Your space-enabled career begins here

Space-based technologies are the building blocks of these pillars of innovation:

Search for credible job opportunities with top entrepreneurial space companies.

Lead Site Reliability Engineer - Autonomous Finance AI Platform

Curb

Curb

Accounting & Finance, Software Engineering, Data Science
San Francisco, CA, USA
Posted on Mar 26, 2026
Lead Site Reliability Engineer — Scalable Financial Technology Platform

Build the foundation of reliability for next-generation financial operations. A rapidly growing technology platform is redefining how B2B organizations manage accounts receivable, replacing manual, resource-intensive workflows with intelligent, automated systems. Serving industries such as manufacturing, chemicals, and industrial operations, this platform is helping finance teams operate faster, smarter, and with greater accuracy at scale.

This is an opportunity to take ownership of reliability engineering for a high-growth SaaS product, ensuring systems remain highly available, resilient, and performant as usage expands and complexity increases.

Role Overview

The Lead Site Reliability Engineer will define and execute the strategy for system reliability, scalability, and operational excellence across a modern cloud-based platform. This role blends hands-on engineering with technical leadership, focusing on building robust infrastructure, improving observability, and establishing best practices that enable rapid product development without compromising system stability.

You will work closely with engineering and product teams to ensure systems are designed with reliability in mind from the outset. By driving improvements in infrastructure, automation, and incident management, you will play a key role in building trust with customers who rely on the platform for mission-critical financial processes.

Key Responsibilities

Reliability Strategy & Standards

Define and implement a long-term reliability strategy, including service level objectives (SLOs), service level indicators (SLIs), and error budgets. Establish operational standards that guide engineering teams in building and maintaining reliable systems.

Cloud Infrastructure & Scalability

Design and maintain scalable, fault-tolerant infrastructure using cloud-native technologies. Ensure systems are secure, highly available, and optimized for both performance and cost efficiency.

Observability & Monitoring

Develop and enhance monitoring, logging, and alerting systems that provide clear visibility into system performance. Create actionable alerts that enable teams to detect and resolve issues quickly.

Incident Management & Continuous Improvement

Lead response efforts for critical incidents, ensuring rapid resolution and effective communication. Drive post-incident reviews focused on identifying root causes and implementing long-term improvements.

System Resilience & Risk Mitigation

Proactively identify reliability risks and implement solutions such as redundancy, failover strategies, and capacity planning to ensure systems remain stable under varying workloads.

Deployment & Release Reliability

Collaborate with engineering teams to improve deployment processes, ensuring releases are safe, observable, and easily reversible. Enhance CI/CD workflows to reduce risk and improve delivery efficiency.

Cross-Functional Collaboration

Partner with product, backend, frontend, and infrastructure teams to influence system design decisions that balance reliability, performance, and development velocity.

Automation & Developer Experience

Reduce manual operational work by building automation tools, improving runbooks, and streamlining workflows that support engineering productivity.

Root Cause Analysis

Guide teams through deep technical investigations to identify and resolve underlying system issues, ensuring sustainable improvements rather than temporary fixes.

Reliability Culture & Mentorship

Promote a culture of shared ownership and operational excellence. Mentor engineers on best practices related to system design, incident response, and infrastructure management.

Experience

Required Qualifications

7+ years of experience in site reliability engineering, infrastructure engineering, or backend software development within high-scale environments.

Distributed Systems Expertise

Proven experience designing and operating production systems that require high availability, scalability, and performance.

Programming Skills

Proficiency in languages such as Python and/or TypeScript, with experience building automation and internal tooling.

Cloud & Infrastructure

Strong experience with AWS, Kubernetes (EKS), containerization (Docker), and cloud-native architecture patterns.

Observability

Experience implementing monitoring and observability systems, including metrics, logging, and distributed tracing.

Reliability Engineering Practices

Deep understanding of SLOs, SLIs, and error budgets, and how to apply them to real-world systems.

Modern Technology Stack

Familiarity with modern application architectures, including APIs, event-driven systems, and technologies such as FastAPI, frontend frameworks, and relational databases.

CI/CD & DevOps

Experience working with continuous integration and deployment pipelines, infrastructure as code, and modern deployment strategies.

Problem-Solving & Decision-Making

Ability to balance reliability, performance, and cost considerations while making pragmatic engineering decisions.

Collaboration

Strong communication skills with the ability to work across engineering disciplines and influence technical direction.

What Makes a Strong Candidate

Passion for building systems that are reliable, scalable, and easy to operate.

Ability to lead technical initiatives while remaining hands-on in system design and implementation.

Comfort working in fast-paced environments where systems evolve rapidly.

A mindset focused on continuous improvement, automation, and operational excellence.

Compensation & Benefits

This role offers a highly competitive compensation package, including a strong base salary and equity participation.

Additional benefits for full-time employees may include comprehensive health coverage, retirement savings options with employer contributions, flexible paid time off, and parental leave programs.

This is an opportunity to shape the reliability and performance of a platform that supports critical financial operations for businesses operating at scale.

About Andiamo

Talent Partners for the AI Revolution. As a globally recognized staffing and consulting firm, we specialize in placing the top 2% of technology and go-to-market professionals with the world’s largest and most well-known companies.

For over 20 years, we've maintained the status of tier-one vendor for firms such as Palantir, Amazon, Fluidstack, Bloomberg, Relativity Space, Firefly, MasterCard, Visa, Two Sigma, Citadel, as well as other major financial services firms, elite hedge funds, Google-backed tech start-ups, and major software firms.

Our talent solutions include Permanent Placement, Contract Staffing, Executive Search, and Dedicated Recruiting Services (RPO). Find out more at www.andiamogo.com