Role: Lead Site Reliability Engineer
Location: Remote (U.S. based)
Company Stage of Funding: Growth-stage venture-backed (Series B/C equivalent)
Office Type: Remote-first
Salary: $170K – $200K base
Our client is a rapidly growing, mission-driven technology company serving defense, government, and critical infrastructure enterprises. Their secure collaboration platform helps organizations operate in high-stakes environments where resilience, adaptability, and compliance are paramount. The team has been remote-first since inception and prides itself on hiring exceptional engineers globally.
As the Lead Site Reliability Engineer, you’ll drive the architecture, reliability, and operational excellence of a platform supporting mission-critical organizations. This role is a 70/30 split between technical leadership and hands-on engineering. You will:
Define the strategy and roadmap for the SRE function, aligning infrastructure with product and business goals.
Lead the design, deployment, and optimization of production-grade, compliant cloud environments.
Build observability, monitoring, and alerting frameworks to ensure performance and reliability at scale.
Own incident management processes, including on-call rotations, root cause analysis, and reliability improvements.
Partner with security and compliance teams to meet federal and industry standards.
Champion automation to improve efficiency and scale operations.
Oversee cost management and capacity planning for cloud infrastructure.
Mentor engineers and foster a culture of collaboration and technical excellence.
5+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles.
Deep expertise in Kubernetes and infrastructure-as-code (Terraform preferred).
Strong background with AWS or other major cloud providers.
Skilled in designing monitoring, alerting, and performance optimization strategies.
Proven troubleshooting and incident management abilities for distributed systems.
Proficiency in at least one scripting/programming language for automation.
Strong communicator with experience leading cross-functional initiatives.
Comfortable working in a distributed, remote-first environment.
Familiarity with Grafana, Prometheus, and modern observability stacks.
Experience designing high-availability and disaster recovery architectures.
Exposure to GCP or Azure in addition to AWS.
Experience in highly regulated industries (defense, finance, healthcare, or critical infrastructure).
Knowledge of compliance frameworks such as FedRAMP, NIST 800-53, or DoD standards.
Prior leadership of distributed teams.
Cloud or DevOps certifications (e.g., CKA, CKAD, AWS Solutions Architect).
Open-source contributions in reliability, DevOps, or infrastructure tooling.
Competitive base salary: $170K – $200K
Equity participation
Fully remote U.S. role with a remote-first culture
Mission-driven work directly supporting organizations in defense, government, and critical infrastructure
Growth-stage company with significant funding and strong customer retention