Site Reliability Engineer

Crunchafi

Full-time

Remote

United States

About This Role

Crunchafi is looking for a Site Reliability Engineer to ensure the availability, performance, and scalability of our cloud-based SaaS platform. This role bridges software engineering and operations — you will build and maintain the infrastructure, observability, and automation that keep our systems running reliably at scale. The ideal candidate brings deep Azure cloud expertise, a strong background in infrastructure-as-code and incident management, and a passion for eliminating toil through automation.

Responsibilities

Design, build, and maintain scalable and resilient infrastructure on Microsoft Azure to support production SaaS workloads

Define and track service level objectives (SLOs), service level indicators (SLIs), and error budgets to drive reliability decisions

Build and maintain comprehensive monitoring, alerting, and observability systems to ensure early detection of issues

Develop and maintain CI/CD pipelines using GitHub Actions to enable safe, rapid, and repeatable deployments

Lead incident response and on-call rotations, conduct blameless post-incident reviews, and drive follow-up action items to completion

Automate operational tasks and eliminate toil through scripting, infrastructure-as-code, and self-healing systems

Manage and optimize Azure Kubernetes Service (AKS) clusters, container orchestration, and related networking and storage configurations

Collaborate with software engineering teams to embed reliability into application

architecture, including capacity planning, load testing, and chaos engineering

Maintain and improve infrastructure-as-code using tools such as Terraform, Bicep, or ARM templates

Partner cross-functionally with Product, Support, and Quality to reduce friction and accelerate delivery

Qualifications

5+ years of professional experience in site reliability engineering, DevOps, or infrastructure engineering roles

Strong hands-on experience with Microsoft Azure cloud services (AKS, Azure SQL, App Services, Virtual Networks, Azure Monitor, etc.)

Proficiency in at least one programming or scripting language (Python, Go, Bash, PowerShell, or C#)

Experience designing and managing CI/CD pipelines using GitHub Actions, Azure DevOps, or equivalent

Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes)

Demonstrated experience with infrastructure-as-code tools (e.g. Bicep + ARM templates)

Strong understanding of networking fundamentals, DNS, load balancing, and TLS/SSL management

Experience with monitoring and observability platforms (Azure Monitor, Alerts, App Insights, Seq, etc.)

Proven track record of managing production incidents, conducting post-mortems, and driving reliability improvements

Exceptional analytical, interpersonal, and communication skills

Preferred Qualifications

Experience operating SaaS platforms in accounting, financial services, or B2B environments

Experience with chaos engineering practices and tools

Familiarity with microservices and event-driven architecture patterns

Background in capacity planning, performance tuning, and cost optimization on Azure

Experience with security hardening, compliance frameworks, or SOC 2 readiness

Azure certifications (AZ-104, AZ-400, AZ-500, or equivalent) are a plus

Benefits

Competitive salary

Health, dental, and vision plans

401(k) Retirement savings plan for US-based employees

100% remote work environment, with occasional travel for in-person company and/or team meetings

Unlimited PTO

Significant professional development growth opportunities

Dynamic and inclusive company culture with real commitment to our values

Apply now

Site Reliability Engineer

More jobs

Senior DevOps Engineer (m/f/d)

Betterlore

DevOps Engineer

Optimized Technical Solutions