About This Role
Crunchafi is looking for a Site Reliability Engineer to ensure the availability, performance, and scalability of our cloud-based SaaS platform. This role bridges software engineering and operations — you will build and maintain the infrastructure, observability, and automation that keep our systems running reliably at scale. The ideal candidate brings deep Azure cloud expertise, a strong background in infrastructure-as-code and incident management, and a passion for eliminating toil through automation.
Responsibilities
- Design, build, and maintain scalable and resilient infrastructure on Microsoft Azure to support production SaaS workloads
- Define and track service level objectives (SLOs), service level indicators (SLIs), and error budgets to drive reliability decisions
- Build and maintain comprehensive monitoring, alerting, and observability systems to ensure early detection of issues
- Develop and maintain CI/CD pipelines using GitHub Actions to enable safe, rapid, and repeatable deployments
- Lead incident response and on-call rotations, conduct blameless post-incident reviews, and drive follow-up action items to completion
- Automate operational tasks and eliminate toil through scripting, infrastructure-as-code, and self-healing systems
- Manage and optimize Azure Kubernetes Service (AKS) clusters, container orchestration, and related networking and storage configurations
- Collaborate with software engineering teams to embed reliability into application
- architecture, including capacity planning, load testing, and chaos engineering
- Maintain and improve infrastructure-as-code using tools such as Terraform, Bicep, or ARM templates
- Partner cross-functionally with Product, Support, and Quality to reduce friction and accelerate delivery
Qualifications
- 5+ years of professional experience in site reliability engineering, DevOps, or infrastructure engineering roles
- Strong hands-on experience with Microsoft Azure cloud services (AKS, Azure SQL, App Services, Virtual Networks, Azure Monitor, etc.)
- Proficiency in at least one programming or scripting language (Python, Go, Bash, PowerShell, or C#)
- Experience designing and managing CI/CD pipelines using GitHub Actions, Azure DevOps, or equivalent
- Hands-on experience with containerization and orchestration technologies (Docker, Kubernetes)
- Demonstrated experience with infrastructure-as-code tools (e.g. Bicep + ARM templates)
- Strong understanding of networking fundamentals, DNS, load balancing, and TLS/SSL management
- Experience with monitoring and observability platforms (Azure Monitor, Alerts, App Insights, Seq, etc.)
- Proven track record of managing production incidents, conducting post-mortems, and driving reliability improvements
- Exceptional analytical, interpersonal, and communication skills
Preferred Qualifications
- Experience operating SaaS platforms in accounting, financial services, or B2B environments
- Experience with chaos engineering practices and tools
- Familiarity with microservices and event-driven architecture patterns
- Background in capacity planning, performance tuning, and cost optimization on Azure
- Experience with security hardening, compliance frameworks, or SOC 2 readiness
- Azure certifications (AZ-104, AZ-400, AZ-500, or equivalent) are a plus
Benefits
- Health, dental, and vision plans
- 401(k) Retirement savings plan for US-based employees
- 100% remote work environment, with occasional travel for in-person company and/or team meetings
- Significant professional development growth opportunities
- Dynamic and inclusive company culture with real commitment to our values