We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.
Your responsibilities
- Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
- Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
- Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
- Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
- Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
- Be on-call.
- Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
- Implement monitoring, Logging, alerting, and SLA Reporting.
- Create and maintain technical documentation.
- Implement, maintain and mature SRE best practices.
- Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
- Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
- Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
- Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.
Requirements
- 5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
- Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.
- Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.
- Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).
- Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).
- Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.
- Experience with incident management, on-call participation, escalation, and structured postmortems.
- Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.
- Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.
- Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.
- Basic knowledge of Java- or .Net-based development required.
- Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.
Additional requirements:
- Escalation on-call rotation
- Occasional travel (quarterly offsites, conferences – less than 10%)