Flex Dental logo

Site Reliability Engineer

Flex Dental
Full-time
On-site
Alpharetta, Georgia, United States

Responsibilities

  • Be available to respond to critical service incidents outside of business hours on a rotating on-call schedule.
  • Proactively monitor application health and performance across cloud infrastructure (AWS).

  • Troubleshoot and prevent service interruptions in real-time, working closely with development teams to resolve incidents efficiently.

  • Lead and participate in disaster recovery drills and security incident simulations.

  • Implement Infrastructure as Code (IaC) and maintain scalable deployments using AWS-native tools and services.

  • Collaborate with development teams to ensure smooth CI/CD workflows using Git and containerized deployments (Docker).

  • Work closely with stakeholders and product teams to ensure technical reliability aligns with business needs.

  • Support and improve observability tools, alerting mechanisms, and logging infrastructure to promote transparency and response agility.

  • Champion best practices in security, availability, performance, and incident response.


Required Technologies & Tools

  • Cloud Infrastructure: Strong proficiency in Amazon Web Services (AWS) with knowledge of services like EC2, ECS, RDS, CloudWatch, and IAM.

  • Programming/Scripting: Proficiency in Node.js and scripting for automation and tooling.

  • Containerization: Experience with Docker for container-based deployment pipelines.

  • Frontend Awareness: Familiarity with React and Ember.js to understand performance implications at the frontend level.

  • Backend Stack: Understanding of NestJS and scalable Node-based services.

  • Databases: Proficient in MySQL and performance monitoring of relational databases.

  • Version Control: Proficiency with Git for collaborative code management and DevOps workflow integration.


Core Competencies

  • Incident Response: Calm and focused under pressure with a structured approach to resolving outages and degradation.

  • System Design: Ability to contribute to and review architectural designs for scalability and resiliency.

  • Collaboration: Strong communication skills to coordinate across developers, QA, and product teams.

  • Automation & Efficiency: Passion for automation, repeatability, and continuous improvement.

  • Security Mindset: Consistent implementation of security best practices and a strong grasp of data protection standards.


Qualifications

  • 3+ years of experience in a Site Reliability, DevOps, or related engineering role.

  • Proven track record managing and scaling applications in a production AWS environment.

  • Familiarity with full stack environments, particularly those using Node.jss.

  • Experience maintaining and deploying databases such as MySQL with performance tuning.

  • Experience with container orchestration (e.g., ECS or Kubernetes is a plus).
  • Commitment to uptime, performance, and security in fast-moving SaaS environments.