C

Sr. Site Reliability Engineer

ConstructConnect
Full-time
On-site
Cincinnati, Ohio, United States

Overview

This position sits within our Product Development division, which develops, tests, and improves our software solutions in an innovative and collaborative environment.

 

The Opportunity

The Sr SRE will be responsible for providing scalable and secure solutions in a demanding SaaS 24x7 environment. The individual will be responsible for the health, observability, and maintainability of our production environment as well as the associated auxiliary components that drive our day-to-day business. An ideal candidate would have a background in both software development, with solid groundings in the fundamentals of computer science as well as the systems skills necessary to understand and operate within both contexts.

Responsibilities

What You’ll Be Doing

 

  • Part of a team that troubleshoots applications, middleware, infrastructure, networks, tools, patching
  • Ensure high availability and reliability of our systems, applications, and infrastructure
  • Build enhancements within an existing software architecture and suggest improvements to the architecture.
  • Assists in defining the appropriate operational planning.
  • Strong communication skills, analytical skills, thorough understanding of product development.
  • Collaborate with development teams to identify and address performance and stability issues in applications and services
  • Own, define and improve metrics, KPIs, SLOs and visualizations for systems.
  • Act as an escalation point for complex or critical issues that have not yet been documented as Standard Operating Procedures (SOPs).
  • Drive quality accountability within the organization with well-defined processes, metrics, and goals for process quality. This includes leading effective postmortems and ensuring actions are followed-up.
  • Building, and maintaining, robust, actionable alerting and monitoring systems and workflows.   Influence across boundaries and at all levels of the organization.
  • Implement, maintain and improve CI/CD processes and tools.
  • Work closely with development teams to improve services, deployments and releases.
  • Troubleshoot production issues and continued documentation of runbooks.
  • Part of an on-call rotation to address production issues.
  • This job description in no way implies that the duties listed here are the only ones that team members can be required to perform

Qualifications

What You Bring to the Team

  • Possess a very high attention to detail and organization with the passion and ability to create order out of disorder with excellence and efficiency.
  • Strong understanding of cloud technologies (GCP, AWS, Azure, etc.), containerization (Docker, Kubernetes), and infrastructure as code (Terraform, Ansible, etc.).
  • Experience with CI/CD pipelines and automation tools such as GitLab CI, Jenkins, or TeamCity
  • Proficiency in scripting and programming languages, such as Python, Go, or Bash 
  • Expertise in monitoring, logging, and observability tools, such as New Relic, Site24x7, Prometheus, Grafana, Elasticsearch, or Splunk
  • A desire to automate everything. Whether that be infrastructure as code or tooling to eliminate toil, automation should be a core focus of your mindset and the elimination of repetitive tasks should be a constant desire in the role.
  • Natural curiosity. You aren’t simply satisfied with something working, you want to know why it works and how it works.
  • A mindset of total ownership - you aren’t afraid to dig into things you’ve never worked on before, from the browser all the way to the persistence layer. You’ve got a solid foundation in debugging and can jump in when needed to any problem you’re asked to help with.
  • An architectural mind. You understand the fundamentals of distributed computing and look for ways to make systems more resilient, self-healing, and eliminate the need for human intervention as much as possible.
  • Very strong communication and interpersonal skills allowing the candidate to work well in a team environment and deliver excellent customer service.
  • The ability to convey the importance of site reliability in both business and technical terms to a wide variety of audiences that range from non-technical to the most technical of engineers. Drive stakeholder buy-in of key metrics such as SLAs/SLOs for all supported systems.
  • Ability to maintain SLAs through the implementation of proactive issue detection and reporting
  • Experience developing scripts or tools for automating administrative tasks.
  • Prior successful experience as a systems performance or site/systems reliability engineer.
  • Demonstrated experience working in large, complex systems environments.

Physical Demands and Work Environment:

  • The physical activities of this position include frequent sitting, telephone communication, working on a computer for extended periods of time. Visual acuity is required to perform activities close to the eyes.   
  • This position is fully remote with only occasional travel to the office for team meetings and events. Team members are expected to have an established workspace.   
  • Ability to work remotely in the United States or Canada.  

 

E-Verify Statement 


ConstructConnect utilizes the E-Verify program with every potential new hire. This makes it possible for us to make certain that every employee who works for ConstructConnect is eligible to work in the United States. To learn more about E-Verify you can call 1-800-255-7688 or visit their website. E-Verify® is a registered trademark of the United States Department of Homeland Security. 

 

Privacy Notice