We are looking for a talented and experienced manager with high technical capabilities. You will play a crucial role in fostering a culture of operational excellence and continuous improvement
In this role, you will
- Be responsible for maintaining the reliability, performance, and availability of our software systems and infrastructure
- Work closely with R&D , system administrations and system architects to:
- Design and implement scalable and robust systems
- Plan and implement a long-term modernization plan
- Ensure lab infrastructure is provisioned correctly and maintained regularly
- Resolve operational issues
- Build a team of SREs (future request).
Key Responsibilities:
Team Leadership:
- Lead, mentor, and develop a team of SREs, setting clear objectives and providing guidance.
- Promote a culture of collaboration, innovation, and continuous learning.
Reliability Engineering:
- Drive the design, implementation, and maintenance of scalable and reliable systems.
- Define, monitor, and uphold Service Level Objectives (SLOs) and Service Level Indicators (SLIs).
Incident Management:
- Oversee incident response, ensuring timely resolution and effective communication.
- Conduct post-incident reviews to identify root causes and implement preventive measures.
Operational Excellence:
- Enhance monitoring, alerting, and logging systems to proactively identify and address issues.
- Optimize system performance, reliability, and capacity planning.
Collaboration:
- Work closely with development teams and GCO to integrate reliability practices into the software development lifecycle.
Process Improvement:
- Implement and drive best practices in reliability, including automation and efficiency improvements.
- Analyze system performance and recommend changes to improve reliability and efficiency.
Qualifications:
- Bachelorβs degree in Computer Science, Engineering, or a related field, or equivalent experience.
- Minimum of [3-5] years of proven experience as an SRE or in a similar role, with a strong track record in leading technical teams.
- Expertise in system administration, cloud platforms, and monitoring tools (e.g., Prometheus, Grafana).
- Strong problem-solving skills and the ability to handle high-pressure situations effectively.
- Excellent communication and leadership skills, with experience in mentoring and developing team members.
Preferred Skills:
- Experience with containerization, with CI/CD practices & pipeline , with orchestration tools (e.g., Kubernetes, Docker), with system internals, performance tuning, and resource management.
- Experience with setting up and monitoring SLIs, SLOs, and Service Level Agreements (SLAs).
- Deep knowledge of operating systems (Linux, Windows) and networking principles (TCP/IP, DNS, HTTP/HTTPS).
- Proficiency with monitoring tools (e.g., Prometheus, Grafana, Nagios) and setting up effective alerting systems.
- Experience with IaC (Infrastructure as Code) tools like Terraform, Ansible, or Puppet for managing and automating infrastructure.
- Background in software development and knowledge of SRE & DevOps practices.
- Familiarity with cloud platforms such as AWS, Azure, or GCP.
- Relevant certifications (e.g., AWS Certified Solutions Architect, Google Professional DevOps Engineer).