Site Reliability Engineering Manager

Cognyte

Full-time

Remote

We are looking for a talented and experienced manager with high technical capabilities. You will play a crucial role in fostering a culture of operational excellence and continuous improvement

In this role, you will

Be responsible for maintaining the reliability, performance, and availability of our software systems and infrastructure
Work closely with R&D , system administrations and system architects to:
Design and implement scalable and robust systems
Plan and implement a long-term modernization plan
Ensure lab infrastructure is provisioned correctly and maintained regularly
Resolve operational issues
Build a team of SREs (future request).

Key Responsibilities:

Team Leadership:

Lead, mentor, and develop a team of SREs, setting clear objectives and providing guidance.
Promote a culture of collaboration, innovation, and continuous learning.

Reliability Engineering:

Drive the design, implementation, and maintenance of scalable and reliable systems.
Define, monitor, and uphold Service Level Objectives (SLOs) and Service Level Indicators (SLIs).

Incident Management:

Oversee incident response, ensuring timely resolution and effective communication.
Conduct post-incident reviews to identify root causes and implement preventive measures.

Operational Excellence:

Enhance monitoring, alerting, and logging systems to proactively identify and address issues.
Optimize system performance, reliability, and capacity planning.

Collaboration:

Work closely with development teams and GCO to integrate reliability practices into the software development lifecycle.

Process Improvement:

Implement and drive best practices in reliability, including automation and efficiency improvements.
Analyze system performance and recommend changes to improve reliability and efficiency.

Qualifications:

Bachelor’s degree in Computer Science, Engineering, or a related field, or equivalent experience.
Minimum of [3-5] years of proven experience as an SRE or in a similar role, with a strong track record in leading technical teams.
Expertise in system administration, cloud platforms, and monitoring tools (e.g., Prometheus, Grafana).
Strong problem-solving skills and the ability to handle high-pressure situations effectively.
Excellent communication and leadership skills, with experience in mentoring and developing team members.

Preferred Skills:

Experience with containerization, with CI/CD practices & pipeline , with orchestration tools (e.g., Kubernetes, Docker), with system internals, performance tuning, and resource management.
Experience with setting up and monitoring SLIs, SLOs, and Service Level Agreements (SLAs).
Deep knowledge of operating systems (Linux, Windows) and networking principles (TCP/IP, DNS, HTTP/HTTPS).
Proficiency with monitoring tools (e.g., Prometheus, Grafana, Nagios) and setting up effective alerting systems.
Experience with IaC (Infrastructure as Code) tools like Terraform, Ansible, or Puppet for managing and automating infrastructure.
Background in software development and knowledge of SRE & DevOps practices.
Familiarity with cloud platforms such as AWS, Azure, or GCP.
Relevant certifications (e.g., AWS Certified Solutions Architect, Google Professional DevOps Engineer).

Apply now

Share this job

Twitter Facebook Linkedin Email