Overview
The Site Reliability Engineer will play a crucial role in ensuring the reliability, scalability, and performance of our infrastructure and applications, ultimately contributing to the seamless operations of our systems. This role is vital in maintaining a high level of uptime and system efficiency, enhancing the overall user experience, and enabling our organization to meet its objectives.
Key Responsibilities
- Design and implement monitoring and alerting systems to ensure high availability and performance of services
- Develop automation tools for system provisioning, configuration management, and application deployment
- Collaborate with cross-functional teams to ensure that new software and systems are production-ready
- Perform capacity planning and manage infrastructure capacity efficiently
- Conduct root cause analysis of production issues and implement preventive measures
- Participate in on-call rotations and respond to system emergencies
- Ensure compliance with security and regulatory standards in all aspects of the infrastructure
- Contribute to the continuous improvement of the reliability and performance of systems and applications
- Implement best practices for cloud infrastructure and services
- Lead initiatives to optimize system performance and stability
- Conduct periodic testing of disaster recovery and failover systems
- Document system configurations, processes, and procedures
- Assist in evaluating new technologies and methods to improve reliability and performance
Required Qualifications
- Bachelor's degree in Computer Science, Information Technology, or a related field
- 3+ years of experience in a site reliability engineering role
- Proficiency in Linux system administration and troubleshooting
- Strong programming skills in Python, Shell scripting, or other scripting languages
- Experience with cloud platforms such as AWS, GCP, or Azure
- Expertise in building and maintaining scalable, high-performance systems
- Knowledge of containerization and orchestration technologies (Docker, Kubernetes)
- Hands-on experience with monitoring and logging tools (e.g., Prometheus, Grafana, ELK)
- Ability to design and implement automated solutions for infrastructure and application deployment
- Excellent troubleshooting and problem-solving skills
- Understanding of networking concepts and protocols
- Strong communication and collaboration skills
- Relevant certifications (e.g., AWS Certified DevOps Engineer, Google Professional Cloud DevOps Engineer) a plus
}