Design and implement solutions to enhance the reliability and scalability of AI/ML platforms and applications to accommodate fast growing demands.
Partner with product engineering teams to ensure the AI/ML systems are reliable and high performing.
Develop observability, security, automation and fin-ops tools and orchestration.
Provide strategic technology leadership by defining and evaluating standards and architecture for reliability, observability and automation frameworks.
Build strong cross-functional relationships that foster engagements across the organization and deliver solutions to user problems.
Debug and solve issues in a production environment, identify root cause and remediate.
Participates in on-call rotations, incident management and escalation workflows.
Take full ownership of problems, develop solutions, and acquire new knowledge to complete the task.
Mentor and guide junior engineers.
Required Qualifications:
Bachelor’s degree in computer science, Information Technology, or equivalent technical qualification with 5+ years professional experience.
Expertise in SRE principles, reliability, scalability and performance of application and infrastructure.
Have hands-on experience with cloud platforms (AWS, GCP, Azure) and IaC tools (Terraform, Ansible).
Extensive experience implementing advanced observability using tools like Open Telemetry, Dynatrace, Grafana, and/or cloud-native services.
Experience in architecting distributed systems and cloud-native architecture in AWS.
Systematic problem-solving and troubleshooting skills in a complex system.
Excellent communication skills and ability to represent and present business and technical concepts to stakeholders.
Self-managed, self-motivated with strong sense of ownership, urgency, and drive
Good to have:
Prior experience working in AI, ML, or Data engineering.
Prior experience developing AI Ops/AI Agents.
Multi cloud experience (AWS, GCP, Azure) is a plus