We are seeking a dedicated DevOps/SRE - Cloud Infrastructure Engineer with expertise in Kubernetes, AWS, and monitoring tools to manage and enhance our cloud-based infrastructure. This role requires hands-on experience with deployment and automation in EKS/Kubernetes clusters using Terraform and Helm/Flux, as well as 24x7 on-call support for critical SaaS events. Key responsibilities include proactive issue monitoring, RCA, incident management, and collaboration with cross-functional teams to maintain a high-availability platform. Ideal candidates will have strong scripting skills (Python, NodeJS, or Go), familiarity with CI/CD, and experience with tools like DataDog, Prometheus, and ELK. We offer quarterly bonuses, a flexible and remote work environment, comprehensive insurance, professional development support, and unlimited paid vacation and sick leave.
Details:
Experience:
Schedule: Shift 1week on/1 week off
Start: ASAP
English: Upper
Russian/Ukrainian: Native
Team is from Ukraine
Employment: B2B contract
Responsibilities:
- Manage day-to-day alerts, system checks, and issue escalation as necessary.
- Provide 24x7 on-call support for critical SaaS events.
- Document issues and remediation steps.
- Proactively create monitors within the EKS/K8s ecosystem.
- Deploy to EKS/K8s cluster using Terraform and Helm/Flux.
- Enhance infrastructure health by implementing checks and scripts to address known issues.
- Maintain and develop deployment code.
- Implement/integrate new technologies into our Cloud Infrastructure.
- Collaborate with other teams to provide top-notch support and assistance.
- Prioritize customer focus in planning deployments/updates, ensuring minimal impact.
- Conduct RCA and take necessary corrective actions to prevent issue recurrence.
- Assign alert-related actions to the appropriate team after investigation.
- Handle support requests for environment-specific actions.
Requirements:
- Strong experience with issue processing (RCA, Postmortems).
- Proficiency in Kubernetes (deployment, scaling, troubleshooting).
- Familiarity with AWS, Terraform, Docker, CI/CD.
- Experience with monitoring tools like DataDog, Prometheus, Grafana, and logging solutions like Elasticsearch, Logstash, and Kibana (ELK Stack) or AWS CloudWatch.
- Strong understanding of networking concepts and protocols.
- Proficiency in at least one scripting language (e.g., Python, NodeJS, Go).
- Experience with configuration management tools like FluxCD/ArgoCD.
- Proficiency in Git or other version control systems.
- Familiarity with incident response and management tools like PagerDuty, Opsgenie, or VictorOps.
- Ownership, proactiveness, persistence, and passion for maintaining a high-traffic online platform.
Client offers:
- Quarterly Bonuses based on transparent and systematic evaluation.
- Flexible Work Schedule.
- Remote Work Option for Enhanced Flexibility.
- Comprehensive Medical Insurance for you and your significant other.
- Financial Support for Life Events.
- Unlimited Paid Vacation.
- Unlimited Paid Sick Leave.
- Reimbursement for professional development courses and training.