Acclaim logo

Senior DevOps Engineer

Acclaim
1 day ago
Full-time
On-site
Buenos Aires, United States

We are looking to strengthen our team for a DevOps/SRE Engineer!

  • Minimum 5 years of experience in a DevOps and/or Site Reliability Engineering role
  • Strong hands-on experience with Linux system administration
  • Extensive experience deploying, operating, and scaling Kubernetes in both cloud and bare-metal environments
  • Deep expertise and practical experience with at least one major cloud provider (preferably Google Cloud Platform)
  • Experience with ML inference on GPU/CPU is a strong plus
  • Proven experience implementing SRE practices and building observability stacks using Grafana, Prometheus, and Loki
  • Strong adherence to GitOps, Infrastructure as Code (IaC), and CI/CD principles
  • Advanced expertise in Terraform, Ansible, and Python
  • Comfortable working in high-uncertainty environments: we are building a new product, requirements evolve quickly, and the ability to rapidly learn new technologies and patterns is essential
  • Proactive mindset: ability to look beyond DevOps tasks and actively debug and understand the product
  • Strategic thinking: ability to choose technologies and architectural approaches based on long-term goals rather than short-term compromises
  • Deploy, operate, and evolve a microservices-based platform running in Kubernetes clusters across AWS, GCP, and on-prem (Rancher)
  • Operate and support GPU-based ML inference services (Triton Inference Server, vLLM) deployed on RunPod, Scaleway, and Nebius
  • Build and maintain Docker images for all microservices and ensure a stable service lifecycle
  • Maintain and scale development and production Kubernetes clusters, actively participate in deployment debugging, incident investigation, and performance troubleshooting
  • Develop, maintain, and evolve custom Helm charts for each service
  • Design and operate CI/CD pipelines using GitHub (code and pipelines) and GitLab for on-prem customer deployments
  • Ensure platform compliance with SOC 2 requirements and actively contribute to improving security and compliance processes
  • Manage cluster access via NetBird VPN, implementing role-based access control using group policies
  • Deploy and manage infrastructure using IaC practices with Terraform and Ansible
  • Develop and continuously improve observability systems:
  • Grafana & Prometheus for metrics
  • ELK stack for centralized log storage and analysis
  • Continuously optimize infrastructure in the areas of IaC, IAM, Observability, and CI/CD
  • Work with a technology stack, including: Python, Kubernetes, Linux, Docker, GitHub CI/CD, PostgreSQL, ClickHouse, Kafka, Superset, Terraform, Ansible
  • The team has built award-winning AI products for tech corporations โ€” devices, voice assistants, products that are actually in the worldย 
  • Cutting-edge tech stack: Speech Technologies, NLP, Generative AI (LLMs, diffusion models), voice-first agentic architecture with privacy-first and on-premises deployment
  • High engineering bar and real ownership โ€” the team cares about what actually works in production, not what looks good in a demo, and you'll see the impact of your work directlyย 
  • Fast career progression โ€” a senior-heavy team and a high volume of real problems means you grow faster than you would anywhere elseย 
  • Startup pace with enterprise stability โ€” real clients, real revenue, no bureaucracyย 
  • Fully remote
  • 21 vacation days + public holidays + 5 sick daysย 
  • Private English lessons via Preply
  • Participation in Employee Stock Ownership Plan (ESOP)