About Arctera
Arctera keeps the world’s IT systems working. We can trust that our credit cards will work at the store, that power will be routed to our homes and that factories will produce our medications because those companies themselves trust Arctera.
Arctera is behind the scenes making sure that many of the biggest organizations in the world – and many of the smallest too – can face down ransomware attacks, natural disasters, and compliance challenges without missing a beat. We do this through the power of data and our flagship products, Insight, InfoScale and Backup Exec.
Illuminating data also helps our customers maintain personal privacy, reduce the environmental impact of data storage, and defend against illegal or immoral use of information.
It’s a task that continues to get more complex as data volumes surge. Every day, the world produces more data than it ever has before. And global digital transformation – and the arrival of the age of AI – has set the course for a new explosion in data creation.
Joining the Arctera team, you’ll be part of a group innovating to harness the opportunity of the latest technologies to protect the world’s critical infrastructure and to keep all our data safe.
As a Senior Site Reliability Engineer, you will be a key contributor to the design, implementation, and support of complex, large-scale systems across Windows Server, Linux, and containers running in the Azure cloud environment. You'll be responsible for improving reliability, automation, observability, and operational excellence while ensuring security and compliance throughout the environments.
This role requires a strong background in systems engineering, infrastructure as code, security best practices, and experience operating services at enterprise scale.
Responsibilities:
- Deploy and manage scalable, resilient, and secure infrastructure in Microsoft Azure using Terraform and GitOps workflows.
- Maintain and optimize hybrid environments across Windows Server and Linux systems.
- Design and implement configuration management using Puppet, ensuring consistency, automation, and compliance.
- Manage and secure Microsoft SQL Server instances and contribute to database performance and availability strategies.
- Operate and scale Elasticsearch clusters for logging, monitoring, analytics, and search workloads.
- Develop and help robust observability practices using metrics, tracing, and logging
- Lead the resolution of complex production issues, conduct root cause analysis, and define long-term remediation plans.
- Partner with security, development, and compliance teams to enforce infrastructure security, hardening, and access controls.
- Work with existing team members and contribute to a culture of engineering excellence and continuous improvement.
- Participate in a rotating on-call schedule
Basic Qualifications:
- 2+ years of experience in Site Reliability Engineering, Infrastructure Engineering, or a related role.
- Advanced knowledge of both Windows Server (2016/2019/2022) and Linux (Ubuntu/RHEL) environments.
- Proven experience operating infrastructure in Microsoft Azure, including networking, VMs, security, and automation tools.
- Strong programming/scripting skills in PowerShell, Bash, and/or Python.
Additional Qualifications:
- Experience with using and developing Puppet or Ansible for configuration management.
- Experience with Terraform for managing IaC at enterprise scale.
- Strong command of SQL Server administration, backup strategies, performance tuning, and security.
- Excellent understanding of security principles (IAM, RBAC, least privilege, encryption, audit logging).
- Proficiency in monitoring and alerting using tools such as Azure Monitor, Datadog, Grafana, Prometheus, or similar.
- Experience with building, managing, and scaling large production-grade Elasticsearch clusters.
- Exposure to compliance and regulatory environments (SOC 2, HIPAA, ISO 27001)
- Familiarity with containerization (Docker) and orchestration (Kubernetes)
- Experience with CI/CD pipelines (Azure DevOps, GitHub Actions, Jenkins, etc.)
- Experience with high-scale architecture, capacity planning, and cost optimization in cloud environments