As an Production Services Site Reliability Engineer, responsibilities include:
- Configuration and monitoring of on-prem and cloud-based dependencies
-Automate continuous integration (CI) and continuous delivery (CD) pipelines
- Maintain staging and production environments with goal of maximizing uptimes
- Implement observability of systems for monitoring, alerting, and metrics reporting
- Generate reports regarding service metrics on performance, availability, and reliability - Champion practices regarding change control management and incident response
A successful Production Services Site Reliability Engineer will be expected to:
- Proactively communicate status of Atlassian services to stakeholders and follow through on time-sensitive tasks
- Demonstrate willingness to ask for clarification and increase awareness of the larger context
- Explore solutions to problems, evaluate risk vs reward, then execute best approach
- Communicate asynchronously with a global team across multiple timezones
- Document new processes or update existing documentation pages
- Eager and curious to learn across multiple technology stacks
B.S. in Computer Science or related work experience
Passion in building reliable, scalable, and performant distributed systems
Understanding of distributed systems w.r.t. application, networking, and security
SRE or Dev/Ops experience in managing customer-facing systems in 24/7 environment Experience in managing and monitoring fleets of *nix systems or container platforms
Excellent judgment and integrity with ability to make timely and sound decisions
Ability to anticipate the needs of others and adapt to changing conditions