V

Staff Site Reliability Engineer

Velocity Global
Full-time
On-site
Palo Alto, California, United States

Velocity Global seeks a Staff Site Reliability Engineer (Staff SRE) with extensive cloud engineering experience. In this role, you will lead, design and help create the automation and support efforts of our cloud Infrastructure, identify and execute strategies to improve our full-stack telemetry, monitoring and alerting capabilities, and improve our overall SLA’s. 

We obsess over performance, scalability, privacy and security. SREs work cross-functionally with DevOps and Engineering teams, combining operations work with software engineering principles to enable high availability of production systems. You will serve as a partner to our Engineering organization to help make their services more performant, scalable, observable, and reliable. Every engineering team at Velocity Global should be responsible for the software they build. SREs are critical in providing the tools, practices, and expertise to make that happen. 

You will be based in Palo Alto, California, and in-office collaboration is required for at least three days per week.

You Will:

  • Automating observability and alerting across an ever-changing landscape of microservices
  • Automated Service Reliability Scorecards and Production Readiness Standards
  • Chaos Engineering and Game Day Simulations to discover and test fixes for weak spots that would otherwise not be identified until a real-life production incident occurred
  • Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we've never considered before
  • Expand and improve our observability and monitoring footprint
  • Collaborate with the Engineering and DevOps to create architectural plans, define project requirements, and establish technical standards
  • Improve common operational challenges by building tools and automating scripts
  • Serve on the Incident Response Team to help debug and drive resolution of production reliability issues, contribute to the postmortem, and work to prevent recurrence
  • Participate in design and production reviews for new features, products, or infrastructure
  • Audit and tune the configuration of systems owned by other engineering teams
  • Plan for the growth of Velocity Global’s infrastructure and infrastructure reliability/resiliency
  • Designing and implementing High Availability architecture underlying Velocity Global’s platform
  • Creating Disaster Recovery solutions, including backups, redundant systems, and emergency response processes
  • Collaborating with Architects and Engineering leaders in the hiring, training and mentoring of all talents.

You Have:

  • Outstanding analytical skills with the ability to solve complex systems challenges and performance bottlenecks
  • Proficient knowledge of public cloud infrastructure, networking, architecture, and Linux as well as orchestration, monitoring, automation, and configuration management solutions
  • Practical knowledge of distributed service design and performance, including messaging protocols, caching, data residency, and observability
  • Passion for designing and evolving complex systems while also being able to support day-to-day infrastructure operations. We want someone who does not prefer one tool, but rather looks for the right tool for the job
  • A dedication to learning new techniques and technologies, then sharing ideas with your fellow engineers with mastery of breaking down, discussing, and communicating technical concepts

Nice to Have:

  • 5-8 years of experience (Depending on open role) Software engineering experience, preferably within the Infrastructure Engineering area.
  • 5-8 years of experience in highly scalable cloud architectures including service-oriented architectures (AWS and/or GCP experience preferred)
  • Ability to collaborate well and come up with maintainable, reliable solutions. Experience building scalable, high-performing systems.
  • Strong analytical and problem-solving skills.
  • Ability to provide both architectural guidance and detailed technical directions.
  • Excellent communication, collaboration and leadership skills

 

#LI-Hybrid