Staff Site Reliability Engineer

Velocity Global

Full-time

On-site

Palo Alto, California, United States

Velocity Global seeks a Staff Site Reliability Engineer (Staff SRE) with extensive cloud engineering experience. In this role, you will lead, design and help create the automation and support efforts of our cloud Infrastructure, identify and execute strategies to improve our full-stack telemetry, monitoring and alerting capabilities, and improve our overall SLA’s.

We obsess over performance, scalability, privacy and security. SREs work cross-functionally with DevOps and Engineering teams, combining operations work with software engineering principles to enable high availability of production systems. You will serve as a partner to our Engineering organization to help make their services more performant, scalable, observable, and reliable. Every engineering team at Velocity Global should be responsible for the software they build. SREs are critical in providing the tools, practices, and expertise to make that happen.

You will be based in Palo Alto, California, and in-office collaboration is required for at least three days per week.

You Will:

Automating observability and alerting across an ever-changing landscape of microservices
Automated Service Reliability Scorecards and Production Readiness Standards
Chaos Engineering and Game Day Simulations to discover and test fixes for weak spots that would otherwise not be identified until a real-life production incident occurred
Software engineering project work, proposed and driven by individual SRE team members, to remove operational bottlenecks and increase velocity in ways we've never considered before
Expand and improve our observability and monitoring footprint
Collaborate with the Engineering and DevOps to create architectural plans, define project requirements, and establish technical standards
Improve common operational challenges by building tools and automating scripts
Serve on the Incident Response Team to help debug and drive resolution of production reliability issues, contribute to the postmortem, and work to prevent recurrence
Participate in design and production reviews for new features, products, or infrastructure
Audit and tune the configuration of systems owned by other engineering teams
Plan for the growth of Velocity Global’s infrastructure and infrastructure reliability/resiliency
Designing and implementing High Availability architecture underlying Velocity Global’s platform
Creating Disaster Recovery solutions, including backups, redundant systems, and emergency response processes
Collaborating with Architects and Engineering leaders in the hiring, training and mentoring of all talents.

You Have:

Outstanding analytical skills with the ability to solve complex systems challenges and performance bottlenecks
Proficient knowledge of public cloud infrastructure, networking, architecture, and Linux as well as orchestration, monitoring, automation, and configuration management solutions
Practical knowledge of distributed service design and performance, including messaging protocols, caching, data residency, and observability
Passion for designing and evolving complex systems while also being able to support day-to-day infrastructure operations. We want someone who does not prefer one tool, but rather looks for the right tool for the job
A dedication to learning new techniques and technologies, then sharing ideas with your fellow engineers with mastery of breaking down, discussing, and communicating technical concepts

Nice to Have:

5-8 years of experience (Depending on open role) Software engineering experience, preferably within the Infrastructure Engineering area.

5-8 years of experience in highly scalable cloud architectures including service-oriented architectures (AWS and/or GCP experience preferred)
Ability to collaborate well and come up with maintainable, reliable solutions. Experience building scalable, high-performing systems.
Strong analytical and problem-solving skills.
Ability to provide both architectural guidance and detailed technical directions.
Excellent communication, collaboration and leadership skills

#LI-Hybrid

Apply now

Share this job

Twitter Facebook Linkedin Email

Staff Site Reliability Engineer

More jobs

Cloud Engineer / Data Scientist

Peraton

DevOps Engineer

SilverEdge Government Solutions