Site Reliability Engineer Lead

Bank of America

Full-time

On-site

Plano, Texas, United States

$125,300 - $167,900 USD yearly

Job Description:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.

Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations.

At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!

Job Description:

This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units. Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices.

Role Overview:

The individual in this role is accountable for establishing and maintaining partnerships with Application Development and Production Support teams to implement the measures prescribed through the collaboration of the Senior Site Reliability Engineer (SRE) and the SRE team(s) they are leading. This individual will include ensuring the appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services. This role demonstrates a high level of technical expertise within one or more technical domains. This role demonstrates the ability to decompose issues or objectives into units of work that can be assigned to other team members. This individual will advocate and advance more efficient solution delivery practices and evangelize great design, engineering, and organizational practices.

Responsibilities:

Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE)
Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities
Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
Participates regularly in architecture community of practice meetings and communication via other channels
Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
Define and maintain a multi-year stability roadmap aligned with business objectives and technology strategy
Identify critical dependencies, risks, and mitigation strategies across infrastructure, applications, and services
Work with the architects to develop and adhere to the enterprise architectural patterns and frameworks that enhance system reliability and fault tolerance
Ensure designs adhere to best practices for high availability, disaster recovery, and performance optimization
Establish stability metrics, KPIs, and compliance standards for technology teams
Drive adoption of reliability engineering principles across development and operations
Partner with engineering, operations, and product teams to embed stability into the software development lifecycle
Act as a trusted advisor to senior leadership on stability-related initiatives and investments
Monitor emerging technologies and industry trends to enhance stability strategies
Lead post-incident reviews and ensure lessons learned are incorporated into future designs
Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior SRE
Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs
Partner to implement code changes to make use of common reliability libraries and tools and help the Application Production Services (APS) and Application Development teammates understand how to use them
Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development of teammates to understand how to use them
Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with the Problem Manager to diagnose root causes for major incident / problem management investigations
Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and to help define solutions to reduce manual support effort and/or improve system reliability

Required Qualifications:

8+ years in technology architecture, reliability engineering, or infrastructure strategy roles
Proven track record of delivering stability-focused initiatives in large-scale environments
Strong knowledge of distributed systems, cloud architecture (AWS, Azure, GCP), and microservices
Experience with reliability engineering, chaos testing, and observability tools
Ability to influence cross-functional teams and communicate complex concepts to non-technical stakeholders

Desired Qualifications:

SRE Certification

Skills:

Automation
Collaboration
Influence
Production Support
Result Orientation
Analytical Thinking
Application Development
Architecture
Solution Design
Stakeholder Management
Adaptability
DevOps Practices
Project Management
Risk Management
Solution Delivery Process

Shift:

1st shift (United States of America)

Hours Per Week:

Pay Transparency details

US - NJ - Pennington - 1300 American Blvd - Hopewell Bldg 3 (NJ2130)

Pay and benefits information

Pay range

$125,300.00 - $167,900.00 annualized salary, offers to be determined based on experience, education and skill set.

Discretionary incentive eligible

This role is eligible to participate in the annual discretionary plan. Employees are eligible for an annual discretionary award based on their overall individual performance results and behaviors, the performance and contributions of their line of business and/or group; and the overall success of the Company.

Benefits

This role is currently benefits eligible. We provide industry-leading benefits, access to paid time off, resources and support to our employees so they can make a genuine impact and contribute to the sustainable growth of our business and the communities we serve.

Apply now

Share this job

Site Reliability Engineer Lead

More jobs

Cloud Engineer - Azure

Fisher Investments

Senior AWS Cloud Engineer (DevOps) â Banking Industry

ITTConnect

Site Reliability Engineer Lead

More jobs

Cloud Engineer - Azure

Fisher Investments

Senior AWS Cloud Engineer (DevOps) â Banking Industry

ITTConnect

Senior AWS Cloud Engineer (DevOps) â Banking Industry