Bank of America logo

Site Reliability Engineer Lead

Bank of America
Full-time
On-site
Plano, Texas, United States
$125,300 - $167,900 USD yearly

Job Description:

At Bank of America, we are guided by a common purpose to help make financial lives better through the power of every connection. We do this by driving Responsible Growth and delivering for our clients, teammates, communities and shareholders every day.

Being a Great Place to Work is core to how we drive Responsible Growth. This includes our commitment to being an inclusive workplace, attracting and developing exceptional talent, supporting our teammates’ physical, emotional, and financial wellness, recognizing and rewarding performance, and how we make an impact in the communities we serve.

Bank of America is committed to an in-office culture with specific requirements for office-based attendance and which allows for an appropriate level of flexibility for our teammates and businesses based on role-specific considerations.

At Bank of America, you can build a successful career with opportunities to learn, grow, and make an impact. Join us!
 

Job Description:


This job is responsible for partnering with engineering and technology teams to implement measures prescribed by the Site Reliability Engineer teams it leads. Key responsibilities include ensuring appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services, demonstrating technical expertise within domains, and decomposing objectives into work units. Job expectations include advancing efficient solution delivery practices and promoting exceptional design, engineering, and organizational practices.

Role Overview:

The individual in this role is accountable for establishing and maintaining partnerships with Application Development and Production Support teams to implement the measures prescribed through the collaboration of the Senior Site Reliability Engineer (SRE) and the SRE team(s) they are leading. This individual will include ensuring the appropriate instrumentation, tooling, ticketing, alerting and on call routines are in place for key services. This role demonstrates a high level of technical expertise within one or more technical domains. This role demonstrates the ability to decompose issues or objectives into units of work that can be assigned to other team members. This individual will advocate and advance more efficient solution delivery practices and evangelize great design, engineering, and organizational practices.

Responsibilities:

  • Collaborates with Development and Infrastructure teams to understand technical solutions and implement monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior Site Reliability Engineer (SRE)
  • Develops and maintains reliability scripts, tools and libraries and leverages them for common instrumentation, automation, and operational needs, and when mentoring SRE resources on reliability practices and established tools/capabilities
  • Partners to implement code changes to make use of common reliability libraries and tools and helps Application Production Services and Application Development teammates understand how to use them
  • Participates regularly in architecture community of practice meetings and communication via other channels
  • Identifies vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and defines solutions to reduce manual support effort and/or improve system reliability
  • Engages as a subject matter expert in major incident triage efforts and failure scenario modelling and diagnosis with Problem Manager root causes for major incident/problem management investigations
  • Define and maintain a multi-year stability roadmap aligned with business objectives and technology strategy
  • Identify critical dependencies, risks, and mitigation strategies across infrastructure, applications, and services
  • Work with the architects to develop and adhere to the enterprise architectural patterns and frameworks that enhance system reliability and fault tolerance
  • Ensure designs adhere to best practices for high availability, disaster recovery, and performance optimization
  • Establish stability metrics, KPIs, and compliance standards for technology teams
  • Drive adoption of reliability engineering principles across development and operations
  • Partner with engineering, operations, and product teams to embed stability into the software development lifecycle
  • Act as a trusted advisor to senior leadership on stability-related initiatives and investments
  • Monitor emerging technologies and industry trends to enhance stability strategies
  • Lead post-incident reviews and ensure lessons learned are incorporated into future designs
  • Collaborate with Development and Infrastructure teams to understand technical solutions and to implement the monitoring capabilities outlined in the application and system monitoring designs put forward by the Senior SRE
  • Develop and maintain a catalog of extensible reliability scripts, tools, and libraries that can be leveraged for common instrumentation, automation and operational needs
  • Partner to implement code changes to make use of common reliability libraries and tools and help the Application Production Services (APS) and Application Development teammates understand how to use them
  • Partner with infrastructure engineers and application teams to implement the necessary code changes to make use of common reliability libraries and tools and help the APS and Application Development of teammates to understand how to use them
  • Engage as a subject matter expert (SME) in major incident triage efforts, failure scenario modelling and work with the Problem Manager to diagnose root causes for major incident / problem management investigations
  • Identify vulnerabilities and opportunities for reliability improvement, such as investigating low level error rates and 'noise' in monitoring, and to help define solutions to reduce manual support effort and/or improve system reliability

Required Qualifications:

  • 8+ years in technology architecture, reliability engineering, or infrastructure strategy roles
  • Proven track record of delivering stability-focused initiatives in large-scale environments
  • Strong knowledge of distributed systems, cloud architecture (AWS, Azure, GCP), and microservices
  • Experience with reliability engineering, chaos testing, and observability tools
  • Ability to influence cross-functional teams and communicate complex concepts to non-technical stakeholders

Desired Qualifications:

  • SRE Certification

Skills:

  • Automation
  • Collaboration
  • Influence
  • Production Support
  • Result Orientation
  • Analytical Thinking
  • Application Development
  • Architecture
  • Solution Design
  • Stakeholder Management
  • Adaptability
  • DevOps Practices
  • Project Management
  • Risk Management
  • Solution Delivery Process

Shift:

1st shift (United States of America)

Hours Per Week: 

40

Pay Transparency details

US - NJ - Pennington - 1300 American Blvd - Hopewell Bldg 3 (NJ2130)

Pay and benefits information

Pay range

$125,300.00 - $167,900.00 annualized salary, offers to be determined based on experience, education and skill set.

Discretionary incentive eligible

This role is eligible to participate in the annual discretionary plan. Employees are eligible for an annual discretionary award based on their overall individual performance results and behaviors, the performance and contributions of their line of business and/or group; and the overall success of the Company.

Benefits

This role is currently benefits eligible. We provide industry-leading benefits, access to paid time off, resources and support to our employees so they can make a genuine impact and contribute to the sustainable growth of our business and the communities we serve.
Apply now
Share this job