JPMorganChase logo

Senior Lead Site Reliability Engineer

JPMorganChase
Full-time
On-site
Plano, Texas, United States
Description

Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.

As a Senior Lead Site Reliability Engineering at JPMorgan Chase within the Infrastructure & Production Management sector of Consumer & Community Banking, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.

 

Job responsibilities

  • Demonstrates expertise in site reliability principles and demonstrates an understanding of the fine balance between features, efficiency, and stability
  • Effectively negotiates with peers and executive partners to ensure optimal outcomes for all 
  • Drives the adoption of site reliability practices throughout the organization
  • Ensures your teams demonstrate site reliability best practices with the ability to demonstrate this empirically through stability and reliability metrics
  • Drives a culture of continual improvement and solicits real-time feedback to improve the customer’s experience
  • Ensures your team collaborates with other teams within your group’s specialization and avoids duplication of work where possible
  • Follows blameless, data-driven, post-mortem strategies and conducts regular team debriefs to enable learning from both successes and mistakes
  • Provides personalized coaching for entry to mid-level team members 
  • Ensures your team documents and shares their knowledge and innovations via internal forums, communities of practice, guilds, and conferences 

 

Required qualifications, capabilities, and skills 

  • Formal training or certification in software engineering concepts and 5+ years of applied experience; plus 2+ years leading technologists to manage and solve complex technical items within your domain.
  • Advanced proficiency in SRE culture and principles, with a track record of implementing SRE practices across application and platform teams while avoiding common pitfalls.
  • Strong observability fundamentals: define and measure SLIs, set and manage SLOs and error budgets, build actionable alerting and dashboards; hands-on experience with Dynatrace and Splunk.
  • Proven resiliency engineering: capacity planning, failure mode analysis, fault-tolerant design (circuit breakers, retries, bulkheads), disaster recovery strategies, and running game days.
  • Proficiency in at least one programming language (e.g., Python, Java Spring Boot, .NET) to build production-grade automation and tooling; deeper coding skills are a plus but not a hard requirement.
  • Proficiency in CI/CD and Infrastructure as Code (e.g., Jenkins, GitLab, Terraform), including pipeline design, environment promotion, and secrets/artifact management.
  • Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS), including image hardening, Helm, and operational runbooks.
  • Ability to troubleshoot common networking technologies and issues (TCP/IP, DNS, HTTP, proxies, load balancers, TLS, routing, VPCs/subnets, firewalls).
  • Demonstrated proficiency operating cloud-scale, distributed systems within a technical discipline (e.g., cloud platforms), with experience at firmwide or similarly large scale.
  • Ability to influence team culture by championing innovation and change; experience mentoring and leading technologists (including hiring, developing, and recognizing talent) as an individual contributor.
  • Automation mindset focused on reducing toil (target ~25% of time), building self-service capabilities, and codifying operational procedures into code.

 

Preferred qualifications, capabilities, and skills 

  • Experience in banking/financial services and familiarity with risk and control expectations in regulated environments.
  • AWS experience; AWS Certified Solutions Architect (Associate or Professional) preferred.
  • Advanced observability ecosystem knowledge beyond Dynatrace/Splunk (e.g., OpenTelemetry, Prometheus, Grafana, ELK).
  • Experience scaling SRE practices across multiple teams/platforms, including playbooks, SRE onboarding, and maturity assessments.
  • Exposure to payments concepts and platforms (e.g., ISO 20022, SWIFT, real-time payments) with willingness to learn; not required for the role.
  • Experience with chaos engineering tools (e.g., Gremlin, Litmus, Chaos Mesh) and integrating resilience tests into CI/CD pipelines.
  • Proven cloud cost/performance optimization in production (autoscaling, caching, capacity management, and efficiency tuning