Senior Lead Site Reliability Engineer

JPMorganChase

21 days ago

Full-time

On-site

Plano, Texas, United States

Description

Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.

As a Senior Lead Site Reliability Engineering at JPMorgan Chase within the Infrastructure & Production Management sector of Consumer & Community Banking, you are the non-functional requirement owner and champion for the applications in your remit. You are a key influencer in your team’s strategic planning, driving continual improvement in customer experience, resiliency, security, scalability, monitoring, instrumentation, and automation of the software in your area. You act in a blameless, data-driven manner and navigate difficult situations with composure and tact.

Job responsibilities

Demonstrates expertise in site reliability principles and demonstrates an understanding of the fine balance between features, efficiency, and stability
Effectively negotiates with peers and executive partners to ensure optimal outcomes for all
Drives the adoption of site reliability practices throughout the organization
Ensures your teams demonstrate site reliability best practices with the ability to demonstrate this empirically through stability and reliability metrics
Drives a culture of continual improvement and solicits real-time feedback to improve the customer’s experience
Ensures your team collaborates with other teams within your group’s specialization and avoids duplication of work where possible
Follows blameless, data-driven, post-mortem strategies and conducts regular team debriefs to enable learning from both successes and mistakes
Provides personalized coaching for entry to mid-level team members
Ensures your team documents and shares their knowledge and innovations via internal forums, communities of practice, guilds, and conferences

Required qualifications, capabilities, and skills

Formal training or certification in software engineering concepts and 5+ years of applied experience; plus 2+ years leading technologists to manage and solve complex technical items within your domain.
Advanced proficiency in SRE culture and principles, with a track record of implementing SRE practices across application and platform teams while avoiding common pitfalls.
Strong observability fundamentals: define and measure SLIs, set and manage SLOs and error budgets, build actionable alerting and dashboards; hands-on experience with Dynatrace and Splunk.
Proven resiliency engineering: capacity planning, failure mode analysis, fault-tolerant design (circuit breakers, retries, bulkheads), disaster recovery strategies, and running game days.
Proficiency in at least one programming language (e.g., Python, Java Spring Boot, .NET) to build production-grade automation and tooling; deeper coding skills are a plus but not a hard requirement.
Proficiency in CI/CD and Infrastructure as Code (e.g., Jenkins, GitLab, Terraform), including pipeline design, environment promotion, and secrets/artifact management.
Experience with containers and orchestration (e.g., Docker, Kubernetes, ECS), including image hardening, Helm, and operational runbooks.
Ability to troubleshoot common networking technologies and issues (TCP/IP, DNS, HTTP, proxies, load balancers, TLS, routing, VPCs/subnets, firewalls).
Demonstrated proficiency operating cloud-scale, distributed systems within a technical discipline (e.g., cloud platforms), with experience at firmwide or similarly large scale.
Ability to influence team culture by championing innovation and change; experience mentoring and leading technologists (including hiring, developing, and recognizing talent) as an individual contributor.
Automation mindset focused on reducing toil (target ~25% of time), building self-service capabilities, and codifying operational procedures into code.

Preferred qualifications, capabilities, and skills

Experience in banking/financial services and familiarity with risk and control expectations in regulated environments.
AWS experience; AWS Certified Solutions Architect (Associate or Professional) preferred.
Advanced observability ecosystem knowledge beyond Dynatrace/Splunk (e.g., OpenTelemetry, Prometheus, Grafana, ELK).
Experience scaling SRE practices across multiple teams/platforms, including playbooks, SRE onboarding, and maturity assessments.
Exposure to payments concepts and platforms (e.g., ISO 20022, SWIFT, real-time payments) with willingness to learn; not required for the role.
Experience with chaos engineering tools (e.g., Gremlin, Litmus, Chaos Mesh) and integrating resilience tests into CI/CD pipelines.
Proven cloud cost/performance optimization in production (autoscaling, caching, capacity management, and efficiency tuning

Apply now

Senior Lead Site Reliability Engineer

More jobs

Cloud Engineer - Azure

Fisher Investments

AWS DevOps Engineer

Bank of America