J

Senior Lead Site Reliability Engineer

JPMorganChase
Full-time
On-site
Plano, Texas, United States
$171,000 - $260,000 USD yearly
Description

Guide and shape the future of technology at a globally recognized firm, driven by pride in ownership.

 

As a Senior Lead Site Reliability Engineer at JPMorgan Chase within the Infrastructure & Production Management sector of Consumer & Community Banking, you will be tasked with closely collaborating with stakeholders to establish non-functional requirements (NFRs) and set service availability targets for various applications and product lines. Your role will be crucial in incorporating these NFRs during the design and testing stages of product development, accurately evaluating customer experience through service level indicators, and setting up and implementing service level objectives in production in partnership with stakeholders.

 

Job responsibilities:

  • Create, develop, and sustain scalable and dependable infrastructure using Infrastructure as Code (IaC) tools, such as Terraform.
  • Effectively manage incidents and strive to enhance Mean Time to Recovery (MTTR) and other MTTx metrics through proactive monitoring and response strategies.
  • Implement and maintain observability tools, including Grafana, Splunk, CloudWatch, and Datadog, to monitor and ensure system performance and reliability.
  • Deploy and oversee services within cloud environments, with a strong preference for AWS, ensuring optimal performance and cost-efficiency.
  • Employ container technologies like Docker and Kubernetes to facilitate efficient service deployment and management.
  • Implement monitoring solutions to track API performance metrics such as response time, error rates, and throughput.
  • Use tools like Grafana, Datadog, or Dynatrace to visualize these metrics and set up dashboards for real-time monitoring.
  • Ensure comprehensive logging of API requests and responses, including metadata such as timestamps, request paths, and status codes.
  • Define and monitor SLOs for API performance and availability, ensuring they align with stakeholder expectations.
  • Implement health checks to monitor the status of API Gateway instances and automatically scale up or down based on traffic demands.

 

Required qualifications, capabilities, and skills

  • Formal training or certification on software engineering concepts and 5+ years applied experience
  • Advanced knowledge in site reliability culture and principles with demonstrated ability to implement site reliability within an application or platform
  • Proficiency in programming languages such as Java, Go, and Python.
  • Extensive experience with Infrastructure as Code (IaC) tools, with a strong emphasis on Terraform.
  • Proven experience deploying services to cloud platforms, preferably AWS.
  • Strong expertise in container technologies, including Docker and Kubernetes.
  • Comprehensive experience with observability tools, specifically Grafana, Splunk, CloudWatch and Datadog.
  • Experience in incident management and improving MTTx metrics.
  • Strong problem-solving skills with a high level of accountability.
  • Excellent written and verbal communication skills.
  • Ability to work independently and manage ambiguous scopes effectively.

 

Preferred qualifications, capabilities, and skills
  • Experience in developing and consuming APIs (REST, SOAP, GraphQL).
  • Strong expertise in networking concepts (TCP/IP, DNS, TLS, HTTPS) and CDN technologies.
  • Experience in developing and operating large-scale API Gateways (Apigee, Kong, Envoy) with high reliability.

 

#LI-RB3