Senior Site Reliability Engineer (SRE) – Dynatrace & Azure Observability Expert

RaceTrac

22 days ago

Full-time

On-site

Atlanta, Georgia, United States

RaceTrac Company Overview

Job Description:

Location: Hybrid - 3 days onsite in Atlanta, GA

We are seeking a highly experienced Site Reliability Engineer (SRE) with deep expertise in Dynatrace, observability engineering, and Azure cloud technologies. This role will be exclusively focused on building, enhancing, and managing enterprise observability, telemetry, monitoring, and proactive reliability engineering practices across critical digital platforms.

The ideal candidate must possess advanced hands-on expertise in Dynatrace, especially Dynatrace Query Language (DQL), along with strong knowledge of Azure Monitor, Azure KQL, Application Insights, Azure Functions, APIM, and distributed telemetry concepts. The candidate should have a strong understanding of .NET application architecture and the ability to read and analyze .NET code to support troubleshooting, root cause analysis, and observability implementation within Azure environments. Experience enabling observability for mobile platforms such as iOS and Android is also required.

This is a highly technical, hands-on role requiring a proactive engineering mindset, strong analytical capabilities, and the ability to collaborate across engineering, cloud, mobile, and business teams.

Key Responsibilities

Dynatrace & Observability Engineering

Serve as the primary Dynatrace SME across the organization.
Design, develop, and optimize enterprise observability solutions using Dynatrace.
Develop advanced Dynatrace DQL queries, dashboards, workflows, alerts, and analytics.
Implement intelligent monitoring strategies for applications, APIs, integrations, Azure services, mobile platforms, and distributed systems.
Continuously improve observability maturity through telemetry standardization, proactive monitoring, and automation.
Configure and tune alerting mechanisms to improve signal-to-noise ratio and reduce alert fatigue.
Leverage Dynatrace Davis AI, anomaly detection, and AI-driven root cause analysis capabilities.
Enable and enhance observability for mobile applications across iOS and Android platforms.

Azure Monitoring & Cloud Operations

Build and maintain monitoring solutions using:
- Azure Monitor
- Application Insights
- Azure Log Analytics
- Azure KQL
Monitor and troubleshoot Azure Function Apps, App Services, APIs, integrations, and backend services.
Analyze telemetry, traces, logs, metrics, and distributed transactions to identify root causes and performance bottlenecks.
Troubleshoot cloud-native applications and Azure infrastructure issues.
Develop proactive monitoring for cloud services, integrations, APIs, and backend processing systems.

API & Integration Monitoring

Monitor and troubleshoot Azure API Management (APIM), API Gateways, API endpoints, and integrations.
Understand end-to-end API transaction flows and dependency mapping.
Build observability solutions for APIs, middleware platforms, and integration services.
Diagnose latency issues, transaction failures, authentication issues, and backend service degradation.

Mobile Application Observability

Enable telemetry, monitoring, tracing, and performance analysis for iOS and Android applications.
Analyze mobile-to-backend transaction flows and end-user experience metrics.
Troubleshoot mobile application latency, crash analytics, API failures, and connectivity issues.
Correlate mobile telemetry with backend application and infrastructure monitoring data.

Application Engineering & Troubleshooting

Utilize prior .NET development experience to troubleshoot application behavior, performance, and deployment issues.
Read and understand .NET application code to support root cause analysis and observability implementation.
Work closely with development teams to understand application logic, API flows, dependencies, and exception handling.
Support Azure Function deployments, configuration management, scaling, and runtime troubleshooting.
Collaborate with development teams during architecture reviews and production releases.
Ensure observability and monitoring readiness before deployments go live.

Site Reliability Engineering (SRE)

Perform deep technical analysis across systems by correlating logs, metrics, traces, and application telemetry.
Conduct root cause analysis (RCA) for recurring incidents and systemic issues.
Partner with engineering and operations teams to implement preventive improvements and automation.
Develop KPI-driven reliability improvements focused on system stability, performance, and operational excellence.
Proactively identify risks, bottlenecks, failure patterns, and reliability concerns before business impact occurs.

Continuous Improvement & Automation

Automate operational workflows and monitoring processes wherever possible.
Improve operational efficiency using AI-driven insights and automation capabilities.
Build reusable monitoring frameworks, dashboards, and telemetry standards.
Drive observability best practices across engineering teams.

Required Skills & Qualifications

Mandatory Technical Skills

10+ years of overall IT experience.
Expert-level hands-on experience with Dynatrace.
Advanced expertise in Dynatrace Query Language (DQL).
Strong hands-on expertise in Azure Kusto Query Language (KQL).
Deep understanding of telemetry, observability, distributed tracing, metrics, and logging concepts.
Strong Azure cloud experience with emphasis on:
- Azure Monitor
- Application Insights
- Azure Functions
- Azure API Management (APIM)
- Azure Log Analytics
- App Services
Strong understanding of API architectures, API Gateways, and backend integrations.
Prior hands-on experience developing .NET applications.
Strong ability to read, analyze, and understand .NET application code.
Experience troubleshooting and deploying Azure Functions and cloud-native applications.
Experience enabling observability and telemetry for mobile applications on iOS and Android.
Understanding of mobile telemetry, crash analytics, API monitoring, and end-user experience monitoring.
Strong understanding of distributed systems and enterprise application architectures.

Preferred Skills

Experience with OpenTelemetry implementation and instrumentation.
Experience with CI/CD pipelines and DevOps practices.
Knowledge of AI-driven observability and AIOps concepts.
Experience monitoring high-volume enterprise digital platforms.
Familiarity with ServiceNow and incident management workflows.
Experience with Databricks, SQL platforms, and integration technologies.

Core Competencies

Strong analytical and troubleshooting skills.
Excellent communication and stakeholder management abilities.
Ability to work independently and drive proactively.
Strong collaboration skills across engineering, cloud, SRE, mobile, and business teams.
Ability to quickly adapt to new technologies and evolving environments.

Success Criteria

Reduction in recurring incidents through proactive monitoring and RCA.
Improved observability coverage across enterprise systems, APIs, and mobile applications.
Faster incident detection and resolution.
Reduction in monitoring noise and false positives.
Increased automation and operational efficiency.
Improved reliability and performance of critical systems and APIs.
Strong partnership with engineering teams to ensure production readiness and operational excellence.

Responsibilities:

Engages in and improve the whole lifecycle of services—from inception and design, deployment, operation, and refinement.
Develops software and provide hands-on technical knowledge to design, deploy, and optimize large-scale, massively distributed, fault-tolerant systems.
Supports services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning, automation, pipelining and launch reviews.
Maintains services once they are live by measuring/monitoring availability, latency, and overall system health.
Scale systems sustainably through mechanisms like automation; evolve systems by pushing for changes that improve reliability and velocity. Reduces manual intervention and turn-around time to solve for repetitive problems while automating and monitoring the health of our sites and services.
Practices sustainable incident response and blameless postmortems.
Improves, tunes and performs operational efficiency within the Windows based infrastructure and production environment.
Actively participates in deploying and supporting applications on our private and public cloud environment.
Collaborates with development teams to support the current environment as we transform into a cloud architecture and provides resources “as a service” to developers.

Qualifications:

Bachelor’s degree from an accredited college or university in Computer Science or related field preferred. Equivalent practical experience will be considered.
Experience programming in at least one of the following languages: C, C++, Java, Python, or Go.
Minimum 4 years of working experience in Azure. Experience with Jenkins or similar application.
General knowledge of Infrastructure as Code tools and Config management tools such as (Terraform/Ansible/Chef/Puppet/SCCM).
Comfort with large-scale production systems and technologies (load balancing, monitoring, distributed system and configuration management. Expertise in designing, analyzing, and troubleshooting.
Ability to debug, optimize code, and automate routine tasks.
Systematic problem-solving approach, coupled with effective communication skills and a sense of drive.
Support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity management and launch reviews.
Demonstrated history of living the values that are important to RaceTrac: Honesty, Efficiency, Attitude, Respect, Teamwork.

All qualified applicants will receive consideration for employment with RaceTrac without regard to their race, national origin, religion, age, color, sex, sexual orientation, gender identity, disability, or protected veteran status, or any other characteristic protected by local, state, or federal laws, rules, or regulations.

Apply now

Senior Site Reliability Engineer (SRE) – Dynatrace & Azure Observability Expert

More jobs

Senior DevOps Engineer

Metas Solutions

Technical Platform Lead, Customer Platform Engineering

Chick-fil-A