Senior Site Reliability Engineer (Kubernetes)

Cytracom

Full-time

On-site

McKinney, Texas, United States

About Us:

Cytracom delivers infrastructure software purpose-built to empower managed service providers (MSPs) and IT professionals (ITSPs) with cloud solutions that connect and secure both traditional and hybrid workforces.

Our secure access service edge (SASE) solution provides identity-based network security and connectivity within a single platform that enables businesses to deploy zero-trust networks, enforce compliance and eliminate traditional firewalls and VPNs.

Our unified communications suite (UCaaS) uniquely aligns with the operating needs of MSPs and enables their customers to experience seamless communication and collaboration regardless of physical location.

Here's a closer look at this key role:

As a Senior Site Reliability Engineer (Kubernetes), you will lead the design, implementation, and operation of Cytracom's Kubernetes environment with dual focus areas: (1) ensuring overall platform reliability with deep observability, and (2) building a dynamic, policy-driven traffic enforcement layer within our Kubernetes-hosted SASE platform. This role ensures both platform uptime and multi-tenant fairness across distributed ingress points and cloud-native workloads.

Responsibilities

Serve as the primary owner for platform stability and performance, with ultimate accountability for SLA attainment.
Design and implement resilient Kubernetes architectures that maximize fault tolerance across regions and availability zones.
Lead capacity planning and proactive scaling initiatives to ensure consistent performance under variable loads.
Establish and refine incident management processes, including on-call rotations, automated remediation, and post-incident analysis.
Implement rigorous change management protocols to minimize service disruptions during platform updates.
Conduct regular disaster recovery exercises and failure simulations to validate resilience strategies.
Build a world-class observability stack that provides actionable insights across the platform, network, and application layers.
Implement golden signals monitoring for all critical system components with appropriate alerting thresholds.
Design and maintain dashboards that provide clear visibility into platform health, performance bottlenecks, and customer impact.
Create observability-driven automated remediation workflows that detect and resolve issues before they affect customers.
Implement distributed tracing across service boundaries to pinpoint performance bottlenecks and optimize critical paths.
Establish logging standards and retention policies that balance troubleshooting needs with storage efficiency.
Own the lifecycle management of Cytracom's Kubernetes clusters, including upgrades, migrations, resiliency testing, and cluster autoscaling.
Lead incident response and platform stability initiatives, including root cause analysis and infrastructure remediation.
Automate cluster operations using GitOps methodologies with Helm, ArgoCD, Flux, and Terraform.
Ensure compliance with industry security standards and perform regular security audits.
Optimize Kubernetes resource allocation and implement FinOps practices for cost management.
Design and build a Kubernetes-native traffic control plane capable of shaping traffic and enforcing per-tenant and per-device session/bandwidth limits.
Architect dynamic enforcement logic using Kubernetes CRDs and custom controllers integrated with Prometheus and OpenTelemetry.
Extend Kubernetes policy expressiveness to encompass dynamic global usage thresholds and fair-share logic across tenants.
Implement and troubleshoot service mesh technologies for advanced traffic routing, security, and observability.
Deploy and operate telemetry agents and exporters (e.g., Cilium Hubble, custom WireGuard agents, conntrack, Prometheus) for SDN and edge device flows.
Integrate flow, session, and tunnel data into OpenTelemetry pipelines and deliver actionable metrics to Grafana dashboards.
Implement real-time response workflows using Prometheus rule evaluation, webhook triggers, and automated tc/iptables enforcement.
Design and implement chaos engineering tests to validate system resilience under network stress conditions.
Contribute to architecture decisions and technical strategy for our cloud-native infrastructure.
Mentor junior engineers and share Kubernetes best practices across the organization.
Participate in the wider Kubernetes community through contributions, speaking, or knowledge sharing.
Partner cross-functionally with security, development, and product teams to align infrastructure capabilities with business needs.

Required Competencies

Proven success managing Kubernetes in high-stakes, multi-region environments with significant production workloads.
Extensive experience building comprehensive observability solutions across metrics, logs, and traces.
Track record of maintaining 99.9%+ availability for mission-critical services.
Advanced proficiency in Kubernetes controller patterns, custom resources, and operators (Kubebuilder preferred).
Hands-on experience managing Kubernetes clusters atop OpenStack, with strong understanding of Nova, Neutron, and Ceph integration.
Deep understanding of CNI plugins (Cilium preferred) and dynamic network policy enforcement.
Strong Linux networking fundamentals: tc, nftables, conntrack, iptables, and WireGuard.
Observability stack expertise: Prometheus, Grafana, OpenTelemetry, Jaeger, Loki.
Experience implementing SLOs and error budgets to drive reliability improvements.
Fluent in Go for Kubernetes controller development; familiarity with Python/Bash for scripting.
Experience implementing GitOps workflows using Helm, Terraform, ArgoCD, or Flux for Kubernetes automation.
Knowledge of network security and overlay architectures including VXLAN, IPSEC, and SDN routing.

Preferred Competencies

Experience implementing and troubleshooting service mesh technologies (Istio, Linkerd) for advanced traffic management.
Hands-on experience with multi-cluster management technologies (Fleet, Cluster API, Rancher).
Implementation of chaos engineering practices using frameworks like Chaos Mesh or Litmus.
Experience building internal developer platforms on Kubernetes that abstract complexity.
Knowledge of FinOps practices for Kubernetes resource and cost optimization.
Experience with edge computing Kubernetes deployments.
Active participation in the Kubernetes community through contributions, speaking, or knowledge sharing.
Experience with stateful application management on Kubernetes (databases, message queues).
Background in network function virtualization (NFV) or software-defined networking (SDN).
Design and implementation of event-driven auto-remediation systems.

Our Benefits:

Medical, dental, and vision insurance is available
401K
Disability and Life insurance
Paid vacations and holidays
Flexible PTO policy
Casual, laid-back work environment
Free refreshments
Standing desks

Cytracom, LLC is an Equal Opportunity Employer and supports a diverse, inclusive work environment. All qualified applicants will receive consideration for employment without regard to protected characteristics, including race, color, religion, sex, national origin, disability, veteran status, sexual orientation, gender identity or age.

Apply now

Share this job

Twitter Facebook Linkedin Email

Senior Site Reliability Engineer (Kubernetes)

About Us:

Responsibilities

Required Competencies

Preferred Competencies

More jobs

DevSecOps Engineer - US Person and current TS - FSP required to apply

Bow Wave

AWS Engineer

Peraton