Job Title: Site Reliability Engineer (SRE)
Job Location: Germany (Remote)
Job Type: Fixed Term Contract (12 Months)
Responsibilities
• Design, develop, and maintain observability platform component and integrations across Prometheus, Thanos, Grafana, OpenTelemetry, and streaming telemetry systems.
• Contribute to architecture and technical design of scalable monitoring solutions running on Kubernetes, Docker, and cloud-native environments.
• Implement standardized instrumentation Using OpenTelemetry, SDKs, collectors, exporters and agent across services and infrastructure.
• Built and Optimise Telemetry pipelines for metrics, Logs, and traces using Prometheus, OTEL collector, Kafka/Streaming pipelines, and time-series backends.
• Develop advance PromQL queries, recording rules, and AlertManager logic for complex monitoring scenarios.
• Create reusable dashboards and visualisation templates using Grafana (and Perses if applicable.)
• Automate deployments and configuration using Git, GitHub/GitLab, Jenkins, ArgoCD, Helm, and Infrastructure-as-Code practises.
• Troubleshoot and optimise performance across collectors, exporters, storage backends and query layers.
• Support performance testing, load validation, and reliability analysis of observability components.
• Collaborate with engineering and SRE teams to onboard services and improve Telemetry coverage across platforms.
• Document implementations, standards, and operational procedures.
Required skills and expertise
• Strong programming experience in Go, Python, Or Java With focus on backend or platform engineering.
• Hands-on expertise with Prometheus ecosystem (Prometheus, Alertmanager, exporters, Pushgateway) And PromQL.
• Experience implementing OpenTelemetry instrumentation, collectors, processors and pipelines.
• Strong knowledge of Kubernetes, containers, Helm and MicroServices architecture.
• Experience with CI/CD tools such as Jenkins, GitHub Actions, GitLab CI, Or Argo CD.
• Understanding of distributed systems, Performance tuning, Debugging, and profiling techniques.
• Fimilarity with streaming and messaging systems (example, Kafka or equivalent) and time-series databases.
• Experience building or integrating REST/gRPC APIs.
• Proficiency in Git workflows, Scripting (Bash/Python), And automation frameworks.
• Understanding of SNMP, exporters, and infrastructure/device telemetry collections.
• Awareness of sec Security, RBAC, secret management, and compliance requirements in platform environments.