Date:  Feb 20, 2026

Site Reliability Engineer

Location: 

ID

Level:  Staff
Employment Status:  Permanent
Department:  Group Digital Commercial
Description: 

Role Summary

We are seeking a skilled and passionate Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our hybrid and cloud-native infrastructure. You will play a critical role in automating operations, improving system resilience, and supporting mission-critical services running across Kubernetes and cloud environments.This role is ideal for engineers who enjoy solving complex infrastructure challenges, building automation, and improving platform reliability at scale

 

Job Description (1/2)

Reliability & System Performance

  • Maintain high availability, scalability, and performance of production systems
  • .Define and monitor SLIs, SLOs, and error budgets to ensure service reliability.
  • Perform root cause analysis, incident response, and postmortem reviews.
  • Implement reliability improvements and proactive failure prevention. 

Cloud & Kubernetes Platform Management

  • Manage and optimize workloads running on Google Kubernetes Engine (GKE) and OpenShift.
  • Support multi-cluster and hybrid infrastructure environments.
  • Implement autoscaling and high availability architecture 

CI/CD, GitOps & Release Engineering

  • Design and maintain CI/CD pipelines using GitLab CI/CD.
  • Implement GitOps deployment workflows using Argo CD.
  • Implement safe deployment strategies including:

 🔹 Infrastructure as Code & Automation

  • Provision and manage infrastructure using Terraform / OpenTofu.
  • Develop and maintain Helm charts for Kubernetes deployments.
  • Automate operational tasks using Python scripting to reduce manual toil.

  

Job Description 2/2

🔹 Observability, Monitoring & Distributed Tracing

  • Implement centralized logging using Grafana Loki and ELK Stack.
  • Build dashboards and alerts using Grafana and Datadog.
  • Implement distributed tracing using OpenTelemetry to improve system visibility.
  • Improve monitoring coverage and alert accuracy.

  🔹 Performance & Load Testing

  • Conduct load and stress testing using tools such as k6, Locust, or JMeter.
  • Analyze performance bottlenecks and implement tuning strategies.
  • Support capacity planning and performance optimization.

 🔹 Data Streaming & Integration

  • Support Change Data Capture (CDC) and real-time data streaming pipelines.
  • Work with Confluent Platform / Apache Kafka to ensure reliable event-driven data flow.

 🔹 Security & Secret Management

  • Manage secrets securely using Google Cloud Secret Manager and Kubernetes secrets, Vault Hashicorp.
  • Implement secure CI/CD and platform access practices.

 

Education

Bachelor’s degree in Computer Science, Informatics, Information Systems, Electrical Engineering, Mathematics/Statistics, or related field.

 

Experience

  • 0–4 years of experience in SRE, DevOps, Cloud Engineering, or Platform Engineering.
  • Hands-on experience supporting production systems and cloud infrastructure.

 

Technical Skills

  • Strong Linux system administration and networking fundamentals.
  • Hands-on experience with Kubernetes and containerized environments.
  • Experience designing and maintaining CI/CD pipelines.
  • Infrastructure as Code experience (Terraform), Ansible.
  • Helm chart development and Kubernetes deployment management.
  • Monitoring, logging, and observability best practices.
  • Programming/scripting skills in Bash, Python (Go is a plus).
  • Familiarity with Google Cloud Platform (GCP).