Date: Mar 21, 2026

Site Reliability Engineer

Location:

ID

Level: Staff

Employment Status: Permanent

Department: Group Consumer Technology & Digital Innovation

Description:

Role Summary

We are seeking a skilled and passionate Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our hybrid and cloud-native infrastructure. You will play a critical role in automating operations, improving system resilience, and supporting mission-critical services running across Kubernetes and cloud environments.This role is ideal for engineers who enjoy solving complex infrastructure challenges, building automation, and improving platform reliability at scale

Job Description (1/2)

Reliability & System Performance

Maintain high availability, scalability, and performance of production systems
.Define and monitor SLIs, SLOs, and error budgets to ensure service reliability.
Perform root cause analysis, incident response, and postmortem reviews.
Implement reliability improvements and proactive failure prevention.

Cloud & Kubernetes Platform Management

Manage and optimize workloads running on Google Kubernetes Engine (GKE) and OpenShift.
Support multi-cluster and hybrid infrastructure environments.
Implement autoscaling and high availability architecture

CI/CD, GitOps & Release Engineering

Design and maintain CI/CD pipelines using GitLab CI/CD.
Implement GitOps deployment workflows using Argo CD.
Implement safe deployment strategies including:

🔹 Infrastructure as Code & Automation

Provision and manage infrastructure using Terraform / OpenTofu.
Develop and maintain Helm charts for Kubernetes deployments.
Automate operational tasks using Python scripting to reduce manual toil.

Job Description 2/2

🔹 Observability, Monitoring & Distributed Tracing

Implement centralized logging using Grafana Loki and ELK Stack.
Build dashboards and alerts using Grafana and Datadog.
Implement distributed tracing using OpenTelemetry to improve system visibility.
Improve monitoring coverage and alert accuracy.

🔹 Performance & Load Testing

Conduct load and stress testing using tools such as k6, Locust, or JMeter.
Analyze performance bottlenecks and implement tuning strategies.
Support capacity planning and performance optimization.

🔹 Data Streaming & Integration

Support Change Data Capture (CDC) and real-time data streaming pipelines.
Work with Confluent Platform / Apache Kafka to ensure reliable event-driven data flow.

🔹 Security & Secret Management

Manage secrets securely using Google Cloud Secret Manager and Kubernetes secrets, Vault Hashicorp.
Implement secure CI/CD and platform access practices.

Education

Bachelor’s degree in Computer Science, Informatics, Information Systems, Electrical Engineering, Mathematics/Statistics, or related field.

Experience

0–4 years of experience in SRE, DevOps, Cloud Engineering, or Platform Engineering.
Hands-on experience supporting production systems and cloud infrastructure.

Technical Skills

Strong Linux system administration and networking fundamentals.
Hands-on experience with Kubernetes and containerized environments.
Experience designing and maintaining CI/CD pipelines.
Infrastructure as Code experience (Terraform), Ansible.
Helm chart development and Kubernetes deployment management.
Monitoring, logging, and observability best practices.
Programming/scripting skills in Bash, Python (Go is a plus).
Familiarity with Google Cloud Platform (GCP).