Site Reliability Engineer
ID
Role Summary
We are seeking a skilled and passionate Site Reliability Engineer (SRE) to ensure the reliability, scalability, and performance of our hybrid and cloud-native infrastructure. You will play a critical role in automating operations, improving system resilience, and supporting mission-critical services running across Kubernetes and cloud environments.This role is ideal for engineers who enjoy solving complex infrastructure challenges, building automation, and improving platform reliability at scale
Job Description (1/2)
Reliability & System Performance
- Maintain high availability, scalability, and performance of production systems
- .Define and monitor SLIs, SLOs, and error budgets to ensure service reliability.
- Perform root cause analysis, incident response, and postmortem reviews.
- Implement reliability improvements and proactive failure prevention.
Cloud & Kubernetes Platform Management
- Manage and optimize workloads running on Google Kubernetes Engine (GKE) and OpenShift.
- Support multi-cluster and hybrid infrastructure environments.
- Implement autoscaling and high availability architecture
CI/CD, GitOps & Release Engineering
- Design and maintain CI/CD pipelines using GitLab CI/CD.
- Implement GitOps deployment workflows using Argo CD.
- Implement safe deployment strategies including:
🔹 Infrastructure as Code & Automation
- Provision and manage infrastructure using Terraform / OpenTofu.
- Develop and maintain Helm charts for Kubernetes deployments.
- Automate operational tasks using Python scripting to reduce manual toil.
Job Description 2/2
🔹 Observability, Monitoring & Distributed Tracing
- Implement centralized logging using Grafana Loki and ELK Stack.
- Build dashboards and alerts using Grafana and Datadog.
- Implement distributed tracing using OpenTelemetry to improve system visibility.
- Improve monitoring coverage and alert accuracy.
🔹 Performance & Load Testing
- Conduct load and stress testing using tools such as k6, Locust, or JMeter.
- Analyze performance bottlenecks and implement tuning strategies.
- Support capacity planning and performance optimization.
🔹 Data Streaming & Integration
- Support Change Data Capture (CDC) and real-time data streaming pipelines.
- Work with Confluent Platform / Apache Kafka to ensure reliable event-driven data flow.
🔹 Security & Secret Management
- Manage secrets securely using Google Cloud Secret Manager and Kubernetes secrets, Vault Hashicorp.
- Implement secure CI/CD and platform access practices.
Education
Bachelor’s degree in Computer Science, Informatics, Information Systems, Electrical Engineering, Mathematics/Statistics, or related field.
Experience
- 0–4 years of experience in SRE, DevOps, Cloud Engineering, or Platform Engineering.
- Hands-on experience supporting production systems and cloud infrastructure.
Technical Skills
- Strong Linux system administration and networking fundamentals.
- Hands-on experience with Kubernetes and containerized environments.
- Experience designing and maintaining CI/CD pipelines.
- Infrastructure as Code experience (Terraform), Ansible.
- Helm chart development and Kubernetes deployment management.
- Monitoring, logging, and observability best practices.
- Programming/scripting skills in Bash, Python (Go is a plus).
- Familiarity with Google Cloud Platform (GCP).