Date: Feb 20, 2026

Observability Engineer

Location:

ID

Level:

Employment Status: Permanent

Department: Group Digital Commercial

Description:

About the Role

We are seeking an Observability Engineer to enhance the visibility, performance insight, and operational intelligence of our cloud-native and hybrid systems. This role focuses on designing and implementing observability strategies that provide deep insight into system health, performance, and user experience.

You will work closely with platform engineers, SREs, and application teams to instrument services, implement telemetry standards, and build actionable monitoring that enables proactive incident prevention and faster troubleshooting.

This role blends SRE fundamentals, telemetry engineering, and application-level instrumentation.

Key Responsibilities (1/2)

🔹 Observability Strategy & Platform Ownership

Design and implement end-to-end observability architecture across hybrid and cloud environments.
Define telemetry standards for metrics, logs, and traces.
Ensure full service visibility across microservices and infrastructure layers.

🔹 Metrics, Monitoring & Alerting

Build and maintain monitoring solutions using:

Develop actionable alerting strategies that reduce noise and improve signal accuracy.
Tune alert thresholds and implement intelligent escalation logic.
Define service health indicators and golden signals.

Key Responsibilities (2/2)

Logging & Log Intelligence

Implement centralized logging using ELK Stack or Grafana Loki.
Build structured logging standards and log correlation strategies.
Enable log-driven troubleshooting and anomaly detection.

🔹 Distributed Tracing & Telemetry Instrumentation

Implement distributed tracing using OpenTelemetry.

Instrument applications and services to expose telemetry data.
Work with developers to integrate tracing and metrics into application code.
Ensure trace correlation between logs, metrics, and spans.

🔹 Application-Level Observability

Collaborate with development teams to embed observability into services.

Define telemetry instrumentation standards for microservices.
Support performance profiling and latency analysis.
Ensure end-to-end transaction visibility.

🔹 CI/CD & Observability Integration

Integrate observability checks into CI/CD pipelines.

Ensure deployments include telemetry validation and monitoring readiness.
Support reliability gates and observability-driven deployment validation.

🔹 Performance & Reliability Insights

Analyze system performance trends and detect anomalies.

Support capacity planning and performance optimization.
Provide insights to improve system reliability and user experience.

Required Qualifications

Education

Bachelor’s degree in Computer Science, Informatics, Information Systems, Electrical Engineering, Mathematics/Statistics, or related field.

Experience

2–5 years experience in Observability, SRE, DevOps, or Platform Engineering.
Experience supporting production systems and troubleshooting complex distributed systems.

Technical Skills

Observability & Monitoring

Hands-on experience with Datadog, Prometheus, and Grafana.

Experience designing actionable alerting & reducing alert fatigue.
Understanding of golden signals and service health metrics.

Telemetry & Tracing

Experience with OpenTelemetry instrumentation.

Strong understanding of distributed tracing concepts.
Knowledge of metrics, logs, and traces correlation.

Logging & Analysis

Experience with ELK Stack or Loki.

Structured logging and log parsing strategies.

Platform & Infrastructure

Familiarity with Kubernetes environments.

Understanding of microservices architecture.
Basic cloud platform knowledge (GCP preferred).

Programming & Automation

Experience with Bash, Python or similar scripting languages.

Ability to instrument services and analyze telemetry data.

What Makes You Successful in This Role

You can distinguish signal vs noise in monitoring.
You think in telemetry, visibility, and system behavior, not just dashboards.
You collaborate with developers to improve observability inside applications.
You design monitoring that prevents incidents — not just racts to them.