Jump to content

Distributed tracing

From Wikitech

Distributed tracing starts with a single incoming request from a user, and tracks and records all the sub-queries issued between different microservices to handle that request.

If you are totally new to distributed tracing in general, then we recommend beginning with Tutorial/Start

We implement distributed tracing with the OpenTelemetry Collector running on each production host as a collection point, and then using Jaeger as an indexer and search/display interface to the trace data. The data at rest lives on the same OpenSearch cluster that also backs Logstash.

As of early 2026, production components and services emitting trace data to otelcol include the Envoy instances in our service mesh, MediaWiki, and certain Abstract Wikipedia services.

The vast majority of traces are emitted by Envoy, where our configuration offers simple opt-in tracing of both incoming and outgoing requests, as long as your application is propagating context. See #Enabling tracing for a service for details on how to get started and Propagating tracing context for specific requirements.

Any interested service owners who want to emit OTel themselves should get in touch at TODO (irc, phab, and/or slack?)


Enabling tracing for a service

In production at WMF, Envoy plays a central role in tracing. Envoy's OpenTelemetry tracing provider serves as the default source of sampling decisions, based on the x-request-id header value and an optional sampling fraction, in addition to its role emitting traces for incoming and outgoing requests.

The simplest path to enabling basic distributed tracing for your service, with minimal changes to your application code, involves:

  1. Ensuring that your application propagates tracing context between incoming and outgoing requests.
  2. Configuring the Envoy instance collocated with your service (e.g., the mesh sidecar container in your Kubernetes deployment) to enable tracing.

In the most common scenario, #2 can be achieved by modifying the production Helmfile values for your service in the deployment-charts repo, setting mesh.tracing.enabled to true and optionally mesh.tracing.sampling to a percentage of requests to sample (default: 100). Example:

mesh:
  tracing:
    enabled: true
    sampling: 10  # Sample 10% of incoming and outgoing requests

Note that you may already have a mesh object defined in your values file, in which case you will need to modify it to reflect the newly added fields.

Technically, propagating only x-request-id in your application is sufficient to ensure that the Envoy instance collocated with your service can make consistent sampling decisions for a given sampling fraction - i.e., if an incoming request with a given x-request-id is sampled, all outgoing requests carrying that same header value will be as well. However, it is strongly recommended that you propagate the standard W3C headers as well in order to get the most out of distributed tracing.

Instrumenting your application

The above focuses on tracing in Envoy, while application code is largely considered to be passive (i.e., responsible only for context propagation).

However, you may also want to consider instrumenting your code, by introducing custom spans that represent semantically meaningful operations within your application logic and / or enabling instrumentation in dependencies (e.g., via auto-instrumentation available for your programming language).

If you are working on MediaWiki core and extensions, instrumentation is already configured and ready for you to use to implement custom spans. The MediaWiki instrumentation portion of the tracing tutorial contains an overview of that process.

If you are working on a Node.js service that has adopted service-utils, the Node.js service instrumentation portion of the tracing tutorial provides an overview of setup, including OpenTelemetry SDK initialization and auto-instrumentation, as well as creating custom spans in your application code.

Configuring your application

If your application uses an OpenTelemetry SDK and is deployed to a production Kubernetes cluster, there are three key pieces of configuration to consider that influence how trace data is emitted to otelcol, all of which can be set via environment variables:

OTEL_EXPORTER_OTLP_ENDPOINT must be set to http://main-opentelemetry-collector.opentelemetry-collector.svc.cluster.local:4318/. This is how the exporter knows where to send trace data.

OTEL_SERVICE_NAME must be set to a string that clearly identifies your service. This only applies if you have not already set the service.name attribute in some other way (e.g., in your application code).

OTEL_TRACES_SAMPLER should be set to parentbased_always_off. This ensures that all root spans created by your application - i.e., those without a parent span - will not be sampled. This is consistent with our general approach of delegating sampling decisions to Envoy and is the default behavior we have established in MediaWiki.

Dealing with PII

After enabling tracing, you should also do a brief audit for any easily-removable PII embedded in traces. Some PII is inevitable, but especially-sensitive data may be scrubbed by writing an otelcol processor rule. SRE is happy to assist with this, but you can find existing examples under transform/scrub: in helmfile.d/admin_ng/opentelemetry-collector/values.yaml

TODOs

TODO a way for service owners to e2e test in staging? no otelcol deployment there https://phabricator.wikimedia.org/T365809

More user-facing documentation

[17:30]  <    taavi> how do i search trace.wikimedia.org with a specific request id?
[09:33]  <   claime> taavi: search for guid:x-request-id=$your-request-id

Tutorial sub-page: how to read a trace

/Tutorial/Start