Distributed tracing
Distributed tracing starts with a single incoming request from a user, and tracks and records all the sub-queries issued between different microservices to handle that request.
If you are totally new to distributed tracing in general, then we recommend beginning with Tutorial/Start
We implement distributed tracing with the OpenTelemetry Collector running on each production host as a collection point, and then using Jaeger as an indexer and search/display interface to the trace data. The data at rest lives on the same OpenSearch cluster that also backs Logstash.
As of August 2024, the only thing in production that emits data to otelcol is Envoy proxy. Our Envoy configuration offers simple opt-in tracing of both incoming and outgoing requests, as long as your application is propagating context -- see #Enabling tracing for a service and also /Propagating tracing context.
Any interested service owners who want to emit OTel themselves should get in touch at TODO (irc, phab, and/or slack?)
Enabling tracing for a service
TODO simple configuration instructions for helm charts in production
After enabling tracing, you should also do a brief audit for any easily-removable PII embedded in traces. Some PII is inevitable, but especially-sensitive data may be scrubbed by writing an otelcol processor rule. SRE is happy to assist with this, but you can find existing examples under transform/scrub:
in helmfile.d/admin_ng/opentelemetry-collector/values.yaml
TODO a way for service owners to e2e test in staging? no otelcol deployment there https://phabricator.wikimedia.org/T365809
More user-facing documentation