distributed tracing pattern

Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.

Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.

symptoms

no idea why a request was slow across N services
metrics tell you something's slow but not where
logs don't correlate across services

causes

no trace context propagation
logs don't carry trace ids

fixes

OpenTelemetry SDK in every service
trace-context header propagation (W3C)
sample at ingest (head-based) or after (tail-based)
correlate logs by trace id

you might say

trace it
distributed tracing
check the trace

related

topics: observability, distributed-systems

references:

Dapper paper (Sigelman et al.)