distributed tracing pattern
Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.
Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.
symptoms
- no idea why a request was slow across N services
- metrics tell you something's slow but not where
- logs don't correlate across services
causes
- no trace context propagation
- logs don't carry trace ids
fixes
- OpenTelemetry SDK in every service
- trace-context header propagation (W3C)
- sample at ingest (head-based) or after (tail-based)
- correlate logs by trace id
you might say
- trace it
- distributed tracing
- check the trace