recall

← recall

distributed tracing pattern

Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.

Each request gets a trace id; every service propagates it and records spans (start, end, metadata) for the work it did. Tools (Jaeger, Honeycomb, Datadog) reconstruct the tree to show which service spent time where. Without it, debugging cross-service latency is guesswork. Costs: instrumentation effort, sampling decisions, storage.

symptoms

  • no idea why a request was slow across N services
  • metrics tell you something's slow but not where
  • logs don't correlate across services

causes

  • no trace context propagation
  • logs don't carry trace ids

fixes

  • OpenTelemetry SDK in every service
  • trace-context header propagation (W3C)
  • sample at ingest (head-based) or after (tail-based)
  • correlate logs by trace id

you might say

  • trace it
  • distributed tracing
  • check the trace

related

topics: observability, distributed-systems

references: