Observability and Distributed Tracing in Federation
A federated GraphQL request fans out across a router and multiple subgraphs, so a single client operation produces work in several processes before a response is assembled. Achieving production-grade Federated GraphQL Operations in Production means treating that fan-out as a single distributed transaction — instrumenting the router and every subgraph so that one trace, with correctly linked spans, tells you exactly where latency and errors originate.
This guide covers the federated trace lifecycle from client to subgraphs, OpenTelemetry configuration in the Apollo Router (router.yaml telemetry blocks for tracing, metrics, and OTLP/Datadog/Prometheus exporters), trace-context propagation across service boundaries, the span attributes that expose query-plan structure, and Apollo Studio field-level tracing for usage and latency analysis.
The Problem: Latency Has No Single Owner
In a monolithic GraphQL server, a slow operation is a slow resolver in one process. In federation, a slow operation might be a poorly shaped query plan that serialises three subgraph fetches, an N+1 storm inside one subgraph’s __resolveReference, a cold downstream dependency, or router-side parsing and planning overhead. Without distributed tracing, you see only the aggregate p95 at the router and have no way to attribute it. Observability in federation is therefore about attribution: every millisecond of wall-clock time must be assignable to a named span on a named service.
The hard part is span linkage. If the router emits a trace but each subgraph emits its own unrelated trace, you have telemetry but not observability — you cannot reconstruct the causal tree. The entire discipline hinges on propagating a single trace context through the router into each subgraph HTTP call, which is covered in depth in propagating trace context across subgraphs with OpenTelemetry.
Prerequisites
The Federated Trace Lifecycle
A federated trace is a tree rooted at the client request and branching through the router into subgraph fetches. Walk it span by span:
- Client span (optional root). A browser or service client may start a trace and send a
traceparentheader. If present, the router adopts it as the parent; if absent, the router starts a new root span. routerspan. The top-level span on the router process. Its duration is the full server-side wall-clock time the client experiences.supergraph/requestspans. Parsing the operation, validating it, and computing (or cache-hitting) the query plan. Query-plan caching latency shows up here — see configuring query plan caching in Apollo Router.executionspan. Wraps the query-plan walk. Its children mirror the plan’sFetch,Flatten,Parallel, andSequencenodes.subgraphspans (fetch). One per subgraph request the plan issues. Each carries the subgraph name, the operation, and (when propagation is configured) becomes the parent of the subgraph’s own server span.- Subgraph server + resolver spans. Inside each subgraph, the
@apollo/serverinstrumentation opens a server span whose parent is the router’sfetchspan, then resolver and__resolveReferencespans beneath it.
Because steps 5 and 6 are linked by W3C trace context, the final tree shows router planning, parallel vs. sequential subgraph fetches, and per-resolver time in one view.
Two properties of this lifecycle deserve emphasis because they shape every downstream decision. First, the router span’s duration is the only number the client actually experiences — everything else is internal accounting that must sum (allowing for parallelism) to that figure. If your subgraph spans add up to far less than the router span, the missing time is router-side: parsing, validation, query planning, coprocessor round-trips, or response formatting. If they add up to more, you have parallelism the trace view should make visible. Second, the lifecycle is recursive at every subgraph: a subgraph that itself calls a database, a cache, or another service should propagate context onward, so the trace doesn’t dead-end at the subgraph server span but continues into the data layer where many real latency problems live. A trace that stops at the subgraph boundary tells you which subgraph is slow but not why; extending instrumentation into the subgraph’s outbound calls is what closes that gap.
The diagram makes the central insight visible: total latency is the critical path through the span tree, not the sum of all spans. Two subgraph fetches running in parallel cost only the slower of the two; a sequential fetch that depends on an upstream entity’s keys is additive. Distributed tracing is the only practical way to see which branch is on the critical path.
OpenTelemetry in the Apollo Router
The Apollo Router has first-class OpenTelemetry support configured entirely through the telemetry block in router.yaml. There are three sub-areas: instrumentation (spans/traces), metrics, and exporters. The router also has a native Apollo exporter that ships traces to Apollo Studio independently of OTLP.
Tracing configuration
telemetry:
instrumentation:
spans:
mode: spec_compliant # emit OpenTelemetry-spec span names/attrs
router:
attributes:
# promote request metadata onto the router span
http.request.method: true
graphql.operation.name:
operation_name: string
subgraph:
attributes:
subgraph.name: true # tag every fetch span with its subgraph
exporters:
tracing:
common:
service_name: apollo-router
sampler: 0.1 # head-based sampling: 10% of traces
# parent_based_sampler: keep child decisions consistent with the root
parent_based_sampler: true
otlp:
enabled: true
endpoint: http://otel-collector:4317
protocol: grpc
sampler: 0.1 is head-based sampling — the router decides at the root whether to record the whole trace, and parent_based_sampler: true ensures subgraphs honour that decision rather than each sampling independently (which would shred traces). For low-traffic services start at 1.0 (sample everything) and dial down as volume grows.
Metrics configuration
Traces tell you about individual slow requests; metrics tell you about aggregate health and feed alerting. The router emits RED-style metrics (rate, errors, duration) for the router and per-subgraph.
telemetry:
exporters:
metrics:
common:
service_name: apollo-router
# custom histogram buckets tuned for sub-second GraphQL latencies
views:
- name: http.server.request.duration
unit: s
aggregation:
histogram:
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5]
prometheus:
enabled: true
listen: 0.0.0.0:9090
path: /metrics
otlp:
enabled: true
endpoint: http://otel-collector:4317
protocol: grpc
temporality: delta # Datadog prefers delta; Prometheus uses cumulative
Exporter selection: OTLP vs. Datadog vs. Prometheus
| Exporter | Transport | Best for | Notes |
|---|---|---|---|
| OTLP | gRPC/HTTP to an OTel Collector | Vendor-neutral pipelines | The recommended default — export OTLP to a Collector, then fan out to any backend. |
| Datadog | OTLP to Datadog Agent (or native) | Teams standardised on Datadog APM | Use temporality: delta for metrics; map service_name to the Datadog service. |
| Prometheus | Pull (/metrics) |
Metrics + Grafana dashboards | Metrics only, not traces; pair with an OTLP trace exporter to Tempo/Jaeger. |
The durable pattern is OTLP to a Collector, with the Collector handling tail sampling, batching, and routing to one or more backends. This keeps router.yaml stable while you change observability vendors behind the Collector.
A subtlety teams hit when adopting Datadog: the router can export OTLP straight to the Datadog Agent’s OTLP intake, which is preferable to a Datadog-native exporter because it keeps your configuration vendor-neutral and lets you mirror the same spans to a second backend during a migration. Metric temporality is the one place vendor choice leaks into config — Datadog expects delta temporality, while Prometheus is inherently cumulative because it scrapes a counter’s running total. Setting temporality: delta on an OTLP exporter feeding Datadog, while leaving the Prometheus exporter at its cumulative default, lets both coexist. Resist the temptation to run three independent exporters with three sampling configs; that path produces traces that disagree with each other and dashboards that disagree with traces. One Collector, one sampling decision, fan-out at the Collector.
Span Attributes for Query Plans
The single most valuable thing federated tracing adds over generic HTTP tracing is query-plan visibility. The router annotates execution spans with attributes that describe the plan it computed, so you can correlate a slow operation with the shape of its plan.
Key attributes to surface and watch:
| Attribute | Span | Why it matters |
|---|---|---|
graphql.operation.name |
router | Group traces by operation for p50/p95 analysis. |
apollo.client.name / version |
router | Attribute load and regressions to specific clients. |
subgraph.name |
subgraph/fetch | Identify which service a fetch span belongs to. |
graphql.document |
router/subgraph | The operation text (redact variables; can be large). |
apollo_private.ftv1 |
subgraph | Encoded field-level trace the router forwards to Studio. |
A plan that issues many sequential fetch spans for one operation signals a cross-service reference that cannot be parallelised — often a design smell addressable by adjusting entity boundaries or using @provides. A fetch span whose duration dwarfs its subgraph’s own server span points to network or connection-pool overhead between router and subgraph rather than slow resolution.
Query-plan visibility is what makes federated tracing schema-aware rather than merely service-aware. A generic APM will tell you that service B called service C; only the query-plan attributes tell you that it did so because resolving Order.recommendedProducts required hopping from the orders subgraph into the catalog subgraph, and that the hop was sequential because the catalog fetch needed the order’s product keys first. That is the difference between a diagnosis you can act on at the schema level — “denormalise this reference,” “add a @provides so the keys travel with the parent,” “split this entity differently” — and a generic latency number that only tells you to make service C faster. When you read a trace, always pull the operation name and the subgraph names onto the spans (as the instrumentation.spans config above does) so that six months later, looking at a trace from an incident, you can reconstruct which operation produced which plan without re-deriving it by hand.
Be deliberate about which attributes you promote. The router lets you lift almost any request or response value onto a span, but each promoted attribute multiplies storage cost and, for high-cardinality values, can render a backend’s index unusable. The reliable set is low-cardinality and high-signal: operation name, client name and version, subgraph name, response status, and a boolean for whether the query plan was a cache hit. Resist promoting full document text or anything derived from variables onto every span — sample those into FTV1 traces instead, where Studio expects and redacts them.
Apollo Studio Field-Level Tracing
OpenTelemetry gives you operation- and span-level latency. Apollo Studio’s federated trace format (FTV1) goes one level deeper: it captures per-field execution timing inside each subgraph and stitches the subgraph traces back together under the router’s plan. The result is a flame graph keyed by schema field, so you can answer “which field, in which subgraph, is slow across all operations that select it” — the basis for monitoring federated query performance with Apollo Studio.
To enable it, the router needs the Apollo exporter turned on and subgraphs must emit FTV1 traces (the router requests them via the apollo-federation-include-trace: ftv1 header; @apollo/server v4 with the inline-trace plugin responds automatically).
telemetry:
apollo:
# field_level_instrumentation_sampler controls how often subgraphs
# are asked to produce expensive per-field traces
field_level_instrumentation_sampler: 0.01 # 1% — FTV1 is not free
send_headers:
only:
- apollographql-client-name
send_variable_values:
none: true # never ship variable values
Field-level instrumentation is heavier than span sampling because it times every field resolution, so keep field_level_instrumentation_sampler low (1–5%) in production. Studio aggregates the sampled traces into stable field statistics regardless.
The practical payoff of field-level tracing is a different question than OTLP answers. OTLP answers “where did the time go in this one request?” Studio’s field statistics answer “across every operation that touches this field, what is its typical and tail latency, and how often is it requested?” That second question is what drives schema-evolution decisions: a field that is expensive but rarely selected is a candidate for a @defer or a separate operation; a field that is cheap but selected on every query is fine; a field that is both hot and slow is your top optimisation target regardless of which operation surfaced it. Because Studio keys these statistics to the schema rather than to HTTP routes, the same field’s cost is visible no matter which of a dozen operations happened to include it — something endpoint-level monitoring structurally cannot do for GraphQL, where one endpoint serves every operation. The detailed workflow for reading these statistics, segmenting by client, and turning them into alerts is covered in monitoring federated query performance with Apollo Studio.
Performance and Scale Considerations
Tracing is not free, and naive configuration can itself become a latency source.
- Sample at the head, refine at the tail. Head-based sampling in the router (
sampler) bounds overhead cheaply; if you need to keep all error/slow traces, do tail sampling in the Collector, not the router. - Export asynchronously and batch. The router batches span export by default. Never point exporters at a slow or unreachable endpoint synchronously — a stalled collector must not back-pressure request handling. Run a local Collector sidecar/agent so the router’s export hop is localhost.
- Bound attribute cardinality. Promoting high-cardinality values (raw query text with inlined variables, user ids) onto spans explodes backend cost and can leak PII. Prefer
graphql.operation.nameover full documents, and always setsend_variable_values: none. - Separate FTV1 from OTLP budgets. OTLP span sampling and Studio field-level sampling are independent knobs; tune them separately so deep field tracing doesn’t force you to under-sample your distributed traces.
- Watch the router↔subgraph hop. A
fetchspan much larger than the subgraph’s server span usually means connection-pool exhaustion or DNS/TLS overhead — tune keep-alive and pool size before blaming resolvers. Pair this with the guidance in optimizing reference resolvers for performance.
Failure Modes and Debugging
Disconnected subgraph traces (the most common failure)
Symptom: Apollo Studio or your APM shows a router trace and separate subgraph traces with no parent link; the subgraph spans float as their own roots.
Cause: The subgraph’s OpenTelemetry SDK is not extracting the incoming traceparent header, or the router is not configured to propagate it.
Fix: Enable propagation in router.yaml and register the W3C TraceContextPropagator plus a header-extracting instrumentation in the subgraph. Full walkthrough in propagating trace context across subgraphs with OpenTelemetry.
Sampling drops the trace you need
Symptom: Intermittent slow operations never appear in the backend; only fast ones are recorded.
Cause: Head-based sampling at 0.1 discarded the slow trace before it finished.
Fix: Move slow/error retention to tail sampling in the Collector (tail_sampling processor with a latency policy), keeping a low head sample for baseline traffic.
Inconsistent sampling across services
Symptom: Traces are half-complete — the router span is sampled but a subgraph independently decided not to record.
Cause: parent_based_sampler is off, so each service samples independently.
Fix: Set parent_based_sampler: true in the router and configure subgraphs with a ParentBased sampler so they honour the root decision.
Exporter back-pressure
Symptom: Request latency rises and correlates with collector outages.
Cause: A synchronous or under-buffered exporter blocking on a slow endpoint.
Fix: Use a batched exporter to a local Collector; set sane export timeouts; alert on the router’s own export-failure metric.
Frequently Asked Questions
What is the difference between OpenTelemetry tracing and Apollo Studio field-level tracing?
OpenTelemetry traces capture spans at the operation, query-plan, and subgraph-fetch level and can be exported to any OTLP backend (Tempo, Datadog, Honeycomb). Apollo Studio field-level tracing (FTV1) goes deeper, timing each individual schema field’s resolution inside subgraphs and aggregating those into per-field statistics in Studio. Most teams run both: OTLP for cross-service distributed traces and operational dashboards, Studio for field usage and schema-aware latency.
How do I keep tracing overhead low in a high-throughput router?
Use head-based sampling (sampler) to record only a fraction of traces, enable parent_based_sampler so the decision is consistent across services, keep Apollo field_level_instrumentation_sampler at 1–5%, batch all exports to a local OpenTelemetry Collector, and do expensive tail sampling (keep-all-errors, keep-all-slow) in the Collector rather than the router.
Why are my subgraph spans showing up as separate traces instead of children of the router span?
The W3C traceparent header is not being propagated from the router into the subgraph request, or the subgraph is not extracting it. Enable header propagation in router.yaml and register the trace-context propagator in each subgraph’s OpenTelemetry setup so the incoming context becomes the parent of the subgraph’s server span.
Can I use Prometheus for distributed traces?
No — Prometheus is a metrics system. Use the Prometheus exporter for the router’s RED metrics (rate, errors, duration) and Grafana dashboards, and pair it with an OTLP trace exporter to a tracing backend such as Tempo, Jaeger, or Datadog for the actual span trees.
Which span attribute tells me whether subgraph fetches ran in parallel?
Compare the wall-clock overlap of sibling fetch spans under the execution span. Fetches the planner placed in a Parallel node will overlap in time; fetches in a Sequence node (typically because one depends on entity keys returned by another) will be adjacent and additive. Sequential fetches on the critical path are the primary target for entity-boundary or @provides optimisation.
Related
- Propagating Trace Context Across Subgraphs with OpenTelemetry
- Monitoring Federated Query Performance with Apollo Studio
- Apollo Router Configuration and Deployment — router telemetry lives in
router.yaml - Caching Strategies for Federated GraphQL — interpret cache-hit spans in traces
- Federated GraphQL Operations in Production — parent guide