Propagating Trace Context Across Subgraphs with OpenTelemetry
Linking a router span to its subgraph spans in one distributed trace requires the W3C traceparent header to flow from the Apollo Router into each subgraph and be adopted as the parent of that subgraph’s server span. This guide shows the exact router and subgraph configuration that makes a federated trace a single connected tree rather than disconnected fragments.
This is the implementation detail behind the broader Observability and Distributed Tracing in Federation guide.
When to use this pattern
- You see a router trace and separate subgraph traces in your backend, with no parent-child link between them.
- You need per-subgraph and per-resolver latency attributed under a single client operation.
- You are standardising on OpenTelemetry/OTLP across the router and all
@apollo/serversubgraphs and want spec-compliant context propagation.
Prerequisites
How Context Propagation Works
OpenTelemetry context propagation is two halves on two sides of an HTTP call. On the router side, when the query planner issues a fetch to a subgraph, the router injects the active span context into the outgoing request as a W3C traceparent header (format: 00-<trace-id>-<span-id>-<flags>). On the subgraph side, the HTTP instrumentation must extract that header, rebuild the remote span context, and start its server span as a child of it. If either half is missing, the subgraph starts a brand-new root span and the trace fragments.
The router does this automatically once propagation is enabled in router.yaml. The subgraph does it automatically only if its OpenTelemetry SDK has the W3C trace-context propagator registered and HTTP auto-instrumentation active — which is the part teams most often miss.
It helps to be precise about the W3C format, because most propagation bugs are really format bugs. A traceparent header has four hyphen-separated fields: a two-digit version (00), a 32-hex-character trace id, a 16-hex-character parent span id, and two trace-flags digits where the low bit indicates whether the trace was sampled. When the router injects this header, the trace id is the id shared by the whole operation and the parent span id is the router’s fetch span — so the subgraph’s extracted parent is that fetch span, which is exactly the linkage you want. The optional tracestate header carries vendor-specific data and rides along automatically; you rarely touch it directly. The two ways this breaks are a subgraph configured for a different propagation format (B3 single/multi-header, used by some Zipkin setups, or Jaeger’s uber-trace-id) so it never looks for traceparent, and a sampling flag of 00 that tells the subgraph “the root chose not to record this,” which a correctly configured ParentBasedSampler will honour by also not recording. Both are configuration choices, not bugs in OpenTelemetry — which is why the fix is always in setup, never in code you write per request.
Implementation Walkthrough
1. Enable propagation in the router
Tell the router to propagate W3C trace context on outgoing subgraph requests and to keep sampling decisions consistent across the trace.
# router.yaml
telemetry:
exporters:
tracing:
common:
service_name: apollo-router
# honour the root's sampling decision so children aren't dropped
parent_based_sampler: true
sampler: 1.0 # sample everything while verifying
otlp:
enabled: true
endpoint: http://otel-collector:4317
protocol: grpc
instrumentation:
spans:
subgraph:
attributes:
subgraph.name: true # label each fetch span
# propagation block: which formats the router injects/extracts
apollo:
# (Apollo Studio FTV1 is separate; this section is OTLP propagation)
The router emits W3C traceparent on subgraph requests by default when tracing is enabled; parent_based_sampler: true is the critical line — without it a subgraph may independently decide not to sample and you lose the child spans.
2. Instrument the subgraph with @opentelemetry/sdk-node
Create a tracing bootstrap file that registers the W3C trace-context propagator and HTTP auto-instrumentation, then load it before anything else (so HTTP and GraphQL are patched at import time).
// tracing.ts — MUST be imported first, before @apollo/server / http
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
import {
W3CTraceContextPropagator,
} from '@opentelemetry/core';
import {
ParentBasedSampler,
TraceIdRatioBasedSampler,
} from '@opentelemetry/sdk-trace-base';
const sdk = new NodeSDK({
resource: new Resource({
// service.name MUST match what you expect to see in the backend
[SemanticResourceAttributes.SERVICE_NAME]: 'products-subgraph',
}),
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317', // same collector the router uses
}),
// Extract/inject W3C traceparent so the router's span becomes our parent
textMapPropagator: new W3CTraceContextPropagator(),
// Honour the router's sampling decision; only sample roots we originate
sampler: new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(1.0),
}),
instrumentations: [
getNodeAutoInstrumentations({
// the HTTP instrumentation is what reads the incoming traceparent
'@opentelemetry/instrumentation-http': { enabled: true },
}),
],
});
sdk.start(); // begins extracting context from inbound requests immediately
Wire it in ahead of the server. With ts-node/tsx use --import ./tracing.ts, or simply import './tracing'; as the very first line of your entrypoint:
// index.ts
import './tracing'; // <-- first line, no exceptions
import { ApolloServer } from '@apollo/server';
import { startStandaloneServer } from '@apollo/server/standalone';
import { buildSubgraphSchema } from '@apollo/subgraph';
import gql from 'graphql-tag';
const typeDefs = gql`
extend schema
@link(url: "https://specs.apollo.dev/federation/v2.9", import: ["@key"])
type Product @key(fields: "id") {
id: ID!
name: String!
}
`;
const resolvers = {
Product: {
// resolver spans created here will nest under the router's fetch span
__resolveReference: (ref: { id: string }) => ({ id: ref.id, name: 'Widget' }),
},
};
const server = new ApolloServer({ schema: buildSubgraphSchema({ typeDefs, resolvers }) });
const { url } = await startStandaloneServer(server, { listen: { port: 4001 } });
console.log(`products-subgraph ready at ${url}`);
The auto-instrumentation patches the inbound HTTP server, so the moment the router’s request arrives carrying traceparent, the subgraph’s server span is opened as a child of the router’s fetch span — and every resolver/__resolveReference span nests beneath it.
3. Continue the trace into the subgraph’s own dependencies
Linking the router to the subgraph is the headline goal, but the trace is far more useful if it does not dead-end at the subgraph server span. The same auto-instrumentation set that extracts traceparent on the way in also injects it on the way out of the subgraph’s own HTTP and database calls — provided you enabled the relevant instrumentations. Because getNodeAutoInstrumentations() bundles instrumentation for http, common database drivers (pg, mysql2, mongodb, ioredis), and graphql, a subgraph that queries Postgres will, with no extra code, produce a span tree of fetch (router) → subgraph server → __resolveReference → pg.query. That last span is usually where a federated N+1 becomes obvious: instead of one pg.query span you see twenty identical ones in a row beneath a single reference resolution. Keeping these downstream spans inside the same trace is what turns “the products subgraph is slow” into “the products subgraph issues one query per product key instead of batching” — an actionable, code-level finding rather than a service-level one.
If a subgraph calls a downstream service that is not auto-instrumented (say, a hand-rolled fetch wrapper or a non-Node service), make sure the W3C propagator is what stamps the outbound request, so the next hop can adopt the context just as this subgraph did from the router. The chain only stays connected if every hop both extracts on entry and injects on exit using the same format.
Verification Steps
-
Send a traced operation. Issue a query through the router so the planner fetches the subgraph:
curl -s http://localhost:4000/ \ -H 'content-type: application/json' \ -d '{"query":"{ topProducts { id name } }"}' > /dev/null -
Confirm the header on the wire. Temporarily log inbound headers in the subgraph (or inspect with a debugging proxy). You should see a single
traceparentper request:traceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01The middle segment is the router span id — that is the parent the subgraph will adopt.
-
Check the backend trace tree. In your tracing UI (Tempo/Jaeger/Datadog), open the trace by id. You should see one tree:
apollo-router→fetch (products)→products-subgraphserver span → resolver spans. Thetrace.idis identical across all of them. -
Confirm shared trace id. Both processes report the same
trace_id— if the subgraph shows a different trace id, the propagator is not registered ortracing.tswas not loaded first.
Common Mistakes and Gotchas
- Loading
tracing.tstoo late. If@apollo/serverorhttpis imported before the SDK starts, the HTTP instrumentation never patches the server and the incomingtraceparentis ignored. The bootstrap import must be the first statement in the process. - Forgetting
parent_based_sampler. Without it (router) orParentBasedSampler(subgraph), each service samples independently and you get half-trees — the router span sampled, the subgraph span dropped, or vice versa. - Mismatched propagator formats. The router emits W3C
traceparent. If a subgraph is configured only for B3 or Jaeger propagation it won’t extract the header. RegisterW3CTraceContextPropagator(or aCompositePropagatorincluding it) on the subgraph. - Different collector endpoints. If the router exports to one collector and a subgraph to another that writes to a different backend, the spans share a trace id but live in two stores and never appear as one tree. Point every service at the same trace pipeline.
- Sampling at full volume in production.
sampler: 1.0andTraceIdRatioBasedSampler(1.0)are correct while verifying, but recording every trace in a high-throughput system is expensive. Once linkage is confirmed, lower the root ratio and rely onParentBased/parent_based_samplerso the whole tree is recorded or dropped as a unit.
Frequently Asked Questions
Do I need to manually read the traceparent header in my resolvers?
No. The @opentelemetry/instrumentation-http auto-instrumentation extracts the header and establishes the remote parent context before your resolvers run, provided the SDK (with the W3C propagator) was started before the HTTP server was imported. Your resolver and __resolveReference spans are created within that context automatically.
Why is my subgraph creating a new root trace instead of a child span?
Either the W3C trace-context propagator is not registered on the subgraph SDK, or tracing.ts is loaded after @apollo/server/http, so HTTP auto-instrumentation never patched the inbound server. Register W3CTraceContextPropagator and make the tracing bootstrap the first import in the process.
Does this work the same with a manually composed Express subgraph?
Yes. As long as the OpenTelemetry HTTP instrumentation is active and the W3C propagator is registered before the Express app is created, the incoming traceparent is extracted regardless of whether you use startStandaloneServer or expressMiddleware.
Related
- Observability and Distributed Tracing in Federation — parent guide
- Monitoring Federated Query Performance with Apollo Studio
- Apollo Router Configuration and Deployment — where the
telemetryblock lives - Optimizing Reference Resolvers for Performance — once spans reveal a slow
__resolveReference