langchain-otel-observability

'Wire LangChain 1.0 / LangGraph 1.0 traces into an OpenTelemetry-native

v2.5.0

Jeremy Longshore

MIT

Allowed Tools

ReadWriteEditBash(python:*)Bash(pip:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.5.0

View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain OTEL Observability (Python)

Overview

An engineer wires OpenTelemetry expecting to see prompts and responses in

Honeycomb. The traces land — but only timing, model name, and token counts

appear. The prompt body is blank. This is not a bug: it's the OTEL GenAI

semantic-conventions privacy-safe default (P27), where

OTELINSTRUMENTATIONGENAICAPTUREMESSAGE_CONTENT is off. The instinct is to

flip it on and move on. On a multi-tenant workload that flip is a leak — the

next engineer to search traces for Tenant A sees Tenant B's PII in the results,

because redaction was supposed to happen upstream and never did.

A second trap lives inside LangGraph. A BaseCallbackHandler attached to the

parent runnable never fires on inner agent tool calls, because LangGraph

creates a child runtime per subgraph and callbacks do not inherit (P28). Spans

inside subgraphs appear orphaned in the waterfall — or they do not appear at

all — and SLO dashboards under-count latency on the exact calls that matter

most: the nested agent loops.

This skill wires LangChain 1.0 / LangGraph 1.0 into an OTEL-native backend

(Jaeger, Honeycomb, Grafana Tempo, Datadog) with a correct content-capture

policy, subgraph-aware span propagation, and five LLM-specific SLOs (p95 / p99

latency, error rate, cost-per-request, TTFT) with burn-rate alerts. Pin:

langchain-core 1.0.x, langgraph 1.0.x,

opentelemetry-instrumentation-langchain >= 0.33, OTEL GenAI semconv as of

2026-04. Pain-catalog anchors: P27, P28 (and cross-references P04, P34, P37).

Prerequisites

Python 3.10+
langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
An OTEL-native backend picked: Jaeger (dev), Honeycomb / Tempo / Datadog (prod)
For multi-tenant: upstream redaction middleware already in place (see

langchain-security-basics and langchain-middleware-patterns)

Access to set env vars at deploy time (OTLP_ENDPOINT, API keys)

Instructions

Step 1 — Install the SDK and instrumentor, configure the exporter


pip install \
  opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp-proto-http \
  "opentelemetry-instrumentation-langchain>=0.33"


import os
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.langchain import LangchainInstrumentor

resource = Resource.create({
    "service.name": "my-langchain-app",
    "service.version": "1.0.0",
    "deployment.environment": os.getenv("ENV", "dev"),
})
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(
    OTLPSpanExporter(
        endpoint=os.environ["OTLP_ENDPOINT"],       # per-backend; see matrix
        headers=_parse_headers(os.getenv("OTLP_HEADERS", "")),
    ),
    max_queue_size=2048,        # spans buffered before drop; raise for high volume
    max_export_batch_size=512,  # batched export keeps per-span overhead under 1ms
))
trace.set_tracer_provider(provider)

LangchainInstrumentor().instrument()   # emits gen_ai.* attrs on every run

BatchSpanProcessor keeps per-span overhead well under 1 ms. Use

SimpleSpanProcessor only in local dev — it blocks the call path per span.

Per-backend OTLP_ENDPOINT and header config lives in

Backend Setup Matrix — Jaeger,

Honeycomb, Grafana Tempo, Datadog.

Step 2 — Verify the GenAI attribute schema

Trigger one call and inspect what landed in the backend. LangChain 1.0 emits

these gen_ai.* attributes natively on every chat-model span:

Attribute	Example
`gen_ai.system`	`anthropic`
`gen_ai.request.model`	`claude-sonnet-4-6`
`gen_ai.request.temperature`	`0.0`
`genai.usage.inputtokens`	`1234`
`genai.usage.outputtokens`	`567`
`genai.response.finishreasons`	`["stop"]`

Missing anything? Likely a stale instrumentor version or an outdated provider

package. The full emitted-vs-custom matrix plus LangGraph's span taxonomy

(LangGraph.invoke → LangGraph.node. → LangGraph.subgraph.) is in

GenAI Semantic Conventions.

Step 3 — Decide on prompt-content capture (critical — do not skip)

The engineer's instinct is to flip the capture flag to see prompts. Before

flipping it, classify the workload into one of these buckets:

Workload	Flag	Notes
Dev / staging with synthetic inputs	`true`	Fine. Do not copy these traces to prod.
Single-tenant internal tool	`true`	Fine if RBAC on backend is tight.
Single-tenant product, signed compliance artifacts	`true`	BAA / DPIA in place; retention policy matches log retention.
Multi-tenant SaaS, no upstream redaction	`false`	Hard no. Fix redaction first.
Multi-tenant SaaS, with upstream redaction	`true`	Safe — the span sees the already-redacted text.
Healthcare / finance / legal without legal sign-off	`false`	Hard no.


# trusted single-tenant ONLY
export OTEL_INSTRUMENTATION_GENAI_CAPTURE_MESSAGE_CONTENT=true
export TRACELOOP_TRACE_CONTENT=true   # OpenLLMetry alias; set both to be safe

Leave unset (default) anywhere else. To capture bodies in a multi-tenant

system, wire redaction middleware upstream of the model call first — see

Prompt Content Policy and cross-reference

pack siblings langchain-security-basics (PII redaction middleware pattern,

P34) and langchain-middleware-patterns (middleware order: redact → cache →

model, P24). Failure pattern P27 — prompts missing from traces because

capture was never opted in — is the #1 first-day OTEL complaint; make the

decision explicit instead of surprise-flipping the flag in prod.

Step 4 — Propagate callbacks through subgraphs (P28)

LangGraph creates a child runtime per subgraph. Callbacks bound at the parent

definition time do not inherit:


# WRONG — subagent spans orphaned or missing (P28)
agent = create_react_agent(model=llm, tools=tools).with_config(
    callbacks=[my_handler]  # bound at definition time; children do not see it
)
agent.invoke({"messages": [...]})

# RIGHT — pass callbacks at invocation via config; they propagate down
agent.invoke(
    {"messages": [...]},
    config={"callbacks": [my_handler]}  # invocation-time; inherited by children
)

The same rule applies to custom attribute handlers (e.g. the

CostAttributeHandler in the semantic-conventions reference that stamps

genai.usage.costusd on each model span). Attach via

config["callbacks"], never via .with_config(). **Failure pattern P28

symptom:** SLO dashboards show low latency because the slow nested spans are

missing entirely, not because the nested calls are fast.

Step 5 — Define LLM SLOs and dashboards

Five SLIs matter from day one. All five derive from gen_ai.* span attributes

— no second pipeline required:

SLI	Target example	Why
p95 latency (top-level chat)	`< 5 s` for chat UI	Provider variance dominates
p99 latency	`< 15 s`	Tail matters on chat; agents with tools live here
Error rate	`< 0.5%`	Includes 429s + `finishreason IN ("length","contentfilter")`
Cost per request (p95)	`< $0.05`	Catches `haiku`→`opus` regressions
TTFT p95 (streaming)	`< 2 s`	Perceived latency, not total duration

Concrete Honeycomb / PromQL / Datadog queries for each SLI, plus multi-window

multi-burn-rate alerts (14.4× / 1h fast burn, 6× / 6h slow burn), are in

LLM SLO Dashboards.

Step 6 — Tune sampling

Defaults are wrong for two ends of the volume spectrum:


from opentelemetry.sdk.trace.sampling import TraceIdRatioBased

# Low/medium volume — keep everything for debuggability
# (< ~100 req/s) — SDK default 100% is fine

# High volume — head-sample, but carve out errors + slow spans via tail sampling
# at the OTEL Collector (see references/llm-slo-dashboards.md)
provider = TracerProvider(
    resource=resource,
    sampler=TraceIdRatioBased(0.10),  # 10% head sample
)

Watch out: head sampling at 10% means 90% of p99 outliers are discarded

before they reach the backend — p99 metrics become noisy and biased toward

the median. For tail-latency SLOs, move sampling to a Collector with

tailsamplingprocessor so errors and slow spans (latency > 5000ms) are

always kept while the rest is probabilistically sampled at 10%. Typical trace

overhead with BatchSpanProcessor at the 512-span batch size: **under 1 ms

per span; recommended sampling rate for high-volume production is 1-10%**.

Output

OTEL exporter wired to a chosen backend (Jaeger / Honeycomb / Tempo / Datadog)
opentelemetry-instrumentation-langchain emitting gen_ai.* attrs on every

LangChain and LangGraph span

Explicit prompt-content capture decision recorded against a workload bucket,

with the multi-tenant guardrail enforced upstream

Callbacks propagated via config["callbacks"] at invocation time so

subgraph spans nest correctly under their parent node

Five LLM SLOs (p95 / p99 latency, error rate, cost-per-request, TTFT) with

dashboards and MWMBR burn-rate alerts

Sampling strategy matched to workload volume and SLO precision needs

Error Handling

Symptom	Cause	Fix
Traces land but prompt and completion bodies are empty	`OTELINSTRUMENTATIONGENAICAPTUREMESSAGE_CONTENT` unset (P27 — privacy-safe default)	Set to `true` only for the workload buckets in Step 3; for multi-tenant, wire upstream redaction first
Subgraph / tool-call spans orphaned or missing	Callbacks bound via `.with_config()` at definition time (P28)	Pass via `config["callbacks"]` at invocation time so children inherit
`genai.usage.cachereadinputtokens` resets every call	Per-call usage, aggregation is your job (P04)	Custom callback summing across calls keyed by `session.id`; see `langchain-model-inference`
p99 dashboard looks noisy and median-biased	10% head sampling drops outliers before backend	Move to Collector `tailsamplingprocessor` — always keep errors and `latency > 5000ms`
Traces never appear	`OTLPSpanExporter` endpoint wrong protocol (gRPC on 4317 vs HTTP on 4318)	Verify with `curl -v $OTLP_ENDPOINT`; swap to the `proto-grpc` exporter package if your backend expects gRPC
Cost attribute missing from spans	LangChain 1.0 does not emit `genai.usage.costusd` natively	Add a `BaseCallbackHandler` that computes from tokens × pricing; see semantic-conventions reference
PR review flags `sk-...` in trace attributes	Secrets in prompts captured via `gen_ai.prompt.content` (P37-adjacent)	Upstream redactor must strip API-key patterns before model call; audit via 0.1% sampler
Exporter dropping spans silently	Queue overflow at high volume	Increase `maxqueuesize` to 4096+; add Collector between SDK and backend

Examples

Running Jaeger locally for dev-loop tracing

Spin up Jaeger in Docker, point the SDK at http://localhost:4318/v1/traces,

leave content capture on (it's dev, inputs are synthetic). You get a generic

span waterfall — no LLM-specific UX, but good for verifying the instrumentor

emits what you expect before paying for a SaaS backend.

See Backend Setup Matrix for the

docker run command and SDK config.

Investigating an agent latency incident in Honeycomb

Honeycomb's BubbleUp over genai.request.model, genai.usage.input_tokens,

and tool call count is the fastest path from "p95 spiked at 14:00" to "one

specific tool took 20 s because the vectorstore was slow." Requires

content-capture-off by default so you can turn the team loose on search

without PII-leak worries.

See LLM SLO Dashboards for the exact

Honeycomb query shape.

Dual-exporting during a LangSmith → Tempo migration

Tempo. Run both for two weeks, compare waterfalls, cut over. LangSmith handles

LLM-specific analytics; Tempo handles unified trace search across LLM and

non-LLM services in your Grafana stack.

See Backend Setup Matrix dual-export

section.

Resources

OTEL GenAI semantic conventions
OTEL Python SDK
OpenLLMetry LangChain instrumentation
Honeycomb OTLP ingest
Grafana Tempo
Datadog OTLP
Google SRE — Alerting on SLOs
Pack cross-references: langchain-security-basics (redaction, P34),

langchain-middleware-patterns (order: redact → cache → model, P24),

langchain-model-inference (cost callback pattern, P04)

Pack pain catalog: docs/pain-catalog.md — P27 (content-capture default),

P28 (subgraph callback propagation), P04 (cache token aggregation),

P34 (prompt injection), P37 (secrets in env / prompts)

Allowed Tools

Provided by Plugin

langchain-py-pack

Installation

Instructions

LangChain OTEL Observability (Python)

Overview

Prerequisites

Instructions

Step 1 — Install the SDK and instrumentor, configure the exporter

Step 2 — Verify the GenAI attribute schema

Step 3 — Decide on prompt-content capture (critical — do not skip)

Step 4 — Propagate callbacks through subgraphs (P28)

Step 5 — Define LLM SLOs and dashboards

Step 6 — Tune sampling

Output

Error Handling

Examples

Running Jaeger locally for dev-loop tracing

Investigating an agent latency incident in Honeycomb

Dual-exporting during a LangSmith → Tempo migration

Resources

Ready to use langchain-py-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle