langchain-observability
Wire LangSmith tracing and custom metric callbacks into a LangChain 1.0 chain or LangGraph 1.0 agent correctly — env-var spelling, subgraph propagation, per-tenant dimensions, cost and latency counters. Use when setting up observability on a new service, debugging blank traces in LangSmith, or adding per-tenant cost breakdowns. Trigger with "langchain observability", "langsmith tracing", "langchain callbacks", "langchain metrics".
Allowed Tools
Provided by Plugin
langchain-py-pack
Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns
Installation
This skill is included in the langchain-py-pack plugin:
/plugin install langchain-py-pack@claude-code-plugins-plus
Click to copy
Instructions
LangChain Observability (Python)
Overview
Engineer sets LANGCHAINTRACINGV2=true and LANGCHAINAPIKEY=... from the
0.2 docs, restarts the service, and sees zero traces in LangSmith — no errors,
no warnings. That is P26: in LangChain 1.0 the canonical env vars are
LANGSMITHTRACING and LANGSMITHAPIKEY. The LANGCHAIN* names are
soft-deprecated and fail silently on any chain that goes through 1.0 middleware
or createreactagent. One-line fix:
export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=lsv2_...
export LANGSMITH_PROJECT=my-service-prod
Next failure mode: a custom BaseCallbackHandler attached via
chain.with_config(callbacks=[meter]) fires on the parent but is silent on
LangGraph subgraphs and createreactagent tool calls — token counts
under-report by 30-70% vs the provider dashboard. That is P28: LangGraph
creates a child runtime per subgraph, and bound callbacks do not propagate.
Pass callbacks at invocation time instead:
await chain.ainvoke(inputs, config={"callbacks": [meter], "configurable": {"tenant_id": t}})
This skill walks through canonical LangSmith setup, a metric-callback template
with tenant dimensions, invocation-time propagation, RunnableConfig trace
tagging, and a decision tree for LangSmith-only vs OTEL-native (defer to
langchain-otel-observability / L33 for OTEL-heavy). Pin: langchain-core 1.0.x,
langgraph 1.0.x, langsmith current. LangSmith tracing adds <5ms per-span
overhead; metric callbacks add <1ms per fire. Pain-catalog anchors: P26, P28,
P04 (cache-token aggregation), P25 (retry double-counting).
Prerequisites
- Python 3.10+
langchain-core >= 1.0, < 2.0,langgraph >= 1.0, < 2.0langsmith(bundled withlangchain; upgrade to current for 1.0 env-var support)- A LangSmith API key (
lsv2_...) — free tier at https://smith.langchain.com - Optional metric sinks:
prometheus_client,statsd, ordatadogPython packages
Instructions
Step 1 — Enable LangSmith with the canonical 1.0 env vars
LANGSMITHTRACING=true is the switch. LANGSMITHAPI_KEY authenticates.
LANGSMITH_PROJECT groups traces by environment — use one project per
service-env pair (myapp-prod, myapp-staging), not one per service.
# .env (loaded via python-dotenv or secret manager)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_PROJECT=my-service-prod
# Legacy fallback names (still work, soft-deprecated — do not use in new code):
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=lsv2_pt_...
# LANGCHAIN_PROJECT=my-service-prod
Verify in a REPL that the client sees the key before relying on it in
production:
from langsmith import Client
c = Client() # reads LANGSMITH_API_KEY and LANGSMITH_ENDPOINT
print(c.list_projects(limit=1)) # raises LangSmithAuthError if key is wrong
Do NOT set both LANGCHAINTRACINGV2 and LANGSMITH_TRACING — mixed settings
have caused stale project routing in 1.0.x. See P26.
For selective sampling in high-traffic services, set
LANGSMITHSAMPLINGRATE=0.1 (10% of runs). Full detail in
Step 2 — Write a metric callback for per-request observability
Subclass BaseCallbackHandler. Record tokenin, tokenout, latency_ms,
toolcalls, and error, tagged with a tenantid dimension for downstream
grouping.
import time
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult
class MetricCallback(BaseCallbackHandler):
"""Per-LLM-call metrics tagged with tenant_id. Overhead <1ms per event."""
def __init__(self, tenant_id: str, sink) -> None:
self.tenant_id = tenant_id
self.sink = sink
self._starts: dict[str, float] = {}
def on_llm_start(self, serialized, prompts, *, run_id, **kwargs) -> None:
self._starts[str(run_id)] = time.perf_counter()
def on_llm_end(self, response: LLMResult, *, run_id, **kwargs) -> None:
t0 = self._starts.pop(str(run_id), time.perf_counter())
elapsed_ms = (time.perf_counter() - t0) * 1000 # wall-clock latency
tags = {"tenant_id": self.tenant_id}
for gen in response.generations:
for g in gen:
meta = getattr(g.message, "usage_metadata", None) or {}
self.sink.incr("llm.token_in", meta.get("input_tokens", 0), tags)
self.sink.incr("llm.token_out", meta.get("output_tokens", 0), tags)
# P04 — aggregate Anthropic cache reads across calls
cache = meta.get("input_token_details", {}).get("cache_read", 0)
self.sink.incr("llm.cache_read", cache, tags)
self.sink.hist("llm.latency_ms", elapsed_ms, tags)
def on_llm_error(self, error, *, run_id, **kwargs) -> None:
self._starts.pop(str(run_id), None)
self.sink.incr("llm.error", 1, {"tenant_id": self.tenant_id,
"error_type": type(error).__name__})
def on_tool_end(self, output, *, run_id, **kwargs) -> None:
self.sink.incr("llm.tool_calls", 1, {"tenant_id": self.tenant_id})
A thin sink protocol (incr, hist) swaps between Prometheus, StatsD, or
Datadog. Alternative sinks (LangSmith-only, OTEL) do not need this callback
at all — see Step 5. Full sink adapters and P25 retry dedupe in
Step 3 — Pass callbacks via config["callbacks"] at invocation (P28)
This is the single most common observability bug in LangGraph 1.0 services.
Binding callbacks at definition time does not propagate into subgraphs or
createreactagent tool nodes — those create child runtimes with their own
callback scope.
# WRONG — fires on parent runnable only; silent on subgraphs (P28)
agent_bound = agent.with_config(callbacks=[MetricCallback(tenant_id, sink)])
result = await agent_bound.ainvoke(inputs)
# RIGHT — propagates to every runnable, subgraph, and tool call
meter = MetricCallback(tenant_id, sink)
result = await agent.ainvoke(
inputs,
config={
"callbacks": [meter],
"configurable": {"thread_id": session_id, "tenant_id": tenant_id},
"tags": ["prod", f"tenant:{tenant_id}"],
"metadata": {"request_id": req_id, "tier": "enterprise"},
},
)
Construct the callback inside the request handler so it captures a fresh
tenant_id per request — and in that pattern, invocation-time config is the
only way callbacks reach subgraphs. See Trace Metadata and Tagging
for the full RunnableConfig shape.
Step 4 — Tag and annotate traces via RunnableConfig
LangSmith indexes two per-request fields: tags (flat list, filterable) and
metadata (key-value, searchable). Fix conventions early — LangSmith has no
rename tool.
config = {
"callbacks": [meter],
"tags": [
"env:prod", # environment
f"tenant:{tenant_id}", # tenant
f"tier:{tenant_tier}", # plan tier
f"feature:{feature_flag}", # A/B experiment arm
],
"metadata": {
"request_id": req_id,
"user_id": user_id,
"session_id": session_id,
"app_version": os.environ["APP_VERSION"],
},
"run_name": "agent_main", # LangSmith UI label; overrides chain class name
}
Hierarchical tag conventions (env:prod, tenant:acme, tier:enterprise)
make LangSmith filters work. Free-form tags ("important", "check-me") do
not. See Trace Metadata and Tagging.
Step 5 — Pick a sink and the stack shape
The callback handler is the integration point. Options, in decreasing order of
fit:
- LangSmith only — zero additional overhead; tracing already covers latency
and token accounting. Fine for solo dev, small teams, and LLM-native ops.
- Prometheus (pull) — best fit for Kubernetes + existing Prom stack. Export
via prometheus_client HTTP endpoint. Watch tenant label cardinality.
- StatsD / Datadog (push) — UDP fire-and-forget; sub-1ms overhead. Safe on
high-throughput async services. Use datadog.dogstatsd for tag support.
- OTEL native — multi-service distributed tracing. Defer to
langchain-otel-observability (L33); do not reimplement here.
Decision tree:
Existing OTEL stack (Collector, Tempo, Jaeger)?
├── YES → OTEL-native (L33). LangSmith optional for prompt inspection.
└── NO → LLM-specific features (prompt inspection, evals, queues) enough?
├── YES → LangSmith only. Add MetricCallback only for tenant cost.
└── NO → Hybrid: LangSmith for prompts + Prometheus/Datadog for SLOs.
See references/hybrid-langsmith-otel.md for split-point rules.
Mixing paths without a plan creates double-emission and conflicting trace IDs.
See Custom Metrics Callback for
Prometheus / StatsD / Datadog sink implementations, plus dedupe for P25 retry
double-counts; see Hybrid LangSmith + OTEL
for the split-point contract.
Step 6 — Feed runs back into evals
Real traffic is the best eval set. Route a sampled subset of production runs
into a LangSmith annotation queue for human review; the queue feeds Dataset
objects replayable against candidate models.
from langsmith import Client
Client().create_annotation_queue(
name="prod-regressions",
description="1% sample, weekly review",
)
# Add metadata={"eval_candidate": "true"} on 1% of runs — LangSmith UI has
# a rule to route into the queue by metadata filter.
Keep annotation queues under 500 runs/week (reviewers saturate past that).
See LangSmith Setup for the queue and
dataset flow.
Output
- LangSmith tracing on via
LANGSMITHTRACING/LANGSMITHAPI_KEY/
LANGSMITH_PROJECT with a langsmith.Client() smoke-check
MetricCallback(BaseCallbackHandler)emittingtokenin,tokenout,
cacheread, latencyms, toolcalls, error tagged with tenantid
- All chain invocations pass
config={"callbacks": [...], ...}at invoke time
so metrics propagate to subgraphs and agent tools
RunnableConfigcarries hierarchical tags (env:,tenant:,tier:*)
and structured metadata (requestid, userid, session_id)
- One metric sink wired (Prometheus, StatsD, Datadog, or LangSmith-only)
- Explicit choice recorded for LangSmith / OTEL / hybrid / custom
Error Handling
| Error | Cause | Fix |
|---|---|---|
| No traces in LangSmith, no errors | Used LANGCHAINTRACINGV2 spelling on 1.0 middleware path (P26) |
Switch to LANGSMITHTRACING=true and LANGSMITHAPI_KEY |
langsmith.utils.LangSmithAuthError: Unauthorized |
Key is valid but points to a deleted workspace, or copied with trailing whitespace | Regenerate at smith.langchain.com, check repr(os.environ['LANGSMITHAPIKEY']) for \n |
| Callback fires on parent only, silent on subgraphs | Bound via .with_config(callbacks=[...]) — does not propagate (P28) |
Pass via config["callbacks"] at invoke() / ainvoke() |
| Token counts under by 30-70% vs provider dashboard | Combination of P28 (subgraph silence) and P25 (retry double-count not deduped) | Fix P28 first; for P25 add request_id dedupe key in sink |
| Trace duration shows 0ms on streamed calls | onllmend fires after stream closes but handler records before — timing race |
Use time.perfcounter() captured in onllmstart, not onchatmodelstart |
| Prometheus cardinality explosion | tenant_id label has high cardinality (>10k tenants) |
Bucket tenants into tiers for metrics; keep full tenant_id in LangSmith metadata only |
LangSmith UI shows runs under default project, not the configured one |
LANGSMITH_PROJECT env var not set at process start |
Set before import; LANGSMITH_PROJECT is read once at Client() init |
AttributeError: 'NoneType' object has no attribute 'get' in onllmend |
usage_metadata is None on intermediate streaming chunks |
Guard with if meta := getattr(g.message, 'usage_metadata', None): |
Examples
Multi-tenant SaaS: per-tenant cost dashboard
A production SaaS has 200 tenants on a shared LangGraph agent. Finance wants
weekly cost reports per tenant. The MetricCallback records token_in,
tokenout, and cacheread tagged with tenant_id; Prometheus scrapes the
/metrics endpoint; Grafana aggregates sum by (tenantid) (rate(llmtokenouttotal[1w])) * 0.0000015
for Sonnet output cost. The invocation-time config["callbacks"] propagation
is load-bearing here — without it, subgraph tool calls (the bulk of token
spend) go uncounted. See Custom Metrics Callback
for the full Prometheus integration.
Debugging missing traces in staging
A team deploys a new LangGraph service to staging. No traces show up in
LangSmith. Checking: (1) LANGSMITH_TRACING spelled correctly — yes; (2) API
key valid — langsmith.Client().list_projects(limit=1) returns ok; (3) project
name matches — LANGSMITH_PROJECT=myservice-staging. Traces appear in the
default project, not myservice-staging. Root cause: the env var was set in
the runtime env-file but the process was started before the env-file was
sourced. Client() read LANGSMITH_PROJECT at import time. Fix: restart the
process cleanly. See LangSmith Setup for the
process-order checklist.
Feeding prod traffic to an eval dataset
A team wants to validate a Claude 4.6 → Claude 4.7 upgrade against recent prod
runs. They add metadata={"eval_candidate": "pre-upgrade"} to 1% of runs for
one week, create a LangSmith dataset from the tagged runs, then replay against
the new model and diff outputs. The sampling rule lives in LangSmith UI,
filtered by metadata.eval_candidate. See LangSmith Setup
for the annotation-queue and dataset-creation flow.
Resources
- LangSmith Observability concepts
- LangSmith env variable reference
- LangChain callbacks (1.0)
BaseCallbackHandlerAPIRunnableConfigAPI- For OTEL-native instrumentation:
langchain-otel-observability(L33) in this pack - Pack pain catalog:
docs/pain-catalog.md(entries P04, P25, P26, P28)