langchain-observability

Wire LangSmith tracing and custom metric callbacks into a LangChain 1.0 chain or LangGraph 1.0 agent correctly — env-var spelling, subgraph propagation, per-tenant dimensions, cost and latency counters. Use when setting up observability on a new service, debugging blank traces in LangSmith, or adding per-tenant cost breakdowns. Trigger with "langchain observability", "langsmith tracing", "langchain callbacks", "langchain metrics".

claude-codecodex
4 Tools
langchain-py-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(python:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0
View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain Observability (Python)

Overview

Engineer sets LANGCHAINTRACINGV2=true and LANGCHAINAPIKEY=... from the

0.2 docs, restarts the service, and sees zero traces in LangSmith — no errors,

no warnings. That is P26: in LangChain 1.0 the canonical env vars are

LANGSMITHTRACING and LANGSMITHAPIKEY. The LANGCHAIN* names are

soft-deprecated and fail silently on any chain that goes through 1.0 middleware

or createreactagent. One-line fix:


export LANGSMITH_TRACING=true
export LANGSMITH_API_KEY=lsv2_...
export LANGSMITH_PROJECT=my-service-prod

Next failure mode: a custom BaseCallbackHandler attached via

chain.with_config(callbacks=[meter]) fires on the parent but is silent on

LangGraph subgraphs and createreactagent tool calls — token counts

under-report by 30-70% vs the provider dashboard. That is P28: LangGraph

creates a child runtime per subgraph, and bound callbacks do not propagate.

Pass callbacks at invocation time instead:


await chain.ainvoke(inputs, config={"callbacks": [meter], "configurable": {"tenant_id": t}})

This skill walks through canonical LangSmith setup, a metric-callback template

with tenant dimensions, invocation-time propagation, RunnableConfig trace

tagging, and a decision tree for LangSmith-only vs OTEL-native (defer to

langchain-otel-observability / L33 for OTEL-heavy). Pin: langchain-core 1.0.x,

langgraph 1.0.x, langsmith current. LangSmith tracing adds <5ms per-span

overhead; metric callbacks add <1ms per fire. Pain-catalog anchors: P26, P28,

P04 (cache-token aggregation), P25 (retry double-counting).

Prerequisites

  • Python 3.10+
  • langchain-core >= 1.0, < 2.0, langgraph >= 1.0, < 2.0
  • langsmith (bundled with langchain; upgrade to current for 1.0 env-var support)
  • A LangSmith API key (lsv2_...) — free tier at https://smith.langchain.com
  • Optional metric sinks: prometheus_client, statsd, or datadog Python packages

Instructions

Step 1 — Enable LangSmith with the canonical 1.0 env vars

LANGSMITHTRACING=true is the switch. LANGSMITHAPI_KEY authenticates.

LANGSMITH_PROJECT groups traces by environment — use one project per

service-env pair (myapp-prod, myapp-staging), not one per service.


# .env (loaded via python-dotenv or secret manager)
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=lsv2_pt_...
LANGSMITH_PROJECT=my-service-prod

# Legacy fallback names (still work, soft-deprecated — do not use in new code):
# LANGCHAIN_TRACING_V2=true
# LANGCHAIN_API_KEY=lsv2_pt_...
# LANGCHAIN_PROJECT=my-service-prod

Verify in a REPL that the client sees the key before relying on it in

production:


from langsmith import Client
c = Client()                       # reads LANGSMITH_API_KEY and LANGSMITH_ENDPOINT
print(c.list_projects(limit=1))   # raises LangSmithAuthError if key is wrong

Do NOT set both LANGCHAINTRACINGV2 and LANGSMITH_TRACING — mixed settings

have caused stale project routing in 1.0.x. See P26.

For selective sampling in high-traffic services, set

LANGSMITHSAMPLINGRATE=0.1 (10% of runs). Full detail in

LangSmith Setup.

Step 2 — Write a metric callback for per-request observability

Subclass BaseCallbackHandler. Record tokenin, tokenout, latency_ms,

toolcalls, and error, tagged with a tenantid dimension for downstream

grouping.


import time
from langchain_core.callbacks import BaseCallbackHandler
from langchain_core.outputs import LLMResult

class MetricCallback(BaseCallbackHandler):
    """Per-LLM-call metrics tagged with tenant_id. Overhead <1ms per event."""

    def __init__(self, tenant_id: str, sink) -> None:
        self.tenant_id = tenant_id
        self.sink = sink
        self._starts: dict[str, float] = {}

    def on_llm_start(self, serialized, prompts, *, run_id, **kwargs) -> None:
        self._starts[str(run_id)] = time.perf_counter()

    def on_llm_end(self, response: LLMResult, *, run_id, **kwargs) -> None:
        t0 = self._starts.pop(str(run_id), time.perf_counter())
        elapsed_ms = (time.perf_counter() - t0) * 1000   # wall-clock latency
        tags = {"tenant_id": self.tenant_id}
        for gen in response.generations:
            for g in gen:
                meta = getattr(g.message, "usage_metadata", None) or {}
                self.sink.incr("llm.token_in",   meta.get("input_tokens", 0),  tags)
                self.sink.incr("llm.token_out",  meta.get("output_tokens", 0), tags)
                # P04 — aggregate Anthropic cache reads across calls
                cache = meta.get("input_token_details", {}).get("cache_read", 0)
                self.sink.incr("llm.cache_read", cache, tags)
        self.sink.hist("llm.latency_ms", elapsed_ms, tags)

    def on_llm_error(self, error, *, run_id, **kwargs) -> None:
        self._starts.pop(str(run_id), None)
        self.sink.incr("llm.error", 1, {"tenant_id": self.tenant_id,
                                         "error_type": type(error).__name__})

    def on_tool_end(self, output, *, run_id, **kwargs) -> None:
        self.sink.incr("llm.tool_calls", 1, {"tenant_id": self.tenant_id})

A thin sink protocol (incr, hist) swaps between Prometheus, StatsD, or

Datadog. Alternative sinks (LangSmith-only, OTEL) do not need this callback

at all — see Step 5. Full sink adapters and P25 retry dedupe in

Custom Metrics Callback.

Step 3 — Pass callbacks via config["callbacks"] at invocation (P28)

This is the single most common observability bug in LangGraph 1.0 services.

Binding callbacks at definition time does not propagate into subgraphs or

createreactagent tool nodes — those create child runtimes with their own

callback scope.


# WRONG — fires on parent runnable only; silent on subgraphs (P28)
agent_bound = agent.with_config(callbacks=[MetricCallback(tenant_id, sink)])
result = await agent_bound.ainvoke(inputs)

# RIGHT — propagates to every runnable, subgraph, and tool call
meter = MetricCallback(tenant_id, sink)
result = await agent.ainvoke(
    inputs,
    config={
        "callbacks": [meter],
        "configurable": {"thread_id": session_id, "tenant_id": tenant_id},
        "tags": ["prod", f"tenant:{tenant_id}"],
        "metadata": {"request_id": req_id, "tier": "enterprise"},
    },
)

Construct the callback inside the request handler so it captures a fresh

tenant_id per request — and in that pattern, invocation-time config is the

only way callbacks reach subgraphs. See Trace Metadata and Tagging

for the full RunnableConfig shape.

Step 4 — Tag and annotate traces via RunnableConfig

LangSmith indexes two per-request fields: tags (flat list, filterable) and

metadata (key-value, searchable). Fix conventions early — LangSmith has no

rename tool.


config = {
    "callbacks": [meter],
    "tags": [
        "env:prod",                # environment
        f"tenant:{tenant_id}",     # tenant
        f"tier:{tenant_tier}",     # plan tier
        f"feature:{feature_flag}", # A/B experiment arm
    ],
    "metadata": {
        "request_id": req_id,
        "user_id": user_id,
        "session_id": session_id,
        "app_version": os.environ["APP_VERSION"],
    },
    "run_name": "agent_main",      # LangSmith UI label; overrides chain class name
}

Hierarchical tag conventions (env:prod, tenant:acme, tier:enterprise)

make LangSmith filters work. Free-form tags ("important", "check-me") do

not. See Trace Metadata and Tagging.

Step 5 — Pick a sink and the stack shape

The callback handler is the integration point. Options, in decreasing order of

fit:

  • LangSmith only — zero additional overhead; tracing already covers latency

and token accounting. Fine for solo dev, small teams, and LLM-native ops.

  • Prometheus (pull) — best fit for Kubernetes + existing Prom stack. Export

via prometheus_client HTTP endpoint. Watch tenant label cardinality.

  • StatsD / Datadog (push) — UDP fire-and-forget; sub-1ms overhead. Safe on

high-throughput async services. Use datadog.dogstatsd for tag support.

  • OTEL native — multi-service distributed tracing. Defer to

langchain-otel-observability (L33); do not reimplement here.

Decision tree:


Existing OTEL stack (Collector, Tempo, Jaeger)?
├── YES → OTEL-native (L33). LangSmith optional for prompt inspection.
└── NO  → LLM-specific features (prompt inspection, evals, queues) enough?
         ├── YES → LangSmith only. Add MetricCallback only for tenant cost.
         └── NO  → Hybrid: LangSmith for prompts + Prometheus/Datadog for SLOs.
                   See references/hybrid-langsmith-otel.md for split-point rules.

Mixing paths without a plan creates double-emission and conflicting trace IDs.

See Custom Metrics Callback for

Prometheus / StatsD / Datadog sink implementations, plus dedupe for P25 retry

double-counts; see Hybrid LangSmith + OTEL

for the split-point contract.

Step 6 — Feed runs back into evals

Real traffic is the best eval set. Route a sampled subset of production runs

into a LangSmith annotation queue for human review; the queue feeds Dataset

objects replayable against candidate models.


from langsmith import Client
Client().create_annotation_queue(
    name="prod-regressions",
    description="1% sample, weekly review",
)
# Add metadata={"eval_candidate": "true"} on 1% of runs — LangSmith UI has
# a rule to route into the queue by metadata filter.

Keep annotation queues under 500 runs/week (reviewers saturate past that).

See LangSmith Setup for the queue and

dataset flow.

Output

  • LangSmith tracing on via LANGSMITHTRACING / LANGSMITHAPI_KEY /

LANGSMITH_PROJECT with a langsmith.Client() smoke-check

  • MetricCallback(BaseCallbackHandler) emitting tokenin, tokenout,

cacheread, latencyms, toolcalls, error tagged with tenantid

  • All chain invocations pass config={"callbacks": [...], ...} at invoke time

so metrics propagate to subgraphs and agent tools

  • RunnableConfig carries hierarchical tags (env:, tenant:, tier:*)

and structured metadata (requestid, userid, session_id)

  • One metric sink wired (Prometheus, StatsD, Datadog, or LangSmith-only)
  • Explicit choice recorded for LangSmith / OTEL / hybrid / custom

Error Handling

Error Cause Fix
No traces in LangSmith, no errors Used LANGCHAINTRACINGV2 spelling on 1.0 middleware path (P26) Switch to LANGSMITHTRACING=true and LANGSMITHAPI_KEY
langsmith.utils.LangSmithAuthError: Unauthorized Key is valid but points to a deleted workspace, or copied with trailing whitespace Regenerate at smith.langchain.com, check repr(os.environ['LANGSMITHAPIKEY']) for \n
Callback fires on parent only, silent on subgraphs Bound via .with_config(callbacks=[...]) — does not propagate (P28) Pass via config["callbacks"] at invoke() / ainvoke()
Token counts under by 30-70% vs provider dashboard Combination of P28 (subgraph silence) and P25 (retry double-count not deduped) Fix P28 first; for P25 add request_id dedupe key in sink
Trace duration shows 0ms on streamed calls onllmend fires after stream closes but handler records before — timing race Use time.perfcounter() captured in onllmstart, not onchatmodelstart
Prometheus cardinality explosion tenant_id label has high cardinality (>10k tenants) Bucket tenants into tiers for metrics; keep full tenant_id in LangSmith metadata only
LangSmith UI shows runs under default project, not the configured one LANGSMITH_PROJECT env var not set at process start Set before import; LANGSMITH_PROJECT is read once at Client() init
AttributeError: 'NoneType' object has no attribute 'get' in onllmend usage_metadata is None on intermediate streaming chunks Guard with if meta := getattr(g.message, 'usage_metadata', None):

Examples

Multi-tenant SaaS: per-tenant cost dashboard

A production SaaS has 200 tenants on a shared LangGraph agent. Finance wants

weekly cost reports per tenant. The MetricCallback records token_in,

tokenout, and cacheread tagged with tenant_id; Prometheus scrapes the

/metrics endpoint; Grafana aggregates sum by (tenantid) (rate(llmtokenouttotal[1w])) * 0.0000015

for Sonnet output cost. The invocation-time config["callbacks"] propagation

is load-bearing here — without it, subgraph tool calls (the bulk of token

spend) go uncounted. See Custom Metrics Callback

for the full Prometheus integration.

Debugging missing traces in staging

A team deploys a new LangGraph service to staging. No traces show up in

LangSmith. Checking: (1) LANGSMITH_TRACING spelled correctly — yes; (2) API

key valid — langsmith.Client().list_projects(limit=1) returns ok; (3) project

name matches — LANGSMITH_PROJECT=myservice-staging. Traces appear in the

default project, not myservice-staging. Root cause: the env var was set in

the runtime env-file but the process was started before the env-file was

sourced. Client() read LANGSMITH_PROJECT at import time. Fix: restart the

process cleanly. See LangSmith Setup for the

process-order checklist.

Feeding prod traffic to an eval dataset

A team wants to validate a Claude 4.6 → Claude 4.7 upgrade against recent prod

runs. They add metadata={"eval_candidate": "pre-upgrade"} to 1% of runs for

one week, create a LangSmith dataset from the tagged runs, then replay against

the new model and diff outputs. The sampling rule lives in LangSmith UI,

filtered by metadata.eval_candidate. See LangSmith Setup

for the annotation-queue and dataset-creation flow.

Resources

Ready to use langchain-py-pack?