langchain-local-dev-loop
Build a fast, deterministic local test loop for LangChain 1.0 / LangGraph 1.0 — FakeListChatModel fixtures, pytest config, VCR cassettes with key redaction, warning-filter policy. Use when adding tests to a new chain, fixing a flaky test, or making integration tests reproducible. Trigger with "langchain pytest", "FakeListChatModel", "VCR langchain", "langchain test fixtures", "langchain integration test".
Allowed Tools
Provided by Plugin
langchain-py-pack
Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns
Installation
This skill is included in the langchain-py-pack plugin:
/plugin install langchain-py-pack@claude-code-plugins-plus
Click to copy
Instructions
LangChain Local Dev Loop (Python)
Overview
An engineer writes the most natural assertion possible:
def test_summarize():
out = chain.invoke({"text": "..."})
assert out.content == "expected summary"
It passes locally against Claude at temperature=0. It fails in CI on the third
run with a one-token delta in the output. That is P05: Anthropic's temperature=0
is not greedy — it still samples. Tests against live Claude are not deterministic,
period.
So the engineer swaps in FakeListChatModel(responses=["expected summary"]) and
the assertion passes. Then the downstream callback that logs cost blows up in CI
with KeyError: 'token_usage' — because FakeListChatModel does not emit
responsemetadata["tokenusage"] (P43). Production code reads that key, so
either the fake has to synthesize it or the test has to skip the callback.
Meanwhile, the first integration test under VCR records a cassette that ships
Authorization: Bearer sk-ant-api03-... in the repo (P44). PR review catches it;
the reviewer revokes the key; the dev loop is hosed for an afternoon.
And none of this matters if pytest cannot even collect the suite because
import langchain_community emits a DeprecationWarning that -W error promotes
to failure (P45).
This skill installs the four layers that make the whole loop fast and safe:
FakeListChatModel / FakeListLLM with a metadata-emitting subclass (fixes P43);
VCR with filter_headers plus a pre-commit hook (fixes P44); pytest
filterwarnings policy in pyproject.toml (fixes P45); and an env-var-gated
integration marker so the default pytest run never touches live APIs.
Speed targets: unit tests with FakeListChatModel run in < 100ms per
test; VCR-replayed integration tests run in 500ms – 2s per test; live
integration tests (the RUN_INTEGRATION=1 gate) run only in nightly or
manual workflows.
Pin: langchain-core 1.0.x, langgraph 1.0.x, pytest current, vcrpy
current. Pain-catalog anchors: P05, P43, P44, P45.
Prerequisites
- Python 3.10+
pip install langchain-core>=1.0,<2.0 langgraph>=1.0,<2.0 pytest vcrpy pytest-recording- For integration tests: at least one provider key (
ANTHROPICAPIKEY, etc.) - Project uses
pyproject.toml(PEP 621) for pytest config
Instructions
Step 1 — Deterministic unit tests with FakeListChatModel
Use FakeListChatModel from langchaincore.languagemodels.fake for chat
chains and FakeListLLM for legacy completion LLMs. Responses cycle through
the list.
from langchain_core.language_models.fake import FakeListChatModel
from langchain_core.prompts import ChatPromptTemplate
def test_classifier_picks_positive():
fake = FakeListChatModel(responses=["positive"])
prompt = ChatPromptTemplate.from_messages([("user", "Classify: {text}")])
chain = prompt | fake
out = chain.invoke({"text": "I love it"})
assert out.content == "positive"
This is deterministic, runs in single-digit milliseconds, and has zero provider
dependency. Use it for every chain assertion that does not specifically require
real model behavior.
Step 2 — Subclass FakeListChatModel to emit response_metadata (P43 fix)
The stock fake emits no responsemetadata["tokenusage"]. If your chain has a
callback that records cost, the callback crashes under the fake. Subclass and
synthesize the metadata instead of mocking around the callback:
from langchain_core.language_models.fake import FakeListChatModel
from langchain_core.outputs import ChatGeneration, ChatResult
from langchain_core.messages import AIMessage
class FakeChatWithUsage(FakeListChatModel):
"""FakeListChatModel that emits response_metadata['token_usage'] so
downstream callbacks reading token usage do not crash under test."""
def _generate(self, messages, stop=None, run_manager=None, **kwargs):
response = self.responses[self.i % len(self.responses)]
self.i += 1
message = AIMessage(
content=response,
response_metadata={
"token_usage": {
"input_tokens": 10,
"output_tokens": len(response.split()),
"total_tokens": 10 + len(response.split()),
},
"model_name": "fake-chat",
},
usage_metadata={
"input_tokens": 10,
"output_tokens": len(response.split()),
"total_tokens": 10 + len(response.split()),
},
)
return ChatResult(generations=[ChatGeneration(message=message)])
Use FakeChatWithUsage whenever a chain's observability / cost path is in the
assertion surface. See Fake Model Fixtures
for agent, retriever, and embedder fakes.
Step 3 — pytest fixtures that wire the fake into chains
Put fixtures in tests/conftest.py so they are shared across the suite:
# tests/conftest.py
import pytest
from langchain_core.prompts import ChatPromptTemplate
from tests.fakes import FakeChatWithUsage
@pytest.fixture
def fake_chat():
"""Reusable fake chat model. Override responses per-test via
monkeypatch.setattr(fake_chat, 'responses', [...])."""
return FakeChatWithUsage(responses=["ok"])
@pytest.fixture
def summarize_chain(fake_chat):
prompt = ChatPromptTemplate.from_messages([
("system", "Summarize the user's text in one line."),
("user", "{text}"),
])
return prompt | fake_chat
Per-test response override:
def test_summary_shape(summarize_chain, fake_chat):
fake_chat.responses = ["short summary"]
out = summarize_chain.invoke({"text": "long input"})
assert out.content == "short summary"
Step 4 — VCR cassettes for integration tests with key redaction (P44 fix)
Unit tests should never touch the network. Integration tests do, exactly once —
to record a cassette — and every subsequent run replays from the cassette file.
vcrpy records headers by default, which means Authorization: Bearer sk-...
lands in the fixture unless you filter it.
Configure VCR in tests/conftest.py:
# tests/conftest.py (continued)
import pytest
@pytest.fixture(scope="module")
def vcr_config():
return {
"filter_headers": [
"authorization",
"x-api-key",
"anthropic-version",
"openai-organization",
"cookie",
],
"filter_query_parameters": ["api_key"],
# Block accidental re-recording in CI:
"record_mode": "none",
}
Use pytest-recording:
import pytest
@pytest.mark.vcr # cassette at tests/cassettes/<test_name>.yaml
@pytest.mark.integration
def test_live_claude_short_answer():
from langchain_anthropic import ChatAnthropic
chat = ChatAnthropic(model="claude-sonnet-4-6", temperature=0, timeout=30)
out = chat.invoke("Say 'ok' and nothing else.")
assert "ok" in out.content.lower()
To record (once, locally, with a real key): pytest --record-mode=once tests/.
Every other run replays — cassettes are committed, real API is never hit again.
Pre-commit hook to block key leaks:
# .git/hooks/pre-commit or .pre-commit-config.yaml entry
#!/usr/bin/env bash
set -e
if git diff --cached --name-only | grep -q '^tests/cassettes/'; then
if git diff --cached -U0 -- 'tests/cassettes/' | \
grep -E '(sk-ant-[a-zA-Z0-9_-]+|sk-[a-zA-Z0-9]{20,}|Bearer\s+[a-zA-Z0-9_-]{20,})'; then
echo "ERROR: API key pattern found in staged cassette." >&2
exit 1
fi
fi
See VCR Cassette Hygiene for the full
pre-commit config, record-new-episodes flow, shared-cassette patterns, and the
PR review checklist.
Step 5 — Pytest warnings + markers in pyproject.toml (P45 fix)
langchain_community and some provider SDKs emit DeprecationWarning at import
time. If the suite runs -W error, collection fails before any test does. Set
the policy once in pyproject.toml:
[tool.pytest.ini_options]
minversion = "8.0"
testpaths = ["tests"]
addopts = [
"-ra",
"--strict-markers",
"--strict-config",
"-W", "error",
]
markers = [
"integration: hits real APIs or replays VCR cassettes (set RUN_INTEGRATION=1)",
"slow: takes > 1s per test",
"smoke: minimal healthcheck run in CI",
]
filterwarnings = [
"error",
"ignore::DeprecationWarning:langchain_community.*",
"ignore::DeprecationWarning:pydantic.*",
"ignore::PendingDeprecationWarning:langchain_core.*",
]
See Pytest Config for the full skeleton
including coverage config and parallel execution notes.
Step 6 — Integration-test gating via env var
Default pytest must never hit real APIs. Gate on RUN_INTEGRATION=1:
# tests/conftest.py (continued)
import os
import pytest
def pytest_collection_modifyitems(config, items):
if os.getenv("RUN_INTEGRATION") == "1":
return
skip_integration = pytest.mark.skip(reason="set RUN_INTEGRATION=1 to run")
for item in items:
if "integration" in item.keywords:
item.add_marker(skip_integration)
CI default: pytest (unit only). Nightly / manual: RUN_INTEGRATION=1 pytest -m integration.
Step 7 — LangGraph tests: per-test thread_id + state assertions
LangGraph state is scoped to a threadid. Tests that share a threadid leak
state between each other. Give every test a fresh thread_id and a fresh
MemorySaver:
from langgraph.checkpoint.memory import MemorySaver
import uuid, pytest
@pytest.fixture
def graph_config():
return {"configurable": {"thread_id": str(uuid.uuid4())}}
@pytest.fixture
def checkpointed_graph(fake_chat):
from my_app.graphs import build_graph
return build_graph(fake_chat).compile(checkpointer=MemorySaver())
def test_node_emits_plan(checkpointed_graph, graph_config, fake_chat):
fake_chat.responses = ["step 1\nstep 2\nstep 3"]
result = checkpointed_graph.invoke({"goal": "deploy"}, graph_config)
# Assert state shape per node, not just the final output:
assert result["plan"] == ["step 1", "step 2", "step 3"]
# Time-travel: inspect every checkpoint for debugging
history = list(checkpointed_graph.get_state_history(graph_config))
assert history[-1].values == {"goal": "deploy"} # initial state
Subgraph isolation testing cross-references langchain-langgraph-subgraphs
(pain P21 — parent cannot read child state unless the key is in the parent
schema). See LangGraph Test Patterns
for the subgraph-shared-state test recipe.
Output
tests/fakes.pywithFakeChatWithUsagesubclass that emitsresponse_metadatatests/conftest.pywith fake-model fixtures, VCR config, andRUN_INTEGRATIONgatepyproject.toml[tool.pytest.ini_options]block with markers andfilterwarningstests/cassettes/committed with filtered headers (noAuthorization/x-api-key)- Pre-commit hook grepping cassettes for
sk-/sk-ant-/Bearerpatterns - LangGraph tests with per-test
thread_idandMemorySaver— no cross-test leakage
Test-type matrix
| Type | Model | Network | Target speed | Determinism | Use case |
|---|---|---|---|---|---|
| Unit | FakeListChatModel / FakeChatWithUsage |
none | < 100ms | total | Chain shape, parser, routing logic |
| Integration (VCR) | real model, replayed cassette | replay only | 500ms – 2s | total (once recorded) | End-to-end chain behavior, provider-specific edge cases |
| Integration (live) | real model | live API | 2s – 30s | probabilistic (P05) | Nightly smoke, recording new cassettes, provider regression |
| Smoke | real model, minimal prompt | live API | < 5s | probabilistic | CI healthcheck — 1 test per provider, gated on RUN_INTEGRATION=1 |
| Load | real model | live API | minutes | probabilistic | Throughput / retry-storm reproduction, never in PR CI |
Error Handling
| Error | Cause | Fix |
|---|---|---|
AssertionError on content despite temperature=0 |
Anthropic temperature=0 still samples (P05) |
Switch to FakeListChatModel or VCR replay |
KeyError: 'token_usage' under fake model |
FakeListChatModel emits no response_metadata (P43) |
Use FakeChatWithUsage subclass from Step 2 |
PR review flags Authorization: Bearer sk-... in cassette |
VCR recorded headers by default (P44) | Set filter_headers before recording; re-record; add pre-commit grep hook |
pytest fails at collection with DeprecationWarning |
-W error + SDK import warnings (P45) |
Add filterwarnings = ["ignore::DeprecationWarning:langchain_community.*"] |
vcr.errors.CannotOverwriteExistingCassetteException |
Test changed request shape but cassette is stale | pytest --record-mode=new_episodes locally, inspect diff, commit |
| LangGraph test pollutes next test's state | Shared thread_id + shared MemorySaver |
Per-test thread_id=uuid.uuid4(), per-test MemorySaver() |
Examples
A flaky chain assertion, fixed in three commits
- Commit 1 — failing test uses real
ChatAnthropic, passes locally, fails
1-in-5 in CI at temperature=0 (P05).
- Commit 2 — swap to fake model uses
FakeListChatModel, passes
deterministically, but the cost-logging callback crashes (P43).
- Commit 3 — fake with metadata uses
FakeChatWithUsage, the callback
reads responsemetadata["tokenusage"] cleanly, the test is green and
runs in 40ms.
See Fake Model Fixtures for the full
worked example including agent and retriever fakes.
Recording a cassette without leaking a key
# 1. Ensure conftest.py has filter_headers configured FIRST
# 2. Record with real key present in the environment
ANTHROPIC_API_KEY=sk-ant-... pytest --record-mode=once tests/integration/test_summarize.py
# 3. Verify no leak
grep -E 'sk-|Bearer' tests/cassettes/*.yaml && echo "LEAK" || echo "clean"
# 4. Commit cassettes/ — pre-commit hook runs the same grep as a hard gate
git add tests/cassettes/ && git commit -m "test: record summarize cassette"
See VCR Cassette Hygiene for
record-new-episodes mode, rerecord-on-mismatch, and the PR review checklist.
LangGraph time-travel debugging on a failing test
When a graph test fails mid-graph, getstatehistory(config) returns every
checkpoint — you can replay from any point by passing its config.checkpoint_id
back into graph.invoke. See
LangGraph Test Patterns for the full
time-travel debugging recipe and the subgraph-shared-state test pattern
(cross-ref langchain-langgraph-subgraphs / pain L30).
Resources
- LangChain Python: testing guide
FakeListChatModelAPIvcrpydocumentationpytest-recording- LangGraph
MemorySaver+getstatehistory - Pytest
filterwarnings - Pack pain catalog:
docs/pain-catalog.md(entries P05, P43, P44, P45)