langchain-performance-tuning
Tune LangChain 1.0 / LangGraph 1.0 Python chains and agents for throughput, latency, and cost — streaming modes, explicit batch concurrency, semantic plus exact caches, persistent message history, and async-safe retriever patterns. Use when p95 latency exceeds target, batching "does not work", cost grows linearly with traffic, or a process restart wipes chat history. Trigger with "langchain performance", "langchain slow batch", "langchain throughput", "langchain p95 latency", "semantic cache hit rate".
Allowed Tools
Provided by Plugin
langchain-py-pack
Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns
Installation
This skill is included in the langchain-py-pack plugin:
/plugin install langchain-py-pack@claude-code-plugins-plus
Click to copy
Instructions
LangChain Performance Tuning
Overview
An engineer calls chain.batch(inputs1000) expecting 1000 parallel LLM calls. Actual behavior: Runnable.batch and Runnable.abatch in LangChain 1.0 default to maxconcurrency=1, so the 1000 inputs run sequentially with bookkeeping overhead — sometimes slower than a plain for loop. This is pain-catalog entry P08. The fix is one line:
# Before: serial, ~1000 * per_call_latency
await chain.abatch(inputs)
# After: 10x throughput at 10 providers' worth of concurrency
await chain.abatch(inputs, config={"max_concurrency": 10})
Other silent regressions in the same pain catalog: P48 (invoke inside async def blocks the FastAPI event loop), P22 (InMemoryChatMessageHistory loses every user's chat on restart), P62 (RedisSemanticCache at the default scorethreshold=0.95 returns under 5% hit rate), P59 (async retrievers leak connections on cancellation), P60 (BackgroundTasks fires after the response — wrong for per-token SSE), P01 (streaming token counts are only reliable on the onchatmodelend event).
This skill wires a production performance baseline: explicit batch concurrency, async-only code paths, Redis-backed caches tuned on a golden set, persistent chat history with TTL, and TTFT instrumentation from astream_events(version="v2").
Prerequisites
- Python 3.11+ with
langchain>=1.0,<2,langgraph>=1.0,<2,langchain-openaiorlangchain-anthropic,langchain-community,langchain-redisorredis>=5. - A working LangChain 1.0 chain or LangGraph 1.0 graph that already passes functional tests.
- Redis 7+ reachable from the app for cache and history (local Docker is fine for dev).
- A FastAPI / Starlette async endpoint, or an equivalent async entrypoint.
- Observability: a place to emit metrics (Prometheus, OpenTelemetry, or LangSmith) — needed to measure TTFT, p95, and cache hit rate.
Instructions
- Establish a latency budget and baseline. Pick explicit targets before changing code: TTFT under 1s, p95 total under 5s, throughput over 20 req/s per worker, cost under $X per 1k interactions. Run a 5-minute load test with
locustorwrkagainst the current chain and record p50 / p95 / p99 / TTFT / total cost. Without these numbers every downstream change is theater.
- Convert every hot path to async (P48). Inside
async defhandlers, replaceinvoke,stream,batch,getrelevantdocuments, andtool.runwithainvoke,astream/astreamevents(version="v2"),abatch,agetrelevant_documents, andtool.arun. Seereferences/async-safety-checklist.mdfor a grep pattern and a CI linter. Target: zero sync LangChain calls inside any async function.
- Fix
.abatch()concurrency (P08). Every.abatch/.batchcall must passconfig={"maxconcurrency": N}where N is chosen from the provider table inreferences/batch-concurrency-per-provider.md(Anthropic 10-20, OpenAI 20-50, local vLLM 100+). For multi-worker deploys, cap account-wide calls with a LiteLLM / Portkey proxy or a Redis semaphore —maxconcurrencyonly governs one process.
- Instrument TTFT with
astreamevents(version="v2")(P01). Measure time to first token separately from total latency — user-perceived performance hinges on TTFT. Read usage metadata only on theonchatmodelendevent; per-chunk usage fields lag and are not reliable mid-stream.
from time import perf_counter
async def run(chain, query: str):
t0 = perf_counter(); ttft = None; tokens = 0
async for ev in chain.astream_events({"input": query}, version="v2"):
if ev["event"] == "on_chat_model_stream" and ttft is None:
ttft = perf_counter() - t0
if ev["event"] == "on_chat_model_end":
tokens = ev["data"]["output"].usage_metadata["total_tokens"]
return {"ttft_s": ttft, "total_s": perf_counter() - t0, "tokens": tokens}
- Enable an exact LLM cache. For deterministic (temperature=0) prompts, set
RedisCacheorSQLiteCacheglobally. LangChain 1.0 keys include the bound tools signature (P61 fix), which prevents cache poisoning when an agent's tool list changes. Always set an explicit TTL on Redis keys — default Redis keys are immortal.
from langchain_core.globals import set_llm_cache
from langchain_community.cache import RedisCache
import redis
set_llm_cache(RedisCache(redis.Redis.from_url("redis://cache:6379/0")))
- Add a semantic cache with a tuned threshold (P62). The
RedisSemanticCachedefaultscore_threshold=0.95produces < 5% hit rate on real traffic. Collect a 200-500 prompt golden set with labeled near-duplicates, measure cosine similarity with your embedding model, and pick the F1-maximizing threshold — typically 0.85-0.90 fortext-embedding-3-small. Full procedure inreferences/cache-tuning.md. Do not run semantic cache behindtemperature > 0; users will see prior random draws.
- Replace
InMemoryChatMessageHistory(P22). Every production chat path must useRedisChatMessageHistory(withttl) or a LangGraph checkpointer (AsyncPostgresSaver/AsyncSqliteSaver). Add a restart test: mid-conversation, kill and restart the worker, assert the next user turn still sees prior messages. Seereferences/persistent-history.mdfor migration steps and trim policies.
- Close retriever connection pools in FastAPI
lifespan(P59). Build the vector store once at startup, expose it viaapp.state, close it in thefinallyblock. Never construct a retriever per request — cancellations leak pg connections.
- Stream tokens with SSE, not
BackgroundTasks(P60).BackgroundTasksruns after the response body is flushed; per-token dispatch via it delivers tokens the client will never read. UseEventSourceResponse(sse-starlette) or a WebSocket and pipe events fromastream_events.
- Re-run the load test and diff the four metrics. TTFT, p95, throughput, cost per 1k. If any regressed, revert that step and investigate — do not stack changes without verification. Execute in this order to isolate effects:
- Run the baseline load test and save results.
- Set
max_concurrencyon every.abatchcall and re-run. - Add exact cache, re-run, check cache hit rate.
- Configure semantic cache with tuned threshold, re-run, check hit rate again.
- Verify persistent history survives a worker restart.
Throughput Tuning Table (starting values)
| Provider | Safe max_concurrency |
Ceiling signal |
|---|---|---|
| Anthropic (sonnet-4.5/4.6) | 10-20 | 429 ratelimiterror |
| OpenAI (gpt-4o / 4o-mini) | 20-50 | 429 + TPM exhaustion header |
| OpenAI o1 / reasoning | 2-5 | Cost + latency, not rate |
| Google Gemini 1.5/2.5 | 10-30 | 429 |
| Cohere | 20-40 | 429 |
| Local vLLM / TGI | 100-500 (batch N≈32-64) | GPU KV-cache OOM |
| Ollama on consumer GPU | 1-4 | Process queue backpressure |
Latency Breakdown Template
Record these for every change, not just total:
| Metric | Target | Source |
|---|---|---|
| TTFT p50 / p95 | 500ms / 1s | first onchatmodel_stream event |
| Total p50 / p95 | 2s / 5s | end-to-end handler |
| Tool-call p95 | < 1s per tool | ontoolend - ontoolstart |
| Retriever p95 | < 300ms | onretrieverend - onretrieverstart |
| Provider p95 | measure per model | split by LLM node |
Batch Sweet-Spot Numbers
- Anthropic tier 2 chat:
max_concurrency=10saturates at roughly 8 req/s, p95 doubles past 20. - OpenAI
gpt-4o-minitier 3: knee of the curve aroundmax_concurrency=30-40; ~40 req/s throughput. - Local vLLM A100: server-side batch sweet spot
N=32-64, clientmax_concurrency=100+.
Verify on your own account — these are starting points, not promises.
Output
Deliverables from running this skill end-to-end:
- A
perf/directory withbaseline.jsonandtuned.jsonload-test results. - All async handlers use
ainvoke/astreamevents/abatchwith explicitmaxconcurrency. setllmcachewired toRedisCache(exact) and optionallyRedisSemanticCache(tuned threshold).RunnableWithMessageHistoryor LangGraph checkpointer backed by Redis or Postgres, with TTL.- FastAPI
lifespanclosing vector store pools on shutdown. - SSE endpoint streaming from
astream_events(version="v2"). - A
tests/testnosyncinasync.pyCI guard (see async-safety reference). - Metrics exported:
ttftseconds,totallatencyseconds,cachehittotal,cachemisstotal,batchconcurrency_current. - Runbook entry with the tuned
max_concurrencyper provider and the semantic-cache threshold, versioned in git.
Error Handling
| Symptom | Root cause | Fix |
|---|---|---|
.abatch(inputs) no faster than a for loop |
max_concurrency=1 default (P08) |
Pass config={"max_concurrency": N} |
| FastAPI TTFT collapses under load | Sync invoke inside async def (P48) |
Switch to ainvoke / astream_events |
| Chat forgets prior turns after deploy | InMemoryChatMessageHistory (P22) |
Move to RedisChatMessageHistory with TTL |
| Semantic cache hit rate < 5% | score_threshold=0.95 default (P62) |
Tune on golden set to 0.85-0.90 |
| pg pool exhausted hours into load test | Retriever not closed on cancel (P59) | Close vector store in FastAPI lifespan |
| SSE client sees zero tokens | Dispatching via BackgroundTasks (P60) |
Use EventSourceResponse and astream_events |
| Per-chunk token counts fluctuate | Usage metadata lags during stream (P01) | Read only on onchatmodel_end |
| 429 storm after tuning concurrency | Per-worker limit * N workers > account RPM | Add LiteLLM/Portkey proxy or Redis semaphore |
| Semantic cache returns off-brand output | Cache hit on temperature > 0 route |
Disable semantic cache or force temperature=0 |
| Cache poisoning after tool change | Missing tools in cache key | Upgrade LangChain to 1.0.x post-P61 fix |
Examples
Example 1 — Fix a sequential batch job.
# Before — 1000 items, 18 minutes end-to-end
results = await chain.abatch(inputs)
# After — 1000 items, ~2 minutes; Anthropic tier-2 account, N=10
results = await chain.abatch(inputs, config={"max_concurrency": 10})
Example 2 — Wire persistent history and an exact cache on a FastAPI app.
from contextlib import asynccontextmanager
from fastapi import FastAPI
from langchain_core.globals import set_llm_cache
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.cache import RedisCache
from langchain_community.chat_message_histories import RedisChatMessageHistory
import redis
@asynccontextmanager
async def lifespan(app: FastAPI):
r = redis.Redis.from_url("redis://cache:6379/0")
set_llm_cache(RedisCache(r))
app.state.r = r
yield
r.close()
app = FastAPI(lifespan=lifespan)
def history_for(session_id: str) -> RedisChatMessageHistory:
return RedisChatMessageHistory(
session_id=session_id,
url="redis://history:6379/2",
ttl=60 * 60 * 24 * 14,
)
chain_with_history = RunnableWithMessageHistory(
base_chain, history_for,
input_messages_key="input",
history_messages_key="history",
)
Example 3 — Stream tokens with measured TTFT.
from sse_starlette.sse import EventSourceResponse
from time import perf_counter
@app.post("/chat")
async def chat(req: ChatReq):
async def gen():
t0 = perf_counter()
async for ev in chain_with_history.astream_events(
{"input": req.text},
config={"configurable": {"session_id": req.session_id}},
version="v2",
):
if ev["event"] == "on_chat_model_stream":
yield {"data": ev["data"]["chunk"].content}
app.state.r.incrbyfloat("ttft_sum_s", perf_counter() - t0)
return EventSourceResponse(gen())
Resources
- One-pager — problem / solution / key features snapshot.
- batch-concurrency-per-provider — per-provider
max_concurrencytable, sweep procedure, semaphore patterns. - cache-tuning — exact vs semantic, Redis key design, golden-set threshold procedure, TTL strategy.
- persistent-history — Redis / Postgres / LangGraph checkpointer migration off
InMemoryChatMessageHistory. - async-safety-checklist — sync-in-async grep + linter, lifespan pool cleanup, SSE vs
BackgroundTasks. - LangChain streaming / batching — official docs for
Runnable.batchand streaming modes. - LangChain caching —
setllmcache, Redis and SQLite backends. - LangGraph checkpointers — persistence for graph state.
- Companion skills in
langchain-py-pack:langchain-model-inference(token accounting),langchain-embeddings-search(retrieval tuning),langchain-middleware-patterns(tool-signature cache keying, P61).