langchain-rate-limits
Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".
Allowed Tools
Provided by Plugin
langchain-py-pack
Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns
Installation
This skill is included in the langchain-py-pack plugin:
/plugin install langchain-py-pack@claude-code-plugins-plus
Click to copy
Instructions
LangChain Rate Limits (Python)
Overview
A team deploys 10 Cloud Run workers. Each worker initializes its ChatAnthropic
with InMemoryRateLimiter(requestspersecond=10) — they read the docs, they
picked a safe-looking number, they shipped. Thirty seconds later the dashboard
lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM
tier-1 ceiling, not the 10 RPS they configured. The name is the fix —
InMemoryRateLimiter is in-process. Each worker has its own counter. Ten
workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry P29
and it lands on every team that scales past one pod.
Three more traps wait on the same code path:
- P07 —
.withfallbacks([backup])defaultsexceptionsto_handle=(Exception,),
which on Python <3.12 swallows KeyboardInterrupt. Ctrl+C during a 429
retry storm silently falls through to the backup chain and keeps billing.
- P30 —
ChatOpenAIandChatAnthropicdefaultmax_retries=6. That is
retries, not attempts: 7 total requests per logical call on flaky
networks. One .invoke() can bill 7x.
- P31 — Anthropic's RPM counts cache reads, cache writes, and uncached
calls uniformly. Cache-heavy workloads at 50 RPM can 429 on cache writes
while the ITPM dashboard shows headroom.
This skill covers measuring demand before picking a limit; the
InMemoryRateLimiter vs Redis-backed limiter vs asyncio.Semaphore decision
tree; the narrow exceptionstohandle whitelist; max_retries=2 math; and
the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,
cached-vs-uncached). Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x,
langchain-openai 1.0.x. Pain-catalog anchors: P07, P08, P29, P30, P31.
For .batch(max_concurrency=...) tuning, see the sibling skill
langchain-performance-tuning — this skill is about provider-facing rate caps.
Prerequisites
- Python 3.10+ (3.12+ fixes the
KeyboardInterrupthalf of P07) langchain-core >= 1.0, < 2.0- At least one provider:
pip install langchain-anthropic langchain-openai - For multi-worker prod:
redis >= 4.5client and a Redis server reachable from every worker - Completed
langchain-model-inference— the chat-model factory from that skill is whererate_limiter=gets attached
Instructions
Step 1 — Measure actual demand before picking a number
Do not guess at requestspersecond. Instrument first, size second.
Attach a BaseCallbackHandler that logs per-call input_tokens,
outputtokens, and cachereadinputtokens from response.generations[].message.usage_metadata:
chain.with_config({"callbacks": [DemandLogger()]})
Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95
ITPM, p95 OTPM, cache hit rate. Size the limiter at **70% of the binding
constraint's tier ceiling** on your p95.
See Measuring Demand for the full
DemandLogger implementation, pandas roll-up, OTEL integration, load-test
harness, and multi-tenant sizing strategies.
Step 2 — InMemoryRateLimiter for single-process dev only; never multi-worker prod
LangChain 1.0 ships InMemoryRateLimiter as a first-class BaseChatModel parameter:
from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter
limiter = InMemoryRateLimiter(
requests_per_second=0.58, # 35 RPM = 70% of Anthropic tier-1 50 RPM
check_every_n_seconds=0.1,
max_bucket_size=5, # burst capacity
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=limiter,
max_retries=2,
timeout=30,
)
InMemoryRateLimiter is per-process. Safe for:
- Single-process local dev (
python script.py) - Single-worker uvicorn (
uvicorn --workers 1) - Jupyter notebooks, batch scripts
Unsafe for (this is P29):
- Multi-worker uvicorn / gunicorn (
--workers 4) - Any container orchestrator with replica count > 1 (Cloud Run min-instances > 1, K8s, ECS)
- Distributed job runners (Celery, Temporal, Cloud Tasks fanout)
Step 3 — Redis-backed limiter for cluster-wide enforcement
For multi-worker deployments, cluster-wide rate limiting requires shared state.
Redis is the default answer — atomic Lua script for sliding-window, or Redis
6.2+ CL.THROTTLE for GCRA.
import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter
client = redis.Redis.from_url("redis://redis.internal:6379/0")
limiter = RedisRateLimiter(
client,
key="anthropic:prod",
requests_per_second=35 / 60, # 35 RPM cluster-wide, not per-worker
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=limiter,
max_retries=2,
timeout=30,
)
Key scoping decisions:
key="anthropic:prod"— all tenants share one global budget (simplest)key=f"anthropic:tenant:{tenant_id}"— per-tenant quota (requires cleanup for dead tenants)- Two-level: per-tenant + global, acquire both (best for multi-tenant SaaS)
See Redis Limiter Pattern for the full
RedisRateLimiter implementation (atomic Lua sliding window), the GCRA
alternative via CL.THROTTLE, failure modes (Redis down, clock skew), and
per-tenant cleanup strategy.
Step 4 — asyncio.Semaphore for per-worker in-flight concurrency cap
The rate limiter throttles request rate. A semaphore throttles **in-flight
count**. Use both:
import asyncio
# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)
async def bounded_invoke(inp):
async with worker_sem:
return await llm.ainvoke(inp)
# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])
Why both: a semaphore prevents a single worker from queueing hundreds of
pending limiter acquires against Redis (head-of-line blocking on the event
loop). The limiter prevents the cluster from exceeding the provider tier. They
solve different problems.
Semaphore sizing: target latency-bandwidth-product. If p95 request latency
is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot
is wasted memory; undershoot leaves throughput on the table.
Step 5 — Narrow withfallbacks(exceptionsto_handle=...) — never (Exception,)
.with_fallbacks([backup]) defaults to catching Exception. This is P07 — on
Python <3.12, Exception edge-cases include KeyboardInterrupt propagation.
Ctrl+C during a retry storm silently hands off to the backup and keeps running.
Always narrow the tuple:
from anthropic import (
RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
resilient = (prompt | claude | parser).with_fallbacks(
[prompt | gpt4o | parser],
exceptions_to_handle=(
RateLimitError, APITimeoutError,
APIConnectionError, InternalServerError,
),
# NEVER: Exception, BaseException, AuthenticationError,
# BadRequestError, ValidationError
)
The whitelist is only transient provider errors. AuthenticationError,
BadRequestError, and ValidationError are bugs in your code/credentials —
fallback produces the same crash. See the sibling skill's reference
langchain-sdk-patterns/references/fallback-exception-list.md for the full
per-provider whitelist (Anthropic, OpenAI, Gemini).
Step 6 — maxretries=2, never the default maxretries=6
maxretries is retries, not attempts. Default maxretries=6 on
ChatOpenAI / ChatAnthropic means initial + 6 retries = 7 billed requests
per logical call (P30). On a flaky network, one .invoke() costs 7x what you
budgeted.
# BAD — default
llm = ChatOpenAI(model="gpt-4o") # max_retries=6
# GOOD — production default
llm = ChatOpenAI(
model="gpt-4o",
max_retries=2, # initial + 2 retries = 3 total billed requests max
timeout=30,
rate_limiter=redis_limiter,
)
Trade resilience off to the fallback layer — with_fallbacks is strictly
cheaper than retry amplification when the primary is genuinely unhealthy.
Instrument retry count via callback and alert if retry rate exceeds ~5%.
See Backoff and Retry for the full math,
Retry-After header handling, and circuit-breaker pattern for sustained
overload.
Step 7 — Understand the provider limit taxonomy
Different providers expose different limit types. Know which one binds your
workload before you size:
| Limit | Meaning | Who enforces | Binds for |
|---|---|---|---|
| RPM | Requests/minute (counts every call) | All three providers | Short chat replies |
| ITPM | Input tokens/minute | Anthropic, OpenAI (as TPM combined) | Long document Q&A |
| OTPM | Output tokens/minute | Anthropic separately; OpenAI as combined TPM | Long completions |
| Concurrent | In-flight request cap | Mainly OpenAI higher tiers | Burst traffic |
| Cached reads | Cache-read input tokens (Anthropic) | Anthropic separate budget line | Cache-heavy workloads (but still counts toward RPM — P31) |
Critical for Anthropic cache workloads (P31): RPM counts uniformly across
cached reads, cache writes, and uncached calls. A workload at 90% cache hit
rate still trips the 50 RPM ceiling at 51 requests/min. Separate monitors for
cachereadinputtokens vs inputtokens (minus cache read/write) give
early warning.
Step 8 — Decision tree: which limiter to use
┌─ Single process (dev, notebooks, sync CLI, --workers 1)?
│ └─ InMemoryRateLimiter
│
├─ Multi-process but single host (same-machine pool, local gunicorn)?
│ └─ Redis-backed limiter (even localhost Redis beats InMemoryRateLimiter —
│ which still has per-process counters)
│
├─ Multi-host cluster (Cloud Run --min-instances>1, K8s, ECS)?
│ └─ Redis-backed limiter (mandatory)
│
├─ Multi-region or cross-cloud?
│ └─ Regional Redis per zone + provider-side account quota
│ (cross-region Redis latency adds 30-200ms per acquire)
│
└─ Any of the above + multi-tenant SaaS?
└─ Two-level Redis limiter: per-tenant + global, acquire both
Always pair with asyncio.Semaphore(N) per-worker for in-flight concurrency.
Step 9 — Provider tier snapshot (verify before shipping)
2026-04-21 snapshot — re-verify against the official console before shipping.
| Provider | Free tier RPM | Tier-1 RPM | High tier RPM | Source |
|---|---|---|---|---|
| Anthropic | 5 | 50 (Build 1) | 4000 (Build 4) | https://docs.anthropic.com/en/api/rate-limits |
| OpenAI | 3 | 500 | 10000 (Tier 5) | https://platform.openai.com/docs/guides/rate-limits |
| Google Gemini | 15 | 2000 (Paid 1) | 30000 (Paid 3) | https://ai.google.dev/gemini-api/docs/rate-limits |
Tiers change quarterly. A limiter sized six months ago on a different tier is
a liability. See Provider Tier Matrix for
the full matrix including ITPM / OTPM / cached-read separation, binding-limit
math, and the pre-ship verification checklist.
Output
- Instrumented
DemandLoggercallback attached to your chains for 24-48h before sizing InMemoryRateLimiterin dev / notebooks / single-worker onlyRedisRateLimiter(sliding-window Lua orCL.THROTTLEGCRA) for any multi-worker deployment, keyed per-tenant or globalasyncio.Semaphore(N)per-worker in-flight cap paired with the cluster-wide limitermax_retries=2on everyChatAnthropic/ChatOpenAI/ChatGoogleGenerativeAI.withfallbacks(exceptionsto_handle=(RateLimitError, APITimeoutError, APIConnectionError, InternalServerError))— never(Exception,)- Per-provider tier re-verified from the official console, sized at 70% of the binding constraint
Error Handling
| Error | Cause | Fix |
|---|---|---|
anthropic.RateLimitError: 429 THROTTLED at cluster RPM = N × InMemoryRateLimiter ceiling |
InMemoryRateLimiter is per-process; N workers each send at their limit (P29) |
Switch to Redis-backed limiter (Step 3) |
| 429 on cache writes while ITPM dashboard shows headroom | Anthropic RPM counts cache writes uniformly (P31) | Budget at RPM level with limiter; separate cached vs uncached metrics |
One .invoke() bills as 7 requests on flaky networks |
Default max_retries=6 (P30) |
max_retries=2 + fallback layer for resilience |
Ctrl+C during retry storm silently falls through to backup chain |
exceptionstohandle=(Exception,) catches KeyboardInterrupt on Python <3.12 (P07) |
Narrow tuple to (RateLimitError, APITimeoutError, APIConnectionError, InternalServerError) |
| Limiter queue p95 wait > 500ms | Limiter is oversubscribed for real traffic | Re-measure demand (Step 1); upgrade provider tier OR shed load |
redis.exceptions.ConnectionError blocks all LLM calls |
Redis unavailable and limiter is fail-closed | Instrument Redis health; decide fail-open (log loudly) vs fail-closed (shed load) — for provider safety, prefer fail-closed |
retry-after header climbing 2→4→8→16 |
Pushing past tier; backoff amplifying, not absorbing | Lower limiter target RPS by 20%; upgrade tier if sustained |
google.api_core.exceptions.ResourceExhausted on Gemini |
Gemini free tier 15 RPM is brutal | Upgrade to paid Gemini tier 1 (2000 RPM) or use Redis limiter at 10 RPM |
Examples
Multi-worker Cloud Run deployment with Anthropic tier-1 50 RPM
Ten workers, single region, Redis in same VPC. Target: 35 RPM cluster-wide
(70% of 50 RPM ceiling), 20 in-flight per worker.
import asyncio, os, redis
from langchain_anthropic import ChatAnthropic
from anthropic import (
RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
from your_app.redis_limiter import RedisRateLimiter # see references
_client = redis.Redis.from_url(os.environ["REDIS_URL"])
anthropic_limiter = RedisRateLimiter(
_client, key="anthropic:prod",
requests_per_second=35 / 60, # 35 RPM cluster-wide
)
llm = ChatAnthropic(
model="claude-sonnet-4-6",
rate_limiter=anthropic_limiter, # cluster gate
max_retries=2, # not 6 (P30)
timeout=30,
)
chain = (prompt | llm | parser).with_fallbacks(
[prompt | gpt4o_backup | parser],
exceptions_to_handle=( # narrow tuple (P07)
RateLimitError, APITimeoutError,
APIConnectionError, InternalServerError,
),
)
worker_sem = asyncio.Semaphore(20) # per-worker in-flight cap
async def invoke_bounded(inp):
async with worker_sem:
return await chain.ainvoke(inp)
Cluster behavior: every worker's limiter call hits the same Redis key. At 35
RPM cluster-wide, individual workers see fair-share throughput. max_retries=2
- narrow fallback tuple means transient 429s surface quickly and hand off to
GPT-4o instead of amplifying cost.
Multi-tenant SaaS with per-tenant isolation
Two-level Redis limiter. Per-tenant limit prevents noisy neighbors; global limit
protects the provider tier.
See Redis Limiter Pattern for the
two-level acquire implementation (acquire tenant key first, then global key;
release tenant if global fails) and the per-tenant cleanup cron.
Single-process dev — InMemoryRateLimiter is fine
For local debugging, notebook work, or a sync CLI tool:
from langchain_core.rate_limiters import InMemoryRateLimiter
limiter = InMemoryRateLimiter(requests_per_second=0.5, max_bucket_size=3)
llm = ChatAnthropic(model="claude-sonnet-4-6", rate_limiter=limiter, max_retries=2)
Do not carry this into production without re-reading Step 2.
Resources
- LangChain how-to: Chat model rate limiting
InMemoryRateLimiterAPI- Anthropic rate limits
- OpenAI rate limits
- Google Gemini rate limits
- Redis
CL.THROTTLE(redis-cell module) - Pack pain catalog:
docs/pain-catalog.md(entries P07, P08, P29, P30, P31) - Sibling skills:
langchain-sdk-patterns(batch concurrency, fallback exception whitelist),langchain-performance-tuning(.batch(max_concurrency=...)tuning for throughput)