langchain-rate-limits

Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".

v2.0.0

Jeremy Longshore

MIT

claude-codecodex

5 Tools

langchain-py-pack Plugin

saas packs Category

Allowed Tools
        ReadWriteEditBash(python:*)Bash(redis-cli:*)
      

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0

View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain Rate Limits (Python)

Overview

A team deploys 10 Cloud Run workers. Each worker initializes its ChatAnthropic

with InMemoryRateLimiter(requestspersecond=10) — they read the docs, they

picked a safe-looking number, they shipped. Thirty seconds later the dashboard

lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM

tier-1 ceiling, not the 10 RPS they configured. The name is the fix —

InMemoryRateLimiter is in-process. Each worker has its own counter. Ten

workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry P29

and it lands on every team that scales past one pod.

Three more traps wait on the same code path:

P07 — .withfallbacks([backup]) defaults exceptionsto_handle=(Exception,),

which on Python <3.12 swallows KeyboardInterrupt. Ctrl+C during a 429

retry storm silently falls through to the backup chain and keeps billing.

P30 — ChatOpenAI and ChatAnthropic default max_retries=6. That is

retries, not attempts: 7 total requests per logical call on flaky

networks. One .invoke() can bill 7x.

P31 — Anthropic's RPM counts cache reads, cache writes, and uncached

calls uniformly. Cache-heavy workloads at 50 RPM can 429 on cache writes

while the ITPM dashboard shows headroom.

This skill covers measuring demand before picking a limit; the

InMemoryRateLimiter vs Redis-backed limiter vs asyncio.Semaphore decision

tree; the narrow exceptionstohandle whitelist; max_retries=2 math; and

the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,

cached-vs-uncached). Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x,

langchain-openai 1.0.x. Pain-catalog anchors: P07, P08, P29, P30, P31.

For .batch(max_concurrency=...) tuning, see the sibling skill

langchain-performance-tuning — this skill is about provider-facing rate caps.

Prerequisites

Python 3.10+ (3.12+ fixes the KeyboardInterrupt half of P07)
langchain-core >= 1.0, < 2.0
At least one provider: pip install langchain-anthropic langchain-openai
For multi-worker prod: redis >= 4.5 client and a Redis server reachable from every worker
Completed langchain-model-inference — the chat-model factory from that skill is where rate_limiter= gets attached

Instructions

Step 1 — Measure actual demand before picking a number

Do not guess at requestspersecond. Instrument first, size second.

Attach a BaseCallbackHandler that logs per-call input_tokens,

outputtokens, and cachereadinputtokens from response.generations[].message.usage_metadata:


chain.with_config({"callbacks": [DemandLogger()]})

Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95

ITPM, p95 OTPM, cache hit rate. Size the limiter at **70% of the binding

constraint's tier ceiling** on your p95.

See Measuring Demand for the full

DemandLogger implementation, pandas roll-up, OTEL integration, load-test

harness, and multi-tenant sizing strategies.

Step 2 — `InMemoryRateLimiter` for single-process dev only; never multi-worker prod

LangChain 1.0 ships InMemoryRateLimiter as a first-class BaseChatModel parameter:


from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(
    requests_per_second=0.58,    # 35 RPM = 70% of Anthropic tier-1 50 RPM
    check_every_n_seconds=0.1,
    max_bucket_size=5,           # burst capacity
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)

InMemoryRateLimiter is per-process. Safe for:

Single-process local dev (python script.py)
Single-worker uvicorn (uvicorn --workers 1)
Jupyter notebooks, batch scripts

Unsafe for (this is P29):

Multi-worker uvicorn / gunicorn (--workers 4)
Any container orchestrator with replica count > 1 (Cloud Run min-instances > 1, K8s, ECS)
Distributed job runners (Celery, Temporal, Cloud Tasks fanout)

Step 3 — Redis-backed limiter for cluster-wide enforcement

For multi-worker deployments, cluster-wide rate limiting requires shared state.

Redis is the default answer — atomic Lua script for sliding-window, or Redis

6.2+ CL.THROTTLE for GCRA.


import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter

client = redis.Redis.from_url("redis://redis.internal:6379/0")

limiter = RedisRateLimiter(
    client,
    key="anthropic:prod",
    requests_per_second=35 / 60,  # 35 RPM cluster-wide, not per-worker
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)

Key scoping decisions:

key="anthropic:prod" — all tenants share one global budget (simplest)
key=f"anthropic:tenant:{tenant_id}" — per-tenant quota (requires cleanup for dead tenants)
Two-level: per-tenant + global, acquire both (best for multi-tenant SaaS)

See Redis Limiter Pattern for the full

RedisRateLimiter implementation (atomic Lua sliding window), the GCRA

alternative via CL.THROTTLE, failure modes (Redis down, clock skew), and

per-tenant cleanup strategy.

Step 4 — `asyncio.Semaphore` for per-worker in-flight concurrency cap

The rate limiter throttles request rate. A semaphore throttles **in-flight

count**. Use both:


import asyncio

# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)

async def bounded_invoke(inp):
    async with worker_sem:
        return await llm.ainvoke(inp)

# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])

Why both: a semaphore prevents a single worker from queueing hundreds of

pending limiter acquires against Redis (head-of-line blocking on the event

loop). The limiter prevents the cluster from exceeding the provider tier. They

solve different problems.

Semaphore sizing: target latency-bandwidth-product. If p95 request latency

is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot

is wasted memory; undershoot leaves throughput on the table.

Step 5 — Narrow `withfallbacks(exceptionsto_handle=...)` — never `(Exception,)`

.with_fallbacks([backup]) defaults to catching Exception. This is P07 — on

Python <3.12, Exception edge-cases include KeyboardInterrupt propagation.

Ctrl+C during a retry storm silently hands off to the backup and keeps running.

Always narrow the tuple:


from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)

resilient = (prompt | claude | parser).with_fallbacks(
    [prompt | gpt4o | parser],
    exceptions_to_handle=(
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
    # NEVER: Exception, BaseException, AuthenticationError,
    # BadRequestError, ValidationError
)

The whitelist is only transient provider errors. AuthenticationError,

BadRequestError, and ValidationError are bugs in your code/credentials —

fallback produces the same crash. See the sibling skill's reference

langchain-sdk-patterns/references/fallback-exception-list.md for the full

per-provider whitelist (Anthropic, OpenAI, Gemini).

Step 6 — `maxretries=2`, never the default `max``retries=6`

maxretries is retries, not attempts. Default maxretries=6 on

ChatOpenAI / ChatAnthropic means initial + 6 retries = 7 billed requests

per logical call (P30). On a flaky network, one .invoke() costs 7x what you

budgeted.


# BAD — default
llm = ChatOpenAI(model="gpt-4o")  # max_retries=6

# GOOD — production default
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=2,      # initial + 2 retries = 3 total billed requests max
    timeout=30,
    rate_limiter=redis_limiter,
)

Trade resilience off to the fallback layer — with_fallbacks is strictly

cheaper than retry amplification when the primary is genuinely unhealthy.

Instrument retry count via callback and alert if retry rate exceeds ~5%.

See Backoff and Retry for the full math,

Retry-After header handling, and circuit-breaker pattern for sustained

overload.

Step 7 — Understand the provider limit taxonomy

Different providers expose different limit types. Know which one binds your

workload before you size:

Limit	Meaning	Who enforces	Binds for
RPM	Requests/minute (counts every call)	All three providers	Short chat replies
ITPM	Input tokens/minute	Anthropic, OpenAI (as TPM combined)	Long document Q&A
OTPM	Output tokens/minute	Anthropic separately; OpenAI as combined TPM	Long completions
Concurrent	In-flight request cap	Mainly OpenAI higher tiers	Burst traffic
Cached reads	Cache-read input tokens (Anthropic)	Anthropic separate budget line	Cache-heavy workloads (but still counts toward RPM — P31)

Critical for Anthropic cache workloads (P31): RPM counts uniformly across

cached reads, cache writes, and uncached calls. A workload at 90% cache hit

rate still trips the 50 RPM ceiling at 51 requests/min. Separate monitors for

cachereadinputtokens vs inputtokens (minus cache read/write) give

early warning.

Step 8 — Decision tree: which limiter to use


┌─ Single process (dev, notebooks, sync CLI, --workers 1)?
│  └─ InMemoryRateLimiter
│
├─ Multi-process but single host (same-machine pool, local gunicorn)?
│  └─ Redis-backed limiter (even localhost Redis beats InMemoryRateLimiter —
│     which still has per-process counters)
│
├─ Multi-host cluster (Cloud Run --min-instances>1, K8s, ECS)?
│  └─ Redis-backed limiter (mandatory)
│
├─ Multi-region or cross-cloud?
│  └─ Regional Redis per zone + provider-side account quota
│     (cross-region Redis latency adds 30-200ms per acquire)
│
└─ Any of the above + multi-tenant SaaS?
   └─ Two-level Redis limiter: per-tenant + global, acquire both

Always pair with asyncio.Semaphore(N) per-worker for in-flight concurrency.

Step 9 — Provider tier snapshot (verify before shipping)

2026-04-21 snapshot — re-verify against the official console before shipping.

Provider	Free tier RPM	Tier-1 RPM	High tier RPM	Source
Anthropic	5	50 (Build 1)	4000 (Build 4)	https://docs.anthropic.com/en/api/rate-limits
OpenAI	3	500	10000 (Tier 5)	https://platform.openai.com/docs/guides/rate-limits
Google Gemini	15	2000 (Paid 1)	30000 (Paid 3)	https://ai.google.dev/gemini-api/docs/rate-limits

Tiers change quarterly. A limiter sized six months ago on a different tier is

a liability. See Provider Tier Matrix for

the full matrix including ITPM / OTPM / cached-read separation, binding-limit

math, and the pre-ship verification checklist.

Output

Instrumented DemandLogger callback attached to your chains for 24-48h before sizing
InMemoryRateLimiter in dev / notebooks / single-worker only
RedisRateLimiter (sliding-window Lua or CL.THROTTLE GCRA) for any multi-worker deployment, keyed per-tenant or global
asyncio.Semaphore(N) per-worker in-flight cap paired with the cluster-wide limiter
max_retries=2 on every ChatAnthropic / ChatOpenAI / ChatGoogleGenerativeAI
.withfallbacks(exceptionsto_handle=(RateLimitError, APITimeoutError, APIConnectionError, InternalServerError)) — never (Exception,)
Per-provider tier re-verified from the official console, sized at 70% of the binding constraint

Error Handling

Error	Cause	Fix
`anthropic.RateLimitError: 429 THROTTLED` at cluster RPM = N × InMemoryRateLimiter ceiling	`InMemoryRateLimiter` is per-process; N workers each send at their limit (P29)	Switch to Redis-backed limiter (Step 3)
429 on cache writes while ITPM dashboard shows headroom	Anthropic RPM counts cache writes uniformly (P31)	Budget at RPM level with limiter; separate cached vs uncached metrics
One `.invoke()` bills as 7 requests on flaky networks	Default `max_retries=6` (P30)	`max_retries=2` + fallback layer for resilience
`Ctrl+C` during retry storm silently falls through to backup chain	`exceptionstohandle=(Exception,)` catches `KeyboardInterrupt` on Python <3.12 (P07)	Narrow tuple to `(RateLimitError, APITimeoutError, APIConnectionError, InternalServerError)`
Limiter queue p95 wait > 500ms	Limiter is oversubscribed for real traffic	Re-measure demand (Step 1); upgrade provider tier OR shed load
`redis.exceptions.ConnectionError` blocks all LLM calls	Redis unavailable and limiter is fail-closed	Instrument Redis health; decide fail-open (log loudly) vs fail-closed (shed load) — for provider safety, prefer fail-closed
`retry-after` header climbing 2→4→8→16	Pushing past tier; backoff amplifying, not absorbing	Lower limiter target RPS by 20%; upgrade tier if sustained
`google.api_core.exceptions.ResourceExhausted` on Gemini	Gemini free tier 15 RPM is brutal	Upgrade to paid Gemini tier 1 (2000 RPM) or use Redis limiter at 10 RPM

Examples

Multi-worker Cloud Run deployment with Anthropic tier-1 50 RPM

Ten workers, single region, Redis in same VPC. Target: 35 RPM cluster-wide

(70% of 50 RPM ceiling), 20 in-flight per worker.


import asyncio, os, redis
from langchain_anthropic import ChatAnthropic
from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
from your_app.redis_limiter import RedisRateLimiter  # see references

_client = redis.Redis.from_url(os.environ["REDIS_URL"])
anthropic_limiter = RedisRateLimiter(
    _client, key="anthropic:prod",
    requests_per_second=35 / 60,    # 35 RPM cluster-wide
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=anthropic_limiter, # cluster gate
    max_retries=2,                  # not 6 (P30)
    timeout=30,
)

chain = (prompt | llm | parser).with_fallbacks(
    [prompt | gpt4o_backup | parser],
    exceptions_to_handle=(          # narrow tuple (P07)
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
)

worker_sem = asyncio.Semaphore(20)  # per-worker in-flight cap
async def invoke_bounded(inp):
    async with worker_sem:
        return await chain.ainvoke(inp)

Cluster behavior: every worker's limiter call hits the same Redis key. At 35

RPM cluster-wide, individual workers see fair-share throughput. max_retries=2

narrow fallback tuple means transient 429s surface quickly and hand off to

GPT-4o instead of amplifying cost.

Multi-tenant SaaS with per-tenant isolation

Two-level Redis limiter. Per-tenant limit prevents noisy neighbors; global limit

protects the provider tier.

See Redis Limiter Pattern for the

two-level acquire implementation (acquire tenant key first, then global key;

release tenant if global fails) and the per-tenant cleanup cron.

Single-process dev — `InMemoryRateLimiter` is fine

For local debugging, notebook work, or a sync CLI tool:


from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(requests_per_second=0.5, max_bucket_size=3)
llm = ChatAnthropic(model="claude-sonnet-4-6", rate_limiter=limiter, max_retries=2)

Do not carry this into production without re-reading Step 2.

Resources

LangChain how-to: Chat model rate limiting
InMemoryRateLimiter API
Anthropic rate limits
OpenAI rate limits
Google Gemini rate limits
Redis CL.THROTTLE (redis-cell module)
Pack pain catalog: docs/pain-catalog.md (entries P07, P08, P29, P30, P31)
Sibling skills: langchain-sdk-patterns (batch concurrency, fallback exception whitelist), langchain-performance-tuning (.batch(max_concurrency=...) tuning for throughput)

Allowed Tools

Provided by Plugin

langchain-py-pack

Installation

Instructions

LangChain Rate Limits (Python)

Overview

Prerequisites

Instructions

Step 1 — Measure actual demand before picking a number

Step 2 — InMemoryRateLimiter for single-process dev only; never multi-worker prod

Step 3 — Redis-backed limiter for cluster-wide enforcement

Step 4 — asyncio.Semaphore for per-worker in-flight concurrency cap

Step 5 — Narrow withfallbacks(exceptionsto_handle=...) — never (Exception,)

Step 6 — maxretries=2, never the default maxretries=6

Step 7 — Understand the provider limit taxonomy

Step 8 — Decision tree: which limiter to use

Step 9 — Provider tier snapshot (verify before shipping)

Output

Error Handling

Examples

Multi-worker Cloud Run deployment with Anthropic tier-1 50 RPM

Multi-tenant SaaS with per-tenant isolation

Single-process dev — InMemoryRateLimiter is fine

Resources

Ready to use langchain-py-pack?

Related Skills

"cursor-advanced-composer"

"cursor-ai-chat"

"cursor-api-key-management"

"cursor-codebase-indexing"

"cursor-common-errors"

"cursor-compliance-audit"

Step 2 — `InMemoryRateLimiter` for single-process dev only; never multi-worker prod

Step 4 — `asyncio.Semaphore` for per-worker in-flight concurrency cap

Step 5 — Narrow `withfallbacks(exceptionsto_handle=...)` — never `(Exception,)`

Step 6 — `maxretries=2`, never the default `max``retries=6`

Single-process dev — `InMemoryRateLimiter` is fine