langchain-rate-limits

Rate-limit LangChain 1.0 calls correctly across multi-worker deployments — Redis-backed limiters, asyncio.Semaphore, narrow exception whitelists, and provider-specific throttle handling. Use when hitting 429s in production, scaling workers horizontally, or tuning throughput against Anthropic, OpenAI, or Gemini tier limits. Trigger with "langchain rate limit", "langchain 429", "langchain semaphore", "langchain token bucket", "anthropic rpm", "openai rpm throttling", "InMemoryRateLimiter", "redis rate limiter".

claude-codecodex
5 Tools
langchain-py-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(python:*)Bash(redis-cli:*)

Provided by Plugin

langchain-py-pack

Claude Code skill pack for LangChain 1.0 + LangGraph 1.0 (Python) - 34 skills covering chains, agents, RAG, middleware, checkpointing, HITL, streaming, and production patterns

saas packs v2.0.0
View Plugin

Installation

This skill is included in the langchain-py-pack plugin:

/plugin install langchain-py-pack@claude-code-plugins-plus

Click to copy

Instructions

LangChain Rate Limits (Python)

Overview

A team deploys 10 Cloud Run workers. Each worker initializes its ChatAnthropic

with InMemoryRateLimiter(requestspersecond=10) — they read the docs, they

picked a safe-looking number, they shipped. Thirty seconds later the dashboard

lights up with 429s: the cluster is pushing 100 RPS to Anthropic's 50 RPM

tier-1 ceiling, not the 10 RPS they configured. The name is the fix —

InMemoryRateLimiter is in-process. Each worker has its own counter. Ten

workers × 10 RPS = 100 RPS to the provider. This is pain-catalog entry P29

and it lands on every team that scales past one pod.

Three more traps wait on the same code path:

  • P07.withfallbacks([backup]) defaults exceptionsto_handle=(Exception,),

which on Python <3.12 swallows KeyboardInterrupt. Ctrl+C during a 429

retry storm silently falls through to the backup chain and keeps billing.

  • P30ChatOpenAI and ChatAnthropic default max_retries=6. That is

retries, not attempts: 7 total requests per logical call on flaky

networks. One .invoke() can bill 7x.

  • P31 — Anthropic's RPM counts cache reads, cache writes, and uncached

calls uniformly. Cache-heavy workloads at 50 RPM can 429 on cache writes

while the ITPM dashboard shows headroom.

This skill covers measuring demand before picking a limit; the

InMemoryRateLimiter vs Redis-backed limiter vs asyncio.Semaphore decision

tree; the narrow exceptionstohandle whitelist; max_retries=2 math; and

the provider-specific limit taxonomy (RPM, ITPM, OTPM, concurrent,

cached-vs-uncached). Pin: langchain-core 1.0.x, langchain-anthropic 1.0.x,

langchain-openai 1.0.x. Pain-catalog anchors: P07, P08, P29, P30, P31.

For .batch(max_concurrency=...) tuning, see the sibling skill

langchain-performance-tuning — this skill is about provider-facing rate caps.

Prerequisites

  • Python 3.10+ (3.12+ fixes the KeyboardInterrupt half of P07)
  • langchain-core >= 1.0, < 2.0
  • At least one provider: pip install langchain-anthropic langchain-openai
  • For multi-worker prod: redis >= 4.5 client and a Redis server reachable from every worker
  • Completed langchain-model-inference — the chat-model factory from that skill is where rate_limiter= gets attached

Instructions

Step 1 — Measure actual demand before picking a number

Do not guess at requestspersecond. Instrument first, size second.

Attach a BaseCallbackHandler that logs per-call input_tokens,

outputtokens, and cachereadinputtokens from response.generations[].message.usage_metadata:


chain.with_config({"callbacks": [DemandLogger()]})

Collect 24-48 hours of representative traffic. Roll up: p50 and p95 RPM, p95

ITPM, p95 OTPM, cache hit rate. Size the limiter at **70% of the binding

constraint's tier ceiling** on your p95.

See Measuring Demand for the full

DemandLogger implementation, pandas roll-up, OTEL integration, load-test

harness, and multi-tenant sizing strategies.

Step 2 — InMemoryRateLimiter for single-process dev only; never multi-worker prod

LangChain 1.0 ships InMemoryRateLimiter as a first-class BaseChatModel parameter:


from langchain_anthropic import ChatAnthropic
from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(
    requests_per_second=0.58,    # 35 RPM = 70% of Anthropic tier-1 50 RPM
    check_every_n_seconds=0.1,
    max_bucket_size=5,           # burst capacity
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)

InMemoryRateLimiter is per-process. Safe for:

  • Single-process local dev (python script.py)
  • Single-worker uvicorn (uvicorn --workers 1)
  • Jupyter notebooks, batch scripts

Unsafe for (this is P29):

  • Multi-worker uvicorn / gunicorn (--workers 4)
  • Any container orchestrator with replica count > 1 (Cloud Run min-instances > 1, K8s, ECS)
  • Distributed job runners (Celery, Temporal, Cloud Tasks fanout)

Step 3 — Redis-backed limiter for cluster-wide enforcement

For multi-worker deployments, cluster-wide rate limiting requires shared state.

Redis is the default answer — atomic Lua script for sliding-window, or Redis

6.2+ CL.THROTTLE for GCRA.


import redis
from langchain_anthropic import ChatAnthropic
# RedisRateLimiter class defined in references/redis-limiter-pattern.md
from your_app.limiters import RedisRateLimiter

client = redis.Redis.from_url("redis://redis.internal:6379/0")

limiter = RedisRateLimiter(
    client,
    key="anthropic:prod",
    requests_per_second=35 / 60,  # 35 RPM cluster-wide, not per-worker
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=limiter,
    max_retries=2,
    timeout=30,
)

Key scoping decisions:

  • key="anthropic:prod" — all tenants share one global budget (simplest)
  • key=f"anthropic:tenant:{tenant_id}" — per-tenant quota (requires cleanup for dead tenants)
  • Two-level: per-tenant + global, acquire both (best for multi-tenant SaaS)

See Redis Limiter Pattern for the full

RedisRateLimiter implementation (atomic Lua sliding window), the GCRA

alternative via CL.THROTTLE, failure modes (Redis down, clock skew), and

per-tenant cleanup strategy.

Step 4 — asyncio.Semaphore for per-worker in-flight concurrency cap

The rate limiter throttles request rate. A semaphore throttles **in-flight

count**. Use both:


import asyncio

# Cluster: 35 RPM (Redis enforces)
# Worker: 20 in-flight at once (semaphore enforces)
worker_sem = asyncio.Semaphore(20)

async def bounded_invoke(inp):
    async with worker_sem:
        return await llm.ainvoke(inp)

# Fanout
results = await asyncio.gather(*[bounded_invoke(x) for x in inputs])

Why both: a semaphore prevents a single worker from queueing hundreds of

pending limiter acquires against Redis (head-of-line blocking on the event

loop). The limiter prevents the cluster from exceeding the provider tier. They

solve different problems.

Semaphore sizing: target latency-bandwidth-product. If p95 request latency

is 2s and the worker's RPS cap is 10, in-flight count ≈ 2 × 10 = 20. Overshoot

is wasted memory; undershoot leaves throughput on the table.

Step 5 — Narrow withfallbacks(exceptionsto_handle=...) — never (Exception,)

.with_fallbacks([backup]) defaults to catching Exception. This is P07 — on

Python <3.12, Exception edge-cases include KeyboardInterrupt propagation.

Ctrl+C during a retry storm silently hands off to the backup and keeps running.

Always narrow the tuple:


from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)

resilient = (prompt | claude | parser).with_fallbacks(
    [prompt | gpt4o | parser],
    exceptions_to_handle=(
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
    # NEVER: Exception, BaseException, AuthenticationError,
    # BadRequestError, ValidationError
)

The whitelist is only transient provider errors. AuthenticationError,

BadRequestError, and ValidationError are bugs in your code/credentials —

fallback produces the same crash. See the sibling skill's reference

langchain-sdk-patterns/references/fallback-exception-list.md for the full

per-provider whitelist (Anthropic, OpenAI, Gemini).

Step 6 — maxretries=2, never the default maxretries=6

maxretries is retries, not attempts. Default maxretries=6 on

ChatOpenAI / ChatAnthropic means initial + 6 retries = 7 billed requests

per logical call (P30). On a flaky network, one .invoke() costs 7x what you

budgeted.


# BAD — default
llm = ChatOpenAI(model="gpt-4o")  # max_retries=6

# GOOD — production default
llm = ChatOpenAI(
    model="gpt-4o",
    max_retries=2,      # initial + 2 retries = 3 total billed requests max
    timeout=30,
    rate_limiter=redis_limiter,
)

Trade resilience off to the fallback layer — with_fallbacks is strictly

cheaper than retry amplification when the primary is genuinely unhealthy.

Instrument retry count via callback and alert if retry rate exceeds ~5%.

See Backoff and Retry for the full math,

Retry-After header handling, and circuit-breaker pattern for sustained

overload.

Step 7 — Understand the provider limit taxonomy

Different providers expose different limit types. Know which one binds your

workload before you size:

Limit Meaning Who enforces Binds for
RPM Requests/minute (counts every call) All three providers Short chat replies
ITPM Input tokens/minute Anthropic, OpenAI (as TPM combined) Long document Q&A
OTPM Output tokens/minute Anthropic separately; OpenAI as combined TPM Long completions
Concurrent In-flight request cap Mainly OpenAI higher tiers Burst traffic
Cached reads Cache-read input tokens (Anthropic) Anthropic separate budget line Cache-heavy workloads (but still counts toward RPM — P31)

Critical for Anthropic cache workloads (P31): RPM counts uniformly across

cached reads, cache writes, and uncached calls. A workload at 90% cache hit

rate still trips the 50 RPM ceiling at 51 requests/min. Separate monitors for

cachereadinputtokens vs inputtokens (minus cache read/write) give

early warning.

Step 8 — Decision tree: which limiter to use


┌─ Single process (dev, notebooks, sync CLI, --workers 1)?
│  └─ InMemoryRateLimiter
│
├─ Multi-process but single host (same-machine pool, local gunicorn)?
│  └─ Redis-backed limiter (even localhost Redis beats InMemoryRateLimiter —
│     which still has per-process counters)
│
├─ Multi-host cluster (Cloud Run --min-instances>1, K8s, ECS)?
│  └─ Redis-backed limiter (mandatory)
│
├─ Multi-region or cross-cloud?
│  └─ Regional Redis per zone + provider-side account quota
│     (cross-region Redis latency adds 30-200ms per acquire)
│
└─ Any of the above + multi-tenant SaaS?
   └─ Two-level Redis limiter: per-tenant + global, acquire both

Always pair with asyncio.Semaphore(N) per-worker for in-flight concurrency.

Step 9 — Provider tier snapshot (verify before shipping)

2026-04-21 snapshot — re-verify against the official console before shipping.

Provider Free tier RPM Tier-1 RPM High tier RPM Source
Anthropic 5 50 (Build 1) 4000 (Build 4) https://docs.anthropic.com/en/api/rate-limits
OpenAI 3 500 10000 (Tier 5) https://platform.openai.com/docs/guides/rate-limits
Google Gemini 15 2000 (Paid 1) 30000 (Paid 3) https://ai.google.dev/gemini-api/docs/rate-limits

Tiers change quarterly. A limiter sized six months ago on a different tier is

a liability. See Provider Tier Matrix for

the full matrix including ITPM / OTPM / cached-read separation, binding-limit

math, and the pre-ship verification checklist.

Output

  • Instrumented DemandLogger callback attached to your chains for 24-48h before sizing
  • InMemoryRateLimiter in dev / notebooks / single-worker only
  • RedisRateLimiter (sliding-window Lua or CL.THROTTLE GCRA) for any multi-worker deployment, keyed per-tenant or global
  • asyncio.Semaphore(N) per-worker in-flight cap paired with the cluster-wide limiter
  • max_retries=2 on every ChatAnthropic / ChatOpenAI / ChatGoogleGenerativeAI
  • .withfallbacks(exceptionsto_handle=(RateLimitError, APITimeoutError, APIConnectionError, InternalServerError)) — never (Exception,)
  • Per-provider tier re-verified from the official console, sized at 70% of the binding constraint

Error Handling

Error Cause Fix
anthropic.RateLimitError: 429 THROTTLED at cluster RPM = N × InMemoryRateLimiter ceiling InMemoryRateLimiter is per-process; N workers each send at their limit (P29) Switch to Redis-backed limiter (Step 3)
429 on cache writes while ITPM dashboard shows headroom Anthropic RPM counts cache writes uniformly (P31) Budget at RPM level with limiter; separate cached vs uncached metrics
One .invoke() bills as 7 requests on flaky networks Default max_retries=6 (P30) max_retries=2 + fallback layer for resilience
Ctrl+C during retry storm silently falls through to backup chain exceptionstohandle=(Exception,) catches KeyboardInterrupt on Python <3.12 (P07) Narrow tuple to (RateLimitError, APITimeoutError, APIConnectionError, InternalServerError)
Limiter queue p95 wait > 500ms Limiter is oversubscribed for real traffic Re-measure demand (Step 1); upgrade provider tier OR shed load
redis.exceptions.ConnectionError blocks all LLM calls Redis unavailable and limiter is fail-closed Instrument Redis health; decide fail-open (log loudly) vs fail-closed (shed load) — for provider safety, prefer fail-closed
retry-after header climbing 2→4→8→16 Pushing past tier; backoff amplifying, not absorbing Lower limiter target RPS by 20%; upgrade tier if sustained
google.api_core.exceptions.ResourceExhausted on Gemini Gemini free tier 15 RPM is brutal Upgrade to paid Gemini tier 1 (2000 RPM) or use Redis limiter at 10 RPM

Examples

Multi-worker Cloud Run deployment with Anthropic tier-1 50 RPM

Ten workers, single region, Redis in same VPC. Target: 35 RPM cluster-wide

(70% of 50 RPM ceiling), 20 in-flight per worker.


import asyncio, os, redis
from langchain_anthropic import ChatAnthropic
from anthropic import (
    RateLimitError, APITimeoutError, APIConnectionError, InternalServerError,
)
from your_app.redis_limiter import RedisRateLimiter  # see references

_client = redis.Redis.from_url(os.environ["REDIS_URL"])
anthropic_limiter = RedisRateLimiter(
    _client, key="anthropic:prod",
    requests_per_second=35 / 60,    # 35 RPM cluster-wide
)

llm = ChatAnthropic(
    model="claude-sonnet-4-6",
    rate_limiter=anthropic_limiter, # cluster gate
    max_retries=2,                  # not 6 (P30)
    timeout=30,
)

chain = (prompt | llm | parser).with_fallbacks(
    [prompt | gpt4o_backup | parser],
    exceptions_to_handle=(          # narrow tuple (P07)
        RateLimitError, APITimeoutError,
        APIConnectionError, InternalServerError,
    ),
)

worker_sem = asyncio.Semaphore(20)  # per-worker in-flight cap
async def invoke_bounded(inp):
    async with worker_sem:
        return await chain.ainvoke(inp)

Cluster behavior: every worker's limiter call hits the same Redis key. At 35

RPM cluster-wide, individual workers see fair-share throughput. max_retries=2

  • narrow fallback tuple means transient 429s surface quickly and hand off to

GPT-4o instead of amplifying cost.

Multi-tenant SaaS with per-tenant isolation

Two-level Redis limiter. Per-tenant limit prevents noisy neighbors; global limit

protects the provider tier.

See Redis Limiter Pattern for the

two-level acquire implementation (acquire tenant key first, then global key;

release tenant if global fails) and the per-tenant cleanup cron.

Single-process dev — InMemoryRateLimiter is fine

For local debugging, notebook work, or a sync CLI tool:


from langchain_core.rate_limiters import InMemoryRateLimiter

limiter = InMemoryRateLimiter(requests_per_second=0.5, max_bucket_size=3)
llm = ChatAnthropic(model="claude-sonnet-4-6", rate_limiter=limiter, max_retries=2)

Do not carry this into production without re-reading Step 2.

Resources

Ready to use langchain-py-pack?