openrouter-context-optimization

'Optimize context window usage for OpenRouter models to reduce cost and

v1.20.0

Jeremy Longshore

MIT

Allowed Tools

ReadWriteEditGrepBash(python3:*)Bash(node:*)Bash(curl:*)Bash(jq:*)

Provided by Plugin

openrouter-pack

Flagship+ skill pack for OpenRouter - 30 skills for multi-model routing, fallbacks, and LLM gateway mastery

saas packs v1.20.0

View Plugin

Installation

This skill is included in the openrouter-pack plugin:

/plugin install openrouter-pack@claude-code-plugins-plus

Click to copy

Instructions

OpenRouter Context Optimization

Overview

OpenRouter models have varying context windows (4K to 1M+ tokens). Since pricing is per-token, stuffing unnecessary context wastes money and can degrade output quality. This skill covers context window lookup, token estimation, conversation trimming, chunking strategies, and Anthropic prompt caching for large contexts.

Prerequisites

An OpenRouter API key (sk-or-v1-...) exported as OPENROUTERAPIKEY — see the openrouter-install-auth skill for setup
Python 3.8+ with the OpenAI SDK and requests for model-metadata lookup; tiktoken for exact token counting per the references
curl and jq to query context windows and pricing from /api/v1/models
Node.js 18+ if you use the TypeScript context-budget calculator in the references

Instructions

Run the Query Context Limits one-liner — it returns context_length and prompt price per 1M tokens for each candidate model, so you know the real budget before writing code.
Estimate input size (~4 characters per token, or exactly with tiktoken per the references) and pick a model with selectmodelfor_context() from Context-Aware Model Selection — it applies an 80% safety margin and falls back through gpt-4o-mini (128K) → Claude 3.5 Sonnet (200K) → Gemini 2.0 Flash (1M).
Keep long conversations inside budget with trim_conversation() per Conversation Trimming: system prompt plus the last N messages, with a trim-marker note injected where history was dropped.
For documents that exceed any window, use chunkandprocess() per Chunking for Large Documents — 8,000-char chunks with 500-char overlap, analyzed independently at temperature=0 and then synthesized.
Mark large static blocks with cache_control: {"type": "ephemeral"} per Prompt Caching for Repeated Context to cut repeated input cost by 90% on Anthropic models.
Monitor prompttokens on every response (Enterprise Considerations) to catch context bloat before it becomes a 400 contextlength_exceeded.

Query Context Limits


# Check context window for specific models
curl -s https://openrouter.ai/api/v1/models | jq '[.data[] | select(
  .id == "anthropic/claude-3.5-sonnet" or
  .id == "openai/gpt-4o" or
  .id == "google/gemini-2.0-flash-001" or
  .id == "meta-llama/llama-3.1-70b-instruct"
) | {id, context_length, prompt_per_M: ((.pricing.prompt|tonumber)*1000000)}]'

Context-Aware Model Selection


import os, requests
from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key=os.environ["OPENROUTER_API_KEY"],
    default_headers={"HTTP-Referer": "https://my-app.com", "X-Title": "my-app"},
)

# Cache model metadata at startup
MODELS = {m["id"]: m for m in requests.get("https://openrouter.ai/api/v1/models").json()["data"]}

def estimate_tokens(text: str) -> int:
    """Rough estimate: 1 token ~ 4 characters for English text."""
    return len(text) // 4

def select_model_for_context(messages: list, preferred: str = "anthropic/claude-3.5-sonnet") -> str:
    """Pick a model that fits the context, falling back to larger windows."""
    estimated_tokens = sum(len(m.get("content", "")) for m in messages) // 4

    FALLBACK_CHAIN = [
        ("openai/gpt-4o-mini", 128_000),
        ("anthropic/claude-3.5-sonnet", 200_000),
        ("google/gemini-2.0-flash-001", 1_000_000),
    ]

    # Try preferred model first
    preferred_ctx = MODELS.get(preferred, {}).get("context_length", 0)
    if estimated_tokens < preferred_ctx * 0.8:  # 80% safety margin
        return preferred

    for model_id, ctx in FALLBACK_CHAIN:
        if estimated_tokens < ctx * 0.8:
            return model_id

    raise ValueError(f"Content too large ({estimated_tokens} est. tokens)")

Conversation Trimming


def trim_conversation(
    messages: list[dict],
    max_tokens: int = 100_000,
    keep_system: bool = True,
    keep_last_n: int = 4,
) -> list[dict]:
    """Trim conversation history to fit context window.

    Strategy: Keep system prompt + last N messages.
    If still too large, reduce to last 2 messages.
    """
    system = [m for m in messages if m["role"] == "system"] if keep_system else []
    non_system = [m for m in messages if m["role"] != "system"]

    kept = non_system[-keep_last_n:]
    trimmed = non_system[:-keep_last_n] if len(non_system) > keep_last_n else []

    total_est = sum(estimate_tokens(m.get("content", "")) for m in system + kept)
    if total_est > max_tokens and keep_last_n > 2:
        kept = non_system[-2:]

    result = system + kept
    if trimmed:
        summary_note = {
            "role": "system",
            "content": f"[Previous {len(trimmed)} messages trimmed for context limits]",
        }
        result = system + [summary_note] + kept

    return result

Chunking for Large Documents


def chunk_and_process(document: str, question: str, model: str = "openai/gpt-4o-mini",
                      chunk_size: int = 8000, overlap: int = 500) -> str:
    """Process a large document in overlapping chunks, then synthesize."""
    chunks = []
    start = 0
    while start < len(document):
        chunks.append(document[start:start + chunk_size])
        start += chunk_size - overlap

    results = []
    for i, chunk in enumerate(chunks):
        response = client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": f"Analyzing chunk {i+1}/{len(chunks)}."},
                {"role": "user", "content": f"Document:\n{chunk}\n\nQuestion: {question}"},
            ],
            max_tokens=1024, temperature=0,
        )
        results.append(response.choices[0].message.content)

    # Synthesize
    synthesis = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "Synthesize these partial analyses."},
            {"role": "user", "content": f"Question: {question}\n\nResults:\n" + "\n---\n".join(results)},
        ],
        max_tokens=2048, temperature=0,
    )
    return synthesis.choices[0].message.content

Prompt Caching for Repeated Context


# Anthropic models support prompt caching -- mark large static blocks
# Subsequent requests with same cached block cost 90% less for input tokens
response = client.chat.completions.create(
    model="anthropic/claude-3.5-sonnet",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": large_reference_document,  # 50K+ tokens
                    "cache_control": {"type": "ephemeral"},
                }
            ],
        },
        {"role": "user", "content": "Summarize section 3."},
    ],
    max_tokens=1024,
)
# First request: cache_creation_input_tokens at 1.25x rate
# Subsequent: cache_read_input_tokens at 0.1x rate (90% savings)

Output

A jq-formatted listing of model IDs with context_length and per-1M prompt pricing from /api/v1/models
A model ID selected to fit the estimated token count within an 80% safety margin, or a ValueError when nothing fits
A trimmed message list containing the system prompt, a [Previous N messages trimmed for context limits] note, and the most recent turns
A single synthesized answer assembled from per-chunk analyses of an oversized document

Examples

Multi-turn chat with the references' prune_conversation() holding a 2,000-token budget — oldest messages drop as the conversation grows:


[Pruned] 9 -> 7 messages (1876 tokens)
Q: What about class-based decorators?...
Tokens: 412

The pruner always keeps the system message and removes the oldest non-system turns first. More worked examples: references/examples.md.

Error Handling

Error	Cause	Fix
400 `contextlengthexceeded`	Input + max_tokens > model limit	Trim messages or use larger-context model
400 `max_tokens too large`	max_tokens alone exceeds limit	Reduce max_tokens
Slow responses	Very large context	Use streaming; consider chunking
Degraded quality	Too much irrelevant context	Trim to relevant content only

Enterprise Considerations

Query /api/v1/models at startup to cache context limits -- don't hardcode (they change)
Use max_tokens on every request to prevent runaway completion costs on large contexts
Implement conversation trimming as middleware so all calls respect limits
Use Anthropic prompt caching for RAG contexts that repeat across requests (90% input savings)
Route large-context tasks to cost-effective models (Gemini Flash for 1M context at low cost)
Monitor prompt_tokens in responses to detect context bloat before it hits limits

References

Examples | Errors
Prompt Caching | Models API

Allowed Tools

Provided by Plugin

openrouter-pack

Installation

Instructions

OpenRouter Context Optimization

Overview

Prerequisites

Instructions

Query Context Limits

Context-Aware Model Selection

Conversation Trimming

Chunking for Large Documents

Prompt Caching for Repeated Context

Output

Examples

Error Handling

Enterprise Considerations

References

Ready to use openrouter-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle