groq-reference-architecture

'Implement Groq reference architecture with model routing, streaming

v1.11.0

Jeremy Longshore

MIT

Allowed Tools

ReadGrep

Provided by Plugin

groq-pack

Claude Code skill pack for Groq (24 skills)

saas packs v1.11.0

View Plugin

Installation

This skill is included in the groq-pack plugin:

/plugin install groq-pack@claude-code-plugins-plus

Click to copy

Instructions

Groq Reference Architecture

Overview

Production architecture for applications built on Groq's LPU inference API. It

covers four concerns that every serious Groq integration needs: routing requests

to the right model by latency/capability/cost, a middleware band (cache, metrics,

retry), a multi-provider fallback chain, and a streaming pipeline. The service

layer built here is reusable across a chat UI, an API backend, a batch processor,

or an agent.

The full layer diagram and how the pieces interact lives in

references/architecture.md; the complete,

copy-ready TypeScript for every layer is in

references/implementation.md.

Prerequisites

Groq API key — create one at console.groq.com

and export it as GROQAPIKEY. The Groq SDK reads it from the environment;

the client is constructed as new Groq({ apiKey: process.env.GROQAPIKEY }).

Never hardcode the key.

Runtime: Node.js 18+ (for performance.now() and native fetch).
Packages: groq-sdk and lru-cache (npm install groq-sdk lru-cache).
Optional backup provider: an OpenAI-compatible key if you extend the

fallback chain beyond Groq's own models.

Instructions

Build the service layer in five ordered steps. Each step is one file under

src/groq/. The router depends on the registry; the middleware and fallback

depend on the client; the streaming pipeline stands alone. Full source for every

step (verbatim) is in references/implementation.md.

Model Registry (models.ts) — declare a ModelSpec for each model with

its tier, context window, speed, cost, and capabilities. Skeleton:


   export const MODELS: Record<string, ModelSpec> = {
     "llama-3.1-8b-instant":     { tier: "speed",   /* fast, cheap */ },
     "llama-3.3-70b-versatile":  { tier: "quality", /* tools + JSON */ },
     "meta-llama/llama-4-scout-17b-16e-instruct": { tier: "vision" },
     "whisper-large-v3-turbo":   { tier: "audio" },
   };

Model Router (router.ts) — selectModel(req) maps requirements

(maxLatencyMs, needsVision, needsTools, costSensitive) to the cheapest

model that satisfies them. Callers pass requirements, never hardcoded ids.

Middleware (middleware.ts) — completionWithMiddleware() wraps each call

with an LRU cache (deterministic requests only, temperature === 0), latency +

token metrics, and a pluggable metrics sink.

Fallback Chain (fallback.ts) — completionWithFallback() tries the

primary model, drops to a model in a different rate-limit pool on 429/5xx, then

returns a graceful-degradation payload instead of throwing.

Streaming Pipeline (streaming.ts) — streamCompletion() is an async

generator yielding { type: "token" | "done" | "error" } for real-time SSE UIs.

When applying this to an existing repo, Read the current src/ layout and

Grep for direct groq.chat.completions.create calls to find code that should

route through the middleware and fallback wrappers instead.

Integration Patterns

Pattern	When to Use	Groq Feature
Direct completion	Simple request/response	`chat.completions.create`
Streaming SSE	Real-time chat UI	`stream: true`
Tool calling	Agent with function execution	`tools` parameter
JSON extraction	Structured data from text	`responseformat: jsonobject`
Batch processing	High-volume document processing	Queue + rate limiting
Audio transcription	Voice input	`audio.transcriptions.create`
Vision analysis	Image understanding	Llama 4 Scout/Maverick

Output

Applying this skill produces a src/groq/ service layer with six files

(client.ts, models.ts, router.ts, middleware.ts, fallback.ts,

streaming.ts) plus the service and API layers that consume it. At runtime you get:

Routed completions — selectModel() returns a ModelSpec; callers never

hardcode a model id, so cost/latency policy lives in one place.

Cached deterministic responses — repeated temperature: 0 calls return from

the LRU cache instead of re-billing the API.

Resilient calls — completionWithFallback() returns a valid completion shape

even when Groq is rate-limited, never surfacing a raw 429 to the user.

Streamed tokens — streamCompletion() yields { type, content } events for

SSE, with a terminal done or error event.

Metrics — every call emits { model, latencyMs, tokens, cached } to your

metrics sink (Prometheus, Datadog, or console.log by default).

Error Handling

Issue	Cause	Solution
429 on primary model	RPM/TPM exceeded	Fall back to different model
High latency	Wrong model tier	Route to `8b-instant` for latency-critical paths
Context overflow	Input > 128K tokens	Truncate or chunk input
Vision errors	Wrong model for images	Use Llama 4 Scout full model path
`GROQAPIKEY` undefined	Env var not exported	Export the key before starting the process

Examples

A latency-critical chat turn routes to the speed tier and returns one completion:


const model = selectModel({ maxLatencyMs: 80, costSensitive: true });
// → llama-3.1-8b-instant
const res = await completionWithMiddleware(groq, model.id, messages);

Streaming a UI consumes the async generator token-by-token:


for await (const event of streamCompletion(groq, messages)) {
  if (event.type === "token") process.stdout.write(event.content!);
}

Four fully worked examples — latency-critical, quality-with-fallback, streaming,

and vision routing — are in references/examples.md.

Resources

Next Steps

For multi-environment deployment, see the groq-multi-env-setup skill, which

extends this service layer with per-environment configuration and secrets handling.

Allowed Tools

Provided by Plugin

groq-pack

Installation

Instructions

Groq Reference Architecture

Overview

Prerequisites

Instructions

Integration Patterns

Output

Error Handling

Examples

Resources

Next Steps

Ready to use groq-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle