Complete Groq integration skill pack with 24 skills covering LPU inference, ultra-fast AI, and Groq Cloud deployment. Flagship tier vendor pack.
Installation
Open Claude Code and run this command:
/plugin install groq-pack@claude-code-plugins-plus
Use --global to install for all projects, or --project for current project only.
Skills (24)
Configure Groq CI/CD integration with GitHub Actions, testing, and model validation.
Groq CI Integration
Overview
Set up CI/CD pipelines for Groq integrations with unit tests (mocked), integration tests (live API), and model deprecation checks. Groq's fast inference makes live integration tests practical in CI -- a completion round-trip takes < 500ms.
Prerequisites
- GitHub repository with Actions enabled
- Groq API key stored as GitHub secret
- vitest or jest for testing
Instructions
Step 1: GitHub Actions Workflow
# .github/workflows/groq-tests.yml
name: Groq Integration Tests
on:
push:
branches: [main]
pull_request:
branches: [main]
schedule:
- cron: "0 6 * * 1" # Weekly model deprecation check
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- run: npm ci
- run: npm test -- --coverage
# Unit tests use mocked groq-sdk -- no API key needed
integration-tests:
runs-on: ubuntu-latest
if: github.event_name != 'pull_request' # Only on push to main
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
cache: "npm"
- run: npm ci
- name: Run Groq integration tests
run: GROQ_INTEGRATION=1 npx vitest tests/groq.integration.ts --reporter=verbose
timeout-minutes: 2
model-check:
runs-on: ubuntu-latest
env:
GROQ_API_KEY: ${{ secrets.GROQ_API_KEY }}
steps:
- uses: actions/checkout@v4
- name: Check for deprecated models
run: |
set -euo pipefail
# Get current models from Groq API
MODELS=$(curl -sf https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY" | jq -r '.data[].id')
# Check our code references valid models
USED=$(grep -roh "model.*['\"].*['\"]" src/ --include="*.ts" | \
grep -oP "(?<=['\"])[\w./-]+(?=['\"])" | sort -u)
echo "=== Models in our code ==="
echo "$USED"
echo ""
echo "=== Available on Groq ==="
echo "$MODELS"
# Flag any model in our code that's not in the API response
MISSING=""
while IFS= read -r model; do
if ! echo "$MODELS" | grep -qF "$model"; then
MISSING="$MISSING\n - $model"
fi
done <<< "$USED"
if [ -n "$MISSING" ]; then
echo "WARNING: These models in code are not available on Groq:$MISSING"Diagnose and fix Groq API errors with real error codes and solutions.
Groq Common Errors
Overview
Comprehensive reference for Groq API error codes, their root causes, and proven fixes. Groq returns standard HTTP status codes with structured error bodies and rate-limit headers.
Error Response Format
{
"error": {
"message": "Rate limit reached for model `llama-3.3-70b-versatile`...",
"type": "tokens",
"code": "rate_limit_exceeded"
}
}
Quick Diagnostic
set -euo pipefail
# 1. Verify API key is valid
curl -s https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY" | jq '.data | length'
# 2. Check specific model availability
curl -s https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY" | jq '.data[].id' | sort
# 3. Test a minimal completion
curl -s https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.1-8b-instant","messages":[{"role":"user","content":"ping"}],"max_tokens":5}' | jq .
Error Reference
401 — Authentication Error
Authentication error: Invalid API key provided
Causes: Key missing, revoked, or malformed.
Fix:
# Verify key is set and starts with gsk_
echo "${GROQ_API_KEY:0:4}" # Should print "gsk_"
# Test key directly
curl -s -o /dev/null -w "%{http_code}" \
https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY"
# Should return 200
429 — Rate Limit Exceeded
Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_xxx`
on tokens per minute (TPM): Limit 6000, Used 5800, Requested 500.
Causes: RPM (requests/min), TPM (tokens/min), or RPD (requests/day) limit hit.
Rate limit headers returned:
| Header | Description | |||
|---|---|---|---|---|
retry-after |
Seconds to wait before retrying | |||
x-ratelimit-limit-requests |
Max requests per window | |||
x-ratelimit-limit-tokens |
Max tokens per window | |||
x-ratelimit-remaining-requests |
Requests remaining | |||
x-ratelimit-remaining-tokens |
Tokens remaining | |||
x-ratelimit-reset-requests |
When request limit resets | |||
x-ratelimit-reset-tokens |
When token limit resets |
| Task | Recommended Model | Why |
|---|---|---|
| Chat with tools | llama-3.3-70b-versatile |
Best tool-calling accuracy |
| JSON extraction | llama-3.1-8b-instant |
Fast, accurate for structured tasks |
| Structured outputs | llama-3.3-70b-versatile |
Supports strict: true schema compliance |
| Vision + chat | meta-llama/llama-4-scout-17b-16e-instruct |
Multimodal input |
Instructions
Step 1: Chat Completion with System Prompt
import Groq from "groq-sdk";
const groq = new Groq();
async function chat(userMessage: string, history: any[] = []) {
const messages = [
{ role: "system" as const, content: "You are a concise technical assistant." },
...history,
{ role: "user" as const, content: userMessage },
];
const completion = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages,
temperature: 0.7,
max_tokens: 1024,
});
return {
reply: completion.choices[0].message.content,
usage: completion.usage,
};
}
Step 2: Tool Use / Function Calling
// Define tools with JSON Schema
const tools: Groq.Chat.ChatCompletionTool[] = [
{
type: "function",
function: {
name: "get_weather",
description: "Get current weather for a location",
parameters: {
type: "object",
properties: {
location: { type: "string", description: "City name" },
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
},
{
type: "function",
function: {
name: "search_docs",
description: "Search internal documentation",
parameters: {
type: "object",
properties: {
query: { type: "string" },
limit: { type: "number", description: "Max results" },
},
required: ["query"],
},
},
},
];
async function chatWithTools(userMExecute Groq secondary workflows: audio transcription (Whisper), vision, text-to-speech, and batch model evaluation.
Groq Core Workflow B: Audio, Vision & Speech
Overview
Beyond chat completions, Groq provides ultra-fast audio transcription (Whisper at 216x real-time), multimodal vision (Llama 4 Scout/Maverick), and text-to-speech. These endpoints use the same groq-sdk client.
Prerequisites
groq-sdkinstalled,GROQAPIKEYset- For audio: audio files in supported formats
- For vision: image URLs or base64 images
Audio Models
| Model ID | Languages | Speed | Best For |
|---|---|---|---|
whisper-large-v3 |
100+ | 164x real-time | Best accuracy, multilingual |
whisper-large-v3-turbo |
100+ | 216x real-time | Best speed/accuracy balance |
Supported audio formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, webm
Instructions
Step 1: Audio Transcription (Whisper)
import Groq from "groq-sdk";
import fs from "fs";
const groq = new Groq();
// Transcribe audio file
async function transcribe(filePath: string): Promise<string> {
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3-turbo",
response_format: "json", // or "text" or "verbose_json"
language: "en", // Optional: ISO 639-1 code
});
return transcription.text;
}
// With timestamps (verbose mode)
async function transcribeWithTimestamps(filePath: string) {
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3-turbo",
response_format: "verbose_json",
timestamp_granularities: ["segment"],
});
return transcription;
// Returns segments with start/end times
}
Step 2: Audio Translation (to English)
// Translate any language audio to English text
async function translateAudio(filePath: string): Promise<string> {
const translation = await groq.audio.translations.create({
file: fs.createReadStream(filePath),
model: "whisper-large-v3",
});
return translation.text;
}
Step 3: Vision (Image Understanding)
// Analyze images with Llama 4 Scout (up to 5 images per request)
async function analyzeImage(imageUrl: string, question: string) {
const completion = await groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
messages: [
{
role: "user",
content: [
{ type: "text", text: question },
{ type: "Optimize Groq costs through model routing, token management, and usage monitoring.
Groq Cost Tuning
Overview
Optimize Groq inference costs through smart model routing, token minimization, and caching. Groq pricing is already extremely competitive, but at high volume the savings from routing classification to 8B vs 70B are 12x per request.
Groq Pricing (per million tokens)
| Model | Input | Output |
|---|---|---|
llama-3.1-8b-instant |
~$0.05 | ~$0.08 |
llama-3.3-70b-versatile |
~$0.59 | ~$0.79 |
llama-3.3-70b-specdec |
~$0.59 | ~$0.99 |
meta-llama/llama-4-scout-17b-16e-instruct |
~$0.11 | ~$0.34 |
whisper-large-v3-turbo |
~$0.04/hr | — |
Check current pricing at groq.com/pricing.
Instructions
Step 1: Smart Model Routing
import Groq from "groq-sdk";
const groq = new Groq();
// Route to cheapest model that meets quality requirements
interface ModelConfig {
model: string;
inputCostPer1M: number;
outputCostPer1M: number;
}
const ROUTING: Record<string, ModelConfig> = {
classification: { model: "llama-3.1-8b-instant", inputCostPer1M: 0.05, outputCostPer1M: 0.08 },
extraction: { model: "llama-3.1-8b-instant", inputCostPer1M: 0.05, outputCostPer1M: 0.08 },
summarization: { model: "llama-3.1-8b-instant", inputCostPer1M: 0.05, outputCostPer1M: 0.08 },
reasoning: { model: "llama-3.3-70b-versatile", inputCostPer1M: 0.59, outputCostPer1M: 0.79 },
codeReview: { model: "llama-3.3-70b-versatile", inputCostPer1M: 0.59, outputCostPer1M: 0.79 },
chat: { model: "llama-3.3-70b-versatile", inputCostPer1M: 0.59, outputCostPer1M: 0.79 },
vision: { model: "meta-llama/llama-4-scout-17b-16e-instruct", inputCostPer1M: 0.11, outputCostPer1M: 0.34 },
};
function getModel(useCase: string): string {
return ROUTING[useCase]?.model || "llama-3.1-8b-instant";
}
// Classification on 8B: $0.05/M vs 70B: $0.59/M = 12x savings
Step 2: Minimize Tokens Per Request
// COST SAVINGS: Reduce system prompt tokens
// Groq charges for BOTH input and output tokens
// Verbose system prompt: ~200 tokens ($0.012 per 1000 calls on 70B)
const expensive = "You are a highly skilled AI assistant specializing in text classification. When given a piece of text, carefully analyze the sentiment, considering tone, word choice, connotation...";
// Concise system prompt: ~15 tokens ($0.001 per 1000 calls on 70B)
const cheap = "Classify sentiment: positive/negative/neutral. One word.";
// COST SAVINGS: Limit output tokens
async function cheapCImplement prompt sanitization, PII redaction, response filtering, and usage tracking for Groq API integrations.
Groq Data Handling
Overview
Manage data flowing through Groq's inference API. Covers prompt sanitization before sending to Groq, response filtering after receiving, PII redaction, conversation audit logging, and token usage tracking. Key fact: Groq does not use API data for model training (Groq Privacy Policy).
Groq Data Policy
- Groq does not train on API request/response data
- Prompts and completions are processed and discarded
- Groq may temporarily log requests for abuse prevention
- For enterprise: contact Groq for DPA and SOC 2 compliance details
Instructions
Step 1: Prompt Sanitization Layer
import Groq from "groq-sdk";
const groq = new Groq();
interface RedactionRule {
name: string;
pattern: RegExp;
replacement: string;
}
const PII_RULES: RedactionRule[] = [
{ name: "email", pattern: /\b[\w.+-]+@[\w-]+\.[\w.]+\b/g, replacement: "[EMAIL]" },
{ name: "phone", pattern: /\b\d{3}[-.]?\d{3}[-.]?\d{4}\b/g, replacement: "[PHONE]" },
{ name: "ssn", pattern: /\b\d{3}-\d{2}-\d{4}\b/g, replacement: "[SSN]" },
{ name: "credit_card", pattern: /\b\d{4}[- ]?\d{4}[- ]?\d{4}[- ]?\d{4}\b/g, replacement: "[CARD]" },
{ name: "ip_address", pattern: /\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b/g, replacement: "[IP]" },
];
function sanitizeText(text: string): { sanitized: string; redactedTypes: string[] } {
let sanitized = text;
const redactedTypes: string[] = [];
for (const rule of PII_RULES) {
if (rule.pattern.test(sanitized)) {
redactedTypes.push(rule.name);
sanitized = sanitized.replace(rule.pattern, rule.replacement);
}
}
return { sanitized, redactedTypes };
}
function sanitizeMessages(messages: any[]): { messages: any[]; hadPII: boolean } {
let hadPII = false;
const sanitized = messages.map((m) => {
if (typeof m.content !== "string") return m;
const { sanitized: text, redactedTypes } = sanitizeText(m.content);
if (redactedTypes.length > 0) hadPII = true;
return { ...m, content: text };
});
return { messages: sanitized, hadPII };
}
Step 2: Safe Completion Wrapper
async function safeCompletion(
messages: any[],
model = "llama-3.3-70b-versatile",
options?: { maxTokens?: number }
) {
// Sanitize input
const { messages: sanitized, hadPII } = sanitizeMessages(messages);
if (hadPII) {
console.warn("[groq-data] PII detected and redacted before sending to Groq API");
}
// Call Groq
const completion = await groq.chat.completions.create({
model,
messages: sanitized,
max_tokens: options?.maxTokens ?? 1024,
});
// Filter response
const responseContent = completion.choiCollect Groq debug evidence for support tickets and troubleshooting.
Groq Debug Bundle
Current State
!node --version 2>/dev/null || echo 'N/A'
!python3 --version 2>/dev/null || echo 'N/A'
!npm list groq-sdk 2>/dev/null | grep groq-sdk || echo 'groq-sdk not installed'
Overview
Collect all diagnostic information needed to resolve Groq API issues. Produces a redacted support bundle with environment info, SDK version, connectivity test results, and rate limit status.
Prerequisites
GROQAPIKEYset in environmentcurlandjqavailable- Access to application logs
Instructions
Step 1: Create Debug Bundle Script
#!/bin/bash
set -euo pipefail
BUNDLE_DIR="groq-debug-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE_DIR"
echo "Collecting Groq debug bundle..."
# === Environment ===
cat > "$BUNDLE_DIR/environment.txt" <<ENVEOF
=== Groq Debug Bundle ===
Generated: $(date -u +"%Y-%m-%dT%H:%M:%SZ")
Hostname: $(hostname)
OS: $(uname -sr)
Node.js: $(node --version 2>/dev/null || echo 'not installed')
Python: $(python3 --version 2>/dev/null || echo 'not installed')
npm groq-sdk: $(npm list groq-sdk 2>/dev/null | grep groq-sdk || echo 'not installed')
pip groq: $(pip show groq 2>/dev/null | grep Version || echo 'not installed')
GROQ_API_KEY: ${GROQ_API_KEY:+SET (${#GROQ_API_KEY} chars, prefix: ${GROQ_API_KEY:0:4}...)}${GROQ_API_KEY:-NOT SET}
ENVEOF
Step 2: API Connectivity Test
# Test API endpoint and capture headers
echo "--- API Connectivity ---" >> "$BUNDLE_DIR/connectivity.txt"
# Models endpoint (lightweight, confirms auth)
curl -s -w "\nHTTP Status: %{http_code}\nTime: %{time_total}s\n" \
https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY" \
| jq '.data | length' >> "$BUNDLE_DIR/connectivity.txt" 2>&1
echo "Models available: $(curl -s https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY" | jq -r '.data[].id' | wc -l)" \
>> "$BUNDLE_DIR/connectivity.txt"
Step 3: Rate Limit Status
# Make a minimal request and capture rate limit headers
echo "--- Rate Limit Status ---" >> "$BUNDLE_DIR/rate-limits.txt"
curl -si https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.1-8b-instant","messages":[{"role":"user","content":"ping"}],"max_tokens":1}' \
2>/dev/null | grep -iE "^(x-ratelimDeploy Groq integrations to Vercel, Cloud Run, and containerized platforms.
Groq Deploy Integration
Overview
Deploy applications using Groq's inference API to Vercel Edge, Cloud Run, Docker, and other platforms. Groq's sub-200ms latency makes it ideal for edge deployments and real-time applications.
Prerequisites
- Groq API key stored in
GROQAPIKEY - Application using
groq-sdkpackage - Platform CLI installed (vercel, docker, or gcloud)
Instructions
Step 1: Vercel Edge Function
// app/api/chat/route.ts (Next.js App Router)
import Groq from "groq-sdk";
export const runtime = "edge";
export async function POST(req: Request) {
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY! });
const { messages, stream: useStream } = await req.json();
if (useStream) {
const stream = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages,
stream: true,
max_tokens: 2048,
});
const encoder = new TextEncoder();
const readable = new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
controller.enqueue(
encoder.encode(`data: ${JSON.stringify({ content })}\n\n`)
);
}
}
controller.enqueue(encoder.encode("data: [DONE]\n\n"));
controller.close();
},
});
return new Response(readable, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
},
});
}
const completion = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages,
max_tokens: 2048,
});
return Response.json(completion);
}
Step 2: Vercel Deployment
set -euo pipefail
# Set secret
vercel env add GROQ_API_KEY production
# Deploy
vercel --prod
Step 3: Docker Container
FROM node:20-slim AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
RUN npm run build
FROM node:20-slim
WORKDIR /app
COPY --from=builder /app/dist ./dist
COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/package.json .
EXPOSE 3000
HEALTHCHECK --interval=30s --timeout=5s CMD curl -sf http://localhost:3000/health || exit 1
CMD ["node", "dist/index.js"]
Step 4: Cloud Run Deployment
set -euo pipefail
# Store API key in Secret Manager
echo -n "$GROQ_API_KEY" | gcloud secrets create groq-api-key --data-file=-
# Deploy with streaming support
gcloud run deploy groq-api \
--source . \
--region us-central1 \
--set-secrets=GROQ_APIConfigure Groq organization management, API key scoping, spending controls, and team access patterns.
Groq Enterprise Access Management
Overview
Manage team access to Groq's inference API through API key strategy, model-level routing controls, spending limits, and usage monitoring. Groq uses flat API keys (gsk_ prefix) with no built-in scoping -- access control is implemented at the application layer.
Groq Access Model
- API keys are per-organization, not per-user
- No built-in scopes -- every key has full API access
- Rate limits are per-organization, shared across all keys
- Spending limits are configurable in the Groq Console
- Projects allow creating isolated API keys with separate limits
Instructions
Step 1: API Key Strategy
// Create separate keys per team/service via Groq Console Projects
// Each project gets its own API key and can have independent rate limits
// Key naming convention: {team}-{environment}-{purpose}
const KEY_REGISTRY = {
// Each team gets a separate Groq Project
"chatbot-prod": "gsk_...", // Project: chatbot-production
"chatbot-staging": "gsk_...", // Project: chatbot-staging
"analytics-prod": "gsk_...", // Project: analytics-production
"batch-processor": "gsk_...", // Project: batch-processing
} as const;
Step 2: Application-Level Model Access Control
// Since Groq keys don't have model scoping, implement it in your gateway
interface TeamConfig {
allowedModels: string[];
maxTokensPerRequest: number;
monthlyBudgetUsd: number;
rateLimitRPM: number;
}
const TEAM_CONFIGS: Record<string, TeamConfig> = {
chatbot: {
allowedModels: ["llama-3.3-70b-versatile", "llama-3.1-8b-instant"],
maxTokensPerRequest: 2048,
monthlyBudgetUsd: 200,
rateLimitRPM: 60,
},
analytics: {
allowedModels: ["llama-3.1-8b-instant"], // Only cheapest model
maxTokensPerRequest: 512,
monthlyBudgetUsd: 50,
rateLimitRPM: 30,
},
research: {
allowedModels: [
"llama-3.3-70b-versatile",
"llama-3.1-8b-instant",
"meta-llama/llama-4-scout-17b-16e-instruct",
],
maxTokensPerRequest: 4096,
monthlyBudgetUsd: 500,
rateLimitRPM: 120,
},
};
function validateRequest(team: string, model: string, maxTokens: number): void {
const config = TEAM_CONFIGS[team];
if (!config) throw new Error(`Unknown team: ${team}`);
if (!config.allowedModels.includes(model)) {
throw new Error(`Team ${team} not authorized for model ${model}`);
}
if (maxTokens > config.maxTokensPerRequest) {
throw new Error(`max_tokens ${maxTokens} exceeds limit ${config.maxTokensPerRequest} for team ${team}`);
}
}
Create a minimal working Groq chat completion example.
Groq Hello World
Overview
Build a minimal chat completion with Groq's LPU inference API. Groq uses an OpenAI-compatible endpoint, so the API shape is familiar -- but responses arrive 10-50x faster than GPU-based providers.
Prerequisites
groq-sdkinstalled (npm install groq-sdk)GROQAPIKEYenvironment variable set- Completed
groq-install-authsetup
Instructions
Step 1: Basic Chat Completion (TypeScript)
import Groq from "groq-sdk";
const groq = new Groq();
async function main() {
const completion = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "What is Groq's LPU and why is it fast?" },
],
});
console.log(completion.choices[0].message.content);
console.log(`Tokens: ${completion.usage?.total_tokens}`);
}
main().catch(console.error);
Step 2: Streaming Response
async function streamExample() {
const stream = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [
{ role: "user", content: "Explain quantum computing in 3 sentences." },
],
stream: true,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content || "";
process.stdout.write(content);
}
console.log(); // newline
}
Step 3: Python Equivalent
from groq import Groq
client = Groq()
completion = client.chat.completions.create(
model="llama-3.3-70b-versatile",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Groq's LPU and why is it fast?"},
],
)
print(completion.choices[0].message.content)
print(f"Tokens: {completion.usage.total_tokens}")
Step 4: Try Different Models
// Speed tier -- fastest responses (~560 tok/s)
const fast = await groq.chat.completions.create({
model: "llama-3.1-8b-instant",
messages: [{ role: "user", content: "Hello!" }],
});
// Quality tier -- best reasoning (~280 tok/s)
const quality = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages: [{ role: "user", content: "Explain monads in Haskell." }],
});
// Vision tier -- multimodal understanding
const vision = await groq.chat.completions.create({
model: "meta-llama/llama-4-scout-17b-16e-instruct",
messages: [{
role: &qExecute Groq incident response: triage, mitigation, fallback, and postmortem.
Groq Incident Runbook
Overview
Rapid incident response procedures for Groq API failures. Groq is a third-party inference provider -- when it goes down, your mitigation options are: wait, fall back to a different model, or fall back to a different provider.
Severity Levels
| Level | Definition | Response Time | Examples |
|---|---|---|---|
| P1 | Complete API failure | < 15 min | Groq API returns 5xx on all models |
| P2 | Degraded performance | < 1 hour | High latency, partial 429s, one model down |
| P3 | Minor impact | < 4 hours | Intermittent errors, non-critical feature affected |
| P4 | No user impact | Next business day | Monitoring gap, cost anomaly |
Quick Triage (Run First)
set -euo pipefail
echo "=== 1. Groq API Status ==="
curl -sf https://status.groq.com > /dev/null && echo "status.groq.com: REACHABLE" || echo "status.groq.com: UNREACHABLE"
echo ""
echo "=== 2. API Authentication ==="
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" \
https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $GROQ_API_KEY")
echo "GET /models: HTTP $HTTP_CODE"
echo ""
echo "=== 3. Model Availability ==="
for model in "llama-3.1-8b-instant" "llama-3.3-70b-versatile"; do
CODE=$(curl -s -o /dev/null -w "%{http_code}" \
https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d "{\"model\":\"$model\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}],\"max_tokens\":1}")
echo "$model: HTTP $CODE"
done
echo ""
echo "=== 4. Rate Limit Status ==="
curl -si https://api.groq.com/openai/v1/chat/completions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3.1-8b-instant","messages":[{"role":"user","content":"ping"}],"max_tokens":1}' \
2>/dev/null | grep -iE "^(x-ratelimit|retry-after)" || echo "No rate limit headers"
Decision Tree
Is the Groq API responding?
├─ NO (timeout/connection refused):
│ ├─ Check status.groq.com
│ │ ├─ Incident reported → Wait, enable fallback provider
│ │ └─ No incident → Network issue on our side (check DNS, firewall, proxy)
│ └─ Check if api.groq.com resolves: dig api.groq.com
│
├─ YES, but 401/403:
│ ├─ API key revoked or expireInstall and configure Groq SDK authentication for TypeScript or Python.
Groq Install & Auth
Overview
Install the official Groq SDK and configure API key authentication. Groq provides ultra-fast LLM inference on custom LPU hardware through an OpenAI-compatible REST API at api.groq.com/openai/v1/.
Prerequisites
- Node.js 18+ or Python 3.8+
- Package manager (npm, pnpm, or pip)
- Groq account at console.groq.com
- API key from GroqCloud console (Settings > API Keys)
Instructions
Step 1: Install the SDK
set -euo pipefail
# TypeScript / JavaScript
npm install groq-sdk
# Python
pip install groq
Step 2: Get Your API Key
- Go to console.groq.com/keys
- Click "Create API Key"
- Copy the key (starts with
gsk_) - Store it securely -- you cannot view it again
Step 3: Configure Environment
# Set environment variable (recommended)
export GROQ_API_KEY="gsk_your_key_here"
# Or create .env file (add .env to .gitignore first)
echo 'GROQ_API_KEY=gsk_your_key_here' >> .env
Step 4: Verify Connection (TypeScript)
import Groq from "groq-sdk";
const groq = new Groq({
apiKey: process.env.GROQ_API_KEY,
});
async function verify() {
const models = await groq.models.list();
console.log("Connected! Available models:");
for (const model of models.data) {
console.log(` ${model.id} (owned by ${model.owned_by})`);
}
}
verify().catch(console.error);
Step 5: Verify Connection (Python)
import os
from groq import Groq
client = Groq(api_key=os.environ.get("GROQ_API_KEY"))
models = client.models.list()
print("Connected! Available models:")
for model in models.data:
print(f" {model.id} (owned by {model.owned_by})")
SDK Defaults
The Groq SDK auto-reads GROQAPIKEY from environment if no apiKey is passed to the constructor. Additional constructor options:
const groq = new Groq({
apiKey: process.env.GROQ_API_KEY, // Optional if env var set
baseURL: "https://api.groq.com/openai/v1", // Default
maxRetries: 2, // Default retry count
timeout: 60_000, // 60 second timeout (ms)
});
API Key Formats
| Prefix | Type | Usage |
|---|---|---|
gsk_ |
Standard API key | All API endpoints |
Groq uses a single key type. There are no separate read/write scopes -- all keys have full API access. Restrict access through organizational controls in the console.
Error Handling
Configure Groq local development with hot reload, mocking, and testing.
Groq Local Dev Loop
Overview
Set up a fast, reproducible local development workflow for Groq. Groq's sub-second response times make it uniquely suited for tight dev loops -- you get LLM responses fast enough to iterate without context-switching.
Prerequisites
groq-sdkinstalledGROQAPIKEYset (free tier is fine for development)- Node.js 18+ with tsx for TypeScript execution
- vitest for testing
Instructions
Step 1: Project Structure
my-groq-project/
├── src/
│ ├── groq/
│ │ ├── client.ts # Singleton Groq client
│ │ ├── models.ts # Model constants and selection
│ │ └── completions.ts # Completion wrappers
│ └── index.ts
├── tests/
│ ├── groq.test.ts # Unit tests with mocks
│ └── groq.integration.ts # Live API tests (CI-only)
├── .env.local # Local secrets (git-ignored)
├── .env.example # Template for team
└── package.json
Step 2: Package Setup
{
"scripts": {
"dev": "tsx watch src/index.ts",
"test": "vitest",
"test:watch": "vitest --watch",
"test:integration": "GROQ_INTEGRATION=1 vitest tests/groq.integration.ts"
},
"dependencies": {
"groq-sdk": "^0.12.0"
},
"devDependencies": {
"tsx": "^4.0.0",
"vitest": "^2.0.0"
}
}
Step 3: Singleton Client
// src/groq/client.ts
import Groq from "groq-sdk";
let _client: Groq | null = null;
export function getGroqClient(): Groq {
if (!_client) {
if (!process.env.GROQ_API_KEY) {
throw new Error("GROQ_API_KEY not set. Copy .env.example to .env.local");
}
_client = new Groq({
apiKey: process.env.GROQ_API_KEY,
maxRetries: 2,
timeout: 30_000,
});
}
return _client;
}
// Reset for testing
export function resetClient(): void {
_client = null;
}
Step 4: Model Constants
// src/groq/models.ts
export const MODELS = {
FAST: "llama-3.1-8b-instant", // Dev default: cheapest, fastest
VERSATILE: "llama-3.3-70b-versatile", // Production quality
SPECDEC: "llama-3.3-70b-specdec", // Speculative decoding variant
SCOUT: "meta-llama/llama-4-scout-17b-16e-instruct", // Vision
} as const;
export const DEV_MODEL = MODELS.FAST; // Use 8B for dev to save quota
Step 5: Unit Tests with Mocking
// tests/groq.test.ts
import { describe, it, expect, vi, beforeEach } from "vitest";
import Groq from "groq-sdk";
// Mock the entire groq-sdk module
vi.mock("Migrate from OpenAI/Anthropic/other LLM providers to Groq, or migrate between Groq model generations with zero-downtime traffic shifting.
Groq Migration Deep Dive
Current State
!npm list groq-sdk openai @anthropic-ai/sdk 2>/dev/null | grep -E "groq|openai|anthropic" || echo 'No LLM SDKs found'
Overview
Migrate to Groq from OpenAI, Anthropic, or other LLM providers. Groq's OpenAI-compatible API makes migration straightforward -- the primary changes are: different SDK import, different model IDs, and different response metadata. The reward is 10-50x faster inference.
Migration Complexity
| Source | Complexity | Key Changes |
|---|---|---|
| OpenAI | Low | Import, model IDs, base URL -- API shape is identical |
| Anthropic | Medium | Different API shape, message format, streaming protocol |
| Local LLMs | Medium | Remove infra, add API calls |
| Other cloud (Bedrock, Vertex) | Medium | Remove cloud SDK, add groq-sdk |
Instructions
Step 1: OpenAI to Groq Migration
// BEFORE: OpenAI
import OpenAI from "openai";
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const result = await openai.chat.completions.create({
model: "gpt-4o-mini",
messages: [{ role: "user", content: "Hello" }],
});
// AFTER: Groq (minimal changes)
import Groq from "groq-sdk";
const groq = new Groq({ apiKey: process.env.GROQ_API_KEY });
const result = await groq.chat.completions.create({
model: "llama-3.3-70b-versatile", // or "llama-3.1-8b-instant"
messages: [{ role: "user", content: "Hello" }],
});
// Same response shape: result.choices[0].message.content
Step 2: Model ID Mapping
// OpenAI → Groq model equivalents
const MODEL_MAP: Record<string, string> = {
// OpenAI → Groq (quality equivalent)
"gpt-4o": "llama-3.3-70b-versatile",
"gpt-4o-mini": "llama-3.1-8b-instant",
"gpt-4-turbo": "llama-3.3-70b-versatile",
"gpt-3.5-turbo": "llama-3.1-8b-instant",
// Anthropic → Groq (approximate)
"claude-3-5-sonnet": "llama-3.3-70b-versatile",
"claude-3-haiku": "llama-3.1-8b-instant",
};
function migrateModelId(model: string): string {
return MODEL_MAP[model] || "llama-3.3-70b-versatile";
}
Step 3: Provider Abstraction Layer
// Build a provider-agnostic layer for zero-downtime migration
interface LLMProvider {
name: string;
complete(messages: any[], model: string, maxTokens: number): Promise<{
content: string;
model: string;
tokens: { prompt: number; completion: number; total: number };
}>;
}
class GroqProvider impConfigure Groq across dev, staging, and production with environment-specific model selection, rate limits, and API keys.
Groq Multi-Environment Setup
Overview
Configure Groq API access across development, staging, and production with the right model, rate limit strategy, and secret management per environment. Key insight: use llama-3.1-8b-instant in development (cheapest, fastest), match production model in staging, and harden production with retries and fallbacks.
Environment Strategy
| Environment | API Key Source | Default Model | Retry | Logging |
|---|---|---|---|---|
| Development | .env.local |
llama-3.1-8b-instant |
1 | Verbose |
| Staging | CI/CD secrets | llama-3.3-70b-versatile |
3 | Standard |
| Production | Secret manager | llama-3.3-70b-versatile |
5 | Structured |
Instructions
Step 1: Configuration Module
// config/groq.ts
import Groq from "groq-sdk";
interface GroqEnvConfig {
apiKey: string;
model: string;
maxTokens: number;
temperature: number;
maxRetries: number;
timeout: number;
logRequests: boolean;
}
const configs: Record<string, GroqEnvConfig> = {
development: {
apiKey: process.env.GROQ_API_KEY || "",
model: "llama-3.1-8b-instant", // Cheapest, fastest for iteration
maxTokens: 512,
temperature: 0.7,
maxRetries: 1,
timeout: 15_000,
logRequests: true, // Verbose in dev
},
staging: {
apiKey: process.env.GROQ_API_KEY_STAGING || process.env.GROQ_API_KEY || "",
model: "llama-3.3-70b-versatile", // Match production model
maxTokens: 2048,
temperature: 0.3,
maxRetries: 3,
timeout: 30_000,
logRequests: false,
},
production: {
apiKey: process.env.GROQ_API_KEY_PROD || process.env.GROQ_API_KEY || "",
model: "llama-3.3-70b-versatile", // Quality model
maxTokens: 2048,
temperature: 0.3,
maxRetries: 5, // More retries in prod
timeout: 30_000,
logRequests: false,
},
};
function getEnv(): string {
return process.env.NODE_ENV || "development";
}
export function getGroqConfig(): GroqEnvConfig {
const env = getEnv();
const config = configs[env] || configs.development;
if (!config.apiKey) {
throw new Error(
`GROQ_API_KEY not set for ${env}. ` +
(env === "development"
? "Copy .env.example to .env.local and add your key from console.groq.com/keys"
: `Set GROQ_API_KEY_${env.toUpperCase()} in your secret manager`)
);
}
return config;
}
let _client: Groq | null = null;
export function getGroqClient(): Groq {
if (!_client) {
const config = getGroqConfig();
_client = new Groq({
apiKey: confSet up observability for Groq integrations: latency histograms, token throughput, rate limit gauges, cost tracking, and Prometheus alerts.
Groq Observability
Overview
Monitor Groq LPU inference for latency, token throughput, rate limit utilization, and cost. Groq's defining advantage is speed (280-560 tok/s), so latency degradation is the highest-priority signal. The API returns rich timing metadata (queuetime, prompttime, completion_time) and rate limit headers on every response.
Key Metrics to Track
| Metric | Type | Source | Why |
|---|---|---|---|
| TTFT (time to first token) | Histogram | Client-side timing | Groq's main value prop |
| Tokens/second | Gauge | usage.completion_time |
Throughput degradation |
| Total latency | Histogram | Client-side timing | End-to-end performance |
| Rate limit remaining | Gauge | x-ratelimit-remaining-* headers |
Prevent 429s |
| Token usage | Counter | usage.total_tokens |
Cost attribution |
| Error rate by code | Counter | Error handler | Availability |
| Estimated cost | Counter | Tokens * model price | Budget tracking |
Instructions
Step 1: Instrumented Groq Client
import Groq from "groq-sdk";
const groq = new Groq();
interface GroqMetrics {
model: string;
latencyMs: number;
ttftMs: number;
tokensPerSec: number;
promptTokens: number;
completionTokens: number;
totalTokens: number;
queueTimeMs: number;
estimatedCostUsd: number;
}
const PRICE_PER_1M: Record<string, { input: number; output: number }> = {
"llama-3.1-8b-instant": { input: 0.05, output: 0.08 },
"llama-3.3-70b-versatile": { input: 0.59, output: 0.79 },
"llama-3.3-70b-specdec": { input: 0.59, output: 0.99 },
"meta-llama/llama-4-scout-17b-16e-instruct": { input: 0.11, output: 0.34 },
};
async function trackedCompletion(
model: string,
messages: any[],
options?: { maxTokens?: number; temperature?: number }
): Promise<{ result: any; metrics: GroqMetrics }> {
const start = performance.now();
const result = await groq.chat.completions.create({
model,
messages,
max_tokens: options?.maxTokens ?? 1024,
temperature: options?.temperature ?? 0.7,
});
const latencyMs = performance.now() - start;
const usage = result.usage!;
const pricing = PRICE_PER_1M[model] || { input: 0.10, output: 0.10 };
const metrics: GroqMetrics = {
model,
latencyMs: Math.round(latencyMs),
ttftMs: Math.round(((usage as any).prompt_time ?? 0) * 1000),
tokensPerSec: Math.round(
usage.completion_tokens / ((usage as any).completion_time || latencyMs / 1000)
),
promptTokens: usaOptimize Groq API performance with model selection, caching, streaming, and parallel requests.
Groq Performance Tuning
Overview
Maximize Groq's LPU inference speed advantage. Groq already delivers extreme throughput (280-560 tok/s) and low latency (<200ms TTFT), but client-side optimization -- model selection, prompt size, streaming, caching, and parallelism -- determines whether your application fully exploits that speed.
Groq Speed Benchmarks
| Model | TTFT | Throughput | Context |
|---|---|---|---|
llama-3.1-8b-instant |
~50ms | ~560 tok/s | 128K |
llama-3.3-70b-versatile |
~150ms | ~280 tok/s | 128K |
llama-3.3-70b-specdec |
~100ms | ~400 tok/s | 128K |
meta-llama/llama-4-scout-17b-16e-instruct |
~80ms | ~460 tok/s | 128K |
TTFT = Time to First Token. Actual values depend on prompt size and server load.
Instructions
Step 1: Choose the Right Model for Speed
import Groq from "groq-sdk";
const groq = new Groq();
// Speed tiers for different use cases
const SPEED_MAP = {
// Under 100ms TTFT -- use for latency-critical paths
instant: "llama-3.1-8b-instant",
// Under 200ms TTFT -- use for quality-sensitive paths
balanced: "llama-3.3-70b-versatile",
// Speculative decoding -- same quality as 70b, faster throughput
fast70b: "llama-3.3-70b-specdec",
} as const;
type SpeedTier = keyof typeof SPEED_MAP;
async function tieredCompletion(prompt: string, tier: SpeedTier = "instant") {
return groq.chat.completions.create({
model: SPEED_MAP[tier],
messages: [{ role: "user", content: prompt }],
temperature: 0, // Deterministic = cacheable
max_tokens: 256, // Only request what you need
});
}
Step 2: Minimize Token Count
// Groq charges per token AND rate limits on TPM
// Smaller prompts = faster responses + less quota usage
// BAD: verbose system prompt (200+ tokens)
const verbosePrompt = "You are an AI assistant that classifies text. Given a piece of text, analyze it carefully and determine whether the sentiment is positive, negative, or neutral. Consider the tone, word choice, and overall message...";
// GOOD: concise system prompt (15 tokens)
const concisePrompt = "Classify as positive/negative/neutral. One word only.";
// BAD: high max_tokens for short expected output
const wasteful = { max_tokens: 4096 }; // for a one-word response
// GOOD: match max_tokens to expected output
const efficient = { max_tokens: 5 }; // "positive" is 1 token
Step 3: Streaming for Perceived Performance
async function streamWithMetrics(
messages: any[],
Execute Groq production deployment checklist and go-live procedures.
Groq Production Checklist
Overview
Complete pre-launch checklist for deploying Groq-powered applications to production. Covers API key security, model selection, rate limit planning, fallback strategies, and monitoring setup.
Prerequisites
- Staging environment tested with Groq API
- Groq Developer or Enterprise plan (free tier is not suitable for production)
- Production API key created in console.groq.com
- Monitoring and alerting infrastructure ready
Pre-Deployment Checklist
API Key & Auth
- [ ] Production API key stored in secret manager (not
.envfiles) - [ ] Key is NOT shared with development or staging environments
- [ ] Key rotation procedure documented and tested
- [ ] Pre-commit hook blocks
gsk_pattern in code
Model Selection
- [ ] Production model chosen and tested (recommend
llama-3.3-70b-versatile) - [ ] Fallback model configured (
llama-3.1-8b-instant) - [ ] Deprecated model IDs removed (check deprecations)
- [ ]
max_tokensset to actual expected output size (not context max)
Rate Limit Planning
- [ ] Production rate limits known (check console.groq.com/settings/limits)
- [ ] Estimated peak RPM < 80% of limit
- [ ] Estimated peak TPM < 80% of limit
- [ ] Exponential backoff with
retry-afterheader implemented - [ ] Request queue for burst protection (
p-queueor similar)
Error Handling
- [ ] All Groq error types caught (
Groq.APIError,Groq.APIConnectionError) - [ ] 429 errors retried with backoff
- [ ] 5xx errors retried with backoff
- [ ] 401 errors trigger alert (key may be revoked)
- [ ] Network timeouts configured (default 60s may be too long)
- [ ] Circuit breaker pattern for sustained failures
Fallback & Degradation
async function completionWithFallback(messages: any[]) {
try {
return await groq.chat.completions.create({
model: "llama-3.3-70b-versatile",
messages,
timeout: 15_000,
});
} catch (err: any) {
if (err.status === 429 || err.status >= 500) {
console.warn("Groq primary failed, trying fallback model");
try {
return await groq.chat.completions.create({
model: "llama-3.1-8b-instant",
messages,
timeout: 10_000,
});
} catch {
console.error("Groq fully unavailable, degrading gracefully");
return { choices: [{ message: { content: "Service temporarily unavailable. Please try again." } }] };
}
}
throw err;
}
}
Health Check Endp
Implement Groq rate limit handling with backoff, queuing, and header parsing.
Groq Rate Limits
Overview
Handle Groq rate limits using the retry-after header, exponential backoff, and request queuing. Groq enforces limits at the organization level with both RPM (requests/minute) and TPM (tokens/minute) constraints -- hitting either one triggers a 429.
Rate Limit Structure
Groq rate limits vary by plan and model. Limits are applied simultaneously -- you must stay under both RPM and TPM.
| Constraint | Description |
|---|---|
| RPM | Requests per minute |
| RPD | Requests per day |
| TPM | Tokens per minute |
| TPD | Tokens per day |
Free tier limits are significantly lower than paid tier. Check your current limits at console.groq.com/settings/limits.
Rate Limit Response Headers
When Groq responds (even on success), it includes these headers:
| Header | Description |
|---|---|
x-ratelimit-limit-requests |
Max requests in current window |
x-ratelimit-limit-tokens |
Max tokens in current window |
x-ratelimit-remaining-requests |
Requests remaining before limit |
x-ratelimit-remaining-tokens |
Tokens remaining before limit |
x-ratelimit-reset-requests |
Time until request limit resets |
x-ratelimit-reset-tokens |
Time until token limit resets |
retry-after |
Seconds to wait (only on 429 responses) |
Instructions
Step 1: Parse Rate Limit Headers
import Groq from "groq-sdk";
interface RateLimitInfo {
limitRequests: number;
limitTokens: number;
remainingRequests: number;
remainingTokens: number;
resetRequestsMs: number;
resetTokensMs: number;
}
function parseRateLimitHeaders(headers: Record<string, string>): RateLimitInfo {
return {
limitRequests: parseInt(headers["x-ratelimit-limit-requests"] || "0"),
limitTokens: parseInt(headers["x-ratelimit-limit-tokens"] || "0"),
remainingRequests: parseInt(headers["x-ratelimit-remaining-requests"] || "0"),
remainingTokens: parseInt(headers["x-ratelimit-remaining-tokens"] || "0"),
resetRequestsMs: parseResetTime(headers["x-ratelimit-reset-requests"]),
resetTokensMs: parseResetTime(headers["x-ratelimit-reset-tokens"]),
};
}
function parseResetTime(value?: string): number {
if (!value) return 0;
// Groq returns reset times like "1.2s" or "120ms"
if (valImplement Groq reference architecture with model routing, streaming pipelines, and fallbacks.
Groq Reference Architecture
Overview
Production architecture for applications built on Groq's LPU inference API. Covers model routing by latency requirements, streaming pipelines, multi-provider fallback, and the middleware layer that ties it together.
Architecture Diagram
┌──────────────────────────────────────────────────────────────┐
│ Application Layer │
│ Chat UI │ API Backend │ Batch Processor │ Agent │
└─────┬─────┴──────┬────────┴────────┬──────────┴──────┬───────┘
│ │ │ │
▼ ▼ ▼ ▼
┌──────────────────────────────────────────────────────────────┐
│ Groq Service Layer │
│ ┌─────────────┐ ┌────────────┐ ┌─────────────────────┐ │
│ │ Model Router │ │ Middleware │ │ Fallback Chain │ │
│ │ │ │ │ │ │ │
│ │ speed → │ │ Cache │ │ Groq (primary) │ │
│ │ 8b-instant│ │ Rate Guard │ │ ↓ 429/5xx │ │
│ │ quality → │ │ Metrics │ │ Groq (fallback model)│ │
│ │ 70b-versa.│ │ Logging │ │ ↓ still failing │ │
│ │ vision → │ │ Retry │ │ OpenAI (backup) │ │
│ │ llama-4 │ │ │ │ ↓ also failing │ │
│ │ audio → │ │ │ │ Graceful degrade │ │
│ │ whisper │ │ │ │ │ │
│ └─────────────┘ └────────────┘ └─────────────────────┘ │
└──────────────────────────────────────────────────────────────┘
Project Structure
src/
├── groq/
│ ├── client.ts # Singleton Groq client
│ ├── models.ts # Model constants and capabilities
│ ├── router.ts # Model selection logic
│ ├── middleware.ts # Cache, rate limit, metrics
│ ├── fallback.ts # Multi-provider fallback chain
│ └── types.ts # Shared types
├── services/
│ ├── chat.ts # Chat completion service
│ ├── transcription.ts # Audio transcription (Whisper)
│ ├── extraction.ts # Structured data extraction
│ └── batch.ts # Batch processing service
└── api/
├── chat.ts # HTTP endpoint
├── transcribe.ts # Audio endpoint
└── health.ts # Health check
Instructions
Step 1: Model Registry
// src/groq/models.ts
export interface ModelSpec {
id: string;
tier: "speed" | "quality" | "vision" | "audio";
contextWindow: number;
maxOutput: number;
speedTokPerSec: number;
inputCostPer1M: number;
outputCostPer1M: number;
capabilities: ("text" | "tools" | "json" | "vision" | "audio")[];
}
export const MODELS: Record<string, ModelApply production-ready Groq SDK patterns for TypeScript and Python.
Groq SDK Patterns
Overview
Production patterns for the groq-sdk package. The Groq SDK mirrors the OpenAI SDK interface (chat.completions.create), so patterns feel familiar but must account for Groq-specific behavior: extreme speed (500+ tok/s), aggressive rate limits on free tier, and unique response metadata like queuetime and completiontime.
Prerequisites
groq-sdkinstalled- Understanding of async/await and error handling
- Familiarity with OpenAI SDK patterns (Groq is API-compatible)
Instructions
Step 1: Typed Client Singleton
// src/groq/client.ts
import Groq from "groq-sdk";
let _client: Groq | null = null;
export function getGroq(): Groq {
if (!_client) {
_client = new Groq({
apiKey: process.env.GROQ_API_KEY,
maxRetries: 3,
timeout: 30_000,
});
}
return _client;
}
Step 2: Type-Safe Completion Wrapper
import Groq from "groq-sdk";
import type { ChatCompletionMessageParam } from "groq-sdk/resources/chat/completions";
const groq = getGroq();
interface CompletionResult {
content: string;
model: string;
tokens: { prompt: number; completion: number; total: number };
timing: { queueMs: number; totalMs: number; tokensPerSec: number };
}
async function complete(
messages: ChatCompletionMessageParam[],
model = "llama-3.3-70b-versatile",
options?: { maxTokens?: number; temperature?: number }
): Promise<CompletionResult> {
const response = await groq.chat.completions.create({
model,
messages,
max_tokens: options?.maxTokens ?? 1024,
temperature: options?.temperature ?? 0.7,
});
const usage = response.usage!;
return {
content: response.choices[0].message.content || "",
model: response.model,
tokens: {
prompt: usage.prompt_tokens,
completion: usage.completion_tokens,
total: usage.total_tokens,
},
timing: {
queueMs: (usage.queue_time ?? 0) * 1000,
totalMs: (usage.total_time ?? 0) * 1000,
tokensPerSec: usage.completion_tokens / ((usage.completion_time ?? 1) || 1),
},
};
}
Step 3: Streaming with Typed Events
async function* streamCompletion(
messages: ChatCompletionMessageParam[],
model = "llama-3.3-70b-versatile"
): AsyncGenerator<string> {
const stream = await groq.chat.completions.create({
model,
messages,
stream: true,
max_tokens: 2048,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield content;
}
}
// Usage
async function printStream(prompt: string) {
const messages: ChatCompletionMessageParam[] = [
{ role: "user", content: prompt }Apply Groq security best practices for API key management and data protection.
Groq Security Basics
Overview
Security practices for Groq API keys and data flowing through Groq's inference API. Groq uses a single API key type (gsk_ prefix) with full access -- there are no scoped tokens -- so key management and rotation are critical.
Prerequisites
- Groq account at console.groq.com
- Understanding of environment variable management
- Secret management solution for production (Vault, AWS Secrets Manager, etc.)
Key Security Facts
- Groq API keys start with
gsk_and grant full API access - There are no read-only or scoped keys -- every key can call every endpoint
- Keys are created at console.groq.com/keys and cannot be viewed after creation
- Rate limits are per-organization, not per-key
- Groq does not store prompt data for training (see privacy policy)
Instructions
Step 1: Secure Key Storage by Environment
# Development: .env file (NEVER commit)
echo "GROQ_API_KEY=gsk_dev_key_here" > .env.local
# .gitignore (mandatory)
echo -e ".env\n.env.local\n.env.*.local" >> .gitignore
# Production: use platform secret managers
# Vercel
vercel env add GROQ_API_KEY production
# AWS
aws secretsmanager create-secret --name groq-api-key --secret-string "gsk_..."
# GCP
echo -n "gsk_..." | gcloud secrets create groq-api-key --data-file=-
# GitHub Actions
gh secret set GROQ_API_KEY --body "gsk_..."
Step 2: Key Rotation Procedure
set -euo pipefail
# 1. Create new key in console.groq.com/keys
# Name it with a date: "prod-2026-03"
# 2. Deploy new key to production first (both keys work simultaneously)
# Update secret manager with new value
# 3. Verify new key works
curl -s -o /dev/null -w "%{http_code}" \
https://api.groq.com/openai/v1/models \
-H "Authorization: Bearer $NEW_GROQ_KEY"
# Should return 200
# 4. Monitor for 24h -- ensure no requests use old key
# 5. Delete old key in console.groq.com/keys
Step 3: Git Leak Prevention
# Pre-commit hook to detect leaked keys
cat > .git/hooks/pre-commit << 'HOOKEOF'
#!/bin/bash
if git diff --cached --diff-filter=ACM | grep -qE "gsk_[a-zA-Z0-9]{20,}"; then
echo "ERROR: Groq API key detected in staged files!"
echo "Remove the key and use environment variables instead."
exit 1
fi
HOOKEOF
chmod +x .git/hooks/pre-commit
Step 4: Server-Side Key Usage Pattern
import Groq from "groq-sdk";
// NEVER expose key to client-side code
// Always proxy through your backend
export async function POST(req: Request) {
// Key stays server-side
const groq = neUpgrade groq-sdk versions and handle Groq model deprecations.
Groq Upgrade & Migration
Current State
!npm list groq-sdk 2>/dev/null | grep groq-sdk || echo 'groq-sdk not installed'
!pip show groq 2>/dev/null | grep -E "Name|Version" || echo 'groq not installed (python)'
Overview
Guide for upgrading the groq-sdk package and migrating away from deprecated model IDs. Groq regularly deprecates older models in favor of newer, faster alternatives.
Model Deprecation Timeline
Groq announces deprecations with advance notice. These models have been deprecated:
| Deprecated Model | Deprecation Date | Replacement |
|---|---|---|
mixtral-8x7b-32768 |
2025-03-05 | llama-3.3-70b-versatile or llama-3.1-8b-instant |
gemma2-9b-it |
2025-08-08 | llama-3.1-8b-instant |
llama-3.1-70b-versatile |
2024-12-06 | llama-3.3-70b-versatile |
llama-3.1-70b-specdec |
2024-12-06 | llama-3.3-70b-specdec |
playai-tts |
2025-12-23 | Orpheus TTS models |
playai-tts-arabic |
2025-12-23 | Orpheus TTS models |
distil-whisper-large-v3-en |
— | whisper-large-v3-turbo |
Current Model IDs (Use These)
| Model ID | Type | Context | Speed |
|---|---|---|---|
llama-3.1-8b-instant |
Text | 128K | ~560 tok/s |
llama-3.3-70b-versatile |
Text | 128K | ~280 tok/s |
llama-3.3-70b-specdec |
Text | 128K | Faster |
meta-llama/llama-4-scout-17b-16e-instruct |
Vision+Text | 128K | ~460 tok/s |
meta-llama/llama-4-maverick-17b-128e-instruct |
Vision+Text | 128K | — |
whisper-large-v3 |
Audio STT | — | 164x RT |
whisper-large-v3-turbo |
Audio STT | — | 216x RT |
Always verify at: GET https://api.groq.com/openai/v1/models
Instructions
Step 1: Check Current Version and Models
set -euo pipefail
# SDK version
npm list groq-sdk 2>/dev/null
npm view groq-sdk version # latest on npm
# Find all model references in your code
grep -rn "model.*['\"]" src/ --include="*.ts" --include="*.js" | grep -i "groq\|llama\|mixtral\|gemma\|whisper"
Step 2: Upgrade SDK
Build event-driven architectures with Groq streaming, batch processing, and async patterns.
Groq Events & Async Patterns
Overview
Build event-driven architectures around Groq's inference API. Groq does not provide native webhooks, but its sub-second latency enables unique patterns: real-time SSE streaming, batch processing with callbacks, queue-based pipelines, and event processors that use Groq as an LLM classification/extraction engine.
Prerequisites
groq-sdkinstalled,GROQAPIKEYset- Queue system for batch patterns (BullMQ, Redis, SQS)
- Understanding of Server-Sent Events (SSE) for streaming
Instructions
Step 1: SSE Streaming Endpoint
import Groq from "groq-sdk";
import express from "express";
const groq = new Groq();
const app = express();
app.use(express.json());
app.post("/api/chat/stream", async (req, res) => {
const { messages, model = "llama-3.3-70b-versatile" } = req.body;
res.writeHead(200, {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
Connection: "keep-alive",
"X-Accel-Buffering": "no", // Disable nginx buffering
});
try {
const stream = await groq.chat.completions.create({
model,
messages,
stream: true,
max_tokens: 2048,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) {
res.write(`data: ${JSON.stringify({ content, type: "token" })}\n\n`);
}
}
res.write(`data: ${JSON.stringify({ type: "done" })}\n\n`);
} catch (err: any) {
res.write(`data: ${JSON.stringify({ type: "error", message: err.message })}\n\n`);
}
res.end();
});
Step 2: Batch Processing with BullMQ
import { Queue, Worker } from "bullmq";
import Groq from "groq-sdk";
import { randomUUID } from "crypto";
const groq = new Groq();
const groqQueue = new Queue("groq-batch", { connection: { host: "localhost" } });
// Enqueue a batch of prompts
async function submitBatch(
prompts: string[],
callbackUrl: string,
model = "llama-3.1-8b-instant"
): Promise<string> {
const batchId = randomUUID();
for (const [index, prompt] of prompts.entries()) {
await groqQueue.add("inference", {
batchId,
index,
prompt,
model,
callbackUrl,
total: prompts.length,
});
}
return batchId;
}
// Worker processes queue items
const worker = new Worker("groq-batch", async (job) => {
const { prompt, model, callbackUrl, batchId, index, total } = job.data;
const completion = await groq.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
temperature: 0,
});
consReady to use groq-pack?
Related Plugins
ai-ethics-validator
AI ethics and fairness validation
ai-experiment-logger
Track and analyze AI experiments with a web dashboard and MCP tools
ai-ml-engineering-pack
Professional AI/ML Engineering toolkit: Prompt engineering, LLM integration, RAG systems, AI safety with 12 expert plugins
ai-sdk-agents
Multi-agent orchestration with AI SDK v5 - handoffs, routing, and coordination for any AI provider (OpenAI, Anthropic, Google)
anomaly-detection-system
Detect anomalies and outliers in data
automl-pipeline-builder
Build AutoML pipelines