| Job failed |
Data quality issue |
Check training file format
Together AI core workflow b for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI — Fine-Tuning & Model Management
Overview
Create fine-tuning jobs, monitor training runs, and deploy custom models on Together AI's
infrastructure. Use this workflow when you need to customize an open-source model on your
own data, track training metrics, manage model versions, or set up dedicated inference
endpoints for production. This is the secondary workflow — for basic inference and chat
completions, see together-core-workflow-a.
Instructions
Step 1: Upload Training Data and Create a Fine-Tune Job
import Together from 'together-ai';
const client = new Together({ apiKey: process.env.TOGETHER_API_KEY });
const file = await client.files.upload({
file: fs.createReadStream('training.jsonl'),
purpose: 'fine-tune',
});
const job = await client.fineTuning.create({
training_file: file.id,
model: 'meta-llama/Llama-3.3-70B-Instruct-Turbo',
n_epochs: 3,
learning_rate: 1e-5,
batch_size: 4,
suffix: 'support-agent-v2',
});
console.log(`Fine-tune job ${job.id} — status: ${job.status}`);
Step 2: Monitor Training Progress
let status = await client.fineTuning.retrieve(job.id);
while (!['completed', 'failed', 'cancelled'].includes(status.status)) {
console.log(`Status: ${status.status} — ${status.training_steps_completed}/${status.total_steps} steps`);
if (status.metrics) console.log(` Loss: ${status.metrics.training_loss.toFixed(4)}`);
await new Promise(r => setTimeout(r, 30_000));
status = await client.fineTuning.retrieve(job.id);
}
console.log(`Final model: ${status.fine_tuned_model}`);
Step 3: List and Manage Model Versions
const models = await client.models.list({ owned_by: 'me' });
models.data.forEach(m =>
console.log(`${m.id} — created ${m.created_at}, type: ${m.type}`)
);
// Delete an old model version
await client.models.delete('my-org/support-agent-v1');
console.log('Deleted old model version');
Step 4: Deploy to a Dedicated Endpoint
const endpoint = await client.endpoints.create({
model: status.fine_tuned_model,
instance_type: 'gpu-a100-80gb',
min_replicas: 1,
max_replicas: 3,
autoscale_target_utilization: 0.7,
});
console.log(`Endpoint ${endpoint.id} — URL: ${endpoint.url}`);
console.log(`Status: ${endpoint.status}, replicas: ${endpoint.current_replicas}`);
Error Handling
| Issue |
Cause |
Fix |
401 Unauthorized |
Invalid or expired API key |
Regenerate at api.together.xyz/settings |
400 Invalid JSONL |
Malformed training file |
Each line must be val
Together AI cost tuning for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Cost Tuning
Overview
Optimize Together AI costs with model selection, batching, and caching.
Instructions
Together AI Pricing Model
| Model Category |
Price (per 1M tokens) |
Example Models |
| Small (< 10B) |
$0.10-0.30 |
Llama-3.2-3B, Qwen-2.5-7B |
| Medium (10-40B) |
$0.60-1.20 |
Mixtral-8x7B, Llama-3.3-70B-Turbo |
| Large (40B+) |
$2.00-5.00 |
Llama-3.1-405B, DeepSeek-V3 |
| Image gen |
$0.003-0.05/image |
FLUX.1-schnell, SDXL |
| Embeddings |
$0.008/1M tokens |
M2-BERT |
| Fine-tuning |
~$5-25/hour |
Depends on model + GPU |
| Batch inference |
50% off |
Same models, async |
Cost Reduction Strategies
# 1. Use Turbo variants (faster, cheaper, similar quality)
# meta-llama/Llama-3.3-70B-Instruct-Turbo vs Llama-3.1-70B-Instruct
# 2. Batch inference (50% cost reduction)
batch_response = client.batch.create(
input_file_id=file_id,
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
completion_window="24h",
)
# 3. Cache responses for identical prompts
from functools import lru_cache
@lru_cache(maxsize=1000)
def cached_completion(prompt: str, model: str) -> str:
response = client.chat.completions.create(
model=model, messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.content
# 4. Use smallest model that works
# Test with 3B first, upgrade to 70B only if quality insufficient
Error Handling
| Issue |
Cause |
Solution |
| High costs |
Wrong model tier |
Downsize model |
| Batch failures |
Invalid input format |
Validate JSONL |
| Fine-tuning expensive |
Too many epochs |
Start with 1-2 epochs |
Resources
Next Steps
For architecture patterns, see together-reference-architecture.
Together AI debug bundle for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Debug Bundle
Overview
Guidance for debug bundle with Together AI inference and fine-tuning API.
Instructions
Key Points
- Together AI is OpenAI-compatible:
base_url = 'https://api.together.xyz/v1'
- Use the
together Python SDK or any OpenAI client library
- Supports 100+ open-source models (Llama, Mixtral, Qwen, FLUX)
- Fine-tuning available for supported models
- Batch inference at 50% cost reduction
Error Handling
| Error |
Cause |
Solution |
401 Unauthorized |
Invalid API key |
Check at api.together.xyz |
Model not found |
Wrong model ID |
Use client.models.list() |
429 Rate limit |
Too many requests |
Implement backoff |
500 Server error |
Model overloaded |
Retry with backoff |
Resources
Next Steps
See related Together AI skills for more patterns.
Together AI deploy integration for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Deploy Integration
Overview
Deploy a containerized Together AI inference integration service with Docker. This skill covers building a production image that connects to Together's OpenAI-compatible API for running completions, embeddings, and image generation across 100+ open-source models. Includes environment configuration for model selection and batch processing, health checks that verify API key validity and model availability, and rolling update strategies for zero-downtime deployments serving real-time inference requests.
Docker Configuration
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
RUN groupadd -r app && useradd -r -g app app
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY --from=builder /usr/local/bin /usr/local/bin
COPY src/ ./src/
USER app
EXPOSE 8000
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
CMD ["python", "src/server.py"]
Environment Variables
export TOGETHER_API_KEY="tog_xxxxxxxxxxxx"
export TOGETHER_BASE_URL="https://api.together.xyz/v1"
export TOGETHER_DEFAULT_MODEL="meta-llama/Llama-3.1-8B-Instruct"
export TOGETHER_MAX_TOKENS="2048"
export LOG_LEVEL="info"
export PORT="8000"
Health Check Endpoint
import express from 'express';
const app = express();
app.get('/health', async (req, res) => {
try {
const response = await fetch(`${process.env.TOGETHER_BASE_URL}/models`, {
headers: { 'Authorization': `Bearer ${process.env.TOGETHER_API_KEY}` },
});
if (!response.ok) throw new Error(`Together API returned ${response.status}`);
res.json({ status: 'healthy', service: 'together-integration', model: process.env.TOGETHER_DEFAULT_MODEL, timestamp: new Date().toISOString() });
} catch (error) {
res.status(503).json({ status: 'unhealthy', error: (error as Error).message });
}
});
Deployment Steps
Step 1: Build
docker build -t together-integration:latest .
Step 2: Run
docker run -d --name together-integration \
-p 8000:8000 \
-e TOGETHER_API_KEY -e TOGETHER_BASE_URL -e TOGETHER_DEFAULT_MODEL \
together-integration:latest
Step 3: Verify
curl -s http://localhost:8000/health | jq .
Step 4: Rolling Update
docker build -t together-integration:v2 . && \
docker stop together-integration && \
docker rm together-integration &
Run inference with Together AI -- chat completions, streaming, and model selection.
ReadWriteEditBash(pip:*)Bash(python3:*)
Together AI Hello World
Overview
Run chat completions with open-source models via Together AI's OpenAI-compatible API. Supports Llama, Mixtral, Qwen, and 100+ models. Key endpoints: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/images/generations.
Instructions
Step 1: Chat Completions
from together import Together
client = Together()
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[
{"role": "system", "content": "You are a helpful coding assistant."},
{"role": "user", "content": "Write a Python function to calculate fibonacci numbers"},
],
max_tokens=500,
temperature=0.7,
top_p=0.9,
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")
Step 2: Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Explain quantum computing"}],
stream=True,
max_tokens=200,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Step 3: Image Generation
response = client.images.generate(
model="black-forest-labs/FLUX.1-schnell-Free",
prompt="A sunset over mountains, digital art style",
width=1024, height=768,
n=1,
)
print(f"Image URL: {response.data[0].url}")
Step 4: Embeddings
response = client.embeddings.create(
model="togethercomputer/m2-bert-80M-8k-retrieval",
input=["Hello world", "Together AI is great"],
)
print(f"Embedding dim: {len(response.data[0].embedding)}")
Step 5: Node.js with OpenAI Client
import OpenAI from 'openai';
const together = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: 'https://api.together.xyz/v1',
});
const chat = await together.chat.completions.create({
model: 'meta-llama/Llama-3.3-70B-Instruct-Turbo',
messages: [{ role: 'user', content: 'Hello!' }],
});
console.log(chat.choices[0].message.content);
Output
def fibonacci(n):
if n <= 1:
return n
return fibonacci(n-1) + fibonacci(n-2)
Tokens: 28 in, 45 out
Error Handling
| Error |
Cause |
Solution |
Model not found |
Install Together AI SDK and configure API key for inference and fine-tuning.
ReadWriteEditBash(npm:*)Bash(pip:*)Grep
Together AI Install & Auth
Overview
Together AI provides an OpenAI-compatible API for open-source model inference and fine-tuning. Base URL: https://api.together.xyz/v1. Works with the official together Python SDK or any OpenAI-compatible client.
Prerequisites
- Together AI account at api.together.xyz
- API key from Settings > API Keys
- Python 3.8+ or Node.js 18+
Instructions
Step 1: Install SDK
# Python (official)
pip install together
# Node.js (use OpenAI SDK with custom base URL)
npm install openai
Step 2: Configure API Key
# .env
TOGETHER_API_KEY=your-api-key-here
Step 3: Verify Connection (Python)
from together import Together
client = Together(api_key=os.environ["TOGETHER_API_KEY"])
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct-Turbo",
messages=[{"role": "user", "content": "Say hello"}],
max_tokens=10,
)
print(f"Connected! Response: {response.choices[0].message.content}")
Step 4: Verify with OpenAI Client (Node.js)
import OpenAI from 'openai';
const client = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY,
baseURL: 'https://api.together.xyz/v1',
});
const response = await client.chat.completions.create({
model: 'meta-llama/Llama-3.3-70B-Instruct-Turbo',
messages: [{ role: 'user', content: 'Say hello' }],
max_tokens: 10,
});
console.log(`Connected! ${response.choices[0].message.content}`);
Step 5: List Available Models
models = client.models.list()
for m in models.data[:5]:
print(f"{m.id} ({m.type})")
Error Handling
| Error |
Cause |
Solution |
401 Unauthorized |
Invalid API key |
Check key at api.together.xyz |
Model not found |
Wrong model ID |
Use client.models.list() to verify |
ModuleNotFoundError |
SDK not installed |
pip install together |
429 Too Many Requests |
Rate limit |
Back off and retry |
Resources
Next Step
Together AI local dev loop for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Local Dev Loop
Overview
Local development workflow for Together AI inference API integration. Provides a fast feedback loop with mock chat completions, embeddings, and model listing endpoints so you can build AI-powered applications without consuming live API credits. Together AI is OpenAI-compatible, so the same client libraries work with both. Toggle between mock mode for rapid iteration and live mode for model evaluation.
Environment Setup
cp .env.example .env
# Set your credentials:
# TOGETHER_API_KEY=tog_xxxxxxxxxxxx
# TOGETHER_BASE_URL=https://api.together.xyz/v1
# MOCK_MODE=true
npm install express axios dotenv tsx typescript @types/node
npm install -D vitest supertest @types/express
# Or for Python: pip install together openai httpx pytest
Dev Server
// src/dev/server.ts
import express from "express";
import { createProxyMiddleware } from "http-proxy-middleware";
const app = express();
app.use(express.json());
const MOCK = process.env.MOCK_MODE === "true";
if (!MOCK) {
app.use("/v1", createProxyMiddleware({
target: process.env.TOGETHER_BASE_URL,
changeOrigin: true,
headers: { Authorization: `Bearer ${process.env.TOGETHER_API_KEY}` },
}));
} else {
const { mountMockRoutes } = require("./mocks");
mountMockRoutes(app);
}
app.listen(3009, () => console.log(`Together dev server on :3009 [mock=${MOCK}]`));
Mock Mode
// src/dev/mocks.ts — OpenAI-compatible mock responses for inference
export function mountMockRoutes(app: any) {
app.post("/v1/chat/completions", (req: any, res: any) => res.json({
id: "chatcmpl-mock-001", object: "chat.completion", model: req.body.model || "meta-llama/Llama-3-70b-chat-hf",
choices: [{ index: 0, message: { role: "assistant", content: "This is a mock response from Together AI." }, finish_reason: "stop" }],
usage: { prompt_tokens: 25, completion_tokens: 12, total_tokens: 37 },
}));
app.post("/v1/embeddings", (req: any, res: any) => res.json({
object: "list", model: req.body.model || "togethercomputer/m2-bert-80M-8k-retrieval",
data: [{ object: "embedding", index: 0, embedding: Array(768).fill(0).map(() => Math.random() * 2 - 1) }],
}));
app.get("/v1/models", (_req: any, res: any) => res.json({
data: [
{ id: "meta-llama/Llama-3-70b-chat-hf", type: "chat", context_length: 8192 },
{ id: "mistralai/Mixtral-8x22B-Instruct-v0.1", type: "chat", context_length: 65536 },
{ id: "togethercomputer/m2-bert-80M-8k-retrieval", type: "embedding", context_length: 8192 },
],
}));
}
Testing Workflow
Together AI performance tuning for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Performance Tuning
Overview
Guidance for performance tuning with Together AI inference and fine-tuning API.
Instructions
Key Points
- Together AI is OpenAI-compatible:
base_url = 'https://api.together.xyz/v1'
- Use the
together Python SDK or any OpenAI client library
- Supports 100+ open-source models (Llama, Mixtral, Qwen, FLUX)
- Fine-tuning available for supported models
- Batch inference at 50% cost reduction
Error Handling
| Error |
Cause |
Solution |
401 Unauthorized |
Invalid API key |
Check at api.together.xyz |
Model not found |
Wrong model ID |
Use client.models.list() |
429 Rate limit |
Too many requests |
Implement backoff |
500 Server error |
Model overloaded |
Retry with backoff |
Resources
Next Steps
See related Together AI skills for more patterns.
Together AI prod checklist for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Production Checklist
Overview
Together AI provides OpenAI-compatible inference across 100+ open-source models (Llama, Mixtral, Qwen, FLUX) plus fine-tuning and batch processing. A production integration routes completions, embeddings, or image generation through Together's API. Failures mean inference latency spikes, model availability gaps, or unexpected cost overruns from uncontrolled batch jobs.
Authentication & Secrets
- [ ]
TOGETHERAPIKEY stored in secrets manager (not source code)
- [ ] API key restricted to production workspace
- [ ] Key rotation schedule documented (90-day cycle)
- [ ] Separate keys for dev/staging/prod environments
- [ ] Fine-tuning job tokens scoped separately from inference tokens
API Integration
- [ ] Production base URL configured (
https://api.together.xyz/v1)
- [ ] Rate limit handling with exponential backoff
- [ ] Model IDs validated against
client.models.list() before deployment
- [ ] Completion streaming implemented for real-time use cases
- [ ] Embedding batch size optimized (max 2048 inputs per request)
- [ ] Batch inference configured for non-real-time workloads (50% cost savings)
- [ ] Fallback model configured if primary model is unavailable
Error Handling & Resilience
- [ ] Circuit breaker configured for Together API outages
- [ ] Retry with backoff for 429/5xx responses
- [ ] Model-not-found errors caught before user-facing requests
- [ ] Token usage tracked per request to prevent budget overruns
- [ ] Fine-tuning job failure alerts configured
- [ ] Timeout handling for long-running generation requests (>30s)
Monitoring & Alerting
- [ ] API latency tracked per model and endpoint (chat, embeddings, images)
- [ ] Error rate alerts set (threshold: >5% over 5 minutes)
- [ ] Token consumption monitored against daily/monthly budget caps
- [ ] Model availability checked (Together status page integration)
- [ ] Batch job completion rate tracked
Validation Script
async function checkTogetherReadiness(): Promise<void> {
const checks: { name: string; pass: boolean; detail: string }[] = [];
// API connectivity
try {
const res = await fetch('https://api.together.xyz/v1/models', {
headers: { Authorization: `Bearer ${process.env.TOGETHER_API_KEY}` },
});
checks.push({ name: 'Together API', pass: res.ok, detail: res.ok ? 'Connected' : `HTTP ${res.status}` });
} catch (e: any) { checks.push({ name: 'Together API', pass: false, detail: e.message }); }
// Credentials present
checks.push({ name: 'API Key Set', pass: !!process.env.TOGETHER_API_KEY, detail: process.env.TOGETHER_API_KEY ?
Together AI rate limits for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Rate Limits
Overview
Together AI's OpenAI-compatible inference API enforces per-key rate limits that vary by model tier and operation type. Chat completions and embeddings share a global request quota, while fine-tuning jobs and batch inference have separate concurrency caps. High-throughput workloads like embedding entire document corpora or running evaluations across 100+ prompts require client-side token bucket limiting. Together's batch inference endpoint offers 50% cost savings but has its own queue depth limits that differ from real-time inference.
Rate Limit Reference
| Endpoint |
Limit |
Window |
Scope |
| Chat completions |
600 req |
1 minute |
Per API key |
| Embeddings |
300 req |
1 minute |
Per API key |
| Image generation (FLUX) |
60 req |
1 minute |
Per API key |
| Fine-tune jobs (concurrent) |
3 jobs |
Rolling |
Per API key |
| Batch inference |
100 req/batch, 10 batches |
Rolling |
Per API key |
Rate Limiter Implementation
class TogetherRateLimiter {
private tokens: number;
private lastRefill: number;
private readonly max: number;
private readonly refillRate: number;
private queue: Array<{ resolve: () => void }> = [];
constructor(maxPerMinute: number) {
this.max = maxPerMinute;
this.tokens = maxPerMinute;
this.lastRefill = Date.now();
this.refillRate = maxPerMinute / 60_000;
}
async acquire(): Promise<void> {
this.refill();
if (this.tokens >= 1) { this.tokens -= 1; return; }
return new Promise(resolve => this.queue.push({ resolve }));
}
private refill() {
const now = Date.now();
this.tokens = Math.min(this.max, this.tokens + (now - this.lastRefill) * this.refillRate);
this.lastRefill = now;
while (this.tokens >= 1 && this.queue.length) {
this.tokens -= 1;
this.queue.shift()!.resolve();
}
}
}
const chatLimiter = new TogetherRateLimiter(500); // buffer under 600
const embedLimiter = new TogetherRateLimiter(250);
Retry Strategy
async function togetherRetry<T>(
limiter: TogetherRateLimiter, fn: () => Promise<Response>, maxRetries = 4
): Promise<T> {
for (let attempt = 0; attempt <= maxRetries; attempt++) {
await limiter.acquire();
const res = await fn();
if (res.ok) return res.json();
if (res.status === 429) {
const retryAfter = parseInt(res.headers.get("Retry-After") || "5", 10);
const jitter = Math.random() * 2000;
await new Promise(r => setTimeout(r, retryAfter * 1000 + jitter));
continue;
}
if (res.status >
Together AI reference architecture for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Reference Architecture
Overview
Production architecture for AI inference, fine-tuning, and batch processing with Together AI's OpenAI-compatible API. Designed for teams routing requests across 100+ open-source models (Llama, Mixtral, Qwen, FLUX) with intelligent model selection, response caching, fine-tune pipeline management, and cost optimization via batch inference at 50% discount. Key design drivers: model routing for cost/quality tradeoffs, inference caching for repeated queries, fine-tune lifecycle management, and graceful degradation across model providers.
Architecture Diagram
Application ──→ Model Router ──→ Cache (Redis) ──→ Together API (v1)
↓ /chat/completions
Queue (Bull) ──→ Batch Worker /completions
↓ /images/generations
Fine-Tune Manager ──→ Together API /fine-tunes
↓ /models
Cost Tracker ──→ Analytics Dashboard
Service Layer
class InferenceService {
constructor(private together: TogetherClient, private cache: CacheLayer, private router: ModelRouter) {}
async complete(request: InferenceRequest): Promise<InferenceResponse> {
const model = this.router.selectModel(request.task, request.priority);
const cacheKey = `inference:${model}:${this.hashPrompt(request.prompt)}`;
const cached = await this.cache.get(cacheKey);
if (cached && request.allowCached) return cached;
const response = await this.together.chatCompletions({ model, messages: request.messages, temperature: request.temperature ?? 0.7 });
await this.cache.set(cacheKey, response, CACHE_CONFIG.inference.ttl);
await this.costTracker.record(model, response.usage);
return response;
}
async submitBatch(requests: InferenceRequest[]): Promise<string> {
const batchId = await this.together.createBatch(requests.map(r => ({
model: this.router.selectModel(r.task, 'batch'), messages: r.messages })));
return batchId; // 50% cost reduction for batch processing
}
}
Caching Strategy
const CACHE_CONFIG = {
inference: { ttl: 3600, prefix: 'infer' }, // 1 hr — deterministic prompts (temp=0) cache well
embeddings: { ttl: 86400, prefix: 'embed' }, // 24 hr — embeddings are stable for same input
modelList: { ttl: 3600, prefix: 'models' }, // 1 hr — available models change infrequently
fineTune: { ttl: 60, prefix: 'ft' }, // 1 min — training status needs near-real-time
batchStatus: { ttl: 30, prefix: 'batch' }, // 30s — batch completion polling
};
// Cache only temp=0 responses by default; stochastic responses bypass cache unless explicitly opted in
Together AI sdk patterns for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI SDK Patterns
Overview
Production-ready patterns for Together AI inference. Together exposes an OpenAI-compatible REST API at https://api.together.xyz/v1, meaning any OpenAI client library works with a base URL swap. This makes Together a drop-in replacement for OpenAI when running open-source models (Llama, Mixtral, Qwen, FLUX). A singleton client centralizes the base URL override and enables seamless backend switching.
Singleton Client
import OpenAI from 'openai';
let _client: OpenAI | null = null;
export function getClient(): OpenAI {
if (!_client) {
const apiKey = process.env.TOGETHER_API_KEY;
if (!apiKey) throw new Error('TOGETHER_API_KEY must be set — get it from api.together.xyz/settings');
_client = new OpenAI({ apiKey, baseURL: 'https://api.together.xyz/v1' });
}
return _client;
}
// Usage: const client = getClient();
// await client.chat.completions.create({ model: 'meta-llama/Meta-Llama-3.1-70B-Instruct', messages: [...] });
Error Wrapper
export class TogetherError extends Error {
constructor(public status: number, public code: string, message: string) { super(message); }
}
export async function safeCall<T>(operation: string, fn: () => Promise<T>): Promise<T> {
try { return await fn(); }
catch (err: any) {
const status = err.status ?? err.response?.status ?? 0;
if (status === 429) { await new Promise(r => setTimeout(r, 3000)); return fn(); }
if (status === 401) throw new TogetherError(401, 'AUTH', 'Invalid TOGETHER_API_KEY');
if (status === 404) throw new TogetherError(404, 'MODEL', `${operation}: model not found — use client.models.list()`);
throw new TogetherError(status, 'API_ERROR', `${operation} failed [${status}]: ${err.message}`);
}
}
Request Builder
class TogetherRequest {
private params: Record<string, any> = {};
model(m: string) { this.params.model = m; return this; }
messages(msgs: Array<{ role: string; content: string }>) { this.params.messages = msgs; return this; }
temperature(t: number) { this.params.temperature = t; return this; }
maxTokens(n: number) { this.params.max_tokens = n; return this; }
stream(s = true) { this.params.stream = s; return this; }
jsonMode() { this.params.response_format = { type: 'json_object' }; return this; }
build() { return this.params; }
}
// Usage: new TogetherRequest().model('meta-llama/Meta-Llama-3.1-70B-Instruct')
// .messages([{ role: 'user', content: 'Summarize this' }]).temperature(0.3).jsonMode().build();
Response Types
interface TogetherModel {
id: string; type: 'chat' | 'language' | 'image' | 'embed
Together AI security basics for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Security Basics
Overview
Together AI provides inference and fine-tuning for 100+ open-source models (Llama, Mixtral, Qwen, FLUX) via an OpenAI-compatible API. Security concerns include API key management for production inference, protecting fine-tuning datasets that may contain proprietary or sensitive data, rate limit handling to prevent cost overruns, and ensuring model outputs are not logged with sensitive prompt content. A leaked API key grants full access to inference, fine-tuning, and model management endpoints.
API Key Management
function createTogetherClient(): { apiKey: string; baseUrl: string } {
const apiKey = process.env.TOGETHER_API_KEY;
if (!apiKey) {
throw new Error("Missing TOGETHER_API_KEY — store in secrets manager, never in code");
}
// Together keys access inference + fine-tuning — treat as production credentials
console.log("Together AI client initialized (key suffix:", apiKey.slice(-4), ")");
return { apiKey, baseUrl: "https://api.together.xyz/v1" };
}
Webhook Signature Verification
import crypto from "crypto";
import { Request, Response, NextFunction } from "express";
function verifyTogetherWebhook(req: Request, res: Response, next: NextFunction): void {
const signature = req.headers["x-together-signature"] as string;
const secret = process.env.TOGETHER_WEBHOOK_SECRET!;
const expected = crypto.createHmac("sha256", secret).update(req.body).digest("hex");
if (!signature || !crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
res.status(401).send("Invalid signature");
return;
}
next();
}
Input Validation
import { z } from "zod";
const InferenceRequestSchema = z.object({
model: z.string().min(1).max(200),
messages: z.array(z.object({
role: z.enum(["system", "user", "assistant"]),
content: z.string().max(100_000),
})).min(1),
max_tokens: z.number().int().min(1).max(4096).default(512),
temperature: z.number().min(0).max(2).default(0.7),
stop: z.array(z.string()).max(4).optional(),
});
function validateInferenceRequest(data: unknown) {
return InferenceRequestSchema.parse(data);
}
Data Protection
const TOGETHER_SENSITIVE_FIELDS = ["api_key", "prompt_content", "fine_tune_dataset", "model_output", "system_prompt"];
function redactTogetherLog(record: Record<string, unknown>): Record<string, unknown> {
const redacted = { ...record };
for (const field of TOGETHER_SENSITIVE_FIELDS) {
if (field in redacted) redacted[field] = "[REDACTED]";
}
return redacted;
}
Security C
Together AI upgrade migration for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Upgrade & Migration
Overview
Together AI provides an OpenAI-compatible inference platform hosting 100+ open-source models (Llama, Mixtral, Qwen, FLUX) with fine-tuning and batch inference capabilities. The API lives at api.together.xyz/v1 and follows OpenAI's chat completions format. Tracking model deprecations and API changes matters because Together regularly retires older model versions, updates model IDs when weights are refreshed, and changes fine-tuning job schemas — causing silent failures when a model ID that worked yesterday returns 404 today with no advance warning in the response.
Version Detection
const TOGETHER_BASE = "https://api.together.xyz/v1";
async function detectTogetherChanges(apiKey: string): Promise<void> {
// List available models and check for deprecations
const res = await fetch(`${TOGETHER_BASE}/models`, {
headers: { Authorization: `Bearer ${apiKey}` },
});
const data = await res.json();
const models = data.data ?? data;
// Check if commonly used models are still available
const trackedModels = [
"meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
"mistralai/Mixtral-8x7B-Instruct-v0.1",
"Qwen/Qwen2.5-72B-Instruct-Turbo",
];
for (const modelId of trackedModels) {
const available = models.some((m: any) => m.id === modelId);
if (!available) console.warn(`Model deprecated or renamed: ${modelId}`);
}
// Check API version headers
const version = res.headers.get("x-together-api-version");
if (version) console.log(`Together API version: ${version}`);
}
Migration Checklist
- [ ] Check Together model list for deprecated or renamed model IDs
- [ ] Update all hardcoded model ID strings in codebase
- [ ] Verify fine-tuning job creation schema (new required fields or parameter renames)
- [ ] Test chat completions response format for new fields (usage breakdown, etc.)
- [ ] Check if embeddings endpoint model list changed
- [ ] Validate batch inference job API for schema or pricing tier changes
- [ ] Update function calling format if tool use schema evolved
- [ ] Test streaming response for new SSE event types or finish reasons
- [ ] Verify image generation models (FLUX) for parameter changes
- [ ] Run cost comparison — pricing per token may change with model updates
Schema Migration
// Together model IDs change when model versions are updated
interface ModelMigration {
oldId: string;
newId: string;
breakingChanges: string[];
}
const MODEL_MIGRATIONS: ModelMigration[] = [
{
oldId: "togethercomputer/llama-2-70b-chat",
newId: "meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo",
breakingChanges: ["New chat template format", "Differe
Together AI webhooks events for inference, fine-tuning, and model deployment.
ReadWriteEditBash(pip:*)Grep
Together AI Webhooks & Events
Overview
Together AI delivers webhook callbacks for asynchronous operations including fine-tuning jobs, batch inference, and model lifecycle events. Subscribe to events for fine-tune completion, job failures, model deprecation notices, and batch processing status to build automated ML pipelines without polling the jobs API.
Webhook Registration
const response = await fetch("https://api.together.xyz/v1/webhooks", {
method: "POST",
headers: {
"Authorization": `Bearer ${process.env.TOGETHER_API_KEY}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
url: "https://yourapp.com/webhooks/together",
events: ["fine_tune.completed", "fine_tune.failed", "model.deprecated", "batch.done"],
secret: process.env.TOGETHER_WEBHOOK_SECRET,
}),
});
Signature Verification
import crypto from "crypto";
import { Request, Response, NextFunction } from "express";
function verifyTogetherSignature(req: Request, res: Response, next: NextFunction) {
const signature = req.headers["x-together-signature"] as string;
const expected = crypto.createHmac("sha256", process.env.TOGETHER_WEBHOOK_SECRET!)
.update(req.body).digest("hex");
if (!crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
return res.status(401).json({ error: "Invalid signature" });
}
next();
}
Event Handler
import express from "express";
const app = express();
app.post("/webhooks/together", express.raw({ type: "application/json" }), verifyTogetherSignature, (req, res) => {
const event = JSON.parse(req.body.toString());
res.status(200).json({ received: true });
switch (event.type) {
case "fine_tune.completed":
deployModel(event.data.fine_tune_id, event.data.model_name); break;
case "fine_tune.failed":
alertTeam(event.data.fine_tune_id, event.data.error_message); break;
case "model.deprecated":
migratePipelines(event.data.model_id, event.data.replacement_model); break;
case "batch.done":
collectResults(event.data.batch_id, event.data.output_url); break;
}
});
Event Types
| Event |
Payload Fields |
Use Case |
fine_tune.completed |
finetuneid, modelname, evalloss |
Auto-deploy fine-tuned model |
fine_tune.failed |
finetuneid, error_message, step |
Alert team and retry with adjusted para
Ready to use together-pack?
|
|
|
|