Operations

Incident Debugging Playbook

SEV-1/2/3/4 incident response protocols. Log analysis, root cause investigation (5 Whys, Fishbone), postmortem templates, and on-call procedures.

~25 min read 5,000 words Production-Ready

Production Playbook for DevOps and Plugin Maintainers

Debugging production incidents in multi-agent Claude Code workflows requires systematic approaches to log analysis, root cause identification, and rapid remediation. This playbook provides battle-tested debugging techniques, incident response workflows, postmortem templates, and real-world examples of common failure modes.

Incident Classification

Severity Levels

Severity Impact Response Time Example
SEV-1 Production down Immediate All agents failing, API completely offline
SEV-2 Major degradation 15 minutes 50%+ error rate, critical features broken
SEV-3 Minor degradation 1 hour Intermittent failures, single plugin broken
SEV-4 Cosmetic issues 24 hours UI bugs, non-critical warnings

Common Incident Types

enum IncidentType {
  API_FAILURE = 'api_failure',           // Claude API unreachable
  RATE_LIMIT = 'rate_limit',             // 429 errors from API
  TIMEOUT = 'timeout',                    // Agent/tool timeouts
  MEMORY_LEAK = 'memory_leak',           // Process memory exhaustion
  PLUGIN_CRASH = 'plugin_crash',         // Plugin process died
  DATA_CORRUPTION = 'data_corruption',   // Invalid data in DB/cache
  PERFORMANCE = 'performance',           // Slow response times
  AUTHENTICATION = 'authentication'      // Auth failures
}

interface Incident {
id: string;
severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
type: IncidentType;
startTime: number;
affectedUsers: number;
errorRate: number;
description: string;
}

Initial Response Protocol

First 5 Minutes (SEV-1/SEV-2)

Step 1: Assess Impact

# Check current error rate
tail -n 1000 /var/log/claude-code.log | grep -c ERROR

Check affected users

grep “ERROR” /var/log/claude-code.log | awk ‘{print $5}’ | sort -u | wc -l

Check service health

curl http://localhost:3333/api/status

Step 2: Check Obvious Issues

// Quick health check script
async function quickHealthCheck(): Promise<{ healthy: boolean; issues: string[] }> {
  const issues: string[] = [];

// 1. Check Claude API connectivity try { const response = await fetch(‘https://api.anthropic.com/v1/messages’, { method: ‘POST’, headers: { ‘x-api-key’: process.env.ANTHROPIC_API_KEY }, body: JSON.stringify({ model: ‘claude-3-5-haiku-20241022’, messages: [{ role: ‘user’, content: ‘test’ }], max_tokens: 10 }) }); if (!response.ok) issues.push(‘Claude API unreachable’); } catch (error) { issues.push(‘Network connectivity issue’); }

// 2. Check disk space const { stdout } = await execAsync(“df -h / | tail -1 | awk ‘{print $5}’ | sed ‘s/%//’”); if (parseInt(stdout) > 90) issues.push(‘Disk space critical’);

// 3. Check memory const memUsage = process.memoryUsage(); if (memUsage.heapUsed / memUsage.heapTotal > 0.9) issues.push(‘Memory exhaustion’);

return { healthy: issues.length === 0, issues }; }

Step 3: Stabilize (if possible)

# Restart failed services
systemctl restart claude-code-daemon
pm2 restart all

Clear cache if corrupted

redis-cli FLUSHALL

Rate limit protection

iptables -A INPUT -p tcp —dport 80 -m limit —limit 25/minute —limit-burst 100 -j ACCEPT

Communication Template

# Incident Alert: [TITLE]

Severity: SEV-2
Status: Investigating
Started: 2025-12-24 14:35 UTC
Affected: ~1,200 users (15% of total)

Current Impact

  • Agent execution failing with 429 errors
  • Error rate: 68% (normal: <1%)
  • No data loss

Actions Taken

  • ✅ Identified rate limit exhaustion (14:40)
  • ✅ Implemented emergency rate limiting (14:42)
  • 🔄 Monitoring recovery (14:45)

Next Update

In 15 minutes or when resolved.

Common Failure Modes

1. Rate Limit Exhaustion

Symptoms:

\`\`

Error 429: Rate limit exceeded

anthropic-ratelimit-requests-remaining: 0

anthropic-ratelimit-requests-reset: 2025-12-24T15:00:00Z

\`

Diagnosis:

async function diagnoseRateLimits(): Promise {
  // Check recent API calls
  const recentCalls = await queryLogs('SELECT COUNT(*) FROM api_calls WHERE timestamp > NOW() - INTERVAL 1 MINUTE');
  console.log(`API calls in last minute: ${recentCalls}`);

// Check rate limit headers from last successful call const lastHeaders = await getLastAPIHeaders(); console.log(‘Remaining requests:’, lastHeaders[‘anthropic-ratelimit-requests-remaining’]); console.log(‘Reset time:’, lastHeaders[‘anthropic-ratelimit-requests-reset’]); }

Fix:

// Implement token bucket rate limiter
class EmergencyRateLimiter {
  private tokens = 50; // Match API tier
  private lastRefill = Date.now();

async throttle(): Promise { this.refill(); while (this.tokens < 1) { await sleep(100); this.refill(); } this.tokens—; }

private refill() { const now = Date.now(); const elapsed = (now - this.lastRefill) / 1000; const tokensToAdd = elapsed * (50 / 60); // 50 per minute this.tokens = Math.min(50, this.tokens + tokensToAdd); this.lastRefill = now; } }

2. Agent Timeout

Symptoms:

\`

Error: Agent execution timed out after 300000ms

Task: code-review

Conversation: abc-123-def

\`

Diagnosis:

# Check for hung processes
ps aux | grep claude | grep -v grep

Check system load

uptime

Output: load average: 12.5, 8.3, 5.2 (CPU overload!)

Check for blocking I/O

iotop -o -d 5

Fix:

// Implement aggressive timeouts
class TimeoutManager {
  async executeWithTimeout(
    fn: () => Promise,
    timeoutMs: number
  ): Promise {
    return Promise.race([
      fn(),
      new Promise((_, reject) =>
        setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
      )
    ]);
  }
}

// Usage const timeout = new TimeoutManager(); const result = await timeout.executeWithTimeout( () => agent.execute(task), 30000 // 30 second hard limit );

3. Memory Leak

Symptoms:

# Memory usage climbing over time
free -m
#              total   used   free
# Mem:         16384  15892    492  # Critical!

Process memory

ps aux —sort=-%mem | head -5

claude-daemon: 8.2GB (!)

Diagnosis:

// Track memory usage over time
setInterval(() => {
  const usage = process.memoryUsage();
  console.log(JSON.stringify({
    timestamp: Date.now(),
    heapUsed: usage.heapUsed / 1024 / 1024, // MB
    heapTotal: usage.heapTotal / 1024 / 1024,
    external: usage.external / 1024 / 1024,
    rss: usage.rss / 1024 / 1024
  }));

// Trigger GC if usage > 80% if (usage.heapUsed / usage.heapTotal > 0.8) { global.gc(); // Requires —expose-gc flag } }, 60000); // Every minute

Common Causes:

// ❌ Leak: Global cache never cleared
const cache = new Map();
function addToCache(key: string, value: any) {
  cache.set(key, value); // Grows forever!
}

// ✅ Fix: LRU cache with size limit import LRU from ‘lru-cache’; const cache = new LRU<string, any>({ max: 1000 });

4. Plugin Crash Loop

Symptoms:

# PM2 showing rapid restarts
pm2 status
# plugin-server | errored | 47 restarts in 2 minutes

Logs show crash

tail -f /var/log/pm2/plugin-server-error.log

Error: ECONNREFUSED 127.0.0.1:5432

(PostgreSQL connection failed)

Diagnosis:

# Check dependencies
docker ps | grep postgres
# (empty - PostgreSQL container not running!)

Check network

netstat -tulpn | grep 5432

(no listener on port 5432)

Fix:

# Restart dependency
docker-compose up -d postgres

Verify connectivity

psql -h localhost -U user -d database -c “SELECT 1”

Restart plugin

pm2 restart plugin-server


Debugging Techniques

1. Binary Search Debugging

Problem: Unknown change broke production

# Use git bisect to find breaking commit
git bisect start
git bisect bad HEAD              # Current version is broken
git bisect good v1.2.0           # Last known good version

# Git will check out commits for testing
# Test each commit:
npm install && npm run build && npm test

# Mark results
git bisect good   # if tests pass
git bisect bad    # if tests fail

# Git will find the exact breaking commit

2. Correlation Analysis

Find patterns in failures:

interface FailureEvent {
  timestamp: number;
  errorType: string;
  userId?: string;
  pluginName?: string;
  duration: number;
}

function analyzeFailureCorrelations(failures: FailureEvent[]): void { // Group by time windows const byHour = groupBy(failures, f => Math.floor(f.timestamp / 3600000));

// Find spike times const spikes = Object.entries(byHour) .filter(([_, events]) => events.length > 100) .map(([hour, events]) => ({ hour: new Date(parseInt(hour) * 3600000), count: events.length, topError: mode(events.map(e => e.errorType)) }));

console.log(‘Failure spikes:’, spikes);

// Find common attributes const byPlugin = groupBy(failures, f => f.pluginName); const suspiciousPlugin = Object.entries(byPlugin) .sort((a, b) => b[1].length - a[1].length)[0];

console.log(Most failures from plugin: ${suspiciousPlugin[0]} (${suspiciousPlugin[1].length} errors)); }

3. Distributed Tracing

Track request across services:

import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer(‘claude-code’);

async function executeAgent(agentName: string, task: any): Promise { const span = tracer.startSpan(‘agent.execute’, { attributes: { ‘agent.name’: agentName, ‘task.id’: task.id } });

try { // Execute agent logic const result = await agent.run(task);

span.setStatus({ code: SpanStatusCode.OK }); span.setAttribute(‘result.success’, true);

return result; } catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message }); span.recordException(error); throw error; } finally { span.end(); } }


Log Analysis

Parsing Claude Code Logs

Log Format:

\`

[2025-12-24T14:35:22.123Z] [ERROR] [agent:code-review] Rate limit exceeded

conversationId: abc-123-def

userId: user-456

errorCode: 429

retryAfter: 12

stack: Error: Rate limit exceeded

at callClaude (/app/src/api.ts:45:11)

\`

Analysis Script:

import { readFileSync } from 'fs';

interface LogEntry { timestamp: Date; level: ‘ERROR’ | ‘WARN’ | ‘INFO’; component: string; message: string; metadata: Record<string, any>; }

function parseLog(line: string): LogEntry | null { const match = line.match(/[(.?)] [(.?)] [(.?)] (.)/); if (!match) return null;

const [, timestamp, level, component, rest] = match; const lines = rest.split(‘\n’); const message = lines[0];

// Parse metadata const metadata: Record<string, any> = {}; for (const line of lines.slice(1)) { const metaMatch = line.match(/^\s*(\w+): (.+)$/); if (metaMatch) { const [, key, value] = metaMatch; metadata[key] = value; } }

return { timestamp: new Date(timestamp), level: level as any, component, message, metadata }; }

function analyzeLogs(logPath: string): void { const content = readFileSync(logPath, ‘utf-8’); const logs = content.split(‘\n’) .map(parseLog) .filter(Boolean) as LogEntry[];

// Error rate by component const errorsByComponent = groupBy( logs.filter(l => l.level === ‘ERROR’), l => l.component );

console.log(‘Errors by component:’); Object.entries(errorsByComponent) .sort((a, b) => b[1].length - a[1].length) .forEach(([component, errors]) => { console.log( ${component}: ${errors.length}); });

// Recent errors (last 5 minutes) const recentErrors = logs.filter(l => l.level === ‘ERROR’ && Date.now() - l.timestamp.getTime() < 300000 );

console.log(\nRecent errors: ${recentErrors.length}); recentErrors.slice(0, 10).forEach(err => { console.log( ${err.timestamp.toISOString()} - ${err.message}); }); }

Using Analytics Daemon

// Query analytics daemon for incident patterns
const ws = new WebSocket('ws://localhost:3456');

ws.onmessage = (event) => {
const data = JSON.parse(event.data);

// Track rate limit warnings
if (data.type === 'rate_limit.warning') {
console.warn(`⚠️ Rate limit approaching: ${data.current}/${data.limit}`);
}

// Track errors
if (data.type === 'llm.call' && data.error) {
console.error(`❌ LLM call failed: ${data.error}`);
}
};

// Query historical data
const response = await fetch('http://localhost:3333/api/sessions');
const sessions = await response.json();
const failedSessions = sessions.filter(s => s.errorCount > 0);

console.log(`Failed sessions: ${failedSessions.length}/${sessions.length}`);

Root Cause Analysis

The 5 Whys Method

Example: Agent Timeout Incident

  • Why did the agent timeout?

→ Because it took > 300 seconds to respond

  • Why did it take so long?

→ Because the Claude API call was slow (280s)

  • Why was the API call slow?

→ Because we sent a 50,000 token prompt

  • Why did we send such a large prompt?

→ Because the code-reviewer agent included entire codebase in context

  • Why did it include the entire codebase?

Root Cause: File globbing pattern **/* matched all files including node_modules (500MB)

Fix: Update file globbing to exclude node_modules

// Before: includes everything
const files = glob.sync('**/*');

// After: exclude dependencies const files = glob.sync(’/*’, { ignore: [‘node_modules/’, ‘.git/’, ‘dist/’] });

Fishbone Diagram (Ishikawa)

interface RootCauseAnalysis {
  problem: string;
  categories: {
    people?: string[];
    process?: string[];
    technology?: string[];
    environment?: string[];
  };
  rootCause: string;
  fix: string;
}

const analysis: RootCauseAnalysis = {
problem: 'Agent timeout causing 68% error rate',
categories: {
people: [
'Developer added file globbing without testing',
'No code review caught the issue'
],
process: [
'No integration tests for large codebases',
'No performance testing in CI/CD'
],
technology: [
'Glob pattern included node_modules (500MB)',
'No size limit on prompts',
'No timeout on file reading'
],
environment: [
'Production codebase larger than test repos',
'No staging environment for testing'
]
},
rootCause: 'Missing file size validation and glob pattern filtering',
fix: 'Add file exclusion patterns and max prompt size validation'
};

Recovery Procedures

Emergency Rollback

# Immediate rollback to last known good version
git log --oneline | head -5
# c534df4 (HEAD) feat: Add new feature (BROKEN)
# 3946b1f docs: Update README
# fc73caa (tag: v1.2.0) fix: Bug fix (LAST GOOD)

# Rollback
git reset --hard fc73caa
npm install
npm run build
pm2 restart all

# Deploy
./deploy.sh production

# Verify
curl http://api.example.com/health

Circuit Breaker Reset

// Manually reset circuit breaker after fixing issue
class CircuitBreakerManager {
  private breakers = new Map();

reset(serviceName: string): void {
const breaker = this.breakers.get(serviceName);
if (breaker) {
breaker.state = 'closed';
breaker.failures = 0;
console.log(`✓ Reset circuit breaker for ${serviceName}`);
}
}

resetAll(): void {
for (const [service, breaker] of this.breakers) {
this.reset(service);
}
console.log('✓ Reset all circuit breakers');
}
}

Data Recovery

# Recover from backup
BACKUP_DATE="2025-12-24-14:00"

# Stop services
pm2 stop all

# Restore database
pg_restore -d database_prod backups/backup_${BACKUP_DATE}.sql

# Restore files
rsync -av backups/files_${BACKUP_DATE}/ /var/lib/claude-code/

# Restart
pm2 restart all

# Verify data integrity
psql -d database_prod -c "SELECT COUNT(*) FROM conversations"

Postmortem Templates

Incident Postmortem

# Postmortem: Agent Timeout Incident (2025-12-24)

Date: 2025-12-24
Duration: 14:35 - 15:15 UTC (40 minutes)
Severity: SEV-2
Impact: 1,200 users (15%), 68% error rate

Summary

Code-reviewer agent began timing out due to excessive file inclusion in prompts, causing 68% error rate for 40 minutes.

Timeline (UTC)

  • 14:35 - First timeout alerts
  • 14:40 - Error rate reaches 68%
  • 14:42 - On-call engineer paged
  • 14:45 - Root cause identified (file globbing)
  • 14:50 - Fix deployed to staging
  • 14:55 - Fix deployed to production
  • 15:00 - Error rate drops to 5%
  • 15:15 - Incident resolved, error rate < 1%

Root Cause

File globbing pattern
**/* included node_modules/
directory (500MB), creating prompts exceeding Claude API's context limits and causing timeouts.

Contributing Factors

  • No file size validation before prompt construction
  • No integration tests with large codebases
  • No staging environment for testing

What Went Well

  • Fast root cause identification (10 minutes)
  • Effective rollback procedure
  • Clear communication to affected users

What Went Poorly

  • No monitoring alerts before user reports
  • No prompt size limits prevented the issue
  • Fix took 20 minutes to deploy

Action Items

  • [ ] P0: Add file size validation (Owner: @dev, Due: 2025-12-25)
  • [ ] P0: Implement max prompt size limit (Owner: @dev, Due: 2025-12-25)
  • [ ] P1: Add monitoring for agent timeouts (Owner: @ops, Due: 2025-12-27)
  • [ ] P1: Create staging environment (Owner: @ops, Due: 2025-12-30)
  • [ ] P2: Add integration tests with large repos (Owner: @qa, Due: 2026-01-05)

Lessons Learned

  • File operations need size limits
  • Production testing with realistic data is critical
  • Monitoring must detect issues before users report them

Best Practices

DO ✅

  • Log structured data
// ✅ Structured logging
   logger.error('Agent execution failed', {
     agentName: 'code-reviewer',
     conversationId: 'abc-123',
     errorCode: 429,
     duration: 1234
   });

// ❌ Unstructured console.log(‘Error in code-reviewer agent’);

  • Set up alerts before incidents
// Alert on error rate > 5%
   if (errorRate > 0.05) {
     pagerDuty.trigger({
       severity: 'critical',
       title: 'High error rate detected',
       details: `Error rate: ${(errorRate * 100).toFixed(1)}%`
     });
   }
  • Keep runbooks updated
# Agent Timeout Runbook
  1. Check logs: tail -f /var/log/claude-code.log | grep TIMEOUT
  2. Identify pattern: Which agents are timing out?
  3. Check system resources: top, free -m, df -h`
  4. If rate limits: Implement emergency throttling
  5. If resource exhaustion: Restart services
  • Test recovery procedures
# Monthly disaster recovery drill
   ./test-recovery.sh
   # 1. Trigger circuit breaker
   # 2. Verify monitoring alerts
   # 3. Execute rollback
   # 4. Verify service restoration

DON'T ❌

  • Don't skip postmortems
// ❌ Mark as resolved without learning
   incident.status = 'resolved';

// ✅ Document and learn incident.status = ‘resolved’; await createPostmortem(incident); await scheduleReview(incident);

  • Don't blame individuals
# ❌ Blame-focused
   Root cause: Developer X wrote bad code

✅ System-focused

Root cause: Missing code review process for file operations

  • Don't ignore warning signs
// ❌ Suppress warnings
   if (memoryUsage > 0.8) {
     // TODO: Fix later
   }

// ✅ Alert and track if (memoryUsage > 0.8) { logger.warn(‘High memory usage’, { usage: memoryUsage }); metrics.gauge(‘memory.usage’, memoryUsage); }


Tools & Resources

Monitoring Tools

Analytics Daemon (from this marketplace):

cd packages/analytics-daemon
pnpm start
# Real-time monitoring on http://localhost:3333

System Monitoring:

# CPU, memory, disk
htop

Network

iftop

Disk I/O

iotop

Log Aggregation

Centralized logging:

# Ship logs to central server
tail -f /var/log/claude-code.log | \
  nc logserver.example.com 514

External Tools


Summary

Key Takeaways:

  • Classify incidents immediately - SEV-1/2 require immediate response
  • Follow response protocol - Assess, stabilize, communicate
  • Use systematic debugging - Binary search, correlation analysis, tracing
  • Analyze logs effectively - Structured logging enables fast analysis
  • Find root causes - 5 Whys and Fishbone diagrams prevent recurrence
  • Document everything - Postmortems are learning opportunities
  • Test recovery procedures - Practice makes perfect

Incident Response Checklist:

  • [ ] Classify severity (SEV-1 through SEV-4)
  • [ ] Assess impact (error rate, affected users)
  • [ ] Check obvious issues (API, disk, memory)
  • [ ] Stabilize systems (restart, rate limit, rollback)
  • [ ] Communicate status to stakeholders
  • [ ] Identify root cause (5 Whys, logs, metrics)
  • [ ] Deploy fix and verify recovery
  • [ ] Write postmortem within 24 hours
  • [ ] Create action items with owners and dates
  • [ ] Schedule review meeting with team

Last Updated: 2025-12-24

Author: Jeremy Longshore

Related Playbooks: Multi-Agent Rate Limits, MCP Server Reliability