anth-incident-runbook

'Execute incident response procedures for Claude API outages and degradation.

v1.0.0

Jeremy Longshore

MIT

3 Tools

anthropic-pack Plugin

saas packs Category

Allowed Tools
        ReadBash(curl:*)Grep
      

Provided by Plugin

anthropic-pack

Claude Code skill pack for Anthropic (30 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the anthropic-pack plugin:

/plugin install anthropic-pack@claude-code-plugins-plus

Click to copy

Instructions

Anthropic Incident Runbook

Severity Classification

Severity	Condition	Response Time
P1	API returning 500/529 for all requests	Immediate
P2	Rate limiting (429) or high latency (>10s p99)	15 minutes
P3	Intermittent errors (<5% error rate)	1 hour
P4	Degraded quality (not errors)	Next business day

Immediate Triage (First 5 Minutes)


# 1. Check Anthropic status page
curl -s https://status.anthropic.com/api/v2/status.json | python3 -c \
  "import sys,json; d=json.load(sys.stdin); print(d['status']['indicator'], '-', d['status']['description'])"

# 2. Test API connectivity
curl -s -w "\nHTTP %{http_code} | Time: %{time_total}s\n" \
  https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}'

# 3. Check rate limit headers
curl -s -D - https://api.anthropic.com/v1/messages \
  -H "x-api-key: $ANTHROPIC_API_KEY" \
  -H "anthropic-version: 2023-06-01" \
  -H "content-type: application/json" \
  -d '{"model":"claude-haiku-4-20250514","max_tokens":8,"messages":[{"role":"user","content":"1"}]}' \
  2>/dev/null | grep -i "ratelimit\|retry-after\|request-id"

Decision Tree


API returning errors?
├── 401/403 → Key issue → Check ANTHROPIC_API_KEY is set and valid
├── 429 → Rate limited → Check headers, reduce traffic, wait for retry-after
├── 500 → Server error → Check status.anthropic.com, retry with backoff
├── 529 → Overloaded → Temporary, retry after 30-60s
└── Timeouts → Network or long generation → Increase timeout, check max_tokens

Mitigation Actions

Rate Limiting (429)


# Immediate: reduce traffic
# 1. Enable circuit breaker
# 2. Queue non-critical requests
# 3. Switch to Message Batches for bulk work
# 4. Reduce max_tokens to shorten generation time

API Outage (500/529)


# Graceful degradation
def get_response_with_fallback(prompt: str) -> str:
    try:
        msg = client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=1024,
            messages=[{"role": "user", "content": prompt}]
        )
        return msg.content[0].text
    except (anthropic.InternalServerError, anthropic.APIStatusError):
        return "Our AI assistant is temporarily unavailable. Please try again shortly."

Key Compromise


# 1. Immediately revoke key at console.anthropic.com
# 2. Generate new key
# 3. Deploy new key to all environments
# 4. Audit recent usage for unauthorized calls
# 5. File incident report

Postmortem Template


## Incident: [Title]
- **Duration:** [start] to [end]
- **Severity:** P[1-4]
- **Impact:** [what users experienced]
- **Root Cause:** [what went wrong]
- **Detection:** [how we found out]
- **Mitigation:** [what we did to fix it]
- **Request IDs:** [from debug logs]
- **Action Items:**
  - [ ] [preventive measure 1]
  - [ ] [preventive measure 2]

Error Handling

Symptom	Likely Cause	Quick Fix
All requests fail 401	Key rotated/expired	Check Console for active keys
Sudden 429 spike	Traffic burst or tier change	Check rate limit headers
Slow responses (>10s)	Large max_tokens or complex prompt	Reduce max_tokens, use Haiku
Intermittent 500s	Upstream API issue	Check status.anthropic.com

Resources

Next Steps

For data compliance, see anth-data-handling.

Allowed Tools

Provided by Plugin

anthropic-pack

Installation

Instructions

Anthropic Incident Runbook

Severity Classification

Immediate Triage (First 5 Minutes)

Decision Tree

Mitigation Actions

Rate Limiting (429)

API Outage (500/529)

Key Compromise

Postmortem Template

Error Handling

Resources

Next Steps

Ready to use anthropic-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle