klaviyo-incident-runbook

Execute Klaviyo incident response procedures with triage, mitigation, and postmortem. Use when responding to Klaviyo-related outages, investigating errors, or running post-incident reviews for Klaviyo integration failures. Trigger with phrases like "klaviyo incident", "klaviyo outage", "klaviyo down", "klaviyo on-call", "klaviyo emergency", "klaviyo broken".

claude-code
4 Tools
klaviyo-pack Plugin
saas packs Category

Allowed Tools

ReadGrepBash(kubectl:*)Bash(curl:*)

Provided by Plugin

klaviyo-pack

Claude Code skill pack for Klaviyo (24 skills)

saas packs v1.0.0
View Plugin

Installation

This skill is included in the klaviyo-pack plugin:

/plugin install klaviyo-pack@claude-code-plugins-plus

Click to copy

Instructions

Klaviyo Incident Runbook

Overview

Rapid incident response procedures for Klaviyo-related outages.

Prerequisites

  • Access to Klaviyo dashboard and status page
  • kubectl access to production cluster
  • Prometheus/Grafana access
  • Communication channels (Slack, PagerDuty)

Severity Levels

Level Definition Response Time Examples
P1 Complete outage < 15 min Klaviyo API unreachable
P2 Degraded service < 1 hour High latency, partial failures
P3 Minor impact < 4 hours Webhook delays, non-critical errors
P4 No user impact Next business day Monitoring gaps

Quick Triage


# 1. Check Klaviyo status
curl -s https://status.klaviyo.com | jq

# 2. Check our integration health
curl -s https://api.yourapp.com/health | jq '.services.klaviyo'

# 3. Check error rate (last 5 min)
curl -s localhost:9090/api/v1/query?query=rate(klaviyo_errors_total[5m])

# 4. Recent error logs
kubectl logs -l app=klaviyo-integration --since=5m | grep -i error | tail -20

Decision Tree


Klaviyo API returning errors?
├─ YES: Is status.klaviyo.com showing incident?
│   ├─ YES → Wait for Klaviyo to resolve. Enable fallback.
│   └─ NO → Our integration issue. Check credentials, config.
└─ NO: Is our service healthy?
    ├─ YES → Likely resolved or intermittent. Monitor.
    └─ NO → Our infrastructure issue. Check pods, memory, network.

Immediate Actions by Error Type

401/403 - Authentication


# Verify API key is set
kubectl get secret klaviyo-secrets -o jsonpath='{.data.api-key}' | base64 -d

# Check if key was rotated
# → Verify in Klaviyo dashboard

# Remediation: Update secret and restart pods
kubectl create secret generic klaviyo-secrets --from-literal=api-key=NEW_KEY --dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/klaviyo-integration

429 - Rate Limited


# Check rate limit headers
curl -v https://api.klaviyo.com 2>&1 | grep -i rate

# Enable request queuing
kubectl set env deployment/klaviyo-integration RATE_LIMIT_MODE=queue

# Long-term: Contact Klaviyo for limit increase

500/503 - Klaviyo Errors


# Enable graceful degradation
kubectl set env deployment/klaviyo-integration KLAVIYO_FALLBACK=true

# Notify users of degraded service
# Update status page

# Monitor Klaviyo status for resolution

Communication Templates

Internal (Slack)


🔴 P1 INCIDENT: Klaviyo Integration
Status: INVESTIGATING
Impact: [Describe user impact]
Current action: [What you're doing]
Next update: [Time]
Incident commander: @[name]

External (Status Page)


Klaviyo Integration Issue

We're experiencing issues with our Klaviyo integration.
Some users may experience [specific impact].

We're actively investigating and will provide updates.

Last updated: [timestamp]

Post-Incident

Evidence Collection


# Generate debug bundle
./scripts/klaviyo-debug-bundle.sh

# Export relevant logs
kubectl logs -l app=klaviyo-integration --since=1h > incident-logs.txt

# Capture metrics
curl "localhost:9090/api/v1/query_range?query=klaviyo_errors_total&start=2h" > metrics.json

Postmortem Template


## Incident: Klaviyo [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]

### Summary
[1-2 sentence description]

### Timeline
- HH:MM - [Event]
- HH:MM - [Event]

### Root Cause
[Technical explanation]

### Impact
- Users affected: N
- Revenue impact: $X

### Action Items
- [ ] [Preventive measure] - Owner - Due date

Instructions

Step 1: Quick Triage

Run the triage commands to identify the issue source.

Step 2: Follow Decision Tree

Determine if the issue is Klaviyo-side or internal.

Step 3: Execute Immediate Actions

Apply the appropriate remediation for the error type.

Step 4: Communicate Status

Update internal and external stakeholders.

Output

  • Issue identified and categorized
  • Remediation applied
  • Stakeholders notified
  • Evidence collected for postmortem

Error Handling

Issue Cause Solution
Can't reach status page Network issue Use mobile or VPN
kubectl fails Auth expired Re-authenticate
Metrics unavailable Prometheus down Check backup metrics
Secret rotation fails Permission denied Escalate to admin

Examples

One-Line Health Check


curl -sf https://api.yourapp.com/health | jq '.services.klaviyo.status' || echo "UNHEALTHY"

Resources

Next Steps

For data handling, see klaviyo-data-handling.

Ready to use klaviyo-pack?