coreweave-incident-runbook

'Incident response runbook for CoreWeave GPU workload failures.

v1.0.0

Jeremy Longshore

MIT

3 Tools

coreweave-pack Plugin

saas packs Category

Allowed Tools
        ReadBash(kubectl:*)Grep
      

Provided by Plugin

coreweave-pack

Claude Code skill pack for CoreWeave (24 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the coreweave-pack plugin:

/plugin install coreweave-pack@claude-code-plugins-plus

Click to copy

Instructions

CoreWeave Incident Runbook

Triage Steps


# 1. Check pod status
kubectl get pods -l app=inference -o wide

# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20

# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide

# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents

Inference Service Down

Check pod status and events
If OOMKilled: reduce batch size or upgrade GPU
If ImagePullBackOff: check registry credentials
If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically
If no capacity: scale down non-critical workloads
Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists
Verify model name spelling
Check PVC has sufficient storage
Review container logs for download errors

Rollback


kubectl rollout undo deployment/inference

Resources

Next Steps

For data handling, see coreweave-data-handling.

Allowed Tools

Provided by Plugin

coreweave-pack

Installation

Instructions

CoreWeave Incident Runbook

Triage Steps

Common Incidents

Inference Service Down

GPU Node Failure

Model Loading Failure

Rollback

Resources

Next Steps

Ready to use coreweave-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle