Claude Code skill pack for CoreWeave (24 skills)
Installation
Open Claude Code and run this command:
/plugin install coreweave-pack@claude-code-plugins-plus
Use --global to install for all projects, or --project for current project only.
Skills (24)
Integrate CoreWeave deployments into CI/CD pipelines with GitHub Actions.
CoreWeave CI Integration
Overview
Set up CI/CD for CoreWeave GPU cloud workloads: run unit tests with mocked Kubernetes clients on every PR, deploy inference containers to CoreWeave namespaces on merge to main, and validate GPU resource requests against quota. CoreWeave uses standard Kubernetes APIs with GPU-specific scheduling, so CI pipelines authenticate via kubeconfig and manage deployments through kubectl.
GitHub Actions Workflow
# .github/workflows/coreweave-ci.yml
name: CoreWeave CI
on:
pull_request:
paths: ['src/**', 'k8s/**', 'Dockerfile']
push:
branches: [main]
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with: { node-version: '20' }
- run: npm ci
- run: npm test -- --reporter=verbose
deploy:
if: github.ref == 'refs/heads/main'
needs: unit-tests
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push container
run: |
echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
docker build -t ghcr.io/${{ github.repository }}/inference:${{ github.sha }} .
docker push ghcr.io/${{ github.repository }}/inference:${{ github.sha }}
- name: Deploy to CoreWeave
env:
KUBECONFIG_DATA: ${{ secrets.COREWEAVE_KUBECONFIG }}
run: |
echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
export KUBECONFIG=/tmp/kubeconfig
kubectl set image deployment/inference \
inference=ghcr.io/${{ github.repository }}/inference:${{ github.sha }}
kubectl rollout status deployment/inference --timeout=300s
Mock-Based Unit Tests
// tests/coreweave-service.test.ts
import { describe, it, expect, vi } from 'vitest';
import { deployInferenceModel } from '../src/coreweave-service';
vi.mock('@kubernetes/client-node', () => ({
KubeConfig: vi.fn().mockImplementation(() => ({
loadFromDefault: vi.fn(),
makeApiClient: vi.fn().mockReturnValue({
patchNamespacedDeployment: vi.fn().mockResolvedValue({ body: { status: { readyReplicas: 1 } } }),
listNamespacedPod: vi.fn().mockResolvedValue({
body: { items: [{ metadata: { name: 'inference-abc' }, status: { phase: 'Running' } }] },
}),
}),
})),
AppsV1Api: vi.fn(),
}));
describe('CoreWeave Service', () => {
it('deploys inference model with GPU requests', async () => {
const result = await deployInferenceModel('llama-70b', { gpu: 'A100', count: 4 });
expect(result.status).toBe('deployed');
expect(result.gpuType).toBe('A100');
});
});
Integration Tests
Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors.
CoreWeave Common Errors
Error Reference
1. Pod Stuck Pending -- No GPU Available
kubectl describe pod <pod-name> | grep -A5 Events
# "0/N nodes are available: insufficient nvidia.com/gpu"
Fix: Check GPU availability: kubectl get nodes -l gpu.nvidia.com/class=A100PCIE80GB. Try a different GPU type or region.
2. CUDA Out of Memory
torch.cuda.OutOfMemoryError: CUDA out of memory
Fix: Reduce batch size, enable gradient checkpointing, or use a larger GPU (A100-80GB instead of 40GB).
3. Image Pull BackOff
Fix: Create an imagePullSecret:
kubectl create secret docker-registry regcred \
--docker-server=ghcr.io \
--docker-username=$GH_USER \
--docker-password=$GH_TOKEN
4. NCCL Timeout (Multi-GPU)
NCCL error: unhandled system error
Fix: Ensure all GPUs are on the same node (NVLink). For multi-node, use InfiniBand-connected nodes.
5. PVC Not Mounting
Fix: Check storage class availability: kubectl get sc. Use CoreWeave storage classes like shared-hdd-ord1 or shared-ssd-ord1.
6. Node Affinity Mismatch
Fix: List valid GPU class labels:
kubectl get nodes -o json | jq -r '.items[].metadata.labels["gpu.nvidia.com/class"]' | sort -u
7. Service Not Reachable
Fix: Check Service and Endpoints:
kubectl get svc,endpoints <service-name>
Resources
Next Steps
For diagnostics, see coreweave-debug-bundle.
Deploy KServe InferenceService on CoreWeave with autoscaling and GPU scheduling.
CoreWeave Core Workflow: KServe Inference
Overview
Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.
Prerequisites
- Completed
coreweave-install-authsetup - KServe available on your CKS cluster
- Model stored in S3, GCS, or HuggingFace
Instructions
Step 1: Deploy an InferenceService
# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-inference
annotations:
autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/target: "1"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "5"
spec:
predictor:
minReplicas: 1
maxReplicas: 5
containers:
- name: kserve-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8080"
ports:
- containerPort: 8080
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: "1"
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w
Step 2: Scale-to-Zero Configuration
# For dev/staging -- scale down to zero when idle
metadata:
annotations:
autoscaling.knative.dev/minScale: "0" # Scale to zero
autoscaling.knative.dev/maxScale: "3"
autoscaling.knative.dev/scaleDownDelay: "5m"
Step 3: Test the Endpoint
# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
-o jsonpath='{.status.url}')
curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "contentRun distributed GPU training jobs on CoreWeave with multi-node PyTorch.
CoreWeave Core Workflow: GPU Training
Overview
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
Prerequisites
- CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
- Shared storage (CoreWeave PVC or NFS)
- Training container with PyTorch and NCCL
Instructions
Step 1: Single-Node Multi-GPU Training
# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: llm-finetune
spec:
template:
spec:
restartPolicy: Never
containers:
- name: trainer
image: ghcr.io/myorg/trainer:latest
command: ["torchrun"]
args:
- "--nproc_per_node=8"
- "train.py"
- "--model_name=meta-llama/Llama-3.1-8B"
- "--batch_size=4"
- "--epochs=3"
resources:
limits:
nvidia.com/gpu: "8"
memory: 512Gi
cpu: "64"
volumeMounts:
- name: data
mountPath: /data
- name: checkpoints
mountPath: /checkpoints
volumes:
- name: data
persistentVolumeClaim:
claimName: training-data
- name: checkpoints
persistentVolumeClaim:
claimName: model-checkpoints
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_NVLINK_A100_SXM4_80GB"]
Step 2: Persistent Storage for Training Data
# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: training-data
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 500Gi
storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-checkpoints
spec:
accessModes: ["ReadWriteMany"]
resources:
requests:
storage: 200Gi
storageClassName: shared-ssd-ord1
Step 3: Monitor Training Progress
# Watch training logs
kubectl logs -f job/llm-finetune
# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi
# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
cat /checkpoints/training_log.json | tail -5
Error Handling
| Error | Cause | Solution | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| NCCL timeout | Network issue between GPUs | Use NVLink nodes (SXM4/SXM5) |
| GPU | Per GPU/hour | Best For |
|---|---|---|
| A100 40GB PCIe | ~$1.50 | Development, smaller models |
| A100 80GB PCIe | ~$2.21 | Production inference |
| H100 80GB PCIe | ~$4.76 | High-throughput inference |
| H100 SXM5 (8x) | ~$6.15/GPU | Training, multi-GPU |
| L40 | ~$1.10 | Image generation, light inference |
Cost Optimization Strategies
Scale-to-Zero for Dev/Staging
autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/scaleDownDelay: "5m"
Right-Size GPU Selection
def recommend_gpu(model_size_b: float, inference_only: bool = True) -> str:
if model_size_b <= 7:
return "L40" if inference_only else "A100_PCIE_80GB"
elif model_size_b <= 13:
return "A100_PCIE_80GB"
elif model_size_b <= 70:
return "A100_PCIE_80GB (4x tensor parallel)"
else:
return "H100_SXM5 (8x tensor parallel)"
Quantization to Use Smaller GPUs
Use AWQ or GPTQ quantization to fit larger models on smaller GPUs:
# 70B model at 4-bit fits on single A100-80GB instead of 4x
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ --quantization awq
Resources
Next Steps
For architecture patterns, see coreweave-reference-architecture.
Handle training data and model artifacts on CoreWeave persistent storage.
CoreWeave Data Handling
Overview
CoreWeave GPU cloud workloads involve large-scale data artifacts: model weights (multi-GB safetensors/GGUF), training datasets (parquet, TFRecord, WebDataset), checkpoint snapshots, and inference cache volumes. Data flows through Kubernetes PersistentVolumeClaims backed by region-specific storage classes. Compliance requires encryption at rest via the storage driver, namespace-scoped RBAC for volume access, and audit logging for any data egress from GPU nodes.
Data Classification
| Data Type | Sensitivity | Retention | Encryption |
|---|---|---|---|
| Model weights | Medium | Until deprecated | AES-256 at rest |
| Training datasets | High (may contain PII) | Per data license | AES-256 + TLS in transit |
| Checkpoint snapshots | Medium | 30 days post-training | AES-256 at rest |
| Inference cache | Low | Session/TTL | Volume-level encryption |
| HuggingFace tokens | Critical | Rotate quarterly | K8s Secret + KMS |
Data Import
import { KubeConfig, BatchV1Api } from '@kubernetes/client-node';
async function importDataset(pvcName: string, sourceUrl: string, namespace: string) {
const kc = new KubeConfig();
kc.loadFromDefault();
const batch = kc.makeApiClient(BatchV1Api);
const job = {
metadata: { name: `import-${Date.now()}`, namespace },
spec: { template: { spec: {
restartPolicy: 'Never',
containers: [{ name: 'loader', image: 'python:3.11-slim',
command: ['python3', '-c', `
import urllib.request, hashlib
dest = '/data/dataset.tar.gz'
urllib.request.urlretrieve('${sourceUrl}', dest)
print(f"SHA256: {hashlib.sha256(open(dest,'rb').read()).hexdigest()}")`],
volumeMounts: [{ name: 'storage', mountPath: '/data' }],
}],
volumes: [{ name: 'storage', persistentVolumeClaim: { claimName: pvcName } }],
}}}
};
await batch.createNamespacedJob(namespace, { body: job });
}
Data Export
async function exportCheckpoint(pvcName: string, destBucket: string, ns: string) {
// Validate export destination is in approved region list
const APPROVED_REGIONS = ['us-east-1', 'us-central-1', 'eu-west-1'];
const region = destBucket.split('-').slice(0, 3).join('-');
if (!APPROVED_REGIONS.some(r => destBucket.includes(r))) {
throw new Error(`Export blocked: ${region} not in approved regions`);
}
// Stream from PVC → object storage with integrity check
const exportCmd = `tar czf - /models | gsutil cp - gs://${destBucket}/export.tar.gz`;
consoCollect CoreWeave cluster diagnostics for support tickets.
CoreWeave Debug Bundle
Overview
Collect GPU node health, Kubernetes pod status, event logs, and API connectivity into a single diagnostic archive for CoreWeave support tickets. This bundle captures cluster-level resource allocation, failed pod logs, GPU device plugin state, and network reachability so support engineers can diagnose infrastructure issues without requesting additional information. Useful when GPU pods are stuck pending, inference workloads OOM, or node autoscaling behaves unexpectedly.
Debug Collection Script
#!/bin/bash
set -euo pipefail
BUNDLE="debug-coreweave-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE"
# Environment check
echo "=== CoreWeave Debug Bundle ===" | tee "$BUNDLE/summary.txt"
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$BUNDLE/summary.txt"
echo "COREWEAVE_API_KEY: ${COREWEAVE_API_KEY:+[SET]}" >> "$BUNDLE/summary.txt"
echo "KUBECONFIG: ${KUBECONFIG:-default}" >> "$BUNDLE/summary.txt"
echo "kubectl: $(kubectl version --client --short 2>/dev/null || echo 'not found')" >> "$BUNDLE/summary.txt"
# API connectivity
HTTP=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer ${COREWEAVE_API_KEY}" \
https://api.coreweave.com/v1/namespaces 2>/dev/null || echo "000")
echo "API Status: HTTP $HTTP" >> "$BUNDLE/summary.txt"
# Cluster state
kubectl get nodes -o wide > "$BUNDLE/nodes.txt" 2>&1 || true
kubectl get pods --all-namespaces -o wide > "$BUNDLE/pods.txt" 2>&1 || true
kubectl get events --sort-by=.lastTimestamp > "$BUNDLE/events.txt" 2>&1 || true
# GPU allocation and device plugin status
kubectl describe nodes | grep -A10 "Allocated resources" > "$BUNDLE/gpu-allocation.txt" 2>&1 || true
kubectl get pods -n kube-system -l k8s-app=nvidia-device-plugin -o wide > "$BUNDLE/gpu-plugin.txt" 2>&1 || true
# Failed pod logs
for pod in $(kubectl get pods --field-selector=status.phase=Failed -o name 2>/dev/null); do
kubectl logs "$pod" --tail=200 > "$BUNDLE/$(basename "$pod")-logs.txt" 2>&1 || true
done
# Rate limit headers
curl -s -D "$BUNDLE/rate-headers.txt" -o /dev/null \
-H "Authorization: Bearer ${COREWEAVE_API_KEY}" \
https://api.coreweave.com/v1/namespaces 2>/dev/null || true
tar -czf "$BUNDLE.tar.gz" "$BUNDLE" && rm -rf "$BUNDLE"
echo "Bundle: $BUNDLE.tar.gz"
Analyzing the Bundle
tar -xzf debug-coreweave-*.tar.gz
cat debug-coreweave-*/summary.txt # API + env status at a glance
grep -i "error\|fail\|oom" debug-coreweave-*/events.txt Deploy inference services on CoreWeave with Helm charts and Kustomize.
CoreWeave Deploy Integration
Overview
Deploy GPU-accelerated inference services on CoreWeave Kubernetes (CKS). This skill covers containerizing inference workloads with NVIDIA CUDA base images, configuring GPU resource limits and node affinity for A100/H100 scheduling, setting up health checks that validate GPU availability and model loading, and executing rolling updates that respect GPU node draining. CoreWeave's scheduler requires explicit GPU resource requests to place pods on the correct hardware tier.
Docker Configuration
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04 AS base
RUN apt-get update && apt-get install -y --no-install-recommends \
python3 python3-pip curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt
FROM base
RUN groupadd -r app && useradd -r -g app app
COPY --chown=app:app src/ ./src/
COPY --chown=app:app models/ ./models/
USER app
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8080/health || exit 1
CMD ["python3", "src/server.py"]
Environment Variables
export COREWEAVE_API_KEY="cw_xxxxxxxxxxxx"
export COREWEAVE_NAMESPACE="tenant-my-org"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export GPU_TYPE="A100_PCIE_80GB"
export GPU_COUNT="1"
export LOG_LEVEL="info"
export PORT="8080"
Health Check Endpoint
import express from 'express';
import { execSync } from 'child_process';
const app = express();
app.get('/health', async (req, res) => {
try {
const gpuInfo = execSync('nvidia-smi --query-gpu=name,memory.used --format=csv,noheader').toString().trim();
const modelLoaded = globalThis.modelReady === true;
if (!modelLoaded) throw new Error('Model not loaded');
res.json({ status: 'healthy', gpu: gpuInfo, model: process.env.MODEL_NAME, timestamp: new Date().toISOString() });
} catch (error) {
res.status(503).json({ status: 'unhealthy', error: (error as Error).message });
}
});
Deployment Steps
Step 1: Build
docker build -t registry.coreweave.com/my-org/inference-svc:latest .
docker push registry.coreweave.com/my-org/inference-svc:latest
Step 2: Run
# k8s/deployment.yaml
resources:
limits:
nvidia.com/gpu: 1
cpu: "4"
memory: "48Gi"
nodeSelector:
gpu.nvidia.com/class: A100_PCIE_80GB
kubectl apply -f k8s/deployment.yaml -n tenant-my-org
Step 3: Verify
kubectl get pods -n tenant-my-org -l app=infereConfigure RBAC and namespace isolation for CoreWeave multi-team GPU access.
CoreWeave Enterprise RBAC
Overview
CoreWeave runs GPU workloads on Kubernetes, so RBAC maps directly to K8s namespace isolation and ResourceQuotas. Each team gets a dedicated namespace with GPU limits, storage caps, and network policies. This prevents noisy-neighbor problems where one team's training job starves another's inference service. SOC 2 and HIPAA workloads require namespace-level audit logging and team-scoped API key rotation.
Role Hierarchy
| Role | Permissions | Scope |
|---|---|---|
| Cluster Admin | Full CKS control, namespace creation, quota management | All namespaces |
| Team Lead | Deploy workloads, manage team API keys, adjust pod limits | Own namespace |
| ML Engineer | Launch jobs, access PVCs, view logs | Own namespace |
| Inference Operator | Deploy/scale inference endpoints, read metrics | Own namespace |
| Viewer | Read-only pod status, logs, GPU utilization metrics | Own namespace |
Permission Check
import { KubeConfig, RbacAuthorizationV1Api } from '@kubernetes/client-node';
async function checkNamespaceAccess(user: string, namespace: string, verb: string, resource: string): Promise<boolean> {
const kc = new KubeConfig();
kc.loadFromDefault();
const rbac = kc.makeApiClient(RbacAuthorizationV1Api);
const review = { apiVersion: 'authorization.k8s.io/v1', kind: 'SubjectAccessReview',
spec: { user, resourceAttributes: { namespace, verb, resource } } };
const result = await rbac.createSubjectAccessReview(review);
return result.body.status?.allowed ?? false;
}
Role Assignment
async function assignTeamNamespace(team: string, group: string, gpuLimit: number): Promise<void> {
await kubectl(`create namespace ${team}`);
await kubectl(`create resourcequota ${team}-gpu --namespace=${team} --hard=requests.nvidia.com/gpu=${gpuLimit}`);
await kubectl(`create rolebinding ${team}-access --namespace=${team} --clusterrole=edit --group=${group}`);
console.log(`Namespace ${team} created with ${gpuLimit} GPU quota bound to ${group}`);
}
async function revokeAccess(team: string, binding: string): Promise<void> {
await kubectl(`delete rolebinding ${binding} --namespace=${team}`);
}
Audit Logging
interface CoreWeaveAuditEntry {
timestamp: string; user: string; namespace: string;
action: 'gpu_request' | 'deploy' | 'scale' | 'delete' | 'quota_change';
resource: string; gpuCount?: number; result: 'allowed' | 'denied';
}
function logAccess(entry: CoreWeaveAuditEntry): void {
console.log(JSON.stringify({ ...entry, clustDeploy a GPU workload on CoreWeave with kubectl.
CoreWeave Hello World
Overview
Deploy your first GPU workload on CoreWeave: a simple inference service using vLLM or a batch CUDA job. CoreWeave runs Kubernetes on bare-metal GPU nodes with A100, H100, and L40 GPUs.
Prerequisites
- Completed
coreweave-install-authsetup - kubectl configured with CoreWeave kubeconfig
- Namespace with GPU quota
Instructions
Step 1: Deploy a vLLM Inference Server
# vllm-inference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-server
spec:
replicas: 1
selector:
matchLabels:
app: vllm-server
template:
metadata:
labels:
app: vllm-server
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--port"
- "8000"
ports:
- containerPort: 8000
resources:
limits:
nvidia.com/gpu: 1
memory: 48Gi
cpu: "8"
requests:
nvidia.com/gpu: 1
memory: 32Gi
cpu: "4"
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
---
apiVersion: v1
kind: Service
metadata:
name: vllm-server
spec:
selector:
app: vllm-server
ports:
- port: 8000
targetPort: 8000
type: ClusterIP
# Create HuggingFace token secret
kubectl create secret generic hf-token --from-literal=token="${HF_TOKEN}"
# Deploy
kubectl apply -f vllm-inference.yaml
kubectl get pods -w # Wait for Running state
# Port-forward and test
kubectl port-forward svc/vllm-server 8000:8000 &
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'
Step 2: Batch GPU Job
# gpu-batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-benchmark
spec:
template:
spec:
restartPolicy: Never
containers:
- name: benchmark
image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
command: ["python3", "-c"]
args:
- |
import torcIncident response runbook for CoreWeave GPU workload failures.
CoreWeave Incident Runbook
Triage Steps
# 1. Check pod status
kubectl get pods -l app=inference -o wide
# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20
# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide
# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi
Common Incidents
Inference Service Down
- Check pod status and events
- If OOMKilled: reduce batch size or upgrade GPU
- If ImagePullBackOff: check registry credentials
- If Pending: check GPU quota and availability
GPU Node Failure
- Pods will be rescheduled automatically
- If no capacity: scale down non-critical workloads
- Contact CoreWeave support for extended outages
Model Loading Failure
- Check HuggingFace token secret exists
- Verify model name spelling
- Check PVC has sufficient storage
- Review container logs for download errors
Rollback
kubectl rollout undo deployment/inference
Resources
Next Steps
For data handling, see coreweave-data-handling.
Configure CoreWeave Kubernetes Service (CKS) access with kubeconfig and API tokens.
CoreWeave Install & Auth
Overview
Set up access to CoreWeave Kubernetes Service (CKS). CKS runs bare-metal Kubernetes with NVIDIA GPUs -- no hypervisor overhead. Access is via standard kubeconfig with CoreWeave-issued credentials.
Prerequisites
- CoreWeave account at https://cloud.coreweave.com
kubectlv1.28+ installed- Kubernetes namespace provisioned by CoreWeave
Instructions
Step 1: Download Kubeconfig
- Log in to https://cloud.coreweave.com
- Navigate to API Access > Kubeconfig
- Download the kubeconfig file
# Save kubeconfig
mkdir -p ~/.kube
cp ~/Downloads/coreweave-kubeconfig.yaml ~/.kube/coreweave
# Set as active context
export KUBECONFIG=~/.kube/coreweave
# Verify connection
kubectl get nodes
kubectl get namespaces
Step 2: Configure API Token
# CoreWeave API token for programmatic access
export COREWEAVE_API_TOKEN="your-api-token"
# Store securely
echo "COREWEAVE_API_TOKEN=${COREWEAVE_API_TOKEN}" >> .env
echo "KUBECONFIG=~/.kube/coreweave" >> .env
Step 3: Verify GPU Access
# List available GPU nodes
kubectl get nodes -l gpu.nvidia.com/class -o custom-columns=\
NAME:.metadata.name,GPU:.metadata.labels.gpu\.nvidia\.com/class,\
STATUS:.status.conditions[-1].type
# Check GPU allocatable resources
kubectl describe nodes | grep -A5 "Allocatable:" | grep nvidia
Step 4: Test with a Simple GPU Pod
# test-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: cuda-test
image: nvidia/cuda:12.2.0-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: gpu.nvidia.com/class
operator: In
values: ["A100_PCIE_80GB"]
kubectl apply -f test-gpu.yaml
kubectl logs gpu-test # Should show nvidia-smi output
kubectl delete pod gpu-test
Error Handling
| Error | Cause | Solution | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Unable to connect to the server |
Wrong kubeconfig | Verify KUBECONFIG path | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Forbidden |
Missing namespace permissions | Contact CoreWeave support | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| No GPU nodes found | Wrong node labels | Check gpu.nvidia.com/class labels |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Pod stuck Pending | GPU capacity exhausted | Try different
Set up local development workflow for CoreWeave GPU deployments.
ReadWriteEditBash(kubectl:*)Bash(docker:*)Grep
CoreWeave Local Dev LoopOverviewLocal development workflow for CoreWeave: build containers, test YAML manifests with dry-run, push to registry, and deploy to CoreWeave CKS. Prerequisites
InstructionsStep 1: Project Structure
Step 2: Build and Push Container
Step 3: Validate Manifests Before Deploy
Step 4: Deploy and Watch
Error Handling
ResourcesNext StepsSee Migrate ML workloads from AWS/GCP/Azure to CoreWeave GPU cloud.
ReadWriteEditBash(kubectl:*)Grep
CoreWeave Migration Deep DiveCost Comparison
Migration StepsPhase 1: Containerize
Phase 2: Adapt YAML for CoreWeaveKey changes from AWS EKS / GKE:
Phase 3: Parallel DeployRun both old and new infrastructure simultaneously, gradually shift traffic. Phase 4: Cut OverDecommission old GPU instances after validation period. Common Gotchas
ResourcesNext StepsThis completes the CoreWeave skill pack. Start with Configure CoreWeave across development, staging, and production environments.
ReadWriteEditBash(kubectl:*)Bash(kustomize:*)Grep
CoreWeave Multi-Environment SetupOverviewCoreWeave GPU cloud requires strict environment separation to control infrastructure costs and prevent resource contention. Each environment maps to an isolated Kubernetes namespace with its own GPU quota, scaling policy, and access controls. Development uses cheaper GPU tiers for iteration speed, staging mirrors production GPU types for accurate benchmarking, and production runs full-scale with no scale-to-zero to guarantee inference latency SLAs. Environment Configuration
Environment Files
Environment Validation
Promotion Workflow
Environment Matrix
|