Claude Code skill pack for CoreWeave (24 skills)

24 Skills

MIT License

Free Pricing

Installation

Open Claude Code and run this command:

/plugin install coreweave-pack@claude-code-plugins-plus

Use --global to install for all projects, or --project for current project only.

What It Does

> 24 production-grade Claude Code skills for GPU cloud computing with CoreWeave Kubernetes Service

Skills (24)

coreweave-ci-integration SKILL.md View full skill →

'Integrate CoreWeave deployments into CI/CD pipelines with GitHub Actions.

ReadWriteEditBash(gh:*)

CoreWeave CI Integration

Overview

Set up CI/CD for CoreWeave GPU cloud workloads: run unit tests with mocked Kubernetes clients on every PR, deploy inference containers to CoreWeave namespaces on merge to main, and validate GPU resource requests against quota. CoreWeave uses standard Kubernetes APIs with GPU-specific scheduling, so CI pipelines authenticate via kubeconfig and manage deployments through kubectl.

GitHub Actions Workflow


# .github/workflows/coreweave-ci.yml
name: CoreWeave CI
on:
  pull_request:
    paths: ['src/**', 'k8s/**', 'Dockerfile']
  push:
    branches: [main]

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }
      - run: npm ci
      - run: npm test -- --reporter=verbose

  deploy:
    if: github.ref == 'refs/heads/main'
    needs: unit-tests
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Build and push container
        run: |
          echo "${{ secrets.GHCR_TOKEN }}" | docker login ghcr.io -u ${{ github.actor }} --password-stdin
          docker build -t ghcr.io/${{ github.repository }}/inference:${{ github.sha }} .
          docker push ghcr.io/${{ github.repository }}/inference:${{ github.sha }}
      - name: Deploy to CoreWeave
        env:
          KUBECONFIG_DATA: ${{ secrets.COREWEAVE_KUBECONFIG }}
        run: |
          echo "$KUBECONFIG_DATA" | base64 -d > /tmp/kubeconfig
          export KUBECONFIG=/tmp/kubeconfig
          kubectl set image deployment/inference \
            inference=ghcr.io/${{ github.repository }}/inference:${{ github.sha }}
          kubectl rollout status deployment/inference --timeout=300s

Mock-Based Unit Tests


// tests/coreweave-service.test.ts
import { describe, it, expect, vi } from 'vitest';
import { deployInferenceModel } from '../src/coreweave-service';

vi.mock('@kubernetes/client-node', () => ({
  KubeConfig: vi.fn().mockImplementation(() => ({
    loadFromDefault: vi.fn(),
    makeApiClient: vi.fn().mockReturnValue({
      patchNamespacedDeployment: vi.fn().mockResolvedValue({ body: { status: { readyReplicas: 1 } } }),
      listNamespacedPod: vi.fn().mockResolvedValue({
        body: { items: [{ metadata: { name: 'inference-abc' }, status: { phase: 'Running' } }] },
      }),
    }),
  })),
  AppsV1Api: vi.fn(),
}));

describe('CoreWeave Service', () => {
  it('deploys inference model with GPU requests', async () => {
    const result = await deployInferenceModel('llama-70b', { gpu: 'A100', count: 4 });
    expect(result.status).toBe('deployed');
    expect(result.gpuType).toBe('A100');
  });
});

Integration Tests

coreweave-common-errors SKILL.md View full skill →

'Diagnose and fix CoreWeave GPU scheduling, pod, and networking errors.

ReadBash(kubectl:*)Grep

coreweave-core-workflow-a SKILL.md View full skill →

'Deploy KServe InferenceService on CoreWeave with autoscaling and GPU.

ReadWriteEditBash(kubectl:*)Grep

CoreWeave Core Workflow: KServe Inference

Overview

Deploy production inference services on CoreWeave using KServe InferenceService with GPU scheduling, autoscaling, and scale-to-zero. CKS natively integrates with KServe for serverless GPU inference.

Prerequisites

Completed coreweave-install-auth setup
KServe available on your CKS cluster
Model stored in S3, GCS, or HuggingFace

Instructions

Step 1: Deploy an InferenceService


# inference-service.yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-inference
  annotations:
    autoscaling.knative.dev/class: "kpa.autoscaling.knative.dev"
    autoscaling.knative.dev/metric: "concurrency"
    autoscaling.knative.dev/target: "1"
    autoscaling.knative.dev/minScale: "1"
    autoscaling.knative.dev/maxScale: "5"
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 5
    containers:
      - name: kserve-container
        image: vllm/vllm-openai:latest
        args:
          - "--model"
          - "meta-llama/Llama-3.1-8B-Instruct"
          - "--port"
          - "8080"
        ports:
          - containerPort: 8080
            protocol: TCP
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: 48Gi
            cpu: "8"
          requests:
            nvidia.com/gpu: "1"
            memory: 32Gi
            cpu: "4"
        env:
          - name: HUGGING_FACE_HUB_TOKEN
            valueFrom:
              secretKeyRef:
                name: hf-token
                key: token
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: gpu.nvidia.com/class
                  operator: In
                  values: ["A100_PCIE_80GB"]


kubectl apply -f inference-service.yaml
kubectl get inferenceservice llama-inference -w

Step 2: Scale-to-Zero Configuration


# For dev/staging -- scale down to zero when idle
metadata:
  annotations:
    autoscaling.knative.dev/minScale: "0"    # Scale to zero
    autoscaling.knative.dev/maxScale: "3"
    autoscaling.knative.dev/scaleDownDelay: "5m"

Step 3: Test the Endpoint


# Get inference URL
INFERENCE_URL=$(kubectl get inferenceservice llama-inference \
  -o jsonpath='{.status.url}')

curl -X POST "${INFERENCE_URL}/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content


                
                  
                  
                  coreweave-core-workflow-b
                  SKILL.md
                  View full skill →
                
                
                  'Run distributed GPU training jobs on CoreWeave with multi-node PyTorch.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Core Workflow: GPU Training
Overview
Run distributed GPU training on CoreWeave: single-node multi-GPU and multi-node training with PyTorch DDP, Slurm-on-Kubernetes, and shared storage.
Prerequisites

CKS cluster with multi-GPU node pools (8xA100 or 8xH100)
Shared storage (CoreWeave PVC or NFS)
Training container with PyTorch and NCCL

Instructions
Step 1: Single-Node Multi-GPU Training

# training-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: llm-finetune
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: trainer
          image: ghcr.io/myorg/trainer:latest
          command: ["torchrun"]
          args:
            - "--nproc_per_node=8"
            - "train.py"
            - "--model_name=meta-llama/Llama-3.1-8B"
            - "--batch_size=4"
            - "--epochs=3"
          resources:
            limits:
              nvidia.com/gpu: "8"
              memory: 512Gi
              cpu: "64"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: checkpoints
              mountPath: /checkpoints
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: training-data
        - name: checkpoints
          persistentVolumeClaim:
            claimName: model-checkpoints
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu.nvidia.com/class
                    operator: In
                    values: ["A100_NVLINK_A100_SXM4_80GB"]

Step 2: Persistent Storage for Training Data

# storage.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: training-data
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 500Gi
  storageClassName: shared-hdd-ord1
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-checkpoints
spec:
  accessModes: ["ReadWriteMany"]
  resources:
    requests:
      storage: 200Gi
  storageClassName: shared-ssd-ord1

Step 3: Monitor Training Progress

# Watch training logs
kubectl logs -f job/llm-finetune

# Check GPU utilization
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- nvidia-smi

# Check training metrics
kubectl exec -it $(kubectl get pod -l job-name=llm-finetune -o name) -- \
  cat /checkpoints/training_log.json | tail -5

Error Handling

<
                
                  
                  
                  coreweave-cost-tuning
                  SKILL.md
                  View full skill →
                
                
                  'Optimize CoreWeave GPU cloud costs with right-sizing and scheduling.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Cost Tuning
GPU Pricing Reference (approximate)

Error
Cause
Solution


NCCL timeout
Network issue between GPUs
Use NVLink nodes (SXM4/SXM5)

                
              

GPU
Per GPU/hour
Best For


A100 40GB PCIe
~$1.50
Development, smaller models


A100 80GB PCIe
~$2.21
Production inference


H100 80GB PCIe
~$4.76
High-throughput inference


H100 SXM5 (8x)
~$6.15/GPU
Training, multi-GPU


L40
~$1.10
Image generation, light inference


Cost Optimization Strategies
Scale-to-Zero for Dev/Staging

autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/scaleDownDelay: "5m"

Right-Size GPU Selection

def recommend_gpu(model_size_b: float, inference_only: bool = True) -> str:
    if model_size_b <= 7:
        return "L40" if inference_only else "A100_PCIE_80GB"
    elif model_size_b <= 13:
        return "A100_PCIE_80GB"
    elif model_size_b <= 70:
        return "A100_PCIE_80GB (4x tensor parallel)"
    else:
        return "H100_SXM5 (8x tensor parallel)"

Quantization to Use Smaller GPUs
Use AWQ or GPTQ quantization to fit larger models on smaller GPUs:

# 70B model at 4-bit fits on single A100-80GB instead of 4x
vllm serve meta-llama/Llama-3.1-70B-Instruct-AWQ --quantization awq

Resources

CoreWeave Pricing
CoreWeave GPU Instances

Next Steps
For architecture patterns, see coreweave-reference-architecture.


                
                  
                  
                  coreweave-data-handling
                  SKILL.md
                  View full skill →
                
                
                  'Handle training data and model artifacts on CoreWeave persistent storage.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Data Handling
Overview
CoreWeave GPU cloud workloads involve large-scale data artifacts: model weights (multi-GB safetensors/GGUF), training datasets (parquet, TFRecord, WebDataset), checkpoint snapshots, and inference cache volumes. Data flows through Kubernetes PersistentVolumeClaims backed by region-specific storage classes. Compliance requires encryption at rest via the storage driver, namespace-scoped RBAC for volume access, and audit logging for any data egress from GPU nodes.
Data Classification

Data Type
Sensitivity
Retention
Encryption


Model weights
Medium
Until deprecated
AES-256 at rest


Training datasets
High (may contain PII)
Per data license
AES-256 + TLS in transit


Checkpoint snapshots
Medium
30 days post-training
AES-256 at rest


Inference cache
Low
Session/TTL
Volume-level encryption


HuggingFace tokens
Critical
Rotate quarterly
K8s Secret + KMS


Data Import

import { KubeConfig, BatchV1Api } from '@kubernetes/client-node';

async function importDataset(pvcName: string, sourceUrl: string, namespace: string) {
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const batch = kc.makeApiClient(BatchV1Api);
  const job = {
    metadata: { name: `import-${Date.now()}`, namespace },
    spec: { template: { spec: {
      restartPolicy: 'Never',
      containers: [{ name: 'loader', image: 'python:3.11-slim',
        command: ['python3', '-c', `
import urllib.request, hashlib
dest = '/data/dataset.tar.gz'
urllib.request.urlretrieve('${sourceUrl}', dest)
print(f"SHA256: {hashlib.sha256(open(dest,'rb').read()).hexdigest()}")`],
        volumeMounts: [{ name: 'storage', mountPath: '/data' }],
      }],
      volumes: [{ name: 'storage', persistentVolumeClaim: { claimName: pvcName } }],
    }}}
  };
  await batch.createNamespacedJob(namespace, { body: job });
}

Data Export

async function exportCheckpoint(pvcName: string, destBucket: string, ns: string) {
  // Validate export destination is in approved region list
  const APPROVED_REGIONS = ['us-east-1', 'us-central-1', 'eu-west-1'];
  const region = destBucket.split('-').slice(0, 3).join('-');
  if (!APPROVED_REGIONS.some(r => destBucket.includes(r))) {
    throw new Error(`Export blocked: ${region} not in approved regions`);
  }
  // Stream from PVC → object storage with integrity check
  const exportCmd = `tar czf - /models | gsutil cp - gs://${destBucket}/export.tar.gz`;
  conso


                
                  
                  
                  coreweave-debug-bundle
                  SKILL.md
                  View full skill →
                
                
                  'Collect CoreWeave cluster diagnostics for support tickets.
                  
                      ReadBash(kubectl:*)Bash(tar:*)Grep
                    
                
                
                  CoreWeave Debug Bundle
Overview
Collect GPU node health, Kubernetes pod status, event logs, and API connectivity into a single diagnostic archive for CoreWeave support tickets. This bundle captures cluster-level resource allocation, failed pod logs, GPU device plugin state, and network reachability so support engineers can diagnose infrastructure issues without requesting additional information. Useful when GPU pods are stuck pending, inference workloads OOM, or node autoscaling behaves unexpectedly.
Debug Collection Script

#!/bin/bash
set -euo pipefail
BUNDLE="debug-coreweave-$(date +%Y%m%d-%H%M%S)"
mkdir -p "$BUNDLE"

# Environment check
echo "=== CoreWeave Debug Bundle ===" | tee "$BUNDLE/summary.txt"
echo "Generated: $(date -u +%Y-%m-%dT%H:%M:%SZ)" >> "$BUNDLE/summary.txt"
echo "COREWEAVE_API_KEY: ${COREWEAVE_API_KEY:+[SET]}" >> "$BUNDLE/summary.txt"
echo "KUBECONFIG: ${KUBECONFIG:-default}" >> "$BUNDLE/summary.txt"
echo "kubectl: $(kubectl version --client --short 2>/dev/null || echo 'not found')" >> "$BUNDLE/summary.txt"

# API connectivity
HTTP=$(curl -s -o /dev/null -w "%{http_code}" -H "Authorization: Bearer ${COREWEAVE_API_KEY}" \
  https://api.coreweave.com/v1/namespaces 2>/dev/null || echo "000")
echo "API Status: HTTP $HTTP" >> "$BUNDLE/summary.txt"

# Cluster state
kubectl get nodes -o wide > "$BUNDLE/nodes.txt" 2>&1 || true
kubectl get pods --all-namespaces -o wide > "$BUNDLE/pods.txt" 2>&1 || true
kubectl get events --sort-by=.lastTimestamp > "$BUNDLE/events.txt" 2>&1 || true

# GPU allocation and device plugin status
kubectl describe nodes | grep -A10 "Allocated resources" > "$BUNDLE/gpu-allocation.txt" 2>&1 || true
kubectl get pods -n kube-system -l k8s-app=nvidia-device-plugin -o wide > "$BUNDLE/gpu-plugin.txt" 2>&1 || true

# Failed pod logs
for pod in $(kubectl get pods --field-selector=status.phase=Failed -o name 2>/dev/null); do
  kubectl logs "$pod" --tail=200 > "$BUNDLE/$(basename "$pod")-logs.txt" 2>&1 || true
done

# Rate limit headers
curl -s -D "$BUNDLE/rate-headers.txt" -o /dev/null \
  -H "Authorization: Bearer ${COREWEAVE_API_KEY}" \
  https://api.coreweave.com/v1/namespaces 2>/dev/null || true

tar -czf "$BUNDLE.tar.gz" "$BUNDLE" && rm -rf "$BUNDLE"
echo "Bundle: $BUNDLE.tar.gz"

Analyzing the Bundle

tar -xzf debug-coreweave-*.tar.gz
cat debug-coreweave-*/summary.txt          # API + env status at a glance
grep -i "error\|fail\|oom" debug-coreweave-*/events.txt


                
                  
                  
                  coreweave-deploy-integration
                  SKILL.md
                  View full skill →
                
                
                  'Deploy inference services on CoreWeave with Helm charts and Kustomize.
                  
                      ReadWriteEditBash(helm:*)Bash(kubectl:*)Bash(kustomize:*)
                    
                
                
                  CoreWeave Deploy Integration
Overview
Deploy GPU-accelerated inference services on CoreWeave Kubernetes (CKS). This skill covers containerizing inference workloads with NVIDIA CUDA base images, configuring GPU resource limits and node affinity for A100/H100 scheduling, setting up health checks that validate GPU availability and model loading, and executing rolling updates that respect GPU node draining. CoreWeave's scheduler requires explicit GPU resource requests to place pods on the correct hardware tier.
Docker Configuration

FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04 AS base
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3 python3-pip curl && rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt ./
RUN pip3 install --no-cache-dir -r requirements.txt

FROM base
RUN groupadd -r app && useradd -r -g app app
COPY --chown=app:app src/ ./src/
COPY --chown=app:app models/ ./models/
USER app
EXPOSE 8080
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
  CMD curl -f http://localhost:8080/health || exit 1
CMD ["python3", "src/server.py"]

Environment Variables

export COREWEAVE_API_KEY="cw_xxxxxxxxxxxx"
export COREWEAVE_NAMESPACE="tenant-my-org"
export MODEL_NAME="meta-llama/Llama-3.1-8B-Instruct"
export GPU_TYPE="A100_PCIE_80GB"
export GPU_COUNT="1"
export LOG_LEVEL="info"
export PORT="8080"

Health Check Endpoint

import express from 'express';
import { execSync } from 'child_process';

const app = express();

app.get('/health', async (req, res) => {
  try {
    const gpuInfo = execSync('nvidia-smi --query-gpu=name,memory.used --format=csv,noheader').toString().trim();
    const modelLoaded = globalThis.modelReady === true;
    if (!modelLoaded) throw new Error('Model not loaded');
    res.json({ status: 'healthy', gpu: gpuInfo, model: process.env.MODEL_NAME, timestamp: new Date().toISOString() });
  } catch (error) {
    res.status(503).json({ status: 'unhealthy', error: (error as Error).message });
  }
});

Deployment Steps
Step 1: Build

docker build -t registry.coreweave.com/my-org/inference-svc:latest .
docker push registry.coreweave.com/my-org/inference-svc:latest

Step 2: Run

# k8s/deployment.yaml
resources:
  limits:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "48Gi"
nodeSelector:
  gpu.nvidia.com/class: A100_PCIE_80GB


kubectl apply -f k8s/deployment.yaml -n tenant-my-org

Step 3: Verify

kubectl get pods -n tenant-my-org -l app=infere


                
                  
                  
                  coreweave-enterprise-rbac
                  SKILL.md
                  View full skill →
                
                
                  'Configure RBAC and namespace isolation for CoreWeave multi-team GPU.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Enterprise RBAC
Overview
CoreWeave runs GPU workloads on Kubernetes, so RBAC maps directly to K8s namespace isolation and ResourceQuotas. Each team gets a dedicated namespace with GPU limits, storage caps, and network policies. This prevents noisy-neighbor problems where one team's training job starves another's inference service. SOC 2 and HIPAA workloads require namespace-level audit logging and team-scoped API key rotation.
Role Hierarchy

Role
Permissions
Scope


Cluster Admin
Full CKS control, namespace creation, quota management
All namespaces


Team Lead
Deploy workloads, manage team API keys, adjust pod limits
Own namespace


ML Engineer
Launch jobs, access PVCs, view logs
Own namespace


Inference Operator
Deploy/scale inference endpoints, read metrics
Own namespace


Viewer
Read-only pod status, logs, GPU utilization metrics
Own namespace


Permission Check

import { KubeConfig, RbacAuthorizationV1Api } from '@kubernetes/client-node';

async function checkNamespaceAccess(user: string, namespace: string, verb: string, resource: string): Promise<boolean> {
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const rbac = kc.makeApiClient(RbacAuthorizationV1Api);
  const review = { apiVersion: 'authorization.k8s.io/v1', kind: 'SubjectAccessReview',
    spec: { user, resourceAttributes: { namespace, verb, resource } } };
  const result = await rbac.createSubjectAccessReview(review);
  return result.body.status?.allowed ?? false;
}

Role Assignment

async function assignTeamNamespace(team: string, group: string, gpuLimit: number): Promise<void> {
  await kubectl(`create namespace ${team}`);
  await kubectl(`create resourcequota ${team}-gpu --namespace=${team} --hard=requests.nvidia.com/gpu=${gpuLimit}`);
  await kubectl(`create rolebinding ${team}-access --namespace=${team} --clusterrole=edit --group=${group}`);
  console.log(`Namespace ${team} created with ${gpuLimit} GPU quota bound to ${group}`);
}

async function revokeAccess(team: string, binding: string): Promise<void> {
  await kubectl(`delete rolebinding ${binding} --namespace=${team}`);
}

Audit Logging

interface CoreWeaveAuditEntry {
  timestamp: string; user: string; namespace: string;
  action: 'gpu_request' | 'deploy' | 'scale' | 'delete' | 'quota_change';
  resource: string; gpuCount?: number; result: 'allowed' | 'denied';
}

function logAccess(entry: CoreWeaveAuditEntry): void {
  console.log(JSON.stringify({ ...entry, clust


                
                  
                  
                  coreweave-hello-world
                  SKILL.md
                  View full skill →
                
                
                  'Deploy a GPU workload on CoreWeave with kubectl.
                  
                      ReadWriteEditBash(kubectl:*)
                    
                
                
                  CoreWeave Hello World
Overview
Deploy your first GPU workload on CoreWeave: a simple inference service using vLLM or a batch CUDA job. CoreWeave runs Kubernetes on bare-metal GPU nodes with A100, H100, and L40 GPUs.
Prerequisites

Completed coreweave-install-auth setup
kubectl configured with CoreWeave kubeconfig
Namespace with GPU quota

Instructions
Step 1: Deploy a vLLM Inference Server

# vllm-inference.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm-server
  template:
    metadata:
      labels:
        app: vllm-server
    spec:
      containers:
        - name: vllm
          image: vllm/vllm-openai:latest
          args:
            - "--model"
            - "meta-llama/Llama-3.1-8B-Instruct"
            - "--port"
            - "8000"
          ports:
            - containerPort: 8000
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: 48Gi
              cpu: "8"
            requests:
              nvidia.com/gpu: 1
              memory: 32Gi
              cpu: "4"
          env:
            - name: HUGGING_FACE_HUB_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-token
                  key: token
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: gpu.nvidia.com/class
                    operator: In
                    values: ["A100_PCIE_80GB"]
---
apiVersion: v1
kind: Service
metadata:
  name: vllm-server
spec:
  selector:
    app: vllm-server
  ports:
    - port: 8000
      targetPort: 8000
  type: ClusterIP


# Create HuggingFace token secret
kubectl create secret generic hf-token --from-literal=token="${HF_TOKEN}"

# Deploy
kubectl apply -f vllm-inference.yaml
kubectl get pods -w  # Wait for Running state

# Port-forward and test
kubectl port-forward svc/vllm-server 8000:8000 &
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "meta-llama/Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}]}'

Step 2: Batch GPU Job

# gpu-batch-job.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-benchmark
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: benchmark
          image: pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime
          command: ["python3", "-c"]
          args:
            - |
              import torc


                
                  
                  
                  coreweave-incident-runbook
                  SKILL.md
                  View full skill →
                
                
                  'Incident response runbook for CoreWeave GPU workload failures.
                  
                      ReadBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Incident Runbook
Triage Steps

# 1. Check pod status
kubectl get pods -l app=inference -o wide

# 2. Check recent events
kubectl get events --sort-by=.lastTimestamp | tail -20

# 3. Check node status
kubectl get nodes -l gpu.nvidia.com/class -o wide

# 4. Check GPU health
kubectl exec -it $(kubectl get pod -l app=inference -o name | head -1) -- nvidia-smi

Common Incidents
Inference Service Down

Check pod status and events
If OOMKilled: reduce batch size or upgrade GPU
If ImagePullBackOff: check registry credentials
If Pending: check GPU quota and availability

GPU Node Failure

Pods will be rescheduled automatically
If no capacity: scale down non-critical workloads
Contact CoreWeave support for extended outages

Model Loading Failure

Check HuggingFace token secret exists
Verify model name spelling
Check PVC has sufficient storage
Review container logs for download errors

Rollback

kubectl rollout undo deployment/inference

Resources

CoreWeave Support
CoreWeave Status

Next Steps
For data handling, see coreweave-data-handling.


                
                  
                  
                  coreweave-install-auth
                  SKILL.md
                  View full skill →
                
                
                  'Configure CoreWeave Kubernetes Service (CKS) access with kubeconfig.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Install & Auth
Overview
Set up access to CoreWeave Kubernetes Service (CKS). CKS runs bare-metal Kubernetes with NVIDIA GPUs -- no hypervisor overhead. Access is via standard kubeconfig with CoreWeave-issued credentials.
Prerequisites

CoreWeave account at https://cloud.coreweave.com
kubectl v1.28+ installed
Kubernetes namespace provisioned by CoreWeave

Instructions
Step 1: Download Kubeconfig

Log in to https://cloud.coreweave.com
Navigate to API Access > Kubeconfig
Download the kubeconfig file


# Save kubeconfig
mkdir -p ~/.kube
cp ~/Downloads/coreweave-kubeconfig.yaml ~/.kube/coreweave

# Set as active context
export KUBECONFIG=~/.kube/coreweave

# Verify connection
kubectl get nodes
kubectl get namespaces

Step 2: Configure API Token

# CoreWeave API token for programmatic access
export COREWEAVE_API_TOKEN="your-api-token"

# Store securely
echo "COREWEAVE_API_TOKEN=${COREWEAVE_API_TOKEN}" >> .env
echo "KUBECONFIG=~/.kube/coreweave" >> .env

Step 3: Verify GPU Access

# List available GPU nodes
kubectl get nodes -l gpu.nvidia.com/class -o custom-columns=\
NAME:.metadata.name,GPU:.metadata.labels.gpu\.nvidia\.com/class,\
STATUS:.status.conditions[-1].type

# Check GPU allocatable resources
kubectl describe nodes | grep -A5 "Allocatable:" | grep nvidia

Step 4: Test with a Simple GPU Pod

# test-gpu.yaml
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
    - name: cuda-test
      image: nvidia/cuda:12.2.0-base-ubuntu22.04
      command: ["nvidia-smi"]
      resources:
        limits:
          nvidia.com/gpu: 1
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: gpu.nvidia.com/class
                operator: In
                values: ["A100_PCIE_80GB"]


kubectl apply -f test-gpu.yaml
kubectl logs gpu-test  # Should show nvidia-smi output
kubectl delete pod gpu-test

Error Handling

Error
Cause
Solution


Unable to connect to the server
Wrong kubeconfig
Verify KUBECONFIG path


Forbidden
Missing namespace permissions
Contact CoreWeave support


No GPU nodes found
Wrong node labels
Check gpu.nvidia.com/class labels


Pod stuck Pending
GPU capacity exhausted
Try different 
                
              
                
                  
                  
                  coreweave-local-dev-loop
                  SKILL.md
                  View full skill →
                
                
                  'Set up local development workflow for CoreWeave GPU deployments.
                  
                      ReadWriteEditBash(kubectl:*)Bash(docker:*)Grep
                    
                
                
                  CoreWeave Local Dev Loop
Overview
Local development workflow for CoreWeave: build containers, test YAML manifests with dry-run, push to registry, and deploy to CoreWeave CKS.
Prerequisites

Completed coreweave-install-auth setup
Docker installed locally
Container registry access (Docker Hub, GHCR, or CoreWeave registry)

Instructions
Step 1: Project Structure

my-inference-service/
├── Dockerfile
├── src/
│   ├── server.py          # Inference server code
│   └── model_config.py    # Model configuration
├── k8s/
│   ├── deployment.yaml    # GPU deployment manifest
│   ├── service.yaml       # Service and ingress
│   └── hpa.yaml           # Horizontal pod autoscaler
├── scripts/
│   ├── build.sh           # Build and push container
│   └── deploy.sh          # Deploy to CoreWeave
├── .env.local
└── Makefile

Step 2: Build and Push Container

# Build locally
docker build -t my-inference:latest .

# Tag for registry
docker tag my-inference:latest ghcr.io/myorg/my-inference:v1.0.0

# Push
docker push ghcr.io/myorg/my-inference:v1.0.0

Step 3: Validate Manifests Before Deploy

# Dry-run against CoreWeave cluster
kubectl apply -f k8s/deployment.yaml --dry-run=server

# Diff against current state
kubectl diff -f k8s/deployment.yaml

# Check resource requests match available GPU types
kubectl get nodes -l gpu.nvidia.com/class=A100_PCIE_80GB --no-headers | wc -l

Step 4: Deploy and Watch

kubectl apply -f k8s/
kubectl rollout status deployment/my-inference
kubectl logs -f deployment/my-inference

Error Handling

Error
Cause
Solution


Image pull backoff
Wrong registry or no pull secret
Create imagePullSecret


CUDA mismatch
Driver vs container version
Match CUDA version to node drivers


Dry-run fails
Invalid manifest
Fix YAML syntax


Resources

CoreWeave CKS Docs
kubectl dry-run

Next Steps
See coreweave-sdk-patterns for inference client patterns.
                
              

                
                  
                  
                  coreweave-migration-deep-dive
                  SKILL.md
                  View full skill →
                
                
                  'Migrate ML workloads from AWS/GCP/Azure to CoreWeave GPU cloud.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Migration Deep Dive
Cost Comparison

Instance
AWS
CoreWeave
Savings


1x A100 80GB
~$3.60/hr (p4d)
~$2.21/hr
~39%


8x A100 80GB
~$32/hr (p4d.24xl)
~$17.70/hr
~45%


1x H100 80GB
~$6.50/hr (p5)
~$4.76/hr
~27%


Migration Steps
Phase 1: Containerize

# If running on bare EC2/GCE, containerize first
docker build -t inference-server:v1 .
docker push ghcr.io/myorg/inference-server:v1

Phase 2: Adapt YAML for CoreWeave
Key changes from AWS EKS / GKE:

Node affinity: Use gpu.nvidia.com/class instead of nvidia.com/gpu.product
Storage: Use CoreWeave storage classes (shared-ssd-ord1)
Networking: CoreWeave provides flat networking within VPC

Phase 3: Parallel Deploy
Run both old and new infrastructure simultaneously, gradually shift traffic.
Phase 4: Cut Over
Decommission old GPU instances after validation period.
Common Gotchas

Issue
Solution


Different CUDA drivers
Match container CUDA to CoreWeave node drivers


Storage migration
Use rclone or rsync to move data to CoreWeave PVC


DNS changes
Update ingress/load balancer DNS


IAM differences
CoreWeave uses kubeconfig, not IAM roles


Resources

CoreWeave Pricing
CoreWeave Documentation

Next Steps
This completes the CoreWeave skill pack. Start with coreweave-install-auth for new deployments.
                
              

                
                  
                  
                  coreweave-multi-env-setup
                  SKILL.md
                  View full skill →
                
                
                  'Configure CoreWeave across development, staging, and production environments.
                  
                      ReadWriteEditBash(kubectl:*)Bash(kustomize:*)Grep
                    
                
                
                  CoreWeave Multi-Environment Setup
Overview
CoreWeave GPU cloud requires strict environment separation to control infrastructure costs and prevent resource contention. Each environment maps to an isolated Kubernetes namespace with its own GPU quota, scaling policy, and access controls. Development uses cheaper GPU tiers for iteration speed, staging mirrors production GPU types for accurate benchmarking, and production runs full-scale with no scale-to-zero to guarantee inference latency SLAs.
Environment Configuration

const coreweaveConfig = (env: string) => ({
  development: {
    namespace: "app-dev", apiEndpoint: process.env.CW_API_ENDPOINT_DEV!,
    token: process.env.CW_TOKEN_DEV!, gpuType: "L40", scaleToZero: true, replicas: [0, 1],
  },
  staging: {
    namespace: "app-staging", apiEndpoint: process.env.CW_API_ENDPOINT_STG!,
    token: process.env.CW_TOKEN_STG!, gpuType: "A100_PCIE_40GB", scaleToZero: true, replicas: [0, 2],
  },
  production: {
    namespace: "app-prod", apiEndpoint: process.env.CW_API_ENDPOINT_PROD!,
    token: process.env.CW_TOKEN_PROD!, gpuType: "A100_PCIE_80GB", scaleToZero: false, replicas: [2, 10],
  },
}[env]);

Environment Files

# Per-env files: .env.development, .env.staging, .env.production
CW_API_ENDPOINT_{DEV|STG|PROD}=https://k8s.{ord1|ord1|las1}.coreweave.com
CW_TOKEN_{DEV|STG|PROD}=<service-account-token>
CW_NAMESPACE={app-dev|app-staging|app-prod}
CW_GPU_TYPE={L40|A100_PCIE_40GB|A100_PCIE_80GB}

Environment Validation

function validateCoreWeaveEnv(env: string): void {
  const required = ["CW_API_ENDPOINT", "CW_TOKEN", "CW_NAMESPACE", "CW_GPU_TYPE"];
  const suffix = { development: "_DEV", staging: "_STG", production: "_PROD" }[env];
  const missing = required
    .map((k) => (k.includes("NAMESPACE") ? k : `${k}${suffix}`))
    .filter((k) => !process.env[k]);
  if (missing.length) throw new Error(`Missing env vars for ${env}: ${missing.join(", ")}`);
}

Promotion Workflow

# 1. Validate model in dev namespace
kubectl -n app-dev get inferenceservice my-model -o jsonpath='{.status.conditions}'

# 2. Apply staging overlay with production GPU type
kustomize build k8s/overlays/staging | kubectl apply -f -

# 3. Run inference benchmarks against staging endpoint
curl -X POST https://staging.myapp.coreweave.cloud/v1/predict -d @test-payload.json

# 4. Promote to production (blue-green via namespace switch)
kustomize build k8s/overlays/prod | kubectl apply -f -
kubectl -n app-prod rollout status deployment/my-model

Environment Matrix

                
              
                
                  
                  
                  coreweave-observability
                  SKILL.md
                  View full skill →
                
                
                  'Set up GPU monitoring and observability for CoreWeave workloads.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Observability
Overview
CoreWeave runs GPU-intensive workloads on Kubernetes where hardware failures, memory exhaustion, and underutilization directly impact cost and reliability. Observability must cover DCGM GPU metrics, Kubernetes pod health, inference latency, and job completion rates. Proactive monitoring prevents wasted spend on idle GPUs and catches OOM conditions before they cascade.
Key Metrics

Setting
Dev
Staging

Metric
Type
Target
Alert Threshold


GPU utilization
Gauge
> 60%
< 20% for 30m


GPU memory usage
Gauge
< 85%
> 95% for 5m


Inference latency p99
Histogram
< 200ms
> 500ms


Job completion rate
Counter
> 99%
< 95% per hour


Pod restart count
Counter
0
> 3 in 15m


Node GPU temperature
Gauge
< 80C
> 85C for 10m


Instrumentation

async function trackInference(model: string, fn: () => Promise<any>) {
  const start = Date.now();
  try {
    const result = await fn();
    metrics.record('coreweave.inference.latency', Date.now() - start, { model, status: 'ok' });
    metrics.increment('coreweave.inference.completed', { model });
    return result;
  } catch (err) {
    metrics.increment('coreweave.inference.errors', { model, error: err.code });
    throw err;
  }
}

Health Check Dashboard

async function coreweaveHealth(): Promise<Record<string, string>> {
  const gpu = await queryPrometheus('avg(DCGM_FI_DEV_GPU_UTIL)');
  const mem = await queryPrometheus('avg(DCGM_FI_DEV_FB_USED/(DCGM_FI_DEV_FB_USED+DCGM_FI_DEV_FB_FREE))');
  const pods = await queryPrometheus('kube_deployment_status_replicas_available{namespace="inference"}');
  return {
    gpu_utilization: gpu > 20 ? 'healthy' : 'underutilized',
    gpu_memory: mem < 0.9 ? 'healthy' : 'critical',
    inference_pods: pods > 0 ? 'healthy' : 'down',
  };
}

Alerting Rules

const alerts = [
  { metric: 'DCGM_FI_DEV_GPU_UTIL', condition: 'avg < 20', window: '30m', severity: 'warning' },
  { metric: 'gpu_memory_pct', condition: '> 0.95', window: '5m', severity: 'critical' },
  { metric: 'inference_latency_p99', condition: '> 500ms', window: '10m', severity: 'warning' },
  { metric: 'pod_restart_count', condition: '> 3', window: '15m', severity: 'critical' },
];

                
              

                
                  
                  
                  coreweave-performance-tuning
                  SKILL.md
                  View full skill →
                
                
                  'Optimize CoreWeave GPU inference latency and throughput.
                  
                      ReadWriteEditBash(kubectl:*)
                    
                
                
                  CoreWeave Performance Tuning
GPU Selection by Workload

Workload
Recommended GPU
Why


LLM inference (7-13B)
A100 80GB
Good balance of memory and cost


LLM inference (70B+)
8xH100
NVLink for tensor parallelism


Image generation
L40
Good for diffusion models


Training (large models)
8xH100 SXM5
Fastest interconnect


Batch processing
A100 40GB
Cost-effective


Inference Optimization

# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning

# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

Metric
A100-80GB
H100-80GB


Llama-8B tokens/sec
~2,000
~4,500


Llama-70B tokens/sec
~200 (4x)
~500 (4x)


Cold start (vLLM)
30-60s
20-40s


Resources

CoreWeave Inference
vLLM Documentation

Next Steps
For cost optimization, see coreweave-cost-tuning.
                
              

                
                  
                  
                  coreweave-prod-checklist
                  SKILL.md
                  View full skill →
                
                
                  'Production readiness checklist for CoreWeave GPU workloads.
                  
                      ReadBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Production Checklist
Inference Services

[ ] GPU type and count validated for model size
[ ] Autoscaling configured (KServe or HPA)
[ ] Health and readiness probes set
[ ] Resource requests AND limits specified
[ ] Node affinity targeting correct GPU class
[ ] minReplicas >= 1 for production (no cold starts)

Storage

[ ] Model weights in PVC (not downloaded at startup)
[ ] Checkpoints saved to persistent storage
[ ] Storage class appropriate (SSD for inference, HDD for archival)

Security

[ ] Secrets for model tokens and registry access
[ ] Network policies applied
[ ] Container images from trusted registries

Monitoring

[ ] GPU utilization metrics collected
[ ] Inference latency and throughput tracked
[ ] Alert on pod restarts and OOM events
[ ] Log aggregation configured

Rollback

kubectl rollout undo deployment/my-inference
kubectl rollout status deployment/my-inference

Resources

CoreWeave CKS

Next Steps
For upgrades, see coreweave-upgrade-migration.
                
              

                
                  
                  
                  coreweave-rate-limits
                  SKILL.md
                  View full skill →
                
                
                  'Handle CoreWeave API and GPU quota limits.
                  
                      ReadWriteEditBash(kubectl:*)
                    
                
                
                  CoreWeave Rate Limits
Overview
CoreWeave limits are primarily GPU quota-based rather than API rate limits. Each namespace has allocated GPU quotas per type.
Check GPU Quota

kubectl describe resourcequota -n my-namespace
kubectl get resourcequota -o json | jq '.items[].status'

Inference Request Queuing

import asyncio
from collections import deque

class InferenceQueue:
    def __init__(self, max_concurrent: int = 10):
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.queue_depth = 0

    async def inference(self, client, prompt: str) -> str:
        self.queue_depth += 1
        async with self.semaphore:
            try:
                return await asyncio.to_thread(client.generate, prompt)
            finally:
                self.queue_depth -= 1

Resources

CoreWeave Node Pools

Next Steps
For security, see coreweave-security-basics.
                
              

                
                  
                  
                  coreweave-reference-architecture
                  SKILL.md
                  View full skill →
                
                
                  'Reference architecture for CoreWeave GPU cloud deployments.
                  
                      ReadWriteEditGrep
                    
                
                
                  CoreWeave Reference Architecture
Architecture Diagram

                    ┌─────────────────────┐
                    │   Load Balancer     │
                    │   (Ingress/LB)      │
                    └──────────┬──────────┘
                               │
              ┌────────────────┼────────────────┐
              │                │                │
     ┌────────▼──────┐ ┌──────▼────────┐ ┌─────▼───────┐
     │ Model A       │ │ Model B       │ │ Model C     │
     │ (vLLM, A100)  │ │ (TGI, H100)  │ │ (SD, L40)   │
     │ 2 replicas    │ │ 1 replica     │ │ 3 replicas  │
     └───────────────┘ └───────────────┘ └─────────────┘
              │                │                │
     ┌────────▼────────────────▼────────────────▼───────┐
     │              Shared Storage (PVC)                │
     │         Models / Checkpoints / Data              │
     └──────────────────────────────────────────────────┘

Project Structure

ml-platform/
├── k8s/
│   ├── base/                    # Shared templates
│   ├── models/
│   │   ├── llama-8b/           # Per-model manifests
│   │   ├── llama-70b/
│   │   └── stable-diffusion/
│   └── infra/
│       ├── storage.yaml         # PVCs
│       ├── secrets.yaml         # Model tokens
│       └── monitoring.yaml      # Prometheus rules
├── containers/
│   ├── vllm/Dockerfile
│   └── custom-server/Dockerfile
├── scripts/
│   ├── deploy.sh
│   └── benchmark.sh
└── monitoring/
    ├── grafana-dashboards/
    └── alert-rules.yaml

Key Design Decisions

Decision
Choice
Rationale


Serving framework
vLLM
Continuous batching, PagedAttention


GPU type (production)
A100 80GB
Best price/performance for inference


Storage
Shared PVC (SSD)
Fast model loading across replicas


Autoscaling
KServe + Knative
Native scale-to-zero support


Container registry
GHCR
GitHub integration, free for public


Resources

CoreWeave Documentation
CoreWeave Examples

Next Steps
For multi-environment setup, see coreweave-multi-env-setup.
                
              

                
                  
                  
                  coreweave-sdk-patterns
                  SKILL.md
                  View full skill →
                
                
                  'Production-ready patterns for CoreWeave GPU workload management with.
                  
                      ReadWriteEdit
                    
                
                
                  CoreWeave SDK Patterns
Overview
CoreWeave is Kubernetes-native -- use kubectl, Kubernetes Python client, or Helm for programmatic management. These patterns cover GPU-aware deployment templates, inference client wrappers, and node affinity configurations.
Instructions
GPU Affinity Helper

# coreweave_helpers.py
from dataclasses import dataclass

@dataclass
class GPUConfig:
    gpu_class: str        # A100_PCIE_80GB, H100_SXM5, L40, etc.
    gpu_count: int = 1
    memory_gb: int = 32
    cpu_cores: int = 4

GPU_CATALOG = {
    "a100-80gb": GPUConfig("A100_PCIE_80GB", memory_gb=48, cpu_cores=8),
    "h100-80gb": GPUConfig("H100_SXM5", memory_gb=64, cpu_cores=12),
    "l40":       GPUConfig("L40", memory_gb=24, cpu_cores=4),
    "a100-8x":   GPUConfig("A100_NVLINK_A100_SXM4_80GB", gpu_count=8, memory_gb=256, cpu_cores=64),
}

def gpu_affinity_block(gpu_class: str) -> dict:
    return {
        "nodeAffinity": {
            "requiredDuringSchedulingIgnoredDuringExecution": {
                "nodeSelectorTerms": [{
                    "matchExpressions": [{
                        "key": "gpu.nvidia.com/class",
                        "operator": "In",
                        "values": [gpu_class],
                    }]
                }]
            }
        }
    }

def gpu_resources(config: GPUConfig) -> dict:
    return {
        "limits": {
            "nvidia.com/gpu": str(config.gpu_count),
            "memory": f"{config.memory_gb}Gi",
            "cpu": str(config.cpu_cores),
        },
        "requests": {
            "nvidia.com/gpu": str(config.gpu_count),
            "memory": f"{config.memory_gb // 2}Gi",
            "cpu": str(config.cpu_cores // 2),
        },
    }

Inference Client Wrapper

# inference_client.py
import requests
from typing import Optional

class CoreWeaveInferenceClient:
    def __init__(self, endpoint: str, timeout: int = 30):
        self.endpoint = endpoint.rstrip("/")
        self.timeout = timeout
        self.session = requests.Session()

    def generate(self, prompt: str, max_tokens: int = 256, **kwargs) -> str:
        resp = self.session.post(
            f"{self.endpoint}/v1/completions",
            json={"prompt": prompt, "max_tokens": max_tokens, **kwargs},
            timeout=self.timeout,
        )
        resp.raise_for_status()
        return resp.json()["choices"][0]["text"]

    def chat(self, messages: list[dict], **kwargs) -> str:
        resp = self.session.post(
            f"{self.endpoint}/v1/chat/completions",
         

                

              

                
                  
                  
                  coreweave-security-basics
                  SKILL.md
                  View full skill →
                
                
                  'Secure CoreWeave deployments with RBAC, network policies, and secrets.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Security Basics
Overview
CoreWeave provides bare-metal GPU cloud on Kubernetes. Security concerns center on compute credential management (kubeconfig, deploy tokens), network isolation between inference workloads, secrets for model registry access (HuggingFace, container registries), and protecting sensitive training data on persistent volumes. A compromised namespace can expose GPU resources, model weights, and customer inference data.
API Key Management

import { KubeConfig, CoreV1Api } from "@kubernetes/client-node";

function createCoreWeaveClient(): CoreV1Api {
  const apiKey = process.env.COREWEAVE_API_KEY;
  if (!apiKey) {
    throw new Error("Missing COREWEAVE_API_KEY — set via secrets manager");
  }
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const api = kc.makeApiClient(CoreV1Api);
  // Never log kubeconfig or API key contents
  console.log("CoreWeave client initialized for namespace:", process.env.CW_NAMESPACE);
  return api;
}

Webhook Signature Verification

import crypto from "crypto";
import { Request, Response, NextFunction } from "express";

function verifyCoreWeaveWebhook(req: Request, res: Response, next: NextFunction): void {
  const signature = req.headers["x-coreweave-signature"] as string;
  const secret = process.env.COREWEAVE_WEBHOOK_SECRET!;
  const expected = crypto.createHmac("sha256", secret).update(req.body).digest("hex");
  if (!signature || !crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
    res.status(401).send("Invalid signature");
    return;
  }
  next();
}

Input Validation

import { z } from "zod";

const WorkloadRequestSchema = z.object({
  namespace: z.string().regex(/^[a-z0-9-]+$/).max(63),
  gpu_type: z.enum(["A100_80GB", "A100_40GB", "H100_80GB", "RTX_A6000"]),
  gpu_count: z.number().int().min(1).max(8),
  image: z.string().regex(/^[a-z0-9.\-/]+:[a-z0-9.\-]+$/),
  model_id: z.string().min(1).max(200),
});

function validateWorkloadRequest(data: unknown) {
  return WorkloadRequestSchema.parse(data);
}

Data Protection

const CW_SENSITIVE_FIELDS = ["kubeconfig", "hf_token", "registry_password", "api_key", "model_weights_url"];

function redactCoreWeaveLog(record: Record<string, unknown>): Record<string, unknown> {
  const redacted = { ...record };
  for (const field of CW_SENSITIVE_FIELDS) {
    if (field in redacted) redacted[field] = "[REDACTED]";
  }
  return redacted;
}

Security Checklist

[ ] Kubeconfig stored in secrets manager, never in repos
[ ] Kubernetes Secrets used f
                
              

                
                  
                  
                  coreweave-upgrade-migration
                  SKILL.md
                  View full skill →
                
                
                  'Upgrade CoreWeave deployments and migrate between GPU types.
                  
                      ReadWriteEditBash(kubectl:*)Grep
                    
                
                
                  CoreWeave Upgrade & Migration
Overview
CoreWeave is a GPU-specialized cloud provider running Kubernetes-native infrastructure. Migrations involve upgrading between GPU instance types (A100 to H100), updating CUDA driver versions, and handling Kubernetes API version changes across namespaces. Tracking API versions is critical because CoreWeave's instance type labels and resource quotas change between platform releases, and deploying to a deprecated instance class will cause scheduling failures.
Version Detection

import { KubeConfig, CoreV1Api } from "@kubernetes/client-node";

async function detectCoreWeaveVersion(): Promise<void> {
  const kc = new KubeConfig();
  kc.loadFromDefault();
  const k8sApi = kc.makeApiClient(CoreV1Api);

  // Check current namespace GPU allocations
  const pods = await k8sApi.listNamespacedPod("my-namespace");
  for (const pod of pods.body.items) {
    const gpuClass = pod.spec?.nodeSelector?.["gpu.nvidia.com/class"];
    const cudaVersion = pod.metadata?.labels?.["cuda-version"];
    console.log(`Pod ${pod.metadata?.name}: GPU=${gpuClass}, CUDA=${cudaVersion}`);
  }

  // Detect deprecated instance types
  const deprecated = ["A100_PCIE_40GB", "V100_PCIE_16GB", "RTX_A5000"];
  const activeGpus = pods.body.items
    .map((p) => p.spec?.nodeSelector?.["gpu.nvidia.com/class"])
    .filter(Boolean);
  const stale = activeGpus.filter((g) => deprecated.includes(g!));
  if (stale.length > 0) console.warn(`Deprecated GPU types in use: ${stale.join(", ")}`);
}

Migration Checklist

[ ] Review CoreWeave release notes for deprecated instance types
[ ] Audit all deployments for gpu.nvidia.com/class node selectors
[ ] Verify CUDA version compatibility with target GPU (see matrix below)
[ ] Update container base images to match new CUDA/cuDNN requirements
[ ] Test inference latency on new GPU type in staging namespace
[ ] Update resource requests (nvidia.com/gpu) for new instance memory
[ ] Migrate persistent volumes if switching regions or availability zones
[ ] Update Kubernetes API version in manifests (e.g., apps/v1 changes)
[ ] Validate HPA scaling behavior on new instance type throughput
[ ] Run canary deployment with traffic split before full cutover

Schema Migration

// CoreWeave instance type labels changed in 2025 platform update
// Old: gpu.nvidia.com/class: "A100_PCIE_80GB"
// New: gpu.nvidia.com/class: "H100_SXM5_80GB"

interface DeploymentMigration {
  oldSelector: Record<string, string>;
  newSelector: Record<string, string>;
  cudaMinVersion: string;
}

const GPU_MIGRATIONS: DeploymentMigration[] = [
  {
    oldSe

                

              

                
                  
                  
                  coreweave-webhooks-events
                  SKILL.md
                  View full skill →
                
                
                  'Monitor CoreWeave cluster events and GPU workload status.
                  
                      ReadWriteEditBash(kubectl:*)
                    
                
                
                  CoreWeave Webhooks & Events
Overview
CoreWeave emits Kubernetes-native events and custom status callbacks for GPU workload lifecycle management. Monitor instance readiness, job completion, volume attachment, and node health to build automated scaling, alerting, and recovery pipelines for GPU-accelerated inference and training workloads.
Webhook Registration

const response = await fetch("https://api.coreweave.com/v1/webhooks", {
  method: "POST",
  headers: {
    "Authorization": `Bearer ${process.env.COREWEAVE_API_KEY}`,
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    url: "https://yourapp.com/webhooks/coreweave",
    events: ["instance.ready", "job.completed", "volume.attached", "node.unhealthy"],
    secret: process.env.COREWEAVE_WEBHOOK_SECRET,
  }),
});

Signature Verification

import crypto from "crypto";
import { Request, Response, NextFunction } from "express";

function verifyCoreWeaveSignature(req: Request, res: Response, next: NextFunction) {
  const signature = req.headers["x-coreweave-signature"] as string;
  const timestamp = req.headers["x-coreweave-timestamp"] as string;
  const payload = `${timestamp}.${req.body.toString()}`;
  const expected = crypto.createHmac("sha256", process.env.COREWEAVE_WEBHOOK_SECRET!)
    .update(payload).digest("hex");
  if (!crypto.timingSafeEqual(Buffer.from(signature), Buffer.from(expected))) {
    return res.status(401).json({ error: "Invalid signature" });
  }
  next();
}

Event Handler

import express from "express";
const app = express();

app.post("/webhooks/coreweave", express.raw({ type: "application/json" }), verifyCoreWeaveSignature, (req, res) => {
  const event = JSON.parse(req.body.toString());
  res.status(200).json({ received: true });

  switch (event.type) {
    case "instance.ready":
      registerEndpoint(event.data.instance_id, event.data.gpu_type); break;
    case "job.completed":
      collectArtifacts(event.data.job_id, event.data.output_path); break;
    case "volume.attached":
      mountStorage(event.data.volume_id, event.data.node_name); break;
    case "node.unhealthy":
      drainAndReschedule(event.data.node_id, event.data.reason); break;
  }
});

Event Types

Event
Payload Fields
Use Case


instance.ready
instanceid, gputype, ip_address
Register inference endpoint


job.completed
jobid, output
                
              
          
        

      
      
          How It Works
          1. Install the Pack

/plugin install coreweave-pack@claude-code-plugins-plus

2. Configure kubectl
Download your kubeconfig from cloud.coreweave.com and set it up:

export KUBECONFIG=~/.kube/coreweave
kubectl get nodes

3. Deploy Your First GPU Workload

kubectl run gpu-test --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \
  --restart=Never \
  --overrides='{"spec":{"containers":[{"name":"gpu-test","image":"nvidia/cuda:12.2.0-base-ubuntu22.04","command":["nvidia-smi"],"resources":{"limits":{"nvidia.com/gpu":"1"}}}]}}' \
  -- nvidia-smi

kubectl logs gpu-test
kubectl delete pod gpu-test

4. Deploy an Inference Service
Follow coreweave-core-workflow-a to deploy a KServe InferenceService with autoscaling.
        

      
      

      
      

      
      
  Ready to use coreweave-pack?
  
    
    
  




      
      
          Related Plugins
          
            
                supabase-pack
                Complete Supabase integration skill pack with 30 skills covering authentication, database, storage, realtime, edge functions, and production operations. Flagship+ tier vendor pack.
              
                vercel-pack
                Complete Vercel integration skill pack with 30 skills covering deployments, edge functions, preview environments, performance optimization, and production operations. Flagship+ tier vendor pack.
              
                clay-pack
                Complete Clay integration skill pack with 30 skills covering data enrichment, waterfall workflows, AI agents, and GTM automation. Flagship+ tier vendor pack.
              
                cursor-pack
                Complete Cursor integration skill pack with 30 skills covering AI code editing, composer workflows, codebase indexing, and productivity features. Flagship+ tier vendor pack.
              
                exa-pack
                Complete Exa integration skill pack with 30 skills covering neural search, semantic retrieval, web search API, and AI-powered discovery. Flagship+ tier vendor pack.
              
                firecrawl-pack
                Complete Firecrawl integration skill pack with 30 skills covering web scraping, crawling, markdown conversion, and LLM-ready data extraction. Flagship+ tier vendor pack.
              
          
        

      
      
          Tags
          
            coreweavesaassdkintegration
          
        
    
  

  

    

    
    
        
            
                Stay in the Loop
                
                    
                    
                    
                
                No spam. Unsubscribe anytime.
            

            
                
                    Product
                    
                        Explore
                        Skills
                        Cowork
                        Compare
                        Tools
                    
                
                
                    Resources
                    
                        Docs
                        Changelog
                        Collections
                        Playbooks
                        Research
                        Learning
                    
                
                
                    Company
                    
                        Community
                        Hall of Fame
                        GitHub
                    
                
                
                    Legal
                    
                        Privacy
                        Terms
                        Acceptable Use
                    
                
            

            
                Tons of Skills by Intent Solutions. Marine. Citadel Grad. 20 years ops → self-taught dev → AI architect.
                © 2026 Tons of Skills | Intent Solutions

GPU	Per GPU/hour	Best For
A100 40GB PCIe	~$1.50	Development, smaller models
A100 80GB PCIe	~$2.21	Production inference
H100 80GB PCIe	~$4.76	High-throughput inference
H100 SXM5 (8x)	~$6.15/GPU	Training, multi-GPU
L40	~$1.10	Image generation, light inference

Data Type	Sensitivity	Retention	Encryption
Model weights	Medium	Until deprecated	AES-256 at rest
Training datasets	High (may contain PII)	Per data license	AES-256 + TLS in transit
Checkpoint snapshots	Medium	30 days post-training	AES-256 at rest
Inference cache	Low	Session/TTL	Volume-level encryption
HuggingFace tokens	Critical	Rotate quarterly	K8s Secret + KMS

Role	Permissions	Scope
Cluster Admin	Full CKS control, namespace creation, quota management	All namespaces
Team Lead	Deploy workloads, manage team API keys, adjust pod limits	Own namespace
ML Engineer	Launch jobs, access PVCs, view logs	Own namespace
Inference Operator	Deploy/scale inference endpoints, read metrics	Own namespace
Viewer	Read-only pod status, logs, GPU utilization metrics	Own namespace

coreweave-pack

Installation

What It Does

Skills (24)

CoreWeave CI Integration

Overview

GitHub Actions Workflow

Mock-Based Unit Tests

Integration Tests

CoreWeave Common Errors

Error Reference

1. Pod Stuck Pending -- No GPU Available

2. CUDA Out of Memory

3. Image Pull BackOff

4. NCCL Timeout (Multi-GPU)

5. PVC Not Mounting

6. Node Affinity Mismatch

7. Service Not Reachable

Resources

Next Steps

CoreWeave Core Workflow: KServe Inference

Overview

Prerequisites

Instructions

Step 1: Deploy an InferenceService

Step 2: Scale-to-Zero Configuration

Step 3: Test the Endpoint

CoreWeave Core Workflow: GPU Training

Overview

Prerequisites

Instructions

Step 1: Single-Node Multi-GPU Training

Step 2: Persistent Storage for Training Data

Step 3: Monitor Training Progress

Error Handling

CoreWeave Cost Tuning

GPU Pricing Reference (approximate)

Cost Optimization Strategies

Scale-to-Zero for Dev/Staging

Right-Size GPU Selection

Quantization to Use Smaller GPUs

Resources

Next Steps

CoreWeave Data Handling

Overview

Data Classification

Data Import

Data Export

CoreWeave Debug Bundle

Overview

Debug Collection Script

Analyzing the Bundle

CoreWeave Deploy Integration

Overview

Docker Configuration

Environment Variables

Health Check Endpoint

Deployment Steps

Step 1: Build

Step 2: Run

Step 3: Verify

CoreWeave Enterprise RBAC

Overview

Role Hierarchy

Permission Check

Role Assignment

Audit Logging

CoreWeave Hello World

Overview

Prerequisites

Instructions

Step 1: Deploy a vLLM Inference Server

Step 2: Batch GPU Job

CoreWeave Incident Runbook

Triage Steps

Common Incidents

Inference Service Down

GPU Node Failure

Model Loading Failure

Rollback