coreweave-performance-tuning

'Optimize CoreWeave GPU inference latency and throughput.

4 Tools
coreweave-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(kubectl:*)

Provided by Plugin

coreweave-pack

Claude Code skill pack for CoreWeave (24 skills)

saas packs v1.0.0
View Plugin

Installation

This skill is included in the coreweave-pack plugin:

/plugin install coreweave-pack@claude-code-plugins-plus

Click to copy

Instructions

CoreWeave Performance Tuning

GPU Selection by Workload

Workload Recommended GPU Why
LLM inference (7-13B) A100 80GB Good balance of memory and cost
LLM inference (70B+) 8xH100 NVLink for tensor parallelism
Image generation L40 Good for diffusion models
Training (large models) 8xH100 SXM5 Fastest interconnect
Batch processing A100 40GB Cost-effective

Inference Optimization


# Continuous batching with vLLM
containers:
  - name: vllm
    args:
      - "--model=meta-llama/Llama-3.1-8B-Instruct"
      - "--max-num-batched-tokens=8192"
      - "--max-num-seqs=256"
      - "--gpu-memory-utilization=0.90"
      - "--enable-prefix-caching"
      - "--dtype=float16"

Autoscaling Tuning


# HPA based on GPU utilization
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: inference-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: inference-server
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Pods
      pods:
        metric:
          name: DCGM_FI_DEV_GPU_UTIL
        target:
          type: AverageValue
          averageValue: "70"

Performance Benchmarks

Metric A100-80GB H100-80GB
Llama-8B tokens/sec ~2,000 ~4,500
Llama-70B tokens/sec ~200 (4x) ~500 (4x)
Cold start (vLLM) 30-60s 20-40s

Resources

Next Steps

For cost optimization, see coreweave-cost-tuning.

Ready to use coreweave-pack?