vastai-prod-checklist

'Execute Vast.ai production deployment checklist for GPU workloads.

v1.11.0

Jeremy Longshore

MIT

Allowed Tools

ReadBash(vastai:*)Bash(curl:*)Grep

Provided by Plugin

vastai-pack

Claude Code skill pack for Vast.ai (24 skills)

saas packs v1.11.0

View Plugin

Installation

This skill is included in the vastai-pack plugin:

/plugin install vastai-pack@claude-code-plugins-plus

Click to copy

Instructions

Vast.ai Production Checklist

Overview

Complete checklist for running production GPU workloads on Vast.ai, covering account setup, instance selection, data safety, monitoring, and cost controls.

Prerequisites

Vast.ai account with sufficient credits
Docker images tested and published to registry
Checkpoint-based training pipeline

Instructions

Account & Authentication

[ ] API key stored in secrets manager (not in code or env files)
[ ] Dedicated SSH key pair for Vast.ai (not shared with other services)
[ ] Account balance sufficient for planned workload duration + 50% buffer
[ ] Billing alerts configured at cloud.vast.ai

Instance Selection

[ ] GPU type validated for workload (VRAM, compute capability)
[ ] Reliability filter set to >= 0.98 for production jobs
[ ] Internet speed filter set to inet_down >= 200 for data transfer
[ ] Disk allocation includes room for checkpoints + data + 20% overhead
[ ] CUDA version on host matches Docker image requirements

Data Safety

[ ] Training data encrypted before upload to instances
[ ] Checkpoint saving every N steps (not just per epoch)
[ ] Checkpoints uploaded to persistent storage (S3/GCS) periodically
[ ] Instance cleanup script removes data before destruction
[ ] No sensitive data (API keys, PII) embedded in Docker images

Spot Instance Protection

[ ] Spot preemption handler implemented (save checkpoint on SIGTERM)
[ ] Auto-recovery: detect destroyed instance, provision replacement, resume
[ ] On-demand fallback configured for critical final training stages
[ ] Checkpoint integrity verification after recovery

Monitoring & Alerting

[ ] GPU utilization monitoring (alert if < 50% for > 10 min)
[ ] Instance health polling every 60 seconds
[ ] Cost accumulation tracking with budget threshold alerts
[ ] Training loss/metrics logged to external service (W&B, MLflow)
[ ] Dead instance detection (auto-destroy stuck instances)

Cost Controls

[ ] Maximum dph_total set in search queries
[ ] Auto-destroy timeout for all instances (e.g., 24h max)
[ ] Daily spending limit configured
[ ] Cost-per-job tracking for budget reporting

Verification Script


#!/bin/bash
set -euo pipefail
echo "Vast.ai Production Readiness Check"

# 1. Auth
vastai show user --raw | python3 -c "
import sys, json; u=json.load(sys.stdin)
balance = u.get('balance', 0)
print(f'  Auth: OK | Balance: \${balance:.2f}')
assert balance >= 10, f'Balance too low: \${balance:.2f}'
" && echo "  Balance: PASS" || echo "  Balance: FAIL"

# 2. Offer availability
COUNT=$(vastai search offers 'reliability>0.98 num_gpus=1 rentable=true' --raw --limit 1 | python3 -c "import sys,json; print(len(json.load(sys.stdin)))")
echo "  Offers available: $COUNT+ | PASS"

# 3. Docker image pullable
docker pull pytorch/pytorch:2.2.0-cuda12.1-cudnn8-runtime > /dev/null 2>&1 && echo "  Docker image: PASS" || echo "  Docker image: FAIL"

echo "Pre-flight checks complete."

Output

Production readiness checklist verified
Verification script passes all checks
Cost controls and monitoring configured
Data safety measures in place

Error Handling

Error	Cause	Solution
Insufficient balance	Credits depleted mid-job	Set up auto-top-up or balance alerts
Instance preempted during final epoch	Spot instance reclaimed	Use on-demand for final training stage
Checkpoint corrupted	Interrupted mid-save	Implement atomic checkpoint writes (save to temp, rename)
GPU utilization drops to 0%	Data pipeline bottleneck	Profile data loading; increase disk I/O

Resources

Next Steps

For version upgrades, see vastai-upgrade-migration.

Examples

Pre-launch audit: Run the verification script, check all boxes, confirm Docker image pulls successfully, and verify at least 3 matching offers are available before starting a production training run.

Budget-safe launch: Set max_dph=2.00, auto-destroy timeout of 12 hours, and daily spend alert at $50 to prevent cost overruns.

Allowed Tools

Provided by Plugin

vastai-pack

Installation

Instructions

Vast.ai Production Checklist

Overview

Prerequisites

Instructions

Account & Authentication

Instance Selection

Data Safety

Spot Instance Protection

Monitoring & Alerting

Cost Controls

Verification Script

Output

Error Handling

Resources

Next Steps

Examples

Ready to use vastai-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle