castai-core-workflow-a

'Configure CAST AI autoscaler policies and node templates for cost optimization.

6 Tools
castai-pack Plugin
saas packs Category

Allowed Tools

ReadWriteEditBash(curl:*)Bash(kubectl:*)Grep

Provided by Plugin

castai-pack

Claude Code skill pack for Cast AI (18 skills)

saas packs v1.0.0
View Plugin

Installation

This skill is included in the castai-pack plugin:

/plugin install castai-pack@claude-code-plugins-plus

Click to copy

Instructions

CAST AI Core Workflow: Autoscaler & Policies

Overview

Primary workflow for CAST AI: configure autoscaler policies to optimize cluster costs. Covers enabling spot instances, configuring the node downscaler and evictor, setting cluster CPU/memory limits, and creating node templates for workload-specific requirements.

Prerequisites

  • Completed castai-install-auth with Phase 2 (cluster controller + evictor)
  • CASTAIAPIKEY and CASTAICLUSTERID set
  • Cluster in "ready" status

Instructions

Step 1: Read Current Policies


curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq .

Step 2: Enable Cost-Optimized Autoscaling


curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \
  -H "Content-Type: application/json" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  -d '{
    "enabled": true,
    "unschedulablePods": {
      "enabled": true,
      "headroom": {
        "cpuPercentage": 10,
        "memoryPercentage": 10,
        "enabled": true
      }
    },
    "nodeDownscaler": {
      "enabled": true,
      "emptyNodes": {
        "enabled": true,
        "delaySeconds": 180
      }
    },
    "spotInstances": {
      "enabled": true,
      "clouds": ["aws"],
      "spotDiversityEnabled": true,
      "spotDiversityPriceIncreaseLimitPercent": 20
    },
    "clusterLimits": {
      "enabled": true,
      "cpu": {
        "minCores": 4,
        "maxCores": 100
      }
    }
  }'

Step 3: Configure Node Templates via Terraform


resource "castai_node_template" "spot_workers" {
  cluster_id = castai_eks_cluster.this.id
  name       = "spot-workers"
  is_default = false
  is_enabled = true

  constraints {
    min_cpu               = 2
    max_cpu               = 16
    min_memory            = 4096
    max_memory            = 65536
    spot                  = true
    use_spot_fallbacks    = true
    fallback_restore_rate_seconds = 600

    instance_families {
      include = ["m5", "m6i", "c5", "c6i", "r5", "r6i"]
    }

    architectures = ["amd64"]
  }

  custom_labels = {
    "workload-type" = "batch"
  }
}

resource "castai_node_template" "gpu_ondemand" {
  cluster_id = castai_eks_cluster.this.id
  name       = "gpu-ondemand"
  is_default = false
  is_enabled = true

  constraints {
    spot                  = false
    gpu_manufacturers     = ["NVIDIA"]

    instance_families {
      include = ["p3", "p4d", "g4dn", "g5"]
    }
  }

  custom_labels = {
    "workload-type" = "gpu"
  }
}

Step 4: Verify Autoscaler is Working


# Check if the autoscaler is processing nodes
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/external-clusters/${CASTAI_CLUSTER_ID}/nodes" \
  | jq '[.items[] | {name, instanceType, lifecycle, castaiManaged: .castaiManaged}]
        | group_by(.lifecycle)
        | map({lifecycle: .[0].lifecycle, count: length})'

# Expected: mix of spot and on-demand nodes

Error Handling

Error Cause Solution
Policy update returns 400 Invalid policy JSON Validate with jq before sending
Nodes not scaling Policy not enabled Verify .enabled: true in policy
Spot instances not used Provider not configured Add cloud provider to spotInstances.clouds
Evictor too aggressive Low delay threshold Increase emptyNodes.delaySeconds
Cluster limit hit maxCores too low Increase clusterLimits.cpu.maxCores

Resources

Next Steps

For workload-level autoscaling, see castai-core-workflow-b.

Ready to use castai-pack?