castai-core-workflow-a

'Configure CAST AI autoscaler policies and node templates for cost optimization.

v1.0.0

Jeremy Longshore

MIT

6 Tools

castai-pack Plugin

saas packs Category

Allowed Tools
        ReadWriteEditBash(curl:*)Bash(kubectl:*)Grep
      

Provided by Plugin

castai-pack

Claude Code skill pack for Cast AI (18 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the castai-pack plugin:

/plugin install castai-pack@claude-code-plugins-plus

Click to copy

Instructions

CAST AI Core Workflow: Autoscaler & Policies

Overview

Primary workflow for CAST AI: configure autoscaler policies to optimize cluster costs. Covers enabling spot instances, configuring the node downscaler and evictor, setting cluster CPU/memory limits, and creating node templates for workload-specific requirements.

Prerequisites

Completed castai-install-auth with Phase 2 (cluster controller + evictor)
CASTAIAPIKEY and CASTAICLUSTERID set
Cluster in "ready" status

Instructions

Step 1: Read Current Policies


curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  | jq .

Step 2: Enable Cost-Optimized Autoscaling


curl -X PUT -H "X-API-Key: ${CASTAI_API_KEY}" \
  -H "Content-Type: application/json" \
  "https://api.cast.ai/v1/kubernetes/clusters/${CASTAI_CLUSTER_ID}/policies" \
  -d '{
    "enabled": true,
    "unschedulablePods": {
      "enabled": true,
      "headroom": {
        "cpuPercentage": 10,
        "memoryPercentage": 10,
        "enabled": true
      }
    },
    "nodeDownscaler": {
      "enabled": true,
      "emptyNodes": {
        "enabled": true,
        "delaySeconds": 180
      }
    },
    "spotInstances": {
      "enabled": true,
      "clouds": ["aws"],
      "spotDiversityEnabled": true,
      "spotDiversityPriceIncreaseLimitPercent": 20
    },
    "clusterLimits": {
      "enabled": true,
      "cpu": {
        "minCores": 4,
        "maxCores": 100
      }
    }
  }'

Step 3: Configure Node Templates via Terraform


resource "castai_node_template" "spot_workers" {
  cluster_id = castai_eks_cluster.this.id
  name       = "spot-workers"
  is_default = false
  is_enabled = true

  constraints {
    min_cpu               = 2
    max_cpu               = 16
    min_memory            = 4096
    max_memory            = 65536
    spot                  = true
    use_spot_fallbacks    = true
    fallback_restore_rate_seconds = 600

    instance_families {
      include = ["m5", "m6i", "c5", "c6i", "r5", "r6i"]
    }

    architectures = ["amd64"]
  }

  custom_labels = {
    "workload-type" = "batch"
  }
}

resource "castai_node_template" "gpu_ondemand" {
  cluster_id = castai_eks_cluster.this.id
  name       = "gpu-ondemand"
  is_default = false
  is_enabled = true

  constraints {
    spot                  = false
    gpu_manufacturers     = ["NVIDIA"]

    instance_families {
      include = ["p3", "p4d", "g4dn", "g5"]
    }
  }

  custom_labels = {
    "workload-type" = "gpu"
  }
}

Step 4: Verify Autoscaler is Working


# Check if the autoscaler is processing nodes
curl -s -H "X-API-Key: ${CASTAI_API_KEY}" \
  "https://api.cast.ai/v1/kubernetes/external-clusters/${CASTAI_CLUSTER_ID}/nodes" \
  | jq '[.items[] | {name, instanceType, lifecycle, castaiManaged: .castaiManaged}]
        | group_by(.lifecycle)
        | map({lifecycle: .[0].lifecycle, count: length})'

# Expected: mix of spot and on-demand nodes

Error Handling

Error	Cause	Solution
Policy update returns 400	Invalid policy JSON	Validate with `jq` before sending
Nodes not scaling	Policy not enabled	Verify `.enabled: true` in policy
Spot instances not used	Provider not configured	Add cloud provider to `spotInstances.clouds`
Evictor too aggressive	Low delay threshold	Increase `emptyNodes.delaySeconds`
Cluster limit hit	`maxCores` too low	Increase `clusterLimits.cpu.maxCores`

Resources

Next Steps

For workload-level autoscaling, see castai-core-workflow-b.

Allowed Tools

Provided by Plugin

castai-pack

Installation

Instructions

CAST AI Core Workflow: Autoscaler & Policies

Overview

Prerequisites

Instructions

Step 1: Read Current Policies

Step 2: Enable Cost-Optimized Autoscaling

Step 3: Configure Node Templates via Terraform

Step 4: Verify Autoscaler is Working

Error Handling

Resources

Next Steps

Ready to use castai-pack?

Related Skills

abridge-ci-integration

abridge-common-errors

abridge-core-workflow-a

abridge-core-workflow-b

abridge-cost-tuning

abridge-debug-bundle