Skip to content

AirStack CI Orchestrator

This describes how to use a self-hosted OpenStack VM to run GitHub Actions jobs on truly ephemeral workers. The orchestrator is a Python service that continuously polls GitHub for queued workflow jobs, spawns a fresh OpenStack instance for each one with a single-use JIT runner token, and reaps (deletes) the instance when the job completes. This allows us to run CI workloads on GPU-equipped VMs without sharing any state between runs or exposing long-lived credentials on the worker.

The orchestrator VM is the only host that holds the GitHub PAT and the OpenStack credential; the workers are destroyed after a single job.

Architecture

┌─────────────────────────────────────────────────────────────┐
│  Orchestrator VM  (airstack-ci-cd-orchestrator)             │
│                                                             │
│  airstack-orchestrator.service → orchestrator.py            │
│    spawn loop  (every 15s):                                 │
│      • GET  /repos/<repo>/actions/runs?status=queued        │
│      • POST /repos/<repo>/actions/runners/generate-jitconfig│
│      • openstack server create  (image, flavor, user_data)  │
│      • record (job_id → server_id) in state.json            │
│    reap loop   (every 30s):                                 │
│      • job completed     → openstack server delete          │
│      • job age > N min   → force delete (straggler)         │
│      • owned but not in state → orphan reap                 │
│                                                             │
│  /etc/airstack-orchestrator/                                │
│      config.yaml                                            │
│      github-pat                                             │
│  /home/orchestrator/.config/openstack/clouds.yaml           │
│  /var/lib/airstack-orchestrator/state.json                  │
└─────────┬─────────────────────────────────┬─────────────────┘
          │ Nova / Neutron API              │ GitHub REST API
          ▼                                 ▼
┌──────────────────────────────────┐  ┌──────────────────────┐
│  Ephemeral worker (per job)      │  │  GitHub Actions      │
│  Image: Ubuntu-24.04-GPU-Headless│  │  workflow_job queue  │
│  cloud-init:                     │  └──────────────────────┘
│    install docker + nv toolkit   │
│    download GH runner            │
│    run.sh --jitconfig <token>    │
│    shutdown -h +1                │
└──────────────────────────────────┘

Key properties:

  • Truly ephemeral: every job runs on a clean VM. No Docker layer cache pollution, no leftover networks, no carry-over from prior runs.
  • PAT isolation: the GitHub PAT lives only on the orchestrator. Workers receive a single-use JIT runner config — a base64 token bound to one runner registration, valid only for a short window.
  • Application-credential auth: the orchestrator authenticates to OpenStack with an application credential (revocable, scoped, no password), not the user's openrc.sh.
  • Crash-safe reaping: every server we spawn is tagged with airstack-role=ephemeral-runner. The reap loop force-deletes any owned server not present in state.json, so a crashed orchestrator can't leak instances.

Prerequisites

  • OpenStack instance already setup for the orchestrator VM. The orchestrator itself is lightweight and doesn't need a GPU. 1 vCPU, 2GB RAM, and 20GB disk is sufficient for the orchestrator service. Make sure you can ssh into it and that it has outbound internet access.
  • An OpenStack flavor with GPU passthrough and enough disk to run Docker + the tests. The orchestrator spawns workers from this flavor, so it must have a GPU and sufficient disk (or boot_volume_size_gb must be set) to run the workloads. It's common for GPU flavors to have disk=0, which means they boot from an ephemeral disk — in that case, you must set boot_volume_size_gb to a value large enough for the OS + Docker images + test assets (e.g., 40GB). If your OpenStack setup supports it, you can also boot from a Cinder volume sourced from an image; in that case, pre-bake Docker and the NVIDIA toolkit into the image to speed up boot time.

One-time setup

1. Create OpenStack application credential

On your local workstation (not the orchestrator VM):

source ~/.airlabcloud/openrc.sh
openstack application credential create airstack-orchestrator \
  --description "AirStack CI orchestrator — spawns ephemeral test runners"

The output prints id and secret. Build a clouds.yaml:

clouds:
  airstack:
    auth_type: v3applicationcredential
    auth:
      auth_url: https://airlab-cloud.andrew.cmu.edu:5000/v3/
      application_credential_id: <id from above>
      application_credential_secret: <secret from above>
    region_name: Airlab
    interface: public
    identity_api_version: 3

2. Stage credentials on the orchestrator VM

# clouds.yaml: install for the orchestrator user (created in step 3)
scp clouds.yaml ubuntu@<orchestrator-ip>:/tmp/clouds.yaml

# GitHub PAT: needs `Actions: read/write` and `Administration: read/write`
# (fine-grained) or classic `repo` scope.
scp ~/.airlabcloud/airstack-github-pat.txt \
    ubuntu@<orchestrator-ip>:/tmp/github-pat

3. Run setup.sh

On the orchestrator VM:

git clone https://github.com/castacks/AirStack.git /tmp/airstack
sudo bash /tmp/airstack/.github/orchestrator/setup.sh

setup.sh creates the orchestrator system user, builds the Python venv, copies orchestrator.py and cloud-init.yaml.j2 into /opt/airstack-orchestrator/, scaffolds /etc/airstack-orchestrator/, installs the systemd unit, and consumes /tmp/github-pat.

You still need to put the clouds.yaml in place under the orchestrator user's home:

sudo install -d -o orchestrator -g orchestrator -m 0700 \
    /home/orchestrator/.config/openstack
sudo install -o orchestrator -g orchestrator -m 0600 \
    /tmp/clouds.yaml /home/orchestrator/.config/openstack/clouds.yaml
sudo shred -u /tmp/clouds.yaml

4. Fill in /etc/airstack-orchestrator/config.yaml

Edit the placeholders the example ships with:

Field What goes here How to find it
flavor_name OpenStack flavor with GPU + enough disk openstack flavor list
network_name Network the workers attach to openstack network list
keypair_name SSH keypair for break-glass access openstack keypair list
security_group Outbound 443 must be allowed openstack security group list
availability_zone Optional AZ for the spawned instance; leave empty to let Nova pick openstack availability zone list
boot_volume_size_gb Set >0 if your flavor has disk=0 (common for GPU flavors) — boots from a Cinder volume of this size sourced from image_id; leave 0 for direct image-boot openstack flavor show <flavor> (check disk field)
floating_ips Pre-allocated FIP pool, rotated through sequentially — each spawn picks the first free one. max_concurrent is capped at len(pool). Leave empty to skip FIP attachment openstack floating ip list
repo owner/name of the repo to poll from GitHub URL
runner_version Version tag from actions/runner releases check before each major upgrade

5. Start the service

sudo systemctl enable --now airstack-orchestrator.service
journalctl -u airstack-orchestrator.service -f

You should see orchestrator started: repo=... labels=... max_concurrent=N and then periodic poll activity.

End-to-end verification

# Trigger a fast build-only run.
gh workflow run system-tests.yml -f marks=build_docker

# Within ~30s, a server should appear:
openstack server list --metadata airstack-role=ephemeral-runner
# or if your OpenStack setup doesn't support metadata queries:
openstack server list --name '^ephemeral-'

# Watch GitHub → Actions → Runners — the ephemeral runner should appear,
# pick up the job, then disappear.

# Within ~30s of job completion, the server should be gone:
openstack server list --metadata airstack-role=ephemeral-runner
openstack server list --name '^ephemeral-'

Operational notes

  • State file: /var/lib/airstack-orchestrator/state.json is the in-flight job tracker. Wiping it triggers an orphan sweep on the next reap iteration — owned servers will be force-deleted. Don't wipe it while jobs are mid-flight unless that's what you want.
  • Stuck instance: any server older than max_job_minutes (default 90) is force-deleted regardless of GitHub job status. Bump this if liveliness/autonomy runs grow longer than ~75 minutes.
  • PAT rotation: sudo install -o root -g orchestrator -m 0640 /tmp/new-pat /etc/airstack-orchestrator/github-pat && sudo systemctl restart airstack-orchestrator.service.
  • Pause spawning (e.g. for maintenance): sudo systemctl stop airstack-orchestrator.service. Already-spawned workers will still complete their jobs and self-shutdown; on restart, the reap loop deletes them.
  • Logs: journalctl -u airstack-orchestrator.service -f. Cloud-init logs from individual workers are visible only via openstack console log show <server> while the worker is running.

Debugging a failed job

When a GitHub workflow run fails or stalls, the failure can be in any of four places: the orchestrator (didn't spawn), cloud-init (didn't bootstrap), the GH Actions runner (didn't register or crashed), or the workflow steps themselves. Each has a different inspection path.

1. Find which worker ran the job

state.json is the authoritative job ↔ server ↔ floating-IP map:

sudo jq -r '.jobs | to_entries[] | "\(.key)\t\(.value.server_id)\t\(.value.floating_ip)\t\(.value.runner_name)"' \
  /var/lib/airstack-orchestrator/state.json

Pick the row for your failing job_id (visible in the GitHub Actions URL). Save the values:

JOB_ID=73286176852          # from the GitHub UI
SERVER=$(sudo jq -r ".jobs[\"$JOB_ID\"].server_id"   /var/lib/airstack-orchestrator/state.json)
FIP=$(   sudo jq -r ".jobs[\"$JOB_ID\"].floating_ip" /var/lib/airstack-orchestrator/state.json)

If the job isn't in state.json, the orchestrator never spawned for it — see step 2 below.

2. Did the orchestrator spawn at all?

sudo journalctl -u airstack-orchestrator.service --since "30 min ago" --no-pager

What you want to see for a healthy spawn:

spawned server <uuid> for job <job_id> (<job name>)
attached floating IP <addr> to server <uuid> (job <job_id>)

Common things that block a spawn (and how to spot them):

Log line / symptom What it means Fix
find_queued_jobs failed: 401 ... PAT expired / wrong scope Rotate the PAT (see Operational notes)
spawn failed for job ...: Block Device Mapping is Invalid Flavor has disk=0 and boot_volume_size_gb is 0 Set boot_volume_size_gb > 0
no free floating IP in pool All FIPs in floating_ips are already in use Wait for an in-flight job to complete, or expand the pool
floating_ips configured but not found Pool addresses don't exist in the project Double-check openstack floating ip list
Job is queued in GitHub but no spawned log Runner labels in the workflow's runs-on don't match runner_labels in config Make them match

3. SSH into a running worker

If the worker is ACTIVE, the floating IP is attached and you can connect directly. The keypair was injected during spawn — use the matching private key:

ssh -i <keypair>.pem ubuntu@"$FIP"

If your workstation can't reach the FIP subnet, jump through the orchestrator (which is on the same network):

ssh -J ubuntu@<orchestrator-ip> -i <keypair>.pem ubuntu@"$FIP"

4. SSH into a SHUTOFF worker

Workers shut themselves down after run.sh exits (whether the job succeeded, failed, or the runner crashed). The orchestrator only deletes a server once GitHub reports the job completed, so a SHUTOFF worker is preserved while you debug.

# Optional but safer — keep the orchestrator from reaping mid-session.
sudo systemctl stop airstack-orchestrator.service

openstack server start "$SERVER"
# Wait ~30s, then SSH using the FIP from state.json.
ssh -i <keypair>.pem ubuntu@"$FIP"

When done, delete the worker manually and resume the orchestrator:

openstack server delete "$SERVER"
sudo jq "del(.jobs[\"$JOB_ID\"])" /var/lib/airstack-orchestrator/state.json \
  | sudo tee /var/lib/airstack-orchestrator/state.json.new >/dev/null
sudo mv /var/lib/airstack-orchestrator/state.json.new /var/lib/airstack-orchestrator/state.json
sudo systemctl start airstack-orchestrator.service

5. What to read once you're on the worker

# Combined boot + cloud-init output. Most useful single file: shows every
# line our airstack-runner-bootstrap.sh printed, including run.sh's exit.
sudo less /var/log/cloud-init-output.log
sudo tail -300 /var/log/cloud-init-output.log

# Cloud-init's structured log — quick way to surface errors.
sudo grep -E 'WARN|ERROR|FAIL' /var/log/cloud-init.log

# GitHub Actions runner diagnostics. The Worker_*.log corresponds to the
# actual job execution; Runner_*.log covers registration and dispatch.
ls -lt /home/ubuntu/actions-runner/_diag/
sudo tail -300 /home/ubuntu/actions-runner/_diag/Runner_*.log
sudo tail -300 /home/ubuntu/actions-runner/_diag/Worker_*.log

# Sanity-check Docker came up cleanly — a frequent failure point.
sudo systemctl status docker
docker info 2>&1 | head

6. Console log fallback

Some flavors on this cloud don't expose the serial console (openstack console log show returns Guest does not have a console available). For those, the SSH path above is the only option. Where it does work, the console log persists across SHUTOFF and is faster than restarting the VM:

openstack console log show "$SERVER" | tail -200

7. Common failure patterns at the worker

Symptom in cloud-init-output.log (near end) Cause Fix
Could not connect to api.github.com / DNS errors Security group blocking egress, or no NAT for the network Allow outbound 443; if behind NAT, ensure FIP networking covers egress
Bad credentials / Invalid configuration ... runnerEvent JIT config TTL elapsed before run.sh started — bootstrap took too long Pre-bake Docker + nvidia-container-toolkit into the image to shrink bootstrap
nvidia-ctk: command not found or NVIDIA driver mismatch Image's driver doesn't match the toolkit version Use a different image, or pin a compatible toolkit version
apt-get update fails Image's apt sources are unreachable from this network Check network/security-group; or pre-bake packages into the image
Runner registered, then pytest failed A normal test failure Read the GitHub Actions log — that's the canonical view of the workflow output
No space left on device boot_volume_size_gb too small for Docker images + sim assets Bump boot_volume_size_gb