System Testing¶
AirStack's system tests bring up the full Docker-based stack — simulator, robot containers, and GCS — and verify end-to-end behavior: container health, ROS 2 node presence, sensor publishing rates (in the sensors mark), and compute resource usage. Tests are written in Python with pytest and live under tests/ at the repo root.
Test Suite Structure¶
| Module | Mark | What it tests | Hardware required |
|---|---|---|---|
test_build_docker.py |
build_docker |
Docker image builds (robot-desktop, gcs, isaac-sim, ms-airsim); records image sizes | Docker daemon |
test_build_packages.py |
build_packages |
colcon build inside each container (robot, GCS, ms-airsim ROS workspace) |
Docker daemon |
test_liveliness.py |
liveliness |
Stack bring-up: container Running state, /clock readiness, tmux panes, sentinel ROS 2 nodes, compute snapshot, infra-only test_stable (tmux + nodes + compute) |
Docker daemon, GPU, sim license |
test_sensors.py |
sensors |
After liveliness in collection order: sim + robot stereo/depth Hz (Isaac: batched ros2 topic hz to avoid bridge overload; ms-airsim: single batch), filtered LiDAR via echo --once + cloud sanity (isaacsim), sim RTF, test_sensor_streams_stable |
Docker daemon, GPU, sim license |
test_takeoff_hover_land.py |
takeoff_hover_land |
End-to-end flight: PX4 readiness gate, takeoff to 10 m, hover stability, land — one chain per (sim, num_robots, iteration, velocity) | Docker daemon, GPU, sim license |
Marks can be combined with pytest logic:
-m "build_docker or build_packages", -m liveliness, -m sensors, -m takeoff_hover_land, or e.g. -m "liveliness or sensors" (see Bring-up scope below).
Bring-up scope (airstack_env)¶
airstack_env is class-scoped and parametrized per (sim, num_robots, iteration). Each test class that uses it (TestLiveliness, TestSensors, TestTakeoffHoverLand, …) performs its own airstack up / airstack down for that parametrization. Selecting both classes (for example, -m "liveliness or sensors") runs two full stack cycles per tuple (liveliness class, then sensors class). Collection order (see conftest.py) runs liveliness before sensors when both are selected. To save wall time, run -m liveliness or -m sensors alone when one suite is enough.
Test Infrastructure¶
All shared fixtures, helpers, and configuration live in tests/conftest.py.
airstack_env fixture¶
Parametrized over (sim, num_robots, iteration) tuples derived from CLI flags. For each combination it:
- Calls
airstack upwith the appropriateCOMPOSE_PROFILES,NUM_ROBOTS, and headless flags - Records
airstack_up_duration_stometrics.json - Yields an
envdict used by liveliness and sensor tests - Tears down with
airstack downand recordsairstack_down_duration_s
Isaac Sim and the sensors mark¶
LiDAR in pytest: tests/conftest.py sets
ENABLE_LIDAR=true in SIM_CONFIG["isaacsim"]["extra_env"] so the multi-drone
Pegasus script (example_multi_px4_pegasus_launch_script.py) attaches RTX LiDAR
the same way the single-drone script always does. Without that flag the multi
script would not spawn LiDAR OmniGraphs.
Topic checks live in tests/sensor_probes.py
and are driven by tests/test_sensors.py:
| Path | What we measure | How |
|---|---|---|
Sim → /clock, stereo images, stereo depth |
Publish rate | ros2 topic hz on the sim container: /clock alone, then chunks of two image_rect topics, then chunks of two depth topics (ISAACSIM_HZ_CHUNK_SIZE in sensor_probes.py). |
| Robot → same topic names (bridge) | Publish rate | Same two-at-a-time chunking on the robot container for Isaac. ms-airsim: one batch of four topics. |
Robot → filtered .../ouster/point_cloud |
Stream alive | ros2 topic echo --once per robot (not Hz — large PointCloud2). |
| LiDAR geometry | Near-range vs near_range_m |
lidar_point_cloud_filter/scripts/validate_lidar_filter_clouds.py (raw vs filtered). |
Sim RTF (real-time factor from /clock) is also in the sensors suite.
test_sensor_streams_stable repeats sim + robot stereo + LiDAR probes every
--stable-interval for --stable-duration and records time-series to
metrics.json (stereo/depth as *.hz_samples; LiDAR echo-once as *.received_samples).
MetricsRecorder¶
Writes custom metrics to tests/results/<timestamp>/metrics.json after each record() call. Keys follow the pattern test_node_id → metric_key → {value, unit, direction}. Time-series data (Hz samples, compute snapshots) are stored as {key}_samples lists and expanded into scalar aggregates (mean, min, max, start_mean, end_mean) by parse_metrics.py.
Output files¶
Every test run produces a timestamped directory. Per-test logs — for each
pytest function, pytest_runtest_setup in conftest.py attaches the shared
logger to logs/test_<module>.<Class>.<test>[<param-id>].log (param ids are
rewritten for readability, e.g. msairsim-rob#1-iter0; see
pytest_collection_modifyitems).
airstack_env.<…>.log — the class-scoped airstack_env fixture wraps
airstack up / airstack down in logger_to("airstack_env." + <current nodeid>)
(see conftest.py). So you get an extra file whose name is the word
airstack_env. plus the node id of whichever test was running when the
fixture first ran for that class. For TestLiveliness that is almost always
test_robot_containers_running (first test in the class), not test_stable.
That file holds compose / airstack subprocess output; each test still has its
own log for assertions and docker exec / ros2 lines.
tests/results/
└── 2025-04-21_14-30-00/
├── results.xml # JUnit XML — test durations and pass/fail status
├── metrics.json # Custom metrics (image sizes, Hz, compute, timing)
└── logs/
├── test_build_docker.TestDockerBuilds.test_build_robot_desktop.log
├── airstack_env.test_liveliness.TestLiveliness.test_robot_containers_running[msairsim-rob#1-iter0].log
├── test_liveliness.TestLiveliness.test_robot_containers_running[msairsim-rob#1-iter0].log
├── test_liveliness.TestLiveliness.test_stable[msairsim-rob#1-iter0].log
├── test_sensors.TestSensors.test_sensor_streams_stable[msairsim-rob#1-iter0].log
└── ... # More per-test logs; another airstack_env.* per class using the fixture
Running Tests¶
airstack test (primary interface)¶
airstack test is the standard way to run tests. It builds the containerized
test runner from tests/docker/, mounts the repo read-only, and forwards all
arguments directly to pytest. No local Python environment needed.
# From the repo root (AirStack must be set up: airstack setup):
# Build tests only — fast, no GPU needed
airstack test -m "build_docker or build_packages" -v
# Liveliness run — ms-airsim, 1 robot, 1 iteration, 60 s stability window
airstack test -m liveliness \
--sim msairsim \
--num-robots 1 \
--stress-iterations 1 \
--stable-duration 60 \
-v
# Takeoff/hover/land run — three velocities
airstack test -m takeoff_hover_land \
--sim msairsim \
--num-robots 1 \
--stress-iterations 1 \
--takeoff-velocities 0.5,1,2 \
-v
# Sensor topic rates + LiDAR
airstack test -m sensors \
--sim isaacsim \
--num-robots 1 \
--stress-iterations 1 \
--stable-duration 60 \
-v
# Show GUI windows (for local visual inspection)
airstack test -m liveliness --gui -v
airstack test calls xhost + automatically so GUI-mode sim containers
can reach the host X server; it is a no-op when DISPLAY is not set.
Prerequisites¶
- Docker daemon running with your user in the
dockergroup - NVIDIA drivers +
nvidia-container-toolkitfor liveliness, sensors, and takeoff_hover_land tests airstack setupcompleted (addsairstacktoPATH)
Direct pytest (for development / debugging)¶
Run pytest directly when you need faster iteration (no container rebuild) or want to attach a debugger. Requires a local Python environment.
export AIRSTACK_ROOT=$(pwd)
pip install -r tests/requirements.txt
# Build tests only
pytest tests/ -m "build_docker or build_packages" -v
# Liveliness run
pytest tests/ -m liveliness \
--sim msairsim \
--num-robots 1 \
--stress-iterations 1 \
--stable-duration 60 \
-v
# Sensor streams (after liveliness in default collection order)
pytest tests/ -m sensors \
--sim isaacsim \
--num-robots 1 \
--stress-iterations 1 \
-v
CLI option reference¶
| Option | Default | Description |
|---|---|---|
--sim |
msairsim,isaacsim |
Comma-separated sim targets |
--num-robots |
1,3 |
Comma-separated robot counts |
--stress-iterations |
3 |
Up/down cycles per (sim, num_robots) config |
--stable-duration |
120 |
Seconds test_stable / test_sensor_streams_stable poll for |
--stable-interval |
10 |
Seconds between polls in those stability tests |
--gui |
off | Show simulator GUI (disables headless mode) |
--takeoff-velocities |
0.5,1,2 |
Takeoff/land speeds in m/s |
Autonomy Tests (test_takeoff_hover_land.py)¶
TestTakeoffHoverLand runs a 4-phase flight chain for every combination of
(sim, num_robots, iteration, velocity). The drone returns to the ground after
each velocity so the next velocity starts from a clean state.
Phase order¶
| Phase | Test | What happens |
|---|---|---|
| 1 | test_px4_ready |
Waits for MAVROS + PX4 EKF ready; once per env |
| 2 | test_takeoff |
Sends TakeoffTask; asserts altitude within 10 % |
| 3 | test_hover |
Captures odom for 10 s; asserts altitude drift < 0.5 m |
| 4 | test_landing |
Sends LandTask; asserts final altitude < 0.5 m |
If any phase other than test_hover fails, the remaining phases for that env
are skipped (the chain guard prevents a stuck-in-air drone from blocking later
velocity sweeps). A hover failure does not skip landing, so the drone always
returns to the ground.
Recorded metrics¶
| Metric key | Unit | Description |
|---|---|---|
ready_duration_sys_s |
s | Wall-clock time from test start until PX4 ready |
takeoff_duration_sim_s |
s | Sim-time from first motion to 95 % of target |
land_duration_sim_s |
s | Sim time from 80 % peak descent to < 0.5 m |
velocity_rmse_m_sim_s |
m/s | RMSE of dz/dt vs commanded velocity during climb/descent |
altitude_error_m |
m | Signed steady-state error at takeoff success (+ = high) |
overshoot_m |
m | Unsigned transient overshoot above target |
hover_altitude_mean_error_m |
m | Mean altitude drift during hover |
hover_position_stddev_m |
m | 3-D position jitter (sqrt of summed axis variances) |
final_altitude_m |
m | Altitude at landing action completion |
odometry_error_mean_m |
m | Mean 3-D position error vs ground-truth odom |
odometry_error_max_m |
m | Peak 3-D error vs ground-truth odom |
odometry_altitude_bias_m |
m | Signed z-axis bias vs ground-truth odom |
Metrics are recorded per robot as robot_N.<key> and written to
tests/results/<timestamp>/metrics.json.
Running takeoff_hover_land tests¶
# Sweep velocities 0.5, 1, 2 m/s; 1 robot; ms-airsim
airstack test -m takeoff_hover_land \
--sim msairsim \
--num-robots 1 \
--stress-iterations 1 \
--takeoff-velocities 0.5,1,2 \
-v
# Single velocity, Isaac Sim, 3 robots
airstack test -m takeoff_hover_land \
--sim isaacsim \
--num-robots 3 \
--stress-iterations 1 \
--takeoff-velocities 1 \
-v
Metrics Reporting (parse_metrics.py)¶
tests/parse_metrics.py reads results.xml and metrics.json from a run directory and produces a markdown report. It has two modes:
Single-run report¶
Prints a markdown table of all recorded metrics. Always exits 0.
Diff / regression check¶
python tests/parse_metrics.py \
--current tests/results/2025-04-21_14-30-00/ \
--baseline tests/results/2025-04-20_09-00-00/ \
--threshold 20 # optional: regression if change% exceeds this (default 20)
--output report.md # optional: also write to file
Prints a side-by-side comparison. Exits 1 if any metric regresses beyond the threshold; exits 0 otherwise.
The report has three sections per test module:
- Metrics — flat table of scalar metrics (test name, metric key, value/baseline, change%)
- Sim publishing rates — pivot table of topic Hz aggregates from the
sensorsmark (mean,start_mean,end_mean,min,max; sim + robot topics) - Compute usage — pivot table of CPU/memory/GPU metrics per container
Regressions are flagged with :red_circle:, improvements with :green_circle:.
CI/CD Integration¶
Workflow: system-tests.yml¶
.github/workflows/system-tests.yml runs on:
- Pull requests to
mainordevelop— automatically runsbuild_docker or build_packagestests (no GPU-intensive liveliness run on every PR) - Manual dispatch (
workflow_dispatch) — fully configurable for liveliness runs and metric comparisons
Manual dispatch inputs¶
| Input | Default | Description |
|---|---|---|
marks |
liveliness |
pytest marks expression |
sim |
msairsim |
Sim targets |
num_robots |
1 |
Robot counts |
stress_iterations |
1 |
Iterations per config |
stable_duration |
120 |
Stability polling seconds |
baseline_run_id |
(blank) | Run ID for comparison; blank = latest main run |
Jobs¶
run-tests runs on a freshly-spawned ephemeral OpenStack instance ([self-hosted, airstack-ephemeral]). The instance is provisioned per-job by the orchestrator described below and destroyed once the job completes. It installs dependencies, runs pytest, and uploads tests/results/ as an artifact named test-results-<sha>-<run_id> with 90-day retention.
report runs on ubuntu-latest after run-tests (even if it failed). It:
- Downloads the current artifact
- Downloads a baseline artifact (from the base branch for PRs, from
mainfor manual runs, or from the specifiedbaseline_run_id) - Runs
parse_metrics.pyin diff mode if a baseline is found, otherwise in single-run mode - Posts the markdown report as a PR comment (PR runs) or to the job summary (all runs)
- Fails with
::error::ifparse_metrics.pyexits 1 (regression detected)
Required third-party action¶
The workflow uses dawidd6/action-download-artifact@v6 to download artifacts from other workflow runs by branch name. This is a community action and must be trusted in your repository's Actions settings if you use a restricted allowed-actions policy.
CI/CD Orchestrator (OpenStack-backed ephemeral runners)¶
AirStack's tests require a GPU, Docker, and a clean filesystem per run, so they execute on truly ephemeral OpenStack instances spawned per-job by an orchestrator. Each test job gets a fresh VM that is destroyed once the job completes — no Docker layer carryover, no leaked containers, no shared host state.
Architecture¶
┌──────────────────────────────────────────────────────────────┐
│ Orchestrator VM (airstack-ci-cd-orchestrator) │
│ • polls GitHub for queued workflow_jobs │
│ • mints single-use JIT runner tokens │
│ • spawns / reaps ephemeral instances via OpenStack Nova │
│ • holds the GitHub PAT and OpenStack application credential│
└────────────┬───────────────────────────────────┬─────────────┘
│ │
▼ ▼
┌──────────────────────────────┐ ┌────────────────────────────────┐
│ Ephemeral worker (per job) │ │ GitHub Actions queue │
│ Image: Ubuntu-24.04-GPU- │ │ workflow_job status=queued │
│ Headless │ │ labels: [self-hosted, │
│ cloud-init bootstraps Docker │ │ airstack-ephemeral] │
│ + nvidia-container-toolkit + │ └────────────────────────────────┘
│ GH Actions runner; runs ONE │
│ job, then is destroyed. │
└──────────────────────────────┘
Why this instead of a long-lived self-hosted runner¶
| Concern | Mitigation |
|---|---|
| Cross-job state pollution (Docker cache, dangling networks, leftover artifacts) | Each job runs on a fresh VM. Spent VM is destroyed within ~30 s of job completion. |
| Fork PRs executing arbitrary code | Workflow's if: github.event.pull_request.head.repo.full_name == github.repository — fork PRs skipped. |
| Runner running as root | The runner runs as the unprivileged ubuntu user inside an instance whose only purpose is one job. |
| Docker socket gives root-equivalent access | Bounded to a single one-shot VM. The orchestrator host doesn't expose Docker at all. |
| Long-lived PAT on the runner host | The PAT lives only on the orchestrator. Workers receive a single-use JIT runner config — a base64 token bound to one runner registration. |
| Persistent OpenStack creds tied to a user password | Orchestrator authenticates with an application credential (revocable, scoped) instead of openrc.sh. |
Setup¶
The orchestrator service code, cloud-init template, systemd unit, and full setup runbook live in .github/orchestrator/. See .github/orchestrator/README.md for:
- creating the OpenStack application credential and
clouds.yaml - staging the GitHub PAT
- running
setup.shon the orchestrator VM - filling in flavor / network / keypair / security-group in
/etc/airstack-orchestrator/config.yaml - enabling and verifying the
airstack-orchestrator.servicesystemd unit
Runner labels¶
The workflow file requests runs-on: [self-hosted, airstack-ephemeral]. The orchestrator polls for queued jobs whose labels are a superset of runner_labels in its config, mints a JIT config registering the ephemeral runner under those same labels, and spawns the worker. To route jobs to a different pool (e.g. CPU-only workers) in the future, add a second label set in config and adjust the workflow's runs-on.