Distributed Mode
Run a load test across multiple machines. The CLI output is identical to single-machine mode — the coordinator aggregates all agent metrics transparently.
Local agents
Spawn N agent processes on the same machine sharing an embedded NATS broker:
loadpilot run scenarios/checkout.py --target https://api.example.com --agents 4
Each agent handles rps / N of the total load. Useful for saturating the network
interface or bypassing OS-level connection limits.
External agents — separate machines
Start agents
Install the agent binary on each machine:
curl -fsSL https://raw.githubusercontent.com/VladislavAkulich/loadpilot/main/install.sh | sh
Start an agent — it connects to the coordinator and waits for a plan:
loadpilot-agent --coordinator <coordinator-ip>:4222 --agent-id agent-0
loadpilot-agent --coordinator <coordinator-ip>:4222 --agent-id agent-1
Agents are persistent — after a run completes they reconnect and wait for the next plan automatically.
Run a test
loadpilot run scenarios/checkout.py \
--target https://api.example.com \
--external-agents 2 \
--report results/report.html
The coordinator uses an embedded NATS broker listening on :4222 by default.
Coordinator as a Kubernetes service
Deploy the coordinator as a long-running pod inside the cluster. It exposes an
HTTP API (POST /run) that accepts a plan JSON and streams metric ndjson back.
Prometheus scrapes metrics at :9090 in-cluster — no host networking required.
# Build coordinator image
docker build -f engine/Dockerfile.coordinator -t loadpilot-coordinator:local .
kind load docker-image loadpilot-coordinator:local --name <cluster-name>
# Enable in Helm
helm upgrade loadpilot cli/loadpilot/charts/loadpilot -n loadpilot \
--reuse-values \
--set coordinator.enabled=true \
--set coordinator.image=loadpilot-coordinator \
--set coordinator.tag=local \
--set coordinator.imagePullPolicy=Never
# Port-forward the API and run
kubectl port-forward -n loadpilot svc/loadpilot-coordinator 8080:8080
loadpilot run scenarios/checkout.py \
--target https://api.example.com \
--coordinator-url http://localhost:8080
The coordinator uses the agents already running in-cluster (controlled by
agent.replicas). The CLI streams the live dashboard exactly as in local mode.
You can also set the URL via environment variable:
export LOADPILOT_COORDINATOR_URL=http://localhost:8080
loadpilot run scenarios/checkout.py --target https://api.example.com
Coordinator HTTP API
| Endpoint | Method | Description |
|---|---|---|
/run | POST | Accept plan JSON, stream ndjson metric lines. Returns 409 if a test is already running. |
/healthz | GET | Readiness probe — returns ok. |
External NATS (Railway / cloud)
Deploy a NATS server separately (e.g. Railway, Fly.io, or a VPS):
# Deploy NATS: Docker image nats:latest, TCP port 4222
# Start agents with COORDINATOR env var pointing at your NATS
COORDINATOR=your-nats.railway.app:PORT AGENT_ID=agent-0 loadpilot-agent
# Run test with external NATS
loadpilot run scenarios/checkout.py \
--target https://api.example.com \
--nats-url nats://your-nats.railway.app:PORT \
--external-agents 2 \
--report results/report.html
on_start in distributed mode
When a scenario uses on_start (e.g. login → per-user auth token), the coordinator
runs on_start N times locally against the target before the test begins. It
captures the headers and URLs each VUser would set and ships them with the plan.
Agents rotate through these pre-authenticated header sets in pure Rust — no Python
required on agent machines.
@scenario(rps=100, duration="2m")
class CheckoutFlow(VUser):
def on_start(self, client):
resp = client.post("/auth/login", json={"user": "test", "pass": "secret"})
self.token = resp.json()["access_token"]
@task
def browse(self, client):
# self.token from on_start is captured and shipped to agents automatically
client.get("/api/products", headers={"Authorization": f"Bearer {self.token}"})
Per-VUser URL state
If on_start stores state that influences task URLs (e.g. a resource ID created
during setup), the coordinator captures the resulting URL for each VUser and ships
it as an override. Agents use the per-VUser URL instead of the task’s default:
@scenario(rps=5, duration="2m", ramp_up="20s")
class ProjectCRUDFlow(VUser):
_lock = threading.Lock()
_shared_project_id: int | None = None # one project shared across all VUsers
def on_start(self, client):
# login
super().on_start(client)
# create the shared project once; all VUsers reuse the same ID
with self.__class__._lock:
if self.__class__._shared_project_id is None:
resp = client.post("/api/v1/projects", json=new_project(), headers=self._auth())
resp.raise_for_status()
self.__class__._shared_project_id = resp.json()["id"]
self.project_id = self.__class__._shared_project_id
@task(weight=4)
def read_project(self, client):
# URL /api/v1/projects/{self.project_id} is captured per-VUser
# and shipped to agents — agents use the real URL, not "/"
client.get(f"/api/v1/projects/{self.project_id}", headers=self._auth())
Tip — resource-limited accounts: if your on_start creates a resource and the account has a per-user limit, use a class-level shared resource (as above) so only one object is created regardless of the pre-auth pool size.
on_stop in distributed mode
If on_stop is defined, the coordinator calls it for each pre-authenticated VUser
after the test completes. Use this to delete resources created in on_start:
def on_stop(self, client):
with self.__class__._lock:
self.__class__._vuser_count -= 1
last = self.__class__._vuser_count == 0
if last and self.__class__._shared_project_id is not None:
client.delete(
f"/api/v1/projects/{self.__class__._shared_project_id}",
headers=self._auth(),
)
self.__class__._shared_project_id = None
on_stop is also called during --dry-run to prevent resource leaks from the
pre-auth phase.
Reliability guarantees
- Synchronised start — all agents begin within ~1ms of each other. The
coordinator sends a
start_attimestamp; agents sleep until it fires. - PING/PONG keepalive — agents and coordinator respond to NATS server PING frames so long-running tests (> 2 min) are not disconnected mid-run.
- Agent re-registration — agents re-announce to the coordinator every 3s until they receive a shard, so coordinator and agents can start in any order.
- Agent timeout — if an agent stops reporting for 15s it is marked timed-out; the test continues on remaining agents without hanging.
- Agent recovery — if a timed-out agent reconnects mid-test it is restored to the active pool.
- Fractional RPS budget — the dispatcher accumulates sub-integer request budgets across ticks so low-RPS scenarios (e.g. 3 RPS split across 2 agents) fire the correct number of requests instead of rounding to zero.
Architecture
CLI (Python)
build plan ──► spawn coordinator subprocess (local mode)
│ stdin (JSON)
OR
build plan ──► POST /run to coordinator pod (--coordinator-url)
│ HTTP ndjson stream
▼
Coordinator (Rust)
├── embedded NATS broker (or connect to external NATS)
├── wait for N agents to register
├── shard plan + set synchronised start_at → publish to each agent
├── aggregate metrics (sum RPS, histogram-merged percentiles, per-task)
├── stdout / HTTP ndjson → CLI live dashboard
└── :9090/metrics → Prometheus / Grafana
Agent (Rust, one per machine or k8s pod)
├── connect to NATS → register → receive shard
├── sleep until start_at (clock sync)
├── run HTTP load (token-bucket + reqwest)
├── stream metrics + per-task histograms → NATS → coordinator
└── reconnect and wait for next plan