Monitoring¶
cafleet monitor is a fleet-scoped foreground loop — scan → ping → sleep —
that a coding agent runs as a background task (the Director's own background
task, or a dedicated monitoring member). It supplies the heartbeat a
Director needs to supervise its team: a plain loop, not agent reasoning, that
wakes due agents on a fixed cadence by keystroking cafleet … message poll into
their tmux panes. While the loop runs it spends no model tokens, and because
it is just a backgrounded command it works identically on any backend
(claude, codex, opencode). cafleet monitor start runs the loop
in-process; the launching agent owns its lifetime — there is no detached
subprocess and no monitor stop (stop the background task, or delete the
monitoring member, to stop it). One monitor per fleet.
Heartbeat vs facilitation¶
The monitor decides only the when — which agents are due and a keystroke to
wake them. It MUST NOT poll, ACK, dispatch work, health-check, or escalate;
those require agent judgment and stay the Director's job, defined by the
cafleet-agent-team-supervision skill (the what).
| Layer | Owns | Lives in |
|---|---|---|
| Heartbeat (the when) | which agents are due; the wake keystroke | the cafleet monitor loop |
| Facilitation (the what) | poll → ACK → dispatch → health-check → escalate | the Director, per the supervision skill |
The wake keystroke differs by role:
- The Director gets a bare
cafleet … message poll. That bare poll, on its own, performs only the first step of facilitation, so the contract that makes a woken Director run its full facilitation loop lives in the supervision skill, not in the keystroke: a monitor poll-trigger wake is the Director's cue to run its entire facilitation loop, not to read its inbox and stop. - A member gets a single-line resume nudge — the same poll command plus an instruction to review its current task and continue working — so a member that unexpectedly stopped resumes rather than going idle on an empty inbox. This is the one place a monitor keystroke carries more than a bare poll, and only for members; the Director's wake stays the bare poll above.
The monitor never reasons about message content — it is the alarm clock; the Director is the worker.
Who gets pinged¶
Each tick, the monitor evaluates every enrolled, active agent and pings the ones whose interval has elapsed. Both roles — the root Director and every member — are pinged unconditionally once due, regardless of whether the agent has any pending inbox items. The Director's facilitation does useful work even on an empty inbox (it still health-checks members, dispatches queued work, and detects stalls); pinging an idle member re-drains its inbox and keeps the heartbeat visible. The re-ping is unbounded — no backoff, no cap.
pending_count (the count of an agent's un-acked inbox items) is still computed
and shown in monitor status, but it no longer gates the ping — it is purely
informational.
A ping is skipped only when the agent is disabled, when its pane is missing or dead, or when its interval has not yet elapsed.
Each dispatched ping is logged to the monitor's stdout as
<iso-ts> ping agent <id> (<name>). Because cafleet monitor start runs in the
foreground of the launching agent's background task, that task's output shows
live heartbeat activity, one line per ping.
Cadence and tick precision¶
| Knob | Default | Set by |
|---|---|---|
| Per-agent ping interval | 60s |
monitor_config.interval_seconds (per agent) |
| Scan tick | 5s |
monitor start --tick N (per run) |
The monitor scans once per tick and pings each agent whose interval has elapsed since its last ping. Because a ping only comes due at a tick boundary, the tick is the floor on interval precision: an interval that is not a multiple of the tick snaps up to the next tick boundary (e.g. a 7 s interval under a 5 s tick fires at ~10 s). Set the tick smaller than the smallest interval you care about.
Single-instance and liveness¶
Exactly one monitor may run per fleet. The monitor_runtime DB row is the
single authority for both single-instance coordination and status liveness
— there is no PID file and no state directory. The single-instance claim runs in
one SQLite write transaction, so two concurrent monitor start calls cannot
both win.
Liveness is read from the DB heartbeat, not from the process table: the
running monitor rewrites last_tick_at every tick, so a monitor that died
silently is detected as stale (now - last_tick_at exceeds the stale window)
even though nothing cleaned up after it. os.kill(pid, 0) is a corroborating
signal, not the authority.
Because a stale heartbeat is treated as dead, a fresh start may reclaim the
slot from a momentarily-wedged monitor. To keep two live monitors from both
pinging, the slot has exactly one owner (the pid that claimed it) and both the
per-tick heartbeat and the on-exit clear are ownership-checked: a displaced
monitor's next heartbeat matches zero rows, so it self-terminates without
pinging and without wiping the winner's row.
Lifecycle¶
%%{init: {'theme': 'default', 'themeVariables': {'fontSize': '16px'}}}%%
flowchart LR
Start["monitor start<br/>(agent's background task)"] --> Claim["claim runtime row"]
Claim --> Tick["every tick:<br/>scan → ping due agents → heartbeat"]
Tick --> Tick
Tick --> Stop["stop the task /<br/>delete the member /<br/>fleet delete"]
Stop --> Clear["clear runtime row"]
Tick -. keystroke .-> PaneD["Director pane"]
Tick -. keystroke .-> PaneM["member pane"]
- Start: a coding agent runs
cafleet --fleet-id N monitor startas a background task. The loop runs in-process and inherits the launching pane's environment ($TMUX,$CAFLEET_DATABASE_URL); it fails fast on startup if it cannot reach a tmux session. There is no detached subprocess. - Run: each tick scans the fleet's enrolled agents, pings the due ones, and rewrites the heartbeat.
- Stop: the launching agent stops the background task (or deletes the
monitoring member). A clean stop (SIGTERM/SIGINT) clears the runtime row; a
hard kill simply lets the heartbeat go stale, after which
statusreports stopped. There is nomonitor stopcommand.fleet deleteneeds no stop step — a running loop's next tick sees the soft-deleted fleet and self-terminates, anddelete_fleetremoves themonitor_runtime+monitor_configrows.
Per-agent schedule (interval_seconds, enabled) is persisted, so cadence
resumes from last_ping_at across a restart. The schedule is editable from both
the CLI (cafleet monitor config) and the admin WebUI at parity; launching and
stopping the loop is CLI-only by nature (it is the agent's background task).
Enrollment and schema¶
Two tables back the monitor. Both reuse a parent id as a 1:1 INTEGER primary key (no fresh sequence), and both are cleaned explicitly on teardown.
monitor_config— one row per pane-bound agent, holding itsinterval_seconds,last_ping_at, andenabledflag. A row is inserted automatically at registration for every agent that has a tmux pane (the root Director and every member); the write-only Administrator and card-only agents have no pane and are not enrolled. Director-vs-member is derived at scan time (agent_id == fleets.director_agent_id), not stored.monitor_runtime— one row per fleet, holding the running loop'spid,started_at,last_tick_atheartbeat, andtick_seconds. "No monitor" is modeled cleanly as "no row".
See Data model for the full column definitions and the
CLI options page for the
cafleet monitor command surface.