Drift: Observe, Deploy, Respond. From a Prompt.
The Problem
Running a small fleet of Docker hosts (homelab, edge, cloud, corp boxes) is stuck in an uncomfortable middle. Kubernetes is overkill. SaaS platforms charge per device and want your data. The DIY path is a stack of disconnected tools: Grafana for metrics, Portainer for deploys, ssh + tmux for the times something is actually wrong, hand-edited Alertmanager configs.
We wanted one place to do all of it, conversationally. A small set of hard constraints:
- No inbound ports on target devices. They sit behind NATs, residential routers, and corp firewalls.
- No SSH after the first install. Every operational action should run through the control plane.
- Compose as the contract. If
docker compose upruns it on your laptop, the fleet should be able to ship it. - Lean on proven observability infrastructure. Don't build another TSDB.
- The LLM can mutate state, but never silently. Every change goes through a propose step.
Drift is what came out of this.
End-to-End Architecture

Inside each box on the CP: drift-frontend (nginx + React SPA), drift-agent (FastAPI agent loop + SSE), the Tools layer (metrics, logs, alerts, deploy, emit), LiteLLM (provider abstraction), drift-postgres (users, devices, apps, revisions, sessions).
The observability stack is conventional VictoriaMetrics / VictoriaLogs / vmalert / Alertmanager with vmauth fronting remote writes from edges and alertmanager-ntfy translating webhooks. On each device, the reporter stack runs vmagent, cAdvisor, node-exporter, and Vector; Jetson devices add jtop_exporter for GPU/power/thermal metrics. The deploy-agent container bundles drift-deploy-agent.sh and terminal-bridge.py.
Four things to notice:
- Edges only ever talk out. Every arrow from a device points up to the CP. Nothing listens on the device side, so devices behind NATs, residential routers, and corp firewalls work without any port forwarding or tunneling.
- One reverse proxy (Caddy by default, optional) terminates TLS and fans out to every internal service. The installer prompts you whether to enable it; behind a tunnel or already running nginx / Traefik, you decline and front the services yourself. Details in The CP Installer below.
- Standard observability tooling, not a homegrown TSDB. VictoriaMetrics, VictoriaLogs, vmalert, Alertmanager, Grafana, vmagent, cAdvisor, node-exporter, Vector. Every box is something you could swap or replace independently. Drift adds the prompt-driven interaction layer on top; the data plane stays conventional.
- The agent is in the data path for orchestration but out of it for data. Time-series arrays never enter the LLM context. More on that below.
The Three Pillars
Drift's tagline is observe, deploy, respond. From a prompt. All three run through the same chat surface; the agent picks the right tools based on what you ask.
Observe
Ask anything about your telemetry. The agent discovers what metrics exist, picks the right PromQL, fetches the data, runs statistics, and assembles a streaming response.

An investigation underway: streaming markdown narration above, a Plotly chart in the middle, a summary table below. All three were emitted by separate tool calls and painted into the UI as they arrived.
Example prompts:
> Which hosts are reporting metrics, and what jobs are scraping?
> Show CPU usage on the host over the last 15 minutes.
> Look for anomalies in network traffic over the last hour.
> Plot disk I/O on jetson-002 every 5 seconds.
> Now change refresh rate to 1s.
The live-chart case is interesting. The agent emits a make_live_chart render block with a chart_key. Re-emitting with the same key on a later turn doesn't add a new chart; it Plotly.react-diffs the existing one in place, preserving zoom and hover state.
Deploy
Drift Deploy registers each device with a small edge agent that polls the CP every 30s, applies whatever compose bundles you've assigned, and reports back. The whole thing (devices, apps, revisions, tagging, deploy-by-tag, rollback) drives from the chat.

The Devices sidebar showing online status, tags, and a one-click terminal icon per device.
> List devices and their groups.
> Tag pi-riffpod-001 with edge, client-z.
> Fork the reporter app as reporter-jetson.
> Deploy reporter-jetson v3 to all devices tagged edge AND client-z.
> Roll home-pi4-001 back to reporter v2.
Apps are versioned bundles of plain files. There's no proprietary packaging. A "revision" is just compose.yaml + .env + whatever configs the compose references. You can download any revision as a .tar.gz or .zip from the UI.
Respond
Investigations end in action. Manage vmalert rules and Alertmanager routing, silence noise during planned work, or jump straight into a host shell. Same chat, same propose-then-apply pattern.

The agent proposing an alert rule before applying it. Operator sees the exact YAML that will be written; one "looks good" confirms.
> List firing alerts.
> Create an alert when CPU > 90% for 5 minutes on any edge device.
> Silence anything from jetson-002 for 2 hours, I'm rebooting it.
> Wire up a webhook so critical alerts ping ntfy.sh/drift-alerts.
> Open a terminal to home-pi4-001.

When the terminal opens, it's a full pty with TERM=xterm-256color (so tmux, screen, vim, less all work), entered into the host's namespaces via nsenter -t 1 -m -p -u -i -- /bin/login. PAM authenticates; sudo works. Bytes flow over a dedicated WebSocket; they never touch the LLM.
Key Architectural Decisions
Queue-Based Deploys, Not Push
The CP holds the desired state. Edge agents poll out, ask "what should I be running?", and reconcile. Targets can be offline when you make a change. They converge on their next check-in.
The retry budget lives on the CP, not the edge. If the agent fails an apply, it reports the failure on the next check-in and attempts increments by one. This matters when a device drops offline mid-failure. Without CP-side tracking, an edge-only counter would burn through the retry budget against an unreachable network. CP tracking ensures max_retries=5 means "5 real apply attempts."
The dataRef Pattern (LLM Out of the Data Path)
A naive agent design dumps query results into the LLM's context and asks it to summarize. That fails three ways:
- Cost. A few hundred series × a thousand points × frequent investigations = burnt tokens.
- Precision. The model paraphrases numerical data; small errors creep in.
- Hallucination risk. The model can confidently report a p95 that wasn't in the data.
Drift's query_range tool instead stores the Plotly trace under a prom://<uuid> reference in a server-side cache, streams the trace to the UI via SSE (out-of-band, not via the LLM), and returns a compact digest to the model: {ref, n_series, series: [{n, mean, p50, p95, min, max, ...}]}.
Subsequent analysis tools (detect_anomalies, find_correlations) and emit tools (make_chart) operate by ref. The model decides what to compute; numpy/scipy actually compute it. Cost stays flat regardless of how many series are in flight; precision is whatever VictoriaMetrics returns.
Streaming UX via SSE
Every investigation streams as a text/event-stream response. The frontend's AgentAdapter consumes eight event types:
| Event | When it fires |
|---|---|
start |
Once, when the investigation begins. |
thinking |
Streaming reasoning delta from the model. Zero or more. |
narrative |
Streaming user-facing prose delta. Zero or more. |
tool_call |
The agent is about to invoke a tool. One per invocation. |
data |
Out-of-band payload (e.g. chart traces). Only for emit tools that carry one. |
tool_result |
Tool returned. One per tool_call, always after data if present. |
block |
Render block JSON ready to paint. Only for emit tools that produced one. |
done |
Once, when the stream closes. (Replaced by error on failure.) |
This is what lets the user watch the investigation: tool calls appear in a trace pane, intermediate render blocks paint into the conversation as they arrive, charts show up the moment the underlying data does. Trust comes from seeing how the result was reached, not from a 30-second blank wait followed by a wall of text.
Commissioning a Device
The edge agent isn't pulled from a public registry. It's built on each device at install time from a small Dockerfile shipped by the CP. This sounds backwards until you think about the constraints: a corp-network device behind a TLS-intercepting proxy might not be able to reach GHCR; an air-gapped install needs zero external pulls beyond the explicit ones; and baking on-device keeps the CP's container registry out of the trust path entirely.
What's inside the image:
- Alpine base for size (the built image is roughly 80MB).
dockerCLI +docker composeplugin. The agent's job is to rundocker compose pull && up -dagainst bundles the CP ships, so it needs the client tools.drift-deploy-agent.sh: the reconciliation loop (poll, apply, report, repeat). This is what self-updates on every check-in.terminal-bridge.py: the pty + nsenter helper that handles web terminal sessions.util-linuxfor thensenterbinary so the bridge can cross into the host's namespaces.
Commissioning is a single command on the target host:
DEVICE_NAME=pi-livingroom \
BOOTSTRAP_TOKEN=drift-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
CP_URL=https://drift.example.com/api/deploy \
curl -fsSL "$CP_URL/agent/install.sh" | sudo -E bash
The operator gets DEVICE_NAME and BOOTSTRAP_TOKEN from a render block in the chat after asking the agent to "commission a new device named pi-livingroom". The token is the device's long-lived bearer credential for every /agent/check-in. Save the curl line to a password manager (the chat won't render it again on later turns) so you can reinstall the agent on the same host later. The token is invalidated when the device is removed from the CP.
The whole installer flow lands in under 10 seconds on a Pi 4 (alpine pull is the slow step; build itself is ~5s). After commissioning, the operator never has to touch that device again to update the agent script. Image-baseline changes (the Dockerfile, the Python helper, system packages) still need a one-time install.sh rerun per device, but those are rare. The reconciliation script that does the actual work updates itself automatically, on every check-in.
Why a fingerprint and not just a token. The bootstrap token is bound to the device name on the CP, not to hardware. Without an extra check, pasting the same curl on a different machine would silently let two hosts share one device identity: last_seen and reported deploy state flip-flopping each tick, deployed apps trying to run on both. So the edge agent reads /etc/machine-id from a host bind-mount, sha256s it, and sends the digest on every check-in. The CP records whatever fingerprint arrives on the first check-in after commissioning (trust-on-first-use), then validates on every subsequent one. Four scenarios:
| Scenario | What happens |
|---|---|
Same-host reinstall (docker volume rm then re-run curl) |
/etc/machine-id is system-level config that survives container wipes; fingerprint matches; check-in proceeds. Transparent. |
| Accidental cross-host paste | New host's fingerprint differs from the recorded one; CP returns 409 with the remediation message; the agent logs it and stops retrying. Original host keeps working. |
| Intentional hardware migration | Commission a new device with a new name; old device row stays as audit history; new device gets its own fingerprint on its own first check-in. |
| OS reinstall on the same hardware | /etc/machine-id regenerates; fingerprint mismatch on next check-in. Operator either deletes the device and re-commissions under the same name (lenient tombstone reuse), or under a new name. |
commission_device enforces name uniqueness too: input is stripped + lowercased before lookup, and a partial unique index on LOWER(name) WHERE status != 'removed' rejects collisions at the DB level. So Pi-001 and pi-001 are the same device; freed tombstones don't block name reuse.
Edge-Agent Self-Update via SHA Comparison
The CP includes a 12-char SHA of the canonical drift-deploy-agent.sh in every check-in response. When the running agent's SHA differs, the container exits cleanly; Docker's --restart unless-stopped brings it back; a bootstrap at the top of the script fetches /api/deploy/agent/agent.sh, syntax-checks it, and execs into it.
After the one-time install.sh on a device, you never need to re-run anything per-device just to ship a new agent script. Worst-case downtime per device per update: one poll cycle + container restart, roughly 20 to 30 seconds.
The script-level self-update doesn't cover image-baseline changes (the Dockerfile, the Python terminal-bridge.py helper, system packages). Those rarely change and require a one-time install.sh rerun per device. It's a deliberate safety/convenience split.
Web Terminal with No Listening Sockets
The constraint "no inbound ports on target devices" applies to the terminal too. The flow is asymmetric: the browser opens a WebSocket to the CP; the edge agent (on its next 30s tick) sees a pending_sessions[] entry in its check-in response and dials out to the CP to attach.
Per-session audit row: user, device, started/ended, bytes browser↔agent. No keystroke capture. Worst-case wait for a session to land is one poll cycle (default 30s); the UI shows a countdown chip so the operator sees the bound.

The browser terminal modal showing the "waiting for agent… 12s / 30s" countdown before the session attaches.
What the LLM Sees (and Doesn't)
The trust story matters more than the architecture diagram. The boundary is enforced in code, not by prompting the model to behave.
| ✅ LLM has access to | ❌ LLM never sees |
|---|---|
| Metric / label / job names | API keys and other env-var credentials |
| Time-series summaries only (n, mean, p50, p95, min, max, ...) | Database password (DRIFT_PG_PASSWORD), Fernet key (DRIFT_SECRET_KEY) |
| Device + app metadata (names, tags, groups, statuses) | Registry credentials (encrypted at rest, TLS-direct to edge agent) |
| Alert rule names + PromQL expressions + labels | Alertmanager receiver secrets (filename references only) |
Log lines returned by query_logs |
Raw time-series arrays (dataRef pattern) |
Compose file contents when explicitly fetched via get_app_revision |
Web terminal pty bytes (dedicated WS, never the LLM) |
User passwords (passlib server-side; LLM has no read path) |
Three places where sensitive content briefly touches the chat surface, each with a workaround:
create_user/reset_user_passwordreturn a server-generated password ONCE in the tool result. The self-service sidebar flow for changing your own password keeps it off the chat entirely.commission_devicereturns the device's bootstrap token in the curl line. It's the device's long-lived bearer credential for/agent/check-in, bound to a specific host on first use via a/etc/machine-idfingerprint (TOFU). Treat the rendered curl line as a secret: paste it on the intended host, then clear the investigation if you're sharing the workspace. If the curl leaks and someone pastes it on a different machine, the CP returns 409 fingerprint mismatch and the impostor agent can't update device state.- Pasting compose with literal secrets. If you type a secret into the prompt, the LLM sees it. Use
${VAR}references resolved on the device, or the registry-credentials modal (which bypasses the LLM).
Tool Calling Is the Extension Mechanism
Drift's agent has about 30 tools across discovery, query, analysis, fleet management, alert management, and render-block emission. Adding a new capability is:
- Write
async def my_tool(ctx, args)indrift-agent/app/tools/<topic>.py. - Add an entry to that file's
*_TOOLSlist (JSON Schema describing inputs). - Register the handler in
*_HANDLERS.
The agent picks it up on next request; the system prompt and tools list rebuild from the registries on import.
The whole prefix (system prompt + tools list) is wrapped in cache_control: ephemeral, so every turn after the first reads it from cache. Two design rules keep this working:
- Time-series data never flows back to the LLM through a tool result. Use a dataRef and a digest.
- Render blocks come from emit tools, not text parsing. The agent doesn't write
RenderBlockJSON in its message text; it callsmake_chart/make_table/make_markdown, and the tool's JSON Schema validates structure. There is no JSON-from-text parsing path.
The system prompt and tools list must be byte-stable across calls because the cache marker covers the entire prefix. A datetime.now() interpolated into the system prompt, a tools list iterated from a set, a per-request UUID: any of these silently invalidate the cache and 10× your bill.
Bring Your Own Model (via LiteLLM)
The agent loop is provider-agnostic. Drift talks to the LLM through LiteLLM, which exposes a unified API across Anthropic, OpenAI, Google, Azure, AWS Bedrock, Ollama, vLLM, and a long tail of others. The MODEL=... env var picks the provider; LiteLLM handles the wire format underneath.
The installer prompts for whichever API key matches the model you pick. MODEL=claude-opus-4-7 + ANTHROPIC_API_KEY, MODEL=gpt-5 + OPENAI_API_KEY, MODEL=gemini-2.5-pro + GEMINI_API_KEY, and so on. Swap providers by editing .env and restarting the agent. The frontend doesn't know which model is running. You can also do it from the web UI, if you are an admin.

What stays constant when you swap models:
- The SSE protocol the frontend consumes (same events, same shapes).
- The tool catalog (same names, same JSON schemas).
- The propose-then-apply pattern, the dataRef pattern, the streaming UX.
What varies:
- Reasoning quality on long tool-use loops. This is where the most capable models pay for themselves. Drift's investigations routinely chain 8 to 15 tool calls; a model that loses focus partway costs you accuracy.
- Cost per investigation. Smaller models (Claude Haiku, Gemini Flash, GPT mini tiers) can be 10 to 50× cheaper. Often good enough for read-only "observe" work; less reliable for the "deploy" and "respond" pillars where mistakes have side effects.
- Prompt-caching behavior. Anthropic and OpenAI both support prompt caching, with different semantics. LiteLLM abstracts the API call but the cache-hit economics still depend on the provider. Drift's cache prefix is byte-stable by design (see above) so cache hits are reliable when the provider supports them.
- Thinking / reasoning modes. Opus and o-series surface explicit reasoning streams; smaller models don't. The frontend renders the
thinkingevents when they arrive and degrades gracefully when they don't.
Default is claude-opus-4-7 because it's the most reliable on long tool-use loops with the propose-then-apply discipline Drift relies on. But if you've already got an OpenAI subscription, or you want to run everything against a local Ollama instance for an air-gapped install, that's a one-line .env change.
The CP Installer
The whole control plane comes up via a single guided script. deploy/install.sh does three things: prompts the operator for choices, renders config templates with those answers, and brings the docker-compose stack up.
What it prompts for:
- Use Caddy for TLS? Yes by default. Decline and install.sh skips rendering the Caddyfile and drops the
caddyservice from the generateddocker-compose.yml. - Public domain + email for Let's Encrypt registration (only relevant when Caddy is enabled).
- Drift admin username + password. Bootstrapped on first agent start.
- vmalert / Alertmanager UI password. Basic-auth credential for
/vmalert/and/am/, separate from the Drift admin login. - LLM model + matching API key.
ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY, etc., wired up via LiteLLM. - ntfy topic for Alertmanager push notifications.
- B2 / S3 credentials (optional; only needed for Drift Deploy's bundle uploads at fleet scale).
- Self-scrape labels (hostname + group used for this host's own metrics).
What it auto-generates the first time and preserves on every rerun:
DRIFT_SECRET_KEY(Fernet, used to encrypt registry credentials at rest).DRIFT_PG_PASSWORD.REPORTER_PASSWORD(vmauth credential for the self-scrape vmagent).- Drift admin password and vmalert/AM UI password, if the operator pressed Enter at the prompt instead of supplying one.
What gets rendered from templates:
| Template | Output | Used by |
|---|---|---|
config/Caddyfile.tmpl |
Caddyfile |
caddy |
config/auth.yml.tmpl |
auth.yml |
vmauth |
config/alertmanager-ntfy.yml.tmpl |
alertmanager-ntfy.yml |
alertmanager-ntfy |
config/grafana.ini.tmpl |
grafana.ini |
grafana |
Each template is a plain file with ${VAR} placeholders. install.sh exports the prompt answers, then envsubsts each template into its final form. No Helm, no Jinja, no extra runtime dependencies.
Reruns are idempotent. Re-running install.sh lets you rotate any prompted secret (press Enter at the prompt to keep the current value), change the model + API key, switch ntfy topics, toggle Caddy on or off, or fix anything that was misconfigured on the first run. Auto-generated secrets stay stable across reruns. A sidecar copy of .env at /var/lib/drift-cp/.env (owned root:root, mode 600) persists state outside the repo directory, so even a docker compose down -v followed by a fresh git clone reseeds the same credentials when you re-run install.sh.
A trap on EXIT prints the recoverable state if anything goes sideways mid-run: port conflicts, missing Docker daemon, DNS not yet resolving for Let's Encrypt. Fix the underlying issue and re-run install.sh; it picks up from the prior state instead of starting over.
The Update Model
Drift ships in two flavors: the docker images (drift-agent, drift-frontend) and the surrounding install bundle (install.sh, docker-compose.yml, config/*.tmpl). They move on different cadences, so releases come in two kinds:
The differentiator is whether a tarball is attached to the GitHub release. No metadata schema, no flag; just "tarball present means bundle release." The Software Updates modal in the UI reads this from the GitHub Releases API and tailors the apply path accordingly. Image-only updates get an "Update now" button; bundle updates get a warning Alert with a "View release" link (because re-running compose up against an old compose file would silently miss new services or mounts).

The Software Updates modal. The chip shows running → latest version; the Edge Agents subsection lists each device's reported AGENT_VERSION.
What the Constraints Rule Out
The same constraints that shape Drift also exclude a lot of common shapes:
- No PaaS-style "give us your code". Apps stay on your CP and your devices.
- No per-device daemon you upgrade by hand. The agent script self-updates.
- No log-aggregator-as-a-service. Vector + VictoriaLogs run on your box.
- No "let the LLM read all your data" RAG. Telemetry flows through tools, not into the prompt.
- No listening sockets on target devices. Everything is poll-out.
What you get instead is a small, opinionated stack: an HTTPS reverse proxy (Caddy by default, optional) + the Drift CP + a TSDB on a single Linux box. Your devices, your data, your model key. Bring-your-own model via LiteLLM (Claude Opus 4.7 is the default; MODEL=... picks any Anthropic, OpenAI, Google, Bedrock, or local-Ollama target). Multi-tenant from day one via RBAC + per-group scoping. Audit log for terminal sessions. Encrypted-at-rest registry credentials shipped TLS-direct to the edge.
Try It
Drift source is Apache 2.0, on GitHub.
The single-server bundle that installs everything (Drift CP + VictoriaMetrics + VictoriaLogs + vmalert + Alertmanager + Grafana + Caddy/TLS + Deploy Agent) on one Linux host with a guided installer.
VERSION=v0.1.41
curl -L "https://github.com/Scope-Creep-Labs/drift/releases/download/${VERSION}/drift-deploy-${VERSION#v}.tar.gz" | tar -xz
cd "drift-deploy-${VERSION#v}"
./install.sh
See "The CP Installer" above for the full prompt list and the template-rendering flow. Re-running the script is safe (idempotent, secrets preserved); first-run takes a few minutes for image pulls and the initial build.
For a deeper dive into the agent loop, the SSE protocol, the tool catalog, and the extension points, see ARCHITECTURE.md. For the deploy subsystem specifically, see DEPLOY.md. For the alerting subsystem, ALERTING.md.
If you're running a homelab, a small edge fleet, or just a few Docker hosts that have outgrown ssh + tmux but don't justify Kubernetes, give it a shot. We'd love to hear what breaks.
