Drift: Observe, Deploy, Respond. From a Prompt.

The Problem

Running a small fleet of Docker hosts (homelab, edge, cloud, corp boxes) is stuck in an uncomfortable middle. Kubernetes is overkill. SaaS platforms charge per device and want your data. The DIY path is a stack of disconnected tools: Grafana for metrics, Portainer for deploys, ssh + tmux for the times something is actually wrong, hand-edited Alertmanager configs.

We wanted one place to do all of it, conversationally. A small set of hard constraints:

  • No inbound ports on target devices. They sit behind NATs, residential routers, and corp firewalls.
  • No SSH after the first install. Every operational action should run through the control plane.
  • Compose as the contract. If docker compose up runs it on your laptop, the fleet should be able to ship it.
  • Lean on proven observability infrastructure. Don't build another TSDB.
  • The LLM can mutate state, but never silently. Every change goes through a propose step.

Drift is what came out of this.

pasted image

End-to-End Architecture

pasted image

Inside each box on the CP: drift-frontend (nginx + React SPA), drift-agent (FastAPI agent loop + SSE), the Tools layer (metrics, logs, alerts, deploy, emit), LiteLLM (provider abstraction), drift-postgres (users, devices, apps, revisions, sessions).

The observability stack is conventional VictoriaMetrics / VictoriaLogs / vmalert / Alertmanager with vmauth fronting remote writes from edges and alertmanager-ntfy translating webhooks. On each device, the reporter stack runs vmagent, cAdvisor, node-exporter, and Vector; Jetson devices add jtop_exporter for GPU/power/thermal metrics. The deploy-agent container bundles drift-deploy-agent.sh and terminal-bridge.py.

Four things to notice:

  1. Edges only ever talk out. Every arrow from a device points up to the CP. Nothing listens on the device side, so devices behind NATs, residential routers, and corp firewalls work without any port forwarding or tunneling.
  2. One reverse proxy (Caddy by default, optional) terminates TLS and fans out to every internal service. The installer prompts you whether to enable it; behind a tunnel or already running nginx / Traefik, you decline and front the services yourself. Details in The CP Installer below.
  3. Standard observability tooling, not a homegrown TSDB. VictoriaMetrics, VictoriaLogs, vmalert, Alertmanager, Grafana, vmagent, cAdvisor, node-exporter, Vector. Every box is something you could swap or replace independently. Drift adds the prompt-driven interaction layer on top; the data plane stays conventional.
  4. The agent is in the data path for orchestration but out of it for data. Time-series arrays never enter the LLM context. More on that below.

The Three Pillars

Drift's tagline is observe, deploy, respond. From a prompt. All three run through the same chat surface; the agent picks the right tools based on what you ask.

Observe

Ask anything about your telemetry. The agent discovers what metrics exist, picks the right PromQL, fetches the data, runs statistics, and assembles a streaming response.

pasted image

An investigation underway: streaming markdown narration above, a Plotly chart in the middle, a summary table below. All three were emitted by separate tool calls and painted into the UI as they arrived.

Example prompts:

> Which hosts are reporting metrics, and what jobs are scraping?
> Show CPU usage on the host over the last 15 minutes.
> Look for anomalies in network traffic over the last hour.
> Plot disk I/O on jetson-002 every 5 seconds.
> Now change refresh rate to 1s.

The live-chart case is interesting. The agent emits a make_live_chart render block with a chart_key. Re-emitting with the same key on a later turn doesn't add a new chart; it Plotly.react-diffs the existing one in place, preserving zoom and hover state.

Deploy

Drift Deploy registers each device with a small edge agent that polls the CP every 30s, applies whatever compose bundles you've assigned, and reports back. The whole thing (devices, apps, revisions, tagging, deploy-by-tag, rollback) drives from the chat.

pasted image

The Devices sidebar showing online status, tags, and a one-click terminal icon per device.

> List devices and their groups.
> Tag pi-riffpod-001 with edge, client-z.
> Fork the reporter app as reporter-jetson.
> Deploy reporter-jetson v3 to all devices tagged edge AND client-z.
> Roll home-pi4-001 back to reporter v2.

Apps are versioned bundles of plain files. There's no proprietary packaging. A "revision" is just compose.yaml + .env + whatever configs the compose references. You can download any revision as a .tar.gz or .zip from the UI.

Respond

Investigations end in action. Manage vmalert rules and Alertmanager routing, silence noise during planned work, or jump straight into a host shell. Same chat, same propose-then-apply pattern.

pasted image

The agent proposing an alert rule before applying it. Operator sees the exact YAML that will be written; one "looks good" confirms.

> List firing alerts.
> Create an alert when CPU > 90% for 5 minutes on any edge device.
> Silence anything from jetson-002 for 2 hours, I'm rebooting it.
> Wire up a webhook so critical alerts ping ntfy.sh/drift-alerts.
> Open a terminal to home-pi4-001.

pasted image

When the terminal opens, it's a full pty with TERM=xterm-256color (so tmux, screen, vim, less all work), entered into the host's namespaces via nsenter -t 1 -m -p -u -i -- /bin/login. PAM authenticates; sudo works. Bytes flow over a dedicated WebSocket; they never touch the LLM.


Key Architectural Decisions

Queue-Based Deploys, Not Push

The CP holds the desired state. Edge agents poll out, ask "what should I be running?", and reconcile. Targets can be offline when you make a change. They converge on their next check-in.

%%{init: {"sequence": {"actorFontSize": 18, "noteFontSize": 16, "messageFontSize": 16}}}%% sequenceDiagram participant U as Operator participant CP as Control Plane participant DB as drift-postgres participant E as Edge Agent (offline now) U->>CP: "Deploy reporter v3 to home-pi4-001" CP->>DB: INSERT deployment_target(desired=v3, status=pending) CP-->>U: ack (target queued) Note over E: Device offline<br/>(home WiFi down) Note over E: ...30 min later... E->>CP: POST /agent/check-in CP->>DB: SELECT desired state for home-pi4-001 CP-->>E: { apps: [{ name: reporter, revision: v3, bundle_url: ... }] } E->>E: docker compose pull && up -d E->>CP: POST /agent/check-in (next tick)<br/>{ report: { reporter: { status: healthy, current: v3 } } } CP->>DB: UPDATE current=v3, status=healthy CP-->>E: ack

The retry budget lives on the CP, not the edge. If the agent fails an apply, it reports the failure on the next check-in and attempts increments by one. This matters when a device drops offline mid-failure. Without CP-side tracking, an edge-only counter would burn through the retry budget against an unreachable network. CP tracking ensures max_retries=5 means "5 real apply attempts."

The dataRef Pattern (LLM Out of the Data Path)

A naive agent design dumps query results into the LLM's context and asks it to summarize. That fails three ways:

  1. Cost. A few hundred series × a thousand points × frequent investigations = burnt tokens.
  2. Precision. The model paraphrases numerical data; small errors creep in.
  3. Hallucination risk. The model can confidently report a p95 that wasn't in the data.

Drift's query_range tool instead stores the Plotly trace under a prom://<uuid> reference in a server-side cache, streams the trace to the UI via SSE (out-of-band, not via the LLM), and returns a compact digest to the model: {ref, n_series, series: [{n, mean, p50, p95, min, max, ...}]}.

%%{init: {"sequence": {"actorFontSize": 18, "noteFontSize": 16, "messageFontSize": 16}}}%% sequenceDiagram participant LLM participant Tool as query_range participant VM as VictoriaMetrics participant Cache as dataRegistry participant UI as Browser LLM->>Tool: call query_range(promql, start, end) Tool->>VM: HTTP /api/v1/query_range VM-->>Tool: { series: [...full arrays...] } Tool->>Cache: store under prom://abc-123 Tool-->>UI: SSE "data" event (full trace bytes) Tool-->>LLM: tool_result<br/>{ ref: prom://abc-123,<br/> n_series: 4,<br/> series: [{ n: 720, mean: 0.42, p95: 0.91 }, ...] } Note over LLM: Sees stats. Never sees raw arrays. LLM->>Tool: call make_chart(ref=prom://abc-123, ...) Tool-->>UI: SSE "block" event<br/>render block referencing the cached trace

Subsequent analysis tools (detect_anomalies, find_correlations) and emit tools (make_chart) operate by ref. The model decides what to compute; numpy/scipy actually compute it. Cost stays flat regardless of how many series are in flight; precision is whatever VictoriaMetrics returns.

Streaming UX via SSE

Every investigation streams as a text/event-stream response. The frontend's AgentAdapter consumes eight event types:

Event When it fires
start Once, when the investigation begins.
thinking Streaming reasoning delta from the model. Zero or more.
narrative Streaming user-facing prose delta. Zero or more.
tool_call The agent is about to invoke a tool. One per invocation.
data Out-of-band payload (e.g. chart traces). Only for emit tools that carry one.
tool_result Tool returned. One per tool_call, always after data if present.
block Render block JSON ready to paint. Only for emit tools that produced one.
done Once, when the stream closes. (Replaced by error on failure.)

This is what lets the user watch the investigation: tool calls appear in a trace pane, intermediate render blocks paint into the conversation as they arrive, charts show up the moment the underlying data does. Trust comes from seeing how the result was reached, not from a 30-second blank wait followed by a wall of text.

%%{init: {"sequence": {"actorFontSize": 18, "noteFontSize": 16, "messageFontSize": 16}}}%% sequenceDiagram participant UI participant Agent as drift-agent participant LLM as LLM (via LiteLLM) participant Tools UI->>Agent: POST /api/investigate (prompt) Agent-->>UI: event: start Agent->>LLM: completion (system + user + tools) LLM-->>Agent: streaming response loop Agent loop (up to 20 iterations) Agent-->>UI: event: thinking (delta) Agent-->>UI: event: narrative (delta) Agent-->>UI: event: tool_call (id, name, args) Agent->>Tools: dispatch(name, args) Tools-->>UI: event: data (if emit tool with payload) Tools-->>Agent: result Agent-->>UI: event: tool_result (id, summary) Agent-->>UI: event: block (rendered render-block JSON) Agent->>LLM: continue with tool_result end Agent-->>UI: event: done

Commissioning a Device

The edge agent isn't pulled from a public registry. It's built on each device at install time from a small Dockerfile shipped by the CP. This sounds backwards until you think about the constraints: a corp-network device behind a TLS-intercepting proxy might not be able to reach GHCR; an air-gapped install needs zero external pulls beyond the explicit ones; and baking on-device keeps the CP's container registry out of the trust path entirely.

What's inside the image:

  • Alpine base for size (the built image is roughly 80MB).
  • docker CLI + docker compose plugin. The agent's job is to run docker compose pull && up -d against bundles the CP ships, so it needs the client tools.
  • drift-deploy-agent.sh: the reconciliation loop (poll, apply, report, repeat). This is what self-updates on every check-in.
  • terminal-bridge.py: the pty + nsenter helper that handles web terminal sessions.
  • util-linux for the nsenter binary so the bridge can cross into the host's namespaces.

Commissioning is a single command on the target host:

DEVICE_NAME=pi-livingroom \
BOOTSTRAP_TOKEN=drift-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx \
CP_URL=https://drift.example.com/api/deploy \
  curl -fsSL "$CP_URL/agent/install.sh" | sudo -E bash

The operator gets DEVICE_NAME and BOOTSTRAP_TOKEN from a render block in the chat after asking the agent to "commission a new device named pi-livingroom". The token is the device's long-lived bearer credential for every /agent/check-in. Save the curl line to a password manager (the chat won't render it again on later turns) so you can reinstall the agent on the same host later. The token is invalidated when the device is removed from the CP.

%%{init: {"sequence": {"actorFontSize": 18, "noteFontSize": 16, "messageFontSize": 16}}}%% sequenceDiagram participant Op as Operator participant Host as Target host participant CP as Control Plane Op->>CP: "Commission pi-livingroom" CP-->>Op: render block with curl command<br/>(DEVICE_NAME, BOOTSTRAP_TOKEN, CP_URL) Op->>Host: paste + run as root Host->>CP: GET /agent/install.sh CP-->>Host: installer script body Host->>Host: write /etc/drift-deploy/env<br/>(creds, mode 600) Host->>CP: GET /agent/build-context.tar CP-->>Host: Dockerfile + bash agent + terminal-bridge.py Host->>Host: docker build (alpine + cli + util-linux, ~5s) Host->>Host: docker run -d --restart unless-stopped<br/>(--pid host, --cap-add SYS_ADMIN,<br/>/etc/machine-id + docker.sock<br/>+ state volume mounted) Host->>CP: POST /agent/check-in<br/>(Bearer BOOTSTRAP_TOKEN,<br/>host_fingerprint = sha256(/etc/machine-id)) CP->>CP: device.host_fingerprint is NULL,<br/>record this fingerprint (TOFU) CP-->>Host: { agent_target_sha, pending_deploys, ... } Note over Host: Device now in steady state<br/>(polls every 30s, sends fingerprint each time)

The whole installer flow lands in under 10 seconds on a Pi 4 (alpine pull is the slow step; build itself is ~5s). After commissioning, the operator never has to touch that device again to update the agent script. Image-baseline changes (the Dockerfile, the Python helper, system packages) still need a one-time install.sh rerun per device, but those are rare. The reconciliation script that does the actual work updates itself automatically, on every check-in.

Why a fingerprint and not just a token. The bootstrap token is bound to the device name on the CP, not to hardware. Without an extra check, pasting the same curl on a different machine would silently let two hosts share one device identity: last_seen and reported deploy state flip-flopping each tick, deployed apps trying to run on both. So the edge agent reads /etc/machine-id from a host bind-mount, sha256s it, and sends the digest on every check-in. The CP records whatever fingerprint arrives on the first check-in after commissioning (trust-on-first-use), then validates on every subsequent one. Four scenarios:

Scenario What happens
Same-host reinstall (docker volume rm then re-run curl) /etc/machine-id is system-level config that survives container wipes; fingerprint matches; check-in proceeds. Transparent.
Accidental cross-host paste New host's fingerprint differs from the recorded one; CP returns 409 with the remediation message; the agent logs it and stops retrying. Original host keeps working.
Intentional hardware migration Commission a new device with a new name; old device row stays as audit history; new device gets its own fingerprint on its own first check-in.
OS reinstall on the same hardware /etc/machine-id regenerates; fingerprint mismatch on next check-in. Operator either deletes the device and re-commissions under the same name (lenient tombstone reuse), or under a new name.

commission_device enforces name uniqueness too: input is stripped + lowercased before lookup, and a partial unique index on LOWER(name) WHERE status != 'removed' rejects collisions at the DB level. So Pi-001 and pi-001 are the same device; freed tombstones don't block name reuse.

Edge-Agent Self-Update via SHA Comparison

The CP includes a 12-char SHA of the canonical drift-deploy-agent.sh in every check-in response. When the running agent's SHA differs, the container exits cleanly; Docker's --restart unless-stopped brings it back; a bootstrap at the top of the script fetches /api/deploy/agent/agent.sh, syntax-checks it, and execs into it.

%%{init: {"themeVariables": {"fontSize": "18px"}}}%% flowchart LR A[Agent running<br/>SHA: a1b2c3] --> B{Check-in tick} B -->|"agent_target_sha: e4f5g6"| C{SHA differs?} C -->|no| D[Apply pending bundles<br/>report state] D --> B C -->|yes| E[exit 100<br/>let Docker restart us] E --> F[Bootstrap fetches<br/>new agent.sh from CP] F --> G[bash -n syntax check] G --> H[exec new script] H --> A

After the one-time install.sh on a device, you never need to re-run anything per-device just to ship a new agent script. Worst-case downtime per device per update: one poll cycle + container restart, roughly 20 to 30 seconds.

The script-level self-update doesn't cover image-baseline changes (the Dockerfile, the Python terminal-bridge.py helper, system packages). Those rarely change and require a one-time install.sh rerun per device. It's a deliberate safety/convenience split.

Web Terminal with No Listening Sockets

The constraint "no inbound ports on target devices" applies to the terminal too. The flow is asymmetric: the browser opens a WebSocket to the CP; the edge agent (on its next 30s tick) sees a pending_sessions[] entry in its check-in response and dials out to the CP to attach.

%%{init: {"sequence": {"actorFontSize": 18, "noteFontSize": 16, "messageFontSize": 16}}}%% sequenceDiagram participant Browser participant CP as Control Plane participant Bridge as terminal-bridge.py<br/>(forked by edge agent) participant Host as /bin/login<br/>(via nsenter) Browser->>CP: POST /api/deploy/devices/pi/terminal CP-->>Browser: { session_id } Browser->>CP: WS /api/deploy/.../terminal/ws/<session_id> CP->>CP: insert terminal_sessions row (status=pending) Note over Bridge: Edge agent's 30s tick fires Bridge->>CP: POST /agent/check-in CP-->>Bridge: { pending_sessions: [<session_id>] } Bridge->>Bridge: fork terminal-bridge.py Bridge->>CP: WS /agent/.../terminal/ws/<session_id> CP->>CP: wire the two WS endpoints Bridge->>Host: nsenter -t 1 -m -p -u -i -- /bin/login loop Active session Browser->>CP: pty stdin (binary frames) CP->>Bridge: relay Bridge->>Host: write to pty master Host-->>Bridge: pty stdout Bridge-->>CP: binary frames CP-->>Browser: relay end

Per-session audit row: user, device, started/ended, bytes browser↔agent. No keystroke capture. Worst-case wait for a session to land is one poll cycle (default 30s); the UI shows a countdown chip so the operator sees the bound.

pasted image

The browser terminal modal showing the "waiting for agent… 12s / 30s" countdown before the session attaches.


What the LLM Sees (and Doesn't)

The trust story matters more than the architecture diagram. The boundary is enforced in code, not by prompting the model to behave.

✅ LLM has access to ❌ LLM never sees
Metric / label / job names API keys and other env-var credentials
Time-series summaries only (n, mean, p50, p95, min, max, ...) Database password (DRIFT_PG_PASSWORD), Fernet key (DRIFT_SECRET_KEY)
Device + app metadata (names, tags, groups, statuses) Registry credentials (encrypted at rest, TLS-direct to edge agent)
Alert rule names + PromQL expressions + labels Alertmanager receiver secrets (filename references only)
Log lines returned by query_logs Raw time-series arrays (dataRef pattern)
Compose file contents when explicitly fetched via get_app_revision Web terminal pty bytes (dedicated WS, never the LLM)
User passwords (passlib server-side; LLM has no read path)

Three places where sensitive content briefly touches the chat surface, each with a workaround:

  • create_user / reset_user_password return a server-generated password ONCE in the tool result. The self-service sidebar flow for changing your own password keeps it off the chat entirely.
  • commission_device returns the device's bootstrap token in the curl line. It's the device's long-lived bearer credential for /agent/check-in, bound to a specific host on first use via a /etc/machine-id fingerprint (TOFU). Treat the rendered curl line as a secret: paste it on the intended host, then clear the investigation if you're sharing the workspace. If the curl leaks and someone pastes it on a different machine, the CP returns 409 fingerprint mismatch and the impostor agent can't update device state.
  • Pasting compose with literal secrets. If you type a secret into the prompt, the LLM sees it. Use ${VAR} references resolved on the device, or the registry-credentials modal (which bypasses the LLM).

Tool Calling Is the Extension Mechanism

Drift's agent has about 30 tools across discovery, query, analysis, fleet management, alert management, and render-block emission. Adding a new capability is:

  1. Write async def my_tool(ctx, args) in drift-agent/app/tools/<topic>.py.
  2. Add an entry to that file's *_TOOLS list (JSON Schema describing inputs).
  3. Register the handler in *_HANDLERS.

The agent picks it up on next request; the system prompt and tools list rebuild from the registries on import.

%%{init: {"themeVariables": {"fontSize": "18px"}}}%% flowchart TB subgraph Registry["app/tools/ (registries)"] METRICS[metrics.py<br/>query_range, list_jobs,<br/>list_metric_names, ...] ANALYSIS[analysis.py<br/>detect_anomalies,<br/>find_correlations, ...] DEPLOY[deploy.py<br/>list_devices, deploy_revision,<br/>tag_device, fork_app, ...] ALERTS[alerts.py<br/>propose_alert_rule, silence_alert,<br/>upsert_receiver, set_route, ...] EMIT[emit.py<br/>make_chart, make_table,<br/>make_metric, make_timeline, ...] LOGS[logs.py<br/>query_logs] end AGENT["agent.py: all_tools + all_handlers"] METRICS --> AGENT ANALYSIS --> AGENT DEPLOY --> AGENT ALERTS --> AGENT EMIT --> AGENT LOGS --> AGENT AGENT -->|cached prefix| LLM[LLM via LiteLLM]

The whole prefix (system prompt + tools list) is wrapped in cache_control: ephemeral, so every turn after the first reads it from cache. Two design rules keep this working:

  • Time-series data never flows back to the LLM through a tool result. Use a dataRef and a digest.
  • Render blocks come from emit tools, not text parsing. The agent doesn't write RenderBlock JSON in its message text; it calls make_chart / make_table / make_markdown, and the tool's JSON Schema validates structure. There is no JSON-from-text parsing path.

The system prompt and tools list must be byte-stable across calls because the cache marker covers the entire prefix. A datetime.now() interpolated into the system prompt, a tools list iterated from a set, a per-request UUID: any of these silently invalidate the cache and 10× your bill.


Bring Your Own Model (via LiteLLM)

The agent loop is provider-agnostic. Drift talks to the LLM through LiteLLM, which exposes a unified API across Anthropic, OpenAI, Google, Azure, AWS Bedrock, Ollama, vLLM, and a long tail of others. The MODEL=... env var picks the provider; LiteLLM handles the wire format underneath.

flowchart LR subgraph Drift["drift-agent"] LOOP[Agent loop<br/>tool dispatch<br/>SSE generator] LITELLM[LiteLLM<br/>unified completion API] LOOP --> LITELLM end LITELLM -->|claude-*| ANTHROPIC[Anthropic API] LITELLM -->|gpt-*, o-*| OPENAI[OpenAI API] LITELLM -->|gemini-*| GOOGLE[Google AI / Vertex] LITELLM -->|bedrock-*| BEDROCK[AWS Bedrock] LITELLM -->|ollama/...| LOCAL[Ollama / vLLM<br/>local inference]

The installer prompts for whichever API key matches the model you pick. MODEL=claude-opus-4-7 + ANTHROPIC_API_KEY, MODEL=gpt-5 + OPENAI_API_KEY, MODEL=gemini-2.5-pro + GEMINI_API_KEY, and so on. Swap providers by editing .env and restarting the agent. The frontend doesn't know which model is running. You can also do it from the web UI, if you are an admin.

pasted image

What stays constant when you swap models:

  • The SSE protocol the frontend consumes (same events, same shapes).
  • The tool catalog (same names, same JSON schemas).
  • The propose-then-apply pattern, the dataRef pattern, the streaming UX.

What varies:

  • Reasoning quality on long tool-use loops. This is where the most capable models pay for themselves. Drift's investigations routinely chain 8 to 15 tool calls; a model that loses focus partway costs you accuracy.
  • Cost per investigation. Smaller models (Claude Haiku, Gemini Flash, GPT mini tiers) can be 10 to 50× cheaper. Often good enough for read-only "observe" work; less reliable for the "deploy" and "respond" pillars where mistakes have side effects.
  • Prompt-caching behavior. Anthropic and OpenAI both support prompt caching, with different semantics. LiteLLM abstracts the API call but the cache-hit economics still depend on the provider. Drift's cache prefix is byte-stable by design (see above) so cache hits are reliable when the provider supports them.
  • Thinking / reasoning modes. Opus and o-series surface explicit reasoning streams; smaller models don't. The frontend renders the thinking events when they arrive and degrades gracefully when they don't.

Default is claude-opus-4-7 because it's the most reliable on long tool-use loops with the propose-then-apply discipline Drift relies on. But if you've already got an OpenAI subscription, or you want to run everything against a local Ollama instance for an air-gapped install, that's a one-line .env change.


The CP Installer

The whole control plane comes up via a single guided script. deploy/install.sh does three things: prompts the operator for choices, renders config templates with those answers, and brings the docker-compose stack up.

What it prompts for:

  • Use Caddy for TLS? Yes by default. Decline and install.sh skips rendering the Caddyfile and drops the caddy service from the generated docker-compose.yml.
  • Public domain + email for Let's Encrypt registration (only relevant when Caddy is enabled).
  • Drift admin username + password. Bootstrapped on first agent start.
  • vmalert / Alertmanager UI password. Basic-auth credential for /vmalert/ and /am/, separate from the Drift admin login.
  • LLM model + matching API key. ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, etc., wired up via LiteLLM.
  • ntfy topic for Alertmanager push notifications.
  • B2 / S3 credentials (optional; only needed for Drift Deploy's bundle uploads at fleet scale).
  • Self-scrape labels (hostname + group used for this host's own metrics).

What it auto-generates the first time and preserves on every rerun:

  • DRIFT_SECRET_KEY (Fernet, used to encrypt registry credentials at rest).
  • DRIFT_PG_PASSWORD.
  • REPORTER_PASSWORD (vmauth credential for the self-scrape vmagent).
  • Drift admin password and vmalert/AM UI password, if the operator pressed Enter at the prompt instead of supplying one.

What gets rendered from templates:

Template Output Used by
config/Caddyfile.tmpl Caddyfile caddy
config/auth.yml.tmpl auth.yml vmauth
config/alertmanager-ntfy.yml.tmpl alertmanager-ntfy.yml alertmanager-ntfy
config/grafana.ini.tmpl grafana.ini grafana

Each template is a plain file with ${VAR} placeholders. install.sh exports the prompt answers, then envsubsts each template into its final form. No Helm, no Jinja, no extra runtime dependencies.

Reruns are idempotent. Re-running install.sh lets you rotate any prompted secret (press Enter at the prompt to keep the current value), change the model + API key, switch ntfy topics, toggle Caddy on or off, or fix anything that was misconfigured on the first run. Auto-generated secrets stay stable across reruns. A sidecar copy of .env at /var/lib/drift-cp/.env (owned root:root, mode 600) persists state outside the repo directory, so even a docker compose down -v followed by a fresh git clone reseeds the same credentials when you re-run install.sh.

%%{init: {"themeVariables": {"fontSize": "18px"}}}%% flowchart LR A[Read sidecar<br/>/var/lib/drift-cp/.env<br/>for prior state] --> B[Prompt operator<br/>defaults from prior state] B --> C[Auto-generate<br/>missing secrets] C --> D[Write .env<br/>chown root:docker, mode 640] D --> E[envsubst templates<br/>Caddyfile, vmauth,<br/>alertmanager-ntfy, grafana] E --> F[docker compose<br/>up -d --build] F --> G[Poll /healthz<br/>up to 90s] G --> H[Mirror .env<br/>to sidecar]

A trap on EXIT prints the recoverable state if anything goes sideways mid-run: port conflicts, missing Docker daemon, DNS not yet resolving for Let's Encrypt. Fix the underlying issue and re-run install.sh; it picks up from the prior state instead of starting over.


The Update Model

Drift ships in two flavors: the docker images (drift-agent, drift-frontend) and the surrounding install bundle (install.sh, docker-compose.yml, config/*.tmpl). They move on different cadences, so releases come in two kinds:

%%{init: {"themeVariables": {"fontSize": "18px"}}}%% flowchart TD DEV[Code change committed] DEV --> KIND{What changed?} KIND -->|Only Python or SPA code| IMAGE[Image-only release<br/>vX.Y.Z] KIND -->|install.sh, compose,<br/>config templates| BUNDLE[Bundle release<br/>vX.Y.Z] IMAGE --> PUSH1[Push :vX.Y.Z + :latest<br/>to GHCR] PUSH1 --> REL1[GitHub Release<br/>NO tarball asset] REL1 --> OP1["Operator: click<br/>'Update now' in modal"] OP1 --> APPLY1[docker compose pull<br/>+ up -d --no-deps] APPLY1 --> DONE1[Running new version] BUNDLE --> PUSH2[Push :vX.Y.Z + :latest<br/>+ build tarball] PUSH2 --> REL2[GitHub Release<br/>WITH tarball asset] REL2 --> OP2["Operator: curl | tar |<br/>install.sh on the host"] OP2 --> APPLY2[install.sh re-renders<br/>configs, runs compose up] APPLY2 --> DONE2[Running new version<br/>+ new bundle on disk]

The differentiator is whether a tarball is attached to the GitHub release. No metadata schema, no flag; just "tarball present means bundle release." The Software Updates modal in the UI reads this from the GitHub Releases API and tailors the apply path accordingly. Image-only updates get an "Update now" button; bundle updates get a warning Alert with a "View release" link (because re-running compose up against an old compose file would silently miss new services or mounts).

pasted image
The Software Updates modal. The chip shows running → latest version; the Edge Agents subsection lists each device's reported AGENT_VERSION.


What the Constraints Rule Out

The same constraints that shape Drift also exclude a lot of common shapes:

  • No PaaS-style "give us your code". Apps stay on your CP and your devices.
  • No per-device daemon you upgrade by hand. The agent script self-updates.
  • No log-aggregator-as-a-service. Vector + VictoriaLogs run on your box.
  • No "let the LLM read all your data" RAG. Telemetry flows through tools, not into the prompt.
  • No listening sockets on target devices. Everything is poll-out.

What you get instead is a small, opinionated stack: an HTTPS reverse proxy (Caddy by default, optional) + the Drift CP + a TSDB on a single Linux box. Your devices, your data, your model key. Bring-your-own model via LiteLLM (Claude Opus 4.7 is the default; MODEL=... picks any Anthropic, OpenAI, Google, Bedrock, or local-Ollama target). Multi-tenant from day one via RBAC + per-group scoping. Audit log for terminal sessions. Encrypted-at-rest registry credentials shipped TLS-direct to the edge.


Try It

Drift source is Apache 2.0, on GitHub.

The single-server bundle that installs everything (Drift CP + VictoriaMetrics + VictoriaLogs + vmalert + Alertmanager + Grafana + Caddy/TLS + Deploy Agent) on one Linux host with a guided installer.

   VERSION=v0.1.41
  curl -L "https://github.com/Scope-Creep-Labs/drift/releases/download/${VERSION}/drift-deploy-${VERSION#v}.tar.gz" | tar -xz
  cd "drift-deploy-${VERSION#v}"
  ./install.sh

See "The CP Installer" above for the full prompt list and the template-rendering flow. Re-running the script is safe (idempotent, secrets preserved); first-run takes a few minutes for image pulls and the initial build.

For a deeper dive into the agent loop, the SSE protocol, the tool catalog, and the extension points, see ARCHITECTURE.md. For the deploy subsystem specifically, see DEPLOY.md. For the alerting subsystem, ALERTING.md.

If you're running a homelab, a small edge fleet, or just a few Docker hosts that have outgrown ssh + tmux but don't justify Kubernetes, give it a shot. We'd love to hear what breaks.