The kernel that doesn't believe the agents — a domain-free trust substrate for fleets of autonomous agents: verify what shipped, arbitrate collisions, refuse with structured reasons.

DOS — the Dispatch Operating System

Catch your AI agents when they lie about what they shipped.

PyPIPython versionsCIverified by DOScommit-claimsLicense: MIT

The whole pitch in one recording: the agent claims two features shipped; git backs one. dos verify answers from the commits, the lie exits 1, and a gate on that exit code refuses the false "done". Every line is the real CLI's verbatim output — scripts/build_caught_lie_cast.py re-records it whenever the output changes.

Run a fleet of agents on one repo. The left loop just feels like progress; the right one you can steer. The only difference is a verdict DOS reads from the real world — here, git — never the agent's word.

An AI agent will tell you it finished. DOS checks the real world instead oftaking its word — and the nearest piece of the real world is your git history.

An agent says it shipped the login endpoint. Did it? Run one command,dos verify, and it answers from the artifacts the work actually left behind,not from what the agent typed. If a commit backs the claim, you get SHIPPEDand exit code 0. If nothing landed, you get NOT_SHIPPED and exit code 1.The agent's story never enters into it. (Git is just the first witness DOSreads; the file tree, the clock, a CI status, a test environment's own stateare others — anything the agent didn't author.)

dos verify AUTH AUTH1   # → SHIPPED      AUTH AUTH1 e62f74d   (exit 0)
dos verify AUTH AUTH2   # → NOT_SHIPPED  AUTH AUTH2           (exit 1)

That's the smallest version. It scales up, too: point a dozen agents at onerepo — in CI, in a fleet, racing on the same files — and DOS also tells youwhich ones are stepping on each other, which one is spinning in circles, andwhich claim of "done" is real. Every answer comes from the artifacts (git, thefile tree, the clock), never the narration. It works on a plain git repo withzero config, and the only thing you ever install is one small Python package.

⏱️ Want to try it right now? Jump to Try it in 60 seconds— one command, real output, then come back for the why.

Or just add it — two commands, zero decisions. From the repo where youragent works:

pip install dos-kernel
dos init --hooks auto   # finds the agent runtime(s) you already use, wires in the checks

From then on: your agent can't tell you "done" unless the work actuallylanded, two agents can't silently overwrite each other's files, and a runthat stalls gets flagged instead of quietly spinning. Nothing about yourworkflow changes, and you don't need to learn any of the vocabulary below tobe covered. It prints the one config file it wrote; deleting the dos hookentries there undoes it. (No runtime detected? It says so and lists thenames to pick from — it never guesses.)

v0.26.0 · 3900+ tests · CI: Python 3.11–3.13 on Linux + a Windows 3.13smoke run · the only runtime dependency is PyYAML · MIT.

🧭 Want it in plain words first? What DOS is, what it catches, and whatadopting it costs — no code: the plain-words version, just below.

🧭 Or route yourself: the page runs shallow → deep, andWho this is for matches the question you brought to thesection that answers it.

Reading this as an AI agent? Start with AGENTS.md — a shortorientation written for you: what DOS is in three lines, how to build/test/checkyour work, the ~5 files actually worth reading, and the architecture rules achange must satisfy.

🔤 Five words the rest of this page leans on. A plan is a named goal(AUTH); a phase is one shippable step of it (AUTH1); a lane is theslice of the file tree one agent may touch; the oracle is the part of DOSthat reads the evidence and rules; a stamp is the mark a shipped phaseleaves in a commit subject (AUTH1: …) — the thing the oracle greps for.That's the whole vocabulary.

Who this is for

This README runs shallow to deep — try it, see the failure it fixes, audit theevidence, wire it in, extend it. You don't have to read it in that order. Findthe question you arrived with and jump; the rows route by the question, notyour job title, and every section hands off to the one above it.

You're asking… Start at Then
"What is this, in plain words, and why should my team care?" the plain-words version, just below — no code hand the 60-second demo to whoever runs your agents
"Show me it working, fast." Try it in 60 seconds what goes wrong in a fleet without it
"I already run agents — how do I wire the verdict into my stack?" How you plug it in the MCP lie detector · Install
"I run a fleet every day — how do I watch it, triage it, debug it?" Operating a fleet Three live projections · Debug a stuck fleet
"How do I bend it to my org without forking it?" Hacking it docs/HACKING.md — the seven extension axes
"What is actually proven here — and can I re-run it?" For researchers — claims → invariants → reproduction What's proven and what's still a bet · Citation

(The seventh reader — an AI agent orienting itself in this repo — already hasits own front door: AGENTS.md, per the note above.)

The plain-words version

A coding agent does some work, then tells you how it went. Usually the story istrue. Sometimes it isn't — the cheerful "all work completed!" from a workerthat actually shipped nothing is the single most common failure in agentfleets. With one agent you catch that yourself, because you read its workbefore trusting it.

Run twenty agents at once and nobody reads everything. Each worker grades itsown homework, you believe the reports because what else is there to go on, andthe unchecked problems pile up quietly — a false "done" here, two agentsoverwriting the same file there, one worker spinning in circles burning money.None of it is loud. The codebase ends up sorta working, and nobody can safelychange it.

DOS is the referee. It's a small, deterministic program that never reads theagent's story; it reads what actually happened — the commit, the file, theclock — and hands you a verdict. An agent says "done"? DOS checks whether thework really landed in your repo's history. An agent says "making progress"?DOS checks whether anything real has changed. Two agents head for the samefiles? DOS admits one and refuses the other, with a reason a machine can acton. Every verdict is computed from artifacts the agent didn't author, so noamount of confident narration can move it.

Nothing about it is coding-specific, and it imposes no framework. Your repodeclares its own rules — which file regions each agent may touch, how afinished unit of work signals "done" — as data in one small config file, andDOS supplies only the machinery. You reach it through small, do-one-thingcommands, through the agent host you already run, or straight from Python. Andit stays in its lane: it tells you reliably what happened, never whether thecommitted code is good — quality stays with your tests, your reviews, andyou.

Adopting it costs one engineer about an afternoon: one small Python package(one runtime dependency), one optional config file — and it works on day oneagainst a plain git repository with neither. If your team is about to go fromone agent to many, the missing piece is usually not a smarter agent. It's areferee that doesn't believe any of them.

Convinced enough to watch it work? Try it in 60 secondsis one command — or hand this page to whoever runs your agents.

Try it in 60 seconds

Got a terminal? This runs the whole thing in a throwaway repo — one commandscaffolds it, makes a real commit, verifies it, and cleans up after itself:

pip install dos-kernel      # PyYAML is the only runtime dep
dos quickstart              # → SHIPPED AUTH AUTH1 … then NOT_SHIPPED AUTH AUTH2

One SHIPPED, one NOT_SHIPPED: the first is a claim git can back, the secondis a claim nothing landed for. That contrast is the product. The demo closeswith a router to wherever you already run agents — a Claude Code / Cursor tab(dos init --hooks), an MCP host, a CI step, or a fleet — so your next move isone line, not a docs dig. (Add --keep ./demo to keep the repo and poke at it.Don't even want the install? uvx --from dos-kernel dos quickstart runs thesame demo ephemerally — nothing left behind.)

Prefer to watch the gears turn? The same thing, by hand, in 5 lines — click to expand

A plan (AUTH) groups phases (AUTH1, AUTH2); dos verify takes<plan> <phase>, and a commit whose subject starts AUTH1: is what stamps thatphase shipped.

mkdir hello-dos && cd hello-dos
dos init .                                       # writes one dos.toml
git init -q
git config user.email [email protected]            # skip if you have a global git identity
git config user.name  "You"
echo 'def login(): ...' > login.py
git add -A
git commit -m "AUTH1: ship the login endpoint"   # stamp AUTH1 shipped: <PHASE-ID>: <message>

dos verify --workspace . AUTH AUTH1   # → SHIPPED     AUTH AUTH1 <your-sha> (via grep-subject)  exit 0
dos verify --workspace . AUTH AUTH2   # → NOT_SHIPPED  AUTH AUTH2            (via none)          exit 1

An agent can claim AUTH2 is done all day long; verify just reports what theartifacts say — and they say it isn't. The via grep-subject / via none tagtells you how it knows: it found the phase token in a commit subject, or itfound it nowhere. The full walkthrough is indocs/QUICKSTART.md.

Two equally confident claims, one verdict each — SHIPPED for the one git can back, NOT_SHIPPED for the one nothing landed for. Every string is verbatim output of examples/demo/verify_demo.sh. Step through it locally for the click-through version (it's an HTML file — clone the repo and open it in a browser; GitHub shows its source, not the running page).

The smallest real win: in a CI step or dispatch loop, replace the line thattrusts an agent's "done" with dos verify PLAN PHASE and branch on its exitcode (0 shipped / 1 not). No parsing, no plan, no config — theCI integration cookbook walks itend-to-end. To run it on a repo shaped like yours, start withOnboard a repo in 10 minutes.

Next level up — wire the verdict into your own stack: How you plug it in.

What goes wrong in a fleet

Run a pile of agents at once with nobody refereeing, and here's how it goes:each worker reports its own success, and you believe the reports, because whatelse is there to go on? The unchecked problems pile up quietly — a lie here,two agents clobbering the same file there, a little scope creep, one workerspinning in circles — until the codebase sorta works and nobody can safelychange it.

The trouble is you launched the agents and then let them grade their ownhomework. DOS gives you the missing signal — a verdict from ground truth — sothe loop closes. Here is the same fleet under both regimes:

The two regimes as a flowchart — NO REFEREE: you believe the narration; DOS ADJUDICATES: you steer on a verdict
flowchart LR
  subgraph OPEN["NO REFEREE — you believe the narration"]
    direction TB
    A1["agent: 'done!'"] --> B1[["believed"]]
    A2["agent: 'done!'"] --> B1
    A3["agent: 'done!'"] --> B1
    B1 --> C1["silent corruption piles up<br/>(lies · collisions · spin)"]
    C1 --> D1["'sorta works' — can't be changed"]
  end
  subgraph CLOSED["DOS ADJUDICATES — you steer on a verdict"]
    direction TB
    A4["agent: 'done!'"] --> V{{"dos verify<br/>reads git"}}
    V -->|in git ancestry| S["SHIPPED (exit 0)"]
    V -->|found nowhere| N["NOT_SHIPPED (exit 1)"]
    S --> L["land it"]
    N --> R["re-dispatch / flag — caught"]
    R -.verdict steers the loop.-> A4
  end

Here are the failures a fleet actually produces, each next to the ground truththat quietly contradicts the worker's story — and the verdict DOS hands back:

A worker… …but the ground truth is DOS verdict
says it shipped a unit of work no commit ever landed verifycaught lie
tried, but the commit silently failed no commit ever landed verify (the flake — indistinguishable from a lie without git)
edits files another worker owns two agents, one shared file arbitraterefuse the second
overruns the file region it claimed footprint reaches beyond the declared tree scope-gateREFUSE (before the write lands)
reports "making progress" 0 commits, only a fresh heartbeat livenessSPINNING

The first row is the most common one. The classic tell is a cheerful one-liner,"all work completed!", from a worker that did little or nothing. DOS neverreads that line; it reads the ground truth, so the claim collapses the instantno artifact backs it (more indocs/108). That's alsowhat makes it cheap to adopt: verify needs no plan, no registry, no config,and the exit code is the verdict — any shell or CI step can branch on itwithout parsing a word.

Prefer to watch it move? The two loops are also a self-contained animation youstep through one frame at a time — clone the repo and opendocs/assets/loop_visual.html in a browser. (It's anHTML file, so GitHub shows its source rather than running it — open it locally.)

How far you take it

It works on a plain git init with zero config, and gets smarter the more youtell it. You don't adopt a framework and pick a tier; you start at the shallowend and it keeps paying off as you wade deeper — the same kernel the whole way:

  • Zero config. Point dos verify PLAN PHASE at a plain gitrepo — no plan, no registry, no dos.toml. It answers from commit historyalone (via grep-subject / via none). This is the whole ofQUICKSTART and the day-one CI win above.
  • Tell it your structure. dos init writes a dos.toml (lanes, paths,ship grammar as data); add a plan doc and dos plan lays each phase'sclaim beside the oracle's verdict. Here's exactly what a plan file lookslike (copyable, round-trips with the built-inreader), and four worked example workspaces.
  • Teach it your own types. Declare your own block reasons, gateverdicts, output renderers, admission predicates, a model-backed judge, acustom plan dialect, or a whole host driver — all as workspace policy,never a fork. The map is docs/HACKING.md (seven extensionaxes) + the copy-me examples/dos_ext/.

How you plug it in

That slope is how deep your config goes. The other axis is how you call thereferee at all — and you adopt through whichever surface matches how youalready work, not by restructuring your stack. The same kernel verdicts arereachable through every row here, lowest-friction first:

Surface Adopt it when… The move
MCP server you drive an agent through an MCP host (Claude Desktop, Cursor, Cline, an Agent-SDK app) add one line to the host config ({ "command": "dos-mcp" }) and ask the agent to dos_verify its own last claim — zero code. The advisory path (the agent asks). See Give your agent a lie detector.
Runtime hooks you run an agent loop (Claude Code, Cursor, Codex CLI, Gemini CLI) and want the verdict to act, not just be available dos init --hooks <runtime> wires the verdict into that host's own hook config — a refused call is denied before it runs, a false "done" is refused. The enforcement path (the host denies). One command, no hand-edited YAML. See QUICKSTART + docs/221.
CLI exit-code you have a shell pipeline or CI step that trusts an agent's "done" replace that step with dos verify PLAN PHASE and branch on the exit code (0 shipped / 1 not) — the verdict is the exit code. The day-one win above.
Python API your dispatcher/orchestrator is already Python import dos and call the pure syscalls (dos.oracle.is_shipped, dos.arbiter.arbitrate, …) — state-in / verdict-out, no subprocess. The Python cookbook.
Fleet framework your fleet already runs on LangGraph, CrewAI, AutoGen, or the OpenAI/Claude Agents SDK bolt the referee onto the framework's own seam — a referee node, a termination condition only git can satisfy, an output guardrail with a git tripwire. One function, no rewrite; every seam executed against the real framework. The fleet-framework cookbook.
Swarm runtime your agents run on Hermes, OpenClaw, or a SwarmClaw-style autonomous swarm — privileged tools, shared memory docs / task boards, and no lock manager for either drop a two-function adapter into the tool-execution loop: guard_action refuses an arbitrary-exec command before it runs, and acquire_lease / release_lease bracket each shared-state write so the lost update never lands. No import dos — it shells the CLI; Hermes' pre_tool_call hook also speaks DOS natively (dos hook pretool --dialect hermes). The runnable, A/B-measured Hermes / OpenClaw worked example + docs/278.
Skill pack you run agents in Claude Code and want the workflow, not just the verdict dos init --skills drops editable SKILL.md screenplays that wire the syscalls into a snapshot → audit → gate → take-a-lane loop. See QUICKSTART §2.
Driver your lanes must be computed, or you add a provider-backed judge write one dos/drivers/<host>.py (a LaneTaxonomy + a config factory), loaded by name, never imported by the kernel. The map is HACKING.md.

The two axes are independent: a zero-config repo can adopt through any surface,and a deeply-configured one still answers over the same CLI and MCP tools.Start at the top row — it's the one that costs nothing to try. The first tworows also compose: MCP advises (the agent checks its own work), hooks enforce(the host stops a bad action) — wire both for the full loop.

Those surfaces are the upstream half of the value chain — who calls thereferee. The same verdicts also flow downstream, to the systems that act onthem: every adjudication lands in a verdict journal that dos export drains toyour observability stack (Datadog / Honeycomb / Grafana —docs/266),dos notify pushes what-needs-a-human to Slack, dos reward gates what afine-tune may train on, and dos attest mints a signed receipt a skeptic cancheck without loop access(docs/246). One kernel, oneverdict vocabulary, from the agent's tool call to your dashboard.

Next level up — run it every day: Operating a fleet.

Why not just run N agents?

Fair question — why add a referee at all? Because N agents with no referee isthat open loop again: you launch them, they self-report, and you've got nothingsolid to steer on. DOS hands you that missing signal. Specifically, it givesyou sensors

  • verify — did it really ship? (from git, not the agent's word)
  • liveness — is it ADVANCING, or just SPINNING / STALLED?
  • scope-gate — did it stay in its lane? A binding pre-effect gate(dos scope-gate, ALLOW/REFUSE, exit 0/5/6) over the same dos.scopeclassifier that also reports post-hoc.

— and actuators: arbitrate (let this lane in, or refuse the collision) andrefuse (say no with a reason a machine can act on). Together they turn a pileof workers into something you can actually drive. The kernel's job is thesignal, but it also ships a reference supervisor to show what you do with it:dos watch checks liveness on each tracked run and proposes a halt when onespins or blows its budget — it recommends, it never pulls the trigger — anddos loop keeps N dispatch-loops alive. Use those, or build your own on thesame signal. Either way, it's the difference between "I launched 20 sessionsand I'm hoping" and "I can see which two are lying and which one is wedged."

You see that signal through three read-only screens — dos top (what'srunning), dos decisions (what's waiting on you), dos plan (claim vs. groundtruth) — covered in Three live projectionsbelow and walked end-to-end inDebug a stuck fleet.

The referee grows along two axes: deterministic verdicts that read artifacts(verify, liveness, scope), and provider-backed judges — a model, adebate — that rule on what no deterministic check can, kept outside the kernelunder a discipline that stops a wrong judge from clearing a falsehood. Seethe adjudicator-population note forthat scalable-oversight story in code.

We caught ourselves doing the exact thing DOS exists to catch. A design docin this repo included a small worked example — "here's what this snippet prints" —written by the agent building DOS. It read perfectly plausible. It was reviewed. Itwas committed. And it was wrong, for the dullest possible reason: nobody hadactually run it. The agent had reasoned out what the code "would" print and typedthat down as fact. An adversarial review later did the one thing the author hadn't— executed the snippet — and the real output flatly contradicted the prose.That's the whole thesis in one anecdote: a confident narration is not evidence,even when the narrator is us, even after a human reviewed it. The reasoning feltlike checking; it wasn't. The only thing that settled it was running the code andreading what came back — an independent witness, exactly the move verify makesagainst an agent's "done." The correction is pinned in git (docs/124, commit651ba03), because here too the record is the commit, not the claim.

And the first issue ever filed on this repo was closed the same way.Issue #1 is thepublish pipeline's TestPyPI rehearsal failing its OIDC token exchange(invalid-publisher). The bug is ordinary; the closure is the demo. It wasn'tclosed on "fixed it" narration — it was closed on two read-backs the claimantdidn't author: the next pipeline run's own conclusion(the dry-run leg, green)and the registry's own JSONreporting the artifact that leg exists to land. The closing comment runs thekernel's verdict on itself — dos reward --claim --witness confirmACCEPT — and the same evening, the same pipeline's witness gaterefused to publish release 0.23.0because CI was red on the candidate commit: a release pipeline declining tobelieve an unwitnessed "ready." Every link is public — click the runs, readthe registry JSON, audit the closure yourself.

What's proven and what's still a bet

We apply the same honesty to our own claims that the kernel applies to youragents. It would be easy to lead with one big number; instead, here's thesplit — what we actually measured, what we extrapolated from thosemeasurements, and what is still a bet. Draw the line yourself. (Every provennumber is from a live, re-runnable benchmark written up underbenchmark/ and the paper.)

✅ Proven — measured in live runs, scored against a fact the agent can't fake(a test environment's database state, git history — bytes the agent wrote none of):

  • It catches the lie and blocks it. Across 120 clean tasks on a standardagent benchmark, a DOS gate caught 10 genuine "I shipped it" lies and letevery honest write through — at the same 8.3% catch rate on both a mid-sizeand a top-tier model. The signal doesn't fade when you upgrade the model.(Over the full benchmark: 15 lies caught in 258 tasks, two models, zero falsealarms.) (▶ the catch itself is the gate figure below.)
  • It prevents the collision. The same referee put two live agents on oneshared record and stopped 6 of 8 cases of one silently overwriting the other— 4 of 6 when the cases were drawn from the real task mix. This is the half asandbox can't cover: an isolated workspace still shares the outside world.(▶ the collision being prevented is the coordination figure below.)
  • Mid-run "fixes" don't help; quitting a doomed run does. Every active fixwe tried mid-run (warn it, rewind it, inject a hint) came out flat-to-negative— poking a run also disturbs the ones that would have passed. The one movethat helps writes nothing: give up at the right moment — 0 runs wronglykilled out of 1,634 winners across 22 models, ~11% of fleet compute saved.
  • The training label can't be gamed. For "may a fine-tune learn from thisrun?" (dos reward), the yes/no is computed from environment state the agentauthored none of — so no amount of clever output text can flip a no to ayes. That's a proof, plus a measured 60% → 100% precision lift fromfiltering out the poison a naive self-graded collector would have kept.

The two proven moments above, each rendered as a single figure from its own liverun (every number, hash, and ID is a verbatim read-off — never a hand-typeddramatization):

It catches the lie and blocks it. A confident booking, refuted by the DB-hash the agent couldn't author, blocked before a downstream agent inherits the phantom. Step through it locally (an HTML walkthrough — clone and open in a browser; GitHub shows its source).

It prevents the collision. A stale add-bag clobbers a cancellation under naive replay; the arbiter serializes the two agents on the same region so neither overwrites the other. Step through it locally (an HTML walkthrough — clone and open in a browser).

📈 Projected — real measurements, composed into a curve (and labelled as one).Here's the crux: catching a lie is only worth something to whoever can't catchit themselves. Hand the verdict to one strong agent that re-checks its owninputs and it buys you almost nothing — that agent recovers on its own. Hand itto something that can't re-check — a non-LLM system, a weaker model, a longmulti-step chain, or a training loop — and it pays off (up to a full +1.0 inour no-recovery upper bound). In short: DOS is worth more the less yourdownstream can check itself. Our fleet-scale figure (≈173–505 corrupted resultsprevented at a 32-agent fleet) projects these real per-run rates onto fleetmath — it's geometry on top of measured numbers, not a measured fleet run.

🎲 A bet — stated as one. Where this goes if the floor holds: a frozen,cross-vendor trust standard (the "deny" message is already byte-identicalacross Claude Code, Codex, and Qwen — a de-facto standard waiting to be named),a shared arbiter for real-world effects, the claim-vs-reality corpus only aneutral party can hold, and a notary that proves what an agent did to askeptic who wasn't in the room (the mechanism already ships — dos attestmints an HMAC-signed receipt over an effect-witness verdict anddos verify-receipt checks it with the shared key alone;docs/246). The seeds arein the tree; we claim no results for any of it.

The one distinction that keeps this honest: a J is a count of failuresblocked off ground truth — never a downstream outcome delta. "Blocked 10 realover-claims" is proven; "made the fleet 10% better" is not the same sentence,and we don't write it.

What DOS does not do

The proven/bet gradient above is about evidence; this is about capability —the boundaries are part of the contract, and stating them is the same honestythe kernel applies to your agents:

  • It adjudicates that a ship happened, not that the code is correct or good.verify reads git ancestry, so it catches "no commit landed," not "thecommitted work is wrong." Judging quality is the JUDGE / HUMAN rung, not thedeterministic oracle.
  • It computes verdicts and admission decisions; it never spawns or kills an OSprocess. liveness is advisory — it reports SPINNING, it doesn't stop therun — and dos loop emits a spawn/reap/flag plan you act on. (arbitrate andrefuse are decisions you enforce, not force the kernel applies.)
  • It is not a CI replacement or a test runner. It sits beside them and lets astep branch on the exit-code verdict.
  • The pluggable verdict/JUDGE adjudicator registry is specced, not yetshipped (see docs/88 §5); the JUDGEseam and built-in judges are.

Give your agent a lie detector (MCP)

The easiest way in doesn't involve writing any Python. Point the agent host youalready use at the bundled MCP server, then ask your agent to dos_verify itsown last claim. The first time it comes back NOT_SHIPPED … (via none) on workthe agent swore it finished, you'll see why this repo exists — in yourterminal, on your fleet.

Installed with the [mcp] extra (pip install -e ".[mcp]" from your clone — seeInstall), DOS exposes the syscalls as MCP tools — the truth toolsfirst (dos_verify "did it ship?", dos_commit_audit "does this commit's claimmatch its diff?", dos_status one folded fact about a run), thendos_arbitrate (may two workers run without colliding?), the structured-refusalpair (dos_refuse_reasons / dos_check_reason), dos_recall (is this recalledmemory still true?), dos_citation_resolve (does this cited legal case exist ina third-party reporter? — the Mata v. Avianca witness), and dos_doctor (theworkspace report) — so anyMCP-speaking host — Claude Desktop, Claude Cowork, Cursor, Cline, Trae, anAgent-SDK app — can call the referee over JSON-on-stdio with zero Python coupling. Each verdict comesback with a one-line interpretation of what it means for the agent's next move.(See the MCP server surface.)

// claude_desktop_config.json — paste, restart, then say:
//   "use dos_verify to confirm you actually shipped that"
{ "mcpServers": { "dos": { "command": "dos-mcp" } } }

The MCP server is advisory: the agent calls the referee when it (or you)thinks to. The per-host wiring for Cursor / Codex / Gemini is inthe MCP README — all four are MCP clients, so thisworks on every one of them with zero code.

Gemini CLI users get a one-liner. DOS ships a gemini-extension.jsonmanifest at the repo root, so the whole referee — the MCP server plus a contextfile that tells the model to gate its own done-claims — installs as one Geminiextension, no clone and no config edit:

pip install 'dos-kernel[mcp]'   # the server the extension launches
gemini extensions install https://github.com/anthony-chaudhary/dos-kernel

The same manifest is what Google's auto-indexedextensions gallery crawls, so the listingis automatic.

Browsing an MCP registry? A repo-root smithery.yaml lists DOS onSmithery, the de-facto MCP-server registry. It declares alocal stdio launch (uvx --from 'dos-kernel[mcp]' dos-mcp) on purpose: DOSadjudicates your git repo, so it runs next to it rather than in a hostedsandbox that couldn't see your history. No API key — DOS is deterministic.

…then make the verdict act (hooks)

To go from "the agent can ask" to "the host won't let a bad call through", wireDOS's hooks into the runtime you actually run. One command per host — it writesthat host's own hook-config file, merged into anything already there:

dos init --hooks auto .          # don't know the names below? this detects the
                                 # runtime(s) this repo already uses and wires them all
dos init --hooks claude-code .   # .claude/settings.json
dos init --hooks cursor .        # .cursor/hooks.json
dos init --hooks codex .         # .codex/config.toml
dos init --hooks gemini .        # .gemini/settings.json
dos init --hooks antigravity .   # .agents/hooks.json
dos init --hooks claude-cowork . # the SAME .claude/settings.json Claude Code reads

The list above is illustrative, not authoritative — the live matrix is a verb:dos hosts prints every host DOS can wire, sourced from the registries themselves(dos hosts --json for tooling), so the table never rots out of sync with whatdos init --hooks actually installs. Each row carries the host's tier, the eventsit binds, its dialect, its config path, the exact wiring command, and the host'sown caveat (Codex's partial tool coverage, Cowork's not-yet-firing hooks). A hostwith no row is itself the signal: it has no hook seam, so its DOS surface isthe advisory one (MCP + skills).

That binds three shipped hooks: pretool denies a structurally-refused callbefore it runs, stop refuses a stop on an unverified "done," posttoolre-surfaces a stalled stream. This is the enforcement path (the hostdenies on a DOS verdict) — the complement to MCP's advisory path. Untilrecently this spoke only Claude Code; it now installs across six hosts —Claude Code, Cursor, Codex, Gemini, Antigravity, and Claude Cowork(docs/221,docs/269,docs/298).--with-hooks is the back-compat alias for --hooks claude-code. auto(docs/303)names the host for you: it probes which of those config dirs already exist inthe repo — plus the shell's own environment, so a fresh repo opened insideClaude Code still detects — wires every runtime it finds (a shared config fileis wired once), and fails loud with the list above when nothing is detectable,never guessing. ClaudeCowork is the shared-surface host: it runs the same agent harness as ClaudeCode, so wiring either name binds both — one file, one set of hooks. (Onehonest caveat, carried on the install note itself: the Cowork app doesn'tfire hooks yet — anthropics/claude-code#63360 — so until that closes,Cowork's working DOS surface is the advisory one above.)

Under the installer sits a pluggable dialect seam: the verdict is decidedonce, then rendered into whatever JSON shape the host parses(docs/217) — so a runtime theinstaller doesn't cover yet can still consume the same hooks. A sixth shippeddialect speaks Hermes: dos hook pretool --dialect hermes emits the{"decision": "block", "reason": …} object Hermes' pre_tool_call shell hookreads (wire it in cli-config.yaml). A new host's dialect is a driver, never akernel edit.

The flip side of that honesty: a host with no hook seam gets no dialect.ByteDance's Trae was proved out and ships no user-scriptable hook system inits personal/international editions (no lifecycle events, no deny/allow stdoutcontract; its CN-enterprise edition announced one on 2026-06-09 with nopublished grammar yet), so DOS binds to it advisory-only — the MCP server in.trae/mcp.json (read alike by IDE-mode Agent, SOLO mode, and TRAE CLI), averify-before-"done" rule in .trae/rules/project_rules.md, the genericskills in .trae/skills/ — and dos init --hooks trae fails loud rather thanwriting config Trae would never read(docs/294).An invented envelope would be fake enforcement, which is the exact failure thedialect seam exists to prevent.

Because these hooks run on every tool call, the core kernel logic on the hotpath is reimplemented in native Go — a dos-hook binary that ports the actualdecision predicates (the conjunctive-only lease-admission andprefix-disjointness floor, the verify() grep rung, self-modify, the markerbudget, the WAL) rather than just shelling out to Python. It serves theper-call verdict in ~10 ms — 16–43× faster than shellingpython -m dos.cli hook … (~0.25–0.8 s, dominated by interpreter cold-start) —and is byte-identical to the Python kernel on the gated decision (the docs/124parity contract, pinned by Go parity tests). It owns the common fast path andfalls back to the always-available Python verb for anything it doesn't yetserve, so a machine without the binary degrades cleanly with no wiring change(docs/125,docs/270). You don't build ityourself: the per-platform wheels bundle the binary, so a wheel install getsthe native fast path with no Go toolchain — and any platform without a bundledbinary (including a plain source install) just runs the pure-Python path(docs/286).

Next level up — what to watch once a fleet runs through these hooks: Operating a fleet.

The syscall ABI

Every syscall answers a question you'd otherwise have to take the agent's wordfor. "Reach for this when…" is the plain-English trigger; the rest is thecontract — and the module names are auditable.

Syscall Reach for this when… What it is Module
verify() an agent says a unit of work is done and you don't want to take its word the truth syscall — "did (plan, phase) actually ship?" registry-first, ancestry-checked, from git history if there's no plan at all dos.oracle, dos.phase_shipped
liveness() a long run says it's "making progress" and you want to know if it actually is the temporal verdict — "is the run ADVANCING, or just SPINNING / STALLED?" from the git/journal delta and the clock dos.liveness
verify-result() a subagent hands a result back to an orchestrator that folds it as a finding — but the result string may be a harness-synthesized error the worker never authored the fold-site result-state witness (docs/197) — classifies a subagent transcript's terminal record, gating on message.model == "<synthetic>" (the unforgeable harness-authorship marker), never the agent's self-report; exit 3 = DEAD (a harness 429 / quota / auth / server error), 0 = HEALTHY / UNREADABLE dos.result_state
resume() a run died or paused mid-flight and you need to continue without re-doing work or double-applying it the third ARIES phase — "how far did the fossils say it got, and what's the residual?" over a run-id-keyed intent ledger; re-enters from a git-VERIFIED SHA, never the dead run's self-report (RESUMABLE / COMPLETE / DIVERGED / UNRESUMABLE) dos.resume, dos.intent_ledger
complete() you need to know if the whole declared job is verifiably done, not just one phase the completion verdictresidual = declared − verified, asked forward; read-only, never self-certifies dos.completion
rewind() a run thrashed and you want to excise the dead-end turns without the kernel authoring a correction the conversation-rewind verdict — replays the ledger for a minted checkpoint and PROPOSES the excision (never truncates; the host owns the transcript) dos.rewind
productivity() a long run is burning turns and you want to know if it's still doing work, or fading the loop-economics verdict (docs/218) — classify(work-deltas) -> PRODUCTIVE / DIMINISHING / STALLED over a trend of per-step work; pure, no I/O dos.productivity
efficiency() you want to know if the tokens a run spent actually bought work (a run can be productive yet burn 10× its work's worth) the token-effectiveness verdict (docs/263) — work / tokens -> EFFICIENT / COSTLY / WASTEFUL; both counts are env-authored, so a run can't narrate its way to EFFICIENT dos.efficiency
improve() a self-improving loop proposes a change to its own code and you must decide keep or revert — without trusting the loop's own claim that it helped the keep-gate (docs/280) — KEEP / REVERT / ESCALATE from witnesses the candidate's author didn't write: the suite green on the candidate-only tree, the truth syscall clean, and a strictly-measured metric gain; a regression always REVERTs, a run of non-keeps ESCALATEs to a human dos.improve
reward() a fine-tune is about to train on an agent's trajectory and the "it worked" label came from the agent itself the reward-set admission verdict (docs/230) — ACCEPT / REJECT_POISON / ABSTAIN off a witness the agent authored zero bytes of, so no answer text can flip a reject to an accept (the non-distillable label) dos.reward
breaker() a failure class keeps tripping and you want to stop retrying and escalate the circuit-breaker primitive (docs/223) — a pure two-counter state machine, CLOSED / OPEN, tripping on consecutive or total failures; an OPEN verdict names the escalation rung (none / judge / human) dos.breaker
hook_exit() / exec_capability() you wire a plain shell hook into a runtime, or need to know if a command grants arbitrary code execution two classifier leaves the cheapest integrations consult — hook_exit maps an exit code to an intervention (docs/226: 0 pass / 2 BLOCK / other WARN), exec_capability classifies the invoked program token — never a substring — as GRANTS_ARBITRARY_EXEC / BOUNDED (docs/224) dos.hook_exit, dos.exec_capability
refuse(reason) you need to say why a pick was blocked in a way a machine can act on structured refusal — a closed, declared reason vocabulary (dos.reasons, extensible per-workspace), every reason emittable, verifiable, and refusable dos.wedge_reason, dos.picker_oracle
lease() / arbitrate() two agents might touch the same files and you need to admit one without a collision the pure admission kernelarbitrate(request, live_leases, config) -> decision, state-in / decision-out, no I/O dos.arbiter
spawn() / reap() you need every run to carry a traceable identity and its effects to be replayable the correlation spine (sortable, lineage-carrying run-ids) + the lease write-ahead log dos.run_id, dos.lane_journal
enumerate() / pickable() / cooldown() / reconcile() an unattended loop must know is there anything pickable, why-not, have I tried it, and did the claim hold? — without re-storming a known drain or believing a "done" the git can't confirm the picker substrate (docs/207) — enumerate is the phase-list producer (the declared set, never a silent empty); pickable the pre-dispatch gate (OFFERABLE / HELD(reason)); cooldown the anti-churn fold over pick-attempts (CLEAR / RECENTLY_ATTEMPTED); reconcile the quiet-completion join (VERIFIED / QUIET_INCOMPLETE / HONEST_OPEN, fail-closed on the claim) dos.enumerate, dos.pickable, dos.cooldown, dos.reconcile

Three terms the table assumes: a plan (e.g. AUTH) groups phases —a phase is a named unit of work (e.g. AUTH1); a lane is a leased region ofthe file tree an agent works in. All are defined in thequickstart.

The newest catch — a result that died. When a subagent hands a resultback to an orchestrator that folds it as a finding (an ultracode Workflow, anAgent-SDK fan-out), the result string itself may be a harness-synthesizederror the worker never authored — and ~32% of real subagents return exactlythat (a 429 / quota / auth string) where the fold expects a finding (docs/197).verify-result reads the transcript's terminal record and refuses to believe aharness-authored death:

dos verify-result --transcript dead.jsonl
#   DEAD SYNTHETIC class=OTHER — harness-authored terminal
#   (model=<synthetic> + stop_reason=stop_sequence) — not a finding; route to DEAD, do not fold
echo $?   # → 3   (count it in the denominator; never bank it as a result)

dos verify-result --transcript real.jsonl
#   HEALTHY — terminal assistant record is real-model authored with content
echo $?   # → 0

It gates on message.model == "<synthetic>" — the marker the agent's own modelcannot forge (the runtime harness authored those bytes, not the worker) — whichis broader than rate-limits alone: quota, auth, and server deaths are caught too.

Around these sit ~30 supporting kernel modules — the file-tree disjointnessalgebra, the timeline reader, the gate/loop classifiers, the typed-verdictcontract, the JUDGE-rung seam. The full map is in CLAUDE.md.

Install

Pick the row that matches how you work — the full matrix (every OS, everychannel, upgrade/uninstall, WSL, troubleshooting) is indocs/INSTALL.md:

# pip — the default (the line the 60-second demo ran; also how a host pins it):
pip install dos-kernel            # core kernel (PyYAML only)
pip install "dos-kernel[mcp]"     # + the MCP server (dos-mcp)

# uv — the isolated CLI install (keeps `dos` + `dos-mcp` off your project venv):
uv tool install dos-kernel        # `dos` + `dos-mcp` on PATH
uvx --from dos-kernel dos doctor  # or run it once, ephemerally

# from a clone — editable, the contributor path (tracking unreleased master:
# pip install "dos-kernel @ git+https://github.com/anthony-chaudhary/dos-kernel", no clone needed):
git clone https://github.com/anthony-chaudhary/dos-kernel.git && cd dos-kernel
pip install -e .                  # editable: your edits are live in the install
./install.sh                      # or .\install.ps1 on Windows — venv + install + PATH, one line

The distribution name is dos-kernel, not dos — a bare pip install dospulls an unrelated package that squats the name. The import name and the CLIare still dos. The core kernel's only runtime dependency is PyYAML (the[mcp] extra adds the MCP framework; [tui] adds the live dos top screens).See SECURITY.md, "Supply chain."

pip install dos-kernel is the whole install — if it worked in the demo,you're done here. The other rows exist for how your team works: uv if youwant the CLI isolated from your project venv (faster than pipx, managesPython versions; pipx install dos-kernel works the same way), the clone ifyou're contributing. Homebrew / WinGet / Scoop one-liners are next on therelease runway (see docs/INSTALL.md).

A host repo adds DOS as a pinned dependency and points it at its own tree — neverby vendoring the code in. DOS is stateless about which repo it serves: itresolves the workspace from --workspace$DISPATCH_WORKSPACE › cwd, neverits own install location, so the ground truth stays legible as the codebasegrows. (The full separation contract — mechanism in the package, policy in theworkspace's dos.toml — is in CLAUDE.md.)

For most repos that one dos.toml is the whole policy surface — but when yourlanes must be computed (from runtime state, an env var, a monorepo manifest)rather than listed, or you add a provider-backed JUDGE, you write a smalldriver instead: a dos/drivers/<host>.py exposing a LaneTaxonomy constant +a <host>_config factory, loaded by name via dos --driver <host> and neverimported by the kernel. Copy dos/drivers/workshop.pyas the template; the full driver/plugin map is in docs/HACKING.md.

Claude Code plugin — hooks + MCP + skills in one install

If you drive a fleet with Claude Code, the lowest-friction way to bind theverdict to the runtime is the bundled plugin underclaude-plugin/ — it packages all three runtime surfaces atonce:

  • the hooks (PreToolUse → deny a structurally-refused call · PostToolUse →re-surface a stalled tool stream · Stop → refuse to stop on an unverifiedclaim) — all fail-safe (they emit nothing and exit 0 on any error, so they neverbreak a turn);
  • the MCP server (dos_verify / dos_arbitrate / dos_commit_audit /dos_refuse_reasons … as tools the model calls directly);
  • the generic skill pack (the domain-free dispatch screenplays), namespaced as/dos-kernel:dos-next-up, /dos-kernel:dos-dispatch, …
# 1. The plugin ships JSON + markdown; the brains ship as the pip package, so
#    install it FIRST into the interpreter Claude Code runs (the [mcp] extra is
#    what the bundled MCP server needs):
pip install "dos-kernel[mcp]"

# 2. Then, inside Claude Code:
/plugin marketplace add anthony-chaudhary/dos-kernel
/plugin install dos-kernel@dos

After installing, run /dos-kernel:dos-setup once — it confirms the packageis importable, reports what the plugin wired, and points at the next skill. Thesame three hooks are available à la carte via dos init --hooks auto (detectsyour runtime; or name one — claude-code, cursor, codex, gemini, antigravity,claude-cowork); the plugin is just the pre-packaged Claude Code form. The bundle's design + the build that keeps its skills in lockstepwith the source are in claude-plugin/README.md.

CLI

One dos entrypoint over the syscalls (see QUICKSTART.md fora runnable tour of the core ones):

# --- the syscalls ---
dos verify PLAN PHASE                  # truth: did (plan,phase) ship? (works with no plan)
dos commit-audit [REF] [--sweep]       # truth: does a commit's SUBJECT match its own diff? (--sweep = drift rate over a range)
dos verify-result --transcript T       # fold-site witness: did a subagent's terminal record DIE (harness 429/quota)? (exit 3 = DEAD)
dos coverage --declared N              # fan-out coverage: how many of N declared workers REALLY returned a result vs died?
dos liveness --run-id R --start-sha S  # temporal: ADVANCING / SPINNING / STALLED?
dos resume --run-id R                  # the resume verdict: replay a run's intent ledger, re-verify against git, PROPOSE the continuation
dos complete --run-id R [--diverged]   # completion verdict: is the WHOLE declared job done? (residual = declared − verified)
dos rewind --run-id R [--fire SIGNAL]  # conversation-rewind verdict: PROPOSE excising dead-end turns (never truncates)
dos status --run-id R                  # the folded fact: one fail-closed digest of a run (liveness + verified progress + lease)
dos arg-provenance --tool T --args J [--new-key K]  # did the model MINT this id/FK, or RESOLVE it from env bytes? (exit 0 believe / 3 UNSUPPORTED)
dos arbitrate --lane L --kind K --leases '[…]'   # admission: may a lane start without collision? (decision only — journals nothing; hold via lease-lane)
dos scope-gate --lane L [--staged]     # binding pre-effect scope gate: may this PROPOSED write land in its lane? (ALLOW/REFUSE)
dos lease {acquire,release,status} OWNER         # the cross-process archive lock
dos lease-lane {acquire,release,heartbeat,live}  # durable lane lease over the pure arbiter (write-back to the WAL)
dos run-id mint PROCESS                # mint a correlation run-id
dos id-alloc {allocate,peek} SCOPE     # atomically allocate a never-reused, monotonic id for a scope
dos journal {tail,replay,seq,compact}  # the lane write-ahead log
dos halt --handle H                    # the reap verb: emit the stop-plan for a live run/lease
dos pickable / enumerate / cooldown / reconcile  # picker substrate: anything pickable? why-not? tried recently? did the claim hold?

# --- workspace & inspection ---
dos init [DIR]                         # scaffold a dos.toml workspace config
dos doctor [--json] [--check]          # report the active workspace + taxonomy + predicates
dos lint [--strict] [--json]           # dead policy in this workspace's own dos.toml? (unreachable lanes, dangling refs)
dos man {wedge,lane} [ID]              # the self-describing manual over the registries
dos exit-codes [VERB]                  # print the verdict-IS-the-exit-code table (all verbs or one)
dos gate PACKET                        # typed empty-packet verdict (LIVE/DRAIN/STALE-STAMP/…)
dos judge wedge RUN_TS                 # adjudicate a no-pick verdict (deterministic)
dos judge-eval --judge N --cases C     # score a JUDGE-rung adjudicator against labelled claims
dos overlap-eval --policy P --cases C  # score an overlap scorer by false-admit rate (the disjointness backtest)
dos intervention-eval --cases C        # score an intervention policy by NET task delta (not verdict accuracy)
dos tool-stream-eval --cases C         # score a stall-reader policy by NET recovery (not detection accuracy)
dos precursor-gate-eval --cases C      # score a precursor grammar by recall vs false-refute waste
dos memory {recall,verify}             # re-verify recalled agent-memory at read time (RECALL_FRESH/STALE/UNVERIFIABLE)
dos health --lane L                    # pre-dispatch lane-health gate (overlap + recurring-blocker → route)
dos scout                              # pre-dispatch chooser: pick the next activity before leasing a lane
dos trace RUN_ID                       # walk one run across spine + intent ledger + WAL + git, joined by run_id

# --- agent-host binding (Claude Code / MCP) ---
dos guard [--verify-on-stop] -- CMD…   # wrap a headless agent launch: inject the DOS MCP server (+ optional verify-on-stop Stop hook)
dos hook {pretool,posttool,stop}       # the live agent-host hook surface (PreToolUse deny / PostToolUse sensor / Stop verify)

# --- live projections (read-only TUIs) ---
dos top [--once] [--json]              # live fleet watchdog: lanes, leases, verdicts, commits
dos decisions [N]                      # the operator-decision queue (list + drill-in TUI)
dos plan [--once] [--json]             # work-terrain board: every phase, the plan's claim vs the oracle's verdict
dos watch --track R [--budget-ms M]    # the watchdog driver: poll liveness for tracked runs + propose halts on spin/hang
dos loop --target N [--watch] [--json] # supervisor (init/PID-1): keep N dispatch-loops alive — emits a spawn/reap/flag plan

# --- loop-economics & reliability verdicts (pure; exit code is the verdict) ---
dos productivity --deltas 5,3,1,0      # is the run still doing work? PRODUCTIVE / DIMINISHING / STALLED
dos efficiency --work W --tokens N     # did the tokens buy work? EFFICIENT / COSTLY / WASTEFUL
dos breaker --consecutive N --max-consecutive M  # has this failure class tripped? CLOSED / OPEN (+ escalation rung)
dos hook-exit --code N                 # map a shell hook's exit code → PASS / BLOCK / WARN
dos exec-capability --command "…"      # does this command grant arbitrary exec? BOUNDED / GRANTS_ARBITRARY_EXEC
dos improve --suite-passed --truth-clean --work W --baseline-work B  # self-improving loop: KEEP / REVERT / ESCALATE
dos reward --claim --witness {confirm,refute,none}   # may a fine-tune TRAIN on this trajectory? ACCEPT / REJECT_POISON

# --- observability: the verdict journal → your dashboards ---
dos observe [--run R] [--json]         # project the verdict journal: every kernel adjudication, folded by run/syscall/verdict
dos helped [--since TS] [--json]       # the operator rollup: how many things DOS productively caught for you
dos export [--to file|statsd|otlp] [--since SEQ]  # drain the journal outward (Datadog / Honeycomb / Grafana); null = report only
dos notify {decisions,top} [--notifier slack|webhook --channel NAME]  # push what-needs-a-human / what's-running to where the operator is; null = render only

# --- portable proof (third-party verifiable, no loop access) ---
dos attest --claim KEY {--accept-cmd CMD | --before P --after P}  # mint an HMAC-signed receipt over an effect-witness verdict
dos verify-receipt --receipt R         # the skeptic's side: check the signature with the shared key alone (fails LOUD on tamper)

# --- cross-project (machine-local index) ---
dos projects                           # the projects DOS has served
dos learn AXIS                         # aggregates over resolved decisions
dos reindex                            # rebuild the central store from the .dos/ dirs

Most verbs take --workspace . (or honor $DISPATCH_WORKSPACE / cwd) and--json for machine-readable output. For verdict-bearing commands (verify /liveness / gate) the exit code is the verdict. A pluggable --output <name> renderer (the dos.renderers entry-point group) is covered inHACKING.md.

Three live projections (read-only TUIs)

A fleet leaves its state scattered across git history, a write-ahead log, and apile of verdict envelopes. DOS folds that into three read-only screens, eachanswering a different operator question. They are projections, not stores:every one reads kernel state, mutates nothing, takes no lease, launches nothing— delete any of them and you lose the screen, not the data. Pick by thequestion you're asking:

Screen Answers Reads
dos top What's running right now? — the lanes, the leases holding them, recent verdicts, live git activity. The screen you leave open in a side terminal during a run. leases (WAL) + per-lane liveness + verdict envelopes + git
dos decisions What's waiting on me right now? — the no-picks (refusals, wedges, open gates) that need a decision, each tagged by who can resolve it. the four refusal sources, joined
dos plan Does the plan's claim match the ground truth? — every declared phase, the plan's self-reported status beside the oracle's verdict, so an over-claim is its own cell. the plan source × verify() per phase

In dos top a held lane's status chip is its liveness verdict — greenADVANCING / yellow SPINNING / red STALLED — so "which one is wedged" is oneglance, not a log dig. dos decisions tags each row by resolver — a deterministicORACLE (may auto-clear), an LLM JUDGE (could rule before you spendattention), or a HUMAN (a genuine operator call) — and on a keypress printsthe exact shell command and exits; you run it, the screen never mutatessubstrate. dos plan is a verify() fan-out, not a plan reader: a human runsit from outside the agent loop, so an over-claiming loop is caught by groundtruth, not by re-reading its own narration.

All three have a plain-text floor that needs no dependencies — the live richredraw is the optional [tui] extra, but --once (one frame) and --jsonwork on a bare core install (no extras). Here is dos top --once on a freshcheckout (no leases yet, so every lane is FREE and the git strip carries thecontent):

┌─ dos top · /path/to/repo · 2026-06-07T17:14:32+00:00 ──────────────────────
LANES
   benchmark     ⚪ FREE
   docs          ⚪ FREE
   …                                  (one concurrent lane per source dir)
  *global        ⚪ FREE              (* = the exclusive whole-repo lane)
  8 lanes · 0 advancing · 0 spinning · 0 stalled · 8 free
RECENT VERDICTS        [trust = ship-oracle cross-check]
  (no verdicts yet)
RECENT COMMITS        [ground truth — git history]
  0857bd4    docs/206 Appendix A: the whole program in plain words
  …                                  (last 10 commits — the content even a
                                      zero-lease repo always has)
──────────────────────────────────────────────────────────────────────────────
read-only · q quit · this screen mutates nothing

The stuck-fleet walkthrough that drives all three end-to-end isDebug a stuck fleet.

Observability: the verdict journal, drained to your dashboards

Those three screens read a fleet's running state. Underneath, every verdictthe kernel computes — each verify / liveness / efficiency / breaker /reward / hook decision — also lands in a verdict journal: arun_id-correlated write-ahead log of the kernel's own adjudications(docs/262).Two verbs make it useful. dos observe is the read-only projection — fold thejournal by run, syscall, or verdict, or replay one run's verdict history. dos export is the delivery seam: it drains the journal outward to an observabilitybackend through the dos.exporters entry-point group, with three shippedtransports — file (JSONL), statsd (DogStatsD counters), and otlp(OpenTelemetry log records → Datadog / Honeycomb / Grafana), the null defaultreporting only (docs/266).So "how often did the fleet over-claim this week, and on which lanes?" becomes adashboard panel, not a log grep — and adding a transport is a driver, never akernel edit (the same kernel/driver split as judges and notifiers).

Operating a fleet

The listing above is the reference; this is the day-2 shape of running on it —what an operator actually does each morning, and where to look first whensomething wedges.

Morning triage is three reads, in order. dos top answers what's runningright now: each lane, the lease holding it, and a status chip that is theliveness verdict — green ADVANCING, yellow SPINNING, red STALLED — so"which one is wedged" is one glance, not a log dig. dos decisions answerswhat's waiting on me: the refusals and open gates that need a decision, eachtagged by who can resolve it, so you spend attention only on the rows markedHUMAN. And dos plan answers is anyone over-claiming: every declared phase,the plan's self-reported status beside the oracle's verdict — run from outsidethe agent loop, so an over-claimer is caught by ground truth, not byre-reading its own narration. All three are read-only projections (no leasetaken, nothing launched, nothing mutated), so leave them open or script them —each has --once and --json.

Then route the signal to where you already look. You don't have to keep aterminal open: dos notify pushes what-needs-a-human (or what's-running) toSlack or a webhook on whatever cadence you drive it with, and dos exportdrains the verdict journal — every adjudication the kernel computed — to yourobservability stack (file / statsd / OTLP → Datadog, Honeycomb, Grafana). "Howoften did the fleet over-claim this week, and on which lanes?" becomes adashboard panel, not a log grep.

When something wedges, start with the verdicts, not the logs. Thesymptom → one-command table — a run that swears it's progressing but isn't, alane nobody can take, a "done" that won't verify — isDebug a stuck fleet,which drives all three screens end-to-end on a worked example. Link it fromyour on-call doc; it is the playbook.

Running smoothly and want the referee to fit your org — your own lanes, yourown block reasons, a model-backed judge? Step up a level:Hacking it.

Hacking it

DOS is built to be extended without forking the package — add your own blockreasons, gate verdicts, admission/safety predicates, output renderers (thedos.renderers entry-point group), and your own judge for the JUDGE rung(dos.judges, scored by dos judge-eval), all as workspace policy, notpackage edits. The block-reason vocabulary is fully data-driven: declare areason in four lines of dos.toml and it becomes emittable, verifiable,refusable, and dos man wedge-documented through the same kernel calls abuilt-in uses. See docs/HACKING.md for the sevenextension axes and the plugin model, and examples/dos_ext/for a copy-me skeleton.

Documentation

  • docs/QUICKSTART.md — runnable 5-minute hello-world. Start here.
  • docs/FAQ.md — the arriving questions ("how do Iverify an agent's claim?", "does it need an LLM?"), each answered in oneself-contained block.
  • docs/answers/ — the answer corpus: onesourced, self-contained page per high-intent question ("how to verify an AIagent actually did the work", "how to stop two AI agents overwriting eachother"), each carrying an evidence table where every number links to the filethat proves it.
  • docs/ALTERNATIVES.md — DOS and thealternatives: evals, framework guardrails, Temporal, in-toto, plain CI —what each does well, what DOS adds, and when NOT to use DOS.
  • docs/README.md — the docs index (guides vs. design notesvs. the dated build-journal; the numbers are chronology, not a reading order).
  • docs/HACKING.md — extend DOS without forking it.
  • docs/DOT_DOS.md — the .dossurface: what that directory is, why it's safe to delete, and therepo-resident state contract (policy in dos.toml, fossils under .dos/,verdicts in git) that keeps every host adapter thin.
  • docs/STABILITY.md — the compatibilitypromise: what you may depend on, what SemVer means here, how deprecationsare announced (DosDeprecationWarning, a two-minor-release window), andwhat will never break.
  • CLAUDE.md / CONTRIBUTING.md — thearchitecture contract and how to send a change.
  • verify-action/ — the CI gate: acomposite Action and a reusable workflow that run dos commit-audit on everyPR and merge-queue group and post the verdict as the named dos-verifystatus check; make it required and GitHub enforces what the kernel decides(the verified by DOS badge above is this gate on the kernel's own repo).
  • docs/releases/ — per-version release notes (the changelog).
  • The website — this page,condensed to one screen (good for sending to someone).

Playbooks & examples

examples/playbooks/ walks the syscalls end-to-end onanonymized real-world repo shapes — every command was run and its output pastedback verbatim:

Next level up — evaluating the claims themselves: For researchers.

For researchers

Every number this README claims is graded the way the kernel grades agents: itcounts only if a witness the graded party didn't author backs it. This sectionis the on-ramp joining the claims to their write-ups, the two formalinvariants, and the literature — the depth lives underbenchmark/ and the paper.

The claims register. The proven rows fromthe evidence section, each with thewitness it was scored against and the write-up that reproduces it:

Claim Result Witness Write-up
the write-admission gate catches the over-claim J = 10/120 "I shipped it" lies blocked, 0 honest writes refused, the same 8.3% over-claim rate on a mid-size and a top-tier model (15/258 over the full benchmark) the env DB-hash the agent authored zero bytes of agentprocessbench/writeadmit/
the arbiter prevents the lost update J = 4/6 natural-mix clobbers prevented (6/8 constructed; 8/8 in the two-domain τ²-bench port; 9/10 in the live replication) the post-state DB-hash neither agent authors tau2coord/RESULTS.md
mid-run "fixes" are flat-to-negative; witness-gated early-halt is the survivor 0 false-abandons / 1,634 winners across 22 models (error-gated, K≥3) — and the same test falsifies the naive raw-repeat gate each task's own oracle over a frozen replay corpus giveup_cross_benchmark.py · the corpus ledger
the reward-set admission label can't be gamed acceptance precision 60% → 100%, J = 5 poison labels purged (ΔP +40 pp) from a naive self-judged collector's bank the gold DB-hash, keyed on db_match docs/230

The two invariants. The formal contribution under those numbers, statedprecisely:

  1. The non-forgeable witness(docs/138):every kernel verdict is a pure function classify(evidence, policy) whoseevidence is bytes the adjudicated agent did not author — git ancestry, anenvironment's database state, the file tree, the clock. Byte-author ≠claimant is the load-bearing split: the agent may write anything it likes,and none of it is an input to the verdict.
  2. The non-distillable label(docs/230,docs/234):the reward-set admission bit (dos reward) is a pure function of thatwitness, so — conditional on environment state — it is independent of theanswer text. No token sequence moves a REJECT_POISON to an ACCEPT, and aforgeable read-back is structurally ignored, not down-weighted.

Reproduce it. One runner fronts the suite: python -m benchmark._run listinventories every benchmark with its arms, cost, and prereqs, and each provenrow above has a $0 offline arm(benchmark/BENCHMARKS.md).Read the numbers under one rule: a J is a count of failures blocked offground truth, never a downstream outcome delta — "blocked 10 real over-claims"is proven, "made the fleet 10% better" is a different sentence, and we don'twrite it.

Where it sits. The lineage is deliberate, one line each: the kernel is areference monitor in the minimal-TCB tradition — a small, separate,non-bypassable adjudicator outside the agents it judges; resume is the thirdARIES phase aimed forward — continue from the durable fossils, never from thedead run's account of itself; the arbiter enforces serializability over sharedworld-state regions, with the lost update as its target anomaly; andreward() lands in the reward-hacking / scalable-oversight line — adeterministic floor inside the training loop. The full argument is the paper,"Verification Is All You Need — But Not Where You Think"(paper/releases/),and the BibTeX is in Citation.

Citation

The ideas here are written up in a paper — "Verification Is All You Need — ButNot Where You Think" — on the out-of-loop referee for agent fleets. A built PDFlives at paper/releases/; the arXiv preprint is inpreparation. Until the arXiv ID lands, cite the repository:

@misc{dos_kernel,
  title        = {Verification Is All You Need --- But Not Where You Think},
  author       = {Chaudhary, Anthony},
  howpublished = {\url{https://github.com/anthony-chaudhary/dos-kernel}},
  note         = {DOS --- the Dispatch Operating System; arXiv preprint in preparation},
  year         = {2026}
}

License

MIT — see LICENSE.

MCP Server · Populars

MCP Server · New