Why we were wrong about agent-native CLIs

Three things triggered this post.

Steipete and Trevin shipped Printing Press. A library of agent-native CLIs, plus a factory that prints new ones. The thesis was sharp. Most APIs suck for agents. Most MCPs suck for agents. Most official CLIs suck for agents. They waste tokens. They waste time. Build something better.

I read it and thought yes, obviously. Of course agents need their own CLIs. JSON output, no ANSI, no pagination, terse and structured. Easy hypothesis. Easy product story.

Then I tried to prove it.

The setup

I run a personal agent swarm. Agents read local knowledge files. Background pipelines summarize records. A CLI tracks issues. All running on local models: an 8B for fast routing, a 26B for synthesis, a 24B for code.

Perfect lab for testing the agent-native CLI claim.

I built a benchmark. Three substrates per operation: REST API, MCP server, agent-native CLI. Same task, three substrates, three token counts. Run a small agent against each output. Score whether it extracts the correct answer.

Then I started running it.

v0: agent-native does not mean fewer tokens

First operation: list open issues. issues list for the human-shaped tree. issues list --json for the agent-native version. Same data, two substrates.

The JSON output was 5.5x bigger.

Five-point-five. Times. Bigger.

The naive headline was wrong before I scored any task. Agent-native does not mean fewer tokens. JSON has structure, but pays for it in field names, brackets, and metadata fields the human tree compressed away. Token count is not where the wedge is.

OK. Reframe. Maybe the agent-native version yields fewer tokens per agent task success. That is the metric that matters. So I added task scoring.

v1: task completion is a model question

Question: how many issues have priority 2. Ground truth from parsing the JSON: 26.

The 8B model read the tree output and answered 38. Read the JSON and answered 6.

Both wrong. Tree off by 12 high. JSON off by 20 low.

Same data. Same model. Two substrates. Both failed.

The model hallucinated differently against each substrate. Tree output got pattern-matched too aggressively. JSON output blew past the model's context budget. Neither failure taught me about substrate quality. The model was the bottleneck, not the substrate.

Reframe again. The benchmark needs a model sweep. Fix the operation, vary the model, see whether substrate matters once capability scales.

v2: substrate format is roughly irrelevant

Same operation, three models: an 8B, a 26B, and a 24B.

The numbers across the substrate × model matrix:

| Model | Tree (912 tok) | JSON (5091 tok) | |---|---|---| | 8B model | 25 (off by 1) in 0.2s | 8 (off by 18) in 79s | | 26B model | 31 (off by 5) in 12s | 28 (off by 2) in 387s | | 24B model | 24 (off by 2) in 25s | 19 (off by 7) in 196s |

All six combinations failed the exact match. None hit 26.

But look at the 8B model on the tree output. Off by 1. In two-tenths of a second. The fastest, cheapest combination was almost the most accurate.

Look at the 26B model on JSON. Off by 2. In six minutes. The most expensive run barely beat a 0.2-second tree read.

Substrate format barely moved the needle. Model capability dominated. So did operation difficulty, which I had not yet measured.

I ran a third operation. ops fail --since 7d. Three rows of data. Default mode prints "3 job(s) failed in last 7d" as the first line. JSON mode is the same data with field names.

Six combinations. Six successes. 100% task completion. Substrate did not matter. Model did not matter. The operation was small enough and the answer was up front. Done.

Then I ran a fourth operation. ops pipelines. 110 rows. Question: how many pipeline names start with a given prefix. Ground truth: 6.

The 26B model got 6 on both substrates. The 8B got 7 on both. The 24B got 4 or 5 on both.

For a fixed model, the substrate barely changed the answer.

What the data actually says

Operation size dominates. Three rows: every model on every substrate succeeds. 110 rows: only the most capable model succeeds. 50 rows on a hard counting task: nobody succeeds, capable models come close.

Model capability dominates. For a fixed operation size, the model decides whether the agent succeeds. Substrate format adjusts the answer by a few percent. Not zero, but not the wedge.

Substrate noise is bigger than substrate signal. The 8B model on the issue-list benchmark answered 25 against the tree and 8 against the JSON. That is not substrate teaching us something. That is a small model hallucinating differently against different inputs.

The honest doctrine

Design operations for the smallest agent you want to support.

Pagination is the agent's lever. Default returns are bounded. Large datasets require an explicit ask. The CLI's job is not to hand off everything and hope. The CLI's job is to bound the result to what the model can chew.

Query-language surfaces for transformations. Agents describe intent. CLIs execute. The result is a small confirmation, not a giant diff. Small models win on transformation tasks because they never traverse the dataset themselves.

Two output modes per operation. Default is terse human-shaped with answer-up-front. --json is structured for downstream tooling. Mode selection is per-task. Customer agents pick correctly. Ship both. Do not lead with this.

This is what shipped: a couple of small operations CLIs and a benchmark that catches when the doctrine drifts from the data. About 1000 lines of Python. Three rewrites of what we thought we knew.

What the substrate question was hiding

The whole time I thought I was asking "REST vs MCP vs CLI." The real question was "is your agent's model capable enough for the operation size."

That is a different product call.

If your customer runs frontier-class agents only, every substrate works. Okta's REST API works. SailPoint's REST API works. Pick whatever feels good.

If your customer runs smaller models, on-prem models, or air-gapped models, none of the existing identity vendors work for you at scale. Their operations are sized assuming the agent is huge. That assumption breaks for regulated verticals. It breaks for cost-sensitive deployments. It breaks for multi-tier orchestration, where a small model triages and a frontier model escalates.

The wedge is operation design, not substrate format. The agent ergonomic that matters is whether the answer fits in the agent's context window without paginating through 50 calls.

What this means for product builders

Three things changed in how I think about agent surfaces.

First, benchmark before you position. The v0 hypothesis (JSON saves tokens) was empirically wrong inside one afternoon. If I had launched the post on the v0 hypothesis, I would have looked silly within a week.

Second, model tier is a product input. The customer's model determines what operation sizes you can ship. If you sell to regulated industries running smaller models, your default response sizes have to shrink.

Third, substrate format is a tactical knob, not a strategic one. Ship two modes per operation. Pick which mode is default per task class. Move on. The compounding lives in operation design and the policy layer that sizes calls per agent capability.

Printing Press is still a good idea. Agent-native CLIs are still a real surface. The thesis was right that most existing surfaces are the wrong shape for agents. The correction is just that "right shape for agents" depends on which agent.

I will keep running the benchmark. v4 should test transformation tasks and joins, not just counting. v5 should add the frontier tier. The doctrine will probably get rewritten again. That is fine. The whole point of running benchmarks is to find out you were wrong before the customer does.

Drafted from three iterations of a CLI substrate benchmark on a personal agent swarm.