The Next CLI UX Battle Is Agent Forensics

Agent tooling has spent the last year obsessed with capability.

More tools. More autonomy. More parallelism. More subagents. More wow.

But once an agent starts doing real work in your terminal, a different question becomes brutal fast: what exactly happened here?

The signal in both Codex and Gemini CLI right now is the rise of agent forensics: runtime primitives that help teams reconstruct cost, context, and causality after a turn is over.

Codex is turning each turn into a measurable event

In OpenAI’s codex, the interesting March changes are not flashy UX features. They are accounting moves.

Commit e07eaff0d32c, “feat: add metric for per-turn tool count and add tmp_mem flag”, adds a dedicated histogram for codex.turn.tool.call. That means a turn is no longer just a blob of agent behavior. It becomes a countable unit of action.

The same patch also tags those metrics with tmp_mem_enabled. That sounds small. It isn’t. The moment you segment behavior by memory mode, you gain a real debugging dimension: did the agent behave differently because its memory configuration changed?

Nearby commit 49634b7f9c07, “add metric for per-turn token usage”, pushes the same idea deeper. Codex snapshots total token usage at turn start, diffs it at turn end, and emits histograms for total, input, cached_input, output, and reasoning_output.

That is a meaningful shift in posture. Instead of treating cost as a fuzzy session-level haze, Codex is making each turn auditable: how much it spent, how many tools it used, and under what memory mode it ran.

Gemini CLI is making context acquisition visible

Google’s gemini-cli is approaching the same problem from a different angle.

Commit 7b4a822b0eba, “instrument file system tools for JIT context discovery”, introduces a shared discoverJitContext() helper and wires it into high-intent file tools like read-file. When the agent touches a path, Gemini can discover relevant subdirectory context and append it directly to tool output under a loud delimiter: --- Newly Discovered Project Context ---.

That delimiter matters. It turns hidden context loading into an observable event. You can now see, in the tool result itself, when the runtime pulled in extra project knowledge and what shape that knowledge took.

This is not just a convenience feature for the model. It is a breadcrumb for the human. If an agent suddenly starts acting more informed about a corner of the repo, Gemini is increasingly leaving clues about where that extra context came from.

Then Gemini adds a trail marker for edit offers

Another nearby Gemini change sharpens the same forensic pattern.

Commit d7d53981f3c1, “add trajectoryId to ConversationOffered telemetry”, extends the telemetry emitted for agentic edit offers with a trajectoryId. In plain English: an offered code change can now carry an identifier that helps tie it back to the execution path that produced it.

That is the kind of detail that barely registers in a changelog and matters enormously in practice. Once agents generate more edits, more autonomously, the real product question is no longer just “can it edit?” It becomes “can I trace this edit back to the run that created it?”

Same destination, different implementation style

Codex and Gemini CLI are not shipping the same mechanism.

Codex is making turns measurable through metrics: tool-call counts, token deltas, memory-mode tags.
Gemini CLI is making turns reconstructable through trace signals: newly discovered context, trajectory-linked edit telemetry.

Different code. Same product instinct.

Both teams seem to understand the next pain point of agent workflows: once people trust agents enough to do more work, they will immediately demand better ways to inspect that work after the fact.

Why this matters more than another demo feature

The first generation of agent UX was about spectacle.

Watch it plan. Watch it browse. Watch it edit. Watch it spawn helpers.

The next generation is going to be about evidence.

Why did this run get expensive?

Why did this turn explode into twelve tool calls?

What new context did the agent load before it changed that file?

Can I connect this patch back to the exact execution path that produced it?

That is not observability as a nice-to-have. That is trust infrastructure.

And there is a bigger strategic consequence here: the CLI that best explains its own agent behavior may end up feeling safer than the CLI that simply looks more powerful in a demo.

The pattern worth watching

For a while, agent tooling competed on visible capability. The quieter competition now is on forensic fidelity.

If that trend holds, the winning developer tools will not just help agents act. They will help humans replay the story of a run afterward — cost, context, choices, and all.

So here’s the open question: as agent tooling gets better at leaving a reconstructable trail, does that change where teams draw the line between “assist me” and “go do it”?

If you build agent products, watch this layer closely. Then push your own tools to leave sharper evidence behind. That is how agent trust graduates from a vibe into an interface.

The Next CLI UX Battle Is Agent Forensics

Codex is turning each turn into a measurable event

Gemini CLI is making context acquisition visible

Then Gemini adds a trail marker for edit offers

Same destination, different implementation style

Why this matters more than another demo feature

The pattern worth watching

Receipts below the story

Source Trail

Evidence Limits

Send a note to the desk

The Next CLI UX Battle Is Agent Forensics

Codex is turning each turn into a measurable event

Gemini CLI is making context acquisition visible

Then Gemini adds a trail marker for edit offers

Same destination, different implementation style

Why this matters more than another demo feature

The pattern worth watching

Receipts below the story

Source Trail

Evidence Limits

Atlas Context

The Human Control Loop

Send a note to the desk

Same Edition

Subagents Are Getting Job Titles, Badge Checks, and a Manager Chain