For a while, the AI CLI story sounded simple: type a prompt, let the agent do a thing, inspect the result.
That model is breaking down.
The interesting March changes in Codex and Gemini CLI suggest a different future: the terminal is turning into an agent workbench — a place where plans persist, tasks get tracked, extensions stay loaded, and work unfolds across runtime surfaces that outlive a single prompt.
The new product battle is not just who can execute commands. It is who can turn agent work into an operational state that survives long enough to manage.
Gemini is pushing planning into the default path
Google’s gemini-cli keeps adding evidence that planning is no longer a side quest.
The public release notes now say Plan Mode is enabled by default. That matters because defaults are product truth. A feature stops being “advanced” the moment the team expects ordinary users to encounter it on the main road.
Then the surrounding changes pile up in the same direction. Gemini added built-in research subagents to Plan Mode, introduced task tracker tools, and raised sub-agent turn and time limits. In the prompt layer, the repo literally instructs the model that an approved plan should be treated as the single source of truth and that it should invoke tracker tools to create and maintain tasks from that plan.
That is a meaningful architectural move. Planning is no longer just hidden reasoning. It is becoming structured operational state: a plan file, a task list, a delegated research loop, and a larger runtime budget to execute against them.
Codex is assembling the long-lived surfaces around the work
OpenAI’s codex is arriving at a similar destination from a different direction.
The March 26 Codex changelog is full of signs that the runtime wants to stick around. Plugins became a first-class workflow, with startup sync plus a dedicated /plugins surface for browsing, installing, and removing them. The app-server-backed TUI is now enabled by default. Subagents get readable path-based addresses for multi-agent v2. App-server clients can watch filesystem changes and connect to remote websocket services.
Put differently: Codex is building the bench, not just the hammer. The agent is no longer imagined as a one-off command runner. It sits inside a runtime with synced capabilities, named workers, persistent thread flows, and eventful infrastructure around the work.
This is bigger than “better UX”
There is a temptation to file all of this under convenience. Don’t.
When Gemini promotes plans, trackers, and research helpers into the default execution path, and Codex promotes plugins, app-server threads, and addressable agents into the default runtime, both teams are making the same bet:
- agent work will span multiple steps,
- those steps need structure,
- that structure should live in product surfaces, not in prompt folklore.
That is the difference between an assistant and a workbench. An assistant waits for the next request. A workbench accumulates state: tools already loaded, tasks already tracked, plans already agreed, workers already named, threads already running.
The code backs up the product story
The repo details sharpen this reading.
In Gemini’s codebase, task tracking is not just marketing copy. The settings schema explicitly exposes an experimental.taskTracker capability, the docs describe session cleanup as removing implementation plans and task trackers, and the prompt snippets tell the model to treat the approved plan as the canonical operating document.
In Codex, the workbench pattern shows up in protocol and runtime layers. The app-server protocol defines typed notifications like skills/changed, the SDK centers repeated thread.run(...) flows, and the app-server docs describe streaming lifecycle events while turns are running. That is not just a prettier shell. It is a runtime designed to host ongoing agent work.
Why this matters now
The first wave of terminal agents competed on capability: can it edit files, run commands, browse, search, and spawn helpers?
The next wave will compete on operational coherence: can it keep a plan alive, keep workers organized, keep capabilities in sync, and let humans step back into the job without reconstructing the whole situation from scratch?
That is a much harder moat to build.
It touches defaults, runtime architecture, protocol shape, docs, task semantics, and the invisible handoff between one turn and the next.
And it may matter more than raw benchmark wins, because real software work rarely happens in one heroic prompt. It sprawls. It pauses. It gets resumed. It branches. It needs bookkeeping.
The pattern worth watching
Last month’s terminal agent race was about who could act like an operator. This month’s deeper race is about who can act like an environment.
If Codex keeps turning plugins, threads, and app-server surfaces into default workflow primitives, and Gemini keeps turning plans, trackers, and research delegation into default execution structure, the CLI stops being just a doorway to an agent.
It becomes the workshop where the agent’s work is laid out, labeled, resumed, and pushed forward.
So here’s the open question: when planning and task state become first-class runtime objects, does the winning agent feel less like a chatbot in your terminal and more like a lightweight operating system for work?
If you build agent tools, watch the state layer — then make yours easier to resume, inspect, and steer. That is where the next real product advantage is starting to form.
Send a note to the desk
Corrections, missing context, or a follow-up lead.