CLI Agents vs AI-Native IDEs: The Token Economics

Posted on 25th June 2026 by Rodrigo Silva

Here is the benchmark that’s been driving architecture decisions across the industry in 2026: a typical CLI command costs an agent around 200 tokens. The equivalent operation through an MCP server costs 32,000 to 82,000 tokens.

That’s not a typo, and it’s not cherry-picked. Independent benchmarks from Scalekit, Apideck, and others keep landing in the same range — roughly a 35× overhead for MCP on identical tasks. When “MCP is dead. Long live the CLI” hit the top of Hacker News and Perplexity’s CTO publicly described moving away from MCP internally over context waste, they were all pointing at the same arithmetic.

This series is about that arithmetic — where it’s real, where it isn’t, and how to design around it. We start with the cost itself, because once you see where the tokens go, every later decision gets easier.

Where the tokens actually go

The gap comes down to one architectural difference: when does the agent pay for a tool’s definition?

MCP loads everything, always. When an agent connects to an MCP server, the entire tool catalog — every tool’s name, description, and full input/output JSON schema — is injected into the context window. It sits there on every single completion request, whether the agent calls ten tools or zero. The GitHub MCP server exposes ~93 tools; loading it costs roughly 55,000 tokens before the agent reads its first instruction. The agent is carrying schemas for creating gists, configuring webhooks, and managing PR reviews even when all it wants is the repo’s primary language.

CLI pays only when it calls. A command-line agent starts with zero tool context. When it needs GitHub, it runs gh repo view — and the model already knows gh from training, so the command plus its output might cost 200 tokens. No catalog. No schema. No discovery step that loads 92 tools it will never touch.

Diagram illustrating the distribution of tokens in a 200,000-token context window for CLI and MCP agents, including token usage and reasoning capacity.

The stacking problem

A single server is survivable. The trouble is that real agents connect several. Add GitHub, a database connector, a project tracker, and a cloud provider, and a widely-cited Apideck measurement shows three MCP servers consuming 143,000 of a 200,000-token window — about 72% gone before the agent reads its first user message.

Now do the cost math at production scale. At roughly \$3 per million input tokens, 55,000 tokens of schema is about \$0.16 per session. Run 10,000 automated sessions a day — an unremarkable volume for a production pipeline — and you’re spending ~\$1,600 every day just loading tool definitions, before the agent solves anything. That’s the line item teams started calling the “MCP tax.”

The cost you can’t see: reasoning budget

Token cost is the headline, but it’s not the most important number. The deeper problem is cognitive.

A context window is also the agent’s working memory. Every token spent on tool schemas is a token not available for reasoning about the actual task. When 70% of the window is consumed by definitions, the model is trying to think in the cramped space that’s left — and quality degrades, especially late in a long task when accumulated tool output has pushed important context toward the edges of the window where attention is weakest.

This is why the cost gap reappears as a reliability gap. In Scalekit’s benchmark, CLI agents completed tasks with 100% reliability while the MCP equivalents came in at 72% — and most of the MCP failures weren’t logic errors but connection timeouts to a remote server. On a token-efficiency score (work completed per token spent), CLI scored 202 to MCP’s 152, a 33% advantage: the CLI agent spent its tokens on solving the problem instead of on protocol overhead.

Why CLI is good, not just cheap

It’s tempting to stop at “CLI uses fewer tokens,” but that misses the real reason it works so well. Models have been trained on decades of terminal interactions — Stack Overflow answers, GitHub histories, Dockerfiles, jq pipelines, git invocations, Kubernetes manifests. Shell tooling lives in the model’s weights as latent knowledge. When an agent composes gh pr list --json number,title | jq '.[] | select(...)', it’s operating from prior knowledge, not parsing a schema it met for the first time three tokens ago.

MCP schemas, by contrast, carry zero pretraining advantage. They’re custom JSON the model has never seen, that must be read and interpreted fresh on every run. The token savings of CLI are almost a side effect; the structural win is that the model already fluently speaks the interface.

Measure your own context budget

Before we build anything, do the one exercise that makes this concrete for your stack: measure what your tool integrations cost on idle. Here’s a quick way to tally MCP schema overhead using the same tokenizer your model uses:

# pip install tiktoken
import json, tiktoken

enc = tiktoken.get_encoding("cl100k_base")   # close enough for an estimate

def tokens(text: str) -> int:
    return len(enc.encode(text))

# Paste in the tool catalog your MCP client advertises (the result of tools/list,
# including each tool's full JSON schema). Many clients can dump this.
with open("mcp_tools_dump.json") as f:
    catalog = json.load(f)

total = 0
for tool in catalog["tools"]:
    cost = tokens(json.dumps(tool))     # name + description + input/output schema
    total += cost
    print(f"{tool['name']:<32} {cost:>6} tokens")

print(f"\nALWAYS-ON SCHEMA OVERHEAD: {total:,} tokens")
print(f"As share of a 200k window:  {total/200_000:.0%}")
print(f"Per-session cost @ $3/1M:   ${total * 3 / 1_000_000:.3f}")
print(f"Daily @ 10k sessions:       ${total * 3 / 1_000_000 * 10_000:,.0f}")

Now compare against the CLI baseline: the discovery cost of a CLI tool is whatever your-tool --help returns — usually 150–600 tokens, paid once, only if the agent is unsure. Run this against your real tool catalog and the abstract benchmark becomes your actual API bill.

The honest caveat (so you don’t over-correct)

Everything above is real, but it benchmarks the slice of the world where a CLI exists and the agent controls execution. That’s a large and important slice — and it’s exactly where production pipelines live — but it isn’t everything. There is no Workday CLI, no Greenhouse CLI; for multi-tenant products where an agent acts on behalf of a specific user, the schema tax is buying something real (identity, scope, audit) that a raw shell can’t provide. We’ll give that side a full and fair hearing a bit ahead. For now, hold the cost picture clearly, because it’s the force pushing terminal-based agents into production pipelines — and it’s earned.

CLI Agents vs AI-Native IDEs: Building CLI-First Agents

A CLI-first agent is almost embarrassingly simple in concept: instead of wiring the agent to a catalog of pre-declared tools, you give it one tool — a shell — and let it compose commands. The sophistication isn’t in the plumbing; it’s in how you shape the agent’s knowledge and contain its blast radius. Let’s build it up piece by piece.

The core loop: one tool, a whole toolbox

The entire tool surface is a single bash function. The model writes a command; you run it; you feed back the output. That’s it.

import subprocess
from anthropic import Anthropic

client = Anthropic()

BASH_TOOL = {
    "name": "bash",
    "description": "Run a shell command and return its stdout/stderr.",
    "input_schema": {
        "type": "object",
        "properties": {"cmd": {"type": "string"}},
        "required": ["cmd"],
    },
}

def run(cmd: str) -> str:
    r = subprocess.run(cmd, shell=True, capture_output=True,
                       text=True, timeout=120)
    return (r.stdout + r.stderr)[:10_000]   # cap to protect the context window

def agent(task: str, system: str):
    messages = [{"role": "user", "content": task}]
    while True:
        resp = client.messages.create(
            model="claude-sonnet-4-6", max_tokens=2000,
            system=system, tools=[BASH_TOOL], messages=messages,
        )
        messages.append({"role": "assistant", "content": resp.content})
        tool_calls = [b for b in resp.content if b.type == "tool_use"]
        if not tool_calls:
            return resp                      # agent is done
        results = []
        for call in tool_calls:
            output = run(call.input["cmd"])
            results.append({"type": "tool_result",
                            "tool_use_id": call.id, "content": output})
        messages.append({"role": "user", "content": results})

Notice what’s absent: no per-tool schemas, no MCP catalog, no discovery payload. The agent already knows gh, git, jq, az, kubectl, psql, and a thousand other commands from pretraining. The single bash tool definition costs a few dozen tokens, and that’s the entire fixed overhead — versus the 55,000 tokens a GitHub MCP server would put on every request.

Composability: the property MCP can’t match

Token cost is the famous argument, but composability is the one practitioners care about more once they’ve felt it. MCP tools don’t chain — you can’t pipe one tool’s output into another; each call is a separate round-trip with the agent shuttling intermediate state back and forth, paying tokens and latency every hop.

A shell pipeline does the whole thing in one shot:

gh pr list --state merged --json number,title,mergedAt \
  | jq '[.[] | select(.mergedAt > "2026-05-01")] | length'

That’s “how many PRs merged since May 1st” as a single command. The agent composes the pipe, the shell executes it locally, and one number comes back. The MCP equivalent is: call list_pull_requests, get a large JSON blob into context, reason over it, maybe call again with pagination, filter in the agent’s head. More tokens, more latency, more failure surface.

Diagram illustrating a CLI-first agent that composes pipelines using a single bash tool. It shows the process of executing a shell command locally and integrating various local CLIs such as git, jq, psql, kubectl, and az.

This is just the Unix philosophy — small tools that do one thing, composed with pipes — and it turns out agents are excellent at it, because the training data is full of exactly these one-liners.

The 800-token skill file (the best ROI in the benchmark)

Raw CLI works, but there’s a refinement that’s almost free and pays for itself immediately. Instead of loading 28,000 tokens of MCP schema, give the agent a tiny skill file: a few hundred tokens of plain-markdown tips about the tools it has.

In Scalekit’s benchmark, an 800-token markdown file of gh tips beat 28,000 tokens of MCP schemas — the skill-augmented agent made about a third fewer tool calls and finished about a third faster than even the naive CLI agent. It was the single best return on investment they measured. A skill file looks like this:

# GitHub (gh) tips for this repo

- Auth is already configured; never run `gh auth`.
- Prefer `--json <fields>` + `jq` over scraping human-readable output.
- Useful fields: number, title, state, mergedAt, author, labels.
- List recent merged PRs:
    gh pr list --state merged --limit 50 --json number,title,mergedAt
- This repo's default branch is `main`. CI is GitHub Actions.
- NEVER use: `gh repo delete`, `git push --force`.

You load that string into the system prompt. It costs a rounding error in tokens and dramatically sharpens behavior, because it encodes the two things training data can’t: your conventions and the specific flags that matter for this repo. This is also the cleanest place to put guardrails the agent should always honor.

SKILL = open("skills/github.md").read()           # ~800 tokens
SYSTEM = (
    "You are a CLI agent. You have a bash tool. Prefer composing pipelines. "
    "Use --help if unsure about a command's flags.\n\n" + SKILL
)
agent("How many PRs merged since May 1st?", SYSTEM)

`--help` as just-in-time documentation

What about a CLI the agent doesn’t know well, or a tool with unusual flags? You don’t pre-load its manual. You let the agent fetch documentation on demand: if it’s unsure, it runs some-tool --help and pays ~200 tokens for exactly the information it needs, exactly when it needs it. This is the same pay-per-use principle we’ve seen previously, applied to documentation: progressive disclosure instead of always-on schema. The agent’s instinct to do this is worth encouraging explicitly in the system prompt, as above.

The part nobody likes to talk about: a shell is a loaded gun

Here’s the honest cost of all this power. The single bash tool that makes CLI agents efficient and composable also hands the model rm -rf, git push --force, DROP TABLE, and arbitrary code execution. MCP’s much-maligned schema overhead is partly buying something: an agent can only call tools that were explicitly declared, so the blast radius is bounded by design. A shell has no such boundary. One bad generation is one bad generation away from something you can’t undo.

So a CLI-first agent is only production-ready with containment. The non-negotiables:

Sandbox execution. Run commands inside a container or VM with no access to production credentials, scoped to a throwaway working directory — the same isolation discipline any agentic system needs.
Least privilege. The environment gets only the credentials and network access the task requires. An agent summarizing PRs does not need write access to the repo or your database URL in its environment.
A deny list and/or approval gate. Block destructive verbs outright (rm -rf, force-push, DROP, DELETE FROM without a WHERE), and require human approval for anything that mutates state. The skill file’s “NEVER use” section helps, but never rely on the model’s compliance as your only control.
Output caps and timeouts. Bound stdout (as in run() above) so a runaway command can’t flood — and evict — the context window.

DENY = ("rm -rf", "git push --force", "git push -f", " drop table",
        "delete from", "mkfs", ":(){", "> /dev/sd")

def run(cmd: str) -> str:
    low = f" {cmd.lower()} "
    if any(bad in low for bad in DENY):
        return "BLOCKED: destructive command requires human approval."
    # ...then execute inside the sandbox as before

A deny list is a backstop, not a security boundary — the real boundary is the sandbox. Treat the shell’s power as something you deliberately fence in, not something you trust the model to wield carefully.

What we’ve built — and what it can’t do

We now have a CLI-first agent that’s cheap (one tool, no schema tax), composable (it pipes), well-informed (a tiny skill file), and contained (sandbox + deny list). For developer-facing work and deterministic production pipelines where a mature CLI exists, this design is hard to beat.

But re-read that sentence: where a mature CLI exists and where you control execution. The moment your agent needs to act on behalf of a specific customer, inside a multi-tenant SaaS system that ships only an OAuth API and no shell, this whole approach runs out of road — and the schema tax you’ve been avoiding turns out to be the price of something you now actually need. That boundary, and how to decide on which side of it any given integration falls, is what we’ll see next.

CLI Agents vs AI-Native IDEs: When to Use Which

If you’ve read the above, you might think the verdict is in: CLI is cheaper, more reliable, more composable, so use it everywhere. That conclusion is wrong, and the teams that act on it create a different, quieter class of problem. The token benchmark is real — but it measures roughly the 5% of integrations where a CLI even exists and where you control execution. The other 95% of enterprise surfaces are a different world. Now, we’ll give that world its due and then hand you a framework that makes the choice mechanical.

Where MCP and AI-native IDEs actually win

Be fair to the other side, because the other side is right about several things.

Services with no shell. There is no Workday CLI, no Greenhouse CLI, no BambooHR CLI — and there never will be. These are SaaS systems with OAuth APIs, custom subdomain routing, refresh tokens, and org-level access control. MCP was built precisely for these. When the only integration a vendor ships is an API behind OAuth, the “just use the CLI” advice has nothing to point at.

Acting on behalf of a specific user. A CLI agent runs in your shell with your ambient credentials. That’s fine when you are the user. It’s a non-starter when an agent acts for a specific customer across a specific tenant. MCP’s model — explicit tool declarations, per-user OAuth 2.1 with PKCE, scope enforcement, the ability to revoke one user without touching everyone else — is buying governance the schema tax pays for. As one widely-shared analysis put it: the properties that make MCP expensive are the same properties that make it governable.

Audit and compliance. Structured tool calls with declared inputs produce clean audit trails. “The agent ran some bash” does not. In regulated workflows, that structure isn’t overhead — it’s the requirement.

The interactive IDE experience. AI-native IDEs (Cursor, Windsurf, Copilot-style tools) lean on always-on rich context and MCP integrations on purpose: it’s what makes inline exploration, hovering, and conversational iteration feel seamless for a human in the loop. A Sales Director shouldn’t have to read a stderr traceback. The token cost buys a UX that a headless shell simply doesn’t offer. The catch is that this advantage is about interactive use — which is exactly the part a production pipeline doesn’t have.

And the reliability gap from above deserves an asterisk: most MCP failures in the benchmarks were connection timeouts to remote servers — infrastructure problems, not protocol problems. An MCP gateway (one that filters schemas down to the relevant tools, pools connections, and centralizes auth) closes much of both the cost and reliability gap. So does lazy schema loading (Anthropic’s Tool Search, shipped late 2025), which defers pulling a tool’s full schema until it’s actually needed. The naive 55,000-token connection is a worst case, not a law of nature.

The reframe: it was never “CLI vs MCP”

Here’s the insight that makes the whole debate dissolve. MCP and CLI don’t sit on the same axis. Treating them as competing transports is a category error — like arguing whether to use an enterprise service bus or an API. They operate on different planes of the agent stack, and most well-designed systems use all of them at once.

Diagram illustrating three planes: Knowledge plane with skills and prompts, Execution plane focusing on CLI tools, and Governance plane for multi-tenant SaaS systems.

Execution plane → CLI. Developer-facing agents, local tooling, code operations, infrastructure-as-code — anything with a mature shell interface the base model has seen in training. Accept the modest cold-start of describing the tools; harvest the long tail of pretraining familiarity.
Governance plane → MCP. Customer-facing agents, multi-tenant SaaS, systems of record, regulated workflows — any surface that requires per-request identity, scope enforcement, or audit. Spend the schema tokens here; they’re buying compliance.
Knowledge plane → Skills. Domain procedures, company conventions, playbooks. These aren’t tools at all. They belong in skill files and prompts. Wrapping a procedure in an MCP schema or a CLI is the most common over-engineering mistake — it’s instructions cosplaying as a transport.

Confusing the planes produces the exact pathologies the industry has been cataloguing all year: MCP servers wrapping shell commands that burn tokens for zero governance benefit; CLIs bolted onto SaaS integrations that leak credentials and lose audit trails; skills written as MCP tools, duplicating schema that should have been three lines of markdown.

The decision framework

Stop asking “MCP or CLI?” Ask three questions about each tool integration, in order:

Flowchart illustrating the decision-making process for choosing a transport method in tool integration, including questions about model maturity, trust boundaries, and procedural versus tool classification.

Does this tool have a mature shell interface the base model already knows? (git, gh, kubectl, psql, az, aws, jq…) → Use CLI. The token savings are a side effect; the real win is operating from pretrained knowledge.
Does this action cross a trust boundary needing per-user identity, scope enforcement, or audit? (a customer’s CRM, a tenant’s billing) → Use MCP, ideally behind a gateway. The schema cost is buying something no shell provides.
Is this actually a procedure or convention dressed up as a tool? (a runbook, a house style, a multi-step playbook) → Put it in a skill or prompt. Don’t wrap it in any transport.

Decide this per integration, not per system. Your agent will almost certainly use all three.

What this means for production pipelines

The reason terminal-based agents are winning production pipelines specifically falls right out of the framework. A pipeline is headless and batched — there’s no human enjoying the IDE’s interactive UX, so that entire side of MCP’s value proposition is absent. And pipelines are dominated by deterministic operations: run tests, build, lint, query a database, transform files, hit git and gh. Those are textbook execution-plane work — CLI territory — where the token efficiency compounds across thousands of runs and the 100% reliability matters because nobody’s watching.

But “pipeline = all CLI” is still too simple. A mature pipeline is a hybrid:

Deterministic steps run as CLI / scripts / hooks — the bulk of the work, cheap and reliable.
The few steps that touch a governed external system go through MCP — fetching a customer record, posting to a per-tenant SaaS — ideally via a gateway that filters schemas so you pay for the three tools you use, not the ninety you don’t.
The pipeline’s domain logic lives in skills — what “done” means, your conventions, the order of operations — not baked into either transport.

That’s the real end state. Not “CLI beat MCP,” but a pipeline where each integration sits on its correct plane, the deterministic majority runs as efficient shell commands, and the governed minority pays the schema tax precisely where it buys something.

The bottom line

If the integration is…	Use	Because
A tool with a mature CLI the model knows	CLI	Pretrained fluency + ~200 tokens vs ~35× for MCP
A SaaS system with no shell, behind OAuth	MCP	The only option that handles tenant identity
Acting on behalf of a specific customer	MCP (gateway)	Per-user auth, scope, revocation, audit
A deterministic step in a headless pipeline	CLI	Cheaper, 100% reliable, composable, no UX needed
A procedure, convention, or playbook	Skill	It’s instructions, not a tool — no transport
Multi-tenant infra needing both	Both	CLI execution plane + MCP governance plane

The token economics are what made everyone look, and they’re genuinely the reason CLI agents are taking over production pipelines. But the durable lesson is the one underneath: match each integration to its plane. Do that and you stop paying the MCP tax where it buys nothing, stop leaking credentials where you need governance, and stop wrapping instructions in schemas. The transport stops being a religion and goes back to being an implementation detail — which is exactly where it belongs.

MCP vs A2A: Two Protocols, Two Different Problems

Posted on 24th June 2026 by Rodrigo Silva

If you’ve built anything with AI agents recently, you’ve heard two acronyms thrown around constantly, often with wildly conflicting takes: “MCP is the USB-C of AI.” “A2A replaces MCP.” “You need both.” “Neither is production-ready.”

Most of that noise comes from a single confusion. MCP and A2A solve completely different problems, and treating them as competitors — or worse, as interchangeable — is one of the most common and most expensive architecture mistakes in the space right now. Get the distinction wrong and your system fights you at every layer: you’ll find yourself wrapping databases as “agents,” building bespoke RPC glue between services that should just speak a standard, or reaching for the heavier protocol when the lighter one was the right call.

Here’s the entire thesis of this series in two sentences:

MCP handles how an agent talks to tools. A2A handles how agents talk to each other.

Everything else follows from that. Let’s make it precise.

Two problems that look similar and aren’t

Before either protocol existed, two distinct pain points kept showing up in agentic systems.

Problem 1 — tool integration. A single agent needs to do things in the world: query a database, hit an API, read a file, send an invoice. Every integration was a bespoke one-off. If you had M agent applications and N tools, you were maintaining M×N custom integrations, each inventing its own auth, schema, and error handling. This is a vertical problem — it’s about one agent reaching down to capabilities.

Problem 2 — agent coordination. As teams built multi-agent systems, there was no standard way for agents to discover each other, advertise what they could do, or hand off work. If you had a research agent and a writing agent, wiring them together meant gluing orchestration logic into application code. Swap one agent for another and you rewrote the integration. This is a horizontal problem — it’s about agents reaching across to peers.

MCP is the purpose-built answer to Problem 1. A2A is the purpose-built answer to Problem 2. Understanding that they sit on different axes is the key to everything else.

A diagram comparing MCP (Model Control Protocol) and A2A (Agent to Agent) communication, featuring two perpendicular axes. The diagram includes three agents (Agent B, Your Agent, and Agent C) connected by A2A links, and shows how the Your Agent communicates with a database, API, and files via MCP.

MCP: the vertical protocol

The Model Context Protocol (MCP), created by Anthropic and released in November 2024, standardizes how an AI agent connects to external tools, data, and services. Think of it as a universal connector — the analogy people reach for is USB-C: one standard plug instead of a drawer full of proprietary cables.

Concretely, an MCP server exposes capabilities through three primitives — Tools (actions the model can invoke), Resources (read-only data it can fetch), and Prompts (reusable templates) — and any MCP-compatible host can discover and use them without bespoke integration code. The M×N integration explosion collapses into M+N: build a server once, and every MCP-aware host can use it.

MCP is, at its core, synchronous and request-response: the agent asks the tool to do something and gets an answer back, usually in the same breath. (An experimental Tasks primitive for longer-running work arrived in the late-2025 spec, but the model’s center of gravity is still fast, synchronous tool calls.)

A2A: the horizontal protocol

The Agent2Agent protocol (A2A), created by Google and released in April 2025 with 50+ launch partners, standardizes how independent agents — built by different teams, on different frameworks, hosted on different networks — discover each other, delegate tasks, and exchange results.

Its defining trait is the opposite of MCP’s. A2A is intentionally stateful and asynchronous: it assumes tasks can be long-running, multi-step, and human-in-the-loop from the start. A client agent delegates work to a remote agent, and — crucially — the remote agent does that work without access to the client’s internal context, memory, or tools. That opacity is a feature: it respects the autonomy and privacy boundaries you need when one company’s agent calls another’s. The analogy here is HTTP: it doesn’t care whether the other side runs Rails, Django, or Go; it just defines the shape of the conversation.

They are designed to compose, not compete

The single most important takeaway: these protocols are complementary. Most real production multi-agent systems use both — A2A to coordinate between agents, and MCP so each agent can reach its own tools and data. They live on perpendicular axes, so they never actually overlap:

	MCP	A2A
Question it answers	How does my agent use a tool?	How does my agent talk to another agent?
Axis	Vertical (agent → tools/data)	Horizontal (agent ↔ agents)
Relationship	Agent → capability	Peer ↔ peer
Interaction style	Synchronous request/response	Stateful, async, long-running
Context sharing	Tool runs inside the agent’s context	Remote agent has no access to caller’s context
Created by	Anthropic (Nov 2024)	Google (Apr 2025)
On the wire	JSON-RPC 2.0 over stdio / Streamable HTTP	JSON-RPC 2.0 over HTTP(S) + SSE
Unit of work	A tool call	A Task (with a lifecycle)

Both are now neutral, governed standards

A fair worry in 2025 was vendor lock-in: MCP was “an Anthropic thing,” A2A “a Google thing.” That’s no longer the situation, and it matters for adoption decisions.

Both protocols now live under the Linux Foundation’s Agentic AI Foundation (AAIF). Google contributed A2A to the Linux Foundation in mid-2025, and Anthropic donated MCP in December 2025. Both have crossed the production-maturity threshold: MCP’s current spec revision is the November 2025 release, with 10,000+ public servers and native support across every major model provider; A2A reached a stable v1.0 in early 2026, with signed Agent Cards, SDKs in five languages, and 150+ production organizations. Neither is a moving target anymore, and neither is owned by a single vendor.

When you’re staring at a design decision, ask one question:

Is the thing on the other end a capability, or is it a peer?

A database, an API, a file system, a payment processor, a search index — those are capabilities. They reach down. Use MCP.
Another autonomous agent — owned by a different team or vendor, running its own logic, that you want to delegate a goal to and get a result back from — that’s a peer. It reaches across. Use A2A.

And when you have a fleet of agents that each need their own tools? You use both, on their respective axes.

MCP in Practice — How an Agent Talks to Tools

MCP gives a model a standardized contract for reaching the outside world. Instead of every host hand-coding every integration, a server declares what it can do, describes it with schemas, and any MCP-compatible host can discover and use it.

The architecture: host, client, server

MCP has three roles, and keeping them straight prevents most confusion:

Host — the AI application the user interacts with (Claude Desktop, Claude Code, Cursor, your own app). The host owns the model and the conversation.
Client — a connector the host instantiates, one per server, that manages a single stateful session. A host talking to three servers runs three clients.
Server — a program that exposes capabilities. It can run locally (a subprocess on your machine) or remotely (a hosted HTTPS service).

Underneath, MCP is JSON-RPC 2.0 split into two layers: a data layer (the message schema and primitives) and a transport layer (how those messages travel). That separation is why the same primitives work identically whether the server is a local subprocess or a remote service.

Diagram illustrating the MCP architecture and message flow, showing Host and Server interactions, including the Model and Client roles.

The three server primitives

Everything a server offers falls into exactly three buckets. Getting the mapping right is most of good MCP design.

Tools are functions the model can invoke to perform actions — they have side effects, like POST endpoints. Discovered via tools/list, executed via tools/call.

Resources are read-only data the model can fetch for context — no side effects, like GET endpoints. Discovered via resources/list, read via resources/read.

Prompts are reusable templates that structure a recurring interaction. Discovered via prompts/list, retrieved via prompts/get. (In practice this is the least-used primitive — most teams keep prompts in host-side code — but it’s the right tool for domain-specific servers that want to enforce a house workflow.)

A clean way to remember it: Tools do, Resources show, Prompts guide.

Here’s a complete small server using FastMCP (the standard Python library, 3.x as of 2026). It models a tiny CRM:

from fastmcp import FastMCP

mcp = FastMCP("crm-server")

# TOOL — an action with side effects (the model can invoke it)
@mcp.tool()
def create_lead(name: str, email: str, source: str = "web") -> dict:
    """Create a new sales lead and return its record."""
    lead = db.insert("leads", {"name": name, "email": email, "source": source})
    return {"id": lead.id, "status": "created"}

# RESOURCE — read-only context (the model can fetch it, no side effects)
@mcp.resource("crm://customers/{customer_id}")
def customer_record(customer_id: str) -> str:
    """Return a customer's record as JSON."""
    return json.dumps(db.get("customers", customer_id))

# PROMPT — a reusable template that guides a workflow
@mcp.prompt()
def qualify_lead(lead_id: str) -> str:
    """Guide the model through qualifying a lead."""
    return (f"Review lead {lead_id}. Assess fit on budget, authority, "
            f"need, and timeline. Recommend next action.")

if __name__ == "__main__":
    mcp.run(transport="stdio")   # local; swap to "streamable-http" to host remotely

Notice what you didn’t write: no JSON-RPC plumbing, no schema by hand. The SDK infers each tool’s input schema from the function signature and advertises name, description, and schema to any client that connects. That’s the whole point — the contract is generated from ordinary typed Python.

The two transports

MCP deliberately supports only two transports, and the choice is about where the server runs, not what it does.

stdio — the host launches the server as a subprocess and exchanges JSON-RPC frames over standard in/out. No network, no ports, OS-level identity for security. This is the default for local tools — it’s how Claude Desktop and Claude Code run filesystem and git servers.

Streamable HTTP — the server is a network endpoint: a single HTTPS URL that accepts HTTP POST for requests, with optional Server-Sent Events for streaming. This is the transport for remote, multi-user, or cloud-hosted servers. It was introduced in the 2025-03-26 spec and replaced the older HTTP+SSE transport, which is being sunset across major providers through 2026 — so if you see two-endpoint HTTP+SSE examples online, they’re stale.

A practical rule: local dev tool → stdio; anything multi-tenant or hosted → Streamable HTTP behind OAuth 2.1, which is the recommended auth for remote servers. If you’re deploying serverless, prefer a stateless server (omit the Mcp-Session-Id header) so it scales horizontally without sticky sessions — stateless operation is the headline item on MCP’s 2026 roadmap.

The reverse primitives: what makes MCP interactive

Most people think of MCP as one-directional — host calls server. The feature that surprises people is that servers can call back to the host. These client-side primitives are what make MCP interactive rather than a static function catalog:

Sampling (sampling/createMessage) — the server asks the host to run an LLM completion on its behalf. Why? So an agentic server can reason without bundling its own model SDK or API key. A GitHub server that needs to classify an issue’s severity can ask the host to run that classification with whatever model the user is already paying for. The server stays model-independent; the user owns the billing and the approval.
Elicitation — the server asks the user, through the host’s UI, for missing input or confirmation. A payments server can surface “Is the user OK paying $4.99 for this?” right in the chat thread.
Roots — the host advertises trusted filesystem boundaries to the server.

These reverse calls always route through the host, which must show the user what’s being asked and let them edit or reject it. That consent boundary is core to MCP’s security model.

Connecting from a host

On the host side, the lifecycle is: connect → discover → invoke. Conceptually:

# Pseudocode for the host/client side
session = connect("stdio", command=["python", "crm_server.py"])
await session.initialize()                      # capability handshake

tools = await session.list_tools()              # tools/list — dynamic discovery
result = await session.call_tool(               # tools/call
    "create_lead", {"name": "Ada", "email": "ada@example.com"})

schema = await session.read_resource("crm://customers/42")   # resources/read

Discovery is dynamic: the host calls tools/list at runtime, so a server can add or remove tools and notify the host with a notifications/tools/list_changed message rather than requiring a redeploy. From there, when the model decides to act, the host issues tools/call with arguments matching the advertised schema, gets structured output back, and feeds it to the model. That loop — discover, decide, call, observe — is the entirety of an agent using a tool over MCP.

Where MCP stops

Notice what we still didn’t see: we never had one agent ask another agent to accomplish a goal. Every interaction was an agent reaching down to a capability it controls, inside its own context. The CRM server isn’t an autonomous peer — it’s a typed set of actions and data, and the agent is fully in charge.

The moment you want to hand a goal to an independent agent — one you don’t control, possibly run by another team or vendor, that will work asynchronously and report back without sharing your context — MCP is the wrong tool. That’s a horizontal, peer-to-peer interaction, and it’s exactly what A2A is built for.

A2A in Practice & When to Use Which

The four core concepts

A2A is built on four objects. Learn these and the protocol falls into place:

Agent Card — a public JSON document, served at /.well-known/agent.json, that describes an agent: its name, version, endpoint URL, the skills it offers, the input/output formats it accepts, and how to authenticate. It’s a machine-readable business card. A client fetches it (directly or via a registry) to learn who can do what and how to connect.
Task — the fundamental unit of work, identified by a unique taskId. Unlike an MCP tool call, a Task has an explicit, trackable lifecycle and can run for seconds or days.
Message — the unit of exchange inside a Task. Each message has a role (user or agent) and a body that’s an array of Parts — which can mix text, files, binary, and structured data, making A2A multimodal by design.
Artifact — the output of a Task: a PDF invoice, a JSON analysis, an image. Delivered to the client when the work completes.

The interaction model is client agent → remote agent. The client delegates; the remote agent executes autonomously and returns results without ever seeing the client’s internal context, memory, or tools. That isolation is the whole point of the horizontal axis — it’s what lets agents from different vendors collaborate without trusting each other with their internals.

Discovery: the Agent Card

An agent advertises itself with a card like this, served at a well-known URL:

{
  "name": "research-agent",
  "description": "Performs literature reviews and synthesizes findings.",
  "url": "https://research.example.com/a2a",
  "version": "1.0.0",
  "capabilities": { "streaming": true, "pushNotifications": true },
  "defaultInputModes": ["text/plain"],
  "defaultOutputModes": ["application/json", "text/markdown"],
  "securitySchemes": {
    "oauth2": { "type": "oauth2", "flows": { "clientCredentials": { "...": "..." } } }
  },
  "skills": [
    {
      "id": "literature-review",
      "name": "Literature Review",
      "description": "Given a topic, returns a synthesized review with citations.",
      "inputModes": ["text/plain"],
      "outputModes": ["text/markdown"]
    }
  ]
}

A client fetches this card to decide whether this agent can help and how to call it. A2A v1.0 (early 2026) added Signed Agent Cards — a cryptographic signature proving the card was issued by the domain owner. Without it, an attacker could stand up a forged card and redirect callers to a malicious endpoint (a “card forgery” attack). For any cross-organization deployment, verify the signature.

Delegating a task

With the card in hand, the client sends a message via JSON-RPC over HTTPS. The current method is message/send (for synchronous or pollable work) or message/stream (for streaming updates over SSE):

{
  "jsonrpc": "2.0",
  "id": 1,
  "method": "message/send",
  "params": {
    "message": {
      "role": "user",
      "parts": [
        { "kind": "text", "text": "Review recent work on retrieval-augmented generation." }
      ],
      "messageId": "9c1f...e7"
    }
  }
}

The remote agent responds with a Task object carrying a status. Because the work may be long-running, the client tracks progress through the task lifecycle rather than blocking on a single response. Using the Python a2a-sdk, the same exchange is roughly:

from a2a.client import A2AClient

# 1. Discover: fetch and (in production) verify the signed Agent Card
client = await A2AClient.from_agent_card_url(
    "https://research.example.com/.well-known/agent.json",
    auth=oauth_credentials,
)

# 2. Delegate: hand over a goal, not a function call
task = await client.send_message(
    "Review recent work on retrieval-augmented generation."
)

# 3. Track: poll or stream until a terminal state
while task.status.state not in ("completed", "failed", "canceled", "rejected"):
    task = await client.get_task(task.id)          # tasks/get
    await asyncio.sleep(2)

# 4. Collect the Artifact
if task.status.state == "completed":
    review = task.artifacts[0]                       # e.g. markdown review

For real-time progress you’d use message/stream and consume SSE events instead of polling; for fire-and-forget work you’d register a push-notification webhook so the remote agent calls you back when it’s done.

The task lifecycle

The thing that most distinguishes A2A from MCP is that work is stateful. Every Task moves through an explicit lifecycle, and long-running, human-in-the-loop, and async patterns are first-class:

A flowchart illustrating the A2A task lifecycle, showing the progression of tasks from 'submitted' to various states like 'working', 'completed', 'failed', 'canceled', and 'rejected'.

A few consequences worth internalizing. A task can pause at input-required (the remote agent needs more from you) or auth-required (it needs credentials) and resume — this is how human-in-the-loop works natively. The four terminal states (completed, failed, canceled, rejected) are final: a terminal task can’t be restarted; you start a new one. And retry and circuit-breaking are deliberately left to the client — the protocol governs state and messaging, not your resilience policy.

How the two protocols compose

Here’s where the “they’re complementary” claim becomes concrete. A2A and MCP operate on perpendicular axes, so a single agent uses both at once: it speaks A2A horizontally to its peers and MCP vertically to its own tools.

Picture an orchestrator that delegates a literature review to the research agent above. That research agent, internally, is itself an MCP host: to actually do the review, it calls an arXiv search tool, a PDF-reader tool, and a vector-store resource — all over MCP. The orchestrator never sees any of that. It handed over a goal via A2A; the tools the remote agent used to satisfy it are private, on the vertical axis, exactly where MCP belongs.

Diagram illustrating the relationship between an Orchestrator agent, a Research Agent, and a Writing Agent, highlighting their roles and private tools like arXiv API, Vector store, Docs API, and Style DB, within a framework of A2A and MCP.

That picture is the answer to “do I need both?” In any non-trivial multi-agent system: yes, and they never collide, because one runs across and the other runs down.

When to use which: the decision framework

Carry one question into every design decision: is the thing on the other end a capability, or a peer?

Reach for MCP when:

Your agent needs to call an API, query a database, read files, or trigger an action you control.
The interaction is synchronous — ask, get an answer, continue.
The “other side” has no agency of its own; it’s a typed set of actions and data.
You want the tool to operate inside your agent’s context.

Reach for A2A when:

You’re delegating a goal to an autonomous agent, not calling a function.
That agent is owned by a different team, vendor, or codebase, and you shouldn’t share your internal context with it.
The work is long-running, async, or needs human-in-the-loop checkpoints.
You want to swap one agent for another without rewriting orchestration — discovery via Agent Cards gives you that.

Use both whenever you have multiple agents that each need their own tools — which is most production systems. Coordinate across with A2A; equip each agent with MCP.

The mistakes this series exists to prevent

Each of these is a real, common error, and each is just the capability/peer question answered wrong:

Exposing a database or API as an “agent” over A2A. It’s a capability, not a peer. That’s an MCP server. You’ve added a coordination protocol, an Agent Card, and a task lifecycle to something that should be a synchronous tool call.
Hand-rolling bespoke RPC between heterogeneous agents instead of using A2A. You’ll reinvent discovery, auth, and task state — badly — and re-glue it every time an agent changes.
Wrapping every tool as an agent because A2A feels newer or more powerful. Over-engineering: you pay statefulness and discovery overhead for a function call.
Trying to use MCP to coordinate independent agents across a trust boundary. MCP has no concept of an autonomous peer working without your context; you’ll end up leaking context or faking agency.

The protocols are not rivals and the choice is rarely ambiguous once you ask the right question. MCP is how your agent talks to tools. A2A is how your agents talk to each other. Build on the correct axis and the architecture stops fighting you.

Agentic Engineering: From Vibe Coding to a Plan, Execute, Verify Discipline

Posted on 21st June 2026 by Rodrigo Silva

For about a year, “vibe coding” was the most fun anyone had had with a keyboard. You describe what you want, an agent writes it, you skim the result, you ship. Andrej Karpathy’s original framing was almost gleeful: you “give in to the vibes” and barely read the diffs.

Then teams tried to put vibe-coded software into production, and the bill came due.

The failure mode now has a name — AI slop: code that looks reasonable on the surface but lacks error handling, quietly introduces security vulnerabilities, breaks something three modules over, or produces an architecture nobody can maintain. It’s not that the model is dumb. It’s that “prompt and hope” has no step where anything gets checked against reality before it lands.

The numbers make the discomfort concrete. In Sonar’s 2026 developer survey, 96% of developers said they don’t fully trust the output of AI coding agents. A late-2025 Stack Overflow survey found nearly half of developers were frustrated by AI solutions that are “almost right, but not quite” — which is arguably the most expensive kind of wrong, because it survives a casual review and fails in production.

In early 2026, Karpathy named the thing that comes next: agentic engineering — the discipline of designing systems where AI agents plan, write, test, and ship code under structured human oversight. Not casual prompting. Not hope-and-check. An actual engineering methodology built for AI-first development.

This series is a hands-on guide to that discipline. By the end you’ll have built a loop you’d actually trust near your codebase.

What vibe coding is missing

Strip a vibe-coding session down and you get:

You: "Add rate limiting to the API"
Agent: <writes 80 lines>
You: <skim, looks fine, paste it in>

Three things are absent, and each is a place production breaks:

No explicit plan. The agent inferred intent and committed to an approach in a single shot. You never saw the approach, so you couldn’t catch that it chose an in-memory limiter that won’t survive a multi-instance deploy.
No isolation. The code went straight onto your working branch. When it’s wrong, untangling it from your own changes is a chore.
No verification gate. “Looks fine” is the only test. There’s no point where tests, a linter, or a security scanner can block the change before it reaches you.

Agentic engineering reintroduces all three as first-class steps. The core workflow that replaces “prompt and hope” is the Plan → Execute → Verify loop, usually shortened to PEV.

The PEV loop

Flowchart illustrating the software development process with four stages: Plan, Execute, Verify, and Ship, emphasising human oversight and feedback loops for failure.

Each phase has a job:

Plan turns a fuzzy request into an explicit, reviewable artifact: a spec and a task breakdown. This is where a human can intervene cheaply — catching a bad approach before a line of code exists costs seconds; catching it after costs a debugging session.
Execute writes the code, but in an isolated environment (a fresh branch, a sandboxed workspace) so a bad run can be thrown away with zero blast radius.
Verify runs objective checks — the test suite, linters, type checks, security scanners — and either passes the work forward or kicks it back to Plan with the failure as new context.

The arrow wrapping the whole thing is the part people skip and shouldn’t: human oversight. The human isn’t in the inner loop typing code. They’re at the boundaries — approving the plan, adjudicating verify failures the agent can’t resolve, and owning the final merge decision.

That distinction — humans at the boundaries, not in the middle — is the whole philosophy. It’s what lets the loop run fast without running blind.

A minimal PEV loop you can run

Let’s make this concrete with the smallest thing that’s still genuinely a PEV loop. We’ll give an agent a tiny task, force it to plan first, let it execute, then actually verify with a real test run instead of vibes.

We’ll use Python and the Anthropic Messages API, but nothing here is provider-specific — swap in any model SDK.

import subprocess, json, tempfile, os
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-sonnet-4-6"

def call(system, user):
    msg = client.messages.create(
        model=MODEL, max_tokens=2000,
        system=system,
        messages=[{"role": "user", "content": user}],
    )
    return "".join(b.text for b in msg.content if b.type == "text")

# ---- PLAN ---------------------------------------------------------------
def plan(task):
    system = (
        "You are a senior engineer. Given a task, produce a short, explicit "
        "plan as JSON with keys: 'approach' (one sentence) and 'steps' (list). "
        "Do not write code yet. Output JSON only, no prose, no code fences."
    )
    return json.loads(call(system, task))

# ---- EXECUTE ------------------------------------------------------------
def execute(task, the_plan):
    system = (
        "You are a careful engineer. Implement the task following the plan. "
        "Return JSON only with keys 'code' (the module) and 'tests' (pytest "
        "tests for it). No prose, no code fences."
    )
    user = f"TASK:\n{task}\n\nPLAN:\n{json.dumps(the_plan, indent=2)}"
    return json.loads(call(system, user))

# ---- VERIFY -------------------------------------------------------------
def verify(artifact):
    """Run the agent's tests in an isolated temp dir. Real check, not vibes."""
    with tempfile.TemporaryDirectory() as d:
        open(os.path.join(d, "solution.py"), "w").write(artifact["code"])
        open(os.path.join(d, "test_solution.py"), "w").write(artifact["tests"])
        result = subprocess.run(
            ["python", "-m", "pytest", "-q"],
            cwd=d, capture_output=True, text=True, timeout=60,
        )
        return result.returncode == 0, result.stdout + result.stderr

# ---- THE LOOP -----------------------------------------------------------
def pev(task, max_attempts=3):
    feedback = ""
    for attempt in range(1, max_attempts + 1):
        print(f"\n=== Attempt {attempt} ===")
        the_plan = plan(task + feedback)
        print("PLAN:", the_plan["approach"])

        artifact = execute(task, the_plan)
        passed, log = verify(artifact)

        if passed:
            print("VERIFY: passed ✅  (awaiting human review before merge)")
            return artifact
        print("VERIFY: failed ❌  — replanning with the failure as context")
        feedback = f"\n\nPrevious attempt failed these tests:\n{log[-1500:]}"

    print("Gave up after max attempts — escalate to a human.")
    return None

if __name__ == "__main__":
    pev("Write a function `slugify(s)` that lowercases, strips punctuation, "
        "and replaces runs of whitespace with single hyphens.")

Run it and watch the difference from vibe coding: the agent is forced to state an approach before coding, the work runs in a throwaable directory, and the loop only declares success when a real pytest process exits zero. A failure doesn’t get shipped — it gets fed back in as context for the next plan.

What this toy is still missing

This is a real loop, but it’s a toy, and the gaps are exactly the subject of later sections:

The plan is unreviewed. A human should be able to approve or edit it before execution.
Verification is shallow. Passing the agent’s own tests proves very little — the agent can write weak tests. We need independent tests, linters, type checks, and security scanning.
No real isolation. A temp dir works for one function; real work needs branch-per-task and git worktrees so parallel runs don’t collide.
One agent does everything. Author and grader being the same model is a conflict of interest. Splitting roles — author, tester, reviewer, security — catches far more.
No audit trail. In production, why an agent did something becomes the constraint, not whether it could.

Building a production grade PEV loop

Above, we built a PEV loop that ran an agent’s own tests in a temp directory. It was a real loop but a weak one: the plan was unreviewed, the verification trusted the agent to grade itself, and “isolation” was a folder that vanished. We’ll build each phase properly: a plan you can read and approve, an execute phase that runs in a real isolated workspace, and a verify gate the agent cannot talk its way past.

Phase 1 — Plan: make intent explicit and reviewable

The single highest-leverage move in agentic engineering is forcing a planning step before any code exists. A bad approach caught at the plan stage costs seconds. The same mistake caught after execution costs a debugging session — and after merge, an incident.

A good plan is a structured artifact, not a paragraph. Structure makes it reviewable, diff-able, and machine-checkable. A practical shape:

from dataclasses import dataclass, field
from typing import Literal

@dataclass
class PlanStep:
    id: str
    description: str
    files_touched: list[str]
    risk: Literal["low", "medium", "high"]

@dataclass
class Plan:
    goal: str
    approach: str                       # one sentence the human can sanity-check
    non_goals: list[str]                # what we are deliberately NOT doing
    steps: list[PlanStep]
    acceptance_criteria: list[str]      # the verify phase will check these
    open_questions: list[str] = field(default_factory=list)

Two fields earn their keep. non_goals stops scope creep — the most common way an agent turns “add rate limiting” into a rewrite of your middleware stack. acceptance_criteria is the contract the Verify phase will hold the code to; writing it during planning means “done” is defined before a line is written, not rationalized afterward.

The prompt that produces this should be explicit about the spec-first stance:

PLAN_SYSTEM = """You are a staff engineer doing spec-driven development.
Given a task and the relevant repo context, produce a Plan object as JSON.
Rules:
- Decompose into the smallest steps that each leave the repo green.
- State non_goals explicitly to bound scope.
- Write acceptance_criteria as concrete, testable statements.
- If the task is ambiguous, populate open_questions instead of guessing.
Output JSON only."""

Notice the last rule. A vibe-coding agent guesses when it’s unsure and you find out later. A planning agent is told to surface ambiguity as open_questions — which becomes the natural place for a human to intervene. If open_questions is non-empty, the loop pauses for an answer instead of charging ahead on an assumption.

Phase 2 — Execute: isolation is non-negotiable

The reason vibe coding feels dangerous is that the agent writes directly onto your working tree. Get isolation right and a botched run becomes a deleted branch instead of an afternoon of git reset archaeology.

The right primitive is one git worktree per task. A worktree gives each agent run its own checked-out branch in its own directory, all backed by the same repo — so parallel agents can work without colliding, and merging back is an ordinary PR.

import subprocess, uuid, pathlib

def make_worktree(repo: str, base: str = "main") -> tuple[str, str]:
    """Create an isolated worktree on a fresh branch. Returns (path, branch)."""
    branch = f"agent/{uuid.uuid4().hex[:8]}"
    path = pathlib.Path(repo).parent / "worktrees" / branch.replace("/", "-")
    subprocess.run(["git", "-C", repo, "worktree", "add", "-b", branch,
                    str(path), base], check=True)
    return str(path), branch

def teardown_worktree(repo: str, path: str, branch: str, keep: bool):
    if not keep:
        subprocess.run(["git", "-C", repo, "worktree", "remove", "--force", path])
        subprocess.run(["git", "-C", repo, "branch", "-D", branch])

Now execution happens inside that directory. The agent gets tools scoped to the worktree — read a file, write a file, run a shell command — and nothing it does can touch your main checkout. This is also where you cap blast radius with the principle of least privilege: the execution sandbox should have only the filesystem, network, and credential access the task genuinely needs. An agent implementing slugify does not need your production database URL in its environment.

def execute_in_worktree(client, task, plan, wt_path):
    """Agentic execution with file + shell tools, confined to wt_path."""
    tools = [
        {"name": "write_file", "description": "Write a file (path relative to repo root).",
         "input_schema": {"type": "object",
            "properties": {"path": {"type": "string"}, "content": {"type": "string"}},
            "required": ["path", "content"]}},
        {"name": "run", "description": "Run a shell command in the repo root.",
         "input_schema": {"type": "object",
            "properties": {"cmd": {"type": "string"}}, "required": ["cmd"]}},
    ]
    # ... a standard tool-use loop: call the model, dispatch tool calls against
    # wt_path with subprocess(cwd=wt_path), feed results back until the model
    # signals completion. Every shell command is confined to the worktree.

The implementation detail that matters: every tool call executes with cwd=wt_path, and the write_file handler rejects any path that resolves outside the worktree (guard against ../ escapes). Isolation you can bypass isn’t isolation.

Phase 3 — Verify: the gate the agent can’t sweet-talk

This is where the previous loop was weakest. Letting the agent write and grade its own work is a conflict of interest — it’ll write tests that pass. Production verification needs to be objective and independent of the author, and it needs multiple lenses, because each catches a different class of slop.

Think of the gate as a pipeline of checks, each of which can block:

@dataclass
class Check:
    name: str
    cmd: list[str]
    blocking: bool = True

GATE = [
    Check("format",   ["ruff", "format", "--check", "."]),
    Check("lint",     ["ruff", "check", "."]),
    Check("types",    ["mypy", "."]),
    Check("tests",    ["pytest", "-q", "--cov", "--cov-fail-under=80"]),
    Check("security", ["bandit", "-r", ".", "-ll"]),   # catches injected vulns
    Check("deps",     ["pip-audit"]),                   # known-CVE dependencies
]

def verify(wt_path) -> tuple[bool, dict]:
    results, ok = {}, True
    for c in GATE:
        r = subprocess.run(c.cmd, cwd=wt_path, capture_output=True,
                           text=True, timeout=300)
        passed = r.returncode == 0
        results[c.name] = {"passed": passed, "log": (r.stdout + r.stderr)[-2000:]}
        if c.blocking and not passed:
            ok = False
    return ok, results

A few deliberate choices:

A coverage floor (--cov-fail-under=80) stops the agent from “passing” by writing one trivial test. It has to actually exercise the code.
A security scanner (bandit) and a dependency auditor (pip-audit) are not optional niceties. As we’ll see later, an agent producing code at volume produces vulnerabilities at volume unless something blocks them. Security belongs in the gate, not in a later review.
Independent tests matter. A strong setup has a second agent (or a human-owned test suite) write tests the author agent never sees. Self-graded tests are a starting point, not the finish line.

Wiring it together

def production_pev(repo, task, max_attempts=3):
    feedback = ""
    for attempt in range(1, max_attempts + 1):
        plan = make_plan(task + feedback)          # Phase 1
        if plan.open_questions:
            return pause_for_human(plan)           # don't guess — escalate

        wt_path, branch = make_worktree(repo)      # Phase 2: isolate
        try:
            execute_in_worktree(client, task, plan, wt_path)
            ok, results = verify(wt_path)          # Phase 3: gate
            if ok:
                open_pull_request(branch, plan, results)   # human merges 
                teardown_worktree(repo, wt_path, branch, keep=True)
                return branch
            # failed gate → feed the specific failures back into the next plan
            feedback = summarize_failures(results)
        finally:
            teardown_worktree(repo, wt_path, branch, keep=ok)
    escalate_to_human(task, feedback)

The shape is the same loop from earlier, but every phase now has teeth: the plan is a reviewable artifact that refuses to guess, execution is confined to a disposable worktree with least-privilege access, and verification is an independent multi-check gate with a coverage floor and security scanning. Crucially, a failure doesn’t ship — it becomes precise feedback (summarize_failures extracts the actual failing test names and scanner findings) that sharpens the next plan.

What’s still open

We now have a loop that’s safe to point at a real repo. But two things still rely on hand-waving:

“Human merges” and “escalate to a human” — we keep deferring to a human at the boundaries without saying how that handoff should work. Where exactly do humans belong, and how do you keep them effective without making them a bottleneck?
One agent still does the work. We hinted that independent test-writing helps. The full version splits the job across specialized agents — author, tester, reviewer, security — and that orchestration is its own design problem.

Human Oversight, Multi-Agent Orchestration & Shipping Safely

We have a loop that plans explicitly, executes in isolation, and verifies objectively. Left alone, though, it has two unresolved weaknesses: it still treats “a human handles it” as a magic step, and it still has one agent doing — and grading — all the work. Fixing both is what separates a demo from something you’d run against your main branch a hundred times a day.

Where humans actually belong

The initial instinct is to put a human “in the loop.” That’s the wrong picture. A human reviewing every diff an agent produces is just slow vibe coding — you’ve added a bottleneck without adding rigor, and reviewers rubber-stamp under volume anyway.

The right model is humans at the boundaries, agents in the loop. There are exactly three boundaries worth a human’s attention:

Plan approval (the cheap gate). Reviewing a one-paragraph approach plus non_goals and acceptance_criteria takes thirty seconds and catches the most expensive mistakes — wrong approach, wrong scope — before any code exists. This is the single highest-ROI place to spend human attention. Pair it with the open_questions mechanism from earlier: if the agent is unsure, it asks here instead of guessing.
Verify escalation (the exception gate). When the loop exhausts its attempts or hits a failure it can’t resolve, a human adjudicates. The key design rule: the human should receive the structured failure — which checks failed, the actual scanner findings, what the agent already tried — not a raw transcript. Make the escalation legible and it takes a minute; dump a 4,000-line log and it takes an hour.
The merge decision (the accountability gate). A passing verify gate produces a pull request, not a merge. A human owns the decision to land it. This isn’t ceremony — it’s where accountability lives. You can’t fire a bot; someone human is answerable for what reached production.

PullRequest = {
    "branch": branch,
    "plan": plan,                  # the approved approach + acceptance criteria
    "gate_results": results,       # every check, pass/fail, with logs
    "diff_stat": diff_summary,     # what actually changed
    "agent_trail": run_id,         # link to the full audit trail (see below)
}

Everything else — the writing, the testing, the iterating — happens without a human in the inner loop. That’s what lets the system move fast. The boundaries are where speed and safety get reconciled.

Multi-agent orchestration: stop letting the author grade itself

Earlier, we flagged the conflict of interest in one agent writing and testing its own code. The production answer is to split the job into specialized roles, each a focused agent with its own prompt, its own context, and — importantly — no incentive to cover for the others.

A battle-tested division of labor:

Flowchart illustrating a software development process involving roles such as Planner, Author, Tester, Reviewer, and Security, with steps including implementation, testing, and security checks, leading to a pull request.

Planner owns the spec (Phase 1). It does not write code.
Author implements against the plan. It does not write its own acceptance tests.
Tester writes tests from the plan’s acceptance criteria, not from the author’s code — so the tests check intended behavior, not whatever the author happened to build. This single separation kills a huge fraction of “passes its own tests, fails in prod” slop.
Reviewer reads code + tests against the plan and can reject back to Planner with reasons. It’s looking for the things scanners miss: bad architecture, missing edge cases, misread requirements.
Security runs as its own role and as part of the automated gate. It looks specifically for injected vulnerabilities, secrets, and unsafe dependencies.

You don’t need heavyweight frameworks to start — each role is a function that calls a model with a role-specific system prompt and passes a structured artifact to the next. Orchestration can be a plain state machine. Reach for an agent framework only when you actually need durable state, parallelism, or cross-process coordination; premature orchestration infrastructure is its own kind of slop.

The security math you can’t argue with

Here’s the calculation that makes everything above non-negotiable rather than nice-to-have. Anthropic’s 2026 agentic coding guidance puts it bluntly:

An agent producing 1,000 pull requests a week at a 1% vulnerability rate ships 10 new vulnerabilities every week.

Manual review cannot keep pace with that — which is the whole point. The same scaling that makes agentic engineering powerful for you makes it powerful for an attacker, and it makes your own agents a vulnerability factory unless something blocks bad output automatically.

Three consequences:

Security lives in the harness, not in a later review. Every PEV cycle runs security scanning as a blocking check (gate). Bolting it on afterward means you’ve already shipped the slop.
Least privilege is structural. Each execution sandbox gets only the filesystem, network, and credentials its task needs. An agent’s expanded attack surface — it touches APIs, databases, external services — is exactly what a scoped sandbox contains.
New attack classes are real. “Living off the agent” — hijacking an enterprise AI’s own permissions to act maliciously — is an emerging 2026 tactic. Treat the agent’s credentials and tool access as a primary attack surface, not plumbing.

Auditability is the real constraint

There’s a counterintuitive lesson from teams running this at scale: the bottleneck stops being whether agents can do the work and becomes whether you can account for what they did. As agentic dev tooling has boomed through 2026, workflow auditability has become the binding constraint.

Every run should emit an immutable trail: the task, the approved plan, which agent did what, every tool call and its result, the full gate output, and who approved the merge. This isn’t bureaucracy — it’s what makes incidents debuggable, makes compliance possible in regulated environments, and makes the merge gate meaningful (a human approving a PR needs to be able to see why the agent did what it did).

			
def emit_trail(run_id, event, payload):
    record = {"run_id": run_id, "ts": now_iso(), "event": event, "payload": payload}
    append_only_log.write(record)     # tamper-evident, queryable, retained

If you build one thing beyond the loop itself, build this. The teams that succeed with agentic engineering aren’t the ones with the cleverest agents; they’re the ones who can answer “what happened and why” for any change that reached production.

Anti-patterns to avoid

A few failure modes show up repeatedly:

Automating a broken process. Gartner projects that ~40% of agentic projects will be cancelled by 2027 — largely not because the tech fails, but because teams automate workflows that were already broken. PEV makes a bad process faster, not better. Fix the process first.
Human-in-the-inner-loop. Reviewing every diff doesn’t scale and trains reviewers to rubber-stamp. Move humans to the three boundaries.
Self-graded work. If the author writes the tests, you’re measuring the author’s confidence, not correctness. Separate the roles.
Optional security. At agent volume, “we’ll add scanning later” means shipping vulnerabilities now.
Premature orchestration. Don’t reach for a multi-agent framework on day one. Start with a single PEV loop and one human boundary; add roles when a specific failure mode demands one.

A production-readiness checklist

Before you point an agentic loop at a repo that matters:

[ ] Planning produces a reviewable, structured spec with explicit non_goals and acceptance_criteria.
[ ] The agent surfaces ambiguity as questions instead of guessing.
[ ] Every run executes in an isolated, disposable worktree.
[ ] The execution sandbox has least-privilege filesystem/network/credential access.
[ ] The verify gate is independent of the author and includes tests with a coverage floor, type checks, linting, security scanning, and dependency auditing.
[ ] Tests are written from acceptance criteria, not from the author’s code.
[ ] Humans sit at exactly three boundaries: plan approval, verify escalation, merge.
[ ] A passing gate produces a PR, never an auto-merge.
[ ] Every run emits an immutable, queryable audit trail.
[ ] You can answer “what happened and why” for any agent-made change.

Where this leaves you

Agentic engineering isn’t about replacing developers — it’s about multiplying what one developer can responsibly oversee. The teams pulling ahead aren’t the ones who let agents run wildest; they’re the ones who turned “prompt and hope” into Plan → Execute → Verify, kept humans at the boundaries where their judgment compounds, and made every run accountable.

That’s the whole discipline. The agents will keep getting better. The engineering — the plan, the isolation, the gate, the oversight, the trail — is the part that’s yours, and it’s the part that decides whether all that capability ships value or ships slop.