AI CodingDeveloper ToolsAI Agent

From Zero to Cloudflare: Rewriting Tools for AI, Not Just Wrapping APIs

In mid-May, Vercel Labs released a programming language called Zero. The most interesting part isn’t the syntax — syntax is shared between humans and AI — it’s the compiler output. Running zero check --json doesn’t return prose error messages. It returns structured JSON with stable error codes and repair IDs. Each diagnostic has both a message field (for humans) and code + repair.id fields (for agents). Same output, two tracks.

Why would a compiler do this? You could call it a gimmick — Zero has 981 stars, two releases, sits under Vercel Labs not Vercel’s main line, and the HN thread is split between “brings nothing new” and “agent repair loop infrastructure.” Or you could ask a different question: when an AI agent writes code, the compiler throws an error, and the agent tries to fix it — a loop that runs millions of times a day — what should compiler output actually look like?

The traditional answer: it should be readable by humans. Zero’s answer: it should be matchable by agents. The difference isn’t technical — it’s a change in the premise of who consumes compiler output.

Same Problem, Different Entry Point

Cloudflare ran into the same problem from a different angle. Their API has 2,594 endpoints. If the MCP approach is to create one tool per endpoint, the tool definitions alone would consume 244,047 tokens — before any conversation even begins, the context window is already blown. Their solution: compress the entire API into two tools — search and execute — and let the agent write JavaScript to call a typed API client, running code in isolated sandboxes. Tokens dropped from 1,170,523 to 1,069 — a 99.9% reduction (Cloudflare Code Mode MCP).

Put these two cases side by side, and you see something in common: they’re both working with AI’s capability boundaries. Zero handles ambiguity — the same error can be worded differently across compiler versions, and agents guess wrong. The fix is giving them stable identifiers instead of natural language. Cloudflare handles choice — an agent’s accuracy drops sharply when picking from hundreds of similar tools (Anthropic’s own data: at 134K tokens of tool definitions, Opus 4 achieved only 49% accuracy). The fix is letting it write code instead of picking from a menu.

Neither of these is “exposing existing features to AI.” That’s not design — that’s thin wrapping.

Most MCP servers are thin wrappers. They map API endpoints one-to-one to MCP tools. The format is right, the content unchanged. AI can call them, but the calling context doesn’t tell it when to use which, in what order, or how to recover from errors. The guidance knowledge — the operational knowledge that should ship alongside the tools — is missing. The leverage tools — those that encapsulate error-prone tasks into deterministic operations — are also absent. I discussed this framework in another article: when the consumer shifts from human to AI, what you ship shouldn’t just be the core API. It should also include the knowledge system for guiding AI use, and tools for bypassing AI’s weaknesses. I used Stripe as an example then; looking at compilers and MCP now, the logic is the same.

Set these two kinds of design side by side, and what really distinguishes them isn’t “did you think about AI” — it’s whether you’ve taken AI’s fundamental characteristics seriously. These characteristics are completely different from human ones, yet most “AI-first” products don’t address them in their design at all.

AI Has No Memory

Human engineers accumulate knowledge — you’ve worked at this company for three years, you know which modules are problematic, you know where the last refactor went wrong. AI agents don’t. Every session starts from zero. Humans can fill in context through experience, colleagues, and code review; agents can only rely on what you feed them at startup.

Zero ships its guidance knowledge with the compiler: the zero skills get zero --full command lets the agent read a Markdown-format operations guide directly from the compiler — syntax rules, build processes, common pitfalls — packaged with the compiler version, always precisely matched. The agent won’t read documentation describing an API that doesn’t match the installed version, the way it would with web docs. AGENTS.md follows the same logic: a file at the repository root that injects project background, build commands, and code conventions into every session’s context. Matt Pocock has cited Humanlayer’s “instruction budget” concept — frontier LLMs can reliably follow 150-200 instructions. This means AGENTS.md can’t bloat; every extra rule competes for the model’s attention in understanding the task. Completely unlike humans reading a README: humans can skip, scan, selectively ignore — agents treat every line you write as an instruction to follow.

This limitation also explains why Salesforce Headless 360 isn’t just “adding an API” — it’s encoding business context that previously required a human to log in and navigate the UI (whether a customer has an open escalation, a renewal due in 30 days, a violated SLA) as data the agent can access while writing code. It’s not that the agent got smarter — it’s that information that previously only lived in human memory and UI navigation paths now has an interface the agent can directly consume.

AI Can’t Browse, Only Execute

Faced with a menu of hundreds of items, a human can scan and find what they need. An agent can’t. Give it 100 tools, and its accuracy on the first five is already low; by the fiftieth, it’s essentially guessing. Anthropic recommends keeping Claude Code’s core toolset at around 12. It’s not that the model isn’t good enough — it’s that tool selection as a task doesn’t match how LLMs make decisions.

Cloudflare’s response is extreme: not optimizing tool descriptions to make selection easier, but eliminating selection entirely — giving it search and execute, and letting it write code to call the API. Agent code generation accuracy is far higher than tool selection accuracy.

Stripe’s Agent Toolkit uses a gentler version: curated surface area. Stripe has hundreds of endpoints; exposing all of them to an agent is like asking it to blind-pick from an enormous menu. The Toolkit selects the dozen or so operations an agent most likely needs, each with precise schemas and descriptions. These tools do exactly the same thing as the traditional Stripe SDK — they all call the payment API. What changed is the design assumption of the interface layer: traditional SDKs face human developers who read documentation; the Agent Toolkit faces AI systems that discover capabilities at runtime.

AI Needs Precision, Humans Need Abstraction

This is the aggressive transparency principle discussed in Beyond DRY. Traditional API design faces human developers and centers on protective abstraction — hiding complexity, providing clean interfaces, preventing user mistakes. AI-native design is the reverse: an agent won’t be scared off by complex error messages, but it will get stuck on vague ones.

A concrete example: when a traditional API catches a low-level network timeout, it typically throws an abstracted APIFailureError with a line saying “Operation failed, please try again later.” This is friendly to humans — they don’t need to know whether it was a TCP handshake timeout or a DNS resolution failure. But it’s fatal to an agent. An agent’s effectiveness depends on a “try-feedback-fix” loop. Vague error messages break this loop — the agent doesn’t know what specifically went wrong and can only flail randomly before retrying.

The correct approach is to preserve the original ConnectTimeoutError with the full stack trace and context. The agent immediately sees that a specific step timed out and can retry with backoff or switch endpoints. The information volume is excessive for a human; it’s exactly right for an agent.

Zero’s repair IDs are the same principle applied differently. Natural language error messages have ambiguity — the same error can be worded differently across versions, and the same wording can be parsed into different fixes by the agent. The NAM003declare-missing-symbol mapping is stable. Certainty at every step of the repair loop is worth more than “letting the AI understand natural language.”

Where We Are Now

Progress varies dramatically by layer. The platform layer is moving fastest — Salesforce, Stripe, Atlassian, AWS are shipping agent-first products as core roadmap deliveries. The protocol layer is standardizing. The security layer is still early. Whether the compiler-layer experiments survive is unknown.

There’s one common criticism worth taking seriously. Tom Bedor, in MCP is a Fad, argues that instead of building new protocols for agents, we should make human interfaces clearer. This criticism hits a real problem — a lot of what’s done for agents is indeed just thin wrapping. But it misses another category: compiler output, CRM platforms, payment systems — products where decades of design have baked in the assumption that “there’s a human on the other end.” You can’t solve that with a better README.