From Process Certainty to Outcome Certainty: A Different Kind of Confidence in the Age of AI

Even in 2026, turning an AI demo into a production-ready product is surprisingly hard. Take Chinese-to-English translation. Everyone assumes LLMs solved this ages ago—just call an API, right? But when we recently tried to add an automatic translation sync feature to the Superlinear Academy community, we discovered just how painful the developer experience actually is.

The core issue is uncertainty in AI outputs. A post is too long, and the AI gets lazy—translating the first half properly, then summarizing the rest. Or it short-circuits mid-output, starting in English but randomly inserting Chinese characters. Or it makes subtle formatting mistakes, like dropping bold text. Or it times out halfway through and just hangs there until it crashes.

To deal with all this uncertainty, we had to add layers of handling in our code. If a post was too long, we'd split it into chunks, call the API separately for each, and stitch the results back together (similar to what I described in Wide Research). But that created another problem: terminology across chunks wasn't consistent. So we had to design additional workflows to pass a glossary from one chunk to the next, ensuring the same Chinese term didn't get translated into two different English words. On top of that, we added a check: if the output still contained Chinese characters, retry the translation. And to handle timeouts without duplicating work, we implemented checkpoint-based resumption—re-translating only the failed portion and splicing it back in.

All this effort did improve success rates. Even very long community posts would eventually get translated correctly. But it was exhausting. We spent 90% of our time not on making translations better, but on workflow and orchestration to babysit the AI. And after a while, because edge cases kept popping up and some only happened once or twice, we stopped fixing them. It felt like we'd never be done. No productivity gains at all. We might as well have stuck with the old machine translation APIs.

Then we tried a completely different approach—and it actually worked. But before I explain what we did, I want to share a deeper analysis of why this problem exists in the first place.

The Four Layers of Agent Integration

As I mentioned, calling an AI API isn't as simple as fire-and-forget. There's a lot of supporting infrastructure involved. From an integration standpoint, this work falls into four distinct layers:

  1. Model layer: Which model do we use—Claude or GPT? Opus or Haiku? What reasoning effort level?
  2. Protocol layer: Chat Completion API or Response API? MCP or RESTful API? How do we handle rate limits? Should we enable JSON mode? When people talk about "calling an API," they usually mean this layer.
  3. Runtime layer: How do we manage state? How do we invoke tools? How do we feed file contents to the AI? How do we control permissions and concurrency? This layer isn't part of traditional API development, but if you want to use AI reliably in production, you can't skip it.
  4. Contract layer: What does success actually look like? What checks do we run on AI outputs? How do we set up guardrails? When do we bring in human review? How do we ensure compliance with content policies? This layer determines whether we can trust AI outputs enough to use them in production.

When people talk about AI product development, most discussions focus on the protocol layer. But in actual development, the runtime layer consumes the most time. Unlike traditional APIs, LLMs introduce massive uncertainty, and the runtime layer has to absorb and handle all of it. The problem is that the runtime layer has nothing to do with business logic. Whether you're building translation, code generation, or a customer service bot, you still have to deal with lazy outputs, chunking, context management, and concurrency control. Every team ends up reinventing the same wheels. Which naturally raises the question: can we outsource the runtime layer?

It's not that simple. Different models have different failure patterns. Some follow instructions precisely but get lazy on long texts. Others are creative but terrible at format control. The way you clean up after each model is different, especially for long-tail failure patterns. So the runtime layer often ends up highly customized to specific models, making it hard to reuse—let alone outsource.

But something has recently changed this picture. Claude Code itself isn't open source, but more and more model providers are actively building compatibility with it. Kimi, DeepSeek, and GLM all offer official integrations—just change a few environment variables, and Claude Code can call these models under the hood. This is interesting. It means Claude Code has transcended being just a tool and become something reusable.

More importantly, when model providers claim Claude Code compatibility, what they're actually doing is adapting their models' failure patterns to match Claude Code's expected behavior. In other words, the cleanup work hasn't disappeared—it's just shifted. Instead of us developers doing it, the model providers do it. To enter this ecosystem, they have to ensure their models behave reliably within Claude Code's runtime. (This discussion applies equally to similar tools like Codex and Cursor Agent, since their command-line interfaces are very similar and easy to adapt to.)

In other words, Claude Code, Codex, and Cursor Agent are becoming a reusable Agentic Runtime.

This solves the long-tail problem I mentioned earlier. Those scattered edge cases that no single team could fix—the entire ecosystem can. Every model provider that wants Claude Code compatibility, including Anthropic itself, is filling in the gaps for us. So a new approach emerges: instead of "calling an API and cleaning up ourselves," we can just "hand it off to Claude Code." By leveraging all the compatibility work the ecosystem has done, we're effectively reusing its runtime layer. We've gone from building our own wheels to standing on a converging standard.

In Practice: Handing Translation to Claude Code

This is why we decided to try a different path: instead of continuing to handle uncertainty at the runtime layer, we'd just stand on this converging standard. So we tried handing the community translation task to Claude Code. The most intuitive feeling was that most of the problems we'd spent so much time handling simply disappeared.

Take the laziness problem. Before, when calling the API directly, we had to handle chunking, stitching, and validation ourselves. But Claude Code works differently—its basic unit of operation is the file. A file is stateful. It lives on disk, can be serialized and persisted. So we can have Claude translate chapter by chapter, writing back to the file after each one. The whole process doesn't need an external orchestration layer to track and manage progress.

Checkpoint-based resumption is the same. Before, when an API call timed out, we had to record the breakpoint, re-translate only the failed portion, and splice it back in. Now we don't. If translation crashes halfway through, the file is still there—whatever was already translated won't get lost anyway. Just restart and tell Claude Code to continue. It reads the file, sees what's not done yet, and picks up where it left off.

Terminology consistency used to require dedicated workflow design—passing a glossary from the first chunk to the second in a structured or incremental way. Now, Claude Code reads the entire file before making changes, so it naturally sees the earlier context. So the problem of terminology consistency can be solved with a simple prompt: first read the whole file, see what terminology was used before, then translate lines XX to YY.

The problem of Chinese characters leaking into the output used to require detection, judgment, and retry logic. Now we can just tell Claude in the prompt: after translating, scan the whole thing and make sure there are no leftover Chinese characters. Even better, since Claude Code can run Python, we can have it write a simple script to validate that the final file meets our format requirements. It writes the check, runs it, and fixes any issues itself.

The common thread here is that problems we used to solve at the workflow level can now be stated clearly in natural language in the prompt, and the agent handles them reliably. We can finally focus on making translations better, instead of preventing the system from doing something stupid.

The Agentic Loop and Evaluation-First Mindset

These changes finally let us focus on translation quality itself. But afterward, I started wondering: why can Claude Code do this? Does switching the way we call the API really make that much difference?

As I mentioned, one important reason is that we're reusing the ecosystem's compatibility work. But that's just a higher-level, more superficial reason. Looking at it through the four-layer framework, the direct reason Claude Code works is that it allows the AI to observe the results of its own actions.

This sounds obvious, but it's the fundamental difference between agentic AI and traditional API calls. When you use an API, the AI only sees the prompt it's fed, produces an output, and that's it. If there's a problem—malformed JSON, missing fields, lazy second half—the AI doesn't know. You're the one who notices. You're the one who decides whether to retry. You're the one who writes the fix logic. This is the direct reason we feel like AI is dumb and we have to clean up after it.

But Claude Code is different. After it modifies a file, it can run Python to invoke a JSON parser and see an error message saying line 9527 has a syntax error. That error gets fed back to it, so it knows what to fix. It fixes it, runs again, passes, moves on. This execute → observe → correct cycle is the agentic loop.

This is also why the file abstraction matters so much. Files are carriers of state, and visible state is what makes the closed loop possible. By turning translation from "call an API once and get a result" into "have an agent operate on files in a working directory," this in fact gives the AI a pair of eyes. It can see what it did in the previous step, see the output of validation scripts, and decide what to do next based on that information. This is the capability the runtime layer provides.

But just because the agentic loop can run doesn't mean it runs correctly. Observing results is one thing; knowing what counts as "correct" is another. That's what the contract layer has to answer. Back to the translation example: even with Claude Code, it wasn't like we switched tools and everything magically worked.

If we just said "translate this file to English," Claude would do it, but there would still be a few paragraphs with Chinese characters mixed in. Same problem as with the API—except now it's much easier to fix. We can add a line to the prompt: after translation, run a Python script to check for leftover Chinese characters, and fix any you find. Claude Code reliably writes a simple regex check, runs it, finds issues, goes back to fix them, runs again, until it passes.

But this reveals something more important: the earlier failures weren't because Claude was stupid—it was because Claude didn't know what "done" meant. From its perspective, applying a Chinese-to-English operation to every chapter meant the task was complete. But for us, "done" also included correct formatting, no leftover Chinese, consistent terminology—all implicit expectations. These expectations were in our heads; Claude couldn't see them. Once we made these expectations explicit and told it how to verify them, it could judge for itself whether it was finished.

I like to use this analogy: imagine you're giving a task to an intern with amnesia. This intern has no context, doesn't know what you discussed before, doesn't know your implicit expectations—they can only see the instructions you give them this one time. You need to write the acceptance criteria so clearly that, based on this information alone, they can judge whether they're done. If they think they're not done, they know what's missing. In my experience, when you write things at this level of detail, you can reliably expect Claude Code or Codex to complete the task. If it can't, don't panic and blame the AI—first check whether you wrote the criteria clearly enough. An even better approach is to codify acceptance criteria into executable checks, like Python scripts. That way the agent can verify on its own, without human supervision.

So now we can clearly describe the relationship between these two layers. The runtime layer gives the agent observational capability—it can see what it did and what the results are. The contract layer tells it what success looks like, so it can judge whether it's done. Both are essential: observation without standards means the agent spins aimlessly, producing something beautiful that may not meet our requirements; standards without observation means the agent stops after one attempt, and whether it's right is pure luck. The agentic loop plus evaluation-first is what creates a complete closed loop.

From Process Certainty to Outcome Certainty

Once this closed loop is established, it subtly changes where our trust in AI comes from. Behind it are actually two different kinds of certainty.

Traditional programmers' sense of security comes from process certainty. Every line of code I write is under my control. Every branch, every edge case—I've thought about them all. The program's behavior is something I designed, and as long as it follows this logic, it will definitely produce correct results. This certainty is tangible, and this ability to translate outcomes into program behavior is a fundamental skill we've trained over many years.

But the agentic loop and evaluation-first mindset we just discussed represent a different kind of certainty. We don't specify every step of the process; instead, we specify what the destination looks like and how to verify we've arrived. The process is uncertain—Claude might translate first then check, or check while translating; it might use regex or some other method—but the outcome is certain: as long as the acceptance criteria are right, the final product will be right.

This is outcome certainty.

Behind these two kinds of certainty are two different cost structures. The economics of process certainty: code execution costs almost nothing, but the human effort to write code is expensive. So we carefully design logic, pursue reuse, avoid duplication—we amortize human cost across every execution. The economics of outcome certainty is the opposite: intelligence is getting cheaper. The cost of having AI repeatedly try, check, and correct is dropping fast. We can spend tokens lavishly to buy certainty—not by writing more defensive code, but by letting AI use its reasoning ability to combat uncertainty.

This is the same logic I discussed in Disposable Software and Compressed Reality. That article was about how when the cost of writing code approaches zero, disposable software becomes the optimal strategy. The change here is broader: it's not just code, but reasoning and intelligence itself that's getting cheaper. Translation isn't coding, but it's equally something produced by burning tokens. When that cost is low enough, we can have AI do checks on the spot, write validation scripts on the spot, loop repeatedly until the result is correct—instead of pre-encoding all possible situations into rules.

Beyond the cost structure shift, there's also a difference in ceilings. The upper bound of process certainty is our imagination and energy—the situations we can think of, the logic we can write, that's the boundary of what the system can handle. Outcome certainty has a higher ceiling: we don't need to enumerate every possible path, just define clearly what's correct, and the agent will find its own way to that state.

But we're not used to this kind of certainty, and it often feels unsettling. One of the core skills we've taken pride in throughout our careers is precisely this: translating outcomes into processes. The boss wants a system that handles 100,000 concurrent connections—we design an architecture to guarantee that outcome. The PM says uploaded files can't exceed 10MB—we write validation logic to block oversized requests. So when we start using AI, this habit naturally continues. We instinctively want to constrain AI behavior with rules: output must be JSON format, every field must exist, handle this case this way, handle that case that way.

But this path has limits. AI is not a deterministic system. Trying to constrain it through process, you'll find yourself using massive amounts of rules to hedge against its uncertainty. More and more rules, more and more patches, until you spend more effort on defense than on solving the actual problem. This was exactly the trap we fell into when using APIs for translation.

But what if we could accept a small concession? What if we accepted process uncertainty and instead constrained AI behavior by specifying outcomes? Things would change. Instead of saying "you must use this method to handle this case," we say "the final product must meet these conditions; how you meet them is up to you." This way, AI's flexibility is no longer a risk to hedge against—it becomes a resource for completing the task.

Of course, this path wasn't easy to take before. If you wanted AI to observe its own results, judge right from wrong, and decide what to do next, you had to build your own agentic loop. And wrapping an agent is harder than it looks: you have to handle tool calling formats, parse AI outputs, manage context windows, and adapt to different models' characteristics. By the time you're done, you realize you've traded another form of process for certainty.

But now you don't have to. Tools like Claude Code, Codex, and Cursor Agent have done the dirty work of the runtime layer. The agentic loop is ready-made, the file system is ready-made, tool calling is already wrapped. What you need to do is think clearly about what outcome you want, how to verify that outcome, and then tell it in natural language.

So here's my suggestion: try embracing process uncertainty. Don't instinctively specify every step of AI behavior. Instead, directly describe your expectations for the final result and codify them into verifiable standards. Leave the runtime layer stuff to tools like Claude Code and focus on the contract layer: define what's correct, define how to verify it.

This is a different way of working, and a different source of confidence.

Conclusion

Of course, this approach has its boundaries.

First, the nature of the task itself. Outcome certainty works on the premise that you can clearly define what "correct" means. Translation works for this because acceptance criteria can be formalized: correct format, no leftover Chinese, consistent terminology—all of these can be written as scripts for the agent to run itself. But for some tasks, "correct" is hard to define, or the defined criteria are themselves ambiguous. That said, in those cases, using rules to constrain the process would be even harder. At least evaluation-first gives you a clear failure signal.

Second, security. With an API, AI has no control over your system. It receives a prompt, returns text, and that's it. But tools like Claude Code are different. They can read and write files, execute Python, run bash commands. This is why they're powerful, and also why they're dangerous. This needs to be taken seriously. Our approach is to tighten permissions at the configuration level: use the --allowedTools parameter to limit which tools it can call, constraining execution to specific scripts. Going further, you can combine this with lightweight sandbox solutions that are popular now, so even if the agent messes up, it only ruins files inside the sandbox without affecting the host system.

There are definitely still many pitfalls here. How to design the permission model, how to configure the sandbox, how to roll back when things go wrong—these are all open questions without standard answers. But I'm optimistic about this direction. Security problems are engineering problems, and engineering problems can be solved. These risks don't mean this path is impassable.

Back to the opening question: is it better to treat AI as a component of the system, calling it from our code, building a translation product with AI features? Or to treat AI as the core of the system, having it call our programs, building an AI Agent that accomplishes translation tasks?

We tried both paths and found the latter had surprisingly higher success rates and stability. This might be because the latter lets us reuse the ecosystem's compatibility work, because the agentic loop lets AI self-correct, because evaluation-first lets us constrain AI with outcomes rather than processes. These factors combine to form a different way of working.

It requires giving up some things: the sense of control over process, the certainty about every step, and the instinct we spent years training—translating outcomes into procedures. But it also gives us something: a higher ceiling, less grunt work, and a new kind of confidence based on outcomes.

How far can this pattern extend? I'm not sure, but at least in the translation scenario, it completely transformed our development experience. We've compiled these practices into a how-to guide that you can share with your own AI and try right now.

Comments