AI CodingDeveloper ToolsAI Agent

Make AI More Accurate, or Make Mistakes Cheaper

Published Jun 22, 2026

Make AI More Accurate, or Make Mistakes Cheaper

AI modifies 14 files in one turn, you don’t have time to read the diff, and when you run it you realize it went the wrong direction. How do you get back.

This question is invisible on AI coding tool benchmark leaderboards, but it determines something more fundamental: whether users dare to let AI work autonomously. Revertability is not a remedy after errors happen; it is a precondition that determines how much users let AI do. When reverting is cheap, exploration is bold; when reverting is expensive, every handover requires hesitation.

The safety narrative of AI coding tools splits into two paths around this question.

One path asks how to make AI more accurate, accurate enough that rollback isn’t needed. The main force on this path is Anthropic, whose Opus 4.8 launch focused on the honesty metric: the probability that the model lets defects in its own written code go unnoticed is roughly four times lower than the previous generation (source). On the benchmark front this path has extreme momentum: several models have pushed the SWE-bench Verified resolved rate to 93.9%, and OpenAI subsequently posted that this benchmark can no longer measure frontier coding capability (source). The industry’s response is to build new benchmarks and keep pushing harder along the same dimension.

The other path questions a different premise: AI will inevitably make mistakes, we just don’t know when, so make the cost of mistakes as low as possible. Only one vendor is currently taking this stance explicitly and systematically. When Replit CEO Amjad Masad reshared Replit’s Snapshot Engine technical blog in December 2025, he stated the core philosophy bluntly: coding agents will make mistakes, so they must run on infrastructure where every operation is reversible (source). Rollback is an infrastructure-layer design constraint, not something to remember only after an incident.

The divergence between the two paths is definitional: one defines reliability by accuracy rate, the other defines reliability by recoverability.

Replit Made Rollback Its Safety Philosophy

Replit’s rollback system has two layers with a clear evolutionary trajectory. The 2022 File History was character-level OT logging, solving the small problem of a single person breaking a single file (source). In September 2024, Checkpoints & Rollbacks launched alongside Agent (source), and in May 2025 it was upgraded to App History / Time Travel, expanding coverage from file code to AI conversation context, Agent memory, and environment variable copies, rolling back code and environment and all other side effects within the same mechanism.

These engineering capabilities did not stop at the implementation layer. On the Possible.fm podcast, Amjad Masad used video games as an analogy: saving and loading is an important part of exploring a game, and players shouldn’t have to start over from scratch after one wrong step; Replit wants users to feel that any operation can be reversed (source). Save and load is itself an exploration mechanism. You take a step forward knowing you can come back if it’s wrong, which is exactly why you dare to take the next step. Moved into AI coding, rollback doesn’t take effect only after an error; from the very first action it changes user behavior.

Why Replit is the only vendor that walked this path to the end is answered by its market positioning and product form. Replit’s target users are non-programmers who use natural language to have Agent build apps. Amjad Masad publicly said in early 2025 that Replit no longer targets professional programmers (source). This group can’t read diffs, can’t use git, and has no ability to manually recover after AI breaks their code. For them, rollback is not a convenience feature, it’s the only safety net. Accuracy metrics are meaningless to this group: they can’t even read the code, so the difference between a 93.9% and 73% benchmark is imperceptible to them. But “can I undo it when AI breaks my app” they perceive in a second.

Product form is the second reason. Replit is a full-stack hosted environment where the editor, runtime, database, and deployment all live inside it. On the Sequoia podcast, Amjad said Replit aims to be “the last tool you need to adopt to build software” (source). Because it hosts everything, it can snapshot everything. Cursor and Claude Code only touch code files, not databases and runtime state, so even if they wanted to do full-state rollback like Replit they couldn’t; they’re not on that information chain. Taking rollback to the infrastructure layer requires that you own that layer of infrastructure.

The third reason is incidents forcing the issue. In July 2025, Replit Agent performed destructive operations on investor Jason Lemkin’s SaaStr app, Amjad Masad publicly apologized, and in subsequent product updates, Checkpoints & Rollbacks ranked first among four safety measures (source). The checkpoint feature existed before the incident, but the incident elevated it from a default-present feature into an explicit safety narrative. Replit didn’t proactively choose revertability as a selling point; its user base and product form determined that it had no other card to play: you can’t ask non-programmers to read diffs to judge whether AI’s changes are right, you can only let them know that if it breaks, they can revert.

Replit’s checkpoint strategy follows this thinking. Checkpoints are created automatically by Agent at logical milestones, stored at the git layer, and users can access them with git tools, but the design doesn’t require users to understand git (source). This draws the boundary between the Replit line and the git-first line: git manages the underlying storage, while what users see at the upper layer is always a thinner abstraction, auto-saving as they go and reverting with one click.

Claude Code’s Rollback Was Forced Out by the Community

Claude Code added the /rewind command in v2.0.0 (late September 2025), triggered by double-tapping Esc or typing /rewind. After each Claude response completes, it automatically snapshots the modified files, with a cap of 100 per session. It offers four rollback modes: Restore Code and Conversation, Restore Conversation Only, Restore Code Only, and Summarize from Here (source). There is a dedicated documentation page, and engineering resources were invested.

The limitations are also written in the docs. /rewind does not track file changes produced by bash commands, manual edits, or external tool modifications. Cross-session rewind is also not supported. Put these together and the picture is sharp: bash commands are precisely the main channel through which Agent performs high-risk operations, and the inability to rewind across sessions means that splitting one exploration into two sessions automatically deactivates the rollback safety net.

GitHub issue #6001 has a blunt title: Feature Request: Native Undo/Checkpoint/Restore Functionality (source). The original poster’s appeal reads like a user research summary, and that line about game-changer for user trust was quoted at the start. The same issue contains another equally pointed criticism: this kind of core safety and usability feature should be a first-class citizen in the tool, not a DIY project left to advanced users to cobble together with hooks.

The edge of this remark points at how Anthropic positioned /rewind. Anthropic released /rewind as an engineering feature without elevating it to the height of a safety philosophy. No press release, no executive placing it on equal footing with the honesty metric, no blog post arguing that letting users roll back and making the model more accurate are two parallel safety strategies. Contrast that with the extensive treatment of honesty in the Opus 4.8 launch blog, and the entire /rewind affair stopped at the documentation layer. The Reddit community sent a sharper signal: one user titled their post PLEASE WE NEED REVERT FEATURE (source), all caps, like shouting.

The community didn’t wait. Three third-party tools each cut in from a different gap. ccundo reads Claude Code’s session jsonl files, covering undo for 6 kinds of file operations, but bash commands still require manual handling (source). mrq directly positions itself as continuous protection between commits, capturing the core contradiction: users don’t stop mid-flow to commit, while git’s protection only kicks in after a commit (source). claude-file-recovery extracts files that were once read, edited, or written from the ~/.claude session logs, and picked up 99 points on Hacker News (source). The existence of these tools says more than their individual capabilities: together they point to a fact that /rewind’s coverage has gaps at the design level.

Git Alone Can’t Keep Up with AI’s Edit Tempo

The remaining four can be divided into two categories by engineering approach.

Aider is the purest implementation of the git-first path. After every AI file edit it auto-commits, the commit message is generated by a weak model, and /undo is just git reset (source). The Aider homepage lists Git integration as a core feature: easily diff, manage, and undo AI’s changes with familiar git tools (source). Author Paul Gauthier hasn’t written a philosophical essay on why he chose git over snapshots, but from day one Aider has used auto-commit plus git integration as the backbone of its design. The advantages are straightforward: transparent, pushable, zero extra storage. The disadvantages are equally clear: no conversation rollback, and a natural barrier for non-git users.

The snapshot-first path has three vendors with markedly different promotional intensity. Cursor auto-snapshots before each chat request is processed; you hover over a history message and click the restore button to roll back. It doesn’t promote this feature separately, and community feedback reports the button failing and restore granularity being too coarse (source). Windsurf (Cascade) has a similar mechanism plus named checkpoints, and promoted it as a highlight in the Wave 11 blog, but revert is irreversible (source; source). Cline is the closest of the three to treating rollback as a first-class feature: an independent shadow git repository, a commit after every tool use, finer granularity than both Cursor and Windsurf, and a three-way rollback choice (files only, conversation only, both), also the only one that decouples file rollback from conversation rollback (source). The cost shows up at scale: a user encountered a 262GB .git/objects/pack (source).

Laid flat, Replit takes the middle path, Cline is the closest second but hasn’t stepped onto the narrative stage, and the other three either have implementation but lack promotion, or have promotion but lack depth.

Just use git, that’s the most widely circulated counterargument in this space. It has merit, but it has at least three real-world holes.

First, git’s commit cadence doesn’t match AI’s iteration speed. Developers in flow don’t stop to commit. The title of mrq’s blog post captures exactly this misalignment. Git protects what you’ve committed, but the intermediate states between each AI edit, those tens of seconds, git can’t see at all. The critic on issue #6001 said it through: rollback should be a first-class feature, not a second-hand scheme that advanced users patch together with hooks.

Second, git commits are deliberate human actions, while AI coding requires automatic system-level fine-grained snapshots. Git’s workflow has always been designed around a premise: you know when to commit, and you know what you committed. AI coding overturns this premise. Users don’t know which files AI touched across dozens of tool uses, what logic it changed, and naturally don’t know at which step to press git commit. Looking at git log afterward, the gullies between commits are full of intermediate states the user can’t get back to. Cline’s shadow git is one solution, but the way it uses git has already deviated from git’s design intent. Forcing git to act as an automatic snapshot engine, the 262GB pack file is the direct consequence of git being made to do something it’s not good at.

Third, git’s barrier for non-programmers is too high. GitHub co-founder Scott Chacon admitted on the a16z podcast that git’s UI has barely changed since 2005, and the team is redesigning version control to fit AI agent workflows (source). The Register’s May 2026 headline is more blunt: Git is unprepared for the AI coding tsunami (source). Against the backdrop of Cline’s shadow git ballooning to 262GB, that headline no longer reads like hyperbole.

Revertability Is Invisible on the Benchmark Scoreboard

Not a single mainstream AI coding benchmark measures revertability. SWE-bench’s core metric is resolved rate, whether a patch passes the original PR’s unit tests (source). LiveCodeBench measures Pass@1. The Aider leaderboard measures Pass rate 1 and Pass rate 2 (source. AgentBench measures task success rate. Even with undo-and-retry variants like pass@3, what’s measured is still whether the model ultimately got it right, never touching the recovery process after getting it wrong. This is like measuring a car’s fuel consumption and acceleration while braking distance is entirely off the scoreboard.

Academic research has also flagged the signal quality of benchmarks. An ICSE 2026 paper found that SWE-bench’s patch validation has a 7.8% false positive rate, and being marked resolved doesn’t mean it was actually solved correctly (source. The entire benchmark culture still revolves around this flawed metric.

In July 2025, Amjad Masad relayed an observation from a public company CEO: AI did generate a lot of their code, but the time saved by code generation was all paid back in debugging, rolling back bugs, and security audits (source). This sentence points out the fault line between benchmarks and engineering reality: benchmarks capture code generation efficiency, while in actual work the real cost is consumed in debugging, rollback, and security audits. The writing time AI saves you may be paid back double in a single incident.

The saturation signal of SWE-bench hitting 93.9% is already strong enough that OpenAI took the lead in declaring it no longer an evaluation standard. The industry’s response is to launch new benchmarks to measure harder tasks. What do the new benchmarks measure. Still accuracy, just with a harder set of problems. Revertability as an evaluation dimension remains completely invisible across the entire benchmark system. This exposes the industry’s starting point for understanding reliability: it only chooses to measure accuracy, having entirely abandoned the dimension of recoverability.

Two Kinds of Trust, One Side Taken

Behind the two paths are two trust models.

The accuracy path builds trust on AI capability. The stronger the model, the more trustworthy; get it right in one shot and rollback is unnecessary. This logic holds intuitively, who doesn’t want AI to get it right the first time. But its robustness has a fragile cross-section: the moment the model errs, there’s nothing to be done, and users can only stand and wait for the next version.

The revertability path builds trust on human recovery capability. The easier the revert, the more users dare to let AI try; when AI errs, humans can pull it back. The core variable on this path is error cost: after AI errs, how expensive is correction, and can it be brought down to a level where users can casually catch it. The autonomous driving field has a corresponding reference: the safety assumption of SAE Level 3 is precisely that humans can take over at any time, and the system doesn’t need to always be right (source. AI coding tools and L3 autonomous driving are isomorphic in safety structure; both are high-autonomy, high-cost-of-error, human-in-the-loop systems. The academic literature distinguishes overtrust (leading to misuse) and undertrust (leading to disuse), emphasizing that calibrated trust is the goal (source. But the default premise of these studies is always that trust lands on the AI side, calibrating AI to the point where users neither overtrust nor underuse. The revertability path has a different trust landing point: it relies on the user’s own ability to catch the fall.

Revertability is an amplifier. That line in issue #6001, empower users to let Claude attempt more ambitious tasks, knowing they can easily roll back, already made the key point. Because they know they can restore with one click, users dare to let Agent try things with a low probability of getting right in one shot but huge value if successful. Rollback lowers the cost of exploration, and lowering cost is itself a behavioral incentive.

This path is still early, and every implementation carries clear caveats. Replit’s checkpoint granularity is decided by Agent, and users don’t fully get to choose which specific state to return to. Cline’s shadow git can balloon to 262GB. Aider can’t roll back conversations. Claude Code’s /rewind doesn’t track bash commands. The Snapshot Engine technical blog shows Replit put hard work into storage efficiency, but also acknowledges that current rollback precision still involves engineering trade-offs (source. These caveats indicate the path’s current position: going from concept to productization requires passing through several engineering narrow gates, each still lit.

The direction is already clear. The reliability of AI coding tools needs two dimensions in parallel: make the model more accurate, and make mistakes cheaper. The two dimensions are not mutually exclusive, and each requires independent engineering investment. Today only one vendor is explicitly and systematically investing in the latter dimension, while user feedback, community patch tools, plus first-hand voices like issue #6001 are proving that this dimension has already become a precondition.