Date: 2026-03-30
In Q1 2026, OpenAI, Cursor, and Anthropic each published their practice reports on agent-first software development. All three were filed under the same term: harness engineering. But read closely, they are talking about three almost entirely different things.
OpenAI’s harness engineering is about environment design: documentation systems, architectural constraints, and observability infrastructure that let an agent reliably produce code within a carefully designed working environment. Cursor’s self-driving codebases and scaling agents are about coordination architecture: how to divide work, parallelize, and converge when hundreds of agents are working simultaneously. Anthropic’s harness design for long-running apps is about runtime course correction: how to maintain direction and quality when a single agent runs continuously for hours.
These three articles share highly overlapping audiences and highly consistent terminology, yet the engineering problems they each address are fundamentally different. This is the root of the confusion that the term harness engineering creates in current discussions: people use the same word to discuss problems at different layers, and much of the second-hand commentary still lingers on the multi-agent virtual team concepts from two years ago, even further from what these three articles actually describe.
This article attempts to provide a unified framework to untangle these discussions. The core argument is: the essence of harness engineering is making AI software construction scalable, and scalability has three independent dimensions. Each company solved one of them.
Before unpacking the three dimensions, it is worth examining where all three companies agree. These shared convictions form the foundation of harness engineering; the three scaling dimensions are differentiations built on top of this foundation.
The core human work has shifted from writing code to designing the agent’s working environment. OpenAI frames this as “designing environments, specifying intent, building feedback loops.” Cursor’s experiments found that “architecture and instructions matter more than the harness itself.” Anthropic found that planner and evaluator design affect output quality more than prompt wording. From different directions, all three arrived at the same conclusion: the human leverage point is creating conditions under which agents can work reliably; the code itself is produced by agents.
Knowledge must be versioned, discoverable, and live inside the repo. OpenAI stated this most bluntly: what Codex cannot see does not exist. Discussions in Google Docs, alignment conversations on Slack, implicit knowledge in team members’ heads — to an agent, all of these are blank. Cursor validated the same point from a different angle: ambiguous wording in instructions gets amplified simultaneously by hundreds of agents, with consequences far more severe than communication ambiguity in human teams. The solution in both cases is to push knowledge into the repo, replacing verbal communication with markdown and structured documents.
Constraints are more effective than instructions. OpenAI uses custom linters to enforce layered architecture, where the lint error messages themselves serve as repair guidance for the agent. Cursor found that “no TODOs, no partial implementations” is far more effective than “remember to finish implementations.” Constraints are executable and deterministic; instructions are interpretable and ambiguous. In how agents work, this distinction matters more than it does in human teams.
Perfectionism is the enemy of throughput. OpenAI adopts minimum-blocking merges — waiting is more expensive than correcting. Cursor found that demanding 100% correctness on every commit causes the system to stall, where a single small error traps the entire system in a repair loop. Both accepted the trade-off that “correction is cheaper than waiting.” This judgment might be controversial in human engineering teams, but in scenarios where agent output velocity far exceeds human attention capacity, it is a sound engineering decision.
These four shared convictions are the least controversial part of harness engineering. Any article discussing harness engineering that does not touch on any of these four is most likely discussing something else entirely.
Building on the shared convictions above, each company is solving a scaling problem along a different dimension.
The problem Anthropic is solving: once an agent starts working in a well-designed environment, how does it maintain direction and quality over hours of continuous operation?
This problem is independent of environment design because long-running execution introduces two categories of failure that environment design alone cannot prevent. The first is directional drift: as the context window gradually fills, model coherence begins to decay, manifesting as deviation from the original direction, forgetting early constraints, and going progressively deeper into details. The second is self-evaluation distortion: the agent can detect flaws in its own output during the work process, but then convinces itself these flaws are acceptable and issues a passing judgment. These two categories of problems require runtime course correction mechanisms that are independent of what prompts and documentation can cover.
Anthropic’s solution is a three-role architecture. The Planner expands a one-line requirement into a complete product spec, handling only product-level and high-level technical direction without entering implementation details. The Generator implements features according to the spec. The Evaluator takes a pre-negotiated sprint contract and uses Playwright to operate the real running application to verify whether the output meets standards. There is no shared internal state between the Evaluator and Generator — this independence is the precondition for its ability to course-correct.
The most methodologically valuable part of this architecture is how it handles harness component lifecycles. Each harness component is a hypothesis about the current model’s capability boundary. Context reset assumes the model cannot maintain coherence in long contexts; sprint decomposition assumes the model cannot maintain a sense of direction in continuous long sessions; the evaluator assumes the model will be overly lenient toward its own work. These hypotheses expire at different rates. Across three model generations — from Sonnet 4.5 to Opus 4.5 to Opus 4.6 — context reset was retired first, sprint decomposition followed, and the evaluator still provides value. What the author did after Opus 4.6 launched was systematically remove old components one by one and test whether quality actually degraded, rather than continuing to stack new components on top.
The simplified system produced a digital audio workstation, running for approximately 4 hours at a cost of $124, with the generator’s first continuous run lasting 2 hours and 7 minutes. For comparison: the same prompt run with a single agent for 20 minutes cost $9, and core functionality was unusable.
The problem Cursor is solving: can you achieve 10x meaningful throughput by investing 10x the compute?
They chose building a web browser engine from scratch as their benchmark task, written in Rust, with hundreds of agents running in parallel for a week, generating over one million lines of code. The truly valuable part of the article is its candid documentation of four architectural iterations and their failures.
The first attempt gave all agents equal status, coordinating through shared state files. This is a classic approach in distributed systems, but it failed rapidly in the agent scenario. Agents held locks too long and forgot to release them; throughput at 20 agents degraded to the level of 1-3. The deeper problem was behavioral: without hierarchy, agents became risk-averse, making only safe minor changes, with difficult problems left unowned.
The second attempt separated four roles — Planner, Executor, Worker, Judge — which improved things significantly but was bottlenecked by the slowest Worker. The third attempt merged Planner into Executor, which caused role overload and pathological behavior: random sleeping, ceasing to generate tasks, and writing code themselves.
The final solution is a recursive Planner-Worker architecture. The root Planner owns the entire project scope; when scope is too large, it spawns sub-Planners, recursing as needed. Workers receive tasks from Planners, work independently on their own repo copies, and upon completion write a handoff (what was done, what was discovered, what concerns exist) that they submit to the requesting Planner. Workers are unaware of each other and do not communicate with other Planners. Information flows strictly upward.
This architecture achieved linear scaling through three key mechanisms. At the planning level, recursive Planners allow planning work itself to be parallelized, preventing any single Planner from becoming a bottleneck. At the execution level, Workers are fully isolated, each working on independent repo copies, eliminating lock contention. At the quality level, they removed the centralized Integrator role (originally responsible for central quality control, but hundreds of Workers all needing to pass through a single gate to merge code immediately made it a bottleneck), accepting a small but stable error rate and letting errors be naturally fixed by other agents.
Peak throughput was approximately 1000 commits/hour. One notable finding: when they refactored the repo from a monolith into multiple independent crates, compilation wait times dropped dramatically and throughput increased by multiples. The choice of project architecture directly affected agent work efficiency, which means repo structures optimized for agents and those optimized for humans may involve different design considerations.
The problem OpenAI is solving extends from the harness engineering article to Symphony (open-sourced March 2026): when agent output velocity far exceeds human attention, through what interface should humans steer the entire system?
The interaction model described in the harness engineering article itself is relatively straightforward: a human writes a prompt describing the task, the agent runs (a single run often exceeds 6 hours, typically executing while the engineer sleeps), the agent produces a PR, iterates through agent-to-agent review cycles until satisfactory, and the human can optionally participate in review. A three-person team merged approximately 1,500 PRs over five months, averaging 3.5 per person per day.
The scalability bottleneck of this interaction model is obvious: humans still write prompts one by one and trigger tasks one by one. Symphony solves precisely this problem. It is a persistent daemon built with Elixir/BEAM that turns project management tools (currently defaulting to Linear) into a job scheduler for agents. Engineers write requirements as tickets; when a ticket moves to Todo status, Symphony automatically creates an independent workspace (fresh git clone + isolated agent session), dispatches Codex to execute the task, and upon completion produces Proof of Work (CI results, walkthrough, sometimes even screen recordings) and opens a PR. If an agent fails midway, BEAM’s supervision tree handles restart and backoff while other agents continue running. The system can manage hundreds of concurrent implementation runs.
Configuration is done through WORKFLOW.md files in the repo (YAML frontmatter + Liquid-templated prompts), meaning agent strategies are version-controlled alongside the code.
This simplifies the human interaction interface from “write prompt and trigger” to “write ticket and move status.” Interaction becomes very sparse: upstream is writing tickets and maintaining the harness (documentation, tests, architectural constraints); downstream is reviewing Proof of Work and PRs. The execution process in between is fully autonomous. The feedback loop’s center of gravity shifts from correcting specific agent outputs to improving the harness itself: better tests, better documentation, better constraints. These improvements compound across all future agent runs.
OpenAI’s harness engineering article describes several layers of solutions for scaling human attention. The first layer lets agents self-verify: through Chrome DevTools Protocol integration and per-worktree observability stacks (Victoria Logs/Metrics/Traces), high-level goals like “ensure no span exceeding two seconds in critical user paths” become agent-executable without humans watching dashboards. The second layer replaces manual review with mechanized constraints: custom linters enforce architectural invariants, with lint error messages written as repair guidance that agents can understand. The third layer is automated entropy management: encoding “golden rules,” periodically running background agents to scan for deviations and open fix PRs, most of which can be reviewed and auto-merged within a minute.
These three dimensions appear independent but actually have dependency relationships. Understanding these dependencies matters more than understanding each dimension individually.
The most critical dependency is: spatial scaling amplifies the problems in temporal scaling. When you have a single agent running, the consequences of directional drift and self-evaluation distortion are confined to a single PR. When you have hundreds of agents running simultaneously, each drifting, each self-rationalizing, errors accumulate at multiples of the parallelism factor. Cursor did encounter this problem in practice: they found agents lose focus during long runs and need periodic scratchpad rewrites and context summarization. But their approach leaned toward accepting a stable error rate and letting the system naturally converge, rather than introducing an independent evaluator. Which of these two strategies is superior remains an open question.
Conversely, interaction scaling depends on the maturity of temporal and spatial scaling. The reason Symphony can let humans steer agents by writing tickets is that individual agent runs are sufficiently reliable (temporal dimension) and the system can manage large numbers of concurrent runs (spatial dimension). If every run requires human intervention mid-process, the ticket-driven model degrades into manually triggered batch processing.
There is also a cross-dimensional shared finding: all three companies discovered in practice that model selection for role fit matters more than expected. Cursor found that GPT-5.2 outperformed Opus 4.5 in long autonomous runs (the latter tended to stop early and take shortcuts). Anthropic documented the evolution path of harness components across three model generations from Sonnet 4.5 to Opus 4.6. This means part of harness engineering work is matching different models to different roles, and this matching will continue to shift with each model iteration.
Returning to the opening question: what is harness engineering actually about?
Viewed through this framework, the answer becomes clear. When someone says harness engineering, first ask: which dimension of scaling is it addressing? Is it about letting agents run longer (temporal), letting more agents run together (spatial), or letting humans steer with less effort (interaction)? The engineering problems differ across the three dimensions, the solutions differ, and the trade-offs differ. Discussing them as one undifferentiated concept guarantees confusion.
This framework can also help you assess the quality of the abundant second-hand discussions on the market. If an article discusses harness engineering but does not touch on any of these three dimensions, it is most likely discussing something more basic: traditional multi-agent collaboration protocols, the AI virtual team concepts popular two years ago, or simply wrapping existing practices in a trendy term. These discussions have their value, but they operate at a different layer from what OpenAI, Cursor, and Anthropic are doing.
There is also an easily overlooked dimension. The scaling all three companies discuss optimizes how agents work: working longer, more agents working simultaneously, humans managing with less effort. But the quality ceiling of agent work depends heavily on what context the agent receives. The same model, the same tools, the same prompt — when connected to a cognitive framework accumulated and refined in layers over a year — the nature of the output shifts from “correct platitudes” to “analysis with judgment.” This direction complements harness engineering: the harness addresses the agent’s work methods and coordination; context infrastructure addresses the agent’s cognitive density. For detailed discussion on this direction and an open-source reference implementation, see Why AI Only Produces Correct Platitudes, and How to Push It Out of Its Comfort Zone.
Finally, it is worth pointing out a boundary. The scaling across all three dimensions addresses needs at the head of the distribution: extremely complex systems, large-scale infrastructure, and even experimental exploration of AI capability frontiers. For the broader population of everyday developers and enterprises, their software may not need hundreds of agents running in parallel at all, nor a single agent running continuously for 6 hours. AI’s more far-reaching impact on software may lie in a different direction: making software itself simpler, more disposable, and more tailored to the specific user’s needs. When the deliverable shifts from finished software to a Generative Kernel that supports AI-generated personalized applications, the problems harness engineering solves become less important, because the complexity of the system that needs to be harnessed is itself decreasing. These two directions exist in parallel, serving different scenarios. The harness engineering discussion covers one end of the spectrum, but readers should know where its boundaries of applicability lie.