Whose Fault is it When AI Slacks Off?
Professor Zou teaches a "Modern Software Engineering" course where one assignment requires students to write a blog post about their thoughts and questions from the class. 53 students submitted their assignments, with their articles scattered across different websites. A reliable AI that could crawl, understand, and summarize their articles—and even perform more abstract, higher-level analysis—could save the professor a significant amount of time.
This problem seems simple because the tasks are highly independent and parallelizable. Each task itself is straightforward—just summarization. However, it's more difficult than it appears. Professor Zou initially tried using DeepSeek to summarize the posts but found that the results were mostly hallucinations. I also tested various Agentic AI tools, including Cursor (Claude 4.5 Sonnet, GPT-5), ChatGPT, Codex, Deep Research, and Manus (without Wide Research), but they all struggled with one common issue: slacking off. Out of 53 articles, an AI was considered impressive if it could process more than 10 (like Claude 4.5 Sonnet or GPT-5). Specialized products like Deep Research and the Codex CLI could handle nearly 20. Still, this was far from the perfect number of 53. It wasn't until we used Manus's Wide Research or a similar Wide Research implementation I built for Codex in this repo that we finally achieved results that were complete, accurate, and insightful.
In this article, I want to explain why this seemingly simple summarization task is so difficult. Why do all AIs have this problem? How does Wide Research suddenly make AI "smarter" and able to handle this situation? Why did we build this Wide Research capability on Codex? What makes Codex special? What design principles and technical decisions allowed us to implement it in just one day? What AI usage techniques does this reveal that can help you identify similar problems, accurately estimate their difficulty, and know how to address them?
To answer these questions, we first need to understand why AI "slacks off." The term is a vivid personification, but it obscures the underlying technical reality. The issue isn't that the AI is intentionally being lazy like a seasoned slacker; it's not about its willingness. The core of the problem is the limits of its capabilities. When an AI slacks off, this boundary of capability is determined by two intertwined factors: architectural constraints faced by all large models, and differences in execution ability among various models.
Architectural Constraints
A major limitation of LLMs, or AI in general, is that their ability to follow instructions begins to decline as the output length increases, sometimes when it reaches half, or even just 20%, of the maximum context window. For example, if you ask it to translate a long text, it might be diligent at the beginning, start skipping sentences in the middle, and eventually stop translating verbatim, opting to summarize as it goes. This is likely due to the Transformer's global attention mechanism. Empirically, this is a common pattern across all Transformer-based LLMs.
This universal constraint across all LLMs is the direct reason our task was so difficult. Because the required output was extremely long (summaries of 53 articles), the LLM could handle the first few, or even a dozen, articles by fetching, parsing, and summarizing them methodically. However, as the output grew, its attention became scattered. It would simply give up, claiming the task was complete when it had only finished a third of the work.
But on second thought, this can be avoided. If the root of the problem is forcing an LLM to generate a massive output, why not just work around it? This is precisely the fundamental idea behind Manus Wide Research: when a problem can be naturally divided into smaller, independent sub-problems, Manus uses multiple lightweight LLMs to handle each one. For instance, it could use 53 instances of Gemini 2.5 Flash to summarize each student's assignment. Since these tasks are small and don't require high-level intelligence, these smaller LLMs can perform them quite well, ensuring both speed and cost-effectiveness. Finally, the answers from these sub-problems are aggregated and polished by a primary LLM for the final submission. This way, no single LLM is forced to generate a large output, and the "slacking off" problem is naturally solved.
Differences in Execution
At the same time, even when faced with the same large-scale output, different LLMs show varying quality in task completion. For one, they start slacking off at different points. An early version of GPT-4o-mini might get lazy after a few hundred words, while current models like GPT-5 and GLM-4.6 can hold out for two to three thousand words. Furthermore, in the age of Agentic AI, a model's tool usage and "thinking" habits have a big impact. For example, Gemini hates search, while Codex has a habit of reviewing its work to see if anything was missed. This means that even if both start to slack off after 4,000 words, the final results will differ. Another example is Claude: when fixing code, if it encounters a test it can't fix, it might just delete the test and print "Perfect 🎉 Test fixed!" to get away with it. This has never happened with gpt-5-codex. This is why, after extensively using various Agentic AI products, I've settled on the Codex CLI.
These architectural constraints, common to all LLMs, combined with the differences in their execution abilities, lead to vastly different user experiences with Agentic AI products. I rely heavily on Agentic AI tools, especially Manus's Wide Research, but when using Manus (which seems to be based on Claude), I miss the reliability of Codex. Since Codex is subscription-based, with no extra cost for heavy usage, I decided to build this repo to replicate the functionality of Manus Wide Research within the Codex environment, pairing a highly reliable architecture with a high-execution model.
Although this small project isn't as polished as the commercial Manus product, it embodies a lot of thinking and practice on how to collaborate effectively with AI and design AI-native workflows. I'm sharing it here to also answer a question: in the age of AI, where does an engineer's competitive edge lie?
A Practical Framework for Building with AI
We now have a clear theoretical combination: use the Wide Research architecture for macro-level reliability and Codex as the base LLM for micro-level execution reliability. However, there's a huge gap between theory and practice—a gap that AI writing code for us won't solve.
Acting as a Senior Manager to Design Organizational Processes
Specifically, since Codex is a general-purpose agent and we can't (and don't need to) change its internal code, our influence over its behavior is primarily through prompts. It's like teaching a new intern how to work—we just give them an SOP to follow. This brings great convenience to our development because we only need to "talk" to the AI. But it also introduces a problem: like humans, AI is a non-deterministic system. On one hand, even if I tell it to do something in a prompt, it might not do it. On the other hand, even with the same prompt, running it multiple times will produce different results. In this situation, how do I know if the changes I make to the prompt are useful? Am I just spinning my wheels?
This problem itself is relatively easy to solve. The common practice is to build a test suite with 5 to 10, or preferably more, typical use cases and continuously have the AI perform these tasks with our prompts. We then choose the prompt that results in higher quality, faster speed, and greater stability. But while this solves the "what" (the result), it doesn't solve the "how" (the process). You can see this is a very manual (labor-intensive) process. It requires us to repeatedly fine-tune prompts, have the AI execute tasks, observe the quality of the results, and repeat. This makes for a slow iteration cycle and is exhausting for humans.
Having been spoiled by AI and automation, I was unwilling to do this. So, a natural design decision emerged: further automate this process using AI. For example, after it runs all the tasks, we can directly ask it: "During your execution, what were the bottlenecks in terms of speed and effectiveness? What lessons learned would you want to pass on to your past self?" While this can't completely replace human insight, it largely delegates the grunt work.
For instance, in the first few iterations, we found that one major reason for the AI's low success rate was a recent major upgrade to the Codex CLI. The AI was still trying to use the old interface, and it took a long time to realize it needed to re-examine the command-line interface. This needlessly consumed a large portion of the context window, reducing its intelligence and task success rate. Realizing that Codex had been upgraded and that we should emphasize this in the prompt doesn't require a particularly smart AI or human experience. This kind of prompt iteration is perfectly suited for the AI to do itself.
From another perspective, what we are doing here is more like designing organizational processes and structures within a company or organization, taking on the role of a Senior Manager or even a Director. We are not micromanaging every work item of the front-line employee (the AI), evaluating its quality in detail, and giving feedback. Instead, we are setting up a process that allows it to iterate effectively on its own, enabling self-improvement through data-driven quality assessment and self-reflection across the entire dataset.
This design decision significantly saved us development effort, and the benefits far outweighed the time we would have spent personally overseeing ten projects and giving direct feedback to the AI. Interestingly, this workflow itself is a form of Wide Research. We broke down the task of iterating on the prompt into, say, ten independent problems, each with a Sub-Agent to execute the project, self-reflect, and provide suggestions for prompt modifications. These suggestions are then aggregated by the main LLM and, combined with my observations and experience, used to make actual changes to the prompt. This mindset of macro-management and process design not only changed how we iterate on our SOPs but also completely transformed how we make key technical decisions.
Using Implementation to Accelerate Design Decisions
A crucial technical decision was the form of the sub-agents we would use to solve each sub-problem. Manus's Wide Research uses code; it writes a Python script to batch-call a lightweight LLM. The advantages of this approach are obvious: it's deterministic and transparent. For example, the code contains all the information about which sub-agents have finished and where each sub-agent is in its process. It offers more control because we write the code ourselves, allowing us to fully control the tools and processes. It's also very stable, as we can use JSON mode when calling the API to ensure the output conforms to a specific JSON schema.
But in our specific application, there was another equally tempting option: start a new Codex process for each sub-problem, give it a prompt describing the task, and let it work on its own. This method has a unique advantage: it's free. Currently, Codex can be used with a ChatGPT subscription, and OpenAI is very generous with its limits. So, if we could fork many Codex processes to do wide research and essentially get free usage from OpenAI, that would be a significant advantage.
However, this comes with a cost: we can't deeply customize the tools or control its internal mechanisms; we can only tune it through prompts. Unlike APIs with a JSON Mode, Codex offers no guarantee that the output will be a JSON. Forcing this programmatically would inevitably lead to a lot of retries. Also, due to Codex's conscientious nature, it's often slower at completing tasks compared to a direct API call. This made it a complex and difficult design problem.
Before the AI era, I would have spent hours carefully researching and thinking before making a decision. After all, it's an important decision, and a wrong one could mean wasting hours or even days. But in the age of AI, analyzing this problem suddenly became incredibly simple: I had Codex implement both methods and compared the results on our test set. The entire process took 10 minutes of dev time and about an hour in total, including waiting time. The comparison was clear: using Codex was better. It struck a better balance between control, effectiveness, and cost.
This is a fascinating shift, almost like what's described in The Three-Body Problem: reconnaissance is more expensive than attack. The time I would have spent researching, analyzing, and agonizing was better spent just building it with AI. And the quality of decisions based on this firsthand experience is exceptionally high. It really feels like "the times have changed."
Specific Technical Decisions
With these two design decisions in place, the rest of the implementation was straightforward. But I want to highlight two more low-level technical decisions.
Using Tavily for Web Access
First, after several rounds of iteration, I found that the AI repeatedly fell into a trap: web access with Codex had a lot of friction. This is understandable; it's a programming tool, and web access is meant for looking up an API or viewing a webpage, not for conducting web research. As a result, when I actually used it for wide research, Codex spent most of its time struggling with its search tools. Either the webpage it found was a 404, or it got banned by an anti-bot measure when trying to access it, or it hit a paywall and had to start over. This meant that in the early stages of iteration, a single task could take 40 to 60 minutes to complete.
The AI couldn't figure this out on its own. So I gave it a manual suggestion: introduce a web operations layer and have it use Tavily for all search and web scraping instead of its native tools. Tavily is a search and web content extraction tool designed for LLMs. It's stable, efficient, and returns content directly in Markdown. Tavily also provides a remote MCP server, making its integration with Codex extremely simple. Of course, I wasn't thrilled that MCP stores the API key in plaintext in the URL, config file, and command-line output, but at least it worked. After this change, the execution speed of research tasks increased by two to three times.
The Pitfalls of Multi-Agent Systems
My second contribution was more of an open-ended exploration. In the context of Wide Research, it's a classic multi-agent scenario. When most people hear "multi-agent," their first thought is to have a PM Agent, a Designer Agent, and an Engineer Agent. But I find this setup unreasonable. It's just a rigid imitation of human society without considering the underlying reasons.
Why do we humans need this kind of professional specialization? Mainly because we are limited. A person can't pursue both a PM career path and an Engineer career path in just a dozen years of education. So we have to choose one and specialize. This might be why, if you look at various popular multi-agent implementations, they often start with "You are a senior software engineer." From a human perspective, that's a compliment. A senior engineer is impressive!
But the problem is that LLMs already possess the professional skills of all career paths. They know everything, yet you're asking them to be just an engineer. This kind of prompt is actually restrictive. Instead of enhancing the LLM's capabilities, it weakens them, reducing an omniscient, all-powerful LLM to the level of a limited human who can only wear one hat. Therefore, I believe we shouldn't use role-based division to implement multi-agent systems just to mimic human society. Instead, we should approach it from the fundamental logic of Wide Research and view multi-agent systems through the lens of context window isolation.
In our scenario, we don't tell each Sub-Agent, "You are a research expert on topic X." We simply assign different research sub-problems to different Codex processes to ensure they don't interfere with each other. Communication happens only between the main Codex and the sub-Codexes, via files. Of course, this method isn't perfect. The main Agent doesn't receive any progress reports while the Sub-Agents are working on their tasks. This might be a significant challenge in multi-agent design: how to find the right degree of context isolation so that each sub-task's quality isn't bogged down by too much detail, while the main task has enough high-level awareness and progress reporting.
Conclusion
Returning to the original question: when AI slacks off, whose fault is it? The answer this project provides is: it's the fault of the architecture, and it's the fault of how we collaborate with AI. Our solution addresses these problems directly: we use a divide-and-conquer strategy with Wide Research to solve the architectural constraints, choose the high-execution model Codex as our execution unit, and build an AI-native development and decision-making process around it, creating an agent that doesn't slack off and is well-suited for large, long-running projects.
During development, we found that even with the help of AI programming, it's difficult to iterate and move forward quickly. Therefore, instead of blindly grinding on tactics, we first designed a workflow for AI self-reflection and self-iteration. With this foundation, we were able to pinpoint the major bottleneck of web access and solve it by introducing Tavily.
At the same time, another important high-level technical decision was whether our Sub-Agents should be Python programs or Codex processes. We didn't follow the traditional decision-making process of research and analysis. Instead, we spent ten minutes of dev time to build both options and made a high-quality decision based on a direct comparison.
All our prompts are open-source and can be accessed in this repo. I encourage you to open your own Codex, follow the tutorial to configure Tavily MCP (it has a free plan), and copy the example prompt from the Readme to experience the power of Wide Research.
Comments