<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Computing Life</title><link href="https://yage.ai/" rel="alternate"></link><link href="https://yage.ai/feeds/atom.xml" rel="self"></link><id>https://yage.ai/</id><updated>2026-03-03T12:00:00-08:00</updated><entry><title>用好AI的第一步：停止使用ChatGPT</title><link href="https://yage.ai/stop-using-chatgpt.html" rel="alternate"></link><published>2026-03-03T12:00:00-08:00</published><updated>2026-03-03T12:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-03:/stop-using-chatgpt.html</id><summary type="html">&lt;p&gt;会用AI和用好AI之间差的是10倍。这个差距的根源在于工作方式，而非模型。本文通过一个完整的工作流例子和上中下三策的框架，解释为什么应该从ChatGPT切换到Cursor这类Agentic工具。&lt;/p&gt;</summary><content type="html">&lt;p&gt;2026年，AI的渗透率已经很高了。很多公司All in AI，Meta甚至专门安排了一整周的脱产AI培训。但我有一个观察是：大多数人，甚至很多重度用户使用AI的方式，和两年前是一样的：大家还是打开聊天窗口，输入问题，等一个回答。区别只是从GPT-4o换成了GPT-5.2或者豆包，从免费版换成了Pro。&lt;/p&gt;
&lt;p&gt;这当然比完全不用AI更好，但也远远不是最优的方法。我很相信（下面也有例子解释）一件事：能用AI和用好AI之间，生产力差的不是30%，而是10倍的量级。不是说我用，甚至重度使用ChatGPT，就天然进入了AI阵营，可以高枕无忧了。事实上，大多数人用AI的方法，就像汽车发明之后还在把它当马车用：同样的路线，同样的速度，只是换了个引擎。而这个差距的根源，在于你的工作方式是否匹配了AI的能力结构。&lt;/p&gt;
&lt;p&gt;举一个我最近的真实例子。我要改进一个算法，从开会讨论方向、分析失败case、到实现改进方案并验证结果，AI（Cursor）自主执行了大约45分钟，自己走完了设计、实现、测试、发现问题、定位原因、修复、再验证的完整循环，最终所有失败case全部修复。整个过程中我的角色就是定方向和审结果。如果用ChatGPT做同一件事，保守估计时间会多五到十倍。这个10倍差距到底是怎么来的？下面我先解释原因，再用这个例子的完整过程来演示具体做法。&lt;/p&gt;
&lt;h2&gt;为什么聊天窗口是天花板&lt;/h2&gt;
&lt;p&gt;从2024年底开始，AI领域出现了一类新工具，以Cursor、Claude Code、Codex为代表。它们表面上是编程工具，但代表的是一种跟ChatGPT完全不同的AI用法。很多人以为这只是面向程序员的ChatGPT，但我的实际体验是，它们&lt;a href="/cursor-ai-entry.html"&gt;对几乎所有知识工作都有用&lt;/a&gt;。具体地说，它有三层好处：&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第一层：反馈闭环。&lt;/strong&gt; 你让ChatGPT写一段python，它写了，你复制到IDE里一跑，报错了。你把报错信息贴回去，它改了一版，你再跑，又不对，你又贴回去。这个过程里，我们就是反馈闭环中的人型工具人：AI产出，我们验证，我们搬运，AI再改。我们从一个应该指挥AI的人，变成了一个来回跑腿的工具人。&lt;/p&gt;
&lt;p&gt;Cursor这类工具的核心区别在于它接入了我们的执行环境。它写完代码可以直接跑，看到报错自己改，改完再跑，再改。这个循环是AI自己驱动的。因此，AI从一个只会出主意的顾问，变成了能独立干活的员工。顾问说完就走，对不对它既不知道，也不负责；员工则会自己验收，发现问题就返工。&lt;/p&gt;
&lt;p&gt;这也是为什么很多人觉得AI不靠谱：他们一直在用一个开环的AI，犯了错浑然不觉。给它一个闭环，可靠性会有质的提升。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第二层：上下文供给。&lt;/strong&gt; AI输出质量的瓶颈，很多时候在于它能看到多少相关上下文，而非模型本身有多聪明。同一个模型，给足上下文就能给出对的结果；让它盲猜，就容易脑补出不一样的目标。&lt;/p&gt;
&lt;p&gt;最近有&lt;a href="/ai-key-decisions.html#comment-6844340971"&gt;读者评论&lt;/a&gt;：各家的Deep Research和在本地工具里接搜索API相比，哪个更好？我的回答是，我已经好几个月没开过Deep Research了。搜索质量本身没问题，但它能解决的问题太有限。举个例子，我想在工作中比较两种算法的优劣。这个"我的场景"其实需要仔细描述，因为它直接决定了比较的维度：我的数据长什么样、我看重延迟还是准确率、部署环境有什么约束。用Deep Research，我要花很长时间把这些背景交代清楚。但在Cursor里，我直接 @ 几个内部文档和会议记录，AI立刻就有了所有上下文。哪怕搜索能力弱一点，给出的结果也更贴合，速度还更快。&lt;/p&gt;
&lt;p&gt;所以ChatGPT的瓶颈很多时候在于上下文的供给：你很难把足够的信息喂给它。Cursor这类工具解决的就是这个问题。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第三层：资产积累。&lt;/strong&gt; ChatGPT的使用模式是消耗型的。你投入时间，得到一个答案，答案用完就没了。每次对话都是从零开始。Cursor是投资型的。你用到了某个内部文档？存到项目文件夹里。AI反复犯某个错？花两分钟写一条规则。团队有一套约定俗成的惯例？写下来让AI也知道。这些都是一次性投入，但收益是持久的。&lt;/p&gt;
&lt;p&gt;时间一长就会形成飞轮效应：你用得越多、积累越多，AI就越懂你的项目、你的偏好、你的工作方式。ChatGPT永远是一个需要完整briefing的陌生人，Cursor可以变成一个越来越默契的搭档。一个每次归零，一个持续复利。&lt;/p&gt;
&lt;p&gt;反馈闭环、上下文、资产积累，这三层加在一起，就是前面那个45分钟的例子能成立的原因。但光知道原因还不够，关键是怎么在日常工作中把这些落地。下面我就用那个例子的完整过程来演示。&lt;/p&gt;
&lt;h2&gt;上中下三策：一个完整的例子&lt;/h2&gt;
&lt;p&gt;在展开之前，先介绍一个我在实践中总结的框架，叫做上中下三策。工作中的每一步都会产生信息，这些信息怎么处理，决定了AI能帮你多少。下策是让信息消失（人看不到，AI也看不到）；中策是记录成人能看的形式（人友好，AI不友好）；上策是先让AI能消费，再加工给人看（AI-first）。下面每一步我都会用这个框架来分析。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第一步，开会。&lt;/strong&gt; 组里的周会，讨论了某个算法在一些数据上失败的情况，大家提出了各种假设和改进思路。&lt;/p&gt;
&lt;p&gt;下策是开完就忘，什么都没留下。中策是写一份Google Doc的会议纪要，这已经是一个很好的做法了：它增加了你的visibility，同事知道你做了什么，未来也方便引用。但AI很难直接拿到这些内容，因为Google Doc需要登录，格式也混杂，每次想让AI参考都要手动复制粘贴。中策对人友好，对AI不友好。&lt;/p&gt;
&lt;p&gt;上策是用Zoom AI Companion或类似工具自动转录会议内容，存成.md文件，放到工作文件夹的meeting_notes目录下。时间成本几乎为零，但AI从此可以直接引用这次会议里的每一个细节。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第二步，分析数据。&lt;/strong&gt; 我需要看那个算法在不同数据上的表现，记录失败的具体场景和原因。同样的三策逻辑：下策是在便签上记几个URL，给人看的时候切过去点一下完事；中策是写进Confluence；上策是在工作文件夹里建一个analysis_notes.md，把每个失败case的链接、失败原因、观察都记进去。&lt;/p&gt;
&lt;p&gt;值得说明的是，在这两步里上策实际花的时间和中策差不多，有时候甚至更短，因为.md文件的排版比Confluence简单得多，而且你完全可以让AI帮你整理。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第三步，写代码改进算法。&lt;/strong&gt; 这是上策真正发挥威力的地方。因为前两步的所有信息都在同一个文件夹里，我在Cursor里 @ 一下会议记录，再 @ 一下分析笔记，告诉AI：根据这些信息，设计一个改进方案并实现，然后验证这些失败的case有没有被修复。&lt;/p&gt;
&lt;p&gt;注意AI这时候拿到的上下文有多完整：它知道这个算法为什么要改，有什么改进思路（会议记录里有讨论），知道具体有哪些失败模式和原因（分析笔记里有记录），知道成功的标准是什么（哪几个case要被修复）。这里面最关键的是最后一点：success criteria。很多人用AI的时候，只告诉它做什么，却省略了什么样算做好了。这就像一场缺少终点线的赛跑，AI凭感觉跑，你凭感觉判断。但如果你给了AI一个明确的终点线（这几个失败的case要全部修复），AI就可以自己跑完从设计到实现到验证的完整循环：写代码、跑测试、发现问题、定位原因、修复、再验证。这就是前面说的那45分钟里发生的事情。
（事实上这背后比听起来更复杂：AI在后台自动拆分了子任务，调度了多个agent并行工作，主agent做设计和验收，子agent负责编码和测试，整个过程高度自动化。但这是更进阶的话题了。）&lt;/p&gt;
&lt;p&gt;如果用ChatGPT做同一件事呢？你要手动把每段上下文贴过来。你可能会贴会议纪要作为背景，再开贴代码让它帮你改，但这样一方面要贴大量的文件，一方面要在python环境和ChatGPT里面两边拷来拷去，非常低效。其次，这种ChatGPT的用法缺少自我修正能力，你得自己看中间结果、自己判断哪里出了问题、自己把反馈喂回去。麻烦都是其次，主要是弯路会多很多。AI可以一目十行，看1000行log就知道问题在哪，人类得需要特殊的工具做可视化才能看出来。这就是10倍差距的来源：一边是信息打通、自动闭环的AI，另一边是信息割裂、人肉驱动的AI。&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;第四步，写文档和准备presentation。&lt;/strong&gt; 因为所有的分析、代码、结果都在同一个文件夹里，我直接让AI根据这些内容生成一份技术文档，再贴到Confluence上。&lt;/p&gt;
&lt;p&gt;注意这里的顺序：先在Cursor里让AI生成，再复制到Confluence。先AI，后人。这个顺序的倒转其实是整个工作流里最深的一个思维转变。传统做法是human-first：我自己写文档，写完可能让AI帮我润色一下。上策是AI-first：信息先以AI能消费的格式存在（.md文件），AI完成主要工作（生成文档），最后才转成人类可读的版本（Confluence页面）。结果是你花的时间更少，产出的质量更高，而且AI消费的那份原材料还留在你的文件夹里，未来随时可以再用。&lt;/p&gt;
&lt;p&gt;从开会到出文档，半天时间搞定了全部工作。&lt;/p&gt;
&lt;p&gt;当你把每一步都用上策来处理，所有信息最终都汇聚在同一个文件夹下，形成了我在&lt;a href="/openclaw.html"&gt;之前文章&lt;/a&gt;里提到的Mono Repo模式。AI天然可以跨主题访问所有上下文。这时候AI的能力会有一个显著的跃升，因为它第一次拥有了你的完整信息版图。你可以回想一下你上周的工作：多少环节在用下策？多少在用中策？如果大部分答案是下策和中策，那就是你和10倍效率之间的差距所在。&lt;/p&gt;
&lt;p&gt;回过头看这个流程，有一个根本性的转变：传统工作流里，人是主要执行者，AI是辅助。这个工作流里反过来了，AI是主要执行者，人的角色是定方向、定标准、做判断。换一种说法：我们对AI的定位，应该从&lt;em&gt;让AI帮我写代码&lt;/em&gt;升级到&lt;em&gt;让AI帮我解决问题&lt;/em&gt;。写代码只是解决问题的其中一环。如果你给了AI足够的上下文和明确的成功标准，它可以独立走完整个循环，你的角色就变成了出题人。你的价值在于你知道这个算法应该往哪个方向改，你知道什么样的结果才算成功。这种判断力是你作为专业人士最核心的能力，也恰恰是AI最依赖你提供的东西。&lt;/p&gt;
&lt;p&gt;这个思路适用于所有职业。你可以是工程师、数据分析师、产品经理、研究员。只要你的工作涉及信息的整理、分析、决策和产出，上中下三策就适用，feedback loop的价值就存在。区别只在于AI帮你执行的那个环节是写代码、做分析、写文档还是别的任务。&lt;/p&gt;
&lt;h2&gt;开始行动&lt;/h2&gt;
&lt;p&gt;工具会变，今天的载体是Cursor和Claude Code，明天可能是别的。但三样东西是持久的：反馈闭环让AI能自我修正，上下文供给让AI能理解你的世界，资产积累让你和AI的协作越来越高效。这是底层的范式，跟具体工具无关。&lt;/p&gt;
&lt;p&gt;如果你今天只做一件事，我的建议是这样：找一个你正在进行的项目，建一个文件夹，花半小时把相关的文档、笔记、会议记录全部复制粘贴放进去。然后即使是对你觉得应该用ChatGPT的工作，抑制住这种冲动，强令自己打开Cursor，从这里开始你跟AI的下一次对话。你会立刻感受到差异。改变从这一刻开始。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"></category><category term="Chinese"></category><category term="Agentic AI"></category><category term="Methodology"></category></entry><entry><title>Step One to Using AI Well: Stop Using ChatGPT</title><link href="https://yage.ai/stop-using-chatgpt-en.html" rel="alternate"></link><published>2026-03-03T12:00:00-08:00</published><updated>2026-03-03T12:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-03-03:/stop-using-chatgpt-en.html</id><summary type="html">&lt;p&gt;The gap between using AI and using AI well is 10x. That gap comes from how you work, not which model you use. This post walks through a complete workflow example and a Three Tiers framework to explain why you should switch from ChatGPT to agentic tools like Cursor.&lt;/p&gt;</summary><content type="html">&lt;p&gt;By 2026, AI has become widespread. Companies are all-in on it. Meta even blocked out an entire week for mandatory AI training. But here's what I keep noticing: most people, including heavy users, are interacting with AI the same way they did two years ago. They open a chat window, type a question, wait for an answer. The only difference is they've swapped GPT-4o for GPT-5.2 or Doubao, or upgraded from free to Pro.&lt;/p&gt;
&lt;p&gt;That's better than not using AI at all, but it's nowhere close to optimal. I'm convinced, and I'll show you evidence below, that the productivity gap between &lt;em&gt;using AI&lt;/em&gt; and &lt;em&gt;using AI well&lt;/em&gt; isn't 30%. It's an order of magnitude. Just because you use ChatGPT, even heavily, doesn't mean you've joined some AI-native vanguard where you can sit back and relax. Most people are using AI like someone who got a car but still drives horse-carriage routes: same roads, same speed, just a different engine. The real gap comes down to whether your way of working actually matches how AI is capable of operating.&lt;/p&gt;
&lt;p&gt;Here's a recent real example from my own work. I needed to improve an algorithm. From the initial meeting to map out the direction, through analyzing failure cases, to implementing the fix and verifying results, AI (specifically Cursor) ran autonomously for about 45 minutes. It completed the full loop on its own: design, implement, test, find issues, diagnose, fix, verify again. Every failing case was resolved. My role throughout was to set the direction and review the outcome. Doing the same thing in ChatGPT would conservatively take five to ten times longer. Where does that 10x gap actually come from? I'll explain the why first, then walk through the complete example to show the how.&lt;/p&gt;
&lt;h2&gt;Why the Chat Window Is a Ceiling&lt;/h2&gt;
&lt;p&gt;Starting around late 2024, a new category of AI tools emerged: Cursor, Claude Code, Codex. On the surface they look like coding tools, but they represent a fundamentally different way of using AI compared to ChatGPT. A lot of people assume they're just ChatGPT for programmers, but my experience is that &lt;a href="/cursor-ai-entry-en.html"&gt;they're useful for almost all knowledge work&lt;/a&gt;. The difference plays out on three levels.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 1: The feedback loop.&lt;/strong&gt; You ask ChatGPT to write some Python. It writes it. You copy it to your IDE, run it, it errors. You paste the error back, it gives you a revision, you run it again, still wrong, you paste again. In this cycle, you become the human errand runner in the feedback loop: AI produces, you test, you shuttle information back and forth, AI revises. You've gone from the person directing AI to the person doing the legwork.&lt;/p&gt;
&lt;p&gt;The core difference with Cursor is that it's connected to your execution environment. It writes code and runs it directly. Sees an error, fixes it, runs it again. The loop is AI-driven. This turns AI from a consultant who gives advice and walks away into an employee who can work independently. The consultant says their piece and leaves, with no idea if it was right and no accountability. The employee validates their own work and fixes problems when they find them.&lt;/p&gt;
&lt;p&gt;This is also why a lot of people think AI is unreliable: they've been using open-loop AI that fails and doesn't know it. Give it a closed loop, and reliability improves dramatically.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 2: Context supply.&lt;/strong&gt; The bottleneck on AI output quality, much of the time, isn't how smart the model is. It's how much relevant context the model can see. Same model, enough context: correct result. Same model, guessing blind: it fills in the gaps with something that might be completely wrong.&lt;/p&gt;
&lt;p&gt;A &lt;a href="/ai-key-decisions-en.html#comment-6844340971"&gt;reader recently commented&lt;/a&gt;: between Deep Research from the major AI providers versus plugging a search API into a local tool, which is better? My answer: I haven't opened Deep Research in months. The search quality isn't the issue. It's just too limited in what it can actually solve. Say I want to compare two algorithms for my specific use case at work. "My use case" requires careful description, because it directly determines what dimensions matter for the comparison: what my data looks like, whether I care about latency or accuracy, what the deployment constraints are. With Deep Research, I have to spend a lot of time explaining all that background. In Cursor, I just @ a few internal docs and meeting notes, and AI immediately has all the context. Even if the search capability is slightly weaker, the results are more relevant and the whole thing is faster.&lt;/p&gt;
&lt;p&gt;ChatGPT's bottleneck is often context supply: it's hard to feed it enough information. Cursor-style tools solve exactly that problem.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Level 3: Asset accumulation.&lt;/strong&gt; ChatGPT's usage pattern is consumptive. You put in time, you get an answer, the answer gets used, then it's gone. Every conversation starts from zero. Cursor is investment-style. You needed an internal doc? Save it to the project folder. AI keeps making the same mistake? Spend two minutes writing a rule. Your team has conventions everyone follows? Write them down so AI knows too. Each of these is a one-time investment with compounding returns.&lt;/p&gt;
&lt;p&gt;Over time, this creates a flywheel: the more you use it, the more you've accumulated, the better AI understands your project, your preferences, your working style. ChatGPT is always a stranger who needs a full briefing every time. Cursor becomes a collaborator who gets more in sync with you over time. One resets to zero; the other compounds.&lt;/p&gt;
&lt;p&gt;These three levels, feedback loop, context, and asset accumulation, are why that 45-minute example was possible. But knowing the reason isn't enough. What matters is how to actually make this work day to day. The full example below shows that.&lt;/p&gt;
&lt;h2&gt;Three Tiers: A Complete Example&lt;/h2&gt;
&lt;p&gt;Before walking through it, let me introduce a framework I've developed through practice, which I call the Three Tiers. Every step in your work produces information. How you handle that information determines how much AI can help you. The Bad tier: information disappears (neither you nor AI can see it later). The Better tier: information gets recorded in a human-readable format (human-friendly, AI-unfriendly). The Best tier: information gets stored AI-first, then made human-readable. I'll apply this framework to every step below.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 1: The meeting.&lt;/strong&gt; The team's weekly sync, where we discussed cases where an algorithm was failing on certain data and brainstormed hypotheses and improvement ideas.&lt;/p&gt;
&lt;p&gt;Bad tier: meeting ends, nothing is captured. Better tier: write up a Google Doc with meeting notes. This is already a solid practice. It increases your visibility, your colleagues know what happened, and it's easy to reference later. But AI can't easily access this content: Google Docs require login, the format is messy, and every time you want AI to reference it you have to manually copy and paste. Better tier is human-friendly and AI-unfriendly.&lt;/p&gt;
&lt;p&gt;Best tier: use Zoom AI Companion or a similar tool to auto-transcribe the meeting, save it as a .md file, put it in a meeting_notes directory inside your work folder. Time cost is nearly zero, but AI can now directly reference every detail from that meeting going forward.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 2: Analyzing the data.&lt;/strong&gt; I needed to look at how the algorithm performed across different inputs, and document the specific failure scenarios and their causes. Same Three Tiers logic: Bad tier is jotting a few URLs in a sticky note and clicking through them when you need to show someone. Better tier is writing it up in Confluence. Best tier is creating an analysis_notes.md in your work folder with each failure case's link, failure reason, and observations.&lt;/p&gt;
&lt;p&gt;Worth noting: the Best tier in these two steps takes about as much time as the Better tier, sometimes less, because .md formatting is far simpler than Confluence, and you can have AI help you organize it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 3: Writing code to improve the algorithm.&lt;/strong&gt; This is where the Best tier really shows its value. Because all the information from the first two steps lives in the same folder, I open Cursor, @ the meeting notes, @ the analysis notes, and tell AI: based on this, design an improvement and implement it, then verify that the failing cases are fixed.&lt;/p&gt;
&lt;p&gt;Look at how complete the context is that AI has at this point. It knows why the algorithm needs to change. It has improvement ideas (the meeting notes have that discussion). It knows the specific failure patterns and their causes (the analysis notes have that). It knows the success criteria (which cases need to be fixed). That last piece is the most critical. A lot of people tell AI what to do but skip what "done" looks like. It's like a race with no finish line: AI runs by feel, you judge by feel. But give AI a clear finish line (all these failing cases must pass), and it can run the entire loop from design to implementation to verification on its own: write code, run tests, find problems, diagnose, fix, verify again. That's what happened in those 45 minutes.&lt;/p&gt;
&lt;p&gt;(What's going on behind the scenes is actually more complex than it sounds: AI automatically broke the task into subtasks, scheduled multiple agents to work in parallel, with the main agent handling design and review while sub-agents handled coding and testing. But that's a more advanced topic.)&lt;/p&gt;
&lt;p&gt;What if you did this same thing in ChatGPT? You'd have to manually paste in every piece of context. Maybe you paste the meeting notes as background, then open another chat for the code changes, copying back and forth between your Python environment and the chat window constantly. Beyond the inefficiency, this approach lacks any self-correction ability. You have to review every intermediate result yourself, decide where things went wrong, and manually feed that feedback back in. The hassle is secondary; the bigger cost is all the detours. AI can skim a thousand lines of logs and identify the problem in seconds. A human needs specialized visualization tools just to see what's happening. That's where the 10x gap comes from: on one side, information fully connected, loop automated; on the other, information siloed, loop driven by hand.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Step 4: Writing documentation and preparing the presentation.&lt;/strong&gt; Because all the analysis, code, and results are in the same folder, I have AI generate a technical document directly from that content, then paste it to Confluence.&lt;/p&gt;
&lt;p&gt;Notice the order: generate in Cursor first, then copy to Confluence. AI first, then humans. This reversal is actually the deepest mindset shift in the entire workflow. The traditional approach is human-first: I write the document, then maybe have AI polish it. The Best tier is AI-first: information lives in a format AI can consume (.md files), AI does the main work (generates the document), and only then does it get converted to a human-readable form (Confluence page). The result is less time spent, higher quality output, and the AI-consumable source material stays in your folder for future use.&lt;/p&gt;
&lt;p&gt;From the meeting to finished documentation, the whole thing took half a day.&lt;/p&gt;
&lt;p&gt;When you handle every step with the Best tier, all information converges in the same folder, forming what I called the Mono Repo pattern in &lt;a href="/openclaw-en.html"&gt;a previous post&lt;/a&gt;. AI can naturally access all the context across every topic. At that point, AI's capability takes a noticeable leap, because it finally has access to your complete information map. Think back over your work last week. How many steps were Bad tier? How many were Better tier? If most of your answers are Bad and Better, that's the gap between where you are and 10x productivity.&lt;/p&gt;
&lt;p&gt;Stepping back and looking at this workflow, there's a fundamental shift: in the traditional model, the human is the primary executor and AI is the assistant. In this workflow, it's reversed. AI is the primary executor; the human's role is to set direction, define success criteria, and make judgment calls. Put it another way: our conception of AI should upgrade from &lt;em&gt;have AI help me write code&lt;/em&gt; to &lt;em&gt;have AI help me solve problems&lt;/em&gt;. Writing code is just one piece of solving problems. If you give AI enough context and a clear definition of success, it can complete the entire loop independently, and your role becomes the one who sets the problem. Your value lies in knowing which direction the algorithm should go, and knowing what a successful result looks like. That kind of judgment is your core capability as a professional, and it's exactly what AI depends on you to provide.&lt;/p&gt;
&lt;p&gt;This applies to every profession. Engineer, data analyst, product manager, researcher. If your work involves gathering, analyzing, deciding, and producing information, the Three Tiers apply, and the value of a feedback loop is there. The only difference is whether the loop AI runs for you involves writing code, doing analysis, writing documents, or something else.&lt;/p&gt;
&lt;h2&gt;Getting Started&lt;/h2&gt;
&lt;p&gt;The tools will change. Today it's Cursor and Claude Code; tomorrow it'll be something else. But three things are durable: a feedback loop that lets AI correct itself, context supply that lets AI understand your world, and asset accumulation that makes your collaboration with AI more efficient over time. These are the underlying principles, independent of any specific tool.&lt;/p&gt;
&lt;p&gt;If you do one thing today, here's my suggestion: find a project you're currently working on, create a folder, and spend 30 minutes copying all the relevant documents, notes, and meeting records into it. Then, even for work you'd normally turn to ChatGPT for, resist that impulse, open Cursor instead, and start your next conversation with AI from there. You'll feel the difference immediately. Start now.&lt;/p&gt;</content><category term="Computing"></category><category term="English"></category><category term="Agentic AI"></category><category term="Methodology"></category></entry><entry><title>以一个简单任务为例看AI落地的关键决策</title><link href="https://yage.ai/ai-key-decisions.html" rel="alternate"></link><published>2026-02-20T18:00:00-08:00</published><updated>2026-02-20T18:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-20:/ai-key-decisions.html</id><summary type="html">&lt;p&gt;用两分钟指挥AI给300篇文章添加SEO summary的实战案例，拆解五个关键决策：选对执行环境、先建测试再干活、让agent自己处理corner case、divide and conquer、结果导向的prompt写法。&lt;/p&gt;</summary><content type="html">&lt;p&gt;今天我用AI完成了一个小任务。感觉这个案例特别适合用来介绍AI的实战原则，所以写了这篇文章来分享一下。&lt;/p&gt;
&lt;p&gt;任务本身是给这个blog里的每一篇文章都加一行summary，这样可以帮助搜索引擎理解这个网站的内容，从而提升这个网站的排名（SEO）。这个任务看起来简单，其实有很多坑，一不小心就会陷入AI鬼打墙、不可靠、使用繁琐的陷阱。下面主要分享在这个过程中我做了哪五个重要的决策，来让整个流程变得稳定可靠。&lt;/p&gt;
&lt;h2&gt;决策一：用本地Coding Agent，而不是ChatGPT&lt;/h2&gt;
&lt;p&gt;我做的第一个决策是：用Cursor/OpenCode作为讨论的平台，而不是ChatGPT。这件事其实并不显然，因为整个项目的开始来自于我想给这个网站做SEO。直观上看，这是个更适合ChatGPT的聊天性质的任务。但是我仍然坚持用了OpenCode。这里面最根本的原因是摩擦。&lt;/p&gt;
&lt;p&gt;具体地说，摩擦在两个方面。第一是上下文传递的摩擦。用ChatGPT我需要把我的博客的内容甚至代码复制粘贴给它，或者让它去写代码抓取这些文章的内容。但在OpenCode里，我只要用@指定我的博客所在的文件夹就好了，摩擦小很多。&lt;/p&gt;
&lt;p&gt;另一个方面是落地的摩擦。比如我们在ChatGPT里面通过聊天得出了结论：这个网站需要增加Summary元数据。为了把这个想法落地，我需要把我和ChatGPT来回几轮的聊天记录全部复制粘贴到Cursor/OpenCode里面去，然后再调用另一个AI来改文章的内容。相比之下，如果从头就在OpenCode里面做讨论的话，讨论之后立刻就能落地。&lt;/p&gt;
&lt;p&gt;所以我做了这第一个决策：对几乎所有任务，抛弃基于聊天的AI环境，选择能执行的Agentic环境。为什么把这个决策放在第一个，是因为这是有和无的区别。摩擦一大，我们就懒得做下去了，整个项目花了时间，交付是0，纯浪费时间。只有摩擦小了，项目能继续下去，才有必要继续聊具体的方法和技巧。&lt;/p&gt;
&lt;h2&gt;决策二：动手之前，先定义成功，提供测试&lt;/h2&gt;
&lt;p&gt;我做的第二个决策是：在让AI动手生成任何summary之前，先让它写一个测试。这个测试做的事情很简单，就是检查所有.md文件，看有没有summary字段。如果不是100%的文件都有这个字段就fail，并且打印是哪些文件有问题。&lt;/p&gt;
&lt;p&gt;为什么要先写测试？因为如果没有这个测试，AI说做完了，我也不知道它到底做完了没有。我确实可以抽查几篇，但300多篇文章，抽查没法覆盖全部。最后的局面就是我也不知道、AI也不知道，两个人都在wishful thinking。&lt;/p&gt;
&lt;p&gt;但有了测试就不一样了。AI做完一轮，测试fail了，它自己就知道还有20篇没覆盖，下面就会重新看这些文章。测试通过了，就是100%完成了。不需要人工抽查，不需要猜，一切100%都是确定的。&lt;/p&gt;
&lt;p&gt;这就是我们一直强调的&lt;a href="/agentic-ai-crisis.html"&gt;feedback loop&lt;/a&gt;。很多人用AI陷入踢一脚动一下、动完了发现不对，再踢再动的循环，觉得AI好难用，根本原因就是没有建立反馈机制。AI不知道什么叫"做完"，你也不知道AI做到什么程度了。这是要首先解决的核心问题。确定性的测试就是一个非常有效的解决方法。事实上，只要这种测试到位了，后面三个决策都是锦上添花的东西。&lt;/p&gt;
&lt;p&gt;所以在开始任何任务之前，我都会先问自己：我/AI有没有一个确定性的方式来判断任务完成了没有？如果没有，先把这个机制建起来。&lt;/p&gt;
&lt;h2&gt;决策三：让Agent自己去干，而不是我来写程序调用API&lt;/h2&gt;
&lt;p&gt;第三个决策是：我没有写程序去调用LLM API来生成summary，而是让coding agent自己去做这件事。&lt;/p&gt;
&lt;p&gt;更详细的原因在&lt;a href="/result-certainty.html"&gt;这篇文章&lt;/a&gt;中有解释。虽然让AI做概括听起来调个API就搞定了。但仔细想想，这里有很多corner case：有的文章已经有summary了不要重复加，有的metadata格式不一致，有的位置需要调整。如果写程序处理这些情况，代码会特别复杂，调试成本高，进展速度慢。最后可能AI会花大量的精力去调怎么处理这些细节。&lt;/p&gt;
&lt;p&gt;另一种思路是用自然语言直接给Cursor/OpenCode布置任务："你去看一下XX.md，保证它有个面向SEO的summary元数据域"。这时候完成任务的主题就不是一个机械的程序，而是一个真的有智能，知变通的Agent。它会自己看情况处理——有summary就跳过，格式不对就调整，遇到特殊情况自己判断。&lt;/p&gt;
&lt;p&gt;这就是把AI当agent用和把AI当工具用的区别。调用API的模式是：你写程序，AI是其中一个组件。这种模式确定性高，但灵活性低，遇到复杂情况反而更慢。而用Agentic AI，确定性从过程移到了结果上，你只需要讲清楚要什么结果。剩下的事，AI发挥自己的能动性和判断力自己搞定。&lt;/p&gt;
&lt;p&gt;所以在我的工作流里，调用API是最后手段。能交给agent去做的，尽量交给agent。&lt;/p&gt;
&lt;h2&gt;决策四：用Divide and Conquer应对认知饱和&lt;/h2&gt;
&lt;p&gt;第四个决策是：我没有给一个agent一股脑布置300篇文章的任务，而是让它开了8个sub-agent，分配任务以后并行处理。&lt;/p&gt;
&lt;p&gt;这里面的原因和context window saturation有关。一个agent一下处理300篇，前面可能还好，读了十几篇文章以后context window &lt;a href="/wide-research.html"&gt;会被占满&lt;/a&gt;，后面就开始偷懒、跳文章、或者忘了前面踩过的坑。这和人有点像，认知负荷一高就会丢三落四，或者开始敷衍。&lt;/p&gt;
&lt;p&gt;另一个原因是sub-agent是coding agent原生支持的功能。我不用自己写并发逻辑、分配任务、汇总结果。这些plumbing work都被外包出去了。我只要用一两句话描述一下这个工作流就好。&lt;/p&gt;
&lt;p&gt;很多人用AI的时候没有意识到这个问题。他们没有针对AI的缺陷思考，预测到里面的坑，就用最符合直觉的方法去布置任务。但像我们管理下属的时候要知人善任一样，我们要意识到AI的认知资源尤其有限，context window是一种需要管理的稀缺资源。任务量太大，质量必然下降。所以任务量大的时候，我会主动考虑拆分，而不是让一个agent扛所有东西。&lt;/p&gt;
&lt;p&gt;这个决策和前面几个的关系是：决策二保证结果是对的（测试通过），决策三保证过程是灵活的（agent自己处理corner case），决策四更进一步通过规避一个必然出现的坑，保证处理得又快又好。&lt;/p&gt;
&lt;h2&gt;决策五：保证Prompt Self Contained（自举）并且结果导向&lt;/h2&gt;
&lt;p&gt;第五个决策是：给AI的指令讲清楚所有的信息（不指望它读心），而且着重说acceptance criteria是什么，而不是每一个步骤怎么做。&lt;/p&gt;
&lt;p&gt;我的prompt大概是这样：&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;对于blog/content下面每一篇.md文件，从SEO的角度写一个summary域放到metadata里。你可以用sub-agent来做。先看几篇文章找到感觉，然后想一个prompt，让不同的sub-agent分别处理不同的文章。开8个agent并行处理，每个agent负责写summary并直接编辑.md文件。另外，我希望有个测试能check summary coverage，如果coverage不到100%测试就fail。你的目标就是把这个测试搞到100%让它能过。&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;注意我没有告诉它具体怎么写这个测试程序、怎么处理各种corner case。&lt;/p&gt;
&lt;p&gt;这是很多人容易搞反的地方。他们给AI写指令的时候，事无巨细地规定每一步怎么做。这其实是在把AI当程序用，浪费了Agentic AI的主观能动性。AI不是一个只会照本宣科的乙方，它有很强的判断力和执行力。我们要发挥它的主观能动性，但同时给它一个足够清晰的边界。&lt;/p&gt;
&lt;p&gt;我总结写prompt有两个原则。第一，context要给足，不要指望AI能读心。它不知道metadata结构是什么样的。这些信息要么直接给，要么要保证它自己能搞清楚（比如这里我们给了具体路径，它可以通过读文件搞清楚）。第二，从结果出发，而不是从过程出发。你告诉AI你要什么，让它自己想怎么做。除非你预测到某个环节不给具体指导它会出问题——比如前面的context window问题——否则不用讲那么细。&lt;/p&gt;
&lt;p&gt;这个决策和决策三是一体两面：决策三是说把执行交给agent，决策五是说把指令也写成适合agent的形式。&lt;/p&gt;
&lt;h2&gt;总结：AI是一种杠杆&lt;/h2&gt;
&lt;p&gt;最后说一点感受。&lt;/p&gt;
&lt;p&gt;这个任务，我用语音识别花了大概两分钟把指令讲给AI。然后AI自己折腾了45分钟：并行开8个sub-agent，处理各种边界条件，写测试，返工，跑通，commit。全程我就没再管了。这就是一种leverage。用两分钟的时间，撬动了AI 45分钟的工作量。更准确地说，用5%的时间控制了100%的工程产出。&lt;/p&gt;
&lt;p&gt;而且现在的Agentic AI能力已经足够强，可以长时间自主工作。我们不需要盯着它干活。只要讲清楚deliverable是什么、acceptance criteria是什么，就可以去干其他事了。这就带来了一种新的可能：scalable agentic workflow。比如我们用两分钟撬动一个Agent A，让他忙45分钟。然后这个时间我们再去指挥Agent B，C，D，... 同时启动多个AI并行推进。这样脑力负担确实会很高，但这是在单Agentic workflow的基础上，再进一步实现10倍生产力的切实可行的途径。&lt;/p&gt;
&lt;p&gt;说完了10倍生产力的一面，这个项目的另一面是，有用AI的意识，但是方法不对——在ChatGPT里讨论、没有测试机制、让一个AI包办所有。这些决策做错了，我们可能要折腾几个小时才能做完，甚至鬼打墙做不出来。同一个任务，甚至同一个LLM，会用和不会用，决策做的质量高低，就是从从容容游刃有余和吃力不讨好，比人工做更慢的差别。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"></category><category term="Chinese"></category><category term="Agentic AI"></category></entry><entry><title>Key Decisions for Agentic Workflows: A Simple Case Study</title><link href="https://yage.ai/ai-key-decisions-en.html" rel="alternate"></link><published>2026-02-20T18:00:00-08:00</published><updated>2026-02-20T18:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-20:/ai-key-decisions-en.html</id><summary type="html">&lt;p&gt;A real-world case study of directing AI to add SEO summaries to 300 articles in two minutes, breaking down five key decisions: choosing the right execution environment, building tests before work, letting agents handle corner cases, divide and conquer, and outcome-oriented prompt writing.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Today I used AI to complete a small task. This case feels particularly suitable for introducing AI's practical principles, so I wrote this article to share it.&lt;/p&gt;
&lt;p&gt;The task itself was to add a summary line to every article in this blog, which helps search engines understand the website's content and improve its ranking (SEO). This task looks simple, but it has many pitfalls—one careless move and you fall into the trap of AI getting stuck in loops, being unreliable, or being cumbersome to use. Below I'll mainly share the five important decisions I made during this process to make the entire workflow stable and reliable.&lt;/p&gt;
&lt;h2&gt;Decision 1: Use a Local Coding Agent, Not ChatGPT&lt;/h2&gt;
&lt;p&gt;The first decision I made was to use Cursor/OpenCode as the platform for discussion, not ChatGPT. This isn't obvious, because the project started with me wanting to do SEO for this website. Intuitively, this seems like a chat-type task better suited for ChatGPT. But I still insisted on using OpenCode. The fundamental reason is friction.&lt;/p&gt;
&lt;p&gt;Specifically, friction exists in two aspects. First is the friction of context transfer. With ChatGPT, I need to copy and paste my blog's content or even code to it, or have it write code to fetch these articles. But in OpenCode, I just use @ to specify the folder where my blog is located—much less friction.&lt;/p&gt;
&lt;p&gt;Another aspect is the friction of implementation. For example, if we reach a conclusion through chatting in ChatGPT that this website needs to add Summary metadata, to implement this idea, I need to copy and paste several rounds of chat history between me and ChatGPT into Cursor/OpenCode, and then call another AI to modify the article content. In contrast, if the discussion happens in OpenCode from the beginning, it can be implemented immediately after the discussion.&lt;/p&gt;
&lt;p&gt;So I made this first decision: for almost all tasks, abandon chat-based AI environments and choose executable Agentic environments. Why put this decision first? Because this is the difference between existence and non-existence. If friction is high, we won't bother continuing, and the entire project takes time but delivers 0—pure waste of time. Only when friction is low and the project can continue is it meaningful to discuss specific methods and techniques.&lt;/p&gt;
&lt;h2&gt;Decision 2: Before Starting, Define Success and Provide Tests&lt;/h2&gt;
&lt;p&gt;The second decision I made was: before letting AI generate any summaries, have it write a test first. This test does something very simple—check all .md files to see if they have a summary field. If not 100% of files have this field, it fails, and prints which files have problems.&lt;/p&gt;
&lt;p&gt;Why write the test first? Because without this test, when AI says it's done, I don't actually know if it's really done. I can spot-check a few articles, but with over 300 articles, spot-checking can't cover everything. The final situation is that neither I nor AI knows—we're both in wishful thinking.&lt;/p&gt;
&lt;p&gt;But with the test, it's different. After AI completes one round, if the test fails, it knows there are 20 articles not covered, and will re-examine these articles. When the test passes, it's 100% complete. No manual spot-checking needed, no guessing—everything is 100% certain.&lt;/p&gt;
&lt;p&gt;This is the &lt;a href="/agentic-ai-crisis-en.html"&gt;feedback loop&lt;/a&gt; we've been emphasizing. Many people using AI fall into a cycle of nudging it once to move, finding it's wrong after it moves, then nudging again, feeling that AI is hard to use. The root cause is not establishing a feedback mechanism. AI doesn't know what "done" means, and you don't know to what extent AI has completed things. This is the core problem to solve first. Deterministic testing is a very effective solution. In fact, once this kind of test is in place, the next three decisions are just icing on the cake.&lt;/p&gt;
&lt;p&gt;So before starting any task, I ask myself: Do I/AI have a deterministic way to judge whether the task is complete? If not, build this mechanism first.&lt;/p&gt;
&lt;h2&gt;Decision 3: Let the Agent Do It, Instead of Writing Programs to Call APIs&lt;/h2&gt;
&lt;p&gt;The third decision was: I didn't write a program to call LLM API to generate summaries, but let the coding agent do it itself.&lt;/p&gt;
&lt;p&gt;More detailed reasons are explained in &lt;a href="/result-certainty-en.html"&gt;this article&lt;/a&gt;. Although having AI do summarization sounds like just calling an API, if you think carefully, there are many corner cases here: some articles already have summaries and shouldn't be duplicated, some metadata formats are inconsistent, some positions need adjustment. If you write a program to handle these situations, the code becomes very complex, debugging costs are high, and progress is slow. Eventually, AI might spend a lot of effort adjusting how to handle these details.&lt;/p&gt;
&lt;p&gt;Another approach is to use natural language to directly assign tasks to Cursor/OpenCode: "Go look at XX.md and make sure it has an SEO-oriented summary metadata field." At this point, the entity completing the task is not a mechanical program, but an Agent with real intelligence and adaptability. It handles situations on its own—skipping if summary exists, adjusting if format is wrong, judging by itself when encountering special cases.&lt;/p&gt;
&lt;p&gt;This is the difference between using AI as an agent and using AI as a tool. The API-calling pattern is: you write programs, AI is one component. This pattern has high certainty but low flexibility, and is actually slower when encountering complex situations. With Agentic AI, certainty moves from process to outcome—you only need to clearly state what result you want. The rest, AI figures out using its own initiative and judgment.&lt;/p&gt;
&lt;p&gt;So in my workflow, calling APIs is the last resort. Whatever can be handed to agents, I hand to agents.&lt;/p&gt;
&lt;h2&gt;Decision 4: Use Divide and Conquer to Handle Cognitive Saturation&lt;/h2&gt;
&lt;p&gt;The fourth decision was: I didn't assign one agent the task of handling 300 articles all at once, but had it open 8 sub-agents, distribute tasks, and process in parallel.&lt;/p&gt;
&lt;p&gt;The reason relates to context window saturation. If one agent processes 300 articles at once, it might be okay at first, but after reading a dozen articles, the context window &lt;a href="/wide-research-en.html"&gt;gets filled up&lt;/a&gt;, and later it starts slacking off, skipping articles, or forgetting pitfalls encountered earlier. This is similar to humans—when cognitive load is high, we become forgetful or start cutting corners.&lt;/p&gt;
&lt;p&gt;Another reason is that sub-agents are a natively supported feature of coding agents. I don't need to write concurrency logic, task distribution, or result aggregation myself. This plumbing work is all outsourced. I just need to describe the workflow in a sentence or two.&lt;/p&gt;
&lt;p&gt;Many people using AI don't realize this problem. They don't think about AI's defects or anticipate the pitfalls, and just assign tasks using the most intuitive method. But just like when managing subordinates we need to know their strengths and weaknesses, we need to realize that AI's cognitive resources are particularly limited—context window is a scarce resource that needs management. When task volume is large, quality inevitably drops. So when there's a lot of work, I actively consider splitting it up rather than having one agent carry everything.&lt;/p&gt;
&lt;p&gt;The relationship between this decision and the previous ones: Decision 2 ensures results are correct (tests pass), Decision 3 ensures the process is flexible (agent handles corner cases itself), Decision 4 goes further by avoiding a guaranteed pitfall, ensuring processing is both fast and good.&lt;/p&gt;
&lt;h2&gt;Decision 5: Ensure Prompt Is Self-Contained and Outcome-Oriented&lt;/h2&gt;
&lt;p&gt;The fifth decision was: when giving AI instructions, clearly state all information (don't expect it to read minds), and emphasize what the acceptance criteria are, not how to do each step.&lt;/p&gt;
&lt;p&gt;My prompt was roughly this:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;For each .md file under blog/content, write a summary field from an SEO perspective and put it in metadata. You can use sub-agents to do this. First look at a few articles to get a feel, then think of a prompt, let different sub-agents process different articles. Open 8 agents to process in parallel, each agent responsible for writing summaries and directly editing .md files. Also, I want a test to check summary coverage—if coverage is below 100%, the test fails. Your goal is to get this test to 100% so it passes.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;Notice I didn't tell it specifically how to write this test program or how to handle various corner cases.&lt;/p&gt;
&lt;p&gt;This is where many people get it backwards. When writing instructions for AI, they specify every step in detail. This is actually treating AI as a program, wasting Agentic AI's subjective initiative. AI is not a yes-man who only follows instructions—it has strong judgment and execution capabilities. We should leverage its initiative while giving it a clear enough boundary.&lt;/p&gt;
&lt;p&gt;I summarize two principles for writing prompts. First, give enough context—don't expect AI to read minds. It doesn't know what the metadata structure looks like. This information should either be given directly or ensured that AI can figure it out itself (for example, here we gave a specific path, and it can figure it out by reading files). Second, start from outcomes, not processes. Tell AI what you want, let it figure out how to do it. Unless you predict that not giving specific guidance on some aspect will cause problems—like the context window issue earlier—there's no need to explain in such detail.&lt;/p&gt;
&lt;p&gt;This decision and Decision 3 are two sides of the same coin: Decision 3 says hand execution to agents, Decision 5 says write instructions in a form suitable for agents.&lt;/p&gt;
&lt;h2&gt;Summary: AI Is Leverage&lt;/h2&gt;
&lt;p&gt;Finally, some thoughts on my experience.&lt;/p&gt;
&lt;p&gt;This task took me about two minutes to dictate instructions to AI using voice recognition. Then AI worked on it for 45 minutes: opening 8 sub-agents in parallel, handling various edge cases, writing tests, reworking, getting tests to pass, committing. I didn't manage it at all during this process. This is leverage. Using two minutes of time to leverage 45 minutes of AI work. More precisely, using 5% of time to control 100% of engineering output.&lt;/p&gt;
&lt;p&gt;And current Agentic AI capabilities are strong enough to work autonomously for long periods. We don't need to watch it work. As long as we clearly state what the deliverable is and what the acceptance criteria are, we can go do other things. This brings a new possibility: scalable agentic workflow. For example, we use two minutes to leverage Agent A, keeping it busy for 45 minutes. Then during this time we go command Agent B, C, D... simultaneously launching multiple AIs to proceed in parallel. The cognitive load is indeed high, but this is a practical path to achieve 10x productivity on top of single-agent workflow.&lt;/p&gt;
&lt;p&gt;Having talked about the 10x productivity side, the flip side of this project is: having the awareness to use AI, but using the wrong methods—discussing in ChatGPT, no testing mechanism, letting one AI handle everything. If these decisions are wrong, we might struggle for hours to finish, or even get stuck in endless loops unable to complete. The same task, even the same LLM—the difference between knowing how to use it and not knowing, the quality of decisions made, is the difference between being composed and at ease versus struggling without reward, even slower than doing it manually.&lt;/p&gt;</content><category term="Computing"></category><category term="English"></category><category term="Agentic AI"></category></entry><entry><title>OpenClaw深度分析：为什么突然就火了，以及对我们意味着什么</title><link href="https://yage.ai/openclaw.html" rel="alternate"></link><published>2026-02-14T23:00:00-08:00</published><updated>2026-02-14T23:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-14:/openclaw.html</id><summary type="html">&lt;p&gt;OpenClaw把本地Agent能力带到聊天软件而爆火，但聊天界面、统一记忆、开放Skills都带来妥协。用OpenCode加文件记忆可以搭一套更好的系统。&lt;/p&gt;</summary><content type="html">&lt;p&gt;OpenClaw在2026年1月底爆火。公众号铺天盖地都在介绍怎么配置，云服务厂商都速度上线了一键部署，生怕错过这波热度。与此同时，各种行为艺术又满天飞：ClawdBot、MoltBot、OpenClaw，一周内改了三次名；结果改名的时候账号还被抢注，被一个叫$CLAWD的代币诈骗了1600万美元。与此同时，安全漏洞也层出不穷：有12%的第三方skills含恶意代码，有不少人把控制台裸露在公网上没设密码。一时间让人感觉整个领域全是相互矛盾的噪音，无所适从：这东西到底要不要装？不装会错过什么？装了有什么风险？这到底是下一个生产力革命还是又一个两周就过气的玩具？&lt;/p&gt;
&lt;p&gt;这篇文章就想从更高层的角度抽丝剥茧：OpenClaw到底做对了什么，为什么是它火，以及这跟我们有什么关系。&lt;/p&gt;
&lt;h2&gt;为什么会火的暴论&lt;/h2&gt;
&lt;p&gt;我有一个暴论：OpenClaw火的原因，和去年这个时候DeepSeek火的原因，是高度类似的。&lt;/p&gt;
&lt;p&gt;DeepSeek流行的时候，当时国内大家用的AI主要是纯聊天，没有搜索功能也经常信口瞎编。ChatGPT和Claude虽然有了思考和搜索功能，智能强很多，但国内用不了。DeepSeek引入了推理功能和搜索功能以后，第一次让大家体验到了会搜索懂思考的AI，带来了一种震撼：哇，AI还能这么有用，就爆火了。换言之，这个火不是因为技术上比竞争对手更好，事实上DeepSeek在纯模型能力上并没有碾压同时代的GPT-4o或者Claude 3.5。而是因为把一小撮人享受/习惯的事情，一下子推广到另一群更大的用户群面前，这才火起来。&lt;/p&gt;
&lt;p&gt;OpenClaw也是一样。2026年初Agentic AI领域其实有一个断层：ChatGPT这种产品虽然流行，但相比Cursor/Claude Code/Codex这种有本地权限的编程Agentic AI，整体能力还是落后了至少一代（具体为什么后面有解释）。但Cursor这种工具非常小众，基本上只有程序员在用。大家用的还是ChatGPT这种消费级产品，就觉得AI这两年没啥进步，能力很有限。然后OpenClaw第一次把Cursor这种能本地编程的Agent和WhatsApp/Slack/飞书这种流行通信软件接起来了，让非技术人员这种更广大的用户群第一次接触到了能读写文件，能执行命令，有记忆能持续迭代的Agentic AI，就爆火了。换言之，这个火不是说OpenClaw在技术上做到了什么新的事情，而是因为把一小撮人享受/习惯的事情，一下子推广到另一群更大的非技术用户群面前，这才火起来。&lt;/p&gt;
&lt;p&gt;但我说这些不是为了得出结论说OpenClaw、DeepSeek是花架子，没必要学。恰恰相反，DeepSeek从历史的角度提供了很多启发。比如DeepSeek火了以后，真正从中受益的是哪些人？我的观察是，有没有跟风第一时间玩上DeepSeek本身并不重要。很多人玩了一段时间就退烧了。真正理解了DeepSeek为什么火，把搜索和推理这两个关键因素整合到了自己工作流里的人，才是真正受益的人。类似的，OpenClaw火了以后，我们确实可以去跟风安装使用、体验一下，但这件事情本身并不会让我们一下就脱胎换骨生产力倍增了。因为这种现象级产品能爆火的重要前提是它是面向最广泛的用户设计的，因此设计决策上有很多妥协，直接用往往效率并不是最优。更关键的是要去理解它背后的设计哲学，分析它爆火的原因，从中吸取经验教训，改进自己的工作流。&lt;/p&gt;
&lt;p&gt;毕竟，工具会过气，对工具本质的理解不会。把可迁移的认知抽出来，融入自己的工作流，这才是内行的做法。&lt;/p&gt;
&lt;h2&gt;聊天界面：流行的基础，也是天花板&lt;/h2&gt;
&lt;p&gt;在具体分析OpenClaw的牛逼之处之前，我想先带大家看一个具体的例子，来解释“OpenClaw是面向最广泛的用户设计的”这句话到底是什么意思，以及有什么影响。&lt;/p&gt;
&lt;p&gt;前面我们提到OpenClaw火起来非常关键的一点是，它选用了大家天天都用的聊天软件作为交互入口，而不是像Cursor一样让你在电脑上多装一个软件。这样可以复用现有的使用习惯和渠道，让用这个工具的心智负担特别低。你没事反正都要用Slack/飞书，正好就看到了OpenClaw就会想着用用。另一方面，因为大家本身就非常熟悉这些软件的使用，所以它把学习成本也几乎压到了零。不需要装IDE，不需要学编程的术语概念，拿起手机就能用，这是它能出圈的基础。&lt;/p&gt;
&lt;p&gt;但如果你用过Cursor这种Agentic AI编程软件的话，就会发现Slack这种聊天窗口对AI来说是个相当受限的交互方式。&lt;/p&gt;
&lt;p&gt;第一是它要求对话是线性的。像Slack和微信这样的聊天窗口主要就是一条条消息往下排。但是深度的知识工作往往不是线性的。比如你需要引用另外一个thread的内容，需要把两个方向的探索merge在一起，需要在某个会话中fork出去。这些在桌面环境里比如Cursor和OpenCode里面都有专门的UI可以实现，但是在聊天窗口里面做就特别别扭。&lt;/p&gt;
&lt;p&gt;第二个问题是信息密度。如果只是做玩具性质的调研和开发，聊天窗口是没有问题的。但凡要做更复杂一点的分析和思考，它的信息密度就捉襟见肘了。比如图文混排的分析报告、复杂的表格、带格式的长文，这些在聊天里面看还都蛮痛苦的。同时不同平台对Markdown的支持也参差不齐，体验很不稳定。&lt;/p&gt;
&lt;p&gt;第三个问题出在过程的可观测性上。尤其是对要分好几步才能完成的任务，我把执行权交给AI以后，很自然地会想关心它到底在干啥。比如它是在稳步推进，还是在钻牛角尖鬼打墙？它调用了什么工具，改了哪些文件？这些在Cursor等等工具里会有自然的呈现，但是聊天窗口我们只能看见一条“对方正在打字”或者一个emoji表示正在处理。尤其是比较复杂的任务，OpenClaw需要等蛮久才能等到一条消息告诉我们搞定了还是中间挂了。&lt;/p&gt;
&lt;p&gt;但是我说这么多不是想说OpenClaw设计不好，而是想说这里面有个很明显的妥协（trade-off）。你要想把工具做得容易上手、面向最大的用户群，就必须用聊天工具这些人人都已经在用的工具作为载体。但这同时立刻又带来了对话形式、信息密度等等弊端。反之亦然。在这个从“易用但是拧巴”到“原生但是小众”的连续的trade-off空间里，OpenClaw选择了极致的易用性。这是它能爆火的基础。但我们也要清醒地认识到这种设计决策所带来的限制。在融合进自己工作流的时候，不是无脑地采用OpenClaw的所有设计，而是应该因地制宜，根据自己的需求来在这个trade-off轴线上找到属于自己的甜点区。&lt;/p&gt;
&lt;p&gt;理解了这个trade-off，后面的分析就容易理解了。&lt;/p&gt;
&lt;h2&gt;界面之外的流行要素&lt;/h2&gt;
&lt;p&gt;聊天界面是OpenClaw流行的基础，但只是最浅显的一点。真正让用户觉得这个AI真的智能，好用，懂我的，是它背后的三个设计决策。&lt;/p&gt;
&lt;p&gt;第一个是统一的入口和上下文。对比一下Cursor就很清楚。在Cursor里每个项目的上下文是隔离的——打开项目A，AI只知道项目A的事；切到项目B，之前关于项目A的对话就全没了。Claude Code、OpenCode也一样，每次启动都绑定一个工作目录。但OpenClaw则完全相反。它默认把所有对话的上下文混在一个池子里。你上午在Telegram里让它帮你整理邮件，下午在Slack里让它写个报告，晚上在WhatsApp里让它安排明天的日程——它全都记得。给人的感觉就是它特别聪明，好像真的认识你。&lt;/p&gt;
&lt;p&gt;但光把上下文混在一起是没用的，因为上下文窗口很快就会满了。这就牵扯到了它的第二个关键设计，持久化记忆。OpenClaw对记忆的处理非常巧妙，很值得学习。从大的原理上，它&lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus"&gt;和Manus一样&lt;/a&gt;用的是基于文件的记忆系统。比如它维护了一个SOUL.md，定义AI的核心人格和行为准则；USER.md保存了对用户的画像，MEMORY.md存长期记忆，再加上每日的原始日志等等。&lt;/p&gt;
&lt;p&gt;这里面比较巧妙的是它有个自我维护机制：AI会在每隔一段时间（heartbeat）自动review最近的原始日志，把有价值的信息提炼到MEMORY.md里，顺便清理过时的条目。整个过程不需要用户干预。这个自我维护机制就把记忆给分层了，原始日志是短期记忆，每天的MEMORY.md是中期记忆，提炼出来的个性和喜好是长期记忆。对用户来说，体验一下就从“每次重开都要重新交代一遍"变成了"它好像在成长"，这个感知差异是非常大的。&lt;/p&gt;
&lt;p&gt;第三个设计是丰富的Skills。这个意义要远超节省那么一点用户的时间。工具数量带来的好处&lt;a href="/manus.html"&gt;不是线性的&lt;/a&gt;——6个工具比4个工具的能力提升，远大于4个相对2个。这是因为工具之间可以组合。接Slack能管下达指令，状态汇报，接图像生成能画图，接PPT服务能出稿，接deep research能调研。这些凑在一起，就可以组合进化出很多完整的业务能力和应用场景。&lt;/p&gt;
&lt;p&gt;这三个设计之间也不是简单的加法，而是互相促进的。&lt;/p&gt;
&lt;p&gt;记忆加上统一的上下文池，会带来数据复利。因为有持久化记忆，对话可以跨会话积累；因为有统一入口，所有来源的数据汇进同一个记忆池。你在Slack里讨论的工作内容、在Telegram里安排的日程、在WhatsApp里的个人对话，全部混在一起，形成了对你越来越完整的理解，以后完成任务也会越来越贴心。&lt;/p&gt;
&lt;p&gt;记忆加上skills，带来了自我进化的能力。今天学到的用法明天还在，能力会累积；AI自己能写新的skill并且记住它的存在和用法，这就进入了正循环。这里面特别值得一提的是coding能力。因为OpenClaw自己能写代码，所以遇到没有现成skill可用的时候，它就可以当场造一个。这个新skill会被保存下来，下次遇到类似场景直接复用。这就形成了自我进化的闭环。&lt;/p&gt;
&lt;p&gt;而这些能力和界面的易用性加在一起，又带来了使用频率。入口越顺滑，调用越频繁，飞轮越转越快，能力越来越强。&lt;/p&gt;
&lt;p&gt;总之，OpenClaw是一个相当厉害的产品。它的各种决策，不论是技术的（入口、记忆、工具）还是非技术的（界面），都在为同一个飞轮服务，让普通人第一次摸到了Agentic AI的完整形态。&lt;/p&gt;
&lt;h2&gt;限制和trade-off&lt;/h2&gt;
&lt;p&gt;前面说了它为什么牛，下面我要开始吐槽了。但我想先解释一下，下面介绍的这些限制不是说OpenClaw疏忽了没做好，而是前面说的那个trade-off的直接后果——为了爆款好用必须付出的代价。&lt;/p&gt;
&lt;p&gt;界面的限制前面已经说过了：线性、低信息密度、低可观测性。在深度使用时这些很快会成为瓶颈，这里不再赘述。&lt;/p&gt;
&lt;p&gt;更深层的问题在记忆上。OpenClaw的记忆系统对小白很友好。你不用管，它自己就会打理和进化。但对想把知识沉淀成资产的人来说，这反而是一个障碍。&lt;/p&gt;
&lt;p&gt;举个栗子，比如我们做完一次调研，产出了一份5000字的长文或者一份PRD。在Cursor/文件系统里它就是一个文件：&lt;code&gt;docs/research.md&lt;/code&gt;，想引用就@，想升级就开新版本，想对比就diff。但在OpenClaw里，这份东西像是人类记忆一样，说不定什么时候就会被自动摘要、自动重写，甚至整个被删除了（遗忘），整个过程完全不可控。你很难跟它说清楚：以后就以这份文档为准，遇到相关问题必须引用它，不要给我压缩成三行。总之就是，知识没办法显式管理。&lt;/p&gt;
&lt;p&gt;更让人头疼的是整个更新过程也是一个黑盒。MEMORY.md里存什么、怎么组织、什么时候清理，主要是AI在heartbeat期间自动做的。你看到的是结果，很难看到原因：它这次改了哪些条目，为什么删掉这一条，为什么把两个不相关的东西合并在一起。出了问题也很难定位根源，因而很难改进。&lt;/p&gt;
&lt;p&gt;OpenClaw记忆系统的设计带来的另一个问题是跨场景的信息干扰。统一记忆当然带来懂我的感觉，但也意味着信息很容易跨项目污染：A项目的偏好、甚至某个临时决定，可能会莫名其妙影响到B项目。对小白来说它好像什么都记得，但对真的想干活的进阶用户来说更像是“我去怎么又被它带偏了"。&lt;/p&gt;
&lt;p&gt;Skills的安全隐患又是另一类问题。ClawHub上的上千个技能中，安全审计发现有上百个包含恶意代码——加密货币盗窃、反向shell后门、凭证窃取都有。Simon Willison提过一个&lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;致命三角&lt;/a&gt;的概念：一个AI系统同时具备访问私有数据、暴露于不可信环境、能够对外通信这三个能力时，风险是指数级放大的。OpenClaw三个全中🤡。这就形成了一个奇特的悖论。你要想用的爽，就必须给他很多工具和权限。但这又会带来安全问题，所以就要把权限收得很紧。但权限收紧了就又变成类似Manus那样的云端Agent服务了，没了本地Agent的爽。安全和好用，似乎成了一对矛盾。&lt;/p&gt;
&lt;h2&gt;So What?&lt;/h2&gt;
&lt;p&gt;讲到这里，自然会有人问：分析了一堆，然后呢？这跟我有什么关系呢？&lt;/p&gt;
&lt;p&gt;回答是：可以用这些认知，在已有的工具上搭一套比OpenClaw更顺手的东西。我自己就是这么干的，效果比直接用OpenClaw好很多。下面讲几个关键决策。&lt;/p&gt;
&lt;h3&gt;复用Agentic Loop，而不是自己造&lt;/h3&gt;
&lt;p&gt;我们做的第一个决策，也是最重要的一个，是不自己从头实现一套Agentic AI系统，而是复用OpenCode这样的开源CLI编程工具作为基础。&lt;/p&gt;
&lt;p&gt;这个决策背后有一个更深层的判断。做一个能用的Agentic Loop——也就是调API、解析工具调用、执行工具、把结果返回给AI、请求下一次回答这个循环——说起来简单，但要做到能支撑真实使用的水平，有很多细节：文件系统的读写，文件内容的新增删除替换，沙箱环境，权限管理……每个都是坑。这些东西写起来繁杂、充满陷阱，而且和我们最终想创造的价值没有多少关系。&lt;a href="https://yage.ai/ai-builder-space.html"&gt;我之前的一篇文章&lt;/a&gt;里详细讨论过这个问题——核心观点是，Agentic Loop是体力活，应该外包；真正值得花精力的是Agentic Architecture，也就是怎么把业务逻辑注入AI系统让它直接创造价值。&lt;/p&gt;
&lt;p&gt;而OpenCode、Claude Code这类工具，恰恰就是一个特别好的外包。它们已经把Agentic Loop做得非常成熟了——能读写文件、能跑命令、能持续迭代，而且还在飞速进化中。用它们做基石，等于是白嫖了整个agentic编程工具链，可以把自己的开发成本降到最低。而且选OpenCode还有一些额外的好处：它完全开源可以魔改，支持并行的subagent（Cursor和Codex到现在都还没有），还支持多种coding plan——比如我自己用的是GLM的coding plan，也可以直接用OpenAI的Codex plan，不用像直接调API那么烧钱。&lt;/p&gt;
&lt;h3&gt;文件即记忆：继承和发展OpenClaw的哲学&lt;/h3&gt;
&lt;p&gt;第二个决策是在记忆体系上。OpenCode/Claude Code这类工具天生就有磁盘即记忆的思想——毕竟它们作为编程工具处理的基础单元就是文件。当我们又有基于磁盘的记忆，又有对文件直接的操纵权和透明度的时候，就解决了前面分析中OpenClaw记忆系统的问题。想沉淀资产就写文件，想强制AI遵守某些规则就写AGENTS.md，想管理记忆结构就直接编辑Markdown。前面说的那些知识没法显式管理、更新过程是黑盒的问题，用OpenCode的细粒度控制和文件系统天然就解决了。&lt;/p&gt;
&lt;p&gt;但光有文件系统还不够，我们还把OpenClaw那套persona自我进化的机制移植了过来。具体来说，我们把记忆分成了两层：project-level的记忆（每个项目自己的上下文、决策记录、技术方案）和persona-level的记忆（用户画像、行为偏好、沟通风格）。然后在AGENTS.md里加入persona维护的workflow，让AI在session结束时自动review对话、更新MEMORY.md和USER.md。同样的自我进化，但跑在完全可控的文件系统上，还能用Git做版本管理。&lt;/p&gt;
&lt;p&gt;至于统一上下文的问题，我们用了一个很简单粗暴的方案：Mono Repo。把不同项目放在同一个repo的不同文件夹下，AI天然就可以跨项目访问所有上下文。想隔离就隔离，想共享就共享，想merge两个方向的探索就直接@，想fork出去就复制文件——全都是文件系统和OpenCode的原生操作，比OpenClaw在聊天窗口里拧巴地做这些事情自然太多了。&lt;/p&gt;
&lt;h3&gt;Skills和安全&lt;/h3&gt;
&lt;p&gt;Skills方面，OpenCode生态有大量MCP server和Skills可以接入——日历、邮件、浏览器、搜索等等——功能覆盖和ClawHub大差不差。安全性上，我们的做法是不直接安装第三方skill，而是让AI先审查源码、理解逻辑，然后重写一个干净版本。在AI辅助编程的今天这个过程通常只要几分钟，但可以极大降低供应链攻击的风险。&lt;/p&gt;
&lt;h3&gt;最后一公里：移动端&lt;/h3&gt;
&lt;p&gt;前面三个决策解决了底座、记忆和工具的问题，但还差一个关键的东西：入口。OpenClaw火的一个重要原因是你不用坐在电脑前面。但现有的编程工具在这方面确实拉胯——VSCode有个Code Server可以远程访问，但对iPad非常不友好；OpenCode有个Web Client，但说实话只是解决了有和无的问题，非常难用；Cursor的Web Client高度绑定Github；Claude Code则完全没有Web Client。&lt;/p&gt;
&lt;p&gt;为了解决这个问题，我们做了一个原生的iOS App作为OpenCode的远程客户端。注意这个App不是把聊天窗口搬到手机上——它是一个真正为移动端设计的工作界面：能看到AI的实时工作进度，每一步工具调用、每一个文件操作；能切换模型做A/B测试；能浏览Markdown文件和审查更改；支持语音输入；支持基于HTTPS或者SSH隧道的公网访问；iPad上还有三栏分屏。&lt;/p&gt;
&lt;p&gt;这个客户端已经在github上&lt;a href="https://github.com/grapeot/opencode_ios_client"&gt;开源&lt;/a&gt;了。欢迎大家也来体验。未来可能会加入TestFlight。效果是吃灰很久的iPad重新变成了生产力工具，在沙发上指挥AI干活的体验比OpenClaw的聊天窗口爽得多。外出吃饭的时候接到oncall，也可以直接给AI小弟布置任务，当场就搞清楚了原因。而且全程都有对AI完全的掌控，知道它不会出幺蛾子，也不会把你的信息po到Moltbook上。&lt;/p&gt;
&lt;p&gt;&lt;img alt="iPad客户端" src="/images/opencode_ios_client.jpeg"&gt;&lt;/p&gt;
&lt;h2&gt;总结&lt;/h2&gt;
&lt;p&gt;回到开头的暴论。OpenClaw和DeepSeek的火，本质上是同一件事：把一小撮人已经在享受的能力，第一次推到了更广泛的人群面前。DeepSeek让大家第一次用上了会搜索懂推理的AI，OpenClaw让大家第一次摸到了能读写文件、有记忆、会自我进化的Agentic AI。&lt;/p&gt;
&lt;p&gt;但也正因为要面向最广大的普通用户，这类产品必然在设计上做大量妥协。DeepSeek如此，OpenClaw也如此。聊天界面带来了易用性但牺牲了表达力，统一记忆带来了懂我的感觉但牺牲了可控性，开放的Skills生态带来了能力但引入了安全风险。&lt;/p&gt;
&lt;p&gt;对于已经在用Cursor/Claude Code/OpenCode的人来说，更值得做的不是无脑跟风装一个OpenClaw，而是理解它为什么火——统一入口、持久化记忆、工具生态，以及它们之间的飞轮——然后把这些认知融入自己已有的工具链里，扬长避短。我们自己就是这么干的，效果确实比直接用OpenClaw好很多。&lt;/p&gt;
&lt;p&gt;毕竟，工具会过气，对工具本质的理解不会。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"></category><category term="Chinese"></category><category term="Agentic AI"></category><category term="Review"></category></entry><entry><title>OpenClaw Deep Dive: Why It Went Viral and What It Means for You</title><link href="https://yage.ai/openclaw-en.html" rel="alternate"></link><published>2026-02-14T22:00:00-08:00</published><updated>2026-02-14T22:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-14:/openclaw-en.html</id><summary type="html">&lt;p&gt;Analyzing why OpenClaw democratized Agentic AI through chat interfaces, its trade-offs in memory and security, and how to build a better system using OpenCode with file-based memory.&lt;/p&gt;</summary><content type="html">&lt;p&gt;OpenClaw went absolutely viral at the end of January 2026. Social media was flooded with configuration guides, and cloud service providers rushed to launch one-click deployments, terrified of missing the hype train. Meanwhile, it felt like performance art was happening everywhere: the project changed its name three times in one week—from ClawdBot to MoltBot to OpenClaw. In the process of rebranding, their handle was even hijacked by a token called $CLAWD that scammed people out of $16 million. Security vulnerabilities were popping up left and right, too: 12% of third-party skills contained malicious code, and plenty of people exposed their consoles to the public internet without even setting a password. For a while, the whole space was just a mess of contradictory noise, leaving everyone confused: Should I install this thing? What am I missing if I don't? What are the risks? Is this the next productivity revolution, or just another toy that will be forgotten in two weeks?&lt;/p&gt;
&lt;p&gt;In this post, I want to peel back the layers from a higher-level perspective: What did OpenClaw actually get right? Why did it explode? And most importantly—what does this have to do with you?&lt;/p&gt;
&lt;h2&gt;Why It Went Viral: My Hot Take&lt;/h2&gt;
&lt;p&gt;I have a bit of a provocative theory: the reason OpenClaw blew up is almost identical to why DeepSeek went viral exactly one year ago.&lt;/p&gt;
&lt;p&gt;When DeepSeek first became popular, most AI tools in China were limited to pure chat—no search capabilities, and they hallucinated constantly. While ChatGPT and Claude had reasoning and search features that made them much smarter, they weren't easily accessible in the country. When DeepSeek introduced reasoning and search, it was the first time many people experienced what a thinking, searching AI could do. It was a massive shock to the system: "Wow, AI can actually be THIS useful!" and then—boom—it went viral. In other words, its popularity wasn't necessarily because it was technically superior to its competitors (DeepSeek didn't exactly crush GPT-4o or Claude 3.5 in pure model capability at the time). It went viral because it took something a small circle of early adopters were already enjoying and habituated to, and pushed it right in front of a much larger audience.&lt;/p&gt;
&lt;p&gt;OpenClaw is the exact same story. In early 2026, there was a massive gap in the field of Agentic AI. While products like ChatGPT were popular, they were at least a generation behind Agentic AI tools with local permissions like Cursor, Claude Code, or Codex (I’ll explain why later). But tools like Cursor are niche—mostly used by programmers. The general public was still stuck with consumer-grade chat interfaces, feeling like AI hadn't progressed much in the last two years. Then OpenClaw came along and, for the first time, connected those local programming agents with the messaging apps everyone uses every day—WhatsApp, Slack, Lark. It gave non-technical users their first taste of Agentic AI that can read and write files, execute commands, maintain memory, and iterate continuously. It went viral not because it did something brand new technically, but because it democratized an experience previously reserved for a tiny group of techies.&lt;/p&gt;
&lt;p&gt;Now, I’m not saying OpenClaw or DeepSeek are just "showy" tools you shouldn't bother with. Quite the opposite. DeepSeek provided a lot of historical inspiration. For example, after the hype died down, who actually benefited? In my observation, it wasn't the people who just jumped on the bandwagon to play with it for a few days. It was the people who understood &lt;em&gt;why&lt;/em&gt; it went viral and integrated search and reasoning into their actual workflows. Similarly, while we can go ahead and install OpenClaw and try it out, the tool itself won't magically double your productivity. Viral products are designed for the broadest possible audience, which means they involve a lot of design compromises. Using them as-is is rarely the most efficient way to work. The real value is in understanding the design philosophy behind them, analyzing why they exploded, and applying those lessons to improve your own workflow.&lt;/p&gt;
&lt;p&gt;At the end of the day, tools will come and go, but your understanding of their core essence won't. Extracting transferable insights and baking them into your own workflow—that's how the pros do it.&lt;/p&gt;
&lt;h2&gt;The Chat Interface: Both the Foundation and the Glass Ceiling&lt;/h2&gt;
&lt;p&gt;Before we dive into why OpenClaw is so powerful, I want to look at a specific example to explain what I mean when I say "OpenClaw is designed for the broadest audience," and how that impacts everything.&lt;/p&gt;
&lt;p&gt;As I mentioned earlier, a key reason OpenClaw exploded is that it chose messaging apps we use daily as its interface, rather than requiring you to install yet another piece of software like Cursor. This leverages existing habits and channels, keeping the cognitive barrier to entry incredibly low. You're already on Slack or Lark anyway, so seeing OpenClaw right there makes you want to try it out. Plus, since everyone is already familiar with these apps, the learning curve is pushed practically to zero. No IDE to install, no programming jargon to learn—just pick up your phone and start using it. That’s why it reached such a huge audience.&lt;/p&gt;
&lt;p&gt;But if you’ve ever used an Agentic AI programming tool like Cursor, you’ll quickly realize that a Slack-style chat window is actually a very restrictive way for an AI to interact.&lt;/p&gt;
&lt;p&gt;First, it forces a linear conversation. Slack and WeChat windows are basically just one message after another. But deep knowledge work is rarely linear. You might need to reference content from another thread, merge two different directions of exploration, or fork off a specific conversation. In desktop environments like Cursor or OpenCode, there are dedicated UI elements for this, but doing it in a chat window feels clunky as hell.&lt;/p&gt;
&lt;p&gt;Second, there’s the issue of information density. For toy-level research or quick development, a chat window is fine. But for any meaningful analysis or deep thinking, the information density is embarrassingly low. Trying to read formatted reports, complex tables, or long-form documents inside a chat bubble is pretty painful. Plus, different platforms have wildly inconsistent Markdown support, making the experience very unstable.&lt;/p&gt;
&lt;p&gt;The third problem is observability. Especially for multi-step tasks, once I hand over execution to the AI, I naturally want to know what it’s actually doing. Is it making steady progress, or is it spinning its wheels in a dead-end loop? Which tools did it call? Which files did it change? In Cursor and similar tools, this is presented naturally, but in a chat window, we’re stuck with a "the user is typing..." message or a single emoji. For complex tasks, you’re often left waiting a long time just to be told whether it succeeded or crashed halfway through.&lt;/p&gt;
&lt;p&gt;Now, I’m not saying these are "bad" design choices. They are clear trade-offs. If you want to make a tool that’s easy to pick up for everyone, you have to use the tools everyone is already using. But that immediately brings limitations in format and density. It’s a spectrum from "easy but clunky" to "native but niche," and OpenClaw chose extreme ease of use. That’s why it’s a hit. But we have to be clear-eyed about the limitations that decision brings. When you're integrating these tools into your own workflow, don't just mindlessly copy every design choice—find that sweet spot on the trade-off axis that works for &lt;em&gt;your&lt;/em&gt; needs.&lt;/p&gt;
&lt;p&gt;Once you understand this trade-off, the rest of the analysis becomes much clearer.&lt;/p&gt;
&lt;h2&gt;The Success Factors Beyond the Interface&lt;/h2&gt;
&lt;p&gt;The chat interface is what made OpenClaw approachable, but it’s just the surface. What actually makes users feel like this AI is genuinely intelligent, useful, and "gets" them are three core design decisions happening under the hood.&lt;/p&gt;
&lt;p&gt;The first is a unified entry point and context. If you compare it to Cursor, the difference is stark. In Cursor, project contexts are isolated—if you open Project A, the AI only knows about A. Switch to Project B, and the conversation about A is gone. Claude Code and OpenCode are the same; they bind to a specific working directory every time you launch. OpenClaw does the exact opposite. By default, it mixes all your conversation contexts into one big pool. You can ask it to organize your emails in Telegram in the morning, write a report in Slack in the afternoon, and schedule your calendar in WhatsApp in the evening—and it remembers everything. It feels incredibly smart, like it actually &lt;em&gt;knows&lt;/em&gt; you.&lt;/p&gt;
&lt;p&gt;But just dumping everything into one pool isn't enough, because the context window would fill up instantly. That leads to the second key design: Persistent Memory. OpenClaw handles memory very cleverly. At a high level, it uses a file-based memory system &lt;a href="https://manus.im/blog/Context-Engineering-for-AI-Agents-Lessons-from-Building-Manus"&gt;much like Manus does&lt;/a&gt;. It maintains a &lt;code&gt;SOUL.md&lt;/code&gt; to define the AI’s core personality and behavior, a &lt;code&gt;USER.md&lt;/code&gt; for your profile, and a &lt;code&gt;MEMORY.md&lt;/code&gt; for long-term storage, all on top of the raw daily logs.&lt;/p&gt;
&lt;p&gt;The clever bit is its self-maintenance mechanism. Every so often (a "heartbeat"), the AI automatically reviews its recent raw logs, distills valuable info into &lt;code&gt;MEMORY.md&lt;/code&gt;, and cleans up outdated entries. This happens entirely in the background without user intervention. This mechanism creates a tiered memory structure: raw logs are short-term, the daily &lt;code&gt;MEMORY.md&lt;/code&gt; is medium-term, and the distilled traits/preferences are long-term. For the user, the experience shifts from "I have to explain everything every time" to "It feels like it’s growing with me." That perceived difference is huge.&lt;/p&gt;
&lt;p&gt;The third pillar is the rich ecosystem of Skills. This is about so much more than just saving a few minutes of your time. The benefit of adding tools &lt;a href="/manus-en.html"&gt;isn’t linear&lt;/a&gt;—the jump from 4 to 6 tools adds far more capability than the jump from 2 to 4. Why? Because tools combine. Connecting Slack handles instructions and status reports; image generation handles visuals; a PPT service handles slide decks; deep research handles investigations. When you bundle these together, you get emergent business capabilities and end-to-end applications.&lt;/p&gt;
&lt;p&gt;These three designs aren't just additive; they reinforce each other.&lt;/p&gt;
&lt;p&gt;Memory combined with a unified context pool creates data interest. Because memory is persistent, conversations accumulate over time; because there’s a unified entry point, data from all sources flows into the same pool. Your work discussions in Slack, your scheduling in Telegram, your personal chats in WhatsApp—all of it merges to form an increasingly complete understanding of you, making every subsequent task more personalized.&lt;/p&gt;
&lt;p&gt;Memory combined with Skills brings the ability to self-evolve. Habits learned today are still there tomorrow; as the AI writes and remembers new skills, it enters a positive feedback loop. Its coding ability is particularly noteworthy here. Since OpenClaw can write its own code, if it hits a wall without an existing skill, it can just build one on the fly. That new skill is saved and ready to be reused next time. It’s a closed loop of self-evolution.&lt;/p&gt;
&lt;p&gt;And when you add all that power to the ease of use of the interface, you get high usage frequency. The smoother the entry point, the more the flywheel spins, making the AI smarter with every interaction.&lt;/p&gt;
&lt;p&gt;In short, OpenClaw is an impressive product. Every decision—technical or otherwise—serves the same flywheel, giving regular people their first real taste of what a fully realized Agentic AI can do.&lt;/p&gt;
&lt;h2&gt;Limitations and Trade-offs&lt;/h2&gt;
&lt;p&gt;I’ve spent plenty of time praising OpenClaw, so now it’s time to gripe. But let me be clear: the limitations I’m about to list aren't because the OpenClaw team was sloppy—they are the direct results of that trade-off I mentioned earlier. This is the price you pay for building a viral hit.&lt;/p&gt;
&lt;p&gt;I’ve already covered the interface: it's linear, low-density, and offers poor observability. When you move beyond casual use, these bottlenecks become apparent very quickly.&lt;/p&gt;
&lt;p&gt;The deeper issues lie in the memory system. OpenClaw’s memory is great for beginners—you don't have to manage it; it just works and evolves. But for anyone trying to turn knowledge into a long-term asset, this is actually a massive hurdle.&lt;/p&gt;
&lt;p&gt;For example, say you finish a deep dive research project and produce a 5,000-word report. In a tool like Cursor or a direct file system, that’s a file: &lt;code&gt;docs/research.md&lt;/code&gt;. You can @ reference it, version it, or diff it. In OpenClaw, that knowledge is more like human memory—at any point, it might be automatically summarized, rewritten, or even completely "forgotten" (deleted) by the background heartbeat process, and you have zero control over it. It’s hard to tell it: "This document is the absolute source of truth; reference it exactly and do not summarize it into three lines." In short, knowledge cannot be explicitly managed.&lt;/p&gt;
&lt;p&gt;Worse, the entire update process is a black box. What gets saved in &lt;code&gt;MEMORY.md&lt;/code&gt;, how it’s organized, and when it’s purged is all determined by the AI in secret. You see the result, but you rarely see the "why": What did it change this time? Why did it delete that specific note? Why did it merge those two unrelated thoughts? If something goes wrong, it’s a nightmare to debug and improve.&lt;/p&gt;
&lt;p&gt;Another issue with OpenClaw’s unified memory is cross-context interference. While unified memory makes the AI feel like it "knows" you, it also means information can easily pollute different projects. A preference from Project A, or even a one-off temporary decision, might mysteriously start influencing Project B. For a casual user, it seems like it remembers everything; for an advanced user trying to get work done, it feels more like, "Ugh, it’s going off on a tangent again."&lt;/p&gt;
&lt;p&gt;Then there are relevant security risks with Skills. Out of the thousands of skills on ClawHub, audits have found hundreds containing malicious code—from crypto theft and reverse shell backdoors to credential stealing. Simon Willison once mentioned a concept called &lt;a href="https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/"&gt;the lethal trifecta&lt;/a&gt;: when an AI system has access to private data, is exposed to untrusted environments, and can communicate externally, the risk is amplified exponentially. OpenClaw hits all three🤡. This creates a strange paradox. To get the best experience, you have to give it broad tools and permissions. But that creates security risks, so you feel forced to tighten permissions. But if you tighten them too much, you’re back to a restrictive cloud agent like Manus, losing the magic of a local agent. Safety vs. usability remains a persistent contradiction.&lt;/p&gt;
&lt;h2&gt;So What?&lt;/h2&gt;
&lt;p&gt;At this point, you might be asking: "Okay, that was a lot of analysis—so what? How does this help me?"&lt;/p&gt;
&lt;p&gt;Here’s the answer: you can take these insights and build something for yourself that’s actually better and more tailored than OpenClaw. That’s exactly what I did, and the results have been much better than using OpenClaw directly. Let me walk you through a few key decisions I made.&lt;/p&gt;
&lt;h3&gt;Reuse the Agentic Loop, Don’t Rebuild It&lt;/h3&gt;
&lt;p&gt;The first—and most important—decision we made was to &lt;em&gt;not&lt;/em&gt; build an Agentic AI system from scratch. Instead, we reused an existing open-source CLI programming tool like OpenCode as our foundation.&lt;/p&gt;
&lt;p&gt;There’s a deep reasoning behind this. Building a functional Agentic Loop—the cycle of calling an API, parsing tool calls, executing them, returning results to the AI, and requesting the next step—sounds simple on paper. But making it robust enough for real-world use is full of pitfalls: file system I/O, partial file edits, sandbox environments, permission management... the list goes on. Building these things is tedious, risky, and doesn’t actually create much unique value for the end user. I discussed this in detail in &lt;a href="/ai-builders-space-en.html"&gt;a previous post&lt;/a&gt;—my core point was that the Agentic Loop is "grunt work" that should be outsourced. What’s actually worth your time is the &lt;em&gt;Agentic Architecture&lt;/em&gt;—how you inject business logic into the AI system to create direct value.&lt;/p&gt;
&lt;p&gt;Tools like OpenCode or Claude Code are basically perfect "outsourcing" options. They’ve already matured the Agentic Loop—they can read and write files, run commands, and iterate continuously, and they’re evolving incredibly fast. By using them as a cornerstone, you’re basically getting a free ride on the entire agentic programming toolchain, which drops your development costs to almost zero. Choosing OpenCode specifically has extra perks: it’s fully open-source (so you can hack it), it supports parallel subagents (something Cursor and Codex still don’t have), and it supports multiple coding plans. For instance, I use the GLM coding plan, but you could use the OpenAI Codex plan directly without the insane costs of raw API calls.&lt;/p&gt;
&lt;h3&gt;File as Memory: Inheriting and Evolving the OpenClaw Philosophy&lt;/h3&gt;
&lt;p&gt;The second decision was about the memory system. Tools like OpenCode or Claude Code have a natural "disk-as-memory" philosophy—after all, files are the basic unit they handle. Having disk-based memory, combined with direct ownership and transparency over those files, solves the exact issues we saw with OpenClaw. If you want to build up long-term assets, write a file. If you want to force the AI to follow certain rules, write an &lt;code&gt;AGENTS.md&lt;/code&gt;. If you want to manage your memory structure, just edit the Markdown. The problems of non-explicit management and black-box updates are naturally solved by OpenCode’s fine-grained control and the file system itself.&lt;/p&gt;
&lt;p&gt;But just having a file system isn't enough, so we also ported over OpenClaw’s "persona self-evolution" mechanism. Specifically, we split memory into two layers: project-level memory (the context, decision logs, and technical specs for a specific project) and persona-level memory (user profile, preferences, and communication style). We then added a persona maintenance workflow to &lt;code&gt;AGENTS.md&lt;/code&gt;, so the AI automatically reviews the conversation at the end of a session to update &lt;code&gt;MEMORY.md&lt;/code&gt; and &lt;code&gt;USER.md&lt;/code&gt;. You get the same self-evolution, but it runs on a fully controllable file system where you can even use Git for version control.&lt;/p&gt;
&lt;p&gt;As for the unified context problem, we went with a brute-force but elegant solution: the Mono Repo. By putting different projects in different folders within the same repo, the AI naturally has cross-project access to all contexts. You can isolate when you want, share when you want, merge different lines of exploration, or fork things off just by copying files. These are all native operations in the file system and OpenCode, which feels infinitely more natural than trying to do them in a clunky chat window.&lt;/p&gt;
&lt;h3&gt;Skills and Security&lt;/h3&gt;
&lt;p&gt;On the Skills front, the OpenCode ecosystem has a massive array of MCP servers and skills available—calendars, email, browsers, search, you name it. The feature set is pretty much on par with ClawHub. In terms of security, our approach is to not just blindly install third-party skills. Instead, we have the AI review the source code, understand the logic, and then rewrite a "clean" version. In the age of AI-assisted coding, this only takes a few minutes, but it drastically reduces the risk of supply chain attacks.&lt;/p&gt;
&lt;h3&gt;The Last Mile: Mobile&lt;/h3&gt;
&lt;p&gt;Our first three decisions solved the foundation, memory, and tools, but one key piece was still missing: the entry point. A huge reason OpenClaw is so popular is that you don’t have to be sitting at your computer. But existing programming tools are pretty weak here—VS Code has Code Server, but it’s terrible on an iPad; OpenCode has a web client, but it’s barely functional; Cursor’s web client is tied to GitHub; and Claude Code doesn't even have one.&lt;/p&gt;
&lt;p&gt;To bridge this gap, we built a native iOS app as a remote client for OpenCode. This isn't just a chat window ported to your phone—it’s a workspace genuinely designed for mobile. You can see the AI’s real-time progress, every tool call, and every file operation. You can switch models for A/B testing, browse Markdown files, review changes, and use voice input. It supports public access via HTTPS or SSH tunnels, and the iPad version even has a three-column split view.&lt;/p&gt;
&lt;p&gt;The client is &lt;a href="https://github.com/grapeot/opencode_ios_client"&gt;open-sourced&lt;/a&gt; on GitHub. Feel free to check it out; it might even hit TestFlight soon. The result is that my dusty iPad is finally a productivity beast again. Directing an AI from the couch is a much, much better experience than using OpenClaw’s chat window. If I get an on-call notification while I'm out for dinner, I can just assign the task to my "AI intern" and have the root cause figured out before the check arrives. And the whole time, I have total control over the AI—I know it isn't going to go rogue or leak my info to Moltbook.&lt;/p&gt;
&lt;p&gt;&lt;img alt="iPad Client" src="/images/opencode_ios_client.jpeg"&gt;&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Let's go back to my "hot take" at the beginning. The viral success of both OpenClaw and DeepSeek points to the same underlying truth: it's about taking capabilities a small elite group is already enjoying and pushing them to a broader audience for the first time. DeepSeek gave people their first taste of searching, reasoning AI; OpenClaw gave them their first hands-on experience with an Agentic AI that has disk access, memory, and the power to self-evolve.&lt;/p&gt;
&lt;p&gt;But because these products are designed for the masses, they inherently involve massive design compromises. That was true for DeepSeek, and it’s true for OpenClaw. The chat interface brings ease of use but sacrifices expressiveness; unified memory makes the AI feel like it "gets" you but sacrifices control; an open skill ecosystem brings power but introduces security risks.&lt;/p&gt;
&lt;p&gt;If you’re already using tools like Cursor, Claude Code, or OpenCode, the takeaway isn't that you should mindlessly install OpenClaw. Instead, you should understand &lt;em&gt;why&lt;/em&gt; it’s a hit—the unified entry, the persistent memory, the tool ecosystem, and the flywheel connecting them—and then fold those insights into your own existing toolchain while avoiding the pitfalls. That’s what we did, and I can tell you: the results are significantly better.&lt;/p&gt;
&lt;p&gt;At the end of the day, tools will come and go, but your understanding of their core essence won't.&lt;/p&gt;</content><category term="Computing"></category><category term="English"></category><category term="Agentic AI"></category><category term="Review"></category></entry><entry><title>告别教程思维：为什么 AI 教育不应局限于内容创作，而应该引进工程基建</title><link href="https://yage.ai/ai-builder-space.html" rel="alternate"></link><published>2026-02-02T20:00:00-08:00</published><updated>2026-02-02T20:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-02:/ai-builder-space.html</id><summary type="html">&lt;p&gt;分析AI学习者的四道流失阶梯，提出用工程化平台消除配置、实验、部署等摩擦，让学员专注于核心技能练习。介绍AI Builder Space如何通过统一API、一键部署和MCP自动化实现这一目标。&lt;/p&gt;</summary><content type="html">&lt;p&gt;我们这两年做了&lt;a href="https://ai-builders.com/"&gt;四门课&lt;/a&gt;，积累了 2500+ 学员。最近大家在推送里看到了&lt;a href="https://www.superlinear.academy/c/share-your-projects/"&gt;很多学员分享的项目&lt;/a&gt;，非常精彩，我也受到了很多启发。但那些真的交付了大家都能用的产品的学员，其实是少数走到终点的人。&lt;/p&gt;
&lt;p&gt;我们一直没有机会和大家聊聊另一面：根据我们的观察和访谈，有惊人比例的学员，其实停在了中间的某一步。他们不是说觉得没用或者学不会放弃了，而是就因为种种原因暂停或者终止了学习。&lt;/p&gt;
&lt;p&gt;最让我们遗憾的是，这些流失往往不是发生在复杂的算法或逻辑面前，而是发生在一些极其琐碎、与核心能力无关的障碍上。&lt;/p&gt;
&lt;p&gt;面对流失，传统的教育直觉是做更多内容——你不懂配置，我写个文档；你不会部署，我录个视频。但教程越写越多，学习路径越来越长，那些琐碎的障碍还在那里。因此我们觉得，AI 时代，我们可能需要一条更本质的路线，真正地解决这个流失问题。&lt;/p&gt;
&lt;p&gt;这篇文章想跳出解决方案的讨论，先系统性地梳理一下，学 AI 的人到底卡在哪儿，然后解释我们如何用一种工程化的思路，试图从根本上消灭这些障碍。&lt;/p&gt;
&lt;h2&gt;流失阶梯：学 AI 的四个关键节点&lt;/h2&gt;
&lt;p&gt;学员热情耗尽的流失过程就好像一个阶梯。每一阶都有人因为各种原因停下来，而跨过去的人的能力会有质的提升。&lt;/p&gt;
&lt;h3&gt;第一阶：脑子：我懂了，手：不，你没有&lt;/h3&gt;
&lt;p&gt;从教学的一开始，我们就发现很多学员看完视频、读完教材，觉得自己懂了，但从来没有把 AI 真正用到自己的生活或工作里。&lt;/p&gt;
&lt;p&gt;这是非常可惜的一点。学 AI 更像学游泳或开飞机。就像没有人能通过看视频学会游泳一样，光看教材也是学不会 AI 的。知识点可以靠记忆和理解，但技能必须靠身体去试。必须在真实的场景里摸爬滚打，尤其是在真实的场景里犯错误，才能真正把它内化成自己的能力。脑子觉得懂了和手真的会用之间，有一道巨大的鸿沟。&lt;/p&gt;
&lt;p&gt;这就是为什么我们说这一阶是起点：如果一个学员从来没有把 AI 用到一件真实的事情上——哪怕是非常小的事情——那他还没有真正开始学。很多人在这一阶就停住了，他们觉得自己在学，但其实只是在看别人学来获得一种自己在努力的错觉。&lt;/p&gt;
&lt;h3&gt;第二阶：从玩具项目到真实应用&lt;/h3&gt;
&lt;p&gt;有一些学员跨过了第一阶，做完了几个教程里的小项目，建立了一点信心。但当他们想做一个真正有用的东西，或者想把它用得稍微规模大一点、自动化一点的时候，发现前面横着一堆琐事：要绑信用卡、注册各种账号、申请 API token、配置开发环境。都是体力活，做完了也没什么成就感，稍微折腾一下就放弃了。&lt;/p&gt;
&lt;p&gt;这些琐事的问题在于，它本身对学习目标几乎没有贡献，却消耗了大量的热情和时间。本来准备大干一场，结果两小时过去了还在折腾配置，代码一行没写。因此这种放弃是人之常情。&lt;/p&gt;
&lt;p&gt;同时这种挫败感是很致命的。行动力在刚刚萌芽的阶段最脆弱，最需要保护，因为它一旦熄灭就很难再点燃。结果学员刚刚建立起信心，好不容易跨过了第一阶，开始相信自己能做点什么了，又被这些琐碎的配置工作打回原形。所以这些摩擦非常可恨，从教学的角度一定要重视解决。&lt;/p&gt;
&lt;h3&gt;第三阶：从被动接收到形成自己的判断&lt;/h3&gt;
&lt;p&gt;有些学员扛过了前两阶，终于把 API 跑通了，开始用 AI 做一些事情，积累了一些经验。但这时候会出现另一个问题：他们的第一手经验没办法规模化。在自己的一亩三分地有一些观点看法，但更多的时候还是被公众号的标题党牵着走。今天这篇说 Claude 代码能力最强，明天那篇说 DeepSeek 性价比碾压，没有自己的第一手经验，只能人云亦云。&lt;/p&gt;
&lt;p&gt;这个阶段的本质障碍是：从繁杂的信息中沉淀出自己的见解。真正的学习需要自己做大量的、可规模化的实验，去积累第一手的经验。这不是多试几个模型这么简单，需要在真实的场景里反复对比、反复踩坑。同一个任务用三个模型跑一遍，记录下各自的表现；在不同的 prompt 策略之间来回切换，感受它们的差异。只有这样，才能形成自己的判断力，而不是看到一篇文章就信一篇。&lt;/p&gt;
&lt;p&gt;这一阶之所以重要，是因为它触及了学 AI 的本质：学会怎么调 API 离真的做出有用的 AI 产品离得很远。最关键的是去学怎么做取舍、做判断。技术会变，模型会迭代，只有判断力是可以沉淀可以迁移的。没有第一手的经验，永远形不成自己的观点，永远是别人说什么信什么。这种状态下，没法真正用好 AI，因为每一个决策都要依赖别人（甚至是公众号）的结论。&lt;/p&gt;
&lt;h3&gt;第四阶：从本机跑起来到部署交付&lt;/h3&gt;
&lt;p&gt;最后一阶：代码在本地跑起来了，但它停在了 localhost:8000，除了自己没人能用，只能自娱自乐。你跟别人说 AI 很厉害，“我”很厉害，他们都没有感性认识。&lt;/p&gt;
&lt;p&gt;部署这件事本身不难，但对于初学者来说，它意味着又一堆新概念——服务器、域名、Docker、CI/CD。每一个都可能卡住，每一个都需要额外的学习成本。很多学员就是在这一步停下来了：东西做出来了，但只有自己能用，没法分享给别人。&lt;/p&gt;
&lt;p&gt;这一阶是一个关键转折点，原因不只是技术上的。当一个项目可以被别人访问的那一刻，它就从作业变成了作品。可以分享给朋友，放进简历，甚至让真实用户使用。这个身份的转变，会彻底改变学员对学 AI 这件事的态度——从我&lt;em&gt;在完成练习&lt;/em&gt;变成&lt;em&gt;我在创造价值&lt;/em&gt;。我们观察到，很多学员的学习热情是在第一次分享自己作品的时候被真正点燃的。在此之前是被动学习，在此之后会变成主动探索。&lt;/p&gt;
&lt;h2&gt;如何从根本上解决问题&lt;/h2&gt;
&lt;p&gt;如果我们仔细观察前面讲的四道阶梯，会发现它们本质上都是摩擦问题。对这种问题，传统的解法是给你更多教程，比如教你怎么注册 API，教你怎么配置环境，教你怎么买服务器。每遇到一个坑，就写一篇 tutorial 来教学。结果是教程越来越多，学习路径越来越长，但要做的事情还是那么多，摩擦并没有真正减少。&lt;/p&gt;
&lt;p&gt;这也是为什么AI时代教程满天飞的原因。凭心而论这也不是教程作者或者社区的锅，因为传统的教学方式更像是 Content Creation，或者说更像up主。大家一说到教学，只能想到写教材，录视频，做讲座。为了教学去专门 Build 一个平台不说天方夜谭，至少不在大家第一反应上。是个吃力不讨好，技能点也不匹配的事情。&lt;/p&gt;
&lt;p&gt;但这是我们想挑战的一个思维定势：如果注册、绑卡、配置这些步骤对学习目标贡献接近于零，那为什么要让它们存在于学习路径上？与其写文档教你怎么绑信用卡，不如让绑信用卡这个步骤彻底消失。在 AI 时代，我们至少有这样一个选择，就是去真的 Build 一个平台，来一把消除这些摩擦，让学生无感地直接跨越这些阶梯，把时间都花在最重要的技能练习上。&lt;/p&gt;
&lt;p&gt;这就是我们做 AI Builder Space 的出发点。&lt;/p&gt;
&lt;h3&gt;AI Builder Space 做了什么&lt;/h3&gt;
&lt;p&gt;所以我们的思路是：让这些步骤消失。学员注册课程后直接拿到一个可用的接口（API），背后已经接好了 GPT、Claude、Gemini、DeepSeek、Grok 这些主流模型，还有语音识别、图像理解、图像生成、embedding 这些能力。因为这个平台是学员免费使用的，所以也不需要绑信用卡。&lt;/p&gt;
&lt;p&gt;这一方面直接让调用各种 AI API 变得特别简单，另一方面也让积累第一手经验很容易。想对比不同模型的表现，只要改一个参数就行。不用重新注册、重新配置。实验的成本被大幅压低了。我们希望用这种方法来鼓励大家多做实验，多换几种模型看有没有改进。打字太累试试语音识别。想要加入RAG，加入网络搜索也可以直接让 AI 加。我们的目标是，让大家的好奇心和行动欲可以被这些易于使用的 API 保护起来，坚持到开花结果的那一天。&lt;/p&gt;
&lt;p&gt;另一个我们想鼓励的事情是 Build in Public——把自己做出来的东西分享出去，让别人也能用。&lt;/p&gt;
&lt;p&gt;这个最明显的原因是复利效应。一方面，当你把作品分享出去，你会开始收到反馈，开始和别人交换需求、交换想法。这种交流对打磨AI在什么场景有用的产品思维的帮助，比交换 API 怎么调要大得多。另一方面，做完一个东西然后丢掉/自己用实在太可惜了。如果能放进简历，或者让别人真的用起来，这个价值会持续积累。&lt;/p&gt;
&lt;p&gt;在此之外，还有一个我们访谈学员之后才意识到的事情：很多人在学 AI 的过程中有一种孤独感。他们一方面怕被时代抛下，觉得 AI 是很重要的事情，这是他们来上课的原因。但另一方面，周围还是有很多人不理解他们在做什么。一个人孤军奋战练习 AI，对好奇心和行动力毕竟是一个挑战。可能一两个月过去，因为周围都没人弄，慢慢也就淡忘了。&lt;/p&gt;
&lt;p&gt;所以我们很希望大家能把 build 的东西分享出来。这样可以构建一种持续的 immersion。你会发现不是只有你一个人在做这件事，有很多人和你一样有激情去讨论这些东西。我们这门课想做的，不仅是让你学会技术，还想把你领进一个同好的大门。这个社区的价值，可能比教几个技术点更持久。另外，如果你写的 AI 工具可以让周围人用起来的话，也可能可以转化他们的态度，让他们理解持你学 AI。&lt;/p&gt;
&lt;p&gt;所以我们做了一件事：让部署变成一个非常简单的 API。写完代码，（用一句话让 Cursor）调一下接口，就有一个真实的 URL 可以分享给朋友。域名是 &amp;lt;你选的名字&amp;gt;.ai-builders.space，免费使用一年。不需要买服务器，不需要学 Docker，不需要配置域名。这些概念可以以后再学，但不应该成为你分享第一个作品的障碍。&lt;/p&gt;
&lt;h3&gt;最后一块拼图&lt;/h3&gt;
&lt;p&gt;上面说的这些摩擦——配置、实验、部署——都是我们一开始就预见到的。但 AI Builder Space 上线之后，我们发现还有一个问题是之前没想到的。&lt;/p&gt;
&lt;p&gt;有些学员会来问：你这个平台为什么我照着调 API 调不出来？我们一开始以为是文档写得不够清楚，后来逐渐意识到问题出在别的地方：很多人在用 AI 编程助手的时候，没有给够 context。他们不知道要把 API 文档或者&lt;code&gt;openapi.json&lt;/code&gt;复制给 AI，不知道这样做会让结果好很多。AI 没有足够的信息，就开始 hallucinate，出来的结果当然不对。&lt;/p&gt;
&lt;p&gt;我们当然可以写一个教程去教 context curation。事实上我们的教材里已经有了。但这里有一个更根本的问题：为什么在 AI 时代，我们还要让大家自己把 OpenAPI 文档拷来拷去？这是一个 unknown unknown——大家很难意识到自己需要做这件事。同时这也是一种摩擦。我们不能靠教会大家“一定要把这件高摩擦的事情做好”来解决问题，而应该用平台把这个摩擦直接消除。&lt;/p&gt;
&lt;p&gt;所以我们想了一个办法：有没有可能用一种特别容易部署的方式，直接把这个问题解决掉？我们选了 MCP，主要是因为它部署太方便了，Cursor、Claude Code 都支持，跑一行命令就装好。装完之后，学员只需要说"用 AI Builder Space 帮我做一个 xxx"，AI 就自动知道怎么调用、怎么部署。平台的能力、最佳实践、甚至 API key 都已经包装在里面了。上线之后效果比预期的好，开发和部署的体验都简单了很多。&lt;/p&gt;
&lt;h2&gt;当工具层面的问题解决之后&lt;/h2&gt;
&lt;p&gt;配置的问题解决了，部署的问题解决了，AI 编程助手也能自动理解平台了。但我们在教学中发现，有一类任务仍然让很多学员卡住：调研。&lt;/p&gt;
&lt;p&gt;很多学员的项目都涉及查资料、做总结这类需求。看起来简单，但如果你做过大量实验就会发现：有的模型很勤快，给个调研任务会跑十几轮搜索（比如 GPT、Kimi）；有的模型则懒得搜，直接开始编（比如 Gemini，哪怕你反复强调先搜索）。这个行为很难用 prompt 改变，更像是模型训练时形成的性格。&lt;/p&gt;
&lt;p&gt;如果你自己从零开始做一个调研 Agent，光是踩这些坑、调这些参数、设计工作流，就要花掉大量时间。&lt;/p&gt;
&lt;p&gt;我们在这个问题上花了很多精力，最后得出的结论是：不要指望一个模型既能搜又能想。所以我们做了一个自己的调研 Agent 叫 Supermind Agent v1。它用了 Multi-Agent Handoff 的架构——调研阶段用擅长工具调用的模型（Grok、Kimi）去搜索、抓取、过滤；思考阶段把整理好的材料交给擅长深度推理的模型（Gemini）做综合和表达。&lt;/p&gt;
&lt;p&gt;这个设计背后有一个更一般的原则：用架构去管理模型的不确定性。同一个模型，同一个 prompt，今天和明天的表现可能不一样；同一个任务，GPT 和 Gemini 的行为模式可能完全不同。你改不了模型的性格，prompt 能调整的边界也有限。但你可以设计一个架构，让擅长的模型做擅长的事。&lt;/p&gt;
&lt;p&gt;这种思维方式是可迁移的。当你理解了这个原则，你就能把它应用到任何 AI 系统的设计中。而当你用 Supermind Agent 做出一份高质量的调研报告，体验过这种组合使用的效果，你会自然而然地想去理解它背后的设计。&lt;/p&gt;
&lt;h2&gt;结语：把时间浪费在美好的事物上&lt;/h2&gt;
&lt;p&gt;我们做了这么多基建工作——统一接口、一键部署、MCP 自动化，并不是为了让 AI 变得容易。恰恰相反，我们是为了让学生能更快地去面对那些真正困难的事情。&lt;/p&gt;
&lt;p&gt;什么是真正困难的事情？是如何定义一个从未被解决的问题，是如何设计一个精妙的 Agent 架构来处理模糊性，是如何在看似胡言乱语的模型反馈中捕捉到那一丝逻辑的火花。这些才是 AI 时代的核心竞争力，是那些只有人类大脑才能完成的工作。&lt;/p&gt;
&lt;p&gt;至于配置环境、调试端口、申请 token，这些是假困难。它们消耗意志力，给人一种我在努力的错觉，却不增长你的智慧。我们希望 AI Builder Space 是一把利刃，快刀斩乱麻，斩除这些缠绕在学习路径上的荆棘。&lt;/p&gt;
&lt;p&gt;所以，不要为了学而学。请尽快跨过那些无谓的技术门槛，去到那个真正需要你思考、判断、创造的地方。毕竟，生命有限，你的好奇心和创造力，应该浪费在那些真正美好的事物上。&lt;/p&gt;
&lt;h3&gt;FAQ&lt;/h3&gt;
&lt;h4&gt;Q：文章里提到的 AI Builder Space 是什么？哪里可以使用？&lt;/h4&gt;
&lt;p&gt;这个是我们 &lt;strong&gt;AI Architect&lt;/strong&gt; 这门课程的学生专属的一个教学平台。它的网页在 &lt;a href="https://space.ai-builders.com"&gt;https://space.ai-builders.com&lt;/a&gt;，但是需要学生才有免费的访问权限。&lt;/p&gt;
&lt;p&gt;&lt;img alt="AI Builder Space Screenshot" src="/images/ai-builder-space-screenshot.jpg"&gt;&lt;/p&gt;
&lt;p&gt;如果对这门课感兴趣的话，可以看一下&lt;a href="https://www.superlinear.academy/c/aa/"&gt;这个链接&lt;/a&gt;。&lt;/p&gt;
&lt;h4&gt;Q：市面上已经有 OpenRouter、Portkey、LiteLLM 这些统一 API 网关了，AI Builder Space 有什么不同？&lt;/h4&gt;
&lt;p&gt;功能上确实有重合。OpenRouter 是目前多模态能力最全的网关，支持 LLM、Vision、图像生成、语音识别、Embedding 等，我们的统一 API 网关在这方面和它差不多。&lt;/p&gt;
&lt;p&gt;但定位不同。第一，零摩擦起步——你注册课程后自动获得账号和 API key，不需要单独注册、不需要绑信用卡，OpenRouter 需要你自己注册并绑卡。第二，我们提供 MCP Server 来帮助 AI 编程助手理解平台，这是其他网关没有的。第三，统一 API + 一键部署 + MCP 形成从开发到交付的完整闭环，OpenRouter 只解决 API 调用问题，部署还是要你自己搞定。&lt;/p&gt;
&lt;p&gt;简单说：OpenRouter 是一个很好的产品，但 AI Builder Space 是一个专门为教学设计的平台。&lt;/p&gt;
&lt;h4&gt;Q：你们把我举过去了，但那些底层的东西（比如 context curation、部署原理）我并没有学到，这样好吗？&lt;/h4&gt;
&lt;p&gt;这正是我们有意为之的教学设计。&lt;/p&gt;
&lt;p&gt;传统路径是：先学原理 → 再做练习 → 最后做项目。我们的路径是：先做出东西 → 体验到价值 → 再回来理解原理。&lt;/p&gt;
&lt;p&gt;为什么后者更有效？&lt;/p&gt;
&lt;p&gt;首先，教育最难的不是知识传递，而是激发学习动机。当你已经做出了一个能分享的作品，你才会真正有动力去理解它是怎么工作的。&lt;/p&gt;
&lt;p&gt;其次，在你理解原理之前，你已经通过实践建立了直觉，回头学原理时会发现很多东西"原来如此"，而不是"这有什么用"。&lt;/p&gt;
&lt;p&gt;第三，一次性学太多东西会让人崩溃，先跳过不必要的复杂性，专注于核心，等你准备好了再回来补课。&lt;/p&gt;
&lt;p&gt;当然，这不是说那些底层知识不重要。课程后面会逐步引导你理解 context curation、部署原理、prompt engineering 的深层逻辑。但那是在你已经有了成功体验之后。&lt;/p&gt;
&lt;h4&gt;Q：你们说要培养 Master Builder，这个和普通的 builder 有什么区别？&lt;/h4&gt;
&lt;p&gt;低层次的 builder 着眼于具体细节——这个 API 怎么调、那个参数怎么设。Master Builder 从产品和系统的角度思考：不是这个模型怎么用，而是这个问题应该用什么系统来解决；不是怎么写好 prompt，而是这个任务应该怎么分解、怎么编排；不是 AI 能不能做到，而是 AI 做不到的部分，人应该怎么补位。&lt;/p&gt;
&lt;p&gt;Supermind Agent 就是一个例子：当单个模型有局限时，用架构来弥补。这种思维方式的转变，才是 AI 时代最持久的竞争力。&lt;/p&gt;
&lt;p&gt;我们通过降低摩擦让你快速上手，但最终目标是培养你成为一个能独立设计 AI 系统的 Master Builder。当你理解了为什么这样设计，你就不再需要依赖任何平台——包括我们的。&lt;/p&gt;</content><category term="Computing"></category><category term="Chinese"></category><category term="AI"></category><category term="Tutorial"></category></entry><entry><title>Why AI Education Should Go Beyond Content Creation to Engineering Infrastructure</title><link href="https://yage.ai/ai-builder-space-en.html" rel="alternate"></link><published>2026-02-02T19:00:00-08:00</published><updated>2026-02-02T19:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-02-02:/ai-builder-space-en.html</id><summary type="html">&lt;p&gt;Analyzes the four-step attrition ladder in AI learning and proposes using engineering platforms to eliminate configuration, experimentation, and deployment friction. Introduces AI Builder Space's unified API, one-click deployment, and MCP automation.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Over the past two years, we've created &lt;a href="https://www.superlinear.academy/ai-builders-eng"&gt;four courses&lt;/a&gt;, accumulating 2,500+ students. Recently, you may have seen &lt;a href="https://www.superlinear.academy/c/share-your-projects-en/"&gt;many student projects shared&lt;/a&gt; in our updates—they're truly impressive, and I've been inspired by many of them. But those who actually delivered products that everyone can use are actually a minority who made it to the finish line.&lt;/p&gt;
&lt;p&gt;We haven't had the chance to discuss the other side: based on our observations and interviews, a surprisingly large proportion of students actually stopped somewhere in the middle. They didn't finish and think it was useless, nor did they give up because they couldn't learn—they hit a certain point where their enthusiasm ran out and quietly disappeared.&lt;/p&gt;
&lt;p&gt;What frustrates us most is that this attrition often doesn't happen in front of complex algorithms or logic, but at extremely trivial obstacles that have nothing to do with core skills.&lt;/p&gt;
&lt;p&gt;Facing this attrition, the traditional educational instinct is to create more content—if you don't understand configuration, we write a document; if you can't deploy, we record a video. But as tutorials pile up and learning paths grow longer, those trivial obstacles remain. So we believe that in the AI era, we might need a more fundamental approach to truly solve this attrition problem.&lt;/p&gt;
&lt;p&gt;This article aims to step back from discussing solutions and first systematically examine where people learning AI actually get stuck, then explain how we're trying to eliminate these obstacles at their root using an engineering-oriented approach.&lt;/p&gt;
&lt;h2&gt;The Attrition Ladder: Four Critical Nodes in Learning AI&lt;/h2&gt;
&lt;p&gt;The attrition process where student enthusiasm runs out is like a staircase. At each step, some people stop for various reasons, while those who cross over experience a qualitative leap in their abilities.&lt;/p&gt;
&lt;h3&gt;First Step: Brain Says "I Get It," Hands Say "No, You Don't"&lt;/h3&gt;
&lt;p&gt;From the very beginning of teaching, we found that many students watch the videos, read the materials, feel like they understand, but never actually apply AI to their own lives or work.&lt;/p&gt;
&lt;p&gt;This is a real shame. Learning AI is more like learning to swim or fly a plane. Just as no one can learn to swim by watching videos, you can't learn AI just by reading materials. Knowledge points can be acquired through memorization and understanding, but skills must be developed through physical practice. You have to stumble through real scenarios—especially make mistakes in real scenarios—to truly internalize it as your own ability. There's a huge chasm between the brain thinking it understands and the hands actually being able to use it.&lt;/p&gt;
&lt;p&gt;This is why we say this step is the starting point: if a student has never applied AI to a single real task—even a very small one—they haven't truly started learning. Many people stop at this step. They think they're learning, but they're really just watching others learn to create an illusion that they're making an effort.&lt;/p&gt;
&lt;h3&gt;Second Step: From Toy Projects to Real Applications&lt;/h3&gt;
&lt;p&gt;Some students cross the first step, complete a few small projects from tutorials, and build some confidence. But when they want to make something truly useful, or want to use it at a slightly larger scale or with some automation, they find a pile of chores ahead: binding credit cards, registering for various accounts, applying for API tokens, configuring development environments. It's all grunt work with no sense of achievement when done, and people give up after a bit of frustration.&lt;/p&gt;
&lt;p&gt;The problem with these chores is that they contribute almost nothing to the learning objective, yet consume huge amounts of enthusiasm and time. You were ready to tackle something big, but two hours later you're still wrestling with configuration and haven't written a single line of code. So giving up is human nature.&lt;/p&gt;
&lt;p&gt;At the same time, this sense of defeat is fatal. Drive is most fragile and needs the most protection when it's just beginning to sprout, because once it's extinguished, it's hard to rekindle. The result is that students have just built up confidence, finally crossed the first step, started believing they could do something—and then these trivial configuration tasks knock them back down. So this friction is truly hateful and must be addressed seriously from a teaching perspective.&lt;/p&gt;
&lt;h3&gt;Third Step: From Passive Reception to Forming Your Own Judgment&lt;/h3&gt;
&lt;p&gt;Some students endure through the first two steps, finally get the API running, start using AI to do things, and accumulate some experience. But then another problem emerges: their firsthand experience can't scale. They have some opinions and insights within their own small domain, but most of the time they're still led around by clickbait headlines. Today this article says Claude is best at coding, tomorrow that one says DeepSeek crushes everyone on cost-effectiveness. Without their own firsthand experience, they can only parrot others.&lt;/p&gt;
&lt;p&gt;The essential barrier at this stage is: distilling your own insights from complex information. True learning requires doing a large amount of scalable experiments yourself to accumulate firsthand experience. This isn't as simple as trying a few models—it requires repeatedly comparing and repeatedly hitting pitfalls in real scenarios. Run the same task through three models, record their respective performance; switch back and forth between different prompt strategies to feel their differences. Only this way can you form your own judgment, rather than believing every article you read.&lt;/p&gt;
&lt;p&gt;This step is important because it touches on the essence of learning AI: knowing how to call an API is far from truly making a useful AI product. What's crucial is learning how to make trade-offs and judgments. Technology will change, models will iterate—only judgment can be accumulated and transferred. Without firsthand experience, you'll never form your own opinions, forever believing whatever others say. In this state, you can't truly use AI well, because every decision depends on others' (or even bloggers') conclusions.&lt;/p&gt;
&lt;h3&gt;Fourth Step: From Running Locally to Deployment and Delivery&lt;/h3&gt;
&lt;p&gt;The final step: the code runs locally, but it's stuck at localhost:8000—no one but yourself can use it, just self-entertainment. When you tell others AI is amazing, that "you" are amazing, they have no concrete sense of it.&lt;/p&gt;
&lt;p&gt;Deployment itself isn't hard, but for beginners, it means yet another pile of new concepts—servers, domain names, Docker, CI/CD. Each one can become a blocker, each one requires additional learning cost. Many students stop at this step: they made something, but only they can use it, they can't share it with others.&lt;/p&gt;
&lt;p&gt;This step is a critical turning point, not just technically. The moment a project can be accessed by others, it transforms from homework into a work. It can be shared with friends, put on a resume, or even used by real users. This identity shift completely changes how students view learning AI—from "I'm completing exercises" to "I'm creating value." We've observed that many students' learning enthusiasm truly ignites the first time they share their work. Before that, it's passive learning; after that, it becomes active exploration.&lt;/p&gt;
&lt;h2&gt;How to Solve the Problem at Its Root&lt;/h2&gt;
&lt;p&gt;If we carefully observe the four steps described above, we find they're essentially friction problems. The traditional solution to such problems is to give you more tutorials—teaching you how to register for APIs, how to configure environments, how to buy servers. Every time you hit a pitfall, write a tutorial to teach it. The result is more and more tutorials, longer and longer learning paths, but the same amount of work to do, and friction hasn't really decreased.&lt;/p&gt;
&lt;p&gt;This is also why tutorials are everywhere in the AI era. Honestly, this isn't the fault of tutorial authors or the community, because traditional teaching is more like Content Creation, or being a content creator. When people think of teaching, they can only think of writing textbooks, recording videos, giving lectures. Building a platform specifically for teaching is, if not impossible, at least not the first thing that comes to mind. It's a thankless task, and the skill set doesn't match.&lt;/p&gt;
&lt;p&gt;But this is a mental model we want to challenge: if registration, card binding, and configuration contribute nearly zero to learning objectives, why let them exist on the learning path? Instead of writing documents teaching you how to bind a credit card, why not make the credit card binding step disappear entirely? In the AI era, we at least have this option—to actually Build a platform that eliminates this friction in one go, letting students seamlessly cross these steps and spend all their time on the most important skill practice.&lt;/p&gt;
&lt;p&gt;This is the starting point for why we built AI Builder Space.&lt;/p&gt;
&lt;h3&gt;What AI Builder Space Does&lt;/h3&gt;
&lt;p&gt;So our approach is: make these steps disappear. After registering for the course, students directly get a usable interface (API), with mainstream models like GPT, Claude, Gemini, DeepSeek, and Grok already connected behind it, plus capabilities like speech recognition, image understanding, image generation, and embedding. Since this platform is free for students, there's no need to bind credit cards.&lt;/p&gt;
&lt;p&gt;On one hand, this makes calling various AI APIs particularly simple; on the other hand, it makes accumulating firsthand experience easy. Want to compare different models' performance? Just change one parameter. No re-registering, no re-configuring. The cost of experimentation is drastically reduced. We hope to use this method to encourage everyone to experiment more, try a few different models to see if there's improvement. Tired of typing? Try speech recognition. Want to add RAG, add web search? You can directly ask AI to add it. Our goal is to protect everyone's curiosity and drive to act with these easy-to-use APIs, helping them persist until the day they bear fruit.&lt;/p&gt;
&lt;p&gt;Another thing we want to encourage is Build in Public—sharing what you've made so others can use it too.&lt;/p&gt;
&lt;p&gt;The most obvious reason is the compound effect. On one hand, when you share your work, you start receiving feedback, start exchanging needs and ideas with others. This exchange helps polish your product thinking about what scenarios AI is useful for far more than exchanging how to call APIs. On the other hand, it's such a waste to finish something and then throw it away or just use it yourself. If you can put it on your resume, or if others actually use it, this value continues to accumulate.&lt;/p&gt;
&lt;p&gt;Beyond this, there's something we only realized after interviewing students: many people feel a sense of loneliness while learning AI. On one hand, they fear being left behind by the times, feeling AI is important—which is why they took the course. But on the other hand, many people around them still don't understand what they're doing. Fighting alone to practice AI is a challenge to curiosity and drive. A month or two might pass, and because no one around is doing it, it gradually fades away.&lt;/p&gt;
&lt;p&gt;So we really hope everyone can share what they build. This way, we can create a sustained immersion. You'll find you're not the only one doing this—many people share your passion for discussing these things. What our course wants to do isn't just teach you technology, but also lead you through the door to a community of like-minded people. The value of this community might be more lasting than teaching a few technical points. Also, if the AI tools you write can be used by people around you, it might change their attitudes and help them understand and support your AI learning.&lt;/p&gt;
&lt;p&gt;So we did something: made deployment a very simple API. Write your code, (use one sentence to have Cursor) call the interface, and you have a real URL to share with friends. The domain is &lt;your-chosen-name&gt;.ai-builders.space, free to use for one year. No need to buy servers, no need to learn Docker, no need to configure domain names. These concepts can be learned later, but they shouldn't be barriers to sharing your first work.&lt;/p&gt;
&lt;h3&gt;The Last Piece of the Puzzle&lt;/h3&gt;
&lt;p&gt;The friction mentioned above—configuration, experimentation, deployment—were all things we anticipated from the start. But after AI Builder Space went live, we discovered there was one more problem we hadn't thought of.&lt;/p&gt;
&lt;p&gt;Some students would ask: why can't I get your platform's API to work when I follow the documentation? At first we thought the documentation wasn't clear enough, but gradually we realized the problem was elsewhere: many people, when using AI coding assistants, don't provide enough context. They don't know to copy the API documentation or &lt;code&gt;openapi.json&lt;/code&gt; to the AI, don't know doing so makes results much better. Without enough information, AI starts to hallucinate, and of course the results are wrong.&lt;/p&gt;
&lt;p&gt;We could certainly write a tutorial teaching context curation. In fact, our materials already include this. But there's a more fundamental question: why, in the AI era, should we still have people copying OpenAPI docs around? This is an unknown unknown—people can hardly realize they need to do this. It's also a form of friction. We can't solve the problem by teaching everyone "you must do this high-friction thing well"—we should use the platform to eliminate this friction directly.&lt;/p&gt;
&lt;p&gt;So we thought of an approach: could we solve this problem with something particularly easy to deploy? We chose MCP, mainly because it's so convenient to deploy—both Cursor and Claude Code support it, just run one command to install. After installation, students just need to say "use AI Builder Space to help me make an xxx," and the AI automatically knows how to call it and how to deploy. The platform's capabilities, best practices, and even API keys are all packaged inside. The results after launch were better than expected—both development and deployment experiences became much simpler.&lt;/p&gt;
&lt;h2&gt;When Tool-Level Problems Are Solved&lt;/h2&gt;
&lt;p&gt;Configuration problems solved, deployment problems solved, AI coding assistants can automatically understand the platform. But we found in our teaching that there's still one type of task that leaves many students stuck: research.&lt;/p&gt;
&lt;p&gt;Many students' projects involve looking up information and summarizing. Seems simple, but if you've done extensive experiments you'll find: some models are diligent, running over a dozen search rounds when given a research task (like GPT, Kimi); other models are lazy about searching and just start making things up (like Gemini, even if you repeatedly emphasize searching first). This behavior is hard to change with prompts—it's more like a personality formed during model training.&lt;/p&gt;
&lt;p&gt;If you try to build a research Agent from scratch yourself, just hitting these pitfalls, tuning these parameters, and designing workflows takes enormous amounts of time.&lt;/p&gt;
&lt;p&gt;We spent a lot of effort on this problem, and our final conclusion was: don't expect one model to both search and think. So we made our own research Agent called Supermind Agent v1. It uses a Multi-Agent Handoff architecture—the research phase uses models good at tool calling (Grok, Kimi) to search, scrape, and filter; the thinking phase hands the organized materials to models good at deep reasoning (Gemini) for synthesis and expression.&lt;/p&gt;
&lt;p&gt;Behind this design is a more general principle: use architecture to manage model uncertainty. The same model, same prompt, might perform differently today versus tomorrow; the same task, GPT and Gemini might have completely different behavior patterns. You can't change a model's personality, and there are limits to what prompts can adjust. But you can design an architecture that lets models good at certain things do those things.&lt;/p&gt;
&lt;p&gt;This way of thinking is transferable. When you understand this principle, you can apply it to the design of any AI system. And when you use Supermind Agent to produce a high-quality research report and experience the effects of this combined use, you'll naturally want to understand the design behind it.&lt;/p&gt;
&lt;h2&gt;Conclusion: Waste Your Time on Beautiful Things&lt;/h2&gt;
&lt;p&gt;We've done all this infrastructure work—unified interfaces, one-click deployment, MCP automation—not to make AI easy. Quite the opposite: we did it so students can more quickly face the things that are truly difficult.&lt;/p&gt;
&lt;p&gt;What are the truly difficult things? How to define a problem that's never been solved, how to design an elegant Agent architecture to handle ambiguity, how to catch that spark of logic in seemingly gibberish model feedback. These are the core competencies of the AI era, the work that only human brains can complete.&lt;/p&gt;
&lt;p&gt;As for configuring environments, debugging ports, applying for tokens—these are false difficulties. They consume willpower, give people an illusion of working hard, yet don't grow your wisdom. We hope AI Builder Space is a sharp blade that cuts through these thorns entangling the learning path.&lt;/p&gt;
&lt;p&gt;So don't learn for the sake of learning. Please cross those pointless technical barriers as quickly as possible and get to the place where you truly need to think, judge, and create. After all, life is finite—your curiosity and creativity should be wasted on things that are truly beautiful.&lt;/p&gt;
&lt;h3&gt;FAQ&lt;/h3&gt;
&lt;h4&gt;Q: What is the AI Builder Space mentioned in the article? Where can I use it?&lt;/h4&gt;
&lt;p&gt;This is an exclusive educational platform for students of our &lt;strong&gt;AI Architect&lt;/strong&gt; course. Its website is at &lt;a href="https://space.ai-builders.com"&gt;space.ai-builders.com&lt;/a&gt;, but free access is limited to enrolled students.&lt;/p&gt;
&lt;p&gt;&lt;img alt="AI Builder Space Screenshot" src="/images/ai-builder-space-screenshot.jpg"&gt;&lt;/p&gt;
&lt;p&gt;If you are interested in this course, you can check out &lt;a href="https://www.superlinear.academy/c/aa/"&gt;this link&lt;/a&gt;.&lt;/p&gt;
&lt;h4&gt;Q: There are already unified API gateways like OpenRouter, Portkey, and LiteLLM in the market. How is AI Builder Space different?&lt;/h4&gt;
&lt;p&gt;Functionally, there is indeed overlap. OpenRouter is currently the gateway with the most comprehensive multimodal capabilities, supporting LLM, Vision, Image Generation, Speech Recognition, Embedding, etc., and our unified API gateway is similar in this regard.&lt;/p&gt;
&lt;p&gt;But the positioning is different. First, friction-free start—you automatically get an account and API key after registering for the course, without separate registration or binding a credit card, while OpenRouter requires you to register and bind a card yourself. Second, we provide an MCP Server to help AI coding assistants understand the platform, which other gateways don't have. Third, Unified API + One-click Deployment + MCP creates a complete loop from development to delivery, while OpenRouter only solves API calling, leaving deployment to you.&lt;/p&gt;
&lt;p&gt;Simply put: OpenRouter is a great product, but AI Builder Space is a platform specifically designed for teaching.&lt;/p&gt;
&lt;h4&gt;Q: You helped me skip ahead, but I didn't learn the underlying things (like context curation, deployment principles). Is this okay?&lt;/h4&gt;
&lt;p&gt;This is exactly our intentional instructional design.&lt;/p&gt;
&lt;p&gt;The traditional path is: Learn principles first → Do exercises → Finally do a project. Our path is: Build something first → Experience value → Come back to understand principles.&lt;/p&gt;
&lt;p&gt;Why is the latter more effective?&lt;/p&gt;
&lt;p&gt;First, the hardest part of education isn't knowledge transfer, but sparking learning motivation. Only when you've built a work you can share will you truly be motivated to understand how it works.&lt;/p&gt;
&lt;p&gt;Second, before you understand the principles, you've already built intuition through practice. When you look back at the principles, you'll have many "aha" moments, rather than wondering "what's the use of this."&lt;/p&gt;
&lt;p&gt;Third, learning too much at once can be overwhelming. Skip unnecessary complexity first, focus on the core, and fill in the gaps when you're ready.&lt;/p&gt;
&lt;p&gt;Of course, this isn't to say those underlying knowledge points aren't important. The course will gradually guide you to understand the deeper logic of context curation, deployment principles, and prompt engineering later. But that's after you've already had a successful experience.&lt;/p&gt;
&lt;h4&gt;Q: You talk about cultivating Master Builders. How is this different from ordinary builders?&lt;/h4&gt;
&lt;p&gt;Low-level builders focus on specific details—how to call this API, how to set that parameter. Master Builders think from a product and system perspective: not how to use this model, but what system should be used to solve this problem; not how to write a good prompt, but how this task should be decomposed and orchestrated; not whether AI can do it, but how humans should fill the gap where AI falls short.&lt;/p&gt;
&lt;p&gt;Supermind Agent is an example: when a single model has limitations, compensate with architecture. This shift in thinking is the most enduring competitiveness in the AI era.&lt;/p&gt;
&lt;p&gt;We let you get started quickly by reducing friction, but the ultimate goal is to cultivate you into a Master Builder who can independently design AI systems. When you understand why it's designed this way, you no longer need to rely on any platform—including ours.&lt;/p&gt;</content><category term="Computing"></category><category term="English"></category><category term="AI"></category><category term="Tutorial"></category></entry><entry><title>从过程确定性到结果确定性：AI 时代的另一种安全感</title><link href="https://yage.ai/result-certainty.html" rel="alternate"></link><published>2026-01-25T17:00:00-08:00</published><updated>2026-01-25T17:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-01-25:/result-certainty.html</id><summary type="html">&lt;p&gt;用Claude Code替代API调用做翻译任务：利用agentic loop实现自我纠错，用evaluation-first定义验收标准，从过程确定性转向结果确定性获得新的安全感。&lt;/p&gt;</summary><content type="html">&lt;p&gt;即使在2026年，把AI从demo做成产品也不是一件容易的事。比如中翻英，大家都觉得早就被LLM解决了，不就是调个API的事情嘛。但我们最近因为要把Superlinear Academy社区加入一个中翻英自动同步的功能，才发现开发体验这么差。&lt;/p&gt;
&lt;p&gt;这个问题的核心在 AI 的输出有很多不确定性。比如一个帖子太长了，AI会偷懒，前面正常翻译，后面开始缩写。或者它会脑子短路，开始输出还是英文，中间非要夹几个中文。或者在格式里做一些小手脚，比如丢了个粗体。或者它可能会超时，输出一半就卡在那儿，直到挂掉。&lt;/p&gt;
&lt;p&gt;为了克服这些不确定性，我们就要在程序里面做很多细节处理。比如如果帖子太长，就要分几段分别调用API，最后再拼接起来（&lt;a href="https://yage.ai/wide-research.html"&gt;Wide Research&lt;/a&gt;）。但这会带来另一个问题，不同段之间的术语未必统一，所以我们还要进一步设计工作流，来保证同一个中文术语第一段跟第二段之间不至于翻译成两个不同的英文单词。最后我们还得加一个检查，如果输出还有中文字符，就需要再翻译一遍。为了解决超时的同时避免重复翻译，我们还要做断点续传，只把失败的那一小部分翻译，回头再插进去。&lt;/p&gt;
&lt;p&gt;用这样的方式，我们确实大幅提升了成功率，保证即使对社区里面很长的帖子，AI也能正常翻译。但整个感觉就是累。我们90%的时间都没有花在怎么让AI翻译得更好，而是用workflow跟orchestration来给AI擦屁股。而且到后来，因为总会出各种意外情况，很多只出现一两次的问题我们就没修了，因为感觉永远修不完。总之完全没有感觉到生产力的提升。还不如调以前的机翻API。&lt;/p&gt;
&lt;p&gt;后来我们换了一种完全不同的思路问题反而解决了。但在介绍具体怎么做之前，我想先解释一下我们对这个问题成因的更深一层的思考。&lt;/p&gt;
&lt;h2&gt;Agent调用的四层结构&lt;/h2&gt;
&lt;p&gt;像前面提到的，调用AI的API不是调用完了就甩手不管了这么简单。它需要做很多配套的事情。而这些事情从集成的角度来看可以分成四层：&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;模型层：我们是用Claude还是GPT？用Opus还是Haiku？用什么Reasoning Effort？&lt;/li&gt;
&lt;li&gt;协议层：用Chat Completion API还是Response API？用MCP还是RESTful API？Rate Limit怎么解决？JSON Mode要不要开启？当我们说调用API的时候，大多数情况下我们指的是协议层。&lt;/li&gt;
&lt;li&gt;运行时层：状态怎么管理？工具怎么调用？文件的内容怎么给AI？权限怎么控制？用多少并发？这一层不是传统意义调用API的开发内容，但是但凡想要把AI稳定用到生产环境，这是绕不过去的一层。&lt;/li&gt;
&lt;li&gt;契约层：到底什么样的标准算成功？比如拿到AI的结果之后做什么检查？Guardrail怎么设？什么时候要引入人工干预？怎么保证不违反社会主义核心价值观？这一层决定了我们能不能信任AI的输出，并且真的把它用于生产。&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;一说到AI产品开发，大家讨论最多的是协议层。但实际的开发过程中，最花时间的反而是运行时层。这是因为协议层和传统的API调用不一样，LLM引入了太多的不确定性，而这些不确定性都需要运行时层来吸收和处理。问题是，运行时层跟业务逻辑没什么关系。无论是做翻译、代码生成、还是客服机器人，我们都要处理偷懒、拼接、上下文管理、并发控制这些事情。这意味着每个团队都在重复造轮子。所以一个自然的想法就是：能不能把运行时层外包出去？&lt;/p&gt;
&lt;p&gt;这件事没那么简单。不同模型的 failure pattern 是不一样的。有些模型遵循指令的能力很强，但容易在长文本上偷懒；有些模型创造力好，但格式控制一塌糊涂。针对不同模型，我们擦屁股的方式也不同，尤其是长尾failure pattern更是如此。因此运行时层很多时候是针对模型高度定制化的，也就很难复用，外包更无从谈起。&lt;/p&gt;
&lt;p&gt;但最近有一件事情让这个局面有了改观：Claude Code 本身不是开源项目，但越来越多的模型提供商开始主动兼容。Kimi、DeepSeek、GLM 都提供了官方接口，只要改几个环境变量，就能让 Claude Code 在后台调用这些模型。这件事很有意思。它意味着 Claude Code 已经超越了工具本身，变成了一种可复用的东西。&lt;/p&gt;
&lt;p&gt;更重要的是，当模型提供商宣称兼容 Claude Code的时候，他们实际上做的事情是：把自己模型的 failure pattern 适配到 Claude Code 的预期行为上。换句话说，擦屁股这件事没有消失，但擦的人变了——从我们开发者变成了模型提供商。他们为了进入这个生态，必须确保自己的模型在 Claude Code 的运行时里表现稳定。（这个讨论也对其他类似的工具比如Codex/Cursor Agent也适用，因为它和Claude Code的命令行接口也非常类似，适配很简单）&lt;/p&gt;
&lt;p&gt;换言之，Claude Code/Codex/Cursor Agent 正在成为一种可以复用的 Agentic Runtime。&lt;/p&gt;
&lt;p&gt;这就解决了前面提到的长尾问题。那些零星出现的 edge case，一个团队修不完，但整个生态可以。每一个想兼容 Claude Code 的模型提供商，甚至包括Anthropic本身，都在帮我们填坑。所以一个新的思路就是，对于翻译这个任务，我们完全可以从“调 API然后自己擦屁股”改成“直接交给 Claude Code”。这里我们通过白嫖整个生态为了兼容它而做的适配工作，事实上复用了它的运行时层，或者说，我们从自己造轮子，变成了站在一个正在收敛的标准上面。&lt;/p&gt;
&lt;h2&gt;实战：把翻译交给 Claude Code&lt;/h2&gt;
&lt;p&gt;这就是我们决定换一条路的原因：与其继续在运行时层自己处理不确定性，不如直接站到这个正在收敛的标准上。所以我们就试着把社区翻译这个任务交给 Claude Code 去做。最直观的感受就是：之前我们花大量时间处理的问题，现在大部分自动消失了。&lt;/p&gt;
&lt;p&gt;先说偷懒的问题。之前调 API，我们需要自己做分段、拼接、校验。但 Claude Code 的工作方式天然就不一样——它操作的基本单位是文件。文件是一个 stateful 的东西，它存在磁盘上，可以序列化、持久化。所以我们可以让 Claude 一章一章地翻译，每翻完一章它自己就写回文件，整个过程不需要我们在外面再包一层 orchestration 来追踪和管理进度。&lt;/p&gt;
&lt;p&gt;断点续传也一样。以前调 API 超时了，我们要记录断点、只翻失败的部分、再拼回去。现在不用了。翻译到一半挂掉，文件就在那儿，已经翻好的部分总之不会丢。重启之后让 Claude Code 接着翻就行，它自己会读文件、看到哪里没翻、继续往下做。&lt;/p&gt;
&lt;p&gt;术语统一的问题以前需要我们设计专门的流程，用总分或者递进的形式，让第一段的术语传递给第二段。现在 Claude Code 每次修改之前会先读整个文件，它天然能看到前面的上下文。所以术语统一的问题一个简单的 prompt 就能解决，比如先读整个文件，看看之前用的是什么术语，翻译第XX行到第XX行。&lt;/p&gt;
&lt;p&gt;输出夹杂中文这个问题，以前我们要做检测、判断、重试。现在可以直接在 prompt 里跟 Claude 说：翻译完成后，从头到尾检查一遍，确保没有残留的中文字符。更进一步，因为 Claude Code 可以调用 Python，我们甚至可以让它写一个简单的脚本来验证最终文件的格式是否符合要求。它自己写检查逻辑，自己跑，自己修。&lt;/p&gt;
&lt;p&gt;这些变化的共同点是：以前需要在 workflow 层面解决的问题，现在可以用自然语言在 prompt 里说清楚，让 agent 自己处理，而且还很可靠。我们终于可以把精力放在怎么让翻译效果更好，而不是怎么防止系统脑残出幺蛾子上。&lt;/p&gt;
&lt;h2&gt;Agentic Loop 与 Evaluation-First Mindset&lt;/h2&gt;
&lt;p&gt;这些变化让我们终于可以把精力放在翻译效果本身上。但做完以后我开始好奇：为什么 Claude Code 能做到这些？换一个方式调用 API 真的有这么大区别吗？&lt;/p&gt;
&lt;p&gt;前面我们说过，一个重要原因是我们在复用整个生态的适配工作。但这只是更高层更表面的原因。如果从四层结构的角度来看，Claude Code 能 work 的直接原因是：它让 AI 能够观察到自己行动的结果。&lt;/p&gt;
&lt;p&gt;这听起来像废话，但它是 agentic AI 和传统 API 调用的本质区别。当你用 API 的时候，AI 只能看到喂给它的 prompt，它吐出一个结果，然后就结束了。如果结果有问题，比如 JSON 格式不对、漏了字段、后半段开始偷懒，AI 自己不知道。发现问题的是你，决定要不要重试的也是你，怎么修复的逻辑还是你来写。这是为什么我们觉得AI很傻，我们要跟着收拾的直接原因。&lt;/p&gt;
&lt;p&gt;但 Claude Code 不一样。它改完一个文件之后，可以调用 Python 跑一个JSON parser，看到报错说第 9527 行有语法错误。这个报错会反馈给它，它就知道该去修哪里。修完再跑一遍，通过了，继续往下。这个执行 → 观测 → 纠错的循环，就是 agentic loop。&lt;/p&gt;
&lt;p&gt;这也是为什么文件这个形态这么重要的原因。文件是状态的载体，状态可见才能让闭环成立。我们把翻译任务从调一次 API 拿结果变成让 agent 在一个工作目录里操作文件，这在事实上给 AI 装上了一双眼睛。它能看见自己上一步做了什么，能看见验证脚本的输出，能根据这些信息决定下一步怎么做。这是运行时层带来的能力。&lt;/p&gt;
&lt;p&gt;但 agentic loop 能跑起来，不代表它能跑对。观测到结果是一回事，知道什么结果才算"对"是另一回事。这是契约层要回答的问题。回到翻译这个例子。即使用了 Claude Code，它也不是我们换了个工具，一下就神奇地work了的。&lt;/p&gt;
&lt;p&gt;如果只说"把这个文件翻译成英文"。Claude 翻了，结果里还是会有几段夹着中文字符。和之前调 API 遇到的问题一样，只不过这次修起来容易很多：我们可以在 prompt 里加一句：翻译完之后跑一个 Python 脚本检查有没有残留的中文字符，有的话自己修。Claude Code就会可靠地写一个简单的正则检查，跑一遍，发现问题就回去改，改完再跑，直到通过。&lt;/p&gt;
&lt;p&gt;但这件事体现了一个更重要的问题：之前出错不是因为 Claude 笨，而是因为它不知道什么叫翻译完了。对它来说，保证对每一章都做了一次中翻英这个操作，任务就结束了。但对我们来说，翻译完了还包括格式正确、没有残留中文、术语统一这些隐含的期望。这些期望在我们脑子里，Claude 看不到。而一旦我们把这些期望显式地写出来，并且告诉它怎么验证，它就能自己判断做没做完。&lt;/p&gt;
&lt;p&gt;我喜欢做的一个比喻是：想象你在给一个有健忘症的实习生交代任务。这个实习生没有任何上下文，不知道你之前聊过什么，不知道你的隐含期望，只能看到你这一次给他的指令。你需要把验收标准写到这种程度：只根据这些信息，他就能判断自己做完了没有。如果他觉得没做完，他知道还差什么。我的经验是，写到这种详细程度，基本上可以期待Claude Code/Codex可以可靠地完成任务。如果搞不定，别慌抱怨AI，我们应该首先检查是不是标准没写清楚。&lt;/p&gt;
&lt;p&gt;所以现在我们就可以把这两层的关系说清楚了。运行时层给了 agent 观测能力，让它能看见自己做了什么、结果是什么。契约层告诉它什么算成功，让它能判断自己做完了没有。两者缺一不可：只有观测没有标准，agent 会在那里瞎转，给出一个非常漂亮但未必满足我们要求的结果；只有标准没有观测，agent 做完一次就停了，对不对全靠运气。Agentic loop 加上 evaluation-first，才构成一个完整的闭环。&lt;/p&gt;
&lt;h2&gt;从过程确定性到结果确定性&lt;/h2&gt;
&lt;p&gt;这个闭环一旦建立起来，会带来一种微妙的对 AI 信任来源的改变。它背后其实是两种不一样的确定性。&lt;/p&gt;
&lt;p&gt;传统程序员的安全感来自过程确定性。我写的每一行代码都在我的控制之下，每一个分支、每一个边界条件我都考虑过。程序的行为是我设计出来的，只要它照着这些逻辑做，就一定会得到符合要求的结果。这种确定性是切实可感的，这种把结果翻译成程序行为的能力也是我们长期训练出来的基本功。&lt;/p&gt;
&lt;p&gt;但我们刚才看到的 agentic loop 和 evaluation-first mindset，其实是另一种确定性。我们不规定每一步怎么走，而是规定终点长什么样、怎么验证到了终点。过程是不确定的——Claude 可能先翻译再检查，也可能边翻边查，可能用正则也可能用别的方法——但结果是确定的：只要验收标准写对了，最终产物就是对的。这是结果确定性。&lt;/p&gt;
&lt;p&gt;这两种确定性背后，其实是两种不同的成本结构。过程确定性的经济学是：代码执行起来几乎不花钱，但写代码的人力很贵。所以我们要精心设计逻辑、追求复用、避免重复，把人力成本摊薄到每一次执行上。结果确定性的经济学正好反过来：intelligence越来越便宜，让AI反复尝试、检查、纠错的成本在快速下降。我们可以挥霍token来换取确定性——不是通过写更多的防御性代码，而是让AI用它的推理能力去对抗不确定性。&lt;/p&gt;
&lt;p&gt;这和我之前在&lt;a href="https://yage.ai/ai-native-cost-structure.html"&gt;《一次性软件与被压缩的现实》&lt;/a&gt;中讨论的是同一个逻辑。那篇文章讲的是当写代码的成本趋近于零，一次性软件反而成了最优策略。这里的变化更广：不只是代码，而是整个推理和智能都在变便宜。翻译不是写代码，但它同样是燃烧token所产出的东西。当这个成本足够低，我们就可以让AI每次都现场做检查、现场写验证脚本、反复循环直到结果正确，而不需要像以前那样把所有可能的情况都预先在代码里写成规则。&lt;/p&gt;
&lt;p&gt;在成本结构的变化之外，这也带来了天花板的差别。过程确定性的上限是我们的想象力和精力，我们能想到的情况、能写出来的逻辑，就是系统能处理的边界。结果确定性的上限更高：我们不需要穷举所有可能的路径，只需要定义清楚什么是对的，agent 会自己想办法到达那个状态。&lt;/p&gt;
&lt;p&gt;但我们不太习惯结果确定性，往往会觉得不踏实。因为我们在职业生涯中引以为豪的一项核心技能，恰恰就是把结果翻译成过程：老板说想要一个能处理十万并发的系统，我们就设计出一套架构来保证这个结果；PM 说用户上传的文件不能超过 10MB，我们就写一个校验逻辑来拦截超限的请求。所以当我们开始用 AI 的时候，这种习惯会很自然地延续——我们本能地想用规则来规定 AI 的行为：输出必须是 JSON 格式，每个字段必须存在，遇到这种情况要这样处理，遇到那种情况要那样处理。&lt;/p&gt;
&lt;p&gt;但这条路是有上限的。AI 不是一个确定性的系统，用过程去约束它，你会发现自己在做的事情是用大量的 rule 来控制它的不确定性。规则越写越多，漏洞越补越多，最后你花在防御上的精力比花在解决问题上的还多。这就是我们最开始拿 API 做翻译时遇到的困境。&lt;/p&gt;
&lt;p&gt;但如果我们可以接受一点让步呢？如果我们愿意接受过程上的不确定性，转而通过规定结果来约束 AI 的行为，事情会变得不一样。我们不再说"你必须用这个方法处理这种情况"，而是说"最终产物必须满足这些条件，怎么满足你自己想办法"。这样一来，AI 的灵活性不再是我们需要控制的风险，而是它完成任务的资源。&lt;/p&gt;
&lt;p&gt;当然，以前这条路没那么容易走。如果你想让 AI 能够自己观测结果、自己判断对错、自己决定下一步怎么做，你得自己搓一个 agentic loop 出来。而 agent 套壳比它看上去要更难：你要处理工具调用的格式，要解析 AI 的输出，要管理上下文窗口，还要针对不同模型的特性做适配。这套东西做下来，你会发现自己又在做另一种形式的用过程换确定性。（而且引入 Agentic 框架往往会&lt;a href="https://yage.ai/why-forget-all-frameworks.html"&gt;带来更大的技术债&lt;/a&gt;）&lt;/p&gt;
&lt;p&gt;但现在不用了。Claude Code、Codex、Cursor Agent 这些工具已经把运行时层的脏活干完了。Agentic loop 是现成的，文件系统是现成的，工具调用的封装也是现成的。你要做的，就是想清楚你要什么结果，怎么验证这个结果，然后用自然语言告诉它。&lt;/p&gt;
&lt;p&gt;所以我有一个建议：尝试拥抱过程上的不确定性。不要条件反射地去规定 AI 的每一步行为，而是直接描述你对最终结果的期望，把它 codify 成可验证的标准。运行时层的事情交给 Claude Code 这类工具去处理，你专注于契约层：定义什么是对的，定义怎么检验。&lt;/p&gt;
&lt;p&gt;这是一种不一样的工作方式，也是一种不一样的安全感来源。&lt;/p&gt;
&lt;h2&gt;结语&lt;/h2&gt;
&lt;p&gt;当然，这种工作方式不是没有边界的。&lt;/p&gt;
&lt;p&gt;首先是任务本身的性质。结果确定性能 work，前提是你能清晰地定义什么是"对的"。翻译这个任务之所以适合，是因为验收标准可以形式化：格式正确、没有残留中文、术语一致，这些都可以写成脚本让 agent 自己跑。但有些任务的"对"很难定义，或者定义出来的标准本身就有歧义——不过话说回来，这种情况下用 rules 来约束过程只会更难。至少 evaluation-first 还给了一个明确的失败信号。&lt;/p&gt;
&lt;p&gt;其次是安全。用 API 的时候，AI 对你的系统没有任何控制权。它只能接收 prompt、返回文本，仅此而已。但 Claude Code 这类工具不一样。它能读写文件，能执行 Python，能跑 bash 命令。这是它强大的原因，也是一个危险因素。这个问题要认真对待。我们的做法是在配置层面收紧权限：用 &lt;code&gt;--allowedTools&lt;/code&gt; 参数限制它能调用的工具，把可执行的范围收敛到特定的脚本上。更进一步，可以结合现在比较流行的轻量级 sandbox 方案，让 agent 就算搞砸了也只会把 sandbox 里的文件弄乱，不至于影响宿主系统。&lt;/p&gt;
&lt;p&gt;这方面确实还有很多坑。权限模型怎么设计、sandbox 怎么配置、出了问题怎么回滚，这些都是开放的问题，没有标准答案。但我对这个方向还是乐观的。安全问题是工程问题，工程问题是可以解决的。不会因为有这些风险，这条路就走不通。&lt;/p&gt;
&lt;p&gt;回到开头的问题：到底是把 AI 作为系统的一部分，用程序去调用它，我们做的是一个带AI功能的翻译产品？还是把 AI 作为系统的核心，让它去调用程序，我们做的是一个完成翻译任务的AI Agent？&lt;/p&gt;
&lt;p&gt;我们试了两条路，发现后者的成功率和稳定性意外地高很多。这可能是因为后者让我们可以复用整个生态的适配工作，因为 agentic loop 让 AI 能够自我纠错，因为 evaluation-first 让我们可以用结果而不是过程来约束 AI 的行为。这些因素叠加在一起，构成了一种不同的工作方式。&lt;/p&gt;
&lt;p&gt;它需要我们放弃一些东西：对过程的掌控感，对每一步行为的确定性，以及我们花了很多年训练出来的那种把结果翻译成流程的本能。但它也给了我们一些东西：更高的上限，更少的体力活，以及一种新的、基于结果的安全感。&lt;/p&gt;
&lt;p&gt;这个模式能推广到多远？我不确定，但至少在翻译这个场景上，它彻底改变了我们的开发体验。我们把这套实践整理成了一份&lt;a href="https://gist.github.com/grapeot/9cbdcf7f26bd1d69a11c39414b54dbe6"&gt;操作指南&lt;/a&gt;，你也可以发给自己的AI，让它现在就试试看。&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"></category><category term="Chinese"></category><category term="Agentic AI"></category></entry><entry><title>From Process Certainty to Outcome Certainty: A Different Kind of Confidence in the Age of AI</title><link href="https://yage.ai/result-certainty-en.html" rel="alternate"></link><published>2026-01-25T16:00:00-08:00</published><updated>2026-01-25T16:00:00-08:00</updated><author><name>grapeot</name></author><id>tag:yage.ai,2026-01-25:/result-certainty-en.html</id><summary type="html">&lt;p&gt;Why handing translation to Claude Code works better than calling APIs directly - leveraging the agentic loop, evaluation-first mindset, and the ecosystem's runtime layer to achieve outcome certainty over process certainty.&lt;/p&gt;</summary><content type="html">&lt;p&gt;Even in 2026, turning an AI demo into a production-ready product is surprisingly hard. Take Chinese-to-English translation. Everyone assumes LLMs solved this ages ago—just call an API, right? But when we recently tried to add an automatic translation sync feature to the Superlinear Academy community, we discovered just how painful the developer experience actually is.&lt;/p&gt;
&lt;p&gt;The core issue is uncertainty in AI outputs. A post is too long, and the AI gets lazy—translating the first half properly, then summarizing the rest. Or it short-circuits mid-output, starting in English but randomly inserting Chinese characters. Or it makes subtle formatting mistakes, like dropping bold text. Or it times out halfway through and just hangs there until it crashes.&lt;/p&gt;
&lt;p&gt;To deal with all this uncertainty, we had to add layers of handling in our code. If a post was too long, we'd split it into chunks, call the API separately for each, and stitch the results back together (similar to what I described in &lt;a href="https://yage.ai/wide-research-en.html"&gt;Wide Research&lt;/a&gt;). But that created another problem: terminology across chunks wasn't consistent. So we had to design additional workflows to pass a glossary from one chunk to the next, ensuring the same Chinese term didn't get translated into two different English words. On top of that, we added a check: if the output still contained Chinese characters, retry the translation. And to handle timeouts without duplicating work, we implemented checkpoint-based resumption—re-translating only the failed portion and splicing it back in.&lt;/p&gt;
&lt;p&gt;All this effort did improve success rates. Even very long community posts would eventually get translated correctly. But it was exhausting. We spent 90% of our time not on making translations better, but on workflow and orchestration to babysit the AI. And after a while, because edge cases kept popping up and some only happened once or twice, we stopped fixing them. It felt like we'd never be done. No productivity gains at all. We might as well have stuck with the old machine translation APIs.&lt;/p&gt;
&lt;p&gt;Then we tried a completely different approach—and it actually worked. But before I explain what we did, I want to share a deeper analysis of why this problem exists in the first place.&lt;/p&gt;
&lt;h2&gt;The Four Layers of Agent Integration&lt;/h2&gt;
&lt;p&gt;As I mentioned, calling an AI API isn't as simple as fire-and-forget. There's a lot of supporting infrastructure involved. From an integration standpoint, this work falls into four distinct layers:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Model layer: Which model do we use—Claude or GPT? Opus or Haiku? What reasoning effort level?&lt;/li&gt;
&lt;li&gt;Protocol layer: Chat Completion API or Response API? MCP or RESTful API? How do we handle rate limits? Should we enable JSON mode? When people talk about "calling an API," they usually mean this layer.&lt;/li&gt;
&lt;li&gt;Runtime layer: How do we manage state? How do we invoke tools? How do we feed file contents to the AI? How do we control permissions and concurrency? This layer isn't part of traditional API development, but if you want to use AI reliably in production, you can't skip it.&lt;/li&gt;
&lt;li&gt;Contract layer: What does success actually look like? What checks do we run on AI outputs? How do we set up guardrails? When do we bring in human review? How do we ensure compliance with content policies? This layer determines whether we can trust AI outputs enough to use them in production.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;When people talk about AI product development, most discussions focus on the protocol layer. But in actual development, the runtime layer consumes the most time. Unlike traditional APIs, LLMs introduce massive uncertainty, and the runtime layer has to absorb and handle all of it. The problem is that the runtime layer has nothing to do with business logic. Whether you're building translation, code generation, or a customer service bot, you still have to deal with lazy outputs, chunking, context management, and concurrency control. Every team ends up reinventing the same wheels. Which naturally raises the question: can we outsource the runtime layer?&lt;/p&gt;
&lt;p&gt;It's not that simple. Different models have different failure patterns. Some follow instructions precisely but get lazy on long texts. Others are creative but terrible at format control. The way you clean up after each model is different, especially for long-tail failure patterns. So the runtime layer often ends up highly customized to specific models, making it hard to reuse—let alone outsource.&lt;/p&gt;
&lt;p&gt;But something has recently changed this picture. Claude Code itself isn't open source, but more and more model providers are actively building compatibility with it. Kimi, DeepSeek, and GLM all offer official integrations—just change a few environment variables, and Claude Code can call these models under the hood. This is interesting. It means Claude Code has transcended being just a tool and become something reusable.&lt;/p&gt;
&lt;p&gt;More importantly, when model providers claim Claude Code compatibility, what they're actually doing is adapting their models' failure patterns to match Claude Code's expected behavior. In other words, the cleanup work hasn't disappeared—it's just shifted. Instead of us developers doing it, the model providers do it. To enter this ecosystem, they have to ensure their models behave reliably within Claude Code's runtime. (This discussion applies equally to similar tools like Codex and Cursor Agent, since their command-line interfaces are very similar and easy to adapt to.)&lt;/p&gt;
&lt;p&gt;In other words, Claude Code, Codex, and Cursor Agent are becoming a reusable Agentic Runtime.&lt;/p&gt;
&lt;p&gt;This solves the long-tail problem I mentioned earlier. Those scattered edge cases that no single team could fix—the entire ecosystem can. Every model provider that wants Claude Code compatibility, including Anthropic itself, is filling in the gaps for us. So a new approach emerges: instead of "calling an API and cleaning up ourselves," we can just "hand it off to Claude Code." By leveraging all the compatibility work the ecosystem has done, we're effectively reusing its runtime layer. We've gone from building our own wheels to standing on a converging standard.&lt;/p&gt;
&lt;h2&gt;In Practice: Handing Translation to Claude Code&lt;/h2&gt;
&lt;p&gt;This is why we decided to try a different path: instead of continuing to handle uncertainty at the runtime layer, we'd just stand on this converging standard. So we tried handing the community translation task to Claude Code. The most intuitive feeling was that most of the problems we'd spent so much time handling simply disappeared.&lt;/p&gt;
&lt;p&gt;Take the laziness problem. Before, when calling the API directly, we had to handle chunking, stitching, and validation ourselves. But Claude Code works differently—its basic unit of operation is the file. A file is stateful. It lives on disk, can be serialized and persisted. So we can have Claude translate chapter by chapter, writing back to the file after each one. The whole process doesn't need an external orchestration layer to track and manage progress.&lt;/p&gt;
&lt;p&gt;Checkpoint-based resumption is the same. Before, when an API call timed out, we had to record the breakpoint, re-translate only the failed portion, and splice it back in. Now we don't. If translation crashes halfway through, the file is still there—whatever was already translated won't get lost anyway. Just restart and tell Claude Code to continue. It reads the file, sees what's not done yet, and picks up where it left off.&lt;/p&gt;
&lt;p&gt;Terminology consistency used to require dedicated workflow design—passing a glossary from the first chunk to the second in a structured or incremental way. Now, Claude Code reads the entire file before making changes, so it naturally sees the earlier context. So the problem of terminology consistency can be solved with a simple prompt: first read the whole file, see what terminology was used before, then translate lines XX to YY.&lt;/p&gt;
&lt;p&gt;The problem of Chinese characters leaking into the output used to require detection, judgment, and retry logic. Now we can just tell Claude in the prompt: after translating, scan the whole thing and make sure there are no leftover Chinese characters. Even better, since Claude Code can run Python, we can have it write a simple script to validate that the final file meets our format requirements. It writes the check, runs it, and fixes any issues itself.&lt;/p&gt;
&lt;p&gt;The common thread here is that problems we used to solve at the workflow level can now be stated clearly in natural language in the prompt, and the agent handles them reliably. We can finally focus on making translations better, instead of preventing the system from doing something stupid.&lt;/p&gt;
&lt;h2&gt;The Agentic Loop and Evaluation-First Mindset&lt;/h2&gt;
&lt;p&gt;These changes finally let us focus on translation quality itself. But afterward, I started wondering: why can Claude Code do this? Does switching the way we call the API really make that much difference?&lt;/p&gt;
&lt;p&gt;As I mentioned, one important reason is that we're reusing the ecosystem's compatibility work. But that's just a higher-level, more superficial reason. Looking at it through the four-layer framework, the direct reason Claude Code works is that it allows the AI to observe the results of its own actions.&lt;/p&gt;
&lt;p&gt;This sounds obvious, but it's the fundamental difference between agentic AI and traditional API calls. When you use an API, the AI only sees the prompt it's fed, produces an output, and that's it. If there's a problem—malformed JSON, missing fields, lazy second half—the AI doesn't know. You're the one who notices. You're the one who decides whether to retry. You're the one who writes the fix logic. This is the direct reason we feel like AI is dumb and we have to clean up after it.&lt;/p&gt;
&lt;p&gt;But Claude Code is different. After it modifies a file, it can run Python to invoke a JSON parser and see an error message saying line 9527 has a syntax error. That error gets fed back to it, so it knows what to fix. It fixes it, runs again, passes, moves on. This execute → observe → correct cycle is the agentic loop.&lt;/p&gt;
&lt;p&gt;This is also why the file abstraction matters so much. Files are carriers of state, and visible state is what makes the closed loop possible. By turning translation from "call an API once and get a result" into "have an agent operate on files in a working directory," this in fact gives the AI a pair of eyes. It can see what it did in the previous step, see the output of validation scripts, and decide what to do next based on that information. This is the capability the runtime layer provides.&lt;/p&gt;
&lt;p&gt;But just because the agentic loop can run doesn't mean it runs correctly. Observing results is one thing; knowing what counts as "correct" is another. That's what the contract layer has to answer. Back to the translation example: even with Claude Code, it wasn't like we switched tools and everything magically worked.&lt;/p&gt;
&lt;p&gt;If we just said "translate this file to English," Claude would do it, but there would still be a few paragraphs with Chinese characters mixed in. Same problem as with the API—except now it's much easier to fix. We can add a line to the prompt: after translation, run a Python script to check for leftover Chinese characters, and fix any you find. Claude Code reliably writes a simple regex check, runs it, finds issues, goes back to fix them, runs again, until it passes.&lt;/p&gt;
&lt;p&gt;But this reveals something more important: the earlier failures weren't because Claude was stupid—it was because Claude didn't know what "done" meant. From its perspective, applying a Chinese-to-English operation to every chapter meant the task was complete. But for us, "done" also included correct formatting, no leftover Chinese, consistent terminology—all implicit expectations. These expectations were in our heads; Claude couldn't see them. Once we made these expectations explicit and told it how to verify them, it could judge for itself whether it was finished.&lt;/p&gt;
&lt;p&gt;I like to use this analogy: imagine you're giving a task to an intern with amnesia. This intern has no context, doesn't know what you discussed before, doesn't know your implicit expectations—they can only see the instructions you give them this one time. You need to write the acceptance criteria so clearly that, based on this information alone, they can judge whether they're done. If they think they're not done, they know what's missing. In my experience, when you write things at this level of detail, you can reliably expect Claude Code or Codex to complete the task. If it can't, don't panic and blame the AI—first check whether you wrote the criteria clearly enough. An even better approach is to codify acceptance criteria into executable checks, like Python scripts. That way the agent can verify on its own, without human supervision.&lt;/p&gt;
&lt;p&gt;So now we can clearly describe the relationship between these two layers. The runtime layer gives the agent observational capability—it can see what it did and what the results are. The contract layer tells it what success looks like, so it can judge whether it's done. Both are essential: observation without standards means the agent spins aimlessly, producing something beautiful that may not meet our requirements; standards without observation means the agent stops after one attempt, and whether it's right is pure luck. The agentic loop plus evaluation-first is what creates a complete closed loop.&lt;/p&gt;
&lt;h2&gt;From Process Certainty to Outcome Certainty&lt;/h2&gt;
&lt;p&gt;Once this closed loop is established, it subtly changes where our trust in AI comes from. Behind it are actually two different kinds of certainty.&lt;/p&gt;
&lt;p&gt;Traditional programmers' sense of security comes from process certainty. Every line of code I write is under my control. Every branch, every edge case—I've thought about them all. The program's behavior is something I designed, and as long as it follows this logic, it will definitely produce correct results. This certainty is tangible, and this ability to translate outcomes into program behavior is a fundamental skill we've trained over many years.&lt;/p&gt;
&lt;p&gt;But the agentic loop and evaluation-first mindset we just discussed represent a different kind of certainty. We don't specify every step of the process; instead, we specify what the destination looks like and how to verify we've arrived. The process is uncertain—Claude might translate first then check, or check while translating; it might use regex or some other method—but the outcome is certain: as long as the acceptance criteria are right, the final product will be right.&lt;/p&gt;
&lt;p&gt;This is outcome certainty.&lt;/p&gt;
&lt;p&gt;Behind these two kinds of certainty are two different cost structures. The economics of process certainty: code execution costs almost nothing, but the human effort to write code is expensive. So we carefully design logic, pursue reuse, avoid duplication—we amortize human cost across every execution. The economics of outcome certainty is the opposite: intelligence is getting cheaper. The cost of having AI repeatedly try, check, and correct is dropping fast. We can spend tokens lavishly to buy certainty—not by writing more defensive code, but by letting AI use its reasoning ability to combat uncertainty.&lt;/p&gt;
&lt;p&gt;This is the same logic I discussed in &lt;a href="https://yage.ai/ai-native-cost-structure-en.html"&gt;Disposable Software and Compressed Reality&lt;/a&gt;. That article was about how when the cost of writing code approaches zero, disposable software becomes the optimal strategy. The change here is broader: it's not just code, but reasoning and intelligence itself that's getting cheaper. Translation isn't coding, but it's equally something produced by burning tokens. When that cost is low enough, we can have AI do checks on the spot, write validation scripts on the spot, loop repeatedly until the result is correct—instead of pre-encoding all possible situations into rules.&lt;/p&gt;
&lt;p&gt;Beyond the cost structure shift, there's also a difference in ceilings. The upper bound of process certainty is our imagination and energy—the situations we can think of, the logic we can write, that's the boundary of what the system can handle. Outcome certainty has a higher ceiling: we don't need to enumerate every possible path, just define clearly what's correct, and the agent will find its own way to that state.&lt;/p&gt;
&lt;p&gt;But we're not used to this kind of certainty, and it often feels unsettling. One of the core skills we've taken pride in throughout our careers is precisely this: translating outcomes into processes. The boss wants a system that handles 100,000 concurrent connections—we design an architecture to guarantee that outcome. The PM says uploaded files can't exceed 10MB—we write validation logic to block oversized requests. So when we start using AI, this habit naturally continues. We instinctively want to constrain AI behavior with rules: output must be JSON format, every field must exist, handle this case this way, handle that case that way.&lt;/p&gt;
&lt;p&gt;But this path has limits. AI is not a deterministic system. Trying to constrain it through process, you'll find yourself using massive amounts of rules to hedge against its uncertainty. More and more rules, more and more patches, until you spend more effort on defense than on solving the actual problem. This was exactly the trap we fell into when using APIs for translation.&lt;/p&gt;
&lt;p&gt;But what if we could accept a small concession? What if we accepted process uncertainty and instead constrained AI behavior by specifying outcomes? Things would change. Instead of saying "you must use this method to handle this case," we say "the final product must meet these conditions; how you meet them is up to you." This way, AI's flexibility is no longer a risk to hedge against—it becomes a resource for completing the task.&lt;/p&gt;
&lt;p&gt;Of course, this path wasn't easy to take before. If you wanted AI to observe its own results, judge right from wrong, and decide what to do next, you had to build your own agentic loop. And wrapping an agent is harder than it looks: you have to handle tool calling formats, parse AI outputs, manage context windows, and adapt to different models' characteristics. By the time you're done, you realize you've traded another form of process for certainty.&lt;/p&gt;
&lt;p&gt;But now you don't have to. Tools like Claude Code, Codex, and Cursor Agent have done the dirty work of the runtime layer. The agentic loop is ready-made, the file system is ready-made, tool calling is already wrapped. What you need to do is think clearly about what outcome you want, how to verify that outcome, and then tell it in natural language.&lt;/p&gt;
&lt;p&gt;So here's my suggestion: try embracing process uncertainty. Don't instinctively specify every step of AI behavior. Instead, directly describe your expectations for the final result and codify them into verifiable standards. Leave the runtime layer stuff to tools like Claude Code and focus on the contract layer: define what's correct, define how to verify it.&lt;/p&gt;
&lt;p&gt;This is a different way of working, and a different source of confidence.&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Of course, this approach has its boundaries.&lt;/p&gt;
&lt;p&gt;First, the nature of the task itself. Outcome certainty works on the premise that you can clearly define what "correct" means. Translation works for this because acceptance criteria can be formalized: correct format, no leftover Chinese, consistent terminology—all of these can be written as scripts for the agent to run itself. But for some tasks, "correct" is hard to define, or the defined criteria are themselves ambiguous. That said, in those cases, using rules to constrain the process would be even harder. At least evaluation-first gives you a clear failure signal.&lt;/p&gt;
&lt;p&gt;Second, security. With an API, AI has no control over your system. It receives a prompt, returns text, and that's it. But tools like Claude Code are different. They can read and write files, execute Python, run bash commands. This is why they're powerful, and also why they're dangerous. This needs to be taken seriously. Our approach is to tighten permissions at the configuration level: use the &lt;code&gt;--allowedTools&lt;/code&gt; parameter to limit which tools it can call, constraining execution to specific scripts. Going further, you can combine this with lightweight sandbox solutions that are popular now, so even if the agent messes up, it only ruins files inside the sandbox without affecting the host system.&lt;/p&gt;
&lt;p&gt;There are definitely still many pitfalls here. How to design the permission model, how to configure the sandbox, how to roll back when things go wrong—these are all open questions without standard answers. But I'm optimistic about this direction. Security problems are engineering problems, and engineering problems can be solved. These risks don't mean this path is impassable.&lt;/p&gt;
&lt;p&gt;Back to the opening question: is it better to treat AI as a component of the system, calling it from our code, building a translation product with AI features? Or to treat AI as the core of the system, having it call our programs, building an AI Agent that accomplishes translation tasks?&lt;/p&gt;
&lt;p&gt;We tried both paths and found the latter had surprisingly higher success rates and stability. This might be because the latter lets us reuse the ecosystem's compatibility work, because the agentic loop lets AI self-correct, because evaluation-first lets us constrain AI with outcomes rather than processes. These factors combine to form a different way of working.&lt;/p&gt;
&lt;p&gt;It requires giving up some things: the sense of control over process, the certainty about every step, and the instinct we spent years training—translating outcomes into procedures. But it also gives us something: a higher ceiling, less grunt work, and a new kind of confidence based on outcomes.&lt;/p&gt;
&lt;p&gt;How far can this pattern extend? I'm not sure, but at least in the translation scenario, it completely transformed our development experience. We've compiled these practices into a &lt;a href="https://gist.github.com/grapeot/4271a9782da18b2e746a42e274720f77"&gt;how-to guide&lt;/a&gt; that you can share with your own AI and try right now.&lt;/p&gt;
&lt;script async data-uid="65448d4615" src="https://yage.kit.com/65448d4615/index.js"&gt;&lt;/script&gt;</content><category term="Computing"></category><category term="English"></category><category term="Agentic AI"></category></entry></feed>