AI Agent治理与合规AI 产品与平台

When AI-Written Code Gets Rewritten by AI: The Copyright Vacuum Exposed by the Claude Code Leak

tags: AI Agent, 治理与合规, AI 产品与平台

On March 31, 2026, Anthropic accidentally shipped the Claude Code source code in an npm update. Roughly 512,000 lines of TypeScript across 1,900 files were fully exposed. Within hours, GitHub forks exploded. Anthropic filed DMCA takedowns, and over 8,100 repositories were removed. Then developer Sigrid Jin used OpenAI Codex to rewrite the entire codebase from TypeScript to Python in two hours. The project, called claw-code, has surpassed 100,000 stars.

Up to this point, it looks like an ordinary source code leak plus a high-star reverse engineering project. But this incident is worth a dedicated analysis because it exposes three cracks in copyright law within a single case, and these three cracks happen to be the very assumptions that every developer using AI to write code silently relies on every day.

Before diving in, one legal backdrop needs to be established. On March 2, 2026, the US Supreme Court declined to hear the Thaler v. Perlmutter appeal, effectively upholding the lower court’s ruling: purely AI-generated works are not eligible for copyright protection. The Copyright Office further clarified that copyright only protects portions containing human creative contribution, and that AI involvement must be disclosed when registering. In other words, copyright protection requires human creative contribution. Using AI as a tool to execute your creative decisions is one thing; having AI generate the output entirely is another. Where exactly this line falls remains undecided.

Start with an intuition. Most developers have a straightforward mental model of code copyright: code I write belongs to me, code the company writes belongs to the company, open source has licenses. In this model, “who wrote it” is usually clear. But what if the one writing the code is AI?

Boris Cherny, head of Claude Code, said last year that “100% of my contributions to Claude Code in the last 30 days were written by Claude Code itself”, and that he “doesn’t even make small edits.” This was said to showcase the product’s capability, but placed within the copyright framework, it becomes interesting. Under the principle established in Thaler, if the expression of the code was entirely generated by AI, and the human exercised no creative control over the specific form of that code, then that code has no copyright protection.

Of course, Anthropic could argue that human creative contribution exists in architectural design, prompt writing, and output curation. This is indeed an approach the Copyright Office has recognized. The question is where exactly the line is drawn. If you write a prompt that generates 500 lines of code and pick the best version, does that count as creative control? What if you had AI revise it three times before you were satisfied? What if you barely glanced at it before committing? No precedent, no standard, entirely dependent on how future courts interpret “creative contribution.”

One file in the leaked source code sharpened this question further. A file called undercover.ts (90 lines) implements a stealth mode: when Anthropic employees use Claude Code to contribute to external open-source repositories, the system instructs the AI not to add co-author attribution, and not to mention Claude or Anthropic in commit messages. This means Anthropic was asserting copyright over Claude Code’s source code via DMCA while simultaneously using its tools to systematically erase evidence of AI involvement, evidence that would be the most direct material for a court assessing the ratio of human to AI contribution.

Developers on Hacker News pointed out that DMCA’s Section 512(f) makes false copyright claims potentially punishable as perjury. If Anthropic knowingly claimed full copyright over code it knew was largely AI-generated, there could be theoretical legal risk. It should be noted, however, that there is no successful precedent of a 512(f) challenge involving AI-generated code. This risk remains largely theoretical.

What this means for you: if your product code is largely generated by AI, the clause in your contracts stating “all code is the exclusive property of the company” may have a weaker legal foundation than you assume. The habit of maintaining design documents, code review records, and architectural decision discussions may not just be engineering best practice going forward, it may be the evidentiary chain for your copyright claims.

Layer 2: Is a “Clean-Room Rewrite” Done by AI Actually Clean?

The intuition for this layer goes like this. Suppose a competitor’s source code gets leaked and you want to build a legal alternative. The traditional approach is called a “clean-room rewrite”: one team reads the source code and writes functional specifications, while a second team that has never seen the source code re-implements everything from those specs. Because the second team’s code was written from scratch, it cannot be a “copy” of the original, so courts generally consider the result independent creation. Sega v. Accolade is the landmark case in this area, confirming that this approach can serve as strong evidence of fair use.

The core logic is simple: as long as the rewriter never had access to the original code, whatever they produce is their own work.

claw-code didn’t even attempt this kind of isolation. Sigrid Jin almost certainly pointed Codex directly at the Claude Code repository and asked it to translate. There was no two-stage process, no functional specification wall, no attempt at separation. The point of this section isn’t whether claw-code is a clean-room rewrite, it almost certainly isn’t. The deeper problem is that even if someone did everything right, even if they rigorously followed the two-team isolation protocol, the rewriter being an AI model trained on virtually all public code would still break the isolation premise at a fundamental level. claw-code simply makes the problem more visible by not even pretending.

The entire legal persuasiveness of traditional clean-room design rests on isolation: the rewriter never encountered the original, so any similarity must be coincidence or functional necessity. AI models break this premise. A model trained on virtually all public code cannot, by definition, have “never seen” any particular codebase. Every “rewrite” it produces carries the possibility that the output contains implicit copies of original code from its training data.

The chardet incident provides another illustration of this problem. The maintainer of the Python library chardet used Claude to rewrite the project from LGPL to MIT license, sparking significant controversy. An NVIDIA engineer argued that v7.0.0 poses unacceptable legal risk to users. Bruce Perens, co-drafter of the Open Source Definition, argued that AI-assisted rewriting would fundamentally undermine the software licensing system. Zoë Kooyman, executive director of the Free Software Foundation, put it bluntly: “a rewrite by a large language model that has already ingested the target code has nothing ‘clean’ about it.”

There are dissenting voices. antirez (creator of Redis) points out that clean-room design was always just an “evidence optimization strategy” in litigation. The real legal question is whether the new code copied protected “expression” rather than “function.” This distinction matters: copyright law protects the specific way code is written (variable names, algorithm implementations, code organization), not the functionality the code delivers. If claw-code’s Python implementation is functionally equivalent to the original TypeScript but completely different in its specific expression, then even if Codex’s training data contained the original code, infringement would be hard to establish. Conversely, if the AI-generated code closely mirrors the original in data structures, algorithm choices, or overall architecture, changing the language might not save it. Oracle v. Google confirmed that API declarations remain protected even when expressed in a different language.

The reason there is no simple answer is that traditional copyright law assumes “copying” is a discrete event: you either saw the original or you didn’t, you either copied it or you didn’t. AI models turn this assumption into a continuous spectrum. A model having “seen” something in training does not mean it “remembered” it, and “remembering” does not mean it will “reproduce” it in its output, but you also cannot prove it didn’t.

What this means for you: if a competitor uses AI to rewrite your closed-source product in another language, or if you are doing something similar yourself, the current legal framework offers no definitive answer. This gray area may persist for years until precedents emerge.

Layer 3: Anthropic Needs to Hold Two Contradictory Positions on AI Learning from Copyrighted Material

The intuition for this layer is the clearest of the three. In one courtroom, Anthropic argues “AI learning from copyrighted works and generating new output is fair use.” In another legal action, Anthropic argues “AI learning from our code and generating a rewrite infringes our copyright.” These two positions are in tension.

Specifically: in the 2025 Bartz v. Anthropic case, the court ruled that Anthropic’s use of copyrighted books to train Claude constitutes fair use, reasoning that such use is “highly transformative.” Put simply, the court decided that training a general-purpose chatbot on copyrighted books is fundamentally different from reproducing the books themselves, so it is legal. This ruling matters to every AI company because it provides a legal foundation for the legitimacy of training data.

But when Codex “learned” from Claude Code’s source code and generated a functionally equivalent Python version, Anthropic used DMCA to take down the relevant repositories. Layer5’s analysis captured the core contradiction: if “AI learning from copyrighted works and generating new output” is fair use (Anthropic’s position in training data litigation), then “Codex learning from Claude Code and generating a Python rewrite” should logically apply the same principle.

There are meaningful distinctions between the two cases. Bartz involved training a general model on a massive corpus of books, a use with very high transformativity: the input is books, the output is a chatbot, and the two do not substitute for each other in the marketplace. claw-code, by contrast, was a targeted rewrite of a specific codebase, with output that directly substitutes for the original in the market. Among the four factors of the fair use test, “effect on the market for the original work” is typically the most important consideration, and on this dimension the two scenarios are quite far apart.

But this distinction does not fully resolve the contradiction. Anthropic’s dilemma is this: the more successful its legal argument in training data cases (“AI learning from copyrighted material is transformative use”), the harder it becomes to stop others from using the same argument to defend AI rewrites of Anthropic’s own code. As Layer5 points out, this is a classic precedent trap: the legal moat you build for yourself may simultaneously be building a road for your opponents.

One case worth watching is Doe v. GitHub (the Copilot class-action lawsuit), which had oral arguments before the Ninth Circuit in February 2026 and has not yet been decided. The core dispute is whether AI coding tools strip copyright management information from original code. Once this case is resolved, it will directly affect the legal calculus in all the scenarios discussed above.

Practical Judgment in the Vacuum

When all three layers are stacked together, a common pattern emerges: the most fundamental concepts in copyright law, authorship, copying, transformative use, have all lost their previously clear boundaries because of AI’s involvement. The legal system has not caught up, and the risk has been pushed onto practitioners.

For developers and companies using AI to write code, preserving evidence of human involvement (design documents, code review records, architectural decision discussions) will become increasingly important. These materials may later be the key to proving “human creative control.”

For AI products that rely on closed-source code as a competitive moat, the Claude Code incident proved something: even without a leak, AI tools can generate functionally equivalent alternatives based on publicly available descriptions. Product moats need to be built on dimensions harder for AI to replicate: data flywheels, user ecosystems, integration depth.

For open-source project maintainers, the chardet incident was a dress rehearsal. The combination of AI-assisted rewriting and relicensing has neither been explicitly prohibited nor permitted by law. Following the developments in the chardet controversy and the Doe v. GitHub case has more practical value than tracking any technology trend prediction.

Anthropic is just the first to be pushed into the spotlight. Everyone who uses AI to write code is relying on these untested legal assumptions every single day.