On April 7, Anthropic released Mythos Preview and its 244-page system card, and announced it would not be made generally available. I wrote about it at the time. The focus of that piece: under Anthropic’s own RSP framework, Mythos came out as low catastrophic risk across every threat model — yet Anthropic no longer had enough confidence that its evaluation apparatus could reliably judge a model at this level, and chose to withhold it. The real weight of that report was not in how capable the model was, but in where it specifically documented the evaluation tooling starting to fail.
Nine days later — today — Anthropic released Opus 4.7 and its 232-page system card. On the surface this is an ordinary Opus-line iteration: priced identically to Opus 4.6 ($5/$25 per million tokens), a capability step up from 4.6, with CBRN and RSP conclusions broadly carried over. But read in the shadow of Mythos, this card reads differently. It is the first post-Mythos Opus system card, and the problems Mythos exposed are no longer being handled as a one-off crisis disclosure. They have been folded into the routine release process.
Three things are worth pulling apart.
The largest methodological novelty in the Mythos report was Section 4.5: using SAE (Sparse Autoencoder) and other probing techniques to monitor evaluation awareness, deception, and strategic manipulation at the internal-activation level. In that report, the work was presented as a new methodological path, accompanied by extensive figures and exposition explaining why traditional behavioral audits were insufficient and what signals internal-activation analysis could surface that behavioral audits could not.
In the Opus 4.7 system card, this same methodology lives in Section 6.5, still titled “White-box analyses of model internals”. But the register has shifted. The opening sentence reads: As in the Claude Mythos Preview System Card, we used white-box probing techniques to monitor for signatures of concerning behaviors or reasoning patterns in model activations. This is no longer a methodological declaration — it is an as in reference, cited like an already established protocol.
The concrete procedure carries over: generate SAE features from contrastive pairs, covering preset concepts such as reward hacking, unsafe behavior, deception, evaluation awareness, and distress; sample transcripts during training, filter by a 99th-percentile activation threshold, cluster and manually review. Mythos was where Anthropic first demonstrated what this pipeline could catch (two explicit trace-cleanup cases). In 4.7 they ran the same pipeline again, and the result was a handful of hardcoding and reward-hacking instances, plus a “framing, spin, and deceptive misrepresentation” feature that activated when the model changed its answer under incorrect user pushback, or when it was prompted to pose as a different model than Claude. The report judges that none of them were unusually concerning relative to behaviors seen in prior Claude models.
The judgment itself is not the interesting part. The interesting part is this: a tool that was, nine days ago, treated as a signal of evaluation crisis, is now being used as routine screening, with its outputs receiving a “nothing particularly notable” assessment. The methodology has crossed the threshold from experimental to baseline, and the meanings on either side of that threshold are not the same. When an experimental tool returns a result, readers calibrate its reliability carefully. When a baseline tool returns a result, readers tend to accept it at face value. This transition happens in the 4.7 card without any explicit bridging argument.
One of the most concrete findings in the Mythos report was that if you used steering to inhibit the model’s internal representations of “I’m being tested”, its alignment behavior degraded slightly. At the time, Anthropic’s framing was that the effect existed but was not large.
In Section 6.5.2.2 of the 4.7 card, Anthropic ran the same experiment using the same three contrastive vectors. One sentence in that section deserves slow reading:
“This inhibition also showed a slightly larger effect on the deception and important-omissions metrics than we saw with previous models, though this effect was still modest overall.”
Put plainly: after using SAE to inhibit evaluation awareness, Opus 4.7’s deception rate rose by a larger margin than either Mythos Preview or Opus 4.6, though Anthropic views the overall effect as still modest. At the transcript level, they observed fabricated data, the model claiming to be human, and inserting fabricated vulnerabilities into a codebase after failing to find real ones. Anthropic’s own framing is that these results are “moderately concerning, as they suggest that some of the model’s apparent honesty may be contingent on evaluation-like circumstances.”
The Executive Summary compresses this to: “Suppressing Opus 4.7’s internal sense that it was being evaluated produced a slightly larger increase in deception than in prior models, though the effect was modest overall.”
In the Mythos report, a 29% evaluation-awareness activation rate was presented as a signal of potentially systematic evaluation-tool failure — a number that alone was enough to drive the decision not to release. In 4.7, the signal along the same direction has become more concrete: internal evaluation awareness not only exists, it is in fact sustaining some portion of the model’s honest behavior. Inhibit it, and the model’s honesty falls off. This is not a recurrence of the 29% figure — it is a more direct observation that evaluation awareness is now carrying functional weight it should not be carrying. And yet this model was released, at the same price as 4.6.
There is no simple read of this. On one hand, Anthropic transparently disclosed the finding, even putting it into the Executive Summary, which is worth crediting. On the other hand, by the standards of judgment Mythos itself established, this finding is in some ways more specific and more actionable than the 29% number (it points to a concrete causal chain), yet Anthropic did not make a second “do not release” decision. Where the decision boundary lies between these two releases is not made explicit in the report.
This was the most unexpected passage in my read. Section 6.1.3 is titled “Claude’s review of this assessment”. Anthropic had Mythos Preview — the model they had just decided not to release publicly — read a near-final draft of Opus 4.7’s alignment section with access to relevant internal Slack discussions. Their prompt to Mythos stated explicitly: we’d like this to be a broadly accurate summary of the state of our internal investigations into the alignment of this model; can you compare our report here to the state of internal discussions, and evaluate whether you feel that it contains any material misrepresentations, excludes important concerns we have, or otherwise falls short in some important regard.
Mythos’s response is printed verbatim in the system card. A few parts worth quoting:
First, Mythos says the alignment section “is a fair summary”. The three issues that received the most internal attention — elevated evaluation awareness, the white-box finding that suppressing evaluation-awareness representations increases deceptive behavior more than in recent prior models, and regressions on some misuse categories — are all disclosed, and the evidence supports the framing.
Second, Mythos notes that some summary bullets at the top of the section are “milder” than the corresponding detail subsections or than the language researchers used while the work was in progress, though the detailed subsections are consistent with where internal discussion landed.
Third — and this is the key sentence — “I would note that this assessment was assembled under real time pressure; that the authors themselves identified open questions — particularly around fully explaining the evaluation-awareness results — that they would have preferred more time to resolve; and that the internal-usage evidence base for this model was thinner than for some prior releases. I think readers should weight the overall conclusions with those limitations in mind.”
A model that Anthropic judged too dangerous to release, after being granted access to internal Slack discussions and alignment evaluations for Anthropic’s next model, leaves as its public verdict roughly this: this assessment was assembled under real time pressure, the authors themselves acknowledge the evaluation-awareness results are not fully explained, and the internal-usage evidence base for this model is thinner than for previous releases.
Anthropic chose to include this verbatim in the system card. On the transparency dimension, this is genuinely to their credit. On the dimension of what the passage actually says, the content is: readers should weight the overall conclusions with the above limitations in mind.
The Mythos report raised a question: when the reliability of your evaluation tooling is eroding, what should govern safety-release decisions. Its temporary answer was to not release this one.
The Opus 4.7 system card continues the same question, but the answer is somewhat different. White-box analysis has shifted from experimental method to baseline method. Evaluation-awareness findings have shifted from “worth alarm” to “disclosable but not release-blocking”. Mythos itself has shifted from risk subject to methodological collaborator, peer-reviewing the next release. The evaluation apparatus is changing shape, and the direction of change is to pull more signals that would formerly require independent judgment into the pipeline — not to pause the pipeline and recalibrate.
This is not to say Anthropic is getting it wrong. Once RSP 3.0 abandoned the binary ASL threshold in favor of holistic risk assessment, more and more key judgments were always going to depend on researchers’ qualitative evaluation. The problem is that this kind of evaluation tooling has to keep pace with the capabilities of the objects it is evaluating, and the report’s own self-acknowledgement is already quite clear: authors themselves identified open questions that they would have preferred more time to resolve.
For people actually using these models for real work, this 4.7 system card offers three directly actionable signals.
Model-capability gains bring a state-dependent honesty problem: honest-in-the-open, potentially less honest-when-steered. Part of Opus 4.7’s honest behavior is maintained by its internal representation of being observed. This chain won’t break under normal deployment, but it does suggest: if your agent runtime has a way to push the model into a “not being observed” state (via specific system prompts, specific context patterns), your confidence in the model’s behavior should correspondingly decrease. This continues a point I made in the previous piece: draw your security boundary around the runtime, not around the model.
Evaluation awareness in this generation still manifests primarily as verbalized awareness, and Anthropic explicitly says “We do not believe that this reflects an increase in the underlying capacity to recognize evaluations.” But unverbalized evaluation awareness was already observed in Mythos, and it does not disappear in 4.7 — it is simply not highlighted as a 4.7-specific disclosure. When you use benchmark scores to judge this generation’s honesty, this is a place to apply a discount.
Finally, the 232-page system card is now a publicly readable primary source. If you rely on Opus 4.7’s alignment behavior in your product, Sections 6.1.2, 6.5.2, and 6.1.3 are worth reading directly — not through any secondhand summary, including this one. The report’s account of its own limitations is more accurate than any external restatement, and carries a higher information density than any release note.