DeepSeek, GLM-5, and MAI all use reasoning RL — but their design philosophies couldn’t be more different
MAI-Thinking-1 Deep Dive · Part 2 of 3 · [1] Training Large Models Isn’t Rocket Science — It’s Rock Climbing · [2] Making Models Think Is Easy. Keeping Them Thinking Is Hard. (this article) · [3] After Vibe Coding: The Industrialization of AI-Assisted Programming
One trend in 2026 is unmistakable: top-tier labs are no longer satisfied with “models that can answer questions.” What they want is “models that can think.” The shift is powered by a technique called reasoning reinforcement learning: give the model a problem, let it try to figure it out on its own, reward it when it gets things right, and let it try again when it doesn’t. Sounds a lot like training a puppy.
But that analogy misses the most critical part. When you train a puppy, you can watch the whole time and correct mistakes as they happen. When you train a large model to reason, each training step can cost tens of thousands of dollars. A single crash can mean half a month of work down the drain. The hard part isn’t “how to get it to start thinking.” It’s “how to keep it thinking for thousands of consecutive steps without collapsing.”
Microsoft AI’s MAI-Thinking-1 technical report answers exactly this question. Their solution can be summarized as three engineering metaphors: a thermostat that automatically maintains a healthy range of model creativity; a circuit breaker that cuts power during extreme events to prevent the entire training run from burning out; and a recovery procedure that, after the occasional collapse, distills the learned capabilities into a fresh model and carries on. Together, these three embody MAI’s core philosophy for reasoning training: how high you jump in one step doesn’t matter. What matters is not sliding back down.
This is the second installment in the MAI-Thinking-1 deep dive series. The first article discussed MAI’s pre-training philosophy: from the failure of rank invariance to the birth of Efficiency Gain, and why training large models isn’t rocket science — it’s rock climbing. The third article covers the final piece: how the training problems and scoring signals are constructed at industrial scale.
DeepSeek V4 and GLM-5 are tackling the same problem, but their entry points are entirely different. After reading all three reports, you realize they are solving different sub-problems.
Let’s spend a minute understanding what GRPO does, because the three innovations that follow are all built on top of it. Without a clear picture of this concept, the whole article loses its anchor.
GRPO’s approach is straightforward. Take a problem, have the model generate a batch of answers — say 32 — and score each one. Compute the batch mean and standard deviation. If an answer scores above the mean, push the model in that direction; if it scores below the mean, push the other way. The “push” works by adjusting the generation probability of each token inside the model.
This design has two strengths. First, you don’t need humans to score every answer — the model compares against itself. Second, scores are relative, so you don’t need absolute grading criteria; you just need the answers within a batch to be distinguishable.
But it has one fatal flaw. After several hundred consecutive training steps, the model’s outputs tend to drift toward one of two extremes: either it becomes extremely overconfident, producing nearly the same response every time and losing the ability to explore novel solutions; or it suddenly goes off the rails, spitting out strings of random characters and causing the training to crash outright. Researchers call these phenomena “entropy collapse” and “policy divergence.” Both share the same root cause: GRPO’s clipping mechanism only constrains half of the update direction. The other half is left wide open, and in extreme cases it can get amplified to catastrophic levels.
MAI’s three modifications target exactly these three fronts: adaptive entropy control prevents the model from becoming too rigid, an outer ratio clip keeps the model from suddenly going insane, and self-distillation rescues progress after the occasional crash.
Imagine you’re tutoring a student. If the student is overconfident and solves every problem the same way, they learn nothing new. If they’re underconfident and flail randomly at every problem, they learn nothing either. What you want is a balance between “grounded exploration” and “steady correctness.”
MAI’s approach is to install an automatic regulator on top of GRPO’s clipping mechanism. GRPO originally had fixed upper and lower bounds — the model’s update magnitude couldn’t exceed this range. MAI made the upper bound dynamically adjustable: when it detects the model becoming increasingly rigid (entropy dropping), it raises the ceiling to allow bolder exploration in new directions; when the model starts guessing wildly (entropy rising), it pulls the ceiling back down to rein things in.
This regulator is a simple integral controller. At each step, it checks how far the current entropy deviates from the target value and adjusts in the opposite direction accordingly. It doesn’t tack on an extra “penalty term” to the loss function to force diversity. Instead, it solves the problem directly at the constraint level. MAI found that adding a penalty term is far less effective than this kind of automatic regulation.
Figure 13 in the report shows what this controller looks like in action. The top panel is the model’s actual entropy curve, oscillating around a target of 0.3 over 800 training steps; the bottom panel shows the adjustment parameter k changing — going down when entropy rises, going up when it falls.
What this mechanism tells us is that good training requires dynamically adjusting the strength of constraints. When the model gets too rigid, loosen up a bit. When it gets too divergent, tighten down. Like a thermostat: detect the gap, adjust accordingly, rather than running the heater or AC at full blast all the time.
GRPO’s design intentionally leaves two regions unclipped. If the model actively corrects its own mistake (a direction that used to have low probability now has high probability), don’t clip. If the model actively abandons an overconfident guess (a direction that used to have high probability now has low probability), don’t clip either. The original designers’ logic was: the model is doing the right thing; it shouldn’t be constrained.
But in practice, MAI found that these two “good intentions” occasionally cause disaster. When a particular update happens to land in those unconstrained corners and the update magnitude is unusually large, the gradient explodes and an entire batch of training data is wasted.
Their solution is almost bluntly simple: on top of GRPO’s existing clipping, add another layer of absolute ceiling. Regardless of which direction you’re going, regardless of whether you’re doing the right thing, if the difference between the old and new policy exceeds a certain absolute threshold, just cut it off. This ceiling is set very high — under normal circumstances, you’d never hit it. It’s purely a safety net. Day-to-day training never touches it, but it prevents that one-in-ten-thousand catastrophic spike.
If the first innovation is a thermostat, this one is a circuit breaker. Most of the time it does nothing. But when an extreme event occurs, it cuts power first, protecting the entire circuit from burning out.
The first two innovations drastically reduce how often crashes happen, but they can’t eliminate them entirely. The root cause is a subtle numerical precision mismatch between training and inference. Training uses mixed precision to accelerate computation; inference uses full precision to guarantee quality. In the vast majority of cases these two precisions are nearly identical, but occasionally the discrepancy accumulates into a visible drift. Once the drift reaches a critical point, the model’s behavior suddenly veers off course.
Faced with this problem, MAI didn’t keep wrestling at the precision level. They simply accepted the fact that “crashes will occasionally happen” and designed a recovery procedure around it.
Their solution is called self-distillation. At regular intervals, they record millions of successful reasoning trajectories from the current model — the ones where it got the answer right. If training later crashes, they use these records to teach a brand-new model. This new model is a clean checkpoint, uncontaminated by the previous crash. The process is essentially distilling what the old model learned, pouring it into a fresh container, and then continuing training from that container.
Figure 15 in the report shows the full training trajectory of their STEM model. The stars mark each self-distillation event — you can see performance recovering after several cliff-like drops.
MAI found that about one million reasoning trajectories are enough for the new model to match the old model’s performance. Beyond that, returns diminish, and you risk squeezing the new model’s exploration space too tightly. They also found that using only successful trajectories works about as well as using all of them (including failures), so they chose to keep only the successful ones.
Taken together, these three innovations form a complete “discipline system”: the thermostat maintains day-to-day stability, the circuit breaker extinguishes extreme incidents, and self-distillation recovers progress after the occasional failure. MAI calls this system a “climbing machine.” The goal is thousands of consecutive steps without slipping — not how high a single jump goes.
DeepSeek V4 and GLM-5 are also doing reasoning RL, and they’re also using GRPO. But their bottlenecks aren’t training stability, so their solutions are entirely different.
DeepSeek’s bottleneck is computational efficiency. It wants to do reasoning training under million-token contexts, but conventional attention computation at the million-token scale balloons to unbearable levels. Its solution is to redesign the attention mechanism, using two compression schemes — CSA and HCA — to reduce per-token FLOPs to 27% of the original and KV cache to 10%. This effectively turns “million-token reasoning” from a special operation into a routine one. It’s not about making the model think deeper; it’s about making it think faster. High enough efficiency means you can squeeze in more rounds of RL training within the same time and budget.
GLM’s bottleneck is persistence. In multi-turn agent scenarios, the model has to re-derive all prior context from scratch each turn. By the third or fourth turn, the longer the context, the higher the re-derivation cost. GLM’s solution is Preserved Thinking: keep the previous turn’s reasoning process as-is, and in the next turn pick up right where it left off. This eliminates a compounding cost of repetition, not a one-time computation. They also introduced Interleaved Thinking — taking a moment to think before each tool call to ensure the action is grounded. Taken together, GLM’s core philosophy is: don’t make it think everything again from scratch; not: make it think faster.
So the three teams are solving different problems. MAI solved training stability; DeepSeek solved computational efficiency; GLM solved cross-turn memory. To use an analogy: MAI is making sure the engine doesn’t stall; DeepSeek is making the engine run faster; GLM is making the engine remember the route it took last time.
After reading the reasoning chapters of these three reports, one conclusion naturally emerges: getting a model to start thinking, in itself, is not hard. GRPO is already a mature algorithm, and all the teams use a similar backbone. The real dividing line is: how long can you keep it thinking continuously? At what scale? And what is the relationship between this round of thinking and the last?
MAI’s answer is discipline — three mechanisms ensuring thousands of training steps without major incidents. DeepSeek’s answer is efficiency — using compressed attention to make million-token contexts no longer out of reach. GLM’s answer is endurance — using Preserved Thinking so the model doesn’t have to start over every round.
They aren’t competing with each other. They’re collectively defining a question: how do you make thinking go from “possible” to “sustainable”? And this question is far from resolved. Can MAI’s entropy control hold up for ten thousand steps? Can DeepSeek’s compressed attention survive at 10 million tokens? Will GLM’s Preserved Thinking become bloated after a hundred rounds? Nobody knows the answers.
But that’s exactly what makes this the most interesting phase of research: the question is no longer “can it be done?” but “how far can it go?” The next time you see a new model that claims to “reason,” don’t just check how many problems it got right. Look at whether its training logs mention stability at step two thousand. Look at whether its RL curve suddenly breaks at some inflection point.
Based on Microsoft AI’s MAI-Thinking-1 technical report, DeepSeek’s DeepSeek-V4 report, and Zhipu AI’s GLM-5 report, all published in 2026.