Do language models need sleep?

Summary

A paper out of CMU and UMD shows that what limits SSM-attention hybrids on deep reasoning isn't memory capacity but the number of passes over the context. A "sleep" mechanism runs N recurrent passes at window eviction and consolidates the information into fast weights, so computational depth can be shifted offline. This isn't a property of today's chatbots, these are experimental models on synthetic benchmarks, but the trend as you scale N is clear and worth watching.

"Do language models need sleep?" sounds like a headline begging for an eye-roll. But underneath the metaphor there's a concrete engineering claim: what limits hybrids on deep reasoning isn't how much they remember, it's how many times they get to recompute the context. And that recomputation can be done offline, at a moment when nobody's asking anything.

Sleep that isn't sleep

Let's get the metaphor out of the way first. No ChatGPT or Claude "sleeps" today. This is an experimental setup with small models, on the order of 1.4B to 2B parameters and smaller, on synthetic and small benchmarks. "Sleep" is the name for an offline compute phase, not a state of consciousness.

The mechanism fires at context window eviction. Eviction means the model is about to throw away the accumulated context from active memory, from the KV cache. And that's exactly why it makes sense to run consolidation right here: just before the information is lost, the model rewrites it into a more permanent state inside the SSM blocks. Eviction isn't a random trigger, it's the last moment when there's anything left to consolidate.

Concretely: every L tokens, before the model wipes the KV cache, it runs over the accumulated context once more. Or twice. Or N times. A gated update rule (more on that at the equation below) consolidates that context into fast weights inside the SSM blocks, and only then does the KV cache get dropped. Fast weights are weights that change during inference, a fast, temporary memory, as opposed to the fixed trained (slow) weights that stay the same the whole time. It's a chunk of extra compute the model takes between inputs, not some form of rest. The metaphor ends here, from now on we're talking loops and matrices.

Why an ever-growing cache isn't the answer

When I want a model that's better at reasoning, the first instinct is to give it more memory. Bigger cache, longer window. But that instinct runs into the numbers.

Attention compute grows quadratically with context length. KV cache memory grows linearly. That means "just add more room" is a strategy that either chokes on compute cost or on cache size, whichever runs out first. With long contexts you always hit a wall. And even if your hardware could keep up, you still haven't answered whether more memory actually helps with the thing you're solving.

The real ceiling is depth, not capacity

Here's the uncomfortable finding. The authors show that the performance of vanilla hybrids degrades as reasoning depth grows, even when the amount of information to store stays constant.

It's not that what it needs doesn't fit in memory. The information fits. The problem is how many reasoning steps the model has to perform over it. Chain depth, not data volume. That's a completely different axis than capacity, and the usual "add memory" intuition doesn't touch it. As the number of steps and intermediate computations grows, the hybrid falls apart, even with plenty of memory to spare.

What happens during sleep

The mechanism rests on fast weights in the SSM blocks, specifically on Gated Delta Networks (GDN, an SSM variant where the state is updated by a gated delta rule). As a reminder, fast weights are that fast, inference-time-mutable memory, not the trained slow weights. The update is a gated Hebbian-like outer-product rule:

S_t = α_t · S_{t−1} + β_t · v_t · k_tᵀ

The state S is a matrix. v_t and k_tᵀ are value and key vectors, just like in attention, and it's precisely their outer product v · kᵀ that makes the state S a matrix rather than a vector. α and β are gated, so the model writes and forgets selectively, it doesn't blindly overwrite the state on every token. This is the core I'll come back to, because it's what makes the whole construction hang together: memory isn't a growing KV cache, it's a fixed matrix-valued state that gets written to recurrently.

At window eviction the model runs N such recurrent passes over the accumulated context and thereby consolidates the information into that state. What happens between passes: the model repeatedly runs over the same accumulated tokens with the same weights and applies the update rule above each time. Each subsequent pass reads the state S from the previous one, so more of the context's structure gradually "settles" into it, without the trained weights changing. No gradient, just more passes over the same weights.

A key detail: at N=1 the whole mechanism reduces to an ordinary SSM-attention hybrid. So "sleep" is a value on top of the baseline, not a different architecture. You just push N above one and watch what happens.

It's worth distinguishing this from test-time-training methods. Those typically do one gradient step per chunk. Here there's no gradient at inference at all, the memory update is a learned recurrent forward pass. The model doesn't keep learning, it just recomputes what it already knows a few more times.

The numbers that make sense of it

The general trend is monotonic: more loops N means better, and the biggest gain shows up on tasks with a long reasoning chain. A summary of four benchmarks:

Benchmark	Model	N	Before → after
Cellular automaton (Rule 110, t=32)	4-layer GDN-attention hybrid, hidden dim 256	1 → 4	~10% → over 30%
k-hop graph retrieval (16-hop)	10-layer Jet-Nemotron, d=512	1 → 4	improvement only with 4 loops
GSM-Infinite (6-op, L=2000)	Jet-Nemotron 2B, SSM-attention hybrid	1 → 6	0.742 → 0.812
GSM-Infinite (8-op, L=2000)	Jet-Nemotron 2B, SSM-attention hybrid	1 → 6	0.351 → 0.388
GSM-Infinite (6-op)	Ouro 1.4B, depth-recurrent attention-only	1 → 4	0.419 → 0.615
Sliding-window eviction (L=512)	Ouro 1.4B	1 → 4	0.596 → 0.905

A few words on the table. Jet-Nemotron and Ouro are experimental language models (Jet-Nemotron as an SSM-attention hybrid, Ouro as a depth-recurrent attention-only model) that the authors use as the basis for measurement. On the cellular automaton, predicting the state of Rule 110 after t steps, the model without sleep hits only around 10% exact accuracy at t=32, even after roughly 5B training tokens, whereas with three to four loops it clears 30%. Triple the baseline just by recomputing the context more times. The k-hop graph retrieval task, searching across k hops in a graph (cycles up to 75 nodes, k from the set {1, 2, 4, 8, 16}), only starts to improve on the hardest 16-hop variant with four loops. The shallower variants don't need it, which fits the depth thesis exactly.

The strongest demonstrated effect is on sliding-window eviction. The window is L=512 tokens, while the sequence is on the order of 2000 to 3000 tokens long, so the window is 4 to 6 times smaller than the full sequence. Ouro 1.4B with N=4 goes from 0.596 to 0.905, a relative improvement of 52%. Here the model evicts memory aggressively, and precisely where it's most short on capacity, recurrence helps it the most.

One note on reading these numbers honestly. Absolute accuracy is often low, the 8-op task ends up around 0.39 and that's still not good. The story here isn't absolute performance, it's the trend as you scale N. That's the measurable part.

Where it gets sticky

It isn't free, and the authors admit as much. The cost doesn't disappear, it just moves.

Training is recurrent across windows, which means it can't be fully parallelized along the sequence. Training cost grows roughly linearly with N. N deeper forward and backward passes make training slow and unstable. On top of that, sliding-window eviction needs a warm-up, where you first train just the SSM layers for one epoch and only then the whole model.

Bottom line: you buy inference performance with more expensive, more fragile training. You're shifting compute from the moment the user asks to the moment you build the model.

What to take away

This isn't a product. It's a point in a landscape that already holds test-time-training and linear attention, where architectures are slowly breaking away from the idea that memory has to be a growing KV cache. Linear attention and SSMs can be read as a recurrent update over a fixed matrix, as fast weight memory. This work adds a clean experiment to that landscape.

And that experiment measures one thing in isolation for the first time: reasoning depth is its own compute budget, independent of memory capacity. "Sleep" just moves compute from inference to training and makes it more expensive, linearly with N, slowly and unstably. But a budget you can measure is a budget you can plan for. Even for a time when nobody's asking anything.

Paper: Sangyun Lee, Sean McLeish, Tom Goldstein, Giulia Fanti, "Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference", CMU and University of Maryland, arXiv 2605.26099. It's a fresh May 2026 preprint, so a new and not-yet-peer-reviewed piece of work.