Continual Learning Needs a Bouncer

On Carmack's sequential Atari problem, an agent that learns without forgetting, and what breaks the moment you remove the verification step.

Amit Tandon
Amit Tandon
Founder, Rekursor · ·7 min read

Carmack says continual learning without catastrophic forgetting has no line of sight. On his sequential Atari problem, our agent learns a new game without forgetting old ones. It uses a different kind of learning with a system bouncer, and when you remove the bouncer, the forgetting comes right back.


John Carmack's Physical Atari work at Keen makes continual learning palpable. A camera watches the screen, a robot moves a joystick, and a reinforcement-learning policy has to keep improving in the real world.

The failure mode is easy to see. It learns Breakout and then Pong. But it goes back to play Breakout again, and crashes. The skill to play Breakout is, in Carmack's words, "largely destroyed."

That is not a small bug. It is the continual-learning problem in its most visible form.

1. The Open Problem

Continual learning is described as an agent that can learn from its own experience without forgetting what it already knows.

Over the past year the field has converged on this as the frontier. a16z describes deployed models as amnesiac systems surrounded by external memory. Bessemer's 2026 infrastructure roadmap names continual learning as a missing capability and, tellingly, names a governance problem alongside it: if a system keeps changing after deployment, how do you know it is changing safely? Richard Sutton's Alberta program and John Carmack's Keen Technologies have made lifelong learning central to their research. Everyone now agrees agents must keep learning.

The standard approach is to keep learning in the model's weights. The policy acts, receives a reward, and updates. That is exactly right for some things: a robot's camera, joystick, servo delay, or physical body. The robot has to learn its own reflexes.

But when new learning lives in the same parameters as old learning, every update can disturb everything the agent already knew. Improve one behavior and damage another. Play Pong, forget how to play Breakout.

2. Keen's Answer: Keep Learning in the Body

Keen is building a physical robot that sees the Atari screen, moves the joystick, handles latency, and adapts to the exact machine it is running on.

That matters. Because they showed that a policy trained on one robot performed worse on a second, supposedly identical robot. Continual online learning recovered the performance, but the improvement slope was comparable to learning from scratch.

The lesson we take away from that is that a robot has to learn at least two different things:

  1. The reflexes. How this particular robot moves this joystick under this camera and this latency.
  2. The playbook. What situations matter, what actions are useful, what failures recur, what strategies are worth keeping.

Keen's robot learns by changing the player.

Our system learns by adding to the playbook.

Those are complementary. The player still has to learn its own reflexes. But the playbook should not have to be rediscovered by every robot in a fleet or every AI agent doing similar tasks.

3. Our Experiment: Invent, Prove, and Keep

We are working on a different way of learning.

Instead of treating continual learning as only a weight-update problem, we separated it into three operations:

  1. Discover a new skill from experience.
  2. Prove that it helps.
  3. Refuse to keep it unless it also proves it does not break anything the agent already gets right.

The third step is what we are calling the bouncer.

It autonomously invents the skill. When the agent fails, it does not retrieve a stored answer or mutate at random. It constructs a new, executable skill out of its own failures. This is aimed at the step Sutton has said he has no proposal for in his OaK system. We will not claim the general problem; we will claim the instance: from its own failures, the agent built skills that appeared in none of its inputs and measurably worked. We used autonomous construction, not retrieval.

Before the hardened bouncer run, we first tested the system's ability to discover new abilities: getting better at a game from experience.

In that earlier discovery-only run, we started with a clean non-pixel game-state signal. The agent looked at its own Ms. Pac-Man failures, proposed many possible corrections, and kept the few that actually helped. Those retained skills lifted a held-out Ms. Pac-Man score by +290 while Pong, Breakout, and Freeway showed zero measured regression. On a fresh reuse run, Ms. Pac-Man moved from 30 to 320, the prior games again showed zero measured regression, and the committed skills were actually used.

That answered the first question: can the agent discover something useful from its own failures and turn it into a durable playbook entry? Answer: yes.

Then, in a separate hardened Stage 2 run, we tested the bouncer.

That run used the same four games, the same agent family, and the same sequential-learning setup, but with a different baseline because it was a separate protocol. The key difference inside that run was whether verified admission was on or off: bouncer or no bouncer.

Scorecard showing what changed when the bouncer was removed: with the bouncer, Ms. Pac-Man improved from 120 to 357, prior games were preserved, and 0 of 11 harmful skills were admitted; without the bouncer, Ms. Pac-Man fell from 120 to 100, Pong broke, and 3 of 3 harmful skills were admitted.

 

With the bouncer on:

  • Ms. Pac-Man: 120 -> 357
  • Pong, Breakout, Freeway: zero measured regression
  • harmful skills admitted: 0 of 11

With the bouncer removed:

  • Ms. Pac-Man: 120 -> 100
  • Pong broke: 0 -> -2.7
  • harmful skills admitted: 3 of 3

This indicates that more learning without verification can be worse than no learning at all.

4. What Changed

This result changed our own perspective on what learning means.

It is tempting to say: hey, just move learning outside the weights and forgetting goes away. We tested that, and it is false. An append-only store without verified admission can still accumulate bad lessons and produce negative transfer. The agent plays worse.

The playbook alone is not the answer. What you write down in the playbook is just as important.

But admission only works because the learning unit is separable. A gradient update is a smear across the policy. On the other hand, a skill artifact can be tested and then admitted, rejected, disabled, or removed.

That is the representational distinction. Standard RL learns in weight space: it changes what the policy is. Retrieval changes what is in context, but does not persistently learn. Our result is a third axis: learning by adding tested artifacts outside the weights, then verifying whether those artifacts warrant retention.

5. The Scoped Claim

We did not test pixels-to-joystick control. We tested the continual-learning architecture of the player. Our experiment reads game state in simulation and tests the continual-learning leg directly.

The claim we are making is:

On four Atari games, given a quality reading of the world, a single agent discovered useful skills from its own failures, stored them outside the weights, rejected every candidate in this run that helped the target while harming a prior game, improved the target on every seed, showed zero measured regression on prior games in this protocol, and reproduced the failure mode the moment admission was removed.

That maps back to our legal work in Harvey's Legal Agent Benchmark from prior blog posts. The same pattern showed up in a different substrate: failed runs converted into reusable skills, held-out lifts, and zero regressions. We have also tested the same admission principle against legal judge outcomes, where it admitted clean improvements and rejected revisions that displaced already-passing criteria.

The domains differ but the loop is the same:

discover from failure, verify, admit or reject, reuse.

6. Conclusion

Carmack's Physical Atari makes the stakes of continual learning visceral: an agent that learns a second game but in so doing, forgets how to play the first will never compound.

Keen's answer is right for the reflexes: let each robot keep learning online. But the playbook should not have to be rediscovered by every robot. It should live where it can accumulate.

Composed, the two halves give you a robot whose playbook does not have to start over. Scaled out: a fleet where every robot can inherit every verified skill the fleet has earned, and only trains its own hands.

The lesson is sharper than "use memory" and broader than "add a gate":

Continual learning can happen outside the weights, but durable learning belongs in a verified, append-only store, where nothing gets in until it proves it does not break anything.

A bouncer at the door helps learning compound instead of being forgotten.