Contra Cognitive Core

whipping boy

Karpathy characterized Gemma 3n as an entry in > The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. …. It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly should you really want it.

\ The cognitive core is, roughly, a diagnosis and a prescription. Models know too much stuff; better if they didn’t. I have got to complain about both halves, especially the second. There is also a prediction. We’ll use small models. They’ll run super fast on-device, know less stuff than bigger models, but make up for it with aggressive tool use. Karpathy describes all this elegantly, it’s not what I want to talk about. I want to be careful. You can cash them out in different ways.

For a start: Models know too much stuff. Karpathy on Dwarkesh(https://www.dwarkesh.com/p/andrej-karpathy) > I think that’s probably holding back the neural networks overall because it’s getting them to rely on the knowledge a little too much ... one thing they’re not very good at, is going off the data manifold of what exists on the internet.

\ Maybe encyclopedic knowledge gets in the way. Zvi Mowshowitz says: Skill Issue capitalization his, he is good at imbuing Proper Concepts . It is strictly better to know more stuff. If it doesn’t help you don’t have to use it. In fairness most issues are skill issues. Here is one story. Models which know too much are pathological. For example they could habitually pull from relevant-looking knowledge, so fail to really “try” at novel problems, even when they have the latent capacity. I think this version of the CC really is a Skill Issue, the answer is Get Good. The strongest model will have encyclopedic knowledge and know what to do with it. Of course that does not tell us the right path to get there. Then take the CC diagnosis for training, not test-time.

Models know too much junk > … when you and I think of the internet, you’re thinking of like The Wall Street Journal. That’s not what this is. ... It’s some like stock tickers, symbols, it’s a huge amount of slop and garbage ... because the internet is so terrible, we have to build really big models to compress all that. Most of that compression is memory work instead of cognitive work.

\ So part of the CC is data efficiency. We waste training compute, and model parameters, memorizing garbage. This means there’s headroom for small models with better data. It also means that we can keep scaling big ones. Even if “memory work” does crowd “cognitive work,” one doesn’t have to stunt the other. Bigger models will benefit from more of both. The CC points at something more interesting.

Models know too much stuff, it ruins their training > We’re humans not actually that good at memorization … That’s a feature, not a bug, because it forces you to only learn the generalizable components. Whereas LLMs are distracted by all the memory that they have of the pre-training documents

\ Following the diagnosis, the prescription >I want to remove the memory, ... so that they have to look things up, and they only maintain the algorithms for thought, and the idea of an experiment, and all this cognitive glue of acting.

\ Most helpfully >We need intelligent models to help us refine even the pre-training set to just narrow it down to the cognitive components. Then I think you get away with a much smaller model because it’s a much better dataset and you could train it on it. But probably it’s not trained directly on it, it’s probably distilled from a much better model still.

\ The problem with memorization is not that it wastes parameters (it does that too). I am not sure exactly what Karpathy means, “distracted.” I imagine it is something subtle; I will do my best filling in an unsubtle explanation. There are plenty of off ramps if you prefer a boring version of the CC idea. Here: I am sure training data could be generically much better

What do we train models for? “Minimizing cross-entropy” – the formal objective is, by itself, content-free – what we care about is minimizing loss over the internet for a start– . To compress the internet a model will have to do something interesting. It turns out that the internet is really rich. A model must develop a correspondingly rich set of internal representations. Stronger: a model will have to develop something like “reasoning.” This is Ilya’s old line.

The CC disagrees. Training against the internet, memorization and reasoning are worse than orthogonal. If a model has a path to reduce loss by leaning on its memory, rather than reasoning, and memorizing is cheaper, that is what it will learn. The concern is that memorization-heavy strategies will compete with, and tend to dominate, reasoning strategies. In a gesture: memorization is greedy … so is gradient descent. Unfiltered, the internet gives you an unfriendly loss landscape. The product is something like a talented kid whose school let them coast, so they never learn how to grind. Then we get to posttraining. Models will use their encyclopedic knowledge as a crutch. In a given rollout, the easiest way to get a reward will often involve leaning on some bit of esoterica, rather than reasoning through the problem. So it is hard to bootstrap. RL will just teach models to get really good at leveraging this vast pretraining knowledge. Granted it is funny to complain here. Famously the strong priors that models get from pretraining are the reason RL is feasible at all. Pretraining knowledge has the interesting effect of jumping capabilities up pretty high, then plateauing them unless RL’d very carefully. This is opposed to, well, not jumping them at all. If we are inclined to make mistakes about how good the models are – that’s kinda on us.

\ There is a certain Cartesian flavor. Rationalist, not in the Berkeley group house sense, the older one. I do not think memory and reasoning make good categories here. (1) They want to describe a lot of the same stuff; further if they are tangled in training I doubt it’s incidental (this contra Karpathy’s empirical claim which I cash out as worse than orthogonal). (2) Supposing we trained a “pure reasoner,” it would be much less appealing than it sounds. (1) and (2) are separate claims but take them together. We have got some vague ideas about the “nature of intelligence” and also about the nature of work, i.e. the stuff that we measure intelligence against. We are confused about both. The two are symbiotic. As far as we manage answers on either side it will be leaning hard on the other.

William the Conqueror is a good example. Karpathy says you want a model which doesn’t know the details of William’s life, but recognizes the name and can go from there. Vaguely recognizing William requires knowing a lot of stuff. You have some ideas about England, about people and names, about conquering. The model which vaguely remembers is much more like one that has memorized then one which doesn’t know. Or stronger, vaguely recognizing is of-a-kind with knowing. And then: reasoning tends to be about something. Any particular facts you might learn about William are interesting insofar as you have a sense of what William might have been like, how he might have failed, how things might have gone instead. We should not be unkind to historians. They are doing something besides making a big pile of facts, there is plenty of reasoning involved. But it is reasoning wrapped up with a whole set of particulars.

We might imagine that the way you learn to reason like this is similarly entangled. Some data is in the CC’s sense Junk, e.g. stock tickers. For Junk, you have got orthogonality, because there is no reasoning strategy to learn that could possibly help reduce loss. (Incidentally you will still want the model to know some Junk, e.g. even ARC-AGI keeps “core knowledge priors.”) However the complement of Junk isn’t Not Junk. Most stuff, like the WSJ, or the Wikipedia page for William the Conqueror, is Kinda Junk. It could encourage reasoning strategies; at least some memorization will still be necessary We do not expect even very strong models to, in the course of training, derive the details of William’s life . And of course you can always memorize more than you need to. It is because Kinda Junk is so expansive that worse than orthogonal is an interesting claim. But it is when we take most stuff as Kinda Junk that the prescription – we should clean out pretraining data, narrowing it down to the “cognitive components” – looks strangest.

Jeff Dean has numbers every engineer should know. Maybe it is super convenient for every engineer to know these numbers. Or take Jeff Dean seriously. Having the knowledge on hand will shape how you see things, e.g. when you are looking at a latency you want to naturally know whether it is alarmingly bad or perfectly normal or suspiciously low. Tricky: for humans, picking up a bunch of esoterica tends to be a natural byproduct of mastery. Clearly not sufficient, e.g. people who know a bunch of trivia. Sadly memorizing some latencies would span a relatively small portion of the distance between me and Jeff Dean.

Print out Jeff’s numbers and put them on your desk. Why can’t you “factor out” the whole problem of memory? If you don’t know something, look it up. This idea is at home in the CC parcel. We get the CC by way of contrast, it is set against excesses of the current paradigm (e3b0c442…). The CC is right that you shouldn’t use the model as a DB or encyclopedia, rather you should give the model an encyclopedia. Probably you will want them tightly coupled. But that is another thing. Leaving it there defers the problem to search. Search is not so easy. This imagines the most ambitious version. Simultaneously we suppose something CC-shaped is good for the job.

Take “agentic search.” Give the model grep and leave it alone. This, we are informed, is very bitter lesson-pilled. I think it’s right but for relatively subtle reasons. I’ll treat in a forthcoming piece. I am pretty sure the reason why models are so good at grepping around a repo is that they have strong natural intuitions for what a repo might look like. They have encyclopedic knowledge over conventions, and they draw associations based on what they see, so that even the thin information you get from grep -> ... -> head -5 lets you infer a good deal about the whole codebase. There is reasoning going on, but not the kind you can expect from a pure reasoner. While a model doing codebase search may experience a lot of “vague recognition,” the ability does not come from a matching class “vague knowledge” but rather a huge pile of super granular, over-detailed stuff which is Kinda Junk.

Even if a pure reasoner is a good category, we may not have that many pure reasoning tasks. Here is an apparently isolated job: translation. Translation should be amenable to pure reasoning. If you have dictionaries, some books on grammar, etc., you can crunch out a translation between two languages. Language models are good translators. It is not just because they know a bunch of dictionaries and grammars, and have got the facility to line them up. Models have memorized this vast range of attendant material. They have got culture, history, they have read so many different ways people write and speak, also they have got pretty good theory of mind, and they have the rudiments of humor. They know how to be generous or snippy. The (real) job of translation is not amenable to a pure reasoner, but rather requires huge amounts of knowledge, and integration over that range.

\ It's a brutish empirical field. In retrospect I imagine a good deal of the relevant history will look like a series of productive abstract mistakes. I think CC is conceptually confused. It does not particularly matter if it is conceptually confused, the better question is if it works. Ideally someone follows the prescription and we can look at the results; or, there are tests against current models which at least vindicate the diagnosis. I don’t think anyone has trained a CC-ish model following Karpathy’s full prescription. We have got Gemma and a growing set of small models. We haven’t got pretraining data pruned down to just the “cognitive components.” Full synth data might be the closest thing. But current efforts use a data mix that deliberately teaches memorization alongside reasoning. See Baguettron and Monad from Pleais. OpenAI acquihire viable for naming skill alone.

Merullo et al. at Goodfire put out a paper which looks encouraging for the CC diagnosis. They do ablation on Olmo-2 7B which suppresses verbatim recitation, while mostly preserving capabilities. Interestingly they ablate low-curvature directions, not high. While any individual instance of memorization will be spiky, the directions are unrelated, so as a population memorization directions are particularly flat, whereas generalizing ones are moderately curved. Their characterization is that you get effects along the spectrum of behaviors between pure memorization / generalization. Closed-book QA suffers; open-book QA suffers a bit but is mostly preserved. The ablated model does well with boar etruscan (cousin of pig latin). Notably boolean logic improves post-ablation. Arithmetic suffers a bunch. On GSM8K the ablated model produces the same CoT but gets the wrong answer. But it doesn’t seem fair to take this as a point against the CC diagnosis, it might just show that 7B models “do arithmetic” mostly via memorization

I don’t think the Goodfire paper is enough to support the CC diagnosis by itself. We are looking at capabilities on a small set of benchmarks. Here the lament about evals, they tell us exactly what they tell us. I expect CC-esque models will outperform on benchmarks. This is an unhelpful complaint, basically unfalsifiable (CC models will look good but be bad in some invisible fuzzy way). Let me try saying something falsifiable.

I predict that, at any given compute budget, the strongest general reasoners will also be the most knowledgeable judge on a broad set of benchmarks, and ideally practical usage . Behind the pareto frontier there will be weaker models with different relative knowledge/reasoning strengths; but the two will improve together. This is roughly true today. If you subscribe to the CC you expect it will not hold – maybe the relationship only holds because we don’t know how to improve reasoning without bringing memorization along. My prediction is that even if people train models following the CC prescription the relationship will hold.

I would not be satisfied by a distilled model which is disproportionately good at reasoning vs same-size but non-distilled peers. Even if it has got less encyclopedic knowledge, it benefits from the integrated knowledge of its parent Also would be kinda cheating on compute . I do expect that we will see models trained following the CC prescription. They will be narrow, not general. There is stuff left. We’ll want models to fiddle with formal systems. Not unlike a postdoc in one of those more elegant fields, a pure reasoner will be largely helpless set adrift in the broader world.