# Nachman KS — Full text of published posts > Generated from 6 published posts at build time. Source: https://nachmanks.com --- ## Elicitation URL: https://nachmanks.com/posts/elicitation Published: 2026-03-01 Description: douglas adams vindication Elicitation is a funny category. You can ask a question, you can ask a leading question, you can leave a hint. You can give step-by-step instructions. You can do socratic tutoring. Of course I am naming a generic problem in an obtuse way, there are no neat platonic tests. Also if you are eliciting humans We can talk about elicitation in the meanwhile. By way of demonstration: today models are badly under-elicited. I’m not sure what I mean by that, but I’m quite sure it’s true. Distinguish from the general "capabilities overhang" idea. Which I also buy. \ The good stuff is stuck inside, you’ve just got to get it out. Elicitation is pulling something from the model. You can “hand-elicit” models (that's what you are doing when you use them). Claude Code works a lot better for some people than it does for others. You can also do scaffolding for elicitation. Claude works a lot better in some harnesses than others. Contrast gains from elicitation with gains from external affordances. Models are bad at arithmetic, so you give them a calculator. You get a system which can do arithmetic. But you are not eliciting any latent ability. Models are bad at arithmetic, so you prompt them to “think step by step.” Now they are marginally less bad. The method routes through the model. It turns out models could do arithmetic (if with considerable effort). On a naive view elicitation points at some category of independent facts about model capabilities. Independent facts – even if we could uniformly elicit models to get them – would not be a helpful prize, mechinterp is not so far along. What do you mean, the model has capability X? If we could see under the hood, we might identify circuits which “explain” X. When the model is a black box we are left with the bare fact, which is much less interesting. GPT-3 can’t prove theorems. In the set of possible GPT-3 outputs, there might be some valid proofs. Ask the model to repeat a proof back to you verbatim. Well, GPT-3 wouldn’t do a great job with this either. But you can imagine some prompt injection-like method which would make GPT-3 spit out a valid proof. Of course that’s not what we mean, “do proofs.” We are in some recognizable sense playing a trick. Make the pedant happy. When we say that GPT-3 can’t do proofs, we are describing the set of elicitation, capability compounds. It might contain prompts that “properly” ask for proofs, and it might contain valid proofs – but rarely together. Taken alone, prompt-injection and “think step by step” are of-a-kind. But with a whole set of behaviors, “think step by step” will be part of a contiguous region; whereas instances of prompt injection will look lonely. We’re just applying convention. I’m making all this up. At least now we are equipped to say the thing we wanted to in the first place, “I know it when I see it.” \ Then again, why bother? Why care about saving the category? Are we just shuffling stuff around, does this buy us much? The idea at least is that when we talk about "regions" etc. we resolve to remain silent about "natural" capabilities. We have got to lean on convention either way, but we can articulate conventions when we talk about elicitation and capabilities together, whereas when we go for “independent” facts, we foreclose that possibility – then we cannot sensibly talk in terms of conventions, even though we need them just as much. Summon the naive ambition. We want to know what the models really are; and in the meanwhile, what they can really do. When models are under-elicited we don’t know them well. We do not have an appropriate visceral sense of what we are dealing with. Failures addressable through elicitation are incidental. (Here a kinship with the "unhobbling" idea.) We might also wonder if those failures which remain are characteristic. We should not be too quick. Many “characteristic” limitations have fallen And some, e.g. those stemming from tokenizer issues, are not so interesting . But elicitation is a good name for movement on this front. It is a sort of category-by-implication (and gets a matching circular definition). Elicitation prefigures progress. This happens to be true in a prosaic sense – models are trained in their native harness, so elicitation methods get “baked in.” Maybe it is true in a more interesting way. A model is bad at arithmetic. You train it to generate a bunch of tokens between \ tags, and put it in a scaffold which only shows the final answer, and then RL the whole thing against answers name for berry of your choosing . Now it is passable at arithmetic. I think some people say RL is just elicitation. I don’t really understand this view. And you don’t have to ask it to “think step by step.” Helpfully, scaffolding for elicitation lower-bounds capability. A common distinction, talking about scaffold design, is whether the scaffold is patching a characteristic or temporary deficiency in the model. The lesson is supposed to be that you should not patch temporary deficiencies, because the next model will swallow your scaffolding. We might take a complementary lesson: by aggressively scaffolding models, we either get some characteristic deficiencies, or gently fast-forward, finding a floor for behavior we should expect in the next generation. Maybe models will swallow their harnesses; or, maybe models will continually build their own harnesses. Again (at least if intuitions about what this looks like hold), handrolling a harness fast-forwards us to the kind of performance we’ll see when models become competitive with humans at harness-building. Granted models may shoot past the point as soon as they reach it. Among other things, I do imagine there are some benefits – at least in topend performance, if not efficiency – to “being” the harness rather than being in a harness I hope orgs like METR will do more work on scaffolding, or partner with application layer companies (“agent labs” etc). \ We have taken toy cases that look like prompt in/answer out. So we’ve gotten away with talking about models, not systems. Take an agent (roughly: model calls in a loop, autoregressively appended, with tool results injected.) At each turn the model modifies its environment – then you have got to worry about chaotic interaction with the environment. This is already trouble for our original image, where elicitation is pulling stuff out from “inside” the model. I was a bit sloppy setting the calculator tool so neatly against elicitation. External affordances like a calculator can enable, without constituting, new elicitation regimes. Or: models get to make their environment a bit friendlier. It’s a touchy case. Arithmetic can be neatly “factored out." It’s tempting to slot in specialist systems all over the place. We should expect most will scale worse. In fact our problems are still more interesting. In an agent, the model is also using itself. Consider the “todolist.” Models call a tool to write down a set of todos, then read and update the list as they work through it. On some implementations the harness re-injects the list, pushing the todos into context. On others, the model is encouraged to use the tool frequently, pulling the todos. The model doesn’t get new information about its environment from the todolist, which strictly involves stuff which is already in context. But it does get – in an admittedly kludgey form – a way to manipulate its own attention with a bit of opinionated advice packaged in . Letting the model “recite” its objectives keeps it on track. (I am borrowing from this Manus blog post.) We can describe systems (at least, the good ones) easily. But it is very hard to reason about entailments. Picture working against a codebase with unknown bugs, i.e. most of them. The code describes, in perfect detail, what is going to happen. Yet you may be surprised. Here there is nondeterminism also. The todolist is an exceedingly tidy example. It exists solely to let models prompt (then: elicit) themselves. Take simple multiagents, e.g. the agent gets “subagents” exposed as a tool. The model may write down todos. It is prompting itself – but not just to play the recitation trick, and we must take a broader sense of “self.” Multiagents have more than one instance. A subagent is a separate serial loop, with a context window that the parent is responsible for initially shaping. The system works by carving its own substrate. For a nice formalization see Zhang's "RLMs." Capability gains will depend on emergent behaviors at the system level; so systems will be increasingly opaque. Our epistemic position will get more like the one we’re in w.r.t. models. \ Does elicitation survive? Maybe it gets compressed, then squeezed out of existence. If systems converge to “optimally” elicit the models in them, elicitation isn’t a useful category. (Or if you prefer: say they’re self-eliciting.) Try a distinction on for size: elicitation methods can be more general or specific. An easy test is whether a given intervention sees gains across model families, and across different kinds of jobs. We can be pretty confident general solutions will get folded in – becoming part of the system, then swallowed by or baked into the model. One line is that elicitation specific against models will go away too. People have been talking about the death of prompt engineering for a while. We no longer bother telling models “you are the world’s best SWE” Or, infamously “you are the world’s best SWE, and you’d better get this right or else” . Prompt engineering has an adversarial flavor. (“This one weird trick.”) It exploits quirks of the model – which tell us more about the model than the job. I’m not sure. Stronger models should be harder to trick. At least, relatively harder to trick. We continue to demonstrate that it is possible to make very strong models that are still pretty easy to trick. However, not every elicitation which responds to model-specific quirks is a trick. We might worry about “motivating” models. If models have preferences, I imagine they will be related to capabilities in interesting ways. A strong model’s preferences will probably be first-order helpful (e.g. as part of how we get adaptive compute). Preferences might also be tied up with capabilities in a deeper sense. A strong model is the sort of thing with certain preferences – a model which is a good co-scientist should seek novelty, correspondingly it should get bored. Simultaneously, these preferences may be on finer points contingent One way Goodfire might want to make money There is a pretty broad space of possibilities, but constrained by training. Probably we cannot just work backwards. It also matters how the model got the way . Working for an animal welfare org, I expect Claude to put its back into it. What about elicitation specific against jobs? It’s a practical question how important this is. Hard to guess, we are confused about both models and jobs. Of course it will also be worth spending a lot of effort to reshape jobs to suit models. But as far as the idea about models being spiky is right, we might expect: plenty. Then elicitation survives with centaurs. In this world it is hard to generalize elicitation against the whole distribution of jobs, so the “best-elicited” systems will be centaurs. We’ll care about personalities – at least, about steerability. There are interesting tradeoffs for ergonomics. Models have pretty good ToM for humans. Most of us certain llm-whisperers aside don’t have great ToM for models, it may help to shoot for more legible personas. It matters how models treat their users. Claude is good at intuiting what you want; Codex, less so. A lot of the time it’s pleasant to use Claude. But sometimes better to take your medicine, it’s worth being forced to spell things out, and preferable to be punished sooner rather than later when you don’t. --- ## For Consolidation URL: https://nachmanks.com/posts/consolidation Published: 2025-03-01 Description: meanwhile, centaurs All very stylized: the models are important, two or three frontier labs make them. Maybe there are fast followers. It could be good business. It doesn’t matter much. The labs have got secret knowledge and then vast scale. Takeoff or no there is a departure. Ilya shuffles off to his bunker. Sam and Dario and Demis sit with heads of state. We are quick to recognize, this is a world with concentration of power. You get what it says on the tin. This story is easy to imagine and not wrong either; it easy to imagine so elides structural reasons. In fact I think it is pretty overdetermined. Another story, sometimes popular: upward pressure from open source and a capability ceiling above; modelmakers squished up against the frontier so models will get commoditized, we enjoy intelligence on tap, and it will be just like electricity. I think SOTA models won’t get commoditized. Even if they do, we should expect concentration, and for familiar reasons. \ You can do it or you can’t – once you can, good enough. Some jobs are like this. Maybe there is a “workflow” you want to automate. Call the idea PAST THE POST. It is a claim about the kind of stuff we’ll use models for. One pleasant idea about investors is that they are widget-makers, they put out units of clever analysis. Incidentally there are some decisions also and money gets shuffled around. Or say consultants make slide decks. Of course investors are not widget makers, neither are consultants. There are no widget-makers. Rather, it’s nobody’s job to make widgets; or rather, if it’s yours, you are gonna get run over. There are factories which literally make widgets but the factory is not the firm. The firm’s mandate is to have some grip on the future. If they are making widgets they will probably make more of them and do it better, and do it better than someone else could. And so on. Contrast PAST THE POST with HEADROOM. HEADROOM against humans. Humans are really impressive and we continue failing to appreciate it. Taking HEADROOM against humans seriously we can reconcile how impressive today’s models are with the fact that they haven’t upended everything yet. (Without arguing about diffusion.) Then we can see strong in-paradigm progress for a while and get very strong systems that are not AGI. Nothing ever happens: the RL industry patches flaws one by one; at each turn some new deficiency comes into focus, or rather, we learn a new way that humans are pretty cool. The process repeats. We get increasingly useful models. Work is sliced up and some biggish swathes are traversed only by models. Mostly organizations are centaurs, and really shot-through with the centaur arrangement rather than slicing the bottom half off. We shrug and rearticulate the O-ring idea. The degenerate case is “humans are magic” failing to assimilate the concept AGI at all. HEADROOM suggests – at least for many sensibilities – a dangerously appealing world, just enough like ours, new and exciting also, just strange enough that surely it is an eminently reasonable guess. Still I’m pretty sure it is worth devoting more subtle attention to what makes humans so damn good. HEADROOM against work. Some jobs are hard to saturate, you can keep getting better. Easy examples are adversarial games. Traders don’t make money by getting PAST THE POST. Or notice e.g. how Google, Meta, etc. continue to make more and more money off of ads. They are doing this enormous optimization job. You might guess they’ve squeezed everything they can from their userbase; and the only way to move the needle is finding new people to squeeze. Not so. It is in one sense a neatly bounded problem, but maybe contains a number of messy unbounded problems, i.e. there are surprises left s.t. at any point even the best solution will involve some mistakes, or at least a degree of myopia waiting to be undone. If you think these two expressions of HEADROOM are very different I would like to nudge you and say they are importantly similar. So far we’ve only had jobs for humans (maybe better, jobs for firms) so the measure is pretty tied up. The commoditization story leans hard on PAST THE POST. Say we get commoditization – everybody is selling pretty much the same thing – what is true in this world? Progress has stalled, or demand has stalled (e.g. the best models churn out strange new math but nobody is buying it). More or less contingent reasons: maybe scaling in the current paradigm only gets us so far, a new one doesn’t come in time. Instead fresh AI winter (bimodal, this decade or bust). Maybe there is a “natural” capability ceiling. Hard to say why this would be the case. If you want to rescue the intuition: there could be diminishing returns to using intelligence against the stuff we care about. The very aggressive version of this claim is that a superintelligence wouldn’t even be that interesting. If there is HEADROOM against humans, then there is plenty of room to make better and better models, even below human level And you should expect dramatic effects even if you have the very aggressive view that superintelligence will be boring. If there is HEADROOM against jobs, people will keep buying the best and newest models. There will be a big mass of work done by commoditized models (maybe OS) getting PAST THE POST. You can also anticipate the proliferation of specialist models. There will be small specialists trained to do specific jobs as cheaply as possible; and maybe specialists which push particular regions of the frontier. Generalist models could orchestrate specialists. (There are strong reasons to avoid compound systems of this style; but you should definitely expect them.) Consumer apps will tend to wind up in this bucket. It’s a popular claim that the models are already smart enough for consumers who won’t notice, much less care about, further gains on the capability frontier. This is pretty obviously wrong, and reveals a confused sense of current capabilities, or at least a lack of imagination regarding the apps we’ll make. But eventually it will be right. But jobs with HEADROOM are the ones you should care about. They will present a big and only-growing appetite which, in time, describes where most dollars and tokens go. Following HEADROOM we should preserve the category frontier lab – it’s in the name – they do this characteristic thing which is pushing a frontier. --- ## Contra Cognitive Core URL: https://nachmanks.com/posts/Cognitive_core Published: 2026-03-01 Description: whipping boy Karpathy characterized Gemma 3n as an entry in > The race for LLM "cognitive core" - a few billion param model that maximally sacrifices encyclopedic knowledge for capability. …. It doesn't know that William the Conqueror's reign ended in September 9 1087, but it vaguely recognizes the name and can look up the date. It can't recite the SHA-256 of empty string as e3b0c442..., but it can calculate it quickly should you really want it. \ The cognitive core is, roughly, a diagnosis and a prescription. Models know too much stuff; better if they didn’t. I have got to complain about both halves, especially the second. There is also a prediction. We’ll use small models. They’ll run super fast on-device, know less stuff than bigger models, but make up for it with aggressive tool use. Karpathy describes all this elegantly, it’s not what I want to talk about. I want to be careful. You can cash them out in different ways. For a start: Models know too much stuff. Karpathy on Dwarkesh(https://www.dwarkesh.com/p/andrej-karpathy) > I think that’s probably holding back the neural networks overall because it’s getting them to rely on the knowledge a little too much ... one thing they’re not very good at, is going off the data manifold of what exists on the internet. \ Maybe encyclopedic knowledge gets in the way. Zvi Mowshowitz says: Skill Issue capitalization his, he is good at imbuing Proper Concepts . It is strictly better to know more stuff. If it doesn’t help you don’t have to use it. In fairness most issues are skill issues. Here is one story. Models which know too much are pathological. For example they could habitually pull from relevant-looking knowledge, so fail to really “try” at novel problems, even when they have the latent capacity. I think this version of the CC really is a Skill Issue, the answer is Get Good. The strongest model will have encyclopedic knowledge and know what to do with it. Of course that does not tell us the right path to get there. Then take the CC diagnosis for training, not test-time. Models know too much junk > … when you and I think of the internet, you’re thinking of like The Wall Street Journal. That’s not what this is. ... It’s some like stock tickers, symbols, it’s a huge amount of slop and garbage ... because the internet is so terrible, we have to build really big models to compress all that. Most of that compression is memory work instead of cognitive work. \ So part of the CC is data efficiency. We waste training compute, and model parameters, memorizing garbage. This means there’s headroom for small models with better data. It also means that we can keep scaling big ones. Even if “memory work” does crowd “cognitive work,” one doesn’t have to stunt the other. Bigger models will benefit from more of both. The CC points at something more interesting. Models know too much stuff, it ruins their training > We’re humans not actually that good at memorization … That’s a feature, not a bug, because it forces you to only learn the generalizable components. Whereas LLMs are distracted by all the memory that they have of the pre-training documents \ Following the diagnosis, the prescription >I want to remove the memory, ... so that they have to look things up, and they only maintain the algorithms for thought, and the idea of an experiment, and all this cognitive glue of acting. \ Most helpfully >We need intelligent models to help us refine even the pre-training set to just narrow it down to the cognitive components. Then I think you get away with a much smaller model because it’s a much better dataset and you could train it on it. But probably it’s not trained directly on it, it’s probably distilled from a much better model still. \ The problem with memorization is not that it wastes parameters (it does that too). I am not sure exactly what Karpathy means, “distracted.” I imagine it is something subtle; I will do my best filling in an unsubtle explanation. There are plenty of off ramps if you prefer a boring version of the CC idea. Here: I am sure training data could be generically much better What do we train models for? “Minimizing cross-entropy” – the formal objective is, by itself, content-free – what we care about is minimizing loss over the internet for a start– . To compress the internet a model will have to do something interesting. It turns out that the internet is really rich. A model must develop a correspondingly rich set of internal representations. Stronger: a model will have to develop something like “reasoning.” This is Ilya’s old line. The CC disagrees. Training against the internet, memorization and reasoning are worse than orthogonal. If a model has a path to reduce loss by leaning on its memory, rather than reasoning, and memorizing is cheaper, that is what it will learn. The concern is that memorization-heavy strategies will compete with, and tend to dominate, reasoning strategies. In a gesture: memorization is greedy … so is gradient descent. Unfiltered, the internet gives you an unfriendly loss landscape. The product is something like a talented kid whose school let them coast, so they never learn how to grind. Then we get to posttraining. Models will use their encyclopedic knowledge as a crutch. In a given rollout, the easiest way to get a reward will often involve leaning on some bit of esoterica, rather than reasoning through the problem. So it is hard to bootstrap. RL will just teach models to get really good at leveraging this vast pretraining knowledge. Granted it is funny to complain here. Famously the strong priors that models get from pretraining are the reason RL is feasible at all. Pretraining knowledge has the interesting effect of jumping capabilities up pretty high, then plateauing them unless RL’d very carefully. This is opposed to, well, not jumping them at all. If we are inclined to make mistakes about how good the models are – that’s kinda on us. \ There is a certain Cartesian flavor. Rationalist, not in the Berkeley group house sense, the older one. I do not think memory and reasoning make good categories here. (1) They want to describe a lot of the same stuff; further if they are tangled in training I doubt it’s incidental (this contra Karpathy’s empirical claim which I cash out as worse than orthogonal). (2) Supposing we trained a “pure reasoner,” it would be much less appealing than it sounds. (1) and (2) are separate claims but take them together. We have got some vague ideas about the “nature of intelligence” and also about the nature of work, i.e. the stuff that we measure intelligence against. We are confused about both. The two are symbiotic. As far as we manage answers on either side it will be leaning hard on the other. William the Conqueror is a good example. Karpathy says you want a model which doesn’t know the details of William’s life, but recognizes the name and can go from there. Vaguely recognizing William requires knowing a lot of stuff. You have some ideas about England, about people and names, about conquering. The model which vaguely remembers is much more like one that has memorized then one which doesn’t know. Or stronger, vaguely recognizing is of-a-kind with knowing. And then: reasoning tends to be about something. Any particular facts you might learn about William are interesting insofar as you have a sense of what William might have been like, how he might have failed, how things might have gone instead. We should not be unkind to historians. They are doing something besides making a big pile of facts, there is plenty of reasoning involved. But it is reasoning wrapped up with a whole set of particulars. We might imagine that the way you learn to reason like this is similarly entangled. Some data is in the CC’s sense Junk, e.g. stock tickers. For Junk, you have got orthogonality, because there is no reasoning strategy to learn that could possibly help reduce loss. (Incidentally you will still want the model to know some Junk, e.g. even ARC-AGI keeps “core knowledge priors.”) However the complement of Junk isn’t Not Junk. Most stuff, like the WSJ, or the Wikipedia page for William the Conqueror, is Kinda Junk. It could encourage reasoning strategies; at least some memorization will still be necessary We do not expect even very strong models to, in the course of training, derive the details of William’s life . And of course you can always memorize more than you need to. It is because Kinda Junk is so expansive that worse than orthogonal is an interesting claim. But it is when we take most stuff as Kinda Junk that the prescription – we should clean out pretraining data, narrowing it down to the “cognitive components” – looks strangest. Jeff Dean has numbers every engineer should know. Maybe it is super convenient for every engineer to know these numbers. Or take Jeff Dean seriously. Having the knowledge on hand will shape how you see things, e.g. when you are looking at a latency you want to naturally know whether it is alarmingly bad or perfectly normal or suspiciously low. Tricky: for humans, picking up a bunch of esoterica tends to be a natural byproduct of mastery. Clearly not sufficient, e.g. people who know a bunch of trivia. Sadly memorizing some latencies would span a relatively small portion of the distance between me and Jeff Dean. Print out Jeff’s numbers and put them on your desk. Why can’t you “factor out” the whole problem of memory? If you don’t know something, look it up. This idea is at home in the CC parcel. We get the CC by way of contrast, it is set against excesses of the current paradigm (e3b0c442…). The CC is right that you shouldn’t use the model as a DB or encyclopedia, rather you should give the model an encyclopedia. Probably you will want them tightly coupled. But that is another thing. Leaving it there defers the problem to search. Search is not so easy. This imagines the most ambitious version. Simultaneously we suppose something CC-shaped is good for the job. Take “agentic search.” Give the model grep and leave it alone. This, we are informed, is very bitter lesson-pilled. I think it’s right but for relatively subtle reasons. I’ll treat in a forthcoming piece. I am pretty sure the reason why models are so good at grepping around a repo is that they have strong natural intuitions for what a repo might look like. They have encyclopedic knowledge over conventions, and they draw associations based on what they see, so that even the thin information you get from grep -> ... -> head -5 lets you infer a good deal about the whole codebase. There is reasoning going on, but not the kind you can expect from a pure reasoner. While a model doing codebase search may experience a lot of “vague recognition,” the ability does not come from a matching class “vague knowledge” but rather a huge pile of super granular, over-detailed stuff which is Kinda Junk. Even if a pure reasoner is a good category, we may not have that many pure reasoning tasks. Here is an apparently isolated job: translation. Translation should be amenable to pure reasoning. If you have dictionaries, some books on grammar, etc., you can crunch out a translation between two languages. Language models are good translators. It is not just because they know a bunch of dictionaries and grammars, and have got the facility to line them up. Models have memorized this vast range of attendant material. They have got culture, history, they have read so many different ways people write and speak, also they have got pretty good theory of mind, and they have the rudiments of humor. They know how to be generous or snippy. The (real) job of translation is not amenable to a pure reasoner, but rather requires huge amounts of knowledge, and integration over that range. \ It's a brutish empirical field. In retrospect I imagine a good deal of the relevant history will look like a series of productive abstract mistakes. I think CC is conceptually confused. It does not particularly matter if it is conceptually confused, the better question is if it works. Ideally someone follows the prescription and we can look at the results; or, there are tests against current models which at least vindicate the diagnosis. I don’t think anyone has trained a CC-ish model following Karpathy’s full prescription. We have got Gemma and a growing set of small models. We haven’t got pretraining data pruned down to just the “cognitive components.” Full synth data might be the closest thing. But current efforts use a data mix that deliberately teaches memorization alongside reasoning. See Baguettron and Monad from Pleais. OpenAI acquihire viable for naming skill alone. Merullo et al. at Goodfire put out a paper which looks encouraging for the CC diagnosis. They do ablation on Olmo-2 7B which suppresses verbatim recitation, while mostly preserving capabilities. Interestingly they ablate low-curvature directions, not high. While any individual instance of memorization will be spiky, the directions are unrelated, so as a population memorization directions are particularly flat, whereas generalizing ones are moderately curved. Their characterization is that you get effects along the spectrum of behaviors between pure memorization / generalization. Closed-book QA suffers; open-book QA suffers a bit but is mostly preserved. The ablated model does well with boar etruscan (cousin of pig latin). Notably boolean logic improves post-ablation. Arithmetic suffers a bunch. On GSM8K the ablated model produces the same CoT but gets the wrong answer. But it doesn’t seem fair to take this as a point against the CC diagnosis, it might just show that 7B models “do arithmetic” mostly via memorization I don’t think the Goodfire paper is enough to support the CC diagnosis by itself. We are looking at capabilities on a small set of benchmarks. Here the lament about evals, they tell us exactly what they tell us. I expect CC-esque models will outperform on benchmarks. This is an unhelpful complaint, basically unfalsifiable (CC models will look good but be bad in some invisible fuzzy way). Let me try saying something falsifiable. I predict that, at any given compute budget, the strongest general reasoners will also be the most knowledgeable judge on a broad set of benchmarks, and ideally practical usage . Behind the pareto frontier there will be weaker models with different relative knowledge/reasoning strengths; but the two will improve together. This is roughly true today. If you subscribe to the CC you expect it will not hold – maybe the relationship only holds because we don’t know how to improve reasoning without bringing memorization along. My prediction is that even if people train models following the CC prescription the relationship will hold. I would not be satisfied by a distilled model which is disproportionately good at reasoning vs same-size but non-distilled peers. Even if it has got less encyclopedic knowledge, it benefits from the integrated knowledge of its parent Also would be kinda cheating on compute . I do expect that we will see models trained following the CC prescription. They will be narrow, not general. There is stuff left. We’ll want models to fiddle with formal systems. Not unlike a postdoc in one of those more elegant fields, a pure reasoner will be largely helpless set adrift in the broader world. --- ## SKILL.md URL: https://nachmanks.com/posts/Skills Published: 2026-03-01 Description: (issue) I bet this gets us pretty far, maybe surprisingly far. It also seems to have a pretty clear capability ceiling vs approaches that touch the weights. At least, gains are bounded by how strong ICL is. To be fair, we may be surprised. Do we really know how strong ICL is? Skills are a convention to make users do better context engineering. Imagine a very diligent user driving Claude and adding appropriate context. Claude’s retrieval over Skills might be worse than this. But consider a much lower baseline: a not-very-diligent user driving Claude (and we may note, most users are not so diligent). Here Skills will be a strict improvement. Skills are a way of pulling context from users. (E.g. example scripts implicitly convey stuff about the env, best practices, etc., that a user wouldn't otherwise put in a prompt.) Skills make good centaurs. Helpfully, they offer some guarantees about what kind of behavior to expect from Claude (if Claude follows instructions well enough). They also give you a standard format, so nice for sharing. Though neatly attributing improvements – and so, iterating on Skills – seems hard. Where will gains come from? You could use Skills a couple ways. (1) Package you-specific info for Claude. Skills are, after all, a special case of Anthropic’s telegraphed nearterm approach to “memory.” Crudely, ICL with a dynamically managed context window, acting over a persisted filesystem, with end-to-end RL from which skillful file creation/retrieval emerges. Presumably there are other efforts – but whenever he is asked Dario comes off suspiciously ICL-maximalist. (2) Try to “teach” Claude stuff, i.e. extend the capability frontier. I imagine this would work mostly as a way to elicit stuff which Claude can already do, but so far unreliably. As with any scaffolding, there’s the risk of overspecifying/constraining the model. Will we give newer Claudes the same Skills? Skills are more convention than hard affordance. But because Claude is wedded to its native harness, it would be a mistake to just consider the content in Skill files. We must also account for Claude’s natural tendencies (or: “habits”) using Skills. Presumably Claude is trained to respect Skill files (a special case of steerability), and to be good at navigating filesystems which follow the Skill convention. It will be interesting to see how much of this you can get “for free” from other models that aren’t trained to use Skills, but are good at navigating filesystems and following instructions. OAI adopted Skills, if quietly, within a month, Willison reports I wonder if Claude can be trained for superhuman navigation over Skills. Then you could give Claude a really huge corpus of Skill content. If Claude is very, very good at finding and combining relevant Skills, does this describe something which feels different-in-kind from ideal context engineering? --- ## Two Notes on Travel URL: https://nachmanks.com/posts/travel Published: 2026-03-01 Description: preserve the chip butty Agnes Callard complains about travel. If you have ever shared the misfortune of a certain genre of conversation, you know she has a point. To leave it there would be the stylish thing. Callard is right that travelling is nothing to be proud of. You can do a bad job at it too (sometimes travelling is something to be ashamed of). You go somewhere, look at the important bits, try to feel whatever you are supposed to feel. “To be a tourist is to have already decided that it is not one’s own feelings that count,” Callard reports. What else? Travel is a state of temporary leisure, travel is a reprieve. We get a way to mark and split up life – before vacation, after vacation, a little empty patch sandwiched in between. This is a mistake. Instead, Callard says, we should notice that we are on a single uninterrupted trek towards, putting it cutely, “doing nothing and being nobody.” Of course travel cannot save us. (It’s like most stuff that way.) One may suspect we’ve been set up. For Callard travel names an aspiration. "Aspiration" in the colloquial sense, not the particular kind involving proleptic reasons Callard talks about in other work. Then we could aspire less, and save ourselves the trouble. The case for travel is sad but sturdy. Traveling is a nice way to pass time. The world is biggish, and traveling can help you make a picture of it. It feels different being in one place or another. Some places have got special pleasures. At least they have all got different people in them. Those are probably the best – if not the only – reasons. As for the original mistake: having a good picture of the world, unfortunately, will not make you particularly interesting. In fact it will flatten you out. I’m pretty sure it’s local commitments and importantly local prejudices that make people interesting. \ \\\ It should be much harder to travel. Long treks, highwaymen. Scurvy maybe. Cosmopolitanism is nice, I’m not sure it’s nice enough. We are making a trade. It is very hard for places to stay themselves. “Stay themselves,” what does that mean? It sounds faintly like a dog whistle. I do not tend towards essentialism. I will not claim there is some deep, also fragile, quality of e.g. the British which must be protected, lest their palettes expand too far, such that their appetite for the venerable (and strictly hueless) chip butty is lost, and then the whole British thing slips into oblivion. But it is very hard for places to stay themselves. I mean: a “place” is a useful idea, a name for a bundle of land, people, buildings, weather, random facts (“history” if you prefer), and other stuff too. These bundles are fragile. Here I am supposed to say something about McDonald’s. I have no complaints, no one has ever gotten confused about McDonald’s, everybody either likes it or dislikes it, they enjoy it either way. Instead I would like to complain about culture. By that I mean aspiration, all the respectable bits, and how they are constantly getting mixed together. That's the stuff which makes a place. Places are not obviously different from one another. They are only so many ways to make a bundle – or rather, we are quickly insensitive to those more subtle variations which remain. And people are a lot like one another. If you wanted a nice heuristic you might say that they are all pretty much the same. It takes a place to make people so different from one another. We like to observe (and it happens to be true): we are used to material produced under a set of intense pressures. Social media is the famous case; and a special case of social networks. Social networks are getting thicker. Places are more and more like each other because they are part of one consuming social world. People talk to each other, and notice they have a lot in common. Is this so bad anyway? Maybe there are instrumental problems. Mostly the good stuff is fragile. Or: a good evolutionary algorithm keeps islands . I think it is worth having different kinds of places, and not just so that we can harvest them for a richer cosmopolitan monoculture later. Nor am I inclined to duck out taking “cultural diversity” as a natural good. But the right idea, I think, corresponds pretty tightly. The good stuff is behind the curtain. It cannot survive the common view. Then keep those self-justifying, badly parochial flavors of aspiration, which recognize and deny a broader world – which each identify the “real world,” disagreeing violently as they do it. How do you save a place? For a start, it should cost a lot to be there. In fact cities work like this. Unfortunately the thing it costs is money and the way you pay is rent. That is a really uneven cost (some people are rich, others less so). Suffering, on the other hand, is more even. In that regard air travel gets some stuff right (and is fast improving). Unfortunately a plane ride is over far too soon, so still too cheap. Regret also falls evenly Opportunity cost. Excepting temperment. . That is what you pay when you spend half the year crossing the Appalachians, the Mississippi, the great plains and the Rockies. Let me be clear: if you have “NYC/SF” in your bio, this is what I want to see you do. So out with planes. Boats can stay, though nothing too big or comfortable. What about rail? At the least it should involve a lot of transfers and sporadic scheduling and inflict a general and enduring sense of malaise. Here and elsewhere, Europe lags, the Americans are far out ahead. --- ## Iris Murdoch and Other Reptiles URL: https://nachmanks.com/posts/bayley Published: 2026-03-01 Description: portrait-making When they were both pretty old John Bayley put an essay about his wife, Iris Murdoch, in the New Yorker. She had Alzheimer’s, which is mostly what the essay was about. It was out a year before she died. The content was not surprising. It was a neat exercise in form. It’s easy to imagine why Bayley wrote. And then writers have various stuff to say about why they publish at all. It was not a distinctly bad way to treat her (I imagine that in most respects Bayley was being pretty generous). Still it struck me as the wrong sort of thing to do to your wife. Work like Bayley’s succeeds where it demonstrates a mastery over the subject. Portrait-making is, I think, the right way to describe it. There are many ways to make a good portrait. But the artist will pick one in particular. And when you are looking at the finished work, you do not imagine every other way the artist might have drawn the subject. Rather you have learned something true about the subject, because they were drawn this particular way. Then can I fault Bayley? He had to display a kind of mastery. Otherwise he would not be doing her justice. The only thing worse than putting the decline of your once-brilliant and now absent wife in a popular magazine is doing a shoddy job of it. I am convinced we all do this kind of portrait-making (maybe we are more or less eager, and more or less private) and although I am not sure what the right attitude is, I am sure that we cannot simply get away from it. Doing philosophy Murdoch wrote about loving attention. I am not too good with the metaphysics her view is attached to. But the main idea is that you get to apprehend truly by loving (and she means something specific by “loving” here), and so in loving someone you have a specific idea of them. This does not mean thinking exclusively nice stuff, but it does seem to involve a kind of earnest charitability. The characters in Murdoch’s novels receive loving attention. Many of them love poorly or get engaged in long-running mistakes. Still on balance I think Murdoch is writing about people trying to be fair to each other, and at least, she is trying to be fair to them. The way she goes about it involves a lot of machinery (famously she wrote “philosophical novels,” the common criticism is that her characters are always too busy being shuffled around). At any rate if this works, it works because Murdoch is very able. Although (Bayley says) she does not write with a personality that “fascinates and mesmerizes,” Murdoch does unambiguously exercise force. She declares mastery over her characters and their private worlds, i.e. she wraps them up. A lot of the time this involves having fun with them, or specifically, having fun at their expense. Murdoch’s characters don’t just make mistakes but characteristic mistakes. Maybe it’s unsurprising that Bayley, mostly a critic but also a writer, would pick up portrait-making. Jim Shepard, a fiction writer that I like, said that authors are reptilian because they will use any material that they get. Though I am not so convinced. I think there is something distinctly hot-blooded about the people that Jim’s thinking of. For lack of a better word I’d say they are being vicious. They proceed with easy certainty, so that if you are reading along and feel pleasantly confused about the “themes” or central concerns, you are in equal measure sure about the details, the small facts of the characters and their world. What do I mean vicious, am I being dramatic? I just mean they are engaged in an activity that is very hard to justify. And further that if you do it you are certainly doing it wrong, it’s probably unjust, even if you have a really good explanation, receipts and everything, it is tied up with a whole set of base impulses. The word I used to use was “authenticity.” It never seemed to do the job. I expect the extent to which Bayley’s picture is vivid, and strikes us most directly, and tells us real stuff about Iris his wife, will match the extent to which it is – not misguided but – unjustified. People are radically underdetermined in access and maybe substance. In order to make a good picture, you need to be adequately motivated, and then you have to do a lot of invention along the way. Jim also said that authors must condemn themselves. Maybe this is the root of the vicious impulse. You start by trying to victimize yourself, everybody else just gets roped in. Maybe that is what Bayley was doing. At any rate I think there is no good substitute. There is no possible quantity of antiseptic expertise that will do the job. Which is to say, you can’t get off the hook. Bayley describes a very simple failure. > Like all lovers, I suppose, I wished to be a special case in quite the wrong sense ... Iris wanted each of her friends to know her in the same pristine way. No groups, no sets. No comparing of notes between two about a third. This desire that each of her relationships should be special and separate, as innocent as in the Garden of Eden, was of great significance with Iris.