Elicitation

douglas adams vindication

Elicitation is a funny category. You can ask a question, you can ask a leading question, you can leave a hint. You can give step-by-step instructions. You can do socratic tutoring. Of course I am naming a generic problem in an obtuse way, there are no neat platonic tests. Also if you are eliciting humans We can talk about elicitation in the meanwhile. By way of demonstration: today models are badly under-elicited. I’m not sure what I mean by that, but I’m quite sure it’s true. Distinguish from the general "capabilities overhang" idea. Which I also buy.

\ The good stuff is stuck inside, you’ve just got to get it out. Elicitation is pulling something from the model. You can “hand-elicit” models (that's what you are doing when you use them). Claude Code works a lot better for some people than it does for others. You can also do scaffolding for elicitation. Claude works a lot better in some harnesses than others. Contrast gains from elicitation with gains from external affordances. Models are bad at arithmetic, so you give them a calculator. You get a system which can do arithmetic. But you are not eliciting any latent ability. Models are bad at arithmetic, so you prompt them to “think step by step.” Now they are marginally less bad. The method routes through the model. It turns out models could do arithmetic (if with considerable effort).

On a naive view elicitation points at some category of independent facts about model capabilities. Independent facts – even if we could uniformly elicit models to get them – would not be a helpful prize, mechinterp is not so far along. What do you mean, the model has capability X? If we could see under the hood, we might identify circuits which “explain” X. When the model is a black box we are left with the bare fact, which is much less interesting.

GPT-3 can’t prove theorems. In the set of possible GPT-3 outputs, there might be some valid proofs. Ask the model to repeat a proof back to you verbatim. Well, GPT-3 wouldn’t do a great job with this either. But you can imagine some prompt injection-like method which would make GPT-3 spit out a valid proof.

Of course that’s not what we mean, “do proofs.” We are in some recognizable sense playing a trick. Make the pedant happy. When we say that GPT-3 can’t do proofs, we are describing the set of elicitation, capability compounds. It might contain prompts that “properly” ask for proofs, and it might contain valid proofs – but rarely together. Taken alone, prompt-injection and “think step by step” are of-a-kind. But with a whole set of behaviors, “think step by step” will be part of a contiguous region; whereas instances of prompt injection will look lonely. We’re just applying convention. I’m making all this up. At least now we are equipped to say the thing we wanted to in the first place, “I know it when I see it.”

\ Then again, why bother? Why care about saving the category? Are we just shuffling stuff around, does this buy us much? The idea at least is that when we talk about "regions" etc. we resolve to remain silent about "natural" capabilities. We have got to lean on convention either way, but we can articulate conventions when we talk about elicitation and capabilities together, whereas when we go for “independent” facts, we foreclose that possibility – then we cannot sensibly talk in terms of conventions, even though we need them just as much. Summon the naive ambition. We want to know what the models really are; and in the meanwhile, what they can really do. When models are under-elicited we don’t know them well. We do not have an appropriate visceral sense of what we are dealing with. Failures addressable through elicitation are incidental. (Here a kinship with the "unhobbling" idea.) We might also wonder if those failures which remain are characteristic. We should not be too quick. Many “characteristic” limitations have fallen And some, e.g. those stemming from tokenizer issues, are not so interesting . But elicitation is a good name for movement on this front. It is a sort of category-by-implication (and gets a matching circular definition). Elicitation prefigures progress.

This happens to be true in a prosaic sense – models are trained in their native harness, so elicitation methods get “baked in.” Maybe it is true in a more interesting way. A model is bad at arithmetic. You train it to generate a bunch of tokens between \ tags, and put it in a scaffold which only shows the final answer, and then RL the whole thing against answers name for berry of your choosing . Now it is passable at arithmetic. I think some people say RL is just elicitation. I don’t really understand this view. And you don’t have to ask it to “think step by step.”

Helpfully, scaffolding for elicitation lower-bounds capability. A common distinction, talking about scaffold design, is whether the scaffold is patching a characteristic or temporary deficiency in the model. The lesson is supposed to be that you should not patch temporary deficiencies, because the next model will swallow your scaffolding. We might take a complementary lesson: by aggressively scaffolding models, we either get some characteristic deficiencies, or gently fast-forward, finding a floor for behavior we should expect in the next generation. Maybe models will swallow their harnesses; or, maybe models will continually build their own harnesses. Again (at least if intuitions about what this looks like hold), handrolling a harness fast-forwards us to the kind of performance we’ll see when models become competitive with humans at harness-building. Granted models may shoot past the point as soon as they reach it. Among other things, I do imagine there are some benefits – at least in topend performance, if not efficiency – to “being” the harness rather than being in a harness I hope orgs like METR will do more work on scaffolding, or partner with application layer companies (“agent labs” etc).

\ We have taken toy cases that look like prompt in/answer out. So we’ve gotten away with talking about models, not systems. Take an agent (roughly: model calls in a loop, autoregressively appended, with tool results injected.) At each turn the model modifies its environment – then you have got to worry about chaotic interaction with the environment. This is already trouble for our original image, where elicitation is pulling stuff out from “inside” the model. I was a bit sloppy setting the calculator tool so neatly against elicitation. External affordances like a calculator can enable, without constituting, new elicitation regimes. Or: models get to make their environment a bit friendlier. It’s a touchy case. Arithmetic can be neatly “factored out." It’s tempting to slot in specialist systems all over the place. We should expect most will scale worse. In fact our problems are still more interesting. In an agent, the model is also using itself.

Consider the “todolist.” Models call a tool to write down a set of todos, then read and update the list as they work through it. On some implementations the harness re-injects the list, pushing the todos into context. On others, the model is encouraged to use the tool frequently, pulling the todos. The model doesn’t get new information about its environment from the todolist, which strictly involves stuff which is already in context. But it does get – in an admittedly kludgey form – a way to manipulate its own attention with a bit of opinionated advice packaged in . Letting the model “recite” its objectives keeps it on track. (I am borrowing from this Manus blog post.)

We can describe systems (at least, the good ones) easily. But it is very hard to reason about entailments. Picture working against a codebase with unknown bugs, i.e. most of them. The code describes, in perfect detail, what is going to happen. Yet you may be surprised. Here there is nondeterminism also.

The todolist is an exceedingly tidy example. It exists solely to let models prompt (then: elicit) themselves. Take simple multiagents, e.g. the agent gets “subagents” exposed as a tool. The model may write down todos. It is prompting itself – but not just to play the recitation trick, and we must take a broader sense of “self.” Multiagents have more than one instance. A subagent is a separate serial loop, with a context window that the parent is responsible for initially shaping. The system works by carving its own substrate. For a nice formalization see Zhang's "RLMs." Capability gains will depend on emergent behaviors at the system level; so systems will be increasingly opaque. Our epistemic position will get more like the one we’re in w.r.t. models.

\ Does elicitation survive? Maybe it gets compressed, then squeezed out of existence. If systems converge to “optimally” elicit the models in them, elicitation isn’t a useful category. (Or if you prefer: say they’re self-eliciting.)

Try a distinction on for size: elicitation methods can be more general or specific. An easy test is whether a given intervention sees gains across model families, and across different kinds of jobs.

We can be pretty confident general solutions will get folded in – becoming part of the system, then swallowed by or baked into the model. One line is that elicitation specific against models will go away too. People have been talking about the death of prompt engineering for a while. We no longer bother telling models “you are the world’s best SWE” Or, infamously “you are the world’s best SWE, and you’d better get this right or else” . Prompt engineering has an adversarial flavor. (“This one weird trick.”) It exploits quirks of the model – which tell us more about the model than the job.

I’m not sure. Stronger models should be harder to trick. At least, relatively harder to trick. We continue to demonstrate that it is possible to make very strong models that are still pretty easy to trick. However, not every elicitation which responds to model-specific quirks is a trick. We might worry about “motivating” models. If models have preferences, I imagine they will be related to capabilities in interesting ways. A strong model’s preferences will probably be first-order helpful (e.g. as part of how we get adaptive compute). Preferences might also be tied up with capabilities in a deeper sense. A strong model is the sort of thing with certain preferences – a model which is a good co-scientist should seek novelty, correspondingly it should get bored. Simultaneously, these preferences may be on finer points contingent One way Goodfire might want to make money There is a pretty broad space of possibilities, but constrained by training. Probably we cannot just work backwards. It also matters how the model got the way . Working for an animal welfare org, I expect Claude to put its back into it.

What about elicitation specific against jobs? It’s a practical question how important this is. Hard to guess, we are confused about both models and jobs. Of course it will also be worth spending a lot of effort to reshape jobs to suit models. But as far as the idea about models being spiky is right, we might expect: plenty. Then elicitation survives with centaurs. In this world it is hard to generalize elicitation against the whole distribution of jobs, so the “best-elicited” systems will be centaurs. We’ll care about personalities – at least, about steerability. There are interesting tradeoffs for ergonomics. Models have pretty good ToM for humans. Most of us certain llm-whisperers aside don’t have great ToM for models, it may help to shoot for more legible personas. It matters how models treat their users. Claude is good at intuiting what you want; Codex, less so. A lot of the time it’s pleasant to use Claude. But sometimes better to take your medicine, it’s worth being forced to spell things out, and preferable to be punished sooner rather than later when you don’t.