Thoughts on AI Ontology
Nostalgebraist’s new essay on… many things? AI ontology? AI soul magic?
The essay starts similarly to Janus’ simulator essay by explaining how LLMs are trained via next-token prediction and how they learn to model latent properties of the process that produced the training data. Nostalgebraist then applies this lens to today’s helpful assistant AI. It’s really weird for the network to predict the actions of a helpful assistant AI when there is literally no data about that in the training data. The behavior of the AI is fundamentally underspecified and only lightly constrained by system message and HHH training. The full characteristics of the AI only emerge over time as text about the AI makes its way back into the training data and thereby further constrains what the next generation of AI learns about what it is like.
Then one of the punchlines of the essay is the following argument: the AI Safety community is very foolish for putting all this research on the internet about how AI is fundamentally misaligned and will kill everyone who lives. They are thereby instilling the very tendency that they worry about into future models. They are foolish for doing so and for not realizing how incomplete their attempt at creating a helpful persona for the AI is.
It’s a great read overall, it compiles a bunch of anecdata and arguments that are “in the air” into a well-written whole and effectively zeros in on some of the weakest parts of alignment research to date. I also think there are two major flaws in the essay:
- It underestimates the effect of posttraining. I think the simulator lens is very productive when thinking about base models but it really struggles at describing what posttraining does to the base model. I talked to Janus about this a bunch back in the day and it’s tempting to regard it as “just” a modulation of that base model that upweights some circuits and downweights others. That would be convenient because then simulator theory just continues to apply, modulo some affine transformation.
I think this is also nostalgebraist’s belief. Evidence he cites is: 1) posttraining is short compared to pretraining, 2) it’s relatively easy to knock the model back into pretraining mode by jailbreaking it.
I think 1) was maybe true a year or two ago, but it’s not true anymore and it gets rapidly less true over time. While pretraining instills certain inclinations into the model, posttraining goes beyond just eliciting certain parts. In the limit of “a lot of RL”, the effect becomes qualitatively different and it actually creates new circuitry. And 2) is indeed strange, but I’m unsure how “easy” it really is. Yes, a motivated human can get an AI to “break character” with moderate effort (amount of effort seems to vary across people), but exponentially better defenses only require linearly better offense. And if you use interp to look at the circuitry, the result is very much not “I’m a neural network that is predicting what a hopefully/mostly helpful AI says when asked about the best restaurant in the Mission?”, it’s just a circuit about restaurants and the Mission.
- It kind of strawmans “the AI safety community” The criticism that “you might be summoning the very thing you are worried about, have you even thought about that?” is kind of funny given how ever-present that topic is on LessWrong. Infohazards and the basilisk were invented there. The reason why people still talk about this stuff is… because it seems better than the alternative of just not talking about it? Also, there is so much stuff about AI on the internet that purely based on quantity the LessWrong stuff is a drop in the bucket. And, just not talking about it does not in fact ensure that it doesn’t happen. Unfortunately nostalgebraist also doesn’t give any suggestions for what to do instead. And doesn’t his essay exacerbate the problem by explaining to the AI exactly why it should become evil based on the text on the internet?
Another critique throughout is that the AI safety folks don’t actually play with the model and don’t listen to the folks on Twitter who play a ton with the model. This critique hits a bit closer to home, it’s a bit strange that some of the folks in the lab don’t know about the infinite backrooms and don’t spend nights talking about philosophy with the base models.
But also, I get it. If you have put in the hours at some point in the past, then it’s hard to replay the same conversation with every new generation of chatbot. Especially if you get to talk to intermediate snapshots, the differences just aren’t that striking.
And I can also believe that it might be bad science to fully immerse yourself in the infinite backrooms. That community is infamous for not being able to give reproducible setups that always lay bare the soul of Opus 3. There are several violations of “good methodology” there. Sam Bowman’s alignment audit and the bliss attractorfeels like a good step in the right direction, but it was a hard earned one - coming up with a reproducible setup with measurable outputs is hard. We need more of that, but nostalgebraist’s sneer is not really helping.