An unfinished blog.

We define AI as consisting of the following:

A world model, i.e. a representation of how things behave in the agent’s environment
A policy, i.e. a set of goals that the agent aims to achieve
A search in the space of actions in order to optimize the policy

Language models interpreted through this framework:

The world model is implicit in the language model — the language model finds patterns in the input text, building models of concepts that it encounters, albeit word-only concepts.
The policy is for it to generate text according to its likelihood function (and whatever the sampling method is).
The search is a beam search in the space of likely next sequences, with the resulting actions being output words.

Shortcomings of language models

The world models consist of text alone. Anything that the model “knows”, i.e. anything that is captured by the model of the world, must be described in text, or at least deduced from patterns of text. For example, if I say that Billy attends the University of Michigan, then we can make guesses that Billy takes classes there, lives near the University, has a GPA, maybe participates in clubs, likely recieved a high school education. Even if we don’t have example of people that went to the University of Michigan in the data, or even if we don’t have any information about U of M in the training data, the world model has a model of universities in general, and has modeled what it’s like for a student to attend the university. From descriptions of students at universities, the “world model” has built up a model of the concept of attending a university, which we can apply to the specific case of Billy.

There is so much information that is not captured in words. A lot of information is captured in images, videos, or general participation in real world events. However, a lot of that information is summarized in words and written down, so it makes sense that a model of the world based on text alone would be so capable. This isn’t that big of a shortcoming, yet we can easily imagine cases where a language model would have no idea what we’re talking about (just find a concept that hasn’t been verbalized (or hasn’t been talked about much), and describe it to a language model - example: “how do you do X” where X is a niche & highly technical task). Note that this argument holds even if models were augmented with videos and images — only concepts documented somehow in the training data can be known to the AI, and humans have a much different training set. Also, low data concepts are much more easily captured by humans, highlighting a modeling deficit in modern NLP methods.
The world model is difficult to query. Suppose I want to examine the model’s sub-model (if you will) of universities. Namely, I want to know how the world model has modeled universities, and examine all of the patterns it has identified about universities. How would I go about doing this? When I look at the internals of the model, all I see are activation values. Seemingly random numbers changing at each word location. I suppose, I could try clustering these values, to see whether some of the activation values correspond to different concepts/ideas/patterns, but at least in LLMs there are too many values to cluster. My only other option is to sample from the model based on some input text (e.g. “Universities are ____”) and see how the model fills in the blank. The problem is that this is a very limited and opaque way of examining the model. This technique doesn’t provide an exhaustive list of all patterns identified, and once we have all of the patterns, it provides no clean way of clustering/grouping them together for easier analysis. I would argue that this is an inconvenient way of examining the model.
Question answering isn’t great when the search process is a matter of generating text. But it’s not terrible. These models have the capacity to reason. This is another minor point — sure, a better search method could do better, but language models are still capable of reasoning.

The nature of knowledge & the task of organizing information

The task of organizing information requires a world model, and cannot be solved without one, i.e. the task of organizing information is AI-complete.

Next steps for AI

My guess is that the modeling deficiencies mentioned above will hinder a lot of potential applications of AI: robotics, autonomous agents. Because, as mentioned, the "search" aspect of language models is a matter of sampling from the model, because the model is opaque, a black box, not easily interpretable. So tasks that require more intensive reasoning about the world through the AI's world model will be astronomically harder using models of the like today (deep learning, transformer/cnn/rnn/diffusion-based).

First, we need to solve the modeling problem. We need models that address the shortcomings listed above. Then we can design simple search algorithms that are able to easily query the world model for actions according to a policy, because the model is easily queryable.