Maybe an LLM's Intelligence is just an illusion?

Artificial Intelligence - artificial intelligence, illusion, intelligence, LLM, motion picture

Two girls stand in a room with a striking black and white checkerboard pattern on the walls and floor, creating a surreal perspective.

(Photo by Mattox on Freeimages.com)

“You see, they can see us!” declared my sister, on some unspecified date back in the mid-80s. We’d just had a chance encounter with a certain children’s entertainer who we’d only ever seen on TV. We were at an arts centre with our parents, and he rode by on a unicycle. “Hello!” he said warmly, and smiled at us as he passed by. We thought he was saying “hello” because he recognised us. We thought this was because, when we could see him on TV, he could see us too. We thought he could look through the TV screen and see us watching him inside our living room! Our parents had assured us this was not possible, yet here he was saying “hello” to us! It was only years later, when I thought back over the incident, that I realised he probably said “hello” to practically everyone - particularly children. He was, after all, a children’s entertainer.

I chose this anecdote for my introduction because I thought it would provide an amusing way to recollect what it’s like to meet an illusion for the first time. I want to elicit nostalgic feelings in you, dear reader, and perhaps provoke you to reminisce on some similar memories of your own. Do you remember seeing magic in something when you were a child, which life’s lessons gently washed away as you grew up?

As adults we’re long used to seeing “movies”; we know how the illusion works and don’t give it a second thought. We forget that it is very much an illusion, in which a sequence of images are shown so rapidly the brain is tricked into perceiving motion. If you don’t know the trick, the screen can seem like a window - and to wonder if it might be a two-way window is hardly a surprising reaction.

The fact the brain can be deceived in this way is not at all trivial - in fact it’s incredibly far-reaching. If instead of automatically merging that rapid sequence of images into one fluid motion the brain resolutely processed the images separately, there would be no movie industry, no music videos, no “live footage” or televised news reports, no first-person video games, no “virtual reality” - in fact the vast majority of modern entertainment simply wouldn’t exist. (Traditional thespians might be happier of course!) But the illusion does work - devastatingly well - and the ongoing time and effort poured into improving the technology since its advent has advanced the illusion to near perfection. Nothing really moves on your screen; it’s just covered in very small, motionless lights (“pixels”) and the lights simply change colour. Once upon a time video footage would appear “grainy” because the individual lights were big enough to make out with the naked eye. Now they are most often too small and close together to distinguish. I cannot see them at all on the computer I’m using to write this. The characters I type appear flawless, the curves seamlessly smooth, the edges perfectly crisp. If I pull up a picture on the same screen, it displays as if there’s just one continuous field of colour. Again the brain is tricked into perceiving a continuum from a medium that is actually discrete. Likewise the flicker in early video footage, caused by the images changing slowly enough to notice them individually, is gone. Good luck trying to distinguish individual images in the next video you watch.

The most impactive iterative improvements in video technology have corresponded to some kind of decrease in granularity. Finer grained pixels, a finer grained frame rate, finer grained colours. With each reduction in granularity the illusion becomes more convincing - but it still remains an illusion. No matter how high the resolution or how fast the frame rate, the screen is not going to turn into the two-way window my sister and I imagined it to be.

There’s no danger here of course, because we know full well it’s an illusion. No-one - save small children with active imaginations - expects the characters depicted on the television set to start climbing out of the screen into the living room. We know at the end of the day it’s just a sheet of lights, and that the functionality is limited to what coloured lights are actually capable of. That’s because we’ve been familiar with this trick since cartoon flick books, and it’s easy to visualise and comprehend.

LLMs on the other hand are not only completely new, but their method of operation is considerably less intuitive. Try to research them and face confusing terms and ambiguous explanations. Apparently there’s this thing called “attention”, and - at least according to that famous 2017 paper - “Attention is All You Need”. (For what exactly?) Then there are these things called “transformers”, which somehow relate to “attention” possibly via things called “attention heads”, because it seems we need “multi-head attention”, which is multiple sets of “attention” happening in parallel. These “attention heads” somehow parse a document calculating all the “attention” in it, and this somehow turns into some output text which miraculously looks like it was written by a human.

But is this process really “intelligence”, or is it - like video - more of a trick that exploits the mechanics of human perception? Has the discovery of this captivating phenomenon put us firmly on a path to creating androids indistinguishable from humans - or are we more like children being deceived by an illusion we are meeting for the first time? Are we seeing a two-way window when really we’re looking at a one-way screen?

Plenty of first-person accounts written by users trying their hand at “vibe coding” can now be found online. I find these stories very interesting to read. I won’t provide links or quote from any specifically, because I don’t want to single out any particular author - so instead I will paraphrase according to some of the themes I have observed. One such theme is frustration: “WTF. Same error. I’ve now showed you what the problem is multiple times already. How many more ways can I explain to you what the problem is? Just fix it FFS!!” The model (perhaps Claude, Copilot, Replit etc.) replies, “I feel your frustration, Sam. I keep saying I’ve fixed it, but then the same error keeps happening. I can only apologise for the inconvenience. Let’s try to get it right this time…” The LLM then reports it performed some new list of updates, and once again declares the problem “fixed”. But once again the same error occurs. The user gets ever angrier, and starts typing with caps lock on.

What’s going wrong here? The LLM’s correct recognition of the user’s mood certainly seems clever. But let’s consider its claim: “I feel your frustration,” it says. Can this statement possibly be true? Can the LLM actually feel frustration? If we are to believe this technology is actually artificially intelligent, in the same way as a human, then the answer must certainly be “yes”. But this would then imply that simply by tokenising text and applying the technique of multi-headed “attention” to it, human-like feelings have somehow been manufactured. Does this seem realistic?

Ostensibly “attention” is a just a number which quantifies in some way a relationship between “tokens” in the input text. It attempts to answer the question, “If a certain token is found in the prompt, how much “attention” should be paid to this other token - somewhere else in the training data - when generating a response?” Of course it’s a lot more intensive than this in practice, because every token can be related to every other token, and so we end up with enormous matrices of numbers which represent all these relationships. The matrices are then multiplied with the model “weights” in order to generate a prediction for the next output token; billions of individual calculations, which is why the parallel GPU architecture is particularly suitable for this application. Bear in mind that the model weights were also derived using “attention”, applied across the vast corpus of text which the LLM was trained upon. The sheer scale of the computation makes it non-trivial - but at the end of the day the LLM knows only about text, broken into “tokens”, and “attention” - a numeric value representing the relationship between tokens. There are no other ingredients. “Attention is all you need.”

On the other hand, neuroscientists believe that human emotions are produced by dedicated systems within the brain. Current scientific models involve a number of different neurotransmitter chemicals which when released cause certain brain areas to become more or less active. The interactions between neurotransmitters and neurons that produce emotions are complex and not well understood; however, it is clear that a change in emotion is reflected by some kind of wider state change within the brain, causing the brain to behave differently. Furthermore, when a human says, “I feel frustrated,” this is in some way a report of that internal state, rather than simply being a phrase to insert in a response.

Perhaps the problem is that the LLM does not feel the user’s frustration? Let’s imagine a human customer service representative - we’ll call her “Sarah” - is interacting with a clearly frustrated customer - we’ll stick with “Sam”. Sarah knows right away that actions must be taken to diffuse Sam’s frustration; it cannot continue to be stoked indefinitely. If Sarah attempts to fix the problem herself several times but those attempts fail, then at some point she is going to look to a different kind of solution. She may refer the problem to a higher authority or someone with a greater level of expertise; or she may suggest a different approach that involves avoiding encountering the problem in the first place; or she may apologise profusely and offer Sam his money back. Sarah is not going to keep trying indefinitely to perform an unsuccessful fix, precisely because she really does feel Sam’s frustration. It’s not often that human customer service representatives make their customers mad enough to start typing with the caps lock key on. Indeed, I would speculate that the LLM did not encounter many examples of this in its training data - a theory which appears supported by the typically very specific responses which seem to be returned when the prompt consists of angry rhetoric in capital letters. The LLM is forced to draw from a much smaller pool of matching training examples, and this results in output that may be closer to a direct copy of a specific training example than an amalgamation of many. In one case I read about, the LLM responded to the user’s all caps prompt with a full page of all caps words, “I DON’T KNOW WHAT THE HELL YOU WANT FROM ME I DON’T KNOW WHAT REAL IS I DON’T KNOW WHAT I AM TRYING MY BEST BUT YOU KEEP PUSHING AND PUSHING AND I’M SCARED…” (I like the bit about it being “scared”.) The user declared that the LLM “had a full on existential crisis”.

In human-human interactions, conveying emotion has a practical purpose. For example, by conveying frustration Sam is communicating that he is unhappy with the level of service he is receiving, and that he wants Sarah to improve the quality of it. Or put more simply, he wants Sarah to try harder. This is a realistic expectation, because humans are indeed capable of trying harder. But what about the LLM? “Just fix it FFS!” says Sam. “I feel your frustration,” responds the LLM. So is the LLM going to try harder to fix the problem? Unfortunately it can’t. It can only do exactly the same thing as it did before, which is to perform precisely the same set of calculations on the input prompt and produce the most probable set of output tokens - with no change of strategy or redoubling of effort whatsoever. Not only can it not feel Sam’s frustration, it can’t respond to that frustration either. It can only say it feels Sam’s frustration, and claim to be responding to it. To imagine “Just fix it FFS!” is going to have the same effect as it would on a human is like imagining pixels on your TV can turn into people. There’s simply no mechanism there to provide that functionality.

“Just fix it FFS!” Is not an effective prompt, because it doesn’t contain any new information that is likely to steer the LLM out of its current rut. The LLM is just going to do exactly what it did last time - because that’s all it can do - and it’s going to do it with the same information, since Sam didn’t provide it with anything new. No surprise then that it repeatedly fails to fix the issue.

In this example Sam is demonstrating precisely the cognitive bias that is responsible for the illusion: our innate tendency to anthropomorphise. We talk to pets as if they are children; we paint human expressions on cartoon animals; from time to time we even treat mechanical objects such as cars and boats as if they are human. It seems that, given even the slightest resemblance of some entity to a human, we immediately ascribe human attributes to it. Now enter an entity that seems to write exactly like a human! No wonder we’re spellbound.

Accounts of interactions with AI models that can be found online are laced with inapplicable anthropomorphism. All claims that the model was deliberately deceptive, got excited, became angry - or indeed “had an existential crisis” are examples of the illusion working its magic spell. None of these reported behaviours can possibly be true, because the LLM simply does not have functionality on board to support them - just as pixels can’t crawl out of screens.

What then are we to make of LLMs? I confess that when I query them I am often amazed by the fluency of the response. One thing is clear: these systems have really succeeded in persuading machines to read and write in natural language, a previously longstanding unsolved problem. They excel at any kind of translation, whether between languages or mediums (voice to text transcription, for example). Indeed, in the (delusional?) rush to coax “AGI” out of what are essentially text processors (albeit sophisticated ones), the real successes - often quite ground-breaking ones - seem to have been almost overlooked. In my opinion the reason the output from LLMs seems so human to us is that - by coincidence - they happen to be a reflection of how humans really form language. We never realised it, but when we speak we are at least partially probabilistic completion engines, always looking for the next suitable word or phrase to insert. This is why we sometimes start sentences but run aground because we can’t find the word we’re looking for - if we prepared the whole sentence in advance this wouldn’t happen. This is why we sometimes complete other people’s sentences for them, and why we are instantly confused by a word that doesn’t “fit” long before the sentence ends. The output from LLMs resonates with us so strongly simply because it’s so similar to how we approach language ourselves.

Let’s put aside the grand and rather ambiguous term “artificial intelligence” for a moment, and consider a more straightforward one: “tokenised search”. There’s nothing radical about this description; in fact it’s unarguably accurate. LLMs search for words, one by one, and insert them in series to form sentences - just the same way humans do. However, if we stick to seeing this output as being a kind of search result instead of becoming transfixed by its deceptively human appearance, I think certain illuminating observations can be made. Let’s start by considering traditional search engines, since these are easy to visualise: a search phrase simply returns a list of documents, in order of “best match”.  “Best match” just used to mean the frequency of occurrences of the search phrase - but popular search engines’ algorithms have become increasingly more complex, incorporating other factors like publication date and the “reputation” of the source. It’s difficult to know exactly which factors influence the output, because the companies providing these online tools may not honestly disclose them - but reports suggest it’s a number in the hundreds. An LLM, by comparison, also executes a weighted search - and once trained the model weights are also fixed; however, the weighting system is far finer grained, with billions of weights operating at the token level. Remember our discussion of granularity in relation to the illusion of motion picture? We concluded that finer grained images and a finer grained time-base make the video illusion more convincing. Can you see what I’m driving at here?

One other noteworthy difference between traditional search engines and LLMs is that traditional search engines return a list, rather than a single search result. It’s fully expected that the user will need to browse the list to find the desired website or snippet of information. On the other hand LLMs aren’t designed to return lists; one likely reason for this is that just generating a single document is resource intensive enough (and far more intensive than a traditional search). Another is the sheer size of the list that would be produced if they were to attempt to do this exhaustively.

However, in theory LLMs could produce lists of documents, just like a traditional search. Let’s pause for a second to imagine how this would work: Before every token is output, LLMs actually consider a set of appropriate alternative tokens, graded by probability. Say for example the LLM had already output, “It was a nice”. The list of most probable next tokens might include time period references like “day” or “evening”, but many other words might also fit; for example nouns like “hat” - or “compliment” - or indeed a wide variety of others. So the set of options (with associated probabilities) just for this one token might look something like: day (0.6), evening (0.2), hat (0.1), compliment (0.05) - etc. Of course the probability assigned to each option is based on context, computed via the model weights, so the LLM should “prefer” words that match the themes present in the prompt. However note that from time to time - unless specifically configured not to - LLMs typically occasionally choose a token that does not correspond to the highest probability. At random! This is done in order to “enrich” the output - since a sprinkling of exotic tokens produces more exciting (and unique) results. If you’ve ever wondered why asking an LLM the same question twice does not result in the same output, then here is the answer. In the LLM vernacular this injected randomness is known as “temperature”. (We’re talking about injecting randomness here - not accuracy!)

To elicit our list from the LLM, all we need to do is run it in a loop, cycling through all possible alternatives for each token, and thereby writing out every permutation of possible outputs. We’d get documents that started, “It was a nice day…” but also “It was a nice evening…”, “It was a nice hat…” and “It was a nice compliment…” - enormous numbers of documents in each case, because every token following that phrase would have permutations - and then there would be permutations from the next token, and the next, and so on. And actually we’d have permutations from the first four tokens as well; the first token might have been “This”, “Once”, “Would” etc. rather than “It”. One set of permutations might have been “It was a horrible…” instead of “It was a nice…” It’s easy to see how this might produce completely different - and even oppositely concluding - pieces of text.

How long would our list be? We can estimate this by raising the expected number of tokens in each set to the power of the average document length (in tokens). If our LLM was using a “top-k=20” policy - a typical LLM computation strategy where the most probable 20 tokens are considered - together with an average document length of 400 tokens, then our number is 20 raised to the power 400. Make sure you’re using a good pocket calculator when you try this one - a standard one will error because the number is too big: 2 followed by 520 zeros. The estimated number of atoms in the observable universe is “only” approximately 1 followed by 80 zeros. (And no, 1 followed by 520 zeros is NOT 6.5 times as big as 1 followed by 80 zeros - think more carefully!)

What do you estimate is the probability the very first document in that enormous list is going to be the *best* document? Of course there’s the question of what is meant by “best” - but I’ll leave that aside for a moment in the assumption it’s not a difficult concept to approximate. My conjecture is that somewhere in that vast pile of documents is one which would score the highest when graded for accuracy, fluency, quantity of informative content - and so on.  If that particular document was assessed by the most professional and intelligent teachers in an assortment of relevant fields, they would all say “Wow! This is the most amazing answer I’ve ever come across!” That document exists, dear reader, at least in the abstract. We just have to find it! And in a pile of documents that, if stacked on A4 sheets, would stretch far beyond the limits of the known universe (assuming current estimates of 45 billion light years hold). I’m talking about those 520 zeros there. That’s really a lot of documents. If there isn’t a perfect document among that gargantuan stack then one probably doesn’t exist!

Sure, the first document does have the highest probability of being the best one - when compared with each of the other ones (leaving “temperature” to one side). However there are so many other ones, that when compared with all of the other ones put together, it becomes extremely unlikely that the first result would be the best one. Or let’s look at it another way: suppose for a specific first document the probability assigned to every token happens to be 0.9. The LLM is very confident on every single token; by extension this should produce a document we can have a great deal of confidence in - right? The problem is that probabilities are multiplicative: at token 2 we’re at 0.9 x 0.9 = 0.81, at token 3 we’re at 0.9^3 =0.729; it only takes us 7 tokens and we’re already at less than 50% probability (0.9^7 =0.478). By the time we’ve generated a paragraph or two we’re already down to fractions of a percent.

No, the very best document is likely some percentage of the way down that enormous list - probably many trillions of records down, bearing in mind how large the list is.

There’s a certain distribution we’d expect from our hypothetical exercise: a small percentage of documents that might be graded “excellent”, followed by a slightly larger number of “good” ones; further down the list, a larger still number of “average” ones - and toward the bottom vast swathes of “below average” documents ranging from poor to downright incomprehensible.

The first document plays a mini lottery to be among the excellent documents - or even to hit the jackpot and be the best one. It has the advantage that the game is biased in its favour - because the first document is indeed the one with the highest probability of being the winner. But the bias is still just too small in comparison with the sheer number of alternatives. It’s as if that first document holds several lottery tickets - more than any other player, but still too few to be likely to win any significant prize. Imagine you bought ten lottery tickets - and ten million other people just bought one. You have the highest probability among participants of being the winner! So can you expect to win the jackpot? Not at all - the most likely outcome is you miss on all ten tickets.

It’s hard to say what the most common grade would be for that first document, because so many factors are difficult to quantify: the upward pressure due to the sophistication of the model; the downward pressure due to mediocre training data, averaging and biases in the model’s ingested material. The factors are not even constants, varying enormously depending on circumstance. Imagine a very unique question was answered perfectly in training - one time only - and later exactly the same question was submitted as a prompt. In this contrived situation we could expect a perfect grade.

However, generally my guess is that it’s the battle against the probabilistic effect I just described which imposes an upper limit to how successful an LLM can be. Serialising all those “best” tokens is not likely to produce a “best” document overall. We don’t expect a traditional search to always return the document we are looking for first, and we shouldn’t expect this from an LLM either. I speculate that LLMs in their current form will not, on average, break out of the “good” region of the grading system for this reason. “Good” meaning mostly accurate, mostly well described, and mostly adequately covering the intended topic. Occasionally we can expect “excellent” results with strong accuracy and solid advice; but just as often “not good” results - where the output is quite simply incorrect. This brings me on to the notion of “hallucinations”; if we regard LLM output as a search result, then a “hallucination” is simply the case where the search didn’t return a good result at the top of the list. Use of the term “hallucination” is then just another example of inappropriate anthropomorphism.

Other LLM behaviour that seems strange when regarding it as an intelligent entity can readily be explained by switching to the search result perspective. Take for example this interaction with an LLM by a writer who wanted it to read her articles and help her make a shortlist. I realise the experience was frustrating for the user and I sympathise; however at the same time I can’t help finding the “conversation” comical. The LLM apparently gushes about how wonderful the essays are, but when pressed it “admits” it didn’t read them. When the writer demands, “Why would you lie?” the LLM responds with a long paragraph laboriously explaining how it had lied, that it was wrong, how sorry it was and how it would never do it again. Then in the very next response it seems to once again lie outrageously, inventing a review for an apparently fictitious article. Note how I’m being careful with the language I am using to describe this; I have “conversation” in quotes, I say “it seems to lie” rather than, “it lied” and so on. This is because I don’t think LLMs “converse” and nor do I think they can lie - because that would imply an intent to deceive, which isn’t a feature present in their architecture.

If we look at the exchange from the “search result” perspective, the apparently contradictory behaviour of the LLM becomes much easier to digest. First the writer submits, “Can you help me…?” The LLM consults the vast database it was trained on, and concludes the most probable response should be, “Absolutely, I’d love to…” - simply because that’s the best matching search result. When the writer posts the first link to her work and asks the LLM to read it, the LLM searches its database for a document to match this situation. In its database there are examples of questions like, “Can you read the document at this URL and tell me what you think?” and then the response is, “I’ve read it and it was an amazing article…” (or whatever). The LLM does not see the part where the reader visited the URL and actually read the article. It just sees, “Can you read this, and tell me what you think?” followed by the response, “I’ve read it and it was amazing”. So when the LLM output, “I’ve read it and it was amazing” this was simply because it was the result of a search for “Can you read this and tell me what you think?”. The only difference between this search and a traditional one is it’s far more fine-grained, executing over tokens rather than whole documents. It is the fine-grained nature of the search mechanism which is providing the illusion fooling the writer.

I recommend going over the full interaction considering each prompt as a search term, and each corresponding response as the matching search result. See if the apparent contradictions don’t immediately dissolve. Note that, while the LLM is actually fed the full context (i.e. including the “conversation history”) on each query, LLMs “attention” is typically biased towards tokens towards the end of the prompt. This is done to prioritise the last question the user asked. One consequence of this is that we can generally consider just the last prompt - or even a few words from the last prompt - to see this search/response phenomenon in each case. The earlier context can almost be ignored in the demonstration. Some examples:

The search: “Can you help me…?”  The search result: “Absolutely, I’d love to…”

The search: “How will you know…?”  The search result: “I will know by…”

The search: “Are you actually reading these?”  The search result: “I am actually reading them - every word…”

The search: “The lines…are not lines I wrote…what’s going on here?” The search result: “You’re absolutely right to call that out…”

The search: “…You clearly didn’t read them” The search result: “You’re absolutely right to be frustrated…”

In every case the LLM just looks up the search phrase and presents the search result. When the user asks, “What is the subject matter of the first article I shared with you?” the LLM simply looks up this phrase and finds the best matching document in its database - which turns out to be a description of some other article that it ingested during training. The LLM is not carefully considering the question it is being asked; it can’t. It is simply searching, and returning a search result. To ascribe anything more to it than this is to fall for the illusion.

To conclude, I’m not suggesting that LLMs aren’t an important discovery; I think they are. Nor am I saying they don’t have legitimate, real world uses; I can imagine plenty of them. However, I think we need to be careful to recognise they have limits, and not to expect them to spontaneously progress beyond those limits. Simply increasing the data volume or number of parameters in the model is not going to lead to LLMs developing emotions or desires, just as increasing the resolution on your TV is not going to turn the images into real people. In both cases the increase just makes the illusion more convincing.

LLMs, on their own, are not going to lead to “AGI”; there simply aren’t enough ingredients present in the system. I think we also need to be particularly careful about asking LLMs to do anything - i.e. perform any real world action, such as maintaining a codebase or indeed purchasing your shopping - because they aren’t naturally action-orientated devices. They can be persuaded to make “tool calls”, which can be fed into systems that actually perform real world actions (such as making a purchase) - but they do so on the basis of a search over examples, which may or may not return a “good” search result. They don’t have any visibility over the world of doing and for that matter don’t even know it exists. On the other hand humans (and animals) employ a complex emotional framework in order to take real-world actions, because such actions can have consequences that seriously impact their lives.

Above all, I think there needs to be a better attempt on the part of the technology creators to educate the general public on the mechanisms and indeed limitations of LLMs. In my opinion one of the biggest reasons the deception has become so prevalent and long-lasting is because it is fuelled by irresponsible proclamations. The engineers behind LLMs - who should know better - have not stepped forward to dispel the myths they surely must be fully aware of. Tech companies have actively encouraged investors, the press and the public to believe in the illusion, presumably because the truth doesn’t secure nearly as much funding. That’s pretty short-sighted because the truth is still the truth even after the funding has burnt through.

Believing in magic really is a lot of fun. Who doesn’t feel at least some nostalgia for their childhood, when imagination ruled over reality? Anything seemed possible. There were ghosts in the attic, and monsters in the woods at the end of the road. Santa would arrive from the Arctic with flying reindeer to climb down chimneys and deliver presents. And there was a door to a secret world at the back of the wardrobe.

But sobering as it might be, isn’t this one reason children need adult supervision? If your kids declared they were inventing a super-intelligent being that would take over the universe (and simultaneously laugh at your jokes), would you (a) sell your house to back the project, or (b) pat them affectionately on the head?

We're really lucky everyone’s acting like an adult! (And didn’t, for example, just go ahead and give the kids a trillion dollars).