Large Language Models and what Information Theory tells us about the Evolution of Language

Published in

ONTOLOGIK

6 min readSep 26, 2022

In an article in The Gradient (and briefly in a previous post) I described what I called MTP — the Missing Text Phenomenon, which is the information that is never explicitly stated in everyday discourse, but is information that the speaker can safely assume to be available to the listener. We appropriately call this missing, implicitly assumed, and mutually agreed upon information, common knowledge. In the previous articles I was presenting a proof that BERT-based large language models (e.g., GPT-3) will never scale intro true language understanding (although they might be useful in picking up some low hanging fruits in tasks such as summarization, search, etc.). The proof was intuitively simple:

The equivalence between machine learning (ML) and compression has been mathematically established (see this, and this).
Natural language understanding (NLU), on the other hand, involves decompression — specifically, it involves ‘uncovering’ all the missing and implicitly assumed information.
The above are inconsistent (contradictory) and thus: ML is not relevant to NLU.

The above argument, while appreciated by many, was not accepted by several that did not appreciate the nature of the argument. In this article I hope to support this argument by briefly covering extensive research that has been done in psycholinguistics and neurolinguistics. In particular, research that uses information-theoretic concepts and is based on data from evolutionary linguistics. This research can be summarized as follows:

In linguistic communication both the speaker and the listener would like to minimize their efforts: the speaker would like to minimize the effort of encoding a thought into a linguistic message and the listener would like to minimize the effort of decoding the message. These two requirements work against each other but since speaker and listener change roles in communication, an amazing optimal point was reached at in the evolution of language in such a way that the speaker says the minimum possible, but not less than is necessary for the listener to successfully retrieve the content of the message. This process happens by what Zipf called the least effort principal (for more details see these very important papers, and references therein: 1, 2, 3, and 4).
The above optimization process that minimizes the efforts of both speaker and listener necessarily results in introducing ambiguity, by semantically compressing the encoded message as much as possible. As noted here, when there is enough common knowledge that is shared between speaker and listener, semantic entropy can be reduced while compressing the encoded message without loosing information (“with the help of shared background knowledge, we will be able to communicate with shorter messages to achieve the maximal informativeness of the source.” p. 6)

Let us consider a simple example. Consider the sentence in (1).

(1) The laptop did not fit in the briefcase because it is too small.

The reference ‘it’ has two possible meanings here — it could be a reference to the laptop or to the briefcase. Let us assume there is no shared background knowledge and that all the information required to understand the message is in the text. In this case the probability that ‘it’ refers to either the laptop or the suitcase is equally likely — since there are two possibilities than the probability that ‘it’ referers to either one is 0.5. Using Shannon’s entropy (uncertainty) measure, we get the following, where H[M] is the uncertainty in the message given all possible meanings m:

Thus, in the absence of any background knowledge, uncertainty is at a maximum and ‘it’ is absolutely ambiguous. The data in the text itself does not contain information to make one reference more probable than the other. However, let us assume that speaker and listener have some mutual information (called common knowledge, symbolized by C). The set C could, for example, contain the following:

if LARGER-THAN(x, y) then NOT-FIT-IN(x, y)
LARGER(x, y) iff SMALLER(y, x)

We can now compute the uncertainty (entropy, ambiguity) with the assumption that speaker and listener have access to mutualy shared common knowledge using the following:

Since the probability conditioned on background knowledge of one reference is now near 1.0 (say, 0.999), the right-hand side reduces to 0:

Thus, by assuming some mutually shared common knowledge, we were able to compress our message without introducing any uncertainty (entropy is 0). This, in fact, is how we communicate:

For effective communication, and inline with the least effort principal, we compress our utterances, introducing ambiguity, although it is ambiguity that we are certain can be resolved given common knowledge that speaker and listener mutually share.

A recap: in seeking efficiency in communication, in ordinary discourse we do not express any of the mutually agreed on (shared) common knowledge, such as “object x will not fit in container y if x is physically larger than y”. Our linguistic utterances are thus highly compressed and the amount of information missing is exactly the amount of common knowledge (see proof here).

While this evolution of human communication works well for humans, it is very challenging for machines since machines do not have access to that ‘shared’ and mutually agreed upon common knowledge. In fact, most challenges in NLU are due to this fact. The issue in (1), for example, is known in NLU and computational linguistics by ‘reference resolution’. In general, reference resolution can in fact be much more difficult than the situation in (1). For example, consider the following sentences:

(2) Jon wants to be a guitarist because he thinks it’s a beautiful instrument.
(3) Olga is an old dancer. She has been doing that for a long time.

In (2) and (3) there are references to objects that are not even mentioned explicitly in the text — ‘it’ is a reference for a guitar in (1) while ‘that’ is a reference to Olga’s dancing in (2). Besides reference resolution the missing text phenomenon (MTP) is the cause of many other challenges in language understanding. For example, hearing a waiter in a bar saying (4) we all know that the uncompressed text is that in (5)

(4) The loud omelet wants another beer.
(5) The loud person eating the omelet wants another beer.

How do we ‘uncover’ that missing text for successful ‘understanding’ of ordinary spoken language is a massive project and this is not the place to discuss this (although some of this project is briefly described here).

What matters for now is to appreciate that all linguistic corpora of ordinary spoken language does not contain the implicitly assumed common knowledge, and there can be no language understanding without access to that knowledge. Large language models, therefore, and no matter how much text they ingest, are looking for something that is not even there — that is what the data on language evolution tells us.

Let’s reduce carbon emission and stop chasing infinity by trying to ingest more and more text — it is time to climb down the tree.

_
https://medium.com/ontologik

Large Language Models and what Information Theory tells us about the Evolution of Language

Written by Walid Saba, PhD