Tokens are a big reason why today's generative AI is lacking.

Generative AI models don’t process text the same way humans do. Understanding their “token”-based internal environment can help explain their strange behavior and stubborn limitations.

Most models, from small on-device models like Gemma to OpenAI’s industry-leading GPT-4o, are built on an architecture called a transformer. Because of the way transformers create associations between text and other types of data, they can’t take in raw text or output it—at least not without a ton of compute.

So for practical and technical reasons, today's transformer models process text broken down into small, bite-sized pieces called tokens, a process called tokenization.

A token can be a word, such as “fantastic”. Or it can be a syllable, such as “fan”, “tas”, or “tic”. Depending on the tokenizer, which is the model that performs the tokenization, it can also be the individual letters of a word, such as “f”, “a”, “n”, “t”, “a”, “s”, “t”, “i”, or “c”.

This method allows the transformer to accommodate more information (in a semantic sense) before reaching an upper bound known as the context window. However, tokenization can also introduce bias.

Some tokens have weird spacing that can throw the converter off-track. For example, a tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” while it might encode “once upon a” (followed by a space) as “once,” “upon,” “a,” “.” Depending on how the model is prompted with “once upon a” or “once upon a,” the results can be completely different because the model doesn't understand (as a human would) that they mean the same thing.

Tokenizers also handle cases differently. “Hello” to a model is not necessarily the same as “HELLO”. “hello” is usually just one token (depending on the tokenizer), while “HELLO” can have up to three tokens (“HE,” “El,” and “O”). That's why many converters fail the capitalization test.

“It’s a bit tricky to resolve the question of what exactly a ‘word’ is in a language model, and even if human experts agree on a perfect token vocabulary, the model probably finds it useful to ‘fragment’ things further,” Sheridan Feucht, a PhD student at Northeastern University who studies the interpretability of large-scale language models, told TechCrunch. “My guess is that there will never be a perfect tokenizer because of this kind of ambiguity.”

This “ambiguity” causes even more problems in languages ​​other than English.

Many tokenization methods assume that spaces in a sentence represent new words. This is because they were designed with English in mind. However, not all languages ​​use spaces to separate words. Chinese and Japanese do not use spaces, nor do Korean, Thai, or Khmer.

A 2023 Oxford study found that differences in the way non-English languages ​​are tokenized can cause a translator to take twice as long to complete a task expressed in a non-English language as the same task expressed in English. The same study and another found that users of languages ​​with lower “token efficiency” are more likely to have worse model performance, but also cost more to use, as many AI vendors charge per token.

Tokenizers often treat each character as a separate token in ideographic writing systems (systems in which words are represented by printed symbols, with no relation to their pronunciation, such as Chinese), which increases the number of tokens. Similarly, tokenizers processing agglutinative languages ​​(languages ​​in which words are made up of small, meaningful word elements called morphemes, such as Turkish) tend to convert each morpheme into a token, which increases the total number of tokens. (The Thai word for “hello”, สวัสดี, is six tokens.)

In 2023, Google DeepMind AI researcher Yenny Jun conducted an analysis comparing tokenization and its downstream effects in different languages. Using a parallel text dataset translated into 52 languages, Jun showed that some languages ​​require up to 10 times more tokens to capture the same meaning as English.

Beyond linguistic inequalities, tokenization may also explain why today’s models are mathematically bad.

Numbers are rarely tokenized consistently. Since the tokenizer doesn’t know exactly what a number is, it treats “380” as a single token, but represents “381” as a pair (“38” and “1”), effectively breaking the relationship between numbers and the results of equations and formulas. The result is transformer confusion. A recent paper showed that models struggle with repetitive number patterns and context, especially temporal data. (Note: GPT-4 thinks 7,735 is greater than 7,926).

That's also why the model isn't very good at solving anagram problems or reversing words.

So tokenization clearly presents challenges for generative AI. Can they be solved?

maybe.

Feucht points to “byte-level” state-space models like MambaByte, which can completely eliminate tokenization and thus collect much more data than transducers without sacrificing performance. Working directly with the raw bytes representing text and other data, MambaByte competes with some transducer models for language analysis tasks, while better handling “noise” like transposed letters, spaces, and words beginning with an uppercase letter.

However, models like MambaByte are still in the early stages of research.

“It would be best if the model looked at the characters directly, without imposing tokenization, but that’s currently computationally infeasible for a transformer,” Feucht said. “Especially for transformer models, the computation scales quadratically with sequence length, so you want to use short text representations.”

Unless there is a breakthrough in tokenization, a new model architecture seems likely to be key.