The End of Tokenization
Some thoughts on tokenization and Meta's “Byte Latent Transformer: Patches Scale Better Than Tokens” paper
Meta’s new paper “Byte Latent Transformer: Patches Scale Better Than Tokens” eliminates LLMs’ dirtiest secret: tokenization. Instead of using a tokenizer, the Byte Latent Transformer, as the name implies, operates on the bytes of the raw text data.
Previous works on eliminating tokenization have largely framed the central challenge as mitigating the fact that naively operating on raw bytes results in longer sequences and therefore extra compute for the same task. However, this is a losing battle as wasted compute will overshadow any minor benefits. The Byte Latent Transformer flips the problem on its head and shows how the flexibility of byte-modeling allows for shorter sequences and more efficient compute allocation compared to using a tokenizer.
By showing itself to be more dynamic and compute efficient, the Byte Latent Transformer opens the door to becoming a full-blown replacement to tokenization rather than a niche alternative for specific use cases. However, even if the Byte Latent Transformer does end tokenization in its current form, my personal prediction is that the legacy of tokenization will live on as long as auto-regressive language modeling and the transformer live on in their current form.
What is tokenization?
The backbone behind all modern large-scale models is the transformer, a deep neural network that takes a sequence of input vectors, which are often called tokens, and transforms them into a sequence of output tokens. Ideally, these input tokens are just the raw bytes of text data and everything is learned end-to-end.
However, in practice, most LLMs map the text to and from the token space using a clunky process called tokenization. This is done by constructing a large vocabulary of words and memorizing an input vector embedding corresponding to every single word (note that these embeddings are learnable during training). Converting from tokens back to words is done via a fully connected layer; essentially we take the dot product of the output token with every single word in our vocabulary and generate whichever words have the highest scores.
The most common tokenizer in modern LLMs is Byte Pair Encoding (BPE). BPE is initialized as a 256-word vocabulary—one word for each possible byte. It goes through the data corpus, merges the pair of words that appear most frequently together, and adds the merged pair of words into the vocabulary. This process is repeated until we hit our maximum vocabulary size. For example, “i” and “n” might be merged early on to add “in” to the vocabulary, and in a future iteration “in” and “g” might be merged to add “ing” to the vocabulary. Even further down the line, “walk” and “ing” may be merged to add “walking” to the vocabulary.
For a rough idea of scale, Llama2 had a vocabulary size of 32,000 and a hidden vector dimension of 4096. That means between the embedding layer and the fully connected layer, Llama2 spent 2*32000*4096=262 million parameters entirely on memorizing the vocabulary. Llama3 increased the vocabulary size to 128,256, meaning over 1 billion parameters are spent entirely on memorizing the vocabulary. In fact, the increase in vocabulary size accounts for almost the entire increase in parameter count from the Llama2 small model (6B) to the Llama3 small model (7B).
What’s wrong with tokenization?
In the grand of scheme of things, tokenization is a decent solution to mapping between text and vectors. Memorizing a large vocabulary of words is roughly in line with how humans process language. Modern LLMs have achieved a lot using tokenization. However, there are several reasons not to like tokenization.
Tokenization is not robust to changes in domain. For example, if you want to fine-tune a LLM on code, but your entire vocabulary is built for natural language, you won’t be using the most efficient tokenization scheme available. Similar problems arise for multi-lingual support.
Tokenization is less robust to typos and noise as demonstrated by the Byte Latent Transformer.
Tokenization is subpar at character-level manipulation as demonstrated by the Byte Latent Transformer.
It’s hard to represent arbitrary categorical distributions across a large vocabulary size if we are forced to compress everything into a relatively low-dimensional space. For example, if you prompt GPT-4o with “Pick one of the following at random with equal probability: heads, tails, apple, rock, dog. Answer in one word, all lower case” and observe the probabilities, you will find they are not uniform. You are essentially asking for a vector that is equally angled between the embeddings for “heads”, “tails”, “apple”, “rock”, and “dog” while being orthogonal to all other vocabulary vectors, which seems to be difficult in (relatively) low-dimensional space.
The free versions of ChatGPT and Claude (which presumably have lower hidden dimension) are even worse; of the time of this publication, the below trend keeps repeating itself if you keep prompting “I flip again”:
Betting against LLMs is a fool’s errand, and these problems can surely be tuned away or mitigated with ever-growing hidden dimensionality, but reducing the vocabulary size by eliminating tokenization could likely fix this problem quite trivially even on smaller models. Perhaps this toy problem is entirely irrelevant (in practice ChatGPT/Claude would generate a Python script and run it), but it’s plausible that this could be a problem if I want to do some RL-style search in natural language space.
Ultimately, none of the above may end up mattering. Deep learning and scale are very powerful and forgiving of many sins. However, tokenization carries one unforgivable sin: it is not beautiful.
The Bitter Lesson tells us that human-engineered tricks will all fall by the wayside in favor of simple and elegant end-to-end solutions that leverage compute to learn or discover everything. It is thus entirely unforgivable that our modern LLMs are so dependent on a separate pre-processing step as hand-engineered as BPE. The transformer has to work around the fixed BPE step instead of learning for itself how to best map from raw byte input to raw byte output.
Even if we were unable to specifically articulate any exact bad behavior that tokenization causes, it feels inevitable that the final form of LLMs will map text to text in a fully end-to-end process.
The Byte Latent Transformer
One of the fundamental issues with tokenization is that the hand-crafted nature of BPE locks the model into an inefficient allocation of compute.
Consider the prompt: “Who is the mother of dragons?” On a very weak language model, it would probably be better to expend more forward passes to do some sort of chain-of-thought reasoning that the user is probably asking about Game of Thrones, in which the mother of dragons is Daenerys Targaryen. Llama3, thanks to its size, can do the hard reasoning/information retrieval part and answer this question in a single forward pass. However, due to BPE tokenization, Llama3 will first generate “Da”, then “enery”, “s”, “ T”, “arg”, “ary”, “en”.
What a waste! Llama3 knew the answer with one giant forward pass, but it had to waste another six of its giant forward passes completing the output, because “Daenerys Targaryen” wasn’t in the vocabulary of the tokenizer. Ideally, we could learn to generate “Daenerys Targaryen” in one forward pass. However, this is completely off the table as long as we are using fixed hand-engineered tokenizers like BPE.
On the other hand, a naive byte transformer would do even worse. It would waste forward passes generating all 19 characters in “Daenerys Targaryen”. Previous works like ByT5 try to mitigate this issue by making the cost of an individual forward pass cheaper. Instead, the Byte Latent Transformer flips this issue on its head by dynamically learning to group a larger number of bytes into a single token (they refer to these dynamically grouped bytes as a “patch”, but from the transformer’s perspective they are functionally the same as tokens), resulting in shorter sequence lengths and thus requiring fewer forward passes than a BPE-based model.
Similar arguments can be made regarding sequence length and training efficiency as well.
Grouping Bytes
In order to group the bytes into patches, the Byte Latent Transformer first separately trains a light-weight byte transformer to perform next-byte prediction. The paper refers to this model as the entropy model. Now, when running a forward pass on the main model, they first run the entropy model on the input and plot out the entropy of the next byte prediction (i.e. how “surprising” was each byte in the text to the entropy model). An example is shown below on the text “Daenerys Targaryen is in Game of Thrones, a fantasy epic by George R. R. Martin.”
We see that it’s surprising that the first character is “D” (as opposed to any another character), less surprising that the next character is “a” (as opposed to another vowel), and more surprising that the next character is “e” (since “Dae” is a very rare combination). Once we see “Daen”, it becomes obvious even to the light-weight entropy model that it will be completed by “erys Targaryen”, so the entropy is low for those characters.
The Byte Latent Transformer then splits the text into patches by roughly setting a patch boundary whenever there is high entropy. A comparison of how bytes are grouped under various patching/tokenization schemes are shown below. The second orange row is the BPE baseline used in most modern LLMs. The third pink row and the fourth yellow row are two of the proposed grouping schemes. We can see that the yellow grouping scheme groups “George R. R. Martin” into a single token/patch compared to five from BPE, and it requires half as many tokens/patches as BPE to represent “Daenerys Targaryen”.
Full Pipeline and Results
In summary, the full pipeline of the Byte Latent Transformer is as follows:
Run the pre-trained entropy model on the input bytes.
Group bytes together like in the figure above such that group boundaries correspond to high-entropy bytes that surprised the entropy model.
Encode each group of bytes into patches using a small local transformer. Because each group of bytes only contains low-entropy information, this small encoder does not need to be that powerful.
Pass the patches through a large global transformer to do generative modeling.
Decode the patches into raw bytes using another small local transformer. The decoder similarly does not need to be that powerful.
Thus, with a little assistance from the entropy predictor, the Byte Latent Transformer trains a LLM nearly end-to-end. Further details on the model architecture, such as how the bytes get mapped to patches in the local transformer, can be found in the paper.
Unsurprisingly, the Byte Latent Transformer shows better robustness to noise and character-level manipulation. More importantly, the Byte Latent Transformer matches the performance of tokenization-based models like Llama3 at scales up to 8B and 4T bytes of training data.
One powerful aspect of the Byte Latent Transformer’s dynamic grouping scheme is that it’s easy to adjust token size and therefore sequence length by adjusting the threshold of what is considered high entropy. This sort of flexibility is unavailable to rigid tokenizers like BPE, as the only way to increase token size in BPE is to increase the number of merges that happen by increasing the vocabulary size. However, as one could imagine, this is quite inefficient—Llama 3 quadrupled the vocabulary size of Llama2 and only increased the average token size from 3.7 to 4.4. This flexible adjustment of patch size and sequence length allows the Byte Latent Transformer to reduce inference flops by 50% at the cost of minor losses in evaluation.
More detailed results can be found in the paper.
Is it beautiful?
In my opinion, the central contributions of the Byte Latent Transformer are twofold. One is reframing the problem statement of byte-level modeling from a problem of resisting the ostensible hit to computational efficiency into a problem of leveraging the flexibility of byte-level modeling to improve computational efficiency. Second is providing the rigorous empirical evidence that byte-level modeling can compete with tokenization.
In terms of the actual solution, it is indeed more beautiful than the BPE baseline. The ad-hoc process of finding and merging the strings with high unconditional probability of being together in the corpus is replaced with a learnable neural network (i.e. the entropy predictor) that merges strings with high conditional probability of being together given the context. The embedding and fully connected layers that memorize data points are replaced by the small local encoder/decoder transformers that learn a function.
However, it’s hard to pronounce the Byte Latent Transformer as the final solution to the tokenization problem. The fact that an entropy predictor first needs to be trained invites the same criticism of lack of robustness to changes in domain and the fundamental ugliness of not having an end-to-end model. There are also implicit hand-picked choices of how the bytes are grouped into patches, such as the modeling capacities of the entropy predictor and the local encoder/decoder transformers, that we could ideally learn. In short, it’s beautiful, but not beautiful enough.
My personal prediction is that the ghost of tokenization will haunt us until:
The death of auto-regression as we know it. If we want one giant model that goes directly from byte inputs to byte outputs and doesn’t waste forward passes on trivial generations, we will need some sort of diffusion-ish solution that can simultaneously generate all trivial tokens in a few forward passes.
The death of the transformer as we know it. The fundamental reason that there is all this hullabaloo of what is an input token/patch is that the input token is the atomic unit of compute in current transformer-based LLMs. Tokenization compresses sequence length in a dumb way, the Byte Latent Transformer compresses sequence length in a smarter way, and a future architecture needs to learn to perform compression in an even smarter way.
I don’t think auto-regression or the transformer will ever go extinct, but both paradigms will need to undergo significant evolution until we can see the true death of tokenization.