The current state-of-the-art method for solving text-based problems includes separating sentences into sequences of tokens. Relying on tokens is, for the most part, a necessary evil. Recent approaches have shown the viability of removing learned tokens altogether and instead operating on the raw text directly. This blog post will highlight what token-free NLP models are and what the big deal is. The future may very well be token-free!
This article was originally published by Peltarion.
Why Do We Need a Tokenizer?
For an AI model to understand languages, the most prominent approach so far is to train a so-called language model on a massive amount of text and let the model learn from context the meaning of words, and how those words compose sentences. For an AI model to recall and learn the relation between words, a vocabulary needs to be embedded in the model and then store these mappings as parameters.
Since there are over 170 000 words in the English dictionary and the model needs to learn the weights of each word, it is not feasible for it to store all of those in its vocabulary.
To decrease the number of words to learn, they are instead split into sub-words or tokens. We can use a so-called tokenizer to decide these splits/tokens should be done in the text. A tokenizer is trained to identify a certain number of words from a large corpus. The tokens it learns to identify will be the longest and most common sub-word up to some fixed token vocabulary size; usually 50.000 tokens.
When we train a language model, we can use the tokenizer to reduce the number of tokens the model needs to store in memory and, therefore, reduce the model size significantly. This is important since reduced model size means faster training, lower training cost and less expensive hardware is required.
Below is an illustration of how a tokenizer splits a sentence into a sequence of tokens.
This may seem like a fair trade-off: Limit the size of the vocabulary to reduce the number of words a model needs to store and learn and its size, but there are several issues with this approach:
- A language model stores a vocabulary of each token it was pre-trained for in a so-called vocabulary matrix, containing the input embedding of each token and a softmax output for each token. Just storing these parameters uses a large part of the model’s parameters, 85% for mT5 small and 66% for mT5 base, meaning fewer are used to understand a language or specific problem.
- Different models use different tokenizers and with different vocabulary sizes. Since each model was pre-trained with a specific tokenization, it makes re-use, generalization, and comparison more difficult. For instance, here are some examples of commonly used model and the tokenization they used for their vocabularies:
- Using a fixed size token vocabulary also means that models can adapt poorly to include domain-specific or new words. It can, therefore, be common to further pre-train a language model and even the tokenizer on a specific vocabulary, such as medical, legal, or financial text.
- Most of these tokenizers split the text into words based on whitespace or punctuation marks. This approach does not work well for non-whitespace separated languages, such as Chinese, Thai, and Japanese; or where the punctuation sign is used as a consonant, such as for the Hawaiian and Twi languages.
- Having noisy or misspelled words also means poorer tokenization. Words that may be related, such as amazing and amazed could be tokenized very differently (“amazing” and “ama” +“zed”). Learning those different tokenizations with few or no overlapping tokens can be hard for a language model to learn.Because of these issues, recent methods have proposed different ways to remove tokenizers altogether and instead operate directly on the raw text (Unicode characters with UTF-8 byte encoding). Since all data is stored and interpreted in a computer as sequences of bytes, being able to represent the data in this format makes it language agnostic.
Token-Free Models in Theory
To explain how token-free models work, it is helpful to first highlight how the standard tokenizer-based approach works.
Text is passed in as input to the model, but since it is too expensive to store every word, it is passed through a tokenizer to split the text into smaller sub-words. The tokenizer in turn contains a list of every token it can split a word into and a corresponding ID for each token. Since deep learning models are large matrix computations, we pass in the token IDs to represent the tokenized input instead of the tokens themselves.
The token IDs are then embedded in a vector space to encode more information and enable comparison between words; a so-called word embedding. In Transformer based models, this embedding is part of the model; called contextual word embedding. The model then learns through multiple layers what the relation between different tokens are.
The general process for how text is represented and passed to a model both in a tokenizer-based approach and without a tokenizer are similar, and can be described as:
- Represent the text as numbers (via tokenization or converting the text to Unicode code point with UTF-8 encoding).
- Create a word embedding out of the numerical representations for capturing complexity and comparison between words.
- Reduce the number of words or sub-words the model needs to represent. This can be before the text is encoded, such as splitting the text into sub-words with a tokenizer or reducing the number of characters to represent in the word embedding via mean pooling (Charformer), strided convolution (CANINE), or reduced self-attention cost (Perceiver).
CANINE is the first token- and vocabulary-free model, based on a hashing and downsampling strategy to work directly on the characters as Unicode code points. CANINE was trained on the TyDI QA dataset and outperformed other multilingual models, such as mBERT while having no predefined tokenization and 28% fewer parameters.
The input is then passed through a regular stack of Transformer Encoders (s.a. mBERT or XLM-R). Depending on the task, the first token is then used (the [CLS] token) for classification and the output sequence of the model is upsampled to get back to the original size of 2048.
ByT5 is a variant of the multilingual T5 model, mT5, but operates directly on the raw text input or UTF-8 encoding of the text. The architecture is otherwise similar to mT5, but the split between encoder and decoder layers are no longer 50/50, instead, the number of encoder layers is 3x more than the decoders. It is hypothesized that token-free models need deeper encoder stacks to make up for the decreased embedding capacity for the vocabulary.
ByT5 out-performs mT5 in most multilingual tasks, and especially for smaller models or when dealing with misspelled or noisy data, and is 50-100% faster. When the model is trained on a single language, s.a. English, mT5 and the regular T5 perform better.
Perceiver and Perceiver IO
The perceiver operates directly on the raw byte representation of the input. This enables the models to operate (more or less) on any type of data, be it text, images, point cloud, audio, etc., and even combinations of modalities in one model. The model takes inspiration from the ByT5 paper to operate directly on the raw byte representation (UTF-8 for text) but extends it to multiple modalities.
The Perceiver also continues the trend of removing hardcoded assumptions in the model architecture of how to solve and represent the problem, and instead allows the model itself to learn those aspects. The Perceiver IO is a continuation on the original Perceiver architecture, extending it to be used for multiple tasks and not just classification.
Charformer consists of two parts: a dynamic, fast, and flexible method to learn subword representations automatically from n-grams and a model that incorporates it. By grouping n-characters together (n-gram) there is an increased opportunity to learn multiple representations of a word that may be more advantageous. Instead of using only one representation of subwords of a single character, the model can select the most informative representation of a word, by weighting multiple representations from the different n-grams. These are then downsampled in groups of 2 with mean pooling to get a sequence with a shorter length.
This module is called Gradient-Based Subword Tokenization (GBST) and is the token-free module used by the Charformer. Since all components in the module are pre-defined, except for how to weight/score each n-gram representation, this can be done efficiently and quickly. Also, since the scoring is done using the Softmax function it is also differentiable and learnable. This means that better text representations can update on new vocabulary or languages dynamically.
Creating an n-gram of characters shortens the length of the text by n. For instance, the text “Successfully” as a 4-gram would be “Succ”, “essf”, “ully”; that is 4 times shorter than the original text. Therefore, these n-grams are mean pooled and repeated X number of times to be the same length again. The pooled and duplicated embedding for “succ” in the image below would be C4, 1, “essf” C4, 2 , and “ully” C4, 3. These are then scored via a weighting and mean pooled to a shorter representation. Since pooling removes the position of tokens, position embeddings are added to the tokens at each pooling step.
Charformer performs on par or outperforms the regular T5 on multiple English tasks and outperforms both ByT5 and CANINE while being smaller, faster, and with shorter sequences. Unlike CANINE, a model using the GBST, s.a. Charformer is interpretable in how the tokens are represented. Charformer is as of this writing the current State-of-the-Art (SOTA) method when it comes to token-free models. For those interested in learning more about the model, I highly recommend this short and pedagogical video.
The various techniques for token-free models showcase a lot of potential for standardizing model configurations across modalities, tasks, and languages. I believe that some of the trends these methods highlight will continue to be interesting directions in the field of AI, including:
- Learning model structure and function only from the data in a data-centric way. To reduce architectural bias and hardcoded features of the network (Neural Architecture Search), find better representations or re-discover traditional techniques/structures for solving a problem.
- Increased mixing and interaction between the data, by allowing new and distant connections, and leveraging multiple representations of the input, models seem to generalize better and are more efficient, as seen in FNet, MLP-Mixer, and ALiBi.