Do you keep in mind the hype when OpenAI launched GPT-3 in 2020? Although not the primary in its sequence, GPT-3 gained widespread recognition as a consequence of its spectacular textual content era capabilities. Since then, a various group of Massive Language Fashions(Llms) have flooded the AI panorama. The golden query is: Have you ever ever questioned how ChatGPT or every other LLMs break down the language? In case you haven’t but, we’re going to focus on the mechanism by which LLMs course of the textual enter given to them throughout coaching and inference. In precept, we name it tokenization.
This text is impressed by the YouTube video titled Deep Dive into LLMs like ChatGPT from former Senior Director of AI at Tesla, Andrej Karpathy. His normal viewers video sequence is very really helpful for individuals who need to take a deep dive into the intricacies behind LLMs.
Earlier than diving into the primary subject, I would like you to have an understanding of the internal workings of a LLM. Within the subsequent part, I’ll break down the internals of a language mannequin and its underlying structure. In case you’re already accustomed to neural networks and LLMs usually, you possibly can skip the subsequent part with out affecting your studying expertise.
Internals of huge language fashions
LLMs are made up of transformer neural networks. Take into account neural networks as large mathematical expressions. Inputs to neural networks are a sequence of tokens which are usually processed by means of embedding layers, which convert the tokens into numerical representations. For now, consider tokens as primary models of enter information, reminiscent of phrases, phrases, or characters. Within the subsequent part, we’ll discover find out how to create tokens from enter textual content information in depth. Once we feed these inputs to the community, they’re combined into a large mathematical expression together with the parameters or weights of those neural networks.
Fashionable neural networks have billions of parameters. Initially, these parameters or weights are set randomly. Subsequently, the neural community randomly guesses its predictions. Through the coaching course of, we iteratively replace these weights in order that the outputs of our neural community grow to be in keeping with the patterns noticed in our coaching set. In a way, neural community coaching is about discovering the suitable set of weights that appear to be in keeping with the statistics of the coaching set.
The transformer structure was launched within the paper titled “Attention is All You Need” by Vaswani et al. in 2017. This can be a neural community with a particular sort of construction designed for sequence processing. Initially supposed for Neural Machine Translation, it has since grow to be the founding constructing block for LLMs.
To get a way of what manufacturing grade transformer neural networks seem like go to https://bbycroft.net/llm. This website gives interactive 3D visualizations of generative pre-trained transformer (GPT) architectures and guides you thru their inference course of.
This explicit structure, known as Nano-GPT, has round 85,584 parameters. We feed the inputs, that are token sequences, on the prime of the community. Info then flows by means of the layers of the community, the place the enter undergoes a sequence of transformations, together with consideration mechanisms and feed-forward networks, to provide an output. The output is the mannequin’s prediction for the subsequent token within the sequence.
Tokenization
Coaching a state-of-the-art language mannequin like ChatGPT or Claude includes a number of levels organized sequentially. In my earlier article about hallucinations, I briefly defined the coaching pipeline for an LLM. If you wish to study extra about coaching levels and hallucinations, you possibly can learn it here.
Now, think about we’re on the preliminary stage of coaching known as pretraining. This stage requires a big, high-quality, web-scale dataset of terabyte dimension. The datasets utilized by main LLM suppliers aren’t publicly obtainable. Subsequently, we are going to look into an open-source dataset curated by Hugging Face, known as FineWeb distributed beneath the Open Data Commons Attribution License. You may learn extra about how they collected and created this dataset here.

I downloaded a pattern from the FineWeb dataset, chosen the primary 100 examples, and concatenated them right into a single textual content file. That is simply uncooked web textual content with numerous patterns inside it.

So our purpose is to feed this information to the transformer neural community in order that the mannequin learns the circulation of this textual content. We have to prepare our neural community to imitate the textual content. Earlier than plugging this textual content into the neural community, we should resolve find out how to symbolize it. Neural networks anticipate a one-dimensional sequence of symbols. That requires a finite set of doable symbols. Subsequently, we should decide what these symbols are and find out how to symbolize our information as a one-dimensional sequence of them.
What we’ve got at this level is a one-dimensional sequence of textual content. There may be an underlined illustration of a sequence of uncooked bits for this textual content. We are able to encode the unique sequence of textual content with UTF-8 encoding to get the sequence of uncooked bits. In case you examine the picture beneath, you possibly can see that the primary 8 bits of the uncooked bit sequence correspond to the primary letter ‘A’ of the unique one-dimensional textual content sequence.

Now, we’ve got a really lengthy sequence with two symbols: zero and one. That is, in truth, what we had been on the lookout for — a one-dimensional sequence of symbols with a finite set of doable symbols. Now the issue is that sequence size is a treasured useful resource in a neural community primarily due to computational effectivity, reminiscence constraints, and the problem of processing lengthy dependencies. Subsequently, we don’t need extraordinarily lengthy sequences of simply two symbols. We favor shorter sequences of extra symbols. So, we’re going to commerce off the variety of symbols in our vocabulary towards the ensuing sequence size.
As we have to additional compress or shorten our sequence, we are able to group each 8 consecutive bits right into a single byte. Since every bit is both 0 or 1, there are precisely 256 doable mixtures of 8-bit sequences. Thus, we are able to symbolize this sequence as a sequence of bytes as a substitute.

This illustration reduces the size by an element of 8, whereas increasing the image set to 256 prospects. Consequently, every worth within the sequence will fall inside the vary of 0 to 255.

These numbers shouldn’t have any worth in a numerical sense. They’re simply placeholders for distinctive identifiers or symbols. The truth is, we may change every of those numbers with a singular emoji and the core thought would nonetheless stand. Consider this as a sequence of emojis, every chosen from 256 distinctive choices.

This means of changing from uncooked textual content into symbols known as Tokenization. Tokenization in state-of-the-art language fashions goes even past this. We are able to additional compress the size of the sequence in return for extra symbols in our vocabulary utilizing the Byte-Pair Encoding (BPE) algorithm. Initially developed for textual content compression, BPE is now extensively utilized by transformer fashions for tokenization. OpenAI’s GPT sequence makes use of customary and customised variations of the BPE algorithm.
Primarily, byte pair encoding includes figuring out frequent consecutive bytes or symbols. For instance, we are able to look into our byte degree sequence of textual content.

As you possibly can see, the sequence 101
adopted by 114
seems regularly. Subsequently, we are able to change this pair with a brand new image and assign it a singular identifier. We’re going to rewrite each prevalence of 101 114
utilizing this new image. This course of will be repeated a number of occasions, with every iteration additional shortening the sequence size whereas introducing further symbols, thereby growing the vocabulary dimension. Utilizing this course of, GPT-4 has provide you with a token vocabulary of round 100,000.
We are able to additional discover tokenization utilizing Tiktokenizer. Tiktokenizer gives an interactive web-based graphical consumer interface the place you possibly can enter textual content and see the way it’s tokenized in response to totally different fashions. Play with this instrument to get an intuitive understanding of what these tokens seem like.
For instance, we are able to take the primary 4 sentences of the textual content sequence and enter them into the Tiktokenizer. From the dropdown menu, choose the GPT-4 base mannequin encoder: cl100k_base.

The coloured textual content exhibits how the chunks of textual content correspond to the symbols. The next textual content, which is a sequence of size 51, is what GPT-4 will see on the finish of the day.
11787, 499, 21815, 369, 90250, 763, 14689, 30, 7694, 1555, 279, 21542, 3770, 323, 499, 1253, 1120, 1518, 701, 4832, 2457, 13, 9359, 1124, 323, 6642, 264, 3449, 709, 3010, 18396, 13, 1226, 617, 9214, 315, 1023, 3697, 430, 1120, 649, 10379, 83, 3868, 311, 3449, 18570, 1120, 1093, 499, 0
We are able to now take our total pattern dataset and re-represent it as a sequence of tokens utilizing the GPT-4 base mannequin tokenizer, cl100k_base. Observe that the unique FineWeb dataset consists of a 15-trillion-token sequence, whereas our pattern dataset comprises just a few thousand tokens from the unique dataset.

Conclusion
Tokenization is a elementary step in how LLMs course of textual content, remodeling uncooked textual content information right into a structured format earlier than being fed into neural networks. As neural networks require a one-dimensional sequence of symbols, we have to obtain a steadiness between sequence size and the variety of symbols within the vocabulary, optimizing for environment friendly computation. Fashionable state-of-the-art transformer-based LLMs, together with GPT and GPT-2, use Byte-Pair Encoding tokenization.
Breaking down tokenization helps demystify how LLMs interpret textual content inputs and generate coherent responses. Having an intuitive sense of what tokenization seems like helps in understanding the inner mechanisms behind the coaching and inference of LLMs. As LLMs are more and more used as a information base, a well-designed tokenization technique is essential for enhancing mannequin effectivity and general efficiency.
In case you loved this text, join with me on X (formerly Twitter) for extra insights.