With the current explosion of curiosity in giant language fashions (LLMs), they typically appear virtually magical. However let’s demystify them.
I wished to step again and unpack the basics — breaking down how LLMs are constructed, skilled, and fine-tuned to change into the AI techniques we work together with as we speak.
This two-part deep dive is one thing I’ve been that means to do for some time and was additionally impressed by Andrej Karpathy’s widely popular 3.5-hour YouTube video, which has racked up 800,000+ views in simply 10 days. Andrej is a founding member of OpenAI, his insights are gold— you get the concept.
When you’ve got the time, his video is certainly price watching. However let’s be actual — 3.5 hours is an extended watch. So, for all of the busy of us who don’t need to miss out, I’ve distilled the important thing ideas from the primary 1.5 hours into this 10-minute learn, including my very own breakdowns that will help you construct a stable instinct.
What you’ll get
Half 1 (this text): Covers the basics of LLMs, together with pre-training to post-training, neural networks, Hallucinations, and inference.
Half 2: Reinforcement studying with human/AI suggestions, investigating o1 fashions, DeepSeek R1, AlphaGo
Let’s go! I’ll begin with taking a look at how LLMs are being constructed.
At a excessive stage, there are 2 key phases: pre-training and post-training.
1. Pre-training
Earlier than an LLM can generate textual content, it should first learn the way language works. This occurs by means of pre-training, a extremely computationally intensive job.
Step 1: Information assortment and preprocessing
Step one in coaching an LLM is gathering as a lot high-quality textual content as doable. The aim is to create an enormous and numerous dataset containing a variety of human data.
One supply is Common Crawl, which is a free, open repository of internet crawl information containing 250 billion internet pages over 18 years. Nevertheless, uncooked internet information is noisy — containing spam, duplicates and low high quality content material — so preprocessing is crucial.In case you’re curious about preprocessed datasets, FineWeb gives a curated model of Frequent Crawl, and is made accessible on Hugging Face.
As soon as cleaned, the textual content corpus is prepared for tokenization.
Step 2: Tokenization
Earlier than a neural community can course of textual content, it should be transformed into numerical type. That is carried out by means of tokenization, the place phrases, subwords, or characters are mapped to distinctive numerical tokens.
Consider tokens because the constructing blocks — the elemental constructing blocks of all language fashions. In GPT4, there are 100,277 doable tokens.A well-liked tokenizer, Tiktokenizer, means that you can experiment with tokenization and see how textual content is damaged down into tokens. Strive getting into a sentence, and also you’ll see every phrase or subword assigned a collection of numerical IDs.

Step 3: Neural community coaching
As soon as the textual content is tokenized, the neural community learns to foretell the subsequent token based mostly on its context. As proven above, the mannequin takes an enter sequence of tokens (e.g., “we’re prepare dinner ing”) and processes it by means of an enormous mathematical expression — which represents the mannequin’s structure — to foretell the subsequent token.
A neural community consists of two key elements:
- Parameters (weights) — the discovered numerical values from coaching.
- Structure (mathematical expression) — the construction defining how the enter tokens are processed to provide outputs.

Initially, the mannequin’s predictions are random, however as coaching progresses, it learns to assign chances to doable subsequent tokens.
When the right token (e.g. “meals”) is recognized, the mannequin adjusts its billions of parameters (weights) by means of backpropagation — an optimization course of that reinforces right predictions by growing their chances whereas decreasing the chance of incorrect ones.
This course of is repeated billions of instances throughout huge datasets.
Base mannequin — the output of pre-training
At this stage, the bottom mannequin has discovered:
- How phrases, phrases and sentences relate to one another
- Statistical patterns in your coaching information
Nevertheless, base fashions are usually not but optimised for real-world duties. You possibly can consider them as a sophisticated autocomplete system — they predict the subsequent token based mostly on chance, however with restricted instruction-following means.
A base mannequin can generally recite coaching information verbatim and can be utilized for sure functions by means of in-context studying, the place you information its responses by offering examples in your immediate. Nevertheless, to make the mannequin actually helpful and dependable, it requires additional coaching.
2. Publish coaching — Making the mannequin helpful
Base fashions are uncooked and unrefined. To make them useful, dependable, and protected, they undergo post-training, the place they’re fine-tuned on smaller, specialised datasets.
As a result of the mannequin is a neural community, it can’t be explicitly programmed like conventional software program. As an alternative, we “program” it implicitly by coaching it on structured labeled datasets that characterize examples of desired interactions.
How submit coaching works
Specialised datasets are created, consisting of structured examples on how the mannequin ought to reply in several conditions.
Some varieties of submit coaching embody:
- Instruction/dialog high-quality tuning
Aim: To show the mannequin to observe directions, be job oriented, interact in multi-turn conversations, observe security tips and refuse malicious requests, and so on.
Eg: InstructGPT (2022): OpenAI employed some 40 contractors to create these labelled datasets. These human annotators wrote prompts and supplied preferrred responses based mostly on security tips. In the present day, many datasets are generated routinely, with people reviewing and modifying them for high quality. - Area particular high-quality tuning
Aim: Adapt the mannequin for specialised fields like medication, legislation and programming.
Publish coaching additionally introduces particular tokens — symbols that weren’t used throughout pre-training — to assist the mannequin perceive the construction of interactions. These tokens sign the place a consumer’s enter begins and ends and the place the AI’s response begins, making certain that the mannequin appropriately distinguishes between prompts and replies.
Now, we’ll transfer on to another key ideas.
Inference — how the mannequin generates new textual content
Inference will be carried out at any stage, even halfway by means of pre-training, to guage how properly the mannequin has discovered.
When given an enter sequence of tokens, the mannequin assigns chances to all doable subsequent tokens based mostly on patterns it has discovered throughout coaching.
As an alternative of all the time selecting the most certainly token, it samples from this chance distribution — just like flipping a biased coin, the place higher-probability tokens usually tend to be chosen.
This course of repeats iteratively, with every newly generated token turning into a part of the enter for the subsequent prediction.
Token choice is stochastic and the identical enter can produce completely different outputs. Over time, the mannequin generates textual content that wasn’t explicitly in its coaching information however follows the identical statistical patterns.
Hallucinations — when LLMs generate false data
Why do hallucinations happen?
Hallucinations occur as a result of LLMs don’t “know” info — they merely predict probably the most statistically doubtless sequence of phrases based mostly on their coaching information.
Early fashions struggled considerably with hallucinations.
As an example, within the instance beneath, if the coaching information accommodates many “Who’s…” questions with definitive solutions, the mannequin learns that such queries ought to all the time have assured responses, even when it lacks the mandatory data.
When requested about an unknown individual, the mannequin doesn’t default to “I don’t know” as a result of this sample was not strengthened throughout coaching. As an alternative, it generates its finest guess, typically resulting in fabricated data.

How do you scale back hallucinations?
Technique 1: Saying “I don’t know”
Enhancing factual accuracy requires explicitly coaching the mannequin to recognise what it doesn’t know — a job that’s extra complicated than it appears.
That is carried out through self interrogation, a course of that helps outline the mannequin’s data boundaries.
Self interrogation will be automated utilizing one other AI mannequin, which generates inquiries to probe data gaps. If it produces a false reply, new coaching examples are added, the place the right response is: “I’m undecided. May you present extra context?”
If a mannequin has seen a query many instances in coaching, it should assign a excessive chance to the right reply.
If the mannequin has not encountered the query earlier than, it distributes chance extra evenly throughout a number of doable tokens, making the output extra randomised. No single token stands out because the most certainly alternative.
Wonderful tuning explicitly trains the mannequin to deal with low-confidence outputs with predefined responses.
For instance, once I requested ChatGPT-4o, “Who’s asdja rkjgklfj?”, it appropriately responded: “I’m undecided who that’s. May you present extra context?”
Technique 2: Doing an online search
A extra superior methodology is to increase the mannequin’s data past its coaching information by giving it entry to exterior search instruments.
At a excessive stage, when a mannequin detects uncertainty, it will probably set off an online search. The search outcomes are then inserted right into a mannequin’s context window — basically permitting this new information to be a part of it’s working reminiscence. The mannequin references this new data whereas producing a response.
Imprecise recollections vs working reminiscence
Typically talking, LLMs have two varieties of data entry.
- Imprecise recollections — the data saved within the mannequin’s parameters from pre-training. That is based mostly on patterns it discovered from huge quantities of web information however shouldn’t be exact nor searchable.
- Working reminiscence — the data that’s accessible within the mannequin’s context window, which is immediately accessible throughout inference. Any textual content supplied within the immediate acts as a brief time period reminiscence, permitting the mannequin to recall particulars whereas producing responses.
Including related info inside the context window considerably improves response high quality.
Data of self
When requested questions like “Who’re you?” or “What constructed you?”, an LLM will generate a statistical finest guess based mostly on its coaching information, until explicitly programmed to reply precisely.
LLMs would not have true self-awareness, their responses rely upon patterns seen throughout coaching.
A technique to supply the mannequin with a constant id is through the use of a system immediate, which units predefined directions about the way it ought to describe itself, its capabilities, and its limitations.
To finish off
That’s a wrap for Half 1! I hope this has helped you construct instinct on how LLMs work. In Half 2, we’ll dive deeper into reinforcement studying and among the newest fashions.
Obtained questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you in Half 2! 🙂
Source link