With the current explosion of curiosity in giant language fashions (LLMs), they typically appear virtually magical. However let’s demystify them.
I needed to step again and unpack the basics — breaking down how LLMs are constructed, educated, and fine-tuned to change into the AI methods we work together with as we speak.
This two-part deep dive is one thing I’ve been which means to do for some time and was additionally impressed by Andrej Karpathy’s widely popular 3.5-hour YouTube video, which has racked up 800,000+ views in simply 10 days. Andrej is a founding member of OpenAI, his insights are gold— you get the thought.
In case you have the time, his video is unquestionably value watching. However let’s be actual — 3.5 hours is an extended watch. So, for all of the busy people who don’t wish to miss out, I’ve distilled the important thing ideas from the primary 1.5 hours into this 10-minute learn, including my very own breakdowns that will help you construct a stable instinct.
What you’ll get
Half 1 (this text): Covers the basics of LLMs, together with pre-training to post-training, neural networks, Hallucinations, and inference.
Half 2: Reinforcement studying with human/AI suggestions, investigating o1 fashions, DeepSeek R1, AlphaGo
Let’s go! I’ll begin with how LLMs are being constructed.
At a excessive stage, there are 2 key phases: pre-training and post-training.
1. Pre-training
Earlier than an LLM can generate textual content, it should first learn the way language works. This occurs by way of pre-training, a extremely computationally intensive activity.
Step 1: Knowledge assortment and preprocessing
Step one in coaching an LLM is gathering as a lot high-quality textual content as attainable. The purpose is to create an enormous and various dataset containing a variety of human information.
One supply is Common Crawl, which is a free, open repository of net crawl information containing 250 billion net pages over 18 years. Nonetheless, uncooked net information is noisy — containing spam, duplicates and low high quality content material — so preprocessing is important.In the event you’re concerned with preprocessed datasets, FineWeb affords a curated model of Widespread Crawl, and is made out there on Hugging Face.
As soon as cleaned, the textual content corpus is prepared for tokenization.
Step 2: Tokenization
Earlier than a neural community can course of textual content, it have to be transformed into numerical type. That is achieved by way of tokenization, the place phrases, subwords, or characters are mapped to distinctive numerical tokens.
Consider tokens because the constructing blocks — the elemental constructing blocks of all language fashions. In GPT4, there are 100,277 attainable tokens.A preferred tokenizer, Tiktokenizer, lets you experiment with tokenization and see how textual content is damaged down into tokens. Attempt coming into a sentence, and also you’ll see every phrase or subword assigned a sequence of numerical IDs.

Step 3: Neural community coaching
As soon as the textual content is tokenized, the neural community learns to foretell the following token based mostly on its context. As proven above, the mannequin takes an enter sequence of tokens (e.g., “we’re prepare dinner ing”) and processes it by way of an enormous mathematical expression — which represents the mannequin’s structure — to foretell the following token.
A neural community consists of two key components:
- Parameters (weights) — the realized numerical values from coaching.
- Structure (mathematical expression) — the construction defining how the enter tokens are processed to supply outputs.

Initially, the mannequin’s predictions are random, however as coaching progresses, it learns to assign possibilities to attainable subsequent tokens.
When the proper token (e.g. “meals”) is recognized, the mannequin adjusts its billions of parameters (weights) by way of backpropagation — an optimization course of that reinforces right predictions by growing their possibilities whereas lowering the probability of incorrect ones.
This course of is repeated billions of occasions throughout huge datasets.
Base mannequin — the output of pre-training
At this stage, the bottom mannequin has realized:
- How phrases, phrases and sentences relate to one another
- Statistical patterns in your coaching information
Nonetheless, base fashions should not but optimised for real-world duties. You may consider them as a complicated autocomplete system — they predict the following token based mostly on chance, however with restricted instruction-following skill.
A base mannequin can generally recite coaching information verbatim and can be utilized for sure purposes by way of in-context studying, the place you information its responses by offering examples in your immediate. Nonetheless, to make the mannequin really helpful and dependable, it requires additional coaching.
2. Publish coaching — Making the mannequin helpful
Base fashions are uncooked and unrefined. To make them useful, dependable, and protected, they undergo post-training, the place they’re fine-tuned on smaller, specialised datasets.
As a result of the mannequin is a neural community, it can’t be explicitly programmed like conventional software program. As a substitute, we “program” it implicitly by coaching it on structured labeled datasets that signify examples of desired interactions.
How put up coaching works
Specialised datasets are created, consisting of structured examples on how the mannequin ought to reply in several conditions.
Some sorts of put up coaching embody:
- Instruction/dialog effective tuning
Purpose: To show the mannequin to observe directions, be activity oriented, have interaction in multi-turn conversations, observe security tips and refuse malicious requests, and many others.
Eg: InstructGPT (2022): OpenAI employed some 40 contractors to create these labelled datasets. These human annotators wrote prompts and offered very best responses based mostly on security tips. At present, many datasets are generated robotically, with people reviewing and modifying them for high quality. - Area particular effective tuning
Purpose: Adapt the mannequin for specialised fields like medication, legislation and programming.
Publish coaching additionally introduces particular tokens — symbols that weren’t used throughout pre-training — to assist the mannequin perceive the construction of interactions. These tokens sign the place a person’s enter begins and ends and the place the AI’s response begins, making certain that the mannequin accurately distinguishes between prompts and replies.
Now, we’ll transfer on to another key ideas.
Inference — how the mannequin generates new textual content
Inference may be carried out at any stage, even halfway by way of pre-training, to guage how effectively the mannequin has realized.
When given an enter sequence of tokens, the mannequin assigns possibilities to all attainable subsequent tokens based mostly on patterns it has realized throughout coaching.
As a substitute of all the time selecting the most definitely token, it samples from this chance distribution — much like flipping a biased coin, the place higher-probability tokens usually tend to be chosen.
This course of repeats iteratively, with every newly generated token turning into a part of the enter for the following prediction.
Token choice is stochastic and the identical enter can produce completely different outputs. Over time, the mannequin generates textual content that wasn’t explicitly in its coaching information however follows the identical statistical patterns.
Hallucinations — when LLMs generate false information
Why do hallucinations happen?
Hallucinations occur as a result of LLMs don’t “know” information — they merely predict essentially the most statistically probably sequence of phrases based mostly on their coaching information.
Early fashions struggled considerably with hallucinations.
For example, within the instance beneath, if the coaching information incorporates many “Who’s…” questions with definitive solutions, the mannequin learns that such queries ought to all the time have assured responses, even when it lacks the mandatory information.
When requested about an unknown individual, the mannequin doesn’t default to “I don’t know” as a result of this sample was not strengthened throughout coaching. As a substitute, it generates its finest guess, typically resulting in fabricated info.

How do you cut back hallucinations?
Technique 1: Saying “I don’t know”
Enhancing factual accuracy requires explicitly coaching the mannequin to recognise what it doesn’t know — a activity that’s extra advanced than it appears.
That is achieved by way of self interrogation, a course of that helps outline the mannequin’s information boundaries.
Self interrogation may be automated utilizing one other AI mannequin, which generates inquiries to probe information gaps. If it produces a false reply, new coaching examples are added, the place the proper response is: “I’m unsure. May you present extra context?”
If a mannequin has seen a query many occasions in coaching, it’s going to assign a excessive chance to the proper reply.
If the mannequin has not encountered the query earlier than, it distributes chance extra evenly throughout a number of attainable tokens, making the output extra randomised. No single token stands out because the most definitely alternative.
Advantageous tuning explicitly trains the mannequin to deal with low-confidence outputs with predefined responses.
For instance, once I requested ChatGPT-4o, “Who’s asdja rkjgklfj?”, it accurately responded: “I’m unsure who that’s. May you present extra context?”
Technique 2: Doing an internet search
A extra superior technique is to increase the mannequin’s information past its coaching information by giving it entry to exterior search instruments.
At a excessive stage, when a mannequin detects uncertainty, it may well set off an internet search. The search outcomes are then inserted right into a mannequin’s context window — basically permitting this new information to be a part of it’s working reminiscence. The mannequin references this new info whereas producing a response.
Imprecise recollections vs working reminiscence
Typically talking, LLMs have two sorts of information entry.
- Imprecise recollections — the information saved within the mannequin’s parameters from pre-training. That is based mostly on patterns it realized from huge quantities of web information however isn’t exact nor searchable.
- Working reminiscence — the data that’s out there within the mannequin’s context window, which is instantly accessible throughout inference. Any textual content offered within the immediate acts as a brief time period reminiscence, permitting the mannequin to recall particulars whereas producing responses.
Including related information throughout the context window considerably improves response high quality.
Information of self
When requested questions like “Who’re you?” or “What constructed you?”, an LLM will generate a statistical finest guess based mostly on its coaching information, except explicitly programmed to reply precisely.
LLMs don’t have true self-awareness, their responses rely on patterns seen throughout coaching.
A method to offer the mannequin with a constant id is by utilizing a system immediate, which units predefined directions about the way it ought to describe itself, its capabilities, and its limitations.
To finish off
That’s a wrap for Half 1! I hope this has helped you construct instinct on how LLMs work. In Half 2, we’ll dive deeper into reinforcement studying and a few of the newest fashions.
Acquired questions or concepts for what I ought to cowl subsequent? Drop them within the feedback — I’d love to listen to your ideas. See you in Half 2! 🙂
Source link