6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

LLMs have taken the world by storm. Persons are principally utilizing polished APIs of LLMs, they kind a immediate, get a solution. What they miss is the architectural significance it carries and the place it might excel and the place it must be improved. Beneath the hood lie non-obvious design selections that decide pace, price, and functionality; selections that matter deeply if you wish to construct, fine-tune, or optimize these fashions.

I carried out GPT-2 from scratch with solely PyTorch to know the structure end-to-end. On high of it, I added LoRA (Low-Rank Adapters), RoPE (Rotary Positional Embeddings), KV Cache and extra. Whereas implementing, there have been a number of moments that made me scratch my head and I stored documenting all of them. In the present day I’m sharing 6 of crucial ones. For a deeper have a look at the structure, yow will discover the in my earlier deep-dive.

1. LoRA vs RsLoRA (Rank Stabilized):

LoRA fine-tunes a mannequin by coaching solely two low-rank matrices, B and A, with shapes (dimension, rank) and (rank, dimension), whereas preserving the unique weights frozen (W) [1]. This reduces the variety of trainable parameters drastically (In my case, simply 0.18% of all weights).
There are alpha (α) and rank (r) related to LoRA, with alpha/rank appearing as scaling issue. The formulation goes like:

$W + Delta W = W + frac{alpha}{r}(B occasions A)$

This scaling issue decides the quantity of significance to be given to the fine-tuned parameters. If alpha is 32 and rank is 16, then scaling issue is 2 thus the weights are given 2x significance. Due to this fact this scaling issue can range. A visible concept is given beneath:

LoRA provides a low-rank replace path alongside the frozen pretrained weights. Picture by Writer.

However there is a matter with LoRA as reported by Kalajdzievski, the place he argued that if rank is stored on rising, dividing the fine-tuned parameters with rank finally reduces the weights significance [2]. In easy phrases: as rank grows, the person weight updates shrink and LoRA quietly turns into much less efficient with out you realizing it.
Notice: For these within the underlying math, I’ve included the statistical proof beneath. In any other case, be happy to leap straight to the following part.

Proof: The entries of B and A are randomly initialized once we begin fine-tuning (customary apply: usually distributed), i.e., Bⱼₖ, Aₖᵢ ~ N(0,σ²).
In order we improve the “r”, the variance of B*A will increase proportionally:

[
begin{aligned}
Var(B cdot A) &= Var(sum B_{jk} cdot A_{ki})
&= Var(X_1 + X_2 + dots + X_r)
&text{(denoting } sum B_{jk} cdot A_{ki} sim text{X’s independent variables)}
&= Var(X_1) + Var(X_2) + dots + Var(X_r)
&text{(recall basic probability rule)}
&= r cdot c
&text{(assuming c is constant variance value)}
text{Result: } &Var(B cdot A) propto r
end{aligned}
]

However we are able to’t cease right here, as we’d like variance of full fine-tuned weights (ΔW) accounting scalability consider it:

[
begin{aligned}
&phantom{text{(denoting } sum B_{jk} cdot A_{ki} sim text{X’s independent variables)}} [-2.5ex]
Var(Delta W) &= Varleft(frac{alpha}{r} cdot (B cdot A)proper)
&= frac{alpha^2}{r^2} cdot Var(B cdot A)
&textual content{(Used Variance rule right here)}
&= frac{alpha^2}{r^2} cdot (r cdot c)
&textual content{(Utilizing outcomes from above)}
&= frac{1}{r^2} cdot r
&textual content{(since } alpha^2 textual content{ and c are fixed)}
&= 1/r
textual content{End result: } &Var(Delta W) propto 1/r
finish{aligned}
]

This reveals that with elevated rank, the variance of fine-tuned weights decreases which suggests the load updates turn out to be smaller and smaller. To resolve this shrinking problem, Kalajdzievski launched a easy and efficient answer, to exchange “r” with “√r“. I cannot undergo calculations once more, however what it then resulted into Var(ΔW) = r/r = 1. Which finally made the variance fixed and weights magnitude stays secure with each replace (proven within the plot beneath). Thus it’s higher to stay with RsLoRA than LoRA.

the image shows the stability in terms of variance offered by rank stabilized LoRA compared to LoRA — *As rank will increase, LoRA weight updates shrink (α/r). RsLoRA fixes this by preserving variance secure (α/√r). Picture by Writer*

2. RoPE as a substitute of Discovered Parameters or Sinusoidal Positional Embeddings (PEs)

Positional embeddings are sometimes handled as a secondary element, however we’d underestimate the significance they carry and the way a unsuitable strategy can fully spoil an enormous LLM mannequin. The analysis paper “Consideration Is All You Want” [3] centered on Sinusoidal Positional Embeddings (PEs). This strategy concerned no parameters, utilizing a set formulation to generate values. Nevertheless, it carried many caveats: the mounted formulation was not versatile sufficient to seize relative positions and solely supplied absolute place. One other main problem was that these positional embeddings have been straight added to the token embeddings, thus altering the magnitude of the particular data the token embeddings carried.

To beat these, fashions like GPT-2 and GPT-3 began utilizing a Discovered Parameters-based strategy. As an alternative of counting on a single mounted formulation, it was left to the neural community to search out the positional data utilizing backpropagation. Whereas this labored in the appropriate course, it once more had a number of caveats: it added extra parameter load to the mannequin (context_size * dimension) and the key drawback, direct addition to token embeddings nonetheless remained.

RoPE (Rotary Positional Embeddings) got here to the rescue [4]. It overcame many of the drawbacks that the opposite two approaches have been carrying. Most trendy LLMs now ship with RoPE by default, and for good motive. Not like realized or sinusoidal approaches, RoPE encodes place by rotating Question and Key matrices primarily based on their place and frequency, leaving token embeddings untouched. Thus, it achieved two targets with a single effort: zero parameter load on the mannequin and no direct addition, guaranteeing the precise data carried by token embeddings is left unchanged.

I’ve coated all three in depth with visuals and a professionals/cons breakdown: read the full article here.

3. Weight Tying

Weight tying refers to sharing weights between the token embedding layer and the output projection head. Traditionally GPT, GPT-2 and BERT all used it. On a 124M parameter mannequin it saves 38M parameters which is roughly 30% of your complete mannequin, which was vital. The instinct additionally made sense since embedding maps token → vector and output head maps vector → token, making them pure transposes of one another. Nevertheless as fashions scaled to billions of parameters this 38M saving grew to become lower than 0.5% of the full, virtually meaningless. So most trendy LLMs like LLaMA, Mistral and Falcon maintain them separate, additionally as a result of separate weights provides the output head freedom to specialize independently. Weight tying is sensible for small fashions, however quietly disappeared as fashions scaled.

So when you’re constructing a small mannequin from scratch, it’s price preserving. For those who’re fine-tuning a billion-parameter mannequin, don’t hassle on the lookout for it, it’s probably already gone.

4. Pre-LayerNorm vs Publish-LayerNorm

Pre-LN and Publish-LN sit on reverse ends of a stability vs. efficiency tradeoff. The unique “Consideration Is All You Want” structure utilized Publish-LN (the place normalization occurs after the residual addition). Whereas Publish-LN can result in higher remaining efficiency, it’s notoriously tough to coach as a result of it might trigger gradients to blow up or vanish in deep networks.

Beginning with GPT-2, the trade switched to Pre-LN (the place normalization occurs contained in the residual block). This alternative prioritizes coaching stability, although it typically comes at a slight price to the mannequin’s final representational energy. Researchers have been making an attempt to interrupt this trade-off ever since, resulting in trendy variations like DeepNorm, RMSNorm, and Double Norm.

The image shows the stability and performance tradeoff shown by Pre layer normalization and Post layer normalisation and comparing them with DeepNorm, RMSNorm, DoubleNorm. — *Publish-LN and Pre-LN stability-performance tradeoff. Picture by Writer: Initially revealed in Pre-LN vs Post-LN: The architectural Tug-of-War Every LLM Engineer Should Know*

5. KV-Cache

The eye mechanism is the core engine of the Transformer, permitting the mannequin to dynamically weight the significance of various tokens throughout a sequence. It’s arguably essentially the most crucial innovation in trendy AI, because it allows the mannequin to keep up long-range context and “focus” on related data.
There exist three totally different parts inside the eye mechanism: Question, Key and Worth.

Question (Q): represents the present token the mannequin is specializing in
Key (Ok): used with the question to search out the connection of the present token with different tokens
Worth (V): the precise content material a token shares if chosen

Throughout inference, tokens are predicted one by one autoregressively. Every new token attends to all earlier tokens, which means that the Ok and V matrices have been being recomputed from scratch for each beforehand seen token, each single time. Wasteful.

The repair is easy: simply cache the Ok and V matrices as you go. Every new token solely must compute its personal Ok and V, then retrieve the remaining from cache. This drops time complexity from O(T²) to O(T) for a sequence of size T.

The image showcases the difference between KV Cache and Without KV Cache approaches leading to increased time complexity. — *With out KV Cache, Ok and V are recomputed from scratch at each step. With KV Cache, solely the brand new token’s Ok and V are computed, the remaining are retrieved immediately. Picture by Writer.*

The precise speedup: For a sequence of 15 tokens, with out KV cache you’re doing 15 full Ok and V computations per step. With cache, you do 1. That’s roughly a 15x discount in consideration compute. In apply you see round 2x general speedup accounting for different operations.

However there’s a tradeoff that no one mentions: KV cache is just not free. It consumes reminiscence proportional to the number_of_layers * sequence_length * dimension. For lengthy contexts this turns into vital, which is precisely why reminiscence is the bottleneck in LLM serving, not compute.

This reminiscence overhead has been a serious analysis problem, and lately, Google Analysis launched a breakthrough to deal with it. Of their 2026 paper, “TurboQuant: On-line Vector Quantization with Close to-Optimum Distortion Charge” [5], researchers demonstrated a solution to compress the KV cache down to only 3 bits per worth.

This system achieves a 5x to 6x discount in reminiscence consumption with zero accuracy loss. It really works by rotating the dimensional coordinates so that they comply with a Beta distribution, then making use of Lloyd-Max Quantization mixed with a 1-bit Quantized Johnson-Lindenstrauss (QJL) rework to right residual errors. B This strategy unclogs the reminiscence wall and permits fashions to deal with huge contexts that beforehand required a number of GPUs on a single chip.

6. Quantization Tradeoff: Why LayerNorm is skipped throughout INT8 quantization

Trendy LLMs are monumental. Storing and operating them in full 32-bit or 16-bit floating level precision is pricey, each in reminiscence and compute. Quantization is the method of decreasing the numerical precision of mannequin weights, sometimes from 32-bit floats right down to 8-bit integers (INT8) and even 4-bit. This makes fashions considerably cheaper to retailer and quicker to run, which is why nearly each manufacturing LLM deployment makes use of some type of quantization [6].

However quantization is just not utilized blindly to each layer equally, and that is the place it will get fascinating.

LayerNorm is sort of at all times skipped throughout INT8 quantization. The reason being a easy cost-benefit calculation that almost all articles by no means clarify.

The profit is negligible: LayerNorm has nearly no parameters, simply γ and β, a handful of values in comparison with the hundreds of thousands sitting in a single linear layer. On a 124M parameter mannequin it is a negligible fraction of whole reminiscence. The financial savings from quantizing them are basically zero.
The fee is excessive: LayerNorm is mathematically delicate. It computes imply and variance throughout every token’s embedding, then applies γ and β to rescale. Small precision errors in these parameters, which INT8 introduces, straight distort the normalized output, cascading into each subsequent layer.

The tradeoff is evident: quantize LayerNorm and also you achieve nearly nothing whereas introducing significant high quality degradation. So it stays in full precision.

This can be a broader lesson in quantization, not all parameters are equal. The query isn’t simply “what number of bytes does this save?” however “how delicate is that this layer to precision loss relative to what we save?”.

Conclusion

These 6 issues aren’t secrets and techniques, they’re hiding in plain sight inside each main LLM. However tutorials hardly ever cease to clarify the why behind them. Why rsLoRA fixes a variance drawback most individuals by no means discover. Why RoPE leaves token embeddings untouched. Why weight tying quietly disappeared as fashions scaled. Why Pre-LN trades efficiency for stability. Why KV Cache turns O(T²) into O(T). Why LayerNorm survives quantization at full precision.

Constructing from scratch forces you to confront each one among these choices. You may’t summary them away. And that’s precisely why I’d suggest it to anybody who desires to really perceive how these programs work, not simply use them.

These six observations are simply the floor of what I encountered whereas constructing this mannequin. In my upcoming posts, I’ll be doing a deep dive into the precise math of quantization errors and the sensible challenges of deploying LLMs at scale. For those who’re within the intersection of statistical idea and ML engineering, comply with alongside for the following installment.

References

[1] E. Hu, Y. Shen, P. Wallis et al., LoRA: Low-Rank Adaptation of Large Language Models (2021), arXiv:2106.09685

[2] D. Kalajdzievski, A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA (2023), arXiv:2312.03732

[3] A. Vaswani, N. Shazeer, N. Parmar et al., Attention Is All You Need (2017), arXiv:1706.03762

[4] J. Su, Y. Lu, S. Pan et al., RoFormer: Enhanced Transformer with Rotary Position Embedding (2021), arXiv:2104.09864

[5] Zandieh et al., TurboQuant: Online Vector Quantization with Near-Optimal Distortion Rate (2025), arXiv:2504.19874.

[6] T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022), arXiv:2208.07339

Source link

6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

How to Call Rust from Python

Inside the AI Power Move That Could Redefine Finance

Git UNDO : How to Rewrite Git History with Confidence

DIY AI & ML: Solving The Multi-Armed Bandit Problem with Thompson Sampling

Your RAG Gets Confidently Wrong as Memory Grows – I Built the Memory Layer That Stops It

The LLM Gamble | Towards Data Science

The conversation that could change a founder’s life

iRobot Promo Code: 15% Off

My Smartwatch Gives Me Health Anxiety. Experts Explain How to Make It Stop

How to Call Rust from Python

Featured Picks

Y12-rotor engine with three turbos, what?

Aspark Owl Roadster is the fastest accelerating car ever

Bryson DeChambeau has signed a partnership with Kalshi

6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

1. LoRA vs RsLoRA (Rank Stabilized):

2. RoPE as a substitute of Discovered Parameters or Sinusoidal Positional Embeddings (PEs)

3. Weight Tying

4. Pre-LayerNorm vs Publish-LayerNorm

5. KV-Cache

6. Quantization Tradeoff: Why LayerNorm is skipped throughout INT8 quantization

Conclusion

References

Related Posts