Why Your Next LLM Might Not Have A Tokenizer

In my last article, we dove into Google’s Titans — a mannequin that pushes the boundaries of long-context recall by introducing a dynamic reminiscence module that adapts on the fly, sort of like how our personal reminiscence works.

It’s an odd paradox. We’ve got AI that may analyze a 10-million-word doc, but it nonetheless fumbles questions like: “What number of ‘r’s are within the phrase strawberry?”

The issue isn’t the AI’s mind; it’s the eyes. Step one in how these fashions learn, tokenization, primarily pre-processes language for them. In doing so, it strips away the wealthy, messy particulars of how letters type phrases; the entire world of sub-word info simply vanishes.

1. Misplaced in Tokenization: The place Subword Semantics Die

Language, for people, begins as sound, spoken lengthy earlier than it’s written. But it’s by means of writing and spelling that we start to understand the compositional construction of language. Letters type syllables, syllables type phrases, and from there, we construct conversations. This character-level understanding permits us to appropriate, interpret, and infer even when the textual content is noisy or ambiguous. In distinction, language fashions skip this part totally. They’re by no means uncovered to characters or uncooked textual content as-is; as a substitute, their whole notion of language is mediated by a tokenizer.

This tokenizer, sarcastically, is the one part in your entire pipeline that isn’t realized. It’s dumb, mounted, and completely primarily based on heuristics, regardless of sitting on the entry level of a mannequin designed to be deeply adaptive. In impact, tokenization units the stage for studying, however with none studying of its personal.

Furthermore, tokenization is extraordinarily brittle. A minor typo, say, “strawverry” as a substitute of “strawberry”, can yield a totally totally different token sequence, regardless of the semantic intent remaining apparent to any human reader. This sensitivity, as a substitute of being dealt with proper then and there, is handed downstream, forcing the mannequin to interpret a corrupted enter. Worse nonetheless, optimum tokenizations are extremely domain-dependent. A tokenizer skilled on on a regular basis English textual content could carry out superbly for pure language, however fail miserably when encountering supply code, producing lengthy and semantically awkward token chains for variable names like user_id_to_name_map

Just like the “Spinal Wire”, that’s the language pipeline, the upper up it’s compromised, the extra it cripples every little thing downstream. Sitting proper on the high, a flawed tokenizer distorts enter earlier than the mannequin even begins reasoning. Regardless of how good the structure is, it’s working with corrupted alerts from the beginning.

(Supply: Writer)
How a easy typo can probably waste LLM’s “pondering energy” to rectify it

2. Behold! Byte Latent Transformer

If tokenization is the brittle basis holding fashionable LLMs again, the pure query follows: why not eradicate it totally? That’s exactly the unconventional path taken by researchers at Meta AI with the Byte Latent Transformer (BLT) (Pagnoni et al. 2024)¹. Somewhat than working on phrases, subwords, and even characters, BLT fashions language from uncooked bytes — essentially the most elementary illustration of digital textual content. This permits LLMs to study the language from the very floor up, with out the tokenizer being there to eat away on the subword semantics.

However modeling bytes immediately is way from trivial. A naïve byte-level Transformer would choke on enter lengths a number of occasions longer than tokenized textual content — a million phrases change into almost 5 million bytes (1 phrase = 4.7 characters on common, and 1 character = 1 byte), making consideration computation infeasible resulting from its quadratic scaling. BLT circumvents this by introducing a dynamic two-tiered system: easy-to-predict byte segments are compressed into latent “patches,” considerably shortening the sequence size. The total, high-capacity mannequin is then selectively utilized, focusing its computational assets solely the place linguistic complexity calls for it.

(Supply: Tailored from Pagnoni et al. 2024, Determine 2)
Zoomed-out view of your entire Byte Latent Former structure

2.1 How does it work?

The mannequin might be conceptually divided into three main elements, every with a definite accountability:

2.1.1 The Native Encoder:

The first operate of the Native Encoder is to remodel a protracted enter sequence of N_bytes of uncooked bytes, b = (b₁, b₂,…, b_{N_bytes})right into a a lot shorter sequence of N_patches of latent patch representations, p = (p₁, p₂,…, p_{N_patches}).

Step 1: Enter Segmentation and Preliminary Byte Embedding

The enter sequence is segmented into patches primarily based on a pre-defined technique, similar to entropy-based patching. This gives patch boundary info however doesn’t alter the enter sequence itself. This patch boundary info will come in useful later.

(Supply: Pagnoni et al. 2024, Determine 3)
Completely different methods for patching, visualized

The primary operation inside the encoder is to map every discrete byte worth (0-255) right into a steady vector illustration. That is achieved by way of a learnable embedding matrix, E_byte (form: [256, h_e]), the place h_e is the hidden dimension of the native module.
Enter: A tensor of byte IDs of form [B, N_bytes], the place B is the batch dimension.
Output: A tensor of byte embeddings, X (form: [B, N_bytes, h_e]).

Step 2: Contextual Augmentation by way of N-gram Hashing

To complement every byte illustration with native context past its particular person identification, the researchers make use of a hash-based n-gram embedding method. For every byte b_i at place i, a set of previous n-grams, g_i,n = {b_i-n+1,…, b_i} are constructed for a number of values of n ∈ {3,…,8}.

These n-grams are mapped by way of a hash operate to indices inside a second, separate embedding desk, E_hash (form: [V_hash, h_e]), the place V_hash is a set, giant vocabulary dimension (i.e., the variety of hash buckets).

The ensuing n-gram embeddings are summed with the unique byte embedding to provide an augmented illustration, e_i. This operation is outlined as:

(Supply: Writer)
Clarification: Lookup the hash of the n-gram within the embedding desk and add it to the respective byte embedding, for all n ∈ [3,8]

the place x_i is the preliminary embedding for byte b_i.
The form of the tensor E = {e₁, e₂,…,e_{N_bytes}} stays [B, N_bytes, h_e].

Step 3: Iterative Refinement with Transformer and Cross-Consideration Layers

The core of the Native Encoder consists of a stack of l_e equivalent layers. Every layer performs a two-stage course of to refine byte representations and distill them into patch representations.

Step 3a: Native Self-Consideration:
The enter is processed by a typical Transformer block. This block makes use of a causal self-attention mechanism with a restricted consideration window, that means every byte illustration is up to date by attending solely to a set variety of previous byte representations. This ensures computational effectivity whereas nonetheless permitting for contextual refinement.

Enter: If it’s the primary layer, the enter is the context-augmented byte embedding E; in any other case, it receives the output from the earlier native Self-Consideration layer.

(Supply: Writer)
***H_l***: Enter for the present Self-Consideration layer
E: Context-Augmented Byte Embedding from Step 2
**H^‘_l-1**: Output from the earlier Self-Consideration layer

Output: Extra contextually conscious byte-representations, H^‘_l (form:
[B, N_bytes, h_e])

Step 3b: Multi-Headed Cross-Consideration:
The aim of the Cross-Consideration is to distill the fine-grained, contextual info captured within the byte representations and inject it into the extra summary patch representations, giving them a wealthy consciousness of their constituent sub-word buildings. That is achieved by means of a cross-attention mechanism the place patches “question” the bytes they include.

Queries (Q): The patch embeddings are projected utilizing a easy linear layer to type the queries.
For any subsequent layer (l>0), the patch embeddings are merely the refined patch representations output by the cross-attention block of the earlier layer, P_(l−1).
Nonetheless, for the very first layer (l=0), these patch embeddings have to be created from scratch. This initialization is a three-step course of:

Gathering: Utilizing the patch boundary info obtained in Step 1, the mannequin gathers the byte representations from H₀ that belong to every patch. For a single patch, this leads to a tensor of form (N_{bytes_per_patch}, h_e). After padding every patch illustration to be of the identical size, if there are J patches, the form of your entire concatenated tensor turns into:
(B, J, N_{bytes_per_patch}, h_e).
Pooling: To summarize the vector for every patch, a pooling operation (e.g., max-pooling) is utilized throughout the N_{bytes_per_patch} dimension. This successfully summarizes essentially the most salient byte-level options inside the patch.
- Enter Form: (B, J, N_{bytes_per_patch}, h_e)
- Output Form: (B, J, h_e)
Projection: This summarized patch vector, nonetheless within the small native dimension h_e is then handed by means of a devoted linear layer to the worldwide dimension, h_g, the place h_e <<< h_g. This projection is what bridges the native and international modules.
- Enter Form: (B, J, h_e)
- Output Form: (B, J, h_g)

(Supply: Writer)
Abstract of the 3-step course of to get the primary patch embeddings:
1. Gathering and pooling the bytes for every respective patch.
2. Concatenating the patches to a single tensor.
3. Projection of the patch embedding tensor to the worldwide dimension.

The patch representations, obtained both from the earlier cross-attention block’s output or initialized from scratch, are then fed right into a linear projection layer to type queries.

Enter Form: (B, J, h_g)
Output Form: (B, J, d_a), the place d_a is the “consideration dimension”.

Keys and Values: These are derived from the byte representations H_l from Step 3a. They’re projected from dimension h_e to an intermediate consideration dimension d_a, by way of impartial linear layers:

(Supply: Writer)
Projection of the Self-Consideration output from Step 3a to Keys and Values.

(Supply: Writer)
Overview of the Info circulate within the Native Encoder

2.1.2 The Latent International Transformer

The sequence of patch representations generated by the Native Encoder is handed to the Latent International Transformer. This module serves as the first reasoning engine of the BLT mannequin. It’s a customary, high-capacity autoregressive Transformer composed of l_g self-attention layers, the place l_g is considerably bigger than the variety of layers within the native modules.

Working on patch vectors (form: [B, J, h_g]), this transformer performs full self-attention throughout all patches, enabling it to mannequin complicated, long-range dependencies effectively. Its sole operate is to foretell the illustration of the following patch, o_j (form: [B, 1, h_g]), within the sequence primarily based on all previous ones. The output is a sequence of predicted patch vectors, O_j (form: [B, J, h_g]), which encode the mannequin’s high-level predictions.

(Supply: Writer)
***o_j*** is the patch that accommodates the knowledge for the following prediction

2.1.3 The Native Decoder

The ultimate architectural part is the Native Decoder, a light-weight Transformer that decodes the expected patch vector, o_j, the final token from the worldwide mannequin’s output, O_j, again right into a sequence of uncooked bytes. It operates autoregressively, producing one byte at a time.

The era course of, designed to be the inverse of the encoder, begins with the hidden state of the final byte within the encoder’s output, H_l. Then, for every subsequent byte generated by the decoder (d’_ok), in a typical autoregressive method, it makes use of the expected byte’s hidden state because the enter to information the era.

Cross-Consideration: The final byte’s state of the encoder’s output H_l[:,-1,:] (appearing as question, with form: [B, 1, h_e]) attends to the goal patch vector o_j (appearing as Key and Worth). This step injects the high-level semantic instruction from the patch idea into the byte stream.

The question vectors are projected to an consideration dimension, d_a, whereas the patch vector is projected to create the important thing and worth. This alignment ensures the generated bytes are contextually related to the worldwide prediction.

(Supply: Writer)
The overall equations, which encapsulate what Question, Key, and Worth are.
***d’_ok***: The ok+1^th predicted byte’s hidden state from the decoder.

Native Self-Consideration: The ensuing patch-aware byte representations are then processed by a causal self-attention mechanism. This permits the mannequin to contemplate the sequence of bytes already generated inside the present patch, implementing native sequential coherence and proper character ordering.

After passing by means of all l_d layers, every together with the above two levels, the hidden state of the final byte within the sequence is projected by a remaining linear layer to a 256-dimensional logit vector. A softmax operate converts these logits right into a chance distribution over the byte vocabulary, from which the following byte is sampled. This new byte is then embedded and appended to the enter sequence for the following era step, persevering with till the patch is totally decoded.

(Supply: Writer)
Overview of the Info circulate in Native Decoder

3. The Verdict: Bytes Are Higher Than Tokens!

Byte Latent Transformer might genuinely be a substitute for the common vanilla Tokenization-based Transformers at scale. Listed here are a number of convincing causes for that argument:

1. Byte-Degree Fashions Can Match The Ones Primarily based On Tokens.
One of many fundamental contributions of this work is that byte-level fashions, for the primary time, can match the scaling conduct of state-of-the-art token-based architectures similar to LLaMA 3 (Grattafiori et al. 2024)². When skilled underneath compute-optimal regimes, the Byte Latent Transformer (BLT) reveals efficiency scaling developments corresponding to these of fashions utilizing byte pair encoding (BPE). This discovering challenges the long-standing assumption that byte-level processing is inherently inefficient, exhibiting as a substitute that with the best architectural design, tokenizer-free fashions even have a shot.

(Supply: Tailored from Pagnoni et al. 2024, Determine 6)
BLT exhibiting aggressive BPB (perplexity equal for byte fashions) and related scaling legal guidelines to these of the tokenizer-based LLaMA fashions

2. A New Scaling Dimension: Buying and selling Patch Measurement for Mannequin Measurement.
The BLT structure decouples mannequin dimension from sequence size in a method that token-based fashions can not. By dynamically grouping bytes into patches, BLT can use longer common patches to save lots of on compute. This saved compute might be reallocated to extend the scale and capability of the principle Latent International Transformer whereas holding the whole inference price (FLOPs) fixed. The paper reveals this new trade-off is very useful: bigger fashions working on longer patches persistently outperform smaller fashions working on shorter tokens/patches for a set inference price range.
This implies you possibly can have a bigger and extra succesful mannequin — at no further compute price!

(Supply: Tailored from Pagnoni et al. 2024, Determine 1)
The steeper scaling curves of the bigger BLT fashions permit them to surpass the efficiency of the token-based Llama fashions after the crossover level.

3. Subword Consciousness By Byte-Degree Modeling
By processing uncooked bytes immediately, BLT avoids the knowledge loss usually launched by tokenization, having access to the inner construction of phrases — their spelling, morphology, and character-level composition. This leads to a heightened sensitivity to subword patterns, which the mannequin demonstrates throughout a number of benchmarks.
On CUTE (Character-level Understanding and Textual content Analysis) (Edman et al., 2024)³, BLT excels at duties involving fine-grained edits like character swaps or substitutions, reaching near-perfect accuracy on spelling duties the place fashions like LLaMA 3 fail totally.
Equally, on noised HellaSwag (Zellers et al, 2019)⁴, the place inputs are perturbed with typos and case variations, BLT retains its reasoning potential much more successfully than token-based fashions. These outcomes are indicative of BLT’s inherent robustness, which Token-based fashions can’t achieve even with considerably extra knowledge.

(Supply: Pagnoni et al. 2024, Desk 3)
The mannequin’s direct byte-level processing results in large good points on character manipulation **(CUTE)** and noise robustness **(HellaSwag Noise Avg.)**, duties that problem token-based architectures.

4. BLT Exhibits Stronger Efficiency on Low-Useful resource Languages.
Mounted tokenizers, usually skilled on a majority of English or high-resource language knowledge, might be inefficient and inequitable for low-resource languages, usually breaking phrases down into particular person bytes (a phenomenon generally known as “byte-fallback”). As a result of BLT is inherently byte-based, it treats all languages equally from the beginning. The outcomes present this results in improved efficiency in machine translation, notably for languages with scripts and morphologies which might be poorly represented in customary BPE vocabularies.

(Supply: Pagnoni et al. 2024, Desk 4)
Machine translation efficiency on the FLORES-101 benchmark (Goyal et al., 2022)⁵. Comparable efficiency on high-resource languages, however superior for low-resource languages, outperforming the LLaMA 3 mannequin.

5. Dynamic Allocation Of Compute: Not Each Phrase Is Equally Deserving
A key energy of the BLT structure lies in its potential to dynamically allocate computation primarily based on enter complexity. Not like conventional fashions that expend a set quantity of compute per token—treating easy phrases like “the” and sophisticated ones like “antidisestablishmentarianism” with equal price—BLT ties its computational effort to the construction of its realized patches. The high-capacity International Transformer works just for patches, permitting BLT to type longer patches over predictable, low-complexity sequences and shorter patches over areas requiring deeper reasoning. This permits the mannequin to focus its strongest elements the place they’re wanted most, whereas offloading routine byte-level decoding to a lighter, native decoder, yielding a much more environment friendly and adaptive allocation of assets.

4. Last Ideas And Conclusion

For me, what makes BLT thrilling isn’t simply the benchmarks or the novelties, it’s the concept a mannequin can transfer past the superficial wrappers we name “languages” — English, Japanese, even Python — and begin studying immediately from the uncooked bytes, the basic substrate of all communication. I really like that. A mannequin that doesn’t depend on a set vocabulary, however as a substitute learns construction from the bottom up? That seems like an actual step towards one thing extra common.

After all, one thing this totally different gained’t be embraced with open arms, in a single day. Tokenizers have change into baked into every little thing — our fashions, our instruments, our instinct. Ditching them means rethinking the very foundational block of your entire AI ecosystem. However the upside right here is difficult to disregard. Perhaps as a substitute of the entire structure, we might see a few of its options being built-in into the brand new techniques we see sooner or later.

5. References

[1] Pagnoni, Artidoro, et al. “Byte latent transformer: Patches scale better than tokens.” arXiv preprint arXiv:2412.09871 (2024).
[2] Grattafiori, Aaron, et al. “The llama 3 herd of models.” arXiv preprint arXiv:2407.21783 (2024).
[3] Edman, Lukas, Helmut Schmid, and Alexander Fraser. “CUTE: Measuring LLMs’ Understanding of Their Tokens.” arXiv preprint arXiv:2409.15452 (2024).
[4] Zellers, Rowan, et al. “Hellaswag: Can a machine really finish your sentence?.” arXiv preprint arXiv:1905.07830 (2019).
[5] Goyal, Naman, et al. “The flores-101 evaluation benchmark for low-resource and multilingual machine translation.” Transactions of the Affiliation for Computational Linguistics 10 (2022): 522-538.

Source link

Why Your Next LLM Might Not Have A Tokenizer

Core Machine Learning Skills, Revisited

Agentic AI: Implementing Long-Term Memory

Data Has No Moat! | Towards Data Science

Build Multi-Agent Apps with OpenAI’s Agent SDK

How Businesses Use Text-to-Speech for Marketing Campaigns

Building A Modern Dashboard with Python and Taipy

Core Machine Learning Skills, Revisited

The AI Hype Index: AI-powered toys are coming

High interaction jobs linked to type 2 diabetes risk

Building a unicorn? Start here: Four principles every founder should know

Featured Picks

Robots-Blog | Open-Source-Roboter pib gewinnt German Design Award 2025

EDAG partners with AIR to build short-range private and cargo eVTOLs

Apple Is Pushing AI Into More of Its Products—but Still Lacks a State-of-the-Art Model

Why Your Next LLM Might Not Have A Tokenizer

1. Misplaced in Tokenization: The place Subword Semantics Die

2. Behold! Byte Latent Transformer

2.1 How does it work?

2.1.1 The Native Encoder:

Step 1: Enter Segmentation and Preliminary Byte Embedding

Step 2: Contextual Augmentation by way of N-gram Hashing

Step 3: Iterative Refinement with Transformer and Cross-Consideration Layers

2.1.2 The Latent International Transformer

2.1.3 The Native Decoder

3. The Verdict: Bytes Are Higher Than Tokens!

4. Last Ideas And Conclusion

5. References

Related Posts