And then, just as you think you’ve grasped the bleeding edge of large language models—the multimodal, the multimodal, the ever-expanding context windows—someone whips out a terminal and starts talking about characters.
Characters. As in, individual letters, spaces, punctuation. You know, the stuff we learned to string together into words back in kindergarten. It’s a bit like attending a Formula 1 race and finding the chief engineer meticulously explaining the combustion cycle of a single spark plug. But that’s exactly where this exploration into building an LLM from scratch begins: with the absolute bedrock of language, the character.
Why does this matter? Because the glossy marketing surrounding AI models often glosses over the fundamental mechanics. We’re told about ‘next-gen’ architectures and ‘unprecedented’ capabilities, but rarely do we get a clear, unvarnished look at the actual building blocks. This piece, however, peels back that layer, reminding us that before BPEs and WordPieces dominated the scene, simpler, perhaps more strong, methods were at play.
The Ghost of Tokenizers Past
Before the big AI boom of 2018, before ‘Attention is All You Need’ became the mantra, NLP systems were often rule-based or relied on statistical n-grams. Think spell checkers, basic language detection. It was functional, if clunky. Then came the sub-word tokenizers – Byte Pair Encoding (BPE), WordPiece – which became the darlings of large-scale models, largely ditching character-level approaches. But this article reminds us that character tokenizers have some serious, often overlooked, advantages.
Since, character level tokenizers use each character and if they were trained in diverse text, there would be no problem of Out of vocabulary problem as all the characters would already be in the training set and probability matrix.
That’s the key, isn’t it? The dreaded “Out of Vocabulary” (OOV) problem. With character-level tokenization, if your model has seen every character that makes up a language, it can technically handle any word, even if it’s never encountered that specific sequence before. It’s language agnostic, too—no need to rewrite code when you switch from English to Hindi. Though, as the author notes, handling complex character combinations in languages like Hindi still introduces its own set of complexities, it’s a fundamentally different beast than a word-based model choking on an unknown term.
And typos? Forget about it. A word-level tokenizer might balk at ‘helo,’ marking it as an OOV. A character-level one? It just sees ‘h’, ‘e’, ‘l’, ‘o’ and, with a bit of contextual understanding, can likely infer the intended word. It’s like the difference between a pedantic librarian who only recognizes perfectly alphabetized entries and a seasoned bookworm who can find a title even if the spine is smudged.
So, What’s Under the Hood?
Let’s get down to brass tacks. The author provides a simple corpus: “in the midst of chaos, there is also opportunity. the journey of learning is never easy, but it is always rewarding…”.
This is raw text. The machine sees none of it. So, step one: assign every unique character—letters, spaces, punctuation—a unique numerical index. This is done with simple Python dictionaries, mapping char_to_idx and idx_to_char.
char = sorted(set(text))
char_to_idx = {}
for i, ch in enumerate(char):
char_to_idx[ch] = i
idx_to_char = {}
for ch, i in char_to_idx.items():
idx_to_char[i] = ch
Next, we build a matrix tracking the frequency of adjacent character pairs, or bigrams. If ‘a’ is followed by ‘b’ a thousand times, the bigram_counts matrix at [index_of_a, index_of_b] will reflect that. This gives us a statistical map of how characters typically follow each other in the training data.
bigram_counts = np.zeros((len(char), len(char)))
tokens = [char_to_idx[ch] for ch in text]
for i in range(len(tokens)-1):
bigram_counts[tokens[i], tokens[i+1]] += 1
Now, the tricky part. The author initially uses a SoftMax function to normalize these counts, squishing them into the [0, 1] range. But then, a crucial correction is made. SoftMax isn’t ideal for raw frequency counts; it can compress the differences between frequent and rare pairs. Plain normalization – dividing each count by the sum of its row – is actually the more appropriate mathematical step here. It preserves the relative likelihoods more accurately.
def softmax(x):
exp_x= np.exp(x - np.max(x))
return exp_x / exp_x.sum()
bigram_probs = np.zeros_like(bigram_counts)
for i in range(len(char)):
rows = bigram_counts[i]
if rows.sum() > 0:
bigram_probs[i] = softmax(rows)
# The corrected approach:
# bigram_probs[i] = bigram_counts[i] / bigram_counts[i].sum()
This difference, though subtle in code, is significant for genuine understanding. It’s a perfect example of how easily a concept can be misapplied, even by those building these systems.
Who’s Making Money Here?
This is where the real skepticism kicks in. Building an LLM from scratch using character-level tokenization is an educational exercise, a brilliant way to understand fundamentals. But is anyone actually deploying massive, production-grade LLMs this way today? Unlikely. The computational cost and the sheer complexity of capturing nuanced meaning at the character level for vast, diverse datasets make it impractical for the headline-grabbing models we see from OpenAI, Google, or Anthropic.
Their billions are made by scaling sub-word tokenizers, leveraging massive GPU farms, and selling access to models that can perform complex tasks. Character-level tokenization, while elegant and strong in its own right, remains in the realm of academic exploration and smaller, specialized applications. It’s a foundational pillar, not the skyscraper itself. The money is in the scale, the proprietary datasets, and the API subscriptions, not in the spark plug, however critical that spark plug may be.