The Geometry of Thought
August 18, 2025
My journey and reflections from building an LLM from scratch.
Table of Contents
- Chapter 0: Intro
- Chapter 1: Embeddings Are Scary
- Chapter 2: Attention Is All You Have
- Chapter 3: TBA
Chapter 0: Intro
I’ve always been fascinated by machine learning — its ability to process, to recognize, to almost understand.
When ChatGPT 3 first came out, I stayed up all night experimenting with it. I asked absurd questions. I tried to jailbreak it. I tested its limits by bouncing between the dumbest and the smartest prompts I could think of.
Somewhere in the middle of that late-night frenzy, I had a realization: This wasn’t just a neat tool.
This was power. Profound power.
For the first time in human history, Intelligence itself was about to become a cheap resource at our fingertips.
And I thought: if that’s true, I need to understand it at the deepest level.
That’s what led me to build a large language model from scratch.
It’s not just a toy model, but something that mirrored GPT 2 in structure and scale — even though I didn’t have the luxury of racks of GPUs or a billion-dollar budget. I wanted to walk through the entire process: from raw text, to tokens, to embeddings, to self-attention blocks, to MLPs, to training loops and instruction fine-tuning.
In the end, my model wasn’t fully trained, but that wasn’t the point. The point was the journey — and the surprising lessons that surfaced along the way.
This blog isn’t a tutorial. There are already plenty of excellent resources online for the “how.” Instead, this is the story of my journey: what I built, what surprised me, and the bigger questions that emerged along the way.
Chapter 1: Embeddings Are Scary
One of the first things I did after downloading a small dataset was to tokenize the text, build a vocabulary using Byte Pair Encoding, convert tokens into IDs, and finally transform them into embeddings.
On the surface, embeddings look simple: a (sub)word turned into a long list of floating-point numbers. Just vectors in a high-dimensional space.
But take a step back. That random-looking sequence of numbers is carrying something extraordinary: the meaning of words, and by extension, the structure of human language, and by extension, human thought.
That realization shook me.
Because embeddings reveal two insights:
- Meaning is relational. Words don’t carry meaning by itself. They mean what they mean because of the words around them.
- Structure emerges from scale. With enough data, geometry itself begins to reveal hidden order.
You can see this everywhere: from word2vec’s skip-gram, and BERT’s masked LM and GPT’s next-token prediction, all the way to T5’s span corruption. Different objectives, same principle: meaning through relationships, structure through scale.
And this “geometry structure” isn’t triangles or circles. It’s the shape of relationships — where distances map to similarity, directions map to transformations, and clusters form neighborhoods of thought.
And then the question hit me: if words can be embedded… what else can be embedded?
The answer is: almost anything. Faces are already represented as embeddings in facial recognition systems. Body language can be embedded, powering models that translate text into motion. Even something as elusive as smell has been mapped into vectors, letting researchers predict how an odor will be perceived.
Social media embed people. Not by modeling “friendship” directly, but by following the traces of it: likes, messages, tags, follows. Out of those discrete signals, a continuous geometry of “you” begins to emerge.
Personality, identity, even relationships can all be approximated as vectors — not because they are inherently discrete, but because our actions and choices generate discrete signals that can be embedded.
So why stop there? Could memory itself be embedded? Could grief, love, identity be flattened into vectors?
Think about your first heartbreak. If we could capture every trace of brain activity — every signal, every pattern — then by the same logic we’ve seen with language, meaning through relationships and structure through scale, wouldn’t that memory be representable as a vector too?
The only barrier seems to be access: we don’t yet know how to capture those inner states at high enough fidelity.
But imagine if we could.
Imagine sending a childhood trauma to your therapist the way we now pass embeddings between models. Imagine advertising not as a billboard on a screen, but as a prompt injected directly into your stream of thought.
If memories become portable, they become copyable. If they’re copyable, they’re ownable. But by whom? Your memories will become commodities that can be shared, sold, modified and even stolen.
And here’s a even scarier thoght: embeddings are compressions. They preserve relationships, but don’t capture full details. It can reveal patterns we couldn’t see otherwise, but it can never carry the full meaning of the original.
Your first heartbreak could collapse into the same neighborhood as a thousand other heartbreaks. Just another dot in that latent space. In that averaging, something unique — something you — is lost.
So we come to the most interesting question of all:
If everything about you can be embedded, what remains that is uniquely you?
Chapter 2: Attention Is All You Have
When I first started implementing the self-attention layer, I was struck by how dumb and genius it is.
For those who don’t know, self-attention enables each token (e.g., a word) to dynamically “pay attention” to every other token in the sentence, regardless of distance (ignoring sliding window techniques), allowing the model to understand relationships across the entire sentence.
At its core, each token’s embedding is projected into three separate vectors — Query (Q), Key (K), and Value (V) — via learned linear transformations:
- Query: What I’m looking for.
- Key: What I can offer.
- Value: The content to be retrieved.
On one hand, it felt almost embarrassingly simple. Multiply queries and keys, apply a softmax, weight the values, sum it up. That’s it.
It’s mechanically trivial: dot products, scaling, normalization. And yet… when layered and repeated, this mechanism generates the illusion of thought. Syntax, semantics, even reasoning emerge from nothing but vector math.
That’s the genius of self-attention: it doesn’t hard-code rules. It discovers them.
It feels dumb because it encodes no grammar, no linguistic priors. But brilliant because, from scratch, it learns all of them.
The Softmax Revelation
One piece in particular stood out to me: the softmax.
In a Transformer, attention is finite. The softmax layer forces the weights to sum to one. Every token has a budget. It cannot attend to every other word equally. It must decide where to look and where to look away.
And in that moment, it occurred to me: humans are no different.
Attention is all we have. Our attention is our scarcest resource.
What we choose to notice shapes our world. What we ignore quietly disappears.
Imagine two friends, A and B. They both open TikTok. They both see the same feed.
A scrolls through quickly, chasing that little dopamine hit, then swiping to the next video.
B watches the same clip, but pauses. How did this creator get popular? What’s their strategy? Oh, FPV drones — is this a growing trend? Are others succeeding with this too? What does it say about shifting market preferences?
In Transformer terms: the keys (what the feed offers) and values (the content) are identical. But the queries are different.
- A’s query: “Entertain me.”
- B’s query: “Teach me something about the world.”
And with that, their outcomes diverge.
A’s attention collapses into noise: novelty without retention, stimulation without growth, amusing themselves to death.
B’s attention sharpens into signal: curiosity fueling insight, insight fueling action, action attracting wealth.
A remains a consumer. B becomes a pattern-spotter, a maker.
Over time, the gap compounds, widening with every scroll.
This is the hidden cost of misallocation. It’s not that A and B had different feeds. It’s that their queries — the questions they brought to the world — led them down entirely different trajectories.
That invisible allocation of attention quietly reshapes their lives.
The Architecture of Attention
This raises a sharper question: how much of attention is innate, and how much is trainable?
For LLMs, the answer depends on two things:
- Architecture = standard self-attention, grouped query, or latent attention.
- Weights = learned from data, refined with every token.
For humans, the analogy is unsettling.
- Architecture = genetics, working-memory span, raw cognitive filters.
- Weights = experience, mentors, culture, deliberate practice.
Some of us may be born with sharper filters for signal vs noise. But most of what matters comes from the “training data” we feed ourselves: the books we read, the conversations we have, the habits we cultivate, the people we surround ourselves with.
If intelligence isn’t about raw horsepower, but about what you attend to, then maybe success is less about “who is smarter” and more about “who allocates attention better.”
The Final Scarcity
If attention is all you have, then your life is nothing more than the sum of what you attended to — and the infinity you ignored.
Misallocated attention is not just wasted time. It is lost potential. It is the silent gap between who you are and who you could have been.
What if most human suffering isn’t from lack of resources, but from chronic misallocation of attention? Doomscrolling instead of building. Resentment instead of curiosity. Making war instead of making love.
And it doesn’t stop at the individual. Entire societies run on the same principle:
- When media focuses on outrage over understanding, we get polarization.
- When politics attends to short-term polls instead of long-term problems, we get stagnation.
- When culture rewards virality over wisdom, we trade depth for dopamine.
In that sense, institutions, governments, and companies are our collective softmax. They decide what we, as a civilization, attend to — and what quietly falls away.
So the real question isn’t just: Where is your attention going?
It’s: Who is deciding for you?
Can you confidently say it’s you?
Algorithms already decide what enters our “context window” each day. If attention is the scarce currency, then whoever controls its distribution doesn’t just shape our feed. They shape our future.
So the question becomes: are you allocating your attention… or has TikTok’s softmax been doing it for you?