The Geometry of Thought Part 2 - Attention Is All You Have

August 23, 2025

What if attention is the only currency you truly have?

Chapter 0: Intro
Chapter 1: Embeddings Are Scary
Chapter 2: Attention Is All You Have

Chapter 0: Intro

Read Chapter 0: Intro

Chapter 1: Embeddings Are Scary

Read Chapter 1: Embeddings Are Scary

Chapter 2: Attention Is All You Have

When I first started implementing the self-attention layer, I was struck by how dumb and genius it is.

For those who don’t know, self-attention enables each token (e.g., a word) to dynamically “pay attention” to every other token in the sentence, regardless of distance (ignoring sliding window techniques), allowing the model to understand relationships across the entire sentence.

At its core, each token’s embedding is projected into three separate vectors — Query (Q), Key (K), and Value (V) — via learned linear transformations:

Query: What I’m looking for.
Key: What I can offer.
Value: The content to be retrieved.

On one hand, it felt almost embarrassingly simple. Multiply queries and keys, apply a softmax, weight the values, sum it up. That’s it.

It’s mechanically trivial: dot products, scaling, normalization. And yet… when layered and repeated, this mechanism generates the illusion of thought. Syntax, semantics, even reasoning emerge from nothing but vector math.

That’s the genius of self-attention: it doesn’t hard-code rules. It discovers them.
It feels dumb because it encodes no grammar, no linguistic priors. But brilliant because, from scratch, it learns all of them.

The Softmax Revelation

One piece in particular stood out to me: the softmax.

In a Transformer, attention is finite. The softmax layer forces the weights to sum to one. Every token has a budget. It cannot attend to every other word equally. It must decide where to look and where to look away.

And in that moment, it occurred to me: humans are no different.

Attention is all we have. Our attention is our scarcest resource.
What we choose to notice shapes our world. What we ignore quietly disappears.

Imagine two friends, A and B. They both open TikTok. They both see the same feed.

A scrolls through quickly, chasing that little dopamine hit, then swiping to the next video.

B watches the same clip, but pauses. How did this creator get popular? What’s their strategy? Oh, FPV drones — is this a growing trend? Are others succeeding with this too? What does it say about shifting market preferences?

In Transformer terms: the keys (what the feed offers) and values (the content) are identical. But the queries are different.

A’s query: “Entertain me.”
B’s query: “Teach me something about the world.”

And with that, their outcomes diverge.

A’s attention collapses into noise: novelty without retention, stimulation without growth, amusing themselves to death.
B’s attention sharpens into signal: curiosity fueling insight, insight fueling action, action attracting wealth.

A remains a consumer. B becomes a pattern-spotter, a maker.
Over time, the gap compounds, widening with every scroll.

This is the hidden cost of misallocation. It’s not that A and B had different feeds. It’s that their queries — the questions they brought to the world — led them down entirely different trajectories.

That invisible allocation of attention quietly reshapes their lives.

The Architecture of Attention

This raises a sharper question: how much of attention is innate, and how much is trainable?

For LLMs, the answer depends on two things:

Architecture = standard self-attention, grouped query, or latent attention.
Weights = learned from data, refined with every token.

For humans, the analogy is unsettling.

Architecture = genetics, working-memory span, raw cognitive filters.
Weights = experience, mentors, culture, deliberate practice.

Some of us may be born with sharper filters for signal vs noise. But most of what matters comes from the “training data” we feed ourselves: the books we read, the conversations we have, the habits we cultivate, the people we surround ourselves with.

If intelligence isn’t about raw horsepower, but about what you attend to, then maybe success is less about “who is smarter” and more about “who allocates attention better.”

The Final Scarcity

If attention is all you have, then your life is nothing more than the sum of what you attended to — and the infinity you ignored.

Misallocated attention is not just wasted time. It is lost potential. It is the silent gap between who you are and who you could have been.

What if most human suffering isn’t from lack of resources, but from chronic misallocation of attention? Doomscrolling instead of building. Resentment instead of curiosity. Making war instead of making love.

And it doesn’t stop at the individual. Entire societies run on the same principle:

When media focuses on outrage over understanding, we get polarization.
When politics attends to short-term polls instead of long-term problems, we get stagnation.
When culture rewards virality over wisdom, we trade depth for dopamine.

In that sense, institutions, governments, and companies are our collective softmax. They decide what we, as a civilization, attend to — and what quietly falls away.

So the real question isn’t just: Where is your attention going?
It’s: Who is deciding for you?

Can you confidently say it’s you?

Algorithms already decide what enters our “context window” each day. If attention is the scarce currency, then whoever controls its distribution doesn’t just shape our feed. They shape our future.

So the question becomes: are you allocating your attention… or has TikTok’s softmax been doing it for you?

Back to Blog