
The Math Nobody Taught You Behind Every AI
AI isn't magic — it's linear algebra, calculus, and probability. Here's the math behind every LLM and AI model, explained in plain human language.
In 2017, eight Google researchers published a 15-page paper. Nobody knew it would become the foundation for GPT-4, Llama, Grok, and pretty much every LLM we talk about today.

"Attention Is All You Need" — the title sounds almost philosophical. Turns out, it's one of the most consequential sentences ever written in the history of computer science.
I want to walk you through the Transformer architecture — not the textbook version with a thousand Greek symbols, but the version that actually makes sense. Why it was invented, what it actually does, and why it matters far beyond academia.
Fair warning: there's a bit of math in the middle. But I promise it's the good kind — the kind that clicks and makes you go "oh, that's actually elegant."
Imagine you're trying to translate a long sentence. You read word one, process it, then word two, then word three — keeping a running "memory" of everything before. That's basically what RNNs (Recurrent Neural Networks) did. And for a while, it worked okay.
The problem: the further you got into a sentence, the more the earlier context got blurry. It's like playing a game of telephone — things get distorted the longer the chain. This was called the vanishing gradient problem, and it put a hard ceiling on how good AI could get at language.
LSTMs helped — they added a kind of long-term memory cell — but they were still sequential. You couldn't parallelize the computation. On a GPU with thousands of cores sitting idle, processing one token at a time felt almost insulting.
June 2017. The NeurIPS conference. Vaswani, Shazeer, Parmar, and five colleagues from Google Brain dropped their paper. The central proposal was almost reckless in its simplicity: throw away recurrence entirely. No loops. No sequential processing. Just attention.
The results were immediate and startling. On the English-to-German translation benchmark (WMT 2014), they hit a 28.4 BLEU score — state of the art. And they got there training 8x faster than every previous model.
"The dominant sequence transduction models… The Transformer allows for significantly more parallelization." That one architectural shift unlocked everything that followed.
Within a year, BERT and GPT had appeared. Within three years, models had billions of parameters. By 2023, we were running trillion-parameter models on clusters of thousands of GPUs. All of it tracing back to that one paper.
- 2017
Attention Is All You Need — the original Transformer, built for translation.
- 2018
BERT & GPT-1 — encoder-only and decoder-only variants emerge. The two main branches of the family tree.
- 2019–2020
T5, GPT-2, ViT — the architecture jumps from text into vision. Image patches treated as tokens. A jaw- dropping generalization.
- 2022–Present
GPT-4, Llama, Grok, Claude — the modern era. Multimodal, massive, and still fundamentally the same architecture underneath.
Let's get into the meat of it. The Transformer has two main parts: an Encoder (understands input) and a Decoder (generates output). Modern LLMs often use just the decoder half — that's the GPT family.
Before anything else, words get converted to vectors — lists of numbers called embeddings. Think of it as translating language into coordinates in a high-dimensional space where similar words live close together. Then we add positional encodings so the model knows word 1 came before word 2 — since there's no sequential processing to enforce that naturally.
Here's the real magic. For every word in your input, the model asks: "How relevant is every other word to understanding this one?" It does this by computing three vectors for each token — a Query, a Key, and a Value.
1Attention(Q, K, V) = softmax( QKᵀ / √d_k ) · V2// Q = what I'm looking for · K = what each word offers · V = the actual content
The dot product of Q and K measures similarity. Divide by √d_k to keep things numerically stable. Softmax turns the scores into weights (0 to 1). Multiply by V to get a weighted blend of all token content. Do this for every single word, all at once, in parallel.
Then they do it eight times simultaneously — each "head" attending to different aspects of the language. One might focus on syntax. Another on subject-verb agreement. Another on coreference (which "it" refers to what). This is Multi-Head Attention, and it's why Transformers are so remarkably good at language.

The original Transformer was built for translation — you have a source sentence and generate a target. But people quickly realized you could tweak the architecture for different tasks.

Post-2022, things accelerated even further: FlashAttention made training dramatically faster by optimizing memory access patterns. Grouped-Query Attention (GQA) made inference cheaper. And multimodal models like GPT-4o started combining vision and language in the same architecture.
It's easy to treat this as a technical curiosity. It's not. The Transformer is an economic force.
The AI market is projected to hit $1.8 trillion by 2030. Nearly all of that value is built on Transformer-based models. The GPU shortage we're living through right now? Caused almost entirely by demand for Transformer training at scale. The US-China chip restrictions? Geopolitical fallout of competing to build bigger Transformers faster.
In India specifically, the ripple effects are very real. The market for prompt engineers, fine-tuning specialists, and RAG architects — all of that is downstream of this one architecture. Startups building on top of LLMs (including what we do at Manas AI) exist because Transformers made powerful language models deployable via API.
Worth noting
AlphaFold — the DeepMind system that solved protein structure prediction (a 50-year-old grand challenge in biology) — is a Transformer. GitHub Copilot is a Transformer. The model behind Google Translate is a Transformer. This architecture has genuinely changed multiple industries simultaneously.
Transformers aren't perfect. Two problems stand out:
First, hallucinations. Transformers generate statistically likely text — they don't "look things up." When they don't know something, they often make up something plausible-sounding. This is a fundamental property of how they work, not a bug that'll just get patched.
Second, quadratic complexity. The attention mechanism compares every token to every other token — so doubling the context length quadruples the compute. That's why context windows were stuck at 4K tokens for years, and why getting to 1M tokens is still expensive and slow.
The responses are genuinely interesting. State Space Models (SSMs) like Mamba handle sequences in linear time, not quadratic. They might replace Transformers for some workloads, or more likely, hybrid architectures will combine the best of both.
There's also active research into Mixture-of-Experts — models where only a fraction of parameters activate for any given input. GPT-4 is reportedly an MoE model. It's how you get GPT-4 quality without GPT-4 compute costs.
Transformers aren't the endgame. They're the scaffold we're using to figure out what the endgame even looks like.
Quantum Transformers remain speculative — but if quantum hardware matures, the speedups could be dramatic. RLHF (Reinforcement Learning from Human Feedback) is already layered on top of Transformers to make them safer and more aligned with what humans actually want.
So that's the Transformer. Born in 2017, still dominant in 2026, quietly running inside almost every meaningful AI application you've used this week.
If you want to go deeper, I'd genuinely recommend reading the original paper — it's surprisingly accessible. The math is cleaner than most textbooks, and you'll see why researchers sometimes read it like a piece of literature.
At Manas AI, we build on top of these models — RAG systems, AI agents, automation pipelines. Understanding the architecture underneath is what separates people who use LLMs from people who know how to coax the best out of them. That's the difference we try to bring to every project we take on.
Keep reading
You might also enjoy

AI isn't magic — it's linear algebra, calculus, and probability. Here's the math behind every LLM and AI model, explained in plain human language.

Andrej Karpathy's self-improving AI pattern was built for machine learning. Here's how to apply it to your business — and why it compounds faster than any manual process.

Claude AI helps the U.S. military analyze intelligence, prioritize targets, and coordinate drones. Here's how it works and what it means.

LangChain builds AI apps. MCP Server exposes tools to any LLM. Learn how they differ and when to use each in your AI stack.
ManasAi
Want AI built for your business?
We build custom AI agents, MCP servers, and automation workflows that transform how your team works.
Talk to our team →