read
Technology

How LLM Architectures Evolved Over Seven Years

Neural network visualization showing interconnected nodes glowing against a dark background, representing AI evolution
Neural network visualization showing interconnected nodes glowing against a dark background, representing AI evolution

Sebastian Raschka has spent years mapping how large language models work, and his latest architecture comparison covers 17 models that collectively represent roughly seven years of transformer evolution. That span takes us from GPT-2 to today's Mixture of Experts setups like DeepSeek V3, and it shows just how dramatically AI engineers have refined the blueprint for machine intelligence.

From One Big Network to Many Small Ones: The MoE Shift

It has been seven years since the original GPT architecture was developed, and looking back at GPT-2 from 2019, you might be surprised at how structurally similar today's models still are at their core. The core transformer idea was straightforward: let every word in a sentence pay attention to every other word. That self-attention mechanism became the backbone of GPT-style models, and for years the strategy was to make the model bigger, add more layers, and stuff in more parameters. Scaling up worked, but it exposed a real problem. Every single token you generated required computation across the entire network, burning enormous compute even for simple queries.

The industry needed a different approach, and Mixture of Experts became the answer most open-weight models adopted. Instead of one massive neural network handling every word, you split the model into multiple smaller expert networks. A gating mechanism decides which experts handle each token in real time. DeepSeek V3 pushed this design hard, using 256 expert modules but only activating 9 per token, including one shared expert that always handles common patterns. That means the model can be enormous in total capacity at roughly 671 billion parameters, while keeping active parameters per step at just 37 billion, closer to a much smaller model during inference.

Think of it like a hospital. You could hire one genius doctor who knows everything, or you could build an emergency room with a cardiologist, a neurologist, an orthopedic surgeon, and dozens of other specialists. When a patient walks in with a broken arm, you do not need the cardiologist. The triage system sends them straight to the right person. MoE models do exactly this with words and concepts, routing each token to the experts best equipped to handle it.

What Actually Changed Under the Hood

When you line up these 17 models side by side, a few concrete design patterns emerge that separate the latest generation from everything that came before. These are not minor tweaks. They reflect fundamentally different engineering philosophies.

Attention Mechanisms Got Smarter and Cheaper

Standard multi-head attention, the original transformer recipe, scales poorly. If your sequence length doubles, the attention computation roughly quadruples. That math gets brutal fast when models need to process entire codebases or hundred-page documents. DeepSeek V3 adopted Multi-Head Latent Attention, which compresses the key and value tensors into a lower-dimensional space before storing them in the KV cache, then projects them back at inference. This adds a matrix multiplication but dramatically reduces cache memory. Meanwhile, several other models shifted to Grouped-Query Attention, which shares key and value projections across heads rather than giving each head its own full set. You lose almost no accuracy, but you save a serious chunk of memory and compute.

Some architectures went further. Models like Kimi K2 explored extended context windows by modifying how positional information gets encoded, allowing the model to process much longer sequences without the attention matrix becoming unmanageable. The engineering details get dense, but the takeaway is simple: newer models do more with less attention overhead.

Expert Routing Became a Design Science

Early MoE implementations basically split experts evenly and let the router figure it out. The latest architectures treat routing as a first-class design problem. DeepSeek V3 uses an auxiliary-loss-free load balancing strategy, which sounds dry but matters enormously. In older MoE models, you needed an extra penalty term in your training loss to prevent the router from sending all tokens to the same two experts and ignoring the rest. That penalty was a hack. It worked, but it subtly warped training dynamics. Removing it while still maintaining balanced expert usage means the model trains more cleanly and potentially learns better representations.

The number of active experts per token also varies significantly across architectures. DeepSeek V3 activates 9 per token. Others activate different numbers. More active experts means richer representations but higher compute cost. This is not a settled debate, and different teams have placed different bets. The landscape of current leading models reflects that divergence clearly, with no single routing strategy dominating across the board.

Training Data and Tokenization Strategies Shifted

Architecture is not just about network layers. It is also about what you feed the network and how you chop it up. Several newer open-weight models moved away from the byte-pair encoding tokenizers that earlier GPT models popularized, adopting different subword tokenization schemes optimized for multilingual use or for code-heavy training corpora. A tokenizer that splits a word into one token instead of three means the model processes ideas more efficiently, especially in non-English languages where BPE historically performed poorly.

Training data composition shifted too. Earlier models trained on a broad mix of web text, books, and code. Newer architectures explicitly weight their training data toward high-quality sources and use sophisticated deduplication pipelines. Several of the latest models also invested heavily in synthetic data generation during training, using stronger models to create high-quality reasoning examples for the model being trained. That creates a feedback loop that did not exist in earlier generations.

What This Means for the Next Wave of AI Models

The most striking takeaway from comparing these architectures is that raw parameter count has become a misleading metric. A 671 billion parameter MoE model that only activates 37 billion parameters per token is fundamentally different from a 70 billion dense model, even though the wall-plug compute might look similar during inference. The industry is learning that architecture decisions, routing strategies, and training data quality often matter more than simply adding more weights.

We are also seeing the gap between closed and open models narrow in ways that seemed unlikely two years ago. The open-weight architectures covered in this comparison use techniques that rival or in some cases match what proprietary labs are shipping. The bottleneck is no longer architecture secrets. It is training compute, data quality, and the engineering talent to run months-long training runs without something breaking.

So here is a question worth sitting with: if the architecture blueprints are now largely public and the engineering patterns are well documented, does the real competitive advantage in AI come from the model design itself, or from the infrastructure and data pipelines that bring that design to life? Drop your take below.

Sources

Tags

More people should see this article.

If you found it useful, share it in 10 seconds. Knowledge grows when shared.

Reading Settings

Comments