You've decided to understand Mixture of Experts. Excellent. You open the first article. It begins with a story about a hospital. There's a triage nurse who routes patients to specialists: the cardiologist handles hearts, the neurologist handles brains, the orthopaedic surgeon handles knees. This, the article tells you, is exactly how MoE works. Each "expert" has a specialty. A "gating network" is the triage nurse. Simple!

You nod along, then reach the architecture section and feel a quiet, creeping dread. The article starts talking about feed-forward networks, sparse activation, learned routing — and the hospital analogy has quietly vanished. There's no cardiologist. There's no neurology ward. There's just a lot of matrices and a loss function you don't recognise.

You try a second article. This one uses a panel of judges. Or a committee of advisors. Or, memorably, a room full of people who each have a different opinion about a restaurant, and a chairman who decides who to listen to. The analogy changes. The confusion doesn't.

The problem is structural. The word "expert" is doing enormous pedagogical work it was never equipped to do. And when you finally understand what a MoE "expert" actually is, you may feel — as I did — that the analogy was never helping you. It was, in a specific and measurable way, slowing you down.

This article is an attempt to explain MoE without leaning on the hospital. Let's go from first principles.

Where the term comes from (and why it stuck)

The phrase "Mixture of Experts" originates not in the deep learning era but in early 1990s statistical learning. The canonical paper is Jacobs, Jordan, Nowlan, and Hinton (1991), "Adaptive Mixtures of Local Experts." In that context, "experts" were relatively simple, interpretable models — linear regressors, small MLPs — and the goal was to partition an input space between them in a way that was genuinely modular. In that original framing, the hospital analogy is actually not bad.

The term survived into the transformer era, was attached to something architecturally quite different, and the analogy hitched a ride it had no business taking.

In modern large language models, the "experts" are nothing like those original models. They are not interpretable. They are not pre-assigned any role. Their specialisation, such as it is, is emergent, distributed, and largely opaque. The name stayed. The meaning moved on without it.

What MoE actually is: start from the transformer

To understand MoE, you need to understand what it's replacing. A standard transformer layer has two sub-components:

  • A multi-head attention mechanism, which attends across the sequence

  • A feed-forward network (FFN), which processes each token independently

In a dense transformer, every token passes through the same FFN at every layer. Every parameter is active for every input. This is computationally expensive — and it gets more expensive linearly as you scale the number of parameters.

The key insight behind MoE is this: you don't need to use every parameter for every token. Some parameters might be more useful for processing certain kinds of input than others. What if, instead of one large FFN, you had many smaller FFNs — and learned, during training, a policy for which one(s) to use for each token?

That is Mixture of Experts. The "experts" are those smaller feed-forward networks. The "gating network" (or "router") is the learned policy that selects them.

// Standard transformer FFN layer
Token → [Dense FFN] → Output
// All parameters active. Always.

// MoE replacement for that same layer
Token → [Router] → selects top-K from {FFN₁, FFN₂, ..., FFNₙ}
       → weighted sum of selected outputs → Output
// Only K/N experts active per token. Most parameters dormant.

The Switch Transformer paper from Google (Fedus, Zoph, and Shazeer, 2021) is the landmark modern treatment of this. Mistral AI's Mixtral 8x7B paper (2024) is the most widely studied open-weights MoE model to date and is essential reading.

How experts are created — and the question you were probably trying to answer

The question that sent most of us to the documentation is: where do the experts come from? Are they trained separately on different data? Are they hand-designed specialists? Are they some kind of hyperparameter?

None of these. Here is what actually happens:

1. You structurally define them. When designing the model, you choose how many experts to include per layer — say, 8, 16, or 64. You also choose the architecture of each expert (typically a 2-layer FFN). These are architectural hyperparameters. The number of experts is decided by a human. What each expert does is not.

2. They are initialised randomly. All experts start with random weights. There is no pre-loading, no seeding with domain knowledge. Expert 1 and Expert 7 are indistinguishable at initialisation.

3. They are trained jointly on the same dataset. There is no separate training on biology data for the "biology expert" and finance data for the "finance expert." Every expert trains on the same corpus. Their differentiation emerges from gradient dynamics, not data curation.

4. The router learns simultaneously. The gating network's routing decisions are also learned through backpropagation. The experts and the router co-adapt during training — the router learns which expert to send which tokens to, and the experts learn to handle what they receive.

5. Load-balancing constraints are applied. Left entirely alone, routers tend to collapse — routing all tokens to the same one or two experts, leaving the rest undertrained. An auxiliary load-balancing loss is added to encourage even utilisation. This is a key technical challenge in MoE training.

Experts are not pre-trained separately on different datasets. They are not manually designed specialisations. They are learned parameterised modules that co-adapt during training, differentiated by gradient dynamics and routing pressure.

Do experts actually specialise? And in what?

This is the most interesting question, and the place where the hospital analogy does the most damage. Because yes — experts do develop something that could loosely be called specialisation. But it bears almost no resemblance to human professional expertise.

Interpretability research on Mixtral 8x7B and similar models (see Jiang et al., 2024) has found that expert routing shows some degree of clustering, but the dimensions of that clustering are statistical and structural rather than semantic:

  • Experts may activate differently for code versus natural language

  • Routing patterns sometimes correlate with token frequency or rarity

  • Some experts appear to handle syntactic structure (long-range dependencies, clause boundaries)

  • There is weak evidence of language-level clustering in multilingual models

  • Many patterns are distributed and overlapping, with no clean interpretation at all

The "expert" might map to some abstract task that no human would recognise as expertise. Or it might not map to any task at all — it might map to a region of representation space.

This is not analogous to a cardiologist. It is not analogous to a specialist of any human-recognisable kind. A more honest framing: each expert is a parameter cluster that has been pressured by gradient descent into handling a particular statistical regime of input. The regime may be "code-like sequences" — or it may be something with no English gloss whatsoever.

The specialisation is:

What the analogy implies

What research shows

Interpretable domains ("biology," "finance," "legal reasoning")

Weak statistical clustering

Discrete, non-overlapping expertise

Partial, overlapping, often uninterpretable patterns

Human-assignable roles

Emergent from dynamics, not designed

How many experts — and what the numbers mean

A note on the claim that "MoE models use 8–16 experts": this is accurate for some architectures but misleading as a general statement.

  • Mixtral 8x7B: 8 experts per layer, top-2 active per token (Jiang et al., 2024)

  • Switch Transformer: 1 expert per token, from pools scaling to thousands (Fedus et al., 2021)

  • GLaM (Google): 64 experts per layer, top-2 active (Du et al., 2021)

  • GPT-4 (according to widespread but unconfirmed reports): ~8 active experts from a larger pool per token

The critical distinction is between total experts and active experts per token. A model might have 64 experts total but activate only 2 per token. The "sparse activation" is what enables the efficiency gains. Most of the model's parameters are dormant for any given input.

The efficiency argument in one sentence: MoE allows you to scale total parameter count (model capacity) without proportionally scaling the compute cost per token — because most parameters are unused for any given token.

A more precise mental model

Replace the hospital entirely. Here is a more accurate picture of what MoE is doing.

Imagine you are building a very large dictionary of transformation functions — each one a small neural network capable of transforming a vector in some way. For every token you process, you consult a learned index that says: "given what this token looks like right now, use functions 3 and 7." Those functions are applied. Their outputs are weighted and combined. The rest of the dictionary sits unused.

The "expertise" of functions 3 and 7 was never assigned. It emerged because gradient descent, over billions of training steps, found that these particular weight configurations happen to be useful for the kinds of tokens that got routed to them. Whether that usefulness maps to anything a human would call a domain is, frankly, beside the point.

A better name might indeed be: Mixture of Conditionally-Activated Subnetworks. It doesn't fit on a paper title page, but it doesn't make you build a hospital in your head either.

Instead of thinking:

"Mixture of experts = team of specialists"

Think:

"Mixture of experts = dynamic sparse activation over a set of parameter clusters"

Further reading

How analogies can impede understanding — and this is not a new problem

The MoE case is a useful lens onto a broader issue in machine learning pedagogy: analogies introduced to simplify a concept can, if poorly chosen or poorly bounded, actively prevent the learner from building an accurate mental model.

The hospital analogy fails not because it is wrong in every dimension — there is a router, there are multiple modules, inputs are directed to some modules and not others — but because it imports enormous amounts of background knowledge that does not apply. Cardiologists have names and CVs and conscious specialisations. They were trained on different curricula. They can tell you what they know and don't know. MoE experts have none of these properties. When you import the analogy, you import all of that, and then you have to do the exhausting work of unlearning it.

This is not unique to MoE. Consider some parallel cases:

Term / Analogy

What it implies

What actually happens

"Neural network"

Neurons, like in the brain. Biology. Consciousness. Signals "firing."

Matrix multiplications followed by nonlinearities. No biological fidelity.

"Attention"

The model "pays attention" the way a person focuses. It "notices" important things.

A weighted sum over value vectors, where weights are computed via dot-product similarity of queries and keys. No phenomenology involved.

"Memory" (in RNNs, LSTMs)

The model "remembers" earlier input, like working memory in humans.

A hidden state vector is updated and passed forward. No retrieval, no episodic structure.

"Hallucination"

The model is confused, or lying, or dreaming. It has false beliefs.

The model outputs tokens that are high-probability given its training distribution, but factually incorrect.

"Understanding" (in BERT, GPT)

The model grasps meaning the way a reader does. It comprehends.

The model has learned statistical regularities that support strong task performance. Whether this is "understanding" is an open philosophical question.

In each case, the analogy works as a first introduction — it gets you into the room. The danger is staying too long. When you begin to reason from the analogy rather than from the mechanism, you start making wrong predictions. You expect "attention" to track salience the way human attention does. You expect "memory" to degrade in familiar ways. You expect MoE "experts" to have something like a professional identity.

None of these expectations survive contact with the actual mathematics.

The philosopher Mary Hesse wrote about this in the context of scientific models: analogies are productive when they generate new hypotheses, and obstructive when they import false constraints. The hospital analogy for MoE generates no useful hypothesis about the architecture that the actual mechanism doesn't generate better. It only imports false constraints — that experts have predefined roles, that specialisation is semantic, that the routing process resembles human triage.

A good analogy opens a door. A bad analogy builds a wall where the door should be — one that looks, from the front, exactly like a door.

The solution is not to ban analogies from technical education. It is to be explicit about their limits. Every analogy in ML pedagogy should come with a visible expiry date: "this will help you until you get to the forward pass, at which point you should let it go." The hospital can walk you to the door of MoE. It cannot take you inside.

When you catch yourself reasoning from an analogy rather than from a mechanism — when you find yourself asking "but which expert handles oncology?" — that is the signal. The analogy has done its work and overstayed its welcome. Thank it, and move on to the matrices.

A note on factual precision: The claim sometimes made that MoE models "use 8–16 experts" understates the range. Expert counts vary from 1 active (Switch Transformer) to 2-of-64 (GLaM) to 2-of-8 (Mixtral). GPT-4's MoE architecture is unconfirmed by OpenAI. Additionally, while this article describes expert specialisation as largely uninterpretable, recent work does find weak but real clustering around modalities (code vs. prose) and languages — "largely uninterpretable" should not be read as "completely uninterpretable."

Keep Reading