LLaMA-2 70B Memory Arithmetic: Query vs KV Heads Explained

LLaMA-2 70B’s Heads: The Unseen Math

Every explainer on Grouped Query Attention says the same thing. That it’s about efficiency. That it cuts down on computation by sharing Key and Value heads across multiple Query heads. Simple, right? Well, not exactly. The actual memory arithmetic—the granular breakdown of how those 64 query heads and 8 KV heads in LLaMA-2 70B interact and impact performance—is often glossed over. It’s like explaining how a car works by just saying “it has an engine.”

The devil, as always, is in the details, and for LLM memory, those details live in matrix dimensions and attention calculations. Think about the standard self-attention mechanism in transformers. You’ve got your Query (Q), Key (K), and Value (V) matrices. For each token, you compute Q times K transpose, scale it, apply a softmax, and then multiply by V. This is where the heavy lifting—and the memory consumption—happens.

Now, with Grouped Query Attention (GQA), the innovation lies in decoupling the number of query heads from the number of key and value heads. LLaMA-2 70B sports a whopping 64 query heads but only 8 KV heads. This means that for every KV head, there are effectively eight query heads that can access it. This isn’t just a clever architectural tweak; it’s a deliberate engineering choice aimed squarely at memory bandwidth. Why? Because the computation in attention scales quadratically with sequence length, but the memory bandwidth required to fetch Q, K, and V for each head also scales linearly with the number of heads—and each head needs its own set of parameters.

Here’s where the math gets interesting, and frankly, where most explanations stop. Let’s assume a hidden dimension d_model and n_heads for Q, K, and V. In traditional Multi-Head Attention (MHA), if you have n_heads Q heads, you also have n_heads K heads and n_heads V heads, each with a dimension of d_k = d_v = d_model / n_heads. The total memory for K and V weights would be roughly 2 * n_heads * d_model * d_model / n_heads = 2 * d_model^2.

But with GQA, say n_q_heads = 64 and n_kv_heads = 8. The dimension for each query head would be d_q = d_model / n_q_heads, and for each KV head, d_kv = d_model / n_kv_heads. Crucially, the actual parameters stored for the Key and Value projections are based on n_kv_heads, not n_q_heads. So, the memory for KV weights becomes roughly 2 * n_kv_heads * d_model * d_kv_per_head. If d_kv_per_head is derived from d_model / n_kv_heads, the math for KV weights stabilizes around 2 * n_kv_heads * d_model * (d_model / n_kv_heads) = 2 * d_model^2. The trick is that the computation involves 64 query heads, but the weights are loaded from only 8 KV heads.

The consequence? Reduced memory bandwidth requirements. When computing attention scores, instead of fetching parameters for 64 sets of K and V matrices, the model only needs to fetch parameters for 8 sets. This is the bottleneck that GQA tackles head-on. During inference, especially for large batch sizes or long sequences, memory bandwidth is often the primary constraint, more so than FLOPs (floating-point operations). By sharing KV heads, GQA significantly slashes the amount of data that needs to be shuttled between the compute units and memory for the K and V projections, leading to faster inference.

So, why does LLaMA-2 70B opt for this specific configuration? It’s a balancing act. More query heads allow for capturing richer, more diverse representations of the input sequence. Imagine each query head as an investigator asking a slightly different question about the data. However, if each investigator needed their own separate informant (KV pair), the system would grind to a halt. By having a few highly knowledgeable informants (KV heads) that many investigators (Query heads) can consult, efficiency is gained without sacrificing too much investigative breadth. The 64:8 ratio is Meta’s bet on the sweet spot for this trade-off in the 70B parameter model, optimizing for a faster, yet still performant, inference experience. It’s a deep architectural commitment, not just a superficial optimization.

What Does This Mean for Developers?

For those building on top of LLaMA-2 70B, this architectural choice has tangible implications. Faster inference means you can deploy more responsive applications, handle higher request loads, or even run models on less powerful hardware (though “less powerful” is relative in the LLM space). It’s about making these massive models more accessible for practical, real-world deployment. The underlying memory arithmetic dictates the operational characteristics you’ll encounter, and understanding it helps in optimizing your downstream applications and understanding performance bottlenecks.

Is LLaMA-2 70B Actually Better Because of GQA?

“Better” is a loaded term, but LLaMA-2 70B is certainly more efficient in its memory usage during inference compared to a hypothetical model of similar size using standard Multi-Head Attention. The GQA implementation allows it to achieve faster throughput and lower memory bandwidth demands. Whether this translates directly to superior output quality is a separate, ongoing research question, but the architectural efficiency is undeniable. It’s an engineering win that underpins the model’s usability.

The original article correctly points out that the explainer content around GQA is repetitive. My contribution here is to peel back the layers of that repetition and show why the memory arithmetic matters, linking the head counts directly to memory bandwidth reduction and inferential speed-ups. It’s not just a number of heads; it’s a calculated architectural decision to alleviate the memory I/O bottleneck, making large models like LLaMA-2 70B more practical.

The actual memory arithmetic—the granular breakdown of how those 64 query heads and 8 KV heads in LLaMA-2 70B interact and impact performance—is often glossed over.

This isn’t about some exotic new compute paradigm. It’s about meticulously understanding and optimizing the data flow within the existing transformer architecture. It’s an engineer’s solution to a very specific, very costly problem in scaling. And it’s precisely this kind of deep-dive analysis that separates the hype from the engineering reality of large language models.

🧬 Related Insights

Read more: Gemini Code Assist’s Finish Changes: Watching You Code to Finish the Job
Read more: AI Predicts Ideas, Not Just Words

Frequently Asked Questions

What is Grouped Query Attention (GQA)? GQA is an attention mechanism used in LLMs that improves inference efficiency by sharing Key (K) and Value (V) heads across multiple Query (Q) heads, reducing memory bandwidth requirements compared to traditional Multi-Head Attention.

How many query heads does LLaMA-2 70B have? LLaMA-2 70B has 64 query heads.

How many KV heads does LLaMA-2 70B have? LLaMA-2 70B has 8 Key/Value (KV) heads.

LLaMA-2 70B Memory Arithmetic: Query vs KV Heads Explained

Key Takeaways

What Does This Mean for Developers?

Is LLaMA-2 70B Actually Better Because of GQA?

🧬 Related Insights

Frequently asked questions

Worth sharing?

⚡ Key Takeaways

What Does This Mean for Developers?

Is LLaMA-2 70B Actually Better Because of GQA?

🧬 Related Insights

Frequently asked questions

Share this article

Worth sharing?

Related Stories

NVIDIA's Nemotron: Diffusion LLMs Break Autoregressive Chains

Intel Xeon 7 'Diamond Rapids' Arrives 2027 on 18A-P

ROPE: The Word Rotation Algorithm Powering AI's Top Models

Google Gemma 4: Speculative Decoding Fuels 3x Speed Boost

Stay in the loop

Key Takeaways