Skip to content

Conversation

@ikawrakow
Copy link
Contributor

In LLaMA-v2-70B eight heads share the same K and V attention tensors, and as a result they are 8X smaller than the attention Q tensor. The attention V tensor is quite important for generation quality, so it is often quantized with more bits when using k_quants. Given this, we can get a nice improvement in perplexity score (as a measure of generation quality) with negligible increase in quantized model size by quantizing the entire attention V tensor with 5 bits when the k_quants logic has decided to quantize it with 3 or 4 bits. The table shows the PPL change for a subset of the k_quants:

Quantization Model size (master) Model size PR PPL (Master) PPL (PR)
Q2_K 27.11 GiB 27.27 GiB 3.8164 3.7339
Q3_K_S 27.70 GiB 27.86 GiB 3.7800 3.7019
Q4_K_S 36.31 GiB 36.39 GiB 3.4923 3.4852

@IgnacioFDM
Copy link
Contributor

I'd assume the same should apply to 34B?

llama.cpp Outdated
}
}
if (n_attention_wv != n_feed_forward_w2 || (uint32_t)n_attention_wv != model.hparams.n_layer) {
fprintf(stderr, "============ Strange model: n_attention_wv = %d, n_feed_forward_w2 = %d, hparams.n_layer = %d\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use LLAMA_LOG_WARN with __func__ prefix as all other logs

@Nexesenex
Copy link
Contributor

How long is the context for the perplexity values in the table, @ikawrakow?

@ikawrakow
Copy link
Contributor Author

How long is the context for the perplexity values in the table, @ikawrakow?

512 tokens

@Nexesenex Nexesenex mentioned this pull request Jan 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants