Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

ikawrakow · 2023-08-26T12:10:21Z

In LLaMA-v2-70B eight heads share the same K and V attention tensors, and as a result they are 8X smaller than the attention Q tensor. The attention V tensor is quite important for generation quality, so it is often quantized with more bits when using k_quants. Given this, we can get a nice improvement in perplexity score (as a measure of generation quality) with negligible increase in quantized model size by quantizing the entire attention V tensor with 5 bits when the k_quants logic has decided to quantize it with 3 or 4 bits. The table shows the PPL change for a subset of the k_quants:

Quantization	Model size (master)	Model size PR	PPL (Master)	PPL (PR)
Q2_K	27.11 GiB	27.27 GiB	3.8164	3.7339
Q3_K_S	27.70 GiB	27.86 GiB	3.7800	3.7019
Q4_K_S	36.31 GiB	36.39 GiB	3.4923	3.4852

IgnacioFDM · 2023-08-26T12:26:16Z

I'd assume the same should apply to 34B?

ggerganov · 2023-08-26T12:49:06Z

llama.cpp

        }
    }
+    if (n_attention_wv != n_feed_forward_w2 || (uint32_t)n_attention_wv != model.hparams.n_layer) {
+        fprintf(stderr, "============ Strange model: n_attention_wv = %d, n_feed_forward_w2 = %d, hparams.n_layer = %d\n",


Use LLAMA_LOG_WARN with __func__ prefix as all other logs

Nexesenex · 2024-01-09T12:17:51Z

How long is the context for the perplexity values in the table, @ikawrakow?

ikawrakow · 2024-01-09T12:48:59Z

How long is the context for the perplexity values in the table, @ikawrakow?

512 tokens

Better perplexity for 2- and 3-bit quantization for the 70B model

6544756

ggerganov approved these changes Aug 26, 2023

View reviewed changes

PR comment

3979af1

ikawrakow merged commit 7592375 into master Aug 26, 2023

ikawrakow deleted the ik/refine_70B branch August 26, 2023 14:27

kalomaze mentioned this pull request Nov 17, 2023

Llama 2 70b quantizes in way that's superior for GQA; Mistral 7b is missing that optimization #4111

Closed

Nexesenex mentioned this pull request Jan 21, 2024

Add Q3_K_XS #5060

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Uh oh!

ikawrakow commented Aug 26, 2023

Uh oh!

IgnacioFDM commented Aug 26, 2023

Uh oh!

ggerganov Aug 26, 2023

Uh oh!

Nexesenex commented Jan 9, 2024

Uh oh!

ikawrakow commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Better perplexity for 2- and 3-bit quantization for LLaMA-v2-70B #2807

Uh oh!

Conversation

ikawrakow commented Aug 26, 2023

Uh oh!

IgnacioFDM commented Aug 26, 2023

Uh oh!

ggerganov Aug 26, 2023

Choose a reason for hiding this comment

Uh oh!

Nexesenex commented Jan 9, 2024

Uh oh!

ikawrakow commented Jan 9, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants