Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA? #13504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TheTinyTeddy opened this issue May 13, 2025 · 1 comment

Comments

@TheTinyTeddy
Copy link

The struct of q8_q_mmq is:

struct block_q8_1_mmq {
// The y float data is converted to a data layout that can simply be copied to shared memory as a contiguous block.
// The y float data is first grouped as blocks of 128 values.
// These blocks are then treated as individual data values and transposed.
//
// To avoid shared memory bank conflicts each block is padded with 16 bytes.
// This padding is also used to store block scales/partial sums.
// The scales multiplied with the quantized data are equal to the unquantized values.
// The partial sums are obtained by summing up a subgroup of the contained values (prior to quantization)
// and are only needed for performance reasons.
//
// The exact data stored depends on the x data type.
union {
float d4[4]; // 1 32 bit scale per 32 values, stored as d0,d1,d2,d3
half2 ds4[4]; // 1 16 bit scale + 1 16 bit partial sum per 32 values, stored as d0,s0,d1,s1,d2,s2,d3,s3
half d2s6[8]; // 1 16 bit scale per 64 values + 1 16 bit partial sum per 16 values for the first 96 values,
// stored as d0,d1,s1,s2,s3,s4,s5
};
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};

I was wondering why do we need this partial sum, what is the meaning of "and are only needed for performance reasons."? Is it a bias term to reduce the quantization error during MMA?

@jeffbolznv
Copy link
Collaborator

The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.

@ggml-org ggml-org locked and limited conversation to collaborators May 13, 2025
@JohannesGaessler JohannesGaessler converted this issue into discussion #13507 May 13, 2025

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants