What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13504

TheTinyTeddy · 2025-05-13T11:56:50Z

The struct of q8_q_mmq is:

struct block_q8_1_mmq {
// The y float data is converted to a data layout that can simply be copied to shared memory as a contiguous block.
// The y float data is first grouped as blocks of 128 values.
// These blocks are then treated as individual data values and transposed.
//
// To avoid shared memory bank conflicts each block is padded with 16 bytes.
// This padding is also used to store block scales/partial sums.
// The scales multiplied with the quantized data are equal to the unquantized values.
// The partial sums are obtained by summing up a subgroup of the contained values (prior to quantization)
// and are only needed for performance reasons.
//
// The exact data stored depends on the x data type.
union {
float d4[4]; // 1 32 bit scale per 32 values, stored as d0,d1,d2,d3
half2 ds4[4]; // 1 16 bit scale + 1 16 bit partial sum per 32 values, stored as d0,s0,d1,s1,d2,s2,d3,s3
half d2s6[8]; // 1 16 bit scale per 64 values + 1 16 bit partial sum per 16 values for the first 96 values,
// stored as d0,d1,s1,s2,s3,s4,s5
};
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};

I was wondering why do we need this partial sum, what is the meaning of "and are only needed for performance reasons."? Is it a bias term to reduce the quantization error during MMA?

The text was updated successfully, but these errors were encountered:

jeffbolznv · 2025-05-13T13:34:59Z

The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.

ggml-org locked and limited conversation to collaborators May 13, 2025

JohannesGaessler converted this issue into discussion #13507 May 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13504

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13504

TheTinyTeddy commented May 13, 2025

jeffbolznv commented May 13, 2025

This issue was moved to a discussion.

This issue was moved to a discussion.

What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA? #13504

What is the partial sum in block_q8_1_mmq, is it for reducing the quantization error during MMA? #13504

Comments

TheTinyTeddy commented May 13, 2025

jeffbolznv commented May 13, 2025

This issue was moved to a discussion.

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13504

What is the partial sum in `block_q8_1_mmq`, is it for reducing the quantization error during MMA? #13504