You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
struct block_q8_1_mmq {
// The y float data is converted to a data layout that can simply be copied to shared memory as a contiguous block.
// The y float data is first grouped as blocks of 128 values.
// These blocks are then treated as individual data values and transposed.
//
// To avoid shared memory bank conflicts each block is padded with 16 bytes.
// This padding is also used to store block scales/partial sums.
// The scales multiplied with the quantized data are equal to the unquantized values.
// The partial sums are obtained by summing up a subgroup of the contained values (prior to quantization)
// and are only needed for performance reasons.
//
// The exact data stored depends on the x data type.
union {
float d4[4]; // 1 32 bit scale per 32 values, stored as d0,d1,d2,d3
half2 ds4[4]; // 1 16 bit scale + 1 16 bit partial sum per 32 values, stored as d0,s0,d1,s1,d2,s2,d3,s3
half d2s6[8]; // 1 16 bit scale per 64 values + 1 16 bit partial sum per 16 values for the first 96 values,
// stored as d0,d1,s1,s2,s3,s4,s5
};
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};
I was wondering why do we need this partial sum, what is the meaning of "and are only needed for performance reasons."? Is it a bias term to reduce the quantization error during MMA?
The text was updated successfully, but these errors were encountered:
The quantization used for A is decoded as Ad*a - Am where Ad and Am are the scale/bias for the block, and a is the element. The q8_1 quantization is decoded as Bd*b. So the matrix multiply dots a row of A and column of B, computing sum{(Ad*a-Am)*b*Bd}. If you expand this out, you can rewrite it as Ad*Bd*sum{a*b} - Am*Bd*sum{b}. The partial sum is this sum{b} term, precomputed to make the matrix multiply faster.
The struct of q8_q_mmq is:
struct block_q8_1_mmq {
// The y float data is converted to a data layout that can simply be copied to shared memory as a contiguous block.
// The y float data is first grouped as blocks of 128 values.
// These blocks are then treated as individual data values and transposed.
//
// To avoid shared memory bank conflicts each block is padded with 16 bytes.
// This padding is also used to store block scales/partial sums.
// The scales multiplied with the quantized data are equal to the unquantized values.
// The partial sums are obtained by summing up a subgroup of the contained values (prior to quantization)
// and are only needed for performance reasons.
//
// The exact data stored depends on the x data type.
union {
float d4[4]; // 1 32 bit scale per 32 values, stored as d0,d1,d2,d3
half2 ds4[4]; // 1 16 bit scale + 1 16 bit partial sum per 32 values, stored as d0,s0,d1,s1,d2,s2,d3,s3
half d2s6[8]; // 1 16 bit scale per 64 values + 1 16 bit partial sum per 16 values for the first 96 values,
// stored as d0,d1,s1,s2,s3,s4,s5
};
int8_t qs[4*QK8_1]; // 128 values quantized to 8 bit each
};
I was wondering why do we need this
partial sum
, what is the meaning of "and are only needed for performance reasons."? Is it a bias term to reduce the quantization error during MMA?The text was updated successfully, but these errors were encountered: