Skip to content

#1 performance requirement #83

@cmp-nct

Description

@cmp-nct

I'm stuck with other work, I recently pushed half finished branch, containing a ton of fixes and changes but not finished.
Also moved from falcon_main to "ggfalcon" which is meant to replace the main example and other examples later on with API support.

The real big improvement, I was not able to complete yet, is to calculate the KV mulmat operations on CUDA.
Broadcasting of the first tensor is required (which basically is just repeating it 128 times per batched token, so -b 100 would cause 12800 multiplications sequentially two times. Except it's a single GPU environment then there might be more parallelism behind it)

We do have cublas 8 bit support in that branch! Which is very fast (but not faster than quantized multiplication which is default).
The branch also supports on demand change of the matmul method (cublas 8,16,32,quantized,cpu), so it's easy to test and switch.

What I believe should be done is broadcasting and batched cublas in 8 bit for the two KV cache multiplications. That should bring an enormous boost in performance.

Potential roadblock:
The current operation routine in the cuda code is not usable for that, it would loop tens of thousands of times for batched broadcasted processing and that can not be used to feed into batched cublas. non-batched cublas is also useless that way.
I just did some dry tests (broadcasting the input, not aligning the output properly) and the slowdown compared to CPU was huge.
But that can be solved, likely with a dedicated routine.

Anyone here has cuda/cublas experience who'd like to give that a try ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions