Skip to content

Releases: kvcache-ai/ktransformers

v0.3.1

17 May 07:28
32f3d7b
Compare
Choose a tag to compare

🚀 New Features

⚡ Performance Improvements

  • DeepSeek-R1 Q4 decoding @ 7.5 tokens/s
    Measured on a single-socket Xeon + DDR5 4800 MT/s + A770 platform; enabling dual-NUMA delivers additional speedups.

  • Easy benchmarking
    Try it yourself with the local_chat script to see these gains firsthand.

🔜 What’s Next

  • Balance_serve integration
    We’re working to seamlessly merge Intel GPU operators into the balance_serve backend for end-to-end support and streamlined maintenance.

v0.3

28 Apr 22:50
d7811a4
Compare
Choose a tag to compare

Support AMX-INT8 AMX-BF16 @chenht2022 @ErvinXie @KMSorSMS
Support Qwen3MoE @ovowei @Atream
Support function_call @Creeper-MZ

v0.2.4post1

04 Apr 08:03
6617549
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.2.4...v0.2.4post1

v0.2.4

02 Apr 06:24
ac95b6c
Compare
Choose a tag to compare

KTransformers v0.2.4 Release Notes

We are excited to announce the official release of the long-awaited KTransformers v0.2.4!
In this version, we’ve added highly desired multi-concurrency support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:

v0.2.4.mp4

🚀 Key Updates

  1. Multi-Concurrency Support
    • Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
    • We implemented custom_flashinfer based on the high-performance and highly flexible operator library flashinfer, and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
    • In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
    • With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
  2. Engine Architecture Optimization
    image
    Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
    • Server: Handles user requests and serves the OpenAI-compatible API.
    • Inference Engine: Executes model inference and supports chunked prefill.
    • Scheduler: Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in the FCFS manner and sending them to the inference engine.
  3. Project Structure Reorganization
    All C/C++ code is now centralized under the /csrc directory.
  4. Parameter Adjustments
    Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
    We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.

📚 Upgrade Notes

  • Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
  • To enable multi-concurrency, please refer to the latest documentation for configuration examples.

What's Changed

Implemented custom_flashinfer @Atream @ovowei @qiyuxinlin
Implemented balance_serve engine based on FlashInfer @qiyuxinlin @ovowei
Implemented a continuous batching scheduler in C++ @ErvinXie
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao

Warning

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

v0.2.3post2

15 Mar 15:22
c51818c
Compare
Choose a tag to compare

What's Changed

Key Features:

Ahead of the official v0.2.4 release, we're excited to announce the preliminary ROCm support for ktransformers in response to strong community demand. This update delivers enhanced heterogeneous computing capabilities for developers utilizing AMD GPUs. #178

Extended our heartfelt gratitude to contributor @fxzjshm! Your profound technical expertise demonstrated in the ROCm adaptation work has been instrumental in achieving framework compatibility. Also, we would like to thank the AMD team for their technology and equipment support.

Other Updates:

New Contributors

Full Changelog: v0.2.3post1...v0.2.3post2

v0.2.3post1

07 Mar 14:16
7544ead
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.2.3...v0.2.3post1

v0.2.3

06 Mar 09:05
63b1c85
Compare
Choose a tag to compare

We're excited to announce the update of KTransformers v0.2.3! You can now compile from the GitHub source code. Release packages and Docker images are being built/uploaded - stay tuned!

Key Updates:

  1. Low-Precision Inference Optimization #754

    1. Added IQ1_S/IQ2_XXS quantized matmul support, now compatible with Unsloth's DeepSeek-R1 1.58bit/2.51bit dynamic quantized weights

    2. Released DeepSeek-R1 mixed-precision model (IQ1+FP8) achieving enhanced performance:

      • 19GB VRAM usage & 140GB system memory consumption

      • MMLU score of 83.6, slightly outperforming full-precision DeepSeek-V3

      • Ongoing benchmarks: View Details (Special thanks to @moonshadow-25 and @godrosev for their huge contributions to v0.2.3)

  2. Long Context Handling Enhancement #750

    1. Implemented chunked prefill mechanism. Supports processing 139K-token contexts with DeepSeek-R1 on 24GB VRAM

    2. Note: As DeepSeek's native context window only supports 128K tokens, we will pause further optimizations for extended context handling.


Coming Next - v0.2.4 Preview The upcoming v0.2.4 will be the final minor release in the 0.2 series, delivering the most crucial update to transform KTransformers from "a toy project" to "a practical solution" - multi-concurrency support.

Scheduled for release within two weeks, this update will be followed by our 0.3 version development featuring:

  • AMX-powered optimizations for enhanced performance

  • Expanded hardware support including AMD, XPU, MetaX(沐曦), Moore threads(摩尔线程), and Ascend(昇腾)GPUs

v0.2.2rc2

01 Mar 14:47
Compare
Choose a tag to compare

Improve temperature arg support #721
Update newer torch version for docker #732
Fix numa cpu distribution #685
Add torch support for MOE #684

v0.2.2rc1

25 Feb 16:32
9c71bcb
Compare
Choose a tag to compare

Hi everyone, KTransformers has been updated to version V0.2.2. You can now try it by compiling from the GitHub repository source code. The release packages and Docker images are also being built and uploaded—stay tuned! The main updates in this release include:

  1. #659 Simplified MMLU Test Script and Scores: Quantization may affect model capabilities. We conducted MMLU tests, where the Marlin quantization + Q4KM score slightly dropped to 81 compared to the original full-precision score of 81.6. Note that these tests are still preliminary and should be treated as reference only. For details, visit Benchmark Documentation.
  2. #643 FP8 Kernel for Enhanced Precision and Performance: Model quantization and the weight loading method in v0.2.1 led to some precision loss. Version 0.2.2 introduces GPU-accelerated FP8 Triton kernels, offering higher precision while maintaining performance. The MMLU score for FP8+Q4KM improved to 81.5 with negligible performance impact. We also provide corresponding weight packing scripts. Further optimizations for flexible and efficient weight loading will follow.
  3. #657 Longer Context Support and Efficient FlashInfer MLA Operator:
    • Under 24GB VRAM, the supported context length has increased from 8K (v0.2.1) to up to 25K tokens (varies by use case), with further optimizations planned.
    • Optimized DeepSeek-V3/R1 model prefill phase: VRAM usage now scales linearly with context length.
    • Added support for matrix absorption during prefill (trading some performance for reduced KV Cache VRAM usage) and FlashInfer's MLA operator.
    • Chunk Prefill Optimization will soon be merged to the main branch to improve VRAM efficiency and performance for long-context scenarios.

Feel free to explore these updates and share your feedback!

v0.2.1.post1

18 Feb 14:02
09f5c5e
Compare
Choose a tag to compare
  1. fix precision bug imported in 0.2.1, add mmlu/mmlu_pro text, fix server #413