Releases · kvcache-ai/ktransformers

17 May 07:28

qiyuxinlin

v0.3.1

32f3d7b

v0.3.1 Latest

Latest

🚀 New Features

Intel Arc support @aubreyli @rnwang04

⚡ Performance Improvements

DeepSeek-R1 Q4 decoding @ 7.5 tokens/s
Measured on a single-socket Xeon + DDR5 4800 MT/s + A770 platform; enabling dual-NUMA delivers additional speedups.
Easy benchmarking
Try it yourself with the local_chat script to see these gains firsthand.

🔜 What’s Next

Balance_serve integration
We’re working to seamlessly merge Intel GPU operators into the balance_serve backend for end-to-end support and streamlined maintenance.

Contributors

aubreyli and rnwang04

Assets 2

28 Apr 22:50

qiyuxinlin

v0.3

d7811a4

v0.3

Support AMX-INT8 AMX-BF16 @chenht2022 @ErvinXie @KMSorSMS
Support Qwen3MoE @ovowei @Atream
Support function_call @Creeper-MZ

Contributors

ErvinXie, chenht2022, and 4 other contributors

Assets 2

04 Apr 08:03

KMSorSMS

v0.2.4post1

6617549

v0.2.4post1

What's Changed

fix: refine backend error message to include ROCM_HOME by @fishingfly in #1005
yaml: fix Marlin AssertionError by @aubreyli in #954
delete sudo install by @qiyuxinlin in #1024
Update SUMMARY.md by @KMSorSMS in #1027
Mian update doc by @qiyuxinlin in #1029
文档更新：model_path名字要求以及在示例中添加force_think by @wangkuigang-yewu-cmss in #1031
slove [Bug] #1023 by @ambitiousCC in #1030
Update modeling_deepseek_v3.py by @Atream in #1033
🔧 update config.yaml setting default config by @KMSorSMS in #1034
Fix bug with non-base-multiple chunk_size, update test examples, and … by @qiyuxinlin in #1042
🔖 release v0.2.4post1 by @KMSorSMS in #1043

New Contributors

@fishingfly made their first contribution in #1005
@aubreyli made their first contribution in #954
@wangkuigang-yewu-cmss made their first contribution in #1031
@ambitiousCC made their first contribution in #1030

Full Changelog: v0.2.4...v0.2.4post1

Contributors

aubreyli, fishingfly, and 5 other contributors

Assets 2

02 Apr 06:24

qiyuxinlin

v0.2.4

ac95b6c

v0.2.4

KTransformers v0.2.4 Release Notes

We are excited to announce the official release of the long-awaited KTransformers v0.2.4!
In this version, we’ve added highly desired multi-concurrency support to the community through a major refactor of the whole architecture, updating more than 10,000 lines of code.
By drawing inspiration from the excellent architecture of sglang, we have implemented high-performance asynchronous concurrent scheduling in C++, including features like continuous batching, chunked prefill, and more. Thanks to GPU sharing in concurrent scenarios, overall throughput is also improved to a certain extent. The following is a demonstration:

v0.2.4.mp4

🚀 Key Updates

Multi-Concurrency Support
- Added capability to handle multiple concurrent inference requests. Supports receiving and executing multiple tasks simultaneously.
- We implemented custom_flashinfer based on the high-performance and highly flexible operator library flashinfer, and achieved a variable batch size CUDA Graph, which further enhances flexibility while reducing memory and padding overhead.
- In our benchmarks, overall throughput improved by approximately 130% under 4-way concurrency.
- With support from Intel, we tested KTransformers v0.2.4 on the latest Xeon6 + MRDIMM-8800 platform. By increasing concurrency, the total output throughput increased from 17 tokens/s to 40 tokens/s. We observed that the bottleneck has now shifted to the GPU. Using a higher-end GPU than the 4090D could further improve performance.
Engine Architecture Optimization

Inspired by the scheduling framework of sglang, we refactored KTransformers with a clearer three-layer architecture through an update of 11,000 lines of code, now supporting full multi-concurrency:
- Server: Handles user requests and serves the OpenAI-compatible API.
- Inference Engine: Executes model inference and supports chunked prefill.
- Scheduler: Manages task scheduling and requests orchestration. Supports continuous batching by organizing queued requests into batches in the FCFS manner and sending them to the inference engine.
Project Structure Reorganization
All C/C++ code is now centralized under the /csrc directory.
Parameter Adjustments
Removed some legacy and deprecated launch parameters for a cleaner configuration experience.
We plan to provide a complete parameter list and detailed documentation in future releases to facilitate flexible configuration and debugging.

📚 Upgrade Notes

Due to parameter changes, users who have installed previous versions are advised to delete the ~/.ktransformers directory and reinitialize.
To enable multi-concurrency, please refer to the latest documentation for configuration examples.

What's Changed

Implemented custom_flashinfer @Atream @ovowei @qiyuxinlin
Implemented balance_serve engine based on FlashInfer @qiyuxinlin @ovowei
Implemented a continuous batching scheduler in C++ @ErvinXie
release: bump version v0.2.4 by @Atream @Azure-Tang @ErvinXie @qiyuxinlin @ovowei @KMSorSMS @SkqLiao

Warning

⚠️ Please note that installing this project will replace flashinfer in your environment. It is strongly recommended to create a new conda environment!!!

Contributors

ErvinXie, SkqLiao, and 5 other contributors

Assets 2

15 Mar 15:22

KMSorSMS

v0.2.3post2

c51818c

v0.2.3post2

What's Changed

Key Features：

Ahead of the official v0.2.4 release, we're excited to announce the preliminary ROCm support for ktransformers in response to strong community demand. This update delivers enhanced heterogeneous computing capabilities for developers utilizing AMD GPUs. #178

Extended our heartfelt gratitude to contributor @fxzjshm! Your profound technical expertise demonstrated in the ROCm adaptation work has been instrumental in achieving framework compatibility. Also, we would like to thank the AMD team for their technology and equipment support.

Other Updates:

fix #829: support older AVX512-capable CPUs by @isxylands in #847
Update install.md by @flappyknight in #880
cpuinfer: filter repeated backend instantiation by @Lander-Hatsune in #852
fix-singleton by @Atream in #886
use compile for gate, slight performance improvement by @Atream in #891
Add Unit Test for Local Chat by @SkqLiao in #897

New Contributors

@isxylands made their first contribution in #847
@flappyknight made their first contribution in #880
@Lander-Hatsune made their first contribution in #852
@fxzjshm made their first contribution in #178

Full Changelog: v0.2.3post1...v0.2.3post2

Contributors

isxylands, fxzjshm, and 4 other contributors

Assets 13

07 Mar 14:16

KMSorSMS

v0.2.3post1

7544ead

v0.2.3post1

What's Changed

[fix] support openai chat completion api by @BITcyman in #835
fix flashinfer precision by @Atream in #839
release 0.2.3.post1 by @Atream in #840

Full Changelog: v0.2.3...v0.2.3post1

Contributors

Atream and BITcyman

Assets 45

06 Mar 09:05

KMSorSMS

v0.2.3

63b1c85

v0.2.3

We're excited to announce the update of KTransformers v0.2.3! You can now compile from the GitHub source code. Release packages and Docker images are being built/uploaded - stay tuned!

Key Updates:

Low-Precision Inference Optimization #754
1. Added IQ1_S/IQ2_XXS quantized matmul support, now compatible with Unsloth's DeepSeek-R1 1.58bit/2.51bit dynamic quantized weights
2. Released DeepSeek-R1 mixed-precision model (IQ1+FP8) achieving enhanced performance:
  - 19GB VRAM usage & 140GB system memory consumption
  - MMLU score of 83.6, slightly outperforming full-precision DeepSeek-V3
  - Ongoing benchmarks: View Details (Special thanks to @moonshadow-25 and @godrosev for their huge contributions to v0.2.3)
Long Context Handling Enhancement #750
1. Implemented chunked prefill mechanism. Supports processing 139K-token contexts with DeepSeek-R1 on 24GB VRAM
2. Note: As DeepSeek's native context window only supports 128K tokens, we will pause further optimizations for extended context handling.

Coming Next - v0.2.4 Preview The upcoming v0.2.4 will be the final minor release in the 0.2 series, delivering the most crucial update to transform KTransformers from "a toy project" to "a practical solution" - multi-concurrency support.

Scheduled for release within two weeks, this update will be followed by our 0.3 version development featuring:

AMX-powered optimizations for enhanced performance
Expanded hardware support including AMD, XPU, MetaX（沐曦）, Moore threads（摩尔线程）, and Ascend（昇腾）GPUs

Contributors

godrosev and moonshadow-25

Assets 22

01 Mar 14:47

KMSorSMS

v0.2.2rc2

9b41720

v0.2.2rc2

Improve temperature arg support #721
Update newer torch version for docker #732
Fix numa cpu distribution #685
Add torch support for MOE #684

Assets 34

25 Feb 16:32

KMSorSMS

v0.2.2rc1

9c71bcb

v0.2.2rc1

Hi everyone, KTransformers has been updated to version V0.2.2. You can now try it by compiling from the GitHub repository source code. The release packages and Docker images are also being built and uploaded—stay tuned! The main updates in this release include:

#659 Simplified MMLU Test Script and Scores: Quantization may affect model capabilities. We conducted MMLU tests, where the Marlin quantization + Q4KM score slightly dropped to 81 compared to the original full-precision score of 81.6. Note that these tests are still preliminary and should be treated as reference only. For details, visit Benchmark Documentation.
#643 FP8 Kernel for Enhanced Precision and Performance: Model quantization and the weight loading method in v0.2.1 led to some precision loss. Version 0.2.2 introduces GPU-accelerated FP8 Triton kernels, offering higher precision while maintaining performance. The MMLU score for FP8+Q4KM improved to 81.5 with negligible performance impact. We also provide corresponding weight packing scripts. Further optimizations for flexible and efficient weight loading will follow.
#657 Longer Context Support and Efficient FlashInfer MLA Operator:
- Under 24GB VRAM, the supported context length has increased from 8K (v0.2.1) to up to 25K tokens (varies by use case), with further optimizations planned.
- Optimized DeepSeek-V3/R1 model prefill phase: VRAM usage now scales linearly with context length.
- Added support for matrix absorption during prefill (trading some performance for reduced KV Cache VRAM usage) and FlashInfer's MLA operator.
- Chunk Prefill Optimization will soon be merged to the main branch to improve VRAM efficiency and performance for long-context scenarios.

Feel free to explore these updates and share your feedback!

Assets 55

18 Feb 14:02

KMSorSMS

v0.2.1.post1

09f5c5e

v0.2.1.post1

fix precision bug imported in 0.2.1, add mmlu/mmlu_pro text, fix server #413

Assets 38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 New Features

⚡ Performance Improvements

🔜 What’s Next

Contributors

Uh oh!

Contributors

Uh oh!

What's Changed

New Contributors

Contributors

Uh oh!

KTransformers v0.2.4 Release Notes

🚀 Key Updates

📚 Upgrade Notes

What's Changed

Warning

Contributors

Uh oh!

What's Changed

Key Features：

New Contributors

Contributors

Uh oh!

What's Changed

Contributors

Uh oh!

Contributors

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Releases: kvcache-ai/ktransformers

v0.3.1

🚀 New Features

⚡ Performance Improvements

🔜 What’s Next

Contributors

Uh oh!

v0.3

Contributors

Uh oh!

v0.2.4post1

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.4

KTransformers v0.2.4 Release Notes

🚀 Key Updates

📚 Upgrade Notes

What's Changed

Warning

Contributors

Uh oh!

v0.2.3post2

What's Changed

Key Features：

New Contributors

Contributors

Uh oh!

v0.2.3post1

What's Changed

Contributors

Uh oh!

v0.2.3

Contributors

Uh oh!

v0.2.2rc2

Uh oh!

v0.2.2rc1

Uh oh!

v0.2.1.post1

Uh oh!