[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

lixcli · 2025-05-09T14:34:08Z

PR types

New features

PR changes

混合bit量化支持

fused_transformer_layers.py
- 增加FusedBlockMultiTransformerWINTX，支持按照config读取不同层的matmul，使用对应matmul类型初始化和计算，支持mla吸收
- 增加各类按照matmul类型返回linear类型、group_size大小、channel大小变化的函数
  - get_pack_dtype_from_matmul，返回matmul的pack类型
  - get_group_size_from_matmul，返回matmul的group size
  - get_shrink_wdim_from_matmul，返回matmul对应的input channel修改方式
- weight_only_linear_wintx，按照matmul调用对应的matmul kernel
- wintx_fused_moe，按照matmul调用对应的moe kernel
- 增加MixBitConfig，当前只记录了config路径，组网初始化时读取对应路径的config

dsv3接入WINTX量化
dsv2/v3支持WINTX权重导入

PaddleNLP/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
- 新增set_wintx_state_dict，读取wintx权重
  - 只能导入已经量化压缩好的权重

WINT4 cutlass支持

FusedBlockMultiTransformerWINTX加入WINT4 channel-wise(包含权重重排)量化权重导入逻辑

极低bit推理(Triton)

wint2.75/wint2.5 MOE_GEMM(单一kernel 支持两种计算方式)
wint4 GEMM/MOE_GEMM

Description

支持wint2.75/wint2.5 deepseekv3推理

paddle-bot · 2025-05-09T14:34:16Z

Thanks for your contribution!

codecov · 2025-05-09T15:09:27Z

Codecov Report

Attention: Patch coverage is 0% with 1054 lines in your changes missing coverage. Please review.

Project coverage is 46.63%. Comparing base (8205e3d) to head (a1d8557).
Report is 8 commits behind head on develop.

❗ Current head a1d8557 differs from pull request most recent head 5a1ffb2

Please upload reports for the commit 5a1ffb2 to get more accurate results.

Files with missing lines	Patch %	Lines
...erimental/transformers/fused_transformer_layers.py	0.00%	508 Missing ⚠️
.../experimental/transformers/deepseek_v2/modeling.py	0.00%	129 Missing ⚠️
...lenlp/experimental/wintx/wintx_fused_moe_decode.py	0.00%	123 Missing ⚠️
paddlenlp/experimental/wintx/wintx_fused_moe.py	0.00%	105 Missing ⚠️
paddlenlp/experimental/wintx/wintx_gemm.py	0.00%	92 Missing ⚠️
paddlenlp/experimental/wintx/utils.py	0.00%	61 Missing ⚠️
paddlenlp/experimental/wintx/__init__.py	0.00%	25 Missing ⚠️
paddlenlp/experimental/wintx/wint4_fused_moe.py	0.00%	7 Missing ⚠️
paddlenlp/ops/triton_ops/triton_utils.py	0.00%	4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (46.63%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@             Coverage Diff             @@
##           develop   #10578      +/-   ##
===========================================
- Coverage    46.98%   46.63%   -0.36%     
===========================================
  Files          799      805       +6     
  Lines       132246   133244     +998     
===========================================
+ Hits         62135    62137       +2     
- Misses       70111    71107     +996

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

minghaoBD · 2025-05-13T08:33:49Z

paddlenlp/experimental/transformers/deepseek_v2/modeling.py

+            )
+            import json
+
+            with open(mix_bit_path, "r") as f:


verify the existence of mix_bit_path
if the file doesn't exist, initialize the default MixBitConfig with necessary logs

This code logic will be executed only when WINTX quantization or when mix_bit_path is given. At this time, if the file does not exist, an error will be reported. There is no need to print the error log repeatedly.

minghaoBD · 2025-05-13T08:44:20Z

paddlenlp/experimental/transformers/deepseek_v2/modeling.py

@@ -1238,9 +1272,292 @@ def set_state_dict(self, state_dict):
                    self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_weight)
                    self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_weight)

+    @paddle.no_grad()
+    def set_wintx_state_dict(self, state_dict):


It would be easy to maintain if you extracted some common code with set_state_dict() to helper functions.

The original set_state_dict is already very bloated and does not support direct loading of quantized weights. In addition, wintx loading logic will be very complicated in this case.

minghaoBD · 2025-05-13T08:48:04Z

paddlenlp/experimental/wintx/README.md

@@ -0,0 +1,19 @@
+# WINTX Triton kernel


what kind of format should mix_bits_config.json be following? You may consider showing it directly in the readme.

It will be given when exporting the model, so users don't need to worry about this at present.

vivienfanghuagood · 2025-05-13T11:37:18Z

paddlenlp/experimental/transformers/fused_transformer_layers.py

+        return ffn2_out
+
+
+class FusedMultiTransformerWINTX(FusedMultiTransformerBase):


这个类和已有的类共性比较大，是不是可以基于已有的Transformer类来实现？

Suggested change

class FusedMultiTransformerWINTX(FusedMultiTransformerBase):

class FusedMultiTransformerWINTX(FusedMultiTransformerBase):

已有的类最相似的FusedMultiTransformerWeightOnly。基于这个改的话需要改动的地方比较多。主要有以下问题：

FusedMultiTransformerWeightOnly不支持灵活的混合bit配置

weight shape设置与当前已导出权重差别比较大

Add new wint2.75/wint2.5 quant type and support DeepseekV3

1734b98

lixcli and others added 4 commits May 12, 2025 18:41

Add wint4 support in WINTX quant-type & Add static compile triton

49df81a

Merge branch 'PaddlePaddle:develop' into wintx_quant

7b0a5be

Fix bug in triton code

89404c7

Add FLAGS_use_wintx_gemm to set wintx gemm

4a889e4

minghaoBD reviewed May 13, 2025

View reviewed changes

lixcli requested a review from minghaoBD May 13, 2025 10:22

vivienfanghuagood reviewed May 13, 2025

View reviewed changes

lixcli added 3 commits May 13, 2025 21:39

Fix bug when loading mix_bit_config

a1d8557

Fix w2.75 with w4 inference

ef912ed

Fix bug when 4TP mla absorb dsv3 infer

5a1ffb2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

lixcli commented May 9, 2025 •

edited

Loading

paddle-bot bot commented May 9, 2025

codecov bot commented May 9, 2025 •

edited

Loading

minghaoBD May 13, 2025

lixcli May 13, 2025

minghaoBD May 13, 2025

lixcli May 13, 2025

minghaoBD May 13, 2025 •

edited

Loading

lixcli May 13, 2025

vivienfanghuagood May 13, 2025

lixcli May 13, 2025

		return ffn2_out


		class FusedMultiTransformerWINTX(FusedMultiTransformerBase):

[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

Are you sure you want to change the base?

[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

Conversation

lixcli commented May 9, 2025 • edited Loading

PR types

PR changes

Description

paddle-bot bot commented May 9, 2025

codecov bot commented May 9, 2025 • edited Loading

Codecov Report

minghaoBD May 13, 2025

Choose a reason for hiding this comment

lixcli May 13, 2025

Choose a reason for hiding this comment

minghaoBD May 13, 2025

Choose a reason for hiding this comment

lixcli May 13, 2025

Choose a reason for hiding this comment

minghaoBD May 13, 2025 • edited Loading

Choose a reason for hiding this comment

lixcli May 13, 2025

Choose a reason for hiding this comment

vivienfanghuagood May 13, 2025

Choose a reason for hiding this comment

lixcli May 13, 2025

Choose a reason for hiding this comment

lixcli commented May 9, 2025 •

edited

Loading

codecov bot commented May 9, 2025 •

edited

Loading

minghaoBD May 13, 2025 •

edited

Loading