Skip to content

[Inference] Add new wint2.75/wint2.5 quant type and support DeepseekV3 #10578

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: develop
Choose a base branch
from

Conversation

lixcli
Copy link
Contributor

@lixcli lixcli commented May 9, 2025

PR types

New features

PR changes

混合bit量化支持

  • fused_transformer_layers.py
    • 增加FusedBlockMultiTransformerWINTX,支持按照config读取不同层的matmul,使用对应matmul类型初始化和计算,支持mla吸收
    • 增加各类按照matmul类型返回linear类型、group_size大小、channel大小变化的函数
      • get_pack_dtype_from_matmul,返回matmul的pack类型
      • get_group_size_from_matmul,返回matmul的group size
      • get_shrink_wdim_from_matmul,返回matmul对应的input channel修改方式
    • weight_only_linear_wintx,按照matmul调用对应的matmul kernel
    • wintx_fused_moe,按照matmul调用对应的moe kernel
    • 增加MixBitConfig,当前只记录了config路径,组网初始化时读取对应路径的config

dsv3接入WINTX量化
dsv2/v3支持WINTX权重导入

  • PaddleNLP/paddlenlp/experimental/transformers/deepseek_v2/modeling.py
    • 新增set_wintx_state_dict,读取wintx权重
      • 只能导入已经量化压缩好的权重

WINT4 cutlass支持

  • FusedBlockMultiTransformerWINTX加入WINT4 channel-wise(包含权重重排)量化权重导入逻辑

极低bit推理(Triton)

  • wint2.75/wint2.5 MOE_GEMM(单一kernel 支持两种计算方式)
  • wint4 GEMM/MOE_GEMM

Description

支持wint2.75/wint2.5 deepseekv3推理

Copy link

paddle-bot bot commented May 9, 2025

Thanks for your contribution!

Copy link

codecov bot commented May 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 1054 lines in your changes missing coverage. Please review.

Project coverage is 46.63%. Comparing base (8205e3d) to head (a1d8557).
Report is 8 commits behind head on develop.

Current head a1d8557 differs from pull request most recent head 5a1ffb2

Please upload reports for the commit 5a1ffb2 to get more accurate results.

Files with missing lines Patch % Lines
...erimental/transformers/fused_transformer_layers.py 0.00% 508 Missing ⚠️
.../experimental/transformers/deepseek_v2/modeling.py 0.00% 129 Missing ⚠️
...lenlp/experimental/wintx/wintx_fused_moe_decode.py 0.00% 123 Missing ⚠️
paddlenlp/experimental/wintx/wintx_fused_moe.py 0.00% 105 Missing ⚠️
paddlenlp/experimental/wintx/wintx_gemm.py 0.00% 92 Missing ⚠️
paddlenlp/experimental/wintx/utils.py 0.00% 61 Missing ⚠️
paddlenlp/experimental/wintx/__init__.py 0.00% 25 Missing ⚠️
paddlenlp/experimental/wintx/wint4_fused_moe.py 0.00% 7 Missing ⚠️
paddlenlp/ops/triton_ops/triton_utils.py 0.00% 4 Missing ⚠️

❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage.
❌ Your project check has failed because the head coverage (46.63%) is below the target coverage (58.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop   #10578      +/-   ##
===========================================
- Coverage    46.98%   46.63%   -0.36%     
===========================================
  Files          799      805       +6     
  Lines       132246   133244     +998     
===========================================
+ Hits         62135    62137       +2     
- Misses       70111    71107     +996     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

)
import json

with open(mix_bit_path, "r") as f:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify the existence of mix_bit_path
if the file doesn't exist, initialize the default MixBitConfig with necessary logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code logic will be executed only when WINTX quantization or when mix_bit_path is given. At this time, if the file does not exist, an error will be reported. There is no need to print the error log repeatedly.

@@ -1238,9 +1272,292 @@ def set_state_dict(self, state_dict):
self.transformer_block.shared_expert_ffn1_weights[idx].set_value(shared_expert_ffn1_weight)
self.transformer_block.shared_expert_ffn2_weights[idx].set_value(shared_expert_ffn2_weight)

@paddle.no_grad()
def set_wintx_state_dict(self, state_dict):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be easy to maintain if you extracted some common code with set_state_dict() to helper functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original set_state_dict is already very bloated and does not support direct loading of quantized weights. In addition, wintx loading logic will be very complicated in this case.

@@ -0,0 +1,19 @@
# WINTX Triton kernel
Copy link

@minghaoBD minghaoBD May 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what kind of format should mix_bits_config.json be following? You may consider showing it directly in the readme.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be given when exporting the model, so users don't need to worry about this at present.

@lixcli lixcli requested a review from minghaoBD May 13, 2025 10:22
return ffn2_out


class FusedMultiTransformerWINTX(FusedMultiTransformerBase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个类和已有的类共性比较大,是不是可以基于已有的Transformer类来实现?

Suggested change
class FusedMultiTransformerWINTX(FusedMultiTransformerBase):
class FusedMultiTransformerWINTX(FusedMultiTransformerBase):

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已有的类最相似的FusedMultiTransformerWeightOnly。基于这个改的话需要改动的地方比较多。主要有以下问题:

  1. FusedMultiTransformerWeightOnly不支持灵活的混合bit配置
  2. weight shape设置与当前已导出权重差别比较大

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants