Feat/blackwell sm100 support #2670

celsowm · 2025-07-01T12:53:20Z

This change integrates specific, expert-provided CUTLASS heuristic
configurations for the NVIDIA Blackwell (SM100) GPU architecture,
replacing previous placeholders. This includes:

Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h:
- Populated CutlassTileConfigSM100 enum with specific tile shapes
  (e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100.
- Added FP4_ONLY to CandidateConfigTypeParam for new FP4 paths.
Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu:
- Implemented get_candidate_tiles_sm100 with detailed logic for
  selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags,
  using the new SM100 tile enums.
- Implemented supports_mcast_along_m_sm100 and
  supports_mcast_along_n_sm100 with specific tile checks for Blackwell.
- Updated the sm == 100 (Blackwell) block in get_candidate_configs
  to use these new helper functions and accurately populate candidate
  kernel configurations for various cluster shapes.
custom_ops/setup_ops.py remains configured to compile for
arch=compute_100a,code=sm_100a with CUDA 12.9+ for these features.

This aligns the codebase with heuristic configurations similar to those
in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more
performant kernel selection on this new architecture.

This change introduces initial support for the NVIDIA Blackwell GPU architecture, specifically targeting SM100 (Compute Capability 10.x) with '100a' architecture-specific features (e.g., for CUTLASS). Key changes: - Updated custom_ops/setup_ops.py to generate appropriate gencode flags (arch=compute_100a,code=sm_100a) when '100' is specified in FD_BUILDING_ARCS. Requires CUDA 12.9+. - Updated custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h: - Added CutlassTileConfigSM100 enum (with placeholder tile shapes). - Added BLACKWELL to CandidateConfigTypeParam. - Updated CutlassGemmConfig struct with is_sm100 flag, tile_config_sm100, and new constructor for SM100. - Modified toString() and fromString() for SM100 support. - Updated custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu: - Added get_candidate_tiles_sm100() (with placeholder tiles). - Added placeholder mcast support functions for SM100. - Updated get_candidate_configs() to include SM100 paths using the BLACKWELL flag and new SM100 config types. - Updated build.sh with comments to guide users on specifying '100' for Blackwell in FD_BUILDING_ARCS. Further work: - Optimal CUTLASS tile configurations for SM100 need to be researched and updated in cutlass_heuristic.cu. - Kernel auto-generation scripts in custom_ops/utils/ may need SM100-specific versions if Blackwell's hardware features for FP8/TMA differ significantly from SM90. - Compatibility of third-party libraries (CUTLASS v3.8.0, DeepGEMM) with Blackwell should be fully verified.

This change integrates specific, expert-provided CUTLASS heuristic configurations for the NVIDIA Blackwell (SM100) GPU architecture, replacing previous placeholders. This includes: - Updated `custom_ops/gpu_ops/cutlass_extensions/gemm_configs.h`: - Populated `CutlassTileConfigSM100` enum with specific tile shapes (e.g., CtaShape64x64x128B, CtaShape128x128x128B) suitable for SM100. - Added `FP4_ONLY` to `CandidateConfigTypeParam` for new FP4 paths. - Updated `custom_ops/gpu_ops/cutlass_kernels/cutlass_heuristic.cu`: - Implemented `get_candidate_tiles_sm100` with detailed logic for selecting tile configurations based on GROUPED_GEMM and FP4_ONLY flags, using the new SM100 tile enums. - Implemented `supports_mcast_along_m_sm100` and `supports_mcast_along_n_sm100` with specific tile checks for Blackwell. - Updated the `sm == 100` (Blackwell) block in `get_candidate_configs` to use these new helper functions and accurately populate candidate kernel configurations for various cluster shapes. - `custom_ops/setup_ops.py` remains configured to compile for `arch=compute_100a,code=sm_100a` with CUDA 12.9+ for these features. This aligns the codebase with heuristic configurations similar to those in upstream TensorRT-LLM / CUTLASS for Blackwell, enabling more performant kernel selection on this new architecture.

CLAassistant · 2025-07-01T12:53:33Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 3 committers have signed the CLA.

✅ Jiang-Jia-Jun
❌ celsowm
❌ google-labs-jules[bot]
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

paddle-bot · 2025-07-01T12:53:35Z

Thanks for your contribution!

google-labs-jules bot and others added 3 commits July 1, 2025 02:53

Merge branch 'PaddlePaddle:develop' into feat/blackwell-sm100-support

699487c

paddle-bot bot added the contributor label Jul 1, 2025

vivienfanghuagood approved these changes Jul 3, 2025

View reviewed changes

Merge branch 'develop' into feat/blackwell-sm100-support

3ecd723

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/blackwell sm100 support #2670

Feat/blackwell sm100 support #2670

celsowm commented Jul 1, 2025

Uh oh!

CLAassistant commented Jul 1, 2025 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 1, 2025

Uh oh!

Uh oh!

Feat/blackwell sm100 support #2670

Are you sure you want to change the base?

Feat/blackwell sm100 support #2670

Conversation

celsowm commented Jul 1, 2025

Uh oh!

CLAassistant commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

paddle-bot bot commented Jul 1, 2025

Uh oh!

Uh oh!

CLAassistant commented Jul 1, 2025 •

edited

Loading