Build mxfp4 kernel for sm120a #2285

gau-nernst · 2025-05-31T02:36:06Z

Just making some quick changes here to see if I can build mxfp4 kernel on 5090 (sm120). Eventually this will be put under torchao._C_cutlass_120a?

Setting -DCUTLASS_DEBUG_TRACE_LEVEL=1 so I can see debug trace.

To build (using torch==2.8.0.dev20250530+cu128)

TORCH_CUDA_ARCH_LIST=12.0a uv pip install -e . -v --no-build-isolation

Running pytest test/prototype/mx_formats/test_mx_mm.py -v

/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:244    workspace_bytes: 0
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:312  GemmUniversal::initialize() - workspace 0, stream: null
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:201  to_underlying_arguments():
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:214    WARNING: Arguments do not include a valid SM count.
  For optimal performance, populate the arguments KernelHardwareInfo struct with the SM count.
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:218  to_underlying_arguments(): Setting persistent grid SM count to 170
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/kernel/sm90_gemm_tma_warpspecialized_cooperative.hpp:224    WARNING: Arguments do not include a valid max cluster count.
  For optimal performance, populate the arguments KernelHardwareInfo struct with the max_active_clusters.
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:336    Setting smem size to 101376
/home/thien/code/ao/third_party/cutlass/include/cutlass/gemm/device/gemm_universal_adapter.h:343    cudaFuncSetAttribute() returned error: invalid resource handle

cudaFuncSetAttribute() returned error: invalid resource handle means that the function is invalid? https://github.com/NVIDIA/cutlass/blob/ad7b2f5e84fcfa124cb02b91d5bd26d238c0459e/include/cutlass/gemm/device/gemm_universal_adapter.h#L338, which is quite strange...

For reference, I can build and run the example from Cutlass here https://github.com/NVIDIA/cutlass/blob/v3.9.2/examples/79_blackwell_geforce_gemm/79a_blackwell_geforce_nvfp4_bf16_gemm.cu. The changes in this PR has been taken from this example. When building with CUTLASS_DEBUG_TRACE_LEVEL=1, there are also warnings in sm90_gemm_tma_warpspecialized_cooperative.hpp, so that is probably not the issue.

@drisspg

cc @alexsamardzic in case you faced this error with Cutlass before

pytorch-bot · 2025-05-31T02:36:09Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2285

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures, 1 Unrelated Failure

As of commit 5abfe97 with merge base e51ffd9 ():

NEW FAILURES - The following jobs have failed:

Build Linux Wheels / build / build-manywheel-py3_9-cuda11_8 (gh)
RuntimeError: Error compiling objects for extension
Build Linux Wheels / build / build-manywheel-py3_9-cuda12_6 (gh)
RuntimeError: Error compiling objects for extension
Build Linux Wheels / build / upload / upload-manywheel-py3_9-cuda12_6 (gh)
PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Float8 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio --index-url https://download.pytor... / linux-job (gh)
RuntimeError: Command docker exec -t 36339f0c115ce694f1aaf455dfda4d036937d0a288dd6e8814047c7a1e68af7e /exec failed with exit code 1
Run Float8 Tests / test (SM-89, linux.g6.4xlarge.experimental.nvidia.gpu, --pre torch --index-url https://download.p... / linux-job (gh)
RuntimeError: Command docker exec -t 21d2d6a215fab91d42fc76d9b96b9603925b3f6a38f15b6d949bd583d24acc1d /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t eda5f62fe53a3f6cf5d30c330157639eae026d902d241d14dd4d74192e0ec872 /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 471a8c595bdfc208b5813af3cf08f018465cbabb97ac7ba511725435abc90baa /exec failed with exit code 1
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 068f114751f174ca968a0a98915307ad8d361b88814c9f521258a5480eb29063 /exec failed with exit code 1
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t a82367cab09e60fd2df34bcb00cb82aa2f242ab6ada0496a0a326305d76e99af /exec failed with exit code 1

BROKEN TRUNK - The following job failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

Build Linux Wheels / build / upload / upload-manywheel-py3_9-cuda11_8 (gh) (trunk failure)
Unable to download artifact(s): Artifact not found for name: pytorch_ao__3.9_cu118_x86_64

This comment was automatically generated by Dr. CI and updates every 15 minutes.

drisspg · 2025-05-31T05:00:46Z

The first thing that comes to mind is that example is doing NVfp4 where all our recipes are doing MXfp4, e.g. https://github.com/pytorch/ao/pull/2285/files#diff-e155558499c3b1fbab1b5d3b60f032bf1e636908a8ef50a1de33bff518107019R240-R241 needs to change as well. For inference we have MXFP8 and MXFP4 support I am planning to add an NVFP4 scaling recipe next, that being said I would imagine that MXFP4 is supported on 5090..

cc @syed-ahmed

gau-nernst · 2025-05-31T05:05:58Z

I noticed that as well

Changing the torchao kernel to nvfp4 results in the same error
Changing the cutlass example to mxfp4 still works

😭

syed-ahmed · 2025-05-31T06:47:34Z

Per cutlass docs, I believe MXFP4 is supported in 5090: https://github.com/NVIDIA/cutlass/blob/9d165a3b8ef446a7ff3db198413f82bcb83f46fe/media/docs/cpp/blackwell_functionality.md#blackwell-sm120-gemms

However note the section that talks about the differences with sm100. So it's possible we need more changes to the kernel in torch ao. Also what CUDA version are you using? I'd assume you'd need a fairly recent CUDA version. I'll try to guide more next week.

gau-nernst · 2025-05-31T06:52:06Z

@syed-ahmed I'm using CUDA 12.9

The strange thing is that the cutlass example works, but the one in torchao doesn't. I carefully compared the two, and I don't spot any difference in the template arguments.

syed-ahmed · 2025-05-31T08:20:26Z

How about the test? Are the inputs similar to the cutlass example?

build for sm120a

5abfe97

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build mxfp4 kernel for sm120a #2285

Build mxfp4 kernel for sm120a #2285

Uh oh!

gau-nernst commented May 31, 2025

Uh oh!

pytorch-bot bot commented May 31, 2025 •

edited

Loading

Uh oh!

drisspg commented May 31, 2025

Uh oh!

gau-nernst commented May 31, 2025

Uh oh!

syed-ahmed commented May 31, 2025

Uh oh!

gau-nernst commented May 31, 2025

Uh oh!

syed-ahmed commented May 31, 2025

Uh oh!

Uh oh!

Build mxfp4 kernel for sm120a #2285

Are you sure you want to change the base?

Build mxfp4 kernel for sm120a #2285

Uh oh!

Conversation

gau-nernst commented May 31, 2025

Uh oh!

pytorch-bot bot commented May 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2285

❌ 10 New Failures, 1 Unrelated Failure

Uh oh!

drisspg commented May 31, 2025

Uh oh!

gau-nernst commented May 31, 2025

Uh oh!

syed-ahmed commented May 31, 2025

Uh oh!

gau-nernst commented May 31, 2025

Uh oh!

syed-ahmed commented May 31, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented May 31, 2025 •

edited

Loading