26 Sep 22:42

rolandschulz

v0.5 Latest

Latest

New in CUTLASS SYCL 0.5

Major Architecture Changes

Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
- New MMA atoms for improved performance
- Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
- New 2D copy helpers (low-level make_block_2d_copy and high-level make_block_2d_copy_{A,B,C})
- Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
- Requires IGC version v2.18.5 or later

New Features

G++ Host Compiler Support (#490): Support for G++ 13 as host compiler
Migrated syclcompat to this repository as cutlasscompat for better compatibility
- Fixed compilation issues when using G++ instead of clang++
- Added new CI workflow for testing G++ host compiler builds
- Enhanced build system to support -DDPCPP_HOST_COMPILER=g++ option
Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations
- Added support for BF16 + S8 mixed dtype grouped GEMM
- Added support for FP16 + U4 mixed dtype grouped GEMM
- New examples: 10_bmg_grouped_gemm_bf16_f16_s8.cpp and 10_bmg_grouped_gemm_f16_u4.cpp
See the CHANGELOG-SYCL for details of all past releases and updates.

Assets 2

30 Jun 21:12

mehdi-goli

v3.9-0.3

What's Changed

Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)

Add support for GEMM FP8 (E5M2 and E4M3)
Add example for GEMM FP8 with support for channel-wise and group-wise quantization
Add support for Grouped GEMM FP8
Improve performance for FP8 to FP16 conversion
Add support for epilogue data conversion
Add support for FP16 GEMM with FP16 accumulator
Add support for BF16 GEMM with BF16 accumulator
Add support for mixed dtype GEMM with support for tensor-wise, channel-wise and group-wise quantization
Add example of mixed dtype BF16 + INT8 using channel-wise and group-wise quantization
Add example of mixed dtype FP16 + INT8 using tensor-wise quantization
Add example of mixed dtype FP16 + INT4 using channel-wise and group-wise quantization
Add support for zero-point quantization in INT4 and INT8 data types
Add support for Flash Attention prefill FP8 with and without KV cache
Add support for Flash Attention decode FP8 with and without KV cache

Full Changelog: v3.9-0.2...v3.9-0.3

Assets 2

30 May 23:43

mehdi-goli

Cutlass 3.9.2 SYCL backend Version 0.2

Cutlass 3.9.2 SYCL backend Version 0.2 (2025-05-30)
Based on CUTLASS 3.9.2 - May 2025 release

Platforms

Support for Intel GPU Data Center Max (1100 and 1550)
Support for Intel Arc B580 ("Battlemage")

Features

GEMM/StreamK/SplitK with support for FP16 data type
Flash attention prefill with Paged KV cache with support for FP16 data type
Performance improvements for flash attention prefill and decode

Full Changelog: v3.9-0.1...v3.9-0.2

Assets 2

30 Apr 01:12

mehdi-goli

Cutlass 3.9 sycl backend version 0.1

Based on CUTLASS 3.9.0 March 2025 release

Platforms

Support for Intel GPU Data Center Max (1100 and 1550)
Support for Intel Arc B580 ("Battlemage")

Features

GEMM/StreamK/SplitK with support for bfloat16 data type
Flash attention prefill and decode with KV cache with support for bfloat16 data type
Support for epilogue operations:
- Element-wise, row-wise and column-wise bias
- ReLU, SiLU, GELU activation fns
- Softmax
Mixed precision GEMM (bfloat16/int8, half/int4) with dequantization support
Dual GEMM & Grouped GEMM

Full Changelog: https://github.com/codeplaysoftware/cutlass-sycl/commits/v3.9-0.1

Assets 2