Skip to content

Releases: intel/sycl-tla

v0.5

26 Sep 22:42
b0cb10e

Choose a tag to compare

New in CUTLASS SYCL 0.5

Major Architecture Changes

  • Xe Rearchitecture (#477): Complete redesign of Xe CuTe atoms with new architecture
    • New MMA atoms for improved performance
    • Enhanced 2D copy atoms (loads, stores, prefetch with VNNI/transpose support)
    • New 2D copy helpers (low-level make_block_2d_copy and high-level make_block_2d_copy_{A,B,C})
    • Generic and optimized reorder atoms for {int4, uint4, int8, uint8, e2m1, e4m3, e5m2} -> {half, bfloat16}
    • Requires IGC version v2.18.5 or later

New Features

  • G++ Host Compiler Support (#490): Support for G++ 13 as host compiler

  • Migrated syclcompat to this repository as cutlasscompat for better compatibility

    • Fixed compilation issues when using G++ instead of clang++
    • Added new CI workflow for testing G++ host compiler builds
    • Enhanced build system to support -DDPCPP_HOST_COMPILER=g++ option
  • Grouped GEMM for Mixed Dtype (#457): Extended grouped GEMM support to mixed precision operations

    • Added support for BF16 + S8 mixed dtype grouped GEMM
    • Added support for FP16 + U4 mixed dtype grouped GEMM
    • New examples: 10_bmg_grouped_gemm_bf16_f16_s8.cpp and 10_bmg_grouped_gemm_f16_u4.cpp

    See the CHANGELOG-SYCL for details of all past releases and updates.

v3.9-0.3

30 Jun 21:12
467a2bb

Choose a tag to compare

What's Changed

Cutlass 3.9.2 SYCL backend Version 0.3 (2025-06-30)

  • Add support for GEMM FP8 (E5M2 and E4M3)
  • Add example for GEMM FP8 with support for channel-wise and group-wise quantization
  • Add support for Grouped GEMM FP8
  • Improve performance for FP8 to FP16 conversion
  • Add support for epilogue data conversion
  • Add support for FP16 GEMM with FP16 accumulator
  • Add support for BF16 GEMM with BF16 accumulator
  • Add support for mixed dtype GEMM with support for tensor-wise, channel-wise and group-wise quantization
  • Add example of mixed dtype BF16 + INT8 using channel-wise and group-wise quantization
  • Add example of mixed dtype FP16 + INT8 using tensor-wise quantization
  • Add example of mixed dtype FP16 + INT4 using channel-wise and group-wise quantization
  • Add support for zero-point quantization in INT4 and INT8 data types
  • Add support for Flash Attention prefill FP8 with and without KV cache
  • Add support for Flash Attention decode FP8 with and without KV cache

Full Changelog: v3.9-0.2...v3.9-0.3

Cutlass 3.9.2 SYCL backend Version 0.2

30 May 23:43
dd43242

Choose a tag to compare

Cutlass 3.9.2 SYCL backend Version 0.2 (2025-05-30)
Based on CUTLASS 3.9.2 - May 2025 release

Platforms

  • Support for Intel GPU Data Center Max (1100 and 1550)
  • Support for Intel Arc B580 ("Battlemage")

Features

  • GEMM/StreamK/SplitK with support for FP16 data type
  • Flash attention prefill with Paged KV cache with support for FP16 data type
  • Performance improvements for flash attention prefill and decode

Full Changelog: v3.9-0.1...v3.9-0.2

Cutlass 3.9 sycl backend version 0.1

30 Apr 01:12
ef9797f

Choose a tag to compare

Based on CUTLASS 3.9.0 March 2025 release

Platforms

  • Support for Intel GPU Data Center Max (1100 and 1550)
  • Support for Intel Arc B580 ("Battlemage")

Features

  • GEMM/StreamK/SplitK with support for bfloat16 data type

  • Flash attention prefill and decode with KV cache with support for bfloat16 data type

  • Support for epilogue operations:

    • Element-wise, row-wise and column-wise bias
    • ReLU, SiLU, GELU activation fns
    • Softmax
  • Mixed precision GEMM (bfloat16/int8, half/int4) with dequantization support

  • Dual GEMM & Grouped GEMM

Full Changelog: https://github.com/codeplaysoftware/cutlass-sycl/commits/v3.9-0.1