Skip to content

add out_f32x4_shared_bcf_merge_write_row2col(2d) #339

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 16, 2025

Conversation

Beatlesso
Copy link
Contributor

Add an out_f32x4_sthared_bcf_merge rewrite_row2col (2d) kernel, which has some acceleration effect, especially when M and N are small.
The test results on RTX4090 are as follows:

----------------------------------------------------------------------------------------------------------------------------------
                                                       M=1024, N=1024
                       out_original: [0.0, 1.0, 1024.0], validate False, time:0.00697064ms
                    out_f32_col2row: [0.0, 1024.0, 1.0], validate True , time:0.03135753ms
                    out_f32_row2col: [0.0, 1024.0, 1.0], validate True , time:0.01615620ms
                out_f32_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.01270676ms
                out_f32_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01270866ms
                  out_f32_diagnonal: [0.0, 1024.0, 1.0], validate True , time:0.00475740ms
                  out_f32x4_col2row: [0.0, 1024.0, 1.0], validate True , time:0.03089952ms
                  out_f32x4_row2col: [0.0, 1024.0, 1.0], validate True , time:0.01417041ms
              out_f32x4_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.01510811ms
              out_f32x4_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00382328ms
       out_f32x4_shared_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00356507ms
       out_f32x4_shared_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00646234ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00352168ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00514174ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00360584ms
 out_mat_transpose_cute_col2row_reg: [0.0, 1024.0, 1.0], validate True , time:0.01087761ms
 out_mat_transpose_cute_row2col_reg: [0.0, 1024.0, 1.0], validate True , time:0.00413275ms
    out_mat_transpose_cute_col_smem: [0.0, 1024.0, 1.0], validate True , time:0.00572181ms
    out_mat_transpose_cute_row_smem: [0.0, 1024.0, 1.0], validate True , time:0.00404501ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00511813ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00392222ms
out_mat_transpose_cute_row_cvectorized: [0.0, 1024.0, 1.0], validate True , time:0.00530982ms
out_mat_transpose_cute_row_rvectorized: [0.0, 1024.0, 1.0], validate True , time:0.00360203ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00524306ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00347900ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 1024.0, 1.0], validate True , time:0.00354004ms
                         out_f32_th: [0.0, 1024.0, 1.0], validate True , time:0.01861382ms
                out_f32_th_compiled: [0.0, 1024.0, 1.0], validate True , time:0.05372620ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=1024, N=2048
                       out_original: [0.0, 1.0, 2048.0], validate False, time:0.00744534ms
                    out_f32_col2row: [0.0, 2048.0, 1.0], validate True , time:0.05934119ms
                    out_f32_row2col: [0.0, 2048.0, 1.0], validate True , time:0.02970958ms
                out_f32_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.02546620ms
                out_f32_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.02547979ms
                  out_f32x4_col2row: [0.0, 2048.0, 1.0], validate True , time:0.05577970ms
                  out_f32x4_row2col: [0.0, 2048.0, 1.0], validate True , time:0.02610111ms
              out_f32x4_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.02641416ms
              out_f32x4_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.00594759ms
       out_f32x4_shared_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.00539064ms
       out_f32x4_shared_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.01136041ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.00540662ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.00884748ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.00586128ms
 out_mat_transpose_cute_col2row_reg: [0.0, 2048.0, 1.0], validate True , time:0.02183485ms
 out_mat_transpose_cute_row2col_reg: [0.0, 2048.0, 1.0], validate True , time:0.00680208ms
    out_mat_transpose_cute_col_smem: [0.0, 2048.0, 1.0], validate True , time:0.00984836ms
    out_mat_transpose_cute_row_smem: [0.0, 2048.0, 1.0], validate True , time:0.00650501ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.00864840ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.00630593ms
out_mat_transpose_cute_row_cvectorized: [0.0, 2048.0, 1.0], validate True , time:0.00880575ms
out_mat_transpose_cute_row_rvectorized: [0.0, 2048.0, 1.0], validate True , time:0.00544500ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.00791669ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.00518465ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 2048.0, 1.0], validate True , time:0.00575233ms
                         out_f32_th: [0.0, 2048.0, 1.0], validate True , time:0.03374815ms
                out_f32_th_compiled: [0.0, 2048.0, 1.0], validate True , time:0.03374863ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=1024, N=4096
                       out_original: [0.0, 1.0, 4096.0], validate False, time:0.01041293ms
                    out_f32_col2row: [0.0, 4096.0, 1.0], validate True , time:0.10756278ms
                    out_f32_row2col: [0.0, 4096.0, 1.0], validate True , time:0.05451441ms
                out_f32_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.04699636ms
                out_f32_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.04699326ms
                  out_f32x4_col2row: [0.0, 4096.0, 1.0], validate True , time:0.10662913ms
                  out_f32x4_row2col: [0.0, 4096.0, 1.0], validate True , time:0.05547118ms
              out_f32x4_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.05155158ms
              out_f32x4_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.01052403ms
       out_f32x4_shared_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.00921631ms
       out_f32x4_shared_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.02198577ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.00927687ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.01682901ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.01032209ms
 out_mat_transpose_cute_col2row_reg: [0.0, 4096.0, 1.0], validate True , time:0.04425669ms
 out_mat_transpose_cute_row2col_reg: [0.0, 4096.0, 1.0], validate True , time:0.01229477ms
    out_mat_transpose_cute_col_smem: [0.0, 4096.0, 1.0], validate True , time:0.01818109ms
    out_mat_transpose_cute_row_smem: [0.0, 4096.0, 1.0], validate True , time:0.01142120ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.01574636ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.01124001ms
out_mat_transpose_cute_row_cvectorized: [0.0, 4096.0, 1.0], validate True , time:0.01633978ms
out_mat_transpose_cute_row_rvectorized: [0.0, 4096.0, 1.0], validate True , time:0.00903225ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.01399541ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.00919890ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 4096.0, 1.0], validate True , time:0.01020193ms
                         out_f32_th: [0.0, 4096.0, 1.0], validate True , time:0.06586218ms
                out_f32_th_compiled: [0.0, 4096.0, 1.0], validate True , time:0.06585288ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=1024, N=8192
                       out_original: [0.0, 1.0, 8192.0], validate False, time:0.07201743ms
                    out_f32_col2row: [0.0, 8192.0, 1.0], validate True , time:0.20756602ms
                    out_f32_row2col: [0.0, 8192.0, 1.0], validate True , time:0.11658978ms
                out_f32_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.09276581ms
                out_f32_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.09275508ms
                  out_f32x4_col2row: [0.0, 8192.0, 1.0], validate True , time:0.20730686ms
                  out_f32x4_row2col: [0.0, 8192.0, 1.0], validate True , time:0.11969423ms
              out_f32x4_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.10074449ms
              out_f32x4_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.01959562ms
       out_f32x4_shared_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.01687193ms
       out_f32x4_shared_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.04265404ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.01688957ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.03251529ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.01930547ms
 out_mat_transpose_cute_col2row_reg: [0.0, 8192.0, 1.0], validate True , time:0.08812404ms
 out_mat_transpose_cute_row2col_reg: [0.0, 8192.0, 1.0], validate True , time:0.02387476ms
    out_mat_transpose_cute_col_smem: [0.0, 8192.0, 1.0], validate True , time:0.03489089ms
    out_mat_transpose_cute_row_smem: [0.0, 8192.0, 1.0], validate True , time:0.02156711ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.02992201ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.02135229ms
out_mat_transpose_cute_row_cvectorized: [0.0, 8192.0, 1.0], validate True , time:0.03102064ms
out_mat_transpose_cute_row_rvectorized: [0.0, 8192.0, 1.0], validate True , time:0.01681352ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.02621365ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.01678991ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 8192.0, 1.0], validate True , time:0.01904702ms
                         out_f32_th: [0.0, 8192.0, 1.0], validate True , time:0.15844083ms
                out_f32_th_compiled: [0.0, 8192.0, 1.0], validate True , time:0.15847373ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=2048, N=1024
                       out_original: [0.0, 1.0, 1024.0], validate False, time:0.00718760ms
                    out_f32_col2row: [0.0, 1024.0, 1.0], validate True , time:0.05575848ms
                    out_f32_row2col: [0.0, 1024.0, 1.0], validate True , time:0.02742505ms
                out_f32_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.02227831ms
                out_f32_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.02228165ms
                  out_f32x4_col2row: [0.0, 1024.0, 1.0], validate True , time:0.05574393ms
                  out_f32x4_row2col: [0.0, 1024.0, 1.0], validate True , time:0.02803373ms
              out_f32x4_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.02689290ms
              out_f32x4_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00593233ms
       out_f32x4_shared_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00545979ms
       out_f32x4_shared_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01339769ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00544214ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01013637ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.00585651ms
 out_mat_transpose_cute_col2row_reg: [0.0, 1024.0, 1.0], validate True , time:0.02130103ms
 out_mat_transpose_cute_row2col_reg: [0.0, 1024.0, 1.0], validate True , time:0.00680089ms
    out_mat_transpose_cute_col_smem: [0.0, 1024.0, 1.0], validate True , time:0.00990653ms
    out_mat_transpose_cute_row_smem: [0.0, 1024.0, 1.0], validate True , time:0.00652814ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00867939ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00630808ms
out_mat_transpose_cute_row_cvectorized: [0.0, 1024.0, 1.0], validate True , time:0.00957823ms
out_mat_transpose_cute_row_rvectorized: [0.0, 1024.0, 1.0], validate True , time:0.00544834ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00874209ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00544596ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 1024.0, 1.0], validate True , time:0.00577044ms
                         out_f32_th: [0.0, 1024.0, 1.0], validate True , time:0.03313804ms
                out_f32_th_compiled: [0.0, 1024.0, 1.0], validate True , time:0.03312349ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=2048, N=2048
                       out_original: [0.0, 1.0, 2048.0], validate False, time:0.01071548ms
                    out_f32_col2row: [0.0, 2048.0, 1.0], validate True , time:0.10898972ms
                    out_f32_row2col: [0.0, 2048.0, 1.0], validate True , time:0.05358434ms
                out_f32_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.04582500ms
                out_f32_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.04582238ms
                  out_f32_diagnonal: [0.0, 2048.0, 1.0], validate True , time:0.01359177ms
                  out_f32x4_col2row: [0.0, 2048.0, 1.0], validate True , time:0.10840225ms
                  out_f32x4_row2col: [0.0, 2048.0, 1.0], validate True , time:0.05499268ms
              out_f32x4_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.05241871ms
              out_f32x4_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.01051307ms
       out_f32x4_shared_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.00912690ms
       out_f32x4_shared_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.02528095ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.00926328ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.01951742ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.01030612ms
 out_mat_transpose_cute_col2row_reg: [0.0, 2048.0, 1.0], validate True , time:0.04303837ms
 out_mat_transpose_cute_row2col_reg: [0.0, 2048.0, 1.0], validate True , time:0.01203489ms
    out_mat_transpose_cute_col_smem: [0.0, 2048.0, 1.0], validate True , time:0.01826239ms
    out_mat_transpose_cute_row_smem: [0.0, 2048.0, 1.0], validate True , time:0.01135182ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.01574826ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.01102781ms
out_mat_transpose_cute_row_cvectorized: [0.0, 2048.0, 1.0], validate True , time:0.01742291ms
out_mat_transpose_cute_row_rvectorized: [0.0, 2048.0, 1.0], validate True , time:0.00907254ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.01553106ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.00897884ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 2048.0, 1.0], validate True , time:0.01023149ms
                         out_f32_th: [0.0, 2048.0, 1.0], validate True , time:0.06361628ms
                out_f32_th_compiled: [0.0, 2048.0, 1.0], validate True , time:0.06361723ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=2048, N=4096
                       out_original: [0.0, 1.0, 4096.0], validate False, time:0.07163095ms
                    out_f32_col2row: [0.0, 4096.0, 1.0], validate True , time:0.21715951ms
                    out_f32_row2col: [0.0, 4096.0, 1.0], validate True , time:0.11133409ms
                out_f32_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.09314251ms
                out_f32_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.09313607ms
                  out_f32x4_col2row: [0.0, 4096.0, 1.0], validate True , time:0.21611571ms
                  out_f32x4_row2col: [0.0, 4096.0, 1.0], validate True , time:0.11418581ms
              out_f32x4_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.10425901ms
              out_f32x4_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.01959562ms
       out_f32x4_shared_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.01732969ms
       out_f32x4_shared_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.04961967ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.01733017ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.03867412ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.01932693ms
 out_mat_transpose_cute_col2row_reg: [0.0, 4096.0, 1.0], validate True , time:0.08734965ms
 out_mat_transpose_cute_row2col_reg: [0.0, 4096.0, 1.0], validate True , time:0.02312016ms
    out_mat_transpose_cute_col_smem: [0.0, 4096.0, 1.0], validate True , time:0.03496742ms
    out_mat_transpose_cute_row_smem: [0.0, 4096.0, 1.0], validate True , time:0.02118897ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.02997994ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.02083397ms
out_mat_transpose_cute_row_cvectorized: [0.0, 4096.0, 1.0], validate True , time:0.03339076ms
out_mat_transpose_cute_row_rvectorized: [0.0, 4096.0, 1.0], validate True , time:0.01715493ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.02967858ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.01722884ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 4096.0, 1.0], validate True , time:0.01916385ms
                         out_f32_th: [0.0, 4096.0, 1.0], validate True , time:0.15584993ms
                out_f32_th_compiled: [0.0, 4096.0, 1.0], validate True , time:0.15579724ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=2048, N=8192
                       out_original: [0.0, 1.0, 8192.0], validate False, time:0.14705801ms
                    out_f32_col2row: [0.0, 8192.0, 1.0], validate True , time:0.43088937ms
                    out_f32_row2col: [0.0, 8192.0, 1.0], validate True , time:0.24494767ms
                out_f32_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.20129704ms
                out_f32_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.20131707ms
                  out_f32x4_col2row: [0.0, 8192.0, 1.0], validate True , time:0.43249249ms
                  out_f32x4_row2col: [0.0, 8192.0, 1.0], validate True , time:0.23042464ms
              out_f32x4_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.20973110ms
              out_f32x4_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.15208077ms
       out_f32x4_shared_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.15606570ms
       out_f32x4_shared_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.15024900ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.15612721ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.15223646ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.15100741ms
 out_mat_transpose_cute_col2row_reg: [0.0, 8192.0, 1.0], validate True , time:0.18598723ms
 out_mat_transpose_cute_row2col_reg: [0.0, 8192.0, 1.0], validate True , time:0.15580726ms
    out_mat_transpose_cute_col_smem: [0.0, 8192.0, 1.0], validate True , time:0.15742898ms
    out_mat_transpose_cute_row_smem: [0.0, 8192.0, 1.0], validate True , time:0.15334773ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.15712571ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.15336204ms
out_mat_transpose_cute_row_cvectorized: [0.0, 8192.0, 1.0], validate True , time:0.16871929ms
out_mat_transpose_cute_row_rvectorized: [0.0, 8192.0, 1.0], validate True , time:0.14578414ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.16805482ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.14600015ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 8192.0, 1.0], validate True , time:0.14872742ms
                         out_f32_th: [0.0, 8192.0, 1.0], validate True , time:0.38776302ms
                out_f32_th_compiled: [0.0, 8192.0, 1.0], validate True , time:0.38784099ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=4096, N=1024
                       out_original: [0.0, 1.0, 1024.0], validate False, time:0.00965452ms
                    out_f32_col2row: [0.0, 1024.0, 1.0], validate True , time:0.11809397ms
                    out_f32_row2col: [0.0, 1024.0, 1.0], validate True , time:0.05413198ms
                out_f32_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.04930305ms
                out_f32_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.04930592ms
                  out_f32x4_col2row: [0.0, 1024.0, 1.0], validate True , time:0.11729717ms
                  out_f32x4_row2col: [0.0, 1024.0, 1.0], validate True , time:0.05490828ms
              out_f32x4_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.05621791ms
              out_f32x4_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01064992ms
       out_f32x4_shared_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00938892ms
       out_f32x4_shared_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.02315426ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.00949717ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01727295ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01049566ms
 out_mat_transpose_cute_col2row_reg: [0.0, 1024.0, 1.0], validate True , time:0.04423213ms
 out_mat_transpose_cute_row2col_reg: [0.0, 1024.0, 1.0], validate True , time:0.01209688ms
    out_mat_transpose_cute_col_smem: [0.0, 1024.0, 1.0], validate True , time:0.01829433ms
    out_mat_transpose_cute_row_smem: [0.0, 1024.0, 1.0], validate True , time:0.01152468ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.01583409ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.01118970ms
out_mat_transpose_cute_row_cvectorized: [0.0, 1024.0, 1.0], validate True , time:0.01640582ms
out_mat_transpose_cute_row_rvectorized: [0.0, 1024.0, 1.0], validate True , time:0.00908709ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.01398849ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.00925827ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 1024.0, 1.0], validate True , time:0.01027060ms
                         out_f32_th: [0.0, 1024.0, 1.0], validate True , time:0.06325698ms
                out_f32_th_compiled: [0.0, 1024.0, 1.0], validate True , time:0.06324863ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=4096, N=2048
                       out_original: [0.0, 1.0, 2048.0], validate False, time:0.07073593ms
                    out_f32_col2row: [0.0, 2048.0, 1.0], validate True , time:0.22560453ms
                    out_f32_row2col: [0.0, 2048.0, 1.0], validate True , time:0.10785031ms
                out_f32_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.09470630ms
                out_f32_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.09471273ms
                  out_f32x4_col2row: [0.0, 2048.0, 1.0], validate True , time:0.22474456ms
                  out_f32x4_row2col: [0.0, 2048.0, 1.0], validate True , time:0.10927320ms
              out_f32x4_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.10878706ms
              out_f32x4_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.02019811ms
       out_f32x4_shared_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.01779962ms
       out_f32x4_shared_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.04461408ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.01784253ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.03363466ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.01989126ms
 out_mat_transpose_cute_col2row_reg: [0.0, 2048.0, 1.0], validate True , time:0.08718014ms
 out_mat_transpose_cute_row2col_reg: [0.0, 2048.0, 1.0], validate True , time:0.02282619ms
    out_mat_transpose_cute_col_smem: [0.0, 2048.0, 1.0], validate True , time:0.03500485ms
    out_mat_transpose_cute_row_smem: [0.0, 2048.0, 1.0], validate True , time:0.02143908ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.03012705ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.02098823ms
out_mat_transpose_cute_row_cvectorized: [0.0, 2048.0, 1.0], validate True , time:0.03087449ms
out_mat_transpose_cute_row_rvectorized: [0.0, 2048.0, 1.0], validate True , time:0.01717567ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.02607489ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.01715493ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 2048.0, 1.0], validate True , time:0.01911330ms
                         out_f32_th: [0.0, 2048.0, 1.0], validate True , time:0.16622114ms
                out_f32_th_compiled: [0.0, 2048.0, 1.0], validate True , time:0.16617227ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=4096, N=4096
                       out_original: [0.0, 1.0, 4096.0], validate False, time:0.14724803ms
                    out_f32_col2row: [0.0, 4096.0, 1.0], validate True , time:0.43641806ms
                    out_f32_row2col: [0.0, 4096.0, 1.0], validate True , time:0.32312369ms
                out_f32_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.20292211ms
                out_f32_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.20308089ms
                  out_f32_diagnonal: [0.0, 4096.0, 1.0], validate True , time:0.20732331ms
                  out_f32x4_col2row: [0.0, 4096.0, 1.0], validate True , time:0.43654585ms
                  out_f32x4_row2col: [0.0, 4096.0, 1.0], validate True , time:0.25129676ms
              out_f32x4_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.21105075ms
              out_f32x4_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.15177655ms
       out_f32x4_shared_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.15513802ms
       out_f32x4_shared_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.14992380ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.15514398ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.15121222ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.14966726ms
 out_mat_transpose_cute_col2row_reg: [0.0, 4096.0, 1.0], validate True , time:0.18929172ms
 out_mat_transpose_cute_row2col_reg: [0.0, 4096.0, 1.0], validate True , time:0.20898962ms
    out_mat_transpose_cute_col_smem: [0.0, 4096.0, 1.0], validate True , time:0.21185803ms
    out_mat_transpose_cute_row_smem: [0.0, 4096.0, 1.0], validate True , time:0.20334816ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.21148324ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.20276737ms
out_mat_transpose_cute_row_cvectorized: [0.0, 4096.0, 1.0], validate True , time:0.16703892ms
out_mat_transpose_cute_row_rvectorized: [0.0, 4096.0, 1.0], validate True , time:0.14567733ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.16660881ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.14583707ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 4096.0, 1.0], validate True , time:0.14764738ms
                         out_f32_th: [0.0, 4096.0, 1.0], validate True , time:0.44402885ms
                out_f32_th_compiled: [0.0, 4096.0, 1.0], validate True , time:0.44393468ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=4096, N=8192
                       out_original: [0.0, 1.0, 8192.0], validate False, time:0.29286504ms
                    out_f32_col2row: [0.0, 8192.0, 1.0], validate True , time:0.86543465ms
                    out_f32_row2col: [0.0, 8192.0, 1.0], validate True , time:0.62317419ms
                out_f32_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.40327048ms
                out_f32_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.40326500ms
                  out_f32x4_col2row: [0.0, 8192.0, 1.0], validate True , time:0.86530161ms
                  out_f32x4_row2col: [0.0, 8192.0, 1.0], validate True , time:0.54498720ms
              out_f32x4_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.41842818ms
              out_f32x4_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.31206226ms
       out_f32x4_shared_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.32268763ms
       out_f32x4_shared_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.30455327ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.32285690ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.30781293ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.30405593ms
 out_mat_transpose_cute_col2row_reg: [0.0, 8192.0, 1.0], validate True , time:0.38962555ms
 out_mat_transpose_cute_row2col_reg: [0.0, 8192.0, 1.0], validate True , time:0.41600895ms
    out_mat_transpose_cute_col_smem: [0.0, 8192.0, 1.0], validate True , time:0.41964293ms
    out_mat_transpose_cute_row_smem: [0.0, 8192.0, 1.0], validate True , time:0.41039968ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.41958094ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.40964651ms
out_mat_transpose_cute_row_cvectorized: [0.0, 8192.0, 1.0], validate True , time:0.38046217ms
out_mat_transpose_cute_row_rvectorized: [0.0, 8192.0, 1.0], validate True , time:0.29182434ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.37819839ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.29211068ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 8192.0, 1.0], validate True , time:0.30234075ms
                         out_f32_th: [0.0, 8192.0, 1.0], validate True , time:0.87895489ms
                out_f32_th_compiled: [0.0, 8192.0, 1.0], validate True , time:0.87879944ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=8192, N=1024
                       out_original: [0.0, 1.0, 1024.0], validate False, time:0.07053375ms
                    out_f32_col2row: [0.0, 1024.0, 1.0], validate True , time:0.24595332ms
                    out_f32_row2col: [0.0, 1024.0, 1.0], validate True , time:0.10567069ms
                out_f32_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.09474897ms
                out_f32_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.09473014ms
                  out_f32x4_col2row: [0.0, 1024.0, 1.0], validate True , time:0.24111009ms
                  out_f32x4_row2col: [0.0, 1024.0, 1.0], validate True , time:0.10672569ms
              out_f32x4_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.11472416ms
              out_f32x4_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01999807ms
       out_f32x4_shared_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.01770735ms
       out_f32x4_shared_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.05329418ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 1024.0, 1.0], validate True , time:0.01770759ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.04070520ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 1024.0, 1.0], validate True , time:0.01967621ms
 out_mat_transpose_cute_col2row_reg: [0.0, 1024.0, 1.0], validate True , time:0.08774638ms
 out_mat_transpose_cute_row2col_reg: [0.0, 1024.0, 1.0], validate True , time:0.02261615ms
    out_mat_transpose_cute_col_smem: [0.0, 1024.0, 1.0], validate True , time:0.03501701ms
    out_mat_transpose_cute_row_smem: [0.0, 1024.0, 1.0], validate True , time:0.02132845ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.03010297ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.02071834ms
out_mat_transpose_cute_row_cvectorized: [0.0, 1024.0, 1.0], validate True , time:0.03309345ms
out_mat_transpose_cute_row_rvectorized: [0.0, 1024.0, 1.0], validate True , time:0.01717758ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.02939177ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 1024.0, 1.0], validate True , time:0.01721883ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 1024.0, 1.0], validate True , time:0.01909518ms
                         out_f32_th: [0.0, 1024.0, 1.0], validate True , time:0.20288062ms
                out_f32_th_compiled: [0.0, 1024.0, 1.0], validate True , time:0.20288539ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=8192, N=2048
                       out_original: [0.0, 1.0, 2048.0], validate False, time:0.14731693ms
                    out_f32_col2row: [0.0, 2048.0, 1.0], validate True , time:0.44992161ms
                    out_f32_row2col: [0.0, 2048.0, 1.0], validate True , time:0.39209795ms
                out_f32_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.20168328ms
                out_f32_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.20170259ms
                  out_f32x4_col2row: [0.0, 2048.0, 1.0], validate True , time:0.44786477ms
                  out_f32x4_row2col: [0.0, 2048.0, 1.0], validate True , time:0.31436062ms
              out_f32x4_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.21578026ms
              out_f32x4_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.15140390ms
       out_f32x4_shared_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.15418792ms
       out_f32x4_shared_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.14937544ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 2048.0, 1.0], validate True , time:0.15400743ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.15103722ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 2048.0, 1.0], validate True , time:0.14936137ms
 out_mat_transpose_cute_col2row_reg: [0.0, 2048.0, 1.0], validate True , time:0.22439241ms
 out_mat_transpose_cute_row2col_reg: [0.0, 2048.0, 1.0], validate True , time:0.22897625ms
    out_mat_transpose_cute_col_smem: [0.0, 2048.0, 1.0], validate True , time:0.22965932ms
    out_mat_transpose_cute_row_smem: [0.0, 2048.0, 1.0], validate True , time:0.22856069ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.22949767ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.22834253ms
out_mat_transpose_cute_row_cvectorized: [0.0, 2048.0, 1.0], validate True , time:0.19186521ms
out_mat_transpose_cute_row_rvectorized: [0.0, 2048.0, 1.0], validate True , time:0.14563751ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.19233513ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 2048.0, 1.0], validate True , time:0.14593029ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 2048.0, 1.0], validate True , time:0.14865732ms
                         out_f32_th: [0.0, 2048.0, 1.0], validate True , time:0.52873373ms
                out_f32_th_compiled: [0.0, 2048.0, 1.0], validate True , time:0.52878881ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=8192, N=4096
                       out_original: [0.0, 1.0, 4096.0], validate False, time:0.29280901ms
                    out_f32_col2row: [0.0, 4096.0, 1.0], validate True , time:0.88895130ms
                    out_f32_row2col: [0.0, 4096.0, 1.0], validate True , time:0.86307645ms
                out_f32_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.41230702ms
                out_f32_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.41231465ms
                  out_f32x4_col2row: [0.0, 4096.0, 1.0], validate True , time:0.88813376ms
                  out_f32x4_row2col: [0.0, 4096.0, 1.0], validate True , time:0.69743943ms
              out_f32x4_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.42869139ms
              out_f32x4_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.30924320ms
       out_f32x4_shared_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.31846452ms
       out_f32x4_shared_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.30230999ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 4096.0, 1.0], validate True , time:0.31852722ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.30584526ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 4096.0, 1.0], validate True , time:0.30119085ms
 out_mat_transpose_cute_col2row_reg: [0.0, 4096.0, 1.0], validate True , time:0.46655011ms
 out_mat_transpose_cute_row2col_reg: [0.0, 4096.0, 1.0], validate True , time:0.47839689ms
    out_mat_transpose_cute_col_smem: [0.0, 4096.0, 1.0], validate True , time:0.48122859ms
    out_mat_transpose_cute_row_smem: [0.0, 4096.0, 1.0], validate True , time:0.47665644ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.48055482ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.47649026ms
out_mat_transpose_cute_row_cvectorized: [0.0, 4096.0, 1.0], validate True , time:0.39820242ms
out_mat_transpose_cute_row_rvectorized: [0.0, 4096.0, 1.0], validate True , time:0.29179692ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.39949131ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 4096.0, 1.0], validate True , time:0.29208493ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 4096.0, 1.0], validate True , time:0.29841924ms
                         out_f32_th: [0.0, 4096.0, 1.0], validate True , time:1.14308476ms
                out_f32_th_compiled: [0.0, 4096.0, 1.0], validate True , time:1.14306474ms
----------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------
                                                       M=8192, N=8192
                       out_original: [0.0, 1.0, 8192.0], validate False, time:0.58547974ms
                    out_f32_col2row: [0.0, 8192.0, 1.0], validate True , time:1.80126643ms
                    out_f32_row2col: [0.0, 8192.0, 1.0], validate True , time:1.64509988ms
                out_f32_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.83196545ms
                out_f32_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.83196807ms
                  out_f32_diagnonal: [0.0, 8192.0, 1.0], validate True , time:0.97841859ms
                  out_f32x4_col2row: [0.0, 8192.0, 1.0], validate True , time:1.79821086ms
                  out_f32x4_row2col: [0.0, 8192.0, 1.0], validate True , time:1.32565904ms
              out_f32x4_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.86745381ms
              out_f32x4_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.62999153ms
       out_f32x4_shared_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.65167332ms
       out_f32x4_shared_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.61123776ms
   out_f32x4_shared_bcf_col2row(2d): [0.0, 8192.0, 1.0], validate True , time:0.65192652ms
   out_f32x4_shared_bcf_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.61977363ms
out_f32x4_shared_bcf_merge_write_row2col(2d): [0.0, 8192.0, 1.0], validate True , time:0.60966253ms
 out_mat_transpose_cute_col2row_reg: [0.0, 8192.0, 1.0], validate True , time:0.89863276ms
 out_mat_transpose_cute_row2col_reg: [0.0, 8192.0, 1.0], validate True , time:0.91257930ms
    out_mat_transpose_cute_col_smem: [0.0, 8192.0, 1.0], validate True , time:0.91426134ms
    out_mat_transpose_cute_row_smem: [0.0, 8192.0, 1.0], validate True , time:0.91105080ms
out_mat_transpose_cute_col_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.91435575ms
out_mat_transpose_cute_row_smem_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.91125917ms
out_mat_transpose_cute_row_cvectorized: [0.0, 8192.0, 1.0], validate True , time:0.81398630ms
out_mat_transpose_cute_row_rvectorized: [0.0, 8192.0, 1.0], validate True , time:0.58486223ms
out_mat_transpose_cute_row_cvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.81496310ms
out_mat_transpose_cute_row_rvectorized_swizzled: [0.0, 8192.0, 1.0], validate True , time:0.58501387ms
out_mat_transpose_cute_row_rvectorized_swizzled_optimized: [0.0, 8192.0, 1.0], validate True , time:0.60650158ms
                         out_f32_th: [0.0, 8192.0, 1.0], validate True , time:2.18581700ms
                out_f32_th_compiled: [0.0, 8192.0, 1.0], validate True , time:2.18601727ms
----------------------------------------------------------------------------------------------------------------------------------

@DefTruth DefTruth self-requested a review June 16, 2025 16:27
Copy link
Member

@DefTruth DefTruth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DefTruth DefTruth merged commit 6953b39 into xlite-dev:main Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants