Skip to content

Better SIMD shuffles #36624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jneem opened this issue Sep 21, 2016 · 4 comments
Open

Better SIMD shuffles #36624

jneem opened this issue Sep 21, 2016 · 4 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@jneem
Copy link
Contributor

jneem commented Sep 21, 2016

I'm trying to do the following in AVX2 using intrinsics: shift x one byte to the right, while shifting the rightmost byte of y into the leftmost byte of x. This is best done using two instructions: vperm2i128 followed by vpalignr. However, simd_shuffle32 generates four instructions: vmovdqa (to load a constant), vpblendvb, then vperm2i128 and vpalignr. Here is a a full example, which may be compiled with rustc -O -C target_feature=+avx2 --crate-type=lib --emit=asm shuffle.rs.

#![feature(platform_intrinsics, repr_simd)]

#[allow(non_camel_case_types)]
#[repr(simd)]
pub struct u8x32(u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8, u8);

extern "platform-intrinsic" {
    fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
}

pub fn right_shift_1(left: u8x32, right: u8x32) -> u8x32 {
    unsafe { simd_shuffle32(left, right, [31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62]) }
}

This might be considered a bug in LLVM, in the sense that it's generating a sub-optimal shuffle. However, I think it should be addressed in rustc, because if I know what the right sequence of instructions is then I shouldn't have to hope that LLVM can generate it. Moreover, it's possible to get the right code from clang (compile with clang -emit-llvm -mavx2 -O -S shuffle.c):

#include <immintrin.h>
__m256i right_shift_1(__m256i left, __m256i right)
{
    __m256i new_left = _mm256_permute2x128_si256(left, right, 33);
    return _mm256_alignr_epi8(new_left, right, 1);
}

A possibly interesting observation is that the unoptimized LLVM IR from clang contains a llvm.x86.avx2.vperm2i128 intrinsic followed by a shufflevector. The optimized LLVM IR from clang contains two shufflevector intrinsics. In order to try to get the same output from rustc, I first patched it to support llvm.x86.avx2.vperm2i128. After modifying right_shift_1 to use the new intrinsic, I got rustc to produce llvm.x86.avx2.vperm2i128 followed by a shufflevector. However, the optimized LLVM IR from rustc still produces a single shufflevector, and it still ends up producing the bad asm.

I think this means that the fault is from some optimization pass in rustc that isn't in clang, but I haven't had time to investigate it yet...

@Mark-Simulacrum Mark-Simulacrum added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. I-slow Issue: Problems and improvements with respect to performance of generated code. labels May 13, 2017
@Mark-Simulacrum Mark-Simulacrum added the C-enhancement Category: An issue proposing an enhancement or a PR with one. label Jul 26, 2017
jneem referenced this issue in rust-lang/regex Mar 13, 2018
This commit adds a copy of the Teddy searcher that works on AVX2. We
don't attempt to reuse any code between them just yet, and instead just
copy & paste and tweak parts of it to work on 32 bytes instead of 16.
(Some parts were trickier than others. For example, @jneem figured out
how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR
instruction, which we borrow here.)

Overall, AVX2 provides a nice bump in performance.
@BurntSushi
Copy link
Member

This still occurs even when using the specific intrinsics in std::arch, which is somewhat surprising! See: rust-lang/regex@f962ddb#r28069091

@gnzlbg
Copy link
Contributor

gnzlbg commented Mar 27, 2018

Can reproduce, will try to fill in an LLVM bug for this.

Once the LLVM bug is filled and recognized as a real bug we could workaround this in std::arch (rustc is definitely the wrong place to do it).

EDIT: reported https://bugs.llvm.org/show_bug.cgi?id=36933

@workingjubilee
Copy link
Member

@rustbot modify labels: +A-simd

@rustbot rustbot added the A-SIMD Area: SIMD (Single Instruction Multiple Data) label Sep 10, 2020
@workingjubilee
Copy link
Member

workingjubilee commented Oct 5, 2021

This was solved upstream, it seems. I twiddled this a bit to simplify reading it (for me, anyways):

#![feature(platform_intrinsics, repr_simd)]
#[allow(non_camel_case_types)]

#[repr(simd)]
pub struct u8x32([u8; 32]);

extern "platform-intrinsic" {
    fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
}

pub fn right_shift_1(left: u8x32, right: u8x32) -> u8x32 {
    const IDX: [u32; 32] = {
        let mut a = [31u32; 32];
        let mut n: u32 = 0;
        while n < 32 {
            a[n as usize] += n;
            n +=1;
        }
        a
    };
    unsafe { simd_shuffle32(left, right, IDX) }
}

Output is now (Rust-Godbolt):

example::right_shift_1:
        mov     rax, rdi
        vmovdqa ymm0, ymmword ptr [rdx]
        vperm2i128      ymm1, ymm0, ymmword ptr [rsi], 3
        vpalignr        ymm0, ymm0, ymm1, 15
        vmovdqa ymmword ptr [rdi], ymm0
        vzeroupper
        ret

This seems "fixed" insofar as the vblend is gone, the vmovdqa and such are a somewhat unavoidable feature atm of our lack of using register-passing. So further improvements require large systemic changes in the compiler.

For comparison, clang emits (Clang-Godbolt):

right_shift_1(long long __vector(4), long long __vector(4)):               # @right_shift_1(long long __vector(4), long long __vector(4))
        vperm2i128      ymm0, ymm0, ymm1, 33    # ymm0 = ymm0[2,3],ymm1[0,1]
        vpalignr        ymm0, ymm0, ymm1, 1     # ymm0 = ymm1[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], ymm0[0], ymm1[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31], ymm0[16]
        ret

@Noratrieb Noratrieb added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Apr 5, 2023
@workingjubilee workingjubilee added the C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such label Oct 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. A-SIMD Area: SIMD (Single Instruction Multiple Data) C-enhancement Category: An issue proposing an enhancement or a PR with one. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

7 participants