-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Better SIMD shuffles #36624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
This commit adds a copy of the Teddy searcher that works on AVX2. We don't attempt to reuse any code between them just yet, and instead just copy & paste and tweak parts of it to work on 32 bytes instead of 16. (Some parts were trickier than others. For example, @jneem figured out how to nearly compensate for the lack of a real 256-bit bytewise PALIGNR instruction, which we borrow here.) Overall, AVX2 provides a nice bump in performance.
This still occurs even when using the specific intrinsics in |
Can reproduce, will try to fill in an LLVM bug for this. Once the LLVM bug is filled and recognized as a real bug we could workaround this in EDIT: reported https://bugs.llvm.org/show_bug.cgi?id=36933 |
@rustbot modify labels: +A-simd |
This was solved upstream, it seems. I twiddled this a bit to simplify reading it (for me, anyways): #![feature(platform_intrinsics, repr_simd)]
#[allow(non_camel_case_types)]
#[repr(simd)]
pub struct u8x32([u8; 32]);
extern "platform-intrinsic" {
fn simd_shuffle32<T, U>(x: T, y: T, idx: [u32; 32]) -> U;
}
pub fn right_shift_1(left: u8x32, right: u8x32) -> u8x32 {
const IDX: [u32; 32] = {
let mut a = [31u32; 32];
let mut n: u32 = 0;
while n < 32 {
a[n as usize] += n;
n +=1;
}
a
};
unsafe { simd_shuffle32(left, right, IDX) }
} Output is now (Rust-Godbolt): example::right_shift_1:
mov rax, rdi
vmovdqa ymm0, ymmword ptr [rdx]
vperm2i128 ymm1, ymm0, ymmword ptr [rsi], 3
vpalignr ymm0, ymm0, ymm1, 15
vmovdqa ymmword ptr [rdi], ymm0
vzeroupper
ret This seems "fixed" insofar as the vblend is gone, the vmovdqa and such are a somewhat unavoidable feature atm of our lack of using register-passing. So further improvements require large systemic changes in the compiler. For comparison, clang emits (Clang-Godbolt): right_shift_1(long long __vector(4), long long __vector(4)): # @right_shift_1(long long __vector(4), long long __vector(4))
vperm2i128 ymm0, ymm0, ymm1, 33 # ymm0 = ymm0[2,3],ymm1[0,1]
vpalignr ymm0, ymm0, ymm1, 1 # ymm0 = ymm1[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15], ymm0[0], ymm1[17,18,19,20,21,22,23,24,25,26,27,28,29,30,31], ymm0[16]
ret |
I'm trying to do the following in AVX2 using intrinsics: shift
x
one byte to the right, while shifting the rightmost byte ofy
into the leftmost byte ofx
. This is best done using two instructions:vperm2i128
followed byvpalignr
. However,simd_shuffle32
generates four instructions:vmovdqa
(to load a constant),vpblendvb
, thenvperm2i128
andvpalignr
. Here is a a full example, which may be compiled withrustc -O -C target_feature=+avx2 --crate-type=lib --emit=asm shuffle.rs
.This might be considered a bug in LLVM, in the sense that it's generating a sub-optimal shuffle. However, I think it should be addressed in rustc, because if I know what the right sequence of instructions is then I shouldn't have to hope that LLVM can generate it. Moreover, it's possible to get the right code from clang (compile with
clang -emit-llvm -mavx2 -O -S shuffle.c
):A possibly interesting observation is that the unoptimized LLVM IR from clang contains a
llvm.x86.avx2.vperm2i128
intrinsic followed by ashufflevector
. The optimized LLVM IR from clang contains twoshufflevector
intrinsics. In order to try to get the same output from rustc, I first patched it to supportllvm.x86.avx2.vperm2i128
. After modifyingright_shift_1
to use the new intrinsic, I got rustc to producellvm.x86.avx2.vperm2i128
followed by ashufflevector
. However, the optimized LLVM IR from rustc still produces a singleshufflevector
, and it still ends up producing the bad asm.I think this means that the fault is from some optimization pass in rustc that isn't in clang, but I haven't had time to investigate it yet...
The text was updated successfully, but these errors were encountered: