-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[IA] Add support for [de]interleave{3,5,7} #139373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
63f4aee
to
a6b4b02
Compare
@llvm/pr-subscribers-llvm-transforms Author: Luke Lau (lukel97) ChangesThis adds support for lowering deinterleave and interleave intrinsics for factors 3 5 and 7 into target specific memory intrinsics. Notably this doesn't add support for handling higher factors constructed from interleaving interleave intrinsics, e.g. factor 6 from interleave3 + interleave2. I initially tried this but it became very complex very quickly. For example, because there's now multiple factors involved interleaveLeafValues is no longer symmetric between interleaving and deinterleaving. There's then also two ways of representing a factor 6 deinterleave: It can both be done as either 1 deinterleave3 and 3 deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s: <details><summary>Details</summary>
</p> I'm not sure the complexity of supporting arbitrary factors is warranted given how we only need to support a small number of factors currently: SVE only needs factors 2,3,4 whilst RVV only needs 2,3,4,5,6,7,8. My preference would be to just add a interleave6 and deinterleave6 intrinsic to avoid all this ambiguity, but I'll defer this discussion to a later patch. Patch is 73.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139373.diff 8 Files Affected:
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 03099e9ad44dc..1749ac1770da9 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3234,7 +3234,7 @@ class TargetLoweringBase {
/// Lower a deinterleave intrinsic to a target specific load intrinsic.
/// Return true on success. Currently only supports
- /// llvm.vector.deinterleave2
+ /// llvm.vector.deinterleave{2,3,5,7}
///
/// \p LI is the accompanying load instruction.
/// \p DeinterleaveValues contains the deinterleaved values.
@@ -3246,7 +3246,7 @@ class TargetLoweringBase {
/// Lower an interleave intrinsic to a target specific store intrinsic.
/// Return true on success. Currently only supports
- /// llvm.vector.interleave2
+ /// llvm.vector.interleave{2,3,5,7}
///
/// \p SI is the accompanying store instruction
/// \p InterleaveValues contains the interleaved values.
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 04d89d61cb6a9..c590e470fa779 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -571,6 +571,25 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
return true;
}
+static unsigned getIntrinsicFactor(const IntrinsicInst *II) {
+ switch (II->getIntrinsicID()) {
+ case Intrinsic::vector_deinterleave2:
+ case Intrinsic::vector_interleave2:
+ return 2;
+ case Intrinsic::vector_deinterleave3:
+ case Intrinsic::vector_interleave3:
+ return 3;
+ case Intrinsic::vector_deinterleave5:
+ case Intrinsic::vector_interleave5:
+ return 5;
+ case Intrinsic::vector_deinterleave7:
+ case Intrinsic::vector_interleave7:
+ return 7;
+ default:
+ llvm_unreachable("Unexpected intrinsic");
+ }
+}
+
// For an (de)interleave tree like this:
//
// A C B D
@@ -586,7 +605,7 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
// to reorder them by interleaving these values.
static void interleaveLeafValues(MutableArrayRef<Value *> SubLeaves) {
unsigned NumLeaves = SubLeaves.size();
- if (NumLeaves == 2)
+ if (NumLeaves == 2 || !isPowerOf2_64(NumLeaves))
return;
assert(isPowerOf2_32(NumLeaves) && NumLeaves > 1);
@@ -608,7 +627,10 @@ static void interleaveLeafValues(MutableArrayRef<Value *> SubLeaves) {
static bool
getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
SmallVectorImpl<Instruction *> &DeadInsts) {
- assert(II->getIntrinsicID() == Intrinsic::vector_interleave2);
+ assert(II->getIntrinsicID() == Intrinsic::vector_interleave2 ||
+ II->getIntrinsicID() == Intrinsic::vector_interleave3 ||
+ II->getIntrinsicID() == Intrinsic::vector_interleave5 ||
+ II->getIntrinsicID() == Intrinsic::vector_interleave7);
// Visit with BFS
SmallVector<IntrinsicInst *, 8> Queue;
@@ -620,7 +642,7 @@ getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
// All the intermediate intrinsics will be deleted.
DeadInsts.push_back(Current);
- for (unsigned I = 0; I < 2; ++I) {
+ for (unsigned I = 0; I < getIntrinsicFactor(Current); ++I) {
Value *Op = Current->getOperand(I);
if (auto *OpII = dyn_cast<IntrinsicInst>(Op))
if (OpII->getIntrinsicID() == Intrinsic::vector_interleave2) {
@@ -638,9 +660,10 @@ getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
}
const unsigned Factor = Operands.size();
- // Currently we only recognize power-of-two factors.
+ // Currently we only recognize factors of 2, 3, 5 and 7.
// FIXME: should we assert here instead?
- if (Factor <= 1 || !isPowerOf2_32(Factor))
+ if (Factor <= 1 ||
+ (!isPowerOf2_32(Factor) && Factor != getIntrinsicFactor(II)))
return false;
interleaveLeafValues(Operands);
@@ -651,9 +674,12 @@ static bool
getVectorDeinterleaveFactor(IntrinsicInst *II,
SmallVectorImpl<Value *> &Results,
SmallVectorImpl<Instruction *> &DeadInsts) {
- assert(II->getIntrinsicID() == Intrinsic::vector_deinterleave2);
+ assert(II->getIntrinsicID() == Intrinsic::vector_deinterleave2 ||
+ II->getIntrinsicID() == Intrinsic::vector_deinterleave3 ||
+ II->getIntrinsicID() == Intrinsic::vector_deinterleave5 ||
+ II->getIntrinsicID() == Intrinsic::vector_deinterleave7);
using namespace PatternMatch;
- if (!II->hasNUses(2))
+ if (!II->hasNUses(getIntrinsicFactor(II)))
return false;
// Visit with BFS
@@ -662,12 +688,12 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
while (!Queue.empty()) {
IntrinsicInst *Current = Queue.front();
Queue.erase(Queue.begin());
- assert(Current->hasNUses(2));
+ assert(Current->hasNUses(getIntrinsicFactor(Current)));
// All the intermediate intrinsics will be deleted from the bottom-up.
DeadInsts.insert(DeadInsts.begin(), Current);
- ExtractValueInst *LHS = nullptr, *RHS = nullptr;
+ SmallVector<ExtractValueInst *> EVs(getIntrinsicFactor(Current), nullptr);
for (User *Usr : Current->users()) {
if (!isa<ExtractValueInst>(Usr))
return 0;
@@ -679,17 +705,15 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
if (Indices.size() != 1)
return false;
- if (Indices[0] == 0 && !LHS)
- LHS = EV;
- else if (Indices[0] == 1 && !RHS)
- RHS = EV;
+ if (!EVs[Indices[0]])
+ EVs[Indices[0]] = EV;
else
return false;
}
// We have legal indices. At this point we're either going
// to continue the traversal or push the leaf values into Results.
- for (ExtractValueInst *EV : {LHS, RHS}) {
+ for (ExtractValueInst *EV : EVs) {
// Continue the traversal. We're playing safe here and matching only the
// expression consisting of a perfectly balanced binary tree in which all
// intermediate values are only used once.
@@ -713,9 +737,10 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
}
const unsigned Factor = Results.size();
- // Currently we only recognize power-of-two factors.
+ // Currently we only recognize factors of 2, 3, 5 and 7.
// FIXME: should we assert here instead?
- if (Factor <= 1 || !isPowerOf2_32(Factor))
+ if (Factor <= 1 ||
+ (!isPowerOf2_32(Factor) && Factor != getIntrinsicFactor(II)))
return 0;
interleaveLeafValues(Results);
@@ -878,11 +903,23 @@ bool InterleavedAccessImpl::runOnFunction(Function &F) {
if (auto *II = dyn_cast<IntrinsicInst>(&I)) {
// At present, we only have intrinsics to represent (de)interleaving
- // with a factor of 2.
- if (II->getIntrinsicID() == Intrinsic::vector_deinterleave2)
+ // with a factor of 2,3,5 and 7.
+ switch (II->getIntrinsicID()) {
+ case Intrinsic::vector_deinterleave2:
+ case Intrinsic::vector_deinterleave3:
+ case Intrinsic::vector_deinterleave5:
+ case Intrinsic::vector_deinterleave7:
Changed |= lowerDeinterleaveIntrinsic(II, DeadInsts);
- else if (II->getIntrinsicID() == Intrinsic::vector_interleave2)
+ break;
+ case Intrinsic::vector_interleave2:
+ case Intrinsic::vector_interleave3:
+ case Intrinsic::vector_interleave5:
+ case Intrinsic::vector_interleave7:
Changed |= lowerInterleaveIntrinsic(II, DeadInsts);
+ break;
+ default:
+ break;
+ }
}
}
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
index e53dfc23a84bb..31529b1783651 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
@@ -257,6 +257,23 @@ define {<2 x double>, <2 x double>} @vector_deinterleave_load_v2f64_v4f64(ptr %p
ret {<2 x double>, <2 x double>} %res1
}
+define { <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor3(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor3:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT: vlseg3e8.v v6, (a0)
+; CHECK-NEXT: ret
+ %vec = load <24 x i8>, ptr %p
+ %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave3(<24 x i8> %vec)
+ %t0 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 0
+ %t1 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 1
+ %t2 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 2
+ %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+ %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 0
+ %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 0
+ ret { <8 x i8>, <8 x i8>, <8 x i8> } %res2
+}
+
define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor4(ptr %p) {
; CHECK-LABEL: vector_deinterleave_load_factor4:
; CHECK: # %bb.0:
@@ -281,6 +298,52 @@ define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_fact
ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3
}
+define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor5(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor5:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT: vlseg5e8.v v8, (a0)
+; CHECK-NEXT: ret
+ %vec = load <40 x i8>, ptr %p
+ %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave5(<40 x i8> %vec)
+ %t0 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 0
+ %t1 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 1
+ %t2 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 2
+ %t3 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 3
+ %t4 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 4
+ %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+ %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 1
+ %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 2
+ %res3 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res2, <8 x i8> %t3, 3
+ %res4 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t4, 4
+ ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res4
+}
+
+define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor7(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor7:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT: vlseg7e8.v v8, (a0)
+; CHECK-NEXT: ret
+ %vec = load <56 x i8>, ptr %p
+ %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave7(<56 x i8> %vec)
+ %t0 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 0
+ %t1 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 1
+ %t2 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 2
+ %t3 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 3
+ %t4 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 4
+ %t5 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 5
+ %t6 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 6
+ %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+ %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 1
+ %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 2
+ %res3 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res2, <8 x i8> %t3, 3
+ %res4 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t4, 4
+ %res5 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t5, 5
+ %res6 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t6, 6
+ ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res6
+}
+
define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave_load_factor8(ptr %ptr) {
; CHECK-LABEL: vector_deinterleave_load_factor8:
; CHECK: # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
index 26c3db6131034..8244db45a7ef2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
@@ -181,6 +181,17 @@ define void @vector_interleave_store_v4f64_v2f64(<2 x double> %a, <2 x double> %
ret void
}
+define void @vector_interleave_store_factor3(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor3:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vsseg3e32.v v8, (a0)
+; CHECK-NEXT: ret
+ %v = call <12 x i32> @llvm.vector.interleave3(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
+ store <12 x i32> %v, ptr %p
+ ret void
+}
+
define void @vector_interleave_store_factor4(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, ptr %p) {
; CHECK-LABEL: vector_interleave_store_factor4:
; CHECK: # %bb.0:
@@ -194,6 +205,28 @@ define void @vector_interleave_store_factor4(<4 x i32> %a, <4 x i32> %b, <4 x i3
ret void
}
+define void @vector_interleave_store_factor5(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor5:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vsseg5e32.v v8, (a0)
+; CHECK-NEXT: ret
+ %v = call <20 x i32> @llvm.vector.interleave5(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e)
+ store <20 x i32> %v, ptr %p
+ ret void
+}
+
+define void @vector_interleave_store_factor7(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor7:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT: vsseg7e32.v v8, (a0)
+; CHECK-NEXT: ret
+ %v = call <28 x i32> @llvm.vector.interleave7(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g)
+ store <28 x i32> %v, ptr %p
+ ret void
+}
+
define void @vector_interleave_store_factor8(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g, <4 x i32> %h, ptr %p) {
; CHECK-LABEL: vector_interleave_store_factor8:
; CHECK: # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
index 582aef908964a..0483bbbd35b39 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
@@ -344,6 +344,23 @@ define {<vscale x 2 x ptr>, <vscale x 2 x ptr>} @vector_deinterleave_load_nxv2p0
ret {<vscale x 2 x ptr>, <vscale x 2 x ptr>} %res1
}
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor3(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor3:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT: vlseg3e8.v v6, (a0)
+; CHECK-NEXT: ret
+ %vec = load <vscale x 24 x i8>, ptr %p
+ %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave3(<vscale x 24 x i8> %vec)
+ %t0 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 0
+ %t1 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 1
+ %t2 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 2
+ %res0 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } poison, <vscale x 8 x i8> %t0, 0
+ %res1 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res0, <vscale x 8 x i8> %t1, 0
+ %res2 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res1, <vscale x 8 x i8> %t2, 0
+ ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res2
+}
+
define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor4(ptr %p) {
; CHECK-LABEL: vector_deinterleave_load_factor4:
; CHECK: # %bb.0:
@@ -368,6 +385,52 @@ define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x
ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res3
}
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor5(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor5:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT: vlseg5e8.v v8, (a0)
+; CHECK-NEXT: ret
+ %vec = load <vscale x 40 x i8>, ptr %p
+ %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave5(<vscale x 40 x i8> %vec)
+ %t0 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 0
+ %t1 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 1
+ %t2 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 2
+ %t3 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 3
+ %t4 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 4
+ %res0 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } poison, <vscale x 8 x i8> %t0, 0
+ %res1 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res0, <vscale x 8 x i8> %t1, 1
+ %res2 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res1, <vscale x 8 x i8> %t2, 2
+ %res3 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res2, <vscale x 8 x i8> %t3, 3
+ %res4 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res3, <vscale x 8 x i8> %t4, 4
+ ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res4
+}
+
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor7(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor7:
+; CHECK: # %bb.0:
+; CHECK-NEXT: vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT: vlseg7e8.v v8, (a0)
+; CHECK-NEXT: ret
+ %vec = load <vscale x 56 x i8>, ptr %p
+ %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave7(<vscale x 56 x i8> %vec)
+ %t0 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 0
+ %t1 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 1
+ %t2 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale ...
[truncated]
|
Yeah IMHO the vectorizer has to coin one which will be the only one InterleavedAccess supports.
Personally I prefer adding (de)interleave6 but I believe there was already a discussion on this a while back but I don't remember why people decided against it then. |
Yeah, I've been rereading some of the old threads in #89018 (comment) It does seem like there was some discussion about revisiting this after the initial patch, and I think now that we've run into some complexity with trying to extend it to handle RISC-V's factors I think it's a good time to re-evaluate. I've got a patch that adds [de]interleave6 as well as [de]interleave{4,8}, that I hope to post soon to get some feedback on. The gist is that the ability to support arbitrary interleave factors seems premature, since I'm not aware of any hardware that can take advantage of anything beyond a factor of 8, so the simplest setup would just be to have a dedicated intrinsic for each factor:
The eventual plan would be to move from this state of affairs: Fixed-length vectors:Loop vectorizer: emitted as a series of strided shufflevectors, one for each factor Scalable vectors:Loop vectorizer: Only powers of 2 supported, emitted as interleaves of [de]interleave2 intrinsics To this: Fixed-length vectors:Loop vectorizer: emitted as a series of strided shufflevectors, one for each factor Scalable vectors:Loop vectorizer: All factors up to 8 supported, emitted as single [de]interleaveN intrinsic And to get there, I'm imagining the following steps:
If we do end up wanting to support even higher interleave factors > 8, e.g. #89018 (comment), then we could return and address this later. Since at the moment, it doesn't look like we can lower a factor 8 interleave into a ld4 + ld2 on AArch64 anyway. WDYT? |
I kind of miss this part: are you referring to the problems with shufflevectors where interleave4 patterns might not be in the canonical form? But other than that point, I think your plan is reasonable. As pointed out in #89018 the biggest concern for having dedicated intrinsics at the time was that we hope to synthesize arbitrary factors with only a few intrinsics, such that we'll only have a single lowering path. But right now it seems like it's probably easier to just having two lowering paths where the "fast path" is using dedicated intrinsics for natively supported factor (i.e. up to 8 for RISC-V), while the other path -- which might be less common -- is synthesizing higher factors with intrinsics of smaller factors. Whether we're gonna drop shufflevectors to only use (de)interleave intrinsics for both fixed and scalable vectors in the future is another thing. I definitely think that's a good idea to streamline things a little bit though. (CC other folks from that thread: @davemgreen @efriedma-quic @paulwalker-arm) |
Having a single intrinsic would make cost-modelling more accurate. Replacing the fixed width shuffles with a single intrinsic I think would be worth-while, and would fix a number of perf issues in the current design. (We just have to make sure that loops with isomorphic operations do not get accidentally worse). |
This is for scalable vectors, AArch64 can currently only lower vector_[de]interleave nodes with a factor of 2. If we teach the loop vectorizer to emit interleave4, instead of multiple interleave2s, then we should probably teach AArch64 to handle vector_[de]interleave nodes with factor 4. This is to handle the case I think where the loop vectorizer decided it was profitable to emit an interleave group, but the interleaved access pass decided/didn't lower it to a target memory intrinsic. |
This adds support for lowering deinterleave and interleave intrinsics for factors 3 5 and 7 into target specific memory intrinsics.
Notably this doesn't add support for handling higher factors constructed from interleaving interleave intrinsics, e.g. factor 6 from interleave3 + interleave2.
I initially tried this but it became very complex very quickly. For example, because there's now multiple factors involved interleaveLeafValues is no longer symmetric between interleaving and deinterleaving. There's then also two ways of representing a factor 6 deinterleave: It can both be done as either 1 deinterleave3 and 3 deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s:
Details
I'm not sure the complexity of supporting arbitrary factors is warranted given how we only need to support a small number of factors currently: SVE only needs factors 2,3,4 whilst RVV only needs 2,3,4,5,6,7,8.
My preference would be to just add a interleave6 and deinterleave6 intrinsic to avoid all this ambiguity, but I'll defer this discussion to a later patch.