Skip to content

[IA] Add support for [de]interleave{3,5,7} #139373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

lukel97
Copy link
Contributor

@lukel97 lukel97 commented May 10, 2025

This adds support for lowering deinterleave and interleave intrinsics for factors 3 5 and 7 into target specific memory intrinsics.

Notably this doesn't add support for handling higher factors constructed from interleaving interleave intrinsics, e.g. factor 6 from interleave3 + interleave2.

I initially tried this but it became very complex very quickly. For example, because there's now multiple factors involved interleaveLeafValues is no longer symmetric between interleaving and deinterleaving. There's then also two ways of representing a factor 6 deinterleave: It can both be done as either 1 deinterleave3 and 3 deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s:

Details

abcdefabcdef 

-> deinterleave2

aceace
bdfbdf

-> 3 x deinterleave3

aa
cc
ee
bb
dd
ff

OR

abcdefabcdef 

-> deinterleave3
adad
bebe
cfcf

-> 3 x deinterleave2

aa
dd
bb
ee
cc
ff

I'm not sure the complexity of supporting arbitrary factors is warranted given how we only need to support a small number of factors currently: SVE only needs factors 2,3,4 whilst RVV only needs 2,3,4,5,6,7,8.

My preference would be to just add a interleave6 and deinterleave6 intrinsic to avoid all this ambiguity, but I'll defer this discussion to a later patch.

@lukel97 lukel97 requested a review from mshockwave May 10, 2025 11:01
@lukel97 lukel97 marked this pull request as ready for review May 10, 2025 11:01
@lukel97 lukel97 requested a review from hassnaaHamdi May 10, 2025 11:03
@lukel97 lukel97 force-pushed the interleaved-access/factor357 branch from 63f4aee to a6b4b02 Compare May 10, 2025 11:25
@llvmbot
Copy link
Member

llvmbot commented May 12, 2025

@llvm/pr-subscribers-llvm-transforms

Author: Luke Lau (lukel97)

Changes

This adds support for lowering deinterleave and interleave intrinsics for factors 3 5 and 7 into target specific memory intrinsics.

Notably this doesn't add support for handling higher factors constructed from interleaving interleave intrinsics, e.g. factor 6 from interleave3 + interleave2.

I initially tried this but it became very complex very quickly. For example, because there's now multiple factors involved interleaveLeafValues is no longer symmetric between interleaving and deinterleaving. There's then also two ways of representing a factor 6 deinterleave: It can both be done as either 1 deinterleave3 and 3 deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s:

<details><summary>Details</summary>
<p>

abcdefabcdef 

-&gt; deinterleave2

aceace
bdfbdf

-&gt; 3 x deinterleave3

aa
cc
ee
bb
dd
ff

OR

abcdefabcdef 

-&gt; deinterleave3
adad
bebe
cfcf

-&gt; 3 x deinterleave2

aa
dd
bb
ee
cc
ff

</p>
</details>

I'm not sure the complexity of supporting arbitrary factors is warranted given how we only need to support a small number of factors currently: SVE only needs factors 2,3,4 whilst RVV only needs 2,3,4,5,6,7,8.

My preference would be to just add a interleave6 and deinterleave6 intrinsic to avoid all this ambiguity, but I'll defer this discussion to a later patch.


Patch is 73.47 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/139373.diff

8 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/TargetLowering.h (+2-2)
  • (modified) llvm/lib/CodeGen/InterleavedAccessPass.cpp (+56-19)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll (+63)
  • (modified) llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll (+33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll (+63)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vector-interleave-store.ll (+33)
  • (modified) llvm/test/CodeGen/RISCV/rvv/vp-vector-interleaved-access.ll (+236)
  • (modified) llvm/test/Transforms/InterleavedAccess/RISCV/interleaved-accesses.ll (+218)
diff --git a/llvm/include/llvm/CodeGen/TargetLowering.h b/llvm/include/llvm/CodeGen/TargetLowering.h
index 03099e9ad44dc..1749ac1770da9 100644
--- a/llvm/include/llvm/CodeGen/TargetLowering.h
+++ b/llvm/include/llvm/CodeGen/TargetLowering.h
@@ -3234,7 +3234,7 @@ class TargetLoweringBase {
 
   /// Lower a deinterleave intrinsic to a target specific load intrinsic.
   /// Return true on success. Currently only supports
-  /// llvm.vector.deinterleave2
+  /// llvm.vector.deinterleave{2,3,5,7}
   ///
   /// \p LI is the accompanying load instruction.
   /// \p DeinterleaveValues contains the deinterleaved values.
@@ -3246,7 +3246,7 @@ class TargetLoweringBase {
 
   /// Lower an interleave intrinsic to a target specific store intrinsic.
   /// Return true on success. Currently only supports
-  /// llvm.vector.interleave2
+  /// llvm.vector.interleave{2,3,5,7}
   ///
   /// \p SI is the accompanying store instruction
   /// \p InterleaveValues contains the interleaved values.
diff --git a/llvm/lib/CodeGen/InterleavedAccessPass.cpp b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
index 04d89d61cb6a9..c590e470fa779 100644
--- a/llvm/lib/CodeGen/InterleavedAccessPass.cpp
+++ b/llvm/lib/CodeGen/InterleavedAccessPass.cpp
@@ -571,6 +571,25 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
   return true;
 }
 
+static unsigned getIntrinsicFactor(const IntrinsicInst *II) {
+  switch (II->getIntrinsicID()) {
+  case Intrinsic::vector_deinterleave2:
+  case Intrinsic::vector_interleave2:
+    return 2;
+  case Intrinsic::vector_deinterleave3:
+  case Intrinsic::vector_interleave3:
+    return 3;
+  case Intrinsic::vector_deinterleave5:
+  case Intrinsic::vector_interleave5:
+    return 5;
+  case Intrinsic::vector_deinterleave7:
+  case Intrinsic::vector_interleave7:
+    return 7;
+  default:
+    llvm_unreachable("Unexpected intrinsic");
+  }
+}
+
 // For an (de)interleave tree like this:
 //
 //   A   C B   D
@@ -586,7 +605,7 @@ bool InterleavedAccessImpl::lowerInterleavedStore(
 //  to reorder them by interleaving these values.
 static void interleaveLeafValues(MutableArrayRef<Value *> SubLeaves) {
   unsigned NumLeaves = SubLeaves.size();
-  if (NumLeaves == 2)
+  if (NumLeaves == 2 || !isPowerOf2_64(NumLeaves))
     return;
 
   assert(isPowerOf2_32(NumLeaves) && NumLeaves > 1);
@@ -608,7 +627,10 @@ static void interleaveLeafValues(MutableArrayRef<Value *> SubLeaves) {
 static bool
 getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
                           SmallVectorImpl<Instruction *> &DeadInsts) {
-  assert(II->getIntrinsicID() == Intrinsic::vector_interleave2);
+  assert(II->getIntrinsicID() == Intrinsic::vector_interleave2 ||
+         II->getIntrinsicID() == Intrinsic::vector_interleave3 ||
+         II->getIntrinsicID() == Intrinsic::vector_interleave5 ||
+         II->getIntrinsicID() == Intrinsic::vector_interleave7);
 
   // Visit with BFS
   SmallVector<IntrinsicInst *, 8> Queue;
@@ -620,7 +642,7 @@ getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
     // All the intermediate intrinsics will be deleted.
     DeadInsts.push_back(Current);
 
-    for (unsigned I = 0; I < 2; ++I) {
+    for (unsigned I = 0; I < getIntrinsicFactor(Current); ++I) {
       Value *Op = Current->getOperand(I);
       if (auto *OpII = dyn_cast<IntrinsicInst>(Op))
         if (OpII->getIntrinsicID() == Intrinsic::vector_interleave2) {
@@ -638,9 +660,10 @@ getVectorInterleaveFactor(IntrinsicInst *II, SmallVectorImpl<Value *> &Operands,
   }
 
   const unsigned Factor = Operands.size();
-  // Currently we only recognize power-of-two factors.
+  // Currently we only recognize factors of 2, 3, 5 and 7.
   // FIXME: should we assert here instead?
-  if (Factor <= 1 || !isPowerOf2_32(Factor))
+  if (Factor <= 1 ||
+      (!isPowerOf2_32(Factor) && Factor != getIntrinsicFactor(II)))
     return false;
 
   interleaveLeafValues(Operands);
@@ -651,9 +674,12 @@ static bool
 getVectorDeinterleaveFactor(IntrinsicInst *II,
                             SmallVectorImpl<Value *> &Results,
                             SmallVectorImpl<Instruction *> &DeadInsts) {
-  assert(II->getIntrinsicID() == Intrinsic::vector_deinterleave2);
+  assert(II->getIntrinsicID() == Intrinsic::vector_deinterleave2 ||
+         II->getIntrinsicID() == Intrinsic::vector_deinterleave3 ||
+         II->getIntrinsicID() == Intrinsic::vector_deinterleave5 ||
+         II->getIntrinsicID() == Intrinsic::vector_deinterleave7);
   using namespace PatternMatch;
-  if (!II->hasNUses(2))
+  if (!II->hasNUses(getIntrinsicFactor(II)))
     return false;
 
   // Visit with BFS
@@ -662,12 +688,12 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
   while (!Queue.empty()) {
     IntrinsicInst *Current = Queue.front();
     Queue.erase(Queue.begin());
-    assert(Current->hasNUses(2));
+    assert(Current->hasNUses(getIntrinsicFactor(Current)));
 
     // All the intermediate intrinsics will be deleted from the bottom-up.
     DeadInsts.insert(DeadInsts.begin(), Current);
 
-    ExtractValueInst *LHS = nullptr, *RHS = nullptr;
+    SmallVector<ExtractValueInst *> EVs(getIntrinsicFactor(Current), nullptr);
     for (User *Usr : Current->users()) {
       if (!isa<ExtractValueInst>(Usr))
         return 0;
@@ -679,17 +705,15 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
       if (Indices.size() != 1)
         return false;
 
-      if (Indices[0] == 0 && !LHS)
-        LHS = EV;
-      else if (Indices[0] == 1 && !RHS)
-        RHS = EV;
+      if (!EVs[Indices[0]])
+        EVs[Indices[0]] = EV;
       else
         return false;
     }
 
     // We have legal indices. At this point we're either going
     // to continue the traversal or push the leaf values into Results.
-    for (ExtractValueInst *EV : {LHS, RHS}) {
+    for (ExtractValueInst *EV : EVs) {
       // Continue the traversal. We're playing safe here and matching only the
       // expression consisting of a perfectly balanced binary tree in which all
       // intermediate values are only used once.
@@ -713,9 +737,10 @@ getVectorDeinterleaveFactor(IntrinsicInst *II,
   }
 
   const unsigned Factor = Results.size();
-  // Currently we only recognize power-of-two factors.
+  // Currently we only recognize factors of 2, 3, 5 and 7.
   // FIXME: should we assert here instead?
-  if (Factor <= 1 || !isPowerOf2_32(Factor))
+  if (Factor <= 1 ||
+      (!isPowerOf2_32(Factor) && Factor != getIntrinsicFactor(II)))
     return 0;
 
   interleaveLeafValues(Results);
@@ -878,11 +903,23 @@ bool InterleavedAccessImpl::runOnFunction(Function &F) {
 
     if (auto *II = dyn_cast<IntrinsicInst>(&I)) {
       // At present, we only have intrinsics to represent (de)interleaving
-      // with a factor of 2.
-      if (II->getIntrinsicID() == Intrinsic::vector_deinterleave2)
+      // with a factor of 2,3,5 and 7.
+      switch (II->getIntrinsicID()) {
+      case Intrinsic::vector_deinterleave2:
+      case Intrinsic::vector_deinterleave3:
+      case Intrinsic::vector_deinterleave5:
+      case Intrinsic::vector_deinterleave7:
         Changed |= lowerDeinterleaveIntrinsic(II, DeadInsts);
-      else if (II->getIntrinsicID() == Intrinsic::vector_interleave2)
+        break;
+      case Intrinsic::vector_interleave2:
+      case Intrinsic::vector_interleave3:
+      case Intrinsic::vector_interleave5:
+      case Intrinsic::vector_interleave7:
         Changed |= lowerInterleaveIntrinsic(II, DeadInsts);
+        break;
+      default:
+        break;
+      }
     }
   }
 
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
index e53dfc23a84bb..31529b1783651 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-deinterleave-load.ll
@@ -257,6 +257,23 @@ define {<2 x double>, <2 x double>} @vector_deinterleave_load_v2f64_v4f64(ptr %p
   ret {<2 x double>, <2 x double>} %res1
 }
 
+define { <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor3(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT:    vlseg3e8.v v6, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <24 x i8>, ptr %p
+  %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave3(<24 x i8> %vec)
+  %t0 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 0
+  %t1 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 1
+  %t2 = extractvalue {<8 x i8>, <8 x i8>, <8 x i8>} %d0, 2
+  %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+  %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 0
+  %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 0
+  ret { <8 x i8>, <8 x i8>, <8 x i8> } %res2
+}
+
 define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor4(ptr %p) {
 ; CHECK-LABEL: vector_deinterleave_load_factor4:
 ; CHECK:       # %bb.0:
@@ -281,6 +298,52 @@ define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_fact
   ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3
 }
 
+define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor5(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT:    vlseg5e8.v v8, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <40 x i8>, ptr %p
+  %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave5(<40 x i8> %vec)
+  %t0 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 0
+  %t1 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 1
+  %t2 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 2
+  %t3 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 3
+  %t4 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 4
+  %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+  %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 1
+  %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 2
+  %res3 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res2, <8 x i8> %t3, 3
+  %res4 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t4, 4
+  ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res4
+}
+
+define { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } @vector_deinterleave_load_factor7(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor7:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 8, e8, mf2, ta, ma
+; CHECK-NEXT:    vlseg7e8.v v8, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <56 x i8>, ptr %p
+  %d0 = call {<8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>} @llvm.vector.deinterleave7(<56 x i8> %vec)
+  %t0 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 0
+  %t1 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 1
+  %t2 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 2
+  %t3 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 3
+  %t4 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 4
+  %t5 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 5
+  %t6 = extractvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %d0, 6
+  %res0 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } poison, <8 x i8> %t0, 0
+  %res1 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res0, <8 x i8> %t1, 1
+  %res2 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res1, <8 x i8> %t2, 2
+  %res3 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res2, <8 x i8> %t3, 3
+  %res4 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t4, 4
+  %res5 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t5, 5
+  %res6 = insertvalue { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res3, <8 x i8> %t6, 6
+  ret { <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8>, <8 x i8> } %res6
+}
+
 define {<2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>, <2 x i32>} @vector_deinterleave_load_factor8(ptr %ptr) {
 ; CHECK-LABEL: vector_deinterleave_load_factor8:
 ; CHECK:       # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
index 26c3db6131034..8244db45a7ef2 100644
--- a/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/fixed-vectors-interleave-store.ll
@@ -181,6 +181,17 @@ define void @vector_interleave_store_v4f64_v2f64(<2 x double> %a, <2 x double> %
   ret void
 }
 
+define void @vector_interleave_store_factor3(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vsseg3e32.v v8, (a0)
+; CHECK-NEXT:    ret
+  %v = call <12 x i32> @llvm.vector.interleave3(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c)
+  store <12 x i32> %v, ptr %p
+  ret void
+}
+
 define void @vector_interleave_store_factor4(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, ptr %p) {
 ; CHECK-LABEL: vector_interleave_store_factor4:
 ; CHECK:       # %bb.0:
@@ -194,6 +205,28 @@ define void @vector_interleave_store_factor4(<4 x i32> %a, <4 x i32> %b, <4 x i3
   ret void
 }
 
+define void @vector_interleave_store_factor5(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vsseg5e32.v v8, (a0)
+; CHECK-NEXT:    ret
+  %v = call <20 x i32> @llvm.vector.interleave5(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e)
+  store <20 x i32> %v, ptr %p
+  ret void
+}
+
+define void @vector_interleave_store_factor7(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g, ptr %p) {
+; CHECK-LABEL: vector_interleave_store_factor7:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetivli zero, 4, e32, m1, ta, ma
+; CHECK-NEXT:    vsseg7e32.v v8, (a0)
+; CHECK-NEXT:    ret
+  %v = call <28 x i32> @llvm.vector.interleave7(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g)
+  store <28 x i32> %v, ptr %p
+  ret void
+}
+
 define void @vector_interleave_store_factor8(<4 x i32> %a, <4 x i32> %b, <4 x i32> %c, <4 x i32> %d, <4 x i32> %e, <4 x i32> %f, <4 x i32> %g, <4 x i32> %h, ptr %p) {
 ; CHECK-LABEL: vector_interleave_store_factor8:
 ; CHECK:       # %bb.0:
diff --git a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
index 582aef908964a..0483bbbd35b39 100644
--- a/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
+++ b/llvm/test/CodeGen/RISCV/rvv/vector-deinterleave-load.ll
@@ -344,6 +344,23 @@ define {<vscale x 2 x ptr>, <vscale x 2 x ptr>} @vector_deinterleave_load_nxv2p0
   ret {<vscale x 2 x ptr>, <vscale x 2 x ptr>} %res1
 }
 
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor3(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor3:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vlseg3e8.v v6, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <vscale x 24 x i8>, ptr %p
+  %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave3(<vscale x 24 x i8> %vec)
+  %t0 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 0
+  %t1 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 1
+  %t2 = extractvalue {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} %d0, 2
+  %res0 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } poison, <vscale x 8 x i8> %t0, 0
+  %res1 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res0, <vscale x 8 x i8> %t1, 0
+  %res2 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res1, <vscale x 8 x i8> %t2, 0
+  ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res2
+}
+
 define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor4(ptr %p) {
 ; CHECK-LABEL: vector_deinterleave_load_factor4:
 ; CHECK:       # %bb.0:
@@ -368,6 +385,52 @@ define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x
   ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res3
 }
 
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor5(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor5:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vlseg5e8.v v8, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <vscale x 40 x i8>, ptr %p
+  %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave5(<vscale x 40 x i8> %vec)
+  %t0 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 0
+  %t1 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 1
+  %t2 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 2
+  %t3 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 3
+  %t4 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 4
+  %res0 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } poison, <vscale x 8 x i8> %t0, 0
+  %res1 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res0, <vscale x 8 x i8> %t1, 1
+  %res2 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res1, <vscale x 8 x i8> %t2, 2
+  %res3 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res2, <vscale x 8 x i8> %t3, 3
+  %res4 = insertvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res3, <vscale x 8 x i8> %t4, 4
+  ret { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %res4
+}
+
+define { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } @vector_deinterleave_load_factor7(ptr %p) {
+; CHECK-LABEL: vector_deinterleave_load_factor7:
+; CHECK:       # %bb.0:
+; CHECK-NEXT:    vsetvli a1, zero, e8, m1, ta, ma
+; CHECK-NEXT:    vlseg7e8.v v8, (a0)
+; CHECK-NEXT:    ret
+  %vec = load <vscale x 56 x i8>, ptr %p
+  %d0 = call {<vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>} @llvm.vector.deinterleave7(<vscale x 56 x i8> %vec)
+  %t0 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 0
+  %t1 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8> } %d0, 1
+  %t2 = extractvalue { <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale x 8 x i8>, <vscale ...
[truncated]

@mshockwave
Copy link
Member

There's then also two ways of representing a factor 6 deinterleave: It can both be done as either 1 deinterleave3 and 3 deinterleave2s OR 1 deinterleave2 and 3 deinterleave3s:

Yeah IMHO the vectorizer has to coin one which will be the only one InterleavedAccess supports.

My preference would be to just add a interleave6 and deinterleave6 intrinsic to avoid all this ambiguity, but I'll defer this discussion to a later patch.

Personally I prefer adding (de)interleave6 but I believe there was already a discussion on this a while back but I don't remember why people decided against it then.

@lukel97
Copy link
Contributor Author

lukel97 commented May 13, 2025

Personally I prefer adding (de)interleave6 but I believe there was already a discussion on this a while back but I don't remember why people decided against it then.

Yeah, I've been rereading some of the old threads in #89018 (comment)

It does seem like there was some discussion about revisiting this after the initial patch, and I think now that we've run into some complexity with trying to extend it to handle RISC-V's factors I think it's a good time to re-evaluate.

I've got a patch that adds [de]interleave6 as well as [de]interleave{4,8}, that I hope to post soon to get some feedback on.

The gist is that the ability to support arbitrary interleave factors seems premature, since I'm not aware of any hardware that can take advantage of anything beyond a factor of 8, so the simplest setup would just be to have a dedicated intrinsic for each factor:

  • Unlike shufflevectors with fixed-length vectors, detecting an interleave pattern isn't as trivial
  • Unlike shufflevectors, there's no optimisation that happens on [de]interleave2 intriniscs
  • We already have intrinsics for 2,3,5 and 7, so by avoiding 4,6 and 8 we're not really saving much

The eventual plan would be to move from this state of affairs:

Fixed-length vectors:

Loop vectorizer: emitted as a series of strided shufflevectors, one for each factor
InterleavedAccessPass: can lower any factor of shufflevectors to target intrinsics

Scalable vectors:

Loop vectorizer: Only powers of 2 supported, emitted as interleaves of [de]interleave2 intrinsics
InterleavedAccessPass: Only matches powers of 2 interleaves of [de]interleave2 intrinsics

To this:

Fixed-length vectors:

Loop vectorizer: emitted as a series of strided shufflevectors, one for each factor
InterleavedAccessPass: can lower any factor of shufflevectors to target intrinsics

Scalable vectors:

Loop vectorizer: All factors up to 8 supported, emitted as single [de]interleaveN intrinsic
InterleavedAccessPass: Matches all factors up to 8 of [de]interleaveN intrinsics

And to get there, I'm imagining the following steps:

  • We add intrinsics so we now have [de]interleave{2,3,4,5,6,7,8}
  • To handle the case where an interleave somehow doesn't get lowered into a ld4 intrinsic, the "interleave of interleaves" expansion from the loop vectorizer into AArch64's VECTOR_[DE]INTERLEAVE lowering
  • Teach InterleaveAccessPass to handle the new intrinsics: trivial
  • Teach the loop vectorizer to just emit a single intrinsic instead
  • Remove getVectorInterleaveFactor/getVectorDeinterleaveFactor in InterleaveAccessPass

If we do end up wanting to support even higher interleave factors > 8, e.g. #89018 (comment), then we could return and address this later. Since at the moment, it doesn't look like we can lower a factor 8 interleave into a ld4 + ld2 on AArch64 anyway.

WDYT?

@mshockwave
Copy link
Member

mshockwave commented May 13, 2025

  • To handle the case where an interleave somehow doesn't get lowered into a ld4 intrinsic, the "interleave of interleaves" expansion from the loop vectorizer into AArch64's VECTOR_[DE]INTERLEAVE lowering

I kind of miss this part: are you referring to the problems with shufflevectors where interleave4 patterns might not be in the canonical form?

But other than that point, I think your plan is reasonable. As pointed out in #89018 the biggest concern for having dedicated intrinsics at the time was that we hope to synthesize arbitrary factors with only a few intrinsics, such that we'll only have a single lowering path. But right now it seems like it's probably easier to just having two lowering paths where the "fast path" is using dedicated intrinsics for natively supported factor (i.e. up to 8 for RISC-V), while the other path -- which might be less common -- is synthesizing higher factors with intrinsics of smaller factors.

Whether we're gonna drop shufflevectors to only use (de)interleave intrinsics for both fixed and scalable vectors in the future is another thing. I definitely think that's a good idea to streamline things a little bit though.

(CC other folks from that thread: @davemgreen @efriedma-quic @paulwalker-arm)

@davemgreen
Copy link
Collaborator

Having a single intrinsic would make cost-modelling more accurate. Replacing the fixed width shuffles with a single intrinsic I think would be worth-while, and would fix a number of perf issues in the current design. (We just have to make sure that loops with isomorphic operations do not get accidentally worse).

@lukel97
Copy link
Contributor Author

lukel97 commented May 13, 2025

I kind of miss this part: are you referring to the problems with shufflevectors where interleave4 patterns might not be in the canonical form?

This is for scalable vectors, AArch64 can currently only lower vector_[de]interleave nodes with a factor of 2. If we teach the loop vectorizer to emit interleave4, instead of multiple interleave2s, then we should probably teach AArch64 to handle vector_[de]interleave nodes with factor 4.

This is to handle the case I think where the loop vectorizer decided it was profitable to emit an interleave group, but the interleaved access pass decided/didn't lower it to a target memory intrinsic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants