Skip to content

[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 16, 2024

Conversation

tblah
Copy link
Contributor

@tblah tblah commented Oct 10, 2024

The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder.

This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results.

This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables.

RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225

Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite.

I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:

  • The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket.
  • I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program.

Fixes #106297

@llvmbot
Copy link
Member

llvmbot commented Oct 10, 2024

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-openmp

Author: Tom Eccles (tblah)

Changes

The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder.

This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results.

This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables.

RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225

Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite.

I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:

  • The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket.
  • I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program.

Fixes #106297


Patch is 41.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111844.diff

5 Files Affected:

  • (added) flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 (+261)
  • (added) flang/test/Integration/OpenMP/private-global.f90 (+46)
  • (modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+150-253)
  • (modified) mlir/test/Target/LLVMIR/openmp-firstprivate.mlir (+18-7)
  • (modified) mlir/test/Target/LLVMIR/openmp-private.mlir (+4-2)
diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
new file mode 100644
index 00000000000000..b1fcb52b12ae9c
--- /dev/null
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -0,0 +1,261 @@
+! RUN: %flang_fc1 -fopenmp -emit-llvm %s -o - | FileCheck %s
+
+! stress test cfg and builder insertion points in mlir-to-llvm conversion:
+!   - mixing multiple delayed privatisations and multiple reductions
+!   - multiple blocks in the private alloc region
+!   - private alloc region has to read from the mold variable
+!   - firstprivate
+!   - multiple blocks in the private copy region
+!   - multiple blocks in the reduction init region
+!   - reduction init region has to read from the mold variable
+!   - re-used omp.private ops
+!   - re-used omp.reduction.declare ops
+!   - unstructured code inside of the parallel region
+!   - needs private dealloc region, and this has multiple blocks
+!   - needs reduction cleanup region, and this has multiple blocks
+
+! This maybe belongs in the mlir tests, but what we are doing here is complex
+! enough that I find the kind of minimised mlir code prefered by mlir reviwers
+! hard to read without some fortran here for reference. Nothing like this would
+! be generated by other upstream users of the MLIR OpenMP dialect.
+
+subroutine worst_case(a, b, c, d)
+  real, allocatable :: a(:), b(:), c(:), d(:)
+  integer i
+
+  !$omp parallel firstprivate(a,b) reduction(+:c,d)
+  if (sum(a) == 1) stop 1
+  !$omp end parallel
+end subroutine
+
+! CHECK-LABEL: define internal void @worst_case_..omp_par
+! CHECK-NEXT:  omp.par.entry:
+!                [reduction alloc regions inlined here]
+! CHECK:         br label %omp.private.latealloc
+
+! CHECK:       omp.private.latealloc:                            ; preds = %omp.par.entry
+! CHECK-NEXT:  br label %omp.private.alloc5
+
+! CHECK:       omp.private.alloc5:                               ; preds = %omp.private.latealloc
+!                [begin private alloc for first var]
+!                [read the length from the mold argument]
+!                [if it is non-zero...]
+! CHECK:         br i1 {{.*}}, label %omp.private.alloc6, label %omp.private.alloc7
+
+! CHECK:       omp.private.alloc7:                               ; preds = %omp.private.alloc5
+!                [finish private alloc for first var with zero extent]
+! CHECK:         br label %omp.private.alloc8
+
+! CHECK:       omp.private.alloc8:                               ; preds = %omp.private.alloc6, %omp.private.alloc7
+! CHECK-NEXT:    br label %omp.region.cont4
+
+! CHECK:       omp.region.cont4:                                 ; preds = %omp.private.alloc8
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.alloc
+
+! CHECK:       omp.private.alloc:                                ; preds = %omp.region.cont4
+!                [begin private alloc for first var]
+!                [read the length from the mold argument]
+!                [if it is non-zero...]
+! CHECK:         br i1 %{{.*}}, label %omp.private.alloc1, label %omp.private.alloc2
+
+! CHECK:       omp.private.alloc2:                               ; preds = %omp.private.alloc
+!                [finish private alloc for second var with zero extent]
+! CHECK:         br label %omp.private.alloc3
+
+! CHECK:       omp.private.alloc3:                               ; preds = %omp.private.alloc1, %omp.private.alloc2
+! CHECK-NEXT:    br label %omp.region.cont
+
+! CHECK:       omp.region.cont:                                  ; preds = %omp.private.alloc3
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.copy
+
+! CHECK:       omp.private.copy:                                 ; preds = %omp.region.cont
+! CHECK-NEXT:    br label %omp.private.copy10
+
+! CHECK:       omp.private.copy10:                               ; preds = %omp.private.copy
+!                [begin firstprivate copy for first var]
+!                [read the length, is it non-zero?]
+! CHECK:         br i1 %{{.*}}, label %omp.private.copy11, label %omp.private.copy12
+
+! CHECK:       omp.private.copy12:                               ; preds = %omp.private.copy11, %omp.private.copy10
+! CHECK-NEXT:    br label %omp.region.cont9
+
+! CHECK:       omp.region.cont9:                                 ; preds = %omp.private.copy12
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.copy14
+
+! CHECK:       omp.private.copy14:                               ; preds = %omp.region.cont9
+!                [begin firstprivate copy for second var]
+!                [read the length, is it non-zero?]
+! CHECK:         br i1 %{{.*}}, label %omp.private.copy15, label %omp.private.copy16
+
+! CHECK:       omp.private.copy16:                               ; preds = %omp.private.copy15, %omp.private.copy14
+! CHECK-NEXT:    br label %omp.region.cont13
+
+! CHECK:       omp.region.cont13:                                ; preds = %omp.private.copy16
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.reduction.init
+
+! CHECK:       omp.reduction.init:                               ; preds = %omp.region.cont13
+!                [deffered stores for results of reduction alloc regions]
+! CHECK:         br label %[[VAL_96:.*]]
+
+! CHECK:       omp.reduction.neutral:                            ; preds = %omp.reduction.init
+!                [start of reduction initialization region]
+!                [null check:]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral18, label %omp.reduction.neutral19
+
+! CHECK:       omp.reduction.neutral19:                          ; preds = %omp.reduction.neutral
+!                [malloc and assign the default value to the reduction variable]
+! CHECK:         br label %omp.reduction.neutral20
+
+! CHECK:       omp.reduction.neutral20:                          ; preds = %omp.reduction.neutral18, %omp.reduction.neutral19
+! CHECK-NEXT:    br label %omp.region.cont17
+
+! CHECK:       omp.region.cont17:                                ; preds = %omp.reduction.neutral20
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.reduction.neutral22
+
+! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.region.cont17
+!                [start of reduction initialization region]
+!                [null check:]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral23, label %omp.reduction.neutral24
+
+! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.reduction.neutral22
+!                [malloc and assign the default value to the reduction variable]
+! CHECK:         br label %omp.reduction.neutral25
+
+! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral23, %omp.reduction.neutral24
+! CHECK-NEXT:    br label %omp.region.cont21
+
+! CHECK:       omp.region.cont21:                                ; preds = %omp.reduction.neutral25
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.par.region
+
+! CHECK:       omp.par.region:                                   ; preds = %omp.region.cont21
+! CHECK-NEXT:    br label %omp.par.region27
+
+! CHECK:       omp.par.region27:                                 ; preds = %omp.par.region
+!                [call SUM runtime function]
+!                [if (sum(a) == 1)]
+! CHECK:         br i1 %{{.*}}, label %omp.par.region28, label %omp.par.region29
+
+! CHECK:       omp.par.region29:                                 ; preds = %omp.par.region27
+! CHECK-NEXT:    br label %omp.region.cont26
+
+! CHECK:       omp.region.cont26:                                ; preds = %omp.par.region28, %omp.par.region29
+!                [omp parallel region done, call into the runtime to complete reduction]
+! CHECK:         %[[VAL_233:.*]] = call i32 @__kmpc_reduce(
+! CHECK:         switch i32 %[[VAL_233]], label %reduce.finalize [
+! CHECK-NEXT:      i32 1, label %reduce.switch.nonatomic
+! CHECK-NEXT:      i32 2, label %reduce.switch.atomic
+! CHECK-NEXT:    ]
+
+! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont26
+! CHECK-NEXT:    unreachable
+
+! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont26
+! CHECK-NEXT:    %[[red_private_value_0:.*]] = load ptr, ptr %{{.*}}, align 8
+! CHECK-NEXT:    br label %omp.reduction.nonatomic.body
+
+!              [various blocks implementing the reduction]
+
+! CHECK:       omp.region.cont35:                                ; preds =
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    call void @__kmpc_end_reduce(
+! CHECK-NEXT:    br label %reduce.finalize
+
+! CHECK:       reduce.finalize:                                  ; preds =
+! CHECK-NEXT:    br label %omp.par.pre_finalize
+
+! CHECK:       omp.par.pre_finalize:                             ; preds = %reduce.finalize
+! CHECK-NEXT:    %{{.*}} = load ptr, ptr
+! CHECK-NEXT:    br label %omp.reduction.cleanup
+
+! CHECK:       omp.reduction.cleanup:                            ; preds = %omp.par.pre_finalize
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup41, label %omp.reduction.cleanup42
+
+! CHECK:       omp.reduction.cleanup42:                          ; preds = %omp.reduction.cleanup41, %omp.reduction.cleanup
+! CHECK-NEXT:    br label %omp.region.cont40
+
+! CHECK:       omp.region.cont40:                                ; preds = %omp.reduction.cleanup42
+! CHECK-NEXT:    %{{.*}} = load ptr, ptr
+! CHECK-NEXT:    br label %omp.reduction.cleanup44
+
+! CHECK:       omp.reduction.cleanup44:                          ; preds = %omp.region.cont40
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup45, label %omp.reduction.cleanup46
+
+! CHECK:       omp.reduction.cleanup46:                          ; preds = %omp.reduction.cleanup45, %omp.reduction.cleanup44
+! CHECK-NEXT:    br label %omp.region.cont43
+
+! CHECK:       omp.region.cont43:                                ; preds = %omp.reduction.cleanup46
+! CHECK-NEXT:    br label %omp.private.dealloc
+
+! CHECK:       omp.private.dealloc:                              ; preds = %omp.region.cont43
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc48, label %omp.private.dealloc49
+
+! CHECK:       omp.private.dealloc49:                            ; preds = %omp.private.dealloc48, %omp.private.dealloc
+! CHECK-NEXT:    br label %omp.region.cont47
+
+! CHECK:       omp.region.cont47:                                ; preds = %omp.private.dealloc49
+! CHECK-NEXT:    br label %omp.private.dealloc51
+
+! CHECK:       omp.private.dealloc51:                            ; preds = %omp.region.cont47
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc52, label %omp.private.dealloc53
+
+! CHECK:       omp.private.dealloc53:                            ; preds = %omp.private.dealloc52, %omp.private.dealloc51
+! CHECK-NEXT:    br label %omp.region.cont50
+
+! CHECK:       omp.region.cont50:                                ; preds = %omp.private.dealloc53
+! CHECK-NEXT:    br label %omp.par.outlined.exit.exitStub
+
+! CHECK:       omp.private.dealloc52:                            ; preds = %omp.private.dealloc51
+!                [dealloc memory]
+! CHECK:         br label %omp.private.dealloc53
+
+! CHECK:       omp.private.dealloc48:                            ; preds = %omp.private.dealloc
+!                [dealloc memory]
+! CHECK:         br label %omp.private.dealloc49
+
+! CHECK:       omp.reduction.cleanup45:                          ; preds = %omp.reduction.cleanup44
+! CHECK-NEXT:    call void @free(
+! CHECK-NEXT:    br label %omp.reduction.cleanup46
+
+! CHECK:       omp.reduction.cleanup41:                          ; preds = %omp.reduction.cleanup
+! CHECK-NEXT:    call void @free(
+! CHECK-NEXT:    br label %omp.reduction.cleanup42
+
+! CHECK:       omp.par.region28:                                 ; preds = %omp.par.region27
+! CHECK-NEXT:    call {} @_FortranAStopStatement
+
+! CHECK:       omp.reduction.neutral23:                          ; preds = %omp.reduction.neutral22
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral25
+
+! CHECK:       omp.reduction.neutral18:                          ; preds = %omp.reduction.neutral
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral20
+
+! CHECK:       omp.private.copy15:                               ; preds = %omp.private.copy14
+!                [source length was non-zero: call assign runtime]
+! CHECK:         br label %omp.private.copy16
+
+! CHECK:       omp.private.copy11:                               ; preds = %omp.private.copy10
+!                [source length was non-zero: call assign runtime]
+! CHECK:         br label %omp.private.copy12
+
+! CHECK:       omp.private.alloc1:                               ; preds = %omp.private.alloc
+!                [var extent was non-zero: malloc a private array]
+! CHECK:         br label %omp.private.alloc3
+
+! CHECK:       omp.private.alloc6:                               ; preds = %omp.private.alloc5
+!                [var extent was non-zero: malloc a private array]
+! CHECK:         br label %omp.private.alloc8
+
+! CHECK:       omp.par.outlined.exit.exitStub:                   ; preds = %omp.region.cont50
+! CHECK-NEXT:    ret void
diff --git a/flang/test/Integration/OpenMP/private-global.f90 b/flang/test/Integration/OpenMP/private-global.f90
new file mode 100644
index 00000000000000..62d0a3faf0c593
--- /dev/null
+++ b/flang/test/Integration/OpenMP/private-global.f90
@@ -0,0 +1,46 @@
+!RUN: %flang_fc1 -emit-llvm -fopenmp %s -o - | FileCheck %s
+
+! Regression test for https://github.com/llvm/llvm-project/issues/106297
+
+program bug
+  implicit none
+  integer :: table(10)
+  !$OMP PARALLEL PRIVATE(table)
+    table = 50
+    if (any(table/=50)) then
+      stop 'fail 3'
+    end if
+  !$OMP END PARALLEL
+  print *,'ok'
+End Program
+
+
+! CHECK-LABEL: define internal void {{.*}}..omp_par(
+! CHECK:       omp.par.entry:
+! CHECK:         %[[VAL_9:.*]] = alloca i32, align 4
+! CHECK:         %[[VAL_10:.*]] = load i32, ptr %[[VAL_11:.*]], align 4
+! CHECK:         store i32 %[[VAL_10]], ptr %[[VAL_9]], align 4
+! CHECK:         %[[VAL_12:.*]] = load i32, ptr %[[VAL_9]], align 4
+! CHECK:         %[[PRIV_TABLE:.*]] = alloca [10 x i32], i64 1, align 4
+! ...
+! check that we use the private copy of table for the assignment
+! CHECK:       omp.par.region1:
+! CHECK:         %[[ELEMENTAL_TMP:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[BOXED_FIFTY:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8
+! CHECK:         %[[TABLE_BOX_VAL:.*]] = insertvalue { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } { ptr undef, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64), i32 20240719, i8 1, i8 9, i8 0, i8 0, [1 x [3 x i64]] {{\[\[}}3 x i64] [i64 1, i64 10, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64)]] }, ptr %[[PRIV_TABLE]], 0
+! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL]], ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK:         %[[TABLE_BOX_VAL2:.*]] = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL2]], ptr %[[TABLE_BOX_ADDR2]], align 8
+! CHECK:         %[[VAL_26:.*]] = call {} @_FortranAAssign(ptr %[[TABLE_BOX_ADDR2]], ptr %[[BOXED_FIFTY]], ptr @{{.*}}, i32 9)
+! ...
+! check that we use the private copy of table for table/=50
+! CHECK:       omp.par.region3:
+! CHECK:         %[[VAL_44:.*]] = sub nsw i64 %{{.*}}, 1
+! CHECK:         %[[VAL_45:.*]] = mul nsw i64 %[[VAL_44]], 1
+! CHECK:         %[[VAL_46:.*]] = mul nsw i64 %[[VAL_45]], 1
+! CHECK:         %[[VAL_47:.*]] = add nsw i64 %[[VAL_46]], 0
+! CHECK:         %[[VAL_48:.*]] = getelementptr i32, ptr %[[PRIV_TABLE]], i64 %[[VAL_47]]
+! CHECK:         %[[VAL_49:.*]] = load i32, ptr %[[VAL_48]], align 4
+! CHECK:         %[[VAL_50:.*]] = icmp ne i32 %[[VAL_49]], 50
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 19d80fbbd699b0..6eb00d44460a97 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -371,20 +371,46 @@ convertOmpCritical(Operation &opInst, llvm::IRBuilderBase &builder,
   return success();
 }
 
-/// Populates `reductions` with reduction declarations used in the given loop.
+// Looks up from the operation from and returns the PrivateClauseOp with
+// name symbolName
+static omp::PrivateClauseOp findPrivatizer(Operation *from,
+                                           SymbolRefAttr symbolName) {
+  omp::PrivateClauseOp privatizer =
+      SymbolTable::lookupNearestSymbolFrom<omp::PrivateClauseOp>(from,
+                                                                 symbolName);
+  assert(privatizer && "privatizer not found in the symbol table");
+  return privatizer;
+}
+
+/// Populates `privatizations` with privatisation declarations used for the
+/// given op.
+/// TODO: generalise beyond ParallelOp
+static void collectPrivatizationDecls(
+    omp::ParallelOp op, SmallVectorImpl<omp::PrivateClauseOp> &privatizations) {
+  std::optional<ArrayAttr> attr = op.getPrivateSyms();
+  if (!attr)
+    return;
+
+  privatizations.reserve(privatizations.size() + attr->size());
+  for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
+    privatizations.push_back(findPrivatizer(op, symbolRef));
+  }
+}
+
+/// Populates `reductions` with reduction declarations used in the given op.
 template <typename T>
 static void
-collectReductionDecls(T loop,
+collectReductionDecls(T op,
                       SmallVectorImpl<omp::DeclareReductionOp> &reductions) {
-  std::optional<ArrayAttr> attr = loop.getReductionSyms();
+  std::optional<ArrayAttr> attr = op.getReductionSyms();
   if (!attr)
     return;
 
-  reductions.reserve(reductions.size() + loop.getNumReductionVars());
+  reductions.reserve(reductions.size() + op.getNumReductionVars());
   for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
     reductions.push_back(
         SymbolTable::lookupNearestSymbolFrom<omp::DeclareReductionOp>(
-            loop, symbolRef));
+            op, symbolRef));
   }
 }
 
@@ -611,7 +637,7 @@ static LogicalResult
 allocReductionVars(T loop, ArrayRef<BlockArgument> reductionArgs,
                    llvm::IRBuilderBase &builder,
                    LLVM::ModuleTranslation &moduleTranslation,
-                   llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
+                   const llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
                    SmallVectorImpl<omp::DeclareReductionOp> &reductionDecls,
                    SmallVectorImpl<llvm::Value *> &privateReductionVariables,
                    DenseMap<Value, llvm::Value *> &reductionVariableMap,
@@ -1319,76 +1345,11 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase &builder,
                                     privateReductionVariables, isByRef);
 }
 
-/// A RAII class that on construction replaces the region arguments of the
-/// parallel op (which correspond to private variables) with the actual private
-/// variables they correspond to. This prepares the parallel op so that it
-/// matches what is expected by the OMPIRBuilder.
-///
-/// On destruction, it restores the original state of the operation so that on
-/// the MLIR side, the op is not affected by conversion to LLVM IR.
-class OmpParallelOpConversionManager {
-public:
-  OmpParallelOpConversionManager(omp::ParallelOp opInst)
-      : region(opInst.getRegion()),
-        privateBlockArgs(cast<omp::BlockArgOpenMPOpInterface>(*opInst)
-                   ...
[truncated]

@kiranchandramohan
Copy link
Contributor

The fujitsu testsuite is at https://github.com/fujitsu/compiler-test-suite. Could you give it a try? You can download it into a directory inside llvm-testsuite.


// in-place convert the private allocation region
SmallVector<llvm::Value *, 1> phis;
if (allocRegion.getArgument(0).getUses().empty())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this if case here? Does it mean that we will not end up using the privAllocaBlock that was created?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We only use it for cases where it is needed (when the alloc region reads from the original variable). This is to ensure that allocas end up in the entry block as much as possible.

There's a bit more explanation in the comment describing the allocRegion.

@tblah
Copy link
Contributor Author

tblah commented Oct 11, 2024

The fujitsu testsuite is at https://github.com/fujitsu/compiler-test-suite. Could you give it a try? You can download it into a directory inside llvm-testsuite.

After my patch I get the same number of "executable missing" tests (which indicate a compilation failure) and 18 fewer failed tests

Copy link
Member

@ergawy ergawy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks Tom!

@@ -1421,12 +1450,57 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
deferredStores, isByRef)))
bodyGenStatus = failure();

// Apply copy region for firstprivate.
if (!privateBlockArgs.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this sufficient check for the existence of Firstprivate?

If the intention is to always create these blocks, should these blocks be created by the initial skeleton creation in OpenMPIRBuilder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only checking that there are private arguments not if any are firstprivate because it felt like overkill to scan every argument looking for a firstprivate one. Would you prefer I did that?

I don't want to always create the blocks because that would lead to unnecessary test churn, especially if they were in OpenMPIRBuilder as that may effect clang too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can write a quick std::find_if for firstprivate that would be great.

Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?
https://llvm.org/devmtg/2019-04/slides/TechTalk-Doerfert-Compiler_optimization_for_OpenMP_accelerator_offloading.pdf

FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.
https://openmp.llvm.org/optimizations/OpenMPOpt.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the links.

Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?

Clang doesn't use the separate privatisation callback either (this is equivalent to a no-op in OpenMPIRBuilder):

auto PrivCB = [](InsertPointTy AllocaIP, InsertPointTy CodeGenIP,

TL;DR: I don't think this PR will make things worse but it would be good to take a closer look at any work needed to enable these optimizations in the future.

My educated guesses on some of the optimizations in the presentation:

  • Forwarding communicated/shared values between parallel regions -> I am not sure if this will work for flang but it seems to work using the arguments to the outlined function, which will not be changed by my PR
  • Interprocedural constant propagation for outlined parallel regions -> again, the arguments to the outlined function have not changed. The use of the argument should have only changed in the case of the bug I was fixing.
  • Attribute deduction also the same as above
  • Argument promotion (shared -> firstprivate -> private) -> I don't know how this works. It probably never worked for code generated by flang.
  • Parallel region expansion and barrier elimination -> I imagine this should work. It should only need to look at the libcalls generated by OpenMPIRBuilder and maybe stitch some parallel regions together.
  • Communication optimization and value transfer -> this should be unaffected by my patch -> it depends on how variables are passed into and out of parallel regions (by ref or by val)

I can't really comment on device code and cross host-device IPO as I'm very unfamiliar with that pipeline and IIUC there's a lot of code yet to be upstreamed.

FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.

There are already a few places we diverge (e.g. the reduction init block(s)). Plus the differences that are inherent such as the much higher complexity of Fortran datatypes allowed in privatization and reduction (allocatable arrays etc). Maybe future work could unpack Fortran runtime descriptors in cases where they are not needed so that it looks more like a C array. Currently Flang reductions and privatization force arrays and allocatables to be boxed.

The runtime call deduplication mentioned on the webpage ought to still work. We still generate runtime calls using OpenMPIRBuilder and the parallel region is still inside of the outlined function.

The webpage also mentions globalization for data shared between different threads on a GPU. I don't know much about the GPU flow.

Copy link
Member

@Meinersbur Meinersbur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like any patch that removes more code than it adds

template <typename T>
static void
collectReductionDecls(T loop,
collectReductionDecls(T op,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noise in the diff?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I was reading the code I realized this name was misleading because the op might be a ParallelOp. Would you rather I do this in a separate patch?

Copy link
Contributor

@kiranchandramohan kiranchandramohan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Tom for the detailed comments and going through the links I gave.

LGTM. Nice work!

tblah added 6 commits October 15, 2024 10:55
The existing conversion inlined private alloc regions and firstprivate
copy regions in mlir, then undoing the modification of the mlir module
before completing the conversion. To make this work, LLVM IR had to be
generated using the wrong mapping for privatised values and then later
fixed inside of OpenMPIRBuilder.

This approach violated an assumption in OpenMPIRBuilder that private
variables would be values not constants. Flang sometimes generates code
where private variables are promoted to globals, the address of which is
treated as a constant in LLVM IR. This caused the incorrect values for
the private variable from being replaced by OpenMPIRBuilder: ultimately
resulting in programs producing incorrect results.

This patch rewrites delayed privatisation for omp.parallel to work more
similarly to reductions: translating directly into LLVMIR with correct
mappings for private variables.

RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225

Tested against the gfortran testsuite and our internal test suite.
Linaro's post-commit bots will check against the fujitsu test suite.

I decided to add the new tests as flang integration tests rather than in
mlir/test/Target/LLVMIR:
  - The regression test is for an issue filed against flang. i wanted to
    keep the reproducer similar to the code in the ticket.
  - I found the "worst case" CFG test difficult to reason about in abstract
    it helped me to think about what was going on in terms of a Fortran
    program.

Fixes llvm#106297
@tblah tblah force-pushed the ecclescake/delayed-privatisation-global branch from a1af027 to 2748a54 Compare October 15, 2024 11:11
Copy link

github-actions bot commented Oct 15, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@tblah tblah merged commit 621fcf8 into llvm:main Oct 16, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flang:openmp flang Flang issues not falling into any other category mlir:llvm mlir:openmp mlir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flang][OpenMP] Parallel region failure
6 participants