[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

tblah · 2024-10-10T14:43:39Z

The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder.

This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results.

This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables.

RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225

Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite.

I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:

The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket.
I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program.

Fixes #106297

llvmbot · 2024-10-10T14:44:38Z

@llvm/pr-subscribers-mlir

@llvm/pr-subscribers-mlir-openmp

Author: Tom Eccles (tblah)

Changes

The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder.

This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results.

This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables.

RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225

Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite.

I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:

The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket.
I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program.

Fixes #106297

Patch is 41.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111844.diff

5 Files Affected:

(added) flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 (+261)
(added) flang/test/Integration/OpenMP/private-global.f90 (+46)
(modified) mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp (+150-253)
(modified) mlir/test/Target/LLVMIR/openmp-firstprivate.mlir (+18-7)
(modified) mlir/test/Target/LLVMIR/openmp-private.mlir (+4-2)

diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
new file mode 100644
index 00000000000000..b1fcb52b12ae9c
--- /dev/null
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -0,0 +1,261 @@
+! RUN: %flang_fc1 -fopenmp -emit-llvm %s -o - | FileCheck %s
+
+! stress test cfg and builder insertion points in mlir-to-llvm conversion:
+!   - mixing multiple delayed privatisations and multiple reductions
+!   - multiple blocks in the private alloc region
+!   - private alloc region has to read from the mold variable
+!   - firstprivate
+!   - multiple blocks in the private copy region
+!   - multiple blocks in the reduction init region
+!   - reduction init region has to read from the mold variable
+!   - re-used omp.private ops
+!   - re-used omp.reduction.declare ops
+!   - unstructured code inside of the parallel region
+!   - needs private dealloc region, and this has multiple blocks
+!   - needs reduction cleanup region, and this has multiple blocks
+
+! This maybe belongs in the mlir tests, but what we are doing here is complex
+! enough that I find the kind of minimised mlir code prefered by mlir reviwers
+! hard to read without some fortran here for reference. Nothing like this would
+! be generated by other upstream users of the MLIR OpenMP dialect.
+
+subroutine worst_case(a, b, c, d)
+  real, allocatable :: a(:), b(:), c(:), d(:)
+  integer i
+
+  !$omp parallel firstprivate(a,b) reduction(+:c,d)
+  if (sum(a) == 1) stop 1
+  !$omp end parallel
+end subroutine
+
+! CHECK-LABEL: define internal void @worst_case_..omp_par
+! CHECK-NEXT:  omp.par.entry:
+!                [reduction alloc regions inlined here]
+! CHECK:         br label %omp.private.latealloc
+
+! CHECK:       omp.private.latealloc:                            ; preds = %omp.par.entry
+! CHECK-NEXT:  br label %omp.private.alloc5
+
+! CHECK:       omp.private.alloc5:                               ; preds = %omp.private.latealloc
+!                [begin private alloc for first var]
+!                [read the length from the mold argument]
+!                [if it is non-zero...]
+! CHECK:         br i1 {{.*}}, label %omp.private.alloc6, label %omp.private.alloc7
+
+! CHECK:       omp.private.alloc7:                               ; preds = %omp.private.alloc5
+!                [finish private alloc for first var with zero extent]
+! CHECK:         br label %omp.private.alloc8
+
+! CHECK:       omp.private.alloc8:                               ; preds = %omp.private.alloc6, %omp.private.alloc7
+! CHECK-NEXT:    br label %omp.region.cont4
+
+! CHECK:       omp.region.cont4:                                 ; preds = %omp.private.alloc8
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.alloc
+
+! CHECK:       omp.private.alloc:                                ; preds = %omp.region.cont4
+!                [begin private alloc for first var]
+!                [read the length from the mold argument]
+!                [if it is non-zero...]
+! CHECK:         br i1 %{{.*}}, label %omp.private.alloc1, label %omp.private.alloc2
+
+! CHECK:       omp.private.alloc2:                               ; preds = %omp.private.alloc
+!                [finish private alloc for second var with zero extent]
+! CHECK:         br label %omp.private.alloc3
+
+! CHECK:       omp.private.alloc3:                               ; preds = %omp.private.alloc1, %omp.private.alloc2
+! CHECK-NEXT:    br label %omp.region.cont
+
+! CHECK:       omp.region.cont:                                  ; preds = %omp.private.alloc3
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.copy
+
+! CHECK:       omp.private.copy:                                 ; preds = %omp.region.cont
+! CHECK-NEXT:    br label %omp.private.copy10
+
+! CHECK:       omp.private.copy10:                               ; preds = %omp.private.copy
+!                [begin firstprivate copy for first var]
+!                [read the length, is it non-zero?]
+! CHECK:         br i1 %{{.*}}, label %omp.private.copy11, label %omp.private.copy12
+
+! CHECK:       omp.private.copy12:                               ; preds = %omp.private.copy11, %omp.private.copy10
+! CHECK-NEXT:    br label %omp.region.cont9
+
+! CHECK:       omp.region.cont9:                                 ; preds = %omp.private.copy12
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.private.copy14
+
+! CHECK:       omp.private.copy14:                               ; preds = %omp.region.cont9
+!                [begin firstprivate copy for second var]
+!                [read the length, is it non-zero?]
+! CHECK:         br i1 %{{.*}}, label %omp.private.copy15, label %omp.private.copy16
+
+! CHECK:       omp.private.copy16:                               ; preds = %omp.private.copy15, %omp.private.copy14
+! CHECK-NEXT:    br label %omp.region.cont13
+
+! CHECK:       omp.region.cont13:                                ; preds = %omp.private.copy16
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.reduction.init
+
+! CHECK:       omp.reduction.init:                               ; preds = %omp.region.cont13
+!                [deffered stores for results of reduction alloc regions]
+! CHECK:         br label %[[VAL_96:.*]]
+
+! CHECK:       omp.reduction.neutral:                            ; preds = %omp.reduction.init
+!                [start of reduction initialization region]
+!                [null check:]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral18, label %omp.reduction.neutral19
+
+! CHECK:       omp.reduction.neutral19:                          ; preds = %omp.reduction.neutral
+!                [malloc and assign the default value to the reduction variable]
+! CHECK:         br label %omp.reduction.neutral20
+
+! CHECK:       omp.reduction.neutral20:                          ; preds = %omp.reduction.neutral18, %omp.reduction.neutral19
+! CHECK-NEXT:    br label %omp.region.cont17
+
+! CHECK:       omp.region.cont17:                                ; preds = %omp.reduction.neutral20
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.reduction.neutral22
+
+! CHECK:       omp.reduction.neutral22:                          ; preds = %omp.region.cont17
+!                [start of reduction initialization region]
+!                [null check:]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.neutral23, label %omp.reduction.neutral24
+
+! CHECK:       omp.reduction.neutral24:                          ; preds = %omp.reduction.neutral22
+!                [malloc and assign the default value to the reduction variable]
+! CHECK:         br label %omp.reduction.neutral25
+
+! CHECK:       omp.reduction.neutral25:                          ; preds = %omp.reduction.neutral23, %omp.reduction.neutral24
+! CHECK-NEXT:    br label %omp.region.cont21
+
+! CHECK:       omp.region.cont21:                                ; preds = %omp.reduction.neutral25
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    br label %omp.par.region
+
+! CHECK:       omp.par.region:                                   ; preds = %omp.region.cont21
+! CHECK-NEXT:    br label %omp.par.region27
+
+! CHECK:       omp.par.region27:                                 ; preds = %omp.par.region
+!                [call SUM runtime function]
+!                [if (sum(a) == 1)]
+! CHECK:         br i1 %{{.*}}, label %omp.par.region28, label %omp.par.region29
+
+! CHECK:       omp.par.region29:                                 ; preds = %omp.par.region27
+! CHECK-NEXT:    br label %omp.region.cont26
+
+! CHECK:       omp.region.cont26:                                ; preds = %omp.par.region28, %omp.par.region29
+!                [omp parallel region done, call into the runtime to complete reduction]
+! CHECK:         %[[VAL_233:.*]] = call i32 @__kmpc_reduce(
+! CHECK:         switch i32 %[[VAL_233]], label %reduce.finalize [
+! CHECK-NEXT:      i32 1, label %reduce.switch.nonatomic
+! CHECK-NEXT:      i32 2, label %reduce.switch.atomic
+! CHECK-NEXT:    ]
+
+! CHECK:       reduce.switch.atomic:                             ; preds = %omp.region.cont26
+! CHECK-NEXT:    unreachable
+
+! CHECK:       reduce.switch.nonatomic:                          ; preds = %omp.region.cont26
+! CHECK-NEXT:    %[[red_private_value_0:.*]] = load ptr, ptr %{{.*}}, align 8
+! CHECK-NEXT:    br label %omp.reduction.nonatomic.body
+
+!              [various blocks implementing the reduction]
+
+! CHECK:       omp.region.cont35:                                ; preds =
+! CHECK-NEXT:    %{{.*}} = phi ptr
+! CHECK-NEXT:    call void @__kmpc_end_reduce(
+! CHECK-NEXT:    br label %reduce.finalize
+
+! CHECK:       reduce.finalize:                                  ; preds =
+! CHECK-NEXT:    br label %omp.par.pre_finalize
+
+! CHECK:       omp.par.pre_finalize:                             ; preds = %reduce.finalize
+! CHECK-NEXT:    %{{.*}} = load ptr, ptr
+! CHECK-NEXT:    br label %omp.reduction.cleanup
+
+! CHECK:       omp.reduction.cleanup:                            ; preds = %omp.par.pre_finalize
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup41, label %omp.reduction.cleanup42
+
+! CHECK:       omp.reduction.cleanup42:                          ; preds = %omp.reduction.cleanup41, %omp.reduction.cleanup
+! CHECK-NEXT:    br label %omp.region.cont40
+
+! CHECK:       omp.region.cont40:                                ; preds = %omp.reduction.cleanup42
+! CHECK-NEXT:    %{{.*}} = load ptr, ptr
+! CHECK-NEXT:    br label %omp.reduction.cleanup44
+
+! CHECK:       omp.reduction.cleanup44:                          ; preds = %omp.region.cont40
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.reduction.cleanup45, label %omp.reduction.cleanup46
+
+! CHECK:       omp.reduction.cleanup46:                          ; preds = %omp.reduction.cleanup45, %omp.reduction.cleanup44
+! CHECK-NEXT:    br label %omp.region.cont43
+
+! CHECK:       omp.region.cont43:                                ; preds = %omp.reduction.cleanup46
+! CHECK-NEXT:    br label %omp.private.dealloc
+
+! CHECK:       omp.private.dealloc:                              ; preds = %omp.region.cont43
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc48, label %omp.private.dealloc49
+
+! CHECK:       omp.private.dealloc49:                            ; preds = %omp.private.dealloc48, %omp.private.dealloc
+! CHECK-NEXT:    br label %omp.region.cont47
+
+! CHECK:       omp.region.cont47:                                ; preds = %omp.private.dealloc49
+! CHECK-NEXT:    br label %omp.private.dealloc51
+
+! CHECK:       omp.private.dealloc51:                            ; preds = %omp.region.cont47
+!                [null check]
+! CHECK:         br i1 %{{.*}}, label %omp.private.dealloc52, label %omp.private.dealloc53
+
+! CHECK:       omp.private.dealloc53:                            ; preds = %omp.private.dealloc52, %omp.private.dealloc51
+! CHECK-NEXT:    br label %omp.region.cont50
+
+! CHECK:       omp.region.cont50:                                ; preds = %omp.private.dealloc53
+! CHECK-NEXT:    br label %omp.par.outlined.exit.exitStub
+
+! CHECK:       omp.private.dealloc52:                            ; preds = %omp.private.dealloc51
+!                [dealloc memory]
+! CHECK:         br label %omp.private.dealloc53
+
+! CHECK:       omp.private.dealloc48:                            ; preds = %omp.private.dealloc
+!                [dealloc memory]
+! CHECK:         br label %omp.private.dealloc49
+
+! CHECK:       omp.reduction.cleanup45:                          ; preds = %omp.reduction.cleanup44
+! CHECK-NEXT:    call void @free(
+! CHECK-NEXT:    br label %omp.reduction.cleanup46
+
+! CHECK:       omp.reduction.cleanup41:                          ; preds = %omp.reduction.cleanup
+! CHECK-NEXT:    call void @free(
+! CHECK-NEXT:    br label %omp.reduction.cleanup42
+
+! CHECK:       omp.par.region28:                                 ; preds = %omp.par.region27
+! CHECK-NEXT:    call {} @_FortranAStopStatement
+
+! CHECK:       omp.reduction.neutral23:                          ; preds = %omp.reduction.neutral22
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral25
+
+! CHECK:       omp.reduction.neutral18:                          ; preds = %omp.reduction.neutral
+!                [source length was zero: finish initializing array]
+! CHECK:         br label %omp.reduction.neutral20
+
+! CHECK:       omp.private.copy15:                               ; preds = %omp.private.copy14
+!                [source length was non-zero: call assign runtime]
+! CHECK:         br label %omp.private.copy16
+
+! CHECK:       omp.private.copy11:                               ; preds = %omp.private.copy10
+!                [source length was non-zero: call assign runtime]
+! CHECK:         br label %omp.private.copy12
+
+! CHECK:       omp.private.alloc1:                               ; preds = %omp.private.alloc
+!                [var extent was non-zero: malloc a private array]
+! CHECK:         br label %omp.private.alloc3
+
+! CHECK:       omp.private.alloc6:                               ; preds = %omp.private.alloc5
+!                [var extent was non-zero: malloc a private array]
+! CHECK:         br label %omp.private.alloc8
+
+! CHECK:       omp.par.outlined.exit.exitStub:                   ; preds = %omp.region.cont50
+! CHECK-NEXT:    ret void
diff --git a/flang/test/Integration/OpenMP/private-global.f90 b/flang/test/Integration/OpenMP/private-global.f90
new file mode 100644
index 00000000000000..62d0a3faf0c593
--- /dev/null
+++ b/flang/test/Integration/OpenMP/private-global.f90
@@ -0,0 +1,46 @@
+!RUN: %flang_fc1 -emit-llvm -fopenmp %s -o - | FileCheck %s
+
+! Regression test for https://github.com/llvm/llvm-project/issues/106297
+
+program bug
+  implicit none
+  integer :: table(10)
+  !$OMP PARALLEL PRIVATE(table)
+    table = 50
+    if (any(table/=50)) then
+      stop 'fail 3'
+    end if
+  !$OMP END PARALLEL
+  print *,'ok'
+End Program
+
+
+! CHECK-LABEL: define internal void {{.*}}..omp_par(
+! CHECK:       omp.par.entry:
+! CHECK:         %[[VAL_9:.*]] = alloca i32, align 4
+! CHECK:         %[[VAL_10:.*]] = load i32, ptr %[[VAL_11:.*]], align 4
+! CHECK:         store i32 %[[VAL_10]], ptr %[[VAL_9]], align 4
+! CHECK:         %[[VAL_12:.*]] = load i32, ptr %[[VAL_9]], align 4
+! CHECK:         %[[PRIV_TABLE:.*]] = alloca [10 x i32], i64 1, align 4
+! ...
+! check that we use the private copy of table for the assignment
+! CHECK:       omp.par.region1:
+! CHECK:         %[[ELEMENTAL_TMP:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK:         %[[BOXED_FIFTY:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8
+! CHECK:         %[[TABLE_BOX_ADDR2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8
+! CHECK:         %[[TABLE_BOX_VAL:.*]] = insertvalue { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } { ptr undef, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64), i32 20240719, i8 1, i8 9, i8 0, i8 0, [1 x [3 x i64]] {{\[\[}}3 x i64] [i64 1, i64 10, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64)]] }, ptr %[[PRIV_TABLE]], 0
+! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL]], ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK:         %[[TABLE_BOX_VAL2:.*]] = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK:         store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL2]], ptr %[[TABLE_BOX_ADDR2]], align 8
+! CHECK:         %[[VAL_26:.*]] = call {} @_FortranAAssign(ptr %[[TABLE_BOX_ADDR2]], ptr %[[BOXED_FIFTY]], ptr @{{.*}}, i32 9)
+! ...
+! check that we use the private copy of table for table/=50
+! CHECK:       omp.par.region3:
+! CHECK:         %[[VAL_44:.*]] = sub nsw i64 %{{.*}}, 1
+! CHECK:         %[[VAL_45:.*]] = mul nsw i64 %[[VAL_44]], 1
+! CHECK:         %[[VAL_46:.*]] = mul nsw i64 %[[VAL_45]], 1
+! CHECK:         %[[VAL_47:.*]] = add nsw i64 %[[VAL_46]], 0
+! CHECK:         %[[VAL_48:.*]] = getelementptr i32, ptr %[[PRIV_TABLE]], i64 %[[VAL_47]]
+! CHECK:         %[[VAL_49:.*]] = load i32, ptr %[[VAL_48]], align 4
+! CHECK:         %[[VAL_50:.*]] = icmp ne i32 %[[VAL_49]], 50
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 19d80fbbd699b0..6eb00d44460a97 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -371,20 +371,46 @@ convertOmpCritical(Operation &opInst, llvm::IRBuilderBase &builder,
   return success();
 }
 
-/// Populates `reductions` with reduction declarations used in the given loop.
+// Looks up from the operation from and returns the PrivateClauseOp with
+// name symbolName
+static omp::PrivateClauseOp findPrivatizer(Operation *from,
+                                           SymbolRefAttr symbolName) {
+  omp::PrivateClauseOp privatizer =
+      SymbolTable::lookupNearestSymbolFrom<omp::PrivateClauseOp>(from,
+                                                                 symbolName);
+  assert(privatizer && "privatizer not found in the symbol table");
+  return privatizer;
+}
+
+/// Populates `privatizations` with privatisation declarations used for the
+/// given op.
+/// TODO: generalise beyond ParallelOp
+static void collectPrivatizationDecls(
+    omp::ParallelOp op, SmallVectorImpl<omp::PrivateClauseOp> &privatizations) {
+  std::optional<ArrayAttr> attr = op.getPrivateSyms();
+  if (!attr)
+    return;
+
+  privatizations.reserve(privatizations.size() + attr->size());
+  for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
+    privatizations.push_back(findPrivatizer(op, symbolRef));
+  }
+}
+
+/// Populates `reductions` with reduction declarations used in the given op.
 template <typename T>
 static void
-collectReductionDecls(T loop,
+collectReductionDecls(T op,
                       SmallVectorImpl<omp::DeclareReductionOp> &reductions) {
-  std::optional<ArrayAttr> attr = loop.getReductionSyms();
+  std::optional<ArrayAttr> attr = op.getReductionSyms();
   if (!attr)
     return;
 
-  reductions.reserve(reductions.size() + loop.getNumReductionVars());
+  reductions.reserve(reductions.size() + op.getNumReductionVars());
   for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
     reductions.push_back(
         SymbolTable::lookupNearestSymbolFrom<omp::DeclareReductionOp>(
-            loop, symbolRef));
+            op, symbolRef));
   }
 }
 
@@ -611,7 +637,7 @@ static LogicalResult
 allocReductionVars(T loop, ArrayRef<BlockArgument> reductionArgs,
                    llvm::IRBuilderBase &builder,
                    LLVM::ModuleTranslation &moduleTranslation,
-                   llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
+                   const llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
                    SmallVectorImpl<omp::DeclareReductionOp> &reductionDecls,
                    SmallVectorImpl<llvm::Value *> &privateReductionVariables,
                    DenseMap<Value, llvm::Value *> &reductionVariableMap,
@@ -1319,76 +1345,11 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase &builder,
                                     privateReductionVariables, isByRef);
 }
 
-/// A RAII class that on construction replaces the region arguments of the
-/// parallel op (which correspond to private variables) with the actual private
-/// variables they correspond to. This prepares the parallel op so that it
-/// matches what is expected by the OMPIRBuilder.
-///
-/// On destruction, it restores the original state of the operation so that on
-/// the MLIR side, the op is not affected by conversion to LLVM IR.
-class OmpParallelOpConversionManager {
-public:
-  OmpParallelOpConversionManager(omp::ParallelOp opInst)
-      : region(opInst.getRegion()),
-        privateBlockArgs(cast<omp::BlockArgOpenMPOpInterface>(*opInst)
-                   ...
[truncated]

kiranchandramohan · 2024-10-11T11:22:55Z

The fujitsu testsuite is at https://github.com/fujitsu/compiler-test-suite. Could you give it a try? You can download it into a directory inside llvm-testsuite.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

kiranchandramohan · 2024-10-11T11:52:02Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

+
+      // in-place convert the private allocation region
+      SmallVector<llvm::Value *, 1> phis;
+      if (allocRegion.getArgument(0).getUses().empty())


What is this if case here? Does it mean that we will not end up using the privAllocaBlock that was created?

Yes. We only use it for cases where it is needed (when the alloc region reads from the original variable). This is to ensure that allocas end up in the entry block as much as possible.

There's a bit more explanation in the comment describing the allocRegion.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

tblah · 2024-10-11T18:15:14Z

The fujitsu testsuite is at https://github.com/fujitsu/compiler-test-suite. Could you give it a try? You can download it into a directory inside llvm-testsuite.

After my patch I get the same number of "executable missing" tests (which indicate a compilation failure) and 18 fewer failed tests

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

ergawy

LGTM, thanks Tom!

kiranchandramohan · 2024-10-13T17:54:45Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

@@ -1421,12 +1450,57 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder,
            deferredStores, isByRef)))
      bodyGenStatus = failure();

+    // Apply copy region for firstprivate.
+    if (!privateBlockArgs.empty()) {


Is this sufficient check for the existence of Firstprivate?

If the intention is to always create these blocks, should these blocks be created by the initial skeleton creation in OpenMPIRBuilder?

I'm only checking that there are private arguments not if any are firstprivate because it felt like overkill to scan every argument looking for a firstprivate one. Would you prefer I did that?

I don't want to always create the blocks because that would lead to unnecessary test churn, especially if they were in OpenMPIRBuilder as that may effect clang too.

If we can write a quick std::find_if for firstprivate that would be great.

Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?
https://llvm.org/devmtg/2019-04/slides/TechTalk-Doerfert-Compiler_optimization_for_OpenMP_accelerator_offloading.pdf

FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.
https://openmp.llvm.org/optimizations/OpenMPOpt.html

Thanks for the links.

Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?

Clang doesn't use the separate privatisation callback either (this is equivalent to a no-op in OpenMPIRBuilder):

llvm-project/clang/lib/CodeGen/CGStmtOpenMP.cpp

Line 1818 in f3947aa

auto PrivCB = [](InsertPointTy AllocaIP, InsertPointTy CodeGenIP,

TL;DR: I don't think this PR will make things worse but it would be good to take a closer look at any work needed to enable these optimizations in the future.

My educated guesses on some of the optimizations in the presentation:

Forwarding communicated/shared values between parallel regions -> I am not sure if this will work for flang but it seems to work using the arguments to the outlined function, which will not be changed by my PR

Interprocedural constant propagation for outlined parallel regions -> again, the arguments to the outlined function have not changed. The use of the argument should have only changed in the case of the bug I was fixing.

Attribute deduction also the same as above

Argument promotion (shared -> firstprivate -> private) -> I don't know how this works. It probably never worked for code generated by flang.

Parallel region expansion and barrier elimination -> I imagine this should work. It should only need to look at the libcalls generated by OpenMPIRBuilder and maybe stitch some parallel regions together.

Communication optimization and value transfer -> this should be unaffected by my patch -> it depends on how variables are passed into and out of parallel regions (by ref or by val)

I can't really comment on device code and cross host-device IPO as I'm very unfamiliar with that pipeline and IIUC there's a lot of code yet to be upstreamed.

FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.

There are already a few places we diverge (e.g. the reduction init block(s)). Plus the differences that are inherent such as the much higher complexity of Fortran datatypes allowed in privatization and reduction (allocatable arrays etc). Maybe future work could unpack Fortran runtime descriptors in cases where they are not needed so that it looks more like a C array. Currently Flang reductions and privatization force arrays and allocatables to be boxed.

The runtime call deduplication mentioned on the webpage ought to still work. We still generate runtime calls using OpenMPIRBuilder and the parallel region is still inside of the outlined function.

The webpage also mentions globalization for data shared between different threads on a GPU. I don't know much about the GPU flow.

Meinersbur

I like any patch that removes more code than it adds

flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

Meinersbur · 2024-10-14T13:34:32Z

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

 template <typename T>
 static void
-collectReductionDecls(T loop,
+collectReductionDecls(T op,


Noise in the diff?

As I was reading the code I realized this name was misleading because the op might be a ParallelOp. Would you rather I do this in a separate patch?

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp

kiranchandramohan

Thanks Tom for the detailed comments and going through the links I gave.

LGTM. Nice work!

flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90

The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder. This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results. This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables. RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225 Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite. I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR: - The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket. - I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program. Fixes llvm#106297

…ariables

github-actions · 2024-10-15T11:14:41Z

✅ With the latest revision this PR passed the C/C++ code formatter.

tblah requested review from ergawy, Meinersbur, skatrak, kiranchandramohan, Leporacanthicus, pawosm-arm, luporl and bhandarkar-pranav October 10, 2024 14:43

llvmbot added mlir:llvm mlir flang Flang issues not falling into any other category mlir:openmp flang:openmp labels Oct 10, 2024

kiranchandramohan reviewed Oct 11, 2024

View reviewed changes

luporl reviewed Oct 11, 2024

View reviewed changes

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp Outdated Show resolved Hide resolved

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp Outdated Show resolved Hide resolved

ergawy approved these changes Oct 13, 2024

View reviewed changes

kiranchandramohan reviewed Oct 13, 2024

View reviewed changes

Meinersbur reviewed Oct 14, 2024

View reviewed changes

kiranchandramohan approved these changes Oct 15, 2024

View reviewed changes

tblah added 6 commits October 15, 2024 10:55

Review nits

7835371

Don't generate firstprivate copy block if there are no firstprivate v…

e16220e

…ariables

Fix nits

784be8c

Fix spelling

55b9f69

Rebase onto llvm#112192

2748a54

tblah force-pushed the ecclescake/delayed-privatisation-global branch from a1af027 to 2748a54 Compare October 15, 2024 11:11

Format code

48c5b6d

tblah merged commit 621fcf8 into llvm:main Oct 16, 2024
8 checks passed

kiranchandramohan mentioned this pull request Oct 22, 2024

[Flang][OpenMP] Important issues to be fixed prior to removing experimental status of OpenMP #110008

Open

75 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

tblah commented Oct 10, 2024

llvmbot commented Oct 10, 2024 •

edited

Loading

kiranchandramohan commented Oct 11, 2024

kiranchandramohan Oct 11, 2024

tblah Oct 11, 2024

tblah commented Oct 11, 2024

ergawy left a comment

kiranchandramohan Oct 13, 2024

tblah Oct 14, 2024

kiranchandramohan Oct 14, 2024

tblah Oct 14, 2024

Meinersbur left a comment

Meinersbur Oct 14, 2024

tblah Oct 14, 2024

kiranchandramohan left a comment

github-actions bot commented Oct 15, 2024 •

edited

Loading

[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844

Conversation

tblah commented Oct 10, 2024

llvmbot commented Oct 10, 2024 • edited Loading

kiranchandramohan commented Oct 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tblah commented Oct 11, 2024

ergawy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Meinersbur left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kiranchandramohan left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2024 • edited Loading

llvmbot commented Oct 10, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024 •

edited

Loading