-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][OpenMP] rewrite conversion of privatisation for omp.parallel #111844
Conversation
@llvm/pr-subscribers-mlir @llvm/pr-subscribers-mlir-openmp Author: Tom Eccles (tblah) ChangesThe existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder. This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results. This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables. Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite. I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:
Fixes #106297 Patch is 41.61 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/111844.diff 5 Files Affected:
diff --git a/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90 b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
new file mode 100644
index 00000000000000..b1fcb52b12ae9c
--- /dev/null
+++ b/flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
@@ -0,0 +1,261 @@
+! RUN: %flang_fc1 -fopenmp -emit-llvm %s -o - | FileCheck %s
+
+! stress test cfg and builder insertion points in mlir-to-llvm conversion:
+! - mixing multiple delayed privatisations and multiple reductions
+! - multiple blocks in the private alloc region
+! - private alloc region has to read from the mold variable
+! - firstprivate
+! - multiple blocks in the private copy region
+! - multiple blocks in the reduction init region
+! - reduction init region has to read from the mold variable
+! - re-used omp.private ops
+! - re-used omp.reduction.declare ops
+! - unstructured code inside of the parallel region
+! - needs private dealloc region, and this has multiple blocks
+! - needs reduction cleanup region, and this has multiple blocks
+
+! This maybe belongs in the mlir tests, but what we are doing here is complex
+! enough that I find the kind of minimised mlir code prefered by mlir reviwers
+! hard to read without some fortran here for reference. Nothing like this would
+! be generated by other upstream users of the MLIR OpenMP dialect.
+
+subroutine worst_case(a, b, c, d)
+ real, allocatable :: a(:), b(:), c(:), d(:)
+ integer i
+
+ !$omp parallel firstprivate(a,b) reduction(+:c,d)
+ if (sum(a) == 1) stop 1
+ !$omp end parallel
+end subroutine
+
+! CHECK-LABEL: define internal void @worst_case_..omp_par
+! CHECK-NEXT: omp.par.entry:
+! [reduction alloc regions inlined here]
+! CHECK: br label %omp.private.latealloc
+
+! CHECK: omp.private.latealloc: ; preds = %omp.par.entry
+! CHECK-NEXT: br label %omp.private.alloc5
+
+! CHECK: omp.private.alloc5: ; preds = %omp.private.latealloc
+! [begin private alloc for first var]
+! [read the length from the mold argument]
+! [if it is non-zero...]
+! CHECK: br i1 {{.*}}, label %omp.private.alloc6, label %omp.private.alloc7
+
+! CHECK: omp.private.alloc7: ; preds = %omp.private.alloc5
+! [finish private alloc for first var with zero extent]
+! CHECK: br label %omp.private.alloc8
+
+! CHECK: omp.private.alloc8: ; preds = %omp.private.alloc6, %omp.private.alloc7
+! CHECK-NEXT: br label %omp.region.cont4
+
+! CHECK: omp.region.cont4: ; preds = %omp.private.alloc8
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.private.alloc
+
+! CHECK: omp.private.alloc: ; preds = %omp.region.cont4
+! [begin private alloc for first var]
+! [read the length from the mold argument]
+! [if it is non-zero...]
+! CHECK: br i1 %{{.*}}, label %omp.private.alloc1, label %omp.private.alloc2
+
+! CHECK: omp.private.alloc2: ; preds = %omp.private.alloc
+! [finish private alloc for second var with zero extent]
+! CHECK: br label %omp.private.alloc3
+
+! CHECK: omp.private.alloc3: ; preds = %omp.private.alloc1, %omp.private.alloc2
+! CHECK-NEXT: br label %omp.region.cont
+
+! CHECK: omp.region.cont: ; preds = %omp.private.alloc3
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.private.copy
+
+! CHECK: omp.private.copy: ; preds = %omp.region.cont
+! CHECK-NEXT: br label %omp.private.copy10
+
+! CHECK: omp.private.copy10: ; preds = %omp.private.copy
+! [begin firstprivate copy for first var]
+! [read the length, is it non-zero?]
+! CHECK: br i1 %{{.*}}, label %omp.private.copy11, label %omp.private.copy12
+
+! CHECK: omp.private.copy12: ; preds = %omp.private.copy11, %omp.private.copy10
+! CHECK-NEXT: br label %omp.region.cont9
+
+! CHECK: omp.region.cont9: ; preds = %omp.private.copy12
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.private.copy14
+
+! CHECK: omp.private.copy14: ; preds = %omp.region.cont9
+! [begin firstprivate copy for second var]
+! [read the length, is it non-zero?]
+! CHECK: br i1 %{{.*}}, label %omp.private.copy15, label %omp.private.copy16
+
+! CHECK: omp.private.copy16: ; preds = %omp.private.copy15, %omp.private.copy14
+! CHECK-NEXT: br label %omp.region.cont13
+
+! CHECK: omp.region.cont13: ; preds = %omp.private.copy16
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.reduction.init
+
+! CHECK: omp.reduction.init: ; preds = %omp.region.cont13
+! [deffered stores for results of reduction alloc regions]
+! CHECK: br label %[[VAL_96:.*]]
+
+! CHECK: omp.reduction.neutral: ; preds = %omp.reduction.init
+! [start of reduction initialization region]
+! [null check:]
+! CHECK: br i1 %{{.*}}, label %omp.reduction.neutral18, label %omp.reduction.neutral19
+
+! CHECK: omp.reduction.neutral19: ; preds = %omp.reduction.neutral
+! [malloc and assign the default value to the reduction variable]
+! CHECK: br label %omp.reduction.neutral20
+
+! CHECK: omp.reduction.neutral20: ; preds = %omp.reduction.neutral18, %omp.reduction.neutral19
+! CHECK-NEXT: br label %omp.region.cont17
+
+! CHECK: omp.region.cont17: ; preds = %omp.reduction.neutral20
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.reduction.neutral22
+
+! CHECK: omp.reduction.neutral22: ; preds = %omp.region.cont17
+! [start of reduction initialization region]
+! [null check:]
+! CHECK: br i1 %{{.*}}, label %omp.reduction.neutral23, label %omp.reduction.neutral24
+
+! CHECK: omp.reduction.neutral24: ; preds = %omp.reduction.neutral22
+! [malloc and assign the default value to the reduction variable]
+! CHECK: br label %omp.reduction.neutral25
+
+! CHECK: omp.reduction.neutral25: ; preds = %omp.reduction.neutral23, %omp.reduction.neutral24
+! CHECK-NEXT: br label %omp.region.cont21
+
+! CHECK: omp.region.cont21: ; preds = %omp.reduction.neutral25
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: br label %omp.par.region
+
+! CHECK: omp.par.region: ; preds = %omp.region.cont21
+! CHECK-NEXT: br label %omp.par.region27
+
+! CHECK: omp.par.region27: ; preds = %omp.par.region
+! [call SUM runtime function]
+! [if (sum(a) == 1)]
+! CHECK: br i1 %{{.*}}, label %omp.par.region28, label %omp.par.region29
+
+! CHECK: omp.par.region29: ; preds = %omp.par.region27
+! CHECK-NEXT: br label %omp.region.cont26
+
+! CHECK: omp.region.cont26: ; preds = %omp.par.region28, %omp.par.region29
+! [omp parallel region done, call into the runtime to complete reduction]
+! CHECK: %[[VAL_233:.*]] = call i32 @__kmpc_reduce(
+! CHECK: switch i32 %[[VAL_233]], label %reduce.finalize [
+! CHECK-NEXT: i32 1, label %reduce.switch.nonatomic
+! CHECK-NEXT: i32 2, label %reduce.switch.atomic
+! CHECK-NEXT: ]
+
+! CHECK: reduce.switch.atomic: ; preds = %omp.region.cont26
+! CHECK-NEXT: unreachable
+
+! CHECK: reduce.switch.nonatomic: ; preds = %omp.region.cont26
+! CHECK-NEXT: %[[red_private_value_0:.*]] = load ptr, ptr %{{.*}}, align 8
+! CHECK-NEXT: br label %omp.reduction.nonatomic.body
+
+! [various blocks implementing the reduction]
+
+! CHECK: omp.region.cont35: ; preds =
+! CHECK-NEXT: %{{.*}} = phi ptr
+! CHECK-NEXT: call void @__kmpc_end_reduce(
+! CHECK-NEXT: br label %reduce.finalize
+
+! CHECK: reduce.finalize: ; preds =
+! CHECK-NEXT: br label %omp.par.pre_finalize
+
+! CHECK: omp.par.pre_finalize: ; preds = %reduce.finalize
+! CHECK-NEXT: %{{.*}} = load ptr, ptr
+! CHECK-NEXT: br label %omp.reduction.cleanup
+
+! CHECK: omp.reduction.cleanup: ; preds = %omp.par.pre_finalize
+! [null check]
+! CHECK: br i1 %{{.*}}, label %omp.reduction.cleanup41, label %omp.reduction.cleanup42
+
+! CHECK: omp.reduction.cleanup42: ; preds = %omp.reduction.cleanup41, %omp.reduction.cleanup
+! CHECK-NEXT: br label %omp.region.cont40
+
+! CHECK: omp.region.cont40: ; preds = %omp.reduction.cleanup42
+! CHECK-NEXT: %{{.*}} = load ptr, ptr
+! CHECK-NEXT: br label %omp.reduction.cleanup44
+
+! CHECK: omp.reduction.cleanup44: ; preds = %omp.region.cont40
+! [null check]
+! CHECK: br i1 %{{.*}}, label %omp.reduction.cleanup45, label %omp.reduction.cleanup46
+
+! CHECK: omp.reduction.cleanup46: ; preds = %omp.reduction.cleanup45, %omp.reduction.cleanup44
+! CHECK-NEXT: br label %omp.region.cont43
+
+! CHECK: omp.region.cont43: ; preds = %omp.reduction.cleanup46
+! CHECK-NEXT: br label %omp.private.dealloc
+
+! CHECK: omp.private.dealloc: ; preds = %omp.region.cont43
+! [null check]
+! CHECK: br i1 %{{.*}}, label %omp.private.dealloc48, label %omp.private.dealloc49
+
+! CHECK: omp.private.dealloc49: ; preds = %omp.private.dealloc48, %omp.private.dealloc
+! CHECK-NEXT: br label %omp.region.cont47
+
+! CHECK: omp.region.cont47: ; preds = %omp.private.dealloc49
+! CHECK-NEXT: br label %omp.private.dealloc51
+
+! CHECK: omp.private.dealloc51: ; preds = %omp.region.cont47
+! [null check]
+! CHECK: br i1 %{{.*}}, label %omp.private.dealloc52, label %omp.private.dealloc53
+
+! CHECK: omp.private.dealloc53: ; preds = %omp.private.dealloc52, %omp.private.dealloc51
+! CHECK-NEXT: br label %omp.region.cont50
+
+! CHECK: omp.region.cont50: ; preds = %omp.private.dealloc53
+! CHECK-NEXT: br label %omp.par.outlined.exit.exitStub
+
+! CHECK: omp.private.dealloc52: ; preds = %omp.private.dealloc51
+! [dealloc memory]
+! CHECK: br label %omp.private.dealloc53
+
+! CHECK: omp.private.dealloc48: ; preds = %omp.private.dealloc
+! [dealloc memory]
+! CHECK: br label %omp.private.dealloc49
+
+! CHECK: omp.reduction.cleanup45: ; preds = %omp.reduction.cleanup44
+! CHECK-NEXT: call void @free(
+! CHECK-NEXT: br label %omp.reduction.cleanup46
+
+! CHECK: omp.reduction.cleanup41: ; preds = %omp.reduction.cleanup
+! CHECK-NEXT: call void @free(
+! CHECK-NEXT: br label %omp.reduction.cleanup42
+
+! CHECK: omp.par.region28: ; preds = %omp.par.region27
+! CHECK-NEXT: call {} @_FortranAStopStatement
+
+! CHECK: omp.reduction.neutral23: ; preds = %omp.reduction.neutral22
+! [source length was zero: finish initializing array]
+! CHECK: br label %omp.reduction.neutral25
+
+! CHECK: omp.reduction.neutral18: ; preds = %omp.reduction.neutral
+! [source length was zero: finish initializing array]
+! CHECK: br label %omp.reduction.neutral20
+
+! CHECK: omp.private.copy15: ; preds = %omp.private.copy14
+! [source length was non-zero: call assign runtime]
+! CHECK: br label %omp.private.copy16
+
+! CHECK: omp.private.copy11: ; preds = %omp.private.copy10
+! [source length was non-zero: call assign runtime]
+! CHECK: br label %omp.private.copy12
+
+! CHECK: omp.private.alloc1: ; preds = %omp.private.alloc
+! [var extent was non-zero: malloc a private array]
+! CHECK: br label %omp.private.alloc3
+
+! CHECK: omp.private.alloc6: ; preds = %omp.private.alloc5
+! [var extent was non-zero: malloc a private array]
+! CHECK: br label %omp.private.alloc8
+
+! CHECK: omp.par.outlined.exit.exitStub: ; preds = %omp.region.cont50
+! CHECK-NEXT: ret void
diff --git a/flang/test/Integration/OpenMP/private-global.f90 b/flang/test/Integration/OpenMP/private-global.f90
new file mode 100644
index 00000000000000..62d0a3faf0c593
--- /dev/null
+++ b/flang/test/Integration/OpenMP/private-global.f90
@@ -0,0 +1,46 @@
+!RUN: %flang_fc1 -emit-llvm -fopenmp %s -o - | FileCheck %s
+
+! Regression test for https://github.com/llvm/llvm-project/issues/106297
+
+program bug
+ implicit none
+ integer :: table(10)
+ !$OMP PARALLEL PRIVATE(table)
+ table = 50
+ if (any(table/=50)) then
+ stop 'fail 3'
+ end if
+ !$OMP END PARALLEL
+ print *,'ok'
+End Program
+
+
+! CHECK-LABEL: define internal void {{.*}}..omp_par(
+! CHECK: omp.par.entry:
+! CHECK: %[[VAL_9:.*]] = alloca i32, align 4
+! CHECK: %[[VAL_10:.*]] = load i32, ptr %[[VAL_11:.*]], align 4
+! CHECK: store i32 %[[VAL_10]], ptr %[[VAL_9]], align 4
+! CHECK: %[[VAL_12:.*]] = load i32, ptr %[[VAL_9]], align 4
+! CHECK: %[[PRIV_TABLE:.*]] = alloca [10 x i32], i64 1, align 4
+! ...
+! check that we use the private copy of table for the assignment
+! CHECK: omp.par.region1:
+! CHECK: %[[ELEMENTAL_TMP:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK: %[[TABLE_BOX_ADDR:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, align 8
+! CHECK: %[[BOXED_FIFTY:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8 }, align 8
+! CHECK: %[[TABLE_BOX_ADDR2:.*]] = alloca { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, i64 1, align 8
+! CHECK: %[[TABLE_BOX_VAL:.*]] = insertvalue { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } { ptr undef, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64), i32 20240719, i8 1, i8 9, i8 0, i8 0, [1 x [3 x i64]] {{\[\[}}3 x i64] [i64 1, i64 10, i64 ptrtoint (ptr getelementptr (i32, ptr null, i32 1) to i64)]] }, ptr %[[PRIV_TABLE]], 0
+! CHECK: store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL]], ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK: %[[TABLE_BOX_VAL2:.*]] = load { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] }, ptr %[[TABLE_BOX_ADDR]], align 8
+! CHECK: store { ptr, i64, i32, i8, i8, i8, i8, [1 x [3 x i64]] } %[[TABLE_BOX_VAL2]], ptr %[[TABLE_BOX_ADDR2]], align 8
+! CHECK: %[[VAL_26:.*]] = call {} @_FortranAAssign(ptr %[[TABLE_BOX_ADDR2]], ptr %[[BOXED_FIFTY]], ptr @{{.*}}, i32 9)
+! ...
+! check that we use the private copy of table for table/=50
+! CHECK: omp.par.region3:
+! CHECK: %[[VAL_44:.*]] = sub nsw i64 %{{.*}}, 1
+! CHECK: %[[VAL_45:.*]] = mul nsw i64 %[[VAL_44]], 1
+! CHECK: %[[VAL_46:.*]] = mul nsw i64 %[[VAL_45]], 1
+! CHECK: %[[VAL_47:.*]] = add nsw i64 %[[VAL_46]], 0
+! CHECK: %[[VAL_48:.*]] = getelementptr i32, ptr %[[PRIV_TABLE]], i64 %[[VAL_47]]
+! CHECK: %[[VAL_49:.*]] = load i32, ptr %[[VAL_48]], align 4
+! CHECK: %[[VAL_50:.*]] = icmp ne i32 %[[VAL_49]], 50
diff --git a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
index 19d80fbbd699b0..6eb00d44460a97 100644
--- a/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
+++ b/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
@@ -371,20 +371,46 @@ convertOmpCritical(Operation &opInst, llvm::IRBuilderBase &builder,
return success();
}
-/// Populates `reductions` with reduction declarations used in the given loop.
+// Looks up from the operation from and returns the PrivateClauseOp with
+// name symbolName
+static omp::PrivateClauseOp findPrivatizer(Operation *from,
+ SymbolRefAttr symbolName) {
+ omp::PrivateClauseOp privatizer =
+ SymbolTable::lookupNearestSymbolFrom<omp::PrivateClauseOp>(from,
+ symbolName);
+ assert(privatizer && "privatizer not found in the symbol table");
+ return privatizer;
+}
+
+/// Populates `privatizations` with privatisation declarations used for the
+/// given op.
+/// TODO: generalise beyond ParallelOp
+static void collectPrivatizationDecls(
+ omp::ParallelOp op, SmallVectorImpl<omp::PrivateClauseOp> &privatizations) {
+ std::optional<ArrayAttr> attr = op.getPrivateSyms();
+ if (!attr)
+ return;
+
+ privatizations.reserve(privatizations.size() + attr->size());
+ for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
+ privatizations.push_back(findPrivatizer(op, symbolRef));
+ }
+}
+
+/// Populates `reductions` with reduction declarations used in the given op.
template <typename T>
static void
-collectReductionDecls(T loop,
+collectReductionDecls(T op,
SmallVectorImpl<omp::DeclareReductionOp> &reductions) {
- std::optional<ArrayAttr> attr = loop.getReductionSyms();
+ std::optional<ArrayAttr> attr = op.getReductionSyms();
if (!attr)
return;
- reductions.reserve(reductions.size() + loop.getNumReductionVars());
+ reductions.reserve(reductions.size() + op.getNumReductionVars());
for (auto symbolRef : attr->getAsRange<SymbolRefAttr>()) {
reductions.push_back(
SymbolTable::lookupNearestSymbolFrom<omp::DeclareReductionOp>(
- loop, symbolRef));
+ op, symbolRef));
}
}
@@ -611,7 +637,7 @@ static LogicalResult
allocReductionVars(T loop, ArrayRef<BlockArgument> reductionArgs,
llvm::IRBuilderBase &builder,
LLVM::ModuleTranslation &moduleTranslation,
- llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
+ const llvm::OpenMPIRBuilder::InsertPointTy &allocaIP,
SmallVectorImpl<omp::DeclareReductionOp> &reductionDecls,
SmallVectorImpl<llvm::Value *> &privateReductionVariables,
DenseMap<Value, llvm::Value *> &reductionVariableMap,
@@ -1319,76 +1345,11 @@ convertOmpWsloop(Operation &opInst, llvm::IRBuilderBase &builder,
privateReductionVariables, isByRef);
}
-/// A RAII class that on construction replaces the region arguments of the
-/// parallel op (which correspond to private variables) with the actual private
-/// variables they correspond to. This prepares the parallel op so that it
-/// matches what is expected by the OMPIRBuilder.
-///
-/// On destruction, it restores the original state of the operation so that on
-/// the MLIR side, the op is not affected by conversion to LLVM IR.
-class OmpParallelOpConversionManager {
-public:
- OmpParallelOpConversionManager(omp::ParallelOp opInst)
- : region(opInst.getRegion()),
- privateBlockArgs(cast<omp::BlockArgOpenMPOpInterface>(*opInst)
- ...
[truncated]
|
The fujitsu testsuite is at https://github.com/fujitsu/compiler-test-suite. Could you give it a try? You can download it into a directory inside llvm-testsuite. |
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
|
||
// in-place convert the private allocation region | ||
SmallVector<llvm::Value *, 1> phis; | ||
if (allocRegion.getArgument(0).getUses().empty()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this if
case here? Does it mean that we will not end up using the privAllocaBlock that was created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We only use it for cases where it is needed (when the alloc region reads from the original variable). This is to ensure that allocas end up in the entry block as much as possible.
There's a bit more explanation in the comment describing the allocRegion.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
After my patch I get the same number of "executable missing" tests (which indicate a compilation failure) and 18 fewer failed tests |
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Tom!
@@ -1421,12 +1450,57 @@ convertOmpParallel(omp::ParallelOp opInst, llvm::IRBuilderBase &builder, | |||
deferredStores, isByRef))) | |||
bodyGenStatus = failure(); | |||
|
|||
// Apply copy region for firstprivate. | |||
if (!privateBlockArgs.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this sufficient check for the existence of Firstprivate?
If the intention is to always create these blocks, should these blocks be created by the initial skeleton creation in OpenMPIRBuilder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm only checking that there are private arguments not if any are firstprivate because it felt like overkill to scan every argument looking for a firstprivate one. Would you prefer I did that?
I don't want to always create the blocks because that would lead to unnecessary test churn, especially if they were in OpenMPIRBuilder as that may effect clang too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can write a quick std::find_if for firstprivate that would be great.
Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?
https://llvm.org/devmtg/2019-04/slides/TechTalk-Doerfert-Compiler_optimization_for_OpenMP_accelerator_offloading.pdf
FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.
https://openmp.llvm.org/optimizations/OpenMPOpt.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the links.
Could you also have a quick check/thought about whether the move from the separate callback to body gen callback will hamper the transformations that they talk about in the following presentation?
Clang doesn't use the separate privatisation callback either (this is equivalent to a no-op in OpenMPIRBuilder):
llvm-project/clang/lib/CodeGen/CGStmtOpenMP.cpp
Line 1818 in f3947aa
auto PrivCB = [](InsertPointTy AllocaIP, InsertPointTy CodeGenIP, |
TL;DR: I don't think this PR will make things worse but it would be good to take a closer look at any work needed to enable these optimizations in the future.
My educated guesses on some of the optimizations in the presentation:
- Forwarding communicated/shared values between parallel regions -> I am not sure if this will work for flang but it seems to work using the arguments to the outlined function, which will not be changed by my PR
- Interprocedural constant propagation for outlined parallel regions -> again, the arguments to the outlined function have not changed. The use of the argument should have only changed in the case of the bug I was fixing.
- Attribute deduction also the same as above
- Argument promotion (shared -> firstprivate -> private) -> I don't know how this works. It probably never worked for code generated by flang.
- Parallel region expansion and barrier elimination -> I imagine this should work. It should only need to look at the libcalls generated by OpenMPIRBuilder and maybe stitch some parallel regions together.
- Communication optimization and value transfer -> this should be unaffected by my patch -> it depends on how variables are passed into and out of parallel regions (by ref or by val)
I can't really comment on device code and cross host-device IPO as I'm very unfamiliar with that pipeline and IIUC there's a lot of code yet to be upstreamed.
FYI. No action required. The folks working on llvm openmp were planning on adding and improving the openmp-opt pass. This would work well if the regions produced by both Clang and Flang are similar.
There are already a few places we diverge (e.g. the reduction init block(s)). Plus the differences that are inherent such as the much higher complexity of Fortran datatypes allowed in privatization and reduction (allocatable arrays etc). Maybe future work could unpack Fortran runtime descriptors in cases where they are not needed so that it looks more like a C array. Currently Flang reductions and privatization force arrays and allocatables to be boxed.
The runtime call deduplication mentioned on the webpage ought to still work. We still generate runtime calls using OpenMPIRBuilder and the parallel region is still inside of the outlined function.
The webpage also mentions globalization for data shared between different threads on a GPU. I don't know much about the GPU flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like any patch that removes more code than it adds
flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
Outdated
Show resolved
Hide resolved
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
template <typename T> | ||
static void | ||
collectReductionDecls(T loop, | ||
collectReductionDecls(T op, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noise in the diff?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I was reading the code I realized this name was misleading because the op might be a ParallelOp. Would you rather I do this in a separate patch?
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Tom for the detailed comments and going through the links I gave.
LGTM. Nice work!
flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
Outdated
Show resolved
Hide resolved
flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
Outdated
Show resolved
Hide resolved
flang/test/Integration/OpenMP/parallel-private-reduction-worstcase.f90
Outdated
Show resolved
Hide resolved
The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder. This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results. This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables. RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225 Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite. I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR: - The regression test is for an issue filed against flang. i wanted to keep the reproducer similar to the code in the ticket. - I found the "worst case" CFG test difficult to reason about in abstract it helped me to think about what was going on in terms of a Fortran program. Fixes llvm#106297
a1af027
to
2748a54
Compare
✅ With the latest revision this PR passed the C/C++ code formatter. |
The existing conversion inlined private alloc regions and firstprivate copy regions in mlir, then undoing the modification of the mlir module before completing the conversion. To make this work, LLVM IR had to be generated using the wrong mapping for privatised values and then later fixed inside of OpenMPIRBuilder.
This approach violated an assumption in OpenMPIRBuilder that private variables would be values not constants. Flang sometimes generates code where private variables are promoted to globals, the address of which is treated as a constant in LLVM IR. This caused the incorrect values for the private variable from being replaced by OpenMPIRBuilder: ultimately resulting in programs producing incorrect results.
This patch rewrites delayed privatisation for omp.parallel to work more similarly to reductions: translating directly into LLVMIR with correct mappings for private variables.
RFC: https://discourse.llvm.org/t/rfc-openmp-fix-issue-in-mlir-to-llvmir-translation-for-delayed-privatisation/81225
Tested against the gfortran testsuite and our internal test suite. Linaro's post-commit bots will check against the fujitsu test suite.
I decided to add the new tests as flang integration tests rather than in mlir/test/Target/LLVMIR:
Fixes #106297