Skip to content

New version of Intel MKL. #5867

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 8, 2021
Merged

Conversation

michaelgsharp
Copy link
Member

Update version of Intel MKL.

The issue of thread contention is still not fixed. According to them:

We still have this limitation. We don’t have explicit initialization step, so at the very first oneMKL call we do some check for available features and use LoadLibrary to load additional libraries if needed (e.g. to load correct CPU optimizations at runtime for detected HW), but LoadLibrary is not allowed to be used in DLLMain because of potential deadlock. You can use oneMKL in DLLMain but only after initialization, e.g. first cblas call should be done outside of DLLMain.

This is the latest version of Intel MKL and includes the windows .PDB files.

@codecov
Copy link

codecov bot commented Jul 1, 2021

Codecov Report

Merging #5867 (97c5d06) into main (1b3cb77) will decrease coverage by 0.05%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main    #5867      +/-   ##
==========================================
- Coverage   68.27%   68.21%   -0.06%     
==========================================
  Files        1134     1134              
  Lines      242028   242028              
  Branches    25306    25306              
==========================================
- Hits       165235   165106     -129     
- Misses      70152    70272     +120     
- Partials     6641     6650       +9     
Flag Coverage Δ
Debug 68.21% <ø> (-0.06%) ⬇️
production 62.87% <ø> (-0.07%) ⬇️
test 88.82% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...osoft.ML.KMeansClustering/KMeansPlusPlusTrainer.cs 83.60% <0.00%> (-7.16%) ⬇️
src/Microsoft.ML.Data/Training/TrainerUtils.cs 65.86% <0.00%> (-3.82%) ⬇️
...crosoft.ML.StandardTrainers/Standard/SdcaBinary.cs 85.06% <0.00%> (-3.08%) ⬇️
test/Microsoft.ML.AutoML.Tests/AutoFitTests.cs 87.07% <0.00%> (-1.97%) ⬇️
src/Microsoft.ML.Sweeper/AsyncSweeper.cs 71.42% <0.00%> (-1.37%) ⬇️
...crosoft.ML.StandardTrainers/Optimizer/Optimizer.cs 71.96% <0.00%> (-1.16%) ⬇️
...oft.ML.StandardTrainers/Standard/SdcaMulticlass.cs 91.12% <0.00%> (-1.03%) ⬇️
src/Microsoft.ML.Data/Utils/LossFunctions.cs 66.83% <0.00%> (-0.52%) ⬇️
...crosoft.ML.Core/ComponentModel/ComponentCatalog.cs 70.40% <0.00%> (+0.16%) ⬆️
src/Microsoft.ML.Data/DataView/CacheDataView.cs 84.64% <0.00%> (+0.67%) ⬆️
... and 2 more

@michaelgsharp michaelgsharp marked this pull request as ready for review July 7, 2021 06:00
@michaelgsharp michaelgsharp requested review from tarekgh and a team July 7, 2021 06:00
@eerhardt
Copy link
Member

eerhardt commented Jul 7, 2021

You can use oneMKL in DLLMain but only after initialization, e.g. first cblas call should be done outside of DLLMain.

Can you explain this more? Is there something we should be doing to ensure this doesn't happen? Or something an app using ML.NET can do?

Even with the explanation, it still sounds like a bug they should fix... I assume multiple apps can hit this.

@michaelgsharp
Copy link
Member Author

@eerhardt Let me reach out to them and ask them your follow up questions. My guess is that they will just say its our responsibility to make sure the first call happen outside of DLLMain. That same call that you added in the tests as a workaround, could we add that somewhere in the initialization of something so it always happens safely?

@eerhardt
Copy link
Member

eerhardt commented Jul 7, 2021

they will just say its our responsibility to make sure the first call happen outside of DLLMain.

This is the part I am not understanding. We don't control their DllMain.

@tarekgh
Copy link
Member

tarekgh commented Jul 7, 2021

This is the part I am not understanding. We don't control their DllMain.

My understanding from the description is they are advising not doing the first call to the MKL library inside DllMain (any user DllMain). this can create a deadlock as they will call LoadLibrary in the MKL library. I don't think this affect us as I expect we don't call the library in any DllMain we have.

@michaelgsharp
Copy link
Member Author

It has caused problems for us in the past. Eric had to add a manual call to MKL in our test initialization so that we didn't have deadlock issue. With that manual call the tests don't run into problems, but an end user still could.

I don't think we expose any way for them to manually trigger that initialization. We may want to either expose a way for the user to do it or to have it be automatically done for them so its a non-issue. I would vote on doing it automatically if possible. But only if the MKL package has been included.

@eerhardt
Copy link
Member

eerhardt commented Jul 7, 2021

We don't have a DllMain, that is a "C" library concept.

The issue, as far as I could tell, is when 2 different threads are running and both calling into MKL at roughly the same time. MKL tries initializing itself from both threads, resulting in a deadlock. See #1073 for my full analysis.

I would vote on doing it automatically if possible.

I would vote on MKL fixing their issue. And if it is not possible, they should have a documented workaround/API for people to do in order to avoid the issue.

@michaelgsharp
Copy link
Member Author

I agree that MKL should fix their issue and have it all documented. Seeing as its been about 3 years since this issue was raised though it doesn't seem like its high on their priority list. I have already reached back out to them for more clarification and will update you on what they say.

Internally, we can use your workaround in the tests and it works just fine. We don't expose anyway for the user to do that though, so if they run into the issue they will have to figure out the same fix we have already done. And since our custom version of MKL uses a different DLL name, they would also have to figure out what that name is in order to manually initialize it the same way we are.

Since it doesn't appear to be high on Intel's priority list, we already have a fix that works, and we have a custom dll name I suggested we see if there is a way we can do it automatically for the user. I agree that Intel should fix it, but until they do I think we should do something if we can.

@eerhardt
Copy link
Member

eerhardt commented Jul 7, 2021

but until they do I think we should do something if we can.

How many user reports have we had on the issue? If it is a small amount, we can give them the same workaround we are using. And also have them contact Intel. If we put a Band-Aid over it, that makes it even less of a priority for Intel.

@michaelgsharp
Copy link
Member Author

Other than the one internal report we have gotten I haven't heard anything or seen anything in issues. So probably very few, if any, other users have hit it.

Good point about the band-aid. I hadn't thought about it like that. Then yeah, lets just leave things as is for now and if we do get any issues about it we can just point them to the fix.

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. I thought I already approved this. LGTM.

@michaelgsharp michaelgsharp merged commit f70af3a into dotnet:main Jul 8, 2021
@michaelgsharp michaelgsharp deleted the intelmkl branch July 8, 2021 04:58
@ghost ghost locked as resolved and limited conversation to collaborators Mar 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants