Random build failures: Catalog the failures #1474

TomFinley · 2018-10-31T20:35:21Z

Companion issue to #1471. We should investigate more fully at a high level why the builds are failing.

@eerhardt has helpfully provided a good list of these failures for builds that should have in principle succeeded, since they were against master. The task would then to be, somehow, to go through them, and determine why each has failed. This even has an analytics tab that will show the precise test failures, so perhaps this is not so hard of an issue! But getting a sense for why each has failed might itself be worthwhile, I imagine.

While tests are I believe the primary culprit, I don't quite have an appreciation for how big of a problem it might be compared to other issues. For those builds that do not appear to be due to a test failure, why did they fail? Timeout? Network resources unavailable? Some other failure to setup? An actual failure to build somehow? (Of course some of these things, especially timeouts, could themselves be test failures.)

Also for those things that are test failures, which tests are failing, on what environments? I have a sense for some "usual suspects" of tests that are problematic, but I am sometimes surprised by new ones, and it would surely help to build a more complete catalog so that the investigation can focus its efforts usefully.

The text was updated successfully, but these errors were encountered:

Zruty0 · 2018-10-31T20:36:48Z

(Of course some of these things, especially timeouts, could themselves.)

I wholeheartedly

Ivanidzo4ka · 2018-11-02T04:32:02Z

Test	Platform	Count	Comments
Microsoft.ML.Runtime.RunTests.TestPredictors.MulticlassTreeFeaturizedLRTest	MacOS_Release	1
	MacOs_Debug	7
StaticPipelineTests hung	Windows x64 Debug	3
	MacOS_Release	15
	Windows_x64_Release	4
	MacOs_Debug	5
Microsoft.ML.Core.Tests Hung	Windows_x86_Debug	2
	Windows_x64_Release	1
Microsoft.ML.Runtime.RunTests.TestTimeSeries.SavePipeSlidingWindowW1L2	Windows_x86_Debug	2	at Microsoft.ML.Runtime.Data.TextLoader.Cursor.ParseParallel(ParallelState state)+MoveNext() in D:\a\1\s\src\Microsoft.ML.Data\DataLoadSave\Text\TextLoaderCursor.cs:line 612
	Windows_x64_Debug	1
Microsoft.ML.Tests Hung	MacOS_Release	14
	MacOs_Debug	4
	Windows_x64_Release	1
Microsoft.ML.Tests.Transformers.TextFeaturizerTests.TextTokenizationWorkout	MacOs_Debug	2
Microsoft.ML.Runtime.RunTests.TestCSharpApi.TestOvaMacro	MacOs_Debug	1
PipelineApiScenarioTests.Metacomponents	Windows_x86_Debug	1
Microsoft.ML.FSharp.Tests.SmokeTest3.FSharp-Sentiment-Smoke-Test	Windows_x64_Debug	1	The file 'C:\Users\VssAdministrator\AppData\Local\Temp\TLC_26831444\0' already exists.
	Windows_x64_Release	1
TestPipelineSweeper.PipelineSweeperMultiClassClassification	Windows_x86_Debug	1	Microsoft.ML.Trainers.SdcaMultiClassTrainer.TrainWithoutLock(IProgressChannelProvider progress, Factory cursorFactory, IRandom rand, IdToIdxLookup idToIdx, Int32 numThreads, DualsTableBase duals, Single[] biasReg, Single[] invariants, Single lambdaNInv, VBuffer`1[] weights, Single[] biasUnreg, VBuffer`1[] l1IntermediateWeights, Single[] l1IntermediateBias, Single[] featureNormSquared) in D:\a\1\s\src\Microsoft.ML.StandardLearners\Standard\SdcaMultiClass.cs:line 178
TestTimeSeries.SavePipeMovingAverageNonUniform	Windows_x86_Debug	1	Baseline
ScenariosTests.TrainAndPredictIrisModelWithStringLabelTest	MacOS_Debug	1	SDCA invariants
Microsoft.ML.Runtime.RunTests.TestPredictors.LinearClassifierTest	Windows_x64_Debug	1
Microsoft.ML.Runtime.RunTests.TestCSharpApi.TestCrossValidationMacro	Windows_x86_Debug	1
Microsoft.ML.Runtime.RunTests.TestEntryPoints.EntryPointCaching	Linux_Debug	1
Microsoft.ML.Tests.TrainerEstimators.TrainerEstimators.SdcaWorkout	MacOS_Debug	1	SDCA invariants
ApiScenariosTests.New_FileBasedSavingOfData	MacOs_Debug	1	Probably temp file
ApiScenariosTests.New_TrainSaveModelAndPredict	MacOS_Debug	1	Probably temp file
Microsoft.ML.Tests.Transformers.TextFeaturizerTests.WordBagWorkout	MacOS_Debug	1	Probably temp file
Microsoft.ML.StaticPipelineTesting.Training.CrossValidate	MacOS_Debug	1	SDCA invariants
CookbookSamples.CookbookSamples.CrossValidationIris	Linux_Debug	1	SDCA invariants
Microsoft.ML.Tests.OnnxTests.MultiClassificationLRSaveModelToOnnxTest	Windows_x64_Debug	1
Microsoft.ML.Tests.Scenarios.Api.ApiScenariosTests.TrainWithInitialPredictor	Windows_x64_Debug	1	SDCA invariants

Ivanidzo4ka · 2018-11-02T04:37:37Z

I've look in past two days (it's hard to look further, pages start load really slow)
Also found what test system not always write down which test is finished (I can see it's started, and run on that test dll is over with everything is mark as passing, but no finishing line in log)

Ivanidzo4ka · 2018-11-02T04:51:59Z

Point of this exercise is to look what is main pain point and allocate resources accordingly.
So far biggest issue is MacOS system and test hung into timeout.
I have following obvious proposals to ease pain:

Let's reduce timeout to 30 minutes, our regular run on macOs_Debug is about 18 minutes.
Let's move to bigger pool machine with Mac (hosted one, instead of current one)

proposals for investigation:

We need way to download tests artifacts.
We need better way to determine which test get started and which test finished. (We can make hook in base test class and write information about start and finish into one artifact file)
Let's turn of parallel test execution to determine is related to sententious test execution or not.

frank-dong-ms-zz · 2020-04-20T23:32:45Z

Introduced threading analyzer and fix hanging:
#4790
#4791
#4792
#4793
#4794

We fixed several race condition to fix some random failures:
#4829
#4950

investigate and fixed lightgbm related crash:
#4929

fixed benchmark test hanging:
#4985

TomFinley added Build Build related issue test related to tests labels Oct 31, 2018

TomFinley mentioned this issue Oct 31, 2018

Investigate the build issues, focusing on tests #1471

Closed

TomFinley assigned Ivanidzo4ka Nov 2, 2018

codemzs closed this as completed Jun 30, 2019

codemzs reopened this Jun 30, 2019

harishsk added P2 Priority of the issue for triage purpose: Needs to be fixed at some point. P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. and removed P2 Priority of the issue for triage purpose: Needs to be fixed at some point. labels Jan 12, 2020

harishsk assigned frank-dong-ms-zz Jan 12, 2020

frank-dong-ms-zz closed this as completed Apr 20, 2020

ghost locked as resolved and limited conversation to collaborators Mar 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Random build failures: Catalog the failures #1474

Random build failures: Catalog the failures #1474

TomFinley commented Oct 31, 2018 •

edited

Loading

Zruty0 commented Oct 31, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading

frank-dong-ms-zz commented Apr 20, 2020

Random build failures: Catalog the failures #1474

Random build failures: Catalog the failures #1474

Comments

TomFinley commented Oct 31, 2018 • edited Loading

Zruty0 commented Oct 31, 2018 • edited Loading

Ivanidzo4ka commented Nov 2, 2018 • edited Loading

Ivanidzo4ka commented Nov 2, 2018 • edited Loading

Ivanidzo4ka commented Nov 2, 2018 • edited Loading

frank-dong-ms-zz commented Apr 20, 2020

TomFinley commented Oct 31, 2018 •

edited

Loading

Zruty0 commented Oct 31, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading

Ivanidzo4ka commented Nov 2, 2018 •

edited

Loading