Skip to content

Error in ML.net training #4464

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmoise2 opened this issue Nov 9, 2019 · 9 comments
Closed

Error in ML.net training #4464

dmoise2 opened this issue Nov 9, 2019 · 9 comments
Assignees
Labels
AutoML.NET Automating various steps of the machine learning process bug Something isn't working need info This issue needs more info before triage P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Comments

@dmoise2
Copy link

dmoise2 commented Nov 9, 2019

System information

  • Windows 10 Home:
  • .NET Core 2.1.802:

Issue

  • **First time running ML.NET. Set the Database & started Training.
  • **What happened? – training stopped “Failed – See more in Output Pane.”
  • **What did you expect? – Training to Complete

Source code / logs

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1>
Parameter name: schema
Must be at least 2.
Parameter name: numClasses
Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1>
Parameter name: schema
Training failed with the exception: System.ArgumentOutOfRangeException: Schema mismatch for score column 'Score': expected vector of two or more items of type Single, got Vector<Single, 1>
Parameter name: schema
at Microsoft.ML.Data.MulticlassClassificationEvaluator.CheckScoreAndLabelTypes(RoleMappedSchema schema)
at Microsoft.ML.Data.EvaluatorBase1.CheckColumnTypes(RoleMappedSchema schema) at Microsoft.ML.Data.EvaluatorBase1.Microsoft.ML.Data.IEvaluator.Evaluate(RoleMappedData data)
at Microsoft.ML.Data.MulticlassClassificationEvaluator.Evaluate(IDataView data, String label, String score, String predictedLabel)
at Microsoft.ML.AutoML.MultiMetricsAgent.EvaluateMetrics(IDataView data, String labelColumn)
at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, AutoMLLogger logger)

@dmoise2
Copy link
Author

dmoise2 commented Nov 9, 2019

Ran it again after selecting a different table column as my "column to predict". Failed again with different output

| Trainer MicroAccuracy MacroAccuracy Duration #Iteration |
Splitter/consolidator worker encountered exception while consuming source data
Splitter/consolidator worker encountered exception while consuming source data
Splitter/consolidator worker encountered exception while consuming source data
Training failed with the exception: System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data ---> System.InvalidOperationException: Splitter/consolidator worker encountered exception while consuming source data ---> System.FormatException: Parsing failed with an exception: Could not parse value Shenzhen University in line 2975, column NoDates ---> System.InvalidOperationException: Could not parse value Shenzhen University in line 2975, column NoDates
at Microsoft.ML.Data.TextLoader.Parser.ProcessOne(FieldSet vs, ColInfo info, ColumnPipe v, Int32 irow, Int64 line)
at Microsoft.ML.Data.TextLoader.Parser.ProcessItems(RowSet rows, Int32 irow, Boolean[] active, FieldSet fields, Int32 srcLim, Int64 line)
at Microsoft.ML.Data.TextLoader.Parser.ParseRow(RowSet rows, Int32 irow, Helper helper, Boolean[] active, String path, Int64 line, String text)
at Microsoft.ML.Data.TextLoader.Cursor.ParallelState.Parse(Int32 tid)
at Microsoft.ML.Data.TextLoader.Cursor.ParallelState.ThreadProc(Object obj)
--- End of inner exception stack trace ---
at Microsoft.ML.Data.TextLoader.Cursor.d__33.MoveNext()
at Microsoft.ML.Data.TextLoader.Cursor.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Data.LinkedRowFilterCursorBase.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass9_0.b__1()
--- End of inner exception stack trace ---
at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Data.DataViewUtils.Splitter.<>c__DisplayClass5_1.b__2()
--- End of inner exception stack trace ---
at Microsoft.ML.Data.DataViewUtils.Splitter.Batch.SetAll(OutPipe[] pipes)
at Microsoft.ML.Data.DataViewUtils.Splitter.Cursor.MoveNextCore()
at Microsoft.ML.Data.RootCursorBase.MoveNext()
at Microsoft.ML.Trainers.TrainingCursorBase.MoveNext()
at Microsoft.ML.Trainers.LightGbm.LightGbmTrainerBase4.GetMetainfo(IChannel ch, Factory factory, Int32& numRow, Single[]& labels, Single[]& weights, Int32[]& groups) at Microsoft.ML.Trainers.LightGbm.LightGbmTrainerBase4.LoadTrainingData(IChannel ch, RoleMappedData trainData, CategoricalMetaData& catMetaData)
at Microsoft.ML.Trainers.LightGbm.LightGbmTrainerBase4.TrainModelCore(TrainContext context) at Microsoft.ML.Trainers.TrainerEstimatorBase2.TrainTransformer(IDataView trainSet, IDataView validationSet, IPredictor initPredictor)
at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input)
at Microsoft.ML.AutoML.RunnerUtil.TrainAndScorePipeline[TMetrics](MLContext context, SuggestedPipeline pipeline, IDataView trainData, IDataView validData, String labelColumn, IMetricsAgent`1 metricsAgent, ITransformer preprocessorTransform, FileInfo modelFileInfo, DataViewSchema modelInputSchema, AutoMLLogger logger)

@justinormont
Copy link
Contributor

@dmoise2: Can you post a sample of your dataset? How are you calling AutoML (Model Builder, CLI, or API)?

System.FormatException: Parsing failed with an exception: Could not parse value Shenzhen University in line 2975, column NoDates ---> System.InvalidOperationException: Could not parse value Shenzhen University in line 2975, column NoDates

You can look at line 2975 in your dataset to see what's different about it. Commonly it's an issue of non-supported characters in a quoted string (eg: quotes or newlines). Our TextLoader fails to read the row if it includes these.

@dmoise2
Copy link
Author

dmoise2 commented Nov 12, 2019

The data does have a number of characters in it. I can create a version without those. This does come from text people enter. Is it the longer term expectation that we would have to eliminate those types of characters in our data?

@antoniovs1029 antoniovs1029 added bug Something isn't working need info This issue needs more info before triage P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. labels Jan 9, 2020
@antoniovs1029
Copy link
Member

Hi @dmoise2 , where you able to run it after cleaning the Dataset as @justinormont suggested?

If so, please feel free to close this issue.

@antoniovs1029 antoniovs1029 added the AutoML.NET Automating various steps of the machine learning process label Jan 9, 2020
@najeeb-kazmi
Copy link
Member

@dmoise2 I'm pretty sure that the first error that you are getting is because you are selecting a multiclass classification scenario, but the column you try to predict has only one value, hence you get type Vector<Single, 1> instead of a vector of size 2 or more.

For the second issue, yes, you will have to either clean the data or disallow certain characters. You can also try adding MLContext.Transforms.Text.NormalizeText in the pipeline (e.g. with AutoML API, apply this transform to the incoming data, before passing the data view to the AutoML API, with Model Builder, train your model with clean data, then once the code is generated, add this transform to the pipeline). I tried normalizing a string containing 深圳大学 (the characters for Schenzen University from the Wikipedia page), and the NormalizeText transform produced ????.

Please let us know if this is still an issue. If not, I will close this in a few days.

@najeeb-kazmi najeeb-kazmi self-assigned this Apr 3, 2020
@justinormont
Copy link
Contributor

@dmoise2: You would have to clean the dataset itself. ML․NET's TextLoader currently supports a simplified subset of CSV/TSV. Mainly, it does not support escaped-quotes or newlines in a quoted field.

See more detail: #4460

It's a bit more code heavy, but if you're using the AutoML API, you can also use a custom CSV reader which emits an IEumerable, and then read into an IDataView using LoadFromEnumerable(). If you don't have a pre-defined Test dataset, split this IDataView in to a Train/Test dataset w/ TrainTestSplit(). You would then give the Training dataset to AutoML's Execute(), and get final metrics using the Test dataset.

I expect that CsvHelper, or TinyCsvParser in RFC4180 mode, will work to create the IEumerable: https://dotnetcoretutorials.com/2018/08/04/csv-parsing-in-net-core/

@najeeb-kazmi
Copy link
Member

@dmoise2 please feel free to reopen if you have other questions.

@justinormont
Copy link
Contributor

Update: @dmoise2, thanks to hard work by @antoniovs1029, ML․NET's TextLoader now supports a majority of real world CSV/TSV files.

To quote the recent 1.5.0 release notes:

  • Updates to TextLoader
    • Enable TextLoader to accept new lines in quoted fields (#5125)
    • Add escapeChar support to TextLoader (#5147)
    • Add public generic methods to TextLoader catalog that accept Options objects (#5134)
    • Added decimal marker option in TextLoader (#5145, #5154)

@fredrick72
Copy link

Please ensure the file isn't open in another application. This can sometimes cause the error.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
AutoML.NET Automating various steps of the machine learning process bug Something isn't working need info This issue needs more info before triage P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.
Projects
None yet
Development

No branches or pull requests

6 participants