Skip to content

MulticlassClassification.CrossValidate Arithmetic operation resulted in an overflow #5211

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DFMERA opened this issue Jun 5, 2020 · 16 comments · Fixed by #5232
Closed

MulticlassClassification.CrossValidate Arithmetic operation resulted in an overflow #5211

DFMERA opened this issue Jun 5, 2020 · 16 comments · Fixed by #5232
Assignees
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.

Comments

@DFMERA
Copy link

DFMERA commented Jun 5, 2020

System information

  • OS version/distro:
    Windows 10 pro
  • .NET Version (eg., dotnet --info):
    .Net Core 2.1.0

Issue

  • What did you do?
    I am creating a multiclass classification experiment and after de best model is selected and I try to evaluate de model but this method throws an exception
    var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");

  • What happened?
    The mlContext.MulticlassClassification.CrossValidate throws an exception

  • What did you expect?
    To recover the metrics of the model on test data

Source code / logs

CODE

var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH);
IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
path: tmpPath,
hasHeader: true,
separatorChar: '\t',
allowQuoting: true,
allowSparse: false);

        IDataView testDataView = mlContext.Data.BootstrapSample(trainingDataView);

// STEP 2: Run AutoML experiment
Console.WriteLine($"Running AutoML Multiclass classification experiment for {ExperimentTime} seconds...");
ExperimentResult experimentResult = mlContext.Auto()
.CreateMulticlassClassificationExperiment(ExperimentTime)
.Execute(trainingDataView, labelColumnName: "reservation_status");

        // STEP 3: Print metric from the best model
        RunDetail<MulticlassClassificationMetrics> bestRun = experimentResult.BestRun;
        Console.WriteLine($"Total models produced: {experimentResult.RunDetails.Count()}");
        Console.WriteLine($"Best model's trainer: {bestRun.TrainerName}");
        Console.WriteLine($"Metrics of best model from validation data --");
        PrintMulticlassClassificationMetrics(bestRun.ValidationMetrics);

        // STEP 4: Evaluate test data
        IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView);
        var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");

EXCEPTION

Unhandled Exception: System.OverflowException: Arithmetic operation resulted in an overflow.
at Microsoft.ML.Data.VectorDataViewType.ComputeSize(ImmutableArray1 dims) at Microsoft.ML.Data.VectorDataViewType..ctor(PrimitiveDataViewType itemType, Int32[] dimensions) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.Mapper..ctor(KeyToVectorMappingTransformer parent, DataViewSchema inputSchema) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.MakeRowMapper(DataViewSchema schema) at Microsoft.ML.Data.RowToRowTransformerBase.GetOutputSchema(DataViewSchema inputSchema) at Microsoft.ML.Data.TrivialEstimator1.Fit(IDataView input)
at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingTransformer..ctor(HashingEstimator hash, IEstimator1 keyToVector, IDataView input)
at Microsoft.ML.Transforms.OneHotHashEncodingEstimator.Fit(IDataView input)
at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain1.Fit(IDataView input)
at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable1 seed)
at Microsoft.ML.MulticlassClassificationCatalog.CrossValidate(IDataView data, IEstimator1 estimator, Int32 numberOfFolds, String labelColumnName, String samplingKeyColumnName, Nullable1 seed)
at ConsoleAppML2ML.ConsoleApp.ModelBuilder.CreateExperiment() in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\ModelBuilder.cs:line 77
at ConsoleAppML2ML.ConsoleApp.Program.Main(String[] args) in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\Program.cs:line 20
HotelBookings.tsv.zip

@Lynx1820 Lynx1820 added the P1 Priority of the issue for triage purpose: Needs to be fixed soon. label Jun 5, 2020
@wangyems wangyems self-assigned this Jun 8, 2020
@wangyems
Copy link
Contributor

wangyems commented Jun 9, 2020

Thanks for providing the information. I have run part of your code(the only difference is that I use the multi-classification related dataset in ML.NET samples) but not able to reproduce the error. It would be better if you can provide how you extract the data from file if you feel comfortable to do so. Either way, I'll take time on reproducing the error using the dataset you uploaded.

Since the error has something to do with HashEstimator which had core functionality change in release 1.5. As a possible work around, try downgrading to release 1.5 preview 2 and see if that solves the problem.

@wangyems wangyems added the need info This issue needs more info before triage label Jun 9, 2020
@DFMERA
Copy link
Author

DFMERA commented Jun 9, 2020

Thank you for the reply. I'm using this code to extract the data from file.
var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH);
IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
path: tmpPath,
hasHeader: true,
separatorChar: '\t',
allowQuoting: true,
allowSparse: false);

I think the problem is with big datasets like the one I attached.
thanks.

@wangyems
Copy link
Contributor

wangyems commented Jun 9, 2020

Which ML.NET version are you using? It seems the above code you provided is not quite fit in our current release.

@DFMERA
Copy link
Author

DFMERA commented Jun 9, 2020

I just updated to ML.NET 16.0 y .NetCore 3.1. I'm gonna create a small simple project with just the code you need to reproduce the error.
I've tried different datasets and different experiment options and the error is the same in the multiclass experiment.

@wangyems
Copy link
Contributor

wangyems commented Jun 9, 2020

Update: I truncated your dataset to hundreds of rows and your code works fine. So you may right the problem may be due to the large dataset. I am right now running using the original dataset and waiting for the results.

@DFMERA
Copy link
Author

DFMERA commented Jun 9, 2020

CORRECTION. I'm using this version of ML.NET
"Microsoft.ML" Version="1.5.0"
"Microsoft.ML.AutoML" Version="0.17.0"
And .NETCore 3.1

@DFMERA
Copy link
Author

DFMERA commented Jun 9, 2020

Update: Here it is a small project with just the code you need to reproduce the error. I tried with a small dataset and the error is the same.
MLMulticlassExperiment.zip

@wangyems
Copy link
Contributor

Thanks for providing the example and version information:) So right now I can reproduce the error. I'll take a look into this.

@wangyems wangyems removed the need info This issue needs more info before triage label Jun 10, 2020
@wangyems
Copy link
Contributor

The reason for throwing overflow is as follows:

The reservation_status_date column has a little bit more unique values which triggers AutoML to map each of those unique value to a vector of size 65536. The AutoML also believes that there could be up to 65536 unique values so an overall size of 65536*65536 vector will be generated to store those mapped value. That size is bigger than some threshold. That's why you see an overflow.
I think this is not a bug, but something can be improved in the future as I expect this Classification module in ML.NET can not deal with million-level categories classification.(I'll discuss this with my team)

As a work around(perhaps a better way), what you can do is pre-processing the reservation_status_date column before using AutoML(not auto enough in this case). Specifically, for example like 6/23/2015 12:00:00 AM. You can parse it to three float columns: 6(reservation_status_month), 23(reservation_status_date), 2015(reservation_status_year). the 12:00:00 AM seems stay the same throughout the dataset, you can safely ignore it. This approach not only avoid overflow, but also provide a better(useful, with clear information) feature.

Feel free to reach out if you still have any questions.

@DFMERA
Copy link
Author

DFMERA commented Jun 11, 2020

Thank you. I did the change you recommended and it solved the problem. If I'm allow to make a suggestion maybe the name of the column with the value problem could appear in the inner exception so anyone can know what to correct in the data.

@wangyems
Copy link
Contributor

Glad to know that the problem was solved. And yes I think it's better to improve the exception message by adding the column name. I'll have a PR out soon for this.

@DFMERA
Copy link
Author

DFMERA commented Jun 11, 2020

Do I have to close the issue ? or do you close it when the PR is done?

@mstfbl mstfbl linked a pull request Jun 11, 2020 that will close this issue
@mstfbl
Copy link
Contributor

mstfbl commented Jun 11, 2020

Hi @DFMERA , the issue will close when PR #5232 is merged. You don't need to do anything else!

@wangyems
Copy link
Contributor

Hi @DFMERA ,
I took a look of the AutoML samples in our codebase() and find they use Evaluate() instead of Crossvalidate(https://github.com/dotnet/machinelearning/blob/master/docs/samples/Microsoft.ML.AutoML.Samples/MulticlassClassificationExperiment.cs#L40)
// STEP 4: Evaluate test data IDataView testDataViewWithBestScore = bestRun.Model.Transform(testDataView); MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName);
CrossValidate() and CreateMulticlassClassificationExperiment(...).Execute(...); are both training methods while Evaluate() is for inference(evaluation). Is it the Evaluate() you initially want to use?

@DFMERA
Copy link
Author

DFMERA commented Jun 16, 2020

Hi. I think that part of the code is fine because is for evaluating the model with the test data that was not included in the training process, and that code is base on the ml.net documentation in this page
https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview

Although, I did see an error in the test IDataView (testDataView) since it contained the entire dataset including the data that was used in the training process. But I fixed that after excluding the column with the problem that you told me, so I don't know if that could also have solved the problem.

@wangyems
Copy link
Contributor

wangyems commented Jun 16, 2020

Hi @DFMERA ,
The code in the link in your reply uses Evaluate() in step 4, but in your source code CrossValidate() is used. And please note that CrossValidate() is not for evaluating the data but training the data.
To solve the problem in summary:
1, you may want to change this line from
var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");
to
MulticlassClassificationMetrics testMetrics = mlContext.MulticlassClassification.Evaluate(testDataViewWithBestScore, labelColumnName: LabelColumnName);
by following the https://docs.microsoft.com/en-us/dotnet/api/microsoft.ml.automl.multiclassclassificationexperiment?view=ml-dotnet-preview
2, do something with the column reservation_status: either excluding or splitting it. This is optional because if you done 1, there should not be any exceptions.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
P1 Priority of the issue for triage purpose: Needs to be fixed soon.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants