-
Notifications
You must be signed in to change notification settings - Fork 1.9k
MulticlassClassification.CrossValidate Arithmetic operation resulted in an overflow #5211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Thanks for providing the information. I have run part of your code(the only difference is that I use the multi-classification related dataset in ML.NET samples) but not able to reproduce the error. It would be better if you can provide how you extract the data from file if you feel comfortable to do so. Either way, I'll take time on reproducing the error using the dataset you uploaded. Since the error has something to do with HashEstimator which had core functionality change in release 1.5. As a possible work around, try downgrading to release 1.5 preview 2 and see if that solves the problem. |
Thank you for the reply. I'm using this code to extract the data from file. I think the problem is with big datasets like the one I attached. |
Which ML.NET version are you using? It seems the above code you provided is not quite fit in our current release. |
I just updated to ML.NET 16.0 y .NetCore 3.1. I'm gonna create a small simple project with just the code you need to reproduce the error. |
Update: I truncated your dataset to hundreds of rows and your code works fine. So you may right the problem may be due to the large dataset. I am right now running using the original dataset and waiting for the results. |
CORRECTION. I'm using this version of ML.NET |
Update: Here it is a small project with just the code you need to reproduce the error. I tried with a small dataset and the error is the same. |
Thanks for providing the example and version information:) So right now I can reproduce the error. I'll take a look into this. |
The reason for throwing overflow is as follows: The reservation_status_date column has a little bit more unique values which triggers AutoML to map each of those unique value to a vector of size 65536. The AutoML also believes that there could be up to 65536 unique values so an overall size of 65536*65536 vector will be generated to store those mapped value. That size is bigger than some threshold. That's why you see an overflow. As a work around(perhaps a better way), what you can do is pre-processing the reservation_status_date column before using AutoML(not auto enough in this case). Specifically, for example like 6/23/2015 12:00:00 AM. You can parse it to three float columns: 6(reservation_status_month), 23(reservation_status_date), 2015(reservation_status_year). the 12:00:00 AM seems stay the same throughout the dataset, you can safely ignore it. This approach not only avoid overflow, but also provide a better(useful, with clear information) feature. Feel free to reach out if you still have any questions. |
Thank you. I did the change you recommended and it solved the problem. If I'm allow to make a suggestion maybe the name of the column with the value problem could appear in the inner exception so anyone can know what to correct in the data. |
Glad to know that the problem was solved. And yes I think it's better to improve the exception message by adding the column name. I'll have a PR out soon for this. |
Do I have to close the issue ? or do you close it when the PR is done? |
Hi @DFMERA , |
Hi. I think that part of the code is fine because is for evaluating the model with the test data that was not included in the training process, and that code is base on the ml.net documentation in this page Although, I did see an error in the test IDataView (testDataView) since it contained the entire dataset including the data that was used in the training process. But I fixed that after excluding the column with the problem that you told me, so I don't know if that could also have solved the problem. |
Hi @DFMERA , |
System information
Windows 10 pro
.Net Core 2.1.0
Issue
What did you do?
I am creating a multiclass classification experiment and after de best model is selected and I try to evaluate de model but this method throws an exception
var testMetrics = mlContext.MulticlassClassification.CrossValidate(testDataViewWithBestScore, bestRun.Estimator, numberOfFolds: 5, labelColumnName: "reservation_status");
What happened?
The mlContext.MulticlassClassification.CrossValidate throws an exception
What did you expect?
To recover the metrics of the model on test data
Source code / logs
CODE
var tmpPath = GetAbsolutePath(TRAIN_DATA_FILEPATH);
IDataView trainingDataView = mlContext.Data.LoadFromTextFile(
path: tmpPath,
hasHeader: true,
separatorChar: '\t',
allowQuoting: true,
allowSparse: false);
// STEP 2: Run AutoML experiment
Console.WriteLine($"Running AutoML Multiclass classification experiment for {ExperimentTime} seconds...");
ExperimentResult experimentResult = mlContext.Auto()
.CreateMulticlassClassificationExperiment(ExperimentTime)
.Execute(trainingDataView, labelColumnName: "reservation_status");
EXCEPTION
Unhandled Exception: System.OverflowException: Arithmetic operation resulted in an overflow.
at Microsoft.ML.Data.VectorDataViewType.ComputeSize(ImmutableArray
1 dims) at Microsoft.ML.Data.VectorDataViewType..ctor(PrimitiveDataViewType itemType, Int32[] dimensions) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.Mapper..ctor(KeyToVectorMappingTransformer parent, DataViewSchema inputSchema) at Microsoft.ML.Transforms.KeyToVectorMappingTransformer.MakeRowMapper(DataViewSchema schema) at Microsoft.ML.Data.RowToRowTransformerBase.GetOutputSchema(DataViewSchema inputSchema) at Microsoft.ML.Data.TrivialEstimator
1.Fit(IDataView input)at Microsoft.ML.Data.EstimatorChain
1.Fit(IDataView input) at Microsoft.ML.Transforms.OneHotHashEncodingTransformer..ctor(HashingEstimator hash, IEstimator
1 keyToVector, IDataView input)at Microsoft.ML.Transforms.OneHotHashEncodingEstimator.Fit(IDataView input)
at Microsoft.ML.Data.EstimatorChain
1.Fit(IDataView input) at Microsoft.ML.Data.EstimatorChain
1.Fit(IDataView input)at Microsoft.ML.TrainCatalogBase.CrossValidateTrain(IDataView data, IEstimator
1 estimator, Int32 numFolds, String samplingKeyColumn, Nullable
1 seed)at Microsoft.ML.MulticlassClassificationCatalog.CrossValidate(IDataView data, IEstimator
1 estimator, Int32 numberOfFolds, String labelColumnName, String samplingKeyColumnName, Nullable
1 seed)at ConsoleAppML2ML.ConsoleApp.ModelBuilder.CreateExperiment() in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\ModelBuilder.cs:line 77
at ConsoleAppML2ML.ConsoleApp.Program.Main(String[] args) in C:\repos\Curso ML\Bootcamp-Handytec\Clasificacion_multiclase\ConsoleAppML2\ConsoleAppML2ML.ConsoleApp\Program.cs:line 20
HotelBookings.tsv.zip
The text was updated successfully, but these errors were encountered: