Skip to content

Overflow in MultiClassNaiveBayes #3228

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
justinormont opened this issue Apr 6, 2019 · 8 comments · Fixed by #5041
Closed

Overflow in MultiClassNaiveBayes #3228

justinormont opened this issue Apr 6, 2019 · 8 comments · Fixed by #5041
Assignees
Labels
bug Something isn't working nit Needs a really small fix P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Comments

@justinormont
Copy link
Contributor

justinormont commented Apr 6, 2019

We're storing the count of rows in an int. This causes an overflow for large datasets. In my case, Criteo 1TB w/ 4.4B rows. Recommend changing to a long.

Error:

Unexpected exception: Arithmetic operation resulted in an overflow., 'System.OverflowException'
   at System.Linq.Enumerable.Sum(IEnumerable`1 source)
@justinormont justinormont added the bug Something isn't working label Apr 6, 2019
@codemzs codemzs self-assigned this Apr 6, 2019
@antoniovs1029 antoniovs1029 added P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. nit Needs a really small fix labels Jan 9, 2020
@antoniovs1029 antoniovs1029 self-assigned this Jan 10, 2020
@Sandy4321
Copy link

can you share code for this NB for 1TB data
if somebody tried to run locally ?
like
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

@harishsk harishsk assigned Lynx1820 and unassigned codemzs and antoniovs1029 Mar 26, 2020
@Lynx1820
Copy link
Contributor

Lynx1820 commented Apr 2, 2020

@justinormont Can you share your data?

@justinormont
Copy link
Contributor Author

justinormont commented Apr 2, 2020

@Lynx1820: This is the Criteo 1TB dataset. I sent you a link internally.

@Sandy4321: You can use the AutoML․NET CLI (or Model Builder) to create the ML․NET C# code & project (data loaders, pipeline, learner) for the dataset by running on a smaller sample of the dataset. Then change the learner to Naive Bayes. I have run the Criteo 1TB dataset using the CLI; it works well; for a full run you'll want to add --cache off to disable in-memory caching.

To install the AutoML CLI, follow the Linux/MacOS instructions (will work on Windows too, though needs .NET Core installed): https://dotnet.microsoft.com/learn/ml-dotnet/get-started-tutorial/install

Model Builder won't run on the full 1TB dataset as there is not an exposed option for disabling the caching. But Model Builder will work to create the ML․NET C# code from a sample of the dataset. Then you can run the generated C# code, for a single training run, on the full 1TB dataset.

@Sandy4321
Copy link

Do you mean that classifier model will be built only on small data
And under running on full data you mean only prediction?

@justinormont
Copy link
Contributor Author

@Sandy4321: You'll train on the subsample, just to have it produce the C# code for a pipeline for the dataset. After switching the trainer to Naive Bayes, you can then use the generated C# code to train a model from the pipeline on full 1TB dataset.

As a side note, the Target Encoding (dracula) transform works well on the Criteo 1TB dataset: https://docs.microsoft.com/en-us/archive/blogs/machinelearning/now-available-on-azure-ml-criteos-1tb-click-prediction-dataset (slides)

This is being added in PR by @yaeldekel: #4514.

Also, ML․NET has many strong trainers which the AutoML will try for you; Naive Bayes is rarely the winning algo.

@justinormont
Copy link
Contributor Author

@Lynx1820: For fixing this issue, it may be as easy as moving a set of variables from int to long:

private readonly int[] _labelHistogram;
private readonly int[][] _featureHistogram;
private readonly double[] _absentFeaturesLogProb;
private readonly int _totalTrainingCount;
private readonly int _labelCount;
private readonly int _featureCount;

And a few other int variables:

@mstfbl mstfbl assigned mstfbl and unassigned Lynx1820 Apr 16, 2020
@mstfbl
Copy link
Contributor

mstfbl commented Apr 17, 2020

Here are the min/max values for int, uint and long:

int.MaxValue: 2,147,483,647
int.MinValue: -2,147,483,648
long.MaxValue: 9,223,372,036,854,775,807
long.MinValue: -9,223,372,036,854,775,808

As the aforomentioned Criteo 1 TB database has around ~4.4 billion rows, it makes sense to utilize long arrays and matrices for storing labels and features in their respective histograms.

@Sandy4321
Copy link

Good idea

@ghost ghost locked as resolved and limited conversation to collaborators Mar 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working nit Needs a really small fix P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants