-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Overflow in MultiClassNaiveBayes #3228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you share code for this NB for 1TB data |
@justinormont Can you share your data? |
@Lynx1820: This is the Criteo 1TB dataset. I sent you a link internally. @Sandy4321: You can use the AutoML․NET CLI (or Model Builder) to create the ML․NET C# code & project (data loaders, pipeline, learner) for the dataset by running on a smaller sample of the dataset. Then change the learner to Naive Bayes. I have run the Criteo 1TB dataset using the CLI; it works well; for a full run you'll want to add To install the AutoML CLI, follow the Linux/MacOS instructions (will work on Windows too, though needs .NET Core installed): https://dotnet.microsoft.com/learn/ml-dotnet/get-started-tutorial/install Model Builder won't run on the full 1TB dataset as there is not an exposed option for disabling the caching. But Model Builder will work to create the ML․NET C# code from a sample of the dataset. Then you can run the generated C# code, for a single training run, on the full 1TB dataset. |
Do you mean that classifier model will be built only on small data |
@Sandy4321: You'll train on the subsample, just to have it produce the C# code for a pipeline for the dataset. After switching the trainer to Naive Bayes, you can then use the generated C# code to train a model from the pipeline on full 1TB dataset. As a side note, the Target Encoding (dracula) transform works well on the Criteo 1TB dataset: https://docs.microsoft.com/en-us/archive/blogs/machinelearning/now-available-on-azure-ml-criteos-1tb-click-prediction-dataset (slides) This is being added in PR by @yaeldekel: #4514. Also, ML․NET has many strong trainers which the AutoML will try for you; Naive Bayes is rarely the winning algo. |
@Lynx1820: For fixing this issue, it may be as easy as moving a set of variables from Lines 241 to 246 in 77e7f98
And a few other Line 149 in 77e7f98
|
Here are the min/max values for int, uint and long:
As the aforomentioned Criteo 1 TB database has around ~4.4 billion rows, it makes sense to utilize long arrays and matrices for storing labels and features in their respective histograms. |
Good idea |
We're storing the count of rows in an
int
. This causes an overflow for large datasets. In my case, Criteo 1TB w/ 4.4B rows. Recommend changing to along
.machinelearning/src/Microsoft.ML.StandardTrainers/Standard/MulticlassClassification/MulticlassNaiveBayesTrainer.cs
Line 242 in d2bf3e7
Error:
The text was updated successfully, but these errors were encountered: