-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Memory leak when featurizing text with the default settings #4571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So I've talked with @daholste offline, and also worked with @harishsk towards a solution, and I will open a PR soon with a couple of proposals. In the meantime here I will further describe the issue and add a small sample to explain it. IssueWhen loading a dataset using the Sometimes, it is the case that this So, for example, as seen in the sample code I provide at the end of this post, if I have a CSV file such as this one:
And if I use
Then, after fitting, the following As seen above, both "uppercase" and "string" share an Notice that in the sample code there's only one featurizer, the one for In bigger datasets this becomes a problem, when there are many For instance, when using a dataset with 100k rows and 2.5+k columns, we can get One cause of the problem: TextNormalizingTransformerAfter some investigation, I've found that the reason to explain the behavior described above is in the The
The When calling
If there were things that needed to be normalized, then
(link to code in NormalizeSrc) Because of the above, if there was nothing to normalize in the source text, then Later on in other methods, new Code and input datadata.csv
Program.cs
|
As a possible solution, we can allocate a new string for the dictionary which stores the ngram-to-slot-number-key. This will break the pointer chain back to the original text row of data, and release the memory as the transform moves to the next row. When saving a model, the dictionary is serialized to disk, it will be storing the exact strings instead of the string slices. The possible solution above would be similar to loading the serialized dictionary from disk. Allocating strings of course is non-free, but it will only be done once per string stored in the dictionary. String slices can be more memory efficient for long charactergram lengths, but I think the overheads will bring it to similar memory usage. For example, the 8-char-grams of |
For the record, I have discussed this offline with @justinormont , and his suggestion is now Proposal 3 in the PR I opened about this. |
When featurizing text with the default settings, references to the entire dataset rows are kept around
The text was updated successfully, but these errors were encountered: