Skip to content

Memory leak when featurizing text with the default settings #4571

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
daholste opened this issue Dec 12, 2019 · 3 comments
Closed

Memory leak when featurizing text with the default settings #4571

daholste opened this issue Dec 12, 2019 · 3 comments
Assignees

Comments

@daholste
Copy link
Contributor

When featurizing text with the default settings, references to the entire dataset rows are kept around

@antoniovs1029 antoniovs1029 self-assigned this Dec 12, 2019
@antoniovs1029
Copy link
Member

antoniovs1029 commented Dec 13, 2019

So I've talked with @daholste offline, and also worked with @harishsk towards a solution, and I will open a PR soon with a couple of proposals. In the meantime here I will further describe the issue and add a small sample to explain it.


Issue

When loading a dataset using the FeaturizeText featurizer, at some point a NormStr pool is created and it holds the NormStrs created while featurizing each row of the dataset. Each NormStr has a readonly field ReadOnlyMemory<char> Value which has a non-public member _object, typically this member is a string over which the ReadOnlyMemory<char> is based on.

Sometimes, it is the case that this _object member holds a reference to a string that contains the whole row where the NormStr was found.

So, for example, as seen in the sample code I provide at the end of this post, if I have a CSV file such as this one:

Index,Column1,Column2
1,Uppercase string,44
2,lowercase string,33

And if I use FeaturizeText to load Column1, with the following options:

new TextFeaturizingEstimator.Options()
{
    CharFeatureExtractor = null,
    WordFeatureExtractor = new WordBagEstimator.Options()
    {
        NgramLength = 2,
        MaximumNgramsCount = new int[] { 200000 }
    }
}

Then, after fitting, the following NormStrs are created and held in a NormStr pool used by the transformer:

"uppercase"
image

"string"
image

"lowercase"
image

As seen above, both "uppercase" and "string" share an _object reference to the same string which holds the content of the Column1 of the first row "uppercase string". On the other hand "lowercase" has an _object with the string that holds the whole second row "2,lowercase string,33".

Notice that in the sample code there's only one featurizer, the one for Column1, so we are actually only interested in reading that column. Still, we get in memory this string that holds the content of the whole second row.

In bigger datasets this becomes a problem, when there are many NormStrs that hold references to strings that contain whole rows. Because these NormStrs are required by the transformer that uses the featurizer, they are not supposed to be garbage collected even after fitting, and so they stay in memory along with their respective _object strings.

For instance, when using a dataset with 100k rows and 2.5+k columns, we can get NormStr pools of over 150MB for each column we are interested in loading (there's a NormStr pool for each one of those). And this can cause a lot of innecesary memory usage since there is no real need to keep whole rows in memory..

One cause of the problem: TextNormalizingTransformer

After some investigation, I've found that the reason to explain the behavior described above is in the NormalizeSrc method of TextNormalizingTransformer.Mapper. I feel there are other places where similar problems might be caused, but I still need further investigation. In the meantime, in the experiments I've done recently, I found the NormalizeSrc to be the main source of this problem (at least for the case of daholste).

The NormalizeSrc method has the following parameters:

in ReadOnlyMemory<char> src, ref ReadOnlyMemory<char> dst, StringBuilder buffer

The src holds the text of a given row in a given column that is to be normalized, and dst is where the text is left after normalizing.

When calling NormalizeSrc, src seems to always have a _object that holds the string of the whole row. Later on in the method, if there was nothing to normalize in src (e.g., if it didn't have uppercases) then the method simply makes:

dst = src;

If there were things that needed to be normalized, then buffer is used to build the normalized string, which is then converted into a ReadOnlyMemory<char>:

dst = buffer.ToString().AsMemory();

(link to code in NormalizeSrc)

Because of the above, if there was nothing to normalize in the source text, then dst still has an _object with a string containing the whole row. In contrast, if there was something to normalize in the given src, then dst has an _object with only the contents of the row in the specific column that is being processed.

Later on in other methods, new ReadOnlyMemory objects are created for each "word" (or "token") inside dst, but they share the same _object as the dst they are based in, and they are used to create new NormStrs. This explains the behavior of the sample I provided, where the content of "Uppercase string" had to be normalized, and thus the NormStrs for "uppercase" and "string" only had a reference to the content of that column, whereas "lowercase" had a reference to the whole row since "lowercase string" didn't have to be normalized.

Code and input data

data.csv

Index,Column1,Column2
1,Uppercase string,44
2,lowercase string,33

Program.cs

using Microsoft.ML;
using Microsoft.ML.Data;
using Microsoft.ML.Transforms.Text;
using System;

namespace MyData
{
    class MyData
    {
        static void Main(string[] args)
        {
            var mlContext = new MLContext();
            var textLoader = mlContext.Data.CreateTextLoader(new TextLoader.Options()
            {
                Columns = new TextLoader.Column[]
                {
                    new TextLoader.Column("Index", DataKind.String, 0),
                    new TextLoader.Column("Column1", DataKind.String, 1),
                    new TextLoader.Column("Column2", DataKind.String, 2),
                },
                HasHeader = true,
                Separators = new char[] { ',' }
            });

            var data = textLoader.Load(@"C:\Users\anvelazq\Desktop\issue15\mydata.csv");

            IEstimator<ITransformer> featurizer = mlContext.Transforms.Text.FeaturizeText("Column1", new TextFeaturizingEstimator.Options()
            {
                CharFeatureExtractor = null,
                WordFeatureExtractor = new WordBagEstimator.Options()
                {
                    NgramLength = 2,
                    MaximumNgramsCount = new int[] { 200000 }
                }
            });

            var model = featurizer.Fit(data);
            var transformed = model.Transform(data);
            Console.WriteLine("Done");
        }
    }
}

@justinormont
Copy link
Contributor

As a possible solution, we can allocate a new string for the dictionary which stores the ngram-to-slot-number-key. This will break the pointer chain back to the original text row of data, and release the memory as the transform moves to the next row.

When saving a model, the dictionary is serialized to disk, it will be storing the exact strings instead of the string slices. The possible solution above would be similar to loading the serialized dictionary from disk.

Allocating strings of course is non-free, but it will only be done once per string stored in the dictionary.

String slices can be more memory efficient for long charactergram lengths, but I think the overheads will bring it to similar memory usage. For example, the 8-char-grams of "<stx>The cat ran down the street<etx>" is ["<stx>The cat", "The cat ", "he cat r", "e cat ra", " cat ran", "cat ran ", "at ran d", "t ran do", " ran dow", "ran down", "an down ", "n down t", " down th", "down the", "own the ", "wn the s", "n the st", " the str", "the stre", "he stree", "e street", " street<etx>"]. But even though this is 178 chars when created with individual strings and the slices all point to the same 27 chars, there are overheads in keeping the offset and length of each string slice.

@antoniovs1029
Copy link
Member

As a possible solution, we can allocate a new string for the dictionary which stores the ngram-to-slot-number-key. This will break the pointer chain back to the original text row of data, and release the memory as the transform moves to the next row.

For the record, I have discussed this offline with @justinormont , and his suggestion is now Proposal 3 in the PR I opened about this.

@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants