Skip to content

Normalize double min and max value returns NaN #2798

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lisahua opened this issue Feb 28, 2019 · 5 comments · Fixed by #3916
Closed

Normalize double min and max value returns NaN #2798

lisahua opened this issue Feb 28, 2019 · 5 comments · Fixed by #3916
Assignees
Labels
need info This issue needs more info before triage

Comments

@lisahua
Copy link
Contributor

lisahua commented Feb 28, 2019

System information

  • OS version/distro: .Net 4.6.2, Win 10
  • .NET Version (eg., dotnet --info): ML.Net nuget 0.10.1

Issue

  • What did you do? Input data has a double columns with (double.min, double.max, <100 random numbers from 0 to 10000>), I call
            var normalizeColumns = numericalFeatures.Select(
                f => new NormalizingEstimator.MeanVarColumn(f.Name, fixZero: false, useCdf: false)).ToArray();
            var normalizingEstimator = this.context.Transforms.Normalize(normalizeColumns);

I see a feature in transformedData.Preview, which is all NaN, for each row. I use SDCA trainer

  • What happened?

The pipeline.Fit(transformedData) fails and throw an exception say "train with 0 instances"

  • What did you expect?
  1. ML.Net should handle double.min, double.max for NormalizingEstimator?
  2. ML.Net should throw a more meaningful exception - "train with 0 instances" for a feature with NaN is a bit misleading - I do have 100+ rows for this feature.

Source code / logs

Please paste or attach the code or logs or traces that would be helpful to diagnose the issue you are reporting.

@rogancarr rogancarr added the need info This issue needs more info before triage label Mar 2, 2019
@rogancarr
Copy link
Contributor

rogancarr commented Mar 2, 2019

Trainers in ML.NET use float vectors and not double vectors. When you cast from double.MaxValue to float, it casts as infinity; double.MinValue casts as negative infinity.

Now, the data is processed through the pipeline, but when the trainer iterates through the data, it automatically ignores infinite values. That should toss just those two infinite values. However, it looks like all of your values are being ignored.

My best guess is that the MeanVar Normalizer may using the infinities and rescaling everything to either +-infinity. If true, is this a bug? I'm not sure. This will be a semantics conversation with the team if this is the case.

I'll investigate further to repro and make sure I know exactly where the issue is.

@lisahua
Copy link
Contributor Author

lisahua commented Mar 18, 2019

Just want to clarify,
Using a dataset:

column1
double.max
double.min

Using NormalizingEstimator.MeanVarColumn,

            var normalizeColumn = new NormalizingEstimator.MeanVarColumn("Column1", fixZero: false, useCdf: false));
            var pipeline = this.context.Transforms.Normalize(normalizeColumn);
            var transformer = pipeline.Fit(data);
            var transformedData = transformer.Transform(data);

I see:

var normalizer = pipeline.OfType<TransformerChain<NormalizingTransformer>>().FirstOrDefault();
            foreach (var column in normalizer.LastTransformer.Columns)
            {
                var param = column.ModelParameters as NormalizingTransformer.AffineNormalizerModelParameters<double>;
// double.IsNaN(param.Scale) == true and double.IsNaN(param.Offset) == true
            }

I try out some other boundary values: double.PositiveInfinity, double.NegativeInfinity, double.NaN, e.g., :

column1
double.PositiveInfinity
double.NegativeInfinity
double.NaN

I see the code snippet above:

var param = column.ModelParameters as NormalizingTransformer.AffineNormalizerModelParameters<double>;
// param.Scale== 0 and param.Offset == 0

Should I expect similar output as param.Scale== 0 and param.Offset == 0 with the input of double.min and double.max?

@rogancarr rogancarr self-assigned this Apr 1, 2019
@rogancarr rogancarr assigned yaeldekel and artidoro and unassigned rogancarr Apr 15, 2019
@justinormont
Copy link
Contributor

I added more detail on our handling of +/- Infinity in normalizers in issue #3885.

@artidoro
Copy link
Contributor

I have investigated this bug in more detail. The behavior is unfortunate but expected. If you are dealing with very large values in your dataset you should consider using NormalizeLogMeanVariance instead of NormalizeMeanVariance.

Using the LogMeanVariance normalizer will execute the computation of the mean and variance in log space avoiding overflows with very large values (i.e. the double.MaxValue). I believe you might loose some precision in the calculations but it should work as expected.

The calculation of the mean of a streaming sequence of data uses the following formula (sorry but github markdown does not support latex):
mu_{n+1} = mu_n + (x_{n+1} - mu_n)/(n+1)

And the computation of the variance uses the second moment, and the formula for the computation of the second moment of a streaming sequence is:

mu2_{n+1} = (mu2_n + (x_{n+1} - mu_n)^2)/(n+1)

In both cases, the term x_{n+1} - mu_n can lead to infinity. For example when the x_1 = double.MaxValue and x_2 = double.MinValue the second moment is going to be double.MaxValue^2 which leads to infinity and the problem you were having.

Would switching to the LogMeanVariance normalizer be a valid solution for your issue @lisahua?

@lisahua
Copy link
Contributor Author

lisahua commented Jun 21, 2019

Thanks Artidoro, we require a parameter fixZero: false. Can LogMeanVariance provide this parameter? Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Mar 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
need info This issue needs more info before triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants