Skip to content

Add FixZero for LogMeanVariance normalizer #3916

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jul 1, 2019

Conversation

artidoro
Copy link
Contributor

Fixes #2798.

This PR introduces the FixZero argument to the LogMeanVariance normalizer, a relative tests and a sample.

It's still in WIP because I would like to make a breaking change instead of creating an overload with required parameter FixZero. As soon as the change is accepted and I set up the API Compat tool to accept the breaking change, I should be able to remove the overloads to the MLContext extensions.

@artidoro artidoro self-assigned this Jun 26, 2019
new DataPoint(){ Features = new float[4] { 2, 2, 2, 0} },
new DataPoint(){ Features = new float[4] { 0, 0, 1, 0} },
new DataPoint(){ Features = new float[4] {-1,-1,-1, 1} }
new DataPoint(){ Features = new float[5] { 1, 1, 3, 0, float.MaxValue } },
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to FixZero? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to the original issue posted by Lisa. They were encountering an issue with the MeanVariance normalizer which does not handle data that looks like float.MaxValue, float.MinValue. I think it would be useful to show that LogMeanVariance normalizer accepts this data in the related sample.


In reply to: 298254220 [](ancestors = 298254220)

// Uses Cumulative distribution function as output.
var normalize = mlContext.Transforms.NormalizeLogMeanVariance(true, "Features", useCdf: true);

// NormalizeLogMeanVariance normalizes the data based on the computed mean and variance of the logarithm of the data.
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NormalizeLogMeanVariance normalizes the data based on the computed mean and variance of the logarithm of the data. doesn't tell much more than the line below. We need some equations here.

( log(x) - Mean( Log(x) ) ) / Var( log(x) )

maybe? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that would be nice I can add that here.


In reply to: 298254677 [](ancestors = 298254677)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I told you the other day I am still concerned about that equation. I am going to do a separate PR where I add the equations. (Same for the documentation comment in the catalog that you had).
There is no documentation on how to derive them so I am going through the math again and trying to understand the rational.


In reply to: 298311132 [](ancestors = 298311132,298254677)

{
public class NormalizeLogMeanVarianceFixZero
{
public static void Example()
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have two independent cases in Example(...). Could you split them into two files? #Pending

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have followed the same pattern as the MeanVariance normalizer and LogMeanVariance normalizer sample. Would you prefer that I split them all into two files? Or should I just split this one? Or keep it like the others for consistency?


In reply to: 298257288 [](ancestors = 298257288)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separating them is more reader-friendly.

// If we have multiple columns transformations we need to pass index of InputOutputColumnPair.
var transformParams = normalizeTransform.GetNormalizerModelParameters(0) as CdfNormalizerModelParameters<ImmutableArray<float>>;
Console.WriteLine("The 1-index value in resulting array would be produce by:");
Console.WriteLine($"y = 0.5* (1 + ERF((Math.Log(x)- {transformParams.Mean[1]}) / ({transformParams.StandardDeviation[1]} * sqrt(2)))");
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have equation for columnFixZero now! May we have the same thing for column? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just copied the equation from the LogMeanVariance normalizer sample :)
There is an equation for both settings of FixZero and for both Cdf and no Cdf. So I think it's complete.


In reply to: 298258190 [](ancestors = 298258190)


// ERF is https://en.wikipedia.org/wiki/Error_function.
// Expected output:
// The 1 - index value in resulting array would be produce by:
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does 1 - index mean? #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rephrase here but it just means that we show the normalization parameter values for the slot in the array with index 1.


In reply to: 298258499 [](ancestors = 298258499)

/// <param name="maximumExampleCount">Maximum number of examples used to train the normalizer.</param>
/// <param name="useCdf">Whether to use CDF as the output.</param>
/// <example>
/// <format type="text/markdown">
Copy link
Member

@wschin wschin Jun 27, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses would love equations in documentations. #Resolved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add the equations here as well.


In reply to: 298260450 [](ancestors = 298260450)

Copy link
Contributor Author

@artidoro artidoro Jun 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a little concerned about the equations. Yesterday I think I saw an issue. I would like to make sure I understand all the steps and will add those in a separate PR and check this in for now.


In reply to: 298310313 [](ancestors = 298310313,298260450)

@artidoro artidoro changed the title WIP: Add FixZero for LogMeanVariance normalizer Add FixZero for LogMeanVariance normalizer Jun 28, 2019
@artidoro
Copy link
Contributor Author

/// * Binning - Bucketizes the data in each row and performs a linear rescale based on the calculated bins.

Add equation for log mean variance
both cdf and non cdf


Refers to: src/Microsoft.ML.Data/Transforms/Normalizer.cs:46 in 39b4002. [](commit_id = 39b4002, deletion_comment = False)

Copy link
Member

@codemzs codemzs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We reviewed this in person and it looks good to me. As discussed you will be verifying there isn't any mathematical error in the computations.

Copy link
Member

@wschin wschin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only comment is to add equations to those normalizers.

@artidoro artidoro merged commit f67aab5 into dotnet:master Jul 1, 2019
Dmitry-A pushed a commit to Dmitry-A/machinelearning that referenced this pull request Jul 24, 2019
@ghost ghost locked as resolved and limited conversation to collaborators Mar 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Normalize double min and max value returns NaN
3 participants