Skip to content

Rationalize Infinity handling in Normalizers #3885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
justinormont opened this issue Jun 19, 2019 · 2 comments
Open

Rationalize Infinity handling in Normalizers #3885

justinormont opened this issue Jun 19, 2019 · 2 comments
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point. usability Smoothing user interaction or experience

Comments

@justinormont
Copy link
Contributor

justinormont commented Jun 19, 2019

A while back, I looked in to how can we handle values of Infinity & -Infinity in our normalizers. We should standardize the handling of +/- Infinity in our normalizers.

Background

Values of Infinity are rather hard for users to handle as NAHandle ignores the Infinity & -Infinity values. I believe you would need a custom mapping transform to deal with these values.

The comes up in a dataset where I'm calculating pairwise features, eg: { x+y, xy, xyy, xe^y, ... }. I am overflowing into Infinity. This leads to FastTree ignoring rows, and SDCA dying w/ Infinite bias terms.

Summary

MinMaxNormalizer+NAHandleTransform seems to be the best option to handle data values of ±Infinity. This option isn't perfect, as it is slow (causes an additional full pass of the data), and doubles your number of features.

Currently, when a user applies normalization which replaces Infinity data values with NA we then silently skips rows. This is confusing to users and hard to debug.

Recommendation

  • Have MinMaxNormalizer replace +/- Infinity w/ 0.00 (default on with option to disable).
    This will cause the (recommended to add before many learners) normalization to no longer replace Infinity with NA, which currently causes the learners to ignore the rows.
  • Modify the other normalization transforms to have consistent handling of Infinity data values.

Transforms' handling of Infinity

NAHandleTransform:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity).
All learners work, but some skip rows w/ Infinity

MinMaxNormalizer:
📒 Not good: Replaces Infinity w/ NA. But this transform is recommended to add just before linear learners, which causes the learner to see NA, and drop the row (for either training or prediction).
All learners work, but skips rows w/ NA in it.

MeanVarNormalizer:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity), but normalizes the other numbers correctly.
All learners work, but some skip rows w/ Infinity in it.

LogMeanVarNormalizer:
✅ Good: Replaces Infinity w/ 0. Could be better w/ an indicator column for the imputing.
All learners work.

BinNormalizer:
🔴 Very bad: Replaces Infinity w/ 1, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.

SupervisedBinNormalizer:
🔴 Very bad: Replaces Infinity w/ 0, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.

MinMaxNormalizer+NAHandleTransform:
✅ Very good: Replaces Infinity w/ 0 (NA then 0), and also adds an indicator column for the imputing. Double the size of the feature column though and causes a full pass of the data (trainable transform).
All learners work great.


The above analysis is on the internal version of ML.NET, but I don't expect there has been changes to the behavior.

@artidoro
Copy link
Contributor

@justinormont why do you suggest to replace infinity with 0 instead of the max value of the data type? (e.g. float.MaxValue)

@justinormont justinormont added the usability Smoothing user interaction or experience label Jun 23, 2019
@justinormont
Copy link
Contributor Author

@artidoro : Replacing Infinity with float.MaxValue on the output of the normalizer will be unstable and cause linear models to throw an error of non-finite weights/bias terms. Replacing Infinity with float.MaxValue on the input to the normalizer will cause all other values to be replaced with 0, which effectively wipes out the entire column (see BinNormalizer behavior above).

Replacing Infinity with NA (missing value) should work but you'll need to impute the value, at which point we generally impute with a 0. This is the equivalent of adding MinMaxNormalizer+NAHandleTransform (without an indicator).

@wschin: What's your thoughts on how to handle +/- Infinity?

@mstfbl mstfbl added the P2 Priority of the issue for triage purpose: Needs to be fixed at some point. label Jan 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority of the issue for triage purpose: Needs to be fixed at some point. usability Smoothing user interaction or experience
Projects
None yet
Development

No branches or pull requests

3 participants