Rationalize Infinity handling in Normalizers #3885

justinormont · 2019-06-19T17:25:44Z

A while back, I looked in to how can we handle values of Infinity & -Infinity in our normalizers. We should standardize the handling of +/- Infinity in our normalizers.

Background

Values of Infinity are rather hard for users to handle as NAHandle ignores the Infinity & -Infinity values. I believe you would need a custom mapping transform to deal with these values.

The comes up in a dataset where I'm calculating pairwise features, eg: { x+y, xy, xyy, xe^y, ... }. I am overflowing into Infinity. This leads to FastTree ignoring rows, and SDCA dying w/ Infinite bias terms.

Summary

MinMaxNormalizer+NAHandleTransform seems to be the best option to handle data values of ±Infinity. This option isn't perfect, as it is slow (causes an additional full pass of the data), and doubles your number of features.

Currently, when a user applies normalization which replaces Infinity data values with NA we then silently skips rows. This is confusing to users and hard to debug.

Recommendation

Have MinMaxNormalizer replace +/- Infinity w/ 0.00 (default on with option to disable).
This will cause the (recommended to add before many learners) normalization to no longer replace Infinity with NA, which currently causes the learners to ignore the rows.
Modify the other normalization transforms to have consistent handling of Infinity data values.

Transforms' handling of Infinity

NAHandleTransform:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity).
All learners work, but some skip rows w/ Infinity

MinMaxNormalizer:
📒 Not good: Replaces Infinity w/ NA. But this transform is recommended to add just before linear learners, which causes the learner to see NA, and drop the row (for either training or prediction).
All learners work, but skips rows w/ NA in it.

MeanVarNormalizer:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity), but normalizes the other numbers correctly.
All learners work, but some skip rows w/ Infinity in it.

LogMeanVarNormalizer:
✅ Good: Replaces Infinity w/ 0. Could be better w/ an indicator column for the imputing.
All learners work.

BinNormalizer:
🔴 Very bad: Replaces Infinity w/ 1, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.

SupervisedBinNormalizer:
🔴 Very bad: Replaces Infinity w/ 0, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.

MinMaxNormalizer+NAHandleTransform:
✅ Very good: Replaces Infinity w/ 0 (NA then 0), and also adds an indicator column for the imputing. Double the size of the feature column though and causes a full pass of the data (trainable transform).
All learners work great.

The above analysis is on the internal version of ML.NET, but I don't expect there has been changes to the behavior.

artidoro · 2019-06-21T19:06:36Z

@justinormont why do you suggest to replace infinity with 0 instead of the max value of the data type? (e.g. float.MaxValue)

justinormont · 2019-06-23T15:07:44Z

@artidoro : Replacing Infinity with float.MaxValue on the output of the normalizer will be unstable and cause linear models to throw an error of non-finite weights/bias terms. Replacing Infinity with float.MaxValue on the input to the normalizer will cause all other values to be replaced with 0, which effectively wipes out the entire column (see BinNormalizer behavior above).

Replacing Infinity with NA (missing value) should work but you'll need to impute the value, at which point we generally impute with a 0. This is the equivalent of adding MinMaxNormalizer+NAHandleTransform (without an indicator).

@wschin: What's your thoughts on how to handle +/- Infinity?

justinormont mentioned this issue Jun 19, 2019

Normalize double min and max value returns NaN #2798

Closed

justinormont added the usability Smoothing user interaction or experience label Jun 23, 2019

mstfbl added the P2 Priority of the issue for triage purpose: Needs to be fixed at some point. label Jan 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rationalize Infinity handling in Normalizers #3885

Rationalize Infinity handling in Normalizers #3885

justinormont commented Jun 19, 2019 •

edited

Loading

artidoro commented Jun 21, 2019

justinormont commented Jun 23, 2019

Rationalize Infinity handling in Normalizers #3885

Rationalize Infinity handling in Normalizers #3885

Comments

justinormont commented Jun 19, 2019 • edited Loading

Background

Summary

Recommendation

Transforms' handling of Infinity

artidoro commented Jun 21, 2019

justinormont commented Jun 23, 2019

justinormont commented Jun 19, 2019 •

edited

Loading