You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A while back, I looked in to how can we handle values of Infinity & -Infinity in our normalizers. We should standardize the handling of +/- Infinity in our normalizers.
Background
Values of Infinity are rather hard for users to handle as NAHandle ignores the Infinity & -Infinity values. I believe you would need a custom mapping transform to deal with these values.
The comes up in a dataset where I'm calculating pairwise features, eg: { x+y, xy, xyy, xe^y, ... }. I am overflowing into Infinity. This leads to FastTree ignoring rows, and SDCA dying w/ Infinite bias terms.
Summary
MinMaxNormalizer+NAHandleTransform seems to be the best option to handle data values of ±Infinity. This option isn't perfect, as it is slow (causes an additional full pass of the data), and doubles your number of features.
Currently, when a user applies normalization which replaces Infinity data values with NA we then silently skips rows. This is confusing to users and hard to debug.
Recommendation
Have MinMaxNormalizer replace +/- Infinity w/ 0.00 (default on with option to disable).
This will cause the (recommended to add before many learners) normalization to no longer replace Infinity with NA, which currently causes the learners to ignore the rows.
Modify the other normalization transforms to have consistent handling of Infinity data values.
Transforms' handling of Infinity
NAHandleTransform:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity).
All learners work, but some skip rows w/ Infinity
MinMaxNormalizer:
📒 Not good: Replaces Infinity w/ NA. But this transform is recommended to add just before linear learners, which causes the learner to see NA, and drop the row (for either training or prediction).
All learners work, but skips rows w/ NA in it.
MeanVarNormalizer:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity), but normalizes the other numbers correctly.
All learners work, but some skip rows w/ Infinity in it.
LogMeanVarNormalizer:
✅ Good: Replaces Infinity w/ 0. Could be better w/ an indicator column for the imputing.
All learners work.
BinNormalizer:
🔴 Very bad: Replaces Infinity w/ 1, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.
SupervisedBinNormalizer:
🔴 Very bad: Replaces Infinity w/ 0, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.
MinMaxNormalizer+NAHandleTransform:
✅ Very good: Replaces Infinity w/ 0 (NA then 0), and also adds an indicator column for the imputing. Double the size of the feature column though and causes a full pass of the data (trainable transform).
All learners work great.
The above analysis is on the internal version of ML.NET, but I don't expect there has been changes to the behavior.
The text was updated successfully, but these errors were encountered:
@artidoro : Replacing Infinity with float.MaxValue on the output of the normalizer will be unstable and cause linear models to throw an error of non-finite weights/bias terms. Replacing Infinity with float.MaxValue on the input to the normalizer will cause all other values to be replaced with 0, which effectively wipes out the entire column (see BinNormalizer behavior above).
Replacing Infinity with NA (missing value) should work but you'll need to impute the value, at which point we generally impute with a 0. This is the equivalent of adding MinMaxNormalizer+NAHandleTransform (without an indicator).
@wschin: What's your thoughts on how to handle +/- Infinity?
mstfbl
added
the
P2
Priority of the issue for triage purpose: Needs to be fixed at some point.
label
Jan 9, 2020
A while back, I looked in to how can we handle values of Infinity & -Infinity in our normalizers. We should standardize the handling of +/- Infinity in our normalizers.
Background
Values of Infinity are rather hard for users to handle as NAHandle ignores the Infinity & -Infinity values. I believe you would need a custom mapping transform to deal with these values.
The comes up in a dataset where I'm calculating pairwise features, eg: { x+y, xy, xyy, xe^y, ... }. I am overflowing into Infinity. This leads to FastTree ignoring rows, and SDCA dying w/ Infinite bias terms.
Summary
MinMaxNormalizer+NAHandleTransform seems to be the best option to handle data values of ±Infinity. This option isn't perfect, as it is slow (causes an additional full pass of the data), and doubles your number of features.
Currently, when a user applies normalization which replaces Infinity data values with NA we then silently skips rows. This is confusing to users and hard to debug.
Recommendation
This will cause the (recommended to add before many learners) normalization to no longer replace Infinity with NA, which currently causes the learners to ignore the rows.
Transforms' handling of Infinity
NAHandleTransform:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity).
All learners work, but some skip rows w/ Infinity
MinMaxNormalizer:
📒 Not good: Replaces Infinity w/ NA. But this transform is recommended to add just before linear learners, which causes the learner to see NA, and drop the row (for either training or prediction).
All learners work, but skips rows w/ NA in it.
MeanVarNormalizer:
🔴 Bad: Ignores Infinity (maps Infinity => Infinity), but normalizes the other numbers correctly.
All learners work, but some skip rows w/ Infinity in it.
LogMeanVarNormalizer:
✅ Good: Replaces Infinity w/ 0. Could be better w/ an indicator column for the imputing.
All learners work.
BinNormalizer:
🔴 Very bad: Replaces Infinity w/ 1, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.
SupervisedBinNormalizer:
🔴 Very bad: Replaces Infinity w/ 0, but replaces all other values in the column w/ 0 also. Effectively wipes out the column.
No learners die, but the column is wiped out.
MinMaxNormalizer+NAHandleTransform:
✅ Very good: Replaces Infinity w/ 0 (NA then 0), and also adds an indicator column for the imputing. Double the size of the feature column though and causes a full pass of the data (trainable transform).
All learners work great.
The above analysis is on the internal version of ML.NET, but I don't expect there has been changes to the behavior.
The text was updated successfully, but these errors were encountered: