-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Add support for combining hashes in vector columns to HashingTransformer #4828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Question: some of the input types for hashing have missing values (float, double and key type). The hashing transformer maps missing values in the input, to 0, the missing value of key type. What should we do in the combine case? |
else if (((ISeededEnvironment)env).Seed.HasValue) | ||
columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value); | ||
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value, combine: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can merge these conditions in one by defining var localSeed = seed.HasValue ? seed.Value : (ISeededEnvironment)env).Seed.HasValue ? (ISeededEnvironment)env).Seed.Value : null;
if(localSeed.HasValue)
...
else
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true); #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want HashingEstimator
listening to the global seed. The hashing is a different meaning of "seed"; this isn't a PRNG seed.
See #4752 (comment)
/cc @najeeb-kazmi
@@ -122,14 +128,15 @@ private static VersionInfo GetVersionInfo() | |||
return new VersionInfo( | |||
modelSignature: "HASHTRNS", | |||
// verWrittenCur: 0x00010001, // Initial | |||
verWrittenCur: 0x00010002, // Invert hash key values, hash fix | |||
//verWrittenCur: 0x00010002, // Invert hash key values, hash fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
v [](start = 18, length = 1)
v [](start = 18, length = 1)
space #Resolved
bfd1524
to
f62a846
Compare
if (itemType is DateTimeDataViewType || itemType is DateTimeOffsetDataViewType || itemType is TimeSpanDataViewType) | ||
input = new TypeConvertingTransformer(Host, origStratCol, DataKind.Int64, origStratCol).Transform(input); | ||
|
||
output = new HashingEstimator(Host, stratificationColumn, origStratCol, 30).Fit(input).Transform(input); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realize the previous code too used 30 bits. But the default seems to be 31 bits. Any reason why we have 30 here? If there is a specific reason for 30 bits, can you please add a note to that in the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am actually not sure why 30 was chosen. If you think it's important I can try changing it back to 31.
In reply to: 386729669 [](ancestors = 386729669)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no reason to say it should change to 31 either :-). But since it is is used in a couple of places, can you please make it a named constant?
In reply to: 425102125 [](ancestors = 425102125,386729669)
var columnOptions = | ||
localSeed.HasValue ? | ||
new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)localSeed.Value, combine: true) : | ||
new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment as before. If the 30bits hashing is standard, can we define it as some global constant somewhere?
@@ -247,6 +262,8 @@ private Delegate GetGetterCore(DataViewRow input, int iinfo, out Action disposer | |||
var srcType = input.Schema[srcCol].Type; | |||
if (!(srcType is VectorDataViewType vectorType)) | |||
return ComposeGetterOne(input, iinfo, srcCol, srcType); | |||
if (_columns[iinfo].Combine) | |||
return ComposeGetterCombined(input, iinfo, srcCol, vectorType); | |||
return ComposeGetterVec(input, iinfo, srcCol, vectorType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to figure out the onnx export mechanism for this. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…y of HashingEstimator.
verReadableCur: 0x00010003, | ||
//verWrittenCur: 0x00010003, // Uses MurmurHash3_x86_32 | ||
verWrittenCur: 0x00010004, // Add Combine option | ||
verReadableCur: 0x00010004, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still haven't released a version with MurmurHash3. Do we need to increment this if we make sure to include it in the v1.5 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically users may create and save models containing this transformer once the PR that changed the version to 0x00010003 was merged. This most likely didn't happen, but to be on the safe side I think it's better to increment the version again.
In reply to: 424565803 [](ancestors = 424565803)
private readonly VBuffer<ReadOnlyMemory<char>>[] _keyValues; | ||
private readonly VectorDataViewType[] _kvTypes; | ||
private readonly bool _isVersion1; | ||
private readonly bool _useOldHashing; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Is it possible to come up with a better name? The only one I can think of is "_useOnnxCompatibleHashing" which would invert the meaning of the current name. Please use a a more descriptive name if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to keep it a variable that would have the value "false" by default, that's why I didn't want to change it to _useOnnxCompatibleHashing
. Is this name that I changed it to acceptable?
In reply to: 424567216 [](ancestors = 424567216)
var srcType = _srcTypes[iinfo].GetItemType(); | ||
if (srcType is KeyDataViewType) | ||
return false; | ||
if (_parent._columns[iinfo].Combine) | ||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lynx1820 Any thoughts why we are not supporting key types? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the remainder! I'll work on adding support for keytypes. Zero value keys don't get mapped to a hash but map to zero, a behavior not currently present. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report
@@ Coverage Diff @@
## master #4828 +/- ##
==========================================
- Coverage 75.71% 73.54% -2.17%
==========================================
Files 993 992 -1
Lines 179035 179484 +449
Branches 19356 19296 -60
==========================================
- Hits 135549 131997 -3552
- Misses 38262 42305 +4043
+ Partials 5224 5182 -42
|
Codecov Report
@@ Coverage Diff @@
## master #4828 +/- ##
==========================================
- Coverage 75.71% 75.60% -0.11%
==========================================
Files 993 995 +2
Lines 179035 179664 +629
Branches 19356 19314 -42
==========================================
+ Hits 135549 135834 +285
- Misses 38262 38565 +303
- Partials 5224 5265 +41
|
That sounds good to me. In reply to: 586994995 [](ancestors = 586994995) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am making this change so that
CountTargetEncodingEstimator
(see PR #4514 ) can start usingHashingEstimator
instead ofHashJoiningTransform
(see this comment in the above PR).In addition to enabling
CountTargetEncodingEstimator
to use it, this will fix a bug when splitting datasets in the CV command and in the train-test/CV APIs. If a stratification column that is a vector type is specified, currently an exception will be thrown because theRangeFilter
applied after hashing cannot handle vector columns.