Skip to content

Add support for combining hashes in vector columns to HashingTransformer #4828

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 18, 2020

Conversation

yaeldekel
Copy link

I am making this change so that CountTargetEncodingEstimator (see PR #4514 ) can start using HashingEstimator instead of HashJoiningTransform (see this comment in the above PR).

In addition to enabling CountTargetEncodingEstimator to use it, this will fix a bug when splitting datasets in the CV command and in the train-test/CV APIs. If a stratification column that is a vector type is specified, currently an exception will be thrown because the RangeFilter applied after hashing cannot handle vector columns.

@yaeldekel
Copy link
Author

Question: some of the input types for hashing have missing values (float, double and key type). The hashing transformer maps missing values in the input, to 0, the missing value of key type. What should we do in the combine case?
I think the correct thing to do is to return 0 if any of the slots have missing values (that way, the behavior of a length-1 vector is identical whether Combine is true or false), but I would like to know what other people think.

else if (((ISeededEnvironment)env).Seed.HasValue)
columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value);
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)((ISeededEnvironment)env).Seed.Value, combine: true);
Copy link
Member

@codemzs codemzs Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can merge these conditions in one by defining var localSeed = seed.HasValue ? seed.Value : (ISeededEnvironment)env).Seed.HasValue ? (ISeededEnvironment)env).Seed.Value : null;

if(localSeed.HasValue)
...
else
columnOptions = new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true); #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want HashingEstimator listening to the global seed. The hashing is a different meaning of "seed"; this isn't a PRNG seed.

See #4752 (comment)

/cc @najeeb-kazmi

@@ -122,14 +128,15 @@ private static VersionInfo GetVersionInfo()
return new VersionInfo(
modelSignature: "HASHTRNS",
// verWrittenCur: 0x00010001, // Initial
verWrittenCur: 0x00010002, // Invert hash key values, hash fix
//verWrittenCur: 0x00010002, // Invert hash key values, hash fix
Copy link
Member

@codemzs codemzs Feb 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v [](start = 18, length = 1)

v [](start = 18, length = 1)

space #Resolved

@yaeldekel yaeldekel requested a review from a team as a code owner February 25, 2020 22:13
@yaeldekel yaeldekel force-pushed the hashing branch 2 times, most recently from bfd1524 to f62a846 Compare February 26, 2020 21:44
if (itemType is DateTimeDataViewType || itemType is DateTimeOffsetDataViewType || itemType is TimeSpanDataViewType)
input = new TypeConvertingTransformer(Host, origStratCol, DataKind.Int64, origStratCol).Transform(input);

output = new HashingEstimator(Host, stratificationColumn, origStratCol, 30).Fit(input).Transform(input);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize the previous code too used 30 bits. But the default seems to be 31 bits. Any reason why we have 30 here? If there is a specific reason for 30 bits, can you please add a note to that in the code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am actually not sure why 30 was chosen. If you think it's important I can try changing it back to 31.


In reply to: 386729669 [](ancestors = 386729669)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no reason to say it should change to 31 either :-). But since it is is used in a couple of places, can you please make it a named constant?


In reply to: 425102125 [](ancestors = 425102125,386729669)

var columnOptions =
localSeed.HasValue ?
new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, (uint)localSeed.Value, combine: true) :
new HashingEstimator.ColumnOptions(samplingKeyColumn, origStratCol, 30, combine: true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment as before. If the 30bits hashing is standard, can we define it as some global constant somewhere?

@@ -247,6 +262,8 @@ private Delegate GetGetterCore(DataViewRow input, int iinfo, out Action disposer
var srcType = input.Schema[srcCol].Type;
if (!(srcType is VectorDataViewType vectorType))
return ComposeGetterOne(input, iinfo, srcCol, srcType);
if (_columns[iinfo].Combine)
return ComposeGetterCombined(input, iinfo, srcCol, vectorType);
return ComposeGetterVec(input, iinfo, srcCol, vectorType);
Copy link
Contributor

@harishsk harishsk Mar 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to figure out the onnx export mechanism for this. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I'm making this case non-exportable.


In reply to: 386741545 [](ancestors = 386741545)

verReadableCur: 0x00010003,
//verWrittenCur: 0x00010003, // Uses MurmurHash3_x86_32
verWrittenCur: 0x00010004, // Add Combine option
verReadableCur: 0x00010004,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still haven't released a version with MurmurHash3. Do we need to increment this if we make sure to include it in the v1.5 release?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically users may create and save models containing this transformer once the PR that changed the version to 0x00010003 was merged. This most likely didn't happen, but to be on the safe side I think it's better to increment the version again.


In reply to: 424565803 [](ancestors = 424565803)

private readonly VBuffer<ReadOnlyMemory<char>>[] _keyValues;
private readonly VectorDataViewType[] _kvTypes;
private readonly bool _isVersion1;
private readonly bool _useOldHashing;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: Is it possible to come up with a better name? The only one I can think of is "_useOnnxCompatibleHashing" which would invert the meaning of the current name. Please use a a more descriptive name if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to keep it a variable that would have the value "false" by default, that's why I didn't want to change it to _useOnnxCompatibleHashing. Is this name that I changed it to acceptable?


In reply to: 424567216 [](ancestors = 424567216)

var srcType = _srcTypes[iinfo].GetItemType();
if (srcType is KeyDataViewType)
return false;
if (_parent._columns[iinfo].Combine)
return false;
Copy link
Contributor

@harishsk harishsk May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lynx1820 Any thoughts why we are not supporting key types? #Resolved

Copy link
Contributor

@Lynx1820 Lynx1820 May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the remainder! I'll work on adding support for keytypes. Zero value keys don't get mapped to a hash but map to zero, a behavior not currently present. #Resolved

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @Lynx1820 mentioned, in key types 0 is mapped to 0 instead of to its hash value, so it would be more complicated to export to ONNX.


In reply to: 424584542 [](ancestors = 424584542)

@codecov
Copy link

codecov bot commented May 14, 2020

Codecov Report

Merging #4828 into master will decrease coverage by 2.16%.
The diff coverage is 75.35%.

@@            Coverage Diff             @@
##           master    #4828      +/-   ##
==========================================
- Coverage   75.71%   73.54%   -2.17%     
==========================================
  Files         993      992       -1     
  Lines      179035   179484     +449     
  Branches    19356    19296      -60     
==========================================
- Hits       135549   131997    -3552     
- Misses      38262    42305    +4043     
+ Partials     5224     5182      -42     
Flag Coverage Δ
#Debug 73.54% <75.35%> (-2.17%) ⬇️
#production 68.84% <68.71%> (-2.85%) ⬇️
#test 88.67% <85.67%> (+0.02%) ⬆️
Impacted Files Coverage Δ
test/Microsoft.ML.Benchmarks/HashBench.cs 0.00% <0.00%> (ø)
src/Microsoft.ML.EntryPoints/TrainTestSplit.cs 76.92% <45.00%> (-16.11%) ⬇️
...L.Transforms/Text/WordHashBagProducingTransform.cs 61.08% <50.00%> (ø)
test/Microsoft.ML.Tests/DatabaseLoaderTests.cs 94.07% <52.63%> (ø)
...crosoft.ML.Data/Commands/CrossValidationCommand.cs 83.52% <60.00%> (+2.04%) ⬆️
src/Microsoft.ML.Data/Transforms/Hashing.cs 81.38% <69.15%> (-7.58%) ⬇️
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs 81.81% <75.00%> (+4.64%) ⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs 95.53% <77.77%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.21% <100.00%> (ø)
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 71.84% <100.00%> (-0.54%) ⬇️
... and 120 more

@codecov
Copy link

codecov bot commented May 14, 2020

Codecov Report

Merging #4828 into master will decrease coverage by 0.10%.
The diff coverage is 75.33%.

@@            Coverage Diff             @@
##           master    #4828      +/-   ##
==========================================
- Coverage   75.71%   75.60%   -0.11%     
==========================================
  Files         993      995       +2     
  Lines      179035   179664     +629     
  Branches    19356    19314      -42     
==========================================
+ Hits       135549   135834     +285     
- Misses      38262    38565     +303     
- Partials     5224     5265      +41     
Flag Coverage Δ
#Debug 75.60% <75.33%> (-0.11%) ⬇️
#production 71.55% <68.71%> (-0.14%) ⬇️
#test 88.66% <85.63%> (+0.02%) ⬆️
Impacted Files Coverage Δ
test/Microsoft.ML.Benchmarks/HashBench.cs 0.00% <0.00%> (ø)
src/Microsoft.ML.EntryPoints/TrainTestSplit.cs 76.92% <45.00%> (-16.11%) ⬇️
...L.Transforms/Text/WordHashBagProducingTransform.cs 61.08% <50.00%> (ø)
test/Microsoft.ML.Tests/DatabaseLoaderTests.cs 94.07% <52.63%> (ø)
...crosoft.ML.Data/Commands/CrossValidationCommand.cs 83.52% <60.00%> (+2.04%) ⬆️
src/Microsoft.ML.Data/Transforms/Hashing.cs 81.38% <69.15%> (-7.58%) ⬇️
...soft.ML.Data/DataLoadSave/DataOperationsCatalog.cs 81.81% <75.00%> (+4.64%) ⬆️
test/Microsoft.ML.Tests/TextLoaderTests.cs 95.53% <77.77%> (ø)
.../Microsoft.ML.Data/DataLoadSave/Text/TextLoader.cs 82.21% <100.00%> (ø)
...ML.Data/Transforms/ConversionsExtensionsCatalog.cs 71.84% <100.00%> (-0.54%) ⬇️
... and 57 more

@harishsk
Copy link
Contributor

That sounds good to me.


In reply to: 586994995 [](ancestors = 586994995)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@yaeldekel yaeldekel merged commit 6fddd31 into dotnet:master May 18, 2020
@yaeldekel yaeldekel deleted the hashing branch May 18, 2020 06:26
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants