Default seed is not propagated from MLContext #4752

najeeb-kazmi · 2020-01-31T03:10:53Z

In theory, the seed set in MLContext is intended to provide the global seed for all components and operations requiring randomness, e.g. sampling, permutation, etc. In practice, this doesn't always hold true.

TrainTestSplit, CrossValidationSplit, and CrossValidate all have a user specified seed and call EnsureGroupPreservationColumn, which in turn uses GenerateNumberTransform and HashingEstimator.

When the seed is not specified by the user, it is not derived from MLContext. Instead, GenerateNumberTransform and HashingEstimator use their own defaults, so that if a user doesn't specify a seed to TrainTestSplit, CrossValidationSplit, or CrossValidate, they will always get a deterministic split regardless of the seed in MLContext.

machinelearning/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs

Lines 496 to 505 in 24c8274

    
           internal static void EnsureGroupPreservationColumn(IHostEnvironment env, ref IDataView data, ref string samplingKeyColumn, int? seed = null) 
        
           { 
        
               // We need to handle two cases: if samplingKeyColumn is provided, we use hashJoin to 
        
               // build a single hash of it. If it is not, we generate a random number. 
        
               if (samplingKeyColumn == null) 
        
               { 
        
                   samplingKeyColumn = data.Schema.GetTempColumnName("SamplingKeyColumn"); 
        
                   data = new GenerateNumberTransform(env, data, samplingKeyColumn, (uint?)seed); 
        
               }

machinelearning/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs

Lines 521 to 525 in 24c8274

    
           if (seed.HasValue) 
        
               columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)seed.Value); 
        
           else 
        
               columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30); 
        
           data = new HashingEstimator(env, columnOptions).Fit(data).Transform(data);

cc: @harishsk @justinormont

The text was updated successfully, but these errors were encountered:

justinormont · 2020-01-31T10:54:08Z

There are two uses of the term "seed": PRNG seeds, and hashing seeds. The seeds for hashing are a bit of a different beast than the rest. For hashing (e.g. NgramHashingTransformer) they are init values for hashing instead of the seed for PRNGs.

For the hashing components, I don't think they should listen to the global seed.

When hashing a feature value to a specific bin (e.g. categorial hash), I generally expect the hashing to be stable and the same input value will land in to the same feature slot between runs.

For instance, if I featurize a dataset using the categorial hash, I would expect to be able to re-use the data. This occurs when partitioning a large dataset across a cluster and using stateless transforms like categorical hash, where each node learns on its local data, then we combine the output models. To combine the models, the feature slots needs to line up (otherwise we have to keep/run all N featurizers). Altering the seed of hashing, reduces the reusability of data that flows thru the hashing transforms. While we may want to otherwise alter the PRNG seed independently.

We should consider renaming seed in hashing components to hashInit to distinguish from PRNG seeds.

Use of seed in NgramHashingTransformer: (passing the seed as the init value)

machinelearning/src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs

Lines 455 to 461 in 4ec2f9c

    
           (uint[] ngram, int lim, int icol, ref bool more) => 
        
           { 
        
               AssertValid(ngram, ngramLength, lim, icol); 
        
               if (lim < ngramLength) 
        
                   return -1; 
        
               return (int)(Hashing.MurmurHash(seed, ngram, 0, lim) & mask); 
        
           };

Use of seed in hashing: (becomes the init value)

machinelearning/src/Microsoft.ML.Core/Utilities/Hashing.cs

Lines 254 to 269 in b861b5d

    
           public static uint MurmurHash(uint hash, uint[] data, int min, int lim) 
        
           { 
        
               Contracts.Check(0 <= min); 
        
               Contracts.Check(min <= lim); 
        
               Contracts.Check(lim <= Utils.Size(data)); 
        
               for (int i = min; i < lim; i++) 
        
                   hash = MurmurRound(hash, data[i]); 
        
               hash = MurmurRound(hash, (uint)(lim - min)); 
        
               // Final mixing ritual for the hash. 
        
               hash = MixHash(hash); 
        
               return hash; 
        
           }

Hashing methods have a default "seed" value of 314489979:

machinelearning/src/Microsoft.ML.Transforms/OneHotHashEncoding.cs

Lines 251 to 258 in 4ec2f9c

    
           public sealed class OneHotHashEncodingEstimator : IEstimator<OneHotHashEncodingTransformer> 
        
           { 
        
               [BestFriend] 
        
               internal static class Defaults 
        
               { 
        
                   public const int NumberOfBits = 16; 
        
                   public const uint Seed = 314489979; 
        
                   public const bool UseOrderedHashing = true;

najeeb-kazmi · 2020-01-31T18:01:04Z

@justinormont I'm not suggesting we change defaults of HashingExtimator. Rather, the seed we pass to EnsureGroupPreservationColumn should come from MLContext, when not specified in TrainTestSplit. etc.

Or do you think it should always be a deterministic split if no seed specified in TrainTestSplit etc, regardless of the seed in MLContext? This is the current behavior.

justinormont · 2020-01-31T19:00:27Z

Or do you think it should always be a deterministic split if no seed specified in TrainTestSplit etc, regardless of the seed in MLContext? This is the current behavior.

TrainTestSplit etc should listen to MLContext seed if not given by the user for the specific component.

I think the precedence for PRNG seeds should be:

User seed specified for the component instance (or from parent component if hiding behind another)
User seed specified in MLContext
Default seed in component (unsure if we should remove these, might be a component-by-component discussion)
Random (and more random than current -- see Use a GUID when creating the temp path #4645 (comment))

harishsk · 2020-02-02T16:55:32Z

Looks like there are additional comments that need to be addressed.

sharwell · 2020-02-02T18:23:36Z

One of the challenges of the RNG used in ML.NET is it extends Random, so it's hard to find locations where a seed fails to propagate. Note that the propagation is quite complex - rather than starting a new RNG with the original MLContext seed, the new RNG "forks" from the current RNG by using the next number from the current RNG as the seed for the new RNG. If the goal of the seed investigation effort is deterministic execution if a seed is provided, I'm not confident propagation of the seed is sufficient to achieve this. Two additional notes:

I was not able to find a case where an RNG is created without a seed specified (or rather, I can't remember such a case; I investigated this about 2 months ago)
I was not able to find a case where the seed provided to the initial MLContext changed the test outcome. This is very easy to test with IterationData - just pass the iteration value as the MLContext seed.

codemzs · 2020-02-02T21:13:58Z

Forking is done because Random is not thread safe. In any case the current mechanism should guarantee deterministic behavior because the seed generated from parent RNG should be same each time.

najeeb-kazmi mentioned this issue Jan 31, 2020

Added Shuffle=false for all the Sdca related tests #4750

Closed

mstfbl added bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. labels Jan 31, 2020

najeeb-kazmi self-assigned this Feb 1, 2020

najeeb-kazmi mentioned this issue Feb 1, 2020

Data splits to default to MLContext seed when not specified #4764

Merged

harishsk closed this as completed in #4764 Feb 2, 2020

harishsk reopened this Feb 2, 2020

najeeb-kazmi mentioned this issue Feb 4, 2020

Add Seed property to MLContext and use as default for data splits #4775

Merged

najeeb-kazmi closed this as completed in #4775 Feb 11, 2020

justinormont mentioned this issue Feb 21, 2020

Add support for combining hashes in vector columns to HashingTransformer #4828

Merged

justinormont mentioned this issue Nov 19, 2020

Are ML .Net models deterministic? #5497

Closed

ghost locked as resolved and limited conversation to collaborators Mar 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Default seed is not propagated from MLContext #4752

Default seed is not propagated from MLContext #4752

najeeb-kazmi commented Jan 31, 2020 •

edited

Loading

justinormont commented Jan 31, 2020 •

edited

Loading

najeeb-kazmi commented Jan 31, 2020 •

edited

Loading

justinormont commented Jan 31, 2020 •

edited by najeeb-kazmi

Loading

harishsk commented Feb 2, 2020

sharwell commented Feb 2, 2020

codemzs commented Feb 2, 2020 •

edited

Loading

Default seed is not propagated from MLContext #4752

Default seed is not propagated from MLContext #4752

Comments

najeeb-kazmi commented Jan 31, 2020 • edited Loading

justinormont commented Jan 31, 2020 • edited Loading

najeeb-kazmi commented Jan 31, 2020 • edited Loading

justinormont commented Jan 31, 2020 • edited by najeeb-kazmi Loading

harishsk commented Feb 2, 2020

sharwell commented Feb 2, 2020

codemzs commented Feb 2, 2020 • edited Loading

najeeb-kazmi commented Jan 31, 2020 •

edited

Loading

justinormont commented Jan 31, 2020 •

edited

Loading

najeeb-kazmi commented Jan 31, 2020 •

edited

Loading

justinormont commented Jan 31, 2020 •

edited by najeeb-kazmi

Loading

codemzs commented Feb 2, 2020 •

edited

Loading