Skip to content

Default seed is not propagated from MLContext #4752

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
najeeb-kazmi opened this issue Jan 31, 2020 · 6 comments · Fixed by #4764 or #4775
Closed

Default seed is not propagated from MLContext #4752

najeeb-kazmi opened this issue Jan 31, 2020 · 6 comments · Fixed by #4764 or #4775
Assignees
Labels
bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.

Comments

@najeeb-kazmi
Copy link
Member

najeeb-kazmi commented Jan 31, 2020

In theory, the seed set in MLContext is intended to provide the global seed for all components and operations requiring randomness, e.g. sampling, permutation, etc. In practice, this doesn't always hold true.

TrainTestSplit, CrossValidationSplit, and CrossValidate all have a user specified seed and call EnsureGroupPreservationColumn, which in turn uses GenerateNumberTransform and HashingEstimator.

When the seed is not specified by the user, it is not derived from MLContext. Instead, GenerateNumberTransform and HashingEstimator use their own defaults, so that if a user doesn't specify a seed to TrainTestSplit, CrossValidationSplit, or CrossValidate, they will always get a deterministic split regardless of the seed in MLContext.

internal static void EnsureGroupPreservationColumn(IHostEnvironment env, ref IDataView data, ref string samplingKeyColumn, int? seed = null)
{
// We need to handle two cases: if samplingKeyColumn is provided, we use hashJoin to
// build a single hash of it. If it is not, we generate a random number.
if (samplingKeyColumn == null)
{
samplingKeyColumn = data.Schema.GetTempColumnName("SamplingKeyColumn");
data = new GenerateNumberTransform(env, data, samplingKeyColumn, (uint?)seed);
}

if (seed.HasValue)
columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30, (uint)seed.Value);
else
columnOptions = new HashingEstimator.ColumnOptionsInternal(samplingKeyColumn, origStratCol, 30);
data = new HashingEstimator(env, columnOptions).Fit(data).Transform(data);

cc: @harishsk @justinormont

@justinormont
Copy link
Contributor

justinormont commented Jan 31, 2020

There are two uses of the term "seed": PRNG seeds, and hashing seeds. The seeds for hashing are a bit of a different beast than the rest. For hashing (e.g. NgramHashingTransformer) they are init values for hashing instead of the seed for PRNGs.

For the hashing components, I don't think they should listen to the global seed.

When hashing a feature value to a specific bin (e.g. categorial hash), I generally expect the hashing to be stable and the same input value will land in to the same feature slot between runs.

For instance, if I featurize a dataset using the categorial hash, I would expect to be able to re-use the data. This occurs when partitioning a large dataset across a cluster and using stateless transforms like categorical hash, where each node learns on its local data, then we combine the output models. To combine the models, the feature slots needs to line up (otherwise we have to keep/run all N featurizers). Altering the seed of hashing, reduces the reusability of data that flows thru the hashing transforms. While we may want to otherwise alter the PRNG seed independently.

We should consider renaming seed in hashing components to hashInit to distinguish from PRNG seeds.


Use of seed in NgramHashingTransformer: (passing the seed as the init value)

(uint[] ngram, int lim, int icol, ref bool more) =>
{
AssertValid(ngram, ngramLength, lim, icol);
if (lim < ngramLength)
return -1;
return (int)(Hashing.MurmurHash(seed, ngram, 0, lim) & mask);
};

Use of seed in hashing: (becomes the init value)

public static uint MurmurHash(uint hash, uint[] data, int min, int lim)
{
Contracts.Check(0 <= min);
Contracts.Check(min <= lim);
Contracts.Check(lim <= Utils.Size(data));
for (int i = min; i < lim; i++)
hash = MurmurRound(hash, data[i]);
hash = MurmurRound(hash, (uint)(lim - min));
// Final mixing ritual for the hash.
hash = MixHash(hash);
return hash;
}

Hashing methods have a default "seed" value of 314489979:

public sealed class OneHotHashEncodingEstimator : IEstimator<OneHotHashEncodingTransformer>
{
[BestFriend]
internal static class Defaults
{
public const int NumberOfBits = 16;
public const uint Seed = 314489979;
public const bool UseOrderedHashing = true;

@mstfbl mstfbl added bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away. labels Jan 31, 2020
@najeeb-kazmi
Copy link
Member Author

najeeb-kazmi commented Jan 31, 2020

@justinormont I'm not suggesting we change defaults of HashingExtimator. Rather, the seed we pass to EnsureGroupPreservationColumn should come from MLContext, when not specified in TrainTestSplit. etc.

Or do you think it should always be a deterministic split if no seed specified in TrainTestSplit etc, regardless of the seed in MLContext? This is the current behavior.

@justinormont
Copy link
Contributor

justinormont commented Jan 31, 2020

Or do you think it should always be a deterministic split if no seed specified in TrainTestSplit etc, regardless of the seed in MLContext? This is the current behavior.

TrainTestSplit etc should listen to MLContext seed if not given by the user for the specific component.

I think the precedence for PRNG seeds should be:

  1. User seed specified for the component instance (or from parent component if hiding behind another)
  2. User seed specified in MLContext
  3. Default seed in component (unsure if we should remove these, might be a component-by-component discussion)
  4. Random (and more random than current -- see Use a GUID when creating the temp path #4645 (comment))

@harishsk
Copy link
Contributor

harishsk commented Feb 2, 2020

Looks like there are additional comments that need to be addressed.

@sharwell
Copy link
Member

sharwell commented Feb 2, 2020

One of the challenges of the RNG used in ML.NET is it extends Random, so it's hard to find locations where a seed fails to propagate. Note that the propagation is quite complex - rather than starting a new RNG with the original MLContext seed, the new RNG "forks" from the current RNG by using the next number from the current RNG as the seed for the new RNG. If the goal of the seed investigation effort is deterministic execution if a seed is provided, I'm not confident propagation of the seed is sufficient to achieve this. Two additional notes:

  1. I was not able to find a case where an RNG is created without a seed specified (or rather, I can't remember such a case; I investigated this about 2 months ago)
  2. I was not able to find a case where the seed provided to the initial MLContext changed the test outcome. This is very easy to test with IterationData - just pass the iteration value as the MLContext seed.

@codemzs
Copy link
Member

codemzs commented Feb 2, 2020

Forking is done because Random is not thread safe. In any case the current mechanism should guarantee deterministic behavior because the seed generated from parent RNG should be same each time.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working P0 Priority of the issue for triage purpose: IMPORTANT, needs to be fixed right away.
Projects
None yet
6 participants