-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Default seed is not propagated from MLContext #4752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There are two uses of the term "seed": PRNG seeds, and hashing seeds. The seeds for hashing are a bit of a different beast than the rest. For hashing (e.g. NgramHashingTransformer) they are init values for hashing instead of the seed for PRNGs. For the hashing components, I don't think they should listen to the global seed. When hashing a feature value to a specific bin (e.g. categorial hash), I generally expect the hashing to be stable and the same input value will land in to the same feature slot between runs. For instance, if I featurize a dataset using the categorial hash, I would expect to be able to re-use the data. This occurs when partitioning a large dataset across a cluster and using stateless transforms like categorical hash, where each node learns on its local data, then we combine the output models. To combine the models, the feature slots needs to line up (otherwise we have to keep/run all N featurizers). Altering the seed of hashing, reduces the reusability of data that flows thru the hashing transforms. While we may want to otherwise alter the PRNG seed independently. We should consider renaming Use of seed in machinelearning/src/Microsoft.ML.Transforms/Text/NgramHashingTransformer.cs Lines 455 to 461 in 4ec2f9c
Use of machinelearning/src/Microsoft.ML.Core/Utilities/Hashing.cs Lines 254 to 269 in b861b5d
Hashing methods have a default "seed" value of machinelearning/src/Microsoft.ML.Transforms/OneHotHashEncoding.cs Lines 251 to 258 in 4ec2f9c
|
@justinormont I'm not suggesting we change defaults of Or do you think it should always be a deterministic split if no seed specified in |
I think the precedence for PRNG seeds should be:
|
Looks like there are additional comments that need to be addressed. |
One of the challenges of the RNG used in ML.NET is it extends
|
Forking is done because Random is not thread safe. In any case the current mechanism should guarantee deterministic behavior because the seed generated from parent RNG should be same each time. |
In theory, the seed set in
MLContext
is intended to provide the global seed for all components and operations requiring randomness, e.g. sampling, permutation, etc. In practice, this doesn't always hold true.TrainTestSplit
,CrossValidationSplit
, andCrossValidate
all have a user specified seed and callEnsureGroupPreservationColumn
, which in turn usesGenerateNumberTransform
andHashingEstimator
.When the seed is not specified by the user, it is not derived from
MLContext
. Instead,GenerateNumberTransform
andHashingEstimator
use their own defaults, so that if a user doesn't specify a seed toTrainTestSplit
,CrossValidationSplit
, orCrossValidate
, they will always get a deterministic split regardless of the seed inMLContext
.machinelearning/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs
Lines 496 to 505 in 24c8274
machinelearning/src/Microsoft.ML.Data/DataLoadSave/DataOperationsCatalog.cs
Lines 521 to 525 in 24c8274
cc: @harishsk @justinormont
The text was updated successfully, but these errors were encountered: