StratificationColumn in CrossValidation and TrainTestSplit #2536

rogancarr · 2019-02-13T21:10:21Z

CrossValidation and TrainTestSplit have a parameter called StratificationColumn that is used to preserve groupings of columns across splits (as discussed in #2487). This isn't actually stratification, so we should rename the column.

This is a forked sub-issue from #2487

Related to #1204

The text was updated successfully, but these errors were encountered:

Ivanidzo4ka · 2019-02-13T21:21:42Z

Do we have any idea what should be new name?

rogancarr · 2019-02-13T21:27:32Z

@Ivanidzo4ka good question! In the above, I've made a suggestion for "IdColumn".

Ivanidzo4ka · 2019-02-13T21:40:02Z

Sorry, I guess you mention it in other issue, don't see it here.
IdColumn feels blank and also doesn't reflects purpose of it.
maybe ConsistencyColumn or RetentionColumn

rogancarr · 2019-02-13T22:01:50Z

How about RowGroupPreservationColumn? GroupPreservationColumn? PreservationColumn?

RowSetPreservationColumn? Super explicit, and doesn't use the word "group".

Ivanidzo4ka · 2019-02-13T22:43:11Z

Row Set Preservation Society. That would be good name for my second album.
GroupPreservationColumn sound best for me, but would be nice to ask other people around

justinormont · 2019-02-14T06:05:30Z

If I heard something was renamed to IdColumn, I would assume it was the Name column.

Is there another industry term for this? We can't be the first.

justinormont · 2019-02-14T07:00:24Z

Closest I see in scikit-learn is GroupShuffleSplit. Perhaps SplitGroup?

https://scikit-learn.org/stable/modules/cross_validation.html#group-shuffle-split

Another route is to rename the Group column to RankingGroup, which then frees up Stratification to move to Group (which seems to be the industry term).

justinormont · 2019-02-14T18:52:05Z

Speaking of renaming. @Dmitry-A was saying earlier today that Name may be better called RowID

Ivanidzo4ka · 2019-02-14T19:02:39Z

public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null)

public CrossValidationResult<CalibratedBinaryClassificationMetrics>[] CrossValidate( IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,string stratificationColumn = null, uint? seed = null)

@justinormont what Name are you talking about?

justinormont · 2019-02-14T20:39:38Z

The column purpose of Name, which allows a user to identify the row of data. It's mainly used for debugging as it's printed to the .inst.txt file. It lets you match the input data row to the output score.

I'm unsure we have brought the concept to ML.NET.

Ivanidzo4ka · 2019-02-14T20:45:56Z

Ah, that Name. Do we even expose it anywhere in Ml.Net? It's probably part of some commands, but I don't think we do anything with commands right now, since they all hidden

justinormont · 2019-02-14T20:50:06Z

I see it listed here:

machinelearning/src/Microsoft.ML.Data/Commands/DefaultColumnNames.cs

Line 12 in 0c62e30

public const string Name = "Name";

No idea if we utilize the concept though.

rogancarr · 2019-02-14T20:54:33Z

Let's keep this discussion on potential names for StratificationColumn. Any other naming issues, please open a separate issue. (Sorry to be strict, but I need to drive this to conclusion.)

rogancarr · 2019-02-14T20:56:38Z

So far we have

IdColumn: Too vague
Group: Group and relatives feels to rank-y to some folks, but is industry standard language.
RowGroupPreservationColumn
GroupPreservationColumn

RowSetPreservationColumn
ConsistencyColumn
RetentionColumn

@TomFinley @shauheen @glebuk @yaeldekel Any thoughts?

rogancarr · 2019-02-15T01:03:44Z

I renamed it to GroupPreservationColumn in : #2537

TomFinley · 2019-02-15T16:50:07Z

By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).

Anyway, sklearn gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.

This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.

Maybe RowGroup column for what we now call a Group column, and SplitGroup or SplittingGroup column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.

justinormont · 2019-02-15T17:55:46Z

I like @TomFinley naming suggestions:

Group => RowGroup
Stratification => SplitGroup (or SplittingGroup/SplitColumn)

Ivanidzo4ka · 2019-02-26T00:31:32Z

machinelearning/src/Microsoft.ML.Data/TrainCatalog.cs

Line 208 in 3b9d407

samplingKeyColumn = data.Schema.GetTempColumnName("IdPreservationColumn");

machinelearning/src/Microsoft.ML.Data/TrainCatalog.cs

Line 214 in 3b9d407

    
           throw Environment.ExceptSchemaMismatch(nameof(samplingKeyColumn), "GroupPreservationColumn", samplingKeyColumn);

Would be nice to make that names consistent as well.
At least last one.

rogancarr self-assigned this Feb 13, 2019

rogancarr added the API Issues pertaining the friendly API label Feb 13, 2019

rogancarr mentioned this issue Feb 13, 2019

Rename CV and TrainTest "stratification" parameter #2537

Merged

rogancarr closed this as completed in #2537 Feb 15, 2019

Ivanidzo4ka mentioned this issue Feb 20, 2019

Rename GroupId column to RowGroup #2660

Closed

rogancarr mentioned this issue Feb 21, 2019

Cross-Validation API for v1.0 #2487

Open

abgoswam mentioned this issue Feb 24, 2019

Fixing parameters in ML.NET Public API #2665

Merged

Ivanidzo4ka reopened this Feb 26, 2019

Ivanidzo4ka unassigned rogancarr Mar 4, 2019

najeeb-kazmi self-assigned this Mar 4, 2019

najeeb-kazmi mentioned this issue Mar 4, 2019

Remnants from renaming of StratificationColumn #2839

Merged

najeeb-kazmi closed this as completed in #2839 Mar 4, 2019

Ivanidzo4ka mentioned this issue Mar 18, 2019

Cleaning TrainCatalog and RecommenderCatalog #2973

Merged

CESARDELATORRE mentioned this issue Aug 7, 2019

Support stratify in TrainTestSplit() API #4082

Open

antoniovs1029 mentioned this issue Jun 27, 2020

Combined methods related to splitting data into one single method. Also fixed related issues. #5227

Merged

ghost locked as resolved and limited conversation to collaborators Mar 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StratificationColumn in CrossValidation and TrainTestSplit #2536

StratificationColumn in CrossValidation and TrainTestSplit #2536

rogancarr commented Feb 13, 2019

Ivanidzo4ka commented Feb 13, 2019

rogancarr commented Feb 13, 2019

Ivanidzo4ka commented Feb 13, 2019

rogancarr commented Feb 13, 2019 •

edited

Loading

Ivanidzo4ka commented Feb 13, 2019

justinormont commented Feb 14, 2019

justinormont commented Feb 14, 2019

justinormont commented Feb 14, 2019

Ivanidzo4ka commented Feb 14, 2019

justinormont commented Feb 14, 2019 •

edited

Loading

Ivanidzo4ka commented Feb 14, 2019

justinormont commented Feb 14, 2019

rogancarr commented Feb 14, 2019

rogancarr commented Feb 14, 2019 •

edited

Loading

rogancarr commented Feb 15, 2019

TomFinley commented Feb 15, 2019

justinormont commented Feb 15, 2019

Ivanidzo4ka commented Feb 26, 2019

StratificationColumn in CrossValidation and TrainTestSplit #2536

StratificationColumn in CrossValidation and TrainTestSplit #2536

Comments

rogancarr commented Feb 13, 2019

Ivanidzo4ka commented Feb 13, 2019

rogancarr commented Feb 13, 2019

Ivanidzo4ka commented Feb 13, 2019

rogancarr commented Feb 13, 2019 • edited Loading

Ivanidzo4ka commented Feb 13, 2019

justinormont commented Feb 14, 2019

justinormont commented Feb 14, 2019

justinormont commented Feb 14, 2019

Ivanidzo4ka commented Feb 14, 2019

justinormont commented Feb 14, 2019 • edited Loading

Ivanidzo4ka commented Feb 14, 2019

justinormont commented Feb 14, 2019

rogancarr commented Feb 14, 2019

rogancarr commented Feb 14, 2019 • edited Loading

rogancarr commented Feb 15, 2019

TomFinley commented Feb 15, 2019

justinormont commented Feb 15, 2019

Ivanidzo4ka commented Feb 26, 2019

rogancarr commented Feb 13, 2019 •

edited

Loading

justinormont commented Feb 14, 2019 •

edited

Loading

rogancarr commented Feb 14, 2019 •

edited

Loading