Skip to content

StratificationColumn in CrossValidation and TrainTestSplit #2536

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rogancarr opened this issue Feb 13, 2019 · 18 comments
Closed

StratificationColumn in CrossValidation and TrainTestSplit #2536

rogancarr opened this issue Feb 13, 2019 · 18 comments
Assignees
Labels
API Issues pertaining the friendly API

Comments

@rogancarr
Copy link
Contributor

CrossValidation and TrainTestSplit have a parameter called StratificationColumn that is used to preserve groupings of columns across splits (as discussed in #2487). This isn't actually stratification, so we should rename the column.

This is a forked sub-issue from #2487

Related to #1204

@rogancarr rogancarr self-assigned this Feb 13, 2019
@rogancarr rogancarr added the API Issues pertaining the friendly API label Feb 13, 2019
@Ivanidzo4ka
Copy link
Contributor

Do we have any idea what should be new name?

@rogancarr
Copy link
Contributor Author

@Ivanidzo4ka good question! In the above, I've made a suggestion for "IdColumn".

@Ivanidzo4ka
Copy link
Contributor

Sorry, I guess you mention it in other issue, don't see it here.
IdColumn feels blank and also doesn't reflects purpose of it.
maybe ConsistencyColumn or RetentionColumn

@rogancarr
Copy link
Contributor Author

rogancarr commented Feb 13, 2019

How about RowGroupPreservationColumn? GroupPreservationColumn? PreservationColumn?

RowSetPreservationColumn? Super explicit, and doesn't use the word "group".

@Ivanidzo4ka
Copy link
Contributor

Row Set Preservation Society. That would be good name for my second album.
GroupPreservationColumn sound best for me, but would be nice to ask other people around

@justinormont
Copy link
Contributor

If I heard something was renamed to IdColumn, I would assume it was the Name column.

Is there another industry term for this? We can't be the first.

@justinormont
Copy link
Contributor

Closest I see in scikit-learn is GroupShuffleSplit. Perhaps SplitGroup?

https://scikit-learn.org/stable/modules/cross_validation.html#group-shuffle-split
image

Another route is to rename the Group column to RankingGroup, which then frees up Stratification to move to Group (which seems to be the industry term).

@justinormont
Copy link
Contributor

Speaking of renaming. @Dmitry-A was saying earlier today that Name may be better called RowID

@Ivanidzo4ka
Copy link
Contributor

public TrainTestData TrainTestSplit(IDataView data, double testFraction = 0.1, string stratificationColumn = null, uint? seed = null)
public CrossValidationResult<CalibratedBinaryClassificationMetrics>[] CrossValidate( IDataView data, IEstimator<ITransformer> estimator, int numFolds = 5, string labelColumn = DefaultColumnNames.Label,string stratificationColumn = null, uint? seed = null)

@justinormont what Name are you talking about?

@justinormont
Copy link
Contributor

justinormont commented Feb 14, 2019

The column purpose of Name, which allows a user to identify the row of data. It's mainly used for debugging as it's printed to the .inst.txt file. It lets you match the input data row to the output score.

I'm unsure we have brought the concept to ML.NET.

@Ivanidzo4ka
Copy link
Contributor

Ah, that Name. Do we even expose it anywhere in Ml.Net? It's probably part of some commands, but I don't think we do anything with commands right now, since they all hidden

@justinormont
Copy link
Contributor

I see it listed here:

No idea if we utilize the concept though.

@rogancarr
Copy link
Contributor Author

Let's keep this discussion on potential names for StratificationColumn. Any other naming issues, please open a separate issue. (Sorry to be strict, but I need to drive this to conclusion.)

@rogancarr
Copy link
Contributor Author

rogancarr commented Feb 14, 2019

So far we have

IdColumn: Too vague
Group: Group and relatives feels to rank-y to some folks, but is industry standard language.
RowGroupPreservationColumn
GroupPreservationColumn

RowSetPreservationColumn
ConsistencyColumn
RetentionColumn

@TomFinley @shauheen @glebuk @yaeldekel Any thoughts?

@rogancarr
Copy link
Contributor Author

I renamed it to GroupPreservationColumn in : #2537

@TomFinley
Copy link
Contributor

By itself not an acceptable name. If you somehow clarified the "group" column to mean something else. @justinormont 's suggestion of RankingGroup is not my favorite since we use this in other contexts other than ranking (albeit lower priority ones that haven't yet been migrated to the open source codebase).

Anyway, sklearn gets away with it there because it's very, very clear in context what "group" it's talking about since you're calling GroupShuffleSplit. If you were to just identify something divorced from that context and just call it a "group," then by itself is it clear what it's talking about? Not at all.

This is the problem, is that what type of "group" is considered relevant are vert context dependent. If you can make a case that "group" is used in other contexts to refer to this specifically, I could change my mind potentially. But as far as I see the case depends on a 5 character substring of a method from Python taken compeltely out of the context that made it clear what type of group you were talking about.

Maybe RowGroup column for what we now call a Group column, and SplitGroup or SplittingGroup column for what we call stratification. If we don't have to the stomach to rename "group" column at this time, which I could understand, maybe just call it SplitColumn. That suggests clearly enough to me that this has something to do with when a dataset is split, and I think we can easily explain it.

@justinormont
Copy link
Contributor

I like @TomFinley naming suggestions:

  • Group => RowGroup
  • Stratification => SplitGroup (or SplittingGroup/SplitColumn)

@Ivanidzo4ka
Copy link
Contributor

samplingKeyColumn = data.Schema.GetTempColumnName("IdPreservationColumn");

throw Environment.ExceptSchemaMismatch(nameof(samplingKeyColumn), "GroupPreservationColumn", samplingKeyColumn);

Would be nice to make that names consistent as well.
At least last one.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Issues pertaining the friendly API
Projects
None yet
Development

No branches or pull requests

5 participants