Skip to content

Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 3, 2020

Conversation

mstfbl
Copy link
Contributor

@mstfbl mstfbl commented Jul 3, 2020

Fix #5267

This PR fixes the bug where columns generated from inline data were normalized directly through Utils.Normalize(), which only fixes the naming of a given column name, but does not take into account duplicate column names that may exist in a dataset.

PR #5177 introduced a way to fix these duplicate column names by adding the differentiator suffix '_col_x' where 'x' represents the the dataset load order for a given column. In this PR I have separated this generation of distinct and unique column names from Utils.GenerateClassLabels() and made it into its own function to Utils.GenerateColumnNames(). This is so that this generation of distinct and unique column names can also be used in Utils.GenerateSampleData, which before this PR resulted in exceptions. Now, column names from inline data are properly normalized, and duplicate column names are handled.

This PR also adds a unit test to test the case of duplicate column names with Utils.GenerateSampleData.

@codecov
Copy link

codecov bot commented Jul 3, 2020

Codecov Report

Merging #5280 into master will decrease coverage by 0.00%.
The diff coverage is 79.48%.

@@            Coverage Diff             @@
##           master    #5280      +/-   ##
==========================================
- Coverage   73.68%   73.68%   -0.01%     
==========================================
  Files        1022     1022              
  Lines      190258   190320      +62     
  Branches    20468    20470       +2     
==========================================
+ Hits       140185   140230      +45     
- Misses      44541    44559      +18     
+ Partials     5532     5531       -1     
Flag Coverage Δ
#Debug 73.68% <79.48%> (-0.01%) ⬇️
#production 69.42% <100.00%> (+<0.01%) ⬆️
#test 87.62% <68.62%> (-0.03%) ⬇️
Impacted Files Coverage Δ
test/Microsoft.ML.CodeGenerator.Tests/UtilTest.cs 73.14% <68.62%> (-4.05%) ⬇️
src/Microsoft.ML.CodeGenerator/Utils.cs 61.32% <100.00%> (+2.11%) ⬆️
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs 59.60% <0.00%> (-2.65%) ⬇️
...soft.ML.Transforms/Text/WordEmbeddingsExtractor.cs 85.74% <0.00%> (-1.14%) ⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs 88.97% <0.00%> (-0.32%) ⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs 85.78% <0.00%> (-0.16%) ⬇️
...StandardTrainers/Standard/LinearModelParameters.cs 66.58% <0.00%> (+0.25%) ⬆️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs 85.16% <0.00%> (+0.84%) ⬆️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs 86.55% <0.00%> (+6.72%) ⬆️

@mstfbl mstfbl marked this pull request as ready for review July 3, 2020 10:24
@mstfbl mstfbl requested a review from a team as a code owner July 3, 2020 10:24
@mstfbl mstfbl requested a review from LittleLittleCloud July 3, 2020 10:26
@mstfbl mstfbl merged commit 9390f3b into dotnet:master Jul 3, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Mar 18, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Property names generated in AutoML's GenerateSampleData() differ from those generated by GenerateClassLabels()
2 participants