Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

mstfbl · 2020-07-03T08:31:06Z

This PR fixes the bug where columns generated from inline data were normalized directly through Utils.Normalize(), which only fixes the naming of a given column name, but does not take into account duplicate column names that may exist in a dataset.

PR #5177 introduced a way to fix these duplicate column names by adding the differentiator suffix '_col_x' where 'x' represents the the dataset load order for a given column. In this PR I have separated this generation of distinct and unique column names from Utils.GenerateClassLabels() and made it into its own function to Utils.GenerateColumnNames(). This is so that this generation of distinct and unique column names can also be used in Utils.GenerateSampleData, which before this PR resulted in exceptions. Now, column names from inline data are properly normalized, and duplicate column names are handled.

This PR also adds a unit test to test the case of duplicate column names with Utils.GenerateSampleData.

codecov · 2020-07-03T09:40:26Z

Codecov Report

Merging #5280 into master will decrease coverage by 0.00%.
The diff coverage is 79.48%.

@@            Coverage Diff             @@
##           master    #5280      +/-   ##
==========================================
- Coverage   73.68%   73.68%   -0.01%     
==========================================
  Files        1022     1022              
  Lines      190258   190320      +62     
  Branches    20468    20470       +2     
==========================================
+ Hits       140185   140230      +45     
- Misses      44541    44559      +18     
+ Partials     5532     5531       -1

Flag	Coverage Δ
#Debug	`73.68% <79.48%> (-0.01%)`	⬇️
#production	`69.42% <100.00%> (+<0.01%)`	⬆️
#test	`87.62% <68.62%> (-0.03%)`	⬇️

Impacted Files	Coverage Δ
test/Microsoft.ML.CodeGenerator.Tests/UtilTest.cs	`73.14% <68.62%> (-4.05%)`	⬇️
src/Microsoft.ML.CodeGenerator/Utils.cs	`61.32% <100.00%> (+2.11%)`	⬆️
...rosoft.ML.AutoML/ColumnInference/TextFileSample.cs	`59.60% <0.00%> (-2.65%)`	⬇️
...soft.ML.Transforms/Text/WordEmbeddingsExtractor.cs	`85.74% <0.00%> (-1.14%)`	⬇️
...soft.ML.Data/DataLoadSave/Text/TextLoaderCursor.cs	`88.97% <0.00%> (-0.32%)`	⬇️
...ML.Transforms/Text/StopWordsRemovingTransformer.cs	`85.78% <0.00%> (-0.16%)`	⬇️
...StandardTrainers/Standard/LinearModelParameters.cs	`66.58% <0.00%> (+0.25%)`	⬆️
src/Microsoft.ML.AutoML/Sweepers/Parameters.cs	`85.16% <0.00%> (+0.84%)`	⬆️
....ML.AutoML/PipelineSuggesters/PipelineSuggester.cs	`86.55% <0.00%> (+6.72%)`	⬆️

Properly normalize column names in GetSampleData(), added test

3dc6a7a

mstfbl marked this pull request as ready for review July 3, 2020 10:24

mstfbl requested a review from a team as a code owner July 3, 2020 10:24

mstfbl requested a review from LittleLittleCloud July 3, 2020 10:26

LittleLittleCloud approved these changes Jul 3, 2020

View reviewed changes

mstfbl merged commit 9390f3b into dotnet:master Jul 3, 2020

ghost locked as resolved and limited conversation to collaborators Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

mstfbl commented Jul 3, 2020

codecov bot commented Jul 3, 2020 •

edited

Loading

Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

Properly normalize column names in Utils.GetSampleData() for duplicate cases #5280

Conversation

mstfbl commented Jul 3, 2020

codecov bot commented Jul 3, 2020 • edited Loading

Codecov Report

codecov bot commented Jul 3, 2020 •

edited

Loading