TimeSeriesImputer featurizer added #4623

michaelgsharp · 2020-01-03T19:13:07Z

This change adds in the TimeSeriesImputer into the new TimeSeriesImputer project. It is the final of a series of PR's that will go in. The TimeSeriesImputer is implemented in native code, so this is mostly just a wrapper around that with the appropriate entrypoints for NimbusML as well.

The TimeSeriesImputer imputes rows and columns based on the time series given. This is the first transformer that imputes rows in ML.NET.

justinormont · 2020-01-03T20:30:06Z

src/Microsoft.ML.Featurizers/TimeSeriesImputer.cs

+
+            // This transformer adds columns
+            [Argument(ArgumentType.MultipleUnique, HelpText = "Columns to filter", Name = "FilterColumns", ShortName = "filters", SortOrder = 2)]
+            public string[] FilterColumns;


What's the use case for filtering? Are more columns produced besides IsImputed? Does the filtering operate only on the newly produced columns, or can I remove any pre-existing column?

If a column is filtered, whenever a row is imputed the default value for that row is used to fill in the value. Using the FilterMode, you can either use this list as an include or exclude list.

If a column is filtered, whenever a row is imputed the default value for that row is used to fill in the value.

I'm not sure I'm parsing this correctly, and I think I'm missing something obvious. If a column is filtered, why do we fill in a value and then drop the column? Are there other columns produced besides IsImputed? What other columns would a user use the filtering to keep/remove besides IsImpted?

We aren't dropping any columns in this process. If a column is excluded from being imputed, the default value is provided when a new row is generated. The biggest use case for this is dealing with columns that aren't supported for imputation. You can exclude that column from being imputed, but still have the default value filled in automatically.

We may want to rename it from FilterColumns to perhaps ColumnsToImpute or ColumnsToFillDefault.

The naming suggestion is to disambiguate from the existing concept in ML.NET of filtering, which removes rows of data from the IDataView (info).

Would another way of handling the non-supported column types be to automatically fill w/ default when needed? Seems we could auto-handle this use case. Assuming there aren't other use cases, this would remove the need of the FilterColumns field.

I think renaming is a good idea. Its currently setup to where you can either list the columns to fill with default or columns to impute based on the FilterMode paramenter. So the name needs to somehow show that it can do both based on that parameter.

We could automatically fill in default values when needed. I didn't do that as this could potentially mask errors if you wanted to convert something prior and forgot. If you don't think thats a big deal though and think it would be better then I am not opposed to switching to that.

Auto-filling w/ default for unsupported types is likely good from the data science side. Then we note in the docs "TimeSeriesImputer supports these data types (...) and will fill with the default value for columns of unsupported data types".

@EricWrightAtWork -- since you know more about the needs of time-series forecasting, any thoughts? Is auto-filling unsupported datatypes with their default values reasonable?

src/Microsoft.ML.Featurizers/TimeSeriesImputer.cs

codecov · 2020-01-09T00:35:45Z

Codecov Report

❗ No coverage uploaded for pull request base (master@712c3ec). Click here to learn what that means.
The diff coverage is 84.91%.

@@            Coverage Diff            @@
##             master    #4623   +/-   ##
=========================================
  Coverage          ?   75.84%           
=========================================
  Files             ?      947           
  Lines             ?   172170           
  Branches          ?    18576           
=========================================
  Hits              ?   130584           
  Misses            ?    36413           
  Partials          ?     5173

Flag	Coverage Δ
#Debug	`75.84% <84.91%> (?)`
#production	`71.43% <79.35%> (?)`
#test	`90.7% <100%> (?)`

Impacted Files	Coverage Δ
...ft.ML.Tests/Transformers/TimeSeriesImputerTests.cs	`100% <100%> (ø)`
...rosoft.ML.Featurizers/TimeSeriesImputerDataView.cs	`74.03% <74.03%> (ø)`
src/Microsoft.ML.Featurizers/TimeSeriesImputer.cs	`87.08% <87.08%> (ø)`

src/Microsoft.ML.Featurizers/TimeSeriesImputer.cs

src/Microsoft.ML.Featurizers/TimeSeriesImputerDataView.cs

harishsk

Please add the additional tests before merging.

michaelgsharp requested a review from a team as a code owner January 3, 2020 19:13

michaelgsharp self-assigned this Jan 3, 2020

justinormont reviewed Jan 3, 2020

View reviewed changes