Skip to content

TimeSeriesImputer featurizer added #4623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 14, 2020

Conversation

michaelgsharp
Copy link
Member

This change adds in the TimeSeriesImputer into the new TimeSeriesImputer project. It is the final of a series of PR's that will go in. The TimeSeriesImputer is implemented in native code, so this is mostly just a wrapper around that with the appropriate entrypoints for NimbusML as well.

The TimeSeriesImputer imputes rows and columns based on the time series given. This is the first transformer that imputes rows in ML.NET.

@michaelgsharp michaelgsharp requested a review from a team as a code owner January 3, 2020 19:13
@michaelgsharp michaelgsharp self-assigned this Jan 3, 2020

// This transformer adds columns
[Argument(ArgumentType.MultipleUnique, HelpText = "Columns to filter", Name = "FilterColumns", ShortName = "filters", SortOrder = 2)]
public string[] FilterColumns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for filtering? Are more columns produced besides IsImputed? Does the filtering operate only on the newly produced columns, or can I remove any pre-existing column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a column is filtered, whenever a row is imputed the default value for that row is used to fill in the value. Using the FilterMode, you can either use this list as an include or exclude list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a column is filtered, whenever a row is imputed the default value for that row is used to fill in the value.

I'm not sure I'm parsing this correctly, and I think I'm missing something obvious. If a column is filtered, why do we fill in a value and then drop the column? Are there other columns produced besides IsImputed? What other columns would a user use the filtering to keep/remove besides IsImpted?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We aren't dropping any columns in this process. If a column is excluded from being imputed, the default value is provided when a new row is generated. The biggest use case for this is dealing with columns that aren't supported for imputation. You can exclude that column from being imputed, but still have the default value filled in automatically.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to rename it from FilterColumns to perhaps ColumnsToImpute or ColumnsToFillDefault.

The naming suggestion is to disambiguate from the existing concept in ML.NET of filtering, which removes rows of data from the IDataView (info).

Would another way of handling the non-supported column types be to automatically fill w/ default when needed? Seems we could auto-handle this use case. Assuming there aren't other use cases, this would remove the need of the FilterColumns field.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think renaming is a good idea. Its currently setup to where you can either list the columns to fill with default or columns to impute based on the FilterMode paramenter. So the name needs to somehow show that it can do both based on that parameter.

We could automatically fill in default values when needed. I didn't do that as this could potentially mask errors if you wanted to convert something prior and forgot. If you don't think thats a big deal though and think it would be better then I am not opposed to switching to that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Auto-filling w/ default for unsupported types is likely good from the data science side. Then we note in the docs "TimeSeriesImputer supports these data types (...) and will fill with the default value for columns of unsupported data types".

@EricWrightAtWork -- since you know more about the needs of time-series forecasting, any thoughts? Is auto-filling unsupported datatypes with their default values reasonable?

@codecov
Copy link

codecov bot commented Jan 9, 2020

Codecov Report

❗ No coverage uploaded for pull request base (master@712c3ec). Click here to learn what that means.
The diff coverage is 84.91%.

@@            Coverage Diff            @@
##             master    #4623   +/-   ##
=========================================
  Coverage          ?   75.84%           
=========================================
  Files             ?      947           
  Lines             ?   172170           
  Branches          ?    18576           
=========================================
  Hits              ?   130584           
  Misses            ?    36413           
  Partials          ?     5173
Flag Coverage Δ
#Debug 75.84% <84.91%> (?)
#production 71.43% <79.35%> (?)
#test 90.7% <100%> (?)
Impacted Files Coverage Δ
...ft.ML.Tests/Transformers/TimeSeriesImputerTests.cs 100% <100%> (ø)
...rosoft.ML.Featurizers/TimeSeriesImputerDataView.cs 74.03% <74.03%> (ø)
src/Microsoft.ML.Featurizers/TimeSeriesImputer.cs 87.08% <87.08%> (ø)

Copy link
Contributor

@harishsk harishsk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the additional tests before merging.

@codemzs codemzs merged commit c4e4263 into dotnet:master Jan 14, 2020
@michaelgsharp michaelgsharp deleted the time-series-imputer branch February 24, 2020 21:02
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants