-
Notifications
You must be signed in to change notification settings - Fork 1.9k
TimeSeriesImputer featurizer added #4623
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
||
// This transformer adds columns | ||
[Argument(ArgumentType.MultipleUnique, HelpText = "Columns to filter", Name = "FilterColumns", ShortName = "filters", SortOrder = 2)] | ||
public string[] FilterColumns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for filtering? Are more columns produced besides IsImputed
? Does the filtering operate only on the newly produced columns, or can I remove any pre-existing column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a column is filtered, whenever a row is imputed the default
value for that row is used to fill in the value. Using the FilterMode
, you can either use this list as an include or exclude list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a column is filtered, whenever a row is imputed the default value for that row is used to fill in the value.
I'm not sure I'm parsing this correctly, and I think I'm missing something obvious. If a column is filtered, why do we fill in a value and then drop the column? Are there other columns produced besides IsImputed
? What other columns would a user use the filtering to keep/remove besides IsImpted
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We aren't dropping any columns in this process. If a column is excluded from being imputed, the default value is provided when a new row is generated. The biggest use case for this is dealing with columns that aren't supported for imputation. You can exclude that column from being imputed, but still have the default value filled in automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to rename it from FilterColumns
to perhaps ColumnsToImpute
or ColumnsToFillDefault
.
The naming suggestion is to disambiguate from the existing concept in ML.NET
of filtering, which removes rows of data from the IDataView (info).
Would another way of handling the non-supported column types be to automatically fill w/ default when needed? Seems we could auto-handle this use case. Assuming there aren't other use cases, this would remove the need of the FilterColumns
field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think renaming is a good idea. Its currently setup to where you can either list the columns to fill with default or columns to impute based on the FilterMode
paramenter. So the name needs to somehow show that it can do both based on that parameter.
We could automatically fill in default values when needed. I didn't do that as this could potentially mask errors if you wanted to convert something prior and forgot. If you don't think thats a big deal though and think it would be better then I am not opposed to switching to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Auto-filling w/ default for unsupported types is likely good from the data science side. Then we note in the docs "TimeSeriesImputer supports these data types (...) and will fill with the default value for columns of unsupported data types".
@EricWrightAtWork -- since you know more about the needs of time-series forecasting, any thoughts? Is auto-filling unsupported datatypes with their default values reasonable?
Codecov Report
@@ Coverage Diff @@
## master #4623 +/- ##
=========================================
Coverage ? 75.84%
=========================================
Files ? 947
Lines ? 172170
Branches ? 18576
=========================================
Hits ? 130584
Misses ? 36413
Partials ? 5173
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the additional tests before merging.
c1eac1e
to
7e4eb9c
Compare
This change adds in the TimeSeriesImputer into the new TimeSeriesImputer project. It is the final of a series of PR's that will go in. The TimeSeriesImputer is implemented in native code, so this is mostly just a wrapper around that with the appropriate entrypoints for NimbusML as well.
The TimeSeriesImputer imputes rows and columns based on the time series given. This is the first transformer that imputes rows in ML.NET.