-
Notifications
You must be signed in to change notification settings - Fork 1.9k
DateTimeTransformer featurizer #4521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DateTimeTransformer featurizer #4521
Conversation
Year = 1, Month, Day, Hour, Minute, Second, AmPm, Hour12, DayOfWeek, DayOfQuarter, DayOfYear, | ||
WeekOfMonth, QuarterOfYear, HalfOfYear, WeekIso, YearIso, MonthLabel, AmPmLabel, DayOfWeekLabel, | ||
HolidayName, IsPaidTimeOff | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend explicitly numbering each enum name. This lets us insert new values later without affecting the enum numeric value.
This is more important if the enum is serialized within the model, which I expect you're doing since the model needs to know which features to produce. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. They are now explicitly numbered. The Countries
enum (now HolidayList
) is also numbered now.
In reply to: 353910235 [](ancestors = 353910235)
/// <param name="catalog">Transform catalog</param> | ||
/// <param name="inputColumnName">Input column name</param> | ||
/// <param name="columnPrefix">Prefix to add to the generated columns</param> | ||
/// <param name="columnsToDrop">List of columns to drop, if any</param> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One drawback to using columnsToDrop
is that if a user is only interested in the DayOfYear
and DayOfWeekLabel
, they need to explicitly exclude all the rest. And then if we ever add a new output column in a future release, they would start getting it, and would need to add the new column to the "to drop" list. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might recommend a bitwise use of an enum (flags) to say which columns to include.
Default would be DateTimeTransformerEstimator.ColumnsProduced.All = 0xFFFFFFFFFFFFFFFF
(uint64).
Then users can bitwise or
/and-not
to select the columns they want to keep/drop.
Example usage
All output except DayOfYear
: (subtractive column selection)
var pipeline = mlContext.Transforms.DateTimeTransformer("Date", "DTC",
DateTimeTransformerEstimator.ColumnsProduced.All & ~DateTimeTransformerEstimator.ColumnsProduced.DayOfYear)
Only DayOfYear
and DayOfWeekLabel
: (additive column selection)
var pipeline = mlContext.Transforms.DateTimeTransformer("Date", "DTC",
DateTimeTransformerEstimator.ColumnsProduced.DayOfYear | DateTimeTransformerEstimator.ColumnsProduced.DayOfWeekLabel)
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apparently, there's a [Flags]
attribute in C# -- https://docs.microsoft.com/en-us/dotnet/api/system.flagsattribute?view=netcore-3.0
[Flags] public enum ColumnsProduced : uint64
{
Year = 1 << 0,
Month = 1 << 1,
Day = 1 << 2,
Hour = 1 << 3,
WeekOfMonth = 1 << 4,
...
}
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a way you think would be better? We had also thought about not even including that ability here and just letting the user use the DropColumns
transformer to remove any columns they didn't want. That would keep this code cleaner and not duplicate that type of functionality. Thoughts on that?
In reply to: 353911746 [](ancestors = 353911746)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont think that doing flags is a great idea personally. I think enums are more clear (if more verbose) than flags usually. As I mentioned to Eric in this same thread though, we can remove this capability from here and just let the user use the DropColumn
transformer to remove anything they dont want. What is your opinion on that?
In reply to: 353968394 [](ancestors = 353968394)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect this featurizer isn't aware of which columns are being requested of it
Maybe it should be made aware? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the usability of the flag method.
There is a user need of not including all outputs as overfitting occurs, mainly on the larger time scales, like year number. Including the year in a classification or regression model often leads to the model memorizing the label value specifically for a date range.
For instance the model will memorize that 2017-03-01 to 2018-01-05, the label value was high. This information is not valuable to memorize when predicting future data. Instead we want to ensure it memorizes that traffic volumes (the label) are higher on weekends (extracting useful patterns).
Hence it is nice to encourage users (as part of the timedate featurizer API itself) to think about which columns to include/exclude. This is as opposed to a user needing a separate DropColumns step in their pipeline. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So so far we have had 3 suggestions.
1 - Dont support dropping columns as a native part of this transformer.
2 - Use Enums.
3 - Use Flags.
Can we finalize on the decision? I personally agree with Harish that since we already have a drop columns transformer that we can just use that and not have that functionality be present here. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend a way to proactively select the output columns to create during the setup of the transform. Instead of dropping later, just don't create them.
Generally, we should assume users won't think (or know) to drop later. For usability, placing it in the constructor exposes users to the idea, and forces them to consider which columns will be beneficial. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, since this functionality already exists inside of ML.NET as a separate transformer we will be leaving this as is. We can revisit this later as the need arises. Even if this transformer produces all the columns, if no downstream transformers request them then they will never be realized. It shouldn't cause any performance impact this way. #Resolved
HolidayName, IsPaidTimeOff | ||
}; | ||
|
||
public enum Countries : byte |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want a UINT given that there are ~200 current countries (and they keep changing), and in the future we may a more specific break down, like UnitedStatesBankingHolidays
vs. UnitedStatesFederalHolidays
public enum Countries : byte | |
public enum Countries : uint | |
``` #Resolved |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I have changed the enum to be uint
as suggested.
In reply to: 353916304 [](ancestors = 353916304)
public ColumnsProduced[] ColumnsToDrop; | ||
|
||
[Argument(ArgumentType.AtMostOnce, HelpText = "Country to get holidays for. Defaults to none if not passed", Name = "Country", ShortName = "ctry", SortOrder = 4)] | ||
public Countries Country = Countries.None; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Countries
may be a bit restrictive of a term for the future. Perhaps name as HolidayList
.
Perhaps in the future we'll want UnitedStatesCaliforniaEducationHolidays
(see: list). We may also want more specific lists, like UnitedStatesMajorHolidays
. Or localized lists like GermanyBavaria
(see: list).
Or perhaps in the future we'll want to version the holiday list Germany2019
(as holidays are declared and dropped). Versioning helps old models run in current versions of ML.NET. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. I have renamed it to HolidayList
, but we can revisit it if needed in the future as well.
In reply to: 353920532 [](ancestors = 353920532)
public int DTCYearIso { get; set; } | ||
public string DTCMonthLabel { get; set; } | ||
public string DTCAmPmLabel { get; set; } | ||
public string DTCDayOfWeekLabel { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be needed to be added to the code being wrapped here, but when featurizing DateTime fields, I prefer polar (Sin/Cos) transforms (along with the others listed).
Clocks and times/dates are cyclic. The current feature set doesn't include this fact.
The main gain of polar date transforms is that it places times like 11:50PM
and 12:10AM
right next to each other (both mapping to ~0.999
in the cosine transform of percent of day). This lets a model make a decision boundary at CosTimeOfDay > 0.9
to treat these two times as very similar.
Contrast this to the DTCHour
feature which will map these to 23
and 0
(issue: non-continuous range, and not possible to group both the 0
and 23
in a single split point, which to a human is similar).
Info:
- https://ianlondon.github.io/blog/encoding-cyclical-features-24hour-time/
- https://medium.com/ai%C2%B3-theory-practice-business/top-6-errors-novice-machine-learning-engineers-make-e82273d394db#56ae
- https://blog.davidkaleko.com/feature-engineering-cyclical-features.html
- https://www.kaggle.com/avanwyk/encoding-cyclical-features-for-deep-learning
- https://medium.com/open-machine-learning-course/open-machine-learning-course-topic-6-feature-engineering-and-feature-selection-8b94f870706a#c36f
Code example I use for DateTime feature engineering:
DateTime dt;
var invalidDate = !DateTime.TryParse(input.Date, out dt);
var percentOfDay = (invalidDate ? Single.NaN : (float)(new TimeSpan(dt.Hour, dt.Minute, 0).TotalMinutes) / 1440 f);
var percentOfWeek = (invalidDate ? Single.NaN : ((float) dt.DayOfWeek + percentOfDay) / 7 f);
var percentOfMonth = (invalidDate ? Single.NaN : (float)(dt.Day + percentOfDay) / (float) DateTime.DaysInMonth(dt.Year, dt.Month));
var percentOfYear = (invalidDate ? Single.NaN : (float)((dt.DayOfYear + percentOfDay) / (new DateTime(dt.Year + 1, 1, 1) - new DateTime(dt.Year, 1, 1)).TotalDays));
// Time features like the current ones
output.DateWeekday = (invalidDate ? Single.NaN : (float) dt.DayOfWeek);
output.DateMonth = (invalidDate ? Single.NaN : (float) dt.Month);
output.DateHour = (invalidDate ? Single.NaN : dt.Hour);
output.DatePartOfDay = (invalidDate ? Single.NaN : (float) Math.Round(dt.AddHours(-3).Hour / 6 f));
output.InvalidDate = (invalidDate ? 1.0f : 0.0f);
// Polar date/time transforms for percent of day & week & year
output.CosTimeOfDay = (float) Math.Cos(2 * Math.PI * percentOfDay);
output.SinTimeOfDay = (float) Math.Sin(2 * Math.PI * percentOfDay);
output.CosTimeOfWeek = (float) Math.Cos(2 * Math.PI * percentOfWeek);
output.SinTimeOfWeek = (float) Math.Sin(2 * Math.PI * percentOfWeek);
output.CosTimeOfMonth = (float) Math.Cos(2 * Math.PI * percentOfMonth);
output.SinTimeOfMonth = (float) Math.Sin(2 * Math.PI * percentOfMonth);
output.CosTimeOfYear = (float) Math.Cos(2 * Math.PI * percentOfYear);
output.SinTimeOfYear = (float) Math.Sin(2 * Math.PI * percentOfYear);
``` #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/cc @davidbrownellWork #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add this as a potential feature for Mn. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean add to the ML.NET plan? Any estimate on how much additional dev time this would take? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding the new features is pretty simple. I expect more of the work is in testing. #Resolved
{ | ||
ValueGetter<T> result = (ref T dst) => | ||
{ | ||
long dateTime = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the input dateTime
column a long
and not a DateTime
? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
long
to match a unix timestamp. We will be adding in the ability to use C# DateTime
and string
in an ISO format in the future, but the initial implementation is just the unix timestamp. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My concern here is that .NET developers expect to have their date time data using System.DateTime
. That is the natural type in .NET. Forcing the user to convert from System.DateTime
to a long unix timestamp doesn't seem like a good experience. It would make more sense if the input column was a System.DateTime
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not do the conversion yourself here? The user's input column can be DateTime
, and then you convert it to a long unix timestamp before passing it to the C++ code? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to ensure that the process of converting from a string to "Date time" is consistent across frameworks - especially considering that a DateTime in and of itself isn't sufficient in some scenarios, as we lose (potentially important) time zone information in the process. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not asking for string. I'm asking for the ML.NET data type that a user would feed into this transform to be the canonical System.DateTime
type and not a long
type.
Specifically, this transform should require the input column to be of this type:
machinelearning/src/Microsoft.ML.DataView/DataViewType.cs
Lines 361 to 365 in 9fc8f1c
/// <summary> | |
/// The standard date time type. This has representation type of <see cref="DateTime"/>. | |
/// Note this can have only one possible value, accessible by the singleton static property <see cref="Instance"/>. | |
/// </summary> | |
public sealed class DateTimeDataViewType : PrimitiveDataViewType |
It doesn't make sense for a user to have to convert a DateTime
to a long
before calling this transform. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Specifically, look at the sample code being checked in under samples
:
// Create a small dataset as an IEnumerable.
// Future Date - 2025 June 30
var samples = new[] { new DateTimeInput() { Date = 1751241600 } };
Instead, that sample code should be:
// Create a small dataset as an IEnumerable.
var samples = new[] { new DateTimeInput() { Date = new DateTime(2025, 6, 30) } };
``` #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed - eventually a C# developer will be able to pass System:DateTime
, string
, or long
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree on using System.DateTime from the beginning. This will make the transform consistent with other parts of the API, as well as more usable to .NET developers #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is on our todo list and will be added in in the future. We will eventually be supporting DateTime
and string
as well, but initial support will just be for long
. #Resolved
ctx.Writer.Write(data); | ||
} | ||
|
||
public void Dispose() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should just be called automatically when the pipeline teardown happens as mentioned in that comment chain. I am not manually calling it anywhere. Is it happening during pipeline teardown ok? Or is there more desirable behavior? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should just be called automatically when the pipeline teardown happens as mentioned in that comment chain.
There is no call of it when the pipeline teardown happens. That is why that bug was opened. I'm still not certain on why it was closed. Maybe @codemzs can comment. #Resolved
/azp run #Resolved |
Azure Pipelines successfully started running 2 pipeline(s). #Resolved |
Going to close and reopen the PR to reset the stuck builds. #Resolved |
Codecov Report
@@ Coverage Diff @@
## master #4521 +/- ##
==========================================
+ Coverage 75.11% 75.15% +0.03%
==========================================
Files 908 913 +5
Lines 160204 160891 +687
Branches 17254 17292 +38
==========================================
+ Hits 120340 120911 +571
- Misses 35050 35153 +103
- Partials 4814 4827 +13
|
WeekIso = 15, | ||
YearIso = 16, | ||
MonthLabel = 17, | ||
AmPmLabel = 18, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do AmPm label, would it make sense to also do weekend/weekday? (Often used in feature engineering) #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats not a bad idea, but the initial ask we got was just for these columns. We will start with just these and then can add in more if the requirement arises. #Resolved
{ | ||
ValueGetter<T> result = (ref T dst) => | ||
{ | ||
long dateTime = default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agree on using System.DateTime from the beginning. This will make the transform consistent with other parts of the API, as well as more usable to .NET developers #Resolved
/// <param name="inputColumnName">Input column name</param> | ||
/// <param name="columnPrefix">Prefix to add to the generated columns</param> | ||
/// <returns><see cref="DateTimeEstimator"/></returns> | ||
public static DateTimeEstimator DateTimeTransformer(this TransformsCatalog catalog, string inputColumnName, string columnPrefix) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of the extension methods are not consistent with the other transformers (I'll leave it to others to judge whether the original scheme makes sense :-), but consistency is probably a more important consideration now)
If you look at the other transforms they mostly have the form Verb-Object e.g. CopyColumns, MapValueToKey etc. This makes sense when you are chaining them together in a pipeline.
Suggestions for this one: FeaturizeDateTime, SplitDateTime, TokenizeDateTime, ExtractDateTime ...
#Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of the ones you listed I would vote FeaturizeDateTime
. Anyone have any objections if I rename the extension method to that? #Resolved
case DateTimeEstimator.ColumnsProduced.MonthLabel: | ||
case DateTimeEstimator.ColumnsProduced.AmPmLabel: | ||
case DateTimeEstimator.ColumnsProduced.DayOfWeekLabel: | ||
case DateTimeEstimator.ColumnsProduced.HolidayName: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usability question, why are we producing multiple output columns of instead of a single vector type column?
The next step after this DateTime transform is:
- Convert all the INTs/USHORTs/BYTEs columns to Float32
- One-hot encode the STRING columns (
"Friday"
to[ 0, 0, 0, 0, 0, 1, 0 ]
) - Concatenate all of these feature to single vector of Float32
Why aren't we just outputting the final vector type column? Then the user can directly use its output, like our other transforms. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's a usecase for the individual columns, perhaps just a boolean parameter produceRawColumns
, which then produces the individual outputs instead of a singular vector column? #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did it this way just because thats how the original python code works. It splits everything up into different columns and doesn't combine them into a vector. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a single vector is consistent with FeaturizeText
. #Resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original code we were bringing to ML.NET returned all individual columns. Because of that, for now, we will be leaving them as individual columns. If we learn that we never need individual columns and only use this as a vector then we can revisit this then. #Resolved
No pipelines are associated with this pull request. #Resolved |
@@ -219,5 +220,31 @@ internal static TypeId GetNativeTypeIdFromType(this Type type) | |||
|
|||
throw new InvalidOperationException($"Unsupported type {type}"); | |||
} | |||
|
|||
// The Native Featurizers do not currently support CentOS7, this method checks the OS and returns true if it is CentOS7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about RedHat 7?
{ | ||
if (RuntimeInformation.IsOSPlatform(OSPlatform.Linux)) | ||
{ | ||
using (Process process = new Process()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A more performant way of doing this check would be to read from the file directly. Starting a new process just to get this info is too heavy-weight IMO.
This change adds in the DateTimeTransformer into the new Featurizers project. It is the first of a series of PR's that will go in. The DateTimeTransformer is implemented in native code, so this is mostly just a wrapper around that with the appropriate entrypoints for NimbusML as well. Since all the new estimator/transformers follow the same patterns, once this one is reviewed and checked in I will create the other PR's.