-
Notifications
You must be signed in to change notification settings - Fork 1.9k
.NET data type system instead of DvTypes #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You may want to make the part that serializes pluggable (can discover it through DI)/based on a provider model. The default should probably be JSON (it's the prevalent serialization mechanism currently) and if people are concerned about performance, they can implement the provider for the preferred serialization mechanism (Protobuf comes to mind, but there are other options). Also, you may want to make sure that the new Memory/Span APIs are used in this area. Small nitpick; the title should be ".NET types" and not "C# types" as these types are not specific to C#, but are types available throughout the .NET ecosystem. |
This seems fine on the whole, but what is going to be done about sparsity, implicit values for sparse values, and types like |
I could be misreading, but using The point is to use the .NET type system (and all that it offers) instead of DvTypes, as using DvTypes means transformation in use in many other places (the whole point of a type system is to unify data across operations, not fragments it with other sub type systems). You could continue to use .NET has this out-of-the box, and other serialization mechanisms map easily to the .NET type system (JSON.Net, Protobuf.NET, etc). IOW, it's a layer that doesn't need to exist, as it doesn't afford anything that doesn't already exist in the .NET type system. |
Hello @casperOne, thanks for your response and clarifications. I think perhaps I was not clear -- I'm not actually confused about the proposal, I'm pointing out a serious architectural morass this issue as written engenders. But I'll clarify what I mean a bit more. Imagine we get rid of this
Both of these options are awful. Our code and user code in lots of places benefits from the assumption that numeric vectors have a 0 for their implicit values. On the other hand, we also in plenty of places assume that the implicit sparse value for So if we get rid of Incidentally let me make a secondary point while I'm here. As you say, .NET has a concept for NA values that's close and almost useful, except for one major problem: |
From what I see in implementation of this issue, and @TomFinley comment we completely remove nullable support for fields and properties. Which is fine if you use Textloader, but in case of IEnumerable -> Dataview conversion looks like really bad decision. Imagine I'm as a user want to train model on top of SQL table. I can fetch data through LINQ2SQL or EF (which provide me drag and drop option to generate classes and methods to get data) as IEnumerable, wrap it in CollectionDataSource and train it. But only if I don't have any nullable fields in my table, as soon as I have at least one nullable field, I don't have no other options than create new class, write conversion from old class to new class, which is can be extremely painful process, especially if in your some relationship with SQL, people can have hundreds of columns (fields). If only problem which prevents us from nullable support is VBuffer and sparsity, can we change VBuffer code and put check on incoming type and if it's nullable set values to default of inner type? |
Hi @Ivanidzo4ka . What you are saying I think is that before an SQL user injects their table into our system they will have to be explicit about what We are writing an API, and that means people are free to (and will) write their own code around us, rather than having our own mechanisms be the only things at people's disposal. Though I understand this requires a shift in perspective, in this new world sometimes the right answer is, we not only don't have to handle this case, but we absolutely should not. I think this is one of those times. |
Benchmarking the type system changes
These benchmarks are intended to compare performance after these optimizations on Datasets and pipelinesWe chose datasets and pipelines to test to cover a variety of scenarios, including:
The table below shows the datasets and their characteristics, as well as the pipeline that we executed on each dataset. All datasets were ingested in text format, which makes heavy use of
Methodology and experimental setup
ResultsWe present the results of the benchmarks here. The deltas indicate performance gap of .NET data types relative to DvTypes: negative values indicate slower performance of .NET data types compared to DvTypes, and percentage deltas are based off the mean runtime for DvTypes. Finally, we did an independent samples t-test with unequal variances for the two builds, and present the p-values for each test. We chose a significance threshold of 0.05, with a smaller p-value indicating significant differences. We can see that for all the pipelines except the one with Amazon Reviews dataset, the deltas were within 1% of the speed of DvTypes, and were not significant. For Amazon Reviews, the delta was 1.85% of the speed of DvTypes and significant. The statistical significance is not particularly concerning here because the long runtimes on this dataset were bound to return significantly different runtimes even with a small percentage difference. More important thing here is that the performance gap was reduced from ~100% to within 2%. We expect the performance to only improve with further optimizations in future .NET Core runtimes. Criteo 1M
Flight Delay 7M
Bing Click Prediction 500K
Wikipedia Detox
Amazon Reviews
cc: @codemzs @eerhardt @TomFinley @shauheen @markusweimer @justinormont @Zruty0 @GalOshri |
Thanks @najeeb-kazmi for the great benchmarks. From a user perspective, I doubt any user would notice a runtime change this small (within 2%). And, @najeeb-kazmi, as you state, "We expect the performance to only improve with further optimizations in future .NET Core runtimes." Do we have guesses where the main perf impact is located? This might help us create a focused benchmark which will let the DotNet team have a direct measure to optimize. On a higher level note: do we have any datasets with NA values for a type which no longer has NA values (within either the Features or Label)? It would be interesting to see the change in what a user would expect for their first run's accuracy. If the meaning of the NA is truly a missing value, I expect the NAHandleTransform will add measurable accuracy vs. auto-filling w/ default for the type. |
.NET data type system instead of DvTypes
Motivation
Machine Learning datasets often have missing values and to accommodate them along with C# native
types without increasing the memory footprint DvType system was created. If we were to use
Nullable<T>
then we are looking at additional memory forHasValue
boolean field plus another 3bytes for 4 byte alignment. The C# native types that are replaced using DvTypes are bool as DvBool,
sbyte as DvInt1, int16 as DvInt2, int32 as DvInt4, int64 as DvInt8, DvDateTime as System.DateTime,
DvDateTimeZone as combination of DvDateTime and DvInt2 offset, DvTimeSpan as SysTimeSpan and string
as DvText. Float and Double types already have a special value called NaN that can be used for
missing value. DvType system achieves a smaller memory footprint by denoting special value for
missing value which is usually the smallest number that can be represented by the native type that
is encapsulated by DvType, example, DvInt1's missing value indicator would be SByte.MinValue and in
the case of types that represent date/time types it is a value that represent maximum ticks.
We plan to remove DvTypes to make IDataView a general commodity that can be used in other products
and for this to happen it would be nice if it did not having a dependency on a special type system.
If in future we find having DvTypes was useful then we can consider exposing it natively from .NET
platform. Once we remove DvTypes then ML.NET platform will be using native non-nullable C# types.
Float or double types can be used to represent missing value.
Column Types
Columns in ML.NET make up the dataset and
ColumnType
defines a column. At high level there are twokinds of column, first is
PrimitiveType
and that comprises of types such asNumberType
,BoolType
,TextType
,DateTimeType
,DateTimeZoneType
,KeyType
, second isStructured type
and it comparises of
VectorType
.ColumnType
is primarily made up ofType
andDataKind
.Type
could refer to any type but it is instantiated with a type referred byDataKind
which is anidentifer for data types that comprises of DvTypes, native C# types such as float, double and custom
big integer UInt128.
Type conversion
DvTypes have implicit and explicit override for assignment operator that handles type conversion.
Lets consider DvInt1 for example:
Similar conversion rules exist for DvInt2, DvInt4, DvInt8 and DvBool.
Logical, bitwise and numerical operators
Operations such as
==
,!=
,!
,>
,>=
,<
,<=
,+
,-
,*
,pow
,|
,&
take placebetween same DvTypes only. They also handle missing values and in the case of arithmetic operators
overflow is also handled. Most of these overrides are implemented but only few are actively used.
Whenever there is an overflow the resulting value is represented as missing value and the same goes
when one of the operands is a missing value.
Serialization
DvTypes have their own codecs for efficiently compressing data and writing it to disk, for example,
to write DvBool to disk, two bits are used to represent a boolean value, 0x00 is false, 0x01 is true
and 0x10 is missing value indicator. Boolean values are written at the level of int32 which has 32
bits that can accommodate 32/2 or 16 boolean values in 4 bytes as opposed to using 1 byte per
boolean value using the naive approach that does not even handle missing value. We can reuse this
approach to serialize bool by using one bit instead of two. DvInt* codecs need not be changed at
all. DateTime and DvText codecs will require some changes.
Intermediate Language(IL) code generation
ML.NET contains a mini compiler that generates IL code at runtime for peak and poke functions that
basically perform reflection of objects to set and get values in a more performant manner. Here we
can use OpCodes.Stobj to emit IL code for
DvTimeSpan
,DvDateTime
,DvDateTimeZone
andReadOnlyMemory<char>
types.New Behavior
DvInt1
,DvInt2
,DvInt4
,DvInt8
will be replaced withsbyte
,short
,int
andlong
respectively.
behavior is undefined here, example, casting
long
tosbyte
will result in assigning of low 8bits from long to sbyte. ML.NET projects by default are unchecked because checked is expensive
and hence used in code blocks where it is needed.
Text
toInteger
type is done by first convertingText
tolong
value inthe case of positive number and
ulong
in the case of negative number and then validating thisvalue is within the legal bounds of the type that it is being converted to from
Text
type,example, legal bound for
sbyte
is -128 to 127, so converting "-129" or "128" will result in anexception, also converting a value that is out of legal bounds for a
long
type will alsoresult in an exception.
DvTimeSpan
,DvDateTime
andDvDateTimeZone
will be replaced withTimeSpan
,DateTime
andDateTimeOffset
respectively.DataTimeOffset
is represented as long because it records the ticks. Previously thiswas represented as DvInt2 or short in
DvDateTimeZone
because it was recorded as minutes anddue to this it had a smaller footprint on the disk. With offset being long the footprint will
increase, one work around is to convert it to minutes before writing and then converting minutes
back to ticks but this might lead to loss in precision. Since DataTime is very rarely used in
Machine Learning so I'm not sure if it is worth making an optimization here.
DvText
will be replaced withReadOnlyMemory<char>
.ReadOnlyMemory<char>
does not implementIEquatable<T>
and due to this it cannot be be used atype in
GroupKeyColumnChecker
inCursor
in GroupTransform. The workaround for this is toremove the
IEquatable<T>
contraint on the type and instead use if else to check if the typeimplements
IEquatable<T>
then cast and callEquals
method otherwise check if the type is ofReadOnlyMemory<char>
then use its utility method for equality otherwise throw an exception.ReadOnlyMemory<char>
does not implementGetHashCode()
and due to this it cannot be used as akey in a dictionary in
ReconcileSlotNames<T>
in EvaluatorUtils.cs. The workaround for this isto use string representation of
ReadOnlyMemory<char>
as a key. While this is wastage of memorybut its not too bad because this is only used at the end of evaluation phase and the number of
strings allocated here will be roughly proportional to the number of classes.
DvBool
will be replaced with bool.GetPredictedLabel
andGetPredictedLabelCore
will result in an undefined behavior in the casewhere score contains a missing value represented as NaN. Here we will default to false.
Backward compatiblity when reading
IDV
files written withDvTypes
.Integers
are read as they were written to disk, i.e minimum value of the corresponding datatype in the case of missing value.
Boolean
is read using the old codec, where two bits are used per value and missing values areconverted to
false
to fit inbool
type.DateTime
,DateTimeSpan
,DateTimeZone
uselong
andshort
type underneath to representticks and offset and they are converted using the
Integer
scheme defined above. In the casewhere ticks or offset is read and found to contain missing value represented as a minimum of
the underlying type then it is converted to default value of that type to prevent an exception
from
DateTime
orTimeSpan
orDateTimeOffset
class as such minimum values indicate aninvalid date.
DvText
is read as it is. Missing values when being converted to Integer types are converted tominimum value of that
integer
type and empty string is converted todefault
value of thatinteger
type.TextLoader
default
values of type it is being converted to.Parquet Loader
Future consideration
Introduce an option in the loader whether to throw an exception in the case of missing value or just
replace them with
default
values. With the current design we will throw an exception in the caseof missing for Text Loader and Parquet loader but not IDV(Binary Loader).
Benchmarking the type system changes
(this section was written by @najeeb-kazmi )
ReadOnlyMemory<char>
is a data type introduced recently that allows management of strings without unnecessary memory allocation. Strings in C# are immutable. Hence, when we take a string operation such assubstring
, the resulting string is copied to a new memory location. To prevent unnecessary allocation of memory,ReadOnlyMemory
keeps track of the substring via start and end offsets relative to the original string. Hence, for everysubstring
operation, the memory allocated is constant. InReadOnlyMemory
, if one needs to access independent elements, they do it by calling theSpan
property, which returns aReadOnlySpan
object, which is a stack only concept. It turns out that thisSpan
property is an expensive operation, and our initial benchmarks showed that runtimes of the pipelines regressed by 100%. Upon further performance analysis, we decide to cache the returnedReadOnlySpan
as much as we could, and that brought the runtimes on par withDvText
.These benchmarks are intended to compare performance after these optimizations on
Span
were done, in order to investigate whether we hit parity withDvText
or not.Datasets and pipelines
We chose datasets and pipelines to test to cover a variety of scenarios, including:
The table below shows the datasets and their characteristics, as well as the pipeline that we executed on each dataset. All datasets were ingested in text format, which makes heavy use of
DvText
/ReadOnlyMemory<char>
. Other data types are also involved in the pipelines, although the performance of the pipelines are dominated byDvText
/ReadOnlyMemory<char>
.Methodology and experimental setup
dotnet MML.dll <pipeline>
Results
We present the results of the benchmarks here. The deltas indicate performance gap of .NET data types relative to DvTypes: negative values indicate slower performance of .NET data types compared to DvTypes, and percentage deltas are based off the mean runtime for DvTypes. Finally, we did an independent samples t-test with unequal variances for the two builds, and present the p-values for each test. We chose a significance threshold of 0.05, with a smaller p-value indicating significant differences.
We can see that for all the pipelines except the one with Amazon Reviews dataset, the deltas were within 1% of the speed of DvTypes, and were not significant. For Amazon Reviews, the delta was 1.85% of the speed of DvTypes and significant. The statistical significance is not particularly concerning here because the long runtimes on this dataset were bound to return significantly different runtimes even with a small percentage difference. More important thing here is that the performance gap was reduced from ~100% to within 2%. We expect the performance to only improve with further optimizations in future .NET Core runtimes.
Criteo 1M
Flight Delay 7M
Bing Click Prediction 500K
Wikipedia Detox
Amazon Reviews
CC: @eerhardt @Zruty0 @Ivanidzo4ka @TomFinley @shauheen @najeeb-kazmi @markusweimer
The text was updated successfully, but these errors were encountered: