You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We currently can turn a DataFrame into an IDataView and pass it to any ML.NET API that takes an IDataView. This is useful when you have training data or data to be scored, and you need to pass it into ML.NET to .Fit() or .Transform().
However, when data comes out of ML.NET it comes out as an IDataView. While you can consume the data using IDataView directly, it's APIs aren't the most convenient way to access data. For example, in order to read data, you need to open a cursor, get a delegate for each column you want access to, and then move the cursor over the data, calling the delegate for each row.
A more convenient approach would be to materialize all the data from an IDataView in memory and expose it as a DataFrame. Then you could use any DataFrame API to access/modify/etc the data.
A big drawback to this approach is that all the data must fit into memory - since DataFrame is wholly in-memory. So this API wouldn't be used if someone was transforming GBs of data and trying to materialize it into a DataFrame.
However, there are plenty of scenarios where the data will fit into memory where this will be useful. And we can provide optional arguments to limit the number of rows, and to limit which columns are selected.
A canonical use-case of this API would be to consume predicted values without having to use a hard-coded class, like with PredictionEngine. In this use-case the IDataView that is returned from the model contains ALL the columns in the pipeline - the input columns, the intermediate columns, and the output columns. To materialize all those columns into memory would be a waste. Normally consumers would just want the Score and/or PredictedLabel columns, not all the input and intermediate columns. So they should be able to specify which columns to materialize.
The text was updated successfully, but these errors were encountered:
We currently can turn a
DataFrame
into anIDataView
and pass it to any ML.NET API that takes anIDataView
. This is useful when you have training data or data to be scored, and you need to pass it into ML.NET to.Fit()
or.Transform()
.However, when data comes out of ML.NET it comes out as an
IDataView
. While you can consume the data usingIDataView
directly, it's APIs aren't the most convenient way to access data. For example, in order to read data, you need to open a cursor, get a delegate for each column you want access to, and then move the cursor over the data, calling the delegate for each row.A more convenient approach would be to materialize all the data from an
IDataView
in memory and expose it as aDataFrame
. Then you could use anyDataFrame
API to access/modify/etc the data.A big drawback to this approach is that all the data must fit into memory - since
DataFrame
is wholly in-memory. So this API wouldn't be used if someone was transforming GBs of data and trying to materialize it into aDataFrame
.However, there are plenty of scenarios where the data will fit into memory where this will be useful. And we can provide optional arguments to limit the number of rows, and to limit which columns are selected.
A canonical use-case of this API would be to consume predicted values without having to use a hard-coded class, like with PredictionEngine. In this use-case the
IDataView
that is returned from the model contains ALL the columns in the pipeline - the input columns, the intermediate columns, and the output columns. To materialize all those columns into memory would be a waste. Normally consumers would just want theScore
and/orPredictedLabel
columns, not all the input and intermediate columns. So they should be able to specify which columns to materialize.The text was updated successfully, but these errors were encountered: