Create a DataFrame from an IDataView #5682

eerhardt · 2019-10-04T16:55:05Z

We currently can turn a DataFrame into an IDataView and pass it to any ML.NET API that takes an IDataView. This is useful when you have training data or data to be scored, and you need to pass it into ML.NET to .Fit() or .Transform().

However, when data comes out of ML.NET it comes out as an IDataView. While you can consume the data using IDataView directly, it's APIs aren't the most convenient way to access data. For example, in order to read data, you need to open a cursor, get a delegate for each column you want access to, and then move the cursor over the data, calling the delegate for each row.

A more convenient approach would be to materialize all the data from an IDataView in memory and expose it as a DataFrame. Then you could use any DataFrame API to access/modify/etc the data.

A big drawback to this approach is that all the data must fit into memory - since DataFrame is wholly in-memory. So this API wouldn't be used if someone was transforming GBs of data and trying to materialize it into a DataFrame.

However, there are plenty of scenarios where the data will fit into memory where this will be useful. And we can provide optional arguments to limit the number of rows, and to limit which columns are selected.

A canonical use-case of this API would be to consume predicted values without having to use a hard-coded class, like with PredictionEngine. In this use-case the IDataView that is returned from the model contains ALL the columns in the pipeline - the input columns, the intermediate columns, and the output columns. To materialize all those columns into memory would be a waste. Normally consumers would just want the Score and/or PredictedLabel columns, not all the input and intermediate columns. So they should be able to specify which columns to materialize.

The text was updated successfully, but these errors were encountered:

LittleLittleCloud · 2021-03-12T18:33:13Z

Hi @pgovind any update for this issue

pgovind · 2021-03-12T18:44:41Z

Hi @pgovind any update for this issue

Yup. I have it working locally. I'll put a PR up for it today/Monday.

pgovind · 2021-04-29T21:19:47Z

Implemented with #5712. Closing

pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021

pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021

pgovind closed this as completed Apr 29, 2021

ghost locked as resolved and limited conversation to collaborators Mar 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a DataFrame from an IDataView #5682

Create a DataFrame from an IDataView #5682

eerhardt commented Oct 4, 2019

LittleLittleCloud commented Mar 12, 2021

pgovind commented Mar 12, 2021

pgovind commented Apr 29, 2021

Create a DataFrame from an IDataView #5682

Create a DataFrame from an IDataView #5682

Comments

eerhardt commented Oct 4, 2019

LittleLittleCloud commented Mar 12, 2021

pgovind commented Mar 12, 2021

pgovind commented Apr 29, 2021