Skip to content

Create a DataFrame from an IDataView #5682

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
eerhardt opened this issue Oct 4, 2019 · 3 comments
Closed

Create a DataFrame from an IDataView #5682

eerhardt opened this issue Oct 4, 2019 · 3 comments
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs

Comments

@eerhardt
Copy link
Member

eerhardt commented Oct 4, 2019

We currently can turn a DataFrame into an IDataView and pass it to any ML.NET API that takes an IDataView. This is useful when you have training data or data to be scored, and you need to pass it into ML.NET to .Fit() or .Transform().

However, when data comes out of ML.NET it comes out as an IDataView. While you can consume the data using IDataView directly, it's APIs aren't the most convenient way to access data. For example, in order to read data, you need to open a cursor, get a delegate for each column you want access to, and then move the cursor over the data, calling the delegate for each row.

A more convenient approach would be to materialize all the data from an IDataView in memory and expose it as a DataFrame. Then you could use any DataFrame API to access/modify/etc the data.

A big drawback to this approach is that all the data must fit into memory - since DataFrame is wholly in-memory. So this API wouldn't be used if someone was transforming GBs of data and trying to materialize it into a DataFrame.

However, there are plenty of scenarios where the data will fit into memory where this will be useful. And we can provide optional arguments to limit the number of rows, and to limit which columns are selected.

A canonical use-case of this API would be to consume predicted values without having to use a hard-coded class, like with PredictionEngine. In this use-case the IDataView that is returned from the model contains ALL the columns in the pipeline - the input columns, the intermediate columns, and the output columns. To materialize all those columns into memory would be a waste. Normally consumers would just want the Score and/or PredictedLabel columns, not all the input and intermediate columns. So they should be able to specify which columns to materialize.

@pgovind pgovind transferred this issue from dotnet/corefxlab Mar 6, 2021
@pgovind pgovind added the Microsoft.Data.Analysis All DataFrame related issues and PRs label Mar 6, 2021
@LittleLittleCloud
Copy link
Contributor

Hi @pgovind any update for this issue

@pgovind
Copy link

pgovind commented Mar 12, 2021

Hi @pgovind any update for this issue

Yup. I have it working locally. I'll put a PR up for it today/Monday.

@pgovind
Copy link

pgovind commented Apr 29, 2021

Implemented with #5712. Closing

@pgovind pgovind closed this as completed Apr 29, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Mar 20, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Microsoft.Data.Analysis All DataFrame related issues and PRs
Projects
None yet
Development

No branches or pull requests

3 participants