MRG re-allow zero-based indexes in SVMlight files #756

larsmans · 2012-04-04T19:23:47Z

Fixes issue #750.

ogrisel · 2012-04-04T20:38:28Z

Looks good, but could you please add a test for the case where zero_based=False and an artificial input file that has a zero indexed value? Does it fails with a meaningful exception message or silently yield to an invalid dataset?

Fixes issue scikit-learn#750.

larsmans · 2012-04-04T20:45:18Z

It fails with a ValueError stating invalid index 0 in SVMlight/LibSVM data file. Added a test for that (in a force push).

ogrisel · 2012-04-04T20:52:14Z

Indeed. Thanks for the test however I don't see it in the github PR view. Are you sure the push occurred? Other than that 👍 for merging.

BTW, have you tried on the adult dataset to check whether it fixes @mblondel's issue?

larsmans · 2012-04-04T20:57:52Z

Oops, pushed it to my master branch. The updated commit is there now. And yes, I tried it on the adult dataset, both load_svmlight_file and _files.

mblondel · 2012-04-05T02:28:15Z

Thanks for the quick fix!

larsmans · 2012-04-05T10:23:32Z

Is that a green light for the green button?

mblondel · 2012-04-05T10:35:05Z

Is that a green light for the green button?

Do you think that there are cases when the user wouldn't want to use
zero_based="auto"? If not, we could drop the option...

larsmans · 2012-04-05T11:33:58Z

I can imagine a case where a test set has an all-zero first column. Loading that with "auto" will give the wrong number of features. Also, load_svmlight_files doesn't accept "auto".

mblondel · 2012-04-05T12:18:14Z

I can imagine a case where a test set has an all-zero first column.
Loading that with "auto" will give the wrong number of features. Also,
load_svmlight_files doesn't accept "auto".

I really want load_svmlight_files to support the auto mode because I use
load_svmlight_files all the time to load the train and test sets. We could
use the first file to determine whether zero_based is True or not and apply
that to the other files, just like we did for n_features. Also the check
np.min(indices) > 0 should be done on the Python side. This way, it can be
done separetely in load_svmlight_file and in load_svmlight_files.

mblondel · 2012-04-05T12:21:41Z

I really want load_svmlight_files to support the auto mode because I use
load_svmlight_files all the time to load the train and test sets. We could
use the first file to determine whether zero_based is True or not and apply
that to the other files, just like we did for n_features. Also the check
np.min(indices) > 0 should be done on the Python side. This way, it can be
done separetely in load_svmlight_file and in load_svmlight_files.

I can do that tomorrow and send you a PR.

larsmans · 2012-04-05T12:22:36Z

I imagined it wouldn't matter; zero_based=True is safe for one-based indices, it just produces an extra all-zero column.

mblondel · 2012-04-05T12:31:58Z

But the original reason you wanted to fix that is because the shape was
(n_samples, n_features+1), isn't it?

larsmans · 2012-04-05T12:35:10Z

You mean the previous time I "fixed" this? Yes, but I thought the convention was one-based indexing since all the LibSVM datasets do that. This PR makes it possible to use both conventions, but the default settings are designed for safety rather than efficiency/elegance.

…ht_files

larsmans · 2012-04-11T17:54:28Z

@mblondel: I implemented "auto" for load_svmlight_files. Please check it out, it might require one more test. The function now also guesses n_features based on all files' contents rather than just the first one.

mblondel · 2012-04-12T12:02:08Z

@larsmans: Thanks a ton. Looks good to me. Sorry for not doing myself, this has been a very busy week for me.

ogrisel · 2012-04-12T12:14:51Z

Maybe add a couple of new test to highlight the differences when zero_based is False, True or "auto" in load_svmlight_files too.

amueller · 2012-04-14T22:19:23Z

sklearn/datasets/tests/test_svmlight_format.py

amueller · 2012-04-14T22:31:08Z

Maybe I missed something but is there now a way to ensure that training and test file are read in a consistent way with "auto" if one of them contains a column of zeros?

larsmans · 2012-04-15T14:52:21Z

Added a test for zero_based="auto" with multiple files, fixed a bug and addressed @amueller's comments.

larsmans · 2012-04-15T14:55:03Z

@amueller: sorry, didn't see that last comment. In load_svmlight_files, if zero_based="auto" and any of the files contains a zero index, then all of them are assumed to have zero-based indices. Otherwise, one-based indices are assumed.

amueller · 2012-04-15T20:10:28Z

Cool. Could you add a comment to load_svmlight_file that says something along these lines so that people know it is "saver" to use?

larsmans · 2012-04-16T08:02:57Z

Done!

amueller · 2012-04-16T12:01:52Z

👍 for merge

ogrisel · 2012-04-16T13:00:29Z

👍 for merge as well

mblondel · 2012-04-16T13:06:41Z

Alright @larsmans. After this one you still have 4 pending PRs to merge :)

MRG re-allow zero-based indexes in SVMlight files

BUG re-allow zero-based indexes in SVMlight files

2c293eb

Fixes issue scikit-learn#750.

ENH zero_based="auto" support + better n_features=None in load_svmlig…

6d6c9a8

…ht_files

COSMIT refactor SVMlight loader

c2eeca7

amueller reviewed Apr 14, 2012
View reviewed changes

sklearn/datasets/tests/test_svmlight_format.py Outdated

Copy link

Member

amueller Apr 14, 2012

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

larsmans added 3 commits April 15, 2012 16:33

COSMIT pep8 SVMlight loader

2ddf1a7

BUG close files in time in SVMlight loader (with statement)

7b4842d

TEST + FIX zero_based="auto" behavior in SVMlight loader

96af5ae

DOC + PEP8 SVMlight loader

d38f4d0

larsmans added a commit that referenced this pull request Apr 16, 2012

Merge pull request #756 from larsmans/svmlight_fix

7bc2ca7

MRG re-allow zero-based indexes in SVMlight files

larsmans merged commit 7bc2ca7 into scikit-learn:master Apr 16, 2012

Uh oh!

MRG re-allow zero-based indexes in SVMlight files #756

MRG re-allow zero-based indexes in SVMlight files #756

Uh oh!

Conversation

larsmans commented Apr 4, 2012

Uh oh!

ogrisel commented Apr 4, 2012

Uh oh!

larsmans commented Apr 4, 2012

Uh oh!

ogrisel commented Apr 4, 2012

Uh oh!

larsmans commented Apr 4, 2012

Uh oh!

mblondel commented Apr 5, 2012

Uh oh!

larsmans commented Apr 5, 2012

Uh oh!

mblondel commented Apr 5, 2012

Uh oh!

larsmans commented Apr 5, 2012

Uh oh!

mblondel commented Apr 5, 2012

Uh oh!

mblondel commented Apr 5, 2012

Uh oh!

larsmans commented Apr 5, 2012

Uh oh!

mblondel commented Apr 5, 2012

Uh oh!

larsmans commented Apr 5, 2012

Uh oh!

larsmans commented Apr 11, 2012

Uh oh!

mblondel commented Apr 12, 2012

Uh oh!

ogrisel commented Apr 12, 2012

Uh oh!

amueller Apr 14, 2012

Choose a reason for hiding this comment

Uh oh!

amueller commented Apr 14, 2012

Uh oh!

larsmans commented Apr 15, 2012

Uh oh!

larsmans commented Apr 15, 2012

Uh oh!

amueller commented Apr 15, 2012

Uh oh!

larsmans commented Apr 16, 2012

Uh oh!

amueller commented Apr 16, 2012

Uh oh!

ogrisel commented Apr 16, 2012

Uh oh!

mblondel commented Apr 16, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants