Skip to content

Conversation

@larsmans
Copy link
Member

@larsmans larsmans commented Apr 4, 2012

Fixes issue #750.

@ogrisel
Copy link
Member

ogrisel commented Apr 4, 2012

Looks good, but could you please add a test for the case where zero_based=False and an artificial input file that has a zero indexed value? Does it fails with a meaningful exception message or silently yield to an invalid dataset?

@larsmans
Copy link
Member Author

larsmans commented Apr 4, 2012

It fails with a ValueError stating invalid index 0 in SVMlight/LibSVM data file. Added a test for that (in a force push).

@ogrisel
Copy link
Member

ogrisel commented Apr 4, 2012

Indeed. Thanks for the test however I don't see it in the github PR view. Are you sure the push occurred? Other than that 👍 for merging.

BTW, have you tried on the adult dataset to check whether it fixes @mblondel's issue?

@larsmans
Copy link
Member Author

larsmans commented Apr 4, 2012

Oops, pushed it to my master branch. The updated commit is there now. And yes, I tried it on the adult dataset, both load_svmlight_file and _files.

@mblondel
Copy link
Member

mblondel commented Apr 5, 2012

Thanks for the quick fix!

@larsmans
Copy link
Member Author

larsmans commented Apr 5, 2012

Is that a green light for the green button?

@mblondel
Copy link
Member

mblondel commented Apr 5, 2012

Is that a green light for the green button?

Do you think that there are cases when the user wouldn't want to use
zero_based="auto"? If not, we could drop the option...

@larsmans
Copy link
Member Author

larsmans commented Apr 5, 2012

I can imagine a case where a test set has an all-zero first column. Loading that with "auto" will give the wrong number of features. Also, load_svmlight_files doesn't accept "auto".

@mblondel
Copy link
Member

mblondel commented Apr 5, 2012

I can imagine a case where a test set has an all-zero first column.
Loading that with "auto" will give the wrong number of features. Also,
load_svmlight_files doesn't accept "auto".

I really want load_svmlight_files to support the auto mode because I use
load_svmlight_files all the time to load the train and test sets. We could
use the first file to determine whether zero_based is True or not and apply
that to the other files, just like we did for n_features. Also the check
np.min(indices) > 0 should be done on the Python side. This way, it can be
done separetely in load_svmlight_file and in load_svmlight_files.

@mblondel
Copy link
Member

mblondel commented Apr 5, 2012

I really want load_svmlight_files to support the auto mode because I use
load_svmlight_files all the time to load the train and test sets. We could
use the first file to determine whether zero_based is True or not and apply
that to the other files, just like we did for n_features. Also the check
np.min(indices) > 0 should be done on the Python side. This way, it can be
done separetely in load_svmlight_file and in load_svmlight_files.

I can do that tomorrow and send you a PR.

@larsmans
Copy link
Member Author

larsmans commented Apr 5, 2012

I imagined it wouldn't matter; zero_based=True is safe for one-based indices, it just produces an extra all-zero column.

@mblondel
Copy link
Member

mblondel commented Apr 5, 2012

But the original reason you wanted to fix that is because the shape was
(n_samples, n_features+1), isn't it?

@larsmans
Copy link
Member Author

larsmans commented Apr 5, 2012

You mean the previous time I "fixed" this? Yes, but I thought the convention was one-based indexing since all the LibSVM datasets do that. This PR makes it possible to use both conventions, but the default settings are designed for safety rather than efficiency/elegance.

@larsmans
Copy link
Member Author

@mblondel: I implemented "auto" for load_svmlight_files. Please check it out, it might require one more test. The function now also guesses n_features based on all files' contents rather than just the first one.

@mblondel
Copy link
Member

@larsmans: Thanks a ton. Looks good to me. Sorry for not doing myself, this has been a very busy week for me.

@ogrisel
Copy link
Member

ogrisel commented Apr 12, 2012

Maybe add a couple of new test to highlight the differences when zero_based is False, True or "auto" in load_svmlight_files too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8

@amueller
Copy link
Member

Maybe I missed something but is there now a way to ensure that training and test file are read in a consistent way with "auto" if one of them contains a column of zeros?

@larsmans
Copy link
Member Author

Added a test for zero_based="auto" with multiple files, fixed a bug and addressed @amueller's comments.

@larsmans
Copy link
Member Author

@amueller: sorry, didn't see that last comment. In load_svmlight_files, if zero_based="auto" and any of the files contains a zero index, then all of them are assumed to have zero-based indices. Otherwise, one-based indices are assumed.

@amueller
Copy link
Member

Cool. Could you add a comment to load_svmlight_file that says something along these lines so that people know it is "saver" to use?

@larsmans
Copy link
Member Author

Done!

@amueller
Copy link
Member

👍 for merge

@ogrisel
Copy link
Member

ogrisel commented Apr 16, 2012

👍 for merge as well

@mblondel
Copy link
Member

Alright @larsmans. After this one you still have 4 pending PRs to merge :)

larsmans added a commit that referenced this pull request Apr 16, 2012
MRG re-allow zero-based indexes in SVMlight files
@larsmans larsmans merged commit 7bc2ca7 into scikit-learn:master Apr 16, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants