Skip to content

revert the BC breaking change in enwik9 data #1559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion torchtext/datasets/enwik9.py
Original file line number Diff line number Diff line change
Expand Up @@ -50,4 +50,4 @@ def EnWik9(root: str, split: Union[Tuple[str], str]):
)

data_dp = FileOpener(cache_decompressed_dp, mode="b")
return data_dp.readlines(decode=True, return_path=False)
return data_dp.readlines(decode=True, return_path=False, strip_newline=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it breaks BC, but does it matter as we are revamping the library?

Copy link
Contributor Author

@parmeet parmeet Feb 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, it is good to strip the new lines at the end of each line. Can we change this going forward?

It's a good point, I am not so sure why it is like that when they were first added. In-fact now when I think about it, it's not even consistent across our datasets.

I know it breaks BC, but does it matter as we are revamping the library?

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, we wanted the migration to be not BC breaking. But if we do end up doing so, let's make sure to note all the BC breaking changes in release notes.

Edit: Also would be great not to mix the BC breaking changes with the migration.

I also updated the SST2 dataset to return a int for the labels even though it was returning strings before because that seemed like the expected behavior. Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it matter whether we do it in the same PR as the migration vs a different PR if we note all the BC breaking changes in the PR summary?

In general, it's a good practice to logically separate PRs that makes it easy for tracking purpose as well as to review :)