Skip to content

[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Feb 12, 2021

Conversation

cpuhrsch
Copy link
Contributor

@cpuhrsch cpuhrsch commented Feb 11, 2021

This PR comes with a few changes

1.

Addresses inconsistencies such as documentation indicating the incorrect order of splits (train, valid, test vs. train, test, valid), inconsistent indentation or whitespace and standardizes documentation of common arguments such as split and root across all raw datasets.

2.

Signatures such as

def Multi30k(train_filenames=("train.de", "train.en"),
             valid_filenames=("val.de", "val.en"),
             test_filenames=("test_2016_flickr.de", "test_2016_flickr.en"),
             split=('train', 'valid', 'test'), root='.data', offset=0):

and

AmazonReviewPolarity(root='.data', split=('train', 'test'), offset=0):

are aligned in the order of their arguments.

If a user expects to pass the root folder first and calls Multi30k positionally, i.e. Multi30k("my_data_path"), because the behavior is expected based on the signature of AmazonReviewPolarity, they will inadvertently set the train_filenames argument to a single string. This PR standardizes on the majority by moving the root folder first, followed by split and offset and then by dataset dependent arguments such as language, year or train_filenames.

3.

Commonalities between documentation are centralized and generated. This also includes a programatic signature check to make sure all datasets have a consistent interface. This is achieved via a decorator that sanitizes input arguments and preprends a common documentation header.

TODO

  • Verify docstring generation works with website doc generation

Follow-up

  • Decorator to append datapoint example including datapoint types
  • Represent dataset selections for translation datasets more compactly

@cpuhrsch cpuhrsch changed the title [WIP] Standardize raw dataset doc strings. [WIP] Standardize raw dataset doc strings and argument order. Feb 11, 2021
@cpuhrsch cpuhrsch changed the title [WIP] Standardize raw dataset doc strings and argument order. [WIP][BC-breaking] Standardize raw dataset doc strings and argument order. Feb 11, 2021
@codecov
Copy link

codecov bot commented Feb 11, 2021

Codecov Report

Merging #1151 (fcb415c) into master (ef363fa) will increase coverage by 0.37%.
The diff coverage is 96.96%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1151      +/-   ##
==========================================
+ Coverage   79.08%   79.46%   +0.37%     
==========================================
  Files          47       47              
  Lines        3108     3175      +67     
==========================================
+ Hits         2458     2523      +65     
- Misses        650      652       +2     
Impacted Files Coverage Δ
torchtext/experimental/datasets/raw/common.py 87.83% <91.42%> (+3.22%) ⬆️
...ext/experimental/datasets/raw/language_modeling.py 93.18% <100.00%> (+3.43%) ⬆️
...htext/experimental/datasets/raw/question_answer.py 100.00% <100.00%> (ø)
...text/experimental/datasets/raw/sequence_tagging.py 94.00% <100.00%> (+0.38%) ⬆️
...t/experimental/datasets/raw/text_classification.py 87.50% <100.00%> (+3.57%) ⬆️
torchtext/experimental/datasets/raw/translation.py 92.07% <100.00%> (+0.41%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef363fa...fcb415c. Read the comment docs.

@cpuhrsch cpuhrsch marked this pull request as ready for review February 11, 2021 18:31
@@ -25,6 +27,87 @@ def wrap_datasets(datasets, split):
return datasets


def dataset_docstring_header(fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still add more documentation here

return fn


def input_sanitization_decorator(fn):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can still add more documentation here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator sounds redundant.

@cpuhrsch cpuhrsch changed the title [WIP][BC-breaking] Standardize raw dataset doc strings and argument order. [BC-breaking] Standardize raw dataset doc strings and argument order. Feb 11, 2021
Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Approving to unblock. But docstring can be improved.

len(argspec.annotations) == 0
):
raise ValueError("Internal Error: Given function {} did not adhere to standard signature.".format(fn))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following part is not intuitively clear what it intends to do, so adding comments will be helpful.

return fn


def input_sanitization_decorator(fn):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator sounds redundant.

Copy link
Contributor

@zhangguanheng66 zhangguanheng66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can monitor the nightly release doc and check if this actually works. http://pytorch.org/text/master/

@cpuhrsch
Copy link
Contributor Author

The master documentation is currently not pushing html for the raw datasets. I'll send a PR to address this.

@cpuhrsch cpuhrsch merged commit 9053d95 into pytorch:master Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants