[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

cpuhrsch · 2021-02-11T01:08:27Z

This PR comes with a few changes

1.

Addresses inconsistencies such as documentation indicating the incorrect order of splits (train, valid, test vs. train, test, valid), inconsistent indentation or whitespace and standardizes documentation of common arguments such as split and root across all raw datasets.

2.

Signatures such as

def Multi30k(train_filenames=("train.de", "train.en"),
             valid_filenames=("val.de", "val.en"),
             test_filenames=("test_2016_flickr.de", "test_2016_flickr.en"),
             split=('train', 'valid', 'test'), root='.data', offset=0):

and

AmazonReviewPolarity(root='.data', split=('train', 'test'), offset=0):

are aligned in the order of their arguments.

If a user expects to pass the root folder first and calls Multi30k positionally, i.e. Multi30k("my_data_path"), because the behavior is expected based on the signature of AmazonReviewPolarity, they will inadvertently set the train_filenames argument to a single string. This PR standardizes on the majority by moving the root folder first, followed by split and offset and then by dataset dependent arguments such as language, year or train_filenames.

3.

Commonalities between documentation are centralized and generated. This also includes a programatic signature check to make sure all datasets have a consistent interface. This is achieved via a decorator that sanitizes input arguments and preprends a common documentation header.

TODO

Verify docstring generation works with website doc generation

Follow-up

Decorator to append datapoint example including datapoint types
Represent dataset selections for translation datasets more compactly

codecov · 2021-02-11T02:10:44Z

Codecov Report

Merging #1151 (fcb415c) into master (ef363fa) will increase coverage by 0.37%.
The diff coverage is 96.96%.

@@            Coverage Diff             @@
##           master    #1151      +/-   ##
==========================================
+ Coverage   79.08%   79.46%   +0.37%     
==========================================
  Files          47       47              
  Lines        3108     3175      +67     
==========================================
+ Hits         2458     2523      +65     
- Misses        650      652       +2

Impacted Files	Coverage Δ
torchtext/experimental/datasets/raw/common.py	`87.83% <91.42%> (+3.22%)`	⬆️
...ext/experimental/datasets/raw/language_modeling.py	`93.18% <100.00%> (+3.43%)`	⬆️
...htext/experimental/datasets/raw/question_answer.py	`100.00% <100.00%> (ø)`
...text/experimental/datasets/raw/sequence_tagging.py	`94.00% <100.00%> (+0.38%)`	⬆️
...t/experimental/datasets/raw/text_classification.py	`87.50% <100.00%> (+3.57%)`	⬆️
torchtext/experimental/datasets/raw/translation.py	`92.07% <100.00%> (+0.41%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef363fa...fcb415c. Read the comment docs.

… check_default_set from IDMB

cpuhrsch · 2021-02-11T18:32:56Z

torchtext/experimental/datasets/raw/common.py

@@ -25,6 +27,87 @@ def wrap_datasets(datasets, split):
    return datasets


+def dataset_docstring_header(fn):


I can still add more documentation here

cpuhrsch · 2021-02-11T18:33:01Z

torchtext/experimental/datasets/raw/common.py

+    return fn
+
+
+def input_sanitization_decorator(fn):


I can still add more documentation here

That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator sounds redundant.

mthrok

Looks good. Approving to unblock. But docstring can be improved.

mthrok · 2021-02-11T19:17:39Z

torchtext/experimental/datasets/raw/common.py

+            len(argspec.annotations) == 0
+            ):
+        raise ValueError("Internal Error: Given function {} did not adhere to standard signature.".format(fn))
+


The following part is not intuitively clear what it intends to do, so adding comments will be helpful.

mthrok · 2021-02-11T19:21:21Z

torchtext/experimental/datasets/raw/common.py

+    return fn
+
+
+def input_sanitization_decorator(fn):


That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator sounds redundant.

zhangguanheng66

We can monitor the nightly release doc and check if this actually works. http://pytorch.org/text/master/

cpuhrsch · 2021-02-11T23:36:27Z

The master documentation is currently not pushing html for the raw datasets. I'll send a PR to address this.

Standardize raw dataset doc strings.

8f6f74c

facebook-github-bot added the cla signed label Feb 11, 2021

Christian Puhrsch added 2 commits February 10, 2021 17:18

Further edits

9627a80

Further edits

0250829

cpuhrsch changed the title ~~[WIP] Standardize raw dataset doc strings.~~ [WIP] Standardize raw dataset doc strings and argument order. Feb 11, 2021

Christian Puhrsch added 2 commits February 10, 2021 17:29

Further edits

f20c009

Further edits

4b5bcad

cpuhrsch changed the title ~~[WIP] Standardize raw dataset doc strings and argument order.~~ [WIP][BC-breaking] Standardize raw dataset doc strings and argument order. Feb 11, 2021

Remove capitalization from Dataset

a34e72c

Christian Puhrsch added 9 commits February 10, 2021 18:25

Generate common headers

08baefb

Generate language dataset documentation headers

eba15aa

Create decorator for input sanitization and docstring generation

46ac3d9

Remove unused imports

fcc3610

Tidy up the decorator a bit more

b40ff3b

Separate dataset_docstring_header_decorator

80b4ace

Move Default argument to newline, fix translation docs, remove use of…

dad5c5c

… check_default_set from IDMB

Lint

86fce58

Propagate default keyword arguments in wrapper

afd88f1

cpuhrsch marked this pull request as ready for review February 11, 2021 18:31

cpuhrsch commented Feb 11, 2021

View reviewed changes

cpuhrsch changed the title ~~[WIP][BC-breaking] Standardize raw dataset doc strings and argument order.~~ [BC-breaking] Standardize raw dataset doc strings and argument order. Feb 11, 2021

mthrok approved these changes Feb 11, 2021

View reviewed changes

Christian Puhrsch added 2 commits February 11, 2021 13:11

Rename decorators

b450a30

Add more documentation

fcb415c

cpuhrsch requested a review from zhangguanheng66 February 11, 2021 21:27

zhangguanheng66 approved these changes Feb 11, 2021

View reviewed changes

cpuhrsch merged commit 9053d95 into pytorch:master Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

Uh oh!

cpuhrsch commented Feb 11, 2021 •

edited

Loading

Uh oh!

codecov bot commented Feb 11, 2021 •

edited

Loading

Uh oh!

cpuhrsch Feb 11, 2021

Uh oh!

cpuhrsch Feb 11, 2021

Uh oh!

mthrok Feb 11, 2021

Uh oh!

mthrok left a comment

Uh oh!

mthrok Feb 11, 2021

Uh oh!

mthrok Feb 11, 2021

Uh oh!

zhangguanheng66 left a comment

Uh oh!

cpuhrsch commented Feb 11, 2021

Uh oh!

Uh oh!

		@@ -25,6 +27,87 @@ def wrap_datasets(datasets, split):
		return datasets


		def dataset_docstring_header(fn):

[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

[BC-breaking] Standardize raw dataset doc strings and argument order. #1151

Uh oh!

Conversation

cpuhrsch commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1.

2.

3.

TODO

Follow-up

Uh oh!

codecov bot commented Feb 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cpuhrsch Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

cpuhrsch Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

mthrok Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

mthrok left a comment

Choose a reason for hiding this comment

Uh oh!

mthrok Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

mthrok Feb 11, 2021

Choose a reason for hiding this comment

Uh oh!

zhangguanheng66 left a comment

Choose a reason for hiding this comment

Uh oh!

cpuhrsch commented Feb 11, 2021

Uh oh!

Uh oh!

cpuhrsch commented Feb 11, 2021 •

edited

Loading

codecov bot commented Feb 11, 2021 •

edited

Loading