-
Notifications
You must be signed in to change notification settings - Fork 813
[BC-breaking] Standardize raw dataset doc strings and argument order. #1151
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1151 +/- ##
==========================================
+ Coverage 79.08% 79.46% +0.37%
==========================================
Files 47 47
Lines 3108 3175 +67
==========================================
+ Hits 2458 2523 +65
- Misses 650 652 +2
Continue to review full report at Codecov.
|
… check_default_set from IDMB
@@ -25,6 +27,87 @@ def wrap_datasets(datasets, split): | |||
return datasets | |||
|
|||
|
|||
def dataset_docstring_header(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still add more documentation here
return fn | ||
|
||
|
||
def input_sanitization_decorator(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can still add more documentation here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator
sounds redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Approving to unblock. But docstring can be improved.
len(argspec.annotations) == 0 | ||
): | ||
raise ValueError("Internal Error: Given function {} did not adhere to standard signature.".format(fn)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following part is not intuitively clear what it intends to do, so adding comments will be helpful.
return fn | ||
|
||
|
||
def input_sanitization_decorator(fn): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will be nice. The second half of the implementation is not intuitive to the first time reader.
nit: Also I would use verb of adjective for decorator name. Even if it is noun, I will not call it as @XXXdecorator
sounds redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can monitor the nightly release doc and check if this actually works. http://pytorch.org/text/master/
The master documentation is currently not pushing html for the raw datasets. I'll send a PR to address this. |
This PR comes with a few changes
1.
Addresses inconsistencies such as documentation indicating the incorrect order of splits (train, valid, test vs. train, test, valid), inconsistent indentation or whitespace and standardizes documentation of common arguments such as split and root across all raw datasets.
2.
Signatures such as
and
are aligned in the order of their arguments.
If a user expects to pass the root folder first and calls Multi30k positionally, i.e.
Multi30k("my_data_path")
, because the behavior is expected based on the signature of AmazonReviewPolarity, they will inadvertently set the train_filenames argument to a single string. This PR standardizes on the majority by moving the root folder first, followed by split and offset and then by dataset dependent arguments such as language, year or train_filenames.3.
Commonalities between documentation are centralized and generated. This also includes a programatic signature check to make sure all datasets have a consistent interface. This is achieved via a decorator that sanitizes input arguments and preprends a common documentation header.
TODO
Follow-up