-
Notifications
You must be signed in to change notification settings - Fork 812
Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models #1462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM. Just had a question and a nit comment
torchtext/transforms.py
Outdated
:type byte_encoder: Dict[int, str] | ||
""" | ||
SEPERATOR: str = "\u0001" | ||
LSTRIP_PATTERN: str = "\u0120" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Class attributes are tricky here. If anyone needs to use custom separator/lstip_pattern, then one needs to modify them on class.
GPT2BPETokenizer.SEPARATOR = my_separator
GPT2BPETokenizer.LSTRIP_PATTERN = my_lstrip_pattern
foo = GPT2BPETokenizer.load_xxx(...)
It looks like these separator and strip_pattern are logically connected with the data to be loaded (json file and bpe file), which means that the customization should best happen at loading time.
Making them into class attribute means, that if I have two BPE tokenizers, one with default and the other with custom separator/lstrip_pattern, I cannot use them at the same time.
Looking at the implementation of get_pairs
and bpe
method, self.merge_seperator
are used. This means that they are better coupled with instance, instead of class.
Also this approach is forcing all the load_XXX
functions to be classmethod
. In general in Python, there is not much advantage of using classmethod
/staticmethod
over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).
Having class method could be troublesome if the instance is going to be used in subprocess (i.e. the instances are pickled and unpicked when passed from main process to subprocess. I am not sure that's the case for this class, but in general, it's better not to have one).
Therefore, if these attributes are often customized, then I recommend to make them as optional constructor argument, else, putting them as a global variable wouldn't hurt much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mthrok Thanks for the detailed suggestions. Your points make a lot of sense and I will try to get rid of the unnecessary class methods.
One thing, I would like to emphasize that these class attributes are not designed to be customizable. The LSTRIP_PATTERN is how GPT-2 encodes spaces. We don't need users to change this unless they want to try out an idea that is different from the original implementation. As far as SEPERATOR is concerned, that is our workaround because Torchscript does not support Tuple as Dict keys. SEPERATOR in that way is closely toed to our implementation and we don't expect any user to modify this.
Also this approach is forcing all the load_XXX functions to be classmethod. In general in Python, there is not much advantage of using classmethod/staticmethod over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).
I am not sure I understand why this forces us to make class methods. Let me refactor these to be ScriptModule constants as described at the bottom of https://pytorch.org/docs/stable/jit_language_reference.html
torchtext/transforms.py
Outdated
bpe_vocab = f.read() | ||
|
||
return { | ||
cls.SEPERATOR.join(merge_pair.split()): i |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why BPE file cannot be formatted as JSON?
If bpe_vocab and bpe_encoder are in same format, then they can share the same I/O logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, nothings stops us from formatting them as a JSON. However, if I think from a user's perspective, we should align with the file formats that are released by AllenAI, HuggingFace, etc. If people train their own GPT-2 tokenizers using the original implementation of any of these libraries, it is likely that the files produced by them will be in the same format. Hence, I suggest keeping the same file format and letting the implementation deal with Torchscript-specific nuances. Hope that makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.
I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers. I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached) |
The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally. The 2 files are 500kB and 1MB respectively. Wondering if this will be a bigger issue. Since most users will mostly use the full-blown GPT2-BPE from Allen AI, I thought that adding tests with the full one may be better. |
Assets are usually generated by developers of pre-trained models and currently they will be generated via other frameworks. I can add some asset descriptions in the test because we cannot rely on the comments in the asset file as they can get inconsistent. |
Sounds good! |
I think the approach to use full-blown GPT2-BPE is reasonable. If the size is really a concern, we could use the one directly from AWS server (since we are going to upload these files there for usage with pre-trained models). |
This sounds to me that the issue is more of coverage of unicodes than edge cases. In the current structure, this coverage is defined by the BPE data originated not from library code but from the external data, the signal we will get from the test does not necessarily align with the end-user experience. |
Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions. Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab. Hope that makes sense. Most SOTA text research generally uses the data as it is in OSS. The RoBERTa models open-sourced by FAIR are a great example. They reused the tokenizer and the data from what Allen AI released for GPT. If you still feel that the test is not right, I could try to handcraft my custom data, which can be a small subset of the shipped data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please make sure to update the summary (potentially checklist) for any relevant outstanding issues discussed in this PR, so that we can try to cover them up in follow-up PRs.
@abhinavarora @parmeet I do not want to block the PR, so you folks can go ahead, but once a file is checked-in, that action cannot be undone, so I present my thoughts one more time.
Thanks for the explanation. I think both correctness checking and TS consistency checking can be done without 50k data. In fact correctness checking is easier to verify with eyes when data are small and hand-crafted. So I still wonder what is absolute necessity to use and check-in the 50k data.
I understand that shipped data are unlikely changed. But if these data files are checked-in, now there are two different sources of truth for testing and shipped package. I am not against testing on full-brown data, but if such tests are done, I think it should be done on shipped package, i.e. in the form of integration test, fetching the shipped data on-the-fly. |
BTW, reading the documentation on |
@mthrok Thank you for the feedback. |
Hey @mthrok Your points make total sense. @parmeet and I have been talking about integration tests and we plan to create a new task for us to set up integration testing for this library. Let me add an action item on me to replace this test with a dummy hand-crafted test once integration testing is ready. I want to make sure that this testing is in place because I plan to implement some of the components from this PR in C++. As a result having this testing is of high importance. One thing I did not understand from your comment was why can't a file check-in not be undone? We can always replace the test files, right? Wondering if there is any additional perspective I am missing? Thank you again for all your awesome feedback. I am updating the summary of this PR with action items so that we can follow up on all your suggestions before closing out this workstream. |
I think @parmeet was forced to design Vocab this way because |
Even if the file is updated later, the original file stays in the git history, so the file size transferred with |
Description
This PR implements the GPT-2 BPE tokenizer that is used by RoBERTa models. The implemented tokenizer is an initial working version that is scriptable. This PR also add test cases to test out the tokenizer.
Testing
pytest -k test_gpt2_bpe_tokenizer test/test_transforms.py
Future Work
Follow Ups to this PR