Skip to content

Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models #1462

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Dec 16, 2021

Conversation

abhinavarora
Copy link
Contributor

@abhinavarora abhinavarora commented Dec 14, 2021

Description

This PR implements the GPT-2 BPE tokenizer that is used by RoBERTa models. The implemented tokenizer is an initial working version that is scriptable. This PR also add test cases to test out the tokenizer.

Testing

pytest -k test_gpt2_bpe_tokenizer test/test_transforms.py

Future Work

  1. Implement Caching
  2. Move implementation to C++
  3. Implement support for character indices with tokenization

Follow Ups to this PR

  • Refactor Tokenizer organization in TorchText and set up tokenization specific constant and util functions.
  • Once integration testing is set up for the repo, create a new hand-crafted unit test and move "shipped BPE model" tests to integration testing
  • Set up time with @mthrok to have an in-depth discussion on the question of accepting file paths v/s file-like objects.

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM. Just had a question and a nit comment

:type byte_encoder: Dict[int, str]
"""
SEPERATOR: str = "\u0001"
LSTRIP_PATTERN: str = "\u0120"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class attributes are tricky here. If anyone needs to use custom separator/lstip_pattern, then one needs to modify them on class.

GPT2BPETokenizer.SEPARATOR = my_separator
GPT2BPETokenizer.LSTRIP_PATTERN = my_lstrip_pattern
foo = GPT2BPETokenizer.load_xxx(...)

It looks like these separator and strip_pattern are logically connected with the data to be loaded (json file and bpe file), which means that the customization should best happen at loading time.
Making them into class attribute means, that if I have two BPE tokenizers, one with default and the other with custom separator/lstrip_pattern, I cannot use them at the same time.

Looking at the implementation of get_pairs and bpe method, self.merge_seperator are used. This means that they are better coupled with instance, instead of class.

Also this approach is forcing all the load_XXX functions to be classmethod. In general in Python, there is not much advantage of using classmethod/staticmethod over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).

Having class method could be troublesome if the instance is going to be used in subprocess (i.e. the instances are pickled and unpicked when passed from main process to subprocess. I am not sure that's the case for this class, but in general, it's better not to have one).

Therefore, if these attributes are often customized, then I recommend to make them as optional constructor argument, else, putting them as a global variable wouldn't hurt much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mthrok Thanks for the detailed suggestions. Your points make a lot of sense and I will try to get rid of the unnecessary class methods.

One thing, I would like to emphasize that these class attributes are not designed to be customizable. The LSTRIP_PATTERN is how GPT-2 encodes spaces. We don't need users to change this unless they want to try out an idea that is different from the original implementation. As far as SEPERATOR is concerned, that is our workaround because Torchscript does not support Tuple as Dict keys. SEPERATOR in that way is closely toed to our implementation and we don't expect any user to modify this.

Also this approach is forcing all the load_XXX functions to be classmethod. In general in Python, there is not much advantage of using classmethod/staticmethod over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).

I am not sure I understand why this forces us to make class methods. Let me refactor these to be ScriptModule constants as described at the bottom of https://pytorch.org/docs/stable/jit_language_reference.html

bpe_vocab = f.read()

return {
cls.SEPERATOR.join(merge_pair.split()): i
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why BPE file cannot be formatted as JSON?
If bpe_vocab and bpe_encoder are in same format, then they can share the same I/O logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, nothings stops us from formatting them as a JSON. However, if I think from a user's perspective, we should align with the file formats that are released by AllenAI, HuggingFace, etc. If people train their own GPT-2 tokenizers using the original implementation of any of these libraries, it is likely that the files produced by them will be in the same format. Hence, I suggest keeping the same file format and letting the implementation deal with Torchscript-specific nuances. Hope that makes sense?

Copy link
Contributor

@mthrok mthrok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.

https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

@mthrok
Copy link
Contributor

mthrok commented Dec 14, 2021

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.

I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

@abhinavarora
Copy link
Contributor Author

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.

I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally. The 2 files are 500kB and 1MB respectively. Wondering if this will be a bigger issue. Since most users will mostly use the full-blown GPT2-BPE from Allen AI, I thought that adding tests with the full one may be better.

@abhinavarora
Copy link
Contributor Author

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.

https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

Assets are usually generated by developers of pre-trained models and currently they will be generated via other frameworks. I can add some asset descriptions in the test because we cannot rely on the comments in the asset file as they can get inconsistent.

@parmeet
Copy link
Contributor

parmeet commented Dec 15, 2021

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.
https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

Assets are usually generated by developers of pre-trained models and currently they will be generated via other frameworks. I can add some asset descriptions in the test because we cannot rely on the comments in the asset file as they can get inconsistent.

Sounds good!

@parmeet
Copy link
Contributor

parmeet commented Dec 15, 2021

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.
I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally. The 2 files are 500kB and 1MB respectively. Wondering if this will be a bigger issue. Since most users will mostly use the full-blown GPT2-BPE from Allen AI, I thought that adding tests with the full one may be better.

I think the approach to use full-blown GPT2-BPE is reasonable. If the size is really a concern, we could use the one directly from AWS server (since we are going to upload these files there for usage with pre-trained models).

@mthrok
Copy link
Contributor

mthrok commented Dec 15, 2021

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally.

This sounds to me that the issue is more of coverage of unicodes than edge cases. In the current structure, this coverage is defined by the BPE data originated not from library code but from the external data, the signal we will get from the test does not necessarily align with the end-user experience.
I think it makes sense to test if these specific characters are handled correctly when it's tested on the shipped BPE data, instead of the separate assets in tests. (alternatively, how is it guaranteed that the BPE data in test assets and the one you are going to publish on S3 as pre-trained data are in-sync? If the BPE data in test assets become obsolete, then we do not get the right signal.)

@abhinavarora
Copy link
Contributor Author

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally.

This sounds to me that the issue is more of coverage of unicodes than edge cases. In the current structure, this coverage is defined by the BPE data originated not from library code but from the external data, the signal we will get from the test does not necessarily align with the end-user experience. I think it makes sense to test if these specific characters are handled correctly when it's tested on the shipped BPE data, instead of the separate assets in tests. (alternatively, how is it guaranteed that the BPE data in test assets and the one you are going to publish on S3 as pre-trained data are in-sync? If the BPE data in test assets become obsolete, then we do not get the right signal.)

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions. Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab. Hope that makes sense. Most SOTA text research generally uses the data as it is in OSS. The RoBERTa models open-sourced by FAIR are a great example. They reused the tokenizer and the data from what Allen AI released for GPT.

If you still feel that the test is not right, I could try to handcraft my custom data, which can be a small subset of the shipped data.

Copy link
Contributor

@parmeet parmeet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please make sure to update the summary (potentially checklist) for any relevant outstanding issues discussed in this PR, so that we can try to cover them up in follow-up PRs.

@mthrok
Copy link
Contributor

mthrok commented Dec 16, 2021

@abhinavarora @parmeet I do not want to block the PR, so you folks can go ahead, but once a file is checked-in, that action cannot be undone, so I present my thoughts one more time.

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions.

Thanks for the explanation. I think both correctness checking and TS consistency checking can be done without 50k data. In fact correctness checking is easier to verify with eyes when data are small and hand-crafted. So I still wonder what is absolute necessity to use and check-in the 50k data.

Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab.

I understand that shipped data are unlikely changed. But if these data files are checked-in, now there are two different sources of truth for testing and shipped package. I am not against testing on full-brown data, but if such tests are done, I think it should be done on shipped package, i.e. in the form of integration test, fetching the shipped data on-the-fly.

@mthrok
Copy link
Contributor

mthrok commented Dec 16, 2021

BTW, reading the documentation on Vocab/vocab https://pytorch.org/text/main/vocab.html, I feel that having a factory function that loads data from external file, (and have the GPT2 class accept the plain data type like dictionaries) feels more consistent with the other library design.

@abhinavarora
Copy link
Contributor Author

@mthrok Thank you for the feedback. SEPERATOR: torch.jit.Final[str] is no more a Class attribute. It is now an instance level contant. the declaration there is just to let torchscript know that this is a Final attribute and torchscript can do some optimizations during scripting for this. With the latest commits, I have gotten rid of all class level attributes.

@abhinavarora
Copy link
Contributor Author

@abhinavarora @parmeet I do not want to block the PR, so you folks can go ahead, but once a file is checked-in, that action cannot be undone, so I present my thoughts one more time.

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions.

Thanks for the explanation. I think both correctness checking and TS consistency checking can be done without 50k data. In fact correctness checking is easier to verify with eyes when data are small and hand-crafted. So I still wonder what is absolute necessity to use and check-in the 50k data.

Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab.

I understand that shipped data are unlikely changed. But if these data files are checked-in, now there are two different sources of truth for testing and shipped package. I am not against testing on full-brown data, but if such tests are done, I think it should be done on shipped package, i.e. in the form of integration test, fetching the shipped data on-the-fly.

Hey @mthrok Your points make total sense. @parmeet and I have been talking about integration tests and we plan to create a new task for us to set up integration testing for this library. Let me add an action item on me to replace this test with a dummy hand-crafted test once integration testing is ready. I want to make sure that this testing is in place because I plan to implement some of the components from this PR in C++. As a result having this testing is of high importance.

One thing I did not understand from your comment was why can't a file check-in not be undone? We can always replace the test files, right? Wondering if there is any additional perspective I am missing?

Thank you again for all your awesome feedback. I am updating the summary of this PR with action items so that we can follow up on all your suggestions before closing out this workstream.

@abhinavarora
Copy link
Contributor Author

BTW, reading the documentation on Vocab/vocab https://pytorch.org/text/main/vocab.html, I feel that having a factory function that loads data from external file, (and have the GPT2 class accept the plain data type like dictionaries) feels more consistent with the other library design.

I think @parmeet was forced to design Vocab this way because Vocab was an API that was already being used by users and he wanted to keep the API consistent. Not sure if we would have still followed the same practice if this was not an interface we inherited.

@abhinavarora
Copy link
Contributor Author

@mthrok and @parmeet I have added follow up items in the PR summary so that we can address them in the near future and sync up together to find a common solution. Will land this PR for now so that we can unblock RoBERTa model for now.

@abhinavarora abhinavarora merged commit da914e6 into pytorch:main Dec 16, 2021
@abhinavarora abhinavarora deleted the gpt2_bpe_tok branch December 16, 2021 23:05
@mthrok
Copy link
Contributor

mthrok commented Dec 17, 2021

One thing I did not understand from your comment was why can't a file check-in not be undone? We can always replace the test files, right? Wondering if there is any additional perspective I am missing?

Even if the file is updated later, the original file stays in the git history, so the file size transferred with git clone will stay the same. (unless one uses partial clone or shallow clone)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants