Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models #1462

abhinavarora · 2021-12-14T08:32:37Z

Description

This PR implements the GPT-2 BPE tokenizer that is used by RoBERTa models. The implemented tokenizer is an initial working version that is scriptable. This PR also add test cases to test out the tokenizer.

Testing

pytest -k test_gpt2_bpe_tokenizer test/test_transforms.py

Future Work

Implement Caching
Move implementation to C++
Implement support for character indices with tokenization

Follow Ups to this PR

Refactor Tokenizer organization in TorchText and set up tokenization specific constant and util functions.
Once integration testing is set up for the repo, create a new hand-crafted unit test and move "shipped BPE model" tests to integration testing
Set up time with @mthrok to have an in-depth discussion on the question of accepting file paths v/s file-like objects.

Nayef211

Overall LGTM. Just had a question and a nit comment

torchtext/transforms.py

test/test_transforms.py

mthrok · 2021-12-14T18:19:53Z

torchtext/transforms.py

+    :type byte_encoder: Dict[int, str]
+    """
+    SEPERATOR: str = "\u0001"
+    LSTRIP_PATTERN: str = "\u0120"


Class attributes are tricky here. If anyone needs to use custom separator/lstip_pattern, then one needs to modify them on class.

GPT2BPETokenizer.SEPARATOR = my_separator GPT2BPETokenizer.LSTRIP_PATTERN = my_lstrip_pattern foo = GPT2BPETokenizer.load_xxx(...)

It looks like these separator and strip_pattern are logically connected with the data to be loaded (json file and bpe file), which means that the customization should best happen at loading time.
Making them into class attribute means, that if I have two BPE tokenizers, one with default and the other with custom separator/lstrip_pattern, I cannot use them at the same time.

Looking at the implementation of get_pairs and bpe method, self.merge_seperator are used. This means that they are better coupled with instance, instead of class.

Also this approach is forcing all the load_XXX functions to be classmethod. In general in Python, there is not much advantage of using classmethod/staticmethod over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).

Having class method could be troublesome if the instance is going to be used in subprocess (i.e. the instances are pickled and unpicked when passed from main process to subprocess. I am not sure that's the case for this class, but in general, it's better not to have one).

Therefore, if these attributes are often customized, then I recommend to make them as optional constructor argument, else, putting them as a global variable wouldn't hurt much.

@mthrok Thanks for the detailed suggestions. Your points make a lot of sense and I will try to get rid of the unnecessary class methods.

One thing, I would like to emphasize that these class attributes are not designed to be customizable. The LSTRIP_PATTERN is how GPT-2 encodes spaces. We don't need users to change this unless they want to try out an idea that is different from the original implementation. As far as SEPERATOR is concerned, that is our workaround because Torchscript does not support Tuple as Dict keys. SEPERATOR in that way is closely toed to our implementation and we don't expect any user to modify this.

Also this approach is forcing all the load_XXX functions to be classmethod. In general in Python, there is not much advantage of using classmethod/staticmethod over plain module level function (except logical grouping of functions and the very specific case where the method is intended to be subclassed and overridden from child class).

I am not sure I understand why this forces us to make class methods. Let me refactor these to be ScriptModule constants as described at the bottom of https://pytorch.org/docs/stable/jit_language_reference.html

torchtext/transforms.py

mthrok · 2021-12-14T18:34:20Z

torchtext/transforms.py

+            bpe_vocab = f.read()
+
+        return {
+            cls.SEPERATOR.join(merge_pair.split()): i


Curious, why BPE file cannot be formatted as JSON?
If bpe_vocab and bpe_encoder are in same format, then they can share the same I/O logic.

Technically, nothings stops us from formatting them as a JSON. However, if I think from a user's perspective, we should align with the file formats that are released by AllenAI, HuggingFace, etc. If people train their own GPT-2 tokenizers using the original implementation of any of these libraries, it is likely that the files produced by them will be in the same format. Hence, I suggest keeping the same file format and letting the implementation deal with Torchscript-specific nuances. Hope that makes sense?

torchtext/transforms.py

mthrok

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.

https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

mthrok · 2021-12-14T21:25:06Z

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.

I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

abhinavarora · 2021-12-15T06:11:20Z

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.

I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally. The 2 files are 500kB and 1MB respectively. Wondering if this will be a bigger issue. Since most users will mostly use the full-blown GPT2-BPE from Allen AI, I thought that adding tests with the full one may be better.

abhinavarora · 2021-12-15T06:15:53Z

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.

https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

Assets are usually generated by developers of pre-trained models and currently they will be generated via other frameworks. I can add some asset descriptions in the test because we cannot rely on the comments in the asset file as they can get inconsistent.

parmeet · 2021-12-15T15:22:42Z

I recommend to document how the new assets are generated. Ideally, if the asset format allows comment, there, otherwise either in README of unit test or the test class.
https://github.com/pytorch/audio/pull/1104/files#diff-c8c2b1f2318cddbb6e9602b7615608204bd21d175922f7661466d77dd846c693R258

Assets are usually generated by developers of pre-trained models and currently they will be generated via other frameworks. I can add some asset descriptions in the test because we cannot rely on the comments in the asset file as they can get inconsistent.

Sounds good!

parmeet · 2021-12-15T15:26:28Z

I think it's better to hand-craft the test assets. I cannot check the file size of the 50000 lines of text since I'm on mobile, but it would increase the repo size, git history size, memory consumption during the unit tests, and time consumed on test preparation. Also, the vast majority of data in the assets are not used during the test and with this size, no one will ever inspect the content, meaning that they obfuscate the logic of tests. I think curating a minimum set of test assets would help better the future maintainers.
I think the test with full-blown BPE data can go to integration test where the BPE data are retrieved from the internet on-the-fly (and maybe cached)

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally. The 2 files are 500kB and 1MB respectively. Wondering if this will be a bigger issue. Since most users will mostly use the full-blown GPT2-BPE from Allen AI, I thought that adding tests with the full one may be better.

I think the approach to use full-blown GPT2-BPE is reasonable. If the size is really a concern, we could use the one directly from AWS server (since we are going to upload these files there for usage with pre-trained models).

mthrok · 2021-12-15T19:55:43Z

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally.

This sounds to me that the issue is more of coverage of unicodes than edge cases. In the current structure, this coverage is defined by the BPE data originated not from library code but from the external data, the signal we will get from the test does not necessarily align with the end-user experience.
I think it makes sense to test if these specific characters are handled correctly when it's tested on the shipped BPE data, instead of the separate assets in tests. (alternatively, how is it guaranteed that the BPE data in test assets and the one you are going to publish on S3 as pre-trained data are in-sync? If the BPE data in test assets become obsolete, then we do not get the right signal.)

abhinavarora · 2021-12-15T22:30:39Z

The reason I added the full BPE vocab was because I wanted to make sure that tests capture some of the edge cases with unicode special characters that we have seen internally.

This sounds to me that the issue is more of coverage of unicodes than edge cases. In the current structure, this coverage is defined by the BPE data originated not from library code but from the external data, the signal we will get from the test does not necessarily align with the end-user experience. I think it makes sense to test if these specific characters are handled correctly when it's tested on the shipped BPE data, instead of the separate assets in tests. (alternatively, how is it guaranteed that the BPE data in test assets and the one you are going to publish on S3 as pre-trained data are in-sync? If the BPE data in test assets become obsolete, then we do not get the right signal.)

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions. Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab. Hope that makes sense. Most SOTA text research generally uses the data as it is in OSS. The RoBERTa models open-sourced by FAIR are a great example. They reused the tokenizer and the data from what Allen AI released for GPT.

If you still feel that the test is not right, I could try to handcraft my custom data, which can be a small subset of the shipped data.

torchtext/transforms.py

parmeet

LGTM! Please make sure to update the summary (potentially checklist) for any relevant outstanding issues discussed in this PR, so that we can try to cover them up in follow-up PRs.

mthrok · 2021-12-16T17:19:31Z

@abhinavarora @parmeet I do not want to block the PR, so you folks can go ahead, but once a file is checked-in, that action cannot be undone, so I present my thoughts one more time.

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions.

Thanks for the explanation. I think both correctness checking and TS consistency checking can be done without 50k data. In fact correctness checking is easier to verify with eyes when data are small and hand-crafted. So I still wonder what is absolute necessity to use and check-in the 50k data.

Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab.

I understand that shipped data are unlikely changed. But if these data files are checked-in, now there are two different sources of truth for testing and shipped package. I am not against testing on full-brown data, but if such tests are done, I think it should be done on shipped package, i.e. in the form of integration test, fetching the shipped data on-the-fly.

mthrok · 2021-12-16T17:21:37Z

BTW, reading the documentation on Vocab/vocab https://pytorch.org/text/main/vocab.html, I feel that having a factory function that loads data from external file, (and have the GPT2 class accept the plain data type like dictionaries) feels more consistent with the other library design.

torchtext/transforms.py

abhinavarora · 2021-12-16T20:35:11Z

@mthrok Thank you for the feedback. SEPERATOR: torch.jit.Final[str] is no more a Class attribute. It is now an instance level contant. the declaration there is just to let torchscript know that this is a Final attribute and torchscript can do some optimizations during scripting for this. With the latest commits, I have gotten rid of all class level attributes.

abhinavarora · 2021-12-16T20:49:36Z

@abhinavarora @parmeet I do not want to block the PR, so you folks can go ahead, but once a file is checked-in, that action cannot be undone, so I present my thoughts one more time.

Hey @mthrok, I think I did not explain myself correctly in the previous comment :-) The test is not concerned with specific unicode characters. The unicode tests mainly test that the behavior on out of vocab tokens containing multi-byte characters is correct and is consistent between the scripted and the non-scripted versions.

Thanks for the explanation. I think both correctness checking and TS consistency checking can be done without 50k data. In fact correctness checking is easier to verify with eyes when data are small and hand-crafted. So I still wonder what is absolute necessity to use and check-in the 50k data.

Also, I do expect the test data to be the same as "the shipped BPE data". I don't expect the shipped data in OSS to change for almost all users, unless someone wants to use their own BPE vocab or there is a new pre-trained model released in OSS that leverages a much larger dataset to learn the vocab.

I understand that shipped data are unlikely changed. But if these data files are checked-in, now there are two different sources of truth for testing and shipped package. I am not against testing on full-brown data, but if such tests are done, I think it should be done on shipped package, i.e. in the form of integration test, fetching the shipped data on-the-fly.

Hey @mthrok Your points make total sense. @parmeet and I have been talking about integration tests and we plan to create a new task for us to set up integration testing for this library. Let me add an action item on me to replace this test with a dummy hand-crafted test once integration testing is ready. I want to make sure that this testing is in place because I plan to implement some of the components from this PR in C++. As a result having this testing is of high importance.

One thing I did not understand from your comment was why can't a file check-in not be undone? We can always replace the test files, right? Wondering if there is any additional perspective I am missing?

Thank you again for all your awesome feedback. I am updating the summary of this PR with action items so that we can follow up on all your suggestions before closing out this workstream.

abhinavarora · 2021-12-16T20:52:01Z

BTW, reading the documentation on Vocab/vocab https://pytorch.org/text/main/vocab.html, I feel that having a factory function that loads data from external file, (and have the GPT2 class accept the plain data type like dictionaries) feels more consistent with the other library design.

I think @parmeet was forced to design Vocab this way because Vocab was an API that was already being used by users and he wanted to keep the API consistent. Not sure if we would have still followed the same practice if this was not an interface we inherited.

abhinavarora · 2021-12-16T21:19:29Z

@mthrok and @parmeet I have added follow up items in the PR summary so that we can address them in the near future and sync up together to find a common solution. Will land this PR for now so that we can unblock RoBERTa model for now.

mthrok · 2021-12-17T10:49:46Z

One thing I did not understand from your comment was why can't a file check-in not be undone? We can always replace the test files, right? Wondering if there is any additional perspective I am missing?

Even if the file is updated later, the original file stays in the git history, so the file size transferred with git clone will stay the same. (unless one uses partial clone or shallow clone)

Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models

8b2ea2f

pytorch-probot bot added the ciflow/default label Dec 14, 2021

facebook-github-bot added the cla signed label Dec 14, 2021

abhinavarora requested review from parmeet and Nayef211 December 14, 2021 08:32

abhinavarora self-assigned this Dec 14, 2021

Nayef211 reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

torchtext/transforms.py Outdated Show resolved Hide resolved

parmeet reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

test/test_transforms.py Show resolved Hide resolved

mthrok reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Show resolved Hide resolved

mthrok reviewed Dec 14, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 14, 2021

View reviewed changes

abhinavarora added 3 commits December 14, 2021 23:10

Refactor implementation based on PR feedback

66bc8ec

Move common get_asset_local_path logic to utils.py

f47286e

Fix flake8 errors

6288952

abhinavarora requested review from parmeet, Nayef211 and mthrok December 15, 2021 18:10

parmeet reviewed Dec 15, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

torchtext/transforms.py Outdated Show resolved Hide resolved

torchtext/transforms.py Outdated Show resolved Hide resolved

Remove class methods that are used only in the init method

2916c76

parmeet approved these changes Dec 16, 2021

View reviewed changes

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

mthrok reviewed Dec 16, 2021

View reviewed changes

torchtext/transforms.py Outdated Show resolved Hide resolved

Fix nits

0ab3edd

abhinavarora merged commit da914e6 into pytorch:main Dec 16, 2021

abhinavarora deleted the gpt2_bpe_tok branch December 16, 2021 23:05

Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models #1462

Add Torchscriptable GPT-2 BPE Tokenizer for RoBERTa models #1462

Uh oh!

Conversation

abhinavarora commented Dec 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Future Work

Follow Ups to this PR

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mthrok Dec 14, 2021

Choose a reason for hiding this comment

Uh oh!

abhinavarora Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mthrok Dec 14, 2021

Choose a reason for hiding this comment

Uh oh!

abhinavarora Dec 15, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mthrok left a comment

Choose a reason for hiding this comment

Uh oh!

mthrok commented Dec 14, 2021

Uh oh!

abhinavarora commented Dec 15, 2021

Uh oh!

abhinavarora commented Dec 15, 2021

Uh oh!

parmeet commented Dec 15, 2021

Uh oh!

parmeet commented Dec 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mthrok commented Dec 15, 2021

Uh oh!

abhinavarora commented Dec 15, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

parmeet left a comment

Choose a reason for hiding this comment

Uh oh!

mthrok commented Dec 16, 2021

Uh oh!

mthrok commented Dec 16, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abhinavarora commented Dec 16, 2021

Uh oh!

abhinavarora commented Dec 16, 2021

Uh oh!

abhinavarora commented Dec 16, 2021

Uh oh!

abhinavarora commented Dec 16, 2021

Uh oh!

mthrok commented Dec 17, 2021

Uh oh!

Uh oh!

abhinavarora commented Dec 14, 2021 •

edited

Loading

parmeet commented Dec 15, 2021 •

edited

Loading