Add never_split feature to BERTTokenizer #1898

reachsumit · 2022-09-13T20:23:30Z

Description

Add never_split feature to BERTTokenizer

This change is targeted to address the issue #1883 . It adds a new parameter never_split to BERTTokenizer in order to enable the user to specify a sequence of token that should not be changed during tokenization.

Close #1883

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

Made changes to C++ code (BERTEncoder class) to support the never_split functionality.
- Updated torch bindings and py bindings as required.
Added never_split parameter to Python code (BERTTokenizer class).
Added unit test to extensively test the tokenization process.

Testing

No issue identified in pre-commit
No issue identified with any of existing or new unit tests.

joecummings · 2022-09-13T23:33:20Z

@reachsumit Please look into test failures. LMK if you need help debugging.

reachsumit · 2022-09-14T00:00:30Z

@joecummings Thank you for looking into this. There seems to be a pattern of failures among the tests.

The unittest_macos_py* tests seem to be failing with the following 'No space left' error.

...
E       OSError: [Errno 28] No space left on device
...

The unittest_linux_py* tests seems to have a segmentation fault for the following:

...
test/prototype/test_with_asset.py::TestTransformsWithAsset::test_vocab_from_raw_text_file Fatal Python error: Segmentation fault
...

I see the same pattern of failures for other builds, such as @Nayef211 's #1899 .

For the unittest_windows_py* tests, I see a bunch of warnings but no specific error.

I would appreciate any tips or pointers for debugging this further. 🙏🏽

Nayef211 · 2022-09-14T00:11:03Z

Hey @reachsumit. Happy to provide some further context here.

The unittest_macos_py* tests seem to be failing with the following 'No space left' error.
...
E       OSError: [Errno 28] No space left on device
...

This should be resolved once #1889 lands and you can ignore these errors for the time being.

The unittest_linux_py* tests seems to have a segmentation fault for the following:
...
test/prototype/test_with_asset.py::TestTransformsWithAsset::test_vocab_from_raw_text_file Fatal Python error: Segmentation fault
...

These tests should be disabled in CI once #1901 lands and you can rebase your branch on top of the latest main once this happens.

For the unittest_windows_py* tests, I see a bunch of warnings but no specific error.

The windows CI build failures look new and may be caused by this PR and could require some further investigation (see sample CI failure)

Nayef211 · 2022-09-16T00:19:59Z

@reachsumit can you try rebasing on the latest main branch to see if the CI issues are resolved now?

reachsumit · 2022-09-16T02:05:30Z

@reachsumit can you try rebasing on the latest main branch to see if the CI issues are resolved now?

Still seeing the following failures. :(

binary_macos_conda_py3.7

Solving environment: ...working... failed

unittest_macos_py3.8

* Installing conda
PREFIX=/Users/distiller/project/conda
Unpacking payload ...
  0% 0/36 [00:00<?, ?it/s]Exception in thread Thread-2:
Traceback (most recent call last):
  File "threading.py", line 973, in _bootstrap_inner
  File "concurrent/futures/process.py", line 317, in run
  File "concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
  File "concurrent/futures/process.py", line 376, in <listcomp>
RuntimeError: dictionary changed size during iteration

unittest_macos_py3.9

torchtext_unittest/datasets/test_iwslt2016.py::TestIWSLT2016::test_iwslt2016_split_argument_0_train FAILED
...
>       parser.feed(text)
E       xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

../env/lib/python3.9/xml/etree/ElementTree.py:1344: ParseError

Nayef211

Overall LGTM! Thanks for working on this feature @reachsumit! The 2 test failures you're seeing are due to flaky tests and not caused by your PR. I will stamp once the PR comment is resolved.

Nayef211 · 2022-09-16T21:20:42Z

torchtext/csrc/bert_tokenizer.cpp

@@ -254,51 +266,56 @@ UString BERTEncoder::_basic_tokenize(const UString& text) {

 std::vector<std::string> BERTEncoder::Tokenize(std::string text) {
  std::vector<std::string> results;
+  std::vector<std::string> interim_results;
+  std::vector<std::string> tokens;
+  std::set<std::string> never_split(never_split_.begin(), never_split_.end());


Does is make sense to create this set inside of the Tokenize function instead of the constructor? The set will be recreated every time this method is called which doesn't seem efficient.

Good point! I will fix it.

reachsumit · 2022-09-17T00:48:19Z

Thanks for your comments, @Nayef211 . I updated the code. The CI failures are as follows:

binary_linux_conda_py3.9

Solving environment: ...working... 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed

unittest_windows_py3.10

torchtext_unittest\test_build.py::TestVocab::test_download_charngram_vectors FAILED [  0%]
...
..\env\lib\socket.py:955: gaierror

joecummings · 2022-09-19T17:55:30Z

@reachsumit @Nayef211 After rerunning the test and build suite, it appears that the only test that's still failing - unittest_linux_py3.7 - is related to this known issue: #1903. So everything looks good!

However, since this isn't a pressing feature, I'm in favor of waiting until @VitalyFedyunin has a fix out later this week to make sure that this doesn't cause any other strange failures.

Nayef211 · 2022-09-19T18:20:54Z

@reachsumit @Nayef211 After rerunning the test and build suite, it appears that the only test that's still failing - unittest_linux_py3.7 - is related to this known issue: #1903. So everything looks good!

However, since this isn't a pressing feature, I'm in favor of waiting until @VitalyFedyunin has a fix out later this week to make sure that this doesn't cause any other strange failures.

I actually think we can go ahead and land now as this feature and the failures with the IWSLT dataset are not at all related. I don't think it makes sense to block landing this on #1903.

facebook-github-bot added the cla signed label Sep 13, 2022

Nayef211 requested a review from joecummings September 13, 2022 20:37

reachsumit force-pushed the text_issue_1883 branch 2 times, most recently from 21a5cdf to 595f9a2 Compare September 15, 2022 01:16

Sumit Kumar and others added 2 commits September 15, 2022 17:23

Add never_split feature to BERTTokenizer

70206ac

fix logical operator

f0e5729

reachsumit force-pushed the text_issue_1883 branch from 595f9a2 to f0e5729 Compare September 16, 2022 00:23

Nayef211 reviewed Sep 16, 2022

View reviewed changes

move set creation to BERTEncoder constructor

2ebad3c

Nayef211 approved these changes Sep 19, 2022

View reviewed changes

joecummings approved these changes Sep 19, 2022

View reviewed changes

Nayef211 merged commit b0df58b into pytorch:main Sep 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add never_split feature to BERTTokenizer #1898

Add never_split feature to BERTTokenizer #1898

Uh oh!

reachsumit commented Sep 13, 2022

Uh oh!

joecummings commented Sep 13, 2022

Uh oh!

reachsumit commented Sep 14, 2022

Uh oh!

Nayef211 commented Sep 14, 2022

Uh oh!

Nayef211 commented Sep 16, 2022

Uh oh!

reachsumit commented Sep 16, 2022 •

edited

Loading

Uh oh!

Nayef211 left a comment

Uh oh!

Nayef211 Sep 16, 2022

Uh oh!

reachsumit Sep 17, 2022

Uh oh!

reachsumit commented Sep 17, 2022

Uh oh!

joecummings commented Sep 19, 2022

Uh oh!

Nayef211 commented Sep 19, 2022

Uh oh!

Uh oh!

Add never_split feature to BERTTokenizer #1898

Add never_split feature to BERTTokenizer #1898

Uh oh!

Conversation

reachsumit commented Sep 13, 2022

Description

Types of changes

Changes made

Testing

Uh oh!

joecummings commented Sep 13, 2022

Uh oh!

reachsumit commented Sep 14, 2022

Uh oh!

Nayef211 commented Sep 14, 2022

Uh oh!

Nayef211 commented Sep 16, 2022

Uh oh!

reachsumit commented Sep 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nayef211 left a comment

Choose a reason for hiding this comment

Uh oh!

Nayef211 Sep 16, 2022

Choose a reason for hiding this comment

Uh oh!

reachsumit Sep 17, 2022

Choose a reason for hiding this comment

Uh oh!

reachsumit commented Sep 17, 2022

Uh oh!

joecummings commented Sep 19, 2022

Uh oh!

Nayef211 commented Sep 19, 2022

Uh oh!

Uh oh!

reachsumit commented Sep 16, 2022 •

edited

Loading