Skip to content

Add never_split feature to BERTTokenizer #1898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Sep 19, 2022

Conversation

reachsumit
Copy link
Contributor

Description

Add never_split feature to BERTTokenizer

This change is targeted to address the issue #1883 . It adds a new parameter never_split to BERTTokenizer in order to enable the user to specify a sequence of token that should not be changed during tokenization.

Close #1883

Types of changes

[x ] New feature (non-breaking change which adds core functionality)

Changes made

  • Made changes to C++ code (BERTEncoder class) to support the never_split functionality.
    • Updated torch bindings and py bindings as required.
  • Added never_split parameter to Python code (BERTTokenizer class).
  • Added unit test to extensively test the tokenization process.

Testing

  • No issue identified in pre-commit
  • No issue identified with any of existing or new unit tests.

@joecummings
Copy link
Contributor

@reachsumit Please look into test failures. LMK if you need help debugging.

@reachsumit
Copy link
Contributor Author

@joecummings Thank you for looking into this. There seems to be a pattern of failures among the tests.

  • The unittest_macos_py* tests seem to be failing with the following 'No space left' error.
...
E       OSError: [Errno 28] No space left on device
...
  • The unittest_linux_py* tests seems to have a segmentation fault for the following:
...
test/prototype/test_with_asset.py::TestTransformsWithAsset::test_vocab_from_raw_text_file Fatal Python error: Segmentation fault
...

I see the same pattern of failures for other builds, such as @Nayef211 's #1899 .

  • For the unittest_windows_py* tests, I see a bunch of warnings but no specific error.

I would appreciate any tips or pointers for debugging this further. 🙏🏽

@Nayef211
Copy link
Contributor

Hey @reachsumit. Happy to provide some further context here.

  • The unittest_macos_py* tests seem to be failing with the following 'No space left' error.
...
E       OSError: [Errno 28] No space left on device
...

This should be resolved once #1889 lands and you can ignore these errors for the time being.

  • The unittest_linux_py* tests seems to have a segmentation fault for the following:
...
test/prototype/test_with_asset.py::TestTransformsWithAsset::test_vocab_from_raw_text_file Fatal Python error: Segmentation fault
...

These tests should be disabled in CI once #1901 lands and you can rebase your branch on top of the latest main once this happens.

  • For the unittest_windows_py* tests, I see a bunch of warnings but no specific error.

The windows CI build failures look new and may be caused by this PR and could require some further investigation (see sample CI failure)

@reachsumit reachsumit force-pushed the text_issue_1883 branch 2 times, most recently from 21a5cdf to 595f9a2 Compare September 15, 2022 01:16
@Nayef211
Copy link
Contributor

@reachsumit can you try rebasing on the latest main branch to see if the CI issues are resolved now?

@reachsumit
Copy link
Contributor Author

reachsumit commented Sep 16, 2022

@reachsumit can you try rebasing on the latest main branch to see if the CI issues are resolved now?

Still seeing the following failures. :(

binary_macos_conda_py3.7

Solving environment: ...working... failed

unittest_macos_py3.8

* Installing conda
PREFIX=/Users/distiller/project/conda
Unpacking payload ...
  0% 0/36 [00:00<?, ?it/s]Exception in thread Thread-2:
Traceback (most recent call last):
  File "threading.py", line 973, in _bootstrap_inner
  File "concurrent/futures/process.py", line 317, in run
  File "concurrent/futures/process.py", line 376, in wait_result_broken_or_wakeup
  File "concurrent/futures/process.py", line 376, in <listcomp>
RuntimeError: dictionary changed size during iteration

unittest_macos_py3.9

torchtext_unittest/datasets/test_iwslt2016.py::TestIWSLT2016::test_iwslt2016_split_argument_0_train FAILED
...
>       parser.feed(text)
E       xml.etree.ElementTree.ParseError: syntax error: line 1, column 0

../env/lib/python3.9/xml/etree/ElementTree.py:1344: ParseError

Copy link
Contributor

@Nayef211 Nayef211 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM! Thanks for working on this feature @reachsumit! The 2 test failures you're seeing are due to flaky tests and not caused by your PR. I will stamp once the PR comment is resolved.

@@ -254,51 +266,56 @@ UString BERTEncoder::_basic_tokenize(const UString& text) {

std::vector<std::string> BERTEncoder::Tokenize(std::string text) {
std::vector<std::string> results;
std::vector<std::string> interim_results;
std::vector<std::string> tokens;
std::set<std::string> never_split(never_split_.begin(), never_split_.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does is make sense to create this set inside of the Tokenize function instead of the constructor? The set will be recreated every time this method is called which doesn't seem efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I will fix it.

@reachsumit
Copy link
Contributor Author

Thanks for your comments, @Nayef211 . I updated the code. The CI failures are as follows:

binary_linux_conda_py3.9

Solving environment: ...working... 
Found conflicts! Looking for incompatible packages.
This can take several minutes.  Press CTRL-C to abort.
failed         

unittest_windows_py3.10

torchtext_unittest\test_build.py::TestVocab::test_download_charngram_vectors FAILED [  0%]
...
..\env\lib\socket.py:955: gaierror

@joecummings
Copy link
Contributor

@reachsumit @Nayef211 After rerunning the test and build suite, it appears that the only test that's still failing - unittest_linux_py3.7 - is related to this known issue: #1903. So everything looks good!

However, since this isn't a pressing feature, I'm in favor of waiting until @VitalyFedyunin has a fix out later this week to make sure that this doesn't cause any other strange failures.

@Nayef211
Copy link
Contributor

@reachsumit @Nayef211 After rerunning the test and build suite, it appears that the only test that's still failing - unittest_linux_py3.7 - is related to this known issue: #1903. So everything looks good!

However, since this isn't a pressing feature, I'm in favor of waiting until @VitalyFedyunin has a fix out later this week to make sure that this doesn't cause any other strange failures.

I actually think we can go ahead and land now as this feature and the failures with the IWSLT dataset are not at all related. I don't think it makes sense to block landing this on #1903.

@Nayef211 Nayef211 merged commit b0df58b into pytorch:main Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add never_split kwarg to BERTTokenizer to achieve parity with transformers.BertTokenizer
4 participants