-
Notifications
You must be signed in to change notification settings - Fork 812
Add never_split feature to BERTTokenizer #1898
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@reachsumit Please look into test failures. LMK if you need help debugging. |
@joecummings Thank you for looking into this. There seems to be a pattern of failures among the tests.
I see the same pattern of failures for other builds, such as @Nayef211 's #1899 .
I would appreciate any tips or pointers for debugging this further. 🙏🏽 |
Hey @reachsumit. Happy to provide some further context here.
This should be resolved once #1889 lands and you can ignore these errors for the time being.
These tests should be disabled in CI once #1901 lands and you can rebase your branch on top of the latest
The windows CI build failures look new and may be caused by this PR and could require some further investigation (see sample CI failure) |
21a5cdf
to
595f9a2
Compare
@reachsumit can you try rebasing on the latest main branch to see if the CI issues are resolved now? |
595f9a2
to
f0e5729
Compare
Still seeing the following failures. :(
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall LGTM! Thanks for working on this feature @reachsumit! The 2 test failures you're seeing are due to flaky tests and not caused by your PR. I will stamp once the PR comment is resolved.
torchtext/csrc/bert_tokenizer.cpp
Outdated
@@ -254,51 +266,56 @@ UString BERTEncoder::_basic_tokenize(const UString& text) { | |||
|
|||
std::vector<std::string> BERTEncoder::Tokenize(std::string text) { | |||
std::vector<std::string> results; | |||
std::vector<std::string> interim_results; | |||
std::vector<std::string> tokens; | |||
std::set<std::string> never_split(never_split_.begin(), never_split_.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does is make sense to create this set inside of the Tokenize
function instead of the constructor? The set will be recreated every time this method is called which doesn't seem efficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I will fix it.
Thanks for your comments, @Nayef211 . I updated the code. The CI failures are as follows:
|
@reachsumit @Nayef211 After rerunning the test and build suite, it appears that the only test that's still failing - However, since this isn't a pressing feature, I'm in favor of waiting until @VitalyFedyunin has a fix out later this week to make sure that this doesn't cause any other strange failures. |
I actually think we can go ahead and land now as this feature and the failures with the IWSLT dataset are not at all related. I don't think it makes sense to block landing this on #1903. |
Description
Add never_split feature to BERTTokenizer
This change is targeted to address the issue #1883 . It adds a new parameter
never_split
toBERTTokenizer
in order to enable the user to specify a sequence of token that should not be changed during tokenization.Close #1883
Types of changes
[x ] New feature (non-breaking change which adds core functionality)
Changes made
BERTEncoder
class) to support thenever_split
functionality.never_split
parameter to Python code (BERTTokenizer
class).Testing
pre-commit