Description
🚀 Feature
Add functionality to BERTTokenizer so that it does not split/tokenize special tokens like regular words.
Motivation
The HuggingFace BertTokenizer is able to recognize its own special tokens and ensure that they are not split during the tokenization process. For example, if given the following input:
"\tHeLLo!how \n Are yoU? [UNK]"
The HuggingFace (HF) tokenizer returns the "[UNK]" token un-split, whereas the TorchText (TT) tokenizer returns it split:
HF (expected): ['hello', '!', 'how', 'are', 'you', '?', '[UNK]']
TT (actual): ['hello', '!', 'how', 'are', 'you', '?', '[', 'un', '##k', ']']
Pitch
A never_split
keyword argument at the constructor-level which enables the user to specify which tokens to ignore during tokenization. Such a keyword argument would be flexible and give the user the freedom to adapt to various other tokenization schemes. An example interface:
tt_tokenizer = BERTTokenizer(
vocab_path=vocab_file,
do_lower_case=True,
return_tokens=True,
never_split=[
"[UNK]",
"[CLS]",
"[SEP]",
"[PAD]",
],
)
This closely matches the existing HuggingFace BertTokenizer interface (constructor level kwarg here, tokenize
function implementation here).
If the user were able to instantiate the HuggingFace tokenizer with which they were attempting to achieve parity, the user could trivially assign the never_split
keyword argument to its special tokens:
import transformers
hf_tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
tt_tokenizer = BERTTokenizer(
vocab_path=vocab_file,
do_lower_case=True,
return_tokens=True,
never_split=hf_tokenizer.all_special_tokens,
)
Alternatives
None that are more straightforward than this.
Additional context
A full reproduction of the above disparity in functionality, as well as a comment about the requested interface can be found in this Gist: https://gist.github.com/geoffreyangus/7b4c04f1e485c9c4091728bff1414134