Add `never_split` kwarg to BERTTokenizer to achieve parity with `transformers.BertTokenizer`

## 🚀 Feature

Add functionality to BERTTokenizer so that it does not split/tokenize special tokens like regular words.

**Motivation**

The HuggingFace [BertTokenizer](https://github.com/huggingface/transformers/blob/bbbb453e5869a5fec4925b02f1265c9e6bfc3ebb/src/transformers/models/bert/tokenization_bert.py#L137) is able to recognize its own special tokens and ensure that they are not split during the tokenization process. For example, if given the following input:

```
"\tHeLLo!how  \n Are yoU? [UNK]"
```

The HuggingFace (HF) tokenizer returns the "[UNK]" token un-split, whereas the TorchText (TT) tokenizer returns it split:

```
HF (expected):   ['hello', '!', 'how', 'are', 'you', '?', '[UNK]']
TT (actual):     ['hello', '!', 'how', 'are', 'you', '?', '[', 'un', '##k', ']']
```

**Pitch**

A `never_split` keyword argument at the constructor-level which enables the user to specify which tokens to ignore during tokenization. Such a keyword argument would be flexible and give the user the freedom to adapt to various other tokenization schemes. An example interface:

```
tt_tokenizer = BERTTokenizer(
    vocab_path=vocab_file,
    do_lower_case=True,
    return_tokens=True,
    never_split=[
        "[UNK]",
        "[CLS]",
        "[SEP]",
        "[PAD]",
    ],
)
```

This closely matches the existing HuggingFace BertTokenizer interface (constructor level kwarg [here](https://github.com/huggingface/transformers/blob/bbbb453e5869a5fec4925b02f1265c9e6bfc3ebb/src/transformers/models/bert/tokenization_bert.py#L189), `tokenize` function implementation [here](https://github.com/huggingface/transformers/blob/bbbb453e5869a5fec4925b02f1265c9e6bfc3ebb/src/transformers/models/bert/tokenization_bert.py#L244)).

If the user were able to instantiate the HuggingFace tokenizer with which they were attempting to achieve parity, the user could trivially assign the `never_split` keyword argument to its special tokens:

```
import transformers

hf_tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
tt_tokenizer = BERTTokenizer(
    vocab_path=vocab_file,
    do_lower_case=True,
    return_tokens=True,
    never_split=hf_tokenizer.all_special_tokens,
)
```

**Alternatives**

None that are more straightforward than this.

**Additional context**

A full reproduction of the above disparity in functionality, as well as a comment about the requested interface can be found in this Gist: https://gist.github.com/geoffreyangus/7b4c04f1e485c9c4091728bff1414134


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add `never_split` kwarg to BERTTokenizer to achieve parity with `transformers.BertTokenizer` #1883

🚀 Feature

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add never_split kwarg to BERTTokenizer to achieve parity with transformers.BertTokenizer #1883

Description

🚀 Feature

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add `never_split` kwarg to BERTTokenizer to achieve parity with `transformers.BertTokenizer` #1883