Skip to content

Add never_split kwarg to BERTTokenizer to achieve parity with transformers.BertTokenizer #1883

Closed
@geoffreyangus

Description

@geoffreyangus

🚀 Feature

Add functionality to BERTTokenizer so that it does not split/tokenize special tokens like regular words.

Motivation

The HuggingFace BertTokenizer is able to recognize its own special tokens and ensure that they are not split during the tokenization process. For example, if given the following input:

"\tHeLLo!how  \n Are yoU? [UNK]"

The HuggingFace (HF) tokenizer returns the "[UNK]" token un-split, whereas the TorchText (TT) tokenizer returns it split:

HF (expected):   ['hello', '!', 'how', 'are', 'you', '?', '[UNK]']
TT (actual):     ['hello', '!', 'how', 'are', 'you', '?', '[', 'un', '##k', ']']

Pitch

A never_split keyword argument at the constructor-level which enables the user to specify which tokens to ignore during tokenization. Such a keyword argument would be flexible and give the user the freedom to adapt to various other tokenization schemes. An example interface:

tt_tokenizer = BERTTokenizer(
    vocab_path=vocab_file,
    do_lower_case=True,
    return_tokens=True,
    never_split=[
        "[UNK]",
        "[CLS]",
        "[SEP]",
        "[PAD]",
    ],
)

This closely matches the existing HuggingFace BertTokenizer interface (constructor level kwarg here, tokenize function implementation here).

If the user were able to instantiate the HuggingFace tokenizer with which they were attempting to achieve parity, the user could trivially assign the never_split keyword argument to its special tokens:

import transformers

hf_tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
tt_tokenizer = BERTTokenizer(
    vocab_path=vocab_file,
    do_lower_case=True,
    return_tokens=True,
    never_split=hf_tokenizer.all_special_tokens,
)

Alternatives

None that are more straightforward than this.

Additional context

A full reproduction of the above disparity in functionality, as well as a comment about the requested interface can be found in this Gist: https://gist.github.com/geoffreyangus/7b4c04f1e485c9c4091728bff1414134

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions