Open
Description
Describe the bug
For most tokenizers I have tested (e.g. the RoBERTa tokenizer), the data preprocessing cache are not fully reused in the first few runs, although their .arrow
cache files are in the cache directory.
Steps to reproduce the bug
Here is a reproducer. The GPT2 tokenizer works perfectly with caching, but not the RoBERTa tokenizer in this example.
from datasets import load_dataset
from transformers import AutoTokenizer
raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_column_name = "text"
column_names = raw_datasets["train"].column_names
def tokenize_function(examples):
return tokenizer(examples[text_column_name], return_special_tokens_mask=True)
tokenized_datasets = raw_datasets.map(
tokenize_function,
batched=True,
remove_columns=column_names,
load_from_cache_file=True,
desc="Running tokenizer on every text in dataset",
)
Expected results
No tokenization would be required after the 1st run. Everything should be loaded from the cache.
Actual results
Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.
Environment info
datasets
version: 1.18.3- Platform: Ubuntu 18.04.6 LTS
- Python version: 3.6.9
- PyArrow version: 6.0.1