Skip to content

Datasets' cache not re-used #3847

Open
@gejinchen

Description

@gejinchen

Describe the bug

For most tokenizers I have tested (e.g. the RoBERTa tokenizer), the data preprocessing cache are not fully reused in the first few runs, although their .arrow cache files are in the cache directory.

Steps to reproduce the bug

Here is a reproducer. The GPT2 tokenizer works perfectly with caching, but not the RoBERTa tokenizer in this example.

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("wikitext", "wikitext-2-raw-v1")
# tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_column_name = "text"
column_names = raw_datasets["train"].column_names

def tokenize_function(examples):
    return tokenizer(examples[text_column_name], return_special_tokens_mask=True)

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    remove_columns=column_names,
    load_from_cache_file=True,
    desc="Running tokenizer on every text in dataset",
)

Expected results

No tokenization would be required after the 1st run. Everything should be loaded from the cache.

Actual results

Tokenization for some subsets are repeated at the 2nd and 3rd run. Starting from the 4th run, everything are loaded from cache.

Environment info

  • datasets version: 1.18.3
  • Platform: Ubuntu 18.04.6 LTS
  • Python version: 3.6.9
  • PyArrow version: 6.0.1

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions