Pre-offline tokenize for ultra large multimodal datasets #4079

binisalegend · 2025-05-04T16:54:49Z

Thank you for your excellent framework! I noticed that after adding the lazy_tokenize parameter, tokenization of large-scale multimodal datasets will be carried out in batches, generating *.arrow files in the .cache directory. I would like to ask if there is a parameter that can pre-tokenize all the data and save the arrow files, or directly pass in image or video files in the jsonl dataset in a form similar to *.pkl after tokenization, in order to accelerate the training speed. Looking forward to your guidance and reply.

The text was updated successfully, but these errors were encountered:

Jintao-Huang added bug Something isn't working enhancement New feature or request labels May 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pre-offline tokenize for ultra large multimodal datasets #4079

Pre-offline tokenize for ultra large multimodal datasets #4079

binisalegend commented May 4, 2025

Pre-offline tokenize for ultra large multimodal datasets #4079

Pre-offline tokenize for ultra large multimodal datasets #4079

Comments

binisalegend commented May 4, 2025