You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for your excellent framework! I noticed that after adding the lazy_tokenize parameter, tokenization of large-scale multimodal datasets will be carried out in batches, generating *.arrow files in the .cache directory. I would like to ask if there is a parameter that can pre-tokenize all the data and save the arrow files, or directly pass in image or video files in the jsonl dataset in a form similar to *.pkl after tokenization, in order to accelerate the training speed. Looking forward to your guidance and reply.
The text was updated successfully, but these errors were encountered:
Thank you for your excellent framework! I noticed that after adding the
lazy_tokenize
parameter, tokenization of large-scale multimodal datasets will be carried out in batches, generating*.arrow
files in the.cache
directory. I would like to ask if there is a parameter that can pre-tokenize all the data and save the arrow files, or directly pass in image or video files in the jsonl dataset in a form similar to*.pkl
after tokenization, in order to accelerate the training speed. Looking forward to your guidance and reply.The text was updated successfully, but these errors were encountered: