Skip to content

InternVL3-9B LoRA微调数据集预处理速度缓慢问题(大约7h) #4076

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jxma20 opened this issue May 4, 2025 · 2 comments
Open

Comments

@jxma20
Copy link

jxma20 commented May 4, 2025

数据集示例

{
"messages": [
{
"role": "user",
"content": "\nIs there a blue or green color cast in the photo?"
},
{
"role": "assistant",
"content": "Yes"
}
],
"images": [
"/fine_tune/M_Database/1.jpg"
]
},
我的数据集中共有78170个上面的样本。

环境

RTX 3090 * 4
python 3.10.0
ms-swift 3.4.0

--lazy_tokenize

起初我没有注意到这个参数,官方文档描述它在MLLM微调中默认为True,意味着模型的微调过程会边微调边做数据预处理,在这种情况下我需要11天才能完成微调任务。
所以我将其设置为False,但是它的数据预处理过程依然很缓慢,我设置了dataset_num_proc=12依然需要花费大概7小时才能完成。

微调指令

export HF_DATASETS_CACHE="/fine_tune/cachefile/"
swift sft
--model /fine_tune/InternVL3-9B/
--train_type lora
--dataset '/fine_tune/InternVL3-9B/swift_data.json'
--enable_cache True
--lazy_tokenize False
--dataset_num_proc 12
--torch_dtype bfloat16
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--learning_rate 1e-4
--lora_rank 8
--lora_alpha 32
--target_modules all-linear
--gradient_accumulation_steps 16
--eval_steps 50
--save_steps 50
--save_total_limit 2
--logging_steps 5
--max_length 2048
--output_dir output
--system 'You are a helpful assistant.'
--warmup_ratio 0.05
--dataloader_num_workers 4 \

@Jintao-Huang
Copy link
Collaborator

微调和预处理的时间是重叠的

如果需要加速微调过程,可以参考这里:https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/streaming.sh

@jxma20
Copy link
Author

jxma20 commented May 4, 2025

微调和预处理的时间是重叠的

如果需要加速微调过程,可以参考这里:https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/streaming.sh

您好,感谢回复!

  1. 根据我在运行过程中的观察,程序只是将模型读入了显存,但是gpu的利用率一直都是接近0,所以我这边应该并不是一边微调一边预处理。我觉得这跟参数lazy_tokenize设置为False表现得一致,先map预处理,然后再执行微调。
  2. 然后处理时间或许与InternVL预处理的逻辑有关,会慢一些。但是我在之前微调Qwen2.5VL也做过相关的预处理,只不过我之前是手动做的预处理函数,使用了datasets.map函数。相同的数据集开12个进程处理只需要十几分钟,而这次却要7个小时,确实相差大了些
  3. 即使是7个小时,我依然坚持让它map了下去,但是map完打印了一行信息: Dataset filtered, origin length: 77389, filtered dataset length: 18755,我在这个项目的看到过相关的issu,但是并没有回复,想知道这是为什么然后怎么避免
  4. 最后程序很不幸地报错了,map完成后它输出了一个样本的input_ids和labels_ids,之后似乎又进行了一个map操作,但是这次直接就报错了,报错信息如下:
    【input_ids:[……]】
    【labels_ids: [……]】
    Map (num_proc=12): 0%| | 0/18755 [05:16<?, ? examples/s]
    Traceback (most recent call last):
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/cli/sft.py", line 7, in
    sft_main()
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 281, in sft_main
    return SwiftSft(args).main()
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/base.py", line 47, in main
    result = self.run()
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 121, in run
    train_dataset, val_dataset = self._encode_dataset(train_dataset, val_dataset)
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 273, in _encode_dataset
    self.train_msg['train_dataset'] = self._stat_dataset(train_dataset)
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/train/sft.py", line 232, in _stat_dataset
    dataset = GetLengthPreprocessor()(dataset, num_proc=args.dataset_num_proc)
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/swift/llm/dataset/preprocessor/core.py", line 305, in call
    dataset_mapped = dataset.map(
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 557, in wrapper
    out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/arrow_dataset.py", line 3171, in map
    for rank, done, content in iflatmap_unordered(
    File "/DataB/mjx/.conda/envs/InternVL/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 721, in iflatmap_unordered
    raise RuntimeError(
    RuntimeError: One of the subprocesses has abruptly died during map operation.To debug the error, disable multiprocessing.

非常期待您的回复!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants