Skip to content

packing似乎和lazy_encode参数是冲突的? #4054

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hertz-pj opened this issue Apr 30, 2025 · 13 comments · Fixed by #4066
Closed

packing似乎和lazy_encode参数是冲突的? #4054

hertz-pj opened this issue Apr 30, 2025 · 13 comments · Fixed by #4066
Labels
bug Something isn't working enhancement New feature or request

Comments

@hertz-pj
Copy link

Describe the bug
packing操作和lazy_encode参数是冲突的吗?貌似开启lazy_encode后packing会失效。具体是通过每个epoch需要的训练步数判断出来的。
另外是否packing也会和steaming参数冲突,我需要如何验证packing是正常开启的?

Your hardware and system info
ms_swift==3.2.2

@Jintao-Huang
Copy link
Collaborator

可以使用streaming的

@hertz-pj
Copy link
Author

@Jintao-Huang 感谢回复,请问如何确定packing是生效的。streaming状态下似乎无法通过每个epoch的step数来判断。

@Jintao-Huang
Copy link
Collaborator

是的,未来会加上一个参数,如果超过max_epochs,会强制结束并保存权重

@Jintao-Huang Jintao-Huang added the enhancement New feature or request label May 1, 2025
@hertz-pj
Copy link
Author

hertz-pj commented May 1, 2025

似乎比较好的方式是:
通过提前process的方法将所有数据都tokenize好并存在指定的cache路径。
这样即便是streaming模式下,也可以合理的使用packing,而且可以提效。
目前streaming + packing的模式下,gpu的利用率不足50%。

@Jintao-Huang
Copy link
Collaborator

有shell嘛

@hertz-pj
Copy link
Author

hertz-pj commented May 1, 2025

GPU_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NNODES=$n_nodes \
NODE_RANK=$node_rank \
MASTER_ADDR=$CHIEF_IP \
NPROC_PER_NODE=$nproc_per_node \
MASTER_PORT=29500 \
swift sft \
    --custom_register_path train/custom_model.py \
    --model $model_path \
    --model_type $model_type \
    --dataset  $train_data_path  \
    --val_dataset  $val_data_path  \
    --dataset_num_proc 1 \
    --train_type full \
    --torch_dtype bfloat16 \
    --num_train_epochs  $epoch  \
    --per_device_train_batch_size $batch_size \
    --per_device_eval_batch_size $batch_size \
    --learning_rate 1e-4 \
    --gradient_accumulation_steps 8 \
    --eval_steps 5000 \
    --save_steps 10000 \
    --save_only_model \
    --logging_steps 5 \
    --max_steps 50000000 \
    --max_length 8192 \
    --output_dir $output_dir \
    --warmup_ratio 0 \
    --packing true \
    --attn_impl flash_attn \
    --streaming true \
    --dataloader_num_workers 16 2>&1 | tee $output_dir/train.log

@Jintao-Huang
Copy link
Collaborator

多模态模型 数据集在本地嘛

@Jintao-Huang
Copy link
Collaborator

你试试升级一下swift 看看能否解决问题

dataloader这里应该重构了

@hertz-pj
Copy link
Author

hertz-pj commented May 1, 2025

数据我已经都转成字符串格式了,所以是单模太。
数据集在本地,我看到有enable_cache的参数,但不确定是否cache在内存中还是存储中。且没有指定cache dir的路径。

@Jintao-Huang
Copy link
Collaborator

class DataLoaderShard(DataLoader):

升级一下swift试试吧

@hertz-pj
Copy link
Author

hertz-pj commented May 1, 2025

@Jintao-Huang 感谢大佬假期还解答问题,目前更换到3.4版本,训练效率确实提高了2.5倍左右。gpu利用率也正常了。

@hertz-pj
Copy link
Author

hertz-pj commented May 1, 2025

[rank2]: Traceback (most recent call last):
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/cli/sft.py", line 7, in <module>
[rank2]:     sft_main()
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 281, in sft_main
[rank2]:     return SwiftSft(args).main()
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/base.py", line 47, in main
[rank2]:     result = self.run()
[rank2]:              ^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 147, in run
[rank2]:     return self.train(trainer)
[rank2]:            ^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/train/sft.py", line 207, in train
[rank2]:     trainer.train(trainer.args.resume_from_checkpoint)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/trainers/mixin.py", line 321, in train
[rank2]:     res = super().train(*args, **kwargs)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2241, in train
[rank2]:     return inner_training_loop(
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 2612, in _inner_training_loop
[rank2]:     self._maybe_log_save_evaluate(
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/trainers/mixin.py", line 379, in _maybe_log_save_evaluate
[rank2]:     super()._maybe_log_save_evaluate(tr_loss, *args, **kwargs)
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3085, in _maybe_log_save_evaluate
[rank2]:     metrics = self._evaluate(trial, ignore_keys_for_eval)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 3039, in _evaluate
[rank2]:     metrics = self.evaluate(ignore_keys=ignore_keys_for_eval)
[rank2]:               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/trainers/trainers.py", line 157, in evaluate
[rank2]:     res = super().evaluate(*args, **kwargs)
[rank2]:           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer_seq2seq.py", line 197, in evaluate
[rank2]:     return super().evaluate(eval_dataset, ignore_keys=ignore_keys, metric_key_prefix=metric_key_prefix)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4105, in evaluate
[rank2]:     output = eval_loop(
[rank2]:              ^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/transformers/trainer.py", line 4289, in evaluation_loop
[rank2]:     for step, inputs in enumerate(dataloader):
[rank2]:                         ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 564, in __iter__
[rank2]:     current_batch = next(dataloader_iter)
[rank2]:                     ^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 701, in __next__
[rank2]:     data = self._next_data()
[rank2]:            ^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1465, in _next_data
[rank2]:     return self._process_data(data)
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/dataloader.py", line 1491, in _process_data
[rank2]:     data.reraise()
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/_utils.py", line 715, in reraise
[rank2]:     raise exception
[rank2]: AssertionError: Caught AssertionError in DataLoader worker process 0.
[rank2]: Original Traceback (most recent call last):
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/worker.py", line 351, in _worker_loop
[rank2]:     data = fetcher.fetch(index)  # type: ignore[possibly-undefined]
[rank2]:            ^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/torch/utils/data/_utils/fetch.py", line 33, in fetch
[rank2]:     data.append(next(self.dataset_iter))
[rank2]:                 ^^^^^^^^^^^^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/accelerate/data_loader.py", line 342, in __iter__
[rank2]:     for element in self.dataset:
[rank2]:                    ^^^^^^^^^^^^
[rank2]:   File "/usr/local/lib/python3.12/dist-packages/swift/llm/dataset/utils.py", line 256, in __iter__
[rank2]:     worker.start()
[rank2]:   File "/usr/lib/python3.12/multiprocessing/process.py", line 118, in start
[rank2]:     assert not _current_process._config.get('daemon'), \
[rank2]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank2]: AssertionError: daemonic processes are not allowed to have children

更新到最新的版本后出现了新的问题,在执行到evaluation的步骤时出现。

@Jintao-Huang Jintao-Huang added the bug Something isn't working label May 2, 2025
@Jintao-Huang Jintao-Huang linked a pull request May 2, 2025 that will close this issue
@Jintao-Huang
Copy link
Collaborator

fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants