-
Notifications
You must be signed in to change notification settings - Fork 636
推理中出现从未遇见的bug #4116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
你需要传入个val_dataset进去... |
https://swift.readthedocs.io/zh-cn/latest/Instruction/%E6%8E%A8%E7%90%86%E5%92%8C%E9%83%A8%E7%BD%B2.html#cli |
那需要把 分布式 关掉 |
感谢你的回复,我猜测大概也是这个原因,请问具体是增加哪个参数呢?(^▽^) |
你看看 你的环境中是否默认设置了 NPROC_PER_NODE环境变量 |
您好,我分别尝试了 |
您好,或者您是否可以指教一下val_dataset该如何组织呢?我在官网上没有找到val_dataset的格式,再次感谢您的回复 |
这是我的训练代码:
export ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 HCCL_CHECK_TIMEOUT=600 TORCH_NPU_FUSION_ENABLE=1; torchrun --nproc_per_node=8 --rdzv_conf 'overlap_timeout=600' /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/sft.py --ddp_backend hccl --model /root/nfs_storage/zhangliang/Qwen2.5vl-32B-Instruct --dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/alldata_right_withend.jsonl --deepspeed zero3 --train_type lora --torch_dtype bfloat16 --num_train_epochs 3 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 5e-6 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.01 --lora_rank 32 --lora_alpha 64 --lora_dropout 0.1 --use_rslora False --max_grad_norm 0.5 --target_modules all-linear --freeze_vit true --eval_steps 300 --save_steps 300 --save_total_limit 50 --logging_steps 30 --max_length 8192 --output_dir /data/output/zhongda --dataloader_num_workers 8 --model_kwargs '{"device_map":{"": "npu:auto"}}' --optim adamw_torch_npu_fused --gradient_checkpointing False > /data/logs/zhongda/output22.txt 2>&1
这是我的测试代码
(qwenft) [root@ascendnode5 yanke]# MAX_PIXELS=802816 \
我之前一直都用的好好的,突然就出现了这样的错误。我修改推理代码,使用之前运行完成正常的代码,还是会报错,寻找了许多办法,最终还是未能解决。
最初报错是说--nproc_per_node 这个参数有问题,我一通export之后不报这个错了,现在一直卡死在如下的错误上了。
run sh:
/root/anaconda3/envs/qwenft/bin/python -m torch.distributed.run --nproc_per_node 8 /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py --adapters /data/output/zhongda/v19-20250424-175905/checkpoint-300 --stream False --temperature 0.1 --repetition_penalty 1.2 --top_p 0.95 --max_new_tokens 512
WARNING:main:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[INFO:swift] Successfully registered
/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/dataset/data/dataset_info.json
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v19-20250424-175905/checkpoint-300
[INFO:swift] Successfully loaded /data/output/zhongda/v19-20250424-175905/checkpoint-300/args.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 8, local_world_size: 8
[INFO:swift] Loading the model using model_dir: /root/nfs_storage/wangzhuoran/modelweights/Qwen2.5-VL-7B-Instruct
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
[INFO:swift] args.result_path: /data/output/zhongda/v19-20250424-175905/checkpoint-300/infer_result/20250507-172941.jsonl
[INFO:swift] Setting args.eval_human: True
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
W0507 17:29:44.959000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366431 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366433 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366434 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366435 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366436 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366437 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366438 closing signal SIGTERM
E0507 17:29:52.200000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 366432) of binary: /root/anaconda3/envs/qwenft/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in
main()
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
求大佬解答一下我的问题,感谢大佬QAQ
/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py FAILED
Failures:
<NO_OTHER_FAILURES>
Root Cause (first observed failure):
[0]:
time : 2025-05-07_17:29:44
host : ascendnode5
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 366432)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
The text was updated successfully, but these errors were encountered: