Skip to content

推理中出现从未遇见的bug #4116

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
SeuZL opened this issue May 7, 2025 · 7 comments
Open

推理中出现从未遇见的bug #4116

SeuZL opened this issue May 7, 2025 · 7 comments

Comments

@SeuZL
Copy link

SeuZL commented May 7, 2025

这是我的训练代码:
export ASCEND_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 HCCL_CHECK_TIMEOUT=600 TORCH_NPU_FUSION_ENABLE=1; torchrun --nproc_per_node=8 --rdzv_conf 'overlap_timeout=600' /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/sft.py --ddp_backend hccl --model /root/nfs_storage/zhangliang/Qwen2.5vl-32B-Instruct --dataset /root/zhangliang/Qwen2.5-VL/zhongda/xiongpian/alldata_right_withend.jsonl --deepspeed zero3 --train_type lora --torch_dtype bfloat16 --num_train_epochs 3 --per_device_train_batch_size 1 --gradient_accumulation_steps 2 --learning_rate 5e-6 --lr_scheduler_type linear --warmup_ratio 0.1 --weight_decay 0.01 --lora_rank 32 --lora_alpha 64 --lora_dropout 0.1 --use_rslora False --max_grad_norm 0.5 --target_modules all-linear --freeze_vit true --eval_steps 300 --save_steps 300 --save_total_limit 50 --logging_steps 30 --max_length 8192 --output_dir /data/output/zhongda --dataloader_num_workers 8 --model_kwargs '{"device_map":{"": "npu:auto"}}' --optim adamw_torch_npu_fused --gradient_checkpointing False > /data/logs/zhongda/output22.txt 2>&1

这是我的测试代码
(qwenft) [root@ascendnode5 yanke]# MAX_PIXELS=802816 \

swift infer
--adapters /data/output/zhongda/v19-20250424-175905/checkpoint-300
--stream False
--temperature 0.1
--repetition_penalty 1.2
--top_p 0.95
--max_new_tokens 512

我之前一直都用的好好的,突然就出现了这样的错误。我修改推理代码,使用之前运行完成正常的代码,还是会报错,寻找了许多办法,最终还是未能解决。

最初报错是说--nproc_per_node 这个参数有问题,我一通export之后不报这个错了,现在一直卡死在如下的错误上了。
run sh: /root/anaconda3/envs/qwenft/bin/python -m torch.distributed.run --nproc_per_node 8 /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py --adapters /data/output/zhongda/v19-20250424-175905/checkpoint-300 --stream False --temperature 0.1 --repetition_penalty 1.2 --top_p 0.95 --max_new_tokens 512
WARNING:main:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[INFO:swift] Successfully registered /root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/dataset/data/dataset_info.json
[INFO:swift] Loading the model using model_dir: /data/output/zhongda/v19-20250424-175905/checkpoint-300
[INFO:swift] Successfully loaded /data/output/zhongda/v19-20250424-175905/checkpoint-300/args.json.
[INFO:swift] rank: 0, local_rank: 0, world_size: 8, local_world_size: 8
[INFO:swift] Loading the model using model_dir: /root/nfs_storage/wangzhuoran/modelweights/Qwen2.5-VL-7B-Instruct
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
[INFO:swift] args.result_path: /data/output/zhongda/v19-20250424-175905/checkpoint-300/infer_result/20250507-172941.jsonl
[INFO:swift] Setting args.eval_human: True
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
Traceback (most recent call last):
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py", line 5, in
infer_main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 241, in infer_main
return SwiftInfer(args).main()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/infer/infer.py", line 26, in init
super().init(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 18, in init
self.args = self._parse_args(args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/base.py", line 29, in _parse_args
args, remaining_argv = parse_args(self.args_class, args)
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/utils/utils.py", line 146, in parse_args
args, remaining_args = parser.parse_args_into_dataclasses(argv, return_remaining_strings=True)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/transformers/hf_argparser.py", line 357, in parse_args_into_dataclasses
obj = dtype(**inputs)
File "", line 95, in init
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 169, in post_init
self._init_ddp()
File "/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/llm/argument/infer_args.py", line 150, in _init_ddp
assert not self.eval_human and not self.stream
AssertionError
W0507 17:29:44.959000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366431 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366433 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366434 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366435 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366436 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366437 closing signal SIGTERM
W0507 17:29:44.960000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 366438 closing signal SIGTERM
E0507 17:29:52.200000 281473388568608 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 1 (pid: 366432) of binary: /root/anaconda3/envs/qwenft/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/qwenft/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 905, in
main()
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/qwenft/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

求大佬解答一下我的问题,感谢大佬QAQ
/root/zhangliang/Qwen2.5-VL/Lora/ms-swift/swift/cli/infer.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-05-07_17:29:44
host : ascendnode5
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 366432)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

@Jintao-Huang
Copy link
Collaborator

你需要传入个val_dataset进去...

@SeuZL
Copy link
Author

SeuZL commented May 7, 2025

你需要建立一个val_dataset进去...

https://swift.readthedocs.io/zh-cn/latest/Instruction/%E6%8E%A8%E7%90%86%E5%92%8C%E9%83%A8%E7%BD%B2.html#cli
我之前都是参考这个里面的代码写的,没有传过val_dataset呀,都是用这样的代码去推理的呀:
你的身份是一名影像科医生,你拿到了一张55岁的女性的,[胸部摄片(正位)]的图像,请参考上述信息阅读图像,给出患者摄片的读片结果?
然后输入一个图片的路径,然后模型就出结果了

@Jintao-Huang
Copy link
Collaborator

那需要把 分布式 关掉

@SeuZL
Copy link
Author

SeuZL commented May 7, 2025

那需要把 分布式 关掉

感谢你的回复,我猜测大概也是这个原因,请问具体是增加哪个参数呢?(^▽^)

@Jintao-Huang
Copy link
Collaborator

你看看 你的环境中是否默认设置了 NPROC_PER_NODE环境变量

@SeuZL
Copy link
Author

SeuZL commented May 8, 2025

您好,我分别尝试了
export nproc_per_node=1
export NPROC_PER_NODE=1
然后再进行swift infer
--adapters /data/output/zhongda/v19-20250424-175905/checkpoint-300
--temperature 0.1
--repetition_penalty 1.2
--top_p 0.95
--max_new_tokens 512
和直接
NPROC_PER_NODE=1
MAX_PIXELS=802816
swift infer
--adapters /data/output/zhongda/v19-20250424-175905/checkpoint-300
--temperature 0.1
--repetition_penalty 1.2
--top_p 0.95
--max_new_tokens 512
结果都没有改变,请问您说这个环境变量在哪里可以查看呀,非常感谢您的回复

@SeuZL
Copy link
Author

SeuZL commented May 8, 2025

你看看你的环境中是否设置了默认的NPROC_PER_NODE环境变量

您好,或者您是否可以指教一下val_dataset该如何组织呢?我在官网上没有找到val_dataset的格式,再次感谢您的回复

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants