Open
Description
Describe the bug
我用ms-swift进行模型的sft微调训练时,使用自制数据集时发现有报错,运行如下代码时报错:
CUDA_VISIBLE_DEVICES=0,1 NPROC_PER_NODE=2 swift sft --model_type telechat2-115b --model_id_or_path /data/TeleChat2-7B --dataset /data/dataset_0.jsonl#3000 --max_length 4096 --learning_rate 1e-4 --output_dir output --num_train_epochs 10 --save_steps 20 --lora_target_modules ALL --save_total_limit 15
报错截图如下:



报错代码:
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py:93: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Seq2SeqTrainer.__init__`. Use `processing_class` instead.
super().__init__(
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py:93: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Seq2SeqTrainer.__init__`. Use `processing_class` instead.
super().__init__(
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The SftArguments will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/sft_args.json
[INFO:swift] The Seq2SeqTrainingArguments will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/training_args.json
[INFO:swift] The logging file will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/logging.jsonl
Train: 0%| | 0/10 [00:00<?, ?it/s]/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context: # type: ignore[attr-defined]
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
中间省略一些。。。
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in <module>
[rank0]: sft_main()
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]: result = llm_x(args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 546, in llm_sft
[rank0]: return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]: trainer.train(training_args.resume_from_checkpoint)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]: res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 3653, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/trainers.py", line 161, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 823, in forward
[rank0]: return model_forward(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 811, in __call__
[rank0]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]: return self.base_model(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]: return self.model.forward(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward
[rank0]: transformer_outputs = self.transformer(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward
[rank0]: outputs = torch.utils.checkpoint.checkpoint(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/model.py", line 7217, in <lambda>
[rank0]: lambda *args, use_reentrant=_use_reentrant, **kwargs: _old_checkpoint(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_compile.py", line 31, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
[rank0]: ret = function(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward
[rank0]: return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward
[rank0]: attn_outputs = self.self_attention(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 421, in forward
[rank0]: mixed_kv_layer = self.key_value(hidden_states)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/lora/layer.py", line 553, in forward
[rank0]: x = x.to(lora_A.weight.dtype)
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Train: 0%| | 0/10 [00:21<?, ?it/s]
W1218 11:17:39.458715 140448063558272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2410011 closing signal SIGTERM
E1218 11:17:39.672888 140448063558272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2410010) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-18_11:17:39
host : ecm-22b5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2410010)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
还有一类报错,我参与训练数据量过大时报错是下面这样的,数据量较少时报错是上面那样的:
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]: Traceback (most recent call last):
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in <module>
[rank0]: sft_main()
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]: result = llm_x(args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 546, in llm_sft
[rank0]: return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]: trainer.train(training_args.resume_from_checkpoint)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]: res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]: return inner_training_loop(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank0]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 3653, in training_step
[rank0]: loss = self.compute_loss(model, inputs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/trainers.py", line 161, in compute_loss
[rank0]: outputs = model(**inputs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
[rank0]: else self._run_ddp_forward(*inputs, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
[rank0]: return self.module(*inputs, **kwargs) # type: ignore[index]
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank0]: result = forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 823, in forward
[rank0]: return model_forward(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 811, in __call__
[rank0]: return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]: return func(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]: return self.base_model(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]: return self.model.forward(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward
[rank0]: transformer_outputs = self.transformer(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward
[rank0]: outputs = torch.utils.checkpoint.checkpoint(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/model.py", line 7217, in <lambda>
[rank0]: lambda *args, use_reentrant=_use_reentrant, **kwargs: _old_checkpoint(
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_compile.py", line 31, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
[rank0]: ret = function(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward
[rank0]: return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 568, in forward
[rank0]: output = self.mlp(layernorm_output, residual)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 516, in forward
[rank0]: intermediate_output = self.down_proj(F.silu(self.gate_proj(hidden_states)) * self.up_proj(hidden_states))
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/lora/layer.py", line 556, in forward
[rank0]: result = result + lora_B(lora_A(dropout(x))) * scaling
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/tuners/peft.py", line 276, in keep_device_forward
[rank0]: return self.forward_origin(*args, **kwargs)
[rank0]: File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 117, in forward
[rank0]: return F.linear(input, self.weight, self.bias)
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`
Train: 0%| | 0/30 [00:20<?, ?it/s]
W1218 10:59:45.891419 140471802790528 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2402728 closing signal SIGTERM
E1218 10:59:46.105589 140471802790528 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2402727) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
main()
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-12-18_10:59:45
host : ecm-22b5
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 2402727)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
Your hardware and system info
NVIDIA-SMI 550.54.15
Driver Version: 550.54.15
CUDA Version: 12.4
GPU型号:A100
torch版本:2.4.0
ms-swift:2.6.1
系统:ubuntu 20.04
Additional context
最终定位到错误点:
ms-swift版本为2.4.2.post2时,sft微调训练集中包含特殊字符<unk>
时没有异常
ms-swift版本为2.6.1时,sft微调训练集中包含特殊字符<unk>
时有异常
测试<cls>、<sep>、<mask>
不会导致bug
Metadata
Metadata
Assignees
Labels
No labels