Skip to content

BUG: 训练集中包含特殊字符<unk>时有报错 #2688

Open
@WjMessi1

Description

@WjMessi1

Describe the bug

我用ms-swift进行模型的sft微调训练时,使用自制数据集时发现有报错,运行如下代码时报错:

CUDA_VISIBLE_DEVICES=0,1 NPROC_PER_NODE=2 swift sft --model_type telechat2-115b --model_id_or_path /data/TeleChat2-7B --dataset /data/dataset_0.jsonl#3000 --max_length 4096 --learning_rate 1e-4 --output_dir output --num_train_epochs 10 --save_steps 20 --lora_target_modules ALL --save_total_limit 15

报错截图如下:

1734506371831 1734506038555 1734506066157

报错代码:

/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py:93: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Seq2SeqTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py:93: FutureWarning: `tokenizer` is deprecated and will be removed in version 5.0.0 for `Seq2SeqTrainer.__init__`. Use `processing_class` instead.
  super().__init__(
Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
[INFO:swift] The SftArguments will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/sft_args.json
[INFO:swift] The Seq2SeqTrainingArguments will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/training_args.json
[INFO:swift] The logging file will be saved in: /data/TeleChat2-7B/output/telechat2-115b/v4-20241218-111701/logging.jsonl

Train:   0%|          | 0/10 [00:00<?, ?it/s]/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py:1399: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
  with device_autocast_ctx, torch.cpu.amp.autocast(**cpu_autocast_kwargs), recompute_context:  # type: ignore[attr-defined]
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [64,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [65,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [66,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [384,0,0], thread: [67,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

中间省略一些。。。

../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [399,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in <module>
[rank0]:     sft_main()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]:     result = llm_x(args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 546, in llm_sft
[rank0]:     return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]:     trainer.train(training_args.resume_from_checkpoint)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 3653, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/trainers.py", line 161, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 823, in forward
[rank0]:     return model_forward(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 811, in __call__
[rank0]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward
[rank0]:     transformer_outputs = self.transformer(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward
[rank0]:     outputs = torch.utils.checkpoint.checkpoint(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/model.py", line 7217, in <lambda>
[rank0]:     lambda *args, use_reentrant=_use_reentrant, **kwargs: _old_checkpoint(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_compile.py", line 31, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
[rank0]:     ret = function(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward
[rank0]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 551, in forward
[rank0]:     attn_outputs = self.self_attention(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 421, in forward
[rank0]:     mixed_kv_layer = self.key_value(hidden_states)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/lora/layer.py", line 553, in forward
[rank0]:     x = x.to(lora_A.weight.dtype)
[rank0]: RuntimeError: CUDA error: device-side assert triggered
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.


Train:   0%|          | 0/10 [00:21<?, ?it/s]
W1218 11:17:39.458715 140448063558272 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2410011 closing signal SIGTERM
E1218 11:17:39.672888 140448063558272 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2410010) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_11:17:39
  host      : ecm-22b5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2410010)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

还有一类报错,我参与训练数据量过大时报错是下面这样的,数据量较少时报错是上面那样的:

../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [118,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [119,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [120,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [121,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [122,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [123,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [124,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [125,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
../aten/src/ATen/native/cuda/Indexing.cu:1284: indexSelectLargeIndex: block: [229,0,0], thread: [127,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py", line 5, in <module>
[rank0]:     sft_main()
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/utils/run_utils.py", line 32, in x_main
[rank0]:     result = llm_x(args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 546, in llm_sft
[rank0]:     return trainer_train(args, model, template, train_dataset, val_dataset, callbacks=callbacks, msg=msg)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/sft.py", line 496, in trainer_train
[rank0]:     trainer.train(training_args.resume_from_checkpoint)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/mixin.py", line 493, in train
[rank0]:     res = super().train(resume_from_checkpoint, *args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2164, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 2522, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/transformers/trainer.py", line 3653, in training_step
[rank0]:     loss = self.compute_loss(model, inputs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/trainers/trainers.py", line 161, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1636, in forward
[rank0]:     else self._run_ddp_forward(*inputs, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1454, in _run_ddp_forward
[rank0]:     return self.module(*inputs, **kwargs)  # type: ignore[index]
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1603, in _call_impl
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 823, in forward
[rank0]:     return model_forward(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 811, in __call__
[rank0]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/peft_model.py", line 1577, in forward
[rank0]:     return self.base_model(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/tuners_utils.py", line 188, in forward
[rank0]:     return self.model.forward(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 821, in forward
[rank0]:     transformer_outputs = self.transformer(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 721, in forward
[rank0]:     outputs = torch.utils.checkpoint.checkpoint(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/llm/utils/model.py", line 7217, in <lambda>
[rank0]:     lambda *args, use_reentrant=_use_reentrant, **kwargs: _old_checkpoint(
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_compile.py", line 31, in inner
[rank0]:     return disable_fn(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 600, in _fn
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 488, in checkpoint
[rank0]:     ret = function(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 717, in custom_forward
[rank0]:     return module(*inputs, use_cache=use_cache, output_attentions=output_attentions)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 568, in forward
[rank0]:     output = self.mlp(layernorm_output, residual)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/.cache/huggingface/modules/transformers_modules/TeleChat2-7B/modeling_telechat2.py", line 516, in forward
[rank0]:     intermediate_output = self.down_proj(F.silu(self.gate_proj(hidden_states)) * self.up_proj(hidden_states))
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/peft/tuners/lora/layer.py", line 556, in forward
[rank0]:     result = result + lora_B(lora_A(dropout(x))) * scaling
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/tuners/peft.py", line 276, in keep_device_forward
[rank0]:     return self.forward_origin(*args, **kwargs)
[rank0]:   File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 117, in forward
[rank0]:     return F.linear(input, self.weight, self.bias)
[rank0]: RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Train:   0%|          | 0/30 [00:20<?, ?it/s]
W1218 10:59:45.891419 140471802790528 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 2402728 closing signal SIGTERM
E1218 10:59:46.105589 140471802790528 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 2402727) of binary: /root/anaconda3/envs/telechat2/bin/python
Traceback (most recent call last):
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/root/anaconda3/envs/telechat2/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 905, in <module>
    main()
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/root/anaconda3/envs/telechat2/lib/python3.9/site-packages/swift/cli/sft.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-12-18_10:59:45
  host      : ecm-22b5
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2402727)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================


Your hardware and system info

NVIDIA-SMI 550.54.15
Driver Version: 550.54.15
CUDA Version: 12.4
GPU型号:A100
torch版本:2.4.0
ms-swift:2.6.1
系统:ubuntu 20.04

Additional context
最终定位到错误点:

ms-swift版本为2.4.2.post2时,sft微调训练集中包含特殊字符<unk>时没有异常

ms-swift版本为2.6.1时,sft微调训练集中包含特殊字符<unk>时有异常

测试<cls>、<sep>、<mask>不会导致bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions