Skip to content

[Question]: OSError: (External) CUDA error(700), an illegal memory access was encountered. #6609

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
littlesmallrookie opened this issue Aug 3, 2023 · 5 comments
Assignees
Labels
question Further information is requested triage

Comments

@littlesmallrookie
Copy link

请提出你的问题

在做nlp 文档抽取 finetune 过程中,在几个轮次过后,会自动中断训练,中断的时机不确定
训练命令:
python3.7 finetune.py --device cpu --logging_steps 5 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-x-base --output_dir ./checkpointtest1/model_best --train_path train/data/4/train.txt --dev_path train/data/4/dev.txt --max_seq_len 512 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --num_train_epochs 80 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir ./checkpointtest1/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1

频繁出现,报以下错误:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 698, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 546, in to_tensor
return _to_tensor_non_static(data, dtype, place, stop_gradient)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 411, in _to_tensor_non_static
stop_gradient=stop_gradient,
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 218, in _thread_loop
self._thread_done_event)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/data/data_collator.py", line 199, in call
return_attention_mask=self.return_attention_mask,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 2619, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 229, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 708, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Traceback (most recent call last):
File "finetune.py", line 177, in
main()
File "finetune.py", line 147, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 669, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1350, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1312, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 1174, in forward
image=image,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 775, in forward
position_ids=visual_position_ids,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 645, in _calc_img_embeddings
visual_embeddings = self.visual_act_fn(self.visual_proj(self.visual(image.astype(paddle.float32))))
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 560, in forward
features = self.backbone(images_input)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 213, in forward
y = block(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 85, in forward
short = self.short(inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 42, in forward
y = self._batch_norm(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/nn.py", line 1375, in forward
self._trainable_statistics, False)
OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /paddle/paddle/phi/kernels/gpu/batch_norm_kernel.cu:1229)

@littlesmallrookie littlesmallrookie added the question Further information is requested label Aug 3, 2023
@github-actions github-actions bot added the triage label Aug 3, 2023
@lugimzzz
Copy link
Contributor

lugimzzz commented Aug 7, 2023

这个问题比较难定位,请搜索OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.报错的可能
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /paddle/paddle/phi/kernels/gpu/batch_norm_kernel.cu:1229)

@sijunhe sijunhe self-assigned this Aug 7, 2023
@sijunhe
Copy link
Collaborator

sijunhe commented Aug 7, 2023

启动命令的device为啥是cpu?

@littlesmallrookie
Copy link
Author

是gpu 这里写错了

@littlesmallrookie
Copy link
Author

我调小了batch_size train batch_size =2 eval batch_size = 1 训练时显示GPU利用率100%,评估时利用率60%左右, 不一会儿就中断了,偶尔可以训练一段时间
:训练命令如下:
python3.7 -u -m paddle.distributed.launch --gpus "0" finetune.py --device gpu --logging_steps 5 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-x-base --output_dir checkpoint-auto-4-2/model_best --train_path train/data/4/train.txt --dev_path train/data/4/dev.txt --max_seq_len 512 --per_device_train_batch_size 2 --per_device_eval_batch_size 1 --num_train_epochs 80 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir checkpoint-auto-4-2/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1

当 train batch_size = 1 eval batch_size =1 时 会立即训练中断 报错如下:
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Error: /paddle/paddle/phi/kernels/gpu/bce_loss_kernel.cu:42 Assertion (x >= static_cast<T>(0)) && (x <= one) failed. Input is expected to be within the interval [0, 1], but received nan.
Traceback (most recent call last):
File "finetune.py", line 177, in
main()
File "finetune.py", line 147, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 669, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1350, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1312, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 1174, in forward
image=image,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 739, in forward
visual_bbox = self._calc_visual_bbox(self.config["image_feature_pool_shape"], bbox, visual_shape)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 687, in _calc_visual_bbox
axis=-1,
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/manipulation.py", line 1839, in stack
return _C_ops.stack(x, axis)
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: Please search for the error code(719) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:252)

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 698, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 554, in to_tensor
return to_tensor_static(data, dtype, stop_gradient)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 472, in to_tensor_static
output = assign(data)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 1868, in assign
value_name: values,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/layer_helper.py", line 45, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 4046, in append_op
attrs=kwargs.get("attrs", None),
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 3037, in init
self.desc.infer_shape(self.block.desc)
RuntimeError: (NotFound) The kernel assign_value is not registered.
[Hint: Expected iter != kernels
.end(), but received iter == kernels
.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:197)
[operator < assign_value > error]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 218, in _thread_loop
self._thread_done_event)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/data/data_collator.py", line 199, in call
return_attention_mask=self.return_attention_mask,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 2619, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 229, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 708, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

LAUNCH INFO 2023-08-07 10:28:10,123 Pod failed
LAUNCH ERROR 2023-08-07 10:28:10,123 Container failed !!!
Container rank 0 status failed cmd ['/usr/bin/python3.7', '-u', 'finetune.py', '--device', 'gpu', '--logging_steps', '5', '--save_steps', '100', '--eval_steps', '100', '--seed', '42', '--model_name_or_path', 'uie-x-base', '--output_dir', 'checkpoint-auto-4-2/model_best', '--train_path', 'train/data/4/train.txt', '--dev_path', 'train/data/4/dev.txt', '--max_seq_len', '512', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--num_train_epochs', '80', '--learning_rate', '1e-5', '--do_train', '--do_eval', '--do_export', '--export_model_dir', 'checkpoint-auto-4-2/model_best', '--overwrite_output_dir', '--disable_tqdm', 'True', '--metric_for_best_model', 'eval_f1', '--load_best_model_at_end', 'True', '--save_total_limit', '1'] code 1 log log/workerlog.0
env {'GREP_COLOR': '1;31', 'CUDNN_VERSION': '8.1.1.33', 'LC_ALL': 'en_US.UTF-8', 'LD_LIBRARY_PATH': '/usr/local/lib/python3.7/dist-packages/cv2/../../lib64:/usr/local/TensorRT-8.0.3.4/lib:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'LANG': 'en_US.UTF-8', 'HOSTNAME': 'f82f758c2aa9', 'OLDPWD': '/paddle/PaddleNLP-2.5.2/applications/information_extraction/document/train', 'WITH_GPU': 'ON', 'NVIDIA_VISIBLE_DEVICES': 'all', 'NCCL_VERSION': '2.8.4', 'GOPATH': '/root/gopath', 'PWD': '/paddle/PaddleNLP-2.5.2/applications/information_extraction/document', 'HOME': '/root', 'GOROOT': '/usr/local/go', 'CLICOLOR': '1', 'DEBIAN_FRONTEND': 'noninteractive', 'GREP_OPTIONS': '--color=auto', 'LIBRARY_PATH': '/usr/local/cuda/lib64/stubs', 'TERM': 'xterm', 'WITH_AVX': 'ON', 'CUDA_VERSION': '11.2.1', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'CUDA_VISIBLE_DEVICES': '0', 'SHLVL': '1', 'LANGUAGE': 'en_US.UTF-8', 'NVIDIA_REQUIRE_CUDA': 'cuda>=11.2 brand=tesla,driver>=418,driver<419 brand=tesla,driver>=440,driver<441 driver>=450,driver<451', 'PATH': '/home/cmake-3.16.0-Linux-x86_64/bin:/usr/local/gcc-8.2/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/go/bin:/root/gopath/bin', 'PS1': '\[\033[1;33m\]λ \[\033[1;37m\]\h \[\033[1;32m\]\w \[\033[0m\]', '_': '/usr/bin/python3.7', 'CUSTOM_DEVICE_ROOT': '', 'OMP_NUM_THREADS': '1', 'QT_QPA_PLATFORM_PLUGIN_PATH': '/usr/local/lib/python3.7/dist-packages/cv2/qt/plugins', 'QT_QPA_FONTDIR': '/usr/local/lib/python3.7/dist-packages/cv2/qt/fonts', 'POD_NAME': 'zvyhkr', 'PADDLE_MASTER': '172.17.0.2:45402', 'PADDLE_GLOBAL_SIZE': '1', 'PADDLE_LOCAL_SIZE': '1', 'PADDLE_GLOBAL_RANK': '0', 'PADDLE_LOCAL_RANK': '0', 'PADDLE_NNODES': '1', 'PADDLE_TRAINER_ENDPOINTS': '172.17.0.2:45403', 'PADDLE_CURRENT_ENDPOINT': '172.17.0.2:45403', 'PADDLE_TRAINER_ID': '0', 'PADDLE_TRAINERS_NUM': '1', 'PADDLE_RANK_IN_NODE': '0', 'FLAGS_selected_gpus': '0'}
LAUNCH INFO 2023-08-07 10:28:10,123 ------------------------- ERROR LOG DETAIL -------------------------
)

Exception in thread Thread-2:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 698, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 554, in to_tensor
return to_tensor_static(data, dtype, stop_gradient)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 472, in to_tensor_static
output = assign(data)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 1868, in assign
value_name: values,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/layer_helper.py", line 45, in append_op
return self.main_program.current_block().append_op(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 4046, in append_op
attrs=kwargs.get("attrs", None),
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/framework.py", line 3037, in init
self.desc.infer_shape(self.block.desc)
RuntimeError: (NotFound) The kernel assign_value is not registered.
[Hint: Expected iter != kernels
.end(), but received iter == kernels
.end().] (at /paddle/paddle/phi/core/kernel_factory.cc:197)
[operator < assign_value > error]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 218, in _thread_loop
self._thread_done_event)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/data/data_collator.py", line 199, in call
return_attention_mask=self.return_attention_mask,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 2619, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 229, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 708, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

@w5688414
Copy link
Contributor

w5688414 commented May 7, 2024

请问您的paddle和paddle以及cuda版本是什么?我看报错是cuda kernel的问题:

RuntimeError: (NotFound) The kernel assign_value is not registered.

然后如果数据是非官方的话,检查一下数据有没有超长或者超短等问题。

@paddle-bot paddle-bot bot closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

5 participants