-
Notifications
You must be signed in to change notification settings - Fork 3k
[Question]: OSError: (External) CUDA error(700), an illegal memory access was encountered. #6609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
这个问题比较难定位,请搜索OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.报错的可能 |
启动命令的device为啥是cpu? |
是gpu 这里写错了 |
我调小了batch_size train batch_size =2 eval batch_size = 1 训练时显示GPU利用率100%,评估时利用率60%左右, 不一会儿就中断了,偶尔可以训练一段时间 当 train batch_size = 1 eval batch_size =1 时 会立即训练中断 报错如下: Exception in thread Thread-2: During handling of the above exception, another exception occurred: Traceback (most recent call last): LAUNCH INFO 2023-08-07 10:28:10,123 Pod failed Exception in thread Thread-2: During handling of the above exception, another exception occurred: Traceback (most recent call last): |
请问您的paddle和paddle以及cuda版本是什么?我看报错是cuda kernel的问题: RuntimeError: (NotFound) The kernel assign_value is not registered. 然后如果数据是非官方的话,检查一下数据有没有超长或者超短等问题。 |
请提出你的问题
在做nlp 文档抽取 finetune 过程中,在几个轮次过后,会自动中断训练,中断的时机不确定
训练命令:
python3.7 finetune.py --device cpu --logging_steps 5 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-x-base --output_dir ./checkpointtest1/model_best --train_path train/data/4/train.txt --dev_path train/data/4/dev.txt --max_seq_len 512 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --num_train_epochs 80 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir ./checkpointtest1/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1
频繁出现,报以下错误:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 698, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 546, in to_tensor
return _to_tensor_non_static(data, dtype, place, stop_gradient)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 411, in _to_tensor_non_static
stop_gradient=stop_gradient,
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 218, in _thread_loop
self._thread_done_event)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/data/data_collator.py", line 199, in call
return_attention_mask=self.return_attention_mask,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 2619, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 229, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 708, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Traceback (most recent call last):
File "finetune.py", line 177, in
main()
File "finetune.py", line 147, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 669, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1350, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1312, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 1174, in forward
image=image,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 775, in forward
position_ids=visual_position_ids,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 645, in _calc_img_embeddings
visual_embeddings = self.visual_act_fn(self.visual_proj(self.visual(image.astype(paddle.float32))))
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 560, in forward
features = self.backbone(images_input)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 213, in forward
y = block(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 85, in forward
short = self.short(inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 42, in forward
y = self._batch_norm(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/nn.py", line 1375, in forward
self._trainable_statistics, False)
OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /paddle/paddle/phi/kernels/gpu/batch_norm_kernel.cu:1229)
The text was updated successfully, but these errors were encountered: