Skip to content

[Question]: OSError: (External) CUDA error(700), an illegal memory access was encountered.  #6609

Closed
@littlesmallrookie

Description

@littlesmallrookie

请提出你的问题

在做nlp 文档抽取 finetune 过程中,在几个轮次过后,会自动中断训练,中断的时机不确定
训练命令:
python3.7 finetune.py --device cpu --logging_steps 5 --save_steps 100 --eval_steps 100 --seed 42 --model_name_or_path uie-x-base --output_dir ./checkpointtest1/model_best --train_path train/data/4/train.txt --dev_path train/data/4/dev.txt --max_seq_len 512 --per_device_train_batch_size 4 --per_device_eval_batch_size 2 --num_train_epochs 80 --learning_rate 1e-5 --do_train --do_eval --do_export --export_model_dir ./checkpointtest1/model_best --overwrite_output_dir --disable_tqdm True --metric_for_best_model eval_f1 --load_best_model_at_end True --save_total_limit 1

频繁出现,报以下错误:
Exception in thread Thread-4:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 698, in convert_to_tensors
tensor = as_tensor(value)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 546, in to_tensor
return _to_tensor_non_static(data, dtype, place, stop_gradient)
File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/creation.py", line 411, in _to_tensor_non_static
stop_gradient=stop_gradient,
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/dataloader_iter.py", line 218, in _thread_loop
self._thread_done_event)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/data/data_collator.py", line 199, in call
return_attention_mask=self.return_attention_mask,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 2619, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 229, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/tokenizer_utils_base.py", line 708, in convert_to_tensors
"Unable to create tensor, you should probably activate truncation and/or padding "
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.

Traceback (most recent call last):
File "finetune.py", line 177, in
main()
File "finetune.py", line 147, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 669, in train
tr_loss_step = self.training_step(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1350, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/trainer/trainer.py", line 1312, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 1174, in forward
image=image,
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 775, in forward
position_ids=visual_position_ids,
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 645, in _calc_img_embeddings
visual_embeddings = self.visual_act_fn(self.visual_proj(self.visual(image.astype(paddle.float32))))
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/modeling.py", line 560, in forward
features = self.backbone(images_input)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 213, in forward
y = block(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 85, in forward
short = self.short(inputs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddlenlp-2.5.2.post0-py3.7.egg/paddlenlp/transformers/ernie_layout/visual_backbone.py", line 42, in forward
y = self._batch_norm(y)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 1012, in call
return self.forward(*inputs, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/nn.py", line 1375, in forward
self._trainable_statistics, False)
OSError: (External) CUDNN error(8), CUDNN_STATUS_EXECUTION_FAILED.
[Hint: Please search for the error code(8) on website (https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnStatus_t) to get Nvidia's official solution and advice about CUDNN Error.] (at /paddle/paddle/phi/kernels/gpu/batch_norm_kernel.cu:1229)

Metadata

Metadata

Labels

questionFurther information is requestedtriage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions