You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Traceback (mostrecentcalllast):
File"./tools/inference.py", line53, in<module>outs=engine.inference(data)
File"/paddle/PaddleFleetX/ppfleetx/core/engine/eager_engine.py", line864, ininferencereturnself._inference_engine.predict(data)
File"/paddle/PaddleFleetX/ppfleetx/core/engine/inference_engine.py", line269, inpredictself.predictor.run()
OSError: (External) CUBLASerror(1).
[Hint: Pleasesearchfortheerrorcode(1) onwebsite (https://docs.nvidia.com/cuda/cublas/index.html#cublasstatus_t) to get Nvidia's official solution and advice about CUBLAS Error.] (at /paddle/paddle/paddle/phi/backends/gpu/gpu_resources.cc:185)
[operator<multihead_matmul>error]
terminatecalledafterthrowinganinstanceof'phi::enforce::EnforceNotMet'what(): (External) CUDAerror(700), anillegalmemoryaccesswasencountered.
[Hint: Pleasesearchfortheerrorcode(700) onwebsite (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/paddle/fluid/platform/device/gpu/gpu_info.cc:271)Aborted (coredumped)
请提出你的问题
按文档(model_zoo/gpt-3/docs/quick_start.md)跑GPT模型,依次执行如下命令:
docker pull registry.baidubce.com/ppfleetx/fleetx-cuda11.2-cudnn8:dev docker run -it --name=paddle --net=host -v /dev/shm:/dev/shm --shm-size=32G -v $PWD:/paddle --runtime=nvidia registry.baidubce.com/ppfleetx/ppfleetx-cuda11.2-cudnn8:v0.1.0 bash
在docker环境内部
git clone https://github.com/PaddlePaddle/PaddleFleetX.git cd PaddleFleetX mkdir data wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz
使用以下命令跑training,可以正常运行
使用以下命令跑inference
报错找不到
Tensor.copy
:根据报错信息,对代码进行以下修改:
修改后再次尝试跑inference
报错
CUDA error(700)
:用gdb调试可以看到问题出在
paddle::operators::MultiHeadMatMulV2Kernel
里,详见下面gdb backtrace信息:查看详情
请问此问题应该如何解决?谢谢!
The text was updated successfully, but these errors were encountered: