Skip to content

[Question]: 按文档跑GPT模型报错 #6158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ziyanwangin opened this issue Jun 13, 2023 · 1 comment
Closed

[Question]: 按文档跑GPT模型报错 #6158

ziyanwangin opened this issue Jun 13, 2023 · 1 comment
Assignees
Labels
question Further information is requested triage

Comments

@ziyanwangin
Copy link

ziyanwangin commented Jun 13, 2023

请提出你的问题

按文档(model_zoo/gpt-3/docs/quick_start.md)跑GPT模型,依次执行如下命令:

docker pull registry.baidubce.com/ppfleetx/fleetx-cuda11.2-cudnn8:dev
docker run -it --name=paddle --net=host -v /dev/shm:/dev/shm --shm-size=32G -v $PWD:/paddle --runtime=nvidia registry.baidubce.com/ppfleetx/ppfleetx-cuda11.2-cudnn8:v0.1.0 bash

在docker环境内部

git clone https://github.com/PaddlePaddle/PaddleFleetX.git
cd PaddleFleetX
mkdir data
wget -O data/gpt_en_dataset_300m_ids.npy https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_ids.npy
wget -O data/gpt_en_dataset_300m_idx.npz https://bj.bcebos.com/paddlenlp/models/transformers/gpt/data/gpt_en_dataset_300m_idx.npz

使用以下命令跑training,可以正常运行

python ./tools/train.py -c ./ppfleetx/configs/nlp/gpt/pretrain_gpt_345M_single_card.yaml

使用以下命令跑inference

python ./tools/inference.py -c ./ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml

报错找不到Tensor.copy

Traceback (most recent call last):
  File "./tools/inference.py", line 53, in <module>
    outs = engine.inference(data)
  File "/paddle/PaddleFleetX/ppfleetx/core/engine/eager_engine.py", line 864, in inference
    return self._inference_engine.predict(data)
  File "/paddle/PaddleFleetX/ppfleetx/core/engine/inference_engine.py", line 260, in predict
    handle.copy_from_cpu(np.array(d.copy()))
AttributeError: 'Tensor' object has no attribute 'copy'

根据报错信息,对代码进行以下修改:

--- a/ppfleetx/core/engine/inference_engine.py
+++ b/ppfleetx/core/engine/inference_engine.py
@@ -257,7 +257,7 @@
                     raise ValueError()
                 for d, name in zip(data, self.input_names()):
                     handle = self.predictor.get_input_handle(name)
-                    handle.copy_from_cpu(np.array(d.copy()))
+                    handle.copy_from_cpu(np.array(d))
             elif isinstance(data, Mapping):
                 # key check
                 for k, v in data.items():

修改后再次尝试跑inference

python ./tools/inference.py -c ./ppfleetx/configs/nlp/gpt/inference_gpt_345M_single_card.yaml

报错CUDA error(700)

Traceback (most recent call last):
  File "./tools/inference.py", line 53, in <module>
    outs = engine.inference(data)
  File "/paddle/PaddleFleetX/ppfleetx/core/engine/eager_engine.py", line 864, in inference
    return self._inference_engine.predict(data)
  File "/paddle/PaddleFleetX/ppfleetx/core/engine/inference_engine.py", line 269, in predict
    self.predictor.run()
OSError: (External) CUBLAS error(1).
  [Hint: Please search for the error code(1) on website (https://docs.nvidia.com/cuda/cublas/index.html#cublasstatus_t) to get Nvidia's official solution and advice about CUBLAS Error.] (at /paddle/paddle/paddle/phi/backends/gpu/gpu_resources.cc:185)
  [operator < multihead_matmul > error]
terminate called after throwing an instance of 'phi::enforce::EnforceNotMet'
  what():  (External) CUDA error(700), an illegal memory access was encountered.
  [Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/paddle/fluid/platform/device/gpu/gpu_info.cc:271)

Aborted (core dumped)

用gdb调试可以看到问题出在paddle::operators::MultiHeadMatMulV2Kernel里,详见下面gdb backtrace信息:

查看详情
Thread 1 "python" hit Catchpoint 1 (exception thrown), __cxxabiv1::__cxa_throw (obj=0x43039290,
    tinfo=0x7f6d6c1427a8 ,                                                              
    dest=0x7f6d4ad15a40 )                                                          
    at ../../../../gcc-8.2.0/libstdc++-v3/libsupc++/eh_throw.cc:80                                                                
80      ../../../../gcc-8.2.0/libstdc++-v3/libsupc++/eh_throw.cc: No such file or directory.                                      
(gdb) bt                                                                                                                          
#0  __cxxabiv1::__cxa_throw (obj=0x43039290, tinfo=0x7f6d6c1427a8 ,                     
    dest=0x7f6d4ad15a40 )                                                          
    at ../../../../gcc-8.2.0/libstdc++-v3/libsupc++/eh_throw.cc:80                                                                
#1  0x00007f6d4a360a5f in phi::InitBlasHandle(cublasContext**, CUstream_st*) [clone .cold.187] ()                                 
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#2  0x00007f6d5434a828 in phi::GPUContext::Impl::CublasCall(std::function const&)::{lambda()#1}::operator()
() const () from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                 
#3  0x00007f6dabbc6907 in __pthread_once_slow (once_control=0x75b6cb4,                                                            
    init_routine=0x7f6d9ca001e0 ) at pthread_once.c:116                                                      
#4  0x00007f6d543441c4 in phi::GPUContext::CublasCall(std::function const&) const ()                       
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#5  0x00007f6d4c2d84ae in void phi::funcs::Blas::GEMM(CBLAS_TRANSPOSE, CBLAS_TRANSPOSE, int, int, int, flo
at, float const*, float const*, float, float*) const ()                                                                           
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#6  0x00007f6d4c2d8e58 in void phi::funcs::Blas::MatMul(phi::DenseTensor const&, bool, phi::DenseTensor co
nst&, bool, float, phi::DenseTensor*, float) const ()                                                                             
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#7  0x00007f6d4db19a92 in paddle::operators::MultiHeadMatMulV2Kernel::Compute(paddle::framework::Execution
Context const&) const [clone .constprop.1064] ()                                                                                  
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#8  0x00007f6d4db1a264 in std::_Function_handler, paddle:
:operators::MultiHeadMatMulV2Kernel >::operator()(char const*, char const*, int) const::{lambda(paddle::fr
amework::ExecutionContext const&)#1}>::_M_invoke(std::_Any_data const&, paddle::framework::ExecutionContext const&) () from /usr/l
ocal/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                                        
#9  0x00007f6d4f5ac399 in paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddl
e::framework::RuntimeContext*) const () from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                     
#10 0x00007f6d4f5ae3f4 in paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const () from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                               
#11 0x00007f6d4f594aba in paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&) ()             
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#12 0x00007f6d4ef3ec4d in paddle::framework::NaiveExecutor::Run() ()                                                              
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#13 0x00007f6d4b4b952b in paddle::AnalysisPredictor::ZeroCopyRun() ()                                                             
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#14 0x00007f6d4b07f810 in void pybind11::cpp_function::initialize(paddle::pybind::(anonymous namespace)::BindPaddleInferPredictor(pybind11::module_*)::{lambda(
paddle_infer::Predictor&)#2}&&, void (*)(paddle_infer::Predictor&), pybind11::name const&, pybind11::is_method const&, pybind11::s
ibling const&)::{lambda(pybind11::detail::function_call&)#3}::_FUN(pybind11::detail::function_call)                               
    () from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                      
#15 0x00007f6d4ad2c333 in pybind11::cpp_function::dispatcher(_object*, _object*, _object*) ()                                     
   from /usr/local/lib/python3.7/dist-packages/paddle/fluid/libpaddle.so                                                          
#16 0x0000000000593784 in _PyMethodDef_RawFastCallKeywords ()                                                                     
#17 0x0000000000594731 in _PyObject_FastCallKeywords ()          
#18 0x0000000000548cc1 in ?? ()                                  
#19 0x000000000051566f in _PyEval_EvalFrameDefault ()            
#20 0x0000000000549e0e in _PyEval_EvalCodeWithName ()            
#21 0x0000000000593fce in _PyFunction_FastCallKeywords ()        
#22 0x0000000000511e2c in _PyEval_EvalFrameDefault ()            
#23 0x0000000000593dd7 in _PyFunction_FastCallKeywords ()        
#24 0x0000000000511e2c in _PyEval_EvalFrameDefault ()            
#25 0x0000000000549576 in _PyEval_EvalCodeWithName ()            
#26 0x0000000000604173 in PyEval_EvalCode ()                     
#27 0x00000000005f5506 in ?? ()                                  
#28 0x00000000005f8c6c in PyRun_FileExFlags ()

请问此问题应该如何解决?谢谢!

@ziyanwangin ziyanwangin added the question Further information is requested label Jun 13, 2023
@w5688414
Copy link
Contributor

w5688414 commented May 8, 2024

请问您的paddle和paddlenlp版本是啥?请升级一下试试

@paddle-bot paddle-bot bot closed this as completed May 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested triage
Projects
None yet
Development

No branches or pull requests

4 participants