Skip to content

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nkwangyh opened this issue Aug 7, 2020 · 8 comments · Fixed by opencv/opencv_contrib#2868
Closed

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

nkwangyh opened this issue Aug 7, 2020 · 8 comments · Fixed by opencv/opencv_contrib#2868
Labels
bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib Hackathon https://opencv.org/opencv-hackathon-starts-next-week/

Comments

@nkwangyh
Copy link

nkwangyh commented Aug 7, 2020

System information (version)
  • OpenCV => 4.2
  • Operating System / Platform => Ubuntu 16.04
  • Compiler => g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609; cuda release 9.1, V9.1.85
Detailed description

I came across the same problem as in the link by AlexBn:
https://answers.opencv.org/question/227794/cuda-goodfeaturestotrackdetector-is-not-threadsafe/

While using OpenCV CUDA GoodFeaturesToTrackDetector in parallel loop I noticed that I get systematic Exception "merge_sort: failed to synchronize" , though I run it on different cuda::GpuMats and in separate cuda::Streams with separate Algorithm instances.

Steps to reproduce
#include <iostream>
#include <list>
#include <thread>
#include <vector>
#include <opencv2/core.hpp>
#include <opencv2/cvconfig.h>
#include <opencv2/opencv.hpp>
#if defined(HAVE_CUDA)
#include <opencv2/core/cuda.hpp>
#include <opencv2/core/cuda/common.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <opencv2/cudaoptflow.hpp>
#include <opencv2/cudaarithm.hpp>
#endif
#include <thread>
#include <vector>

using namespace std; using namespace cv;

int main() {
   int NBThread = 5;
   Mat frames = imread("C:\\Users\\alex\\Desktop\\test.png");
   cvtColor(frames, frames, COLOR_BGR2GRAY);

   vector<Mat> vectImg;
   for (int u = 0; u < NBThread; u++)
       vectImg.push_back(frames.clone());

   for (int i = 0; i < 100000; i++) 
   {
       vector<thread> workers;
       mutex m;
       for (int id = 0; id < NBThread; ++id) 
           workers.emplace_back([&, id]()
               { 
                   Size frameSize = vectImg[id].size();
                   // Creation du detecteur
                   int     srcType = CV_8UC1;
                   int     maxCorners = /*1000*/   4000;
                   double  qualityLevel = /*0.01*/ 0.01;
                   double  minDistance = /*0.0*/   0.0;
                   int     blockSize = /*3*/       3;
                   bool    useHarrisDetector = /*false*/   false;
                   double  harrisK = /*0.04*/  0.04;
                   auto m_CudaDetector = cv::cuda::createGoodFeaturesToTrackDetector(srcType, maxCorners, qualityLevel, minDistance, blockSize, useHarrisDetector, harrisK);

                   cuda::Stream stream;
                   cuda::GpuMat gpuFrame = cuda::GpuMat(frameSize, CV_8UC1);
                   gpuFrame.upload(vectImg[id], stream);
                   cv::cuda::GpuMat d_prevRef;
                   m_CudaDetector->detect(gpuFrame, d_prevRef, cuda::GpuMat(), stream);
                   stream.waitForCompletion();

                   std::cout << " Nombre de points detect = " << d_prevRef.size() << " thread : " << id << std::endl;
               });

       for (auto& worker : workers) worker.join();
   }
   return 0; }

after many loop I get Exception with CallStack :

opencv_cudaimgproc420d.dll!thrust::cuda_cub::throw_on_error(cudaError status, const char * msg) Line 227 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::__merge_sort::merge_sort .... Line 1318 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::__smart_sort::smart_sort ... Line 1552 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::sort ... Line 1631 C++
opencv_cudaimgproc420d.dll!thrust::sort ... Line 57 C++
opencv_cudaimgproc420d.dll!cv::cuda::device::gfft::sortCorners_gpu(cv::cuda::PtrStepSz<float> eig, float2 * corners,int count, CUstream_st * stream) Line 139 C++
opencv_cudaimgproc420d.dll!`anonymous namespace'::GoodFeaturesToTrackDetector::detect(const cv::debug_build_guard::_InputArray & _image, const cv::debug_build_guard::_OutputArray & _corners, const cv::debug_build_guard::_InputArray & _mask, cv::cuda::Stream & stream) Line 125 C++

I must conclude that OpenCV Cuda GoodFeaturesToTrackDetector is not thread-safe despite usage of the Stream s ?

@nkwangyh
Copy link
Author

nkwangyh commented Aug 7, 2020

I've just see inside the function and find:

        void sortCorners_gpu(PtrStepSzf eig, float2* corners, int count, cudaStream_t stream)
        {
            bindTexture(&eigTex, eig);

            thrust::device_ptr<float2> ptr(corners);
#if THRUST_VERSION >= 100802
            if (stream)
                thrust::sort(thrust::cuda::par(ThrustAllocator::getAllocator()).on(stream), ptr, ptr + count, EigGreater());
            else
                thrust::sort(thrust::cuda::par(ThrustAllocator::getAllocator()), ptr, ptr + count, EigGreater());
#else
            thrust::sort(ptr, ptr + count, EigGreater());
#endif
        }

It seem there is something wrong when cuda::thrust works with multiple cpu threads. When I search "cuda thrust merge_sort failed to synchronize", I find some other discussion:
amdegroot/ssd.pytorch#120
which also indicates the bug is related to multiple threads and thrust.
Could anyone help?

@nglee
Copy link
Contributor

nglee commented Aug 8, 2020

It seems this algorithm uses texture reference, which is quite obsolete and does not support multi-threaded programming. Texture objects came up in 2013 and superseded the texture reference(link), and texture references are now deprecated in CUDA 11.

There have been some other cases similar to this issue, and they were solved by removing texture references and adopting texture objects. I believe that we can apply the same solution to this issue.

@asmorkalov asmorkalov added the Hackathon https://opencv.org/opencv-hackathon-starts-next-week/ label Aug 10, 2020
@nkwangyh
Copy link
Author

nkwangyh commented Aug 12, 2020

Thanks @nglee
I've followed your suggestion by replacing the texture references with texture objects. Also the global variables of cuda kernels are moved in the host callers for thread safety. The changes do work and the crush is fixed.
Besides, I've found the same problem in cv::cuda::SparsePyrLKOpticalFlow. But the textures are implemented with templates to support different image types. I only make changes for the single channel grayscale image type. And it also works.

@shubhamcodez
Copy link
Contributor

Thanks @nglee
I've followed your suggestion by replacing the texture references with texture objects. Also the global variables of cuda kernels are moved in the host callers for thread safety. The changes do work and the crush is fixed.
Besides, I've found the same problem in cv::cuda::SparsePyrLKOpticalFlow. But the textures are implemented with templates to support different image types. I only make changes for the single channel grayscale image type. And it also works.

it's resolved now?

@areche
Copy link

areche commented Feb 3, 2021

Hi @nkwangyh, are your changes in a pull request?
If so, which one?
Thanks!

@nkwangyh
Copy link
Author

nkwangyh commented Feb 4, 2021

@shubhamcodez @areche Sorry for the delay. Yes, the bug was resolved, but since my fix was a little dirty and didn't cover all image types, I thus didn't create a pull request. I will try to submit the changes before this weekend.

@nkwangyh
Copy link
Author

nkwangyh commented Feb 5, 2021

@asmorkalov @areche I've submitted a pull request for GoodFeaturesToTrackDectector at here. The code has been verified in my local environment. Hope it could do the help.
However, as I have mentioned above, the counterpart in cv::cuda::SparsePyrLKOpticalFlow is more complicated. Except for the thread-safety problems resulting from global reference texture, I also noticed that the cuda version algorithm is less well-polished than the x86 version in many aspects. Besides, there is also a bug in dealing with corners from higher levels of the the image pyramid, which would cause false matches.
For SparsePyrLKOpticalFlow, I currently have only implemented the single channel grayscale image type version. But I will be pleased to submit that if it's necessary.

@wHideOf
Copy link

wHideOf commented Feb 17, 2025

It seems this algorithm uses texture reference, which is quite obsolete and does not support multi-threaded programming. Texture objects came up in 2013 and superseded the texture reference(link), and texture references are now deprecated in CUDA 11.

There have been some other cases similar to this issue, and they were solved by removing texture references and adopting texture objects. I believe that we can apply the same solution to this issue.

I encountered similar issue while calling cv::cuda::resize() to upscale GpuMat in multiple threads context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug category: gpu/cuda (contrib) OpenCV 4.0+: moved to opencv_contrib Hackathon https://opencv.org/opencv-hackathon-starts-next-week/
Projects
None yet
7 participants