CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

nkwangyh · 2020-08-07T03:48:38Z

System information (version)

OpenCV => 4.2
Operating System / Platform => Ubuntu 16.04
Compiler => g++ (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609; cuda release 9.1, V9.1.85

Detailed description

I came across the same problem as in the link by AlexBn:
https://answers.opencv.org/question/227794/cuda-goodfeaturestotrackdetector-is-not-threadsafe/

While using OpenCV CUDA GoodFeaturesToTrackDetector in parallel loop I noticed that I get systematic Exception "merge_sort: failed to synchronize" , though I run it on different cuda::GpuMats and in separate cuda::Streams with separate Algorithm instances.

Steps to reproduce

#include <iostream>
#include <list>
#include <thread>
#include <vector>
#include <opencv2/core.hpp>
#include <opencv2/cvconfig.h>
#include <opencv2/opencv.hpp>
#if defined(HAVE_CUDA)
#include <opencv2/core/cuda.hpp>
#include <opencv2/core/cuda/common.hpp>
#include <opencv2/cudaimgproc.hpp>
#include <opencv2/cudaoptflow.hpp>
#include <opencv2/cudaarithm.hpp>
#endif
#include <thread>
#include <vector>

using namespace std; using namespace cv;

int main() {
   int NBThread = 5;
   Mat frames = imread("C:\\Users\\alex\\Desktop\\test.png");
   cvtColor(frames, frames, COLOR_BGR2GRAY);

   vector<Mat> vectImg;
   for (int u = 0; u < NBThread; u++)
       vectImg.push_back(frames.clone());

   for (int i = 0; i < 100000; i++) 
   {
       vector<thread> workers;
       mutex m;
       for (int id = 0; id < NBThread; ++id) 
           workers.emplace_back([&, id]()
               { 
                   Size frameSize = vectImg[id].size();
                   // Creation du detecteur
                   int     srcType = CV_8UC1;
                   int     maxCorners = /*1000*/   4000;
                   double  qualityLevel = /*0.01*/ 0.01;
                   double  minDistance = /*0.0*/   0.0;
                   int     blockSize = /*3*/       3;
                   bool    useHarrisDetector = /*false*/   false;
                   double  harrisK = /*0.04*/  0.04;
                   auto m_CudaDetector = cv::cuda::createGoodFeaturesToTrackDetector(srcType, maxCorners, qualityLevel, minDistance, blockSize, useHarrisDetector, harrisK);

                   cuda::Stream stream;
                   cuda::GpuMat gpuFrame = cuda::GpuMat(frameSize, CV_8UC1);
                   gpuFrame.upload(vectImg[id], stream);
                   cv::cuda::GpuMat d_prevRef;
                   m_CudaDetector->detect(gpuFrame, d_prevRef, cuda::GpuMat(), stream);
                   stream.waitForCompletion();

                   std::cout << " Nombre de points detect = " << d_prevRef.size() << " thread : " << id << std::endl;
               });

       for (auto& worker : workers) worker.join();
   }
   return 0; }

after many loop I get Exception with CallStack :

opencv_cudaimgproc420d.dll!thrust::cuda_cub::throw_on_error(cudaError status, const char * msg) Line 227 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::__merge_sort::merge_sort .... Line 1318 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::__smart_sort::smart_sort ... Line 1552 C++
opencv_cudaimgproc420d.dll!thrust::cuda_cub::sort ... Line 1631 C++
opencv_cudaimgproc420d.dll!thrust::sort ... Line 57 C++
opencv_cudaimgproc420d.dll!cv::cuda::device::gfft::sortCorners_gpu(cv::cuda::PtrStepSz<float> eig, float2 * corners,int count, CUstream_st * stream) Line 139 C++
opencv_cudaimgproc420d.dll!`anonymous namespace'::GoodFeaturesToTrackDetector::detect(const cv::debug_build_guard::_InputArray & _image, const cv::debug_build_guard::_OutputArray & _corners, const cv::debug_build_guard::_InputArray & _mask, cv::cuda::Stream & stream) Line 125 C++

I must conclude that OpenCV Cuda GoodFeaturesToTrackDetector is not thread-safe despite usage of the Stream s ?

The text was updated successfully, but these errors were encountered:

nkwangyh · 2020-08-07T06:22:04Z

I've just see inside the function and find:

        void sortCorners_gpu(PtrStepSzf eig, float2* corners, int count, cudaStream_t stream)
        {
            bindTexture(&eigTex, eig);

            thrust::device_ptr<float2> ptr(corners);
#if THRUST_VERSION >= 100802
            if (stream)
                thrust::sort(thrust::cuda::par(ThrustAllocator::getAllocator()).on(stream), ptr, ptr + count, EigGreater());
            else
                thrust::sort(thrust::cuda::par(ThrustAllocator::getAllocator()), ptr, ptr + count, EigGreater());
#else
            thrust::sort(ptr, ptr + count, EigGreater());
#endif
        }

It seem there is something wrong when cuda::thrust works with multiple cpu threads. When I search "cuda thrust merge_sort failed to synchronize", I find some other discussion:
amdegroot/ssd.pytorch#120
which also indicates the bug is related to multiple threads and thrust.
Could anyone help?

nglee · 2020-08-08T04:28:08Z

It seems this algorithm uses texture reference, which is quite obsolete and does not support multi-threaded programming. Texture objects came up in 2013 and superseded the texture reference(link), and texture references are now deprecated in CUDA 11.

There have been some other cases similar to this issue, and they were solved by removing texture references and adopting texture objects. I believe that we can apply the same solution to this issue.

nkwangyh · 2020-08-12T08:49:45Z

Thanks @nglee
I've followed your suggestion by replacing the texture references with texture objects. Also the global variables of cuda kernels are moved in the host callers for thread safety. The changes do work and the crush is fixed.
Besides, I've found the same problem in cv::cuda::SparsePyrLKOpticalFlow. But the textures are implemented with templates to support different image types. I only make changes for the single channel grayscale image type. And it also works.

shubhamcodez · 2020-09-30T08:39:17Z

Thanks @nglee
I've followed your suggestion by replacing the texture references with texture objects. Also the global variables of cuda kernels are moved in the host callers for thread safety. The changes do work and the crush is fixed.
Besides, I've found the same problem in cv::cuda::SparsePyrLKOpticalFlow. But the textures are implemented with templates to support different image types. I only make changes for the single channel grayscale image type. And it also works.

it's resolved now?

areche · 2021-02-03T14:32:36Z

Hi @nkwangyh, are your changes in a pull request?
If so, which one?
Thanks!

nkwangyh · 2021-02-04T09:04:12Z

@shubhamcodez @areche Sorry for the delay. Yes, the bug was resolved, but since my fix was a little dirty and didn't cover all image types, I thus didn't create a pull request. I will try to submit the changes before this weekend.

nkwangyh · 2021-02-05T03:52:56Z

@asmorkalov @areche I've submitted a pull request for GoodFeaturesToTrackDectector at here. The code has been verified in my local environment. Hope it could do the help.
However, as I have mentioned above, the counterpart in cv::cuda::SparsePyrLKOpticalFlow is more complicated. Except for the thread-safety problems resulting from global reference texture, I also noticed that the cuda version algorithm is less well-polished than the x86 version in many aspects. Besides, there is also a bug in dealing with corners from higher levels of the the image pyramid, which would cause false matches.
For SparsePyrLKOpticalFlow, I currently have only implemented the single channel grayscale image type version. But I will be pleased to submit that if it's necessary.

wHideOf · 2025-02-17T11:30:58Z

It seems this algorithm uses texture reference, which is quite obsolete and does not support multi-threaded programming. Texture objects came up in 2013 and superseded the texture reference(link), and texture references are now deprecated in CUDA 11.

There have been some other cases similar to this issue, and they were solved by removing texture references and adopting texture objects. I believe that we can apply the same solution to this issue.

I encountered similar issue while calling cv::cuda::resize() to upscale GpuMat in multiple threads context.

asmorkalov added the Hackathon https://opencv.org/opencv-hackathon-starts-next-week/ label Aug 10, 2020

nkwangyh mentioned this issue Feb 5, 2021

#18051 Fix CUDA GoodFeaturesToTrackDetector not threadsafe bug opencv/opencv_contrib#2856

Closed

6 tasks

nkwangyh mentioned this issue Feb 12, 2021

Fix CUDA GoodFeaturesToTrackDetector not threadsafe bug opencv/opencv_contrib#2868

Merged

6 tasks

opencv-pushbot closed this as completed in opencv/opencv_contrib#2868 Feb 20, 2021

keivanmoazami mentioned this issue Jul 25, 2022

CUDA SparsePyrLKOpticalFlow is not ThreadSafe ? opencv/opencv_contrib#3307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

nkwangyh commented Aug 7, 2020 •

edited by asmorkalov

Loading

nkwangyh commented Aug 7, 2020

nglee commented Aug 8, 2020

nkwangyh commented Aug 12, 2020 •

edited

Loading

shubhamcodez commented Sep 30, 2020

areche commented Feb 3, 2021

nkwangyh commented Feb 4, 2021

nkwangyh commented Feb 5, 2021 •

edited

Loading

wHideOf commented Feb 17, 2025 •

edited

Loading

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

CUDA GoodFeaturesToTrackDetector is not ThreadSafe ? #18051

Comments

nkwangyh commented Aug 7, 2020 • edited by asmorkalov Loading

System information (version)

Detailed description

Steps to reproduce

nkwangyh commented Aug 7, 2020

nglee commented Aug 8, 2020

nkwangyh commented Aug 12, 2020 • edited Loading

shubhamcodez commented Sep 30, 2020

areche commented Feb 3, 2021

nkwangyh commented Feb 4, 2021

nkwangyh commented Feb 5, 2021 • edited Loading

wHideOf commented Feb 17, 2025 • edited Loading

nkwangyh commented Aug 7, 2020 •

edited by asmorkalov

Loading

nkwangyh commented Aug 12, 2020 •

edited

Loading

nkwangyh commented Feb 5, 2021 •

edited

Loading

wHideOf commented Feb 17, 2025 •

edited

Loading