执行三个嵌套for循环的最快方法是什么？

问题描述：

我是一名图像处理程序员，我正在使用opencv C++。作为我写的一个程序的一部分，我有三个嵌套。第一个是用于不同的图像，第二个用于图像的行，第三个用于图像的列。三者之间没有任何依赖关系，他们可以并行执行（我的意思是，所有图像的所有像素都可以并行处理）。我并不熟悉并行编程，GPU编程，线程，tbb，并行循环和...。我发现互联网上的不同链接提出了这样的建议。我想知道什么是我的问题最快的解决方案？我的操作系统是Windows和我使用的Visual Studio 2015年执行三个嵌套for循环的最快方法是什么？

我的代码如下所示：

int prjResCol[MAX_NUMBER_OF_PROJECTOR]; 
int prjResRow[MAX_NUMBER_OF_PROJECTOR]; 
Mat prjCamCor[MAX_NUMBER_OF_PROJECTOR][2] 
Mat prjImgColored[MAX_NUMBER_OF_PROJECTOR]; 

for (int i = 0; i < numOfProjector; i++) 
{ 
    Mat tmp(prjResRow[i], prjResCol[i], CV_8UC3, Scalar(0, 0, 0)); 
    prjImgColored[i] = tmp; 

    for (int ii = 0; ii < prjResRow[i]; ii++) 
    { 
     double* ptrPrjCamIAnd0 = prjCamCor[i][0].ptr<double>(ii); 
     double* ptrPrjCamIAnd1 = prjCamCor[i][1].ptr<double>(ii); 
     Vec3b* ptrPrjImgColoredI = prjImgColored[i].ptr<Vec3b>(ii); 

     for (int jj = 0; jj < prjResCol[i]; jj++) 
     { 

      if ((ptrPrjCamIAnd0[jj] != NAN_VALUE) && (ptrPrjCamIAnd1[jj] != NAN_VALUE)) 
      { 
       ptrPrjImgColoredI[jj] = secondImgColored.at<Vec3b>(ptrPrjCamIAnd1[jj], ptrPrjCamIAnd0[jj]); 
      } 

     } 
    } 
    imwrite(mainAdr + "\\img" + to_string(i) + ".bmp", prjImgColored[i]); 
}

尝试提供一个MCVE - 一个小但完整的样本。你已经忽略了关于类型（'Mat，''Vec3b'）的几个变量（像名称以'prj'开始的变量）和'CV_8UC3'（不管那是什么）的关键信息。这些信息是至关重要的，因为为了优化你的代码，有人需要了解这些东西是什么。 – Peter

你有简介吗？你的约束是什么？多少图片？什么尺寸？内循环中的哪些处理？如果不知道这一点，开始“优化”是没有意义的。 – Miki

图像的最大数量是20.每个垫子大小约为2000 * 3000（行*列）。 – Shahab

答

当你写使用并行for循环遍历像素将是最快的方法为大图像。使用并行算法对于小图像（例如256 x 256）有一定的开销，您可能会更适合您发布的传统循环。

下面是用Visual C++的例子：

// Calls the provided function for each pixel in a Bitmap object. 
void ProcessImage(Bitmap* bmp, const function<void (DWORD&)>& f) 
{ 
    int width = bmp->GetWidth(); 
    int height = bmp->GetHeight(); 

    // Lock the bitmap. 
    BitmapData bitmapData; 
    Rect rect(0, 0, bmp->GetWidth(), bmp->GetHeight()); 
    bmp->LockBits(&rect, ImageLockModeWrite, PixelFormat32bppRGB, &bitmapData); 

    // Get a pointer to the bitmap data. 
    DWORD* image_bits = (DWORD*)bitmapData.Scan0; 

    // Call the function for each pixel in the image. 
    parallel_for (0, height, [&, width](int y) 
    {  
     for (int x = 0; x < width; ++x) 
     { 
     // Get the current pixel value. 
     DWORD* curr_pixel = image_bits + (y * width) + x; 

     // Call the function. 
     f(*curr_pixel); 
     } 
    }); 

    // Unlock the bitmap. 
    bmp->UnlockBits(&bitmapData); 
}

，你可以把它parallelze，你同时做一个单线程（双for循环）迭代上几个图像的工作流程的另一种方法。下面是用C＃编写的一个例子。您只需将您的串行double替换为位图翻转例程的循环即可。一个C++实现应该使用适当的并行库非常相似：

//一个简单的来源用于演示目的。根据需要修改此路径。 String [] files = System.IO.Directory.GetFiles（@“C：\ Users \ Public \ Pictures \ Sample Pictures”，“* .jpg”）; String newDir = @“C：\ Users \ Public \ Pictures \ Sample Pictures \ Modified”; System.IO.Directory.CreateDirectory（newDir）;

// Method signature: Parallel.ForEach(IEnumerable<TSource> source, Action<TSource> body) 
    // Be sure to add a reference to System.Drawing.dll. 
    Parallel.ForEach(files, (currentFile) => 
    { 
     // The more computational work you do here, the greater 
     // the speedup compared to a sequential foreach loop. 
     String filename = System.IO.Path.GetFileName(currentFile); 
     var bitmap = new Bitmap(currentFile); 

     bitmap.RotateFlip(RotateFlipType.Rotate180FlipNone); 
     bitmap.Save(Path.Combine(newDir, filename)); 

     // Peek behind the scenes to see how work is parallelized. 
     // But be aware: Thread contention for the Console slows down parallel loops!!! 

     Console.WriteLine("Processing {0} on thread {1}", filename, Thread.CurrentThread.ManagedThreadId); 
     //close lambda expression and method invocation 
     });

Open CV支持并行至少从版本2.4.3开始。通过使用并行循环，您可以利用多核CPU的强大功能，其中每个核心将在图像的一个单独的子部分上迭代。

OpenCV还支持CUDA，它是由NVIDA创建的并行处理API，它利用了GPU的功能。我不认为这种方法是解决这个问题的方法，但是既然你提到你是一个图像处理程序员，它值得考虑未来的问题。

感谢您的回答。图像的最大数量是20.每个垫子大小约为2000 * 3000（行*列）。 – Shahab

对于这种处理来说，并行处理基本上是无用的，图像很少。 – Miki

尽管20个600万像素的图像相当小，但如果正确实施，并行处理仍然可以缩短处理时间（考虑到轻微的开销）。如果有人正在使用Intel I7处理器（8个内核），并且正在执行并行工作流程。（并行处理多个图像），如果PC没有被推动执行任何其他强化任务，您将看到时间改进。我没有在图像上尝试过这种模式，但是我已经处理了几个100M与串行工作流程相比，字节文件和节省时间非常重要。 –

执行三个嵌套for循环的最快方法是什么？

相关推荐