需要帮助优化代码（最低图像约定）

问题描述：

我写了一些模拟代码，并使用“随机中断GDB”调试方法。我发现我的计划的99.9％的时间在这个例程（它是最小影像约定）花：需要帮助优化代码（最低图像约定）

inline double distanceSqPeriodic(double const * const position1, double const * const position2, double boxWidth) { 
     double xhw, yhw, zhw, x, y, z;                     

     xhw = boxWidth/2.0;                      
     yhw = xhw;                      
     zhw = xhw;                      

     x = position2[0] - position1[0];                    
     if (x > xhw) 
       x -= boxWidth;                      
     else if (x < -xhw) 
       x += boxWidth;                      


     y = position2[1] - position1[1];                    
     if (y > yhw) 
       y -= boxWidth;                      
     else if (y < -yhw) 
       y += boxWidth;                      


     z = position2[2] - position1[2];                    
     if (z > zhw) 
       z -= boxWidth;                      
     else if (z < -zhw) 
       z += boxWidth;                      

     return x * x + y * y + z * z;                     
}

至今（也许不是很显著的）我已经执行的优化：

返回的距离，而不是平方根的平方
内联它
常量我所能
没有标准库膨胀
编译每g ++优化标志我能想到

我用尽了我能做的事情。也许我可以使用花车而不是双打，但我宁愿这是最后的手段。也许我可以以某种方式在此上使用SIMD，但我从来没有这样做过，所以我想这是很多工作。有任何想法吗？

感谢

这将有助于看到呼叫代码，特别是如果你想要去SIMD路线。理解参数也会有帮助，例如每个呼叫的'boxWidths []'中的值是不同的？ – 2013-02-25 15:01:15

调用方法次数更少（即算法改进）。当然，不知道它是否适用;） – 2013-02-25 15:02:12

@Paul R No ... d'oh！让我解决这个问题 – Nick 2013-02-25 15:02:26

答

首先，你没有使用正确的算法。如果两点大于boxWidth，会怎么样？其次，如果你有多个粒子，调用一个完成所有距离计算的单个函数，并将结果放到输出缓冲区中将会显着提高效率。内联有助于减少一些，但不是全部。任何预先计算 - 例如在算法中将盒子长度除以2 - 将在不需要时重复。

这里是一些SIMD代码来做计算。您需要使用-msse4进行编译。使用-O3，在我的机器上（macbook pro，llvm-gcc-4.2），我的速度提高了约2倍。这确实需要使用32位浮点数而不是双精度算术。

上证所真的不是那么复杂，它只是看起来可怕。例如而不是写一个* b，你必须编写笨重的_mm_mul_ps（a，b）。

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h> 
#include <smmintrin.h> 

// you can compile this code with -DDOUBLE to try using doubles vs. floats 
// in the unoptimized code. The SSE code uses only floats. 
#ifdef DOUBLE 
typedef double real; 
#else 
typedef float real; 
#endif 

static inline __m128 loadFloat3(const float const* value) { 
    // Load (x,y,z) into a SSE register, leaving the last entry 
    // set to zero. 
    __m128 x = _mm_load_ss(&value[0]); 
    __m128 y = _mm_load_ss(&value[1]); 
    __m128 z = _mm_load_ss(&value[2]); 
    __m128 xy = _mm_movelh_ps(x, y); 
    return _mm_shuffle_ps(xy, z, _MM_SHUFFLE(2, 0, 2, 0)); 
} 

int fdistanceSqPeriodic(float* position1, float* position2, const float boxWidth, 
          float* out, const int n_points) { 
    int i; 
    __m128 r1, r2, r12, s12, r12_2, s, box, invBox; 

    box = _mm_set1_ps(boxWidth); 
    invBox = _mm_div_ps(_mm_set1_ps(1.0f), box); 

    for (i = 0; i < n_points; i++) { 
    r1 = loadFloat3(position1); 
    r2 = loadFloat3(position1); 
    r12 = _mm_sub_ps(r1, r2); 

    s12 = _mm_mul_ps(r12, invBox); 
    s12 = _mm_sub_ps(s12, _mm_round_ps(s12, _MM_FROUND_TO_NEAREST_INT)); 
    r12 = _mm_mul_ps(box, s12); 

    r12_2 = _mm_mul_ps(r12, r12); 

    // double horizontal add instruction accumulates the sum of 
    // all four elements into each of the elements 
    // (e.g. s.x = s.y = s.z = s.w = r12_2.x + r12_2.y + r12_2.z + r12_2.w) 
    s = _mm_hadd_ps(r12_2, r12_2); 
    s = _mm_hadd_ps(s, s); 

    _mm_store_ss(out++, s); 
    position1 += 3; 
    position2 += 3; 
    } 
    return 1; 
} 

inline real distanceSqPeriodic(real const * const position1, real const * const position2, real boxWidth) { 
    real xhw, yhw, zhw, x, y, z;                     

    xhw = boxWidth/2.0;                      
    yhw = xhw;                      
    zhw = xhw;                      

    x = position2[0] - position1[0];                    
    if (x > xhw) 
    x -= boxWidth;                      
    else if (x < -xhw) 
    x += boxWidth;                      


    y = position2[1] - position1[1];                    
    if (y > yhw) 
    y -= boxWidth;                      
    else if (y < -yhw) 
    y += boxWidth;                      


    z = position2[2] - position1[2];                    
    if (z > zhw) 
    z -= boxWidth;                      
    else if (z < -zhw) 
    z += boxWidth;                      

    return x * x + y * y + z * z;                     
} 


int main(void) { 
    real* position1; 
    real* position2; 
    real* output; 
    int n_runs = 10000000; 
    posix_memalign((void**) &position1, 16, n_runs*3*sizeof(real)); 
    posix_memalign((void**) &position2, 16, n_runs*3*sizeof(real)); 
    posix_memalign((void**) &output, 16, n_runs*sizeof(real)); 
    real boxWidth = 1.8; 
    real result = 0; 
    int i; 
    clock_t t; 

#ifdef OPT 
    printf("Timing optimized SSE implementation\n"); 
#else 
    printf("Timinig original implementation\n"); 
#endif 

#ifdef DOUBLE 
    printf("Using double precision\n"); 
#else 
    printf("Using single precision\n"); 
#endif 

    t = clock(); 
#ifdef OPT 
    fdistanceSqPeriodic(position1, position2, boxWidth, output, n_runs); 
#else 
    for (i = 0; i < n_runs; i++) { 
    *output = distanceSqPeriodic(position1, position2, boxWidth); 
    position1 += 3; 
    position2 += 3; 
    output++; 
    } 
#endif 
    t = clock() - t; 

    printf("It took me %d clicks (%f seconds).\n", (int) t, ((float)t)/CLOCKS_PER_SEC); 
}

嘿，这太棒了！感谢你这样做。我一定会在我的模拟相关项目中使用它。 – Nick 2013-08-06 01:30:45

没问题。我基本上是从我为MDTraj（http://rmcgibbo.github.io/mdtraj/）写的代码复制粘贴的。我还没有真正分析过这个时机 - 从玩这个代码看起来最重要的是内存带宽和对齐。一旦字节传送到CPU或从CPU传送出去，实际的计算本身就很便宜。 – 2013-08-06 05:20:37

正确的链接到mdtraj现在是https://github.com/mdtraj/mdtraj – max 2016-10-10 20:52:02

答

Return the square of the distance instead of the square root

说只要是个好主意，因为你是比较方块是正方形的。

Inline it

这有时是一个反优化：内联的代码占用空间，在执行流水线/高速缓存，无论是支与否。

通常它没有区别，因为编译器对是否内联是否有最终的说法。

Const what I can

通常根本没有差异。

No standard library bloat

臃肿？

Compiling with every g++ optimization flag I can think of

这很好：将大部分优化留给编译器。只有当你衡量你真正的瓶颈，并确定这个瓶颈是否显着时，才需要在手上进行优化。

你可以试试做的是让你的代码无分支。如果不使用位掩码，这可能是这样的：

//if (z > zhw) 
//  z -= boxWidths[2]; 
//else if (z < -zhw) 
//  z += boxWidths[2]; 

const auto z_a[] = { 
    z, 
    z - boxWidths[2] 
}; 
z = z_a[z>zhw]; 
...

或

z -= (z>zhw) * boxWidths[2];

然而，没有保证，这是速度更快。您的编译器现在可能难以在您的代码中识别SIMD点，或者分支目标缓冲区做得很好，而且大部分时间您的函数都有相同的代码路径。

这实际上减慢了它的相当多。 – Nick 2013-02-25 15:25:14

答

您需要摆脱比较，因为这些比较难以预测。

的功能被实现为：

//   /\ /\ 
    //  /\/ \ 
    ----0----- or ------------ , as (-x)^2 == x^2 
    //
    //

后者是二腹肌语句的结果：

x = abs(half-abs(diff))+half;

代码

double tst(double a[4], double b[4], double half) 
{ 
    double sum=0.0,t; 
    int i; 
    for (i=0;i<3;i++) { t=fabs(fabs(b[i]-a[i])-half)-half; sum+=t*t;} 
    return sum; 
}

由节拍原来的实现因子四（+一些） - 在这一点上甚至没有完全平行：只有xmm注册的下半部分rs被使用。

通过并行处理x & & y，理论上可以获得约50％的收益。理论上使用花车而不是双打可以使速度提高约3倍。

这些比较与算术表达式一样难以预测。我认为你的意思是if语句（即分支） – 2013-02-25 15:48:03

比较_implies_分支，就像在LUT [i> j]的另一种方法中那样。 – 2013-02-25 18:38:14

例如在'z - =（z> zhw）* boxWidths [2];'没有分支，而是比较。或者假设'const bool foobar = x 无分支（至少不在x86上） – 2013-02-26 11:27:51

答

您可能想要使用晶圆厂（在ISO 90 C中标准化），因为这应该能够被缩减为单个非分支指令。

需要帮助优化代码（最低图像约定）

相关推荐