Optimizing C++ code for performance_问答_开发者

Can you think of some way to optimize this piece of code? It's meant to execute in an ARMv7 processor (Iphone 3GS):

4.0%  inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
      {
0.7%    float *data = (float *) img->imageData;
1.4%    int step = img->widthStep/sizeof(float);

        // The subtraction by one for row/col is because row/col is inclusive.
1.1%    int r1 = std::min(row,          img->height) - 1;
1.0%    int c1 = std::min(col,          img->width)  - 1;
2.7%    int r2 = std::min(row + rows,   img->height) - 1;
3.7%    int c2 = std::min(col + cols,   img->width)  - 1;

        float A(0.0f), B(0.0f), C(0.0f), D(0.0f);
8.5%    if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1];
11.7%   if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2];
7.6%    if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1];
9.2%    if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

21.9%   return std::max(0.f, A - B - C + D);
3.8%  }

All this code is taken from the OpenSURF library. Here's the context of the function (some people were asking for the context):

//! Calculate DoH responses for supplied layer
void FastHessian::buildResponseLayer(ResponseLayer *rl)
{
  float *responses = rl->responses;         // response storage
  unsigned char *laplacian = rl->l开发者_如何学Pythonaplacian; // laplacian sign storage
  int step = rl->step;                      // step size for this filter
  int b = (rl->filter - 1) * 0.5 + 1;         // border for this filter
  int l = rl->filter / 3;                   // lobe for this filter (filter size / 3)
  int w = rl->filter;                       // filter size
  float inverse_area = 1.f/(w*w);           // normalisation factor
  float Dxx, Dyy, Dxy;

  for(int r, c, ar = 0, index = 0; ar < rl->height; ++ar) 
  {
    for(int ac = 0; ac < rl->width; ++ac, index++) 
    {
      // get the image coordinates
      r = ar * step;
      c = ac * step; 

      // Compute response components
      Dxx = BoxIntegral(img, r - l + 1, c - b, 2*l - 1, w)
          - BoxIntegral(img, r - l + 1, c - l * 0.5, 2*l - 1, l)*3;
      Dyy = BoxIntegral(img, r - b, c - l + 1, w, 2*l - 1)
          - BoxIntegral(img, r - l * 0.5, c - l + 1, l, 2*l - 1)*3;
      Dxy = + BoxIntegral(img, r - l, c + 1, l, l)
            + BoxIntegral(img, r + 1, c - l, l, l)
            - BoxIntegral(img, r - l, c - l, l, l)
            - BoxIntegral(img, r + 1, c + 1, l, l);

      // Normalise the filter responses with respect to their size
      Dxx *= inverse_area;
      Dyy *= inverse_area;
      Dxy *= inverse_area;

      // Get the determinant of hessian response & laplacian sign
      responses[index] = (Dxx * Dyy - 0.81f * Dxy * Dxy);
      laplacian[index] = (Dxx + Dyy >= 0 ? 1 : 0);

#ifdef RL_DEBUG
      // create list of the image coords for each response
      rl->coords.push_back(std::make_pair<int,int>(r,c));
#endif
    }
  }
}

Some questions:

Is it a good idea that the function is inline? Would using inline assembly provide a significant speedup?

Specialize for the edges so that you don't need to check for them in every row and column. I assume that this call is in a nested loop and is called a lot. This function would become:

inline float BoxIntegralNonEdge(IplImage *img, int row, int col, int rows, int cols) 
{
  float *data = (float *) img->imageData;
  int step = img->widthStep/sizeof(float);

  // The subtraction by one for row/col is because row/col is inclusive.
  int r1 = row - 1;
  int c1 = col - 1;
  int r2 = row + rows - 1;
  int c2 = col + cols - 1;

  float A(data[r1 * step + c1]), B(data[r1 * step + c2]), C(data[r2 * step + c1]), D(data[r2 * step + c2]);

  return std::max(0.f, A - B - C + D);
}

You get rid of a conditional and branch for each min and two conditionals and a branch for each if. You can only call this function if you already meet the conditions -- check that in the caller for the whole row once instead of each pixel.

I wrote up some tips for optimizing image processing when you have to do work on each pixel:

http://www.atalasoft.com/cs/blogs/loufranco/archive/2006/04/28/9985.aspx

Other things from the blog:

You are recalculating a position in the image data with 2 multiplies (indexing is multiplication) -- you should be incrementing a pointer.
Instead of passing in img, row, row, col and cols, pass in pointers to the exact pixels to process -- which you get from incrementing pointers, not indexing.
If you don't do the above, step is the same for all pixels, calculate it in the caller and pass it in. If you do 1 and 2, you won't need step at all.

There are a few places to reuse temporary variables, but whether it would improve performance would have to be measured as dirkgently stated:

Change

  if (r1 >= 0 && c1 >= 0) A = data[r1 * step + c1]; 
  if (r1 >= 0 && c2 >= 0) B = data[r1 * step + c2]; 
  if (r2 >= 0 && c1 >= 0) C = data[r2 * step + c1]; 
  if (r2 >= 0 && c2 >= 0) D = data[r2 * step + c2];

  if (r1 >= 0) {
    int r1Step = r1 * step;
    if (c1 >= 0) A = data[r1Step + c1]; 
    if (c2 >= 0) B = data[r1Step + c2]; 
  }
  if (r2 >= 0) {
    int r2Step = r2 * step;
    if (c1 >= 0) C = data[r2Step + c1]; 
    if (c2 >= 0) D = data[r2Step + c2]; 
  }

You may actually end up doing the temp multiplactions too often in case your if statements rarely provides true.

You aren't interested in four variables A, B, C, D, but only the combination A - B - C + D.

Try

float result(0.0f);
if (r1 >= 0 && c1 >= 0) result += data[r1 * step + c1];
if (r1 >= 0 && c2 >= 0) result -= data[r1 * step + c2];
if (r2 >= 0 && c1 >= 0) result -= data[r2 * step + c1];
if (r2 >= 0 && c2 >= 0) result += data[r2 * step + c2];

if (result > 0f) return result;
return 0f;

The compiler probably handles inling automatically where it's proper.

Without any knowledge about the context. Is the if(r1 >= 0 && c1 >= 0) check necessary?

Isn't it required that the row and col parameters are > 0?

float BoxIntegral(IplImage *img, int row, int col, int rows, int cols) 
{
  assert(row > 0 && col > 0);
  float *data = (float*)img->imageData; // Don't use C-style casts
  int step = img->widthStep/sizeof(float);

  // Is the min check rly necessary?
  int r1 = std::min(row,          img->height) - 1;
  int c1 = std::min(col,          img->width)  - 1;
  int r2 = std::min(row + rows,   img->height) - 1;
  int c2 = std::min(col + cols,   img->width)  - 1;

  int r1_step = r1 * step;
  int r2_step = r2 * step;

  float A = data[r1_step + c1];
  float B = data[r1_step + c2];
  float C = data[r2_step + c1];
  float D = data[r2_step + c2];

  return std::max(0.0f, A - B - C + D);
}

Some of the examples say to initialize A, B, C and D directly and skip the initialization with 0, but this is functionally different than your original code in some ways. I would do this however:

inline float BoxIntegral(IplImage *img, int row, int col, int rows, int cols)  {

    const float *data = (float *) img->imageData;
    const int step = img->widthStep/sizeof(float);

    // The subtraction by one for row/col is because row/col is inclusive.
    const int r1 = std::min(row,          img->height) - 1;
    const int r2 = std::min(row + rows,   img->height) - 1;
    const int c1 = std::min(col,          img->width)  - 1;
    const int c2 = std::min(col + cols,   img->width)  - 1;

    const float A = (r1 >= 0 && c1 >= 0) ? data[r1 * step + c1] : 0.0f;
    const float B = (r1 >= 0 && c2 >= 0) ? data[r1 * step + c2] : 0.0f;
    const float C = (r2 >= 0 && c1 >= 0) ? data[r2 * step + c1] : 0.0f;
    const float D = (r2 >= 0 && c2 >= 0) ? data[r2 * step + c2] : 0.0f;

    return std::max(0.f, A - B - C + D);
}

like your original code, this will make A, B, C and D have a value either from data[] if the condition is true or 0.0f if the condition is false. Also, I would (as I have shown) use const wherever it is appropriate. Many compilers aren't able to improve code much based on const-ness, but it certainly can't hurt to give the compiler more information about the data it is operating on. Finally I have reordered the r1/r2/c1/c2 variables to encourage reuse of the fetched width and height.

Obviously you would need to profile to determine if any of this is actually an improvement.

I am not sure if your problem lends itself to SIMD but this could potentially allow you to perform multiple operations on your image at once and give you a good performance improvement. I am assuming you are inlining and optimizing because you are performing the operation multiple times. Take a look at:

http://blogs.arm.com/software-enablement/coding-for-neon-part-1-load-and-stores/
http://blogs.arm.com/software-enablement/coding-for-neon-part-2-dealing-with-leftovers/
http://blogs.arm.com/software-enablement/coding-for-neon-part-3-matrix-multiplication/
http://blogs.arm.com/software-enablement/coding-for-neon-part-4-shifting-left-and-right/

Compiler do have some support for Neon if the correct flags are enabled but you will probably need to roll out some on your own.

Edit To get compiler support for neon you will need to use the compiler flag -mfpu=neon