开发者

Can I use Duff's Device on an array in C?

开发者 https://www.devze.com 2022-12-28 14:44 出处:网络
I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff\'s Device ca开发者_运维百科n it be applied to this for loop? any ideas?

I have a loop here and I want to make it run faster. I am passing in a large array. I recently heard of Duff's Device ca开发者_运维百科n it be applied to this for loop? any ideas?

for (i = 0; i < dim; i++) {
    for (j = 0; j < dim; j++) {
        dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
    }
}


Please, please don't use Duff's device. A thousand maintenance programmers will thank you. I used to work for a training company where someone thought it funny to introduce the device in the first ten pages of their C programming course. As an instructor it was impossible to deal with, unless (as the guy that that wrote that bit of the course apparently did) you believe in "kewl" coding.

Needless to say, I had the thing expunged from the course, ASAP.


Why do you want to make it run faster?

Is there an actual performance problem?

If so, have you profiled and found that this is executing often enough, and hence worth optimizing?

If so, you may want to write it in two ways, the straightforward way you have now and with Duff's Device, or any other method you like.

At that point, you test the performance. You may be surprised. Modern optimizers are quite good, and modern CPUs are really complicated, so source-level optimization is often counterproductive. (I once did this in a loop that was taking a whole lot of time, and found that tightening up the loop, even while introducing some indirection, improved performance. Your mileage is almost certainly going to vary.)

Finally, if Duff's Device is indeed faster, you have to decide whether the performance improvement is worth taking this straightforward and optimizable code and substituting a maintenance problem that may not improve performance at all in the next compiler version.


You should never unroll loops by hand. It would only give you a very platform-specific advantage, if any. All good compilers can unroll loops, but it's not even guaranteed to make the code faster, because it takes up more memory bandwidth to read a longer program from main memory.

If you want the loop to run fast, you should make sure that whatever RIDX computes, dst is accessed sequentially, so you minimize the number of cache misses. Other than that I can't see how you could make the loop faster.


Duff's Device is simply a technique for loop unrolling. And since any loop can be unrolled, you can use Duff's Device.


Were you able to figure this out and get a gain it would be a pittance and would in no way justify the complexity.

You would be better served spending your energies a level up--reconsidering your entire solution. Perhaps rather than copying values you could create a translation array and spend a little more time looking up answers indirectly when you need them (not really a good idea for building images--just trying to give you a different way to look at it).

Or maybe there is some completely different approach--look at your entire problem and try completely throwing away your current approaches and concepts and just see if there is something you haven't considered because you are too tied to this implementation.

Could your graphics card do some of this work?

Rethinking the problem at a high level works a lot more often than you might think.

Edit: Looking at your sample more, it looks like you are taking a block of your image and copying it, pixel for pixel, to another image. If so, there are almost certainly ways to do it getting rid of the macro and copying byte for byte instead, or even using a block move assembly function then tweaking the edges of the result to match.

Or I may have guessed wrong, but chances are that looking at it on a larger scale than pixel for pixel might help you a lot more than unrolling loops.


The number of instruction cycles to implement the statement

dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];

will far outweigh the loop overhead, so unrolling the loop will be very little help on a percentage basis.


I believe this is a candidate for Duff's Device, depending on what the RIDX() function does. But I hope you don't expect someone to write the code for you... Also, you might want to format your code properly so it's actually readable.


Probably, as long as dim is a power of 2 or you have fast modulus on your target system. Learned something new today. I independently discovered that construct 5 years back and dropped it into our memCopy() routine. Who knew :)


Pedantically, no. Duff's Device was for writing to a hardware register (thus the target of the copy was always the same address).

You can implement something very much like Duff's Device for a copy like this, but there will be a clarity and maintenance cost. I'd first profile to make sure it's a problem. I'd also look into whether you can simplify the indexing, as that may enable the compiler to do the dirty work of unrolling the loop.


If you use it, make sure you measure it to determine that the improvement is both real , significant, and necessary in terms of your performance requirements. I doubt it will be.

For large loops, the remainder dealt with by Duff's device will be an insignificant proportion of the operation, and for small loops where the remainder is significant you will only see a benefit if you have many such loops (themselves in a loop), because small loops by definition don't take that long! Even then the compiler's optimiser is likely to do as well or better without rendering your code unreadable. It is also possible that the application of Duff's device will prevent the optimiser from applying more perhaps effective optimisations, which is why if you use it you need to measure it.

All the time you are likely to save on this (if any) you have probably wasted several times over reading responses to this question.


Duff's device may not be the optimized solution in an unrolled loop.

I had a function that sent a bit to a port, followed by a clock pulse to another port. For each bit, the functions were:

  if (bit == 1)
  {
     write to the set port.
  }
  else
  {
     write to the clear port.
  }
  write high clock bit.
  write low clock bit.

This was put into a Duff's device loop, along with bit shifting and bit count incrementing.

I improved the efficiency of the loop by using nibble values instead of bits (a nibble being 4 bits). The switch statement was based on the nibble value. This allowed 4 bits to be processed without any if statements, improving the flow through instruction cache (pipeline).

There are times when Duff's device may not be the optimal solution; but can be the foundation for a more efficient solution.


Modern compilers already do loop unrolling for you when optimizations are turned on, which renders Duff's device obsolete. The compiler knows better than you do the optimal level of unrolling for your compilation target, and you don't have to write any extra code to do it. It was a neat hack at the time, but these days Duff's device is just a historical curiosity, not a good programming practice.


In the end whoever makes the call on optimization everyone involved needs to be sure it is well documented and written in style that is as self documenting as possible using correctly spelled meaningful names for variables, functions etc. So it is obvious if the comments and the code get out of sync.

The need for optimization will never end. I was talking with a grad student that had broken malloc()/free() working on the largest file of genetic data ever attempted in one pass. After while the heap became too fragmented for malloc to to find a block of contiguous RAM to allocate to the calling function. He had to switch to a library that malloc only issued blocks of memory on 32k boundaries. It took 160% more the memory the old library, ran a good slower but it finished the job.

You must be careful using Duff's Device and many other optimizations to be sure the compiler does't optimize your optimization into obscure broken object code. As we enter an environment using automatic parallelizing tools this will become more of a problem.

I expect the lower the level the optimization the more likely future optimizations are to break the code. I can see that my habit of discarding line feeds in code designed to run on multiple platforms and putting the line feed back in in the print and write functions on each platform will run into problems in several of the things discussed in this thread.

-gcouger

0

精彩评论

暂无评论...
验证码 换一张
取 消