I heard, that there is a way to optimize a * 10
operation (in any language) into something like a * 2 * 2 * 2 + a * 2
and get a great benefit because *2
translates into simple binary sh开发者_如何学运维ift operation and works much faster than multiplication operation.
Yes, that's true. However, a good compiler may do this automatically for you when multiplying a variable by a suitable constant (if it's appropriate for the target CPU architecture).
I just tried this with GCC on an Intel target, and -O
didn't use the shift-and-add method. I guess the imul
instruction is faster. However, I certainly have seen this type of code generated by GCC with an ARM target, where the multiplication instruction is relatively slow.
As mentioned, the best way to optimize the code depends on the particular CPU. However, given deep pipelining, register renaming and out of order execution supported on modern processors, it also really depends very much on the surrounding code, and what can be scheduled into the blanks.
Check out this list of latencies and throughputs. On modern dektop processors for 4 shift and an add vs. 1 multiply: A shift has 2x the throughput, and 1/3rd the latency (on a Nehalem or Sandybridge). 4 of them will almost certainly be a loss, even without the add. On other processors, the situation may be different. I say "almost", since conceivably, the multiplaction unit could be processing another chain of multiplies in nearby code, leaving the shift and add units available to do the multiply by 10 in parallel.
Always try it out yourself and measure, and of course, only do this when you really have to count clocks. :-)
精彩评论