开发者

add vs mul (IA32-Assembly)

开发者 https://www.devze.com 2023-01-15 20:51 出处:网络
I know that add is faster as compared to mul function. I want to know how to go about using add instead of mul in the following code in order to make it more efficient.

I know that add is faster as compared to mul function.

I want to know how to go about using add instead of mul in the following code in order to make it more efficient.

Sample code:

            mov eax, [ebp + 8]              #eax = x1
            mov ecx, [ebp + 12]             #ecx = x2
            mov edx, [ebp + 16]             #edx = y1
            mov ebx, [ebp + 20]             #ebx = y开发者_JS百科2

            sub eax,ecx                     #eax = x1-x2
            sub edx,ebx                     #edx = y1-y2

            mul edx                         #eax = (x1-x2)*(y1-y2)


add is faster than mul, but if you want to multiply two general values, mul is far faster than any loop iterating add operations.

You can't seriously use add to make that code go faster than it will with mul. If you needed to multiply by some small constant value (such as 2), then maybe you could use add to speed things up. But for the general case - no.


If you are multiplying two values that you don't know in advance, it is effectively impossible to beat the multiply instruction in x86 assembler.

If you know the value of one of the operands in advance, you may be able beat the multiply instruction by using a small number of adds. This works particularly well when the known operand is small, and only has a few bits in its binary representation. To multiply an unknown value x by a known value consisting 2^p+2^q+...2^r you simply add x*2^p+x*2^q+..x*2*r if bits p,q, ... and r are set. This is easily accomplished in assembler by left shifting and adding:

;  x in EDX
;  product to EAX
xor  eax,eax
shl  edx,r ; x*2^r
add  eax,edx
shl  edx,q-r ; x*2^q
add  eax,edx
shl  edx,p-q ; x*2^p
add  eax,edx

The key problem with this is that it takes at least 4 clocks to do this, assuming a superscalar CPU constrained by register dependencies. Multiply typically takes 10 or fewer clocks on modern CPUs, and if this sequence gets longer than that in time you might as well do a multiply.

To multiply by 9:

mov  eax,edx ; same effect as xor eax,eax/shl edx 1/add eax,edx
shl  edx,3 ; x*2^3
add  eax,edx

This beats multiply; should only take 2 clocks.

What is less well known is the use of the LEA (load effective address) instruction, to accomplish fast multiply-by-small-constant. LEA which takes only a single clock worst case its execution time can often by overlapped with other instructions by superscalar CPUs.

LEA is essentially "add two values with small constant multipliers". It computes t=2^k*x+y for k=1,2,3 (see the Intel reference manual) for t, x and y being any register. If x==y, you can get 1,2,3,4,5,8,9 times x, but using x and y as seperate registers allows for intermediate results to be combined and moved to other registers (e.g., to t), and this turns out to be remarkably handy. Using it, you can accomplish a multiply by 9 using a single instruction:

lea  eax,[edx*8+edx]  ; takes 1 clock

Using LEA carefully, you can multiply by a variety of peculiar constants in a small number of cycles:

lea  eax,[edx*4+edx] ; 5 * edx
lea  eax,[eax*2+edx] ; 11 * edx
lea  eax,[eax*4] ; 44 * edx

To do this, you have to decompose your constant multiplier into various factors/sums involving 1,2,3,4,5,8 and 9. It is remarkable how many small constants you can do this for, and still only use 3-4 instructions.

If you allow the use other typically single-clock instructions (e.g, SHL/SUB/NEG/MOV) you can multiply by some constant values that pure LEA can't do as efficiently by itself. To multiply by 31:

lea  eax,[4*edx]
lea  eax,[8*eax]  ; 32*edx
sub  eax,edx; 31*edx ; 3 clocks

The corresponding LEA sequence is longer:

lea  eax,[edx*4+edx]
lea  eax,[edx*2+eax] ; eax*7
lea  eax,[eax*2+edx] ; eax*15
lea  eax,[eax*2+edx] ; eax*31 ; 4 clocks

Figuring out these sequences is a bit tricky, but you can set up an organized attack.

Since LEA, SHL, SUB, NEG, MOV are all single-clock instructions worst case, and zero clocks if they have no dependences on other instructions, you can compute the exeuction cost of any such sequence. This means you can implement a dynamic programmming algorithm to generate the best possible sequence of such instructions. This is only useful if the clock count is smaller than the integer multiply for your particular CPU (I use 5 clocks as rule of thumb), and it doesn't use up all the registers, or at least it doesn't use up registers that are already busy (avoiding any spills).

I've actually built this into our PARLANSE compiler, and it is very effective for computing offsets into arrays of structures A[i], where the size of the structure element in A is the known constant. A clever person would possibly cache the answer so it doesn't have to be recomputed each time multiplying the same constant occurs; I didn't actually do that because the time to generate such sequences is less than you'd expect.

Its is mildly interesting to print out the sequences of instructions needed to multiply by all constants from 1 to 10000. Most of them can be done in 5-6 instructions worst case. As a consequence, the PARLANSE compiler hardly ever uses an actual multiply when indexing even the nastiest arrays of nested structures.


Unless your multiplications are fairly simplistic, the add most likely won't outperform a mul. Having said that, you can use add to do multiplications:

Multiply by 2:
    add eax,eax          ; x2
Multiply by 4:
    add eax,eax          ; x2
    add eax,eax          ; x4
Multiply by 8:
    add eax,eax          ; x2
    add eax,eax          ; x4
    add eax,eax          ; x8

They work nicely for powers of two. I'm not saying they're faster. They were certainly necessary in the days before fancy multiplication instructions. That's from someone whose soul was forged in the hell-fires that were the Mostek 6502, Zilog z80 and RCA1802 :-)

You can even multiply by non-powers by simply storing interim results:

Multiply by 9:
    push ebx              ; preserve
    push eax              ; save for later
    add  eax,eax          ; x2
    add  eax,eax          ; x4
    add  eax,eax          ; x8
    pop  ebx              ; get original eax into ebx
    add  eax,ebx          ; x9
    pop  ebx              ; recover original ebx

I generally suggest that you write your code primarily for readability and only worry about performance when you need it. However, if you're working in assembler, you may well already at that point. But I'm not sure my "solution" is really applicable to your situation since you have an arbitrary multiplicand.

You should, however, always profile your code in the target environment to ensure that what you're doing is actually faster. Assembler doesn't change that aspect of optimisation at all.


If you really want to see some more general purpose assembler for using add to do multiplication, here's a routine that will take two unsigned values in ax and bx and return the product in ax. It will not handle overflow elegantly.

START:  MOV    AX, 0007    ; Load up registers
        MOV    BX, 0005
        CALL   MULT        ; Call multiply function.
        HLT                ; Stop.

MULT:   PUSH   BX          ; Preserve BX, CX, DX.
        PUSH   CX
        PUSH   DX

        XOR    CX,CX       ; CX is the accumulator.

        CMP    BX, 0       ; If multiplying by zero, just stop.
        JZ     FIN

MORE:   PUSH   BX          ; Xfer BX to DX for bit check.
        POP    DX

        AND    DX, 0001    ; Is lowest bit 1?
        JZ     NOADD       ; No, do not add.
        ADD    CX,AX

NOADD:  SHL    AX,1        ; Shift AX left (double).
        SHR    BX,1        ; Shift BX right (integer halve, next bit).
        JNZ    MORE        ; Keep going until no more bits in BX.

FIN:    PUSH   CX          ; Xfer product from CX to AX.
        POP    AX

        POP    DX          ; Restore registers and return.
        POP    CX
        POP    BX
        RET

It relies on the fact that 123 multiplied by 456 is identical to:

    123 x 6
+  1230 x 5
+ 12300 x 4

which is the same way you were taught multiplication back in grade/primary school. It's easier with binary since you're only ever multiplying by zero or one (in other words, either adding or not adding).

It's pretty old-school x86 (8086, from a DEBUG session - I can't believe they still actually include that thing in XP) since that was about the last time I coded directly in assembler. There's something to be said for high level languages :-)


When it comes to assembly instruction,speed of executing any instruction is measured using the clock cycle. Mul instruction always take more clock cycle's then add operation,but if you execute the same add instruction in a loop then the overall clock cycle to do multiplication using add instruction will be way more then the single mul instruction. You can have a look on the following URL which talks about the clock cycle of single add/mul instruction.So that way you can do your math,which one will be faster.

http://home.comcast.net/~fbui/intel_a.html#add

http://home.comcast.net/~fbui/intel_m.html#mul

My recommendation is to use mul instruction rather then putting add in loop,the later one is very inefficient solution.


I'd have to echo the responses you have already - for a general multiply you're best off using MUL - after all it's what it's there for!

In some specific cases, where you know you'll be wanting to multiply by a specific fixed value each time (for example, in working out a pixel index in a bitmap) then you can consider breaking the multiply down into a (small) handful of SHLs and ADDs - e.g.:

1280 x 1024 display - each line on the display is 1280 pixels.

1280 = 1024 + 256 = 2^10 + 2^8

y * 1280 = y * (2 ^ 10) + y * (2 ^ 8) = ADD (SHL y, 10), (SHL y, 8)

...given that graphics processing is likely to need to be speedy, such an approach may save you precious clock cycles.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号