开发者

Unsigned 64x64->128 bit integer multiply on 32-bit platforms

开发者 https://www.devze.com 2022-12-07 17:45 出处:网络
In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7)

In the context of exploratory activity I have started to take a look at integer & fixed-point arithmetic building blocks for 32-bit platforms. My primary target would be ARM32 (specifically armv7), with a side glance to RISC-V32 which I expect to grow in importance in the embedded space. The first sample building block I chose to examine is unsigned 64x64->128 bit integer multiplication. Other questions on this site about this building block do not provide detailed coverage of 32-bit platforms.

Over the past thirty years, I have implemented this and other arithmetic building blocks multiple times, but always in assembly language, for various architectures. However, at this point in time my hope and desire is that these could be programmed in straight ISO-C, without the use of intrinsics. Ideally a single version of the C code would generate good machine code across architectures. I know that the approach of manipulating HLL code to control machine code is generally brittle, but hope that processor architectures and toolchains have matured enough to make this feasible.

Some approaches used in assembly language implementations are not well suited for porting to C. In the exemplary code below I have selected six variants that seemed amenable to an HLL implementation. Besides the generation of partial products, which is common to all variants, the two basic approaches are: (1) Sum the partial products using 64-bit arithmetic, letting the compiler take care of the carry propagation between 32-bit halves. In this case there are multiple choices in which order to sum the partial products. (2) Use 32-bit arithmetic for the summing, simulating the carry flag directly. In this case we have a choice of generating the carry after an addition (a = a + b; carry = a < b;) or before the addition (carry = ~a < b; a = a + b;). Variants 1 through 3 below fall into the former category, variants 5 and 6 fall into the latter.

At Compiler Explorer, I focused on the toolchains gcc 12.2 and clang 15.0 for the platforms of interest. I compiled with -O3. The general finding is that on average clang generates more efficient code than gcc, and that the differences between the variants (number of instructions and registers used) are more pronounced with clang. While this may be understandable in the case of RISC-V as the newer architecture, it surprised me in the case of armv7 which has been around for well over a dozen years.

Three cases in particular struck me as noteworthy. While I have worked with compiler engineers before and have a reasonable understanding of basic code transformation, phase ordering issues, etc, the only technique I aware of that might apply to this code is idiom recognition, and I do not see how this could explain the observations by itself. The first case is variant 3, where clang 15.0 produces extremely tight code comprising just 10 instructions that I don't think can be improved upon:

umul64wide:
        push    {r4, lr}
        umull   r12, r4, r2, r0
        umull   lr, r0, r3, r0
     开发者_StackOverflow社区   umaal   lr, r4, r2, r1
        umaal   r0, r4, r3, r1
        ldr     r1, [sp, #8]
        strd    r0, r4, [r1]
        mov     r0, r12
        mov     r1, lr
        pop     {r4, pc}

By contrast, gcc generates twice the number of instructions and requires twice the number of registers. I hypothesize that it does not recognize how to use the multiply-accumulate instruction umaal here, but is that the full story? The reverse situation, but not quite as dramatic, occurs in variant 6, where gcc 12.2 produces this sequence of 18 instructions, with low register usage:

umul64wide:
        mov     ip, r0
        push    {r4, r5, lr}
        mov     lr, r1
        umull   r0, r1, r0, r2
        ldr     r4, [sp, #12]
        umull   r5, ip, r3, ip
        adds    r1, r1, r5
        umull   r2, r5, lr, r2
        adc     ip, ip, #0
        umull   lr, r3, lr, r3
        adds    r1, r1, r2
        adc     r2, ip, #0
        adds    r2, r2, r5
        adc     r3, r3, #0
        adds    r2, r2, lr
        adc     r3, r3, #0
        strd    r2, r3, [r4]
        pop     {r4, r5, pc}

The generated code nicely turns the simulated carry propagation into real carry propagation. clang 15.0 uses nine instructions and five registers more, and I cannot really make out what it is trying to do without spending much more time on analysis. The third observation is with regard to the differences seen in the machine code produced for variant 5 vs. variant 6, in particular with clang. These use the same basic algorithm, with one variant computing the simulated carry before the additions, the other after it. I did find in the end that one variant, namely variant 4, seems to be efficient across both tool chains and both architectures. However, before I proceed to other building blocks and face a similar struggle, I would like to inquire:

(1) Are there coding idioms or algorithms I have not considered in the code below that might lead to superior results? (2) Are there specific optimization switches, e.g. a hypothetical -ffrobnicate (see here), that are not included in -O3 that would help the compilers generate efficient code more consistently for these kind of bit-manipulation scenarios? Explanations as to what compiler mechanisms are likely responsible for the cases of significant differences in code generation observed, and how one might influence or work round them, could also be helpful.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>

#define VARIANT         (3)
#define USE_X64_ASM_REF (0)

/* Multiply two unsigned 64-bit operands a and b. Returns least significant 64 
   bits of product as return value, most significant 64 bits of product via h.
*/
uint64_t umul64wide (uint64_t a, uint64_t b, uint64_t *h)
{
    uint32_t a_lo = (uint32_t)a;
    uint32_t a_hi = a >> 32;
    uint32_t b_lo = (uint32_t)b;
    uint32_t b_hi = b >> 32;
    uint64_t p0 = (uint64_t)a_lo * b_lo;
    uint64_t p1 = (uint64_t)a_lo * b_hi;
    uint64_t p2 = (uint64_t)a_hi * b_lo;
    uint64_t p3 = (uint64_t)a_hi * b_hi;
#if VARIANT == 1
    uint32_t c = (uint32_t)(((p0 >> 32) + (uint32_t)p1 + (uint32_t)p2) >> 32);
    *h = p3 + (p1 >> 32) + (p2 >> 32) + c;
    return p0 + ((p1 + p2) << 32);
#elif VARIANT == 2
    uint64_t s = (p0 >> 32) + p1;
    uint64_t t = (uint32_t)s + p2;
    *h = (s >> 32) + (t >> 32) + p3;
    return (uint32_t)p0 + (t << 32);
#elif VARIANT == 3
    *h = (p1 >> 32) + (((p0 >> 32) + (uint32_t)p1 + p2) >> 32) + p3;
    return p0 + ((p1 + p2) << 32);
#elif VARIANT == 4
    uint64_t t = (p0 >> 32) + p1 + (uint32_t)p2;
    *h = (p2 >> 32) + (t >> 32) + p3;
    return (uint32_t)p0 + (t << 32);
#elif VARIANT == 5
    uint32_t r0, r1, r2, r3, r4, r5, r6;
    r0 = (uint32_t)p0;
    r1 = p0 >> 32;    
    r5 = (uint32_t)p1;
    r2 = p1 >> 32;
    r1 = r1 + r5;
    r6 = r1 < r5;
    r2 = r2 + r6;
    r6 = (uint32_t)p2;
    r5 = p2 >> 32;
    r1 = r1 + r6;
    r6 = r1 < r6;
    r2 = r2 + r6;
    r4 = (uint32_t)p3;
    r3 = p3 >> 32;
    r2 = r2 + r5;
    r6 = r2 < r5;
    r3 = r3 + r6;
    r2 = r2 + r4;
    r6 = r2 < r4;
    r3 = r3 + r6;
    *h =   ((uint64_t)r3 << 32) | r2;
    return ((uint64_t)r1 << 32) | r0;
#elif VARIANT == 6
    uint32_t r0, r1, r2, r3, r4, r5, r6;
    r0 = (uint32_t)p0;
    r1 = p0 >> 32;    
    r5 = (uint32_t)p1;
    r2 = p1 >> 32;
    r4 = ~r1;
    r4 = r4 < r5;
    r1 = r1 + r5;
    r2 = r2 + r4;
    r6 = (uint32_t)p2;
    r5 = p2 >> 32;
    r4 = ~r1;
    r4 = r4 < r6;
    r1 = r1 + r6;
    r2 = r2 + r4;
    r4 = (uint32_t)p3;
    r3 = p3 >> 32;
    r6 = ~r2;
    r6 = r6 < r5;
    r2 = r2 + r5;
    r3 = r3 + r6;
    r6 = ~r2;
    r6 = r6 < r4;
    r2 = r2 + r4;
    r3 = r3 + r6;
    *h =   ((uint64_t)r3 << 32) | r2;
    return ((uint64_t)r1 << 32) | r0;
#else
#error unsupported VARIANT
#endif
}

#if defined(__SIZEOF_INT128__)
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    unsigned __int128 prod = ((unsigned __int128)a) * b;
    *h = (uint64_t)(prod >> 32);
    return (uint64_t)prod;
}
#elif defined(_MSC_VER) && defined(_WIN64)
#include <intrin.h>
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    *h = __umulh (a, b);
    return a * b;
}
#elif USE_X64_ASM_REF
uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    uint64_t res_l, res_h;
    __asm__ (
        "movq  %2, %%rax;\n\t"          // rax = a
        "mulq  %3;\n\t"                 // rdx:rax = a * b
        "movq  %%rdx, %0;\n\t"          // res_h = rdx
        "movq  %%rax, %1;\n\t"          // res_l = rax
        : "=rm" (res_h), "=rm"(res_l)
        : "rm"(a), "rm"(b)
        : "%rax", "%rdx");
    *h = res_h;
    return res_l;
}
#else // generic (and slow) reference implementation
#define ADDCcc(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), cy=t0<cy, t0=t0+t1, t1=t0<t1, cy=cy+t1, t0=t0)

#define ADDcc(a,b,cy,t0,t1) \
  (t0=(b), t1=(a), t0=t0+t1, cy=t0<t1, t0=t0)

#define ADDC(a,b,cy,t0,t1) \
  (t0=(b)+cy, t1=(a), t0+t1)

uint64_t umul64wide_ref (uint64_t a, uint64_t b, uint64_t *h)
{
    uint32_t cy, t0, t1;
    uint32_t a_lo = (uint32_t)a;
    uint32_t a_hi = a >> 32;
    uint32_t b_lo = (uint32_t)b;
    uint32_t b_hi = b >> 32;
    uint64_t p0 = (uint64_t)a_lo * b_lo;
    uint64_t p1 = (uint64_t)a_lo * b_hi;
    uint64_t p2 = (uint64_t)a_hi * b_lo;
    uint64_t p3 = (uint64_t)a_hi * b_hi;
    uint32_t p0_lo = (uint32_t)p0;
    uint32_t p0_hi = p0 >> 32;
    uint32_t p1_lo = (uint32_t)p1;
    uint32_t p1_hi = p1 >> 32;
    uint32_t p2_lo = (uint32_t)p2;
    uint32_t p2_hi = p2 >> 32;
    uint32_t p3_lo = (uint32_t)p3;
    uint32_t p3_hi = p3 >> 32;
    uint32_t r0 = p0_lo;
    uint32_t r1 = ADDcc  (p0_hi, p1_lo, cy, t0, t1);
    uint32_t r2 = ADDCcc (p1_hi, p2_hi, cy, t0, t1);
    uint32_t r3 = ADDC   (p3_hi,     0, cy, t0, t1);
    r1 = ADDcc  (r1, p2_lo, cy, t0, t1);
    r2 = ADDCcc (r2, p3_lo, cy, t0, t1);
    r3 = ADDC   (r3,     0, cy, t0, t1);
    *h =   ((uint64_t)r3 << 32) + r2;
    return ((uint64_t)r1 << 32) + r0;
}
#endif 

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <gmars...@gmail.com>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    uint64_t a, b, res_hi, res_lo, ref_hi, ref_lo, count = 0;

    printf ("Smoke test of umul64wide variant %d\n", VARIANT);
    do {
        a = KISS64;
        b = KISS64;
        ref_lo = umul64wide_ref (a, b, &ref_hi);
        res_lo = umul64wide (a, b, &res_hi);
        if ((res_lo ^ ref_lo) | (res_hi ^ ref_hi)) {
            printf ("!!!! error: a=%016llx b=%016llx res=%016llx_%016llx ref=%016llx_%016llx\n",
                    a, b, res_hi, res_lo, ref_hi, ref_lo);
            return EXIT_FAILURE;
        }
        if (!(count & 0xfffffff)) printf ("\r%llu", count);
        count++;
    } while (count);
    return EXIT_SUCCESS;
}
0

精彩评论

暂无评论...
验证码 换一张
取 消