开发者

Which processor architecture is best suited for biginteger arithmetic?

开发者 https://www.devze.com 2023-04-03 16:10 出处:网络
If I were to write assembly code for large integer calculations (e.g. prime factoring, modulo calculat开发者_如何学Pythonions, etc.) with a focus on speed, which architecture would be best suited for

If I were to write assembly code for large integer calculations (e.g. prime factoring, modulo calculat开发者_如何学Pythonions, etc.) with a focus on speed, which architecture would be best suited for this: x86(-64), ARM, PowerPC, MIPS or others?


If you work with a small number of variable-size numbers, I think POWER 6 would suit your needs best (although I didn't work with this architecture) since it provides high IPC and very high frequency (up to 5GHz).

It you work with a large number of fixed-size numbers, x86-64 would be the best choice as it has SIMD operations which work on 64-bit numbers, and you can use those operations to speed up long arithmetic operations on multiple numbers. You will likely need an SSE 4.2-capable CPU (Intel Nehalem/Westmere/Sandy Bridge, or the upcoming AMD Bulldozer) since the 64-bit compare instruction PCMPGTQ was only added in SSE 4.2

Also, these GMP benchmark results might be interesting for you


IMO nothing beats x86-64, because no one else cares about higher-precision arithmetic

Many RISC architectures like MIPS, DEC Alpha, or RISC-V don't have a flag register so you'll need a separate instruction to get the carry. Therefore they're bad choices and eliminated right away. For example to do a += b in MIPS you need

addu aLow, aLow, bLow     # aLow += bLow
sltu tmp, aLow, bLow      # carry: tmp = (aLow < bLow)
addu aHigh, aHigh, bHigh  # aHigh += bHigh
addu aHigh, aHigh, tmp    # aHigh += carry

With a carry flag you need only 2 instructions: add aLow, bLow; adc aHigh, bHigh

The MIPS designers could have done it better, but they didn't

Higher clock helps as Marco van de Voort said, but those architectures don't have 50%-100% faster clock than the equivalent x86. The remaining things he said are fairly incorrect. It's important to note that arbitrary-precision math isn't trivially parallelized, therefore

  • Superscalar (more ALUs) doesn't help much because of the dependency chain, which prevents instructions to be run in parallel
  • Multiple cores and/or multiple sockets are also relatively useless for the same reason.
  • SIMD also won't generally help, because it's meant for processing multiple separate pieces of data and not a single big large integer. You don't have any way to propagate the carry from the lower limb to the next one. See
    • Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
    • Is there hardware support for 128bit integers in modern processors?
    • Is __int128_t arithmetic emulated by GCC, even with SSE?

In short: You really want to calculate the carries in parallel which is very difficult


In the x86 world you already have the carry flag from the beginning. But later Intel introduced the ADX instruction set with the new instructions ADOX, ADCX and MULX for accelerating big integer arithmetic even further. How they helps is explained in Intel's paper New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors

But it's not only ADX that makes x86 fast. As I mentioned before () SIMD doesn't really help, but nowadays on x86 things may be different. We have very long vectors in x86 (256 bits with AVX2, 512 bits with AVX512 and maybe longer in the future) so if you use various tricks like using a partial-word arithmetic to delay the carry propagation, or arrange the words in strange ways (like llhhllhhllhhllhh) instead of linear like in normal big integer arithmetic (llllllllhhhhhhhh) then SIMD may be faster than scalar operations. For more information you should read

  • Can long integer routines benefit from SSE?
  • practical BigNum AVX/SSE possible?
  • Kogge-Stone Vector Addition

Of course AVX512 will only help if you have very large numbers. Otherwise for a 512-bit number you may have better results with scalar code

No other architectures currently have SIMD registers longer than 128 bits, so even if you can make use of SIMD on them, the cost of calculating the carry would far outweighs the cost of parallel addition. Again this is the reason x86 beats them all

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号