开发者

instruction to copy lower 32 bits of a register to upper 32 bits

开发者 https://www.devze.com 2023-02-12 02:34 出处:网络
Is there a x86 instruction to directly replicate the lower 32 bits of a x86_64 register to upper 32 bits?

Is there a x86 instruction to directly replicate the lower 32 bits of a x86_64 register to upper 32 bits?

Example : r开发者_运维知识库bx -> 0x0123456789ABCDEF
Resultant rbx -> 0x89ABCDEF89ABCDEF


If I'm remembering my assembly class correctly, only the lowest two bytes in each register are individually addressable (al, ah, bl, bh, etc.). So if you're looking for a single instruction, you're probably out of luck.

If it can be multiple instructions, I'd probably go with a combination of left shift and masking (pardon my pseudocode - it's been a while):

tmp = rbx
#Make sure you're using the version of left shift that zeroes the right bits:
tmp = tmp << 32
rbx = rbx & 0x00000000ffffffff
rbx = rbx | tmp

Hope this helps!


There's a tradeoff between front-end throughput (total uops) vs. latency if you have AVX-512 or BMI2.

The standard way uses pure integer regs. A left shift will leave the low 32 bits zero, and writing a 32-bit register will zero-extend to 64 bits. You can use any other register as the temporary, no advantage to RAX.

  mov  eax, ebx           ; EBX = RBX & 0xFFFFFFFF
  shl  rbx, 32
  or   rbx, rax

Compared to the other answer, MOV is creating a "tmp" copy as well as doing the truncation. It would be worse if we copied and shifted in RAX, and separately had to truncate RBX in-place.

Throughput cost: 3 uops for the front-end, 2 for the back-end (assuming mov is eliminated).
Latency cost: 2 cycles: SHL-immediate and OR are single-cycle on all CPUs since P4. MOV either has zero latency (eliminated) or can run in parallel with SHL.


With BMI2 rorx to copy-and-swap-halves of a 64-bit register, we can get it done in 2 instructions, into a different register only. But one of those instructions is shrd which is single-uop 3c latency on Intel Sandybridge-family (with an immediate count) but slower and 6 uops on AMD Zen. RORX is efficient everywhere, single uop 1c latency.

; Intel SnB 4c latency, 2 uops.  AMD Zen: 3c latency, 7 uops
    rorx   rax, rbx, 32          ; top half of RAX = EBX 
    shrd   rax, rbx, 32          ; shift in another copy of EBX
                         ; RAX = EBX:EBX, RBX = untouched

So on Intel SnB-family, e.g. Skylake, total of 4 cycle latency, 2 uops (front-end and back-end, running on different ports).

On AMD Zen and Zen2, interestingly the latency (uops.info) from operand 1 -> 1 (in this case from RAX input to output) is only 2 cycles. (And only 1 cycle from operand 2 -> 1, but RAX comes from RORX so it's ready after RBX, no way to take advantage of that that I can see.) So total latency only 3 cycles. But the throughput cost is fairly high, 6 uops.


The other 2-uop way requires AVX-512 so no current AMD CPUs can run it at all, rather than just slower like the BMI2 version. The total latency is 6 cycles on Skylake-X (See "experiment 49" on uops.info's test results for SKX vpbroadcastd latency, where they used this in an unrolled loop to create a loop-carried dependency chain specifically to measure the RBX->RBX latency).

  vpbroadcastd xmm0, ebx       ; AVX-512VL.  Single-uop on current Intel
  vmovq        rbx, xmm0       ; AVX1

This appears to have zero advantage over the rorx/shrd version: slower on Intel's current AVX-512 CPUs.

Except on Knight's Landing (where shrd r64,r64,imm is very slow; 1 uops, 11c throughput and latency, although rorx is 1c). Agner Fog doesn't have timings for KNL's vpbroadcastd/q xmm, r, but even if it's 2 uops this is probably faster.


Without AVX-512, there's no advantage to using XMM registers if the data originally started in a GP integer register (instead of memory) and you need it back there, although it is possible:

; generally slower than the integer shl/or version
movd       xmm0, ebx
punpckldq  xmm0, xmm0     ; duplicate the low 32 bits
movq       rbx, xmm0

On Skylake, a movd xmm, reg/movd reg,xmm round-trip has 4 cycle latency (per https://uops.info/ testing), so this will have a total of 5. It costs 3 uops, but on Intel Haswell / Skylake and similar CPUs 2 of them need port 5: movq xmm, r64 and the shuffle. Depending on the surrounding code, this could be a throughput bottleneck.

Latency is also worse on some earlier CPUs, notably Bulldozer-family which is fortunately now obsolete. But even on Zen2, the movd/movq round trip has 6 cycle latency, plus another 1 cycle for the shuffle.

If your data started in memory, you could load it with
vbroadcastss xmm0, [mem] (AVX1) / vmovq rbx, xmm0. Broadcast-loads are handled purely by the load port in modern Intel and AMD CPUs, for element sizes of 4 bytes or wider.

If you want to store many copies to memory (like for wmemset), you'll want to use 16-byte stores at least, so you'd pshufd xmm0, xmm0, 0 (SSE2) or vpbroadcastd ymm0, xmm0 (AVX2) to broadcast into a whole vector. If you just need 8 bytes of it as part of the cleanup for that, you can of course use movq [mem], xmm0


BMI2 shlx is only available in shlx reg, reg, reg form, not with an immediate count. With a 32 in a register, you could use this in a loop to produce the result without destroying the input.

  mov edx, 32     ; outside a loop

;inside a loop:
...
  shlx rax, rbx, rdx
  mov  ecx, ebx
  or   rax, rcx
...                     ; RAX = EBX:EBX.   RBX unmodified.

This has the same 2c latency as the normal SHL version, for the same reasons.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号