Is there a x86 instruction to directly replicate the lower 32 bits of a x86_64 register to upper 32 bits?
Example : r开发者_运维知识库bx -> 0x0123456789ABCDEF Resultant rbx -> 0x89ABCDEF89ABCDEF
If I'm remembering my assembly class correctly, only the lowest two bytes in each register are individually addressable (al, ah, bl, bh, etc.). So if you're looking for a single instruction, you're probably out of luck.
If it can be multiple instructions, I'd probably go with a combination of left shift and masking (pardon my pseudocode - it's been a while):
tmp = rbx
#Make sure you're using the version of left shift that zeroes the right bits:
tmp = tmp << 32
rbx = rbx & 0x00000000ffffffff
rbx = rbx | tmp
Hope this helps!
There's a tradeoff between front-end throughput (total uops) vs. latency if you have AVX-512 or BMI2.
The standard way uses pure integer regs. A left shift will leave the low 32 bits zero, and writing a 32-bit register will zero-extend to 64 bits. You can use any other register as the temporary, no advantage to RAX.
mov eax, ebx ; EBX = RBX & 0xFFFFFFFF
shl rbx, 32
or rbx, rax
Compared to the other answer, MOV is creating a "tmp" copy as well as doing the truncation. It would be worse if we copied and shifted in RAX, and separately had to truncate RBX in-place.
Throughput cost: 3 uops for the front-end, 2 for the back-end (assuming mov
is eliminated).
Latency cost: 2 cycles: SHL-immediate and OR are single-cycle on all CPUs since P4. MOV either has zero latency (eliminated) or can run in parallel with SHL.
With BMI2 rorx
to copy-and-swap-halves of a 64-bit register, we can get it done in 2 instructions, into a different register only. But one of those instructions is shrd
which is single-uop 3c latency on Intel Sandybridge-family (with an immediate count) but slower and 6 uops on AMD Zen. RORX is efficient everywhere, single uop 1c latency.
; Intel SnB 4c latency, 2 uops. AMD Zen: 3c latency, 7 uops
rorx rax, rbx, 32 ; top half of RAX = EBX
shrd rax, rbx, 32 ; shift in another copy of EBX
; RAX = EBX:EBX, RBX = untouched
So on Intel SnB-family, e.g. Skylake, total of 4 cycle latency, 2 uops (front-end and back-end, running on different ports).
On AMD Zen and Zen2, interestingly the latency (uops.info) from operand 1 -> 1 (in this case from RAX input to output) is only 2 cycles. (And only 1 cycle from operand 2 -> 1, but RAX comes from RORX so it's ready after RBX, no way to take advantage of that that I can see.) So total latency only 3 cycles. But the throughput cost is fairly high, 6 uops.
The other 2-uop way requires AVX-512 so no current AMD CPUs can run it at all, rather than just slower like the BMI2 version. The total latency is 6 cycles on Skylake-X (See "experiment 49" on uops.info's test results for SKX vpbroadcastd
latency, where they used this in an unrolled loop to create a loop-carried dependency chain specifically to measure the RBX->RBX latency).
vpbroadcastd xmm0, ebx ; AVX-512VL. Single-uop on current Intel
vmovq rbx, xmm0 ; AVX1
This appears to have zero advantage over the rorx/shrd version: slower on Intel's current AVX-512 CPUs.
Except on Knight's Landing (where shrd r64,r64,imm
is very slow; 1 uops, 11c throughput and latency, although rorx
is 1c). Agner Fog doesn't have timings for KNL's vpbroadcastd/q xmm, r
, but even if it's 2 uops this is probably faster.
Without AVX-512, there's no advantage to using XMM registers if the data originally started in a GP integer register (instead of memory) and you need it back there, although it is possible:
; generally slower than the integer shl/or version
movd xmm0, ebx
punpckldq xmm0, xmm0 ; duplicate the low 32 bits
movq rbx, xmm0
On Skylake, a movd xmm, reg
/movd reg,xmm
round-trip has 4 cycle latency (per https://uops.info/ testing), so this will have a total of 5. It costs 3 uops, but on Intel Haswell / Skylake and similar CPUs 2 of them need port 5: movq xmm, r64
and the shuffle. Depending on the surrounding code, this could be a throughput bottleneck.
Latency is also worse on some earlier CPUs, notably Bulldozer-family which is fortunately now obsolete. But even on Zen2, the movd/movq round trip has 6 cycle latency, plus another 1 cycle for the shuffle.
If your data started in memory, you could load it with
vbroadcastss xmm0, [mem]
(AVX1) / vmovq rbx, xmm0
. Broadcast-loads are handled purely by the load port in modern Intel and AMD CPUs, for element sizes of 4 bytes or wider.
If you want to store many copies to memory (like for wmemset
), you'll want to use 16-byte stores at least, so you'd pshufd xmm0, xmm0, 0
(SSE2) or vpbroadcastd ymm0, xmm0
(AVX2) to broadcast into a whole vector. If you just need 8 bytes of it as part of the cleanup for that, you can of course use movq [mem], xmm0
BMI2 shlx
is only available in shlx reg, reg, reg
form, not with an immediate count. With a 32
in a register, you could use this in a loop to produce the result without destroying the input.
mov edx, 32 ; outside a loop
;inside a loop:
...
shlx rax, rbx, rdx
mov ecx, ebx
or rax, rcx
... ; RAX = EBX:EBX. RBX unmodified.
This has the same 2c latency as the normal SHL version, for the same reasons.
精彩评论