The ARM ARM doesn't actually give much in the proper way of usage on this instruction, but I've found it used elsewhere to know that it takes an address as a hint on where to read the next value.
My question is, given a 256-byte tight copy loop of ldm/stm
instructions, say r4-r11 x 8, would it be better to prefetch each cache line before the copy, in between each instruction pair, or not do it at all as the mem开发者_StackOverflow中文版cpy
in question isn't both reading and writing to the same area of memory. Pretty sure my cache line size is 64 bytes, but it may be 32 bytes - awaiting confirmation on that before writing final code here.
From the Cortex-A Series Programmer's Guide, chapter 17.4 (NB: some details might be different for ARM11):
Best performance for memcpy() is achieved using LDM of a whole cache line and then writing these values with an STM of a whole cache line. Alignment of the stores is more important than alignment of the loads. The PLD instruction should be used where possible. There are four PLD slots in the load/store unit. A PLD instruction takes precedence over the automatic pre-fetcher and has no cost in terms of the integer pipeline performance. The exact timing of PLD instructions for best memcpy() can vary slightly between systems, but PLD to an address three cache lines ahead of the currently copying line is a useful starting point.
An example of a reasonably generic copy loop that makes use of cacheline-sized LDM
/STM
blocks and/or PLD
where available can be found in the Linux kernel, arch/arm/lib/copy_page.S
. That implements what Igor mentions above, regarding the use of preloads, and illustrates the blocking.
Note that on ARMv7 (where the cacheline size is usually 64 Bytes) it's not possible to LDM
a full cacheline as a single op (there's only 14 regs you could use since SP
/PC
can't be touched for this). So you might have to use two/four pairs of LDM
/STM
.
To really get the "fastest" possible ARM asm code, you will need to test different approaches on your system. As far as a ldm/stm loop goes, this one seems to work the best for me:
// Use non-conflicting register r12 to avoid waiting for r6 in pld
pld [r6, #0]
add r12, r6, #32
1:
ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
pld [r12, #32]
stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
subs r11, r11, #16
ldm r6!, {r0, r1, r2, r3, r4, r5, r8, r9}
pld [r12, #64]
stm r10!, {r0, r1, r2, r3, r4, r5, r8, r9}
add r12, r6, #32
bne 1b
The block above assumes that your have already setup r6, r10, r11 and this loops counts down on r11 terms of words not bytes. I have tested this on Cortex-A9 (iPad2) and it seems to have quite good results on that processor. But be careful, because on a Cortex-A8 (iPhone4) a NEON loop seems to be faster than ldm/stm at least for larger copies.
精彩评论