Is memcpy accelerated in some way on the iPhone?_问答_开发者

Is memcpy accelerated in some way on the iPhone?

开发者 https://www.devze.com 2023-02-28 16:19 出处：网络

Few days ago I was writing some code and I had noticed that copying RAM by memcpy was much-much faster than copying it in for loop.

相关专题：memcpy

Few days ago I was writing some code and I had noticed that copying RAM by memcpy was much-much faster than copying it in for loop.

I got no measurements now (maybe I did some time later) but as I remember the same block of RAM which in for qas copied in about 300 ms or more by开发者_开发知识库 memcpy was copied in 20 ms or less.

It is possible, is memcpy hardware acelerated?

Well, I can't speak about Apple's compilers, but gcc definitely treats memcpy as a builtin.

The built-in implementation of memcpy tends to be optimized pretty heavily for the platform in question, so it will usually be faster than a naive for loop.

Some optimizations include copying as much as possible at a time (not single bytes but rather whole words, or if the processor in question supports it, even more), some degree of loop unrolling, etc. Of course the best course of optimization depends on the platform, so it's usually best to stick to the built-in function.

In most cases it's written by way more experienced people than the user anyways.

Sometimes mem-to-mem DMA is implemented in processors so, yes, if such a thing exists in the iPhone, then it's likely that memcpy( ) takes advantage of it. Even if it were not implemented, I'm not surprised by the 15-to-1 advantage that memcpy( ) seems to have over your character-by-character copy.

Moral 1: always prefer memcpy( ) to strcpy( ) if possible.
Moral 2: always prefer memmove( ) to memcpy( ); always.

The newest iPhone has SIMD instructions on the ARM chip allowing for 4 calculations at the same time. This includes moving memory around.

Also, if you create a highly optimized memcpy, you'd typically unroll loops to a certain amount, and implement it as a duffs device

It looks like the ARM CPU has instructions that can copy 48 bits per access. I'd bet the lower overhead of doing it in larger chunks is what you're seeing.