Peformance on arrays or matrices_问答_开发者_运维开发者技术经验分享

开发者 https://www.devze.com 2023-01-19 13:36 出处：网络

I was talking to a friend today.. and he told me something that I don\'t know if it is true, so I decided to ask it here.

相关专题：c

I was talking to a friend today.. and he told me something that I don't know if it is true, so I decided to ask it here.

He has开发者_如何学C this huge matrix where he has items like this:

 1 2 3 4 .. 1000
1001 ...... 2000
2001 ...... 3000
....

Anyway.. he was saying that it is more efficient to traverse it 1 2 3 4.. in C, because in C the array is stored in memory row by row. He tested this in code once for traversing one of this huge structures column by column and row by row, and the times were different. One being more efficient than the other.

but I was thinking how could this make a difference...

I mean it takes the same time to access *(i+1) and *(i+1000) in a contiguous array of memory. right?

Ted

He is right by locality of reference (specifically spatial locality). Nearby addresses will be prefetched/cached to speed up access. Thus, if you hit memory address p, then memory address p + 1 and other nearby addresses will be prefetched and cached to speed up access to them whereas accessing p + 1000 will likely require another trip across the memory bus.

(By the way, the layout that you describe is row-major order. Some languages (e.g., Fortran) use column-major order.)

Processors make the assumption that if you retrieve a chunk of memory that you're likely to need other nearby data soon, and so prefetch a bit of extra into the cache. Cache hits are very fast, so code that takes advantage of this will be faster than code which doesn't. This is why it's faster to traverse nearby data members than jumping around. (So, yes, he's right)

Yup, it's true. In my experience, you can roughly halve performance by traversing the matrix the wrong way around.

The reason is that most of your hardware is built around a lot of assumptions of how it will probably be used, and then they use this knowledge to achieve better performance. In this case, the relevant principle is "Spatial Locality", the concept that "if you recently needed address N, then you'll probably need the contents of address N+1 soon".

Your RAM is designed to transfer memory in bursts for that reason. When you ask for one word of data, it starts loading a larger contiguous chunk of memory, so that it's ready for when you probably ask for it a moment later.

And the CPU itself does the same. Its cache is arranged on the same principle. It stores entire cache lines (typically 16 or 32 bytes, if memory serves) instead of individual bytes or words. That way, when you request a single byte, the entire cache line is read into cache, so that nearby data is also available should you need it.

If you traverse your matrix column by column, then each memory access is to an address far away from the previous one, so there is no spatial locality, and so both the CPU cache and your RAM fail to predict what data they should prefetch, and your end up waiting for data to be transferred to you all the time.

See Row-major order.

In the C language members are stored row by row. Caching means that when *i is accessed, *(i+1) is likely to be in the cache, so it is quicker.

The importance of mentioning the C language is because not all languages are like that. Famously, Fortran uses column-major order. So you will sometimes find a row-major matrix called C-contiguous, while a column-major one is called Fortran-contiguous.

Ulrich Drepper has an excellent paper that covers cache behaviour among other things related to the memory system of modern computers. It's at http://www.akkadia.org/drepper/cpumemory.pdf.