开发者

Will a larger binary with parts of code that are not executed at the time, affect use of level 2 CPU memory?

开发者 https://www.devze.com 2023-02-03 12:49 出处:网络
It appears that CPUs run significantly faster if their L2 is not filled. Will a progr开发者_StackOverflow社区ammer be better off to code something that will eventually be smaller in binary, even if pa

It appears that CPUs run significantly faster if their L2 is not filled. Will a progr开发者_StackOverflow社区ammer be better off to code something that will eventually be smaller in binary, even if parts of that code are not executed all the time? Say, parts of code that are only turned on in a config file.


The truth is somewhat more complex, I'll try to outline it for you.

If you look at the memory hierarchy in a modern PC with a multi-core processor you will find that there are six levels:

  1. The prefetcher, one for every core (no latency)
  2. The L1 cache, one or two (combined or code and data, 2*64K on AMD K10) for every core (latency say three clks)
  3. The L2 cache, one (512K on AMD K10) for every core (latency say 10)
  4. The L3 cache, one (ncores*1 MB on AMD K10) per processor used by all cores (latency say 30)
  5. System RAM, one per system used by all processors (latency say 100)
  6. Synchronization (or bus lock), one method per system used by all bus mastering devices (latency at least 300 cycles up to 1 us if an old PCI card is using all 32 clocks available when bus-mastering with clocking at 33 MHz - on a 3 GHz processor that means 3000 clock cycles)

Don't see the cycle counts as exact, they're meant to give you a feel for the possible penalities incurred when executing code.

I use synchronization as a memory level because sometimes you need to synchronize memory too and that costs time.

The language you use will have a great impact on performance. A program written in C, C++ or ForTran will be smaller and execute faster than an interpreted program such as Basic, C# and Java. C and Fortran will also give you a better control when organizing your data areas and program access to them. Certain functions in OO languages (C++, C# and Java) such as encapsulation and usage of standard classes will result in larger code being generated.

How code is written also has a great impact on performance - though some uninformed individuals will say that compilers are so good these days that it isn't necessary to write good source code. Great code will mean great performance and Garbage In will always result in Garbage Out.

In the context of your question writing small is usually better for performance than not caring. If you are used to coding efficiently (small/fast code) then you'll do it regardless of whether you're writing seldom- or often-used sequences.

The cache will most likely not have your entire program loaded (though it might) but rather numerous 32 or 64 byte chunks ("cache lines") of data fetched from even 32 or 64 byte addresses in your code. The more the information in one of these chunks is accessed the longer it will keep the cache line it's sitting in. If the core wants one chunk that's not in L1 it will search for it all the way down to RAM if necessary and incurring penalty clock cycles while doing it.

So in general small, tight and inline code sequences will execute faster because they impact the cache(s) less. Code that makes a lot of calls to other code areas will have a greater impact on the cache, as will code with unoptimized jumps. Divisions are extremely detrimental but only to the execution of the core in question. Apparently AMD is much better at them than intel (http://gmplib.org/~tege/x86-timing.pdf).

There is also the issue of data organization. Here it is also better to have often-used data in residing in a physically small area such that one cache line fetch will bring in several often-used variables instead of just one per fetch (which is the norm).

When accessing arrays of data or data structures try to make sure that you access them from lower to higher memory addresses. Again, accessing all over the place will have a negative impact on the caches.

Finally there is the technique of giving data pre-fetch hints to the processor so that it may direct the caches to begin fetching data as far as possible before the data will actually be used.

To have a reasonable chance of understanding these things so that you may put them to use at a practical level, it will be necessary for you to test different constructs and time them, preferably with the rdtsc counter (lots of info about it here at stackoverflow) or by using a profiler.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号