开发者

C++ , Latency in loading data Structure from memory to cache

开发者 https://www.devze.com 2023-02-12 06:39 出处:网络
I have following function in C++ int readData ( Class1 *data) { //StartTime try{ char *a1 = data->nam开发者_如何学编程e1;

I have following function in C++

int readData ( Class1 *data)  
{    
//StartTime     
   try{  
       char *a1 = data->nam开发者_如何学编程e1;  
       int a2 = data->age1;     

        char *b1 = data->name1;
        int b2 = data->age1; 
        .
        .
        . 
        char *e1 = data->name5;
        int e2 = data->age5;
   }
   catch(...)
   {
        return -1;
   } 
  //endTime
        return 0;
}

There is a pattern it follows

I call this function 1st time it takes 9 - 10microsec to return I call this function 2nd time it takes 1 - 2microsec such that 2nd call was with in 500 millisecond of first call

I call this function 3rd time it takes 9 - 10microsec such that 3rd call was after 2 -3 seconds after 3rd call

Can you please advice why it takes so much time when called after 2 - 3 seconds ? And what is the solution to this problem so that it always takes 1 - 2 micros.

Note: i have put the tags from where to where i m measuring time. I am using cputicks so I m sure time profile is correct.

Thank you,

Ila Agarwal


What happened in between your calls? The most obvious thing is that the CPU L1, L2 cache is filled with other data since you did nothing for several seconds. When you access the memory locations again the data has to be loaded from the main memory again which is much slower. C++ has no GC or such thing so there is nothing in between you and the machine. Only the OS and the hardware. You should check how much time it takes when you measure again after lunch when your code and data has gone into the page file. Then you are over 1000 times slower for the first call.

Yours, Alois Kraus


From your description only, I would not be 100% sure that it is actually cache related. These are some Questions to determine that.

  • How much data are you reading?
  • What is the data layout (i.e. how is the data allocated in memory)?
  • What CPU do you have?
  • What Compiler / optimizations are you using?
  • Are other processes/threads running?
  • How much memory is your application using, could it potentially even start swapping (using more memory than available RAM)
  • Which operating system are you using? For Linux you could use PAPI to read CPU internal counters that tell you about cache misses etc.
  • What is happening during the 2s between the 2nd and 3rd call.

But lets assume for now that this is cache related, initially the data is in main memory.

The first time the function executed, the CPU has to fetch it from memory to the L2/L1 DCache.

Now if you 'quickly' call the function again, the Data can be retrieved form the CPU-Cache, which takes much less time than accessing the main memory.

During the two seconds that pass other code will run, including the operating system, this other code will access different data on the memory - which will overwrite the data previously stored in the cache.

Therfore the third execution will be slower again.

However 10ms to fill the cache appears very long to me, I suspect that the memory access pattern is very bad, and you are not using the avaiable bandwidth to main memory efficiently. Optimizing code for good cache access is a complex topic. There are many tricks to optimize the memory access pattern, most of them are done by the compiler and the CPU itsself. Important points under your control are, data layout in memory, memory access patterns / loops, compiler-flags (and choice). If you provide more information about your code we might be able help about that.


Slightly duplicating other answers but what you're seeing is expected:

  • Cache miss. The longer between hits, the more likely your code and/or data is gone from L1, L2 and L3 (if you have one) cache.
  • On top of caching, there's TLB misses (the MMU). There's usually 2-3 level of indirection to walk here, and that's a huge latency.
  • Then there's DDR as last resort, and that's a huge latency, compared to CPU clock.
  • You might have a power saving scheme turned on, which means when idle for 2-3 seconds it'll be running at a reduced clock for a while after waking up. So you'll run slower for a few time slices.
  • DDR also has aggressive power saving modes which incur more cycles penalty coming out of idle/sleep.

How to fix this? It's extremely difficult:

  • TLB lock-down. Program the hardware TLB to reserve entries for the code and data you'll hit. This reduces the TLB efficiency for everything else.
  • Cache lock-down. Reserve a way or entries for the code and data. This reduces the effective size of the cache for everything else.
  • Don't use power saving. This is bad for all other use cases.

In other words, don't. You need to design around the worst case, rather than trying to make everything best case. There is absolutely no way you could arrange the above on x86 on a modern OS. You might with some effort get it working on ARM in Linux. But it's just not a sensible approach. If you really need microsecond response times, you want a dedicated piece of hardware for it, or a micro (e.g ARM based) doing only that task. If this is a multi-user OS, you basically have no guarantees, and I'd be thrilled to even get 10usec latency. On the other hand, if you're actually doing this on a micro already, then by all means do the TLB/Cache lock-down method :)

0

精彩评论

暂无评论...
验证码 换一张
取 消