CPU Cycle count based profiling in C/C++ Linux x86_64_问答_开发者

I am using the following code to profile my operations to optimize on cpu cycles taken in my functions.

static __inline__ unsigned long GetCC(void)
开发者_如何学Python{
  unsigned a, d; 
  asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
  return ((unsigned long)a) | (((unsigned long)d) << 32); 
}

I don't think it is the best since even two consecutive calls gives me a difference of "33". Any suggestions ?

I personally think the rdtsc instruction is great and usable for a variety of tasks. I do not think that using cpuid is necessary to prepare for rdtsc. Here is how I reason around rdtsc:

Since I use the Watcom compiler I've implemented rdtsc using "#pragma aux" which means that the C compiler will generate the instruction inline, expect the result in edx:eax and also inform its optimizer that the contents of eax and edx have been modified. This is a huge improvement from traditional _asm implementations where the optimizer would stay away from optimizing in _asm's vicinity. I've also implemented a divide_U8_by_U4 using "#pragma aux" so that I won't need to call a lib function when I convert clock_cycles to us or ms.
Every execution of rdtsc will result in some overhead (A LOT more if it is encapsulated as in the author's example) which must be taken more into account the shorter the sequence to measure is. Generally I don't time shorter sequences than 1/30 of the internal clock frequency which usually works out to 1/10^8 seconds (3 GHZ internal clock). I use such measurements as indications, not fact. Knowing this I can leave out cpuid. The more times I measure, the closer to fact I will get.
To measure reliably I would use the 1/100 - 1/300 range i/e 0.03 - 0.1 us. In this range the additional accuracy of using cpuid is practically insignificant. I use this range for short sequence timing. This is my "non-standard" unit since it is dependent on the CPU's internal clock frequency. For example on a 1 GHz machine I would not use 0.03 us because that would put me outside the 1/100 limit and my readings would become indications. Here I would use 0.1 us as the shortest time measurement unit. 1/300 would not be used since it would be too close to 1 us (see below) to make any significant difference.
For even longer processing sequences I divide the difference between two rdtsc reading with say 3000 (for 3 GHz) and will convert the elapsed clock cycles to us. Actually I use (diff+1500)/3000 where 1500 is half of 3000. For I/O waits I use milliseconds => (diff+1500000)/3000000. These are my "standard" units. I very seldom use seconds.
Sometimes I get unexpectedly slow results and then I must ask myself: is this due to an interrupt or to the code? I measure a few more times to see if it was, indeed, an interrupt. In that case ... well interrupts happen all the time in the real world. If my sequence is short then there is a good possibility that the next measurement won't be interrupted. If the sequence is longer interrupts will occur more often and there isn't much I can do about it.
Measuring long elapsed times very accurately (hour and longer ETs in us or lower) will increase the risk of getting a division exception in divide_U8_by_U4, so I think through when to use us and when to use ms.
I also have code for basic statistics. Using this I log min and max values and I can calculate mean and standard deviation. This code is non-trivial so its own ET must be subtracted from the measured ETs.
If the compiler is doing extensive optimizations and your readings are stored in local variables the compiler may determine ("correctly") that the code can be omitted. One way to avoid this is to store the results in public (non-static, non-stack-based) variables.
Programs running in real-world conditions should be measured in real-world conditions, there's no way around that.

As to the question of time stamp counter being accurate I would say that assuming the tsc on different cores are synchronized (which is the norm) there is the problem of CPU throttling during periods of low activity to reduce energy consumption. It is always possible to inhibit the functionality when testing. If you're executing an instruction at 1 GHz or at 10 Mhz on the same processor the elapsed cycle count will be the same even though the former completed in 1% of the time compred to the latter.

Trying to count the cycles of an individual execution of a function is not really the right way to go. The fact that your process can be interrupted at any time, along with delays caused by cache misses and branch mispredictions means that there can be considerable deviation in the number of cycles taken from call to call.

The right way is either:

Count the number of cycles or CPU time (with clock()) taken for a large number of calls to the function, then average them; or
Use a cycle-level emulating profiler like Callgrind / kcachegrind.

By the way, you need to execute a serialising instruction before RDTSC. Typically CPUID is used.

You are on the right track¹, but you need to do two things:

Run cpuid instruction before rdtsc to flush the CPU pipeline (makes measurement more reliable). As far as I recall it clobbers registers from eax to edx.
Measure real time. There is a lot more to execution time, than just CPU cycles (locking contention, context switches and other overhead you don't control). Calibrate TSC ticks with real time. You can do it in a simple loop that takes differences in measurements of, say, gettimeofday (Linux, since you didn't mentioned the platform) calls and rdtsc output. Then you can tell how much time each TSC tick takes. Another consideration is synchronization of TSC across CPUs, because each core may have its own counter. In Linux you can see it in /proc/cpuinfo, your CPU should have a constant_tsc flag. Most newer Intel CPUs I've seen have this flag.

¹I have personally found rdtsc to be more accurate than system calls like gettimeofday() for fine-grained measurements.

Another thing you might need to worry about is if you are running on a multi-core machine the program could be moved to a different core, which will have a different rdtsc counter. You may be able to pin the process to one core via a system call, though.

If I were trying to measure something like this I would probably record the time stamps to an array and then come back and examine this array after the code being benchmarked had completed. When you are examining the data recorded to the array of timestamps you should keep in mind that this array will rely on the CPU cache (and possibly paging if your array is big), but you could prefetch or just keep that in mind as you analyze the data. You should see a very regular time delta between time stamps, but with several spikes and possibly a few dips (probably from getting moved to a different core). The regular time delta is probably your best measurement, since it suggests that no outside events effected those measurements.

That being said, if the code you are benchmarking has irregular memory access patterns or run times or relies on system calls (especially IO related ones) then you will have a difficult time separating the noise from the data you are interested in.

The TSC isn't a good measure of time. The only guarantee that the CPU makes about the TSC is that it rises monotonically (that is, if you RDTSC once and then do it again, the second one will return a result that is higher than the first) and that it will take it a very long time to wraparound.

Do I understand correctly that the reason you do this is to bracket other code with it so you can measure how long the other code takes?

I'm sure you know another good way to do that is just loop the other code 10^6 times, stopwatch it, and call it microseconds.

Once you've measured the other code, am I correct to assume you want to know which lines in it are worth optimizing, so as to reduce the time it takes?

If so, you're on well-trod ground. You could use a tool like Zoom or LTProf. Here's my favorite method.

Linux perf_event_open system call with config = PERF_COUNT_HW_CPU_CYCLES

This Linux system call appears to be a cross architecture wrapper for performance events.

This answer is basically the same as the one for this C++ question: How to get the CPU cycle count in x86_64 from C++? see that answer for more details.

perf_event_open.c

#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_CPU_CYCLES;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("%lld\n", count);

    close(fd);
}