As I remember, with gcc for Pentium it was possible to view advanced dump of compilation process, where gcc shows, how it plans (schedules) assembler instructions for U and V pipelines and also shows how many ticks (CPU clocks) will take each instruction.
Can you say, which versions of gcc can show such dumps and what option is to turn this on?
E.g. for Core2 there is a core2.md
with decoders and execution ports defined, latencies for every instruction. I want to see, how gcc uses this and what decisions are done in instruction scheduling.
In other words: for example program:
int main() {
int i; int j=0;
for(i=0;i<1000000;i++)
开发者_如何学运维 j+=i^((i+5)&(i>>2)&(i>>5) + (i>>2)&(i>>5))-(i+5);
return j%250;
}
how can I get, how ticks are planned by gcc for each iteration?
I'm not sure exactly what you mean, but the -fsched-verbose=n
(try with n=6) dumps some scheduling information which looks like what you're after.
.sched2
dump (possible to get it via -fsched-verbose=1
or -fdump-rtl-all
or -fdump-rtl-all-all
) contains needed information. E.g. for gcc 4.6.0 with -Ofast -march=native -mtune=native
):
note the right part: c2_decoder# - is the decoder of Core2 planned; c2_p# is numbers of ports which are used.
;; ======================================================
;; -- basic block 4 from 65 to 79 -- after reload
;; ======================================================
;; 0--> 78 {sp=bp+0x4;bp=[bp];clobber [scratc:c2_decoder0,(c2_p2+(c2_p0|c2_p1)),c2_p0|c2_p1
;; 0--> 65 xmm0=xmm4 :c2_decodern,c2_p0|c2_p1|c2_p5
;; 0--> 36 dx=0x10624dd3 :c2_decodern,c2_p0|c2_p1|c2_p5
;; 1--> 31 xmm0=xmm0 0>>0x40 :c2_decodern,c2_p1
;; 2--> 32 xmm4=xmm4+xmm0 :c2_decodern,c2_p0|c2_p5
;; 3--> 67 xmm0=xmm4 :c2_decodern,c2_p0|c2_p1|c2_p5
;; 4--> 33 xmm0=xmm0 0>>0x20 :c2_decodern,c2_p1
;; 5--> 34 xmm4=xmm4+xmm0 :c2_decodern,c2_p0|c2_p5
;; 6--> 72 cx=xmm4 :c2_decodern,c2_p0|c2_p1|c2_p5
;; 7--> 69 ax=cx :c2_decodern,c2_p0|c2_p1|c2_p5
;; 8--> 37 {dx=trn(sxn(ax)*sxn(dx) 0>>0x20);c:c2_decodern,c2_p1
;; 8--> 70 ax=cx :c2_decodern,c2_p0|c2_p1|c2_p5
;; 9--> 39 {ax=ax>>0x1f;clobber flags;} :c2_decodern,c2_p0|c2_p5
;; 11--> 38 {dx=dx>>0x4;clobber flags;} :c2_decodern,c2_p0|c2_p5
;; 12--> 40 {dx=dx-ax;clobber flags;} :c2_decodern,c2_p0|c2_p1|c2_p5
;; 13--> 41 {dx=dx*0xfa;clobber flags;} :c2_decodern,c2_p1
;; 16--> 42 {cx=cx-dx;clobber flags;} :c2_decodern,c2_p0|c2_p1|c2_p5
;; 17--> 47 ax=cx :c2_decodern,c2_p0|c2_p1|c2_p5
;; 17--> 50 use ax :nothing
;; 18--> 79 return :c2_decoder0
;; Ready list (final):
;; total time = 18
精彩评论