tracing a linux kernel, function-by function (biggest only) with us timer_问答_开发者

I want to know, how does the linux kernel do some stuff (receiving a tcp packet). In what order main tcp functions are called. I want to see both interrupt handler (top half), bottom half and even work done by kernel after user calls "read()".

How can I get a function trace from kernel with some linear time scale?

I want to get a trace from single packet, not the prof开发者_运维百科ile of kernel when receiving 1000th of packets.

Kernel is 2.6.18 or 2.6.23 (supported in my debian). I can add some patches to it.

I think that the closest tool which atleast partially can achieve what you want is kernel ftrace. Here is a sample usage:

root@ansis-xeon:/sys/kernel/debug/tracing# cat available_tracers 
blk function_graph mmiotrace function sched_switch nop
root@ansis-xeon:/sys/kernel/debug/tracing# echo 1 > ./tracing_on 
root@ansis-xeon:/sys/kernel/debug/tracing# echo function_graph > ./current_trace
root@ansis-xeon:/sys/kernel/debug/tracing# cat trace

 3)   0.379 us    |        __dequeue_entity();
 3)   1.552 us    |      }
 3)   0.300 us    |      hrtick_start_fair();
 3)   2.803 us    |    }
 3)   0.304 us    |    perf_event_task_sched_out();
 3)   0.287 us    |    __phys_addr();
 3)   0.382 us    |    native_load_sp0();
 3)   0.290 us    |    native_load_tls();
 ------------------------------------------
 3)    <idle>-0    =>  ubuntuo-2079 
 ------------------------------------------

 3)   0.509 us    |              __math_state_restore();
 3)               |              finish_task_switch() {
 3)   0.337 us    |                perf_event_task_sched_in();
 3)   0.971 us    |              }
 3) ! 100015.0 us |            }
 3)               |            hrtimer_cancel() {
 3)               |              hrtimer_try_to_cancel() {
 3)               |                lock_hrtimer_base() {
 3)   0.327 us    |                  _spin_lock_irqsave();
 3)   0.897 us    |                }
 3)   0.305 us    |                _spin_unlock_irqrestore();
 3)   2.185 us    |              }
 3)   2.749 us    |            }
 3) ! 100022.5 us |          }
 3) ! 100023.2 us |        }
 3)   0.704 us    |        fget_light();
 3)   0.522 us    |        pipe_poll();
 3)   0.342 us    |        fput();
 3)   0.476 us    |        fget_light();
 3)   0.467 us    |        pipe_poll();
 3)   0.292 us    |        fput();
 3)   0.394 us    |        fget_light();
 3)               |        inotify_poll() {
 3)               |          mutex_lock() {
 3)   0.285 us    |            _cond_resched();
 3)   1.134 us    |          }
 3)   0.289 us    |          fsnotify_notify_queue_is_empty();
 3)               |          mutex_unlock() {
 3)   2.987 us    |        }
 3)   0.292 us    |        fput();
 3)   0.517 us    |        fget_light();
 3)   0.415 us    |        pipe_poll();
 3)   0.292 us    |        fput();
 3)   0.504 us    |        fget_light();
 3)               |        sock_poll() {
 3)   0.480 us    |          unix_poll();
 3)   4.224 us    |        }
 3)   0.183 us    |        fput();
 3)   0.341 us    |        fget_light();
 3)               |        sock_poll() {
 3)   0.274 us    |          unix_poll();
 3)   0.731 us    |        }
 3)   0.182 us    |        fput();
 3)   0.269 us    |        fget_light();

It is not perfect because it does not print function parameters and misses some static functions, but you can get the fell of who is calling who inside the kernel.

If this is not enough then use GDB. But as you might already know setting up GDB for kernel debugging is not as easy as it is for user space processes. I prefer to use GDB+qemu if ever needed.

Happy tracing!

Update: On later Linux distributions I suggest to use trace-cmd command line tool that is "wrapper" around /sys/kernel/debug/tracing. trace-cmd is so much more intuitive to use than the raw interface kernel provides.

You want oprofile. It can give you timings for (a selected subset of) your entire system, which means you can trace network activity from the device to the application and back again, through the kernel and all the libraries.

I can't immediately see a way to trace only a single packet at a time. In particular, it may be difficult to get such fine grained tracing since the upper-half of an interrupt handler shouldn't be doing anything blocking (this is dead-lock prone).

Maybe this is overly pedantic, but have you taken a look at the source code? I know from experience the TCP layer is very well commented, both in recording intent and referencing the RFCs.

I can highly recommend TCP/IP Illustrated, esp Volume 2, the Implementation (website). Its very good, and walks through the code line by line. From the description, "Combining 500 illustrations with 15,000 lines of real, working code...". The source code included is from the BSD kernel, but the stacks are extremely similar, and comparing the two is often instructive.

This question is old and probably not relevant to the original poster, but a nice trick I used lately which was helpful to me was to set the "mark" field of the sk_buf to some value and only "printk" if the value matches.

So for example, if you know where the top half IRQ handler (as the question suggests), then you can hard code some check (e.g. tcp port, source IP, source MAC address, well you get the point) and set the mark to some arbitrary value (e.g. skb->mark = 0x9999).

Then all the way up you only printk if the mark has the same value. As long as nobody else changes your mark (which as far as I could see is usually the case in typical settings) then you'll only see the packets you're interested in.

Since most interesting functions get the skb, it works for almost everything that could be interesting.