本来到这里,我要介绍的要介绍了,但是且慢,由于这个东西要在2.6.26后的系统上使用,我们的线上系统大部分是RHEL 5U4, 2.6.18的, 我们如何使用呢?
这时候 systemtap 一如既往的前来救助了!
systemtap 1.4版本以后带了个latencytop.stp, 也是intel的贡献. 那我们试验下穷人家的latencytop.
它在那里呢?
[shell]
$ uname -r
2.6.18-164.el5
$ stap -V
Systemtap translator/driver (version 1.5 /0.137 non-git sources)
Copyright (C) 2005-2011 Red Hat, Inc. and others
This is free software; see the source for copying conditions.
enabled features: AVAHI LIBRPM LIBSQLITE3 NSS BOOST_SHARED_PTR TR1_UNORDERED_MAP NLS
$ ls -al /usr/share/doc/systemtap/examples/profiling/latencytap.stp
-rwxr-xr-x 1 chuba users 16240 Feb 17 22:02/usr/share/doc/systemtap/examples/profiling/latencytap.stp
$ sudo stap -t –all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: Skipped too many probes, check MAXSKIPPED or try again with stap -t for more details.
WARNING: Number of errors: 0, skipped probes: 101
WARNING: Skipped due to global ‘dequeue’ lock timeout: 2
WARNING: Skipped due to global ‘this_sleep’ lock timeout: 99
—– probe hit report:
kernel.trace(“deactivate_task”)!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:47:1), hits: 254, cycles: 680min/43327avg/2248467max, from: kernel.trace(“deactivate_task”)
kernel.trace(“activate_task”)!, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:58:1), hits: 255, cycles: 890min/502549avg/2271568max, from: kernel.trace(“activate_task”)
kernel.function(“finish_task_switch@kernel/sched.c:1969″)?, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:78:7), hits: 509, cycles: 213min/1002207avg/5382852max, from: kernel.function(“finish_task_switch”) from: scheduler.cpu_on
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:123:1), hits: 1, cycles: 1802min/1802avg/1802max, from: begin
begin, (/usr/share/doc/systemtap/examples/profiling/latencytap.stp:131:1), hits: 1, cycles: 227979min/227979avg/227979max, from: begin
Pass 5: run failed. Try again with another ‘–vp 00001′ option.
[/shell]
出错了! 原因是lock timeout, 原来stap的全局变量是用锁保护的,现在超时了!知道原因好办,打个patch吧!
[shell]
$ diff -up translate.cxx.orig translate.cxx
— translate.cxx.orig 2011-03-22 21:26:52.000000000 +0800
+++ /translate.cxx 2011-03-29 20:31:28.000000000 +0800
@@ -5802,10 +5802,10 @@ translate_pass (systemtap_session& s)
s.op->newline() << "#define MAXACTION_INTERRUPTIBLE (MAXACTION * 10)";
s.op->newline() << "#endif";
s.op->newline() << "#ifndef TRYLOCKDELAY";
- s.op->newline() << "#define TRYLOCKDELAY 10 /* microseconds */";
+ s.op->newline() << "#define TRYLOCKDELAY 50 /* microseconds */";
s.op->newline() << "#endif";
s.op->newline() << "#ifndef MAXTRYLOCK";
- s.op->newline() << "#define MAXTRYLOCK 100 /* 1 millisecond total */";
+ s.op->newline() << "#define MAXTRYLOCK 500 /* 1 millisecond total */";
s.op->newline() << "#endif";
s.op->newline() << "#ifndef MAXMAPENTRIES";
s.op->newline() << "#define MAXMAPENTRIES 2048";
#编译安装后再来一次
$ sudo stap --all-modules /usr/share/doc/systemtap/examples/profiling/latencytap.stp
ERROR: probe overhead exceeded threshold
WARNING: Number of errors: 1, skipped probes: 0
Pass 5: run failed. Try again with another '--vp 00001' option.
#又错了,这次原因是probe overhead exceeded threshold, 看下代码我们知道,脚本的开销太大了,超过正常的负载,通过查看代码可以用STP_NO_OVERLOAD来解除这个限制
#再来一次
$ sudo stap -DSTP_NO_OVERLOAD --all-modules -DMAXSKIPPED=1024 /usr/share/doc/systemtap/examples/profiling/latencytap.stp
Reason Count Average(us) Maximum(us) Percent%
Userspace lock contention 345 16409195 83258717 45%
1453 867513 60231852 10%
95 7391754 33821926 5%
migration() kernel thread 1733 402701 3571412 5%
7239 87993 401552 5%
Reading from a pipe 212 2922207 52151180 4%
142 2267850 17990214 2%
108 2457247 7494331 2%
Waking ksoftirqd 16 16082822 59266312 2%
Waiting for event (select) 99 2113310 28510974 1%
kjournald() kernel thread 148 1313447 13983084 1%
Application requested delay 94 1059898 10011409 0%
41 2391993 7618788 0%
Waiting for event (select) 38 2259444 29057362 0%
719 92947 584944 0%
Waiting for event (poll) 1 57582711 57582711 0%
Application requested delay 3 19030709 36000553 0%
Waiting for event (select) 39 1341880 5847683 0%
34 936628 6649350 0%
5 6163603 10008484 0%
...
[/shell]
这次看到结果了,哈哈,小高兴一把。但是在繁忙的系统上这个脚本的资源占用特别多,也是不爽的。 幸运的是这个脚本支持查看某个进程的延迟情况, 就是在 latencytap.stp 后面加个-x 参数。