Linux内核崩溃问题排查过程总结(2)

Since it’s not realy bug, and the unstable sched clock is working fast and reliable otherwise, We suggested to keep using the sched clock in any case and just to make note in the warning itself.or disables tracing by #echo 0 > /sys/kernel/debug/tracing/tracing_on

Root Cause(根本原因) this case was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this case), they are to do with problems calculating a time stamp for a ring buffer entry.
Message comes from here and appears to be indicating problems with time stability.
1966 static int
1967 rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
1968  u64 *ts, u64 *delta)
1969 {
1970    struct ring_buffer_event *event;
1971    static int once;
1972  int ret;
1973
1974 if (unlikely(*delta > (1ULL << 59) && !once++)) {
1975    int local_clock_stable = 1;
1976 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
1977  local_clock_stable = sched_clock_stable;
1978 #endif
1979  printk(KERN_WARNING ”Delta way too big! %llu”
1980        “ ts=%llu write stamp = %llu\n%s”,
1981      (unsigned long long)*delta,
1982    (unsigned long long)*ts,
1983    (unsigned long long)cpu_buffer->write_stamp,
1984      local_clock_stable ? ”" :
1985      “If you just came from a suspend/resume,\n”
1986      “please switch to the trace global clock:\n”
1987  ”  echo global > /sys/kernel/debug/tracing/trace_clock\n”);
1988      WARN_ON(1);

This called from rb_reserve_next_event() here.
2122 /*
2123 * Only the first commit can update the timestamp.
2124 * Yes there is a race here. If an interrupt comes in
2125 * just after the conditional and it traces too, then it
2126 * will also check the deltas. More than one timestamp may
2127 * also be made. But only the entry that did the actual
2128 * commit will be something other than zero.
2129 */
2130 if (likely(cpu_buffer->tail_page == cpu_buffer->commit_page &&
2131  rb_page_write(cpu_buffer->tail_page) ==
2132  rb_commit_index(cpu_buffer))) {
2133      u64 diff;
2134
2135      diff = ts - cpu_buffer->write_stamp;
2136
2137      /* make sure this diff is calculated here */
2138    barrier();
2139
2140 /* Did the write stamp get updated already? */

2141if (unlikely(ts < cpu_buffer->write_stamp))
2142  goto get_event;
2143
2144  delta = diff;
2145  if (unlikely(test_time_stamp(delta))) {
2146
2147      commit = rb_add_time_stamp(cpu_buffer, &ts, &delta); <—- HERE
This has to do with time stamping for ring buffer entries.

通过上面的信息可以看出,其实和我去年分析的代码和方式一模一样,只是判断原因方面我不敢确定,毕竟重现不了,只能是推测。

后面又继续分析内核中出现的另一个错误,“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:进程号]”,对上面的错误信息我做了一点点处理,CPU#后面的N是对应的一个具体的cpu编号,这个在每一台服务器是不一样的,还有就是最后中括号中的进程和进程号码不同,不过就是qmgr和master。如下表格统计:

更多详情见请继续阅读下一页的精彩内容

Linux Kernel 的详细介绍请点这里
Linux Kernel 的下载地址请点这里

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:http://www.heiqu.com/19060.html