Linux内核崩溃问题排查过程总结(2)

日期：2020-05-31 栏目：程序人生浏览：次

Since it’s not realy bug, and the unstable sched clock is working fast and reliable otherwise, We suggested to keep using the sched clock in any case and just to make note in the warning itself.or disables tracing by #echo 0 > /sys/kernel/debug/tracing/tracing_on

Root Cause（根本原因） this case was ftrace involved ftrace due to teh call to ftrace_raw_event_power_end (debugfs is mounted and ftrace loaded in this case), they are to do with problems calculating a time stamp for a ring buffer entry.
Message comes from here and appears to be indicating problems with time stability.
1966 static int
1967 rb_add_time_stamp(struct ring_buffer_per_cpu *cpu_buffer,
1968 u64 *ts, u64 *delta)
1969 {
1970 struct ring_buffer_event *event;
1971 static int once;
1972 int ret;
1973
1974 if (unlikely(*delta > (1ULL << 59) && !once++)) {
1975 int local_clock_stable = 1;
1976 #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
1977 local_clock_stable = sched_clock_stable;
1978 #endif
1979 printk(KERN_WARNING ”Delta way too big! %llu”
1980 “ ts=%llu write stamp = %llu\n%s”,
1981 (unsigned long long)*delta,
1982 (unsigned long long)*ts,
1983 (unsigned long long)cpu_buffer->write_stamp,
1984 local_clock_stable ? ”" :
1985 “If you just came from a suspend/resume,\n”
1986 “please switch to the trace global clock:\n”
1987 ” echo global > /sys/kernel/debug/tracing/trace_clock\n”);
1988 WARN_ON(1);

This called from rb_reserve_next_event() here.
2122 /*
2123 * Only the first commit can update the timestamp.
2124 * Yes there is a race here. If an interrupt comes in
2125 * just after the conditional and it traces too, then it
2126 * will also check the deltas. More than one timestamp may
2127 * also be made. But only the entry that did the actual
2128 * commit will be something other than zero.
2129 */
2130 if (likely(cpu_buffer->tail_page == cpu_buffer->commit_page &&
2131 rb_page_write(cpu_buffer->tail_page) ==
2132 rb_commit_index(cpu_buffer))) {
2133 u64 diff;
2134
2135 diff = ts - cpu_buffer->write_stamp;
2136
2137 /* make sure this diff is calculated here */
2138 barrier();
2139
2140 /* Did the write stamp get updated already? */

2141if (unlikely(ts < cpu_buffer->write_stamp))
2142 goto get_event;
2143
2144 delta = diff;
2145 if (unlikely(test_time_stamp(delta))) {
2146
2147 commit = rb_add_time_stamp(cpu_buffer, &ts, &delta); <—- HERE
This has to do with time stamping for ring buffer entries.

通过上面的信息可以看出，其实和我去年分析的代码和方式一模一样，只是判断原因方面我不敢确定，毕竟重现不了，只能是推测。

后面又继续分析内核中出现的另一个错误，“BUG: soft lockup – CPU#N stuck for 4278190091s! [qmgr/master:进程号]”，对上面的错误信息我做了一点点处理，CPU#后面的N是对应的一个具体的cpu编号，这个在每一台服务器是不一样的，还有就是最后中括号中的进程和进程号码不同，不过就是qmgr和master。如下表格统计：

更多详情见请继续阅读下一页的精彩内容：

Linux Kernel 的详细介绍：请点这里
Linux Kernel 的下载地址：请点这里

转载注明出处：http://www.heiqu.com/19060.html

Linux内核崩溃问题排查过程总结(2)

相关推荐