12:01:54.6808480, Redis Server端向客户端发送了一个Push包,也就是对于查询请求的一个结果返回。后面的包都是在做连接处理,包括Ack包,Ack确认包,以及重置的RST包,紧接着下面一个Push包是在12:01:56.1181350发出的。中间的间隔是1.4372870秒。也就是说,在这1.4372870秒期间,Redis的服务器端,除了做一个查询,其他的操作都是在做建连,或拒绝连接。
客户端报错的前后逻辑已经清楚了,redis-server卡了1.43秒,client的connection pool被打满,疯狂新建连接,server的accept queue满,直接拒绝服务,client报错。开始怀疑client发送了特殊命令,这时需要确认一下client的最后几个命令是什么,找到redis-server卡死前的第一个包,装一个wireshark的redis插件,看到最后几个命令是简单的get,并且key-value都很小,不至于需要耗费1.43秒才能完成。服务端也没有slow log,此时排障再次陷入僵局。
【进一步分析】为了了解这1.43秒之内,Redis Server在做什么事情,我们用pstack来抓取信息。Pstack本质上是gdb attach. 高频率的抓取会影响redis的吞吐。死循环0.5秒一次无脑抓,在redis-server卡死的时候抓到堆栈如下(过滤了没用的栈信息):
Thu May 31 11:29:18 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000000048cec4 in ?? () #1 0x00000000004914a4 in je_arena_ralloc () #2 0x00000000004836a1 in je_realloc () #3 0x0000000000422cc5 in zrealloc () #4 0x00000000004213d7 in sdsRemoveFreeSpace () #5 0x000000000041ef3c in clientsCronResizeQueryBuffer () #6 0x00000000004205de in clientsCron () #7 0x0000000000420784 in serverCron () #8 0x0000000000418542 in aeProcessEvents () #9 0x000000000041873b in aeMain () #10 0x0000000000420fce in main () Thu May 31 11:29:19 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x0000003729ee5407 in madvise () from /lib64/libc.so.6 #1 0x0000000000493a4e in je_pages_purge () #2 0x000000000048cf70 in ?? () #3 0x00000000004914a4 in je_arena_ralloc () #4 0x00000000004836a1 in je_realloc () #5 0x0000000000422cc5 in zrealloc () #6 0x00000000004213d7 in sdsRemoveFreeSpace () #7 0x000000000041ef3c in clientsCronResizeQueryBuffer () #8 0x00000000004205de in clientsCron () #9 0x0000000000420784 in serverCron () #10 0x0000000000418542 in aeProcessEvents () #11 0x000000000041873b in aeMain () #12 0x0000000000420fce in main () Thu May 31 11:29:19 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000000048108c in je_malloc_usable_size () #1 0x0000000000422be6 in zmalloc () #2 0x00000000004220bc in sdsnewlen () #3 0x000000000042c409 in createStringObject () #4 0x000000000042918e in processMultibulkBuffer () #5 0x0000000000429662 in processInputBuffer () #6 0x0000000000429762 in readQueryFromClient () #7 0x000000000041847c in aeProcessEvents () #8 0x000000000041873b in aeMain () #9 0x0000000000420fce in main () Thu May 31 11:29:20 CST 2018 Thread 1 (Thread 0x7ff2db6de720 (LWP 8378)): #0 0x000000372a60e7cd in write () from /lib64/libpthread.so.0 #1 0x0000000000428833 in sendReplyToClient () #2 0x0000000000418435 in aeProcessEvents () #3 0x000000000041873b in aeMain () #4 0x0000000000420fce in main ()