PHP实现通过strace定位故障原因的方法

日期：2020-06-23 栏目：程序人生浏览：次

俗话说：不怕贼偷，就怕贼惦记着。在面对故障的时候，我也有类似的感觉：不怕出故障，就怕你不知道故障的原因，故障却隔三差五的找上门来。

十一长假还没结束，服务器却频现高负载，Nginx出现错误日志：

connect() failed (110: Connection timed out) while connecting to upstream
connect() failed (111: Connection refused) while connecting to upstream

看上去是Upstream出了问题，在本例中Upstream就是PHP（版本：5.2.5）。可惜监控不完善，我搞不清楚到底是哪出了问题，无奈之下只好不断重启PHP来缓解故障。

如果每次都手动重启服务无疑是个苦差事，幸运的是可以通过CRON设置每分钟执行：

#/bin/bash LOAD=$(awk '{print $1}' /proc/loadavg) if [ $(echo "$LOAD > 100" | bc) = 1 ]; then /etc/init.d/php-fpm restart fi

可惜这只是一个权宜之计，要想彻底解决就必须找出故障的真正原因是什么。

闲言碎语不要讲，轮到Strace出场了，统计一下各个系统调用的耗时情况：

shell> strace -c -p $(pgrep -n php-cgi) % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 30.53 0.023554 132 179 brk 14.71 0.011350 140 81 mlock 12.70 0.009798 15 658 16 recvfrom 8.96 0.006910 7 927 read 6.61 0.005097 43 119 accept 5.57 0.004294 4 977 poll 3.13 0.002415 7 359 write 2.82 0.002177 7 311 sendto 2.64 0.002033 2 1201 1 stat 2.27 0.001750 1 2312 gettimeofday 2.11 0.001626 1 1428 rt_sigaction 1.55 0.001199 2 730 fstat 1.29 0.000998 10 100 100 connect 1.03 0.000792 4 178 shutdown 1.00 0.000773 2 492 open 0.93 0.000720 1 711 close 0.49 0.000381 2 238 chdir 0.35 0.000271 3 87 select 0.29 0.000224 1 357 setitimer 0.21 0.000159 2 81 munlock 0.17 0.000133 2 88 getsockopt 0.14 0.000110 1 149 lseek 0.14 0.000106 1 121 mmap 0.11 0.000086 1 121 munmap 0.09 0.000072 0 238 rt_sigprocmask 0.08 0.000063 4 17 lstat 0.07 0.000054 0 313 uname 0.00 0.000000 0 15 1 access 0.00 0.000000 0 100 socket 0.00 0.000000 0 101 setsockopt 0.00 0.000000 0 277 fcntl ------ ----------- ----------- --------- --------- ---------------- 100.00 0.077145 13066 118 total

看上去「brk」非常可疑，它竟然耗费了三成的时间，保险起见，单独确认一下：

shell> strace -T -e brk -p $(pgrep -n php-cgi) brk(0x1f18000) = 0x1f18000 <0.024025> brk(0x1f58000) = 0x1f58000 <0.015503> brk(0x1f98000) = 0x1f98000 <0.013037> brk(0x1fd8000) = 0x1fd8000 <0.000056> brk(0x2018000) = 0x2018000 <0.012635>

说明：在Strace中和操作花费时间相关的选项有两个，分别是「-r」和「-T」，它们的差别是「-r」表示相对时间，而「-T」表示绝对时间。简单统计可以用「-r」，但是需要注意的是在多任务背景下，CPU随时可能会被切换出去做别的事情，所以相对时间不一定准确，此时最好使用「-T」，在行尾可以看到操作时间，可以发现确实很慢。

在继续定位故障原因前，我们先通过「man brk」来查询一下它的含义：

brk() sets the end of the data segment to the value specified by end_data_segment, when that value is reasonable, the system does have enough memory and the process does not exceed its max data size (see setrlimit(2)).

简单点说就是内存不够用时通过它来申请新内存（data segment），可是为什么呢？

shell> strace -T -p $(pgrep -n php-cgi) 2>&1 | grep -B 10 brk stat("/path/to/script.php", {...}) = 0 <0.000064> brk(0x1d9a000) = 0x1d9a000 <0.000067> brk(0x1dda000) = 0x1dda000 <0.001134> brk(0x1e1a000) = 0x1e1a000 <0.000065> brk(0x1e5a000) = 0x1e5a000 <0.012396> brk(0x1e9a000) = 0x1e9a000 <0.000092>

通过「grep」我们很方便就能获取相关的上下文，反复运行几次，发现每当请求某些PHP脚本时，就会出现若干条耗时的「brk」，而且这些PHP 脚本有一个共同的特点，就是非常大，甚至有几百K，为何会出现这么大的PHP脚本？实际上是程序员为了避免数据库操作，把非常庞大的数组变量通过「var_export」持久化到PHP文件中，然后在程序中通过「include」来获取相应的变量，因为变量太大，所以PHP不得不频繁执行「brk」，不幸的是在本例的环境中，此操作比较慢，从而导致处理请求的时间过长，加之PHP进程数有限，于是乎在Nginx上造成请求拥堵，最终导致高负载故障。

下面需要验证一下推断似乎否正确，首先查询一下有哪些地方涉及问题脚本：

shell> find /path -name "*.php" | xargs grep "script.php"

转载注明出处：https://www.heiqu.com/a71c6272f4839fd4e4a8654f2966c238.html

PHP实现通过strace定位故障原因的方法

相关推荐