Nginx499原因探查

499为客户端请求超时主动断开后,NGINX日志中记录的状态码.笔者线上遇到一个问题,NGINX日志记录499,request_time记录为1s左右(即客户端的超时时间),php-fpm层记录的实际处理时间为10-20ms,产生了不一致,通过文章中的实验验证一下问题出自哪里.

WRK

安装

环境如下:

1
2
[root@rh etc]# cat /etc/redhat-release
CentOS Linux release 7.5.1804 (Core)

使用

1
2
3
4
5
6
7
8
9
10
11
12
[root@rh etc]# wrk
Usage: wrk <options> <url>
Options:
-c, --connections <N> Connections to keep open //连接数量
-d, --duration <T> Duration of test //测试持续时间
-t, --threads <N> Number of threads to use //并发线程数量

-s, --script <S> Load Lua script file
-H, --header <H> Add header to request
--latency Print latency statistics
--timeout <T> Socket/request timeout //请求超时时间,可以模拟客户端请求超时断开,从而产生499
-v, --version Print version details

php-fpm的配置

1
2
3
4
5
pm = dynamic
pm.max_children = 10
pm.start_servers = 2
pm.min_spare_servers = 2
pm.max_spare_servers = 4

模拟配置,生产上比该值要大.
首先验证一个问题,我们在php代码逻辑中sleep 50s,并发10个请求,根据如上配置,php-fpm是有4个worker还是10个worker

1
2
3
4
[root@rh matrix]# wrk -c 10 -d 30 -t 10 http://localhost/alloc
...
[root@rh etc]# ps -ef |grep php-fpm|wc -l
10

可以看到是10个worker,请求结束后会变为4个.

看起来似乎跟php-fpm的配置没有关系.php-fpm是能够动态增加个数的.

模拟线上

线上情况是客户端1s的超时时间,那么我们可以如下设计实验,在php接口中sleep 900ms,然后设置请求持续时间为1s,看看实际情况:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
[root@rh fpm.d]# wrk -c 20  -t 20 -d 1   --latency http://localhost/alloc
Running 1s test @ http://localhost/alloc
20 threads and 20 connections
Thread Stats Avg Stdev Max +/- Stdev
Latency 908.22ms 1.84ms 910.55ms 75.00%
Req/Sec 0.50 0.58 1.00 100.00%
Latency Distribution
50% 908.84ms
75% 910.55ms
90% 910.55ms
99% 910.55ms
4 requests in 1.10s, 856.00B read
Requests/sec: 3.64
Transfer/sec: 778.77B

可以看到只执行成功4个请求,并不是如预料中所想的10个请求,看看nginx和应用的日志

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 200 5 0.903 0.903 22 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 200 5 0.907 0.907 22 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 200 5 0.906 0.906 22 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 200 5 0.908 0.908 22 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 0.194 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 0.196 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 0.194 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 0.190 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.100 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.100 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.100 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.097 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.097 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.097 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.097 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.096 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.096 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.096 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.096 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.096 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.094 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.094 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.094 - 0 "-" "-" 127.0.0.1 "-" "-"
- - localhost [08/Sep/2019:08:37:33 +0800] "GET /alloc HTTP/1.1" 49 499 0 1.094 - 0 "-" "-" 127.0.0.1 "-" "-"

大量的499 1s钟超时

应用层日志处理时间基本维持在900ms左右.并且应用层与nginx层日志数量都是24条

当然,调大php-fpm启动数量后可以请求的个数相应增加.读者可自行实验.

小结

  • 动态配置看起来很美好,实际总是有时间开销(master统计,fork进程,监控并发).大并发场景建议还是使用static静态配好.

  • 即使php-fpm处理不了所有请求,nginx还是会发送所有建立的连接并发送到php-fpm,占用资源