Nginx 高并發(fā)下報錯 connect() failed (110: Connection timed out) while connecting to upstream

背景
在對應(yīng)用服務(wù)進行壓力測試時,Nginx在持續(xù)壓測請求1min左右后開始報錯,花了一些時間對報錯的原因進行排查,并最終定位到問題,現(xiàn)將過程總結(jié)下。

壓測工具
這里壓測使用的是siege, 其非常容易指定并發(fā)訪問數(shù)以及并發(fā)時間,以及有非常清晰的結(jié)果反饋,成功訪問數(shù),失敗數(shù),吞吐率等性能結(jié)果。

壓測指標
單接口壓測,并發(fā)100,持續(xù)1min。

壓測工具 報錯

The server is now under siege...
[error] socket: unable to connect sock.c:249: Connection timed out
[error] socket: unable to connect sock.c:249: Connection timed out

Nginx error.log 報錯

2018/11/21 17:31:23 [error] 15622#0: *24993920 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"

2018/11/21 18:21:09 [error] 4469#0: *25079420 connect() failed (110: Connection timed out) while connecting to upstream, client: 192.168.xx.xx, server: xx-qa.xx.com, request: "GET /guide/v1/activities/1107 HTTP/1.1", upstream: "http://192.168.xx.xx:8082/xx/v1/activities/1107", host: "192.168.86.90"

排查問題

  1. 看到 timed out 第一感覺是,應(yīng)用服務(wù)存在性能問題,導(dǎo)致并發(fā)請求時無法響應(yīng)請求;通過排查應(yīng)用服務(wù)的日志,發(fā)現(xiàn)其實應(yīng)用服務(wù)并沒有任何報錯;

  2. 觀察應(yīng)用服務(wù)的CPU負載(Docker 容器 docker state id) ,發(fā)現(xiàn)其在并發(fā)請求時CPU使用率升高,再無其他異常,屬于正常情況。不過持續(xù)觀察發(fā)現(xiàn),在壓測報錯開始后,應(yīng)用服務(wù)所在的CPU負載降低,應(yīng)用服務(wù)日志里也沒有了請求日志,暫時可以判定無法響應(yīng)請求應(yīng)該來自應(yīng)用服務(wù)鏈路的前一節(jié)點,也就是Nginx;

  3. 通過命令排查Nginx所在服務(wù)器,壓測時的TCP連接情況

    # 查看當前80端口的連接數(shù)
    netstat -nat|grep -i "80"|wc -l
    5407
    
    # 查看當前TCP連接的狀態(tài)
    netstat -na | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
    LISTEN 12
    SYN_RECV 1
    ESTABLISHED 454
    FIN_WAIT1 1
    TIME_WAIT 5000
    

發(fā)現(xiàn)在TCP的連接有兩個異常點

  1. 竟然有5k多個連接
  2. TCP狀態(tài)TIME_WAIT 到5000個后停止增長

關(guān)于這兩點開始進行分析:

  1. 理論上100個并發(fā)用戶數(shù)壓測,應(yīng)該只有100個連接才對,造成這個原因應(yīng)該是 siege 壓測時創(chuàng)建了5000個連接

    # 查看siege配置
    vim ~/.siege/siege.conf
    
    # 真相大白,原來siege在壓測時,連接默認是close,也就是說在持續(xù)壓測時,每個請求結(jié)束后,直接關(guān)閉連接,然后再創(chuàng)建新的連接,那么就可以理解為什么壓測時Nginx所在服務(wù)器TCP連接數(shù)5000多,而不是100;
    
    # Connection directive. Options "close" and "keep-alive" Starting with
    # version 2.57, siege implements persistent connections in accordance 
    # to RFC 2068 using both chunked encoding and content-length directives
    # to determine the page size. 
    #
    # To run siege with persistent connections set this to keep-alive. 
    #
    # CAUTION:        Use the keep-alive directive with care.
    # DOUBLE CAUTION: This directive does not work well on HPUX
    # TRIPLE CAUTION: We don't recommend you set this to keep-alive
    # ex: connection = close
    #     connection = keep-alive
    #
    connection = close
    
  2. TIME_WAIT 到5000分析,這要先弄清楚,TCP狀態(tài)TIME_WAIT是什么含義

    TIME-WAIT:等待足夠的時間以確保遠程TCP接收到連接中斷請求的確認;TCP要保證在所有可能的情況下使得所有的數(shù)據(jù)都能夠被正確送達。當你關(guān)閉一個socket時,主動關(guān)閉一端的socket將進入TIME_WAIT狀態(tài),而被動關(guān)閉一方則轉(zhuǎn)入CLOSED狀態(tài),這的確能夠保證所有的數(shù)據(jù)都被傳輸。

TIME-WAIT定義中分析得知,當壓測工具關(guān)閉連接后,實際上Nginx所在機器連接并未立刻CLOSED,而是進入TIME-WAIT狀態(tài),網(wǎng)上可以搜到非常多講解TIME-WAIT過多導(dǎo)致丟包的情況,與我在壓測時所遇到情況一樣。

# 查看Nginx所在服務(wù)器的配置
cat /etc/sysctl.conf 
# sysctl settings are defined through files in
# /usr/lib/sysctl.d/, /run/sysctl.d/, and /etc/sysctl.d/.
#
# Vendors settings live in /usr/lib/sysctl.d/.
# To override a whole file, create a new file with the same in
# /etc/sysctl.d/ and put new settings there. To override
# only specific settings, add a file with a lexically later
# name in /etc/sysctl.d/ and put new settings there.
#
# For more information, see sysctl.conf(5) and sysctl.d(5).
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

vm.swappiness = 0
net.ipv4.neigh.default.gc_stale_time=120


# see details in https://help.aliyun.com/knowledge_detail/39428.html
net.ipv4.conf.all.rp_filter=0
net.ipv4.conf.default.rp_filter=0
net.ipv4.conf.default.arp_announce = 2
net.ipv4.conf.lo.arp_announce=2
net.ipv4.conf.all.arp_announce=2


# see details in https://help.aliyun.com/knowledge_detail/41334.html
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_max_syn_backlog = 1024
net.ipv4.tcp_synack_retries = 2
kernel.sysrq = 1
fs.file-max = 65535
net.ipv4.ip_forward = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_max_syn_backlog = 10240
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_syn_retries = 3
net.ipv4.tcp_max_orphans = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.tcp_window_scaling = 0
net.ipv4.tcp_sack = 0
net.ipv4.tcp_timestamps = 0
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1024 65000
net.ipv4.icmp_echo_ignore_all = 0

net.ipv4.tcp_max_tw_buckets = 50005000表示系統(tǒng)同時保持TIME_WAIT套接字的最大數(shù)量,如果超過這個數(shù)字,TIME_WAIT套接字將立刻被清除并打印警告信息。

優(yōu)化方案
參照在網(wǎng)上搜索獲取的信息,調(diào)整Linux內(nèi)核參數(shù)優(yōu)化:

net.ipv4.tcp_syncookies = 1 表示開啟SYN Cookies。當出現(xiàn)SYN等待隊列溢出時,啟用cookies來處理,可防范少量SYN攻擊,默認為0,表示關(guān)閉;

net.ipv4.tcp_tw_reuse = 1 表示開啟重用。允許將TIME-WAIT sockets重新用于新的TCP連接,默認為0,表示關(guān)閉;

net.ipv4.tcp_tw_recycle = 1 表示開啟TCP連接中TIME-WAIT sockets的快速回收,默認為0,表示關(guān)閉。

net.ipv4.tcp_fin_timeout = 30 表示如果套接字由本端要求關(guān)閉,這個參數(shù)決定了它保持在FIN-WAIT-2狀態(tài)的時間。

net.ipv4.tcp_keepalive_time = 1200 表示當keepalive起用的時候,TCP發(fā)送keepalive消息的頻度。缺省是2小時,改為20分鐘。

net.ipv4.ip_local_port_range = 1024 65000 表示用于向外連接的端口范圍。缺省情況下很?。?2768到61000,改為1024到65000。

net.ipv4.tcp_max_syn_backlog = 8192 表示SYN隊列的長度,默認為1024,加大隊列長度為8192,可以容納更多等待連接的網(wǎng)絡(luò)連接數(shù)。

net.ipv4.tcp_max_tw_buckets = 5000表示系統(tǒng)同時保持TIME_WAIT套接字的最大數(shù)量,如果超過這個數(shù)字,TIME_WAIT套接字將立刻被清除并打印警告信息。默認為180000,改為5000。

參考資料:

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容