引
端午假期回來,同事往賬號中充錢,意外發(fā)現(xiàn)充錢失敗
奇怪的是之前的各個測試流程都通過了,之前沒有發(fā)現(xiàn)任何bug
bug初現(xiàn)
第一反應(yīng)查看日志
ubuntu@ip-172-31-16-234:~$ tail -f /var/log/supervisor/poll_block-err.log
result = self._query(query)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/cursors.py", line 321, in _query
conn.query(q)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 859, in query
self._execute_command(COMMAND.COM_QUERY, sql)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1096, in _execute_command
self._write_bytes(packet)
File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1048, in _write_bytes
"MySQL server has gone away (%r)" % (e,))
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))") [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'trans_type': 'in', 'wallet_address': '0x2d9272772978AAaC6690994D596811f1966CCC80', 'dt_create': datetime.datetime(2018, 6, 19, 15, 1, 57, 309124), 'phone_num': '+8618813002365', 'eth_num': Decimal('1'), 'to_address': '', 'txn_hash': '0xb6a7f52c9aef9f000dc979b8b1a58b92c0174d646156d50576079ee0d679e1a2', 'from_address': '0x23e96cA25136Fb6444B08FB24f94554197620c66'}] (Background on this error at: http://sqlalche.me/e/e3q8)
情況闡述,
此接口為監(jiān)聽用戶打幣情況接口,為異步操作,單開了一個app對象用來監(jiān)聽打幣情況。
業(yè)務(wù)邏輯為,監(jiān)聽到有用戶打幣,那么用戶的錢包地址中就增加相應(yīng)的數(shù)量。
接下來繼續(xù)測試,第二筆和第三筆都沒有問題,都正常。
奇怪的地方在于,第一筆失敗,在沒有任何人操作的情況下,繼續(xù)打了第二筆和第三筆,竟然成功。
bug分析
- 三次打幣有一個失敗,肯定是和打幣的變量有關(guān),可是打幣的變量只有一個,那就是打幣金額
首先來測試,跟失敗的例子使用相同的打幣金額會不會成功。
經(jīng)測試,打相同數(shù)量的幣并沒有觸發(fā)這個bug,奇怪。
- 另外一個變量就是,中途放了假,test服務(wù)器沒有任何人去使用。難道。。。?跟一些超時參數(shù)有關(guān)?
果斷搜索time out 相關(guān)的參數(shù),意外發(fā)現(xiàn):
參考 https://dev.mysql.com/doc/refman/5.7/en/gone-away.html
By default, the server closes the connection after eight hours if nothing has happened. You can change the time limit by setting the [wait_timeout](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_wait_timeout) variable when you start [**mysqld**](https://dev.mysql.com/doc/refman/5.7/en/mysqld.html). See [Section 5.1.7, “Server System Variables”](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html).
目前來看,八個小時如果沒有任何操作會自動斷掉connection,貌似有這個的可能。
bug復(fù)現(xiàn)
接下來嘗試來復(fù)現(xiàn)這個bug
設(shè)置MySQL timeout 變量 https://blog.csdn.net/liuj2511981/article/details/8520912
把wait timeout 時間縮小到幾秒鐘。
set global wait_timeout=10;
show global variables like '%timeout%';
果然復(fù)現(xiàn)出來了
File "/home/ubuntu/.virtualenvs/coucou/local/lib/python3.5/site-packages/pymysql/connections.py", line 1037, in _read_bytes
CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 20, 10, 55, 58, 846276), 'trans_type': 'in', 'eth_num': Decimal('0.1'), 'txn_hash': '0x6665b259dacf1de2df847f8812667e6af2f32ae8f5fef8f41a6dfdc3b9d9506b', 'phone_num': '+8613146963300'}] (Background on this error at: http://sqlalche.me/e/e3q8)
深度觀察
雖然我們能準(zhǔn)確的復(fù)現(xiàn)出來了,但是目前為止還不知道原因。
既然是跟超時有關(guān),我們來查看程序與mysql的鏈接情況。
首先我們把時間改成2s
set global wait_timeout=2;
我們使用下面這個命令,來查看程序與mysql的連接產(chǎn)生和斷開的情況:
mysqladmin -u dbu -p -i '1' processlist
+----+------+-----------+----+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+----+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
+----+------+-----------+----+---------+------+----------+------------------+
+----+------+-----------+----+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+----+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
+----+------+-----------+----+---------+------+----------+------------------+
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
| 45 | dbu | localhost:52478 | master | Sleep | 1 | | |
+----+------+-----------------+--------+---------+------+----------+------------------+
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 44 | dbu | localhost | | Query | 0 | starting | show processlist |
| 45 | dbu | localhost:52478 | master | Sleep | 2 | | |
+----+------+-----------------+--------+---------+------+----------+------------------+
也就是說,這個連接的存活期只有2s,當(dāng)2后這個鏈接自動關(guān)閉掉了。
而之前沒有碰到過這個情況的原因是,很少有程序都是即時返回的。程序的生存周期只有不到一秒,
而這個腳本是要跨超長時間的,需要一直維持著mysql的連接,而且從開始運(yùn)行到監(jiān)聽到用戶打幣都是默認(rèn)使用的一個連接。
當(dāng)正在檢測到了用戶打幣的時候,之前的鏈接早已超時,
于是產(chǎn)生了報(bào)錯:
sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'txn_hash': '0xf02316210853e2b791c9fb67535a210d85dcd61676277aab7953c10a91d6c138', 'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 21, 11, 25, 22, 222526), 'trans_type': 'in', 'phone_num': '+8613146963300', 'eth_num': Decimal('0.1')}] (Background on this error at: http://sqlalche.me/e/e3q8)
同樣的,很多用戶執(zhí)行長時間sql腳本的時候也經(jīng)常碰到這個bug。
解決方式
1 poll_block進(jìn)程中,循環(huán)去維持一個跟數(shù)據(jù)庫的連接。
2 把wait timeout 值設(shè)大些。
延伸。mysql 連接相關(guān)。
status 里面的connects實(shí)踐。
當(dāng)我業(yè)務(wù)程序沒有啟動的時候,用linux客戶端進(jìn)入mysql
mysql dbu@(none):coucou> show processlist;
+----+------+-----------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
+----+------+-----------+--------+---------+------+----------+------------------+
可以看到,當(dāng)前只有一個連接,那就是本客戶端的鏈接,而且id為9
當(dāng)啟動業(yè)務(wù)程序后:
gunicorn 'coucou.application:create_app()' -w 8 -b 0.0.0.0:5000
可以看到,我開了8 個線程。這時候再去看數(shù)據(jù)庫的鏈接。
mysql dbu@(none):coucou> show processlist;
+----+------+-----------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
+----+------+-----------+--------+---------+------+----------+------------------+
還是只有一個啊,因?yàn)檫@個時候app 與 mysql還沒有交互。
我試著走一個接口。
mysql dbu@(none):coucou> show processlist;
+----+------+-----------------+--------+---------+------+----------+------------------+
| Id | User | Host | db | Command | Time | State | Info |
+----+------+-----------------+--------+---------+------+----------+------------------+
| 9 | dbu | localhost | coucou | Query | 0 | starting | show processlist |
| 19 | dbu | localhost:52392 | master | Sleep | 2 | | <null> |
+----+------+-----------------+--------+---------+------+----------+------------------+
2 rows in set
會發(fā)現(xiàn)多出來了一個連接。
然后我再多走2個接口,發(fā)現(xiàn)不是每個接口都會增長一個連接。
最后我走一百個接口,發(fā)現(xiàn),最多也就是9個連接,不會再增長了。
現(xiàn)象分析
每個請求會被分配到gunicoin的一個線程里面,每個gunicoin所托起的app一旦開始運(yùn)行都會維持一個跟mysql的連接。
因?yàn)槊總€請求分到線程都是隨機(jī)的。也就是不一定每個接口都會增長一個連接。
因?yàn)樽疃嘤?個線程,也就是最多會有8+1 個連接,其中1是客戶端的鏈接。
status命令確認(rèn)
mysql dbu@(none):coucou> status
--------------
mycli 1.16.0, running on CPython 2.7.12
+----------------------+----------------------------------+
| | |
+----------------------+----------------------------------+
| Connection id: | 9 |
| Current database: | coucou |
| Current user: | dbu@localhost |
| Current pager: | System default |
| Server version: | 5.7.22-0ubuntu0.16.04.1 (Ubuntu) |
| Protocol version: | 10 |
| Connection: | Localhost via UNIX socket |
| Server characterset: | utf8mb4 |
| Db characterset: | utf8mb4 |
| Client characterset: | utf8 |
| Conn. characterset: | utf8 |
| UNIX socket: | /var/run/mysqld/mysqld.sock |
| Uptime: | 14 hours 20 min 53 sec |
+----------------------+----------------------------------+
Connections: 9 Queries: 1236 Slow queries: 0 Opens: 220 Flush tables: 1 Open tables: 129 Queries per second avg: 0.000
Statsu顯示頁面里面 也是顯示當(dāng)前連接有9個。而且客戶端的連接id為9;