Bug 記:一個因?yàn)榉帕巳煨¢L假產(chǎn)生的bug

端午假期回來,同事往賬號中充錢,意外發(fā)現(xiàn)充錢失敗

奇怪的是之前的各個測試流程都通過了,之前沒有發(fā)現(xiàn)任何bug

bug初現(xiàn)

第一反應(yīng)查看日志


ubuntu@ip-172-31-16-234:~$ tail -f /var/log/supervisor/poll_block-err.log

 result = self._query(query)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/cursors.py", line 321, in _query

 conn.query(q)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 859, in query

 self._execute_command(COMMAND.COM_QUERY, sql)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1096, in _execute_command

 self._write_bytes(packet)

 File "/home/ubuntu/.virtualenvs/coucou/lib/python3.5/site-packages/pymysql/connections.py", line 1048, in _write_bytes

 "MySQL server has gone away (%r)" % (e,))

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2006, "MySQL server has gone away (BrokenPipeError(32, 'Broken pipe'))") [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'trans_type': 'in', 'wallet_address': '0x2d9272772978AAaC6690994D596811f1966CCC80', 'dt_create': datetime.datetime(2018, 6, 19, 15, 1, 57, 309124), 'phone_num': '+8618813002365', 'eth_num': Decimal('1'), 'to_address': '', 'txn_hash': '0xb6a7f52c9aef9f000dc979b8b1a58b92c0174d646156d50576079ee0d679e1a2', 'from_address': '0x23e96cA25136Fb6444B08FB24f94554197620c66'}] (Background on this error at: http://sqlalche.me/e/e3q8)

情況闡述,

此接口為監(jiān)聽用戶打幣情況接口,為異步操作,單開了一個app對象用來監(jiān)聽打幣情況。

業(yè)務(wù)邏輯為,監(jiān)聽到有用戶打幣,那么用戶的錢包地址中就增加相應(yīng)的數(shù)量。

接下來繼續(xù)測試,第二筆和第三筆都沒有問題,都正常。

奇怪的地方在于,第一筆失敗,在沒有任何人操作的情況下,繼續(xù)打了第二筆和第三筆,竟然成功。

bug分析

  • 三次打幣有一個失敗,肯定是和打幣的變量有關(guān),可是打幣的變量只有一個,那就是打幣金額

首先來測試,跟失敗的例子使用相同的打幣金額會不會成功。

經(jīng)測試,打相同數(shù)量的幣并沒有觸發(fā)這個bug,奇怪。

  • 另外一個變量就是,中途放了假,test服務(wù)器沒有任何人去使用。難道。。。?跟一些超時參數(shù)有關(guān)?

果斷搜索time out 相關(guān)的參數(shù),意外發(fā)現(xiàn):

參考 https://dev.mysql.com/doc/refman/5.7/en/gone-away.html


By default, the server closes the connection after eight hours if nothing has happened. You can change the time limit by setting the [wait_timeout](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html#sysvar_wait_timeout) variable when you start [**mysqld**](https://dev.mysql.com/doc/refman/5.7/en/mysqld.html). See [Section 5.1.7,  “Server System Variables”](https://dev.mysql.com/doc/refman/5.7/en/server-system-variables.html).

目前來看,八個小時如果沒有任何操作會自動斷掉connection,貌似有這個的可能。

bug復(fù)現(xiàn)

接下來嘗試來復(fù)現(xiàn)這個bug

設(shè)置MySQL timeout 變量 https://blog.csdn.net/liuj2511981/article/details/8520912

把wait timeout 時間縮小到幾秒鐘。


set global wait_timeout=10;

show global variables like '%timeout%';

果然復(fù)現(xiàn)出來了


 File "/home/ubuntu/.virtualenvs/coucou/local/lib/python3.5/site-packages/pymysql/connections.py", line 1037, in _read_bytes

 CR.CR_SERVER_LOST, "Lost connection to MySQL server during query")

sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 20, 10, 55, 58, 846276), 'trans_type': 'in', 'eth_num': Decimal('0.1'), 'txn_hash': '0x6665b259dacf1de2df847f8812667e6af2f32ae8f5fef8f41a6dfdc3b9d9506b', 'phone_num': '+8613146963300'}] (Background on this error at: http://sqlalche.me/e/e3q8)

深度觀察

雖然我們能準(zhǔn)確的復(fù)現(xiàn)出來了,但是目前為止還不知道原因。

既然是跟超時有關(guān),我們來查看程序與mysql的鏈接情況。

首先我們把時間改成2s


 set global wait_timeout=2;

我們使用下面這個命令,來查看程序與mysql的連接產(chǎn)生和斷開的情況:


mysqladmin -u dbu -p -i '1' processlist


+----+------+-----------+----+---------+------+----------+------------------+

| Id | User | Host | db | Command | Time | State | Info  |

+----+------+-----------+----+---------+------+----------+------------------+

| 44 | dbu | localhost | | Query  | 0 | starting | show processlist |

+----+------+-----------+----+---------+------+----------+------------------+

+----+------+-----------+----+---------+------+----------+------------------+

| Id | User | Host | db | Command | Time | State | Info  |

+----+------+-----------+----+---------+------+----------+------------------+

| 44 | dbu | localhost | | Query  | 0 | starting | show processlist |

+----+------+-----------+----+---------+------+----------+------------------+

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 44 | dbu | localhost  | | Query  | 0 | starting | show processlist |

| 45 | dbu | localhost:52478 | master | Sleep  | 1 | | |

+----+------+-----------------+--------+---------+------+----------+------------------+

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 44 | dbu | localhost  | | Query  | 0 | starting | show processlist |

| 45 | dbu | localhost:52478 | master | Sleep  | 2 | | |

+----+------+-----------------+--------+---------+------+----------+------------------+

也就是說,這個連接的存活期只有2s,當(dāng)2后這個鏈接自動關(guān)閉掉了。

而之前沒有碰到過這個情況的原因是,很少有程序都是即時返回的。程序的生存周期只有不到一秒,

而這個腳本是要跨超長時間的,需要一直維持著mysql的連接,而且從開始運(yùn)行到監(jiān)聽到用戶打幣都是默認(rèn)使用的一個連接。

當(dāng)正在檢測到了用戶打幣的時候,之前的鏈接早已超時,

于是產(chǎn)生了報(bào)錯:


sqlalchemy.exc.OperationalError: (pymysql.err.OperationalError) (2013, 'Lost connection to MySQL server during query') [SQL: 'INSERT INTO user_presale_wallet_log_model (trans_type, phone_num, wallet_address, from_address, to_address, eth_num, txn_hash, dt_create) VALUES (%(trans_type)s, %(phone_num)s, %(wallet_address)s, %(from_address)s, %(to_address)s, %(eth_num)s, %(txn_hash)s, %(dt_create)s)'] [parameters: {'txn_hash': '0xf02316210853e2b791c9fb67535a210d85dcd61676277aab7953c10a91d6c138', 'to_address': '', 'from_address': '0x5a86f9741b7D420dBe420Ab4FB880fC21516049d', 'wallet_address': '0x9cb9f536181Aa0310808248A2BC3a8d2878D14E1', 'dt_create': datetime.datetime(2018, 6, 21, 11, 25, 22, 222526), 'trans_type': 'in', 'phone_num': '+8613146963300', 'eth_num': Decimal('0.1')}] (Background on this error at: http://sqlalche.me/e/e3q8)

同樣的,很多用戶執(zhí)行長時間sql腳本的時候也經(jīng)常碰到這個bug。

解決方式

1 poll_block進(jìn)程中,循環(huán)去維持一個跟數(shù)據(jù)庫的連接。

2 把wait timeout 值設(shè)大些。

延伸。mysql 連接相關(guān)。

status 里面的connects實(shí)踐。

當(dāng)我業(yè)務(wù)程序沒有啟動的時候,用linux客戶端進(jìn)入mysql


mysql dbu@(none):coucou> show processlist;

+----+------+-----------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost | coucou | Query  | 0 | starting | show processlist |

+----+------+-----------+--------+---------+------+----------+------------------+

可以看到,當(dāng)前只有一個連接,那就是本客戶端的鏈接,而且id為9

當(dāng)啟動業(yè)務(wù)程序后:


gunicorn 'coucou.application:create_app()' -w 8 -b 0.0.0.0:5000

可以看到,我開了8 個線程。這時候再去看數(shù)據(jù)庫的鏈接。


mysql dbu@(none):coucou> show processlist;

+----+------+-----------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost | coucou | Query  | 0 | starting | show processlist |

+----+------+-----------+--------+---------+------+----------+------------------+

還是只有一個啊,因?yàn)檫@個時候app 與 mysql還沒有交互。

我試著走一個接口。


mysql dbu@(none):coucou> show processlist;

+----+------+-----------------+--------+---------+------+----------+------------------+

| Id | User | Host | db  | Command | Time | State | Info  |

+----+------+-----------------+--------+---------+------+----------+------------------+

| 9 | dbu | localhost  | coucou | Query  | 0 | starting | show processlist |

| 19 | dbu | localhost:52392 | master | Sleep  | 2 | | <null>  |

+----+------+-----------------+--------+---------+------+----------+------------------+

2 rows in set

會發(fā)現(xiàn)多出來了一個連接。

然后我再多走2個接口,發(fā)現(xiàn)不是每個接口都會增長一個連接。

最后我走一百個接口,發(fā)現(xiàn),最多也就是9個連接,不會再增長了。

現(xiàn)象分析

每個請求會被分配到gunicoin的一個線程里面,每個gunicoin所托起的app一旦開始運(yùn)行都會維持一個跟mysql的連接。

因?yàn)槊總€請求分到線程都是隨機(jī)的。也就是不一定每個接口都會增長一個連接。

因?yàn)樽疃嘤?個線程,也就是最多會有8+1 個連接,其中1是客戶端的鏈接。

status命令確認(rèn)


mysql dbu@(none):coucou> status

--------------

mycli 1.16.0, running on CPython 2.7.12

+----------------------+----------------------------------+

| | |

+----------------------+----------------------------------+

| Connection id:  | 9 |

| Current database: | coucou  |

| Current user: | dbu@localhost |

| Current pager:  | System default  |

| Server version: | 5.7.22-0ubuntu0.16.04.1 (Ubuntu) |

| Protocol version: | 10  |

| Connection: | Localhost via UNIX socket |

| Server characterset: | utf8mb4 |

| Db characterset:  | utf8mb4 |

| Client characterset: | utf8  |

| Conn. characterset: | utf8  |

| UNIX socket:  | /var/run/mysqld/mysqld.sock |

| Uptime: | 14 hours 20 min 53 sec  |

+----------------------+----------------------------------+

Connections: 9 Queries: 1236 Slow queries: 0 Opens: 220 Flush tables: 1 Open tables: 129 Queries per second avg: 0.000

Statsu顯示頁面里面 也是顯示當(dāng)前連接有9個。而且客戶端的連接id為9;

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容