
一、背景
- 在業(yè)務(wù)龐大,運維流程長且復(fù)雜的場景下,為了提高運維效率,減小系統(tǒng)對業(yè)務(wù)工人的依賴(耦合),系統(tǒng)引入消息隊列
- 系統(tǒng)基于MySql實現(xiàn)的消息隊列 oxen-queue
- 在高并發(fā)的情況下,消息隊列由于設(shè)計原因,有死鎖情況出現(xiàn)
- 造成的影響:入隊操作失敗,導(dǎo)致單據(jù)創(chuàng)建失??;重新入隊不及時,導(dǎo)致業(yè)務(wù)處理慢

二、環(huán)境
- 建表語句
create table Queue ( FId bigint unsigned auto_increment primary key, FBatchId bigint unsigned null, FJobType varchar(256) not null, FCreatedTs datetime default CURRENT_TIMESTAMP not null comment '創(chuàng)建時間', FStartedTs datetime null comment '開始時間', FBody varchar(8096) null, FStatus varchar(256) default 'waiting' null, FResult text null, FRecovered int default 0 null, FRunningTime int null, FUniqueKey bigint null, FPriority bigint null, FRetryCount bigint default 0 not null, FCreateTime datetime default CURRENT_TIMESTAMP not null, constraint i_u_FUniqueKey unique (FUniqueKey) ) charset = utf8; create index i_FBatchId_FPriority on devops.Queue (FBatchId, FPriority); create index i_FCreatedTs on devops.Queue (FCreatedTs); create index i_FJobType_FBatchId_FStatus_FPriority on devops.Queue (FJobType, FBatchId, FStatus, FPriority); create index i_FStartedTs_FJobType_FStatus on devops.Queue (FStartedTs, FJobType, FStatus); create index i_FStatus on devops.Queue (FStatus); INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL); - Queue相關(guān)DB操作
// 入隊 INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL); // 重新入隊 UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 1; // 超時任務(wù)處理 UPDATE T_DevOpsQueue SET FStatus='waiting', FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus = 'processing' AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeStatusQueue';
三、問題定位
起因
用戶反饋單據(jù)創(chuàng)建異常,通過相關(guān)信息,定位到異常產(chǎn)生的原因是消息隊列死鎖導(dǎo)致-
系統(tǒng)異常日志
INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL) ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction首次入隊失敗,由于沒有回退流程,導(dǎo)致數(shù)據(jù)單據(jù)數(shù)據(jù)不一致
第二條錯誤日志,這里是我通過在日志中發(fā)現(xiàn)的, -
統(tǒng)計系統(tǒng)死鎖(以
ER_LOCK_DEADLOCK為關(guān)鍵字)
一天的死鎖數(shù)據(jù).png
通過ER_LOCK_DEADLOCK查找,發(fā)現(xiàn)不只是Insert, 覺大部分為UpdateUPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody`= 'body' WHERE `FId` = 1 ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction問:為什么異常發(fā)生頻率這么高,卻一直沒有被發(fā)現(xiàn)
答:Insert會引起業(yè)務(wù)異常(首次入隊失敗,由于沒有回退流程,導(dǎo)致數(shù)據(jù)單據(jù)數(shù)據(jù)不一致),更容易關(guān)注到,Updata(是重新入隊)會被后臺Worker補償,所以沒有業(yè)務(wù)影響 初步分析-業(yè)務(wù)層面
系統(tǒng)錯誤日志已經(jīng)明確,試圖從業(yè)務(wù)層面去分區(qū)死鎖并解決,
猜測:由于業(yè)務(wù)處理復(fù)雜,整個流程全部在一個事務(wù)中處理(包含多次網(wǎng)絡(luò)請求),初步認為是事務(wù)使用不當導(dǎo)致死鎖
解決:拆解事務(wù),網(wǎng)絡(luò)請求等耗時操作移出事務(wù)
結(jié)果:并沒有改善,所以不是這里引起的(事后分析,入隊操作并沒有使用事務(wù),所以一定不是這里引起的)
四、死鎖分析與處理
-
死鎖分析(一)
-
找DBA拉取死鎖日志(
show engine innodb status)// 事物一 *** (1) TRANSACTION: TRANSACTION 264237645, ACTIVE 2 sec starting index read mysql tables in use 2, locked 2 LOCK WAIT 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53640626,client host: x.x.x.177 MySQL thread id 53640626, OS thread handle 139684660885248, query id 1839137253 x.x.x.177 x.x.x.177 db Sending data // 死鎖SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 6835010934527921, FStatus = "processing", FStartedTs = NOW() // 事物一,等待的鎖 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主鍵的記錄X鎖(排他鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237645 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二 *** (2) TRANSACTION: TRANSACTION 264237646, ACTIVE 1 sec starting index read mysql tables in use 2, locked 2 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53950278,client host: x.x.x.198 MySQL thread id 53950278, OS thread handle 139684121917184, query id 1839137525 x.x.x.198 x.x.x.198 db Sending data // 死鎖SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW() // 事物二,持有的鎖 *** (2) HOLDS THE LOCK(S): // 等待主鍵的記錄S鎖(共享鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock mode S locks rec but not gap Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二,等待的鎖 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主鍵的記錄X鎖(排他鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二,回滾 *** WE ROLL BACK TRANSACTION (2) -
日志分析
疑惑:為什么業(yè)務(wù)系統(tǒng)異常日志信息和MySql死鎖日志信息不一致?
推測: MySql中引起死鎖的Sql會影響業(yè)務(wù)系統(tǒng)中Sql,導(dǎo)致的異常;不止一個死鎖先解決MySql死鎖日志中的: 死鎖語句是同一個,但是是不同的Client執(zhí)行(不同的Host),系統(tǒng)的服務(wù),部署在不同的機器(集群)并發(fā)引起的
- 死鎖SQL
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW()
-
通過日志,分析得到,倆個事務(wù)持有、等待鎖的情況
事務(wù)1 事務(wù)2 持有鎖 ... PRIMARY的記錄S鎖 等待鎖 PRIMARY的記錄X鎖 PRIMARY的記錄X鎖 -
雖然日志中沒有記錄事務(wù)1持有鎖的情況,但是不難推斷出,是PRIMARY的記錄S鎖
lock.png
-
分析一句Sql是怎么形成上面這樣的死鎖的
-
首先分析這條Sql會加哪些鎖(
select * from performance_schema.data_locks)
sql鎖情況
鎖情況.png- Sql語句中有子查詢,并強制使用i_FJobType_FBatchId_FStatus_FPriority索引,所以這里獲取了二級索引(i_FJobType_FBatchId_FStatus_FPriority)的Next-Key共享鎖,
- 由于索引的機制,這里會回表,也就是通過二級索引,會獲取聚簇索引的記錄(
S,REC_NOT_GAP)共享鎖, - 通過主鍵Update,所以再獲取聚簇索引的記錄(
X,REC_NOT_GAP)排他鎖
-
真實的死鎖原因
lock.png
先加了聚簇索引的記錄共享鎖,然后再獲取聚簇索引的記錄排他鎖,在并發(fā)的情況下,同時持有共享鎖,再獲取排他鎖時,相互等待,形成死鎖
-
-
如何解決
- 問題的根因已經(jīng)明確,這里主要共享鎖相互兼容,導(dǎo)致相同的Sql語句同時持有共享鎖導(dǎo)致,所以我們這里通過將共享鎖升級為排他鎖,解決,添加
FOR UPDATE
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 FOR UPDATE ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW();調(diào)整后Sql鎖情況
lock.png通過FOR UPDATE將共享鎖升級為排他鎖,順序加鎖,解決問題
- 問題的根因已經(jīng)明確,這里主要共享鎖相互兼容,導(dǎo)致相同的Sql語句同時持有共享鎖導(dǎo)致,所以我們這里通過將共享鎖升級為排他鎖,解決,添加
-
解決效果
日志.png通過統(tǒng)計觀察指定異常,發(fā)現(xiàn)這里的異常已經(jīng)消除
-
-
死鎖分析(二)
上面的異常是消除了,但是并沒有解決掉我發(fā)現(xiàn)系統(tǒng)日志的異常信息,也就側(cè)面驗證了我們在上面的疑惑,系統(tǒng)不止一個死鎖所以這里調(diào)整了日志統(tǒng)計信息,看所有的死鎖問題
日志.png
show engine innodb status只顯示最近一次發(fā)生的死鎖日志,如果死鎖頻繁,可以通過系統(tǒng)全局變量innodb_print_all_deadlocks置為ON,通過日志可以看到所有的死鎖信息-
找DBA拉取死鎖日志(
show engine innodb status),發(fā)現(xiàn)上面的死鎖確實換成了另外一個*** (1) TRANSACTION: TRANSACTION 278046801, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 5 lock struct(s), heap size 1136, 3 row lock(s) MySQL thread id 54720832, OS thread handle 139687080482560, query id 1895336202 x.x.x.177 x.x.x.177 mmpaydevopsweb Searching rows for update UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046801 lock_mode X locks rec but not gap waiting Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) TRANSACTION: TRANSACTION 278046800, ACTIVE 0 sec updating or deleting mysql tables in use 1, locked 1 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1 MySQL thread id 55659707, OS thread handle 139684629968640, query id 1895336203 x.x.x.62 x.x.x.62 mmpaydevopsweb updating update `Queue` set `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 26846, `FBody` = '{\"job_type\":\"syncChangeStatusJobType\",\"payload\":{\"deploymentId\":41582},\"start_time\":10000}' where `FId` = 7446396 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 435270 n bits 864 index i_FStatus of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap waiting Record lock, heap no 767 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 *** WE ROLL BACK TRANSACTION (2) -
日志分析
-
死鎖SQL
============================= 事物一 ====================================== UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' ============================= 事物二 ====================================== UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 150; ``` 通過日志,分析得到,倆個事務(wù)持有、等待鎖的情況
事務(wù)1 事務(wù)2 持有鎖 ... PRIMARY的記錄X鎖 等待鎖 PRIMARY的記錄X鎖 二級索引FStatus的記錄X鎖 lock.png -
-
分析Sql是怎么形成上面這樣的死鎖的
- 首先分析這條Sql會加哪些鎖(
select * from performance_schema.data_locks)
sql鎖情況
Sql-1
lock.png
Sql-2
lock.png
這里Sql2好像并沒有二級索引FStatus的X鎖,難道推斷出了問題?并不是,該語句更新了FStatus,而FStatus是一個二級索引(二級索引所在的列產(chǎn)生實際變化的更新),所以在更新前,需要對FStatus的二級索引對應(yīng)的記錄加鎖(隱式鎖)
所以實際的加鎖情況是
lock.png這種情況主要是不同的sql,加鎖的順序不一致,更新的條件是主鍵、二級索引,很容易產(chǎn)生死鎖
- 首先分析這條Sql會加哪些鎖(
-
如何解決
倆個語句如果加鎖順序一致,就可以避免這樣的問題發(fā)生// 先查詢 SELECT FId FROM ${this.db_table} WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL ${this.job_timeout_seconds} SECOND) AND ? LIMIT 10 // where條件加上主鍵 UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FId IN(1,2) //上面的ID FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'jobType'這里where條件為什么不去掉后面的條件,只通過主鍵修改?原因是可能形成數(shù)據(jù)覆蓋
-
解決效果
日志.png
-
五、疑惑
-
上面并沒有分析關(guān)于引發(fā)系統(tǒng)異常的Insert語句的死鎖,具體Insert為什么也會有死鎖的問題呢?
-
處理超時消息
update.png -
首次入隊(Insert)
insert.png -
真實執(zhí)行鎖情況
lock.png
lock.png
這里可能有疑問,事物二(Insert)為什么會有二級索引的X鎖,這個就要說說隱式鎖
為了減少鎖的數(shù)量,提高Mysql的性能,在一些情況下,可以不不加鎖,等判斷可能發(fā)生沖突時才加鎖,也就是延遲加鎖, -
結(jié)果:倆次解決效果

熱點行鎖,影響系統(tǒng)性能
六、總結(jié)
排查分析的過程比較曲折,剛開始只通過系統(tǒng)日志中的一句異常信息,從系統(tǒng)上分析死鎖的原因,結(jié)果浪費了很多時間,系統(tǒng)日志與MySql死鎖日志不一致也就糾結(jié)了好久,最后發(fā)現(xiàn),死鎖問題不單單只有一個,所以最后采用逐個擊破的方式,最后解決了該問題
記錄主要注意以下五點,在開發(fā)中可以很大程度的避免死鎖,
- 事務(wù)時間要短,原則是耗時操作不要放到事務(wù)中處理,最好將影響并發(fā)度的鎖,放到時候最后
- 遇到死鎖也不要慌,通過死鎖日志分析是最有效的,但是要先了解清楚Mysql的鎖機制
- 一個表加太多索引,很容易導(dǎo)致死鎖,加索引前要仔細思考是否有必要,不能無腦加,同時在寫Sql的時候也要考慮,是不是存在死鎖的問題,和那些會形成
- 死鎖有時候在并發(fā)度高的情況才會出現(xiàn),對業(yè)務(wù)壓測還是很有必要的
- 通過mysql工具可以清晰的分析Sql鎖的情況,很好的幫助我們定位問題















