1. 問題描述
線上數(shù)據(jù)庫,里面有個account_recharge_transaction表,這個表采用了hash分區(qū),以account_id作為分區(qū)鍵,但是最近這段時間發(fā)現(xiàn)這個一個接口報超時錯誤,大概有2%的概率超時。執(zhí)行sql如下
SELECT * FROM "account_recharge_transaction" WHERE "status"=2 AND "account_id"=111795163172472852 ORDER BY "id" DESC LIMIT 1;
2. 排查方向
- 超時能否確定確實是執(zhí)行sql引起的?
通過otel的鏈路排查發(fā)現(xiàn)確實是sql查詢消耗的,不存在是上游業(yè)務(wù)占用了過多的時間導(dǎo)致執(zhí)行sql本身時間不夠?qū)е碌模膊淮嬖趫?zhí)行sql之前有其他的業(yè)務(wù)代碼占用了過多的時間。
- 是否是數(shù)據(jù)庫連接數(shù)不夠?qū)е碌模?/li>
不存在這個問題,采用k8s部署,有5個pod,每個pod最大連接數(shù)是30,配置是足夠的,同時查看連接池的監(jiān)控發(fā)現(xiàn)占用很少。
- 是否是sql本身有問題
通過執(zhí)行postgre的explain語句發(fā)現(xiàn)走索引了, 索引定義是:
account_recharge_transaction_pkey PRIMARY KEY (id, account_id)
explain analyse SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id" FROM "account_recharge_transaction"
WHERE "account_id"=111332578555887692 AND "status"=2 ORDER BY "id" DESC LIMIT 1;
Limit (cost=0.43..103.72 rows=1 width=123) (actual time=38.996..38.997 rows=1 loops=1)
-> Index Scan Backward using account_recharge_transaction_060_pkey on account_recharge_transaction_060 account_recharge_transaction (cost=0.43..24066.54 rows=233 width=123) (
actual time=38.995..38.995 rows=1 loops=1)
Index Cond: (account_id = '111332578555887692'::bigint)
Filter: (status = 2)
Planning Time: 0.160 ms
Execution Time: 39.035 ms
- 排除沒走索引問題。
- 排除跨分區(qū)查詢問題,只走了一個分區(qū)。
- 排除數(shù)據(jù)量過大問題,這個查詢命中數(shù)據(jù)其實很少只有個位數(shù),而且手動執(zhí)行的時候用了39.035ms并不慢,但是線上確實是執(zhí)行這個的時候超時了。
- 是否是數(shù)據(jù)庫負(fù)載問題?
線上用aws的aurora postgre數(shù)據(jù)庫,配置相當(dāng)高,CPU或者內(nèi)存不足相關(guān)的問題。
- 是否是sql真正執(zhí)行的時候執(zhí)行計劃出現(xiàn)了問題,導(dǎo)致沒有走h(yuǎn)ash分區(qū),走全表掃描?
通過配置auto_explain發(fā)現(xiàn)即使在超時的時候執(zhí)行計劃也是走一個分區(qū)和primary key索引了。
3. 繼續(xù)排查
通過上面分析和信息收集依然沒有找出偶爾超時的問題,繼續(xù)在aws的打開performance insight,可以看到更多的數(shù)據(jù)庫信息,里面有AAS指標(biāo),里面的等待選項發(fā)現(xiàn)對應(yīng)的sql是IO操作。但是上面sql初次看起來不可能引發(fā)io問題,因為數(shù)據(jù)量很少,每個分區(qū)也就200萬數(shù)據(jù),而且查詢的數(shù)據(jù)也很小。繼續(xù)手動在讀庫執(zhí)行查詢語句,發(fā)現(xiàn)有的時候確實能復(fù)現(xiàn)查詢比較慢的情況。
繼續(xù)給explain增加參數(shù)
EXPLAIN (ANALYZE, VERBOSE, COSTS, BUFFERS)
SELECT "id","account_id","created_at","client_member_id","subsidy_id","amount","batch_id","status","transaction_id","sub_account_id","from","to","type","updated_at","detail","operator","member_id"
FROM "account_recharge_transaction"
WHERE "account_id"=111441242101419219 AND "status"=2 ORDER BY "id" DESC LIMIT 1;
發(fā)現(xiàn)訪問速度正常時結(jié)果如下
Limit (cost=0.43..101.12 rows=1 width=123) (actual time=38.298..38.299 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Buffers: shared hit=8948
-> Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction (
cost=0.43..23462.20 rows=233 width=123) (actual time=38.297..38.297 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
Filter: (account_recharge_transaction.status = 2)
Buffers: shared hit=8948
Query Identifier: -7739276633008834702
Planning Time: 0.165 ms
Execution Time: 38.329 ms
訪問速度異常時結(jié)果如下
Limit (cost=0.43..101.12 rows=1 width=123) (actual time=8940.396..8940.396 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recharge_tr
ansaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batch_id,
account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_recharge_t
ransaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_recharg
e_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Buffers: shared hit=296 read=8652
I/O Timings: shared/local read=8873.460
-> Index Scan Backward using account_recharge_transaction_082_pkey on public.account_recharge_transaction_082 account_recharge_transaction (
cost=0.43..23462.20 rows=233 width=123) (actual time=8940.394..8940.394 rows=0 loops=1)
Output: account_recharge_transaction.id, account_recharge_transaction.account_id, account_recharge_transaction.created_at, account_recha
rge_transaction.client_member_id, account_recharge_transaction.subsidy_id, account_recharge_transaction.amount, account_recharge_transaction.batc
h_id, account_recharge_transaction.status, account_recharge_transaction.transaction_id, account_recharge_transaction.sub_account_id, account_rech
arge_transaction."from", account_recharge_transaction."to", account_recharge_transaction.type, account_recharge_transaction.updated_at, account_r
echarge_transaction.detail, account_recharge_transaction.operator, account_recharge_transaction.member_id
Index Cond: (account_recharge_transaction.account_id = '111441242101419222'::bigint)
Filter: (account_recharge_transaction.status = 2)
Buffers: shared hit=296 read=8652
I/O Timings: shared/local read=8873.460
Query Identifier: -7739276633008834702
Planning:
Buffers: shared hit=117
Planning Time: 0.611 ms
Execution Time: 8940.421 ms
(15 rows)
發(fā)現(xiàn)兩者差別在buffer的命中率,比如速度比較慢的時候是因為,讀取了8652個塊數(shù)據(jù),只命中了296個。這是postgre的特性,加載數(shù)據(jù)優(yōu)先從buffer shared(內(nèi)存)讀取,不存在再去磁盤讀取,磁盤讀取耗時較長。
新的疑問:磁盤讀取雖然慢,但也不能這么慢吧?。?!,到底讀了什么
buffer shared的單位是塊,可以通過這個sql查詢出來 ,大小是8k
SELECT current_setting('block_size') AS block_size; -- 8192
所以上面8948 * 8大約是70M
然后通過sql查詢索引的大小發(fā)現(xiàn)是72M
SELECT pg_size_pretty(pg_relation_size('account_recharge_transaction_030_pkey'))
AS index_size_in_mb;
似乎是這次查詢直接把這個索引給全部加載到buffer shared了,但是線上的buffer shard大小是多少呢?大約如下,這么大理論上應(yīng)該是夠的,不應(yīng)該不夠。
SHOW shared_buffers; -- 單位是block,大小是5492159 * 8k = 大約等于43G,總內(nèi)存大小64GB
沒有辦法在線上直接安裝擴(kuò)展查詢buffer shared數(shù)據(jù)是不是直接加載70M大部分都是索引數(shù)據(jù),索引在sandbox進(jìn)行測試。
CREATE EXTENSION pg_buffercache;
SELECT
c.relname,
count(*) AS buffers,
pg_size_pretty(count(*) * 8192) AS size
FROM
pg_buffercache b
JOIN pg_class c ON b.relfilenode = c.relfilenode
JOIN pg_database d ON (b.reldatabase = d.oid AND d.datname = current_database())
WHERE
c.relname = 'account_recharge_transaction_082'
OR c.relname = 'account_recharge_transaction_082_pkey'
GROUP BY
c.relname
ORDER BY
buffers DESC;
執(zhí)行上面的sql buffer cache確實包含了account_recharge_transaction_082_pkey所有的數(shù)據(jù),所以上面的sql查詢必須依賴所有的數(shù)據(jù)。
4. 真相大白
為什么這個會依賴整個索引呢,是因為分區(qū)表的主鍵索引的字段的順序不對,因為postgre數(shù)據(jù)庫hash分區(qū)的主鍵必須包含分區(qū)鍵,當(dāng)時是這樣建立的,account_recharge_transaction_pkey PRIMARY KEY (id, account_id),應(yīng)該是(account_id,id),account_id作為分區(qū)鍵在前,id在后。這樣能解釋通一系列疑問了。
- 為什么第一次explain正常?
因為按照account_id查詢一定命中一個分區(qū),這個是postgre的特有邏輯,根據(jù)where條件,執(zhí)行器發(fā)現(xiàn)用account_recharge_transaction_pkey索引比全表掃描更優(yōu)化,索引使用了這個索引,通過這個索引去查詢數(shù)據(jù),但是這個查詢并不是高效,因為account_id在后面,btree索引只能被全部加載到內(nèi)存然后在掃描整個索引。索引單看explain不加其他參數(shù)看不出來,反而會造成索引被充分利用錯覺,所以以后執(zhí)行explain要加其他參數(shù)綜合考慮。
- 為什么有的時候有問題?
100個分區(qū)按照每個70M算大約是7G,account_id查詢概率比較均等,如果想完全都通過buffer shared查詢到,至少有7Gbuffer shared空間給這個表,但是線上也有其他的表這個不容易保證,滿了以后就可能被覆蓋到,導(dǎo)致內(nèi)存沒有去磁盤讀。
- 為什么磁盤讀取那么慢?
主要還是需要加載一個分區(qū)的整個索引數(shù)據(jù),70M太多了。
5. 怎么解決?
理想方案:刪除現(xiàn)在的索引,重建新的主鍵索引用accont_id, id .但是線上數(shù)據(jù)太大了,這個會鎖表,耗費時間太長,不可取。
實際方案:線上本來就存在account_id 和 created_at, 通過created_at desc,id desc實現(xiàn)查詢。
優(yōu)化后對比發(fā)現(xiàn)加載的block數(shù)量大大減少,之前是9000左右,現(xiàn)在大多數(shù)都是10左右。
總結(jié):
- postgre的explain執(zhí)行要加其他參數(shù)觀察全面一些。
- 分析問題要定量,不能憑直覺,比如感覺數(shù)據(jù)不應(yīng)該那么多,不應(yīng)該耗費那么久,要想辦法去看看到底發(fā)生了什么。
- postgre hash分區(qū)建立主鍵索引一定把分區(qū)鍵放前面。