ES問題排查及優(yōu)化

一、背景
最近有報(bào)es搜索超時(shí)異常,服務(wù)器CPU飆高(達(dá)130%到),于是開始進(jìn)行排查

二、排查
1.項(xiàng)目組設(shè)置接口請(qǐng)求超時(shí)500ms,不是一直穩(wěn)定復(fù)現(xiàn),所以需要抓到盡可能詳細(xì)的可分析日志,于是采取
1)在es_server_api(對(duì)es二次封裝的接口) 中查詢的底層代碼中增加了日志信息(耗時(shí)+查詢語句)。跑了段時(shí)間發(fā)現(xiàn)確實(shí)有不少耗時(shí)超過500ms甚至1s
2)通過es的head插件開啟es的慢查詢?nèi)罩?,監(jiān)測(cè)底層搜索詳細(xì)信息,設(shè)置方式如下,針對(duì)es6.2.4版本,親測(cè)有效 (網(wǎng)上有的地址和設(shè)置項(xiàng)不對(duì)):
Put http://xx.xxx.xxx.xxxx:9200//myindex/_settings
{
"index.search.slowlog.threshold.query.warn":"5s",
"index.search.slowlog.threshold.query.info":"2s",
"index.search.slowlog.threshold.query.debug":"1s",
"index.search.slowlog.threshold.query.trace":"400ms",
"index.search.slowlog.threshold.fetch.warn":"1s",
"index.search.slowlog.threshold.fetch.info":"800ms",
"index.search.slowlog.threshold.fetch.debug":"500ms",
"index.search.slowlog.threshold.fetch.trace":"200ms"
}
由于ES是設(shè)置的query_then_fetch,所以對(duì)兩個(gè)階段的閾值進(jìn)行了設(shè)置,開啟后出現(xiàn)了慢查詢?nèi)罩?,耗時(shí)現(xiàn)象同項(xiàng)目中打印的日志一致。

2.根據(jù)記錄下來的查詢語句,通過es的head插件執(zhí)行,有時(shí)很快有時(shí)慢,和服務(wù)器上訪問的效果一致,于是對(duì)查詢語句進(jìn)行詳細(xì)的捕捉,方式是在查詢語句中增加"profile":true,就可以打印出每個(gè)查詢獲取的過程及耗時(shí)情況。具體每項(xiàng)設(shè)置及含義參考:https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html
也沒發(fā)現(xiàn)蛛絲馬跡。

3.查看es的gc.log,發(fā)現(xiàn)gc.log中幾乎每秒都要觸發(fā)一次GC Full GC (Allocation Failure)。內(nèi)存不夠用,又沒有內(nèi)存可回收,所以GC也不斷。怪不得CPU這么高,大部分時(shí)間都用在gc上面了。于是查看es的jvm配置信息,發(fā)現(xiàn)其設(shè)置為:

# Heap configuration
-Xms1g
-Xmx1g 
## GC configuration
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly  [制止JVM自動(dòng)]

說明配置的內(nèi)存小了,于是將其中一個(gè)節(jié)點(diǎn)配置升級(jí)到3G,重啟后慢查詢?nèi)罩静蹲讲坏搅?,日志打印的耗時(shí)也降低了很多,基本上在10ms以內(nèi),重啟了另一臺(tái)也正常。
gc.log分析過程:

----step1-----
2019-08-21T09:20:23.852+0800: 26063583.125: [GC (CMS Initial Mark) [1 CMS-initial-mark: 707833K(707840K)] 734480K(1014528K), 0.0141329 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 
2019-08-21T09:20:23.866+0800: 26063583.139: Total time for which application threads were stopped: 0.0153949 seconds, Stopping threads took: 0.0001766 seconds
----step2-----
2019-08-21T09:20:23.866+0800: 26063583.139: [CMS-concurrent-mark-start]
2019-08-21T09:20:24.006+0800: 26063583.279: [CMS-concurrent-mark: 0.140/0.140 secs] [Times: user=0.28 sys=0.00, real=0.14 secs] 
----step3-----
2019-08-21T09:20:24.006+0800: 26063583.280: [CMS-concurrent-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-preclean: 0.092/0.092 secs] [Times: user=0.11 sys=0.00, real=0.09 secs] 
----step4-----
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean-start]
2019-08-21T09:20:24.098+0800: 26063583.371: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
----step5-----
2019-08-21T09:20:24.099+0800: 26063583.372: [GC (CMS Final Remark) [YG occupancy: 151902 K (306688 K)]2019-08-21T09:20:24.099+0800: 26063583.372: [Rescan (parallel) , 0.0238185 secs]2019-08-21T09:20:24.123+0800: 26063583.396: [weak refs processing, 0.0003479 secs]2019-08-21T09:20:24.123+0800: 26063583.397: [class unloading, 0.0332278 secs]2019-08-21T09:20:24.157+0800: 26063583.430: [scrub symbol table, 0.0102765 secs]2019-08-21T09:20:24.167+0800: 26063583.440: [scrub string table, 0.0013097 secs][1 CMS-remark: 707833K(707840K)] 859736K(1014528K), 0.0691496 secs] [Times: user=0.14 sys=0.00, real=0.07 secs] 
2019-08-21T09:20:24.168+0800: 26063583.442: Total time for which application threads were stopped: 0.0703531 seconds, Stopping threads took: 0.0001407 seconds
----step6-----
2019-08-21T09:20:24.169+0800: 26063583.442: [CMS-concurrent-sweep-start]
2019-08-21T09:20:24.206+0800: 26063583.480: [CMS-concurrent-sweep: 0.038/0.038 secs] [Times: user=0.04 sys=0.00, real=0.04 secs] 
----step7-----
2019-08-21T09:20:24.207+0800: 26063583.480: [CMS-concurrent-reset-start]
2019-08-21T09:20:24.208+0800: 26063583.481: [CMS-concurrent-reset: 0.002/0.002 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 

2019-08-21T09:20:24.945+0800: 26063584.218: [GC (Allocation Failure) 2019-08-21T09:20:24.945+0800: 26063584.218: [ParNew: 306407K->306407K(306688K), 0.0000269 secs]2019-08-21T09:20:24.945+0800: 26063584.218: [CMS: 707833K->707833K(707840K), 0.3487631 secs] 1014241K->729888K(1014528K), [Metaspace: 70690K->70690K(1116160K)], 0.3489630 secs] [Times: user=0.35 sys=0.00, real=0.35 secs] 
2019-08-21T09:20:25.294+0800: 26063584.567: Total time for which application threads were stopped: 0.3502331 seconds, Stopping threads took: 0.0000746 seconds

分析:其中對(duì)于 CMS回收器來說,總共分四大步驟(7個(gè)小步驟),其中initial mark 和remark 階段屬于stop the world 的,需要重點(diǎn)關(guān)注。由圖上log可知 兩個(gè)階段的停機(jī)時(shí)長(zhǎng)為 0.0153949s + 0.070353s,約為85毫秒,但最后一條顯示發(fā)生了GC。從日志來看,堆區(qū)總大小為1014528K,回收前堆區(qū)大小為1014241K,回收后堆區(qū)大小為729888K,從比例來看,回收后內(nèi)存依然占到72%,結(jié)合es的JVM參數(shù)-XX:CMSInitiatingOccupancyFraction=75 來看,雖然回收了但很快內(nèi)存占用率又達(dá)到75% 從而導(dǎo)致需要頻繁的gc。

三、處理
調(diào)整es可使用的內(nèi)存大小。編輯es/config/jvm.options,調(diào)整了Xms和Xmx的大小,由原來默認(rèn)的1g調(diào)整為3g。(官方建議這個(gè)值不要超過物理內(nèi)存的50%,也不要超過32G。詳見官網(wǎng)說明)接著重啟es就好了。cpu降下來了,也沒有新的慢查詢產(chǎn)生了,基本都在10ms以內(nèi)返回?cái)?shù)據(jù)。

四、參考文章
1.日志分析方法可參考:https://blog.csdn.net/z69183787/article/details/86526648
2.profile api 解析 :https://www.elastic.co/guide/en/elasticsearch/reference/5.4/_profiling_queries.html

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容