Ceph 集群狀態(tài)監(jiān)控細(xì)化

需求
在做Ceph的監(jiān)控報(bào)警系統(tǒng)時(shí),對(duì)于Ceph集群監(jiān)控狀態(tài)的監(jiān)控,最初只是簡(jiǎn)單的OK、WARN、ERROR,按照Ceph的status輸出來(lái)判斷的,仔細(xì)想想,感覺這些還不夠,因?yàn)閃ARN、ERROR狀態(tài)中,是包含多種狀態(tài)的,如果在大晚上收到一條關(guān)于Ceph health的報(bào)警信息,只知道了集群有問(wèn)題,但具體是什么問(wèn)題呢,不得而知。這個(gè)事情發(fā)生在工作時(shí)間,就還好處理,直接到Ceph環(huán)境中查看一下就OK。但是在晚上,有些報(bào)警沒有那么緊急,可以第二天再處理。所以,就需要細(xì)化這些健康狀態(tài)。

因此,從代碼中將HEALTH_OK、HEALTH_WARN、HEALTH_ERR的相關(guān)描述輸出拉出來(lái),進(jìn)行判斷,分類處理,然后用狀態(tài)碼(status code)的方式來(lái)進(jìn)行Level化。

Ceph本身的健康狀態(tài)信息:
HEALTH_WARN:

集群健康狀態(tài)描述信息 代表的現(xiàn)象
Monitor clock skew detected 時(shí)鐘偏移
mons down, quorum Ceph Monitor down
some monitors are running older code 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
in osds are down OSD down后會(huì)出現(xiàn)
flag(s) set 標(biāo)志位設(shè)置,可以忽略
crush map has legacy tunables 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
crush map has straw_calc_version=0 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
cache pools are missing hit_sets 使用cache tier后會(huì)出現(xiàn)
no legacy OSD present but 'sortbitwise' flag is not set 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
has mon_osd_down_out_interval set to 0 將mon_osd_down_out_interval參數(shù)設(shè)置為0會(huì)出現(xiàn),這個(gè)參數(shù)設(shè)置為0,和noout效力一致
'require_jewel_osds' osdmap flag is not set 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
is full pool滿后會(huì)出現(xiàn)
near full osd OSD快滿時(shí)警告
unscrubbed pgs 有些pg沒有scrub
pgs stuck PG處于一些不健康狀態(tài)的時(shí)候,會(huì)顯示出來(lái)
requests are blocked slow requests會(huì)警告
osds have slow requests slow requests會(huì)警告
recovery 需要recovery的時(shí)候會(huì)報(bào)
at/near target max 使用cache tier的時(shí)候會(huì)警告
too few PGs per OSD 每個(gè)OSD的PG數(shù)過(guò)少
too many PGs per OSD 每個(gè)OSD的PG數(shù)過(guò)多

pgp_num pg_num大于pgp_num
has many more objects per pg than average (too few pgs?) 每個(gè)Pg上的objects數(shù)過(guò)多
HEALTH_ERR:

集群健康狀態(tài)描述信息 代表的現(xiàn)象
no osds 部署完就可以看到,運(yùn)行過(guò)程中不會(huì)出現(xiàn)
full osd OSD滿時(shí)出現(xiàn)
pgs are stuck inactive for more than Pg處于inactive狀態(tài),該P(yáng)g讀寫都不行
scrub errors scrub 錯(cuò)誤出現(xiàn),是scrub錯(cuò)誤?還是scrub出了不一致的pg
當(dāng)前監(jiān)控代碼中處理
從上述輸出里選出所有關(guān)鍵的幾項(xiàng),作為一些單獨(dú)的狀態(tài)碼,也就是只關(guān)注這些,其他的要么運(yùn)行過(guò)程中不出現(xiàn),要么目前沒有使用,即忽略。

Ceph Health Status Code:

代碼 10進(jìn)制數(shù)值
其他警告 0
HEALTH_OK 1
HEALTH_CLOCK_SKEW = 1 << 1 2
HEALTH_NEAR_FULL = 1 << 2 4
HEALTH_FULL = 1 << 3 8
HEALTH_SLOW_REQUEST = 1 << 4 16
HEALTH_PG_STALE = 1 << 5 32
HEALTH_SCRUB_ERROR = 1 << 6 64
注: 在報(bào)警的描述中增加基本的狀態(tài)碼說(shuō)明:

ceph cluster not health; clock skew:2,nearfull:4,full:8,slow_request:16,pg_stale:32,scrub_error:64,others:0

鏈接
https://access.redhat.com/documentation/en-us/red_hat_ceph_storage/2/html/troubleshooting_guide/initial-troubleshooting

具體代碼
附:

注: 每行最后含detail的,說(shuō)明是ceph health detail能看到的描述
HEALTH_WARN:

【Monitor.cc:】
Monitor clock skew detected

【MonmapMonitor.cc:】
mons down, quorum
is down (out of quorum) [detail]
some monitors are running older code
only supports the "classic" command set [detail]

【OSDMonitor.cc:】
osd." << i << " is down since epoch [detail]
in osds are down
flag(s) set
crush map has legacy tunables (require
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
crush map has straw_calc_version=0
see http://ceph.com/docs/master/rados/operations/crush-map/#tunables [detail]
with cache_mode needs hit_set_type to be set but it is not [detail]
cache pools are missing hit_sets
no legacy OSD present but 'sortbitwise' flag is not set
has mon_osd_down_out_interval set to 0
this has the same effect as the 'noout' flag [detail]
'require_jewel_osds' osdmap flag is not set
is full
near full osd

【PGMonitor.cc:】
current state/last acting [detail]
ops are blocked > [detail]
deep-scrubbed, last_deep_scrub_stamp [detail]
unscrubbed pgs
pgs stuck
min_size from / may help; search ceph.com/docs for 'incomplete [detail]
requests are blocked >
osds have slow requests
recovery
objects at/near target max [detail]
B at/near target max [detail]
at/near target max
too few PGs per OSD
too many PGs per OSD

pgp_num
has many more objects per pg than average (too few pgs?)

HEALTH_ERR:

【OSDMonitor.cc:】
no osds
full osd

【PGMonitor.cc:】
pgs are stuck inactive for more than
scrub errors

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請(qǐng)結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡(jiǎn)書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容