1. 問題背景

1.1 客戶端緩存問題

$ ceph -s
health HEALTH_WARN
mds0: Client xxx-online00.gz01 failing to respond to cache pressure

官方解釋

消息: “Client name failing to respond to cache pressure”
代碼: MDS_HEALTH_CLIENT_RECALL, MDS_HEALTH_CLIENT_RECALL_MANY
描述:
客戶端有各自的元數(shù)據(jù)緩存，客戶端緩存中的條目（比如索引節(jié)點(diǎn)）也會(huì)存在于 MDS 緩存中，所以當(dāng) MDS 需要削減其緩存時(shí)（保持在 mds_cache_size 以下），它也會(huì)發(fā)消息給客戶端讓它們削減自己的緩存。如果有客戶端沒響應(yīng)或者有缺陷，就會(huì)妨礙 MDS 將緩存保持在 mds_cache_size 以下， MDS 就有可能耗盡內(nèi)存而后崩潰。如果某個(gè)客戶端的響應(yīng)時(shí)間超過了 mds_recall_state_timeout （默認(rèn)為 60s ），這條消息就會(huì)出現(xiàn)。

1.2 服務(wù)端內(nèi)存不釋放

同上參考1.1 客戶端緩存問題

1.3 mds session的inode過多

客戶端session的inode太多，導(dǎo)致內(nèi)存很高，從而也導(dǎo)致主從mds切換加載inode慢，嚴(yán)重影響服務(wù)的可用性。

1.4 mds夯住問題或慢查詢

客戶端搜索遍歷查找文件（不可控)
session的 inode太大導(dǎo)致mds負(fù)載過高
日志級(jí)別開的太大，從而導(dǎo)致mds負(fù)載高

2. 分析思路

上面的幾個(gè)問題都是有一定的聯(lián)系，互相影響的。所以，我們先從已知的方向逐步深入分析問題，從而優(yōu)化解決問題。

2.1 組件通信流程圖

image.png

Client <–> MDS
元數(shù)據(jù)操作和capalities
Client <–> OSD
數(shù)據(jù)IO
Client <–> Monitor
認(rèn)證，集群map信息等
MDS <–> Monitor
心跳，集群map信息等
MDS <–> OSD
元數(shù)據(jù)IO
Monitor <–> OSD
心跳，集群map信息等

2.2 查看客戶端session

$ ceph --admin-daemon  /var/run/ceph/ceph-mds.ceph-xxxx-osd02.py.asok session ls
[
    {
        "id": 5122511,
        "num_leases": 0,
        "num_caps": 655,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 1,
        "reconnecting": false,
        "inst": "client.5122511 192.168.1.2:0\/2026289820",
        "client_metadata": {
            "ceph_sha1": "b1e0532418e4631af01acbc0cedd426f1905f4af",
            "ceph_version": "ceph version 0.94.10 (b1e0532418e4631af01acbc0cedd426f1905f4af)",
            "entity_id": "log_xxx_cephfs",
            "hostname": "ceph-test-osd02",
            "mount_point": "\/mnt\/log"
        }
    }
]

說明：

id：client唯一id
num_caps：client獲取的caps
inst：client端的ip和端口鏈接信息
ceph_version：client端的ceph-fuse版本，若使用kernel client，則為kernel_version
hostname：client端的主機(jī)名
mount_point：client在主機(jī)上對(duì)應(yīng)的mount point
pid：client端ceph-fuse進(jìn)程的pid

2.3 查看客戶端的inode數(shù)量

跟蹤代碼發(fā)現(xiàn)session里面的num_caps就是統(tǒng)計(jì)的客戶端的inode數(shù)量, 大概統(tǒng)計(jì)了下已經(jīng)打開的inode數(shù)量在400w左右。

image.png

總結(jié)：
可以查看客戶端的session信息，包含host、mount、inode等信息
可以統(tǒng)計(jì)所有客戶端session的inode數(shù)量。

2.4 嘗試mds主從切換

2.4.1 執(zhí)行過程如下

2018-04-27 19:24:03.923349 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:boot --> up:replay
2018-04-27 19:24:03.923356 7f53015d7700  1 mds.0.2738 replay_start
2018-04-27 19:24:03.923360 7f53015d7700  1 mds.0.2738  recovery set is
2018-04-27 19:24:03.923365 7f53015d7700  1 mds.0.2738  waiting for osdmap 6339 (which blacklists prior instance)
2018-04-27 19:24:03.948526 7f52fc2ca700  0 mds.0.cache creating system inode with ino:100
2018-04-27 19:24:03.948675 7f52fc2ca700  0 mds.0.cache creating system inode with ino:1
2018-04-27 19:24:04.238128 7f52fa2b8700  1 mds.0.2738 replay_done
2018-04-27 19:24:04.238143 7f52fa2b8700  1 mds.0.2738 making mds journal writeable
2018-04-27 19:24:04.924352 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:24:04.924357 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:replay --> up:reconnect
2018-04-27 19:24:04.924370 7f53015d7700  1 mds.0.2738 reconnect_start
2018-04-27 19:24:04.924371 7f53015d7700  1 mds.0.2738 reopen_log
2018-04-27 19:24:04.924380 7f53015d7700  1 mds.0.server reconnect_clients -- 19 sessions
2018-04-27 19:24:04.926357 7f53015d7700  0 log_channel(cluster) log [DBG] : reconnect by client.4375 192.168.1.3:0/1796553051 after 0.001950
2018-04-27 19:24:04.926429 7f53015d7700  0 log_channel(cluster) log [DBG] : reconnect by client.4403 192.168.1.3:0/1032897847 after 0.002036
2018-04-27 19:24:15.228507 7f53015d7700  1 mds.0.2738 reconnect_done
2018-04-27 19:24:15.984143 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:24:15.984148 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:reconnect --> up:rejoin
2018-04-27 19:24:15.984156 7f53015d7700  1 mds.0.2738 rejoin_start
2018-04-27 19:25:15.987531 7f53015d7700  1 mds.0.2738 rejoin_joint_start
2018-04-27 19:27:40.105134 7f52fd4ce700  1 mds.0.2738 rejoin_done
2018-04-27 19:27:42.206654 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:27:42.206658 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active
2018-04-27 19:27:42.206666 7f53015d7700  1 mds.0.2738 recovery_done -- successful recovery!

主從切換流程：

handle_mds_map state change up:boot --> up:replay
handle_mds_map state change up:replay --> up:reconnect
handle_mds_map state change up:reconnect --> up:rejoin
handle_mds_map state change up:rejoin --> up:active

up:boot

此狀態(tài)在啟動(dòng)期間被廣播到CEPH監(jiān)視器。這種狀態(tài)是不可見的，因?yàn)楸O(jiān)視器立即將MDS分配給可用的秩或命令MDS作為備用操作。這里記錄了完整性的狀態(tài)。

up:replay

日志恢復(fù)階段，他將日志內(nèi)容讀入內(nèi)存后，在內(nèi)存中進(jìn)行回放操作。

up:reconnect

恢復(fù)的mds需要與之前的客戶端重新建立連接，并且需要查詢之前客戶端發(fā)布的文件句柄，重新在mds的緩存中創(chuàng)建一致性功能和鎖的狀態(tài)。
mds不會(huì)同步記錄文件打開的信息，原因是需要避免在訪問mds時(shí)產(chǎn)生多余的延遲，并且大多數(shù)文件是以只讀方式打開。

up:rejoin

把客戶端的inode加載到mds cache。(耗時(shí)最多的地方)

為什么mds切換耗時(shí)比較高？

分析日志(發(fā)現(xiàn)執(zhí)行rejoin_start，rejoin_joint_start動(dòng)作耗時(shí)比較高)

2018-04-27 19:24:15.984156 7f53015d7700  1 mds.0.2738 rejoin_start
2018-04-27 19:25:15.987531 7f53015d7700  1 mds.0.2738 rejoin_joint_start
2018-04-27 19:27:40.105134 7f52fd4ce700  1 mds.0.2738 rejoin_done
2018-04-27 19:27:42.206654 7f53015d7700  1 mds.0.2738 handle_mds_map i am now mds.0.2738
2018-04-27 19:27:42.206658 7f53015d7700  1 mds.0.2738 handle_mds_map state change up:rejoin --> up:active

跟蹤代碼分析(在執(zhí)行process_imported_caps超時(shí)了，這個(gè)函數(shù)主要是打開inodes 加載到cache中)

image.png

總結(jié)：

主從切換時(shí)mds詳細(xì)狀態(tài)
主從切換時(shí)主要耗時(shí)的階段rejoin_start,加載客戶端session的inode信息

2.5 釋放客戶端inode

2.5.1 模擬客戶端session inode過多

查看客戶端session信息

#查看客戶端session的inode數(shù)量， num_caps:7
$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
    {
        "id": 554418,
        "num_leases": 0,
        "num_caps": 7,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 0,
        "reconnecting": false,
        "inst": "client.554418 192.168.1.2:0/1285681097",
        "client_metadata": {
            "ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
            "ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
            "entity_id": "admin",
            "hostname": "ceph-xxx-osd01.ys",
            "mount_point": "/mnt",
            "pid": "2084",
            "root": "/"
        }
    }
]

客戶端遍歷所有文件

#遍歷掛載目錄下所有文件
$ tree /mnt/
 
#查看這個(gè)目錄下面所有文件夾及文件數(shù)量
$ tree /mnt/ | wc -l
347

再次查看客戶端session信息

#查看客戶端session的inode數(shù)量， num_caps:346
$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
   {
        "id": 554418,
        "num_leases": 1,
        "num_caps": 346,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 2,
        "reconnecting": false,
        "inst": "client.554418 192.168.1.3:0/1285681097",
        "client_metadata": {
            "ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
            "ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
            "entity_id": "admin",
            "hostname": "ceph-xxx-osd01.ys",
            "mount_point": "/mnt",
            "pid": "2084",
            "root": "/"
        }
    }
]

結(jié)論：

客戶端通過遍歷掛載目錄下所有文件，發(fā)現(xiàn)服務(wù)端的session num_caps跟客戶端文件夾及文件梳理匹配
也就是說客戶端讀取過的文件句柄，都會(huì)在服務(wù)端記錄下來。 (mds緩存了dentry，并且以lru算法的緩存淘汰方式把dentry緩存在了內(nèi)存中)

2.5.2 釋放客戶端session inode

解決方案：

方案1：采用多活mds(目前12版 multi active不穩(wěn)定)
方案2：evict client(主動(dòng)踢出有問題的客戶端)
方案3：client remount(有問題的客戶端重新mount掛載)
方案4：drop_cache, limit_cache
- mds limiting cache by memory https://github.com/ceph/ceph/pull/17711
- (官方提供的mds 主動(dòng)刪除cache，補(bǔ)丁在review過程中個(gè)，目標(biāo)版本是ceph-14.0.0) https://github.com/ceph/ceph/pull/21566
  
  image.png

3. 深入分析

根據(jù)上面的分析，我們基本有一定的思路。這里我們繼續(xù)深入到方案2中。

3.1 剔除客戶端session

3.1.1 查看客戶端session信息

$ ceph daemon mds.ceph-xxx-osd01.ys session ls
[
    {
        "id": 554418,
        "num_leases": 0,
        "num_caps": 1589,
        "state": "open",
        "replay_requests": 0,
        "completed_requests": 2,
        "reconnecting": false,
        "inst": "client.554418 192.168.1.2:0/1285681097",
        "client_metadata": {
            "ceph_sha1": "fe3a2269d799a8b950404cb2de11af84c7af0ea4",
            "ceph_version": "didi_dss version 12.2.2.4 (fe3a2269d799a8b950404cb2de11af84c7af0ea4) luminous (stable)",
            "entity_id": "admin",
            "hostname": "ceph-xxx-osd01.ys",
            "mount_point": "/mnt",
            "pid": "2084",
            "root": "/"
        }
    }
]

3.1.2 剔除客戶端session信息

$ ceph tell mds.ceph-xxx-osd01.ys client evict id=554418

3.1.3 查看osd的blacklist

#超時(shí)恢復(fù)的時(shí)間是1小時(shí)，剔除的時(shí)間是16:16:30, 恢復(fù)的時(shí)間是17:16:30
$ ceph osd blacklist ls
listed 1 entries
192.168.1.2:0/1285681097 2018-10-10 17:16:30.819201

3.1.4 查看客戶端掛載目錄(不能讀寫)

$ ll /mnt
ls: cannot access /mnt: No such file or directory

3.1.5 恢復(fù)剔除的客戶端

$ ceph osd blacklist rm 192.168.1.2:0/1285681097
un-blacklisting 192.168.1.2:0/1285681097

3.1.6 查看客戶端掛載目錄(正常讀寫)

$ ll /mnt
total 147698
-rw-r--r-- 1 root root         4 Oct 10 15:25 aa.txt
...

3.1.7 osd黑名單的客戶端超時(shí)時(shí)間

舊版本超時(shí)時(shí)間為1小時(shí)
新版本12.2.2 超時(shí)時(shí)間為300s

總結(jié)：

可以通過指令client evict 剔除指定的客戶端
剔除的客戶端會(huì)加入到osd黑名單中
加入到osd黑名單中的客戶端都不能讀寫
恢復(fù)剔除的客戶端需要?jiǎng)h除osd黑名單中的客戶端信息
刪除osd黑名單中客戶端信息，客戶端立馬能正常讀寫
fuse客戶端可以恢復(fù)，kernel客戶端無法恢復(fù)
經(jīng)過試驗(yàn)證明：
- 剔除的用戶雖然釋放了inode
- 主mds的內(nèi)存并未釋放
- 主從切換后，備mds內(nèi)存會(huì)釋放
- 主從切換后，切換速度少了加載inode耗時(shí)的階段，從而加快切換速度，秒級(jí)別

3.2 內(nèi)存未釋放分析

3.2.1 依賴軟件

yum install google-perftools

3.2.2 查看mds內(nèi)存

top - 13:14:06 up 63 days, 21:36,  1 user,  load average: 0.06, 0.08, 0.12
Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.2 us,  0.1 sy,  0.0 ni, 99.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13149521+total, 96957576 free, 10023744 used, 24513896 buff/cache
KiB Swap:        0 total,        0 free,        0 used. 11539159+avail Mem
  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 4997 ceph      20   0 4081012 1.447g  11452 S   0.0  1.2   0:54.29 ceph-mds

3.2.3 啟動(dòng)剖析器

$ ceph tell mds.0 heap start_profiler
2018-10-12 13:15:35.979279 7f3430bfa700  0 client.5796596 ms_handle_reset on 192.168.1.2:6804/2252738073
2018-10-12 13:15:36.008686 7f34293fc700  0 client.5796599 ms_handle_reset on 192.168.1.2:6804/2252738073
mds.ceph-xxx-osd01.ys started profiler

3.2.4 轉(zhuǎn)儲(chǔ)堆棧信息

$ ceph tell mds.0 heap dump
2018-10-12 13:16:34.891671 7efd04bfa700  0 client.5796659 ms_handle_reset on 192.168.1.2:6804/2252738073
2018-10-12 13:16:34.922696 7efcfd3fc700  0 client.5796662 ms_handle_reset on 192.168.1.2:6804/2252738073
mds.ceph-xxx-osd01.ys dumping heap profile now.
------------------------------------------------
MALLOC:     1225155304 ( 1168.4 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +    289987072 (  276.6 MiB) Bytes in central cache freelist
MALLOC: +     11013456 (   10.5 MiB) Bytes in transfer cache freelist
MALLOC: +      7165384 (    6.8 MiB) Bytes in thread cache freelists
MALLOC: +      7598240 (    7.2 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =   1540919456 ( 1469.5 MiB) Actual memory used (physical + swap)
MALLOC: +    112582656 (  107.4 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =   1653502112 ( 1576.9 MiB) Virtual address space used
MALLOC:
MALLOC:          94545              Spans in use
MALLOC:             16              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------
Call ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).
Bytes released to the OS take up virtual address space but no physical memory.

3.2.5 google-pprof分析內(nèi)存堆棧

pprof --text bin/ceph-mds out/mds.a.profile.0001.heap
$ pprof --text bin/ceph-mds out/mds.a.profile.0008.heap
Using local file bin/ceph-mds.
Using local file out/mds.a.profile.0008.heap.
Total: 46.6 MB
    18.1  38.7%  38.7%     19.5  41.9% Server::prepare_new_inode
     6.2  13.3%  52.0%      6.2  13.3% std::_Rb_tree::_M_emplace_hint_unique (inline)
     5.0  10.7%  62.7%      5.8  12.3% CDir::add_null_dentry
     3.8   8.1%  70.8%      3.8   8.1% std::_Rb_tree::_Rb_tree_impl::_M_initialize (inline)
     3.6   7.7%  78.6%      3.6   7.7% ceph::logging::Log::create_entry
     3.1   6.7%  85.2%      3.1   6.7% Counter::_count (inline)
     2.6   5.5%  90.7%      2.6   5.5% ceph::buffer::raw_combined::create (inline)
     0.9   2.0%  92.8%      0.9   2.0% std::_Vector_base::_M_create_storage (inline)
     0.8   1.6%  94.4%      0.8   1.6% CDir::add_null_dentry (inline)
     0.6   1.2%  95.6%      0.6   1.2% CInode::add_client_cap (inline)
     0.5   1.1%  96.6%      0.5   1.1% std::string::_Rep::_S_create
     0.5   1.0%  97.6%      0.5   1.0% MDCache::add_inode (inline)
     0.2   0.5%  98.2%      0.3   0.6% decode_message
     0.2   0.4%  98.5%      0.2   0.5% MDCache::request_start (inline)
     0.1   0.2%  98.7%      0.1   0.3% CInode::project_inode
     0.1   0.2%  99.0%      0.1   0.2% std::_Rb_tree::_M_insert_unique (inline)
     0.1   0.1%  99.1%      0.1   0.1% std::string::_M_data (inline)

4. 總結(jié)

cephfs mds目前版本都是單活，對(duì)于復(fù)雜多變的客戶端可能會(huì)帶來一定的性能影響。
例如：在本地磁盤下去搜索一個(gè)文件，如果文件數(shù)過多的，對(duì)本機(jī)cpu以及負(fù)載也會(huì)帶來一定的沖擊，更何況是復(fù)雜多變的網(wǎng)絡(luò)磁盤。

目前推薦的優(yōu)化方案：

ceph-fuse客戶端Qos限速，避免IO一瞬間涌進(jìn)來導(dǎo)致mds抖動(dòng)(從客戶端限制IOPS,避免資源爭(zhēng)搶，對(duì)系統(tǒng)資源帶來沖擊)
多活mds, 目錄分片(最終解決方案，詳見：多活MDS的性能測(cè)試 )
mds在主處理流程中使用了單線程，這導(dǎo)致了其單個(gè)MDS的性能受到了限制，最大單個(gè)MDS可達(dá)8k ops/s，CPU利用率達(dá)到的 140%左右。
如果mds負(fù)載過高或者內(nèi)存過大，限制內(nèi)存或者定期的回收cache(減輕mds的壓力，提升吞吐 https://github.com/ceph/ceph/pull/17711/files)
剔除用戶可以釋放inode數(shù)量，但是不能減少內(nèi)存，如果此時(shí)切換主從可以加快切換速度。

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av

Ceph MDS問題分析

Ceph MDS問題分析

1. 問題背景

1.1 客戶端緩存問題

1.2 服務(wù)端內(nèi)存不釋放

1.3 mds session的inode過多

1.4 mds夯住問題或慢查詢

2. 分析思路

2.1 組件通信流程圖

2.2 查看客戶端session

2.3 查看客戶端的inode數(shù)量

2.4 嘗試mds主從切換

2.4.1 執(zhí)行過程如下

2.5 釋放客戶端inode

2.5.1 模擬客戶端session inode過多

2.5.2 釋放客戶端session inode

3. 深入分析

3.1 剔除客戶端session

3.1.1 查看客戶端session信息

3.1.2 剔除客戶端session信息

3.1.3 查看osd的blacklist

3.1.4 查看客戶端掛載目錄(不能讀寫)

3.1.5 恢復(fù)剔除的客戶端

3.1.6 查看客戶端掛載目錄(正常讀寫)

3.1.7 osd黑名單的客戶端超時(shí)時(shí)間

3.2 內(nèi)存未釋放分析

3.2.1 依賴軟件

3.2.2 查看mds內(nèi)存

3.2.3 啟動(dòng)剖析器

3.2.4 轉(zhuǎn)儲(chǔ)堆棧信息

3.2.5 google-pprof分析內(nèi)存堆棧

4. 總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九 欧美,1769亚洲,黄色成人av

Ceph MDS問題分析

1. 問題背景

1.1 客戶端緩存問題

1.2 服務(wù)端內(nèi)存不釋放

1.3 mds session的inode過多

1.4 mds夯住問題或慢查詢

2. 分析思路

2.1 組件通信流程圖

2.2 查看客戶端session

2.3 查看客戶端的inode數(shù)量

2.4 嘗試mds主從切換

2.4.1 執(zhí)行過程如下

2.5 釋放客戶端inode

2.5.1 模擬客戶端session inode過多

2.5.2 釋放客戶端session inode

3. 深入分析

3.1 剔除客戶端session

3.1.1 查看客戶端session信息

3.1.2 剔除客戶端session信息

3.1.3 查看osd的blacklist

3.1.4 查看客戶端掛載目錄(不能讀寫)

3.1.5 恢復(fù)剔除的客戶端

3.1.6 查看客戶端掛載目錄(正常讀寫)

3.1.7 osd黑名單的客戶端超時(shí)時(shí)間

3.2 內(nèi)存未釋放分析

3.2.1 依賴軟件

3.2.2 查看mds內(nèi)存

3.2.3 啟動(dòng)剖析器

3.2.4 轉(zhuǎn)儲(chǔ)堆棧信息

3.2.5 google-pprof分析內(nèi)存堆棧

4. 總結(jié)

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容

色偷偷精品伊人,欧洲久久精品,欧美综合婷婷骚逼,国产AV主播,国产最新探花在线,九色在线视频一区,伊人大交九欧美,1769亚洲,黄色成人av