Ceph配置參數(shù)分析

概述
Ceph的配置參數(shù)很多,從網(wǎng)上也能搜索到一大批的調(diào)優(yōu)參數(shù),但這些參數(shù)為什么這么設(shè)置?設(shè)置為這樣是否合理?解釋的并不多

本文從當(dāng)前我們的ceph.conf文件入手,解釋其中的每一項(xiàng)配置,做為以后參數(shù)調(diào)優(yōu)和新人學(xué)習(xí)的依據(jù);

參數(shù)詳解
1,一些固定配置參數(shù)

fsid = 6d529c3d-5745-4fa5-*-*
mon_initial_members = **, **, **
mon_host = *.*.*.*, *.*.*.*, *.*.*.*

以上通常是通過ceph-deploy生成的,都是ceph monitor相關(guān)的參數(shù),不用修改;

2,網(wǎng)絡(luò)配置參數(shù)

public_network = 10.10.2.0/24  默認(rèn)值 ""
cluster_network = 10.10.2.0/24 默認(rèn)值 ""
public network:monitor與osd,client與monitor,client與osd通信的網(wǎng)絡(luò),最好配置為帶寬較高的萬兆網(wǎng)絡(luò);
cluster network:OSD之間通信的網(wǎng)絡(luò),一般配置為帶寬較高的萬兆網(wǎng)絡(luò);

參考:http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
3,pool size配置參數(shù)

osd_pool_default_size = 3       默認(rèn)值 3
osd_pool_default_min_size = 1   默認(rèn)值 0 // 0 means no specific default; ceph will use size-size/2

這兩個(gè)是創(chuàng)建ceph pool的時(shí)候的默認(rèn)size參數(shù),一般配置為3和1,3副本能足夠保證數(shù)據(jù)的可靠性;
4,認(rèn)證配置參數(shù)

auth_service_required = none   默認(rèn)值 "cephx"
auth_client_required = none    默認(rèn)值 "cephx, none"
auth_cluster_required = none   默認(rèn)值 "cephx"

以上是Ceph authentication的配置參數(shù),默認(rèn)值為開啟ceph認(rèn)證;
在內(nèi)部使用的ceph集群中一般配置為none,即不使用認(rèn)證,這樣能適當(dāng)加快ceph集群訪問速度;

5,osd down out配置參數(shù)

mon_osd_down_out_interval = 3600  默認(rèn)值 300 // seconds
mon_osd_min_down_reporters = 3    默認(rèn)值 2
mon_osd_report_timeout = 900      默認(rèn)值 900
osd_heartbeat_interval = 10       默認(rèn)值 6
osd_heartbeat_grace = 60          默認(rèn)值 20
mon_osd_down_out_interval:ceph標(biāo)記一個(gè)osd為down and out的最大時(shí)間間隔
mon_osd_min_down_reporters:mon標(biāo)記一個(gè)osd為down的最小reporters個(gè)數(shù)(報(bào)告該osd為down的其他osd為一個(gè)reporter)
mon_osd_report_timeout:mon標(biāo)記一個(gè)osd為down的最長等待時(shí)間
osd_heartbeat_interval:osd發(fā)送heartbeat給其他osd的間隔時(shí)間(同一PG之間的osd才會(huì)有heartbeat)
osd_heartbeat_grace:osd報(bào)告其他osd為down的最大時(shí)間間隔,grace調(diào)大,也有副作用,如果某個(gè)osd異常退出,等待其
他osd上報(bào)的時(shí)間必須為grace,在這段時(shí)間段內(nèi),這個(gè)osd負(fù)責(zé)的pg的io會(huì)hang住,所以盡量不要將grace調(diào)的太大。

基于實(shí)際情況合理配置上述參數(shù),能減少或及時(shí)發(fā)現(xiàn)osd變?yōu)閐own(降低IO hang住的時(shí)間和概率),延長osd變?yōu)閐own and out的時(shí)間(防止網(wǎng)絡(luò)抖動(dòng)造成的數(shù)據(jù)recovery);

參考:

http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/

http://blog.wjin.org/posts/ceph-osd-heartbeat.html

6,objecter配置參數(shù)

objecter_inflight_ops = 10240               默認(rèn)值 1024
objecter_inflight_op_bytes = 1048576000     默認(rèn)值 100M
osd client端objecter的throttle配置,它的配置會(huì)影響librbd,RGW端的性能;

配置建議:

調(diào)大這兩個(gè)值

7,ceph rgw配置參數(shù)

rgw_frontends = "civetweb num_threads=500"              默認(rèn)值 "fastcgi, civetweb port=7480"
rgw_thread_pool_size = 200                              默認(rèn)值 100
rgw_override_bucket_index_max_shards = 20               默認(rèn)值 0
 
rgw_max_chunk_size = 1048576                            默認(rèn)值 512 * 1024
rgw_cache_lru_size = 10000                              默認(rèn)值 10000 // num of entries in rgw cache
rgw_bucket_default_quota_max_objects = *                默認(rèn)值 -1 // number of objects allowed
rgw_frontends:rgw的前端配置,一般配置為使用輕量級的civetweb;prot為訪問rgw的端口,根據(jù)實(shí)際情況配置;num_threads為civetweb的線程數(shù);
rgw_thread_pool_size:rgw前端web的線程數(shù),與rgw_frontends中的num_threads含義一致,但num_threads 優(yōu)于 rgw_thread_pool_size的配置,兩個(gè)只需要配置一個(gè)即可;
rgw_override_bucket_index_max_shards:rgw bucket index object的最大shards數(shù),增大這個(gè)值能提升bucket index object的訪問時(shí)間,但也會(huì)加大bucket的ls時(shí)間;
rgw_max_chunk_size:rgw最大chunk size,針對大文件的對象存儲(chǔ)場景可以把這個(gè)值調(diào)大;

rgw_cache_lru_size:rgw的lru cache size,對于讀較多的應(yīng)用場景,調(diào)大這個(gè)值能加快rgw的響應(yīng)熟讀;
rgw_bucket_default_quota_max_objects:配合該參數(shù)限制一個(gè)bucket的最大objects個(gè)數(shù);

參考:

http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/

http://ceph-users.ceph.narkive.com/mdB90g7R/rgw-increase-the-first-chunk-size

https://access.redhat.com/solutions/2122231

8,debug配置參數(shù)

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_mon = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_paxos = 0/0
debug_rgw = 0/0 

關(guān)閉了所有的debug信息,能一定程度加快ceph集群速度,但也會(huì)丟失一些關(guān)鍵log,出問題的時(shí)候不好分析;
參考:

http://www.10tiao.com/html/362/201609/2654062487/1.html

9,osd op配置參數(shù)

osd_enable_op_tracker = true       默認(rèn)值 true
osd_num_op_tracker_shard = 32      默認(rèn)值 32
osd_op_threads = 5                 默認(rèn)值 2
osd_disk_threads = 1               默認(rèn)值 1
osd_op_num_shards = 15             默認(rèn)值 5
osd_op_num_threads_per_shard = 2   默認(rèn)值 2
osd_enable_op_tracker:追蹤osd op狀態(tài)的配置參數(shù),默認(rèn)為true;不建議關(guān)閉,關(guān)閉后osd的 slow_request,ops_in_flight,historic_ops 無法正常統(tǒng)計(jì);
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck.  Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck.  Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.

打開op tracker后,若集群iops很高,osd_num_op_tracker_shard可以適當(dāng)調(diào)大,因?yàn)槊總€(gè)shard都有個(gè)獨(dú)立的mutex鎖;

class OpTracker {
...
    struct ShardedTrackingData {
        Mutex ops_in_flight_lock_sharded;
        xlist<TrackedOp *> ops_in_flight_sharded;
        explicit ShardedTrackingData(string lock_name):
            ops_in_flight_lock_sharded(lock_name.c_str()) {}
    };
    vector<ShardedTrackingData*> sharded_in_flight_list;
    uint32_t num_optracker_shards;
...
};
osd_op_threads:對應(yīng)的work queue有peering_wq(osd peering請求),recovery_gen_wq(PG recovery請求);
osd_disk_threads:對應(yīng)的work queue為 remove_wq(PG remove請求);
 

osd_op_num_shards和osd_op_num_threads_per_shard:對應(yīng)的thread pool為osd_op_tp,work queue為op_shardedwq;

處理的請求包括:

OpRequestRef

PGSnapTrim

PGScrub

調(diào)大osd_op_num_shards可以增大osd ops的處理線程數(shù),增大并發(fā)性,提升OSD性能;

10,osd client message配置參數(shù)

osd_client_message_size_cap = 1048576000  默認(rèn)值 500*1024L*1024L     // client data allowed in-memory (in bytes)
osd_client_message_cap = 1000             默認(rèn)值 100     // num client messages allowed in-memory

這個(gè)是osd端收到client messages的capacity配置,配置大的話能提升osd的處理能力,但會(huì)占用較多的系統(tǒng)內(nèi)存;
配置建議:
服務(wù)器內(nèi)存足夠大的時(shí)候,適當(dāng)增大這兩個(gè)值
11,osd scrub配置參數(shù)

osd_scrub_begin_hour = 10                默認(rèn)值 0
osd_scrub_end_hour = 5                   默認(rèn)值 24

// The time in seconds that scrubbing sleeps between two consecutive scrubs

osd_scrub_sleep = 1                      默認(rèn)值 0        // sleep between [deep]scrub ops
 
osd_scrub_load_threshold = 8             默認(rèn)值 0.5

// chunky scrub配置的最小/最大objects數(shù),以下是默認(rèn)值

osd_scrub_chunk_min = 5
osd_scrub_chunk_max = 25

Ceph osd scrub是保證ceph數(shù)據(jù)一致性的機(jī)制,scrub以PG為單位,但每次scrub回獲取PG lock,所以它可能會(huì)影響PG正常的IO;
Ceph后來引入了chunky的scrub模式,每次scrub只會(huì)選取PG的一部分objects,完成后釋放PG lock,并把下一次的PG scrub加入隊(duì)列;這樣能很好的減少PG scrub時(shí)候占用PG lock的時(shí)間,避免過多影響PG正常的IO;
同理,引入的osd_scrub_sleep參數(shù)會(huì)讓線程在每次scrub前釋放PG lock,然后睡眠一段時(shí)間,也能很好的減少scrub對PG正常IO的影響;
配置建議:

osd_scrub_begin_hour和osd_scrub_end_hour:OSD Scrub的開始結(jié)束時(shí)間,根據(jù)具體業(yè)務(wù)指定;
osd_scrub_sleep:osd在每次執(zhí)行scrub時(shí)的睡眠時(shí)間;有個(gè)bug跟這個(gè)配置有關(guān),建議關(guān)閉; 
osd_scrub_load_threshold:osd開啟scrub的系統(tǒng)load閾值,根據(jù)系統(tǒng)的load average值配置該參數(shù);
osd_scrub_chunk_min和osd_scrub_chunk_max:根據(jù)PG中object的個(gè)數(shù)配置;針對RGW全是小文件的情況,這兩個(gè)值需要調(diào)大;

參考:

http://www.itdecent.cn/p/ea2296e1555c

http://tracker.ceph.com/issues/19497

12,osd thread timeout配置參數(shù)

osd_op_thread_timeout = 100                 默認(rèn)值 15
osd_op_thread_suicide_timeout = 300         默認(rèn)值 150
 
osd_recovery_thread_timeout = 100           默認(rèn)值 30
osd_recovery_thread_suicide_timeout = 300   默認(rèn)值 300
osd_op_thread_timeout和osd_op_thread_suicide_timeout關(guān)聯(lián)的work queue為:
op_shardedwq - 關(guān)聯(lián)的請求為:OpRequestRef,PGSnapTrim,PGScrub
peering_wq - 關(guān)聯(lián)的請求為:osd peering
osd_recovery_thread_timeout和osd_recovery_thread_suicide_timeout關(guān)聯(lián)的work queue為:

recovery_wq - 關(guān)聯(lián)的請求為:PG recovery

Ceph的work queue都有個(gè)基類 WorkQueue_,定義如下:

/// Pool of threads that share work submitted to multiple work queues.
class ThreadPool : public md_config_obs_t {
...
    /// Basic interface to a work queue used by the worker threads.
    struct WorkQueue_ {
        string name;
        time_t timeout_interval, suicide_interval;
        WorkQueue_(string n, time_t ti, time_t sti)
            : name(n), timeout_interval(ti), suicide_interval(sti)
        { }
...

這里的timeout_interval和suicide_interval分別對應(yīng)上面所述的配置timeout和suicide_timeout;
當(dāng)thread處理work queue中的一個(gè)請求時(shí),會(huì)受到這兩個(gè)timeout時(shí)間的限制:

timeout_interval - 到時(shí)間后設(shè)置m_unhealthy_workers+1
suicide_interval - 到時(shí)間后調(diào)用assert,OSD進(jìn)程crush
對應(yīng)的處理函數(shù)為:

bool HeartbeatMap::_check(const heartbeat_handle_d *h, const char *who, time_t now)
{
    bool healthy = true;
    time_t was;
    was = h->timeout.read();
    if (was && was < now) {
        ldout(m_cct, 1) << who << " '" << h->name << "'"
                        << " had timed out after " << h->grace << dendl;
        healthy = false;
    }
    was = h->suicide_timeout.read();
    if (was && was < now) {
        ldout(m_cct, 1) << who << " '" << h->name << "'"
                        << " had suicide timed out after " << h->suicide_grace << dendl;
        assert(0 == "hit suicide timeout");
    }
    return healthy;
}

當(dāng)前僅有RGW添加了worker的perfcounter,所以也只有RGW可以通過perf dump查看total/unhealthy的worker信息:

[root@ node1]# ceph daemon /var/run/ceph/ceph-client.rgw.*.asok perf dump | grep worker
        "total_workers": 32,
        "unhealthy_workers": 0

對應(yīng)的配置項(xiàng)為:

OPTION(rgw_num_async_rados_threads, OPT_INT, 32) // num of threads to use for async rados operations

配置建議:

*_thread_timeout:這個(gè)值配置越小越能及時(shí)發(fā)現(xiàn)處理慢的請求,所以不建議配置很大;特別是針對速度快的設(shè)備,建議調(diào)小該值;
*_thread_suicide_timeout:這個(gè)值配置小了會(huì)導(dǎo)致超時(shí)后的OSD crush,所以建議調(diào)大;特別是在對應(yīng)的throttle調(diào)大后,更應(yīng)該調(diào)大該值;

13,fielstore op thread配置參數(shù)

filestore_op_threads = 5                    默認(rèn)值 2
filestore_op_thread_timeout = 100           默認(rèn)值 60
filestore_op_thread_suicide_timeout = 300   默認(rèn)值 180
filestore_op_threads:對應(yīng)的thread pool為op_tp,對應(yīng)的work queue為op_wq;filestore的所有請求都經(jīng)過op_wq處理;
增大該參數(shù)能提升filestore的處理能力,提升filestore的性能;配合filestore的throttle一起調(diào)整;
filestore_op_thread_timeout和filestore_op_thread_suicide_timeout關(guān)聯(lián)的work queue為:

op_wq
配置的含義與上一節(jié)中的thread_timeout/thread_suicide_timeout保持一致;

13,filestore merge/split配置參數(shù)

filestore_merge_threshold = -1    默認(rèn)值 10
filestore_split_multiple = 10000  默認(rèn)值 2
這兩個(gè)參數(shù)是管理filestore的目錄分裂/合并的,filestore的每個(gè)目錄允許的最大文件數(shù)為:
filestore_split_multiple * abs(filestore_merge_threshold) * 16

在RGW的小文件應(yīng)用場景,會(huì)很容易達(dá)到默認(rèn)配置的文件數(shù)(320),若在寫的過程中觸發(fā)了filestore的分裂,則會(huì)非常影響filestore的性能;

每次filestore的目錄分裂,會(huì)依據(jù)如下規(guī)則分裂為多層目錄,最底層16個(gè)子目錄:

例如PG 31.4C0, hash結(jié)尾是4C0,若該目錄分裂,會(huì)分裂為 DIR_0/DIR_C/DIR_4/{DIR_0, DIR_F};

原始目錄下的object會(huì)根據(jù)規(guī)則放到不同的子目錄里,object的名稱格式為: _head_xxxxX4C0,分裂時(shí)候X是幾,就放進(jìn)子目錄DIR_X里。比如object:_head_xxxxA4C0, 就放進(jìn)子目錄 DIR_0/DIR_C/DIR_4/DIR_A 里;

解決辦法:

1)增大merge/split配置參數(shù)的值,使單個(gè)目錄容納更多的文件;

2)filestore_merge_threshold配置為負(fù)數(shù);這樣會(huì)提前觸發(fā)目錄的預(yù)分裂,避免目錄在某一時(shí)間段的集中分裂,詳細(xì)機(jī)制沒有調(diào)研;

3)創(chuàng)建pool時(shí)指定expected-num-objects;這樣會(huì)依據(jù)目錄分裂規(guī)則,在創(chuàng)建pool的時(shí)候就創(chuàng)建分裂的子目錄,避免了目錄分裂對filestore性能的影響;

參考:

http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/

http://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool

http://blog.csdn.net/for_tech/article/details/51251936

http://ivanjobs.github.io/page3/

14,filestore fd cache配置參數(shù)

filestore_fd_cache_shards =  32  默認(rèn)值 16     // FD number of shards
filestore_fd_cache_size = 32768  默認(rèn)值 128  // FD lru size
filestore的fd cache是加速訪問filestore里的file的,在非一次性寫入的應(yīng)用場景,增大配置可以很明顯的提升filestore的性能;

15,filestore sync配置參數(shù)

filestore_wbthrottle_enable = false   默認(rèn)值 true      SSD的時(shí)候建議關(guān)閉
filestore_min_sync_interval = 1       默認(rèn)值 0.01 s    最小同步間隔秒數(shù),sync fs的數(shù)據(jù)到disk,F(xiàn)ileStore::sync_entry()
filestore_max_sync_interval = 10      默認(rèn)值 5 s       最大同步間隔秒數(shù),sync fs的數(shù)據(jù)到disk,F(xiàn)ileStore::sync_entry()
filestore_commit_timeout = 1000       默認(rèn)值 600 s     FileStore::sync_entry() 里 new SyncEntryTimeout(m_filestore_commit_timeout)
filestore_wbthrottle_enable的配置是關(guān)于filestore writeback throttle的,即我們說的filestore處理workqueue op_wq的數(shù)據(jù)量閾值;默認(rèn)值是true,開啟后XFS相關(guān)的配置參數(shù)有:
OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 5000)

若使用普通HDD,可以保持其為true;針對SSD,建議將其關(guān)閉,不開啟writeback throttle;

filestore_min_sync_interval和filestore_max_sync_interval是配置filestore flush outstanding IO到disk的時(shí)間間隔的;增大配置可以讓系統(tǒng)做盡可能多的IO merge,減少filestore寫磁盤的壓力,但也會(huì)增大page cache占用內(nèi)存的開銷,增大數(shù)據(jù)丟失的可能性;

filestore_commit_timeout是配置filestore sync entry到disk的超時(shí)時(shí)間,在filestore壓力很大時(shí),調(diào)大這個(gè)值能盡量避免IO超時(shí)導(dǎo)致OSD crush;

16,filestore throttle配置參數(shù)

filestore_expected_throughput_bytes =  536870912   默認(rèn)值 200MB    /// Expected filestore throughput in B/s
filestore_expected_throughput_ops = 2000           默認(rèn)值 200      /// Expected filestore throughput in ops/s
filestore_queue_max_bytes= 1048576000              默認(rèn)值 100MB
filestore_queue_max_ops = 5000                     默認(rèn)值 50
 
/// Use above to inject delays intended to keep the op queue between low and high
filestore_queue_low_threshhold = 0.3               默認(rèn)值 0.3
filestore_queue_high_threshhold = 0.9              默認(rèn)值 0.9
 
filestore_queue_high_delay_multiple = 2            默認(rèn)值 0    /// Filestore high delay multiple.  Defaults to 0 (disabled)
filestore_queue_max_delay_multiple = 10            默認(rèn)值 0    /// Filestore max delay multiple.  Defaults to 0 (disabled)

在jewel版本里,引入了dynamic throttle,來平滑普通throttle帶來的長尾效應(yīng)問題;
一般在使用普通磁盤時(shí),之前的throttle機(jī)制即可很好的工作,所以這里默認(rèn) filestore_queue_high_delay_multiple 和 filestore_queue_max_delay_multiple 都為0;
針對高速磁盤,需要在部署之前,通過小工具 ceph_smalliobenchfs 來測試下,獲取合適的配置參數(shù);

BackoffThrottle的介紹如下:

/**
* BackoffThrottle
*
* Creates a throttle which gradually induces delays when get() is called
* based on params low_threshhold, high_threshhold, expected_throughput,
* high_multiple, and max_multiple.
*
* In [0, low_threshhold), we want no delay.
*
* In [low_threshhold, high_threshhold), delays should be injected based
* on a line from 0 at low_threshhold to
* high_multiple * (1/expected_throughput) at high_threshhold.
*
* In [high_threshhold, 1), we want delays injected based on a line from
* (high_multiple * (1/expected_throughput)) at high_threshhold to
* (high_multiple * (1/expected_throughput)) +
* (max_multiple * (1/expected_throughput)) at 1.
*
* Let the current throttle ratio (current/max) be r, low_threshhold be l,
* high_threshhold be h, high_delay (high_multiple / expected_throughput) be e,
* and max_delay (max_muliple / expected_throughput) be m.
*
* delay = 0, r \in [0, l)
* delay = (r - l) * (e / (h - l)), r \in [l, h)
* delay = h + (r - h)((m - e)/(1 - h))
*/ 

參考:

http://docs.ceph.com/docs/jewel/dev/osd_internals/osd_throttles/
http://blog.wjin.org/posts/ceph-dynamic-throttle.html
https://github.com/ceph/ceph/blob/master/src/doc/dynamic-throttle.txt
Ceph BackoffThrottle分析

17,filestore finisher threads配置參數(shù)

filestore_ondisk_finisher_threads = 2 默認(rèn)值 1
filestore_apply_finisher_threads = 2  默認(rèn)值 1

這兩個(gè)參數(shù)定義filestore commit/apply的finisher處理線程數(shù),默認(rèn)都為1,任何IO commit/apply完成后,都需要經(jīng)過對應(yīng)的ondisk/apply finisher thread處理;
在使用普通HDD時(shí),磁盤性能是瓶頸,單個(gè)finisher thread就能處理好;

但在使用高速磁盤的時(shí)候,IO完成比較快,單個(gè)finisher thread不能處理這么多的IO commit/apply reply,它會(huì)成為瓶頸;所以在jewel版本里引入了finisher thread pool的配置,這里一般配置為2即可;

18,journal配置參數(shù)

journal_max_write_bytes=1048576000       默認(rèn)值 10M    
journal_max_write_entries=5000           默認(rèn)值 100
 
journal_throttle_high_multiple = 2       默認(rèn)值 0    /// Multiple over expected at high_threshhold. Defaults to 0 (disabled).
journal_throttle_max_multiple = 10       默認(rèn)值 0    /// Multiple over expected at max.  Defaults to 0 (disabled).

/// Target range for journal fullness

OPTION(journal_throttle_low_threshhold, OPT_DOUBLE, 0.6)
OPTION(journal_throttle_high_threshhold, OPT_DOUBLE, 0.9)

journal_max_write_bytes和journal_max_write_entries 是journal一次write的數(shù)據(jù)量和entries限制;

針對SSD分區(qū)做journal的情況,這兩個(gè)值要增大,這樣能增大journal的吞吐量;

journal_throttle_high_multiple和journal_throttle_max_multiple是JournalThrottle的配置參數(shù),JournalThrottle是BackoffThrottle的封裝類,所以JournalThrottle與我們在filestore throttle介紹的dynamic throttle工作原理一樣;

int FileJournal::set_throttle_params()
{
    stringstream ss;
    bool valid = throttle.set_params(
                     g_conf->journal_throttle_low_threshhold,
                     g_conf->journal_throttle_high_threshhold,
                     g_conf->filestore_expected_throughput_bytes,
                     g_conf->journal_throttle_high_multiple,
                     g_conf->journal_throttle_max_multiple,
                     header.max_size - get_top(),
                     &ss);
...
}

從上述代碼中看出相關(guān)的配置參數(shù)有:

journal_throttle_low_threshhold
journal_throttle_high_threshhold
filestore_expected_throughput_bytes

19,rbd cache配置參數(shù)

[client]
rbd_cache_size = 134217728                  默認(rèn)值 32M // cache size in bytes
rbd_cache_max_dirty = 100663296             默認(rèn)值 24M // dirty limit in bytes - set to 0 for write-through caching
rbd_cache_target_dirty = 67108864           默認(rèn)值 16M // target dirty limit in bytes
rbd_cache_writethrough_until_flush = true   默認(rèn)值 true // whether to make writeback caching writethrough until flush is called, to be sure the user of librbd will send flushs so that writeback is safe
rbd_cache_max_dirty_age = 5                 默認(rèn)值 1.0  // seconds in cache before writeback starts
rbd_cache_size:client端每個(gè)rbd image的cache size,不需要太大,可以調(diào)整為64M,不然會(huì)比較占client端內(nèi)存;
參照默認(rèn)值,根據(jù)rbd_cache_size的大小調(diào)整rbd_cache_max_dirty和rbd_cache_target_dirty;
rbd_cache_max_dirty:在writeback模式下cache的最大bytes數(shù),默認(rèn)是24MB;當(dāng)該值為0時(shí),表示使用writethrough模式;
rbd_cache_target_dirty:在writeback模式下cache向ceph集群寫入的bytes閥值,默認(rèn)16MB;注意該值一定要小于rbd_cache_max_dirty值
rbd_cache_writethrough_until_flush:在內(nèi)核觸發(fā)flush cache到ceph集群前rbd cache一直是writethrough模式,直到flush后rbd cache變成writeback模式;

rbd_cache_max_dirty_age:標(biāo)記OSDC端ObjectCacher中entry在cache中的最長時(shí)間;

參考:

https://my.oschina.net/linuxhunter/blog/541997

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容