概述
Ceph的配置參數(shù)很多,從網(wǎng)上也能搜索到一大批的調(diào)優(yōu)參數(shù),但這些參數(shù)為什么這么設(shè)置?設(shè)置為這樣是否合理?解釋的并不多
本文從當(dāng)前我們的ceph.conf文件入手,解釋其中的每一項(xiàng)配置,做為以后參數(shù)調(diào)優(yōu)和新人學(xué)習(xí)的依據(jù);
參數(shù)詳解
1,一些固定配置參數(shù)
fsid = 6d529c3d-5745-4fa5-*-*
mon_initial_members = **, **, **
mon_host = *.*.*.*, *.*.*.*, *.*.*.*
以上通常是通過ceph-deploy生成的,都是ceph monitor相關(guān)的參數(shù),不用修改;
2,網(wǎng)絡(luò)配置參數(shù)
public_network = 10.10.2.0/24 默認(rèn)值 ""
cluster_network = 10.10.2.0/24 默認(rèn)值 ""
public network:monitor與osd,client與monitor,client與osd通信的網(wǎng)絡(luò),最好配置為帶寬較高的萬兆網(wǎng)絡(luò);
cluster network:OSD之間通信的網(wǎng)絡(luò),一般配置為帶寬較高的萬兆網(wǎng)絡(luò);
參考:http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/
3,pool size配置參數(shù)
osd_pool_default_size = 3 默認(rèn)值 3
osd_pool_default_min_size = 1 默認(rèn)值 0 // 0 means no specific default; ceph will use size-size/2
這兩個(gè)是創(chuàng)建ceph pool的時(shí)候的默認(rèn)size參數(shù),一般配置為3和1,3副本能足夠保證數(shù)據(jù)的可靠性;
4,認(rèn)證配置參數(shù)
auth_service_required = none 默認(rèn)值 "cephx"
auth_client_required = none 默認(rèn)值 "cephx, none"
auth_cluster_required = none 默認(rèn)值 "cephx"
以上是Ceph authentication的配置參數(shù),默認(rèn)值為開啟ceph認(rèn)證;
在內(nèi)部使用的ceph集群中一般配置為none,即不使用認(rèn)證,這樣能適當(dāng)加快ceph集群訪問速度;
5,osd down out配置參數(shù)
mon_osd_down_out_interval = 3600 默認(rèn)值 300 // seconds
mon_osd_min_down_reporters = 3 默認(rèn)值 2
mon_osd_report_timeout = 900 默認(rèn)值 900
osd_heartbeat_interval = 10 默認(rèn)值 6
osd_heartbeat_grace = 60 默認(rèn)值 20
mon_osd_down_out_interval:ceph標(biāo)記一個(gè)osd為down and out的最大時(shí)間間隔
mon_osd_min_down_reporters:mon標(biāo)記一個(gè)osd為down的最小reporters個(gè)數(shù)(報(bào)告該osd為down的其他osd為一個(gè)reporter)
mon_osd_report_timeout:mon標(biāo)記一個(gè)osd為down的最長等待時(shí)間
osd_heartbeat_interval:osd發(fā)送heartbeat給其他osd的間隔時(shí)間(同一PG之間的osd才會(huì)有heartbeat)
osd_heartbeat_grace:osd報(bào)告其他osd為down的最大時(shí)間間隔,grace調(diào)大,也有副作用,如果某個(gè)osd異常退出,等待其
他osd上報(bào)的時(shí)間必須為grace,在這段時(shí)間段內(nèi),這個(gè)osd負(fù)責(zé)的pg的io會(huì)hang住,所以盡量不要將grace調(diào)的太大。
基于實(shí)際情況合理配置上述參數(shù),能減少或及時(shí)發(fā)現(xiàn)osd變?yōu)閐own(降低IO hang住的時(shí)間和概率),延長osd變?yōu)閐own and out的時(shí)間(防止網(wǎng)絡(luò)抖動(dòng)造成的數(shù)據(jù)recovery);
參考:
http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/
http://blog.wjin.org/posts/ceph-osd-heartbeat.html
6,objecter配置參數(shù)
objecter_inflight_ops = 10240 默認(rèn)值 1024
objecter_inflight_op_bytes = 1048576000 默認(rèn)值 100M
osd client端objecter的throttle配置,它的配置會(huì)影響librbd,RGW端的性能;
配置建議:
調(diào)大這兩個(gè)值
7,ceph rgw配置參數(shù)
rgw_frontends = "civetweb num_threads=500" 默認(rèn)值 "fastcgi, civetweb port=7480"
rgw_thread_pool_size = 200 默認(rèn)值 100
rgw_override_bucket_index_max_shards = 20 默認(rèn)值 0
rgw_max_chunk_size = 1048576 默認(rèn)值 512 * 1024
rgw_cache_lru_size = 10000 默認(rèn)值 10000 // num of entries in rgw cache
rgw_bucket_default_quota_max_objects = * 默認(rèn)值 -1 // number of objects allowed
rgw_frontends:rgw的前端配置,一般配置為使用輕量級的civetweb;prot為訪問rgw的端口,根據(jù)實(shí)際情況配置;num_threads為civetweb的線程數(shù);
rgw_thread_pool_size:rgw前端web的線程數(shù),與rgw_frontends中的num_threads含義一致,但num_threads 優(yōu)于 rgw_thread_pool_size的配置,兩個(gè)只需要配置一個(gè)即可;
rgw_override_bucket_index_max_shards:rgw bucket index object的最大shards數(shù),增大這個(gè)值能提升bucket index object的訪問時(shí)間,但也會(huì)加大bucket的ls時(shí)間;
rgw_max_chunk_size:rgw最大chunk size,針對大文件的對象存儲(chǔ)場景可以把這個(gè)值調(diào)大;
rgw_cache_lru_size:rgw的lru cache size,對于讀較多的應(yīng)用場景,調(diào)大這個(gè)值能加快rgw的響應(yīng)熟讀;
rgw_bucket_default_quota_max_objects:配合該參數(shù)限制一個(gè)bucket的最大objects個(gè)數(shù);
參考:
http://docs.ceph.com/docs/jewel/install/install-ceph-gateway/
http://ceph-users.ceph.narkive.com/mdB90g7R/rgw-increase-the-first-chunk-size
https://access.redhat.com/solutions/2122231
8,debug配置參數(shù)
debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_mon = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_paxos = 0/0
debug_rgw = 0/0
關(guān)閉了所有的debug信息,能一定程度加快ceph集群速度,但也會(huì)丟失一些關(guān)鍵log,出問題的時(shí)候不好分析;
參考:
http://www.10tiao.com/html/362/201609/2654062487/1.html
9,osd op配置參數(shù)
osd_enable_op_tracker = true 默認(rèn)值 true
osd_num_op_tracker_shard = 32 默認(rèn)值 32
osd_op_threads = 5 默認(rèn)值 2
osd_disk_threads = 1 默認(rèn)值 1
osd_op_num_shards = 15 默認(rèn)值 5
osd_op_num_threads_per_shard = 2 默認(rèn)值 2
osd_enable_op_tracker:追蹤osd op狀態(tài)的配置參數(shù),默認(rèn)為true;不建議關(guān)閉,關(guān)閉后osd的 slow_request,ops_in_flight,historic_ops 無法正常統(tǒng)計(jì);
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_ops_in_flight
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
# ceph daemon /var/run/ceph/ceph-osd.0.asok dump_historic_ops
op_tracker tracking is not enabled now, so no ops are tracked currently, even those get stuck. Please enable "osd_enable_op_tracker", and the tracker will start to track new ops received afterwards.
打開op tracker后,若集群iops很高,osd_num_op_tracker_shard可以適當(dāng)調(diào)大,因?yàn)槊總€(gè)shard都有個(gè)獨(dú)立的mutex鎖;
class OpTracker {
...
struct ShardedTrackingData {
Mutex ops_in_flight_lock_sharded;
xlist<TrackedOp *> ops_in_flight_sharded;
explicit ShardedTrackingData(string lock_name):
ops_in_flight_lock_sharded(lock_name.c_str()) {}
};
vector<ShardedTrackingData*> sharded_in_flight_list;
uint32_t num_optracker_shards;
...
};
osd_op_threads:對應(yīng)的work queue有peering_wq(osd peering請求),recovery_gen_wq(PG recovery請求);
osd_disk_threads:對應(yīng)的work queue為 remove_wq(PG remove請求);
osd_op_num_shards和osd_op_num_threads_per_shard:對應(yīng)的thread pool為osd_op_tp,work queue為op_shardedwq;
處理的請求包括:
OpRequestRef
PGSnapTrim
PGScrub
調(diào)大osd_op_num_shards可以增大osd ops的處理線程數(shù),增大并發(fā)性,提升OSD性能;
10,osd client message配置參數(shù)
osd_client_message_size_cap = 1048576000 默認(rèn)值 500*1024L*1024L // client data allowed in-memory (in bytes)
osd_client_message_cap = 1000 默認(rèn)值 100 // num client messages allowed in-memory
這個(gè)是osd端收到client messages的capacity配置,配置大的話能提升osd的處理能力,但會(huì)占用較多的系統(tǒng)內(nèi)存;
配置建議:
服務(wù)器內(nèi)存足夠大的時(shí)候,適當(dāng)增大這兩個(gè)值
11,osd scrub配置參數(shù)
osd_scrub_begin_hour = 10 默認(rèn)值 0
osd_scrub_end_hour = 5 默認(rèn)值 24
// The time in seconds that scrubbing sleeps between two consecutive scrubs
osd_scrub_sleep = 1 默認(rèn)值 0 // sleep between [deep]scrub ops
osd_scrub_load_threshold = 8 默認(rèn)值 0.5
// chunky scrub配置的最小/最大objects數(shù),以下是默認(rèn)值
osd_scrub_chunk_min = 5
osd_scrub_chunk_max = 25
Ceph osd scrub是保證ceph數(shù)據(jù)一致性的機(jī)制,scrub以PG為單位,但每次scrub回獲取PG lock,所以它可能會(huì)影響PG正常的IO;
Ceph后來引入了chunky的scrub模式,每次scrub只會(huì)選取PG的一部分objects,完成后釋放PG lock,并把下一次的PG scrub加入隊(duì)列;這樣能很好的減少PG scrub時(shí)候占用PG lock的時(shí)間,避免過多影響PG正常的IO;
同理,引入的osd_scrub_sleep參數(shù)會(huì)讓線程在每次scrub前釋放PG lock,然后睡眠一段時(shí)間,也能很好的減少scrub對PG正常IO的影響;
配置建議:
osd_scrub_begin_hour和osd_scrub_end_hour:OSD Scrub的開始結(jié)束時(shí)間,根據(jù)具體業(yè)務(wù)指定;
osd_scrub_sleep:osd在每次執(zhí)行scrub時(shí)的睡眠時(shí)間;有個(gè)bug跟這個(gè)配置有關(guān),建議關(guān)閉;
osd_scrub_load_threshold:osd開啟scrub的系統(tǒng)load閾值,根據(jù)系統(tǒng)的load average值配置該參數(shù);
osd_scrub_chunk_min和osd_scrub_chunk_max:根據(jù)PG中object的個(gè)數(shù)配置;針對RGW全是小文件的情況,這兩個(gè)值需要調(diào)大;
參考:
http://www.itdecent.cn/p/ea2296e1555c
http://tracker.ceph.com/issues/19497
12,osd thread timeout配置參數(shù)
osd_op_thread_timeout = 100 默認(rèn)值 15
osd_op_thread_suicide_timeout = 300 默認(rèn)值 150
osd_recovery_thread_timeout = 100 默認(rèn)值 30
osd_recovery_thread_suicide_timeout = 300 默認(rèn)值 300
osd_op_thread_timeout和osd_op_thread_suicide_timeout關(guān)聯(lián)的work queue為:
op_shardedwq - 關(guān)聯(lián)的請求為:OpRequestRef,PGSnapTrim,PGScrub
peering_wq - 關(guān)聯(lián)的請求為:osd peering
osd_recovery_thread_timeout和osd_recovery_thread_suicide_timeout關(guān)聯(lián)的work queue為:
recovery_wq - 關(guān)聯(lián)的請求為:PG recovery
Ceph的work queue都有個(gè)基類 WorkQueue_,定義如下:
/// Pool of threads that share work submitted to multiple work queues.
class ThreadPool : public md_config_obs_t {
...
/// Basic interface to a work queue used by the worker threads.
struct WorkQueue_ {
string name;
time_t timeout_interval, suicide_interval;
WorkQueue_(string n, time_t ti, time_t sti)
: name(n), timeout_interval(ti), suicide_interval(sti)
{ }
...
這里的timeout_interval和suicide_interval分別對應(yīng)上面所述的配置timeout和suicide_timeout;
當(dāng)thread處理work queue中的一個(gè)請求時(shí),會(huì)受到這兩個(gè)timeout時(shí)間的限制:
timeout_interval - 到時(shí)間后設(shè)置m_unhealthy_workers+1
suicide_interval - 到時(shí)間后調(diào)用assert,OSD進(jìn)程crush
對應(yīng)的處理函數(shù)為:
bool HeartbeatMap::_check(const heartbeat_handle_d *h, const char *who, time_t now)
{
bool healthy = true;
time_t was;
was = h->timeout.read();
if (was && was < now) {
ldout(m_cct, 1) << who << " '" << h->name << "'"
<< " had timed out after " << h->grace << dendl;
healthy = false;
}
was = h->suicide_timeout.read();
if (was && was < now) {
ldout(m_cct, 1) << who << " '" << h->name << "'"
<< " had suicide timed out after " << h->suicide_grace << dendl;
assert(0 == "hit suicide timeout");
}
return healthy;
}
當(dāng)前僅有RGW添加了worker的perfcounter,所以也只有RGW可以通過perf dump查看total/unhealthy的worker信息:
[root@ node1]# ceph daemon /var/run/ceph/ceph-client.rgw.*.asok perf dump | grep worker
"total_workers": 32,
"unhealthy_workers": 0
對應(yīng)的配置項(xiàng)為:
OPTION(rgw_num_async_rados_threads, OPT_INT, 32) // num of threads to use for async rados operations
配置建議:
*_thread_timeout:這個(gè)值配置越小越能及時(shí)發(fā)現(xiàn)處理慢的請求,所以不建議配置很大;特別是針對速度快的設(shè)備,建議調(diào)小該值;
*_thread_suicide_timeout:這個(gè)值配置小了會(huì)導(dǎo)致超時(shí)后的OSD crush,所以建議調(diào)大;特別是在對應(yīng)的throttle調(diào)大后,更應(yīng)該調(diào)大該值;
13,fielstore op thread配置參數(shù)
filestore_op_threads = 5 默認(rèn)值 2
filestore_op_thread_timeout = 100 默認(rèn)值 60
filestore_op_thread_suicide_timeout = 300 默認(rèn)值 180
filestore_op_threads:對應(yīng)的thread pool為op_tp,對應(yīng)的work queue為op_wq;filestore的所有請求都經(jīng)過op_wq處理;
增大該參數(shù)能提升filestore的處理能力,提升filestore的性能;配合filestore的throttle一起調(diào)整;
filestore_op_thread_timeout和filestore_op_thread_suicide_timeout關(guān)聯(lián)的work queue為:
op_wq
配置的含義與上一節(jié)中的thread_timeout/thread_suicide_timeout保持一致;
13,filestore merge/split配置參數(shù)
filestore_merge_threshold = -1 默認(rèn)值 10
filestore_split_multiple = 10000 默認(rèn)值 2
這兩個(gè)參數(shù)是管理filestore的目錄分裂/合并的,filestore的每個(gè)目錄允許的最大文件數(shù)為:
filestore_split_multiple * abs(filestore_merge_threshold) * 16
在RGW的小文件應(yīng)用場景,會(huì)很容易達(dá)到默認(rèn)配置的文件數(shù)(320),若在寫的過程中觸發(fā)了filestore的分裂,則會(huì)非常影響filestore的性能;
每次filestore的目錄分裂,會(huì)依據(jù)如下規(guī)則分裂為多層目錄,最底層16個(gè)子目錄:
例如PG 31.4C0, hash結(jié)尾是4C0,若該目錄分裂,會(huì)分裂為 DIR_0/DIR_C/DIR_4/{DIR_0, DIR_F};
原始目錄下的object會(huì)根據(jù)規(guī)則放到不同的子目錄里,object的名稱格式為: _head_xxxxX4C0,分裂時(shí)候X是幾,就放進(jìn)子目錄DIR_X里。比如object:_head_xxxxA4C0, 就放進(jìn)子目錄 DIR_0/DIR_C/DIR_4/DIR_A 里;
解決辦法:
1)增大merge/split配置參數(shù)的值,使單個(gè)目錄容納更多的文件;
2)filestore_merge_threshold配置為負(fù)數(shù);這樣會(huì)提前觸發(fā)目錄的預(yù)分裂,避免目錄在某一時(shí)間段的集中分裂,詳細(xì)機(jī)制沒有調(diào)研;
3)創(chuàng)建pool時(shí)指定expected-num-objects;這樣會(huì)依據(jù)目錄分裂規(guī)則,在創(chuàng)建pool的時(shí)候就創(chuàng)建分裂的子目錄,避免了目錄分裂對filestore性能的影響;
參考:
http://docs.ceph.com/docs/master/rados/configuration/filestore-config-ref/
http://docs.ceph.com/docs/jewel/rados/operations/pools/#create-a-pool
http://blog.csdn.net/for_tech/article/details/51251936
http://ivanjobs.github.io/page3/
14,filestore fd cache配置參數(shù)
filestore_fd_cache_shards = 32 默認(rèn)值 16 // FD number of shards
filestore_fd_cache_size = 32768 默認(rèn)值 128 // FD lru size
filestore的fd cache是加速訪問filestore里的file的,在非一次性寫入的應(yīng)用場景,增大配置可以很明顯的提升filestore的性能;
15,filestore sync配置參數(shù)
filestore_wbthrottle_enable = false 默認(rèn)值 true SSD的時(shí)候建議關(guān)閉
filestore_min_sync_interval = 1 默認(rèn)值 0.01 s 最小同步間隔秒數(shù),sync fs的數(shù)據(jù)到disk,F(xiàn)ileStore::sync_entry()
filestore_max_sync_interval = 10 默認(rèn)值 5 s 最大同步間隔秒數(shù),sync fs的數(shù)據(jù)到disk,F(xiàn)ileStore::sync_entry()
filestore_commit_timeout = 1000 默認(rèn)值 600 s FileStore::sync_entry() 里 new SyncEntryTimeout(m_filestore_commit_timeout)
filestore_wbthrottle_enable的配置是關(guān)于filestore writeback throttle的,即我們說的filestore處理workqueue op_wq的數(shù)據(jù)量閾值;默認(rèn)值是true,開啟后XFS相關(guān)的配置參數(shù)有:
OPTION(filestore_wbthrottle_xfs_bytes_start_flusher, OPT_U64, 41943040)
OPTION(filestore_wbthrottle_xfs_bytes_hard_limit, OPT_U64, 419430400)
OPTION(filestore_wbthrottle_xfs_ios_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_ios_hard_limit, OPT_U64, 5000)
OPTION(filestore_wbthrottle_xfs_inodes_start_flusher, OPT_U64, 500)
OPTION(filestore_wbthrottle_xfs_inodes_hard_limit, OPT_U64, 5000)
若使用普通HDD,可以保持其為true;針對SSD,建議將其關(guān)閉,不開啟writeback throttle;
filestore_min_sync_interval和filestore_max_sync_interval是配置filestore flush outstanding IO到disk的時(shí)間間隔的;增大配置可以讓系統(tǒng)做盡可能多的IO merge,減少filestore寫磁盤的壓力,但也會(huì)增大page cache占用內(nèi)存的開銷,增大數(shù)據(jù)丟失的可能性;
filestore_commit_timeout是配置filestore sync entry到disk的超時(shí)時(shí)間,在filestore壓力很大時(shí),調(diào)大這個(gè)值能盡量避免IO超時(shí)導(dǎo)致OSD crush;
16,filestore throttle配置參數(shù)
filestore_expected_throughput_bytes = 536870912 默認(rèn)值 200MB /// Expected filestore throughput in B/s
filestore_expected_throughput_ops = 2000 默認(rèn)值 200 /// Expected filestore throughput in ops/s
filestore_queue_max_bytes= 1048576000 默認(rèn)值 100MB
filestore_queue_max_ops = 5000 默認(rèn)值 50
/// Use above to inject delays intended to keep the op queue between low and high
filestore_queue_low_threshhold = 0.3 默認(rèn)值 0.3
filestore_queue_high_threshhold = 0.9 默認(rèn)值 0.9
filestore_queue_high_delay_multiple = 2 默認(rèn)值 0 /// Filestore high delay multiple. Defaults to 0 (disabled)
filestore_queue_max_delay_multiple = 10 默認(rèn)值 0 /// Filestore max delay multiple. Defaults to 0 (disabled)
在jewel版本里,引入了dynamic throttle,來平滑普通throttle帶來的長尾效應(yīng)問題;
一般在使用普通磁盤時(shí),之前的throttle機(jī)制即可很好的工作,所以這里默認(rèn) filestore_queue_high_delay_multiple 和 filestore_queue_max_delay_multiple 都為0;
針對高速磁盤,需要在部署之前,通過小工具 ceph_smalliobenchfs 來測試下,獲取合適的配置參數(shù);
BackoffThrottle的介紹如下:
/**
* BackoffThrottle
*
* Creates a throttle which gradually induces delays when get() is called
* based on params low_threshhold, high_threshhold, expected_throughput,
* high_multiple, and max_multiple.
*
* In [0, low_threshhold), we want no delay.
*
* In [low_threshhold, high_threshhold), delays should be injected based
* on a line from 0 at low_threshhold to
* high_multiple * (1/expected_throughput) at high_threshhold.
*
* In [high_threshhold, 1), we want delays injected based on a line from
* (high_multiple * (1/expected_throughput)) at high_threshhold to
* (high_multiple * (1/expected_throughput)) +
* (max_multiple * (1/expected_throughput)) at 1.
*
* Let the current throttle ratio (current/max) be r, low_threshhold be l,
* high_threshhold be h, high_delay (high_multiple / expected_throughput) be e,
* and max_delay (max_muliple / expected_throughput) be m.
*
* delay = 0, r \in [0, l)
* delay = (r - l) * (e / (h - l)), r \in [l, h)
* delay = h + (r - h)((m - e)/(1 - h))
*/
參考:
http://docs.ceph.com/docs/jewel/dev/osd_internals/osd_throttles/
http://blog.wjin.org/posts/ceph-dynamic-throttle.html
https://github.com/ceph/ceph/blob/master/src/doc/dynamic-throttle.txt
Ceph BackoffThrottle分析
17,filestore finisher threads配置參數(shù)
filestore_ondisk_finisher_threads = 2 默認(rèn)值 1
filestore_apply_finisher_threads = 2 默認(rèn)值 1
這兩個(gè)參數(shù)定義filestore commit/apply的finisher處理線程數(shù),默認(rèn)都為1,任何IO commit/apply完成后,都需要經(jīng)過對應(yīng)的ondisk/apply finisher thread處理;
在使用普通HDD時(shí),磁盤性能是瓶頸,單個(gè)finisher thread就能處理好;
但在使用高速磁盤的時(shí)候,IO完成比較快,單個(gè)finisher thread不能處理這么多的IO commit/apply reply,它會(huì)成為瓶頸;所以在jewel版本里引入了finisher thread pool的配置,這里一般配置為2即可;
18,journal配置參數(shù)
journal_max_write_bytes=1048576000 默認(rèn)值 10M
journal_max_write_entries=5000 默認(rèn)值 100
journal_throttle_high_multiple = 2 默認(rèn)值 0 /// Multiple over expected at high_threshhold. Defaults to 0 (disabled).
journal_throttle_max_multiple = 10 默認(rèn)值 0 /// Multiple over expected at max. Defaults to 0 (disabled).
/// Target range for journal fullness
OPTION(journal_throttle_low_threshhold, OPT_DOUBLE, 0.6)
OPTION(journal_throttle_high_threshhold, OPT_DOUBLE, 0.9)
journal_max_write_bytes和journal_max_write_entries 是journal一次write的數(shù)據(jù)量和entries限制;
針對SSD分區(qū)做journal的情況,這兩個(gè)值要增大,這樣能增大journal的吞吐量;
journal_throttle_high_multiple和journal_throttle_max_multiple是JournalThrottle的配置參數(shù),JournalThrottle是BackoffThrottle的封裝類,所以JournalThrottle與我們在filestore throttle介紹的dynamic throttle工作原理一樣;
int FileJournal::set_throttle_params()
{
stringstream ss;
bool valid = throttle.set_params(
g_conf->journal_throttle_low_threshhold,
g_conf->journal_throttle_high_threshhold,
g_conf->filestore_expected_throughput_bytes,
g_conf->journal_throttle_high_multiple,
g_conf->journal_throttle_max_multiple,
header.max_size - get_top(),
&ss);
...
}
從上述代碼中看出相關(guān)的配置參數(shù)有:
journal_throttle_low_threshhold
journal_throttle_high_threshhold
filestore_expected_throughput_bytes
19,rbd cache配置參數(shù)
[client]
rbd_cache_size = 134217728 默認(rèn)值 32M // cache size in bytes
rbd_cache_max_dirty = 100663296 默認(rèn)值 24M // dirty limit in bytes - set to 0 for write-through caching
rbd_cache_target_dirty = 67108864 默認(rèn)值 16M // target dirty limit in bytes
rbd_cache_writethrough_until_flush = true 默認(rèn)值 true // whether to make writeback caching writethrough until flush is called, to be sure the user of librbd will send flushs so that writeback is safe
rbd_cache_max_dirty_age = 5 默認(rèn)值 1.0 // seconds in cache before writeback starts
rbd_cache_size:client端每個(gè)rbd image的cache size,不需要太大,可以調(diào)整為64M,不然會(huì)比較占client端內(nèi)存;
參照默認(rèn)值,根據(jù)rbd_cache_size的大小調(diào)整rbd_cache_max_dirty和rbd_cache_target_dirty;
rbd_cache_max_dirty:在writeback模式下cache的最大bytes數(shù),默認(rèn)是24MB;當(dāng)該值為0時(shí),表示使用writethrough模式;
rbd_cache_target_dirty:在writeback模式下cache向ceph集群寫入的bytes閥值,默認(rèn)16MB;注意該值一定要小于rbd_cache_max_dirty值
rbd_cache_writethrough_until_flush:在內(nèi)核觸發(fā)flush cache到ceph集群前rbd cache一直是writethrough模式,直到flush后rbd cache變成writeback模式;
rbd_cache_max_dirty_age:標(biāo)記OSDC端ObjectCacher中entry在cache中的最長時(shí)間;
參考: