問題描述
單一label的metric瀏覽長時間數(shù)據(jù)面板時(比如1周)查詢失敗并且導致prometheus oom
排障思路
- 當time series 很長時做數(shù)據(jù)抽樣,比如10W個poinit里抽樣1000個點用于繪圖
- 限制內(nèi)存使用以避免oom
思路落地
數(shù)據(jù)抽樣
搜索grafana文檔發(fā)下如下配置

image.png
參考文獻:https://grafana.com/docs/features/datasources/prometheus/#query-editor
當查詢到的樣本數(shù)據(jù)量非常大時可以導致Grafana渲染圖標時出現(xiàn)一些性能問題,通過Min Step可以控制Prometheus查詢數(shù)據(jù)時的最小步長(Step),從而減少從Prometheus返回的數(shù)據(jù)量。
Resolution選項,則可以控制Grafana自身渲染的數(shù)據(jù)量。例如,如果Resolution的值為1/10,Grafana會將Prometeus返回的10個樣本數(shù)據(jù)合并成一個點。因此Resolution越小可視化的精確性越高,反之,可視化的精度越低。
參考文獻:https://yunlzheng.gitbook.io/prometheus-book/part-ii-prometheus-jin-jie/grafana/grafana-panels
URL query parameters:
- query=<string>: Prometheus expression query string.
- start=<rfc3339 | unix_timestamp>: Start timestamp.
- end=<rfc3339 | unix_timestamp>: End timestamp.
- ==step=<duration | float>: Query resolution step width in duration format or float number of seconds.==
- timeout=<duration>: Evaluation timeout. Optional. Defaults to and is capped by the value of the -query.timeout flag.
- The data section of the query result has the following format:
參考文獻:https://prometheus.io/docs/prometheus/2.7/querying/api/
限制內(nèi)存使用以避免oom
從幫助文檔里發(fā)現(xiàn)--query.max-samples=50000000 即:默認最多加載50000000 samples, 如超過此限制會拒絕query, 因此可根據(jù)機器內(nèi)存合理配置此數(shù)值保證不會oom
ps: prometheus version:2.7.1
./prometheus --help
usage: prometheus [<flags>]
--query.max-samples=50000000 Maximum number of samples a single query can load into memory. Note that queries will fail if they would load more samples than this into memory, so this also limits the number of samples a query can return.
參考文獻:./prometheus --help