Prometheus監(jiān)控服務(wù)器
監(jiān)控服務(wù)器CPU、內(nèi)存、磁盤、I/O等信息,首先需要安裝node_exporter。node_exporter的作用是用于機器系統(tǒng)數(shù)據(jù)收集。
下載地址: https://github.com/prometheus/node_exporter/releases/
https://prometheus.io/download/
wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.0.linux-amd64.tar.gz
tar xvf node_exporter-0.18.0.linux-amd64.tar.gz
mv node_exporter-0.18.0.linux-amd64 /usr/local/bin/node_exporter
創(chuàng)建用戶
groupadd prometheus
useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
chown prometheus.prometheus -R /usr/local/prometheus
創(chuàng)建Systemd服務(wù)
cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
啟動
systemctl start node_exporter
systemctl status node_exporter
● node_exporter.service - node_exporter
Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
Main PID: 11050 (node_exporter)
CGroup: /system.slice/node_exporter.service
└─11050 /usr/local/prometheus/node_exporter/node_exporter
systemctl enable node_exporter
Node Exporter默認的抓取地址為http://IP:9100/metrics
配置prometheus
vim /usr/local/prometheus/prometheus.yml
- job_name: 'linux'
static_configs:
- targets: ['localhost:9100']
labels:
instance: node1
prometheus.yml中一共定義了兩個監(jiān)控:一個是監(jiān)控prometheus自身服務(wù),另一個是監(jiān)控Linux服務(wù)器。這里給個完整的示例:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'linux'
static_configs:
- targets: ['NODE_IP:9100']
labels:
instance: node1
重啟Prometheus
systemctl restart prometheus
訪問Prometheus Web,在Status->Targets頁面下,我們可以看到我們配置的兩個Target,它們的State為UP。
Prometheus針對nodes告警規(guī)則配置
groups:
- name: example
rules:
- alert: 實例丟失
expr: up{job="node-exporter"} == 0
for: 1m
labels:
severity: page
annotations:
summary: "服務(wù)器實例 {{ $labels.instance }} 丟失"
description: "{{ $labels.instance }} 上的任務(wù) {{ $labels.job }} 已經(jīng)停止了 1 分鐘已上了"
- alert: 磁盤容量小于 5%
expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"} * 100) / node_filesystem_size_bytes {job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"}) > 95
for: 30s
annotations:
summary: "服務(wù)器實例 {{ $labels.instance }} 磁盤不足 告警通知"
description: "{{ $labels.instance }}磁盤 {{ $labels.device }} 資源 已不足 5%, 當前值: {{ $value }}"
- alert: "內(nèi)存容量小于 20%"
expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
for: 30s
labels:
severity: warning
annotations:
summary: "服務(wù)器實例 {{ $labels.instance }} 內(nèi)存不足 告警通知"
description: "{{ $labels.instance }}內(nèi)存資源已不足 20%,當前值: {{ $value }}"
- alert: "CPU 平均負載大于 4 個"
expr: node_load5 > 4
for: 30s
annotations:
sumary: "服務(wù)器實例 {{ $labels.instance }} CPU 負載 告警通知"
description: "{{ $labels.instance }}CPU 平均負載(5 分鐘) 已超過 4 ,當前值: {{ $value }}"
- alert: "磁盤讀 I/O 超過 30MB/s"
expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
for: 30s
annotations:
sumary: "服務(wù)器實例 {{ $labels.instance }} I/O 讀負載 告警通知"
description: "{{ $labels.instance }}I/O 每分鐘讀已超過 30MB/s,當前值: {{ $value }}"
- alert: "磁盤寫 I/O 超過 30MB/s"
expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
for: 30s
annotations:
sumary: "服務(wù)器實例 {{ $labels.instance }} I/O 寫負載 告警通知"
description: "{{ $labels.instance }}I/O 每分鐘寫已超過 30MB/s,當前值: {{ $value }}"
- alert: "網(wǎng)卡流出速率大于 10MB/s"
expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 1000000
for: 30s
annotations:
sumary: "服務(wù)器實例 {{ $labels.instance }} 網(wǎng)卡流量負載 告警通知"
description: "{{ $labels.instance }}網(wǎng)卡 {{ $labels.device }} 流量已經(jīng)超過 10MB/s, 當前值: {{ $value }}"
- alert: "CPU 使用率大于 90%"
expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
for: 30s
annotations:
sumary: "服務(wù)器實例 {{ $labels.instance }} CPU 使用率 告警通知"
description: "{{ $labels.instance }}CPU 使用率已超過 90%, 當前值: {{ $value }}"