node_exporter 配置

Prometheus監(jiān)控服務(wù)器

監(jiān)控服務(wù)器CPU、內(nèi)存、磁盤、I/O等信息,首先需要安裝node_exporter。node_exporter的作用是用于機器系統(tǒng)數(shù)據(jù)收集。

下載地址: https://github.com/prometheus/node_exporter/releases/
https://prometheus.io/download/

wget https://github.com/prometheus/node_exporter/releases/download/v0.18.1/node_exporter-0.18.0.linux-amd64.tar.gz

tar xvf node_exporter-0.18.0.linux-amd64.tar.gz
mv node_exporter-0.18.0.linux-amd64 /usr/local/bin/node_exporter

創(chuàng)建用戶

groupadd prometheus
useradd -g prometheus -m -d /var/lib/prometheus -s /sbin/nologin prometheus
chown prometheus.prometheus -R /usr/local/prometheus

創(chuàng)建Systemd服務(wù)

cat > /etc/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
Documentation=https://prometheus.io/
After=network.target
[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF

啟動

systemctl start node_exporter

systemctl status node_exporter
● node_exporter.service - node_exporter
   Loaded: loaded (/etc/systemd/system/node_exporter.service; disabled; vendor preset: disabled)
   Active: active (running) since 三 2019-06-05 09:18:56 GMT; 3s ago
 Main PID: 11050 (node_exporter)
   CGroup: /system.slice/node_exporter.service
           └─11050 /usr/local/prometheus/node_exporter/node_exporter

systemctl enable node_exporter

Node Exporter默認的抓取地址為http://IP:9100/metrics

配置prometheus

vim  /usr/local/prometheus/prometheus.yml

  - job_name: 'linux'
    static_configs:
      - targets: ['localhost:9100']
        labels:
          instance: node1

prometheus.yml中一共定義了兩個監(jiān)控:一個是監(jiān)控prometheus自身服務(wù),另一個是監(jiān)控Linux服務(wù)器。這里給個完整的示例:

scrape_configs:

  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'linux'
    static_configs:
      - targets: ['NODE_IP:9100']
        labels:
          instance: node1

重啟Prometheus

systemctl restart prometheus

訪問Prometheus Web,在Status->Targets頁面下,我們可以看到我們配置的兩個Target,它們的State為UP。

Prometheus針對nodes告警規(guī)則配置
groups:
- name: example
  rules:
 
  - alert: 實例丟失
    expr: up{job="node-exporter"} == 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "服務(wù)器實例 {{ $labels.instance }} 丟失"
      description: "{{ $labels.instance }} 上的任務(wù) {{ $labels.job }} 已經(jīng)停止了 1 分鐘已上了"
 
  - alert: 磁盤容量小于 5%
    expr: 100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"} * 100) / node_filesystem_size_bytes {job="node-exporter",mountpoint=~".*",fstype=~"ext4|xfs|ext2|ext3"}) > 95
    for: 30s
    annotations:
      summary: "服務(wù)器實例 {{ $labels.instance }} 磁盤不足 告警通知"
      description: "{{ $labels.instance }}磁盤 {{ $labels.device }} 資源 已不足 5%, 當前值: {{ $value }}"
 
  - alert: "內(nèi)存容量小于 20%"
    expr: ((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes) / (node_memory_MemTotal_bytes )) * 100 > 80
    for: 30s
    labels:
      severity: warning
    annotations:
      summary: "服務(wù)器實例 {{ $labels.instance }} 內(nèi)存不足 告警通知"
      description: "{{ $labels.instance }}內(nèi)存資源已不足 20%,當前值: {{ $value }}"
 
  - alert: "CPU 平均負載大于 4 個"
    expr: node_load5 > 4
    for: 30s
    annotations:
      sumary: "服務(wù)器實例 {{ $labels.instance }} CPU 負載 告警通知"
      description: "{{ $labels.instance }}CPU 平均負載(5 分鐘) 已超過 4 ,當前值: {{ $value }}"
 
  - alert: "磁盤讀 I/O 超過 30MB/s"
    expr: irate(node_disk_read_bytes_total{device="sda"}[1m]) > 30000000
    for: 30s
    annotations:
      sumary: "服務(wù)器實例 {{ $labels.instance }} I/O 讀負載 告警通知"
      description: "{{ $labels.instance }}I/O 每分鐘讀已超過 30MB/s,當前值: {{ $value }}"
 
  - alert: "磁盤寫 I/O 超過 30MB/s"
    expr: irate(node_disk_written_bytes_total{device="sda"}[1m]) > 30000000
    for: 30s
    annotations:
      sumary: "服務(wù)器實例 {{ $labels.instance }} I/O 寫負載 告警通知"
      description: "{{ $labels.instance }}I/O 每分鐘寫已超過 30MB/s,當前值: {{ $value }}"
 
  - alert: "網(wǎng)卡流出速率大于 10MB/s"
    expr: (irate(node_network_transmit_bytes_total{device!~"lo"}[1m]) / 1000) > 1000000
    for: 30s
    annotations:
      sumary: "服務(wù)器實例 {{ $labels.instance }} 網(wǎng)卡流量負載 告警通知"
      description: "{{ $labels.instance }}網(wǎng)卡 {{ $labels.device }} 流量已經(jīng)超過 10MB/s, 當前值: {{ $value }}"
 
  - alert: "CPU 使用率大于 90%"
    expr: 100 - ((avg by (instance,job,env)(irate(node_cpu_seconds_total{mode="idle"}[30s]))) *100) > 90
    for: 30s
    annotations:
      sumary: "服務(wù)器實例 {{ $labels.instance }} CPU 使用率 告警通知"
      description: "{{ $labels.instance }}CPU 使用率已超過 90%, 當前值: {{ $value }}"
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時請結(jié)合常識與多方信息審慎甄別。
平臺聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點,簡書系信息發(fā)布平臺,僅提供信息存儲服務(wù)。

友情鏈接更多精彩內(nèi)容