安裝Alertmanager
下載地址:https://prometheus.io/download/
下載完成后,將下載中軟件包上傳至Prometheus服務(wù)所在的機(jī)器

image.png
解壓alertmanager軟件包
tar -zxvf alertmanager-0.21.0.linux-amd64.tar.gz -C /data
mv /data/alertmanager-0.21.0.linux-amd64 /data/alertmanager
進(jìn)入解壓后的alertmanager文件夾,修改alertmanager.yml文件,配置報(bào)警信息,alertmanager.yml 內(nèi)容如下:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.163.com:25'
smtp_from: '***@163.com' # 發(fā)送告警的郵箱
smtp_auth_username: '***@163.com' #發(fā)送告警的郵箱
smtp_auth_password: '***' #郵箱授權(quán)密碼
smtp_require_tls: false
route:
group_by: ['alertname'] #分組標(biāo)簽
group_wait: 10s # 告警等待時(shí)間。告警產(chǎn)生后等待10s,如果有同組告警一起發(fā)出
group_interval: 10s # 兩組告警的間隔時(shí)間
repeat_interval: 1m # 重復(fù)告警的間隔時(shí)間,減少相同右鍵的發(fā)送頻率 此處為測(cè)試設(shè)置為1分鐘
receiver: 'mail' # 默認(rèn)接收者 routes: # 指定那些組可以接收消息
receivers:
- name: 'mail'
email_configs:
- to: '***'
#inhibit_rules:
# - source_match:
# severity: 'critical'
# target_match:
# severity: 'warning'
# equal: ['alertname', 'dev', 'instance']
檢查alertmanager.yml 配置是否正確
./amtool check-config alertmanager.yml
啟動(dòng)告警程序
nohup ./alertmanager &
tail -f nohup.out
level=error ts=2021-04-23T06:06:05.336Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:07:05.368Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=error ts=2021-04-23T06:08:05.401Z caller=dispatch.go:309 component=dispatcher msg="Notify for alerts failed" num_alerts=1 err="mail/email[0]: notify retry canceled after 2 attempts: create SMTP client: EOF"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:216 msg="Starting Alertmanager" version="(version=0.21.0, branch=HEAD, revision=4c6c03ebfe21009c546e4d1e9b92c371d67c021d)"
level=info ts=2021-04-23T06:08:15.693Z caller=main.go:217 build_context="(go=go1.14.4, user=root@dee35927357f, date=20200617-08:54:02)"
level=info ts=2021-04-23T06:08:15.697Z caller=cluster.go:161 component=cluster msg="setting advertise address explicitly" addr=192.168.56.128 port=9094
level=info ts=2021-04-23T06:08:15.700Z caller=cluster.go:623 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=info ts=2021-04-23T06:08:15.737Z caller=coordinator.go:119 component=configuration msg="Loading configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.738Z caller=coordinator.go:131 component=configuration msg="Completed loading of configuration file" file=alertmanager.yml
level=info ts=2021-04-23T06:08:15.788Z caller=main.go:485 msg=Listening address=:9093
level=info ts=2021-04-23T06:08:17.702Z caller=cluster.go:648 component=cluster msg="gossip not settled" polls=0 before=0 now=1 elapsed=2.001649742s
level=info ts=2021-04-23T06:08:25.711Z caller=cluster.go:640 component=cluster msg="gossip settled; proceeding" elapsed=10.010215916s
alertmanager默認(rèn)端口9093 可以訪問(wèn)IP:9093

image.png
配置Prometheus
vim /your prometheus path/prometheus.yml
修改Prometheus.yml配置文件
這是修改后的配置文件
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "rule.yml"
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'nginx'
static_configs:
- targets: ['192.168.56.129:9913']
- job_name: 'tomcat'
file_sd_configs:
- files: ['/opt/prometheus/sd_config/tomcat.yml']
refresh_interval: 180s
配置其中
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
以及
rule_files: #配置告警規(guī)則
- "rule.yml"
編寫rule.yml配置文件
cat prometheus-2.26.0.linux-amd64/rule.yml
groups:
- name: mem-rule
rules:
- alert: "內(nèi)存報(bào)警"
expr: up == 0 #PromQL表達(dá)式
for: 30s
labels:
severity: warning
annotations:
summary: "服務(wù)名:{{$labels.alertname}} 內(nèi)存報(bào)警"
description: "{{ $labels.alertname }} 內(nèi)存資源利用率大于 5%"
value: "{{ $value }}"
由于體現(xiàn)實(shí)驗(yàn)效果 告警規(guī)則為up == 0 并非內(nèi)存告警. 監(jiān)控業(yè)務(wù)有Tomcat 以及 Nginx 以及 Prometheus本身
重啟Prometheus以及Alertmanager

image.png