devOps系列(七)grafana+prometheus監(jiān)控告警

前言

作者目前打算分享一期關(guān)于devOps系列的文章,希望對熱愛學(xué)習(xí)和探索的你有所幫助。

文章主要記錄一些簡潔、高效的運(yùn)維部署指令,旨在 記錄和能夠快速地構(gòu)建系統(tǒng)。就像運(yùn)維文檔或者手冊一樣,方便進(jìn)行系統(tǒng)的重建、改造和優(yōu)化。每篇文章獨(dú)立出來,可以單獨(dú)作為其中一項(xiàng)組件的部署和使用。

本章為 devOps系列(七)grafana+prometheus監(jiān)控告警

大綱

devOps系列介紹

devOps系列(一)docker搭建

devOps系列(二)gitlab搭建

devOps系列(三)nexus-harbor搭建

devOps系列(四)jenkins搭建

devOps系列(五)efk系統(tǒng)搭建

devOps系列(六)grafana+prometheus搭建

devOps系列(七)grafana+prometheus監(jiān)控告警

devOps系列(八)efk+prometheus+grafana日志監(jiān)控和告警

使用 prometheus + blackbox-exporter + alertmanager 做http的接口監(jiān)控和告警

image.png

安裝blackbox-exporter

docker run  --restart=always  -d  --name blackbox-exporter -p 9115:9115  prom/blackbox-exporter

好了之后 http://localhost:9115可以訪問查看

修改prometheus.yml

rule_files:
   - "blackbox_rules.yml"

scrape_configs:
  - job_name: 'blackbox-http'
    metrics_path: /probe
    params:
      module: [http_2xx]
    file_sd_configs:
    - files: ['http_check.yml']
      refresh_interval: 10s
    relabel_configs:
     - source_labels: [__address__]
       target_label: __param_target
     - source_labels: [__param_target]
       target_label: instance
     - target_label: __address__
       replacement: 192.168.20.2:9115  #blackbox-exporter 所在的機(jī)器和端口

rule_files 下面添加blackbox_rules.yml

scrape_configs下面添加job

http_check.yml中添加檢查接口

vi blackbox_rules.yml
groups:
- name: 服務(wù)探測
  rules:
  - alert: BlackboxProbeFailed
    expr: probe_success == 0
    for: 0m
    labels:
      severity: critical
      team: node
    annotations:      
        summary: Blackbox probe failed (instance {{ $labels.instance }})
        description: "服務(wù)在線檢查失敗\n當(dāng)前值= {{ $value }}\nIp = {{ $labels.ip }}\nDomain= {{ $labels.domain }}\nEnv= {{ $labels.env }}\n服務(wù)名= {{ $labels.service }}"

  - alert: BlackboxProbeHttpFailure
    expr: probe_http_status_code <= 199 OR probe_http_status_code >= 400                            
    for: 0m
    labels:
      severity: critical
      team: node                                                            
    annotations:                                                               
        summary: Blackbox probe HTTP failure (instance {{ $labels.instance }})
        description: "HTTP狀態(tài)碼不在200-399\n當(dāng)前值= {{ $value }}\nIp = {{ $labels.ip }}\nDomain= {{ $labels.domain }}\nEnv= {{ $labels.env }}\n服務(wù)名= {{ $labels.service }}"

  - alert: BlackboxSslCertificateWillExpireSoon
    expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 30
    for: 0m
    labels:
      severity: warning
    annotations:
        summary: Blackbox SSL certificate will expire soon (instance {{ $labels.instance }})
        description: "SSL certificate expires in 30 days\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

添加接口檢查配置

vi http_check.yml
- targets: [ 'https://www.jafir.top' ]
  labels:
    name: web
    env: prod
    service: api 

重啟prometheus,測試即可

如上示例:可以對 www.jafir.top進(jìn)行檢測,如果失敗的話會(huì)進(jìn)行相應(yīng)的告警信息提示,最終通過prometheus和alertmanager來觸發(fā)告警。也可以對一些重要的網(wǎng)站、接口等進(jìn)行檢測。

短信和電話 prometheus-alert全家桶里面支持

目前我們使用的aliyun的語音和短信模板 不支持code (code只提供給驗(yàn)證碼用) 所以這邊只有把prometheus-alert源碼下載下來改一下即可

可以找一套linux服務(wù)器,clone源碼 當(dāng)前是4.9版本

git@github.com:feiyu563/PrometheusAlert.git

安裝go

yum install go

目前修改地方為

PrometheusAlert/controllers/aliyun.go

添加了

        TemplateKey := beego.AppConfig.DefaultString("ALY_DX_Template_Key","code")
        TemplateKey := beego.AppConfig.DefaultString("ALY_DH_Template_Key","code")

全文件如下:

package controllers

import (
        "PrometheusAlert/models"
        "github.com/aliyun/alibaba-cloud-sdk-go/sdk/requests"
        "github.com/aliyun/alibaba-cloud-sdk-go/services/dysmsapi"
        "github.com/aliyun/alibaba-cloud-sdk-go/services/dyvmsapi"
        "github.com/astaxie/beego"
        "github.com/astaxie/beego/logs"
        "strings"
)

func PostALYmessage(Messages, PhoneNumbers, logsign string) string {
        open := beego.AppConfig.String("open-alydx")
        if open != "1" {
                logs.Info(logsign, "[alymessage]", "阿里云短信接口未配置未開啟狀態(tài),請先配置open-alydx為1")
                return "阿里云短信接口未配置未開啟狀態(tài),請先配置open-alydx為1"
        }
        AccessKeyId := beego.AppConfig.String("ALY_DX_AccessKeyId")
        AccessSecret := beego.AppConfig.String("ALY_DX_AccessSecret")
        SignName := beego.AppConfig.String("ALY_DX_SignName")
        Template := beego.AppConfig.String("ALY_DX_Template")
        TemplateKey := beego.AppConfig.DefaultString("ALY_DX_Template_Key","code")
        client, err := dysmsapi.NewClientWithAccessKey("cn-hangzhou", AccessKeyId, AccessSecret)

        request := dysmsapi.CreateSendSmsRequest()
        request.Scheme = "https"
        request.PhoneNumbers = PhoneNumbers
        request.SignName = SignName
        request.TemplateCode = Template
        request.TemplateParam = `{"`+TemplateKey+`":"` + Messages + `"}`
        response, err := client.SendSms(request)

        if err != nil {
                logs.Error(logsign, "[alymessage]", err.Error())
        }
        logs.Info(logsign, "[alymessage]", response)
        models.AlertToCounter.WithLabelValues("alydx").Add(1)
        ChartsJson.Alydx += 1
        return response.Message
}
func PostALYphonecall(Messages string, PhoneNumbers, logsign string) string {
        open := beego.AppConfig.String("open-alydh")
        if open != "1" {
                logs.Info(logsign, "[alyphonecall]", "阿里云電話接口未配置未開啟狀態(tài),請先配置open-alydh為1")
                return "阿里云電話接口未配置未開啟狀態(tài),請先配置open-alydh為1"
        }
        AccessKeyId := beego.AppConfig.String("ALY_DH_AccessKeyId")
        AccessSecret := beego.AppConfig.String("ALY_DH_AccessSecret")
        CalledShowNumber := beego.AppConfig.String("ALY_DX_CalledShowNumber")
        TtsCode := beego.AppConfig.String("ALY_DH_TtsCode")
        TemplateKey := beego.AppConfig.DefaultString("ALY_DH_Template_Key","code")
        mobiles := strings.Split(PhoneNumbers, ",")
        for _, m := range mobiles {
                client, err := dyvmsapi.NewClientWithAccessKey("cn-hangzhou", AccessKeyId, AccessSecret)
                request := dyvmsapi.CreateSingleCallByTtsRequest()
                request.Scheme = "https"
                request.CalledShowNumber = CalledShowNumber
                request.CalledNumber = m
                request.TtsCode = TtsCode
                request.TtsParam = `{"`+TemplateKey+`":"` + Messages + `"}`
                request.PlayTimes = requests.NewInteger(2)

                response, err := client.SingleCallByTts(request)
                if err != nil {
                        logs.Error(logsign, "[alyphonecall]", err.Error())
                }
                logs.Info(logsign, "[alyphonecall]", response)
        }
        models.AlertToCounter.WithLabelValues("alydh").Add(1)
        ChartsJson.Alydh += 1
        return PhoneNumbers + "Called Over."
}

后面可以在conf文件中添加使用

#阿里云短信模板key
ALY_DX_Template_Key=desc

#阿里云電話模板key
ALY_DH_Template_Key=desc

對應(yīng)變量可以自定義,默認(rèn)是code

比如這里是msg,我的aliyun的模板里面 就可以創(chuàng)建一個(gè)desc的


image.png

這里還遇到一個(gè)問題,可能是文件系統(tǒng)的問題,這里把Dockerfile的文件系統(tǒng)做了改動(dòng) FROM alpine:3.18 改為了centos7

FROM centos:7

LABEL maintainer="jikun.zhang"

RUN yum -y install epel-release && \
    yum -y install tzdata && \
    ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && \
    echo "Asia/Shanghai" > /etc/timezone && \
    yum -y remove epel-release && \
        mkdir -p /app/logs && \
    yum -y install sqlite curl && \
    yum clean all

HEALTHCHECK --start-period=10s --interval=20s --timeout=3s --retries=3 \
    CMD curl -fs http://localhost:8080/health || exit 1

修改好源碼之后重新編譯

make docker

就會(huì)在本地打包一個(gè)鏡像出來,如果需要上傳私服可以再打個(gè)tag上傳(需要docker login先)

最終alertmanager就支持aliyun的短信和語音了

docker tag feiyu563/prometheus-alert:latest harbor.jafir.top/java/feiyu563/prometheus-alert:latest
docker push harbor.jafir.top/java/feiyu563/prometheus-alert:latest
image.png

prometheus-laert配置

部署好了之后,修改app.conf文件 aliyun的accesskey secret等 (建議把數(shù)據(jù)映射到主機(jī)目錄)

添加短信自定義模板


image.png
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}
[Prometheus恢復(fù)信息]
{{$v.annotations.description}}
{{else}}
[Prometheus告警信息]
{{$v.annotations.description}}
{{end}}
{{ end }}

模板測試內(nèi)容:

{"receiver":"prometheus-dx-ali","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"BlackboxProbeHttpFailure","env":"prod","instance":"https://www.jafir.top","job":"blackbox-http","name":"http-web","service":"web","severity":"critical","team":"node"},"annotations":{"description":"HTTP狀態(tài)碼不在200-399\n當(dāng)前值= 502\nName= http-web\nEnv= prod\n服務(wù)名= web","summary":"Blackbox probe HTTP failure (instance https://www.jafir.top)"},"startsAt":"2023-09-08T09:13:24.324Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_http_status_code+%3C%3D+199+or+probe_http_status_code+%3E%3D+400\u0026g0.tab=1","fingerprint":"af6387fb6deb5043"}],"groupLabels":{"alertname":"BlackboxProbeHttpFailure","instance":"https://www.jafir.top"},"commonLabels":{"alertname":"BlackboxProbeHttpFailure","env":"prod","instance":"https://www.jafir.top","job":"blackbox-http","name":"http-web","service":"web","severity":"critical","team":"node"},"commonAnnotations":{"description":"HTTP狀態(tài)碼不在200-399\n當(dāng)前值= 502\nName= http-web\nEnv= prod\n服務(wù)名= web","summary":"Blackbox probe HTTP failure (instance https://www.jafir.top)"},"externalURL":"http://alertmanager:9093","version":"4","groupKey":"{}/{alertname=\"BlackboxProbeHttpFailure\"}:{alertname=\"BlackboxProbeHttpFailure\", instance=\"https://www.jafir.top\"}","truncatedAlerts":0}

添加電話自定義模板


image.png
{{ range $k,$v:=.alerts }}{{if eq $v.status "resolved"}}恢復(fù)信息{{$v.annotations.description}}{{else}}告警信息{{$v.annotations.description}}{{end}}{{ end }}

模板測試內(nèi)容

 {"receiver":"prometheus-dh-ali","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"BlackboxProbeFailed","env":"prod","instance":"https://www.jafir.top","job":"blackbox-http","name":"http-web","service":"web","severity":"critical","team":"node"},"annotations":{"description":"服務(wù)在線檢查失敗\n當(dāng)前值= 0\nName= http-web\nEnv= prod\n服務(wù)名= web","summary":"Blackbox probe failed (instance https://www.jafir.top)"},"startsAt":"2023-09-08T09:13:24.324Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"http://prometheus:9090/graph?g0.expr=probe_success+%3D%3D+0\u0026g0.tab=1","fingerprint":"9d044445256d0302"}],"groupLabels":{"alertname":"BlackboxProbeFailed","instance":"https://www.jafir.top"},"commonLabels":{"alertname":"BlackboxProbeFailed","env":"prod","instance":"https://www.jafir.top","job":"blackbox-http","name":"http-web","service":"web","severity":"critical","team":"node"},"commonAnnotations":{"description":"服務(wù)在線檢查失敗\n當(dāng)前值= 0\nName= http-web\nEnv= prod\n服務(wù)名= web","summary":"Blackbox probe failed (instance https://www.jafir.top)"},"externalURL":"http://alertmanager:9093","version":"4","groupKey":"{}/{alertname=\"BlackboxProbeFailed\"}:{alertname=\"BlackboxProbeFailed\", instance=\"https://www.jafir.top\"}","truncatedAlerts":0}

測試:
修改app.conf 的defaultPhone


image.png

alertmanager配置

添加路由和receiver

路由規(guī)則自己定義,這里是把blackbox的監(jiān)控檢查配置了進(jìn)來,發(fā)送語音和打電話

routes:
  - receiver: 'web.hook.grafanaalert'  # 路由到名為 "web.hook.grafanaalert" 的接收器
    match:
      __alert_rule_namespace_uid__: 'IrqNMj34z'  # 匹配 alertname 為 "grafana" 的告警
  - receiver: 'prometheus-dh-ali'
    match:
      alertname: "BlackboxProbeFailed"
  - receiver: 'prometheus-dx-ali'
    match:
      alertname: "BlackboxProbeHttpFailure"
- name: 'prometheus-dx-ali'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydx&tpl=prometheus-dx-ali&phone=139xxxx'
    send_resolved: false
- name: 'prometheus-dh-ali'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydh&tpl=prometheus-dh-ali&phone=139xxxx'
    send_resolved: false

全文件:(這里又添加了兩個(gè)receiver,發(fā)短信和wx \ 打電話和wx)

global:
  resolve_timeout: 15s
route:
  group_by: ['alertname','instance']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 2m
  receiver: 'web.hook.prometheusalert'

  routes:
  - receiver: 'web.hook.grafanaalert'  # 路由到名為 "web.hook.grafanaalert" 的接收器
    match:
      __alert_rule_namespace_uid__: 'IrqNMj34z'  # 匹配 alertname 為 "grafana" 的告警  
  - match:
      alertname: "BlackboxProbeFailed"
    receiver: 'prometheus-dh-ali-all'
  - match:
      alertname: "BlackboxProbeHttpFailure"
    receiver: 'prometheus-dx-ali-all'

receivers:
- name: 'web.hook.prometheusalert'
  webhook_configs:
  - url: 'http://prometheus-alert:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=你的企業(yè)微信webhookurl'
- name: 'web.hook.grafanaalert'
  webhook_configs:
  - url: 'http://prometheus-alert:8080/prometheusalert?type=wx&tpl=grafana-wx&wxurl=你的企業(yè)微信webhookurl'
- name: 'prometheus-dx-ali'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydx&tpl=prometheus-dx-ali&phone=139xxxx'
    send_resolved: false
- name: 'prometheus-dh-ali'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydh&tpl=prometheus-dh-ali&phone=139xxxx'
    send_resolved: false
- name: 'prometheus-dx-ali-all'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydx&tpl=prometheus-dx-ali&phone=139xxxx'
  - url: 'http://prometheus-alert:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=你的企業(yè)微信webhookurl'
    send_resolved: false
- name: 'prometheus-dh-ali-all'
  webhook_configs:
  - url: 'http://prometheus-alert-new:8080/prometheusalert?type=alydh&tpl=prometheus-dh-ali&phone=139xxxx'
  - url: 'http://prometheus-alert:8080/prometheusalert?type=wx&tpl=prometheus-wx&wxurl=你的企業(yè)微信webhookurl'
    send_resolved: false

重啟alertmanager

docker restart alertmanager

目前上述配置文件可以參考使用,作用主要是配置電話+短信+企業(yè)微信的通知。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
【社區(qū)內(nèi)容提示】社區(qū)部分內(nèi)容疑似由AI輔助生成,瀏覽時(shí)請結(jié)合常識(shí)與多方信息審慎甄別。
平臺(tái)聲明:文章內(nèi)容(如有圖片或視頻亦包括在內(nèi))由作者上傳并發(fā)布,文章內(nèi)容僅代表作者本人觀點(diǎn),簡書系信息發(fā)布平臺(tái),僅提供信息存儲(chǔ)服務(wù)。

相關(guān)閱讀更多精彩內(nèi)容

友情鏈接更多精彩內(nèi)容