环境准备

主机一台

IP:
  192.168.1.60,192.168.1.61
内存:
  2C4G

下载软件包

#### 主节点下载:
wget https://github.com/prometheus/prometheus/releases/download/v2.37.6/prometheus-2.37.6.linux-amd64.tar.gz
#### 从节点下载:
wget https://github.com/prometheus/node_exporter/releases/download/v1.5.0/node_exporter-1.5.0.linux-amd64.tar.gz

部署软件包

解压软件包到指定目录

#### 主节点操作
tar -xf prometheus-2.37.6.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/prometheus-2.37.6.linux-amd64 /usr/local/prometheus
#### 从节点操作
tar -xf node_exporter-1.5.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/node_exporter-1.5.0.linux-amd64 /usr/local/node_exporter

systemd进程管理prometheus服务

cat /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus-Server
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml
Restart=on-failure

[Install]
WantedBy=multi-user.target

刷新systemd管理配置文件

systemctl daemon-reload

修改配置文件

编辑prometheus主节点配置文件

vi /usr/local/prometheus/prometheus.yml
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]  #主节点信息
        labels:                                 #定义标签名称
          instance: prometheus-master  
      - targets: ["192.168.1.60:9100"]    #从节点地址
        labels:                                         #定义从节点标签
          instance: master
#一个target代表一个主机多个主机以此添加target

启动服务

启动prometheus

systemctl start prometheus

启动node_exporter

/usr/local/node_exporter/node_exporter  &

访问浏览器查看信息

image-1677230819299

实现企业微信告警

注册企业微信

浏览器访问https://work.weixin.qq.com/进行注册

登陆企业微信

https://work.weixin.qq.com/    #扫码登录

创建应用

1、点击“应用管理”,再点击“创建应用”

image

2、上传logo显示图标,填写应用名称,选择可见范围人员

image-1677639829376

3、创建成功后见下图所示

image-1677639946744

4、添加可信ip,如果不添加企业微信无法收到laertmanager服务发送的消息

image-1677719966881
image-1677720130575

5、将此应用的有效信息进行保存,后面在编辑alertmanager配置文件会用到

image-1677719759511

部署alertmanager

下载软件包(本文章安装Prometheus的主机下载)

wget https://github.com/prometheus/alertmanager/releases/download/v0.24.0/alertmanager-0.24.0.linux-amd64.tar.gz

解压alertmanager压缩包到指定目录

tar -xf alertmanager-0.24.0.linux-amd64.tar.gz -C /usr/local/
mv /usr/local/alertmanager-0.24.0.linux-amd64 /usr/local/alertmanager

systemd进程管理alertmanager服务

cat /usr/lib/systemd/system/alertmanager.service
[Unit]
Description=alertmanager
Documentation=https://github.com/prometheus/alertmanager
After=network.target
[Service]
Type=simple
User=root
ExecStart=/usr/local/alertmanager/alertmanager --config.file=/usr/local/alertmanager/alertmanager.yml
Restart=on-failure
[Install]
WantedBy=multi-user.target

刷新systemd管理配置文件

systemctl daemon-reload

修改prometheus配置文件,增加alertmanager监听配置以及增加指定rules规则配置

cat /usr/local/prometheus/prometheus.yml
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:     #监听alertmanager服务端口,prometheus用于发送告警信息到altermanager服务
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']   #alertmanager服务地址以及端口
           #- alertmanager:9093

rule_files:
   - "/usr/local/prometheus/rules.yml"  #rules告警规则配置文件

scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.1.60:9090"]
        labels:                                 #定义标签名称
          instance: prometheus-master
      - targets: ["192.168.1.60:9100"]    #从节点地址
        labels:                                         #定义从节点标签
          instance: master
  - job_name: "node"
    static_configs:
      - targets: ["192.168.1.61:9100"]    #从节点地址
        labels:                                         #定义从节点标签
          instance: node

添加rules.yaml告警规则配置文件

cat /usr/local/prometheus/rules.yml
groups:
- name: Node_Down
  rules:
  - alert: Node实例已宕机           #告警消息内容
    expr: up == 0
    for: 10s                           #每10s获取一次监控
    labels:
      user: root
      severity: Warning          #这是的告警状态
    annotations:
      summary: "Instance {{ $labels.instance }} Down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been Down."

修改alertmanager.yaml配置文件

cat /usr/local/alertmanager/alertmanager.yml
global:
  # 每2分钟检查一次是否恢复
  resolve_timeout: 2m
# 自定义通知模板
templates:
  - '/usr/local/prometheus/alertmanager/template/wechat.tmpl'
# route用来设置报警的分发策略
route:
  # 采用哪个标签来作为分组依据
  group_by: ['alertname']
  # 组告警等待时间。也就是告警产生后等待10s,如果有同组告警一起发出
  group_wait: 10s
  # 两组告警的间隔时间
  group_interval: 10s
  # 重复告警的间隔时间,减少相同微信告警的发送频率
  repeat_interval: 1h 
  # 设置默认接收人
  receiver: 'wechat'
  routes:   # 可以指定哪些组接手哪些消息
    - receiver: 'wechat'
      continue: true
      group_wait: 10s
receivers:
- name: 'wechat'
  wechat_configs:
  - corp_id: 'xxxxxxx'    #此处为企业ID,可在我的企业进行查看
    to_party: '2'  #注意部门ID为1的不可以在此处使用,必须是大于1的联系人或者群组的部门ID,如果使用部门ID为1的此处必须写群组名称
    agent_id: '1000002'    #Agentid可在创建的应用中查看
    api_secret: 'xxxxxxxx'    #可在企业微信创建的应用中进行查看
    send_resolved: true

配置发送内容wecat.yaml文件

cat /usr/local/alertmanager/wechat.tmpl
{{ define "wechat.default.message" }}
{{ range .Alerts }}
========start=========
告警程序: prometheus_alert
告警级别: {{ .Labels.serverity }}
告警类型: {{ .Labels.alertname }}
故障主机: {{ .Labels.instance }}
告警主题: {{ .Annotations.summary }}
告警详情: {{ .Annotations.description }}
触发时间: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
=========end===========
{{ end }}
{{ end }}

启动alertmanager服务

systemctl start alertmanager

重新启动Prometheus、node_export服务

systemctl restart prometheus
kill -9 $(ps -elf | grep node_exporter | grep -v grep | awk '{print $4}'
/usr/local/node_exporter/node_exporter &

查看企业微信告警

将node_exporter服务杀掉,收到下图的告警消息

1677724032872