prometheus+grafana+alertmanager服务器运维监控并实( 二 )


:

prometheus+grafana+alertmanager服务器运维监控并实

文章插图
[root@iZbp155czz7lmsi13r3qmqZ prometheus]# vi /opt/alertmanager/alertmanager.ymlglobal:resolve_timeout: 5mroute:receiver: webhookgroup_wait: 30sgroup_interval: 1mrepeat_interval: 4hgroup_by: [alertname]routes:- receiver: webhookgroup_wait: 10sreceivers:- name: webhookwebhook_configs:- url: http://121.41.18.234:8060/dingtalk/webhook1/sendsend_resolved: true~
运行
docker run -d -p 9093:9093 -v /opt/alertmanager/:/etc/alertmanager/ --name alertmanager prom/alertmanager
新增告警规则:
[root@iZbp155czz7lmsi13r3qmqZ prometheus]# vi /opt/prometheus/rules.ymlgroups:- name: host_monitoringrules:- alert: 内存报警expr: netdata_system_ram_MiB_average{chart="system.ram",dimension="free",family="ram"} < 800for: 2mlabels:team: nodeannotations:Alert_type: 内存报警Server: '{{$labels.instance}}'#summary: "{{$labels.instance}}: High Memory usage detected"explain: "内存使用量超过90%,目前剩余量为:{{ $value }}M"#description: "{{$labels.instance}}: Memory usage is above 80% (current value is: {{ $value }})"- alert: CPU报警expr: netdata_system_cpu_percentage_average{chart="system.cpu",dimension="idle",family="cpu"} < 20for: 2mlabels:team: nodeannotations:Alert_type: CPU报警Server: '{{$labels.instance}}'explain: "CPU使用量超过80%,目前剩余量为:{{ $value }}"#summary: "{{$labels.instance}}: High CPU usage detected"#description: "{{$labels.instance}}: CPU usage is above 80% (current value is: {{ $value }})"- alert: 磁盘报警expr: netdata_disk_space_GiB_average{chart="disk_space._",dimension="avail",family="/"} < 4for: 2mlabels:team: nodeannotations:Alert_type: 磁盘报警Server: '{{$labels.instance}}'explain: "磁盘使用量超过90%,目前剩余量为:{{ $value }}G"- alert: 服务告警expr: up == 0for: 2mlabels:team: nodeannotations:Alert_type: 服务报警Server: '{{$labels.instance}}'explain: "netdata服务已关闭"
[root@iZbp155czz7lmsi13r3qmqZ prometheus]# vi /opt/prometheus/prometheus.ymlglobal:scrape_interval:60sevaluation_interval: 60s# Alertmanager配置alerting:alertmanagers:- static_configs:- targets: ["121.41.18.234:9093"]# rule配置,首次读取默认加载,之后根据evaluation_interval设定的周期加载rule_files:- "rules.yml"scrape_configs:- job_name: prometheusstatic_configs:- targets: ['121.41.18.234:9090']labels:instance: prometheus- job_name: linuxstatic_configs:- targets: ['121.41.18.234:9100']labels:instance: localhost
配置完成后重启
测试报警:关闭
等两分钟可发现
此时还未报警,状态需要持续一会,继续等待……
变成这个状态就发送了
再重启
稍等一会,告警消失
【prometheus+grafana+alertmanager服务器运维监控并实】收到恢复消息,至此本文结束!