 
 

- 资源准备
- 监控端,系统Ubuntu20.04,AMD EPYC 7551,1h1g(8.210.17.226)
- 被监控端,系统Debian10,ARM,2h12g(47.242.196.53)
- 安装前需知
- 2.1 监控端需安装服务:
- Prometheus
 
- Node Exporter
 
- Grafana
 
- Alertmanager
 
- webhook-adapter
 
- 2.2 被监控端需安装服务:
- Node Exporter
 
- 2.3 Linux系统影响服务相关:
- 时间同步
 
- 防火墙和selinux
 
- 本次通过容器安装相关服务,需安装docker
 
- docker安装
- 部署具体组件
- 4.1 安装 Node Exporter
- 安装命令: docker pull prom/node-exporter:latest
 
- 安装命令: 
- 制作启动脚本: vi node-export-start.sh
 
- 制作启动脚本: 
docker run -d -p 9100:9100 \
-v "/proc:/host/proc" \
-v "/sys:/host/sys" \
-v "/:/rootfs" \
-v "/etc/localtime:/etc/localtime" \
prom/node-exporter \
--path.procfs /host/proc \
--path.sysfs /host/sys \
--collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)"- 启动 Node Exporter: ./node-export-start.sh
 
- 启动 Node Exporter: 
- 验证 Node Exporter是否启动成功:访问http://8.210.17.226:9100/metrics
 
- 4.2 安装 Prometheus
- 安装命令: docker pull prom/prometheus
 
- 安装命令: 
- 制作启动脚本: vi prometheus-start.sh
 
- 制作启动脚本: 
docker run -d \
    -p 9090:9090 \
    -v /home/docker/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml \
    -v /home/docker/prometheus/rules:/etc/prometheus/rules \
    prom/prometheus- 在终端执行:
 
mkdir -p /home/docker/prometheus
vi prometheus.yml
- 粘贴配置文件:
 
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 8.210.17.226:9093
rule_files:
  - "rules/*.yml"
scrape_configs:
  # 配置监控的 Job
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          serviceId: prometheus
          serviceName: 普罗米修斯
  - job_name: “node_exporter”
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['10.0.0.240:9100','47.242.196.53:9100']- 启动 Prometheus: ./prometheus-start.sh
 
- 启动 Prometheus: 
- 验证 Prometheus是否启动成功:访问http://8.210.17.226:9090/targets
 
- 4.3 安装 Grafana
- 安装命令: docker pull grafana/grafana:latest
 
- 安装命令: 
- 制作启动脚本: vi grafana-start.sh
 
- 制作启动脚本: 
docker run -d -i -p 3000:3000 \
-v "/etc/localtime:/etc/localtime" \
-e "GF_SERVER_ROOT_URL=http://grafana.server.name" \
-e "GF_SECURITY_ADMIN_PASSWORD=studygolang" \
grafana/grafana- 启动 Grafana: ./grafana-start.sh
 
- 启动 Grafana: 
- 验证 Grafana是否启动成功:访问http://8.210.17.226:3000/metrics 或者 http://8.210.17.226:3000,用户名:admin,密码:studygolang
 
- 4.4 安装 Alertmanager
- 安装命令: docker pull prom/alertmanager:latest
 
- 安装命令: 
- 启动服务:
 
docker run --name alertmanager -d -p 9093:9093 --restart=always \
prom/alertmanager- 从容器内获取配置文件: docker cp alertmanager:/etc/alertmanager /home/docker
 
- 从容器内获取配置文件: 
- 删除容器制作启动脚本: vi alertmanager-start.sh
 
- 删除容器制作启动脚本: 
docker run --name alertmanager -d -p 9093:9093 --restart=always \
-v /home/docker/alertmanager/:/etc/alertmanager/ \
prom/alertmanager- 4.5 安装 webhook-adapter
- 安装命令: docker pull guyongquan/webhook-adapter:latest
 
- 安装命令: 
- 制作启动脚本: vi webhook-adapter-start.sh
 
- 制作启动脚本: 
docker run --name webhook-adapter -p 8080:80 -d guyongquan/webhook-adapter --adapter=/app/prometheusalert/wx.js=/wx=https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=(企业微信群机器人key)- Grafana使用
- 5.1 添加Prometheus数据源
- 修改 Prometheus Alertmanager 配置项
- 6.1 修改  Prometheus配置文件 prometheus.yml
- 终端执行:
 
cd /home/docker/prometheus
vim prometheus.yml- 修改配置文件:
 
# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - 8.210.17.226:9093
rule_files:
  - "rules/*.yml" #启动prometheus必须挂载rules目录,否则读取不到该配置
scrape_configs:
  # 配置监控的 Job
  - job_name: "prometheus"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['localhost:9090']
        labels:
          serviceId: prometheus
          serviceName: 普罗米修斯
  - job_name: "cloudcone"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['8.210.17.226:9100']
  - job_name: "oracle_SanJose_ARM"
    # metrics_path defaults to '/metrics'
    static_configs:
      - targets: ['47.242.196.53:9100']- 6.2 在  Prometheus rules文件夹下创建配置文件
- memory_over.yml:
 
groups:
- name: 内存报警规则
  rules:
  - alert: 内存使用率告警
    expr: (1 - (node_memory_MemAvailable_bytes / (node_memory_MemTotal_bytes))) * 100 > 80
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "服务器可用内存不足。"
      description: "内存使用率已超过50%(当前值:{{ $value }}%)"
- disk_over.yml:
 
groups:
- name: 磁盘使用率报警规则
  rules:
  - alert: 磁盘使用率告警
    expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 80
    for: 20m
    labels:
      severity: warning
    annotations:
      summary: "硬盘分区使用率过高"
      description: "分区使用大于80%(当前值:{{ $value }}%)"
- cpu_over.yml:
 
groups:
- name: CPU报警规则
  rules:
  - alert: CPU使用率告警
    expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 50
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "CPU使用率正在飙升。"
      description: "CPU使用率超过50%(当前值:{{ $value }}%)"
- node_alived.yml:
 
groups:
- name: 实例存活告警规则
  rules:
  - alert: 实例存活告警
    expr: up == 0
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "主机宕机!!!"
      description: "该实例主机已经宕机超过一分钟了."
- 6.3 修改  Alertmanager 配置文件 alertmanager.yml
- 终端执行:
 
cd /home/docker/alertmanager
vim alertmanager.yml- alertmanager.yml:
 
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'
receivers:
- name: 'web.hook'
  webhook_configs:
  - send_resolved: true
    url: 'http://8.210.17.226:8080/adapter/wx'    
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']
- 待完善
- 当有新机器需要被监控时,监控端配置文件更新后必须重启服务
- 容器内信息未收集(cAdvisor可实现,暂时不想弄)
- 目前监控端资源压力有点大,内存占用都在70%或更多
- Grafana未使用https (没啥用)
- 服务器异常报警推送到微信
- 其他有待继续发现
参考文章:
 
      
Hi there! I jst wanted to ask iif you eber have any problems
wth hackers? My ast blog (wordpress) was hacked and I enfed uup losing a few months
of hzrd worek ddue to no dat backup. Do you hae any solutions to stop hackers?