Zabbix 监控

基础概念

什么是 Zabbix？它的架构是什么？

答：Zabbix 是一款开源的企业级分布式监控解决方案，能够监控网络参数、服务器健康状态和应用服务的完整性。Zabbix 架构采用 C/S 模式：

核心组件：

Zabbix Server：核心服务，负责数据处理和告警
Zabbix Agent：部署在被监控主机上，采集数据
Zabbix Proxy：分布式监控代理，减轻 Server 负担
Database：存储配置和历史数据
Web UI：管理界面

Zabbix 与 Prometheus 的区别是什么？

特性	Zabbix	Prometheus
架构	C/S 架构	Pull 模式
数据模型	键值对	时间序列 + Labels
存储方式	关系型数据库	TSDB（本地存储）
告警系统	内置告警	Alertmanager（分离）
可视化	内置简单图表	集成 Grafana
适用场景	传统基础设施	云原生/容器环境
学习曲线	较低	中等
扩展性	适合中小规模	大规模集群

安装与配置

如何安装 Zabbix Server？

bash

# CentOS/RHEL 安装
# 1. 安装 Zabbix 仓库
rpm -Uvh https://repo.zabbix.com/zabbix/6.4/rhel/8/x86_64/zabbix-release-6.4-1.el8.noarch.rpm
dnf clean all

# 2. 安装 Zabbix Server + Web + Agent
dnf install zabbix-server-mysql zabbix-web-mysql zabbix-apache-conf zabbix-sql-scripts zabbix-agent

# 3. 创建数据库
mysql -uroot -p <password>
mysql> create database zabbix character set utf8mb4 collate utf8mb4_bin;
mysql> create user 'zabbix'@'localhost' identified by '<password>';
mysql> grant all privileges on zabbix.* to 'zabbix'@'localhost';
mysql> quit;

# 4. 导入初始数据
zcat /usr/share/doc/zabbix-sql-scripts/mysql/server.sql.gz | mysql --default-character-set=utf8mb4 -uzabbix -p zabbix

# 5. 配置数据库连接
vi /etc/zabbix/zabbix_server.conf
DBHost=localhost
DBName=zabbix
DBUser=zabbix
DBPassword=<password>

# 6. 启动服务
systemctl restart zabbix-server zabbix-agent httpd php-fpm
systemctl enable zabbix-server zabbix-agent httpd php-fpm

Zabbix Agent 有几种工作模式？

主动模式（Active）

ini

# /etc/zabbix/zabbix_agentd.conf
ServerActive=192.168.1.100:10051  # Zabbix Server 地址
Hostname=agent1.example.com         # 必须与 Server 端配置一致
RefreshActiveChecks=120             # 主动检查间隔（秒）

特点：

Agent 主动向 Server 发送数据
适用于防火墙限制的场景
减轻 Server 的连接压力

被动模式（Passive）

ini

# /etc/zabbix/zabbix_agentd.conf
Server=192.168.1.100               # 允许连接的 Server IP
ListenPort=10050                   # 监听端口

特点：

Server 主动拉取数据
配置简单，默认模式
需要 Server 能访问 Agent

监控项与触发器

什么是监控项（Item）？常用类型有哪些？

答：监控项是 Zabbix 收集数据的基本单位，定义了收集什么数据、如何收集、多久收集一次。

常用监控项类型：

类型	说明	示例
`Zabbix agent`	通过 Agent 采集	`system.cpu.load`
`Zabbix agent (active)`	Agent 主动推送	`vfs.fs.size[/,free]`
`SNMP v1/v2/v3`	网络设备监控	`ifInOctets[1]`
`Simple check`	简单检查（ICMP/TCP）	`icmpping[]`
`Internal`	Zabbix 内部指标	`zabbix[queue]`
`External check`	外部脚本	`script.sh`
`Database monitor`	数据库查询	`db.select[count]`
`HTTP agent`	Web 监控	HTTP 状态码
`IPMI`	硬件监控	温度、电压
`SSH agent`	SSH 远程执行命令	自定义脚本

常用内置监控项 Key

bash

# CPU 监控
system.cpu.util          # CPU 使用率
system.cpu.load[all,avg1] # 1分钟平均负载

# 内存监控
vm.memory.size[available]   # 可用内存
vm.memory.size[total]       # 总内存

# 磁盘监控
vfs.fs.size[/,free]         # 根分区剩余空间
vfs.fs.inode[/,pfree]       # inode 使用率
vfs.dev.read[,ops]          # 磁盘读操作

# 网络监控
net.if.in[eth0,bytes]       # 网卡入流量
net.if.out[eth0,bytes]      # 网卡出流量

# 进程监控
proc.num[,,nginx]            # nginx 进程数
net.tcp.listen[80]           # 80端口是否监听

触发器（Trigger）表达式语法

yaml

# 基本表达式
{host:item.last()}>90                    # 最后一个值大于90
{host:item.avg(300)}>80                  # 5分钟平均值大于80
{host:item.max(600)}>95                  # 10分钟最大值大于95

# 组合条件
{host:cpu.load.last()}>2 & {host:cpu.util.last()}>80  # AND 条件
{host:disk.free.last()}<10G | {host:inode.pfree.last()}<10%  # OR 条件

# 常用函数
last()      # 最新值
avg(#5)     # 最近5个值的平均值
min(300)    # 5分钟内最小值
max(300)    # 5分钟内最大值
diff()      # 与上一个值比较是否有变化
change()    # 与上一个值的差值
count(300)  # 5分钟内的数据点个数
nodata(300) # 5分钟内无数据

实际案例：磁盘空间告警

yaml

# 触发器配置
名称: 磁盘空间不足
表达式: {Linux Server:vfs.fs.size[/,pfree].last()}<10
严重性: 警告

# 多级告警
{Linux Server:vfs.fs.size[/,pfree].last()}<5   # 一般严重
{Linux Server:vfs.fs.size[/,pfree].last()}<2   # 严重
{Linux Server:vfs.fs.size[/,pfree].last()}<1   # 灾难

自定义监控

如何编写自定义监控脚本？

bash

#!/bin/bash
# /etc/zabbix/scripts/check_mysql.sh
# MySQL 连接数监控

MYSQL_USER="monitor"
MYSQL_PASS="password"
MYSQL_HOST="localhost"

# 获取当前连接数
connections=$(mysql -u$MYSQL_USER -p$MYSQL_PASS -h$MYSQL_HOST \
    -e "SHOW STATUS LIKE 'Threads_connected';" \
    -N | awk '{print $2}')

echo $connections

bash

# 给脚本执行权限
chmod +x /etc/zabbix/scripts/check_mysql.sh

# 配置 UserParameter
vi /etc/zabbix/zabbix_agentd.conf.d/custom.conf
UserParameter=mysql.connections,/etc/zabbix/scripts/check_mysql.sh

# 重启 Agent
systemctl restart zabbix-agent

测试自定义监控项

bash

# 在 Server 端测试
zabbix_get -s 192.168.1.50 -k mysql.connections

# 或在 Agent 本地测试
zabbix_agentd -t mysql.connections

Prometheus 监控

基础概念

什么是 Prometheus？它的工作原理是什么？

答：Prometheus 是一套开源的监控报警系统和时间序列数据库（TSDB），最初由 SoundCloud 开发。

核心组件：

Prometheus Server：主服务，负责数据采集、存储和查询
Exporters：数据采集器，暴露被监控指标的 HTTP 接口
Pushgateway：短期任务的数据推送网关
Alertmanager：告警管理，支持去重、分组、路由、静默
Service Discovery：自动发现目标实例

Prometheus 的数据模型是什么？

答：Prometheus 使用时间序列数据存储，每个时间序列由以下标识：

metric_name{
    label_name1="value1",
    label_name2="value2"
} value [timestamp]

示例：

txt

http_requests_total{
    method="GET",
    handler="/api/users",
    instance="web01:9100",
    job="webserver"
} 10540 1609459200000

指标类型：

类型	说明	示例
Counter	只增不减的计数器	`http_requests_total`
Gauge	可增可减的瞬时值	`memory_usage_bytes`
Histogram	直方图（分布统计）	`http_request_duration_seconds`
Summary	摘要（分位数统计）	`rpc_duration_seconds`

PromQL 查询语言

基本查询语法

txt

# 即时向量查询
up                              # 所有在线的实例
http_requests_total             # 所有请求总数

# 标签过滤
http_requests_total{job="api"}  # job 为 api 的请求
http_requests_total{method!="GET"}  # 方法不为 GET
http_requests_total{env=~"prod|staging"}  # 正则匹配

# 范围向量查询（带时间范围）
http_requests_total{job="api"}[5m]  # 最近5分钟的数据点
rate(http_requests_total[5m])       # 5分钟平均速率
increase(http_requests_total[1h])   # 1小时增长量

常用聚合操作

txt

# 求和
sum(http_requests_total)                          # 总请求量
sum(http_requests_total) by (method)              # 按 method 分组求和

# 平均值
avg(cpu_usage_percent) by (instance)              # 各实例平均CPU使用率

# 最大/最小
max(memory_usage_bytes) by (instance)             # 各实例最大内存
min(up)                                           # 最小在线状态

# Top N
topk(5, http_requests_total)                      # 请求数最多的前5个
bottomk(3, cpu_usage)                             # CPU使用率最低的3个

# 计数
count(up == 1)                                    # 在线实例数量
count_values("status", http_status)               # 各状态码数量

实用查询示例

txt

# CPU 使用率
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) 
    / node_memory_MemTotal_bytes * 100

# 磁盘使用率
100 - ((node_filesystem_avail_bytes{mountpoint="/"} 
    / node_filesystem_size_bytes{mountpoint="/"}) * 100)

# QPS（每秒请求数）
sum(rate(http_requests_total{job="api"}[5m]))

# P99 延迟
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) 
    / sum(rate(http_requests_total[5m])) * 100

Exporter

常用的 Exporter 有哪些？

Exporter	用途	默认端口
Node Exporter	Linux 系统指标	9100
mysqld_exporter	MySQL 数据库	9104
redis_exporter	Redis 缓存	9121
blackbox_exporter	HTTP/TCP/ICMP 探测	9115
postgres_exporter	PostgreSQL 数据库	9187
cadvisor	Docker 容器	8080
jmx_exporter	JVM/JMX 应用	5556
snmp_exporter	SNMP 网络设备	9116
elasticsearch_exporter	Elasticsearch	9114
mongodb_exporter	MongoDB	9216

Node Exporter 部署与配置

bash

# 下载安装
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
cd node_exporter-1.7.0.linux-amd64

# 创建 systemd 服务
cat > /etc/systemd/system/node-exporter.service << 'EOF'
[Unit]
Description=Node Exporter
After=network.target

[Service]
Type=simple
User=prometheus
ExecStart=/usr/local/bin/node_exporter \
    --collector.processes \
    --collector.filesystem.ignored-mount-points='^/(sys|proc|dev|run|boot)($|/)'

[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable node-exporter
systemctl start node-exporter

常用启用 collectors：

Collector	说明
`processes`	进程信息
`diskstats`	磁盘 IO 统计
`filesystem`	文件系统信息
`loadavg`	系统负载
`meminfo`	内存详细信息
`netstat`	网络连接统计
`stat`	CPU 统计信息
`tcpstat`	TCP 连接统计

Prometheus 配置文件示例

yaml

# prometheus.yml
global:
  scrape_interval: 15s     # 抓取间隔
  evaluation_interval: 15s # 规则评估间隔
  scrape_timeout: 10s      # 单次抓取超时

alerting:
  alertmanagers:
  - static_configs:
    - targets:
      - alertmanager:9093

rule_files:
  - "alert_rules.yml"

scrape_configs:
  # Prometheus 自身监控
  - job_name: 'prometheus'
    static_configs:
    - targets: ['localhost:9090']

  # Node Exporter
  - job_name: 'node'
    static_configs:
    - targets: ['node1:9100', 'node2:9100', 'node3:9100']

  # Kubernetes 服务发现
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_scrape]
      action: keep
      regex: true

Alertmanager 告警

Alertmanager 的告警处理流程

告警规则编写

yaml

# alert_rules.yml
groups:
- name: node_alerts
  rules:
  # 实例宕机告警
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "实例 {{ $labels.instance }} 宕机"
      description: "{{ $labels.instance }} of job {{ $labels.job }} 已离线超过1分钟"

  # CPU 使用率告警
  - alert: HighCpuUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "高 CPU 使用率"
      description: "实例 {{ $labels.instance }} CPU 使用率超过 85%，当前值: {{ $value }}%"

  # 内存使用率告警
  - alert: HighMemoryUsage
    expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "高内存使用率"
      description: "实例 {{ $labels.instance }} 内存使用率超过 90%"

  # 磁盘空间告警
  - alert: DiskSpaceLow
    expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100 < 15
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "磁盘空间不足"
      description: "{{ $labels.instance }} 的 {{ $labels.mountpoint }} 分区剩余空间低于 15%"

Alertmanager 配置

yaml

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: 'alert@example.com'
  smtp_auth_username: 'alert@example.com'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default-receiver'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    repeat_interval: 1h

inhibit_rules:
- source_match:
    severity: 'critical'
  target_match:
    severity: 'warning'
  equal: ['alertname', 'instance']

receivers:
- name: 'default-receiver'
  email_configs:
  - to: 'team@example.com'

- name: 'critical-alerts'
  email_configs:
  - to: 'oncall@example.com'
  webhook_configs:
  - url: 'http://dingtalk-webhook/send'
    send_resolved: true

Grafana 可视化

基础概念

什么是 Grafana？主要功能有哪些？

答：Grafana 是一个开源的可视化平台，支持多种数据源，提供丰富的图表和仪表板功能。

核心功能：

多数据源支持（Prometheus、InfluxDB、Elasticsearch 等）
丰富的可视化面板（折线图、柱状图、热力图、表格等）
告警功能集成
模板变量和动态仪表板
权限管理和团队协作
插件生态系统

Grafana 面板类型及应用场景

面板类型	适用场景	示例
Time Series	时序趋势图	CPU/内存趋势
Stat	单一数值展示	当前QPS、错误率
Gauge	仪表盘展示	CPU/内存百分比
Bar Chart	对比分析	各服务请求量对比
Table	数据表格	实例列表及状态
Heatmap	热力图	请求延迟分布
Logs	日志查看	应用日志分析
Pie Chart	饼图	流量来源占比
Status History	状态历史	服务可用性

常用面板配置

CPU 使用率面板

json

{
  "type": "timeseries",
  "title": "CPU 使用率",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
      "legendFormat": "{{instance}}"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 70},
          {"color": "red", "value": 90}
        ]
      },
      "max": 100,
      "min": 0
    }
  }
}

内存使用情况面板

json

{
  "type": "gauge",
  "title": "内存使用率",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
      "legendFormat": "Memory Usage"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "percent",
      "max": 100,
      "min": 0,
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 70},
          {"color": "red", "value": 90}
        ]
      }
    }
  }
}

请求 QPS 面板

json

{
  "type": "stat",
  "title": "当前 QPS",
  "datasource": "Prometheus",
  "targets": [
    {
      "expr": "sum(rate(http_requests_total{job=\"api\"}[5m]))",
      "legendFormat": "QPS"
    }
  ],
  "fieldConfig": {
    "defaults": {
      "unit": "reqps",
      "mappings": [],
      "thresholds": {
        "steps": [
          {"color": "green", "value": null},
          {"color": "yellow", "value": 1000},
          {"color": "red", "value": 5000}
        ]
      }
    }
  }
}

变量与模板

如何创建模板变量？

json

{
  "variables": [
    {
      "name": "instance",
      "type": "query",
      "datasource": "Prometheus",
      "query": "label_values(up, instance)",
      "refresh": 1,
      "sort": 1,
      "current": {
        "selected": true,
        "text": "All",
        "value": "$__all"
      }
    },
    {
      "name": "job",
      "type": "query",
      "datasource": "Prometheus",
      "query": "label_values(up{instance=\"$instance\"}, job)",
      "refresh": 1,
      "depends": ["instance"]
    }
  ]
}

常用变量类型：

类型	说明	示例
Query	数据源查询	`label_values(up, instance)`
Interval	时间间隔	`1m, 5m, 15m, 1h`
Custom	自定义选项	`prod, staging, dev`
Constant	固定值	特定值
DataSource	数据源选择	选择不同数据源

在查询中使用变量

txt

# 使用 $variable 引用变量
up{instance="$instance"}

# 时间范围变量
rate(http_requests_total[$interval])

# 多选变量
sum(rate(http_requests_total{job=~"$jobs"}[5m]))

# 内置变量
$__range        # 当前时间范围
$__rate_interval  # 自动计算的最佳速率区间
$__from / $__to  # 时间范围起止

告警配置

Grafana 告警 vs Prometheus 告警

特性	Grafana Alert	Prometheus Alert
告警位置	Grafana 平台	Prometheus Server
数据源依赖	支持多数据源	仅 Prometheus
可视化	与面板关联	独立配置
通知渠道	内置丰富	依赖 Alertmanager
适用场景	业务指标告警	基础设施告警

Grafana 告警规则配置

yaml

# Contact Point（通知渠道）
uid: my-webhook
name: webhook-alert
type: webhook
settings:
  url: http://dingtalk-webhook/send
  httpMethod: POST

# Notification Policy（通知策略）
receiver: grafana-default-email
routes:
- matcher: alertname = "HighErrorRate"
  receiver: webhook-alert
  group_by: ['alertname', 'instance']
  continue: false

# Alert Rule 示例
name: API Error Rate High
condition: B
data:
  - refId: A
    datasourceUid: prometheus
    model:
      expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
  - refId: B
    datasourceUid: __expr__
    model:
      type: threshold
      conditions:
      - evaluator: {params: [], type: gt}
        query: {params: [A]}
        value: 5
isPaused: false
evaluationGroup: default
noDataState: NoData
executionErrorState: Alerting
for: 5m
annotations:
  summary: "API 错误率超过 5%"
  description: "当前错误率: ${A}"
labels:
  severity: warning

监控体系设计

如何设计一套完整的监控系统？

监控分层架构

监控指标分类（USE/RED 方法）

USE 方法（资源视角）：

Utilization：资源使用率（CPU、内存、磁盘）
Saturation：资源饱和度（队列长度、负载）
Errors：错误数量

RED 方法（服务视角）：

Rate：每秒请求数（QPS）
Errors：每秒错误数
Duration：请求耗时分布（P50/P95/P99）

监控最佳实践

合理设置采样频率
- 关键指标：10-15 秒
- 一般指标：30-60 秒
- 低频指标：5 分钟

数据保留策略

yaml

retention_size: 50GB      # 保留大小
retention_time: 30d       # 保留时间

告警分级
- P0（致命）：服务不可用，立即响应
- P1（严重）：核心功能受损，15分钟内响应
- P2（警告）：非核心异常，1小时内处理
- P3（提示）：优化建议，工作日处理
避免告警风暴
- 设置合理的 for 持续时间
- 使用分组和抑制规则
- 配置告警升级机制

生产环境常见问题排查

Prometheus 性能问题

问题现象：查询慢、CPU/内存占用高

排查方向：

bash

# 查看 Prometheus 状态
curl http://localhost:9090/api/v1/status/tsdb

# 检查活跃时间序列数量
curl http://localhost:9090/api/v1/status/config

# 优化建议
# 1. 减少采集目标或增加采样间隔
# 2. 优化 PromQL 查询，避免全量扫描
# 3. 增加 recording rules 预计算
# 4. 使用 Thanos/Cortex 做远程存储

Zabbix 性能问题

问题现象：Poller 进程繁忙、数据延迟

优化方案：

ini

# zabbix_server.conf
StartPollers=20            # 增加 Poller 数量
StartPollersUnreachable=5
StartTrappers=10
StartPingers=5
StartDiscoverers=5

# 数据库优化
# 1. 定期清理历史数据
# 2. 添加索引
# 3. 使用分区表
# 4. 考虑读写分离

Grafana 加载缓慢

问题原因及解决：

查询时间范围过大：缩小默认时间范围
面板过多：拆分仪表板或懒加载
数据源响应慢：优化查询或添加缓存
浏览器渲染压力大：减少数据点数量

Zabbix 监控 ​

基础概念 ​

什么是 Zabbix？它的架构是什么？ ​

Zabbix 与 Prometheus 的区别是什么？ ​

安装与配置 ​

如何安装 Zabbix Server？ ​

Zabbix Agent 有几种工作模式？ ​

主动模式（Active） ​

被动模式（Passive） ​

监控项与触发器 ​

什么是监控项（Item）？常用类型有哪些？ ​

常用内置监控项 Key ​

触发器（Trigger）表达式语法 ​

实际案例：磁盘空间告警 ​

自定义监控 ​

如何编写自定义监控脚本？ ​

测试自定义监控项 ​

Prometheus 监控 ​

基础概念 ​

什么是 Prometheus？它的工作原理是什么？ ​

Prometheus 的数据模型是什么？ ​

PromQL 查询语言 ​

基本查询语法 ​

常用聚合操作 ​

实用查询示例 ​

Exporter ​

常用的 Exporter 有哪些？ ​

Node Exporter 部署与配置 ​

Prometheus 配置文件示例 ​

Alertmanager 告警 ​

Alertmanager 的告警处理流程 ​

告警规则编写 ​

Alertmanager 配置 ​

Grafana 可视化 ​

基础概念 ​

什么是 Grafana？主要功能有哪些？ ​

Grafana 面板类型及应用场景 ​

常用面板配置 ​

CPU 使用率面板 ​

内存使用情况面板 ​

请求 QPS 面板 ​

变量与模板 ​

如何创建模板变量？ ​

在查询中使用变量 ​

告警配置 ​

Grafana 告警 vs Prometheus 告警 ​

Grafana 告警规则配置 ​

监控体系设计 ​

如何设计一套完整的监控系统？ ​

监控分层架构 ​

监控指标分类（USE/RED 方法） ​

监控最佳实践 ​

生产环境常见问题排查 ​

Prometheus 性能问题 ​

Zabbix 性能问题 ​

Grafana 加载缓慢 ​

Zabbix 监控

基础概念

什么是 Zabbix？它的架构是什么？

Zabbix 与 Prometheus 的区别是什么？

安装与配置

如何安装 Zabbix Server？

Zabbix Agent 有几种工作模式？

主动模式（Active）

被动模式（Passive）

监控项与触发器

什么是监控项（Item）？常用类型有哪些？

常用内置监控项 Key

触发器（Trigger）表达式语法

实际案例：磁盘空间告警

自定义监控

如何编写自定义监控脚本？

测试自定义监控项

Prometheus 监控

基础概念

什么是 Prometheus？它的工作原理是什么？

Prometheus 的数据模型是什么？

PromQL 查询语言

基本查询语法

常用聚合操作

实用查询示例

Exporter

常用的 Exporter 有哪些？

Node Exporter 部署与配置

Prometheus 配置文件示例

Alertmanager 告警

Alertmanager 的告警处理流程

告警规则编写

Alertmanager 配置

Grafana 可视化

基础概念

什么是 Grafana？主要功能有哪些？

Grafana 面板类型及应用场景

常用面板配置

CPU 使用率面板

内存使用情况面板

请求 QPS 面板

变量与模板

如何创建模板变量？

在查询中使用变量

告警配置

Grafana 告警 vs Prometheus 告警

Grafana 告警规则配置

监控体系设计

如何设计一套完整的监控系统？

监控分层架构

监控指标分类（USE/RED 方法）

监控最佳实践

生产环境常见问题排查

Prometheus 性能问题

Zabbix 性能问题

Grafana 加载缓慢