Add production-ready monitoring infrastructure: - 15 alerting rules (4 Critical + 11 Warning) - Grafana dashboard with service health panels - Full documentation with deployment guide Covers: service availability, error rates, latency, routing health, database connections, and log metrics
4.6 KiB
4.6 KiB
Sub2API Relay Manager Monitoring Setup
概述
本项目已配置完整的监控告警体系,包括 Prometheus metrics、Grafana 仪表板和 Prometheus 告警规则。
已配置的 Metrics
HTTP 层指标
http_requests_total- HTTP 请求总数(按 method, path, status 分类)http_request_duration_seconds- HTTP 请求延迟分布
业务指标
active_hosts- 活跃宿主数量active_providers- 活跃 provider 数量route_decisions_total- 路由决策总数route_failovers_total- 路由故障转移总数
数据库指标
db_connections_active- 活跃数据库连接数db_operations_total- 数据库操作总数
日志指标
log_flush_errors_total- 日志刷新错误数log_dropped_events_total- 丢弃的日志事件数
告警规则
Critical 级别
| 告警名称 | 触发条件 | 说明 |
|---|---|---|
| ServiceDown | up == 0 持续1分钟 | 服务完全宕机 |
| NoActiveProviders | active_providers == 0 持续1分钟 | 无可用 provider |
| NoActiveHosts | active_hosts == 0 持续1分钟 | 无可用 host |
| HealthCheckFailing | /healthz 返回非200 | 健康检查失败 |
Warning 级别
| 告警名称 | 触发条件 | 说明 |
|---|---|---|
| HighErrorRate | 错误率 > 5% 持续2分钟 | HTTP 5xx/4xx 错误率高 |
| HighLatency | P95 延迟 > 1秒 持续3分钟 | 请求处理延迟高 |
| RouteFailoverSpike | 故障转移率 > 正常水平2倍 | 路由不稳定 |
| HighDBConnections | 活跃连接 > 50 持续5分钟 | 数据库连接池压力大 |
| LogFlushErrors | 日志刷新错误 > 0 | 日志系统异常 |
| LogDroppedEvents | 丢弃事件率 > 10/sec | 日志缓冲区溢出 |
| BatchImportFailures | 批处理失败率 > 10% | Provider 导入问题 |
| AuthFailures | 认证失败 > 10/sec | 凭证问题或攻击 |
部署步骤
1. Prometheus 配置
在 prometheus.yml 中添加:
rule_files:
- "sub2api-relay-manager-rules.yml"
scrape_configs:
- job_name: "sub2api-relay-manager"
static_configs:
- targets: ["localhost:8080"]
metrics_path: /metrics
scrape_interval: 15s
复制告警规则:
cp deploy/monitoring/prometheus-rules.yml /etc/prometheus/rules/
2. Grafana 配置
导入仪表板:
curl -X POST \
http://admin:admin@localhost:3000/api/dashboards/db \
-H 'Content-Type: application/json' \
-d @deploy/monitoring/grafana-dashboard.json
3. Alertmanager 配置(可选)
配置告警通知渠道(Slack/Email/PagerDuty):
# alertmanager.yml
global:
smtp_smarthost: "localhost:587"
smtp_from: "alerts@example.com"
route:
receiver: "ops-team"
group_by: ["alertname", "severity"]
receivers:
- name: "ops-team"
email_configs:
- to: "ops@example.com"
subject: "[Alert] {{ .GroupLabels.alertname }}"
slack_configs:
- api_url: "YOUR_SLACK_WEBHOOK_URL"
channel: "#alerts"
验证
检查 Metrics 端点
curl http://localhost:8080/metrics
验证告警规则
# 在 Prometheus 中查看
http://localhost:9090/rules
# 查看告警状态
http://localhost:9090/alerts
触发测试告警
# 模拟高错误率
for i in {1..100}; do
curl http://localhost:8080/api/nonexistent
done
监控指标解释
正常状态参考值
| 指标 | 正常范围 | 告警阈值 |
|---|---|---|
| active_providers | >= 2 | < 2 (warning), = 0 (critical) |
| active_hosts | >= 1 | = 0 (critical) |
| Error Rate | < 1% | > 5% |
| P95 Latency | < 500ms | > 1s |
| DB Connections | < 20 | > 50 |
故障排查
服务 Down 告警
- 检查进程状态:
systemctl status sub2api-relay-manager - 查看日志:
journalctl -u sub2api-relay-manager - 检查端口监听:
netstat -tlnp | grep 8080
高延迟告警
- 检查数据库性能
- 查看 upstream provider 响应时间
- 检查内存和 CPU 使用率
路由故障转移告警
- 检查 provider 健康状态
- 查看
/api/routing/routes/health - 分析 provider 响应日志