Add production-ready monitoring infrastructure: - 15 alerting rules (4 Critical + 11 Warning) - Grafana dashboard with service health panels - Full documentation with deployment guide Covers: service availability, error rates, latency, routing health, database connections, and log metrics
Sub2API Relay Manager Monitoring Setup
概述
本项目已配置完整的监控告警体系,包括 Prometheus metrics、Grafana 仪表板和 Prometheus 告警规则。
已配置的 Metrics
HTTP 层指标
http_requests_total- HTTP 请求总数(按 method, path, status 分类)http_request_duration_seconds- HTTP 请求延迟分布
业务指标
active_hosts- 活跃宿主数量active_providers- 活跃 provider 数量route_decisions_total- 路由决策总数route_failovers_total- 路由故障转移总数
数据库指标
db_connections_active- 活跃数据库连接数db_operations_total- 数据库操作总数
日志指标
log_flush_errors_total- 日志刷新错误数log_dropped_events_total- 丢弃的日志事件数
告警规则
Critical 级别
| 告警名称 | 触发条件 | 说明 |
|---|---|---|
| ServiceDown | up == 0 持续1分钟 | 服务完全宕机 |
| NoActiveProviders | active_providers == 0 持续1分钟 | 无可用 provider |
| NoActiveHosts | active_hosts == 0 持续1分钟 | 无可用 host |
| HealthCheckFailing | /healthz 返回非200 | 健康检查失败 |
Warning 级别
| 告警名称 | 触发条件 | 说明 |
|---|---|---|
| HighErrorRate | 错误率 > 5% 持续2分钟 | HTTP 5xx/4xx 错误率高 |
| HighLatency | P95 延迟 > 1秒 持续3分钟 | 请求处理延迟高 |
| RouteFailoverSpike | 故障转移率 > 正常水平2倍 | 路由不稳定 |
| HighDBConnections | 活跃连接 > 50 持续5分钟 | 数据库连接池压力大 |
| LogFlushErrors | 日志刷新错误 > 0 | 日志系统异常 |
| LogDroppedEvents | 丢弃事件率 > 10/sec | 日志缓冲区溢出 |
| BatchImportFailures | 批处理失败率 > 10% | Provider 导入问题 |
| AuthFailures | 认证失败 > 10/sec | 凭证问题或攻击 |
部署步骤
1. Prometheus 配置
在 prometheus.yml 中添加:
rule_files:
- "sub2api-relay-manager-rules.yml"
scrape_configs:
- job_name: "sub2api-relay-manager"
static_configs:
- targets: ["localhost:8080"]
metrics_path: /metrics
scrape_interval: 15s
复制告警规则:
cp deploy/monitoring/prometheus-rules.yml /etc/prometheus/rules/
2. Grafana 配置
导入仪表板:
curl -X POST \
http://admin:admin@localhost:3000/api/dashboards/db \
-H 'Content-Type: application/json' \
-d @deploy/monitoring/grafana-dashboard.json
3. Alertmanager 配置(可选)
配置告警通知渠道(Slack/Email/PagerDuty):
# alertmanager.yml
global:
smtp_smarthost: "localhost:587"
smtp_from: "alerts@example.com"
route:
receiver: "ops-team"
group_by: ["alertname", "severity"]
receivers:
- name: "ops-team"
email_configs:
- to: "ops@example.com"
subject: "[Alert] {{ .GroupLabels.alertname }}"
slack_configs:
- api_url: "YOUR_SLACK_WEBHOOK_URL"
channel: "#alerts"
验证
检查 Metrics 端点
curl http://localhost:8080/metrics
验证告警规则
# 在 Prometheus 中查看
http://localhost:9090/rules
# 查看告警状态
http://localhost:9090/alerts
触发测试告警
# 模拟高错误率
for i in {1..100}; do
curl http://localhost:8080/api/nonexistent
done
监控指标解释
正常状态参考值
| 指标 | 正常范围 | 告警阈值 |
|---|---|---|
| active_providers | >= 2 | < 2 (warning), = 0 (critical) |
| active_hosts | >= 1 | = 0 (critical) |
| Error Rate | < 1% | > 5% |
| P95 Latency | < 500ms | > 1s |
| DB Connections | < 20 | > 50 |
故障排查
服务 Down 告警
- 检查进程状态:
systemctl status sub2api-relay-manager - 查看日志:
journalctl -u sub2api-relay-manager - 检查端口监听:
netstat -tlnp | grep 8080
高延迟告警
- 检查数据库性能
- 查看 upstream provider 响应时间
- 检查内存和 CPU 使用率
路由故障转移告警
- 检查 provider 健康状态
- 查看
/api/routing/routes/health - 分析 provider 响应日志