新增文档: - guides/ADMIN_GUIDE.md — 管理员操作手册(用户/角色/设备/日志管理) - guides/USER_GUIDE.md — 普通用户操作手册(注册/登录/TOTP/设备管理) - guides/CONFIG_REFERENCE.md — 配置文件参考手册(含全部配置项说明) - guides/MONITORING.md — 健康检查、Prometheus 指标和告警规则 同步更新: - docs/README.md 文档索引,加入新增文档链接
319 lines
7.8 KiB
Markdown
319 lines
7.8 KiB
Markdown
# 健康检查与监控指南
|
||
|
||
本文档描述系统健康检查端点、Prometheus 监控指标和告警规则。
|
||
|
||
---
|
||
|
||
## 1. 健康检查端点
|
||
|
||
系统提供三个健康检查端点,适用于不同场景:
|
||
|
||
| 端点 | 路径 | 说明 | 使用场景 |
|
||
|------|------|------|----------|
|
||
| 存活探针 | `/health/live` | 确认进程存活 | Kubernetes `livenessProbe` |
|
||
| 就绪探针 | `/health/ready` | 确认服务就绪 | Kubernetes `readinessProbe` |
|
||
| 健康检查 | `/health` | 综合健康状态 | 负载均衡器、健康检查脚本 |
|
||
|
||
### 1.1 响应格式
|
||
|
||
```json
|
||
{
|
||
"status": "ok",
|
||
"timestamp": "2026-05-10T13:00:00Z",
|
||
"version": "1.0.0"
|
||
}
|
||
```
|
||
|
||
### 1.2 响应码
|
||
|
||
| 状态 | HTTP 响应码 | 说明 |
|
||
|------|-------------|------|
|
||
| ok | 200 | 服务正常 |
|
||
| degraded | 200 | 服务降级(部分依赖不可用,如 Redis) |
|
||
| unhealthy | 503 | 服务不健康(如数据库不可达) |
|
||
|
||
---
|
||
|
||
## 2. Prometheus 监控指标
|
||
|
||
### 2.1 暴露方式
|
||
|
||
指标端点:`GET /metrics`
|
||
|
||
返回 Prometheus 格式文本。
|
||
|
||
### 2.2 核心指标
|
||
|
||
#### HTTP 指标
|
||
|
||
| 指标名 | 类型 | 标签 | 说明 |
|
||
|--------|------|------|------|
|
||
| `http_requests_total` | Counter | method, path, status | HTTP 请求总数 |
|
||
| `http_request_duration_seconds` | Histogram | method, path | 请求延迟分布 |
|
||
|
||
#### 认证指标
|
||
|
||
| 指标名 | 类型 | 标签 | 说明 |
|
||
|--------|------|------|------|
|
||
| `login_attempts_total` | Counter | result, method | 登录尝试次数(成功/失败) |
|
||
| `active_sessions_total` | Gauge | — | 当前活跃会话数 |
|
||
| `refresh_tokens_total` | Counter | — | Token 刷新次数 |
|
||
|
||
#### 数据库指标
|
||
|
||
| 指标名 | 类型 | 标签 | 说明 |
|
||
|--------|------|------|------|
|
||
| `db_query_duration_seconds` | Histogram | operation, table | 数据库查询延迟 |
|
||
| `db_connections_open` | Gauge | type | 当前打开的连接数 |
|
||
| `db_connections_in_use` | Gauge | type | 使用中的连接数 |
|
||
|
||
#### 缓存指标
|
||
|
||
| 指标名 | 类型 | 标签 | 说明 |
|
||
|--------|------|------|------|
|
||
| `cache_hits_total` | Counter | cache_level | 缓存命中次数 |
|
||
| `cache_misses_total` | Counter | cache_level | 缓存未命中次数 |
|
||
| `cache_operations_total` | Counter | operation | 缓存操作总数 |
|
||
|
||
#### 限流指标
|
||
|
||
| 指标名 | 类型 | 标签 | 说明 |
|
||
|--------|------|------|------|
|
||
| `ratelimit_rejections_total` | Counter | endpoint, algorithm | 限流拦截次数 |
|
||
|
||
### 2.3 查看当前指标
|
||
|
||
```bash
|
||
curl http://localhost:8080/metrics
|
||
```
|
||
|
||
---
|
||
|
||
## 3. 告警规则
|
||
|
||
### 3.1 建议的告警规则(Prometheus / Alertmanager 格式)
|
||
|
||
```yaml
|
||
groups:
|
||
- name: user-management
|
||
rules:
|
||
# 服务不可用
|
||
- alert: ServiceDown
|
||
expr: up{job="user-management"} == 0
|
||
for: 1m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "用户管理服务不可用"
|
||
|
||
# 错误率过高
|
||
- alert: HighErrorRate
|
||
expr: |
|
||
rate(http_requests_total{status=~"5.."}[5m]) /
|
||
rate(http_requests_total[5m]) > 0.05
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "HTTP 5xx 错误率超过 5%"
|
||
|
||
# 登录失败率过高(可能暴力破解)
|
||
- alert: HighLoginFailureRate
|
||
expr: |
|
||
rate(login_attempts_total{result="fail"}[5m]) /
|
||
rate(login_attempts_total[5m]) > 0.8
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "登录失败率超过 80%,可能存在暴力破解"
|
||
|
||
# 响应延迟过高
|
||
- alert: HighLatency
|
||
expr: |
|
||
histogram_quantile(0.99,
|
||
rate(http_request_duration_seconds_bucket[5m])) > 1
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "P99 响应延迟超过 1 秒"
|
||
|
||
# 数据库连接池耗尽
|
||
- alert: DatabaseConnectionPoolExhausted
|
||
expr: db_connections_in_use / db_connections_open > 0.9
|
||
for: 5m
|
||
labels:
|
||
severity: critical
|
||
annotations:
|
||
summary: "数据库连接池使用率超过 90%"
|
||
|
||
# 活跃会话数异常下降
|
||
- alert: ActiveSessionsDropped
|
||
expr: |
|
||
active_sessions_total < 10
|
||
and
|
||
delta(active_sessions_total[10m]) < -5
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "活跃会话数急剧下降"
|
||
|
||
# 限流拦截频繁
|
||
- alert: RateLimitRejectionsHigh
|
||
expr: |
|
||
rate(ratelimit_rejections_total[5m]) > 10
|
||
for: 5m
|
||
labels:
|
||
severity: warning
|
||
annotations:
|
||
summary: "限流拦截频率过高"
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Grafana 看板
|
||
|
||
建议导入以下看板配置:
|
||
|
||
### 4.1 核心看板指标
|
||
|
||
**Overview 看板**:
|
||
- 请求率(QPS)
|
||
- P50/P90/P99 延迟
|
||
- 错误率
|
||
- 活跃会话数
|
||
|
||
**Auth 看板**:
|
||
- 登录尝试(成功/失败)
|
||
- Token 刷新次数
|
||
- 活跃会话趋势
|
||
- TOTP 启用率
|
||
|
||
**Database 看板**:
|
||
- 查询延迟 P99
|
||
- 连接池使用率
|
||
- 慢查询数量
|
||
|
||
**Cache 看板**:
|
||
- 命中率
|
||
- 未命中率
|
||
- L1/L2 缓存对比
|
||
|
||
---
|
||
|
||
## 5. 日志关键字监控
|
||
|
||
建议在日志收集系统(如 Loki/ELK)中配置以下关键字告警:
|
||
|
||
| 关键字 | 严重程度 | 说明 |
|
||
|--------|----------|------|
|
||
| `auth: increment login attempts failed` | warning | Redis/L1 缓存不可用 |
|
||
| `goroutine leak` | critical | 潜在的 goroutine 泄漏 |
|
||
| `token blacklisted but refresh failed` | critical | Token 黑名单写入失败 |
|
||
| `password reset code replay` | warning | 可能存在验证码重放 |
|
||
| `temporary login token cleanup failed` | warning | 临时令牌清理失败 |
|
||
| `cache.Set failed` | warning | 缓存写入失败 |
|
||
| `failed to send email` | warning | 邮件发送失败 |
|
||
|
||
---
|
||
|
||
## 6. 健康检查脚本示例
|
||
|
||
```bash
|
||
#!/bin/bash
|
||
# health_check.sh — 服务健康检查脚本
|
||
|
||
HEALTH_URL="http://localhost:8080/health"
|
||
READY_URL="http://localhost:8080/health/ready"
|
||
METRICS_URL="http://localhost:8080/metrics"
|
||
|
||
check_endpoint() {
|
||
local url=$1
|
||
local name=$2
|
||
local status=$(curl -s -o /dev/null -w "%{http_code}" "$url")
|
||
|
||
if [ "$status" -eq 200 ]; then
|
||
echo "[OK] $name: $status"
|
||
return 0
|
||
else
|
||
echo "[FAIL] $name: $status"
|
||
return 1
|
||
fi
|
||
}
|
||
|
||
# 执行检查
|
||
failed=0
|
||
|
||
check_endpoint "$HEALTH_URL" "Health" || failed=$((failed + 1))
|
||
check_endpoint "$READY_URL" "Ready" || failed=$((failed + 1))
|
||
|
||
# 检查 Prometheus 指标端点
|
||
status=$(curl -s -o /dev/null -w "%{http_code}" "$METRICS_URL")
|
||
if [ "$status" -eq 200 ]; then
|
||
echo "[OK] Metrics: $status"
|
||
else
|
||
echo "[WARN] Metrics: $status"
|
||
fi
|
||
|
||
# 检查数据库连接(通过日志)
|
||
if grep -q "database opened" logs/app.log 2>/dev/null; then
|
||
echo "[OK] Database: connected"
|
||
else
|
||
echo "[FAIL] Database: not connected"
|
||
failed=$((failed + 1))
|
||
fi
|
||
|
||
exit $failed
|
||
```
|
||
|
||
---
|
||
|
||
## 7. Kubernetes 部署配置示例
|
||
|
||
```yaml
|
||
apiVersion: apps/v1
|
||
kind: Deployment
|
||
spec:
|
||
template:
|
||
spec:
|
||
containers:
|
||
- name: user-management
|
||
livenessProbe:
|
||
httpGet:
|
||
path: /health/live
|
||
port: 8080
|
||
initialDelaySeconds: 10
|
||
periodSeconds: 15
|
||
timeoutSeconds: 5
|
||
failureThreshold: 3
|
||
|
||
readinessProbe:
|
||
httpGet:
|
||
path: /health/ready
|
||
port: 8080
|
||
initialDelaySeconds: 5
|
||
periodSeconds: 10
|
||
timeoutSeconds: 3
|
||
failureThreshold: 3
|
||
|
||
ports:
|
||
- name: http
|
||
containerPort: 8080
|
||
- name: metrics
|
||
containerPort: 9090
|
||
|
||
resources:
|
||
requests:
|
||
memory: "256Mi"
|
||
cpu: "200m"
|
||
limits:
|
||
memory: "1Gi"
|
||
cpu: "1000m"
|
||
```
|
||
|
||
---
|
||
|
||
*最后更新:2026-05-10*
|