1203 lines
36 KiB
Markdown
1203 lines
36 KiB
Markdown
|
|
# 详细技术架构设计
|
|||
|
|
|
|||
|
|
> 版本:v1.0
|
|||
|
|
> 日期:2026-03-18
|
|||
|
|
> 依据:backend skill 最佳实践
|
|||
|
|
> 状态:历史草稿(已被 `technical_architecture_optimized_v2_2026-03-18.md` 替代,不作为实施基线)
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 1. 系统架构概览
|
|||
|
|
|
|||
|
|
### 1.1 整体架构图
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 客户端层 │
|
|||
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|||
|
|
│ │ Web App │ │ Mobile App │ │ SDK (Python)│ │ SDK (Node) │ │
|
|||
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ API Gateway (入口层) │
|
|||
|
|
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
|
|||
|
|
│ │ • 限流 • 鉴权 • 路由 • 日志 • 监控 │ │
|
|||
|
|
│ └─────────────────────────────────────────────────────────────────────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 业务服务层 │
|
|||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|||
|
|
│ │ Router │ │ Auth │ │ Billing │ │ Provider │ │
|
|||
|
|
│ │ Service │ │ Service │ │ Service │ │ Service │ │
|
|||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|||
|
|
│ │ Tenant │ │ Risk │ │ Settlement │ │ Webhook │ │
|
|||
|
|
│ │ Service │ │ Service │ │ Service │ │ Service │ │
|
|||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 基础设施层 │
|
|||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|||
|
|
│ │ PostgreSQL │ │ Redis │ │ Kafka │ │ S3/MinIO │ │
|
|||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────────────────────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────────────────────────────────────────────────────────────────────────┐
|
|||
|
|
│ 外部集成层 │
|
|||
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|||
|
|
│ │ subapi │ │ OpenAI API │ │ Anthropic │ │ 国内供应商 │ │
|
|||
|
|
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
|
|||
|
|
└─────────────────────────────────────────────────────────────────────────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 2. 技术选型
|
|||
|
|
|
|||
|
|
### 2.1 技术栈
|
|||
|
|
|
|||
|
|
| 层级 | 技术 | 版本 | 说明 |
|
|||
|
|
|------|------|------|------|
|
|||
|
|
| API Gateway | Kong / Traefik | 3.x | 高性能网关 |
|
|||
|
|
| 后端服务 | Go | 1.21 | 高并发 |
|
|||
|
|
| Web框架 | Gin | 1.9 | 高性能 |
|
|||
|
|
| 数据库 | PostgreSQL | 15 | 主数据库 |
|
|||
|
|
| 缓存 | Redis | 7.x | 缓存+限流 |
|
|||
|
|
| 消息队列 | Kafka | 3.x | 异步处理 |
|
|||
|
|
| 服务网格 | Istio | 1.18 | 微服务治理 |
|
|||
|
|
| 容器编排 | Kubernetes | 1.28 | 容器编排 |
|
|||
|
|
| CI/CD | GitHub Actions | - | 持续集成 |
|
|||
|
|
| 监控 | Prometheus + Grafana | - | 可观测性 |
|
|||
|
|
| 日志 | ELK Stack | 8.x | 日志收集 |
|
|||
|
|
|
|||
|
|
### 2.2 项目结构
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
llm-gateway/
|
|||
|
|
├── cmd/ # 入口程序
|
|||
|
|
│ ├── gateway/ # 网关服务
|
|||
|
|
│ ├── router/ # 路由服务
|
|||
|
|
│ ├── billing/ # 计费服务
|
|||
|
|
│ └── admin/ # 管理后台
|
|||
|
|
├── internal/ # 内部包
|
|||
|
|
│ ├── config/ # 配置管理
|
|||
|
|
│ ├── middleware/ # 中间件
|
|||
|
|
│ ├── handler/ # HTTP处理器
|
|||
|
|
│ ├── service/ # 业务逻辑
|
|||
|
|
│ ├── repository/ # 数据访问
|
|||
|
|
│ └── model/ # 数据模型
|
|||
|
|
├── pkg/ # 公共包
|
|||
|
|
│ ├── utils/ # 工具函数
|
|||
|
|
│ ├── errors/ # 错误定义
|
|||
|
|
│ └── constants/ # 常量定义
|
|||
|
|
├── api/ # API定义
|
|||
|
|
│ ├── openapi/ # OpenAPI规范
|
|||
|
|
│ └── proto/ # Protobuf定义
|
|||
|
|
├── configs/ # 配置文件
|
|||
|
|
├── scripts/ # 脚本
|
|||
|
|
├── test/ # 测试
|
|||
|
|
└── docs/ # 文档
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 3. 模块详细设计
|
|||
|
|
|
|||
|
|
### 3.1 API Gateway 模块
|
|||
|
|
|
|||
|
|
```go
|
|||
|
|
// cmd/gateway/main.go
|
|||
|
|
package main
|
|||
|
|
|
|||
|
|
import (
|
|||
|
|
"github.com/gin-gonic/gin"
|
|||
|
|
"llm-gateway/internal/middleware"
|
|||
|
|
"llm-gateway/internal/handler"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
func main() {
|
|||
|
|
r := gin.Default()
|
|||
|
|
|
|||
|
|
// 全局中间件
|
|||
|
|
r.Use(middleware.Logger())
|
|||
|
|
r.Use(middleware.Recovery())
|
|||
|
|
r.Use(middleware.CORS())
|
|||
|
|
|
|||
|
|
// 限流
|
|||
|
|
r.Use(middleware.RateLimiter())
|
|||
|
|
|
|||
|
|
// API路由
|
|||
|
|
v1 := r.Group("/v1")
|
|||
|
|
{
|
|||
|
|
// 认证
|
|||
|
|
v1.POST("/auth/token", handler.AuthToken)
|
|||
|
|
v1.POST("/auth/refresh", handler.RefreshToken)
|
|||
|
|
|
|||
|
|
// 对话
|
|||
|
|
v1.POST("/chat/completions", middleware.AuthRequired(), handler.ChatCompletions)
|
|||
|
|
v1.POST("/completions", middleware.AuthRequired(), handler.Completions)
|
|||
|
|
|
|||
|
|
// Embeddings
|
|||
|
|
v1.POST("/embeddings", middleware.AuthRequired(), handler.Embeddings)
|
|||
|
|
|
|||
|
|
// 模型
|
|||
|
|
v1.GET("/models", handler.ListModels)
|
|||
|
|
|
|||
|
|
// 用户
|
|||
|
|
users := v1.Group("/users")
|
|||
|
|
users.Use(middleware.AuthRequired())
|
|||
|
|
{
|
|||
|
|
users.GET("", handler.ListUsers)
|
|||
|
|
users.GET("/:id", handler.GetUser)
|
|||
|
|
users.PUT("/:id", handler.UpdateUser)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// API Key
|
|||
|
|
keys := v1.Group("/keys")
|
|||
|
|
keys.Use(middleware.AuthRequired())
|
|||
|
|
{
|
|||
|
|
keys.GET("", handler.ListKeys)
|
|||
|
|
keys.POST("", handler.CreateKey)
|
|||
|
|
keys.DELETE("/:id", handler.DeleteKey)
|
|||
|
|
keys.POST("/:id/rotate", handler.RotateKey)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 计费
|
|||
|
|
billing := v1.Group("/billing")
|
|||
|
|
billing.Use(middleware.AuthRequired())
|
|||
|
|
{
|
|||
|
|
billing.GET("/balance", handler.GetBalance)
|
|||
|
|
billing.GET("/usage", handler.GetUsage)
|
|||
|
|
billing.GET("/invoices", handler.ListInvoices)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 供应方
|
|||
|
|
supply := v1.Group("/supply")
|
|||
|
|
supply.Use(middleware.AuthRequired())
|
|||
|
|
{
|
|||
|
|
supply.GET("/accounts", handler.ListAccounts)
|
|||
|
|
supply.POST("/accounts", handler.CreateAccount)
|
|||
|
|
supply.GET("/packages", handler.ListPackages)
|
|||
|
|
supply.POST("/packages", handler.CreatePackage)
|
|||
|
|
supply.GET("/earnings", handler.GetEarnings)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 管理后台
|
|||
|
|
admin := r.Group("/admin")
|
|||
|
|
admin.Use(middleware.AdminRequired())
|
|||
|
|
{
|
|||
|
|
admin.GET("/stats", handler.AdminStats)
|
|||
|
|
admin.GET("/users", handler.AdminListUsers)
|
|||
|
|
admin.POST("/users/:id/disable", handler.DisableUser)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
r.Run(":8080")
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.2 路由服务模块
|
|||
|
|
|
|||
|
|
```go
|
|||
|
|
// internal/service/router.go
|
|||
|
|
package service
|
|||
|
|
|
|||
|
|
import (
|
|||
|
|
"context"
|
|||
|
|
"time"
|
|||
|
|
"llm-gateway/internal/model"
|
|||
|
|
"llm-gateway/internal/adapter"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
type RouterService struct {
|
|||
|
|
adapterRegistry *adapter.Registry
|
|||
|
|
metricsCollector *MetricsCollector
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type RouteRequest struct {
|
|||
|
|
Model string `json:"model"`
|
|||
|
|
Messages []model.Message `json:"messages"`
|
|||
|
|
Options model.CompletionOptions `json:"options"`
|
|||
|
|
UserID int64 `json:"user_id"`
|
|||
|
|
TenantID int64 `json:"tenant_id"`
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (s *RouterService) Route(ctx context.Context, req RouteRequest) (*model.CompletionResponse, error) {
|
|||
|
|
// 1. 获取可用供应商
|
|||
|
|
providers := s.adapterRegistry.GetAvailableProviders(req.Model)
|
|||
|
|
if len(providers) == 0 {
|
|||
|
|
return nil, ErrNoProviderAvailable
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 2. 选择最优供应商
|
|||
|
|
selected := s.selectProvider(providers, req)
|
|||
|
|
|
|||
|
|
// 3. 记录路由决策
|
|||
|
|
s.metricsCollector.RecordRoute(ctx, &RouteMetrics{
|
|||
|
|
Model: req.Model,
|
|||
|
|
Provider: selected.Name(),
|
|||
|
|
TenantID: req.TenantID,
|
|||
|
|
})
|
|||
|
|
|
|||
|
|
// 4. 调用供应商
|
|||
|
|
resp, err := selected.Call(ctx, req)
|
|||
|
|
if err != nil {
|
|||
|
|
// 5. 失败时尝试fallback
|
|||
|
|
return s.tryFallback(ctx, req, err)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return resp, nil
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (s *RouterService) selectProvider(providers []*adapter.Provider, req RouteRequest) *adapter.Provider {
|
|||
|
|
// 多维度选择策略
|
|||
|
|
var best *adapter.Provider
|
|||
|
|
bestScore := -1.0
|
|||
|
|
|
|||
|
|
for _, p := range providers {
|
|||
|
|
score := s.calculateScore(p, req)
|
|||
|
|
if score > bestScore {
|
|||
|
|
bestScore = score
|
|||
|
|
best = p
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return best
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (s *RouterService) calculateScore(p *adapter.Provider, req RouteRequest) float64 {
|
|||
|
|
// 延迟评分 (40%)
|
|||
|
|
latencyScore := 1.0 / (p.LatencyP99 + 1)
|
|||
|
|
|
|||
|
|
// 可用性评分 (30%)
|
|||
|
|
availabilityScore := p.Availability
|
|||
|
|
|
|||
|
|
// 成本评分 (20%)
|
|||
|
|
costScore := 1.0 / (p.CostPer1K + 1)
|
|||
|
|
|
|||
|
|
// 质量评分 (10%)
|
|||
|
|
qualityScore := p.QualityScore
|
|||
|
|
|
|||
|
|
return latencyScore*0.4 + availabilityScore*0.3 + costScore*0.2 + qualityScore*0.1
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.3 Provider Adapter 模块
|
|||
|
|
|
|||
|
|
```go
|
|||
|
|
// internal/adapter/registry.go
|
|||
|
|
package adapter
|
|||
|
|
|
|||
|
|
import (
|
|||
|
|
"context"
|
|||
|
|
"sync"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
type Registry struct {
|
|||
|
|
mu sync.RWMutex
|
|||
|
|
providers map[string]Provider
|
|||
|
|
fallback map[string]string
|
|||
|
|
health map[string]*HealthStatus
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type Provider interface {
|
|||
|
|
Name() string
|
|||
|
|
Call(ctx context.Context, req interface{}) (interface{}, error)
|
|||
|
|
HealthCheck(ctx context.Context) error
|
|||
|
|
GetCapabilities() Capabilities
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type Capabilities struct {
|
|||
|
|
SupportsStreaming bool
|
|||
|
|
SupportsFunctionCall bool
|
|||
|
|
SupportsVision bool
|
|||
|
|
MaxTokens int
|
|||
|
|
Models []string
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type HealthStatus struct {
|
|||
|
|
IsHealthy bool
|
|||
|
|
Latency time.Duration
|
|||
|
|
LastCheck time.Time
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func NewRegistry() *Registry {
|
|||
|
|
return &Registry{
|
|||
|
|
providers: make(map[string]Provider),
|
|||
|
|
fallback: make(map[string]string),
|
|||
|
|
health: make(map[string]*HealthStatus),
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (r *Registry) Register(name string, p Provider, fallback string) {
|
|||
|
|
r.mu.Lock()
|
|||
|
|
defer r.mu.Unlock()
|
|||
|
|
|
|||
|
|
r.providers[name] = p
|
|||
|
|
if fallback != "" {
|
|||
|
|
r.fallback[name] = fallback
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 启动健康检查
|
|||
|
|
go r.healthCheckLoop(name, p)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (r *Registry) Get(name string) (Provider, error) {
|
|||
|
|
r.mu.RLock()
|
|||
|
|
defer r.mu.RUnlock()
|
|||
|
|
|
|||
|
|
p, ok := r.providers[name]
|
|||
|
|
if !ok {
|
|||
|
|
return nil, ErrProviderNotFound
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 检查健康状态
|
|||
|
|
if health, ok := r.health[name]; ok && !health.IsHealthy {
|
|||
|
|
// 尝试fallback
|
|||
|
|
if fallback, ok := r.fallback[name]; ok {
|
|||
|
|
return r.providers[fallback], nil
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return p, nil
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (r *Registry) healthCheckLoop(name string, p Provider) {
|
|||
|
|
ticker := time.NewTicker(30 * time.Second)
|
|||
|
|
for range ticker.C {
|
|||
|
|
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
|
|||
|
|
err := p.HealthCheck(ctx)
|
|||
|
|
cancel()
|
|||
|
|
|
|||
|
|
r.mu.Lock()
|
|||
|
|
r.health[name] = &HealthStatus{
|
|||
|
|
IsHealthy: err == nil,
|
|||
|
|
LastCheck: time.Now(),
|
|||
|
|
}
|
|||
|
|
r.mu.Unlock()
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.4 计费服务模块
|
|||
|
|
|
|||
|
|
```go
|
|||
|
|
// internal/service/billing.go
|
|||
|
|
package service
|
|||
|
|
|
|||
|
|
import (
|
|||
|
|
"context"
|
|||
|
|
"decimal"
|
|||
|
|
"llm-gateway/internal/model"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
type BillingService struct {
|
|||
|
|
repo *repository.BillingRepository
|
|||
|
|
balanceMgr *BalanceManager
|
|||
|
|
notifier *WebhookNotifier
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type Money struct {
|
|||
|
|
Amount decimal.Decimal
|
|||
|
|
Currency string
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (s *BillingService) ProcessRequest(ctx context.Context, req *model.LLMRequest) (*model.BillingRecord, error) {
|
|||
|
|
// 1. 预扣额度
|
|||
|
|
estimatedCost := s.EstimateCost(req)
|
|||
|
|
reserved, err := s.balanceMgr.Reserve(ctx, req.UserID, estimatedCost)
|
|||
|
|
if err != nil {
|
|||
|
|
return nil, ErrInsufficientBalance
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 2. 处理请求(实际扣费)
|
|||
|
|
actualCost := s.CalculateActualCost(req.Response)
|
|||
|
|
|
|||
|
|
// 3. 补偿差额
|
|||
|
|
diff := actualCost.Sub(reserved.Amount)
|
|||
|
|
if diff.IsPositive() {
|
|||
|
|
err = s.balanceMgr.Charge(ctx, req.UserID, diff)
|
|||
|
|
} else if diff.IsNegative() {
|
|||
|
|
err = s.balanceMgr.Refund(ctx, req.UserID, diff.Abs())
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 4. 记录账单
|
|||
|
|
record := &model.BillingRecord{
|
|||
|
|
UserID: req.UserID,
|
|||
|
|
RequestID: req.ID,
|
|||
|
|
Model: req.Model,
|
|||
|
|
PromptTokens: req.Response.Usage.PromptTokens,
|
|||
|
|
CompletionTokens: req.Response.Usage.CompletionTokens,
|
|||
|
|
Amount: actualCost,
|
|||
|
|
Status: model.BillingStatusSettled,
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
err = s.repo.Create(ctx, record)
|
|||
|
|
if err != nil {
|
|||
|
|
// 记录失败,触发补偿
|
|||
|
|
s.notifier.NotifyBillingAnomaly(ctx, record, err)
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return record, nil
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
func (s *BillingService) EstimateCost(req *model.LLMRequest) Money {
|
|||
|
|
// 使用模型定价估算
|
|||
|
|
price := s.repo.GetModelPrice(req.Model)
|
|||
|
|
|
|||
|
|
promptCost := decimal.NewFromInt(int64(req.Messages.Tokens()))
|
|||
|
|
.Mul(price.InputPer1K)
|
|||
|
|
.Div(decimal.NewFromInt(1000))
|
|||
|
|
|
|||
|
|
// 估算输出
|
|||
|
|
estimatedOutput := decimal.NewFromInt(int64(req.Options.MaxTokens))
|
|||
|
|
outputCost := estimatedOutput
|
|||
|
|
.Mul(price.OutputPer1K)
|
|||
|
|
.Div(decimal.NewFromInt(1000))
|
|||
|
|
|
|||
|
|
total := promptCost.Add(outputCost)
|
|||
|
|
|
|||
|
|
return Money{Amount: total.Round(2), Currency: "USD"}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 3.5 风控服务模块
|
|||
|
|
|
|||
|
|
```go
|
|||
|
|
// internal/service/risk.go
|
|||
|
|
package service
|
|||
|
|
|
|||
|
|
import (
|
|||
|
|
"llm-gateway/internal/model"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
type RiskService struct {
|
|||
|
|
rules []RiskRule
|
|||
|
|
rateLimiter *RateLimiter
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type RiskRule struct {
|
|||
|
|
Name string
|
|||
|
|
Condition func(*model.LLMRequest, *model.User) bool
|
|||
|
|
Score int
|
|||
|
|
Action RiskAction
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
type RiskAction string
|
|||
|
|
|
|||
|
|
const (
|
|||
|
|
RiskActionAllow RiskAction = "allow"
|
|||
|
|
RiskActionBlock RiskAction = "block"
|
|||
|
|
RiskActionReview RiskAction = "review"
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
func (s *RiskService) Evaluate(ctx context.Context, req *model.LLMRequest) *RiskResult {
|
|||
|
|
var totalScore int
|
|||
|
|
var triggers []string
|
|||
|
|
|
|||
|
|
user := s.getUser(ctx, req.UserID)
|
|||
|
|
|
|||
|
|
for _, rule := range s.rules {
|
|||
|
|
if rule.Condition(req, user) {
|
|||
|
|
totalScore += rule.Score
|
|||
|
|
triggers = append(triggers, rule.Name)
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 决策
|
|||
|
|
if totalScore >= 70 {
|
|||
|
|
return &RiskResult{
|
|||
|
|
Action: RiskActionBlock,
|
|||
|
|
Score: totalScore,
|
|||
|
|
Triggers: triggers,
|
|||
|
|
}
|
|||
|
|
} else if totalScore >= 40 {
|
|||
|
|
return &RiskResult{
|
|||
|
|
Action: RiskActionReview,
|
|||
|
|
Score: totalScore,
|
|||
|
|
Triggers: triggers,
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
return &RiskResult{
|
|||
|
|
Action: RiskActionAllow,
|
|||
|
|
Score: totalScore,
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
// 预定义风控规则
|
|||
|
|
func DefaultRiskRules() []RiskRule {
|
|||
|
|
return []RiskRule{
|
|||
|
|
{
|
|||
|
|
Name: "high_velocity",
|
|||
|
|
Condition: func(req *model.LLMRequest, user *model.User) bool {
|
|||
|
|
return req.TokensPerMinute > 1000
|
|||
|
|
},
|
|||
|
|
Score: 30,
|
|||
|
|
Action: RiskActionBlock,
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
Name: "new_account_high_value",
|
|||
|
|
Condition: func(req *model.LLMRequest, user *model.User) bool {
|
|||
|
|
return user.AccountAgeDays < 7 && req.EstimatedCost > 100
|
|||
|
|
},
|
|||
|
|
Score: 35,
|
|||
|
|
Action: RiskActionReview,
|
|||
|
|
},
|
|||
|
|
{
|
|||
|
|
Name: "unusual_model",
|
|||
|
|
Condition: func(req *model.LLMRequest, user *model.User) bool {
|
|||
|
|
return !user.PreferredModels.Contains(req.Model)
|
|||
|
|
},
|
|||
|
|
Score: 15,
|
|||
|
|
Action: RiskActionReview,
|
|||
|
|
},
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 4. 数据流设计
|
|||
|
|
|
|||
|
|
### 4.1 请求处理流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
用户请求
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────┐
|
|||
|
|
│ API Gateway │
|
|||
|
|
│ • 限流 │
|
|||
|
|
│ • 鉴权 │
|
|||
|
|
│ • 日志 │
|
|||
|
|
└────────┬────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────┐
|
|||
|
|
│ 路由决策 │
|
|||
|
|
│ • 模型映射 │
|
|||
|
|
│ • 供应商选择 │
|
|||
|
|
└────────┬────────┘
|
|||
|
|
│
|
|||
|
|
┌────┴────┐
|
|||
|
|
│ │
|
|||
|
|
▼ ▼
|
|||
|
|
┌────────┐ ┌────────┐
|
|||
|
|
│ Provider│ │ Fallback│
|
|||
|
|
│ A │ │ B │
|
|||
|
|
└────┬───┘ └────┬───┘
|
|||
|
|
│ │
|
|||
|
|
└────┬────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────┐
|
|||
|
|
│ 计费处理 │
|
|||
|
|
│ • 预扣 │
|
|||
|
|
│ • 实际扣费 │
|
|||
|
|
│ • 记录 │
|
|||
|
|
└────────┬────────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────┐
|
|||
|
|
│ 响应返回 │
|
|||
|
|
└─────────────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 4.2 异步处理流程
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
请求处理
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────────┐
|
|||
|
|
│ 同步:预扣+执行 │
|
|||
|
|
│ │
|
|||
|
|
│ • 预扣额度 │
|
|||
|
|
│ • 调用供应商 │
|
|||
|
|
│ • 实际扣费 │
|
|||
|
|
└────────┬────────┘
|
|||
|
|
│
|
|||
|
|
┌────┴────┐
|
|||
|
|
│ │
|
|||
|
|
▼ ▼
|
|||
|
|
┌────────┐ ┌────────┐
|
|||
|
|
│ 同步响应 │ │ 异步队列│
|
|||
|
|
│ │ │ │
|
|||
|
|
│ • 返回 │ │ • 记录使用量│
|
|||
|
|
│ • 更新 │ │ • 统计 │
|
|||
|
|
│ 余额 │ │ • 对账 │
|
|||
|
|
└────────┘ └────┬────┘
|
|||
|
|
│
|
|||
|
|
▼
|
|||
|
|
┌─────────────┐
|
|||
|
|
│ Kafka Topic │
|
|||
|
|
│ • usage │
|
|||
|
|
│ • billing │
|
|||
|
|
│ • audit │
|
|||
|
|
└──────┬──────┘
|
|||
|
|
│
|
|||
|
|
┌─────┴─────┐
|
|||
|
|
│ │
|
|||
|
|
▼ ▼
|
|||
|
|
┌─────────┐ ┌─────────┐
|
|||
|
|
│ 消费者 │ │ 消费者 │
|
|||
|
|
│ • 写入DB │ │ • 对账 │
|
|||
|
|
│ • 监控 │ │ • 告警 │
|
|||
|
|
└─────────┘ └─────────┘
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 5. API 设计规范
|
|||
|
|
|
|||
|
|
### 5.1 RESTful API 设计
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# openapi.yaml
|
|||
|
|
openapi: 3.0.3
|
|||
|
|
info:
|
|||
|
|
title: LLM Gateway API
|
|||
|
|
version: 1.0.0
|
|||
|
|
description: Enterprise LLM Gateway API
|
|||
|
|
|
|||
|
|
servers:
|
|||
|
|
- url: https://api.lgateway.com/v1
|
|||
|
|
description: Production server
|
|||
|
|
- url: https://staging-api.lgateway.com/v1
|
|||
|
|
description: Staging server
|
|||
|
|
|
|||
|
|
paths:
|
|||
|
|
/chat/completions:
|
|||
|
|
post:
|
|||
|
|
summary: Create a chat completion
|
|||
|
|
operationId: createChatCompletion
|
|||
|
|
tags:
|
|||
|
|
- Chat
|
|||
|
|
security:
|
|||
|
|
- BearerAuth: []
|
|||
|
|
requestBody:
|
|||
|
|
required: true
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/ChatCompletionRequest'
|
|||
|
|
responses:
|
|||
|
|
'200':
|
|||
|
|
description: Successful response
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/ChatCompletionResponse'
|
|||
|
|
'400':
|
|||
|
|
$ref: '#/components/responses/BadRequest'
|
|||
|
|
'401':
|
|||
|
|
$ref: '#/components/responses/Unauthorized'
|
|||
|
|
'429':
|
|||
|
|
$ref: '#/components/responses/RateLimited'
|
|||
|
|
'500':
|
|||
|
|
$ref: '#/components/responses/InternalServerError'
|
|||
|
|
|
|||
|
|
components:
|
|||
|
|
securitySchemes:
|
|||
|
|
BearerAuth:
|
|||
|
|
type: http
|
|||
|
|
scheme: bearer
|
|||
|
|
bearerFormat: JWT
|
|||
|
|
|
|||
|
|
schemas:
|
|||
|
|
ChatCompletionRequest:
|
|||
|
|
type: object
|
|||
|
|
required:
|
|||
|
|
- model
|
|||
|
|
- messages
|
|||
|
|
properties:
|
|||
|
|
model:
|
|||
|
|
type: string
|
|||
|
|
description: Model identifier
|
|||
|
|
messages:
|
|||
|
|
type: array
|
|||
|
|
items:
|
|||
|
|
$ref: '#/components/schemas/Message'
|
|||
|
|
temperature:
|
|||
|
|
type: number
|
|||
|
|
minimum: 0
|
|||
|
|
maximum: 2
|
|||
|
|
default: 1.0
|
|||
|
|
max_tokens:
|
|||
|
|
type: integer
|
|||
|
|
minimum: 1
|
|||
|
|
maximum: 32000
|
|||
|
|
stream:
|
|||
|
|
type: boolean
|
|||
|
|
default: false
|
|||
|
|
|
|||
|
|
Message:
|
|||
|
|
type: object
|
|||
|
|
required:
|
|||
|
|
- role
|
|||
|
|
- content
|
|||
|
|
properties:
|
|||
|
|
role:
|
|||
|
|
type: string
|
|||
|
|
enum: [system, user, assistant]
|
|||
|
|
content:
|
|||
|
|
type: string
|
|||
|
|
|
|||
|
|
responses:
|
|||
|
|
BadRequest:
|
|||
|
|
description: Bad request
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/Error'
|
|||
|
|
Unauthorized:
|
|||
|
|
description: Unauthorized
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/Error'
|
|||
|
|
RateLimited:
|
|||
|
|
description: Rate limited
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/Error'
|
|||
|
|
InternalServerError:
|
|||
|
|
description: Internal server error
|
|||
|
|
content:
|
|||
|
|
application/json:
|
|||
|
|
schema:
|
|||
|
|
$ref: '#/components/schemas/Error'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 5.2 错误响应格式
|
|||
|
|
|
|||
|
|
```json
|
|||
|
|
{
|
|||
|
|
"error": {
|
|||
|
|
"code": "BILLING_001",
|
|||
|
|
"message": "Insufficient balance",
|
|||
|
|
"message_i18n": {
|
|||
|
|
"zh_CN": "余额不足",
|
|||
|
|
"en_US": "Insufficient balance"
|
|||
|
|
},
|
|||
|
|
"details": {
|
|||
|
|
"required": 100.00,
|
|||
|
|
"available": 50.00,
|
|||
|
|
"top_up_url": "/v1/billing/top-up"
|
|||
|
|
},
|
|||
|
|
"trace_id": "req_abc123",
|
|||
|
|
"retryable": false,
|
|||
|
|
"doc_url": "https://docs.lgateway.com/errors/billing-001"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 6. 数据库设计
|
|||
|
|
|
|||
|
|
### 6.1 核心表结构
|
|||
|
|
|
|||
|
|
```sql
|
|||
|
|
-- 用户表
|
|||
|
|
CREATE TABLE users (
|
|||
|
|
id BIGSERIAL PRIMARY KEY,
|
|||
|
|
email VARCHAR(255) UNIQUE NOT NULL,
|
|||
|
|
password_hash VARCHAR(255) NOT NULL,
|
|||
|
|
name VARCHAR(100),
|
|||
|
|
tenant_id BIGINT REFERENCES tenants(id),
|
|||
|
|
role VARCHAR(20) DEFAULT 'user',
|
|||
|
|
status VARCHAR(20) DEFAULT 'active',
|
|||
|
|
mfa_enabled BOOLEAN DEFAULT FALSE,
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|||
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
-- API Keys表
|
|||
|
|
CREATE TABLE api_keys (
|
|||
|
|
id BIGSERIAL PRIMARY KEY,
|
|||
|
|
user_id BIGINT NOT NULL REFERENCES users(id),
|
|||
|
|
key_hash VARCHAR(64) NOT NULL UNIQUE,
|
|||
|
|
key_prefix VARCHAR(20) NOT NULL,
|
|||
|
|
description VARCHAR(200),
|
|||
|
|
permissions JSONB DEFAULT '{}',
|
|||
|
|
rate_limit_rpm INT DEFAULT 60,
|
|||
|
|
rate_limit_tpm INT DEFAULT 100000,
|
|||
|
|
status VARCHAR(20) DEFAULT 'active',
|
|||
|
|
expires_at TIMESTAMP,
|
|||
|
|
last_used_at TIMESTAMP,
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|||
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
-- 租户表
|
|||
|
|
CREATE TABLE tenants (
|
|||
|
|
id BIGSERIAL PRIMARY KEY,
|
|||
|
|
name VARCHAR(100) NOT NULL,
|
|||
|
|
plan VARCHAR(20) DEFAULT 'free',
|
|||
|
|
status VARCHAR(20) DEFAULT 'active',
|
|||
|
|
settings JSONB DEFAULT '{}',
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|||
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
-- 账单记录表
|
|||
|
|
CREATE TABLE billing_records (
|
|||
|
|
id BIGSERIAL PRIMARY KEY,
|
|||
|
|
user_id BIGINT NOT NULL REFERENCES users(id),
|
|||
|
|
tenant_id BIGINT REFERENCES tenants(id),
|
|||
|
|
request_id VARCHAR(64) NOT NULL,
|
|||
|
|
provider VARCHAR(50) NOT NULL,
|
|||
|
|
model VARCHAR(50) NOT NULL,
|
|||
|
|
prompt_tokens INT NOT NULL,
|
|||
|
|
completion_tokens INT NOT NULL,
|
|||
|
|
amount DECIMAL(10, 4) NOT NULL,
|
|||
|
|
currency VARCHAR(3) DEFAULT 'USD',
|
|||
|
|
status VARCHAR(20) DEFAULT 'settled',
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
|
|||
|
|
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
-- 使用量记录表
|
|||
|
|
CREATE TABLE usage_records (
|
|||
|
|
id BIGSERIAL PRIMARY KEY,
|
|||
|
|
user_id BIGINT NOT NULL REFERENCES users(id),
|
|||
|
|
tenant_id BIGINT REFERENCES tenants(id),
|
|||
|
|
api_key_id BIGINT REFERENCES api_keys(id),
|
|||
|
|
request_id VARCHAR(64) NOT NULL,
|
|||
|
|
model VARCHAR(50) NOT NULL,
|
|||
|
|
provider VARCHAR(50) NOT NULL,
|
|||
|
|
prompt_tokens INT DEFAULT 0,
|
|||
|
|
completion_tokens INT DEFAULT 0,
|
|||
|
|
latency_ms INT,
|
|||
|
|
status_code INT,
|
|||
|
|
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
|
|||
|
|
);
|
|||
|
|
|
|||
|
|
-- 索引
|
|||
|
|
CREATE INDEX idx_users_email ON users(email);
|
|||
|
|
CREATE INDEX idx_users_tenant ON users(tenant_id);
|
|||
|
|
CREATE INDEX idx_api_keys_user ON api_keys(user_id);
|
|||
|
|
CREATE INDEX idx_api_keys_hash ON api_keys(key_hash);
|
|||
|
|
CREATE INDEX idx_billing_user ON billing_records(user_id, created_at);
|
|||
|
|
CREATE INDEX idx_billing_tenant ON billing_records(tenant_id, created_at);
|
|||
|
|
CREATE INDEX idx_usage_user ON usage_records(user_id, created_at);
|
|||
|
|
CREATE INDEX idx_usage_request ON usage_records(request_id);
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. 消息队列运维简化
|
|||
|
|
|
|||
|
|
### 7.1 Kafka运维挑战分析
|
|||
|
|
|
|||
|
|
| 挑战 | 影响 | 简化方案 |
|
|||
|
|
|------|------|----------|
|
|||
|
|
| 集群管理复杂 | 运维成本高 | 使用托管服务 |
|
|||
|
|
| 分区副本同步 | 数据延迟 | 优化配置 |
|
|||
|
|
| 消费者组管理 | 消费积压 | 简化架构 |
|
|||
|
|
| 监控告警 | 噪声过多 | 精简指标 |
|
|||
|
|
| 容量规划 | 扩展困难 | 自动化伸缩 |
|
|||
|
|
|
|||
|
|
### 7.2 托管Kafka服务选型
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# 消息队列服务选型
|
|||
|
|
recommended:
|
|||
|
|
# 阿里云Kafka(国内)
|
|||
|
|
aliyun:
|
|||
|
|
type: managed
|
|||
|
|
version: 2.2.0
|
|||
|
|
features:
|
|||
|
|
- 自动分区重平衡
|
|||
|
|
- 死信队列支持
|
|||
|
|
- 跨可用区容灾
|
|||
|
|
ops_benefits:
|
|||
|
|
- 免运维
|
|||
|
|
- SLA 99.9%
|
|||
|
|
- 按量计费
|
|||
|
|
|
|||
|
|
# AWS MSK(海外)
|
|||
|
|
aws_msk:
|
|||
|
|
type: managed
|
|||
|
|
version: 2.8.0
|
|||
|
|
features:
|
|||
|
|
- MSK Serverless免容量规划
|
|||
|
|
- 精细访问控制
|
|||
|
|
ops_benefits:
|
|||
|
|
- 与AWS生态集成
|
|||
|
|
- 托管升级
|
|||
|
|
|
|||
|
|
alternatives:
|
|||
|
|
# 轻量级替代方案
|
|||
|
|
redis_streams:
|
|||
|
|
use_case: "低延迟小消息"
|
|||
|
|
limitations:
|
|||
|
|
- 无持久化保证
|
|||
|
|
- 单线程消费
|
|||
|
|
|
|||
|
|
# 业务简单时的选择
|
|||
|
|
database_queues:
|
|||
|
|
use_case: "消息量<1000/s"
|
|||
|
|
limitations:
|
|||
|
|
- 性能有限
|
|||
|
|
- 需自行实现重试
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.3 Topic设计简化
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 简化的Topic设计 - 从原来的10+个精简为4个
|
|||
|
|
TOPIC_DESIGN = {
|
|||
|
|
# 核心业务Topic
|
|||
|
|
"llm.requests": {
|
|||
|
|
"partitions": 6,
|
|||
|
|
"retention": "7d",
|
|||
|
|
"description": "LLM请求流转"
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# 异步计费Topic
|
|||
|
|
"llm.billing": {
|
|||
|
|
"partitions": 3,
|
|||
|
|
"retention": "30d",
|
|||
|
|
"description": "计费流水"
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# 通知事件Topic
|
|||
|
|
"llm.events": {
|
|||
|
|
"partitions": 3,
|
|||
|
|
"retention": "3d",
|
|||
|
|
"description": "各类事件通知"
|
|||
|
|
},
|
|||
|
|
|
|||
|
|
# 监控数据Topic
|
|||
|
|
"llm.metrics": {
|
|||
|
|
"partitions": 1,
|
|||
|
|
"retention": "1d",
|
|||
|
|
"description": "原始监控数据"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.4 消费者组简化
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 简化的消费者组设计
|
|||
|
|
class SimplifiedConsumerGroup:
|
|||
|
|
"""简化消费者组管理"""
|
|||
|
|
|
|||
|
|
# 原来:每个服务多个消费者组
|
|||
|
|
# 优化后:一个服务一个消费者组
|
|||
|
|
|
|||
|
|
def __init__(self):
|
|||
|
|
self.groups = {
|
|||
|
|
"router-service": {
|
|||
|
|
"topics": ["llm.requests"],
|
|||
|
|
"consumers": 3, # 与分区数匹配
|
|||
|
|
"strategy": "round_robin"
|
|||
|
|
},
|
|||
|
|
"billing-service": {
|
|||
|
|
"topics": ["llm.billing"],
|
|||
|
|
"consumers": 2,
|
|||
|
|
"strategy": "failover"
|
|||
|
|
},
|
|||
|
|
"notification-service": {
|
|||
|
|
"topics": ["llm.events"],
|
|||
|
|
"consumers": 1,
|
|||
|
|
"strategy": "broadcast"
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
def get_consumer_count(self, group: str) -> int:
|
|||
|
|
"""自动计算消费者数量"""
|
|||
|
|
return self.groups[group]["consumers"]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.5 自动化运维脚本
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
#!/bin/bash
|
|||
|
|
# scripts/kafka-ops.sh - Kafka运维自动化
|
|||
|
|
|
|||
|
|
set -e
|
|||
|
|
|
|||
|
|
# 1. 主题健康检查
|
|||
|
|
check_topics() {
|
|||
|
|
echo "=== 检查Topic状态 ==="
|
|||
|
|
kafka-topics.sh --bootstrap-server $KAFKA_BROKER --list | while read topic; do
|
|||
|
|
partitions=$(kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
|
|||
|
|
--topic $topic --describe | grep -c "Leader:")
|
|||
|
|
lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
|
|||
|
|
--group $(get_group_for_topic $topic) \
|
|||
|
|
--describe | awk '{sum+=$6} END {print sum}')
|
|||
|
|
|
|||
|
|
echo "Topic: $topic, Partitions: $partitions, Lag: $lag"
|
|||
|
|
|
|||
|
|
if [ $lag -gt 1000 ]; then
|
|||
|
|
alert "消费积压告警: $topic 积压 $lag 条"
|
|||
|
|
fi
|
|||
|
|
done
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 2. 自动创建Topic(幂等)
|
|||
|
|
ensure_topics() {
|
|||
|
|
for topic in "${!TOPIC_DESIGN[@]}"; do
|
|||
|
|
config="${TOPIC_DESIGN[$topic]}"
|
|||
|
|
kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
|
|||
|
|
--topic $topic --create \
|
|||
|
|
--partitions ${config[partitions]} \
|
|||
|
|
--replication-factor 3 \
|
|||
|
|
--config retention.ms=${config[retention]} \
|
|||
|
|
2>/dev/null || echo "Topic $topic already exists"
|
|||
|
|
done
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
# 3. 消费延迟监控
|
|||
|
|
monitor_lag() {
|
|||
|
|
for group in $(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
|
|||
|
|
--list 2>/dev/null); do
|
|||
|
|
lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
|
|||
|
|
--group $group --describe | awk '{sum+=$6} END {print sum}')
|
|||
|
|
|
|||
|
|
prometheus_pushgateway "kafka_consumer_lag" $lag "group=$group"
|
|||
|
|
done
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.6 监控指标精简
|
|||
|
|
|
|||
|
|
```yaml
|
|||
|
|
# 精简的Kafka监控指标 - 避免噪声
|
|||
|
|
kafka_metrics:
|
|||
|
|
essential:
|
|||
|
|
- name: kafka_consumer_group_lag_max
|
|||
|
|
description: 最大消费延迟
|
|||
|
|
alert_threshold: 1000
|
|||
|
|
|
|||
|
|
- name: kafka_topic_partition_under_replicated
|
|||
|
|
description: 副本不同步数
|
|||
|
|
alert_threshold: 0
|
|||
|
|
|
|||
|
|
- name: kafka_server_broker_topic_messages_in_total
|
|||
|
|
description: 消息入站速率
|
|||
|
|
alert_threshold: rate_change > 50%
|
|||
|
|
|
|||
|
|
optional:
|
|||
|
|
# 以下指标仅在排查问题时启用
|
|||
|
|
- kafka_network_request_metrics
|
|||
|
|
- kafka_consumer_fetch_manager_metrics
|
|||
|
|
- kafka_producer_metrics
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.7 容量规划自动化
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# 自动容量规划
|
|||
|
|
class KafkaCapacityPlanner:
|
|||
|
|
"""Kafka容量自动规划"""
|
|||
|
|
|
|||
|
|
def calculate_requirements(self, metrics: dict) -> dict:
|
|||
|
|
"""基于实际流量计算容量"""
|
|||
|
|
# 峰值QPS
|
|||
|
|
peak_qps = metrics["peak_qps"]
|
|||
|
|
|
|||
|
|
# 平均消息大小
|
|||
|
|
avg_msg_size = metrics["avg_msg_size_kb"] * 1024
|
|||
|
|
|
|||
|
|
# 保留期
|
|||
|
|
retention_days = 7
|
|||
|
|
|
|||
|
|
# 计算所需磁盘
|
|||
|
|
disk_per_day = peak_qps * avg_msg_size * 86400
|
|||
|
|
total_disk = disk_per_day * retention_days
|
|||
|
|
|
|||
|
|
# 推荐配置
|
|||
|
|
return {
|
|||
|
|
"partitions": min(peak_qps // 100, 12), # 最大12分区
|
|||
|
|
"replication_factor": 3,
|
|||
|
|
"disk_gb": total_disk / (1024**3),
|
|||
|
|
"broker_count": 3,
|
|||
|
|
"scaling_trigger": "disk_usage > 70%"
|
|||
|
|
}
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
### 7.8 故障自愈机制
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
# Kafka故障自愈
|
|||
|
|
class KafkaSelfHealing:
|
|||
|
|
"""Kafka自愈机制"""
|
|||
|
|
|
|||
|
|
def __init__(self):
|
|||
|
|
self.healing_rules = {
|
|||
|
|
"under_replicated": {
|
|||
|
|
"detect": "partition.replicas - in.sync.replicas > 0",
|
|||
|
|
"action": "trigger_preferred_reelection",
|
|||
|
|
"cooldown": 300 # 5分钟
|
|||
|
|
},
|
|||
|
|
"controller_failover": {
|
|||
|
|
"detect": "controller_epoch跳跃",
|
|||
|
|
"action": "等待自动选举",
|
|||
|
|
"cooldown": 60
|
|||
|
|
},
|
|||
|
|
"partition_offline": {
|
|||
|
|
"detect": "leader == -1",
|
|||
|
|
"action": "assign_new_leader",
|
|||
|
|
"cooldown": 60
|
|||
|
|
}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
async def check_and_heal(self):
|
|||
|
|
"""定期检查并自愈"""
|
|||
|
|
for rule_name, rule in self.healing_rules.items():
|
|||
|
|
if self.should_heal(rule_name):
|
|||
|
|
await self.execute_healing(rule_name, rule)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 7. 一致性验证
|
|||
|
|
|
|||
|
|
### 7.1 与现有文档一致性
|
|||
|
|
|
|||
|
|
| 设计项 | 对应文档 | 一致性 |
|
|||
|
|
|--------|----------|--------|
|
|||
|
|
| Provider Adapter | `architecture_solution_v1.md` | ✅ |
|
|||
|
|
| 路由策略 | `architecture_solution_v1.md` | ✅ |
|
|||
|
|
| 计费精度 | `business_solution_v1.md` | ✅ |
|
|||
|
|
| 安全机制 | `security_solution_v1.md` | ✅ |
|
|||
|
|
| API版本管理 | `api_solution_v1.md` | ✅ |
|
|||
|
|
| 错误码体系 | `api_solution_v1.md` | ✅ |
|
|||
|
|
| 限流机制 | `p1_optimization_solution_v1.md` | ✅ |
|
|||
|
|
| Webhook | `p1_optimization_solution_v1.md` | ✅ |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
## 8. 实施计划
|
|||
|
|
|
|||
|
|
### 8.1 开发阶段
|
|||
|
|
|
|||
|
|
| 阶段 | 内容 | 周数 |
|
|||
|
|
|------|------|------|
|
|||
|
|
| Phase 1 | 基础设施 + API Gateway | 3周 |
|
|||
|
|
| Phase 2 | 核心服务开发 | 4周 |
|
|||
|
|
| Phase 3 | 集成测试 | 2周 |
|
|||
|
|
| Phase 4 | 性能优化 | 2周 |
|
|||
|
|
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
**文档状态**:详细技术架构设计
|
|||
|
|
**关联文档**:
|
|||
|
|
- `architecture_solution_v1_2026-03-18.md`
|
|||
|
|
- `api_solution_v1_2026-03-18.md`
|
|||
|
|
- `security_solution_v1_2026-03-18.md`
|