36 KiB
36 KiB
详细技术架构设计
版本:v1.0 日期:2026-03-18 依据:backend skill 最佳实践 状态:历史草稿(已被
technical_architecture_optimized_v2_2026-03-18.md替代,不作为实施基线)
1. 系统架构概览
1.1 整体架构图
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 客户端层 │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Web App │ │ Mobile App │ │ SDK (Python)│ │ SDK (Node) │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ API Gateway (入口层) │
│ ┌─────────────────────────────────────────────────────────────────────────────┐ │
│ │ • 限流 • 鉴权 • 路由 • 日志 • 监控 │ │
│ └─────────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 业务服务层 │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Router │ │ Auth │ │ Billing │ │ Provider │ │
│ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Tenant │ │ Risk │ │ Settlement │ │ Webhook │ │
│ │ Service │ │ Service │ │ Service │ │ Service │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 基础设施层 │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ PostgreSQL │ │ Redis │ │ Kafka │ │ S3/MinIO │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────────┐
│ 外部集成层 │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ subapi │ │ OpenAI API │ │ Anthropic │ │ 国内供应商 │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────────┘
2. 技术选型
2.1 技术栈
| 层级 | 技术 | 版本 | 说明 |
|---|---|---|---|
| API Gateway | Kong / Traefik | 3.x | 高性能网关 |
| 后端服务 | Go | 1.21 | 高并发 |
| Web框架 | Gin | 1.9 | 高性能 |
| 数据库 | PostgreSQL | 15 | 主数据库 |
| 缓存 | Redis | 7.x | 缓存+限流 |
| 消息队列 | Kafka | 3.x | 异步处理 |
| 服务网格 | Istio | 1.18 | 微服务治理 |
| 容器编排 | Kubernetes | 1.28 | 容器编排 |
| CI/CD | GitHub Actions | - | 持续集成 |
| 监控 | Prometheus + Grafana | - | 可观测性 |
| 日志 | ELK Stack | 8.x | 日志收集 |
2.2 项目结构
llm-gateway/
├── cmd/ # 入口程序
│ ├── gateway/ # 网关服务
│ ├── router/ # 路由服务
│ ├── billing/ # 计费服务
│ └── admin/ # 管理后台
├── internal/ # 内部包
│ ├── config/ # 配置管理
│ ├── middleware/ # 中间件
│ ├── handler/ # HTTP处理器
│ ├── service/ # 业务逻辑
│ ├── repository/ # 数据访问
│ └── model/ # 数据模型
├── pkg/ # 公共包
│ ├── utils/ # 工具函数
│ ├── errors/ # 错误定义
│ └── constants/ # 常量定义
├── api/ # API定义
│ ├── openapi/ # OpenAPI规范
│ └── proto/ # Protobuf定义
├── configs/ # 配置文件
├── scripts/ # 脚本
├── test/ # 测试
└── docs/ # 文档
3. 模块详细设计
3.1 API Gateway 模块
// cmd/gateway/main.go
package main
import (
"github.com/gin-gonic/gin"
"llm-gateway/internal/middleware"
"llm-gateway/internal/handler"
)
func main() {
r := gin.Default()
// 全局中间件
r.Use(middleware.Logger())
r.Use(middleware.Recovery())
r.Use(middleware.CORS())
// 限流
r.Use(middleware.RateLimiter())
// API路由
v1 := r.Group("/v1")
{
// 认证
v1.POST("/auth/token", handler.AuthToken)
v1.POST("/auth/refresh", handler.RefreshToken)
// 对话
v1.POST("/chat/completions", middleware.AuthRequired(), handler.ChatCompletions)
v1.POST("/completions", middleware.AuthRequired(), handler.Completions)
// Embeddings
v1.POST("/embeddings", middleware.AuthRequired(), handler.Embeddings)
// 模型
v1.GET("/models", handler.ListModels)
// 用户
users := v1.Group("/users")
users.Use(middleware.AuthRequired())
{
users.GET("", handler.ListUsers)
users.GET("/:id", handler.GetUser)
users.PUT("/:id", handler.UpdateUser)
}
// API Key
keys := v1.Group("/keys")
keys.Use(middleware.AuthRequired())
{
keys.GET("", handler.ListKeys)
keys.POST("", handler.CreateKey)
keys.DELETE("/:id", handler.DeleteKey)
keys.POST("/:id/rotate", handler.RotateKey)
}
// 计费
billing := v1.Group("/billing")
billing.Use(middleware.AuthRequired())
{
billing.GET("/balance", handler.GetBalance)
billing.GET("/usage", handler.GetUsage)
billing.GET("/invoices", handler.ListInvoices)
}
// 供应方
supply := v1.Group("/supply")
supply.Use(middleware.AuthRequired())
{
supply.GET("/accounts", handler.ListAccounts)
supply.POST("/accounts", handler.CreateAccount)
supply.GET("/packages", handler.ListPackages)
supply.POST("/packages", handler.CreatePackage)
supply.GET("/earnings", handler.GetEarnings)
}
}
// 管理后台
admin := r.Group("/admin")
admin.Use(middleware.AdminRequired())
{
admin.GET("/stats", handler.AdminStats)
admin.GET("/users", handler.AdminListUsers)
admin.POST("/users/:id/disable", handler.DisableUser)
}
r.Run(":8080")
}
3.2 路由服务模块
// internal/service/router.go
package service
import (
"context"
"time"
"llm-gateway/internal/model"
"llm-gateway/internal/adapter"
)
type RouterService struct {
adapterRegistry *adapter.Registry
metricsCollector *MetricsCollector
}
type RouteRequest struct {
Model string `json:"model"`
Messages []model.Message `json:"messages"`
Options model.CompletionOptions `json:"options"`
UserID int64 `json:"user_id"`
TenantID int64 `json:"tenant_id"`
}
func (s *RouterService) Route(ctx context.Context, req RouteRequest) (*model.CompletionResponse, error) {
// 1. 获取可用供应商
providers := s.adapterRegistry.GetAvailableProviders(req.Model)
if len(providers) == 0 {
return nil, ErrNoProviderAvailable
}
// 2. 选择最优供应商
selected := s.selectProvider(providers, req)
// 3. 记录路由决策
s.metricsCollector.RecordRoute(ctx, &RouteMetrics{
Model: req.Model,
Provider: selected.Name(),
TenantID: req.TenantID,
})
// 4. 调用供应商
resp, err := selected.Call(ctx, req)
if err != nil {
// 5. 失败时尝试fallback
return s.tryFallback(ctx, req, err)
}
return resp, nil
}
func (s *RouterService) selectProvider(providers []*adapter.Provider, req RouteRequest) *adapter.Provider {
// 多维度选择策略
var best *adapter.Provider
bestScore := -1.0
for _, p := range providers {
score := s.calculateScore(p, req)
if score > bestScore {
bestScore = score
best = p
}
}
return best
}
func (s *RouterService) calculateScore(p *adapter.Provider, req RouteRequest) float64 {
// 延迟评分 (40%)
latencyScore := 1.0 / (p.LatencyP99 + 1)
// 可用性评分 (30%)
availabilityScore := p.Availability
// 成本评分 (20%)
costScore := 1.0 / (p.CostPer1K + 1)
// 质量评分 (10%)
qualityScore := p.QualityScore
return latencyScore*0.4 + availabilityScore*0.3 + costScore*0.2 + qualityScore*0.1
}
3.3 Provider Adapter 模块
// internal/adapter/registry.go
package adapter
import (
"context"
"sync"
)
type Registry struct {
mu sync.RWMutex
providers map[string]Provider
fallback map[string]string
health map[string]*HealthStatus
}
type Provider interface {
Name() string
Call(ctx context.Context, req interface{}) (interface{}, error)
HealthCheck(ctx context.Context) error
GetCapabilities() Capabilities
}
type Capabilities struct {
SupportsStreaming bool
SupportsFunctionCall bool
SupportsVision bool
MaxTokens int
Models []string
}
type HealthStatus struct {
IsHealthy bool
Latency time.Duration
LastCheck time.Time
}
func NewRegistry() *Registry {
return &Registry{
providers: make(map[string]Provider),
fallback: make(map[string]string),
health: make(map[string]*HealthStatus),
}
}
func (r *Registry) Register(name string, p Provider, fallback string) {
r.mu.Lock()
defer r.mu.Unlock()
r.providers[name] = p
if fallback != "" {
r.fallback[name] = fallback
}
// 启动健康检查
go r.healthCheckLoop(name, p)
}
func (r *Registry) Get(name string) (Provider, error) {
r.mu.RLock()
defer r.mu.RUnlock()
p, ok := r.providers[name]
if !ok {
return nil, ErrProviderNotFound
}
// 检查健康状态
if health, ok := r.health[name]; ok && !health.IsHealthy {
// 尝试fallback
if fallback, ok := r.fallback[name]; ok {
return r.providers[fallback], nil
}
}
return p, nil
}
func (r *Registry) healthCheckLoop(name string, p Provider) {
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
err := p.HealthCheck(ctx)
cancel()
r.mu.Lock()
r.health[name] = &HealthStatus{
IsHealthy: err == nil,
LastCheck: time.Now(),
}
r.mu.Unlock()
}
}
3.4 计费服务模块
// internal/service/billing.go
package service
import (
"context"
"decimal"
"llm-gateway/internal/model"
)
type BillingService struct {
repo *repository.BillingRepository
balanceMgr *BalanceManager
notifier *WebhookNotifier
}
type Money struct {
Amount decimal.Decimal
Currency string
}
func (s *BillingService) ProcessRequest(ctx context.Context, req *model.LLMRequest) (*model.BillingRecord, error) {
// 1. 预扣额度
estimatedCost := s.EstimateCost(req)
reserved, err := s.balanceMgr.Reserve(ctx, req.UserID, estimatedCost)
if err != nil {
return nil, ErrInsufficientBalance
}
// 2. 处理请求(实际扣费)
actualCost := s.CalculateActualCost(req.Response)
// 3. 补偿差额
diff := actualCost.Sub(reserved.Amount)
if diff.IsPositive() {
err = s.balanceMgr.Charge(ctx, req.UserID, diff)
} else if diff.IsNegative() {
err = s.balanceMgr.Refund(ctx, req.UserID, diff.Abs())
}
// 4. 记录账单
record := &model.BillingRecord{
UserID: req.UserID,
RequestID: req.ID,
Model: req.Model,
PromptTokens: req.Response.Usage.PromptTokens,
CompletionTokens: req.Response.Usage.CompletionTokens,
Amount: actualCost,
Status: model.BillingStatusSettled,
}
err = s.repo.Create(ctx, record)
if err != nil {
// 记录失败,触发补偿
s.notifier.NotifyBillingAnomaly(ctx, record, err)
}
return record, nil
}
func (s *BillingService) EstimateCost(req *model.LLMRequest) Money {
// 使用模型定价估算
price := s.repo.GetModelPrice(req.Model)
promptCost := decimal.NewFromInt(int64(req.Messages.Tokens()))
.Mul(price.InputPer1K)
.Div(decimal.NewFromInt(1000))
// 估算输出
estimatedOutput := decimal.NewFromInt(int64(req.Options.MaxTokens))
outputCost := estimatedOutput
.Mul(price.OutputPer1K)
.Div(decimal.NewFromInt(1000))
total := promptCost.Add(outputCost)
return Money{Amount: total.Round(2), Currency: "USD"}
}
3.5 风控服务模块
// internal/service/risk.go
package service
import (
"llm-gateway/internal/model"
)
type RiskService struct {
rules []RiskRule
rateLimiter *RateLimiter
}
type RiskRule struct {
Name string
Condition func(*model.LLMRequest, *model.User) bool
Score int
Action RiskAction
}
type RiskAction string
const (
RiskActionAllow RiskAction = "allow"
RiskActionBlock RiskAction = "block"
RiskActionReview RiskAction = "review"
)
func (s *RiskService) Evaluate(ctx context.Context, req *model.LLMRequest) *RiskResult {
var totalScore int
var triggers []string
user := s.getUser(ctx, req.UserID)
for _, rule := range s.rules {
if rule.Condition(req, user) {
totalScore += rule.Score
triggers = append(triggers, rule.Name)
}
}
// 决策
if totalScore >= 70 {
return &RiskResult{
Action: RiskActionBlock,
Score: totalScore,
Triggers: triggers,
}
} else if totalScore >= 40 {
return &RiskResult{
Action: RiskActionReview,
Score: totalScore,
Triggers: triggers,
}
}
return &RiskResult{
Action: RiskActionAllow,
Score: totalScore,
}
}
// 预定义风控规则
func DefaultRiskRules() []RiskRule {
return []RiskRule{
{
Name: "high_velocity",
Condition: func(req *model.LLMRequest, user *model.User) bool {
return req.TokensPerMinute > 1000
},
Score: 30,
Action: RiskActionBlock,
},
{
Name: "new_account_high_value",
Condition: func(req *model.LLMRequest, user *model.User) bool {
return user.AccountAgeDays < 7 && req.EstimatedCost > 100
},
Score: 35,
Action: RiskActionReview,
},
{
Name: "unusual_model",
Condition: func(req *model.LLMRequest, user *model.User) bool {
return !user.PreferredModels.Contains(req.Model)
},
Score: 15,
Action: RiskActionReview,
},
}
}
4. 数据流设计
4.1 请求处理流程
用户请求
│
▼
┌─────────────────┐
│ API Gateway │
│ • 限流 │
│ • 鉴权 │
│ • 日志 │
└────────┬────────┘
│
▼
┌─────────────────┐
│ 路由决策 │
│ • 模型映射 │
│ • 供应商选择 │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌────────┐
│ Provider│ │ Fallback│
│ A │ │ B │
└────┬───┘ └────┬───┘
│ │
└────┬────┘
│
▼
┌─────────────────┐
│ 计费处理 │
│ • 预扣 │
│ • 实际扣费 │
│ • 记录 │
└────────┬────────┘
│
▼
┌─────────────────┐
│ 响应返回 │
└─────────────────┘
4.2 异步处理流程
请求处理
│
▼
┌─────────────────┐
│ 同步:预扣+执行 │
│ │
│ • 预扣额度 │
│ • 调用供应商 │
│ • 实际扣费 │
└────────┬────────┘
│
┌────┴────┐
│ │
▼ ▼
┌────────┐ ┌────────┐
│ 同步响应 │ │ 异步队列│
│ │ │ │
│ • 返回 │ │ • 记录使用量│
│ • 更新 │ │ • 统计 │
│ 余额 │ │ • 对账 │
└────────┘ └────┬────┘
│
▼
┌─────────────┐
│ Kafka Topic │
│ • usage │
│ • billing │
│ • audit │
└──────┬──────┘
│
┌─────┴─────┐
│ │
▼ ▼
┌─────────┐ ┌─────────┐
│ 消费者 │ │ 消费者 │
│ • 写入DB │ │ • 对账 │
│ • 监控 │ │ • 告警 │
└─────────┘ └─────────┘
5. API 设计规范
5.1 RESTful API 设计
# openapi.yaml
openapi: 3.0.3
info:
title: LLM Gateway API
version: 1.0.0
description: Enterprise LLM Gateway API
servers:
- url: https://api.lgateway.com/v1
description: Production server
- url: https://staging-api.lgateway.com/v1
description: Staging server
paths:
/chat/completions:
post:
summary: Create a chat completion
operationId: createChatCompletion
tags:
- Chat
security:
- BearerAuth: []
requestBody:
required: true
content:
application/json:
schema:
$ref: '#/components/schemas/ChatCompletionRequest'
responses:
'200':
description: Successful response
content:
application/json:
schema:
$ref: '#/components/schemas/ChatCompletionResponse'
'400':
$ref: '#/components/responses/BadRequest'
'401':
$ref: '#/components/responses/Unauthorized'
'429':
$ref: '#/components/responses/RateLimited'
'500':
$ref: '#/components/responses/InternalServerError'
components:
securitySchemes:
BearerAuth:
type: http
scheme: bearer
bearerFormat: JWT
schemas:
ChatCompletionRequest:
type: object
required:
- model
- messages
properties:
model:
type: string
description: Model identifier
messages:
type: array
items:
$ref: '#/components/schemas/Message'
temperature:
type: number
minimum: 0
maximum: 2
default: 1.0
max_tokens:
type: integer
minimum: 1
maximum: 32000
stream:
type: boolean
default: false
Message:
type: object
required:
- role
- content
properties:
role:
type: string
enum: [system, user, assistant]
content:
type: string
responses:
BadRequest:
description: Bad request
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
Unauthorized:
description: Unauthorized
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
RateLimited:
description: Rate limited
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
InternalServerError:
description: Internal server error
content:
application/json:
schema:
$ref: '#/components/schemas/Error'
5.2 错误响应格式
{
"error": {
"code": "BILLING_001",
"message": "Insufficient balance",
"message_i18n": {
"zh_CN": "余额不足",
"en_US": "Insufficient balance"
},
"details": {
"required": 100.00,
"available": 50.00,
"top_up_url": "/v1/billing/top-up"
},
"trace_id": "req_abc123",
"retryable": false,
"doc_url": "https://docs.lgateway.com/errors/billing-001"
}
}
6. 数据库设计
6.1 核心表结构
-- 用户表
CREATE TABLE users (
id BIGSERIAL PRIMARY KEY,
email VARCHAR(255) UNIQUE NOT NULL,
password_hash VARCHAR(255) NOT NULL,
name VARCHAR(100),
tenant_id BIGINT REFERENCES tenants(id),
role VARCHAR(20) DEFAULT 'user',
status VARCHAR(20) DEFAULT 'active',
mfa_enabled BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
-- API Keys表
CREATE TABLE api_keys (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
key_hash VARCHAR(64) NOT NULL UNIQUE,
key_prefix VARCHAR(20) NOT NULL,
description VARCHAR(200),
permissions JSONB DEFAULT '{}',
rate_limit_rpm INT DEFAULT 60,
rate_limit_tpm INT DEFAULT 100000,
status VARCHAR(20) DEFAULT 'active',
expires_at TIMESTAMP,
last_used_at TIMESTAMP,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
-- 租户表
CREATE TABLE tenants (
id BIGSERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL,
plan VARCHAR(20) DEFAULT 'free',
status VARCHAR(20) DEFAULT 'active',
settings JSONB DEFAULT '{}',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
-- 账单记录表
CREATE TABLE billing_records (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
tenant_id BIGINT REFERENCES tenants(id),
request_id VARCHAR(64) NOT NULL,
provider VARCHAR(50) NOT NULL,
model VARCHAR(50) NOT NULL,
prompt_tokens INT NOT NULL,
completion_tokens INT NOT NULL,
amount DECIMAL(10, 4) NOT NULL,
currency VARCHAR(3) DEFAULT 'USD',
status VARCHAR(20) DEFAULT 'settled',
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);
-- 使用量记录表
CREATE TABLE usage_records (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
tenant_id BIGINT REFERENCES tenants(id),
api_key_id BIGINT REFERENCES api_keys(id),
request_id VARCHAR(64) NOT NULL,
model VARCHAR(50) NOT NULL,
provider VARCHAR(50) NOT NULL,
prompt_tokens INT DEFAULT 0,
completion_tokens INT DEFAULT 0,
latency_ms INT,
status_code INT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- 索引
CREATE INDEX idx_users_email ON users(email);
CREATE INDEX idx_users_tenant ON users(tenant_id);
CREATE INDEX idx_api_keys_user ON api_keys(user_id);
CREATE INDEX idx_api_keys_hash ON api_keys(key_hash);
CREATE INDEX idx_billing_user ON billing_records(user_id, created_at);
CREATE INDEX idx_billing_tenant ON billing_records(tenant_id, created_at);
CREATE INDEX idx_usage_user ON usage_records(user_id, created_at);
CREATE INDEX idx_usage_request ON usage_records(request_id);
7. 消息队列运维简化
7.1 Kafka运维挑战分析
| 挑战 | 影响 | 简化方案 |
|---|---|---|
| 集群管理复杂 | 运维成本高 | 使用托管服务 |
| 分区副本同步 | 数据延迟 | 优化配置 |
| 消费者组管理 | 消费积压 | 简化架构 |
| 监控告警 | 噪声过多 | 精简指标 |
| 容量规划 | 扩展困难 | 自动化伸缩 |
7.2 托管Kafka服务选型
# 消息队列服务选型
recommended:
# 阿里云Kafka(国内)
aliyun:
type: managed
version: 2.2.0
features:
- 自动分区重平衡
- 死信队列支持
- 跨可用区容灾
ops_benefits:
- 免运维
- SLA 99.9%
- 按量计费
# AWS MSK(海外)
aws_msk:
type: managed
version: 2.8.0
features:
- MSK Serverless免容量规划
- 精细访问控制
ops_benefits:
- 与AWS生态集成
- 托管升级
alternatives:
# 轻量级替代方案
redis_streams:
use_case: "低延迟小消息"
limitations:
- 无持久化保证
- 单线程消费
# 业务简单时的选择
database_queues:
use_case: "消息量<1000/s"
limitations:
- 性能有限
- 需自行实现重试
7.3 Topic设计简化
# 简化的Topic设计 - 从原来的10+个精简为4个
TOPIC_DESIGN = {
# 核心业务Topic
"llm.requests": {
"partitions": 6,
"retention": "7d",
"description": "LLM请求流转"
},
# 异步计费Topic
"llm.billing": {
"partitions": 3,
"retention": "30d",
"description": "计费流水"
},
# 通知事件Topic
"llm.events": {
"partitions": 3,
"retention": "3d",
"description": "各类事件通知"
},
# 监控数据Topic
"llm.metrics": {
"partitions": 1,
"retention": "1d",
"description": "原始监控数据"
}
}
7.4 消费者组简化
# 简化的消费者组设计
class SimplifiedConsumerGroup:
"""简化消费者组管理"""
# 原来:每个服务多个消费者组
# 优化后:一个服务一个消费者组
def __init__(self):
self.groups = {
"router-service": {
"topics": ["llm.requests"],
"consumers": 3, # 与分区数匹配
"strategy": "round_robin"
},
"billing-service": {
"topics": ["llm.billing"],
"consumers": 2,
"strategy": "failover"
},
"notification-service": {
"topics": ["llm.events"],
"consumers": 1,
"strategy": "broadcast"
}
}
def get_consumer_count(self, group: str) -> int:
"""自动计算消费者数量"""
return self.groups[group]["consumers"]
7.5 自动化运维脚本
#!/bin/bash
# scripts/kafka-ops.sh - Kafka运维自动化
set -e
# 1. 主题健康检查
check_topics() {
echo "=== 检查Topic状态 ==="
kafka-topics.sh --bootstrap-server $KAFKA_BROKER --list | while read topic; do
partitions=$(kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
--topic $topic --describe | grep -c "Leader:")
lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
--group $(get_group_for_topic $topic) \
--describe | awk '{sum+=$6} END {print sum}')
echo "Topic: $topic, Partitions: $partitions, Lag: $lag"
if [ $lag -gt 1000 ]; then
alert "消费积压告警: $topic 积压 $lag 条"
fi
done
}
# 2. 自动创建Topic(幂等)
ensure_topics() {
for topic in "${!TOPIC_DESIGN[@]}"; do
config="${TOPIC_DESIGN[$topic]}"
kafka-topics.sh --bootstrap-server $KAFKA_BROKER \
--topic $topic --create \
--partitions ${config[partitions]} \
--replication-factor 3 \
--config retention.ms=${config[retention]} \
2>/dev/null || echo "Topic $topic already exists"
done
}
# 3. 消费延迟监控
monitor_lag() {
for group in $(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
--list 2>/dev/null); do
lag=$(kafka-consumer-groups.sh --bootstrap-server $KAFKA_BROKER \
--group $group --describe | awk '{sum+=$6} END {print sum}')
prometheus_pushgateway "kafka_consumer_lag" $lag "group=$group"
done
}
7.6 监控指标精简
# 精简的Kafka监控指标 - 避免噪声
kafka_metrics:
essential:
- name: kafka_consumer_group_lag_max
description: 最大消费延迟
alert_threshold: 1000
- name: kafka_topic_partition_under_replicated
description: 副本不同步数
alert_threshold: 0
- name: kafka_server_broker_topic_messages_in_total
description: 消息入站速率
alert_threshold: rate_change > 50%
optional:
# 以下指标仅在排查问题时启用
- kafka_network_request_metrics
- kafka_consumer_fetch_manager_metrics
- kafka_producer_metrics
7.7 容量规划自动化
# 自动容量规划
class KafkaCapacityPlanner:
"""Kafka容量自动规划"""
def calculate_requirements(self, metrics: dict) -> dict:
"""基于实际流量计算容量"""
# 峰值QPS
peak_qps = metrics["peak_qps"]
# 平均消息大小
avg_msg_size = metrics["avg_msg_size_kb"] * 1024
# 保留期
retention_days = 7
# 计算所需磁盘
disk_per_day = peak_qps * avg_msg_size * 86400
total_disk = disk_per_day * retention_days
# 推荐配置
return {
"partitions": min(peak_qps // 100, 12), # 最大12分区
"replication_factor": 3,
"disk_gb": total_disk / (1024**3),
"broker_count": 3,
"scaling_trigger": "disk_usage > 70%"
}
7.8 故障自愈机制
# Kafka故障自愈
class KafkaSelfHealing:
"""Kafka自愈机制"""
def __init__(self):
self.healing_rules = {
"under_replicated": {
"detect": "partition.replicas - in.sync.replicas > 0",
"action": "trigger_preferred_reelection",
"cooldown": 300 # 5分钟
},
"controller_failover": {
"detect": "controller_epoch跳跃",
"action": "等待自动选举",
"cooldown": 60
},
"partition_offline": {
"detect": "leader == -1",
"action": "assign_new_leader",
"cooldown": 60
}
}
async def check_and_heal(self):
"""定期检查并自愈"""
for rule_name, rule in self.healing_rules.items():
if self.should_heal(rule_name):
await self.execute_healing(rule_name, rule)
7. 一致性验证
7.1 与现有文档一致性
| 设计项 | 对应文档 | 一致性 |
|---|---|---|
| Provider Adapter | architecture_solution_v1.md |
✅ |
| 路由策略 | architecture_solution_v1.md |
✅ |
| 计费精度 | business_solution_v1.md |
✅ |
| 安全机制 | security_solution_v1.md |
✅ |
| API版本管理 | api_solution_v1.md |
✅ |
| 错误码体系 | api_solution_v1.md |
✅ |
| 限流机制 | p1_optimization_solution_v1.md |
✅ |
| Webhook | p1_optimization_solution_v1.md |
✅ |
8. 实施计划
8.1 开发阶段
| 阶段 | 内容 | 周数 |
|---|---|---|
| Phase 1 | 基础设施 + API Gateway | 3周 |
| Phase 2 | 核心服务开发 | 4周 |
| Phase 3 | 集成测试 | 2周 |
| Phase 4 | 性能优化 | 2周 |
文档状态:详细技术架构设计 关联文档:
architecture_solution_v1_2026-03-18.mdapi_solution_v1_2026-03-18.mdsecurity_solution_v1_2026-03-18.md