188 lines
11 KiB
Markdown
188 lines
11 KiB
Markdown
|
|
# QA 设计审查报告:Gateway 收口(2026-05-08)
|
|||
|
|
|
|||
|
|
阶段门控结论:REQUEST_CHANGES
|
|||
|
|
是否可进入 Engineer 实现:否
|
|||
|
|
|
|||
|
|
## 审查范围
|
|||
|
|
- PM 收口文档:/home/long/project/supply-intelligence/prd/PM_GATEWAY_CLOSURE_PRD_2026-05-08.md
|
|||
|
|
- TechLead 设计:/home/long/project/supply-intelligence/tech/TECHLEAD_GATEWAY_CLOSURE_DESIGN_2026-05-08.md
|
|||
|
|
- 真源索引:/home/long/project/supply-intelligence/tech/CURRENT_SOURCE_OF_TRUTH_2026-05.md
|
|||
|
|
- 消费闭环决议:/home/long/project/supply-intelligence/tech/GATEWAY_CONSUMER_DECISION_2026-05.md
|
|||
|
|
- 收口执行板:/home/long/project/supply-intelligence/tech/PRODUCTION_LAUNCH_CLOSURE_BOARD_2026-05-08.md
|
|||
|
|
- 真实代码链路抽检:
|
|||
|
|
- /home/long/project/supply-intelligence/internal/httpapi/server.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/gatewayconsumer/service.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/poller/runtime.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/publish/service.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/repository/interfaces.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/repository/postgres.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/metrics/metrics.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- /home/long/project/supply-intelligence/internal/httpapi/postgres_e2e_test.go
|
|||
|
|
|
|||
|
|
## 设计覆盖检查
|
|||
|
|
1. 契约边界:已覆盖
|
|||
|
|
- PM/TechLead 均明确了 published != applied、pending/applied/failed 语义。
|
|||
|
|
- 证据:PM 文档 4.2/4.3;TechLead 文档 2.2/2.3。
|
|||
|
|
|
|||
|
|
2. 失败重试:部分覆盖,未闭合
|
|||
|
|
- PM 定义了可重试/不可重试、3 次上限、退避窗口。
|
|||
|
|
- TechLead 也识别出现有代码缺少重试元数据和重试结构。
|
|||
|
|
- 但设计仍停留在建议层,未与现有接口/表结构形成可执行的最小实现闭环。
|
|||
|
|
- 证据:TechLead 3.2~3.7。
|
|||
|
|
|
|||
|
|
3. 灰度/回滚:部分覆盖,缺少可执行入口
|
|||
|
|
- PM 给出暂停/回滚判定线。
|
|||
|
|
- TechLead 提出 runbook 脚本与 runtime pause/resume API 建议。
|
|||
|
|
- 但当前真实代码没有 runtime-status/pause/resume 入口,也没有脚本文件。
|
|||
|
|
- 证据:server.go 仅有 /gateway/consume-once 和 health/metrics 等路径;未见 runtime control 路由。
|
|||
|
|
|
|||
|
|
4. 巡检门禁:部分覆盖,缺少真实指标接入
|
|||
|
|
- 文档定义了 24h/72h 巡检项。
|
|||
|
|
- 但 metrics.go 只是声明指标,调用链中没有任何实际打点。
|
|||
|
|
- 证据:metrics.go;全文搜索未命中 GatewayEventsProcessedTotal / GatewayEventLatencySeconds 的使用点。
|
|||
|
|
|
|||
|
|
## 风险与保护检查
|
|||
|
|
- 风险 1:发布完成与消费完成仍可被误判
|
|||
|
|
- 保护:admission-state 暴露 last_event.gateway_sync_status,且 E2E 覆盖 publish -> consume -> ack。
|
|||
|
|
- 缺口:failed 重试后如何重新进入自动消费未实现。
|
|||
|
|
|
|||
|
|
- 风险 2:失败分类不足导致重试/终态策略无法落地
|
|||
|
|
- 保护:文档已定义失败分类模型和上限。
|
|||
|
|
- 缺口:代码层无 retry_count / next_retry_at / failure_category 持久化字段,无对应 repository 方法。
|
|||
|
|
|
|||
|
|
- 风险 3:无法暂停放量或受控回滚
|
|||
|
|
- 保护:poller/runtime 已有 Start/Stop。
|
|||
|
|
- 缺口:没有 pause/resume 或 runtime-status,Stop 是进程级粗粒度停机,不符合 runbook 设计要求。
|
|||
|
|
|
|||
|
|
- 风险 4:观测不可执行
|
|||
|
|
- 保护:/metrics 存在。
|
|||
|
|
- 缺口:指标未接调用链,无法支撑“15 分钟 applied 比例 < 95%”等门禁判断。
|
|||
|
|
|
|||
|
|
## 交接物可用性
|
|||
|
|
- 可用:
|
|||
|
|
- 发布、拉取、ack、admission-state 的基础闭环存在。
|
|||
|
|
- 真实代码路径可定位,且有 PostgreSQL E2E 证明基本链路。
|
|||
|
|
- 不足:
|
|||
|
|
- 缺少可执行 runbook 文件。
|
|||
|
|
- 缺少桌面演练 / 巡检 / 回滚脚本。
|
|||
|
|
- 缺少 runtime 控制接口。
|
|||
|
|
- 缺少重试状态持久化与失败分类存储。
|
|||
|
|
|
|||
|
|
## 关键调用链路核查(定义 / 装配 / 调用 / 入口)
|
|||
|
|
|
|||
|
|
### 链路 A:package 发布
|
|||
|
|
- 定义:/home/long/project/supply-intelligence/internal/publish/service.go
|
|||
|
|
- PublishDraft / RecordPackagePublished
|
|||
|
|
- 装配:/home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- buildApp() 注入 publish.NewService(repo)
|
|||
|
|
- 调用:/home/long/project/supply-intelligence/internal/httpapi/server.go
|
|||
|
|
- handlePublishPackageEvent() -> publishService.PublishDraft(...)
|
|||
|
|
- 入口:/home/long/project/supply-intelligence/internal/httpapi/server.go
|
|||
|
|
- Route: POST /internal/supply-intelligence/publish/package-event
|
|||
|
|
- 结论:已闭合
|
|||
|
|
|
|||
|
|
### 链路 B:package changes 拉取
|
|||
|
|
- 定义:/home/long/project/supply-intelligence/internal/repository/interfaces.go
|
|||
|
|
- ListPackageEventsAfter
|
|||
|
|
- 装配:/home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- gatewayconsumer.NewService(repo)
|
|||
|
|
- 调用:/home/long/project/supply-intelligence/internal/httpapi/server.go
|
|||
|
|
- handleListPackageChanges() -> repo.ListPackageEventsAfter(...)
|
|||
|
|
- gatewayconsumer.Service.ConsumeOnce() -> repo.ListPackageEventsAfter(...)
|
|||
|
|
- 入口:/internal/supply-intelligence/gateway/package-changes
|
|||
|
|
- 结论:已闭合,但仅支持 cursor 流读取,不支持 retry due filtering
|
|||
|
|
|
|||
|
|
### 链路 C:ack 回写
|
|||
|
|
- 定义:/home/long/project/supply-intelligence/internal/repository/interfaces.go
|
|||
|
|
- AckPackageEvent
|
|||
|
|
- 装配:/home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- gatewayconsumer.NewService(repo)
|
|||
|
|
- 调用:/home/long/project/supply-intelligence/internal/httpapi/server.go::handleAckPackageChange
|
|||
|
|
- repo.AckPackageEvent(...)
|
|||
|
|
- /home/long/project/supply-intelligence/internal/gatewayconsumer/service.go::ConsumeOnce
|
|||
|
|
- repo.AckPackageEvent(...)
|
|||
|
|
- 入口:POST /internal/supply-intelligence/gateway/package-changes/{event_id}/ack
|
|||
|
|
- 结论:已闭合
|
|||
|
|
|
|||
|
|
### 链路 D:默认消费方与 poller/runtime
|
|||
|
|
- 定义:/home/long/project/supply-intelligence/internal/gatewayconsumer/service.go::ConsumeOnce
|
|||
|
|
- 装配:/home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- NewGatewayPackagePoller(gatewayConsumerService)
|
|||
|
|
- NewRuntime(gatewayPoller, time.Second)
|
|||
|
|
- 调用:/home/long/project/supply-intelligence/internal/poller/gateway_package_poller.go::PollOnce
|
|||
|
|
- p.consumer.ConsumeOnce(...)
|
|||
|
|
- 入口:/home/long/project/supply-intelligence/internal/poller/runtime.go::Start
|
|||
|
|
- 周期定时触发 PollOnce
|
|||
|
|
- 结论:已闭合,但运行时只能 start/stop,不能按 runbook 语义暂停/恢复
|
|||
|
|
|
|||
|
|
### 链路 E:admission-state
|
|||
|
|
- 定义:/home/long/project/supply-intelligence/internal/httpapi/server.go::handleModelAdmissionState
|
|||
|
|
- 装配:/home/long/project/supply-intelligence/internal/app/app.go
|
|||
|
|
- 调用:server.go 内直接读取 repo.GetLatestDiscoveryCandidateContext / GetSupplyPackage / GetLatestPackageEvent
|
|||
|
|
- 入口:GET /internal/supply-intelligence/models/{platform}/{model}/admission-state
|
|||
|
|
- 结论:已闭合,适合作为发布后状态核验入口
|
|||
|
|
|
|||
|
|
## 问题清单
|
|||
|
|
|
|||
|
|
### Critical
|
|||
|
|
1. 缺少重试状态机的真实持久化与调度闭环
|
|||
|
|
- 证据:tech/TECHLEAD_GATEWAY_CLOSURE_DESIGN_2026-05-08.md 3.2~3.7 仅为建议;internal/repository/interfaces.go 仅有 AckPackageEvent,没有 retry_count/next_retry_at/get retryable pending 接口;internal/repository/postgres.go AckPackageEvent 只更新 ack_status/consumer/detail/time。
|
|||
|
|
- 影响:PM 定义的 3 次自动重试、退避、终态 failed 无法按设计执行。
|
|||
|
|
- 结论:阻断进入实现。
|
|||
|
|
|
|||
|
|
2. 缺少可执行的灰度/回滚运行时控制入口
|
|||
|
|
- 证据:server.go Routes 未暴露 runtime-status/pause/resume;runtime.go 仅有 Start/Stop;app.go 仅在启动时自动 StartBackground。
|
|||
|
|
- 影响:无法按 PM 要求执行“暂停放量但不立即回滚”“受控恢复”等门禁动作。
|
|||
|
|
- 结论:阻断进入实现。
|
|||
|
|
|
|||
|
|
3. 观测指标未接入真实调用链
|
|||
|
|
- 证据:internal/metrics/metrics.go 声明了 GatewayEventsProcessedTotal/GatewayEventLatencySeconds/AccountsByStatus/RoutingEnabledAccounts;全文搜索未命中这些指标的实际使用点。
|
|||
|
|
- 影响:无法验证 15 分钟 applied 比例、重试积压、失败趋势等关键门禁。
|
|||
|
|
- 结论:阻断进入实现。
|
|||
|
|
|
|||
|
|
### Important
|
|||
|
|
1. 失败分类模型未落地到 repository/domain
|
|||
|
|
- 证据:TechLead 3.3 仅建议新增 failure category 枚举;当前 domain/repository 未见对应字段或接口。
|
|||
|
|
- 影响:retryable/non-retryable 分流只能靠 consumer 内部临时判断,无法审计与追踪。
|
|||
|
|
|
|||
|
|
2. 已失败事件缺少再次进入自动重试的机制
|
|||
|
|
- 证据:TechLead 2.4 指出 ListPackageEventsAfter 会返回 failed 事件,但 consumer 仅消费 pending;gatewayconsumer/service.go 124-126 明确跳过 non-pending。
|
|||
|
|
- 影响:failed 一旦写回后不可恢复自动重试,和 PM 的“人工处置入口/受控重试”设计不一致。
|
|||
|
|
|
|||
|
|
3. runbook 依赖脚本文件但仓库中未见对应交付物
|
|||
|
|
- 证据:TechLead 4.2 建议新增 scripts/gateway_closure_smoke.sh / inspect.sh / rollback.sh 和 runbook 文档;当前未发现这些文件。
|
|||
|
|
- 影响:交接物不可直接执行,只能纸面审查。
|
|||
|
|
|
|||
|
|
4. PM 文档中的 24h/72h 巡检指标部分仍偏结果导向,缺少来源字段定义
|
|||
|
|
- 证据:PM 7.1/7.2 仅描述“持续增长/稳定/是否出现”,未绑定具体采样接口与阈值归属。
|
|||
|
|
- 影响:QA 与 Engineer 容易产生不同解释。
|
|||
|
|
|
|||
|
|
### Minor
|
|||
|
|
1. 真源索引文件路径存在历史仓库前缀表述差异
|
|||
|
|
- 证据:/home/long/project/supply-intelligence/tech/CURRENT_SOURCE_OF_TRUTH_2026-05.md 第 5 行出现“/home/long/project/立交桥/projects/supply-intelligence/”。
|
|||
|
|
- 影响:容易造成阅读者路径混淆。
|
|||
|
|
|
|||
|
|
2. TechLead 文档中提议的指标命名与现有 metrics 命名风格不完全一致
|
|||
|
|
- 证据:3.2/5.2 建议使用 supply_intelligence_gateway_* 命名;现有 metrics 已有 supply_intelligence_ 前缀但具体标签规划未统一。
|
|||
|
|
- 影响:实现时需统一命名规范,避免重复与歧义。
|
|||
|
|
|
|||
|
|
## Gap Taxonomy Summary
|
|||
|
|
- Contract gap:published/pending/applied/failed 语义已定义,但 retry/终态语义未形成代码闭环。
|
|||
|
|
- Execution gap:灰度、暂停、回滚需要 runtime control 与脚本,当前只有基础 Start/Stop。
|
|||
|
|
- Observability gap:指标声明存在,实际打点不存在。
|
|||
|
|
- Data-model gap:缺少 retry_count、next_retry_at、failure_category 等字段。
|
|||
|
|
- Operational gap:runbook 交付物缺失,无法直接演练。
|
|||
|
|
- Verification gap:有 E2E 证明基础闭环,但没有覆盖失败重试/回滚/巡检门禁的实证。
|
|||
|
|
|
|||
|
|
## 最终门禁结论
|
|||
|
|
- 设计覆盖:部分通过
|
|||
|
|
- 风险保护:不足
|
|||
|
|
- 交接可用性:不足
|
|||
|
|
- 阶段门控结论:REQUEST_CHANGES
|
|||
|
|
- 是否可进入 Engineer 实现:否
|
|||
|
|
|
|||
|
|
## 备注
|
|||
|
|
本次审查已抽样核查真实调用链,不是仅基于文档判断;但由于重试、runtime control、observability 三条主链仍未在代码层闭合,因此不能给 APPROVED。
|