Files
sub2api-cn-relay-manager/.agent/skills/false-negative-status-triage/SKILL.md

123 lines
3.7 KiB
Markdown
Raw Normal View History

---
name: false-negative-status-triage
description: Diagnose and fix false-negative status signals when control-plane status says something is degraded or broken but real user traffic works. Use this whenever provider status, account status, probe status, route health, or inventory state disagrees with real `/models`, `/chat/completions`, usage logs, or verified user flows. Also use it for Chinese requests such as “误报”, “false-negative”, “状态语义不一致”, “provider_status 不准”, “last_probe_status 错误”, or “真实数据面可用但后台还显示失败”.
---
# False-Negative Status Triage
This skill is for signal reconciliation.
Problem pattern:
- real request path works
- status projection still says degraded, broken, or failed
Treat this as a modeling problem first, not an outage first.
## Four-layer comparison
Always compare these layers side by side:
1. import batch result
2. provider snapshot or aggregate status
3. provider account inventory status
4. real data-plane evidence
Do not jump directly from import noise to a user-facing conclusion.
## Meaning of each layer
Keep these separate:
- `batch_status`: did every import-time check pass?
- `provider_status`: is the provider actually usable at provider level?
- `account_status`: what should operators believe about a specific account asset?
- `last_probe_status`: what happened in the last probe or normalized diagnostic view?
These should not all collapse to the same string.
## Preferred source of truth
When real user traffic and probe-only signals disagree:
- trust real data-plane success over probe-only failure
- trust host `usage_logs` over display counters when available
- trust access closure readiness over a single noisy account probe for provider-level availability
## Normalization strategy
Use a narrow rule rather than promoting everything.
Good example:
- batch is partial
- access closure is ready
- only one imported account resource exists
- smoke model is actually present
- raw account probe failed
In this case:
- provider-level state can be `active`
- account inventory may be normalized away from `broken`
- probe display can become `gateway_ready` or `warning`
The point is to remove false negatives without hiding real breakage.
## What must remain strict
Do not normalize away these cases:
- strict import failures
- rolled back batches
- broken access closure
- missing smoke model
- multi-account scenarios where one account may really be bad
This skill is about reducing noise, not erasing legitimate failures.
## Fix workflow
### 1. Reproduce the disagreement
Capture:
- provider snapshot
- provider account inventory row
- real `/models` result
- real `/chat/completions` result
- usage log evidence if possible
### 2. Identify the wrong abstraction boundary
Typical causes:
- provider status derived too directly from batch partiality
- account inventory mirrors raw probe status instead of normalized availability
- advisory or transient probe failure treated as definitive breakage
### 3. Add tests first
Write regression tests for:
- provider-level promotion when access is truly ready
- account-level normalization only in the intended narrow scenario
- guardrails that keep real broken cases broken
### 4. Change semantics minimally
- keep raw batch detail truthful
- normalize higher-level status only where it improves operational meaning
- avoid changing unrelated enums or broad behavior
### 5. Verify on a live sample
Re-read the same provider and account on a real environment after deployment.
You want to see:
- raw batch still truthful
- aggregate provider state corrected
- account inventory corrected
- real request path still working