3.7 KiB
3.7 KiB
name, description
| name | description |
|---|---|
| false-negative-status-triage | Diagnose and fix false-negative status signals when control-plane status says something is degraded or broken but real user traffic works. Use this whenever provider status, account status, probe status, route health, or inventory state disagrees with real `/models`, `/chat/completions`, usage logs, or verified user flows. Also use it for Chinese requests such as “误报”, “false-negative”, “状态语义不一致”, “provider_status 不准”, “last_probe_status 错误”, or “真实数据面可用但后台还显示失败”. |
False-Negative Status Triage
This skill is for signal reconciliation.
Problem pattern:
- real request path works
- status projection still says degraded, broken, or failed
Treat this as a modeling problem first, not an outage first.
Four-layer comparison
Always compare these layers side by side:
- import batch result
- provider snapshot or aggregate status
- provider account inventory status
- real data-plane evidence
Do not jump directly from import noise to a user-facing conclusion.
Meaning of each layer
Keep these separate:
batch_status: did every import-time check pass?provider_status: is the provider actually usable at provider level?account_status: what should operators believe about a specific account asset?last_probe_status: what happened in the last probe or normalized diagnostic view?
These should not all collapse to the same string.
Preferred source of truth
When real user traffic and probe-only signals disagree:
- trust real data-plane success over probe-only failure
- trust host
usage_logsover display counters when available - trust access closure readiness over a single noisy account probe for provider-level availability
Normalization strategy
Use a narrow rule rather than promoting everything.
Good example:
- batch is partial
- access closure is ready
- only one imported account resource exists
- smoke model is actually present
- raw account probe failed
In this case:
- provider-level state can be
active - account inventory may be normalized away from
broken - probe display can become
gateway_readyorwarning
The point is to remove false negatives without hiding real breakage.
What must remain strict
Do not normalize away these cases:
- strict import failures
- rolled back batches
- broken access closure
- missing smoke model
- multi-account scenarios where one account may really be bad
This skill is about reducing noise, not erasing legitimate failures.
Fix workflow
1. Reproduce the disagreement
Capture:
- provider snapshot
- provider account inventory row
- real
/modelsresult - real
/chat/completionsresult - usage log evidence if possible
2. Identify the wrong abstraction boundary
Typical causes:
- provider status derived too directly from batch partiality
- account inventory mirrors raw probe status instead of normalized availability
- advisory or transient probe failure treated as definitive breakage
3. Add tests first
Write regression tests for:
- provider-level promotion when access is truly ready
- account-level normalization only in the intended narrow scenario
- guardrails that keep real broken cases broken
4. Change semantics minimally
- keep raw batch detail truthful
- normalize higher-level status only where it improves operational meaning
- avoid changing unrelated enums or broad behavior
5. Verify on a live sample
Re-read the same provider and account on a real environment after deployment.
You want to see:
- raw batch still truthful
- aggregate provider state corrected
- account inventory corrected
- real request path still working