123 lines
3.7 KiB
Markdown
123 lines
3.7 KiB
Markdown
|
|
---
|
||
|
|
name: false-negative-status-triage
|
||
|
|
description: Diagnose and fix false-negative status signals when control-plane status says something is degraded or broken but real user traffic works. Use this whenever provider status, account status, probe status, route health, or inventory state disagrees with real `/models`, `/chat/completions`, usage logs, or verified user flows. Also use it for Chinese requests such as “误报”, “false-negative”, “状态语义不一致”, “provider_status 不准”, “last_probe_status 错误”, or “真实数据面可用但后台还显示失败”.
|
||
|
|
---
|
||
|
|
|
||
|
|
# False-Negative Status Triage
|
||
|
|
|
||
|
|
This skill is for signal reconciliation.
|
||
|
|
|
||
|
|
Problem pattern:
|
||
|
|
|
||
|
|
- real request path works
|
||
|
|
- status projection still says degraded, broken, or failed
|
||
|
|
|
||
|
|
Treat this as a modeling problem first, not an outage first.
|
||
|
|
|
||
|
|
## Four-layer comparison
|
||
|
|
|
||
|
|
Always compare these layers side by side:
|
||
|
|
|
||
|
|
1. import batch result
|
||
|
|
2. provider snapshot or aggregate status
|
||
|
|
3. provider account inventory status
|
||
|
|
4. real data-plane evidence
|
||
|
|
|
||
|
|
Do not jump directly from import noise to a user-facing conclusion.
|
||
|
|
|
||
|
|
## Meaning of each layer
|
||
|
|
|
||
|
|
Keep these separate:
|
||
|
|
|
||
|
|
- `batch_status`: did every import-time check pass?
|
||
|
|
- `provider_status`: is the provider actually usable at provider level?
|
||
|
|
- `account_status`: what should operators believe about a specific account asset?
|
||
|
|
- `last_probe_status`: what happened in the last probe or normalized diagnostic view?
|
||
|
|
|
||
|
|
These should not all collapse to the same string.
|
||
|
|
|
||
|
|
## Preferred source of truth
|
||
|
|
|
||
|
|
When real user traffic and probe-only signals disagree:
|
||
|
|
|
||
|
|
- trust real data-plane success over probe-only failure
|
||
|
|
- trust host `usage_logs` over display counters when available
|
||
|
|
- trust access closure readiness over a single noisy account probe for provider-level availability
|
||
|
|
|
||
|
|
## Normalization strategy
|
||
|
|
|
||
|
|
Use a narrow rule rather than promoting everything.
|
||
|
|
|
||
|
|
Good example:
|
||
|
|
|
||
|
|
- batch is partial
|
||
|
|
- access closure is ready
|
||
|
|
- only one imported account resource exists
|
||
|
|
- smoke model is actually present
|
||
|
|
- raw account probe failed
|
||
|
|
|
||
|
|
In this case:
|
||
|
|
|
||
|
|
- provider-level state can be `active`
|
||
|
|
- account inventory may be normalized away from `broken`
|
||
|
|
- probe display can become `gateway_ready` or `warning`
|
||
|
|
|
||
|
|
The point is to remove false negatives without hiding real breakage.
|
||
|
|
|
||
|
|
## What must remain strict
|
||
|
|
|
||
|
|
Do not normalize away these cases:
|
||
|
|
|
||
|
|
- strict import failures
|
||
|
|
- rolled back batches
|
||
|
|
- broken access closure
|
||
|
|
- missing smoke model
|
||
|
|
- multi-account scenarios where one account may really be bad
|
||
|
|
|
||
|
|
This skill is about reducing noise, not erasing legitimate failures.
|
||
|
|
|
||
|
|
## Fix workflow
|
||
|
|
|
||
|
|
### 1. Reproduce the disagreement
|
||
|
|
|
||
|
|
Capture:
|
||
|
|
|
||
|
|
- provider snapshot
|
||
|
|
- provider account inventory row
|
||
|
|
- real `/models` result
|
||
|
|
- real `/chat/completions` result
|
||
|
|
- usage log evidence if possible
|
||
|
|
|
||
|
|
### 2. Identify the wrong abstraction boundary
|
||
|
|
|
||
|
|
Typical causes:
|
||
|
|
|
||
|
|
- provider status derived too directly from batch partiality
|
||
|
|
- account inventory mirrors raw probe status instead of normalized availability
|
||
|
|
- advisory or transient probe failure treated as definitive breakage
|
||
|
|
|
||
|
|
### 3. Add tests first
|
||
|
|
|
||
|
|
Write regression tests for:
|
||
|
|
|
||
|
|
- provider-level promotion when access is truly ready
|
||
|
|
- account-level normalization only in the intended narrow scenario
|
||
|
|
- guardrails that keep real broken cases broken
|
||
|
|
|
||
|
|
### 4. Change semantics minimally
|
||
|
|
|
||
|
|
- keep raw batch detail truthful
|
||
|
|
- normalize higher-level status only where it improves operational meaning
|
||
|
|
- avoid changing unrelated enums or broad behavior
|
||
|
|
|
||
|
|
### 5. Verify on a live sample
|
||
|
|
|
||
|
|
Re-read the same provider and account on a real environment after deployment.
|
||
|
|
|
||
|
|
You want to see:
|
||
|
|
|
||
|
|
- raw batch still truthful
|
||
|
|
- aggregate provider state corrected
|
||
|
|
- account inventory corrected
|
||
|
|
- real request path still working
|