--- name: false-negative-status-triage description: Diagnose and fix false-negative status signals when control-plane status says something is degraded or broken but real user traffic works. Use this whenever provider status, account status, probe status, route health, or inventory state disagrees with real `/models`, `/chat/completions`, usage logs, or verified user flows. Also use it for Chinese requests such as “误报”, “false-negative”, “状态语义不一致”, “provider_status 不准”, “last_probe_status 错误”, or “真实数据面可用但后台还显示失败”. --- # False-Negative Status Triage This skill is for signal reconciliation. Problem pattern: - real request path works - status projection still says degraded, broken, or failed Treat this as a modeling problem first, not an outage first. ## Four-layer comparison Always compare these layers side by side: 1. import batch result 2. provider snapshot or aggregate status 3. provider account inventory status 4. real data-plane evidence Do not jump directly from import noise to a user-facing conclusion. ## Meaning of each layer Keep these separate: - `batch_status`: did every import-time check pass? - `provider_status`: is the provider actually usable at provider level? - `account_status`: what should operators believe about a specific account asset? - `last_probe_status`: what happened in the last probe or normalized diagnostic view? These should not all collapse to the same string. ## Preferred source of truth When real user traffic and probe-only signals disagree: - trust real data-plane success over probe-only failure - trust host `usage_logs` over display counters when available - trust access closure readiness over a single noisy account probe for provider-level availability ## Normalization strategy Use a narrow rule rather than promoting everything. Good example: - batch is partial - access closure is ready - only one imported account resource exists - smoke model is actually present - raw account probe failed In this case: - provider-level state can be `active` - account inventory may be normalized away from `broken` - probe display can become `gateway_ready` or `warning` The point is to remove false negatives without hiding real breakage. ## What must remain strict Do not normalize away these cases: - strict import failures - rolled back batches - broken access closure - missing smoke model - multi-account scenarios where one account may really be bad This skill is about reducing noise, not erasing legitimate failures. ## Fix workflow ### 1. Reproduce the disagreement Capture: - provider snapshot - provider account inventory row - real `/models` result - real `/chat/completions` result - usage log evidence if possible ### 2. Identify the wrong abstraction boundary Typical causes: - provider status derived too directly from batch partiality - account inventory mirrors raw probe status instead of normalized availability - advisory or transient probe failure treated as definitive breakage ### 3. Add tests first Write regression tests for: - provider-level promotion when access is truly ready - account-level normalization only in the intended narrow scenario - guardrails that keep real broken cases broken ### 4. Change semantics minimally - keep raw batch detail truthful - normalize higher-level status only where it improves operational meaning - avoid changing unrelated enums or broad behavior ### 5. Verify on a live sample Re-read the same provider and account on a real environment after deployment. You want to see: - raw batch still truthful - aggregate provider state corrected - account inventory corrected - real request path still working