Report · WR-2026-0430-0142 Postern incident — arbiter loss, sentinel-04-c1
At 17:41:58 UTC, arbiter node sentinel-04-c1 stopped reporting heartbeats to the cluster coordinator. The cluster degraded to a two-arbiter quorum, which remained valid; no write path was interrupted. Auto-failover promoted the standby arbiter sentinel-04-d1 at 17:43:11 UTC, restoring full quorum. Total time to restoration: 1m 13s. Total customer-visible impact: none.
- severity
- SEV-3 · degraded redundancy, no impact
- detected
- 17:41:58 utc · automated · paging route postern.arbiter
- contained
- 17:43:11 utc · automated · standby promotion
- operator ack
- 17:43:47 utc · on-call · gate (you)
- customer impact
- nil
- budget consumed
- 0.4% of monthly error budget · 94% remains
Three contributing factors were identified during the post-incident review at 17:46 UTC. None are independently sufficient to cause an outage; their combination explains the 1m 13s detection-to-containment window.
| ID | Severity | Finding | Owner |
|---|---|---|---|
| F-01 | high | Heartbeat interval (5s) is too long for the new arbiter SLO. Reduce to 2s; current value pre-dates the SLO change in RFC-038. | infra · gate |
| F-02 | med | Standby promotion runbook lists manual confirmation step that automation already performs. Drift from runbook-arb-04. | infra · ward |
| F-03 | low | Pager template did not include cluster identifier in subject line. Adds 8–12s to operator orientation. | ops · matins |
The fault was reproduced in stage-iad at 17:55 UTC by isolating the arbiter VM at the hypervisor level. Detection-to-containment timing matched production within 4s; the fix proposed below brought the window to 11s.
$ watchctl drill arbiter-loss --target sentinel-04-c1 --env stage # drill scheduled · 17:55:02 utc # isolating sentinel-04-c1 at hypervisor [ok] quorum maintained · 2/3 arbiters [warn] heartbeat timeout · 17:55:07 (5.2s) [ok] standby promotion · 17:55:13 (6.1s) [ok] cluster nominal · 17:55:13 · total 11s $ watchctl drill verify --runbook runbook-arb-04 [ok] runbook synthesized · automation matches expected steps # exit 0
| ID | Action | Owner | Due |
|---|---|---|---|
| A-01 | Reduce arbiter heartbeat to 2s; ship behind feature flag arb-hb-2s | infra · gate | 02 may |
| A-02 | Reconcile runbook-arb-04 with current automation; remove manual step | infra · ward | 05 may |
| A-03 | Update pager template to include cluster identifier in subject | ops · matins | 03 may |