WATCH

Report · WR-2026-0430-0142 Postern incident — arbiter loss, sentinel-04-c1

Classification internal · ops
Issued 30 apr 2026 · 17:48 utc
Status contained · monitoring
disposition Contained at 17:43:11 UTC. No customer-visible impact. Root cause identified, replacement arbiter promoted. Monitoring continues for 24h. signed on-call · gate
— 01 Summary resolved

At 17:41:58 UTC, arbiter node sentinel-04-c1 stopped reporting heartbeats to the cluster coordinator. The cluster degraded to a two-arbiter quorum, which remained valid; no write path was interrupted. Auto-failover promoted the standby arbiter sentinel-04-d1 at 17:43:11 UTC, restoring full quorum. Total time to restoration: 1m 13s. Total customer-visible impact: none.

severity
SEV-3 · degraded redundancy, no impact
detected
17:41:58 utc · automated · paging route postern.arbiter
contained
17:43:11 utc · automated · standby promotion
operator ack
17:43:47 utc · on-call · gate (you)
customer impact
nil
budget consumed
0.4% of monthly error budget · 94% remains
— 02 Findings 3 to address

Three contributing factors were identified during the post-incident review at 17:46 UTC. None are independently sufficient to cause an outage; their combination explains the 1m 13s detection-to-containment window.

IDSeverityFindingOwner
F-01 high Heartbeat interval (5s) is too long for the new arbiter SLO. Reduce to 2s; current value pre-dates the SLO change in RFC-038. infra · gate
F-02 med Standby promotion runbook lists manual confirmation step that automation already performs. Drift from runbook-arb-04. infra · ward
F-03 low Pager template did not include cluster identifier in subject line. Adds 8–12s to operator orientation. ops · matins
— 03 Reproduction verified

The fault was reproduced in stage-iad at 17:55 UTC by isolating the arbiter VM at the hypervisor level. Detection-to-containment timing matched production within 4s; the fix proposed below brought the window to 11s.

$ watchctl drill arbiter-loss --target sentinel-04-c1 --env stage
# drill scheduled · 17:55:02 utc
# isolating sentinel-04-c1 at hypervisor
[ok]    quorum maintained         · 2/3 arbiters
[warn]  heartbeat timeout         · 17:55:07 (5.2s)
[ok]    standby promotion        · 17:55:13 (6.1s)
[ok]    cluster nominal          · 17:55:13 · total 11s

$ watchctl drill verify --runbook runbook-arb-04
[ok]    runbook synthesized      · automation matches expected steps
# exit 0
— 04 Action items 3 open · 0 closed
IDActionOwnerDue
A-01Reduce arbiter heartbeat to 2s; ship behind feature flag arb-hb-2sinfra · gate02 may
A-02Reconcile runbook-arb-04 with current automation; remove manual stepinfra · ward05 may
A-03Update pager template to include cluster identifier in subjectops · matins03 may
Gate
on-call · primary · 17:43:47 ack
Ward
reviewer · approved 17:48