Update DR runbook: first restore drill passed

Tested restore of:
- PostgreSQL dumps (forgejo: 112 tables, mautrix_slack: 32 tables)
- Forgejo repositories
- User home directories

Also updated known gaps status (sops key, PostgreSQL pin fixed).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Dan 2026-01-10 16:18:22 -08:00
parent 5a45993046
commit 9c03d2204d

View file

@ -510,16 +510,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
|-----|------|-----|--------| |-----|------|-----|--------|
| ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | | ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | | ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Open (y8le) | | RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) |
| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key | Open (93q9) | | ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) |
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Open (jboq) | | Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) |
| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only | | `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only |
**Medium Priority:** **Medium Priority:**
| Gap | Risk | Fix | | Gap | Risk | Fix |
|-----|------|-----| |-----|------|-----|
| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` | | ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** |
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users | | Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section | | DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
| No backup monitoring | Silent failures for days | Add healthchecks.io integration | | No backup monitoring | Silent failures for days | Add healthchecks.io integration |
@ -538,14 +538,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
## 10. Runbook Maintenance ## 10. Runbook Maintenance
- **Owner:** dan - **Owner:** dan
- **Last updated:** 2026-01-10 - **Last updated:** 2026-01-11
- **Last drill:** Never (TODO: schedule first drill) - **Last drill:** 2026-01-11 (restore test passed)
- **Next review:** After first restore drill - **Next review:** After NixOS 24.11 upgrade
### Change Log ### Change Log
| Date | Change | | Date | Change |
|------|--------| |------|--------|
| 2026-01-11 | First restore drill - all tests passed |
| 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) |
| 2026-01-10 | Initial draft | | 2026-01-10 | Initial draft |
--- ---