Update DR runbook: first restore drill passed

Tested restore of:
- PostgreSQL dumps (forgejo: 112 tables, mautrix_slack: 32 tables)
- Forgejo repositories
- User home directories

Also updated known gaps status (sops key, PostgreSQL pin fixed).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Dan 2026-01-10 16:18:22 -08:00
parent 5a45993046
commit 9c03d2204d

View file

@ -510,16 +510,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
|-----|------|-----|--------|
| ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Open (y8le) |
| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key | Open (93q9) |
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Open (jboq) |
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) |
| ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) |
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) |
| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only |
**Medium Priority:**
| Gap | Risk | Fix |
|-----|------|-----|
| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` |
| ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** |
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
| No backup monitoring | Silent failures for days | Add healthchecks.io integration |
@ -538,14 +538,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
## 10. Runbook Maintenance
- **Owner:** dan
- **Last updated:** 2026-01-10
- **Last drill:** Never (TODO: schedule first drill)
- **Next review:** After first restore drill
- **Last updated:** 2026-01-11
- **Last drill:** 2026-01-11 (restore test passed)
- **Next review:** After NixOS 24.11 upgrade
### Change Log
| Date | Change |
|------|--------|
| 2026-01-11 | First restore drill - all tests passed |
| 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) |
| 2026-01-10 | Initial draft |
---