Update DR runbook: first restore drill passed
Tested restore of: - PostgreSQL dumps (forgejo: 112 tables, mautrix_slack: 32 tables) - Forgejo repositories - User home directories Also updated known gaps status (sops key, PostgreSQL pin fixed). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
5a45993046
commit
9c03d2204d
|
|
@ -510,16 +510,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
|
|||
|-----|------|-----|--------|
|
||||
| ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
|
||||
| ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
|
||||
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Open (y8le) |
|
||||
| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key | Open (93q9) |
|
||||
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Open (jboq) |
|
||||
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) |
|
||||
| ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) |
|
||||
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) |
|
||||
| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only |
|
||||
|
||||
**Medium Priority:**
|
||||
|
||||
| Gap | Risk | Fix |
|
||||
|-----|------|-----|
|
||||
| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` |
|
||||
| ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** |
|
||||
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
|
||||
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
|
||||
| No backup monitoring | Silent failures for days | Add healthchecks.io integration |
|
||||
|
|
@ -538,14 +538,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid
|
|||
## 10. Runbook Maintenance
|
||||
|
||||
- **Owner:** dan
|
||||
- **Last updated:** 2026-01-10
|
||||
- **Last drill:** Never (TODO: schedule first drill)
|
||||
- **Next review:** After first restore drill
|
||||
- **Last updated:** 2026-01-11
|
||||
- **Last drill:** 2026-01-11 (restore test passed)
|
||||
- **Next review:** After NixOS 24.11 upgrade
|
||||
|
||||
### Change Log
|
||||
|
||||
| Date | Change |
|
||||
|------|--------|
|
||||
| 2026-01-11 | First restore drill - all tests passed |
|
||||
| 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) |
|
||||
| 2026-01-10 | Initial draft |
|
||||
|
||||
---
|
||||
|
|
|
|||
Loading…
Reference in a new issue