diff --git a/docs/disaster-recovery-runbook.md b/docs/disaster-recovery-runbook.md index bf45fd2..ad72aa3 100644 --- a/docs/disaster-recovery-runbook.md +++ b/docs/disaster-recovery-runbook.md @@ -510,16 +510,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid |-----|------|-----|--------| | ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | | ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | -| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Open (y8le) | -| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key | Open (93q9) | -| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Open (jboq) | +| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) | +| ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) | +| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) | | `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only | **Medium Priority:** | Gap | Risk | Fix | |-----|------|-----| -| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` | +| ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** | | Dynamic UIDs | Permission errors after restore | Static UIDs for service users | | DNS provider not documented | Can't update IP on new VPS | Document in break glass section | | No backup monitoring | Silent failures for days | Add healthchecks.io integration | @@ -538,14 +538,16 @@ Forgejo LFS objects and large repos may take significant time to restore. Consid ## 10. Runbook Maintenance - **Owner:** dan -- **Last updated:** 2026-01-10 -- **Last drill:** Never (TODO: schedule first drill) -- **Next review:** After first restore drill +- **Last updated:** 2026-01-11 +- **Last drill:** 2026-01-11 (restore test passed) +- **Next review:** After NixOS 24.11 upgrade ### Change Log | Date | Change | |------|--------| +| 2026-01-11 | First restore drill - all tests passed | +| 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) | | 2026-01-10 | Initial draft | ---