diff --git a/docs/disaster-recovery-runbook.md b/docs/disaster-recovery-runbook.md new file mode 100644 index 0000000..f87a768 --- /dev/null +++ b/docs/disaster-recovery-runbook.md @@ -0,0 +1,624 @@ +# Disaster Recovery Runbook - ops-jrz1 + +## Overview + +This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic. + +**Recovery Time Objective (RTO):** 2-6 hours for full restore +**Recovery Point Objective (RPO):** 24 hours (daily backups at 3 AM UTC) + +--- + +## 1. What's Backed Up + +| Component | Path | Backup Method | Restore Priority | +|-----------|------|---------------|------------------| +| PostgreSQL (forgejo) | `/var/backup/postgresql/forgejo.sql.gz` | pg_dump via timer | Critical | +| PostgreSQL (mautrix_slack) | `/var/backup/postgresql/mautrix_slack.sql.gz` | pg_dump via timer | Critical | +| Forgejo | `/var/lib/forgejo` | restic file backup | Critical | +| Matrix | `/var/lib/matrix-continuwuity` | restic file backup | High | +| Maubot | `/var/lib/maubot` | restic file backup | Medium | +| Slack Bridge | `/var/lib/mautrix-slack` | restic file backup | Medium | +| User Homes | `/home/*` | **NOT YET BACKED UP** | High | + +### What's NOT Backed Up (Reproducible via NixOS) + +- `/nix/store` - rebuilt from flake +- `/etc` - generated from NixOS config +- Service binaries - installed via Nix + +### Critical Items Stored Out-of-Band + +| Item | Storage Location | Notes | +|------|------------------|-------| +| NixOS flake | GitHub mirror / local laptop | Self-hosted Forgejo may be dead | +| Restic password | Password manager + printed | Can't restore without it | +| B2 credentials | Password manager + printed | Can't access backups without it | +| Age key (sops) | `/etc/ssh/ssh_host_ed25519_key` on server | Derived from SSH host key | + +--- + +## 2. Break Glass - Emergency Quick Reference + +**Print this page and store physically.** + +### B2 Backup Access + +``` +Bucket: ops-jrz1-backup +Restic Repo: b2:ops-jrz1-backup +Key ID: [stored in password manager] +App Key: [stored in password manager] +Restic Password: [stored in password manager] +``` + +### Minimal Restore Commands + +```bash +# Set environment +export RESTIC_REPOSITORY="b2:ops-jrz1-backup" +export RESTIC_PASSWORD="[password]" +export B2_ACCOUNT_ID="[key-id]" +export B2_ACCOUNT_KEY="[app-key]" + +# List snapshots +restic snapshots + +# Restore everything to /tmp/restore +restic restore latest --target /tmp/restore + +# Restore specific path +restic restore latest --target / --include /var/lib/forgejo +``` + +### Service Start Order + +``` +1. postgresql +2. forgejo +3. mautrix-slack +4. matrix-continuwuity +5. maubot +6. nginx +``` + +### Config Repository + +``` +Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down) +Mirror: [ADD GITHUB MIRROR URL] +Local: ~/proj/ops-jrz1 on admin laptop +``` + +--- + +## 3. Restore Scenarios + +### Scenario A: Full Server Loss + +**When:** Hardware failure, VPS provider issue, complete disk loss. + +**Time estimate:** 2-4 hours + +#### Phase 1: Bootstrap NixOS (30-60 min) + +1. Provision new VPS or boot NixOS installer on new hardware +2. Partition disks, mount to `/mnt` +3. Get NixOS flake from backup location (laptop, GitHub mirror) +4. **Critical:** Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key) + +```bash +# If restoring old host keys (preserves sops decryption): +mkdir -p /mnt/etc/ssh +# Copy ssh_host_* files from secure backup + +# Install NixOS +nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd +reboot +``` + +#### Phase 2: Restore Secrets (10-20 min) + +If you had to generate new SSH host keys: + +```bash +# Get new age public key +ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub + +# Update .sops.yaml with new key on admin machine +# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml +# Redeploy +``` + +#### Phase 3: Stop Services, Restore Data (30-90 min) + +```bash +# Stop all services that use the data +systemctl stop forgejo mautrix-slack matrix-continuwuity maubot + +# Set restic environment +export RESTIC_REPOSITORY="b2:ops-jrz1-backup" +export RESTIC_PASSWORD="..." +export B2_ACCOUNT_ID="..." +export B2_ACCOUNT_KEY="..." + +# Restore PostgreSQL dumps +restic restore latest --target /tmp/restore --include /var/backup/postgresql + +# Import databases +systemctl start postgresql +sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;" +sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" +gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo + +sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;" +sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;" +gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack + +# Restore Forgejo data +rm -rf /var/lib/forgejo/* +restic restore latest --target / --include /var/lib/forgejo +chown -R forgejo:forgejo /var/lib/forgejo + +# Restore Matrix data +rm -rf /var/lib/matrix-continuwuity/* +restic restore latest --target / --include /var/lib/matrix-continuwuity +chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity + +# Restore Maubot data +rm -rf /var/lib/maubot/* +restic restore latest --target / --include /var/lib/maubot +chown -R maubot:maubot /var/lib/maubot + +# Restore Slack bridge data +rm -rf /var/lib/mautrix-slack/* +restic restore latest --target / --include /var/lib/mautrix-slack +chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack + +# Restore user home directories +restic restore latest --target / --include /home +# Permissions should be preserved by restic +``` + +#### Phase 4: Start Services and Verify (15-30 min) + +```bash +# Start in dependency order +systemctl start forgejo +systemctl start mautrix-slack +systemctl start matrix-continuwuity +systemctl start maubot + +# Check status +systemctl status forgejo mautrix-slack matrix-continuwuity maubot +``` + +See Section 5 for verification checklist. + +--- + +### Scenario B: Single Service Corruption + +**When:** One service's data is corrupted but server is otherwise fine. + +**Time estimate:** 15-60 min + +#### Example: Matrix RocksDB Corruption + +```bash +# Stop the service +systemctl stop matrix-continuwuity + +# Find available snapshots +restic snapshots + +# Restore to temp location first (safer) +restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity + +# Verify it looks reasonable +ls -la /tmp/restore/var/lib/matrix-continuwuity/ + +# Replace corrupted data +rm -rf /var/lib/matrix-continuwuity/* +cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/ +chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity + +# Start and verify +systemctl start matrix-continuwuity +journalctl -u matrix-continuwuity -f +``` + +#### Example: PostgreSQL Database Corruption + +```bash +# Stop dependent services +systemctl stop forgejo mautrix-slack + +# Restore dump +restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz + +# Drop and recreate +sudo -u postgres psql -c "DROP DATABASE forgejo;" +sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" +gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo + +# Restart +systemctl start forgejo +``` + +--- + +### Scenario C: User Deleted Their Work + +**When:** Dev accidentally `rm -rf`'d their project or home directory. + +**Time estimate:** 5-20 min + +```bash +# Find what snapshots are available +restic snapshots + +# Browse a snapshot to find the data +restic ls latest /home/USERNAME/ + +# Restore specific directory to temp location +restic restore latest --target /tmp/restore --include /home/USERNAME/project-name + +# Let user copy what they need +cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/ +chown -R USERNAME:users /home/USERNAME/project-name + +# Or restore entire home directory +restic restore latest --target / --include /home/USERNAME +chown -R USERNAME:users /home/USERNAME +``` + +#### Point-in-Time Restore + +```bash +# List snapshots with dates +restic snapshots + +# Restore from specific snapshot (not latest) +restic restore abc123def --target /tmp/restore --include /home/USERNAME +``` + +--- + +### Scenario D: Single Forgejo Repo Deleted + +**When:** A git repository was deleted from Forgejo. + +**Time estimate:** 10-30 min + +**Challenge:** Forgejo database and filesystem must be in sync. + +#### Option 1: Restore Just the Git Data (if DB record exists) + +```bash +# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git +restic ls latest /var/lib/forgejo/repositories/ + +# Restore repo +restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git + +# Copy into place +cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/ +chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git + +# Regenerate hooks +sudo -u forgejo forgejo admin regenerate hooks +``` + +#### Option 2: Full Forgejo Restore (if DB record was also deleted) + +Need to restore both database and filesystem to same point in time: + +```bash +systemctl stop forgejo + +# Restore database +restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz +sudo -u postgres psql -c "DROP DATABASE forgejo;" +sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" +gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo + +# Restore filesystem +rm -rf /var/lib/forgejo/* +restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo +chown -R forgejo:forgejo /var/lib/forgejo + +systemctl start forgejo +``` + +--- + +## 4. Restore Commands Reference + +### Environment Setup + +```bash +export RESTIC_REPOSITORY="b2:ops-jrz1-backup" +export RESTIC_PASSWORD_FILE="/run/secrets/restic/password" # if on server +# OR +export RESTIC_PASSWORD="your-password-here" # if restoring from scratch +export B2_ACCOUNT_ID="your-key-id" +export B2_ACCOUNT_KEY="your-app-key" +``` + +### Common Operations + +```bash +# List all snapshots +restic snapshots + +# List snapshots with tags +restic snapshots --tag ops-jrz1 + +# Browse snapshot contents +restic ls latest +restic ls latest /var/lib/forgejo + +# Restore everything +restic restore latest --target / + +# Restore specific path +restic restore latest --target / --include /var/lib/forgejo + +# Restore to different location +restic restore latest --target /tmp/restore --include /home/dan + +# Restore specific snapshot (not latest) +restic restore abc123de --target /tmp/restore + +# Mount backup as filesystem (for browsing) +mkdir /mnt/restic +restic mount /mnt/restic +# Browse /mnt/restic/snapshots/latest/... +# Ctrl+C to unmount + +# Check backup integrity +restic check +restic check --read-data # slower, verifies all data +``` + +--- + +## 5. Verification Checklist + +### After Full Restore + +#### Infrastructure +- [ ] SSH access works +- [ ] DNS resolves correctly +- [ ] HTTPS certificates valid (may need `systemctl start acme-clarun.xyz`) + +#### PostgreSQL +- [ ] `systemctl status postgresql` - active +- [ ] `sudo -u postgres psql -c "\l"` - lists forgejo, mautrix_slack databases +- [ ] No errors in `journalctl -u postgresql` + +#### Forgejo +- [ ] `systemctl status forgejo` - active +- [ ] Web UI loads at https://git.clarun.xyz +- [ ] Can log in +- [ ] Repositories visible and browsable +- [ ] Can clone a repo: `git clone git@git.clarun.xyz:org/repo.git` +- [ ] Can push to a repo + +#### Matrix +- [ ] `systemctl status matrix-continuwuity` - active +- [ ] No RocksDB errors in `journalctl -u matrix-continuwuity` +- [ ] Can log in with Matrix client +- [ ] Can send/receive messages +- [ ] Old messages visible + +#### Maubot +- [ ] `systemctl status maubot` - active +- [ ] Web UI accessible via SSH tunnel (port 29316) +- [ ] Bots responding + +#### Slack Bridge +- [ ] `systemctl status mautrix-slack` - active +- [ ] Bridge connected (check logs) +- [ ] Messages flowing both directions + +#### User Home Directories +- [ ] Users can SSH in +- [ ] User files present +- [ ] Permissions correct + +### After Partial Restore + +- [ ] Restored service starts without errors +- [ ] Basic functionality works +- [ ] No data from "future" (if restoring older snapshot) + +--- + +## 6. Time Estimates + +| Scenario | Download Size | Estimated Time | +|----------|--------------|----------------| +| Full server restore | ~15 GB | 2-4 hours | +| Single service (Matrix) | ~2 GB | 15-45 min | +| Single service (Forgejo) | ~5 GB | 20-60 min | +| Single user home | ~1 GB | 5-15 min | +| Single git repo | ~100 MB | 5-10 min | +| PostgreSQL DB only | ~50 MB | 10-20 min | + +*Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.* + +--- + +## 7. Quarterly Restore Drill + +Schedule: First Sunday of each quarter + +### Procedure + +1. Spin up test VM (or use local NixOS VM) +2. Attempt full restore procedure +3. Run verification checklist +4. Document: + - Actual time taken + - Any issues encountered + - Runbook updates needed +5. Destroy test VM + +### Success Criteria + +- [ ] NixOS boots with config +- [ ] PostgreSQL databases restore and pass basic queries +- [ ] Forgejo UI loads and repos are accessible +- [ ] Matrix client can connect and see history +- [ ] At least one user home directory restored with correct permissions + +--- + +## 8. Known Limitations + +### RocksDB Consistency + +Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should: +- Stop service before backup, OR +- Use RocksDB checkpoint feature, OR +- Use filesystem snapshots (ZFS/btrfs) + +**Current risk:** Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore. + +### Point-in-Time Recovery + +Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured). + +### User Home Directory Backup + +**TODO:** User home directories (`/home/*`) are not currently included in backup. Need to add to backup-b2.nix. + +### Large File Handling + +Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately. + +--- + +## 9. Known Gaps and TODOs + +**Critical - Must Fix Before Relying on This Runbook:** + +| Gap | Risk | Fix | +|-----|------|-----| +| `/home/*` not backed up | User work lost forever | Add to backup-b2.nix paths | +| `/var/lib/acme` not backed up | Let's Encrypt rate limit (7 days no HTTPS) | Add to backup-b2.nix paths | +| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | +| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key | +| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | +| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | + +**Medium Priority:** + +| Gap | Risk | Fix | +|-----|------|-----| +| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` | +| Dynamic UIDs | Permission errors after restore | Static UIDs for service users | +| DNS provider not documented | Can't update IP on new VPS | Document in break glass section | +| No backup monitoring | Silent failures for days | Add healthchecks.io integration | +| Postgres roles/extensions | Restore may fail | Include `pg_dumpall --globals-only` | + +**Nice to Have:** + +| Gap | Improvement | +|-----|-------------| +| Manual restore steps | Create restore script | +| No immutable backups | Enable B2 Object Lock | +| No second backup location | Replicate to second provider | + +--- + +## 10. Runbook Maintenance + +- **Owner:** dan +- **Last updated:** 2026-01-10 +- **Last drill:** Never (TODO: schedule first drill) +- **Next review:** After first restore drill + +### Change Log + +| Date | Change | +|------|--------| +| 2026-01-10 | Initial draft | + +--- + +## Appendix A: File Paths Reference + +``` +# PostgreSQL dumps (created by services.postgresqlBackup) +/var/backup/postgresql/forgejo.sql.gz +/var/backup/postgresql/mautrix_slack.sql.gz + +# Forgejo +/var/lib/forgejo/ +├── conf/ +├── data/ +│ ├── avatars/ +│ ├── attachments/ +│ └── lfs/ +├── repositories/ +│ └── USERNAME/ +│ └── REPO.git/ +└── gitea.db (if using SQLite, but we use PostgreSQL) + +# Matrix (Conduwuit with RocksDB) +/var/lib/matrix-continuwuity/ +├── db/ # RocksDB database +└── media/ # Uploaded media + +# Maubot +/var/lib/maubot/ +├── plugins/ +├── trash/ +└── maubot.db # SQLite database + +# Slack Bridge +/var/lib/mautrix-slack/ +└── registration.yaml + +# User homes +/home/USERNAME/ +├── .ssh/ +├── .config/ +├── .npm-global/ +└── [user projects] + +# Secrets (runtime, not backed up - regenerated from sops) +/run/secrets/ +├── matrix-registration-token +├── maubot-admin-password +├── restic/password +└── ... +``` + +## Appendix B: Service Dependencies + +``` + ┌─────────────┐ + │ postgresql │ + └──────┬──────┘ + │ + ┌───────────────┼───────────────┐ + │ │ │ + ▼ ▼ ▼ + ┌──────────┐ ┌─────────────┐ ┌──────────┐ + │ forgejo │ │mautrix-slack│ │ maubot │ + └──────────┘ └──────┬──────┘ └────┬─────┘ + │ │ + ▼ ▼ + ┌─────────────────────────┐ + │ matrix-continuwuity │ + └─────────────────────────┘ + │ + ▼ + ┌─────────────┐ + │ nginx │ + └─────────────┘ +```