# Disaster Recovery Runbook - ops-jrz1 ## Overview This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic. **Recovery Time Objective (RTO):** 2-6 hours for full restore **Recovery Point Objective (RPO):** 24 hours (daily backups at 3 AM UTC) --- ## 1. What's Backed Up | Component | Path | Backup Method | Restore Priority | |-----------|------|---------------|------------------| | PostgreSQL (forgejo) | `/var/backup/postgresql/forgejo.sql.gz` | pg_dump via timer | Critical | | PostgreSQL (mautrix_slack) | `/var/backup/postgresql/mautrix_slack.sql.gz` | pg_dump via timer | Critical | | Forgejo | `/var/lib/forgejo` | restic file backup | Critical | | Matrix | `/var/lib/matrix-continuwuity` | restic file backup | High | | Maubot | `/var/lib/maubot` | restic file backup | Medium | | Slack Bridge | `/var/lib/mautrix-slack` | restic file backup | Medium | | User Homes | `/home/*` | restic file backup | High | | ACME Certs | `/var/lib/acme` | restic file backup | Medium | ### What's NOT Backed Up (Reproducible via NixOS) - `/nix/store` - rebuilt from flake - `/etc` - generated from NixOS config - Service binaries - installed via Nix ### Critical Items Stored Out-of-Band | Item | Storage Location | Notes | |------|------------------|-------| | NixOS flake | GitHub mirror / local laptop | Self-hosted Forgejo may be dead | | Restic password | Password manager + printed | Can't restore without it | | B2 credentials | Password manager + printed | Can't access backups without it | | Age key (sops) | `/etc/ssh/ssh_host_ed25519_key` on server | Derived from SSH host key | --- ## 2. Break Glass - Emergency Quick Reference **Print this page and store physically.** ### B2 Backup Access ``` Bucket: ops-jrz1-backup Restic Repo: b2:ops-jrz1-backup Key ID: [stored in password manager] App Key: [stored in password manager] Restic Password: [stored in password manager] ``` ### Minimal Restore Commands ```bash # Set environment export RESTIC_REPOSITORY="b2:ops-jrz1-backup" export RESTIC_PASSWORD="[password]" export B2_ACCOUNT_ID="[key-id]" export B2_ACCOUNT_KEY="[app-key]" # List snapshots restic snapshots # Restore everything to /tmp/restore restic restore latest --target /tmp/restore # Restore specific path restic restore latest --target / --include /var/lib/forgejo ``` ### Service Start Order ``` 1. postgresql 2. forgejo 3. mautrix-slack 4. matrix-continuwuity 5. maubot 6. nginx ``` ### Config Repository ``` Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down) Mirror: [ADD GITHUB MIRROR URL] Local: ~/proj/ops-jrz1 on admin laptop ``` --- ## 3. Restore Scenarios ### Scenario A: Full Server Loss **When:** Hardware failure, VPS provider issue, complete disk loss. **Time estimate:** 2-4 hours #### Phase 1: Bootstrap NixOS (30-60 min) 1. Provision new VPS or boot NixOS installer on new hardware 2. Partition disks, mount to `/mnt` 3. Get NixOS flake from backup location (laptop, GitHub mirror) 4. **Critical:** Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key) ```bash # If restoring old host keys (preserves sops decryption): mkdir -p /mnt/etc/ssh # Copy ssh_host_* files from secure backup # Install NixOS nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd reboot ``` #### Phase 2: Restore Secrets (10-20 min) If you had to generate new SSH host keys: ```bash # Get new age public key ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub # Update .sops.yaml with new key on admin machine # Re-encrypt secrets: sops updatekeys secrets/secrets.yaml # Redeploy ``` #### Phase 3: Stop Services, Restore Data (30-90 min) ```bash # Stop all services that use the data systemctl stop forgejo mautrix-slack matrix-continuwuity maubot # Set restic environment export RESTIC_REPOSITORY="b2:ops-jrz1-backup" export RESTIC_PASSWORD="..." export B2_ACCOUNT_ID="..." export B2_ACCOUNT_KEY="..." # Restore PostgreSQL dumps restic restore latest --target /tmp/restore --include /var/backup/postgresql # Import databases systemctl start postgresql sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;" sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;" sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;" gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack # Restore Forgejo data rm -rf /var/lib/forgejo/* restic restore latest --target / --include /var/lib/forgejo chown -R forgejo:forgejo /var/lib/forgejo # Restore Matrix data rm -rf /var/lib/matrix-continuwuity/* restic restore latest --target / --include /var/lib/matrix-continuwuity chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity # Restore Maubot data rm -rf /var/lib/maubot/* restic restore latest --target / --include /var/lib/maubot chown -R maubot:maubot /var/lib/maubot # Restore Slack bridge data rm -rf /var/lib/mautrix-slack/* restic restore latest --target / --include /var/lib/mautrix-slack chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack # Restore user home directories restic restore latest --target / --include /home # Permissions should be preserved by restic ``` #### Phase 4: Start Services and Verify (15-30 min) ```bash # Start in dependency order systemctl start forgejo systemctl start mautrix-slack systemctl start matrix-continuwuity systemctl start maubot # Check status systemctl status forgejo mautrix-slack matrix-continuwuity maubot ``` See Section 5 for verification checklist. --- ### Scenario B: Single Service Corruption **When:** One service's data is corrupted but server is otherwise fine. **Time estimate:** 15-60 min #### Example: Matrix RocksDB Corruption ```bash # Stop the service systemctl stop matrix-continuwuity # Find available snapshots restic snapshots # Restore to temp location first (safer) restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity # Verify it looks reasonable ls -la /tmp/restore/var/lib/matrix-continuwuity/ # Replace corrupted data rm -rf /var/lib/matrix-continuwuity/* cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/ chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity # Start and verify systemctl start matrix-continuwuity journalctl -u matrix-continuwuity -f ``` #### Example: PostgreSQL Database Corruption ```bash # Stop dependent services systemctl stop forgejo mautrix-slack # Restore dump restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz # Drop and recreate sudo -u postgres psql -c "DROP DATABASE forgejo;" sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo # Restart systemctl start forgejo ``` --- ### Scenario C: User Deleted Their Work **When:** Dev accidentally `rm -rf`'d their project or home directory. **Time estimate:** 5-20 min ```bash # Find what snapshots are available restic snapshots # Browse a snapshot to find the data restic ls latest /home/USERNAME/ # Restore specific directory to temp location restic restore latest --target /tmp/restore --include /home/USERNAME/project-name # Let user copy what they need cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/ chown -R USERNAME:users /home/USERNAME/project-name # Or restore entire home directory restic restore latest --target / --include /home/USERNAME chown -R USERNAME:users /home/USERNAME ``` #### Point-in-Time Restore ```bash # List snapshots with dates restic snapshots # Restore from specific snapshot (not latest) restic restore abc123def --target /tmp/restore --include /home/USERNAME ``` --- ### Scenario D: Single Forgejo Repo Deleted **When:** A git repository was deleted from Forgejo. **Time estimate:** 10-30 min **Challenge:** Forgejo database and filesystem must be in sync. #### Option 1: Restore Just the Git Data (if DB record exists) ```bash # Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git restic ls latest /var/lib/forgejo/repositories/ # Restore repo restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git # Copy into place cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/ chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git # Regenerate hooks sudo -u forgejo forgejo admin regenerate hooks ``` #### Option 2: Full Forgejo Restore (if DB record was also deleted) Need to restore both database and filesystem to same point in time: ```bash systemctl stop forgejo # Restore database restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz sudo -u postgres psql -c "DROP DATABASE forgejo;" sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;" gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo # Restore filesystem rm -rf /var/lib/forgejo/* restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo chown -R forgejo:forgejo /var/lib/forgejo systemctl start forgejo ``` --- ## 4. Restore Commands Reference ### Environment Setup ```bash export RESTIC_REPOSITORY="b2:ops-jrz1-backup" export RESTIC_PASSWORD_FILE="/run/secrets/restic/password" # if on server # OR export RESTIC_PASSWORD="your-password-here" # if restoring from scratch export B2_ACCOUNT_ID="your-key-id" export B2_ACCOUNT_KEY="your-app-key" ``` ### Common Operations ```bash # List all snapshots restic snapshots # List snapshots with tags restic snapshots --tag ops-jrz1 # Browse snapshot contents restic ls latest restic ls latest /var/lib/forgejo # Restore everything restic restore latest --target / # Restore specific path restic restore latest --target / --include /var/lib/forgejo # Restore to different location restic restore latest --target /tmp/restore --include /home/dan # Restore specific snapshot (not latest) restic restore abc123de --target /tmp/restore # Mount backup as filesystem (for browsing) mkdir /mnt/restic restic mount /mnt/restic # Browse /mnt/restic/snapshots/latest/... # Ctrl+C to unmount # Check backup integrity restic check restic check --read-data # slower, verifies all data ``` --- ## 5. Verification Checklist ### After Full Restore #### Infrastructure - [ ] SSH access works - [ ] DNS resolves correctly - [ ] HTTPS certificates valid (may need `systemctl start acme-clarun.xyz`) #### PostgreSQL - [ ] `systemctl status postgresql` - active - [ ] `sudo -u postgres psql -c "\l"` - lists forgejo, mautrix_slack databases - [ ] No errors in `journalctl -u postgresql` #### Forgejo - [ ] `systemctl status forgejo` - active - [ ] Web UI loads at https://git.clarun.xyz - [ ] Can log in - [ ] Repositories visible and browsable - [ ] Can clone a repo: `git clone git@git.clarun.xyz:org/repo.git` - [ ] Can push to a repo #### Matrix - [ ] `systemctl status matrix-continuwuity` - active - [ ] No RocksDB errors in `journalctl -u matrix-continuwuity` - [ ] Can log in with Matrix client - [ ] Can send/receive messages - [ ] Old messages visible #### Maubot - [ ] `systemctl status maubot` - active - [ ] Web UI accessible via SSH tunnel (port 29316) - [ ] Bots responding #### Slack Bridge - [ ] `systemctl status mautrix-slack` - active - [ ] Bridge connected (check logs) - [ ] Messages flowing both directions #### User Home Directories - [ ] Users can SSH in - [ ] User files present - [ ] Permissions correct ### After Partial Restore - [ ] Restored service starts without errors - [ ] Basic functionality works - [ ] No data from "future" (if restoring older snapshot) --- ## 6. Time Estimates | Scenario | Download Size | Estimated Time | |----------|--------------|----------------| | Full server restore | ~15 GB | 2-4 hours | | Single service (Matrix) | ~2 GB | 15-45 min | | Single service (Forgejo) | ~5 GB | 20-60 min | | Single user home | ~1 GB | 5-15 min | | Single git repo | ~100 MB | 5-10 min | | PostgreSQL DB only | ~50 MB | 10-20 min | *Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.* --- ## 7. Quarterly Restore Drill Schedule: First Sunday of each quarter ### Procedure 1. Spin up test VM (or use local NixOS VM) 2. Attempt full restore procedure 3. Run verification checklist 4. Document: - Actual time taken - Any issues encountered - Runbook updates needed 5. Destroy test VM ### Success Criteria - [ ] NixOS boots with config - [ ] PostgreSQL databases restore and pass basic queries - [ ] Forgejo UI loads and repos are accessible - [ ] Matrix client can connect and see history - [ ] At least one user home directory restored with correct permissions --- ## 8. Known Limitations ### RocksDB Consistency Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should: - Stop service before backup, OR - Use RocksDB checkpoint feature, OR - Use filesystem snapshots (ZFS/btrfs) **Current risk:** Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore. ### Point-in-Time Recovery Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured). ### User Home Directory Backup **TODO:** User home directories (`/home/*`) are not currently included in backup. Need to add to backup-b2.nix. ### Large File Handling Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately. --- ## 9. Known Gaps and TODOs **Critical - Must Fix Before Relying on This Runbook:** | Gap | Risk | Fix | Status | |-----|------|-----|--------| | ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | | ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** | | RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) | | ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) | | Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) | | `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only | **Medium Priority:** | Gap | Risk | Fix | |-----|------|-----| | ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** | | Dynamic UIDs | Permission errors after restore | Static UIDs for service users | | DNS provider not documented | Can't update IP on new VPS | Document in break glass section | | No backup monitoring | Silent failures for days | Add healthchecks.io integration | | Postgres roles/extensions | Restore may fail | Include `pg_dumpall --globals-only` | **Nice to Have:** | Gap | Improvement | |-----|-------------| | Manual restore steps | Create restore script | | No immutable backups | Enable B2 Object Lock | | No second backup location | Replicate to second provider | --- ## 10. Runbook Maintenance - **Owner:** dan - **Last updated:** 2026-01-11 - **Last drill:** 2026-01-11 (restore test passed) - **Next review:** After NixOS 24.11 upgrade ### Change Log | Date | Change | |------|--------| | 2026-01-11 | First restore drill - all tests passed | | 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) | | 2026-01-10 | Initial draft | --- ## Appendix A: File Paths Reference ``` # PostgreSQL dumps (created by services.postgresqlBackup) /var/backup/postgresql/forgejo.sql.gz /var/backup/postgresql/mautrix_slack.sql.gz # Forgejo /var/lib/forgejo/ ├── conf/ ├── data/ │ ├── avatars/ │ ├── attachments/ │ └── lfs/ ├── repositories/ │ └── USERNAME/ │ └── REPO.git/ └── gitea.db (if using SQLite, but we use PostgreSQL) # Matrix (Conduwuit with RocksDB) /var/lib/matrix-continuwuity/ ├── db/ # RocksDB database └── media/ # Uploaded media # Maubot /var/lib/maubot/ ├── plugins/ ├── trash/ └── maubot.db # SQLite database # Slack Bridge /var/lib/mautrix-slack/ └── registration.yaml # User homes /home/USERNAME/ ├── .ssh/ ├── .config/ ├── .npm-global/ └── [user projects] # Secrets (runtime, not backed up - regenerated from sops) /run/secrets/ ├── matrix-registration-token ├── maubot-admin-password ├── restic/password └── ... ``` ## Appendix B: Service Dependencies ``` ┌─────────────┐ │ postgresql │ └──────┬──────┘ │ ┌───────────────┼───────────────┐ │ │ │ ▼ ▼ ▼ ┌──────────┐ ┌─────────────┐ ┌──────────┐ │ forgejo │ │mautrix-slack│ │ maubot │ └──────────┘ └──────┬──────┘ └────┬─────┘ │ │ ▼ ▼ ┌─────────────────────────┐ │ matrix-continuwuity │ └─────────────────────────┘ │ ▼ ┌─────────────┐ │ nginx │ └─────────────┘ ```