ops-jrz1/docs/disaster-recovery-runbook.md
Dan 9c03d2204d Update DR runbook: first restore drill passed
Tested restore of:
- PostgreSQL dumps (forgejo: 112 tables, mautrix_slack: 32 tables)
- Forgejo repositories
- User home directories

Also updated known gaps status (sops key, PostgreSQL pin fixed).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 16:18:22 -08:00

628 lines
18 KiB
Markdown

# Disaster Recovery Runbook - ops-jrz1
## Overview
This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic.
**Recovery Time Objective (RTO):** 2-6 hours for full restore
**Recovery Point Objective (RPO):** 24 hours (daily backups at 3 AM UTC)
---
## 1. What's Backed Up
| Component | Path | Backup Method | Restore Priority |
|-----------|------|---------------|------------------|
| PostgreSQL (forgejo) | `/var/backup/postgresql/forgejo.sql.gz` | pg_dump via timer | Critical |
| PostgreSQL (mautrix_slack) | `/var/backup/postgresql/mautrix_slack.sql.gz` | pg_dump via timer | Critical |
| Forgejo | `/var/lib/forgejo` | restic file backup | Critical |
| Matrix | `/var/lib/matrix-continuwuity` | restic file backup | High |
| Maubot | `/var/lib/maubot` | restic file backup | Medium |
| Slack Bridge | `/var/lib/mautrix-slack` | restic file backup | Medium |
| User Homes | `/home/*` | restic file backup | High |
| ACME Certs | `/var/lib/acme` | restic file backup | Medium |
### What's NOT Backed Up (Reproducible via NixOS)
- `/nix/store` - rebuilt from flake
- `/etc` - generated from NixOS config
- Service binaries - installed via Nix
### Critical Items Stored Out-of-Band
| Item | Storage Location | Notes |
|------|------------------|-------|
| NixOS flake | GitHub mirror / local laptop | Self-hosted Forgejo may be dead |
| Restic password | Password manager + printed | Can't restore without it |
| B2 credentials | Password manager + printed | Can't access backups without it |
| Age key (sops) | `/etc/ssh/ssh_host_ed25519_key` on server | Derived from SSH host key |
---
## 2. Break Glass - Emergency Quick Reference
**Print this page and store physically.**
### B2 Backup Access
```
Bucket: ops-jrz1-backup
Restic Repo: b2:ops-jrz1-backup
Key ID: [stored in password manager]
App Key: [stored in password manager]
Restic Password: [stored in password manager]
```
### Minimal Restore Commands
```bash
# Set environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="[password]"
export B2_ACCOUNT_ID="[key-id]"
export B2_ACCOUNT_KEY="[app-key]"
# List snapshots
restic snapshots
# Restore everything to /tmp/restore
restic restore latest --target /tmp/restore
# Restore specific path
restic restore latest --target / --include /var/lib/forgejo
```
### Service Start Order
```
1. postgresql
2. forgejo
3. mautrix-slack
4. matrix-continuwuity
5. maubot
6. nginx
```
### Config Repository
```
Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down)
Mirror: [ADD GITHUB MIRROR URL]
Local: ~/proj/ops-jrz1 on admin laptop
```
---
## 3. Restore Scenarios
### Scenario A: Full Server Loss
**When:** Hardware failure, VPS provider issue, complete disk loss.
**Time estimate:** 2-4 hours
#### Phase 1: Bootstrap NixOS (30-60 min)
1. Provision new VPS or boot NixOS installer on new hardware
2. Partition disks, mount to `/mnt`
3. Get NixOS flake from backup location (laptop, GitHub mirror)
4. **Critical:** Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key)
```bash
# If restoring old host keys (preserves sops decryption):
mkdir -p /mnt/etc/ssh
# Copy ssh_host_* files from secure backup
# Install NixOS
nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd
reboot
```
#### Phase 2: Restore Secrets (10-20 min)
If you had to generate new SSH host keys:
```bash
# Get new age public key
ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub
# Update .sops.yaml with new key on admin machine
# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml
# Redeploy
```
#### Phase 3: Stop Services, Restore Data (30-90 min)
```bash
# Stop all services that use the data
systemctl stop forgejo mautrix-slack matrix-continuwuity maubot
# Set restic environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="..."
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."
# Restore PostgreSQL dumps
restic restore latest --target /tmp/restore --include /var/backup/postgresql
# Import databases
systemctl start postgresql
sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;"
sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;"
gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack
# Restore Forgejo data
rm -rf /var/lib/forgejo/*
restic restore latest --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo
# Restore Matrix data
rm -rf /var/lib/matrix-continuwuity/*
restic restore latest --target / --include /var/lib/matrix-continuwuity
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
# Restore Maubot data
rm -rf /var/lib/maubot/*
restic restore latest --target / --include /var/lib/maubot
chown -R maubot:maubot /var/lib/maubot
# Restore Slack bridge data
rm -rf /var/lib/mautrix-slack/*
restic restore latest --target / --include /var/lib/mautrix-slack
chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack
# Restore user home directories
restic restore latest --target / --include /home
# Permissions should be preserved by restic
```
#### Phase 4: Start Services and Verify (15-30 min)
```bash
# Start in dependency order
systemctl start forgejo
systemctl start mautrix-slack
systemctl start matrix-continuwuity
systemctl start maubot
# Check status
systemctl status forgejo mautrix-slack matrix-continuwuity maubot
```
See Section 5 for verification checklist.
---
### Scenario B: Single Service Corruption
**When:** One service's data is corrupted but server is otherwise fine.
**Time estimate:** 15-60 min
#### Example: Matrix RocksDB Corruption
```bash
# Stop the service
systemctl stop matrix-continuwuity
# Find available snapshots
restic snapshots
# Restore to temp location first (safer)
restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity
# Verify it looks reasonable
ls -la /tmp/restore/var/lib/matrix-continuwuity/
# Replace corrupted data
rm -rf /var/lib/matrix-continuwuity/*
cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
# Start and verify
systemctl start matrix-continuwuity
journalctl -u matrix-continuwuity -f
```
#### Example: PostgreSQL Database Corruption
```bash
# Stop dependent services
systemctl stop forgejo mautrix-slack
# Restore dump
restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
# Drop and recreate
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
# Restart
systemctl start forgejo
```
---
### Scenario C: User Deleted Their Work
**When:** Dev accidentally `rm -rf`'d their project or home directory.
**Time estimate:** 5-20 min
```bash
# Find what snapshots are available
restic snapshots
# Browse a snapshot to find the data
restic ls latest /home/USERNAME/
# Restore specific directory to temp location
restic restore latest --target /tmp/restore --include /home/USERNAME/project-name
# Let user copy what they need
cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/
chown -R USERNAME:users /home/USERNAME/project-name
# Or restore entire home directory
restic restore latest --target / --include /home/USERNAME
chown -R USERNAME:users /home/USERNAME
```
#### Point-in-Time Restore
```bash
# List snapshots with dates
restic snapshots
# Restore from specific snapshot (not latest)
restic restore abc123def --target /tmp/restore --include /home/USERNAME
```
---
### Scenario D: Single Forgejo Repo Deleted
**When:** A git repository was deleted from Forgejo.
**Time estimate:** 10-30 min
**Challenge:** Forgejo database and filesystem must be in sync.
#### Option 1: Restore Just the Git Data (if DB record exists)
```bash
# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git
restic ls latest /var/lib/forgejo/repositories/
# Restore repo
restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git
# Copy into place
cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/
chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git
# Regenerate hooks
sudo -u forgejo forgejo admin regenerate hooks
```
#### Option 2: Full Forgejo Restore (if DB record was also deleted)
Need to restore both database and filesystem to same point in time:
```bash
systemctl stop forgejo
# Restore database
restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
# Restore filesystem
rm -rf /var/lib/forgejo/*
restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo
systemctl start forgejo
```
---
## 4. Restore Commands Reference
### Environment Setup
```bash
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD_FILE="/run/secrets/restic/password" # if on server
# OR
export RESTIC_PASSWORD="your-password-here" # if restoring from scratch
export B2_ACCOUNT_ID="your-key-id"
export B2_ACCOUNT_KEY="your-app-key"
```
### Common Operations
```bash
# List all snapshots
restic snapshots
# List snapshots with tags
restic snapshots --tag ops-jrz1
# Browse snapshot contents
restic ls latest
restic ls latest /var/lib/forgejo
# Restore everything
restic restore latest --target /
# Restore specific path
restic restore latest --target / --include /var/lib/forgejo
# Restore to different location
restic restore latest --target /tmp/restore --include /home/dan
# Restore specific snapshot (not latest)
restic restore abc123de --target /tmp/restore
# Mount backup as filesystem (for browsing)
mkdir /mnt/restic
restic mount /mnt/restic
# Browse /mnt/restic/snapshots/latest/...
# Ctrl+C to unmount
# Check backup integrity
restic check
restic check --read-data # slower, verifies all data
```
---
## 5. Verification Checklist
### After Full Restore
#### Infrastructure
- [ ] SSH access works
- [ ] DNS resolves correctly
- [ ] HTTPS certificates valid (may need `systemctl start acme-clarun.xyz`)
#### PostgreSQL
- [ ] `systemctl status postgresql` - active
- [ ] `sudo -u postgres psql -c "\l"` - lists forgejo, mautrix_slack databases
- [ ] No errors in `journalctl -u postgresql`
#### Forgejo
- [ ] `systemctl status forgejo` - active
- [ ] Web UI loads at https://git.clarun.xyz
- [ ] Can log in
- [ ] Repositories visible and browsable
- [ ] Can clone a repo: `git clone git@git.clarun.xyz:org/repo.git`
- [ ] Can push to a repo
#### Matrix
- [ ] `systemctl status matrix-continuwuity` - active
- [ ] No RocksDB errors in `journalctl -u matrix-continuwuity`
- [ ] Can log in with Matrix client
- [ ] Can send/receive messages
- [ ] Old messages visible
#### Maubot
- [ ] `systemctl status maubot` - active
- [ ] Web UI accessible via SSH tunnel (port 29316)
- [ ] Bots responding
#### Slack Bridge
- [ ] `systemctl status mautrix-slack` - active
- [ ] Bridge connected (check logs)
- [ ] Messages flowing both directions
#### User Home Directories
- [ ] Users can SSH in
- [ ] User files present
- [ ] Permissions correct
### After Partial Restore
- [ ] Restored service starts without errors
- [ ] Basic functionality works
- [ ] No data from "future" (if restoring older snapshot)
---
## 6. Time Estimates
| Scenario | Download Size | Estimated Time |
|----------|--------------|----------------|
| Full server restore | ~15 GB | 2-4 hours |
| Single service (Matrix) | ~2 GB | 15-45 min |
| Single service (Forgejo) | ~5 GB | 20-60 min |
| Single user home | ~1 GB | 5-15 min |
| Single git repo | ~100 MB | 5-10 min |
| PostgreSQL DB only | ~50 MB | 10-20 min |
*Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.*
---
## 7. Quarterly Restore Drill
Schedule: First Sunday of each quarter
### Procedure
1. Spin up test VM (or use local NixOS VM)
2. Attempt full restore procedure
3. Run verification checklist
4. Document:
- Actual time taken
- Any issues encountered
- Runbook updates needed
5. Destroy test VM
### Success Criteria
- [ ] NixOS boots with config
- [ ] PostgreSQL databases restore and pass basic queries
- [ ] Forgejo UI loads and repos are accessible
- [ ] Matrix client can connect and see history
- [ ] At least one user home directory restored with correct permissions
---
## 8. Known Limitations
### RocksDB Consistency
Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should:
- Stop service before backup, OR
- Use RocksDB checkpoint feature, OR
- Use filesystem snapshots (ZFS/btrfs)
**Current risk:** Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore.
### Point-in-Time Recovery
Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured).
### User Home Directory Backup
**TODO:** User home directories (`/home/*`) are not currently included in backup. Need to add to backup-b2.nix.
### Large File Handling
Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately.
---
## 9. Known Gaps and TODOs
**Critical - Must Fix Before Relying on This Runbook:**
| Gap | Risk | Fix | Status |
|-----|------|-----|--------|
| ~~`/home/*` not backed up~~ | ~~User work lost forever~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| ~~`/var/lib/acme` not backed up~~ | ~~Let's Encrypt rate limit~~ | ~~Add to backup-b2.nix paths~~ | **FIXED** |
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook | Deferred (y8le) |
| ~~Sops key tied to SSH host key only~~ | ~~Lose host key = lose all secrets~~ | ~~Add offline recovery age key~~ | **FIXED** (93q9) |
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub | Deferred (jboq) |
| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first | Docs only |
**Medium Priority:**
| Gap | Risk | Fix |
|-----|------|-----|
| ~~PostgreSQL version not pinned~~ | ~~Version mismatch on restore~~ | ~~Pin `pkgs.postgresql_15`~~ **FIXED** |
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
| No backup monitoring | Silent failures for days | Add healthchecks.io integration |
| Postgres roles/extensions | Restore may fail | Include `pg_dumpall --globals-only` |
**Nice to Have:**
| Gap | Improvement |
|-----|-------------|
| Manual restore steps | Create restore script |
| No immutable backups | Enable B2 Object Lock |
| No second backup location | Replicate to second provider |
---
## 10. Runbook Maintenance
- **Owner:** dan
- **Last updated:** 2026-01-11
- **Last drill:** 2026-01-11 (restore test passed)
- **Next review:** After NixOS 24.11 upgrade
### Change Log
| Date | Change |
|------|--------|
| 2026-01-11 | First restore drill - all tests passed |
| 2026-01-11 | Fixed /var/backup permissions (postgres couldn't traverse) |
| 2026-01-10 | Initial draft |
---
## Appendix A: File Paths Reference
```
# PostgreSQL dumps (created by services.postgresqlBackup)
/var/backup/postgresql/forgejo.sql.gz
/var/backup/postgresql/mautrix_slack.sql.gz
# Forgejo
/var/lib/forgejo/
├── conf/
├── data/
│ ├── avatars/
│ ├── attachments/
│ └── lfs/
├── repositories/
│ └── USERNAME/
│ └── REPO.git/
└── gitea.db (if using SQLite, but we use PostgreSQL)
# Matrix (Conduwuit with RocksDB)
/var/lib/matrix-continuwuity/
├── db/ # RocksDB database
└── media/ # Uploaded media
# Maubot
/var/lib/maubot/
├── plugins/
├── trash/
└── maubot.db # SQLite database
# Slack Bridge
/var/lib/mautrix-slack/
└── registration.yaml
# User homes
/home/USERNAME/
├── .ssh/
├── .config/
├── .npm-global/
└── [user projects]
# Secrets (runtime, not backed up - regenerated from sops)
/run/secrets/
├── matrix-registration-token
├── maubot-admin-password
├── restic/password
└── ...
```
## Appendix B: Service Dependencies
```
┌─────────────┐
│ postgresql │
└──────┬──────┘
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────────┐ ┌──────────┐
│ forgejo │ │mautrix-slack│ │ maubot │
└──────────┘ └──────┬──────┘ └────┬─────┘
│ │
▼ ▼
┌─────────────────────────┐
│ matrix-continuwuity │
└─────────────────────────┘
┌─────────────┐
│ nginx │
└─────────────┘
```