Documents restore procedures for full server loss, partial restore, and user data recovery scenarios. Includes verification checklists, time estimates, and break-glass quick reference. Also documents known gaps (home dirs, ACME, RocksDB consistency) that need fixing before the runbook is production-ready. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
18 KiB
Disaster Recovery Runbook - ops-jrz1
Overview
This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic.
Recovery Time Objective (RTO): 2-6 hours for full restore Recovery Point Objective (RPO): 24 hours (daily backups at 3 AM UTC)
1. What's Backed Up
| Component | Path | Backup Method | Restore Priority |
|---|---|---|---|
| PostgreSQL (forgejo) | /var/backup/postgresql/forgejo.sql.gz |
pg_dump via timer | Critical |
| PostgreSQL (mautrix_slack) | /var/backup/postgresql/mautrix_slack.sql.gz |
pg_dump via timer | Critical |
| Forgejo | /var/lib/forgejo |
restic file backup | Critical |
| Matrix | /var/lib/matrix-continuwuity |
restic file backup | High |
| Maubot | /var/lib/maubot |
restic file backup | Medium |
| Slack Bridge | /var/lib/mautrix-slack |
restic file backup | Medium |
| User Homes | /home/* |
NOT YET BACKED UP | High |
What's NOT Backed Up (Reproducible via NixOS)
/nix/store- rebuilt from flake/etc- generated from NixOS config- Service binaries - installed via Nix
Critical Items Stored Out-of-Band
| Item | Storage Location | Notes |
|---|---|---|
| NixOS flake | GitHub mirror / local laptop | Self-hosted Forgejo may be dead |
| Restic password | Password manager + printed | Can't restore without it |
| B2 credentials | Password manager + printed | Can't access backups without it |
| Age key (sops) | /etc/ssh/ssh_host_ed25519_key on server |
Derived from SSH host key |
2. Break Glass - Emergency Quick Reference
Print this page and store physically.
B2 Backup Access
Bucket: ops-jrz1-backup
Restic Repo: b2:ops-jrz1-backup
Key ID: [stored in password manager]
App Key: [stored in password manager]
Restic Password: [stored in password manager]
Minimal Restore Commands
# Set environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="[password]"
export B2_ACCOUNT_ID="[key-id]"
export B2_ACCOUNT_KEY="[app-key]"
# List snapshots
restic snapshots
# Restore everything to /tmp/restore
restic restore latest --target /tmp/restore
# Restore specific path
restic restore latest --target / --include /var/lib/forgejo
Service Start Order
1. postgresql
2. forgejo
3. mautrix-slack
4. matrix-continuwuity
5. maubot
6. nginx
Config Repository
Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down)
Mirror: [ADD GITHUB MIRROR URL]
Local: ~/proj/ops-jrz1 on admin laptop
3. Restore Scenarios
Scenario A: Full Server Loss
When: Hardware failure, VPS provider issue, complete disk loss.
Time estimate: 2-4 hours
Phase 1: Bootstrap NixOS (30-60 min)
- Provision new VPS or boot NixOS installer on new hardware
- Partition disks, mount to
/mnt - Get NixOS flake from backup location (laptop, GitHub mirror)
- Critical: Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key)
# If restoring old host keys (preserves sops decryption):
mkdir -p /mnt/etc/ssh
# Copy ssh_host_* files from secure backup
# Install NixOS
nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd
reboot
Phase 2: Restore Secrets (10-20 min)
If you had to generate new SSH host keys:
# Get new age public key
ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub
# Update .sops.yaml with new key on admin machine
# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml
# Redeploy
Phase 3: Stop Services, Restore Data (30-90 min)
# Stop all services that use the data
systemctl stop forgejo mautrix-slack matrix-continuwuity maubot
# Set restic environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="..."
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."
# Restore PostgreSQL dumps
restic restore latest --target /tmp/restore --include /var/backup/postgresql
# Import databases
systemctl start postgresql
sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;"
sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;"
gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack
# Restore Forgejo data
rm -rf /var/lib/forgejo/*
restic restore latest --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo
# Restore Matrix data
rm -rf /var/lib/matrix-continuwuity/*
restic restore latest --target / --include /var/lib/matrix-continuwuity
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
# Restore Maubot data
rm -rf /var/lib/maubot/*
restic restore latest --target / --include /var/lib/maubot
chown -R maubot:maubot /var/lib/maubot
# Restore Slack bridge data
rm -rf /var/lib/mautrix-slack/*
restic restore latest --target / --include /var/lib/mautrix-slack
chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack
# Restore user home directories
restic restore latest --target / --include /home
# Permissions should be preserved by restic
Phase 4: Start Services and Verify (15-30 min)
# Start in dependency order
systemctl start forgejo
systemctl start mautrix-slack
systemctl start matrix-continuwuity
systemctl start maubot
# Check status
systemctl status forgejo mautrix-slack matrix-continuwuity maubot
See Section 5 for verification checklist.
Scenario B: Single Service Corruption
When: One service's data is corrupted but server is otherwise fine.
Time estimate: 15-60 min
Example: Matrix RocksDB Corruption
# Stop the service
systemctl stop matrix-continuwuity
# Find available snapshots
restic snapshots
# Restore to temp location first (safer)
restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity
# Verify it looks reasonable
ls -la /tmp/restore/var/lib/matrix-continuwuity/
# Replace corrupted data
rm -rf /var/lib/matrix-continuwuity/*
cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
# Start and verify
systemctl start matrix-continuwuity
journalctl -u matrix-continuwuity -f
Example: PostgreSQL Database Corruption
# Stop dependent services
systemctl stop forgejo mautrix-slack
# Restore dump
restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
# Drop and recreate
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
# Restart
systemctl start forgejo
Scenario C: User Deleted Their Work
When: Dev accidentally rm -rf'd their project or home directory.
Time estimate: 5-20 min
# Find what snapshots are available
restic snapshots
# Browse a snapshot to find the data
restic ls latest /home/USERNAME/
# Restore specific directory to temp location
restic restore latest --target /tmp/restore --include /home/USERNAME/project-name
# Let user copy what they need
cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/
chown -R USERNAME:users /home/USERNAME/project-name
# Or restore entire home directory
restic restore latest --target / --include /home/USERNAME
chown -R USERNAME:users /home/USERNAME
Point-in-Time Restore
# List snapshots with dates
restic snapshots
# Restore from specific snapshot (not latest)
restic restore abc123def --target /tmp/restore --include /home/USERNAME
Scenario D: Single Forgejo Repo Deleted
When: A git repository was deleted from Forgejo.
Time estimate: 10-30 min
Challenge: Forgejo database and filesystem must be in sync.
Option 1: Restore Just the Git Data (if DB record exists)
# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git
restic ls latest /var/lib/forgejo/repositories/
# Restore repo
restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git
# Copy into place
cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/
chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git
# Regenerate hooks
sudo -u forgejo forgejo admin regenerate hooks
Option 2: Full Forgejo Restore (if DB record was also deleted)
Need to restore both database and filesystem to same point in time:
systemctl stop forgejo
# Restore database
restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
# Restore filesystem
rm -rf /var/lib/forgejo/*
restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo
systemctl start forgejo
4. Restore Commands Reference
Environment Setup
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD_FILE="/run/secrets/restic/password" # if on server
# OR
export RESTIC_PASSWORD="your-password-here" # if restoring from scratch
export B2_ACCOUNT_ID="your-key-id"
export B2_ACCOUNT_KEY="your-app-key"
Common Operations
# List all snapshots
restic snapshots
# List snapshots with tags
restic snapshots --tag ops-jrz1
# Browse snapshot contents
restic ls latest
restic ls latest /var/lib/forgejo
# Restore everything
restic restore latest --target /
# Restore specific path
restic restore latest --target / --include /var/lib/forgejo
# Restore to different location
restic restore latest --target /tmp/restore --include /home/dan
# Restore specific snapshot (not latest)
restic restore abc123de --target /tmp/restore
# Mount backup as filesystem (for browsing)
mkdir /mnt/restic
restic mount /mnt/restic
# Browse /mnt/restic/snapshots/latest/...
# Ctrl+C to unmount
# Check backup integrity
restic check
restic check --read-data # slower, verifies all data
5. Verification Checklist
After Full Restore
Infrastructure
- SSH access works
- DNS resolves correctly
- HTTPS certificates valid (may need
systemctl start acme-clarun.xyz)
PostgreSQL
systemctl status postgresql- activesudo -u postgres psql -c "\l"- lists forgejo, mautrix_slack databases- No errors in
journalctl -u postgresql
Forgejo
systemctl status forgejo- active- Web UI loads at https://git.clarun.xyz
- Can log in
- Repositories visible and browsable
- Can clone a repo:
git clone git@git.clarun.xyz:org/repo.git - Can push to a repo
Matrix
systemctl status matrix-continuwuity- active- No RocksDB errors in
journalctl -u matrix-continuwuity - Can log in with Matrix client
- Can send/receive messages
- Old messages visible
Maubot
systemctl status maubot- active- Web UI accessible via SSH tunnel (port 29316)
- Bots responding
Slack Bridge
systemctl status mautrix-slack- active- Bridge connected (check logs)
- Messages flowing both directions
User Home Directories
- Users can SSH in
- User files present
- Permissions correct
After Partial Restore
- Restored service starts without errors
- Basic functionality works
- No data from "future" (if restoring older snapshot)
6. Time Estimates
| Scenario | Download Size | Estimated Time |
|---|---|---|
| Full server restore | ~15 GB | 2-4 hours |
| Single service (Matrix) | ~2 GB | 15-45 min |
| Single service (Forgejo) | ~5 GB | 20-60 min |
| Single user home | ~1 GB | 5-15 min |
| Single git repo | ~100 MB | 5-10 min |
| PostgreSQL DB only | ~50 MB | 10-20 min |
Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.
7. Quarterly Restore Drill
Schedule: First Sunday of each quarter
Procedure
- Spin up test VM (or use local NixOS VM)
- Attempt full restore procedure
- Run verification checklist
- Document:
- Actual time taken
- Any issues encountered
- Runbook updates needed
- Destroy test VM
Success Criteria
- NixOS boots with config
- PostgreSQL databases restore and pass basic queries
- Forgejo UI loads and repos are accessible
- Matrix client can connect and see history
- At least one user home directory restored with correct permissions
8. Known Limitations
RocksDB Consistency
Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should:
- Stop service before backup, OR
- Use RocksDB checkpoint feature, OR
- Use filesystem snapshots (ZFS/btrfs)
Current risk: Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore.
Point-in-Time Recovery
Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured).
User Home Directory Backup
TODO: User home directories (/home/*) are not currently included in backup. Need to add to backup-b2.nix.
Large File Handling
Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately.
9. Known Gaps and TODOs
Critical - Must Fix Before Relying on This Runbook:
| Gap | Risk | Fix |
|---|---|---|
/home/* not backed up |
User work lost forever | Add to backup-b2.nix paths |
/var/lib/acme not backed up |
Let's Encrypt rate limit (7 days no HTTPS) | Add to backup-b2.nix paths |
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook |
| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key |
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub |
rm -rf in restore steps |
Wrong snapshot = data destroyed | Always restore to staging first |
Medium Priority:
| Gap | Risk | Fix |
|---|---|---|
| PostgreSQL version not pinned | Version mismatch on restore | Pin pkgs.postgresql_15 |
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
| No backup monitoring | Silent failures for days | Add healthchecks.io integration |
| Postgres roles/extensions | Restore may fail | Include pg_dumpall --globals-only |
Nice to Have:
| Gap | Improvement |
|---|---|
| Manual restore steps | Create restore script |
| No immutable backups | Enable B2 Object Lock |
| No second backup location | Replicate to second provider |
10. Runbook Maintenance
- Owner: dan
- Last updated: 2026-01-10
- Last drill: Never (TODO: schedule first drill)
- Next review: After first restore drill
Change Log
| Date | Change |
|---|---|
| 2026-01-10 | Initial draft |
Appendix A: File Paths Reference
# PostgreSQL dumps (created by services.postgresqlBackup)
/var/backup/postgresql/forgejo.sql.gz
/var/backup/postgresql/mautrix_slack.sql.gz
# Forgejo
/var/lib/forgejo/
├── conf/
├── data/
│ ├── avatars/
│ ├── attachments/
│ └── lfs/
├── repositories/
│ └── USERNAME/
│ └── REPO.git/
└── gitea.db (if using SQLite, but we use PostgreSQL)
# Matrix (Conduwuit with RocksDB)
/var/lib/matrix-continuwuity/
├── db/ # RocksDB database
└── media/ # Uploaded media
# Maubot
/var/lib/maubot/
├── plugins/
├── trash/
└── maubot.db # SQLite database
# Slack Bridge
/var/lib/mautrix-slack/
└── registration.yaml
# User homes
/home/USERNAME/
├── .ssh/
├── .config/
├── .npm-global/
└── [user projects]
# Secrets (runtime, not backed up - regenerated from sops)
/run/secrets/
├── matrix-registration-token
├── maubot-admin-password
├── restic/password
└── ...
Appendix B: Service Dependencies
┌─────────────┐
│ postgresql │
└──────┬──────┘
│
┌───────────────┼───────────────┐
│ │ │
▼ ▼ ▼
┌──────────┐ ┌─────────────┐ ┌──────────┐
│ forgejo │ │mautrix-slack│ │ maubot │
└──────────┘ └──────┬──────┘ └────┬─────┘
│ │
▼ ▼
┌─────────────────────────┐
│ matrix-continuwuity │
└─────────────────────────┘
│
▼
┌─────────────┐
│ nginx │
└─────────────┘