ops-jrz1/docs/disaster-recovery-runbook.md
Dan 9c03d2204d Update DR runbook: first restore drill passed
Tested restore of:
- PostgreSQL dumps (forgejo: 112 tables, mautrix_slack: 32 tables)
- Forgejo repositories
- User home directories

Also updated known gaps status (sops key, PostgreSQL pin fixed).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 16:18:22 -08:00

18 KiB

Disaster Recovery Runbook - ops-jrz1

Overview

This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic.

Recovery Time Objective (RTO): 2-6 hours for full restore Recovery Point Objective (RPO): 24 hours (daily backups at 3 AM UTC)


1. What's Backed Up

Component Path Backup Method Restore Priority
PostgreSQL (forgejo) /var/backup/postgresql/forgejo.sql.gz pg_dump via timer Critical
PostgreSQL (mautrix_slack) /var/backup/postgresql/mautrix_slack.sql.gz pg_dump via timer Critical
Forgejo /var/lib/forgejo restic file backup Critical
Matrix /var/lib/matrix-continuwuity restic file backup High
Maubot /var/lib/maubot restic file backup Medium
Slack Bridge /var/lib/mautrix-slack restic file backup Medium
User Homes /home/* restic file backup High
ACME Certs /var/lib/acme restic file backup Medium

What's NOT Backed Up (Reproducible via NixOS)

  • /nix/store - rebuilt from flake
  • /etc - generated from NixOS config
  • Service binaries - installed via Nix

Critical Items Stored Out-of-Band

Item Storage Location Notes
NixOS flake GitHub mirror / local laptop Self-hosted Forgejo may be dead
Restic password Password manager + printed Can't restore without it
B2 credentials Password manager + printed Can't access backups without it
Age key (sops) /etc/ssh/ssh_host_ed25519_key on server Derived from SSH host key

2. Break Glass - Emergency Quick Reference

Print this page and store physically.

B2 Backup Access

Bucket: ops-jrz1-backup
Restic Repo: b2:ops-jrz1-backup
Key ID: [stored in password manager]
App Key: [stored in password manager]
Restic Password: [stored in password manager]

Minimal Restore Commands

# Set environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="[password]"
export B2_ACCOUNT_ID="[key-id]"
export B2_ACCOUNT_KEY="[app-key]"

# List snapshots
restic snapshots

# Restore everything to /tmp/restore
restic restore latest --target /tmp/restore

# Restore specific path
restic restore latest --target / --include /var/lib/forgejo

Service Start Order

1. postgresql
2. forgejo
3. mautrix-slack
4. matrix-continuwuity
5. maubot
6. nginx

Config Repository

Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down)
Mirror: [ADD GITHUB MIRROR URL]
Local: ~/proj/ops-jrz1 on admin laptop

3. Restore Scenarios

Scenario A: Full Server Loss

When: Hardware failure, VPS provider issue, complete disk loss.

Time estimate: 2-4 hours

Phase 1: Bootstrap NixOS (30-60 min)

  1. Provision new VPS or boot NixOS installer on new hardware
  2. Partition disks, mount to /mnt
  3. Get NixOS flake from backup location (laptop, GitHub mirror)
  4. Critical: Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key)
# If restoring old host keys (preserves sops decryption):
mkdir -p /mnt/etc/ssh
# Copy ssh_host_* files from secure backup

# Install NixOS
nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd
reboot

Phase 2: Restore Secrets (10-20 min)

If you had to generate new SSH host keys:

# Get new age public key
ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub

# Update .sops.yaml with new key on admin machine
# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml
# Redeploy

Phase 3: Stop Services, Restore Data (30-90 min)

# Stop all services that use the data
systemctl stop forgejo mautrix-slack matrix-continuwuity maubot

# Set restic environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="..."
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."

# Restore PostgreSQL dumps
restic restore latest --target /tmp/restore --include /var/backup/postgresql

# Import databases
systemctl start postgresql
sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;"
sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;"
gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack

# Restore Forgejo data
rm -rf /var/lib/forgejo/*
restic restore latest --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo

# Restore Matrix data
rm -rf /var/lib/matrix-continuwuity/*
restic restore latest --target / --include /var/lib/matrix-continuwuity
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity

# Restore Maubot data
rm -rf /var/lib/maubot/*
restic restore latest --target / --include /var/lib/maubot
chown -R maubot:maubot /var/lib/maubot

# Restore Slack bridge data
rm -rf /var/lib/mautrix-slack/*
restic restore latest --target / --include /var/lib/mautrix-slack
chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack

# Restore user home directories
restic restore latest --target / --include /home
# Permissions should be preserved by restic

Phase 4: Start Services and Verify (15-30 min)

# Start in dependency order
systemctl start forgejo
systemctl start mautrix-slack
systemctl start matrix-continuwuity
systemctl start maubot

# Check status
systemctl status forgejo mautrix-slack matrix-continuwuity maubot

See Section 5 for verification checklist.


Scenario B: Single Service Corruption

When: One service's data is corrupted but server is otherwise fine.

Time estimate: 15-60 min

Example: Matrix RocksDB Corruption

# Stop the service
systemctl stop matrix-continuwuity

# Find available snapshots
restic snapshots

# Restore to temp location first (safer)
restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity

# Verify it looks reasonable
ls -la /tmp/restore/var/lib/matrix-continuwuity/

# Replace corrupted data
rm -rf /var/lib/matrix-continuwuity/*
cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity

# Start and verify
systemctl start matrix-continuwuity
journalctl -u matrix-continuwuity -f

Example: PostgreSQL Database Corruption

# Stop dependent services
systemctl stop forgejo mautrix-slack

# Restore dump
restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz

# Drop and recreate
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

# Restart
systemctl start forgejo

Scenario C: User Deleted Their Work

When: Dev accidentally rm -rf'd their project or home directory.

Time estimate: 5-20 min

# Find what snapshots are available
restic snapshots

# Browse a snapshot to find the data
restic ls latest /home/USERNAME/

# Restore specific directory to temp location
restic restore latest --target /tmp/restore --include /home/USERNAME/project-name

# Let user copy what they need
cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/
chown -R USERNAME:users /home/USERNAME/project-name

# Or restore entire home directory
restic restore latest --target / --include /home/USERNAME
chown -R USERNAME:users /home/USERNAME

Point-in-Time Restore

# List snapshots with dates
restic snapshots

# Restore from specific snapshot (not latest)
restic restore abc123def --target /tmp/restore --include /home/USERNAME

Scenario D: Single Forgejo Repo Deleted

When: A git repository was deleted from Forgejo.

Time estimate: 10-30 min

Challenge: Forgejo database and filesystem must be in sync.

Option 1: Restore Just the Git Data (if DB record exists)

# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git
restic ls latest /var/lib/forgejo/repositories/

# Restore repo
restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git

# Copy into place
cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/
chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git

# Regenerate hooks
sudo -u forgejo forgejo admin regenerate hooks

Option 2: Full Forgejo Restore (if DB record was also deleted)

Need to restore both database and filesystem to same point in time:

systemctl stop forgejo

# Restore database
restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

# Restore filesystem
rm -rf /var/lib/forgejo/*
restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo

systemctl start forgejo

4. Restore Commands Reference

Environment Setup

export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD_FILE="/run/secrets/restic/password"  # if on server
# OR
export RESTIC_PASSWORD="your-password-here"  # if restoring from scratch
export B2_ACCOUNT_ID="your-key-id"
export B2_ACCOUNT_KEY="your-app-key"

Common Operations

# List all snapshots
restic snapshots

# List snapshots with tags
restic snapshots --tag ops-jrz1

# Browse snapshot contents
restic ls latest
restic ls latest /var/lib/forgejo

# Restore everything
restic restore latest --target /

# Restore specific path
restic restore latest --target / --include /var/lib/forgejo

# Restore to different location
restic restore latest --target /tmp/restore --include /home/dan

# Restore specific snapshot (not latest)
restic restore abc123de --target /tmp/restore

# Mount backup as filesystem (for browsing)
mkdir /mnt/restic
restic mount /mnt/restic
# Browse /mnt/restic/snapshots/latest/...
# Ctrl+C to unmount

# Check backup integrity
restic check
restic check --read-data  # slower, verifies all data

5. Verification Checklist

After Full Restore

Infrastructure

  • SSH access works
  • DNS resolves correctly
  • HTTPS certificates valid (may need systemctl start acme-clarun.xyz)

PostgreSQL

  • systemctl status postgresql - active
  • sudo -u postgres psql -c "\l" - lists forgejo, mautrix_slack databases
  • No errors in journalctl -u postgresql

Forgejo

  • systemctl status forgejo - active
  • Web UI loads at https://git.clarun.xyz
  • Can log in
  • Repositories visible and browsable
  • Can clone a repo: git clone git@git.clarun.xyz:org/repo.git
  • Can push to a repo

Matrix

  • systemctl status matrix-continuwuity - active
  • No RocksDB errors in journalctl -u matrix-continuwuity
  • Can log in with Matrix client
  • Can send/receive messages
  • Old messages visible

Maubot

  • systemctl status maubot - active
  • Web UI accessible via SSH tunnel (port 29316)
  • Bots responding

Slack Bridge

  • systemctl status mautrix-slack - active
  • Bridge connected (check logs)
  • Messages flowing both directions

User Home Directories

  • Users can SSH in
  • User files present
  • Permissions correct

After Partial Restore

  • Restored service starts without errors
  • Basic functionality works
  • No data from "future" (if restoring older snapshot)

6. Time Estimates

Scenario Download Size Estimated Time
Full server restore ~15 GB 2-4 hours
Single service (Matrix) ~2 GB 15-45 min
Single service (Forgejo) ~5 GB 20-60 min
Single user home ~1 GB 5-15 min
Single git repo ~100 MB 5-10 min
PostgreSQL DB only ~50 MB 10-20 min

Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.


7. Quarterly Restore Drill

Schedule: First Sunday of each quarter

Procedure

  1. Spin up test VM (or use local NixOS VM)
  2. Attempt full restore procedure
  3. Run verification checklist
  4. Document:
    • Actual time taken
    • Any issues encountered
    • Runbook updates needed
  5. Destroy test VM

Success Criteria

  • NixOS boots with config
  • PostgreSQL databases restore and pass basic queries
  • Forgejo UI loads and repos are accessible
  • Matrix client can connect and see history
  • At least one user home directory restored with correct permissions

8. Known Limitations

RocksDB Consistency

Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should:

  • Stop service before backup, OR
  • Use RocksDB checkpoint feature, OR
  • Use filesystem snapshots (ZFS/btrfs)

Current risk: Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore.

Point-in-Time Recovery

Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured).

User Home Directory Backup

TODO: User home directories (/home/*) are not currently included in backup. Need to add to backup-b2.nix.

Large File Handling

Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately.


9. Known Gaps and TODOs

Critical - Must Fix Before Relying on This Runbook:

Gap Risk Fix Status
/home/* not backed up User work lost forever Add to backup-b2.nix paths FIXED
/var/lib/acme not backed up Let's Encrypt rate limit Add to backup-b2.nix paths FIXED
RocksDB backed up while running Corrupt Matrix restore Stop service in pre-backup hook Deferred (y8le)
Sops key tied to SSH host key only Lose host key = lose all secrets Add offline recovery age key FIXED (93q9)
Flake only on self-hosted Forgejo Can't restore if Forgejo is dead Mirror to GitHub Deferred (jboq)
rm -rf in restore steps Wrong snapshot = data destroyed Always restore to staging first Docs only

Medium Priority:

Gap Risk Fix
PostgreSQL version not pinned Version mismatch on restore Pin pkgs.postgresql_15 FIXED
Dynamic UIDs Permission errors after restore Static UIDs for service users
DNS provider not documented Can't update IP on new VPS Document in break glass section
No backup monitoring Silent failures for days Add healthchecks.io integration
Postgres roles/extensions Restore may fail Include pg_dumpall --globals-only

Nice to Have:

Gap Improvement
Manual restore steps Create restore script
No immutable backups Enable B2 Object Lock
No second backup location Replicate to second provider

10. Runbook Maintenance

  • Owner: dan
  • Last updated: 2026-01-11
  • Last drill: 2026-01-11 (restore test passed)
  • Next review: After NixOS 24.11 upgrade

Change Log

Date Change
2026-01-11 First restore drill - all tests passed
2026-01-11 Fixed /var/backup permissions (postgres couldn't traverse)
2026-01-10 Initial draft

Appendix A: File Paths Reference

# PostgreSQL dumps (created by services.postgresqlBackup)
/var/backup/postgresql/forgejo.sql.gz
/var/backup/postgresql/mautrix_slack.sql.gz

# Forgejo
/var/lib/forgejo/
├── conf/
├── data/
│   ├── avatars/
│   ├── attachments/
│   └── lfs/
├── repositories/
│   └── USERNAME/
│       └── REPO.git/
└── gitea.db  (if using SQLite, but we use PostgreSQL)

# Matrix (Conduwuit with RocksDB)
/var/lib/matrix-continuwuity/
├── db/           # RocksDB database
└── media/        # Uploaded media

# Maubot
/var/lib/maubot/
├── plugins/
├── trash/
└── maubot.db     # SQLite database

# Slack Bridge
/var/lib/mautrix-slack/
└── registration.yaml

# User homes
/home/USERNAME/
├── .ssh/
├── .config/
├── .npm-global/
└── [user projects]

# Secrets (runtime, not backed up - regenerated from sops)
/run/secrets/
├── matrix-registration-token
├── maubot-admin-password
├── restic/password
└── ...

Appendix B: Service Dependencies

                    ┌─────────────┐
                    │ postgresql  │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌──────────┐    ┌─────────────┐  ┌──────────┐
    │ forgejo  │    │mautrix-slack│  │  maubot  │
    └──────────┘    └──────┬──────┘  └────┬─────┘
                           │              │
                           ▼              ▼
                    ┌─────────────────────────┐
                    │  matrix-continuwuity    │
                    └─────────────────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │    nginx    │
                    └─────────────┘