Dan b62f649a28 Add disaster recovery runbook draft

Documents restore procedures for full server loss, partial restore,
and user data recovery scenarios. Includes verification checklists,
time estimates, and break-glass quick reference.

Also documents known gaps (home dirs, ACME, RocksDB consistency)
that need fixing before the runbook is production-ready.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2026-01-10 14:02:01 -08:00

18 KiB

Raw Blame History

Disaster Recovery Runbook - ops-jrz1

Overview

This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic.

Recovery Time Objective (RTO): 2-6 hours for full restore Recovery Point Objective (RPO): 24 hours (daily backups at 3 AM UTC)

1. What's Backed Up

Component	Path	Backup Method	Restore Priority
PostgreSQL (forgejo)	`/var/backup/postgresql/forgejo.sql.gz`	pg_dump via timer	Critical
PostgreSQL (mautrix_slack)	`/var/backup/postgresql/mautrix_slack.sql.gz`	pg_dump via timer	Critical
Forgejo	`/var/lib/forgejo`	restic file backup	Critical
Matrix	`/var/lib/matrix-continuwuity`	restic file backup	High
Maubot	`/var/lib/maubot`	restic file backup	Medium
Slack Bridge	`/var/lib/mautrix-slack`	restic file backup	Medium
User Homes	`/home/*`	NOT YET BACKED UP	High

What's NOT Backed Up (Reproducible via NixOS)

/nix/store - rebuilt from flake
/etc - generated from NixOS config
Service binaries - installed via Nix

Critical Items Stored Out-of-Band

Item	Storage Location	Notes
NixOS flake	GitHub mirror / local laptop	Self-hosted Forgejo may be dead
Restic password	Password manager + printed	Can't restore without it
B2 credentials	Password manager + printed	Can't access backups without it
Age key (sops)	`/etc/ssh/ssh_host_ed25519_key` on server	Derived from SSH host key

2. Break Glass - Emergency Quick Reference

Print this page and store physically.

B2 Backup Access

Bucket: ops-jrz1-backup
Restic Repo: b2:ops-jrz1-backup
Key ID: [stored in password manager]
App Key: [stored in password manager]
Restic Password: [stored in password manager]

Minimal Restore Commands

# Set environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="[password]"
export B2_ACCOUNT_ID="[key-id]"
export B2_ACCOUNT_KEY="[app-key]"

# List snapshots
restic snapshots

# Restore everything to /tmp/restore
restic restore latest --target /tmp/restore

# Restore specific path
restic restore latest --target / --include /var/lib/forgejo

Service Start Order

1. postgresql
2. forgejo
3. mautrix-slack
4. matrix-continuwuity
5. maubot
6. nginx

Config Repository

Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down)
Mirror: [ADD GITHUB MIRROR URL]
Local: ~/proj/ops-jrz1 on admin laptop

3. Restore Scenarios

Scenario A: Full Server Loss

When: Hardware failure, VPS provider issue, complete disk loss.

Time estimate: 2-4 hours

Phase 1: Bootstrap NixOS (30-60 min)

Provision new VPS or boot NixOS installer on new hardware
Partition disks, mount to /mnt
Get NixOS flake from backup location (laptop, GitHub mirror)
Critical: Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key)

# If restoring old host keys (preserves sops decryption):
mkdir -p /mnt/etc/ssh
# Copy ssh_host_* files from secure backup

# Install NixOS
nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd
reboot

Phase 2: Restore Secrets (10-20 min)

If you had to generate new SSH host keys:

# Get new age public key
ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub

# Update .sops.yaml with new key on admin machine
# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml
# Redeploy

Phase 3: Stop Services, Restore Data (30-90 min)

# Stop all services that use the data
systemctl stop forgejo mautrix-slack matrix-continuwuity maubot

# Set restic environment
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD="..."
export B2_ACCOUNT_ID="..."
export B2_ACCOUNT_KEY="..."

# Restore PostgreSQL dumps
restic restore latest --target /tmp/restore --include /var/backup/postgresql

# Import databases
systemctl start postgresql
sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;"
sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;"
gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack

# Restore Forgejo data
rm -rf /var/lib/forgejo/*
restic restore latest --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo

# Restore Matrix data
rm -rf /var/lib/matrix-continuwuity/*
restic restore latest --target / --include /var/lib/matrix-continuwuity
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity

# Restore Maubot data
rm -rf /var/lib/maubot/*
restic restore latest --target / --include /var/lib/maubot
chown -R maubot:maubot /var/lib/maubot

# Restore Slack bridge data
rm -rf /var/lib/mautrix-slack/*
restic restore latest --target / --include /var/lib/mautrix-slack
chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack

# Restore user home directories
restic restore latest --target / --include /home
# Permissions should be preserved by restic

Phase 4: Start Services and Verify (15-30 min)

# Start in dependency order
systemctl start forgejo
systemctl start mautrix-slack
systemctl start matrix-continuwuity
systemctl start maubot

# Check status
systemctl status forgejo mautrix-slack matrix-continuwuity maubot

See Section 5 for verification checklist.

Scenario B: Single Service Corruption

When: One service's data is corrupted but server is otherwise fine.

Time estimate: 15-60 min

Example: Matrix RocksDB Corruption

# Stop the service
systemctl stop matrix-continuwuity

# Find available snapshots
restic snapshots

# Restore to temp location first (safer)
restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity

# Verify it looks reasonable
ls -la /tmp/restore/var/lib/matrix-continuwuity/

# Replace corrupted data
rm -rf /var/lib/matrix-continuwuity/*
cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity

# Start and verify
systemctl start matrix-continuwuity
journalctl -u matrix-continuwuity -f

Example: PostgreSQL Database Corruption

# Stop dependent services
systemctl stop forgejo mautrix-slack

# Restore dump
restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz

# Drop and recreate
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

# Restart
systemctl start forgejo

Scenario C: User Deleted Their Work

When: Dev accidentally rm -rf'd their project or home directory.

Time estimate: 5-20 min

# Find what snapshots are available
restic snapshots

# Browse a snapshot to find the data
restic ls latest /home/USERNAME/

# Restore specific directory to temp location
restic restore latest --target /tmp/restore --include /home/USERNAME/project-name

# Let user copy what they need
cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/
chown -R USERNAME:users /home/USERNAME/project-name

# Or restore entire home directory
restic restore latest --target / --include /home/USERNAME
chown -R USERNAME:users /home/USERNAME

Point-in-Time Restore

# List snapshots with dates
restic snapshots

# Restore from specific snapshot (not latest)
restic restore abc123def --target /tmp/restore --include /home/USERNAME

Scenario D: Single Forgejo Repo Deleted

When: A git repository was deleted from Forgejo.

Time estimate: 10-30 min

Challenge: Forgejo database and filesystem must be in sync.

Option 1: Restore Just the Git Data (if DB record exists)

# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git
restic ls latest /var/lib/forgejo/repositories/

# Restore repo
restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git

# Copy into place
cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/
chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git

# Regenerate hooks
sudo -u forgejo forgejo admin regenerate hooks

Option 2: Full Forgejo Restore (if DB record was also deleted)

Need to restore both database and filesystem to same point in time:

systemctl stop forgejo

# Restore database
restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
sudo -u postgres psql -c "DROP DATABASE forgejo;"
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo

# Restore filesystem
rm -rf /var/lib/forgejo/*
restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo
chown -R forgejo:forgejo /var/lib/forgejo

systemctl start forgejo

4. Restore Commands Reference

Environment Setup

export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
export RESTIC_PASSWORD_FILE="/run/secrets/restic/password"  # if on server
# OR
export RESTIC_PASSWORD="your-password-here"  # if restoring from scratch
export B2_ACCOUNT_ID="your-key-id"
export B2_ACCOUNT_KEY="your-app-key"

Common Operations

# List all snapshots
restic snapshots

# List snapshots with tags
restic snapshots --tag ops-jrz1

# Browse snapshot contents
restic ls latest
restic ls latest /var/lib/forgejo

# Restore everything
restic restore latest --target /

# Restore specific path
restic restore latest --target / --include /var/lib/forgejo

# Restore to different location
restic restore latest --target /tmp/restore --include /home/dan

# Restore specific snapshot (not latest)
restic restore abc123de --target /tmp/restore

# Mount backup as filesystem (for browsing)
mkdir /mnt/restic
restic mount /mnt/restic
# Browse /mnt/restic/snapshots/latest/...
# Ctrl+C to unmount

# Check backup integrity
restic check
restic check --read-data  # slower, verifies all data

5. Verification Checklist

After Full Restore

Infrastructure

SSH access works
DNS resolves correctly
HTTPS certificates valid (may need systemctl start acme-clarun.xyz)

PostgreSQL

systemctl status postgresql - active
sudo -u postgres psql -c "\l" - lists forgejo, mautrix_slack databases
No errors in journalctl -u postgresql

Forgejo

systemctl status forgejo - active
Web UI loads at https://git.clarun.xyz
Can log in
Repositories visible and browsable
Can clone a repo: git clone git@git.clarun.xyz:org/repo.git
Can push to a repo

Matrix

systemctl status matrix-continuwuity - active
No RocksDB errors in journalctl -u matrix-continuwuity
Can log in with Matrix client
Can send/receive messages
Old messages visible

Maubot

systemctl status maubot - active
Web UI accessible via SSH tunnel (port 29316)
Bots responding

Slack Bridge

systemctl status mautrix-slack - active
Bridge connected (check logs)
Messages flowing both directions

User Home Directories

Users can SSH in
User files present
Permissions correct

After Partial Restore

Restored service starts without errors
Basic functionality works
No data from "future" (if restoring older snapshot)

6. Time Estimates

Scenario	Download Size	Estimated Time
Full server restore	~15 GB	2-4 hours
Single service (Matrix)	~2 GB	15-45 min
Single service (Forgejo)	~5 GB	20-60 min
Single user home	~1 GB	5-15 min
Single git repo	~100 MB	5-10 min
PostgreSQL DB only	~50 MB	10-20 min

Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.

7. Quarterly Restore Drill

Schedule: First Sunday of each quarter

Procedure

Spin up test VM (or use local NixOS VM)
Attempt full restore procedure
Run verification checklist
Document:
- Actual time taken
- Any issues encountered
- Runbook updates needed
Destroy test VM

Success Criteria

NixOS boots with config
PostgreSQL databases restore and pass basic queries
Forgejo UI loads and repos are accessible
Matrix client can connect and see history
At least one user home directory restored with correct permissions

8. Known Limitations

RocksDB Consistency

Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should:

Stop service before backup, OR
Use RocksDB checkpoint feature, OR
Use filesystem snapshots (ZFS/btrfs)

Current risk: Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore.

Point-in-Time Recovery

Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured).

User Home Directory Backup

TODO: User home directories (/home/*) are not currently included in backup. Need to add to backup-b2.nix.

Large File Handling

Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately.

9. Known Gaps and TODOs

Critical - Must Fix Before Relying on This Runbook:

Gap	Risk	Fix
`/home/*` not backed up	User work lost forever	Add to backup-b2.nix paths
`/var/lib/acme` not backed up	Let's Encrypt rate limit (7 days no HTTPS)	Add to backup-b2.nix paths
RocksDB backed up while running	Corrupt Matrix restore	Stop service in pre-backup hook
Sops key tied to SSH host key only	Lose host key = lose all secrets	Add offline recovery age key
Flake only on self-hosted Forgejo	Can't restore if Forgejo is dead	Mirror to GitHub
`rm -rf` in restore steps	Wrong snapshot = data destroyed	Always restore to staging first

Medium Priority:

Gap	Risk	Fix
PostgreSQL version not pinned	Version mismatch on restore	Pin `pkgs.postgresql_15`
Dynamic UIDs	Permission errors after restore	Static UIDs for service users
DNS provider not documented	Can't update IP on new VPS	Document in break glass section
No backup monitoring	Silent failures for days	Add healthchecks.io integration
Postgres roles/extensions	Restore may fail	Include `pg_dumpall --globals-only`

Nice to Have:

Gap	Improvement
Manual restore steps	Create restore script
No immutable backups	Enable B2 Object Lock
No second backup location	Replicate to second provider

10. Runbook Maintenance

Owner: dan
Last updated: 2026-01-10
Last drill: Never (TODO: schedule first drill)
Next review: After first restore drill

Change Log

Date	Change
2026-01-10	Initial draft

Appendix A: File Paths Reference

# PostgreSQL dumps (created by services.postgresqlBackup)
/var/backup/postgresql/forgejo.sql.gz
/var/backup/postgresql/mautrix_slack.sql.gz

# Forgejo
/var/lib/forgejo/
├── conf/
├── data/
│   ├── avatars/
│   ├── attachments/
│   └── lfs/
├── repositories/
│   └── USERNAME/
│       └── REPO.git/
└── gitea.db  (if using SQLite, but we use PostgreSQL)

# Matrix (Conduwuit with RocksDB)
/var/lib/matrix-continuwuity/
├── db/           # RocksDB database
└── media/        # Uploaded media

# Maubot
/var/lib/maubot/
├── plugins/
├── trash/
└── maubot.db     # SQLite database

# Slack Bridge
/var/lib/mautrix-slack/
└── registration.yaml

# User homes
/home/USERNAME/
├── .ssh/
├── .config/
├── .npm-global/
└── [user projects]

# Secrets (runtime, not backed up - regenerated from sops)
/run/secrets/
├── matrix-registration-token
├── maubot-admin-password
├── restic/password
└── ...

Appendix B: Service Dependencies

                    ┌─────────────┐
                    │ postgresql  │
                    └──────┬──────┘
                           │
           ┌───────────────┼───────────────┐
           │               │               │
           ▼               ▼               ▼
    ┌──────────┐    ┌─────────────┐  ┌──────────┐
    │ forgejo  │    │mautrix-slack│  │  maubot  │
    └──────────┘    └──────┬──────┘  └────┬─────┘
                           │              │
                           ▼              ▼
                    ┌─────────────────────────┐
                    │  matrix-continuwuity    │
                    └─────────────────────────┘
                           │
                           ▼
                    ┌─────────────┐
                    │    nginx    │
                    └─────────────┘

18 KiB Raw Blame History