Add disaster recovery runbook draft
Documents restore procedures for full server loss, partial restore, and user data recovery scenarios. Includes verification checklists, time estimates, and break-glass quick reference. Also documents known gaps (home dirs, ACME, RocksDB consistency) that need fixing before the runbook is production-ready. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
31d388d21c
commit
b62f649a28
624
docs/disaster-recovery-runbook.md
Normal file
624
docs/disaster-recovery-runbook.md
Normal file
|
|
@ -0,0 +1,624 @@
|
||||||
|
# Disaster Recovery Runbook - ops-jrz1
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This runbook covers restore procedures for ops-jrz1, a NixOS homelab server running Matrix, Forgejo, and supporting services. Backups are stored in Backblaze B2 using restic.
|
||||||
|
|
||||||
|
**Recovery Time Objective (RTO):** 2-6 hours for full restore
|
||||||
|
**Recovery Point Objective (RPO):** 24 hours (daily backups at 3 AM UTC)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. What's Backed Up
|
||||||
|
|
||||||
|
| Component | Path | Backup Method | Restore Priority |
|
||||||
|
|-----------|------|---------------|------------------|
|
||||||
|
| PostgreSQL (forgejo) | `/var/backup/postgresql/forgejo.sql.gz` | pg_dump via timer | Critical |
|
||||||
|
| PostgreSQL (mautrix_slack) | `/var/backup/postgresql/mautrix_slack.sql.gz` | pg_dump via timer | Critical |
|
||||||
|
| Forgejo | `/var/lib/forgejo` | restic file backup | Critical |
|
||||||
|
| Matrix | `/var/lib/matrix-continuwuity` | restic file backup | High |
|
||||||
|
| Maubot | `/var/lib/maubot` | restic file backup | Medium |
|
||||||
|
| Slack Bridge | `/var/lib/mautrix-slack` | restic file backup | Medium |
|
||||||
|
| User Homes | `/home/*` | **NOT YET BACKED UP** | High |
|
||||||
|
|
||||||
|
### What's NOT Backed Up (Reproducible via NixOS)
|
||||||
|
|
||||||
|
- `/nix/store` - rebuilt from flake
|
||||||
|
- `/etc` - generated from NixOS config
|
||||||
|
- Service binaries - installed via Nix
|
||||||
|
|
||||||
|
### Critical Items Stored Out-of-Band
|
||||||
|
|
||||||
|
| Item | Storage Location | Notes |
|
||||||
|
|------|------------------|-------|
|
||||||
|
| NixOS flake | GitHub mirror / local laptop | Self-hosted Forgejo may be dead |
|
||||||
|
| Restic password | Password manager + printed | Can't restore without it |
|
||||||
|
| B2 credentials | Password manager + printed | Can't access backups without it |
|
||||||
|
| Age key (sops) | `/etc/ssh/ssh_host_ed25519_key` on server | Derived from SSH host key |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Break Glass - Emergency Quick Reference
|
||||||
|
|
||||||
|
**Print this page and store physically.**
|
||||||
|
|
||||||
|
### B2 Backup Access
|
||||||
|
|
||||||
|
```
|
||||||
|
Bucket: ops-jrz1-backup
|
||||||
|
Restic Repo: b2:ops-jrz1-backup
|
||||||
|
Key ID: [stored in password manager]
|
||||||
|
App Key: [stored in password manager]
|
||||||
|
Restic Password: [stored in password manager]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Minimal Restore Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Set environment
|
||||||
|
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
|
||||||
|
export RESTIC_PASSWORD="[password]"
|
||||||
|
export B2_ACCOUNT_ID="[key-id]"
|
||||||
|
export B2_ACCOUNT_KEY="[app-key]"
|
||||||
|
|
||||||
|
# List snapshots
|
||||||
|
restic snapshots
|
||||||
|
|
||||||
|
# Restore everything to /tmp/restore
|
||||||
|
restic restore latest --target /tmp/restore
|
||||||
|
|
||||||
|
# Restore specific path
|
||||||
|
restic restore latest --target / --include /var/lib/forgejo
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Start Order
|
||||||
|
|
||||||
|
```
|
||||||
|
1. postgresql
|
||||||
|
2. forgejo
|
||||||
|
3. mautrix-slack
|
||||||
|
4. matrix-continuwuity
|
||||||
|
5. maubot
|
||||||
|
6. nginx
|
||||||
|
```
|
||||||
|
|
||||||
|
### Config Repository
|
||||||
|
|
||||||
|
```
|
||||||
|
Primary: git.clarun.xyz:dan/ops-jrz1.git (may be down)
|
||||||
|
Mirror: [ADD GITHUB MIRROR URL]
|
||||||
|
Local: ~/proj/ops-jrz1 on admin laptop
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Restore Scenarios
|
||||||
|
|
||||||
|
### Scenario A: Full Server Loss
|
||||||
|
|
||||||
|
**When:** Hardware failure, VPS provider issue, complete disk loss.
|
||||||
|
|
||||||
|
**Time estimate:** 2-4 hours
|
||||||
|
|
||||||
|
#### Phase 1: Bootstrap NixOS (30-60 min)
|
||||||
|
|
||||||
|
1. Provision new VPS or boot NixOS installer on new hardware
|
||||||
|
2. Partition disks, mount to `/mnt`
|
||||||
|
3. Get NixOS flake from backup location (laptop, GitHub mirror)
|
||||||
|
4. **Critical:** Restore SSH host keys if you have them backed up, OR accept new host identity (will need to re-encrypt sops secrets with new age key)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# If restoring old host keys (preserves sops decryption):
|
||||||
|
mkdir -p /mnt/etc/ssh
|
||||||
|
# Copy ssh_host_* files from secure backup
|
||||||
|
|
||||||
|
# Install NixOS
|
||||||
|
nixos-install --flake /path/to/flake#ops-jrz1 --no-root-passwd
|
||||||
|
reboot
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Phase 2: Restore Secrets (10-20 min)
|
||||||
|
|
||||||
|
If you had to generate new SSH host keys:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Get new age public key
|
||||||
|
ssh-to-age < /etc/ssh/ssh_host_ed25519_key.pub
|
||||||
|
|
||||||
|
# Update .sops.yaml with new key on admin machine
|
||||||
|
# Re-encrypt secrets: sops updatekeys secrets/secrets.yaml
|
||||||
|
# Redeploy
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Phase 3: Stop Services, Restore Data (30-90 min)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop all services that use the data
|
||||||
|
systemctl stop forgejo mautrix-slack matrix-continuwuity maubot
|
||||||
|
|
||||||
|
# Set restic environment
|
||||||
|
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
|
||||||
|
export RESTIC_PASSWORD="..."
|
||||||
|
export B2_ACCOUNT_ID="..."
|
||||||
|
export B2_ACCOUNT_KEY="..."
|
||||||
|
|
||||||
|
# Restore PostgreSQL dumps
|
||||||
|
restic restore latest --target /tmp/restore --include /var/backup/postgresql
|
||||||
|
|
||||||
|
# Import databases
|
||||||
|
systemctl start postgresql
|
||||||
|
sudo -u postgres psql -c "DROP DATABASE IF EXISTS forgejo;"
|
||||||
|
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
|
||||||
|
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
|
||||||
|
|
||||||
|
sudo -u postgres psql -c "DROP DATABASE IF EXISTS mautrix_slack;"
|
||||||
|
sudo -u postgres psql -c "CREATE DATABASE mautrix_slack OWNER mautrix_slack;"
|
||||||
|
gunzip -c /tmp/restore/var/backup/postgresql/mautrix_slack.sql.gz | sudo -u postgres psql -d mautrix_slack
|
||||||
|
|
||||||
|
# Restore Forgejo data
|
||||||
|
rm -rf /var/lib/forgejo/*
|
||||||
|
restic restore latest --target / --include /var/lib/forgejo
|
||||||
|
chown -R forgejo:forgejo /var/lib/forgejo
|
||||||
|
|
||||||
|
# Restore Matrix data
|
||||||
|
rm -rf /var/lib/matrix-continuwuity/*
|
||||||
|
restic restore latest --target / --include /var/lib/matrix-continuwuity
|
||||||
|
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
|
||||||
|
|
||||||
|
# Restore Maubot data
|
||||||
|
rm -rf /var/lib/maubot/*
|
||||||
|
restic restore latest --target / --include /var/lib/maubot
|
||||||
|
chown -R maubot:maubot /var/lib/maubot
|
||||||
|
|
||||||
|
# Restore Slack bridge data
|
||||||
|
rm -rf /var/lib/mautrix-slack/*
|
||||||
|
restic restore latest --target / --include /var/lib/mautrix-slack
|
||||||
|
chown -R mautrix-slack:mautrix-slack /var/lib/mautrix-slack
|
||||||
|
|
||||||
|
# Restore user home directories
|
||||||
|
restic restore latest --target / --include /home
|
||||||
|
# Permissions should be preserved by restic
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Phase 4: Start Services and Verify (15-30 min)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Start in dependency order
|
||||||
|
systemctl start forgejo
|
||||||
|
systemctl start mautrix-slack
|
||||||
|
systemctl start matrix-continuwuity
|
||||||
|
systemctl start maubot
|
||||||
|
|
||||||
|
# Check status
|
||||||
|
systemctl status forgejo mautrix-slack matrix-continuwuity maubot
|
||||||
|
```
|
||||||
|
|
||||||
|
See Section 5 for verification checklist.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Scenario B: Single Service Corruption
|
||||||
|
|
||||||
|
**When:** One service's data is corrupted but server is otherwise fine.
|
||||||
|
|
||||||
|
**Time estimate:** 15-60 min
|
||||||
|
|
||||||
|
#### Example: Matrix RocksDB Corruption
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop the service
|
||||||
|
systemctl stop matrix-continuwuity
|
||||||
|
|
||||||
|
# Find available snapshots
|
||||||
|
restic snapshots
|
||||||
|
|
||||||
|
# Restore to temp location first (safer)
|
||||||
|
restic restore latest --target /tmp/restore --include /var/lib/matrix-continuwuity
|
||||||
|
|
||||||
|
# Verify it looks reasonable
|
||||||
|
ls -la /tmp/restore/var/lib/matrix-continuwuity/
|
||||||
|
|
||||||
|
# Replace corrupted data
|
||||||
|
rm -rf /var/lib/matrix-continuwuity/*
|
||||||
|
cp -a /tmp/restore/var/lib/matrix-continuwuity/* /var/lib/matrix-continuwuity/
|
||||||
|
chown -R matrix-continuwuity:matrix-continuwuity /var/lib/matrix-continuwuity
|
||||||
|
|
||||||
|
# Start and verify
|
||||||
|
systemctl start matrix-continuwuity
|
||||||
|
journalctl -u matrix-continuwuity -f
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Example: PostgreSQL Database Corruption
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Stop dependent services
|
||||||
|
systemctl stop forgejo mautrix-slack
|
||||||
|
|
||||||
|
# Restore dump
|
||||||
|
restic restore latest --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
|
||||||
|
|
||||||
|
# Drop and recreate
|
||||||
|
sudo -u postgres psql -c "DROP DATABASE forgejo;"
|
||||||
|
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
|
||||||
|
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
|
||||||
|
|
||||||
|
# Restart
|
||||||
|
systemctl start forgejo
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Scenario C: User Deleted Their Work
|
||||||
|
|
||||||
|
**When:** Dev accidentally `rm -rf`'d their project or home directory.
|
||||||
|
|
||||||
|
**Time estimate:** 5-20 min
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Find what snapshots are available
|
||||||
|
restic snapshots
|
||||||
|
|
||||||
|
# Browse a snapshot to find the data
|
||||||
|
restic ls latest /home/USERNAME/
|
||||||
|
|
||||||
|
# Restore specific directory to temp location
|
||||||
|
restic restore latest --target /tmp/restore --include /home/USERNAME/project-name
|
||||||
|
|
||||||
|
# Let user copy what they need
|
||||||
|
cp -a /tmp/restore/home/USERNAME/project-name /home/USERNAME/
|
||||||
|
chown -R USERNAME:users /home/USERNAME/project-name
|
||||||
|
|
||||||
|
# Or restore entire home directory
|
||||||
|
restic restore latest --target / --include /home/USERNAME
|
||||||
|
chown -R USERNAME:users /home/USERNAME
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Point-in-Time Restore
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List snapshots with dates
|
||||||
|
restic snapshots
|
||||||
|
|
||||||
|
# Restore from specific snapshot (not latest)
|
||||||
|
restic restore abc123def --target /tmp/restore --include /home/USERNAME
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Scenario D: Single Forgejo Repo Deleted
|
||||||
|
|
||||||
|
**When:** A git repository was deleted from Forgejo.
|
||||||
|
|
||||||
|
**Time estimate:** 10-30 min
|
||||||
|
|
||||||
|
**Challenge:** Forgejo database and filesystem must be in sync.
|
||||||
|
|
||||||
|
#### Option 1: Restore Just the Git Data (if DB record exists)
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Find repo path - usually /var/lib/forgejo/repositories/USERNAME/REPO.git
|
||||||
|
restic ls latest /var/lib/forgejo/repositories/
|
||||||
|
|
||||||
|
# Restore repo
|
||||||
|
restic restore latest --target /tmp/restore --include /var/lib/forgejo/repositories/USERNAME/REPO.git
|
||||||
|
|
||||||
|
# Copy into place
|
||||||
|
cp -a /tmp/restore/var/lib/forgejo/repositories/USERNAME/REPO.git /var/lib/forgejo/repositories/USERNAME/
|
||||||
|
chown -R forgejo:forgejo /var/lib/forgejo/repositories/USERNAME/REPO.git
|
||||||
|
|
||||||
|
# Regenerate hooks
|
||||||
|
sudo -u forgejo forgejo admin regenerate hooks
|
||||||
|
```
|
||||||
|
|
||||||
|
#### Option 2: Full Forgejo Restore (if DB record was also deleted)
|
||||||
|
|
||||||
|
Need to restore both database and filesystem to same point in time:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl stop forgejo
|
||||||
|
|
||||||
|
# Restore database
|
||||||
|
restic restore SNAPSHOT_ID --target /tmp/restore --include /var/backup/postgresql/forgejo.sql.gz
|
||||||
|
sudo -u postgres psql -c "DROP DATABASE forgejo;"
|
||||||
|
sudo -u postgres psql -c "CREATE DATABASE forgejo OWNER forgejo;"
|
||||||
|
gunzip -c /tmp/restore/var/backup/postgresql/forgejo.sql.gz | sudo -u postgres psql -d forgejo
|
||||||
|
|
||||||
|
# Restore filesystem
|
||||||
|
rm -rf /var/lib/forgejo/*
|
||||||
|
restic restore SNAPSHOT_ID --target / --include /var/lib/forgejo
|
||||||
|
chown -R forgejo:forgejo /var/lib/forgejo
|
||||||
|
|
||||||
|
systemctl start forgejo
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Restore Commands Reference
|
||||||
|
|
||||||
|
### Environment Setup
|
||||||
|
|
||||||
|
```bash
|
||||||
|
export RESTIC_REPOSITORY="b2:ops-jrz1-backup"
|
||||||
|
export RESTIC_PASSWORD_FILE="/run/secrets/restic/password" # if on server
|
||||||
|
# OR
|
||||||
|
export RESTIC_PASSWORD="your-password-here" # if restoring from scratch
|
||||||
|
export B2_ACCOUNT_ID="your-key-id"
|
||||||
|
export B2_ACCOUNT_KEY="your-app-key"
|
||||||
|
```
|
||||||
|
|
||||||
|
### Common Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# List all snapshots
|
||||||
|
restic snapshots
|
||||||
|
|
||||||
|
# List snapshots with tags
|
||||||
|
restic snapshots --tag ops-jrz1
|
||||||
|
|
||||||
|
# Browse snapshot contents
|
||||||
|
restic ls latest
|
||||||
|
restic ls latest /var/lib/forgejo
|
||||||
|
|
||||||
|
# Restore everything
|
||||||
|
restic restore latest --target /
|
||||||
|
|
||||||
|
# Restore specific path
|
||||||
|
restic restore latest --target / --include /var/lib/forgejo
|
||||||
|
|
||||||
|
# Restore to different location
|
||||||
|
restic restore latest --target /tmp/restore --include /home/dan
|
||||||
|
|
||||||
|
# Restore specific snapshot (not latest)
|
||||||
|
restic restore abc123de --target /tmp/restore
|
||||||
|
|
||||||
|
# Mount backup as filesystem (for browsing)
|
||||||
|
mkdir /mnt/restic
|
||||||
|
restic mount /mnt/restic
|
||||||
|
# Browse /mnt/restic/snapshots/latest/...
|
||||||
|
# Ctrl+C to unmount
|
||||||
|
|
||||||
|
# Check backup integrity
|
||||||
|
restic check
|
||||||
|
restic check --read-data # slower, verifies all data
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Verification Checklist
|
||||||
|
|
||||||
|
### After Full Restore
|
||||||
|
|
||||||
|
#### Infrastructure
|
||||||
|
- [ ] SSH access works
|
||||||
|
- [ ] DNS resolves correctly
|
||||||
|
- [ ] HTTPS certificates valid (may need `systemctl start acme-clarun.xyz`)
|
||||||
|
|
||||||
|
#### PostgreSQL
|
||||||
|
- [ ] `systemctl status postgresql` - active
|
||||||
|
- [ ] `sudo -u postgres psql -c "\l"` - lists forgejo, mautrix_slack databases
|
||||||
|
- [ ] No errors in `journalctl -u postgresql`
|
||||||
|
|
||||||
|
#### Forgejo
|
||||||
|
- [ ] `systemctl status forgejo` - active
|
||||||
|
- [ ] Web UI loads at https://git.clarun.xyz
|
||||||
|
- [ ] Can log in
|
||||||
|
- [ ] Repositories visible and browsable
|
||||||
|
- [ ] Can clone a repo: `git clone git@git.clarun.xyz:org/repo.git`
|
||||||
|
- [ ] Can push to a repo
|
||||||
|
|
||||||
|
#### Matrix
|
||||||
|
- [ ] `systemctl status matrix-continuwuity` - active
|
||||||
|
- [ ] No RocksDB errors in `journalctl -u matrix-continuwuity`
|
||||||
|
- [ ] Can log in with Matrix client
|
||||||
|
- [ ] Can send/receive messages
|
||||||
|
- [ ] Old messages visible
|
||||||
|
|
||||||
|
#### Maubot
|
||||||
|
- [ ] `systemctl status maubot` - active
|
||||||
|
- [ ] Web UI accessible via SSH tunnel (port 29316)
|
||||||
|
- [ ] Bots responding
|
||||||
|
|
||||||
|
#### Slack Bridge
|
||||||
|
- [ ] `systemctl status mautrix-slack` - active
|
||||||
|
- [ ] Bridge connected (check logs)
|
||||||
|
- [ ] Messages flowing both directions
|
||||||
|
|
||||||
|
#### User Home Directories
|
||||||
|
- [ ] Users can SSH in
|
||||||
|
- [ ] User files present
|
||||||
|
- [ ] Permissions correct
|
||||||
|
|
||||||
|
### After Partial Restore
|
||||||
|
|
||||||
|
- [ ] Restored service starts without errors
|
||||||
|
- [ ] Basic functionality works
|
||||||
|
- [ ] No data from "future" (if restoring older snapshot)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Time Estimates
|
||||||
|
|
||||||
|
| Scenario | Download Size | Estimated Time |
|
||||||
|
|----------|--------------|----------------|
|
||||||
|
| Full server restore | ~15 GB | 2-4 hours |
|
||||||
|
| Single service (Matrix) | ~2 GB | 15-45 min |
|
||||||
|
| Single service (Forgejo) | ~5 GB | 20-60 min |
|
||||||
|
| Single user home | ~1 GB | 5-15 min |
|
||||||
|
| Single git repo | ~100 MB | 5-10 min |
|
||||||
|
| PostgreSQL DB only | ~50 MB | 10-20 min |
|
||||||
|
|
||||||
|
*Times assume 100 Mbps download from B2. Actual times depend on network speed and data size.*
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Quarterly Restore Drill
|
||||||
|
|
||||||
|
Schedule: First Sunday of each quarter
|
||||||
|
|
||||||
|
### Procedure
|
||||||
|
|
||||||
|
1. Spin up test VM (or use local NixOS VM)
|
||||||
|
2. Attempt full restore procedure
|
||||||
|
3. Run verification checklist
|
||||||
|
4. Document:
|
||||||
|
- Actual time taken
|
||||||
|
- Any issues encountered
|
||||||
|
- Runbook updates needed
|
||||||
|
5. Destroy test VM
|
||||||
|
|
||||||
|
### Success Criteria
|
||||||
|
|
||||||
|
- [ ] NixOS boots with config
|
||||||
|
- [ ] PostgreSQL databases restore and pass basic queries
|
||||||
|
- [ ] Forgejo UI loads and repos are accessible
|
||||||
|
- [ ] Matrix client can connect and see history
|
||||||
|
- [ ] At least one user home directory restored with correct permissions
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Known Limitations
|
||||||
|
|
||||||
|
### RocksDB Consistency
|
||||||
|
|
||||||
|
Matrix homeserver uses RocksDB which is sensitive to incomplete backups. Current backup runs while service is active. For guaranteed consistency, should:
|
||||||
|
- Stop service before backup, OR
|
||||||
|
- Use RocksDB checkpoint feature, OR
|
||||||
|
- Use filesystem snapshots (ZFS/btrfs)
|
||||||
|
|
||||||
|
**Current risk:** Low probability of corrupted Matrix restore. Mitigation: verify RocksDB opens without errors after restore.
|
||||||
|
|
||||||
|
### Point-in-Time Recovery
|
||||||
|
|
||||||
|
Restic provides daily snapshots, not continuous backup. Cannot restore to arbitrary point in time. For PostgreSQL PITR, would need WAL archiving (not currently configured).
|
||||||
|
|
||||||
|
### User Home Directory Backup
|
||||||
|
|
||||||
|
**TODO:** User home directories (`/home/*`) are not currently included in backup. Need to add to backup-b2.nix.
|
||||||
|
|
||||||
|
### Large File Handling
|
||||||
|
|
||||||
|
Forgejo LFS objects and large repos may take significant time to restore. Consider whether to exclude LFS from regular backups and handle separately.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Known Gaps and TODOs
|
||||||
|
|
||||||
|
**Critical - Must Fix Before Relying on This Runbook:**
|
||||||
|
|
||||||
|
| Gap | Risk | Fix |
|
||||||
|
|-----|------|-----|
|
||||||
|
| `/home/*` not backed up | User work lost forever | Add to backup-b2.nix paths |
|
||||||
|
| `/var/lib/acme` not backed up | Let's Encrypt rate limit (7 days no HTTPS) | Add to backup-b2.nix paths |
|
||||||
|
| RocksDB backed up while running | Corrupt Matrix restore | Stop service in pre-backup hook |
|
||||||
|
| Sops key tied to SSH host key only | Lose host key = lose all secrets | Add offline recovery age key |
|
||||||
|
| Flake only on self-hosted Forgejo | Can't restore if Forgejo is dead | Mirror to GitHub |
|
||||||
|
| `rm -rf` in restore steps | Wrong snapshot = data destroyed | Always restore to staging first |
|
||||||
|
|
||||||
|
**Medium Priority:**
|
||||||
|
|
||||||
|
| Gap | Risk | Fix |
|
||||||
|
|-----|------|-----|
|
||||||
|
| PostgreSQL version not pinned | Version mismatch on restore | Pin `pkgs.postgresql_15` |
|
||||||
|
| Dynamic UIDs | Permission errors after restore | Static UIDs for service users |
|
||||||
|
| DNS provider not documented | Can't update IP on new VPS | Document in break glass section |
|
||||||
|
| No backup monitoring | Silent failures for days | Add healthchecks.io integration |
|
||||||
|
| Postgres roles/extensions | Restore may fail | Include `pg_dumpall --globals-only` |
|
||||||
|
|
||||||
|
**Nice to Have:**
|
||||||
|
|
||||||
|
| Gap | Improvement |
|
||||||
|
|-----|-------------|
|
||||||
|
| Manual restore steps | Create restore script |
|
||||||
|
| No immutable backups | Enable B2 Object Lock |
|
||||||
|
| No second backup location | Replicate to second provider |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 10. Runbook Maintenance
|
||||||
|
|
||||||
|
- **Owner:** dan
|
||||||
|
- **Last updated:** 2026-01-10
|
||||||
|
- **Last drill:** Never (TODO: schedule first drill)
|
||||||
|
- **Next review:** After first restore drill
|
||||||
|
|
||||||
|
### Change Log
|
||||||
|
|
||||||
|
| Date | Change |
|
||||||
|
|------|--------|
|
||||||
|
| 2026-01-10 | Initial draft |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix A: File Paths Reference
|
||||||
|
|
||||||
|
```
|
||||||
|
# PostgreSQL dumps (created by services.postgresqlBackup)
|
||||||
|
/var/backup/postgresql/forgejo.sql.gz
|
||||||
|
/var/backup/postgresql/mautrix_slack.sql.gz
|
||||||
|
|
||||||
|
# Forgejo
|
||||||
|
/var/lib/forgejo/
|
||||||
|
├── conf/
|
||||||
|
├── data/
|
||||||
|
│ ├── avatars/
|
||||||
|
│ ├── attachments/
|
||||||
|
│ └── lfs/
|
||||||
|
├── repositories/
|
||||||
|
│ └── USERNAME/
|
||||||
|
│ └── REPO.git/
|
||||||
|
└── gitea.db (if using SQLite, but we use PostgreSQL)
|
||||||
|
|
||||||
|
# Matrix (Conduwuit with RocksDB)
|
||||||
|
/var/lib/matrix-continuwuity/
|
||||||
|
├── db/ # RocksDB database
|
||||||
|
└── media/ # Uploaded media
|
||||||
|
|
||||||
|
# Maubot
|
||||||
|
/var/lib/maubot/
|
||||||
|
├── plugins/
|
||||||
|
├── trash/
|
||||||
|
└── maubot.db # SQLite database
|
||||||
|
|
||||||
|
# Slack Bridge
|
||||||
|
/var/lib/mautrix-slack/
|
||||||
|
└── registration.yaml
|
||||||
|
|
||||||
|
# User homes
|
||||||
|
/home/USERNAME/
|
||||||
|
├── .ssh/
|
||||||
|
├── .config/
|
||||||
|
├── .npm-global/
|
||||||
|
└── [user projects]
|
||||||
|
|
||||||
|
# Secrets (runtime, not backed up - regenerated from sops)
|
||||||
|
/run/secrets/
|
||||||
|
├── matrix-registration-token
|
||||||
|
├── maubot-admin-password
|
||||||
|
├── restic/password
|
||||||
|
└── ...
|
||||||
|
```
|
||||||
|
|
||||||
|
## Appendix B: Service Dependencies
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────┐
|
||||||
|
│ postgresql │
|
||||||
|
└──────┬──────┘
|
||||||
|
│
|
||||||
|
┌───────────────┼───────────────┐
|
||||||
|
│ │ │
|
||||||
|
▼ ▼ ▼
|
||||||
|
┌──────────┐ ┌─────────────┐ ┌──────────┐
|
||||||
|
│ forgejo │ │mautrix-slack│ │ maubot │
|
||||||
|
└──────────┘ └──────┬──────┘ └────┬─────┘
|
||||||
|
│ │
|
||||||
|
▼ ▼
|
||||||
|
┌─────────────────────────┐
|
||||||
|
│ matrix-continuwuity │
|
||||||
|
└─────────────────────────┘
|
||||||
|
│
|
||||||
|
▼
|
||||||
|
┌─────────────┐
|
||||||
|
│ nginx │
|
||||||
|
└─────────────┘
|
||||||
|
```
|
||||||
Loading…
Reference in a new issue