9.3 KiB
9.3 KiB
NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
- Session Summary
- Accomplishments
- Key Decisions
- Problems & Solutions
- Technical Details
- Process and Workflow
- Learning and Insights
- Context for Future Work
- Raw Notes
- Session Metrics
Session Summary
Date: 2026-01-10 (Continued from previous session)
Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation
Accomplishments
- Completed B2 backup setup with restic (backup-b2.nix module)
- Added PostgreSQL dump automation via services.postgresqlBackup
- Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md)
- Added /home and /var/lib/acme to backup paths (was missing)
- Created offline sops recovery key for disaster scenarios
- Documented NixOS 24.11 breaking changes analysis
- Pinned PostgreSQL to v15 to prevent auto-upgrade
- Executed first restore drill - all tests passed
- Built and deployed NixOS 24.11 (generation 72)
- Verified all services post-upgrade
- Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
- Closed upgrade epic 00e with all 6 child tasks
- Ran ops-review on backup module, fixed 2 MED findings
- Added failure notification service (backup-b2-failed) with OnFailure handlers
- Added network dependency and timeouts to backup services
- Post-upgrade health check: all services active, no failed units
Key Decisions
Decision 1: Use boot instead of switch for deployment
- Context: Upgrading major NixOS version with systemd 255→256
-
Options considered:
- nixos-rebuild switch - immediate activation
- nixos-rebuild boot - stage for next boot
- Rationale: boot provides cleaner service restarts, avoids mixed-state issues
- Impact: Required reboot but ensured all services start fresh
Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16
- Context: NixOS 24.11 defaults to PostgreSQL 16
-
Options considered:
- Pin to 15 now, upgrade PostgreSQL later (two steps)
- Let PostgreSQL upgrade with NixOS (requires pg_upgrade)
- Rationale: Safer to decouple NixOS upgrade from database upgrade
- Impact: Added
package = pkgs.postgresql_15;to dev-services.nix
Decision 3: Offline sops recovery key stored on laptop
- Context: sops keys derived from SSH host key - lose host, lose secrets
-
Options considered:
- Encrypted file on laptop
- Paper key in safe
- Hardware key (YubiKey)
- Rationale: Laptop encrypted file balances security and accessibility
- Impact: Added third age recipient to .sops.yaml
Decision 4: Accept RocksDB backup consistency risk
- Context: Matrix-continuwuity uses RocksDB, backed up while running
-
Options considered:
- Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
- Use RocksDB checkpoint API (requires upstream support)
- Accept risk - RocksDB has crash consistency
- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption
Problems & Solutions
| Problem | Solution | Learning |
|---|---|---|
| PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access |
| sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys |
| PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE … REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error |
| mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise |
Technical Details
Code Changes
- Total files modified: 10
-
Key files changed:
modules/backup-b2.nix- New B2 backup module with resticmodules/dev-services.nix- Added postgresqlBackup, pinned PostgreSQL 15flake.nix- Updated to nixos-24.11, unpinned sops-nix.sops.yaml- Added recovery key recipientsecrets/secrets.yaml- Added restic credentials (encrypted)
-
New files created:
docs/disaster-recovery-runbook.md- Comprehensive DR documentationdocs/nixos-24.11-upgrade-notes.md- Breaking changes analysis
Commands Used
# B2 bucket creation (manual in Backblaze console)
# Created: ops-jrz1-backup with scoped application key
# Trigger PostgreSQL backups
systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service
# Restore drill - test restore to /tmp
restic restore latest --target /tmp/dr-test --include /var/backup/postgresql
gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30
# Fix PostgreSQL collation after glibc upgrade
sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;"
# NixOS upgrade
nix flake update
nixos-rebuild build --flake .#ops-jrz1
nix copy --to ssh://root@ops-jrz1 ./result
ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."'
nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1
ssh root@ops-jrz1 reboot
Architecture Notes
- B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM)
- Weekly integrity check on Sundays at 4 AM (5% data sample)
- Retention: 7 daily, 4 weekly, 6 monthly snapshots
- Three sops keys: VPS host, admin workstation, offline recovery
Ops Review Findings
Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene
MED (fixed):
backup-b2-checkmissingnetwork-online.targetdependency- No failure notification mechanism for backup services
LOW (skipped - style only):
- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)
Changes made to modules/backup-b2.nix:
- Added
backup-b2-failed.serviceoneshot for failure notification - Added
onFailure = [ "backup-b2-failed.service" ]to both backup services - Added
after/wants = [ "network-online.target" ]to backup-b2-check - Added
TimeoutStartSec(2h for backup, 1h for check)
Process and Workflow
What Worked Well
- Using orch consensus earlier in session for backup strategy validation
- Restore drill caught the /var/backup permissions issue before a real disaster
- Incremental approach: backup → DR runbook → upgrade → verify
- beads issue tracking kept work organized across session
What Was Challenging
- sops key structure confusion (flat vs nested YAML)
- The permissions issue with /var/backup wasn't obvious
- Long nix copy times to server (~5 min for full closure)
Learning and Insights
Technical Insights
- PostgreSQL collation refresh is standard maintenance after glibc upgrades
- NixOS boot vs switch: boot is safer for major upgrades
- restic restore preserves permissions and ownership
- mautrix-slack has graceful retry logic for homeserver connectivity
Process Insights
- Restore drills find issues that code review misses
- DR runbooks should be tested, not just written
- Upgrade checklists prevent forgotten steps
Architectural Insights
- Three-key sops setup (host + admin + recovery) covers disaster scenarios
- Separating database upgrade from OS upgrade reduces risk
- services.postgresqlBackup is better than raw pg_dumpall scripts
Context for Future Work
Open Questions
- Should we set up backup monitoring (healthchecks.io)?
- When to upgrade PostgreSQL 15→16?
- Mirror flake to GitHub (jboq - deferred)?
Next Steps
- Monitor logs for 24-48 hours post-upgrade
- Schedule quarterly restore drills
- Consider static UIDs for service users (permission consistency)
Related Work
- 2025-12-05 Security Review - Initial backup planning
- 2026-01-05 Backup Beads Cleanup - Earlier backup work
- docs/disaster-recovery-runbook.md - Created this session
- docs/nixos-24.11-upgrade-notes.md - Created this session
Raw Notes
- NixOS 24.11 codename: Vicuna
- Kernel upgraded: 6.6.68 → 6.6.94
- systemd upgraded: 255.9 → 256.10
- Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1)
- maubot upgraded significantly (0.4.2 → 0.5.0)
- Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15)
- Closure size increased by ~137 MiB
Beads issues closed this session:
- zgs8 - B2 backup setup (from earlier in session)
- r177 - Add /home and /var/lib/acme to backups
- 93q9 - Add offline sops recovery key
- 09o - Review NixOS 24.11 release notes
- 7qg - Pin PostgreSQL to v15
- asi - Take verified backup before upgrade
- 3wd - Update flake to nixos-24.11
- a9d - Deploy NixOS 24.11
- 3zo - Post-upgrade verification
- 00e - Upgrade epic (parent)
- y8le - Stop Matrix before backup (closed: accepted risk)
Session Metrics
- Commits made: 12
- Files touched: 10
- Lines added/removed: +1013/-52
- Tests added: 0 (restore drill was manual verification)
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)