#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification #+DATE: 2026-01-10 #+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill #+COMMITS: 12 #+COMPRESSION_STATUS: uncompressed * Session Summary ** Date: 2026-01-10 (Continued from previous session) ** Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation * Accomplishments - [X] Completed B2 backup setup with restic (backup-b2.nix module) - [X] Added PostgreSQL dump automation via services.postgresqlBackup - [X] Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md) - [X] Added /home and /var/lib/acme to backup paths (was missing) - [X] Created offline sops recovery key for disaster scenarios - [X] Documented NixOS 24.11 breaking changes analysis - [X] Pinned PostgreSQL to v15 to prevent auto-upgrade - [X] Executed first restore drill - all tests passed - [X] Built and deployed NixOS 24.11 (generation 72) - [X] Verified all services post-upgrade - [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40) - [X] Closed upgrade epic 00e with all 6 child tasks - [X] Ran ops-review on backup module, fixed 2 MED findings - [X] Added failure notification service (backup-b2-failed) with OnFailure handlers - [X] Added network dependency and timeouts to backup services - [X] Post-upgrade health check: all services active, no failed units * Key Decisions ** Decision 1: Use boot instead of switch for deployment - Context: Upgrading major NixOS version with systemd 255→256 - Options considered: 1. nixos-rebuild switch - immediate activation 2. nixos-rebuild boot - stage for next boot - Rationale: boot provides cleaner service restarts, avoids mixed-state issues - Impact: Required reboot but ensured all services start fresh ** Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16 - Context: NixOS 24.11 defaults to PostgreSQL 16 - Options considered: 1. Pin to 15 now, upgrade PostgreSQL later (two steps) 2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade) - Rationale: Safer to decouple NixOS upgrade from database upgrade - Impact: Added ~package = pkgs.postgresql_15;~ to dev-services.nix ** Decision 3: Offline sops recovery key stored on laptop - Context: sops keys derived from SSH host key - lose host, lose secrets - Options considered: 1. Encrypted file on laptop 2. Paper key in safe 3. Hardware key (YubiKey) - Rationale: Laptop encrypted file balances security and accessibility - Impact: Added third age recipient to .sops.yaml ** Decision 4: Accept RocksDB backup consistency risk - Context: Matrix-continuwuity uses RocksDB, backed up while running - Options considered: 1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM) 2. Use RocksDB checkpoint API (requires upstream support) 3. Accept risk - RocksDB has crash consistency - Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy - Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption * Problems & Solutions | Problem | Solution | Learning | |---------+----------+----------| | PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access | | sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys | | PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE ... REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error | | mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise | * Technical Details ** Code Changes - Total files modified: 10 - Key files changed: - ~modules/backup-b2.nix~ - New B2 backup module with restic - ~modules/dev-services.nix~ - Added postgresqlBackup, pinned PostgreSQL 15 - ~flake.nix~ - Updated to nixos-24.11, unpinned sops-nix - ~.sops.yaml~ - Added recovery key recipient - ~secrets/secrets.yaml~ - Added restic credentials (encrypted) - New files created: - ~docs/disaster-recovery-runbook.md~ - Comprehensive DR documentation - ~docs/nixos-24.11-upgrade-notes.md~ - Breaking changes analysis ** Commands Used #+BEGIN_SRC bash # B2 bucket creation (manual in Backblaze console) # Created: ops-jrz1-backup with scoped application key # Trigger PostgreSQL backups systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service # Restore drill - test restore to /tmp restic restore latest --target /tmp/dr-test --include /var/backup/postgresql gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30 # Fix PostgreSQL collation after glibc upgrade sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;" sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;" sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;" # NixOS upgrade nix flake update nixos-rebuild build --flake .#ops-jrz1 nix copy --to ssh://root@ops-jrz1 ./result ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."' nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1 ssh root@ops-jrz1 reboot #+END_SRC ** Architecture Notes - B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM) - Weekly integrity check on Sundays at 4 AM (5% data sample) - Retention: 7 daily, 4 weekly, 6 monthly snapshots - Three sops keys: VPS host, admin workstation, offline recovery ** Ops Review Findings Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene MED (fixed): 1. ~backup-b2-check~ missing ~network-online.target~ dependency 2. No failure notification mechanism for backup services LOW (skipped - style only): - statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring) Changes made to ~modules/backup-b2.nix~: - Added ~backup-b2-failed.service~ oneshot for failure notification - Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services - Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check - Added ~TimeoutStartSec~ (2h for backup, 1h for check) * Process and Workflow ** What Worked Well - Using orch consensus earlier in session for backup strategy validation - Restore drill caught the /var/backup permissions issue before a real disaster - Incremental approach: backup → DR runbook → upgrade → verify - beads issue tracking kept work organized across session ** What Was Challenging - sops key structure confusion (flat vs nested YAML) - The permissions issue with /var/backup wasn't obvious - Long nix copy times to server (~5 min for full closure) * Learning and Insights ** Technical Insights - PostgreSQL collation refresh is standard maintenance after glibc upgrades - NixOS boot vs switch: boot is safer for major upgrades - restic restore preserves permissions and ownership - mautrix-slack has graceful retry logic for homeserver connectivity ** Process Insights - Restore drills find issues that code review misses - DR runbooks should be tested, not just written - Upgrade checklists prevent forgotten steps ** Architectural Insights - Three-key sops setup (host + admin + recovery) covers disaster scenarios - Separating database upgrade from OS upgrade reduces risk - services.postgresqlBackup is better than raw pg_dumpall scripts * Context for Future Work ** Open Questions - Should we set up backup monitoring (healthchecks.io)? - When to upgrade PostgreSQL 15→16? - Mirror flake to GitHub (jboq - deferred)? ** Next Steps - Monitor logs for 24-48 hours post-upgrade - Schedule quarterly restore drills - Consider static UIDs for service users (permission consistency) ** Related Work - [[file:2025-12-05-security-review-backup-implementation.org][2025-12-05 Security Review]] - Initial backup planning - [[file:2026-01-05-phone-workflow-mosh-backup-beads-cleanup.org][2026-01-05 Backup Beads Cleanup]] - Earlier backup work - docs/disaster-recovery-runbook.md - Created this session - docs/nixos-24.11-upgrade-notes.md - Created this session * Raw Notes - NixOS 24.11 codename: Vicuna - Kernel upgraded: 6.6.68 → 6.6.94 - systemd upgraded: 255.9 → 256.10 - Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1) - maubot upgraded significantly (0.4.2 → 0.5.0) - Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15) - Closure size increased by ~137 MiB Beads issues closed this session: - zgs8 - B2 backup setup (from earlier in session) - r177 - Add /home and /var/lib/acme to backups - 93q9 - Add offline sops recovery key - 09o - Review NixOS 24.11 release notes - 7qg - Pin PostgreSQL to v15 - asi - Take verified backup before upgrade - 3wd - Update flake to nixos-24.11 - a9d - Deploy NixOS 24.11 - 3zo - Post-upgrade verification - 00e - Upgrade epic (parent) - y8le - Stop Matrix before backup (closed: accepted risk) * Session Metrics - Commits made: 12 - Files touched: 10 - Lines added/removed: +1013/-52 - Tests added: 0 (restore drill was manual verification) - Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx) - Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)