From d581d7bac4a293ac2532f96d6f66c8fe26ff0fa4 Mon Sep 17 00:00:00 2001 From: Dan Date: Sat, 10 Jan 2026 18:46:38 -0800 Subject: [PATCH] Add worklog: NixOS 24.11 upgrade with DR preparation --- ...26-01-10-nixos-24.11-upgrade-dr-backup.org | 180 ++++++++++++++++++ 1 file changed, 180 insertions(+) create mode 100644 docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org diff --git a/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org b/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org new file mode 100644 index 0000000..8a74da4 --- /dev/null +++ b/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org @@ -0,0 +1,180 @@ +#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification +#+DATE: 2026-01-10 +#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill +#+COMMITS: 11 +#+COMPRESSION_STATUS: uncompressed + +* Session Summary +** Date: 2026-01-10 (Continued from previous session) +** Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation + +* Accomplishments +- [X] Completed B2 backup setup with restic (backup-b2.nix module) +- [X] Added PostgreSQL dump automation via services.postgresqlBackup +- [X] Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md) +- [X] Added /home and /var/lib/acme to backup paths (was missing) +- [X] Created offline sops recovery key for disaster scenarios +- [X] Documented NixOS 24.11 breaking changes analysis +- [X] Pinned PostgreSQL to v15 to prevent auto-upgrade +- [X] Executed first restore drill - all tests passed +- [X] Built and deployed NixOS 24.11 (generation 72) +- [X] Verified all services post-upgrade +- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40) +- [X] Closed upgrade epic 00e with all 6 child tasks + +* Key Decisions +** Decision 1: Use boot instead of switch for deployment +- Context: Upgrading major NixOS version with systemd 255→256 +- Options considered: + 1. nixos-rebuild switch - immediate activation + 2. nixos-rebuild boot - stage for next boot +- Rationale: boot provides cleaner service restarts, avoids mixed-state issues +- Impact: Required reboot but ensured all services start fresh + +** Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16 +- Context: NixOS 24.11 defaults to PostgreSQL 16 +- Options considered: + 1. Pin to 15 now, upgrade PostgreSQL later (two steps) + 2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade) +- Rationale: Safer to decouple NixOS upgrade from database upgrade +- Impact: Added ~package = pkgs.postgresql_15;~ to dev-services.nix + +** Decision 3: Offline sops recovery key stored on laptop +- Context: sops keys derived from SSH host key - lose host, lose secrets +- Options considered: + 1. Encrypted file on laptop + 2. Paper key in safe + 3. Hardware key (YubiKey) +- Rationale: Laptop encrypted file balances security and accessibility +- Impact: Added third age recipient to .sops.yaml + +* Problems & Solutions +| Problem | Solution | Learning | +|---------+----------+----------| +| PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access | +| sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys | +| PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE ... REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error | +| mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise | + +* Technical Details + +** Code Changes +- Total files modified: 10 +- Key files changed: + - ~modules/backup-b2.nix~ - New B2 backup module with restic + - ~modules/dev-services.nix~ - Added postgresqlBackup, pinned PostgreSQL 15 + - ~flake.nix~ - Updated to nixos-24.11, unpinned sops-nix + - ~.sops.yaml~ - Added recovery key recipient + - ~secrets/secrets.yaml~ - Added restic credentials (encrypted) +- New files created: + - ~docs/disaster-recovery-runbook.md~ - Comprehensive DR documentation + - ~docs/nixos-24.11-upgrade-notes.md~ - Breaking changes analysis + +** Commands Used +#+BEGIN_SRC bash +# B2 bucket creation (manual in Backblaze console) +# Created: ops-jrz1-backup with scoped application key + +# Trigger PostgreSQL backups +systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service + +# Restore drill - test restore to /tmp +restic restore latest --target /tmp/dr-test --include /var/backup/postgresql +gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30 + +# Fix PostgreSQL collation after glibc upgrade +sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;" +sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;" +sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;" + +# NixOS upgrade +nix flake update +nixos-rebuild build --flake .#ops-jrz1 +nix copy --to ssh://root@ops-jrz1 ./result +ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."' +nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1 +ssh root@ops-jrz1 reboot +#+END_SRC + +** Architecture Notes +- B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM) +- Weekly integrity check on Sundays at 4 AM (5% data sample) +- Retention: 7 daily, 4 weekly, 6 monthly snapshots +- Three sops keys: VPS host, admin workstation, offline recovery + +* Process and Workflow + +** What Worked Well +- Using orch consensus earlier in session for backup strategy validation +- Restore drill caught the /var/backup permissions issue before a real disaster +- Incremental approach: backup → DR runbook → upgrade → verify +- beads issue tracking kept work organized across session + +** What Was Challenging +- sops key structure confusion (flat vs nested YAML) +- The permissions issue with /var/backup wasn't obvious +- Long nix copy times to server (~5 min for full closure) + +* Learning and Insights + +** Technical Insights +- PostgreSQL collation refresh is standard maintenance after glibc upgrades +- NixOS boot vs switch: boot is safer for major upgrades +- restic restore preserves permissions and ownership +- mautrix-slack has graceful retry logic for homeserver connectivity + +** Process Insights +- Restore drills find issues that code review misses +- DR runbooks should be tested, not just written +- Upgrade checklists prevent forgotten steps + +** Architectural Insights +- Three-key sops setup (host + admin + recovery) covers disaster scenarios +- Separating database upgrade from OS upgrade reduces risk +- services.postgresqlBackup is better than raw pg_dumpall scripts + +* Context for Future Work + +** Open Questions +- Should we set up backup monitoring (healthchecks.io)? +- When to upgrade PostgreSQL 15→16? +- Mirror flake to GitHub (jboq - deferred)? + +** Next Steps +- Monitor logs for 24-48 hours post-upgrade +- Schedule quarterly restore drills +- Consider static UIDs for service users (permission consistency) + +** Related Work +- [[file:2025-12-05-security-review-backup-implementation.org][2025-12-05 Security Review]] - Initial backup planning +- [[file:2026-01-05-phone-workflow-mosh-backup-beads-cleanup.org][2026-01-05 Backup Beads Cleanup]] - Earlier backup work +- docs/disaster-recovery-runbook.md - Created this session +- docs/nixos-24.11-upgrade-notes.md - Created this session + +* Raw Notes +- NixOS 24.11 codename: Vicuna +- Kernel upgraded: 6.6.68 → 6.6.94 +- systemd upgraded: 255.9 → 256.10 +- Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1) +- maubot upgraded significantly (0.4.2 → 0.5.0) +- Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15) +- Closure size increased by ~137 MiB + +Beads issues closed this session: +- zgs8 - B2 backup setup (from earlier in session) +- r177 - Add /home and /var/lib/acme to backups +- 93q9 - Add offline sops recovery key +- 09o - Review NixOS 24.11 release notes +- 7qg - Pin PostgreSQL to v15 +- asi - Take verified backup before upgrade +- 3wd - Update flake to nixos-24.11 +- a9d - Deploy NixOS 24.11 +- 3zo - Post-upgrade verification +- 00e - Upgrade epic (parent) + +* Session Metrics +- Commits made: 11 +- Files touched: 10 +- Lines added/removed: +991/-52 +- Tests added: 0 (restore drill was manual verification) +- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)