Add worklog: NixOS 24.11 upgrade with DR preparation
This commit is contained in:
parent
75515c7e53
commit
d581d7bac4
180
docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org
Normal file
180
docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org
Normal file
|
|
@ -0,0 +1,180 @@
|
|||
#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
|
||||
#+DATE: 2026-01-10
|
||||
#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill
|
||||
#+COMMITS: 11
|
||||
#+COMPRESSION_STATUS: uncompressed
|
||||
|
||||
* Session Summary
|
||||
** Date: 2026-01-10 (Continued from previous session)
|
||||
** Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation
|
||||
|
||||
* Accomplishments
|
||||
- [X] Completed B2 backup setup with restic (backup-b2.nix module)
|
||||
- [X] Added PostgreSQL dump automation via services.postgresqlBackup
|
||||
- [X] Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md)
|
||||
- [X] Added /home and /var/lib/acme to backup paths (was missing)
|
||||
- [X] Created offline sops recovery key for disaster scenarios
|
||||
- [X] Documented NixOS 24.11 breaking changes analysis
|
||||
- [X] Pinned PostgreSQL to v15 to prevent auto-upgrade
|
||||
- [X] Executed first restore drill - all tests passed
|
||||
- [X] Built and deployed NixOS 24.11 (generation 72)
|
||||
- [X] Verified all services post-upgrade
|
||||
- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
|
||||
- [X] Closed upgrade epic 00e with all 6 child tasks
|
||||
|
||||
* Key Decisions
|
||||
** Decision 1: Use boot instead of switch for deployment
|
||||
- Context: Upgrading major NixOS version with systemd 255→256
|
||||
- Options considered:
|
||||
1. nixos-rebuild switch - immediate activation
|
||||
2. nixos-rebuild boot - stage for next boot
|
||||
- Rationale: boot provides cleaner service restarts, avoids mixed-state issues
|
||||
- Impact: Required reboot but ensured all services start fresh
|
||||
|
||||
** Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16
|
||||
- Context: NixOS 24.11 defaults to PostgreSQL 16
|
||||
- Options considered:
|
||||
1. Pin to 15 now, upgrade PostgreSQL later (two steps)
|
||||
2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade)
|
||||
- Rationale: Safer to decouple NixOS upgrade from database upgrade
|
||||
- Impact: Added ~package = pkgs.postgresql_15;~ to dev-services.nix
|
||||
|
||||
** Decision 3: Offline sops recovery key stored on laptop
|
||||
- Context: sops keys derived from SSH host key - lose host, lose secrets
|
||||
- Options considered:
|
||||
1. Encrypted file on laptop
|
||||
2. Paper key in safe
|
||||
3. Hardware key (YubiKey)
|
||||
- Rationale: Laptop encrypted file balances security and accessibility
|
||||
- Impact: Added third age recipient to .sops.yaml
|
||||
|
||||
* Problems & Solutions
|
||||
| Problem | Solution | Learning |
|
||||
|---------+----------+----------|
|
||||
| PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access |
|
||||
| sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys |
|
||||
| PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE ... REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error |
|
||||
| mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise |
|
||||
|
||||
* Technical Details
|
||||
|
||||
** Code Changes
|
||||
- Total files modified: 10
|
||||
- Key files changed:
|
||||
- ~modules/backup-b2.nix~ - New B2 backup module with restic
|
||||
- ~modules/dev-services.nix~ - Added postgresqlBackup, pinned PostgreSQL 15
|
||||
- ~flake.nix~ - Updated to nixos-24.11, unpinned sops-nix
|
||||
- ~.sops.yaml~ - Added recovery key recipient
|
||||
- ~secrets/secrets.yaml~ - Added restic credentials (encrypted)
|
||||
- New files created:
|
||||
- ~docs/disaster-recovery-runbook.md~ - Comprehensive DR documentation
|
||||
- ~docs/nixos-24.11-upgrade-notes.md~ - Breaking changes analysis
|
||||
|
||||
** Commands Used
|
||||
#+BEGIN_SRC bash
|
||||
# B2 bucket creation (manual in Backblaze console)
|
||||
# Created: ops-jrz1-backup with scoped application key
|
||||
|
||||
# Trigger PostgreSQL backups
|
||||
systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service
|
||||
|
||||
# Restore drill - test restore to /tmp
|
||||
restic restore latest --target /tmp/dr-test --include /var/backup/postgresql
|
||||
gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30
|
||||
|
||||
# Fix PostgreSQL collation after glibc upgrade
|
||||
sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;"
|
||||
sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;"
|
||||
sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;"
|
||||
|
||||
# NixOS upgrade
|
||||
nix flake update
|
||||
nixos-rebuild build --flake .#ops-jrz1
|
||||
nix copy --to ssh://root@ops-jrz1 ./result
|
||||
ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."'
|
||||
nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1
|
||||
ssh root@ops-jrz1 reboot
|
||||
#+END_SRC
|
||||
|
||||
** Architecture Notes
|
||||
- B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM)
|
||||
- Weekly integrity check on Sundays at 4 AM (5% data sample)
|
||||
- Retention: 7 daily, 4 weekly, 6 monthly snapshots
|
||||
- Three sops keys: VPS host, admin workstation, offline recovery
|
||||
|
||||
* Process and Workflow
|
||||
|
||||
** What Worked Well
|
||||
- Using orch consensus earlier in session for backup strategy validation
|
||||
- Restore drill caught the /var/backup permissions issue before a real disaster
|
||||
- Incremental approach: backup → DR runbook → upgrade → verify
|
||||
- beads issue tracking kept work organized across session
|
||||
|
||||
** What Was Challenging
|
||||
- sops key structure confusion (flat vs nested YAML)
|
||||
- The permissions issue with /var/backup wasn't obvious
|
||||
- Long nix copy times to server (~5 min for full closure)
|
||||
|
||||
* Learning and Insights
|
||||
|
||||
** Technical Insights
|
||||
- PostgreSQL collation refresh is standard maintenance after glibc upgrades
|
||||
- NixOS boot vs switch: boot is safer for major upgrades
|
||||
- restic restore preserves permissions and ownership
|
||||
- mautrix-slack has graceful retry logic for homeserver connectivity
|
||||
|
||||
** Process Insights
|
||||
- Restore drills find issues that code review misses
|
||||
- DR runbooks should be tested, not just written
|
||||
- Upgrade checklists prevent forgotten steps
|
||||
|
||||
** Architectural Insights
|
||||
- Three-key sops setup (host + admin + recovery) covers disaster scenarios
|
||||
- Separating database upgrade from OS upgrade reduces risk
|
||||
- services.postgresqlBackup is better than raw pg_dumpall scripts
|
||||
|
||||
* Context for Future Work
|
||||
|
||||
** Open Questions
|
||||
- Should we set up backup monitoring (healthchecks.io)?
|
||||
- When to upgrade PostgreSQL 15→16?
|
||||
- Mirror flake to GitHub (jboq - deferred)?
|
||||
|
||||
** Next Steps
|
||||
- Monitor logs for 24-48 hours post-upgrade
|
||||
- Schedule quarterly restore drills
|
||||
- Consider static UIDs for service users (permission consistency)
|
||||
|
||||
** Related Work
|
||||
- [[file:2025-12-05-security-review-backup-implementation.org][2025-12-05 Security Review]] - Initial backup planning
|
||||
- [[file:2026-01-05-phone-workflow-mosh-backup-beads-cleanup.org][2026-01-05 Backup Beads Cleanup]] - Earlier backup work
|
||||
- docs/disaster-recovery-runbook.md - Created this session
|
||||
- docs/nixos-24.11-upgrade-notes.md - Created this session
|
||||
|
||||
* Raw Notes
|
||||
- NixOS 24.11 codename: Vicuna
|
||||
- Kernel upgraded: 6.6.68 → 6.6.94
|
||||
- systemd upgraded: 255.9 → 256.10
|
||||
- Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1)
|
||||
- maubot upgraded significantly (0.4.2 → 0.5.0)
|
||||
- Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15)
|
||||
- Closure size increased by ~137 MiB
|
||||
|
||||
Beads issues closed this session:
|
||||
- zgs8 - B2 backup setup (from earlier in session)
|
||||
- r177 - Add /home and /var/lib/acme to backups
|
||||
- 93q9 - Add offline sops recovery key
|
||||
- 09o - Review NixOS 24.11 release notes
|
||||
- 7qg - Pin PostgreSQL to v15
|
||||
- asi - Take verified backup before upgrade
|
||||
- 3wd - Update flake to nixos-24.11
|
||||
- a9d - Deploy NixOS 24.11
|
||||
- 3zo - Post-upgrade verification
|
||||
- 00e - Upgrade epic (parent)
|
||||
|
||||
* Session Metrics
|
||||
- Commits made: 11
|
||||
- Files touched: 10
|
||||
- Lines added/removed: +991/-52
|
||||
- Tests added: 0 (restore drill was manual verification)
|
||||
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
|
||||
Loading…
Reference in a new issue