ops-jrz1/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org
Dan fec21745ce Update worklog with ops-review fixes and y8le decision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:19:07 -08:00

212 lines
9.3 KiB
Org Mode

#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
#+DATE: 2026-01-10
#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill
#+COMMITS: 12
#+COMPRESSION_STATUS: uncompressed
* Session Summary
** Date: 2026-01-10 (Continued from previous session)
** Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation
* Accomplishments
- [X] Completed B2 backup setup with restic (backup-b2.nix module)
- [X] Added PostgreSQL dump automation via services.postgresqlBackup
- [X] Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md)
- [X] Added /home and /var/lib/acme to backup paths (was missing)
- [X] Created offline sops recovery key for disaster scenarios
- [X] Documented NixOS 24.11 breaking changes analysis
- [X] Pinned PostgreSQL to v15 to prevent auto-upgrade
- [X] Executed first restore drill - all tests passed
- [X] Built and deployed NixOS 24.11 (generation 72)
- [X] Verified all services post-upgrade
- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
- [X] Closed upgrade epic 00e with all 6 child tasks
- [X] Ran ops-review on backup module, fixed 2 MED findings
- [X] Added failure notification service (backup-b2-failed) with OnFailure handlers
- [X] Added network dependency and timeouts to backup services
- [X] Post-upgrade health check: all services active, no failed units
* Key Decisions
** Decision 1: Use boot instead of switch for deployment
- Context: Upgrading major NixOS version with systemd 255→256
- Options considered:
1. nixos-rebuild switch - immediate activation
2. nixos-rebuild boot - stage for next boot
- Rationale: boot provides cleaner service restarts, avoids mixed-state issues
- Impact: Required reboot but ensured all services start fresh
** Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16
- Context: NixOS 24.11 defaults to PostgreSQL 16
- Options considered:
1. Pin to 15 now, upgrade PostgreSQL later (two steps)
2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade)
- Rationale: Safer to decouple NixOS upgrade from database upgrade
- Impact: Added ~package = pkgs.postgresql_15;~ to dev-services.nix
** Decision 3: Offline sops recovery key stored on laptop
- Context: sops keys derived from SSH host key - lose host, lose secrets
- Options considered:
1. Encrypted file on laptop
2. Paper key in safe
3. Hardware key (YubiKey)
- Rationale: Laptop encrypted file balances security and accessibility
- Impact: Added third age recipient to .sops.yaml
** Decision 4: Accept RocksDB backup consistency risk
- Context: Matrix-continuwuity uses RocksDB, backed up while running
- Options considered:
1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
2. Use RocksDB checkpoint API (requires upstream support)
3. Accept risk - RocksDB has crash consistency
- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption
* Problems & Solutions
| Problem | Solution | Learning |
|---------+----------+----------|
| PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access |
| sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys |
| PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE ... REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error |
| mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise |
* Technical Details
** Code Changes
- Total files modified: 10
- Key files changed:
- ~modules/backup-b2.nix~ - New B2 backup module with restic
- ~modules/dev-services.nix~ - Added postgresqlBackup, pinned PostgreSQL 15
- ~flake.nix~ - Updated to nixos-24.11, unpinned sops-nix
- ~.sops.yaml~ - Added recovery key recipient
- ~secrets/secrets.yaml~ - Added restic credentials (encrypted)
- New files created:
- ~docs/disaster-recovery-runbook.md~ - Comprehensive DR documentation
- ~docs/nixos-24.11-upgrade-notes.md~ - Breaking changes analysis
** Commands Used
#+BEGIN_SRC bash
# B2 bucket creation (manual in Backblaze console)
# Created: ops-jrz1-backup with scoped application key
# Trigger PostgreSQL backups
systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service
# Restore drill - test restore to /tmp
restic restore latest --target /tmp/dr-test --include /var/backup/postgresql
gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30
# Fix PostgreSQL collation after glibc upgrade
sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;"
# NixOS upgrade
nix flake update
nixos-rebuild build --flake .#ops-jrz1
nix copy --to ssh://root@ops-jrz1 ./result
ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."'
nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1
ssh root@ops-jrz1 reboot
#+END_SRC
** Architecture Notes
- B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM)
- Weekly integrity check on Sundays at 4 AM (5% data sample)
- Retention: 7 daily, 4 weekly, 6 monthly snapshots
- Three sops keys: VPS host, admin workstation, offline recovery
** Ops Review Findings
Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene
MED (fixed):
1. ~backup-b2-check~ missing ~network-online.target~ dependency
2. No failure notification mechanism for backup services
LOW (skipped - style only):
- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)
Changes made to ~modules/backup-b2.nix~:
- Added ~backup-b2-failed.service~ oneshot for failure notification
- Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services
- Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check
- Added ~TimeoutStartSec~ (2h for backup, 1h for check)
* Process and Workflow
** What Worked Well
- Using orch consensus earlier in session for backup strategy validation
- Restore drill caught the /var/backup permissions issue before a real disaster
- Incremental approach: backup → DR runbook → upgrade → verify
- beads issue tracking kept work organized across session
** What Was Challenging
- sops key structure confusion (flat vs nested YAML)
- The permissions issue with /var/backup wasn't obvious
- Long nix copy times to server (~5 min for full closure)
* Learning and Insights
** Technical Insights
- PostgreSQL collation refresh is standard maintenance after glibc upgrades
- NixOS boot vs switch: boot is safer for major upgrades
- restic restore preserves permissions and ownership
- mautrix-slack has graceful retry logic for homeserver connectivity
** Process Insights
- Restore drills find issues that code review misses
- DR runbooks should be tested, not just written
- Upgrade checklists prevent forgotten steps
** Architectural Insights
- Three-key sops setup (host + admin + recovery) covers disaster scenarios
- Separating database upgrade from OS upgrade reduces risk
- services.postgresqlBackup is better than raw pg_dumpall scripts
* Context for Future Work
** Open Questions
- Should we set up backup monitoring (healthchecks.io)?
- When to upgrade PostgreSQL 15→16?
- Mirror flake to GitHub (jboq - deferred)?
** Next Steps
- Monitor logs for 24-48 hours post-upgrade
- Schedule quarterly restore drills
- Consider static UIDs for service users (permission consistency)
** Related Work
- [[file:2025-12-05-security-review-backup-implementation.org][2025-12-05 Security Review]] - Initial backup planning
- [[file:2026-01-05-phone-workflow-mosh-backup-beads-cleanup.org][2026-01-05 Backup Beads Cleanup]] - Earlier backup work
- docs/disaster-recovery-runbook.md - Created this session
- docs/nixos-24.11-upgrade-notes.md - Created this session
* Raw Notes
- NixOS 24.11 codename: Vicuna
- Kernel upgraded: 6.6.68 → 6.6.94
- systemd upgraded: 255.9 → 256.10
- Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1)
- maubot upgraded significantly (0.4.2 → 0.5.0)
- Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15)
- Closure size increased by ~137 MiB
Beads issues closed this session:
- zgs8 - B2 backup setup (from earlier in session)
- r177 - Add /home and /var/lib/acme to backups
- 93q9 - Add offline sops recovery key
- 09o - Review NixOS 24.11 release notes
- 7qg - Pin PostgreSQL to v15
- asi - Take verified backup before upgrade
- 3wd - Update flake to nixos-24.11
- a9d - Deploy NixOS 24.11
- 3zo - Post-upgrade verification
- 00e - Upgrade epic (parent)
- y8le - Stop Matrix before backup (closed: accepted risk)
* Session Metrics
- Commits made: 12
- Files touched: 10
- Lines added/removed: +1013/-52
- Tests added: 0 (restore drill was manual verification)
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)