212 lines
9.3 KiB
Org Mode
212 lines
9.3 KiB
Org Mode
#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
|
|
#+DATE: 2026-01-10
|
|
#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill
|
|
#+COMMITS: 12
|
|
#+COMPRESSION_STATUS: uncompressed
|
|
|
|
* Session Summary
|
|
** Date: 2026-01-10 (Continued from previous session)
|
|
** Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation
|
|
|
|
* Accomplishments
|
|
- [X] Completed B2 backup setup with restic (backup-b2.nix module)
|
|
- [X] Added PostgreSQL dump automation via services.postgresqlBackup
|
|
- [X] Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md)
|
|
- [X] Added /home and /var/lib/acme to backup paths (was missing)
|
|
- [X] Created offline sops recovery key for disaster scenarios
|
|
- [X] Documented NixOS 24.11 breaking changes analysis
|
|
- [X] Pinned PostgreSQL to v15 to prevent auto-upgrade
|
|
- [X] Executed first restore drill - all tests passed
|
|
- [X] Built and deployed NixOS 24.11 (generation 72)
|
|
- [X] Verified all services post-upgrade
|
|
- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
|
|
- [X] Closed upgrade epic 00e with all 6 child tasks
|
|
- [X] Ran ops-review on backup module, fixed 2 MED findings
|
|
- [X] Added failure notification service (backup-b2-failed) with OnFailure handlers
|
|
- [X] Added network dependency and timeouts to backup services
|
|
- [X] Post-upgrade health check: all services active, no failed units
|
|
|
|
* Key Decisions
|
|
** Decision 1: Use boot instead of switch for deployment
|
|
- Context: Upgrading major NixOS version with systemd 255→256
|
|
- Options considered:
|
|
1. nixos-rebuild switch - immediate activation
|
|
2. nixos-rebuild boot - stage for next boot
|
|
- Rationale: boot provides cleaner service restarts, avoids mixed-state issues
|
|
- Impact: Required reboot but ensured all services start fresh
|
|
|
|
** Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16
|
|
- Context: NixOS 24.11 defaults to PostgreSQL 16
|
|
- Options considered:
|
|
1. Pin to 15 now, upgrade PostgreSQL later (two steps)
|
|
2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade)
|
|
- Rationale: Safer to decouple NixOS upgrade from database upgrade
|
|
- Impact: Added ~package = pkgs.postgresql_15;~ to dev-services.nix
|
|
|
|
** Decision 3: Offline sops recovery key stored on laptop
|
|
- Context: sops keys derived from SSH host key - lose host, lose secrets
|
|
- Options considered:
|
|
1. Encrypted file on laptop
|
|
2. Paper key in safe
|
|
3. Hardware key (YubiKey)
|
|
- Rationale: Laptop encrypted file balances security and accessibility
|
|
- Impact: Added third age recipient to .sops.yaml
|
|
|
|
** Decision 4: Accept RocksDB backup consistency risk
|
|
- Context: Matrix-continuwuity uses RocksDB, backed up while running
|
|
- Options considered:
|
|
1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
|
|
2. Use RocksDB checkpoint API (requires upstream support)
|
|
3. Accept risk - RocksDB has crash consistency
|
|
- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
|
|
- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption
|
|
|
|
* Problems & Solutions
|
|
| Problem | Solution | Learning |
|
|
|---------+----------+----------|
|
|
| PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse | chmod 751 /var/backup to allow execute traversal | Parent directory permissions matter for child access |
|
|
| sops nested key structure mismatch | Restructured secrets.yaml from flat keys (restic/password) to nested YAML | sops-nix expects proper YAML nesting, not path-style keys |
|
|
| PostgreSQL collation version mismatch after glibc upgrade | ALTER DATABASE ... REFRESH COLLATION VERSION; for each DB | Standard post-upgrade maintenance, not an error |
|
|
| mautrix-slack retry warnings on startup | Matrix homeserver wasn't ready yet, bridge retried and connected | Service ordering worked correctly, just log noise |
|
|
|
|
* Technical Details
|
|
|
|
** Code Changes
|
|
- Total files modified: 10
|
|
- Key files changed:
|
|
- ~modules/backup-b2.nix~ - New B2 backup module with restic
|
|
- ~modules/dev-services.nix~ - Added postgresqlBackup, pinned PostgreSQL 15
|
|
- ~flake.nix~ - Updated to nixos-24.11, unpinned sops-nix
|
|
- ~.sops.yaml~ - Added recovery key recipient
|
|
- ~secrets/secrets.yaml~ - Added restic credentials (encrypted)
|
|
- New files created:
|
|
- ~docs/disaster-recovery-runbook.md~ - Comprehensive DR documentation
|
|
- ~docs/nixos-24.11-upgrade-notes.md~ - Breaking changes analysis
|
|
|
|
** Commands Used
|
|
#+BEGIN_SRC bash
|
|
# B2 bucket creation (manual in Backblaze console)
|
|
# Created: ops-jrz1-backup with scoped application key
|
|
|
|
# Trigger PostgreSQL backups
|
|
systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service
|
|
|
|
# Restore drill - test restore to /tmp
|
|
restic restore latest --target /tmp/dr-test --include /var/backup/postgresql
|
|
gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30
|
|
|
|
# Fix PostgreSQL collation after glibc upgrade
|
|
sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;"
|
|
sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;"
|
|
sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;"
|
|
|
|
# NixOS upgrade
|
|
nix flake update
|
|
nixos-rebuild build --flake .#ops-jrz1
|
|
nix copy --to ssh://root@ops-jrz1 ./result
|
|
ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."'
|
|
nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1
|
|
ssh root@ops-jrz1 reboot
|
|
#+END_SRC
|
|
|
|
** Architecture Notes
|
|
- B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM)
|
|
- Weekly integrity check on Sundays at 4 AM (5% data sample)
|
|
- Retention: 7 daily, 4 weekly, 6 monthly snapshots
|
|
- Three sops keys: VPS host, admin workstation, offline recovery
|
|
|
|
** Ops Review Findings
|
|
Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene
|
|
|
|
MED (fixed):
|
|
1. ~backup-b2-check~ missing ~network-online.target~ dependency
|
|
2. No failure notification mechanism for backup services
|
|
|
|
LOW (skipped - style only):
|
|
- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)
|
|
|
|
Changes made to ~modules/backup-b2.nix~:
|
|
- Added ~backup-b2-failed.service~ oneshot for failure notification
|
|
- Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services
|
|
- Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check
|
|
- Added ~TimeoutStartSec~ (2h for backup, 1h for check)
|
|
|
|
* Process and Workflow
|
|
|
|
** What Worked Well
|
|
- Using orch consensus earlier in session for backup strategy validation
|
|
- Restore drill caught the /var/backup permissions issue before a real disaster
|
|
- Incremental approach: backup → DR runbook → upgrade → verify
|
|
- beads issue tracking kept work organized across session
|
|
|
|
** What Was Challenging
|
|
- sops key structure confusion (flat vs nested YAML)
|
|
- The permissions issue with /var/backup wasn't obvious
|
|
- Long nix copy times to server (~5 min for full closure)
|
|
|
|
* Learning and Insights
|
|
|
|
** Technical Insights
|
|
- PostgreSQL collation refresh is standard maintenance after glibc upgrades
|
|
- NixOS boot vs switch: boot is safer for major upgrades
|
|
- restic restore preserves permissions and ownership
|
|
- mautrix-slack has graceful retry logic for homeserver connectivity
|
|
|
|
** Process Insights
|
|
- Restore drills find issues that code review misses
|
|
- DR runbooks should be tested, not just written
|
|
- Upgrade checklists prevent forgotten steps
|
|
|
|
** Architectural Insights
|
|
- Three-key sops setup (host + admin + recovery) covers disaster scenarios
|
|
- Separating database upgrade from OS upgrade reduces risk
|
|
- services.postgresqlBackup is better than raw pg_dumpall scripts
|
|
|
|
* Context for Future Work
|
|
|
|
** Open Questions
|
|
- Should we set up backup monitoring (healthchecks.io)?
|
|
- When to upgrade PostgreSQL 15→16?
|
|
- Mirror flake to GitHub (jboq - deferred)?
|
|
|
|
** Next Steps
|
|
- Monitor logs for 24-48 hours post-upgrade
|
|
- Schedule quarterly restore drills
|
|
- Consider static UIDs for service users (permission consistency)
|
|
|
|
** Related Work
|
|
- [[file:2025-12-05-security-review-backup-implementation.org][2025-12-05 Security Review]] - Initial backup planning
|
|
- [[file:2026-01-05-phone-workflow-mosh-backup-beads-cleanup.org][2026-01-05 Backup Beads Cleanup]] - Earlier backup work
|
|
- docs/disaster-recovery-runbook.md - Created this session
|
|
- docs/nixos-24.11-upgrade-notes.md - Created this session
|
|
|
|
* Raw Notes
|
|
- NixOS 24.11 codename: Vicuna
|
|
- Kernel upgraded: 6.6.68 → 6.6.94
|
|
- systemd upgraded: 255.9 → 256.10
|
|
- Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1)
|
|
- maubot upgraded significantly (0.4.2 → 0.5.0)
|
|
- Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15)
|
|
- Closure size increased by ~137 MiB
|
|
|
|
Beads issues closed this session:
|
|
- zgs8 - B2 backup setup (from earlier in session)
|
|
- r177 - Add /home and /var/lib/acme to backups
|
|
- 93q9 - Add offline sops recovery key
|
|
- 09o - Review NixOS 24.11 release notes
|
|
- 7qg - Pin PostgreSQL to v15
|
|
- asi - Take verified backup before upgrade
|
|
- 3wd - Update flake to nixos-24.11
|
|
- a9d - Deploy NixOS 24.11
|
|
- 3zo - Post-upgrade verification
|
|
- 00e - Upgrade epic (parent)
|
|
- y8le - Stop Matrix before backup (closed: accepted risk)
|
|
|
|
* Session Metrics
|
|
- Commits made: 12
|
|
- Files touched: 10
|
|
- Lines added/removed: +1013/-52
|
|
- Tests added: 0 (restore drill was manual verification)
|
|
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
|
|
- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)
|