ops-jrz1/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org
Dan fec21745ce Update worklog with ops-review fixes and y8le decision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-10 20:19:07 -08:00

9.3 KiB

NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification

Session Summary

Date: 2026-01-10 (Continued from previous session)

Focus Area: Complete NixOS 24.05 to 24.11 upgrade with full DR preparation

Accomplishments

  • Completed B2 backup setup with restic (backup-b2.nix module)
  • Added PostgreSQL dump automation via services.postgresqlBackup
  • Created comprehensive disaster recovery runbook (docs/disaster-recovery-runbook.md)
  • Added /home and /var/lib/acme to backup paths (was missing)
  • Created offline sops recovery key for disaster scenarios
  • Documented NixOS 24.11 breaking changes analysis
  • Pinned PostgreSQL to v15 to prevent auto-upgrade
  • Executed first restore drill - all tests passed
  • Built and deployed NixOS 24.11 (generation 72)
  • Verified all services post-upgrade
  • Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
  • Closed upgrade epic 00e with all 6 child tasks
  • Ran ops-review on backup module, fixed 2 MED findings
  • Added failure notification service (backup-b2-failed) with OnFailure handlers
  • Added network dependency and timeouts to backup services
  • Post-upgrade health check: all services active, no failed units

Key Decisions

Decision 1: Use boot instead of switch for deployment

  • Context: Upgrading major NixOS version with systemd 255→256
  • Options considered:

    1. nixos-rebuild switch - immediate activation
    2. nixos-rebuild boot - stage for next boot
  • Rationale: boot provides cleaner service restarts, avoids mixed-state issues
  • Impact: Required reboot but ensured all services start fresh

Decision 2: Pin PostgreSQL to v15 instead of upgrading to v16

  • Context: NixOS 24.11 defaults to PostgreSQL 16
  • Options considered:

    1. Pin to 15 now, upgrade PostgreSQL later (two steps)
    2. Let PostgreSQL upgrade with NixOS (requires pg_upgrade)
  • Rationale: Safer to decouple NixOS upgrade from database upgrade
  • Impact: Added package = pkgs.postgresql_15; to dev-services.nix

Decision 3: Offline sops recovery key stored on laptop

  • Context: sops keys derived from SSH host key - lose host, lose secrets
  • Options considered:

    1. Encrypted file on laptop
    2. Paper key in safe
    3. Hardware key (YubiKey)
  • Rationale: Laptop encrypted file balances security and accessibility
  • Impact: Added third age recipient to .sops.yaml

Decision 4: Accept RocksDB backup consistency risk

  • Context: Matrix-continuwuity uses RocksDB, backed up while running
  • Options considered:

    1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
    2. Use RocksDB checkpoint API (requires upstream support)
    3. Accept risk - RocksDB has crash consistency
  • Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
  • Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption

Problems & Solutions

Problem Solution Learning
PostgreSQL backup dir /var/backup has 0750 permissions, postgres user can't traverse chmod 751 /var/backup to allow execute traversal Parent directory permissions matter for child access
sops nested key structure mismatch Restructured secrets.yaml from flat keys (restic/password) to nested YAML sops-nix expects proper YAML nesting, not path-style keys
PostgreSQL collation version mismatch after glibc upgrade ALTER DATABASE … REFRESH COLLATION VERSION; for each DB Standard post-upgrade maintenance, not an error
mautrix-slack retry warnings on startup Matrix homeserver wasn't ready yet, bridge retried and connected Service ordering worked correctly, just log noise

Technical Details

Code Changes

  • Total files modified: 10
  • Key files changed:

    • modules/backup-b2.nix - New B2 backup module with restic
    • modules/dev-services.nix - Added postgresqlBackup, pinned PostgreSQL 15
    • flake.nix - Updated to nixos-24.11, unpinned sops-nix
    • .sops.yaml - Added recovery key recipient
    • secrets/secrets.yaml - Added restic credentials (encrypted)
  • New files created:

    • docs/disaster-recovery-runbook.md - Comprehensive DR documentation
    • docs/nixos-24.11-upgrade-notes.md - Breaking changes analysis

Commands Used

# B2 bucket creation (manual in Backblaze console)
# Created: ops-jrz1-backup with scoped application key

# Trigger PostgreSQL backups
systemctl start postgresqlBackup-forgejo.service postgresqlBackup-mautrix_slack.service

# Restore drill - test restore to /tmp
restic restore latest --target /tmp/dr-test --include /var/backup/postgresql
gunzip -c /tmp/dr-test/var/backup/postgresql/forgejo.sql.gz | head -30

# Fix PostgreSQL collation after glibc upgrade
sudo -u postgres psql -c "ALTER DATABASE postgres REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE forgejo REFRESH COLLATION VERSION;"
sudo -u postgres psql -c "ALTER DATABASE mautrix_slack REFRESH COLLATION VERSION;"

# NixOS upgrade
nix flake update
nixos-rebuild build --flake .#ops-jrz1
nix copy --to ssh://root@ops-jrz1 ./result
ssh root@ops-jrz1 'nix-shell -p nvd --run "nvd diff /run/current-system /nix/store/..."'
nixos-rebuild boot --flake .#ops-jrz1 --target-host root@ops-jrz1
ssh root@ops-jrz1 reboot

Architecture Notes

  • B2 backup runs daily at 3 AM (after PostgreSQL dump at 2 AM)
  • Weekly integrity check on Sundays at 4 AM (5% data sample)
  • Retention: 7 daily, 4 weekly, 6 monthly snapshots
  • Three sops keys: VPS host, admin workstation, offline recovery

Ops Review Findings

Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene

MED (fixed):

  1. backup-b2-check missing network-online.target dependency
  2. No failure notification mechanism for backup services

LOW (skipped - style only):

  • statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)

Changes made to modules/backup-b2.nix:

  • Added backup-b2-failed.service oneshot for failure notification
  • Added onFailure = [ "backup-b2-failed.service" ] to both backup services
  • Added after/wants = [ "network-online.target" ] to backup-b2-check
  • Added TimeoutStartSec (2h for backup, 1h for check)

Process and Workflow

What Worked Well

  • Using orch consensus earlier in session for backup strategy validation
  • Restore drill caught the /var/backup permissions issue before a real disaster
  • Incremental approach: backup → DR runbook → upgrade → verify
  • beads issue tracking kept work organized across session

What Was Challenging

  • sops key structure confusion (flat vs nested YAML)
  • The permissions issue with /var/backup wasn't obvious
  • Long nix copy times to server (~5 min for full closure)

Learning and Insights

Technical Insights

  • PostgreSQL collation refresh is standard maintenance after glibc upgrades
  • NixOS boot vs switch: boot is safer for major upgrades
  • restic restore preserves permissions and ownership
  • mautrix-slack has graceful retry logic for homeserver connectivity

Process Insights

  • Restore drills find issues that code review misses
  • DR runbooks should be tested, not just written
  • Upgrade checklists prevent forgotten steps

Architectural Insights

  • Three-key sops setup (host + admin + recovery) covers disaster scenarios
  • Separating database upgrade from OS upgrade reduces risk
  • services.postgresqlBackup is better than raw pg_dumpall scripts

Context for Future Work

Open Questions

  • Should we set up backup monitoring (healthchecks.io)?
  • When to upgrade PostgreSQL 15→16?
  • Mirror flake to GitHub (jboq - deferred)?

Next Steps

  • Monitor logs for 24-48 hours post-upgrade
  • Schedule quarterly restore drills
  • Consider static UIDs for service users (permission consistency)

Related Work

Raw Notes

  • NixOS 24.11 codename: Vicuna
  • Kernel upgraded: 6.6.68 → 6.6.94
  • systemd upgraded: 255.9 → 256.10
  • Matrix-continuwuity went from RC to stable (0.5.0-rc.8.1 → 0.5.1)
  • maubot upgraded significantly (0.4.2 → 0.5.0)
  • Forgejo stayed on 7.x LTS (7.0.12 → 7.0.15)
  • Closure size increased by ~137 MiB

Beads issues closed this session:

  • zgs8 - B2 backup setup (from earlier in session)
  • r177 - Add /home and /var/lib/acme to backups
  • 93q9 - Add offline sops recovery key
  • 09o - Review NixOS 24.11 release notes
  • 7qg - Pin PostgreSQL to v15
  • asi - Take verified backup before upgrade
  • 3wd - Update flake to nixos-24.11
  • a9d - Deploy NixOS 24.11
  • 3zo - Post-upgrade verification
  • 00e - Upgrade epic (parent)
  • y8le - Stop Matrix before backup (closed: accepted risk)

Session Metrics

  • Commits made: 12
  • Files touched: 10
  • Lines added/removed: +1013/-52
  • Tests added: 0 (restore drill was manual verification)
  • Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
  • Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)