From fec21745ce6fbec2ba486fba54d9074e3397c35d Mon Sep 17 00:00:00 2001 From: Dan Date: Sat, 10 Jan 2026 20:19:07 -0800 Subject: [PATCH] Update worklog with ops-review fixes and y8le decision Co-Authored-By: Claude Opus 4.5 --- ...26-01-10-nixos-24.11-upgrade-dr-backup.org | 37 +++++++++++++++++-- 1 file changed, 34 insertions(+), 3 deletions(-) diff --git a/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org b/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org index 8a74da4..0675f91 100644 --- a/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org +++ b/docs/worklogs/2026-01-10-nixos-24.11-upgrade-dr-backup.org @@ -1,7 +1,7 @@ #+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification #+DATE: 2026-01-10 #+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill -#+COMMITS: 11 +#+COMMITS: 12 #+COMPRESSION_STATUS: uncompressed * Session Summary @@ -21,6 +21,10 @@ - [X] Verified all services post-upgrade - [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40) - [X] Closed upgrade epic 00e with all 6 child tasks +- [X] Ran ops-review on backup module, fixed 2 MED findings +- [X] Added failure notification service (backup-b2-failed) with OnFailure handlers +- [X] Added network dependency and timeouts to backup services +- [X] Post-upgrade health check: all services active, no failed units * Key Decisions ** Decision 1: Use boot instead of switch for deployment @@ -48,6 +52,15 @@ - Rationale: Laptop encrypted file balances security and accessibility - Impact: Added third age recipient to .sops.yaml +** Decision 4: Accept RocksDB backup consistency risk +- Context: Matrix-continuwuity uses RocksDB, backed up while running +- Options considered: + 1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM) + 2. Use RocksDB checkpoint API (requires upstream support) + 3. Accept risk - RocksDB has crash consistency +- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy +- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption + * Problems & Solutions | Problem | Solution | Learning | |---------+----------+----------| @@ -102,6 +115,22 @@ ssh root@ops-jrz1 reboot - Retention: 7 daily, 4 weekly, 6 monthly snapshots - Three sops keys: VPS host, admin workstation, offline recovery +** Ops Review Findings +Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene + +MED (fixed): +1. ~backup-b2-check~ missing ~network-online.target~ dependency +2. No failure notification mechanism for backup services + +LOW (skipped - style only): +- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring) + +Changes made to ~modules/backup-b2.nix~: +- Added ~backup-b2-failed.service~ oneshot for failure notification +- Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services +- Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check +- Added ~TimeoutStartSec~ (2h for backup, 1h for check) + * Process and Workflow ** What Worked Well @@ -171,10 +200,12 @@ Beads issues closed this session: - a9d - Deploy NixOS 24.11 - 3zo - Post-upgrade verification - 00e - Upgrade epic (parent) +- y8le - Stop Matrix before backup (closed: accepted risk) * Session Metrics -- Commits made: 11 +- Commits made: 12 - Files touched: 10 -- Lines added/removed: +991/-52 +- Lines added/removed: +1013/-52 - Tests added: 0 (restore drill was manual verification) - Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx) +- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)