Update worklog with ops-review fixes and y8le decision

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
Dan 2026-01-10 20:19:07 -08:00
parent b1d2674629
commit fec21745ce

View file

@ -1,7 +1,7 @@
#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification #+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
#+DATE: 2026-01-10 #+DATE: 2026-01-10
#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill #+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill
#+COMMITS: 11 #+COMMITS: 12
#+COMPRESSION_STATUS: uncompressed #+COMPRESSION_STATUS: uncompressed
* Session Summary * Session Summary
@ -21,6 +21,10 @@
- [X] Verified all services post-upgrade - [X] Verified all services post-upgrade
- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40) - [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
- [X] Closed upgrade epic 00e with all 6 child tasks - [X] Closed upgrade epic 00e with all 6 child tasks
- [X] Ran ops-review on backup module, fixed 2 MED findings
- [X] Added failure notification service (backup-b2-failed) with OnFailure handlers
- [X] Added network dependency and timeouts to backup services
- [X] Post-upgrade health check: all services active, no failed units
* Key Decisions * Key Decisions
** Decision 1: Use boot instead of switch for deployment ** Decision 1: Use boot instead of switch for deployment
@ -48,6 +52,15 @@
- Rationale: Laptop encrypted file balances security and accessibility - Rationale: Laptop encrypted file balances security and accessibility
- Impact: Added third age recipient to .sops.yaml - Impact: Added third age recipient to .sops.yaml
** Decision 4: Accept RocksDB backup consistency risk
- Context: Matrix-continuwuity uses RocksDB, backed up while running
- Options considered:
1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
2. Use RocksDB checkpoint API (requires upstream support)
3. Accept risk - RocksDB has crash consistency
- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption
* Problems & Solutions * Problems & Solutions
| Problem | Solution | Learning | | Problem | Solution | Learning |
|---------+----------+----------| |---------+----------+----------|
@ -102,6 +115,22 @@ ssh root@ops-jrz1 reboot
- Retention: 7 daily, 4 weekly, 6 monthly snapshots - Retention: 7 daily, 4 weekly, 6 monthly snapshots
- Three sops keys: VPS host, admin workstation, offline recovery - Three sops keys: VPS host, admin workstation, offline recovery
** Ops Review Findings
Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene
MED (fixed):
1. ~backup-b2-check~ missing ~network-online.target~ dependency
2. No failure notification mechanism for backup services
LOW (skipped - style only):
- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)
Changes made to ~modules/backup-b2.nix~:
- Added ~backup-b2-failed.service~ oneshot for failure notification
- Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services
- Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check
- Added ~TimeoutStartSec~ (2h for backup, 1h for check)
* Process and Workflow * Process and Workflow
** What Worked Well ** What Worked Well
@ -171,10 +200,12 @@ Beads issues closed this session:
- a9d - Deploy NixOS 24.11 - a9d - Deploy NixOS 24.11
- 3zo - Post-upgrade verification - 3zo - Post-upgrade verification
- 00e - Upgrade epic (parent) - 00e - Upgrade epic (parent)
- y8le - Stop Matrix before backup (closed: accepted risk)
* Session Metrics * Session Metrics
- Commits made: 11 - Commits made: 12
- Files touched: 10 - Files touched: 10
- Lines added/removed: +991/-52 - Lines added/removed: +1013/-52
- Tests added: 0 (restore drill was manual verification) - Tests added: 0 (restore drill was manual verification)
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx) - Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)