Update worklog with ops-review fixes and y8le decision
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
b1d2674629
commit
fec21745ce
|
|
@ -1,7 +1,7 @@
|
|||
#+TITLE: NixOS 24.11 Upgrade with DR Preparation and B2 Backup Verification
|
||||
#+DATE: 2026-01-10
|
||||
#+KEYWORDS: nixos-upgrade, backup, restic, disaster-recovery, postgresql, b2, restore-drill
|
||||
#+COMMITS: 11
|
||||
#+COMMITS: 12
|
||||
#+COMPRESSION_STATUS: uncompressed
|
||||
|
||||
* Session Summary
|
||||
|
|
@ -21,6 +21,10 @@
|
|||
- [X] Verified all services post-upgrade
|
||||
- [X] Fixed PostgreSQL collation mismatch (glibc 2.39→2.40)
|
||||
- [X] Closed upgrade epic 00e with all 6 child tasks
|
||||
- [X] Ran ops-review on backup module, fixed 2 MED findings
|
||||
- [X] Added failure notification service (backup-b2-failed) with OnFailure handlers
|
||||
- [X] Added network dependency and timeouts to backup services
|
||||
- [X] Post-upgrade health check: all services active, no failed units
|
||||
|
||||
* Key Decisions
|
||||
** Decision 1: Use boot instead of switch for deployment
|
||||
|
|
@ -48,6 +52,15 @@
|
|||
- Rationale: Laptop encrypted file balances security and accessibility
|
||||
- Impact: Added third age recipient to .sops.yaml
|
||||
|
||||
** Decision 4: Accept RocksDB backup consistency risk
|
||||
- Context: Matrix-continuwuity uses RocksDB, backed up while running
|
||||
- Options considered:
|
||||
1. Stop matrix-continuwuity during backup (~30s downtime at 3 AM)
|
||||
2. Use RocksDB checkpoint API (requires upstream support)
|
||||
3. Accept risk - RocksDB has crash consistency
|
||||
- Rationale: 3 AM backup window has minimal activity, multiple daily snapshots provide redundancy
|
||||
- Impact: Closed y8le without implementing service stop; can re-evaluate if restore drill shows corruption
|
||||
|
||||
* Problems & Solutions
|
||||
| Problem | Solution | Learning |
|
||||
|---------+----------+----------|
|
||||
|
|
@ -102,6 +115,22 @@ ssh root@ops-jrz1 reboot
|
|||
- Retention: 7 daily, 4 weekly, 6 monthly snapshots
|
||||
- Three sops keys: VPS host, admin workstation, offline recovery
|
||||
|
||||
** Ops Review Findings
|
||||
Ran ops-review skill with lenses: secrets, blast-radius, observability, resilience, nix-hygiene
|
||||
|
||||
MED (fixed):
|
||||
1. ~backup-b2-check~ missing ~network-online.target~ dependency
|
||||
2. No failure notification mechanism for backup services
|
||||
|
||||
LOW (skipped - style only):
|
||||
- statix W20 warnings about repeated keys in Nix modules (idiomatic pattern, not worth refactoring)
|
||||
|
||||
Changes made to ~modules/backup-b2.nix~:
|
||||
- Added ~backup-b2-failed.service~ oneshot for failure notification
|
||||
- Added ~onFailure = [ "backup-b2-failed.service" ]~ to both backup services
|
||||
- Added ~after/wants = [ "network-online.target" ]~ to backup-b2-check
|
||||
- Added ~TimeoutStartSec~ (2h for backup, 1h for check)
|
||||
|
||||
* Process and Workflow
|
||||
|
||||
** What Worked Well
|
||||
|
|
@ -171,10 +200,12 @@ Beads issues closed this session:
|
|||
- a9d - Deploy NixOS 24.11
|
||||
- 3zo - Post-upgrade verification
|
||||
- 00e - Upgrade epic (parent)
|
||||
- y8le - Stop Matrix before backup (closed: accepted risk)
|
||||
|
||||
* Session Metrics
|
||||
- Commits made: 11
|
||||
- Commits made: 12
|
||||
- Files touched: 10
|
||||
- Lines added/removed: +991/-52
|
||||
- Lines added/removed: +1013/-52
|
||||
- Tests added: 0 (restore drill was manual verification)
|
||||
- Services verified: 6/6 (postgresql, forgejo, matrix-continuwuity, mautrix-slack, maubot, nginx)
|
||||
- Ops-review: 2 MED fixed, 4 LOW skipped (style-only statix warnings)
|
||||
|
|
|
|||
Loading…
Reference in a new issue