ops-jrz1/specs/003-maubot-integration/plan.md
Dan 8826d62bcc Add maubot integration and infrastructure updates
- maubot.nix: Declarative bot framework with plugin deployment
- backup.nix: Local backup service for Matrix/bridge data
- sna-instagram-bot: Instagram content bridge plugin
- beads: Issue tracking workflow integrated
- spec 004: Browser-based dev environment design
- nixpkgs bump: Oct 22 → Dec 2
- Fix maubot health check (401 = healthy)
2025-12-08 15:55:12 -08:00

361 lines
14 KiB
Markdown

# Implementation Plan: Maubot Integration
**Branch**: `003-maubot-integration` | **Date**: 2025-10-26 | **Spec**: [spec.md](./spec.md)
**Input**: Feature specification from `/specs/003-maubot-integration/spec.md`
## Summary
Extract maubot bot framework from ops-base and deploy to ops-jrz1 with Instagram bot plugin. Primary approach: adapt proven ops-base maubot.nix module to ops-jrz1 patterns (conduwuit homeserver, sops-nix secrets, dev-platform wrapper), using registration token auth instead of shared secret. Instagram content fetching via yt-dlp (community scraping). Deployment validates single-instance initially, architecture supports 3+ concurrent instances.
## Technical Context
**Language/Version**: Python 3.11 (maubot runtime environment)
**Primary Dependencies**: maubot 0.5.2+, yt-dlp >=2023.1.6, aiohttp, SQLite, sops-nix
**Storage**: SQLite `/var/lib/maubot/bot.db` (service state), per-bot databases (plugin-specific)
**Testing**: Manual QA on production VPS (no staging environment), 7-day validation period
**Target Platform**: NixOS 24.05+ on ops-jrz1 VPS (45.77.205.49, x86_64-linux)
**Project Type**: Infrastructure service (NixOS module)
**Performance Goals**: <5 second Instagram content fetch (SC-001), 99% uptime over 7 days (SC-003), <2 second management UI load (SC-007)
**Constraints**: Localhost-only management interface (SSH tunnel required), single Instagram bot instance initially, conduwuit registration token auth (no shared secret)
**Scale/Scope**: 1 Instagram bot instance MVP, architecture validated for 3 concurrent instances (SC-002), small team usage (<20 Instagram fetches/day)
## Constitution Check
*GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.*
### Principle I: Declarative Infrastructure ✅ PASS
**Compliance**:
- All maubot configuration defined in NixOS modules (maubot.nix, dev-services.nix)
- No imperative modifications required (service managed via nixos-rebuild)
- Configuration changes deployed declaratively
- Rollback via NixOS generations
**Evidence**:
- Module adaptation documented in research.md (ops-base ops-jrz1 pattern)
- Secrets via sops-nix (declarative encryption)
- Runtime config generated from NixOS module options
### Principle II: Security First ✅ PASS
**Compliance**:
- All secrets encrypted via sops-nix (maubot-admin-password, maubot-secret-key, registration-token)
- Runtime secrets in /run/secrets/ (tmpfs, ephemeral)
- No secrets in Nix store or configuration files (LoadCredential pattern)
- Management interface localhost-only (SSH tunnel required per FR-003)
**Evidence**:
- Secrets management pattern documented in data-model.md
- File permissions: 0400 for secrets, 0600 for config with credentials
- Pre-commit hooks scan for secret leaks (inherited from platform)
### Principle III: Presentable State Over Speed ✅ PASS
**Compliance**:
- Comprehensive specification (spec.md with 16 functional requirements, 4 user stories)
- Complete documentation suite (research.md, data-model.md, quickstart.md)
- 7-day validation period required before announcement (per constitution)
- Success criteria measurable and testable (SC-001 through SC-008)
**Evidence**:
- Spec clarification session resolved all ambiguities (5 questions answered)
- Quickstart.md provides deployment runbook with troubleshooting
- Testing checklist in quickstart.md validates all success criteria
### Principle IV: Quality Over Quick Wins ✅ PASS
**Compliance**:
- Extracted proven pattern from ops-base (391-line maubot.nix module in production)
- Research phase documented alternatives (yt-dlp vs instaloader, SQLite vs PostgreSQL)
- Follows established ops-jrz1 patterns (mautrix-slack module structure, sops-nix secrets)
- Spec-kit workflow followed (specify clarify plan tasks implement)
**Evidence**:
- Research.md documents 3 major technical decisions with rationale
- Module adaptation strategy preserves ops-base proven components
- Constitution check validates pattern consistency
**Gate Status**: ALL CHECKS PASS - Proceed to implementation
## Project Structure
### Documentation (this feature)
```text
specs/003-maubot-integration/
├── spec.md # Feature specification (✅ complete)
├── plan.md # This file (✅ complete)
├── research.md # Phase 0 output (✅ complete)
├── data-model.md # Phase 1 output (✅ complete)
├── quickstart.md # Phase 1 output (✅ complete)
├── checklists/
│ └── requirements.md # Quality validation (✅ complete)
└── tasks.md # Phase 2 output (/speckit.tasks - pending)
```
### Source Code (repository root)
**Structure Decision**: Infrastructure service (NixOS module) - no application source code
```text
/home/dan/proj/ops-jrz1/
├── modules/
│ ├── maubot.nix # Low-level maubot service module (to create)
│ ├── dev-services.nix # High-level wrapper (to update)
│ ├── mautrix-slack.nix # Reference pattern (existing)
│ └── matrix-continuwuity.nix # Matrix homeserver (existing)
├── hosts/
│ └── ops-jrz1.nix # VPS configuration (to update: enable maubot)
├── secrets/
│ └── secrets.yaml # Encrypted secrets (to update: add maubot secrets)
├── specs/
│ └── 003-maubot-integration/ # This feature directory
└── docs/
├── platform-vision.md # North star document (reference)
├── CLAUDE.md # Development guidelines (to update)
└── worklogs/ # Session logs (to create after deployment)
```
**External source files** (to copy/adapt):
```text
/home/dan/proj/ops-base/
└── vm-configs/modules/
└── maubot.nix # Source module (391 lines, proven in production)
/home/dan/proj/sna/
├── instagram_bot.py # Instagram bot source (11,643 bytes)
└── sna-instagram-bot.mbp # Packaged plugin (ready to upload)
```
**Runtime state** (on VPS after deployment):
```text
/var/lib/maubot/
├── config/
│ └── config.yaml # Generated runtime config
├── plugins/
│ └── sna.instagram-v1.0.0.mbp # Uploaded plugin
├── bot.db # SQLite database (service state)
└── trash/ # Deleted plugins
/run/secrets/ # sops-nix decrypted secrets (tmpfs)
├── maubot-admin-password
├── maubot-secret-key
└── matrix-registration-token
```
## Deployment Strategy
**Context**: ops-jrz1 is a live production server with critical services (Matrix homeserver, Slack bridge, PostgreSQL, Forgejo, nginx). Deployment must be incremental with validation checkpoints.
### Live Server Risk Assessment
**Critical Services** (must remain operational):
- conduwuit Matrix homeserver (8008) - All Matrix functionality
- mautrix-slack (29319) - ~50 Slack channels syncing bidirectionally
- PostgreSQL (5432) - Bridge database (172KB, critical state)
- Forgejo (git.clarun.xyz) - Code hosting
- nginx (443) - TLS termination for all public services
**New Service** (isolated):
- maubot (29316, localhost-only) - New SQLite database, different port, no appservice registration
### Incremental Deployment Approach
Deploy in 4 phases with git commits as rollback points:
**Phase 1: Module Files (No-Op Deployment)**
- Add modules/maubot.nix (adapted from ops-base)
- Add services.dev-platform.maubot wrapper to modules/dev-services.nix (options + config)
- **Do NOT enable**: services.dev-platform.maubot.enable remains unset
- Deploy Verify no services changed Git commit
- **Rollback**: nixos-rebuild switch --rollback OR git revert
**Phase 2: Secrets (Preparation)**
- Add maubot-admin-password, maubot-secret-key to secrets/secrets.yaml
- Add sops.secrets declarations to hosts/ops-jrz1.nix
- **Still disabled**: services.dev-platform.maubot.enable remains unset
- Deploy Verify secrets decrypt to /run/secrets/ Git commit
- **Rollback**: nixos-rebuild switch --rollback OR git revert
**Phase 3: Service Start (Module Only)**
- Enable in hosts/ops-jrz1.nix: services.dev-platform.maubot.enable = true
- Deploy Verify maubot.service starts Verify existing services healthy Git commit
- **Rollback**: Set enable = false + redeploy OR nixos-rebuild switch --rollback
**Phase 4: Bot Deployment (Manual, Reversible)**
- SSH tunnel to management UI (localhost:29316)
- Create bot Matrix user via registration token
- Upload Instagram plugin (.mbp file)
- Create bot instance (test in private room first)
- **Rollback**: Delete bot instance via web UI (no code changes to revert)
### Validation Checkpoints
After each phase deployment:
```bash
# 1. Verify existing services still healthy
ssh root@45.77.205.49 'systemctl status matrix-continuwuity mautrix-slack forgejo postgresql nginx'
# 2. Check for errors in last 5 minutes (excluding maubot)
ssh root@45.77.205.49 'journalctl --since "5 minutes ago" | grep -E "ERR|CRIT|FTL" | grep -v maubot'
# 3. Test Slack bridge (post in Slack, verify appears in Matrix)
# Phase-specific validations documented in tasks.md
```
### Rollback Procedures
**NixOS Generation Rollback** (fastest):
```bash
ssh root@45.77.205.49 'nixos-rebuild switch --rollback'
ssh root@45.77.205.49 'systemctl status matrix-continuwuity mautrix-slack'
```
**Git Revert** (if committed):
```bash
git revert HEAD
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost
```
**Service Disable** (Phase 3 specific):
```nix
# In hosts/ops-jrz1.nix
services.dev-platform.maubot.enable = false; # Then redeploy
```
### Risk Mitigation
**Known risks from mautrix-slack deployment** (2025-10-26):
1. IPv4 vs localhost: Always use 127.0.0.1 (not localhost) in homeserverUrl
2. Conduwuit database corruption: Have database wipe procedure ready (low risk - fresh maubot install)
3. Port conflicts: Maubot uses 29316 (unique, no conflicts expected)
**Blast radius containment**:
- Phase 1 fail Nix syntax errors only, no runtime impact
- Phase 2 fail Secrets issue, no services affected
- Phase 3 fail Maubot won't start, but Matrix/Slack/Forgejo unaffected (different ports, databases)
- Phase 4 fail Bot instance only, delete via UI
### Success Criteria Per Phase
- **Phase 1**: Build succeeds, nixos-rebuild reports "no services changed"
- **Phase 2**: /run/secrets/maubot-* files exist with mode 0400, existing services healthy
- **Phase 3**: systemctl status maubot.service shows "active (running)", management UI accessible via SSH tunnel
- **Phase 4**: Bot responds to Instagram URL in <5 seconds (SC-001)
### Update/Upgrade Procedure (State-Preserving)
After initial deployment, future updates must preserve runtime state in `/var/lib/maubot/`:
- `bot.db` - Service state (bot instances, plugin configurations)
- `plugins/` - Uploaded .mbp files
- `config/config.yaml` - Generated runtime config
**Typical update scenarios**:
**Scenario 1: Module Configuration Change** (e.g., change port, add new option)
```bash
# 1. Edit modules/dev-services.nix or hosts/ops-jrz1.nix
# 2. Deploy
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost
# 3. Verify service restarted cleanly
ssh root@45.77.205.49 'systemctl status maubot.service'
ssh root@45.77.205.49 'journalctl -u maubot.service -n 50'
# 4. Verify bot instances still running (check management UI)
# StateDirectory persists across service restarts
```
**Scenario 2: Maubot Version Upgrade** (nixpkgs update)
```bash
# 1. Update flake.lock or nixpkgs input
nix flake update
# 2. Review maubot changelog for breaking changes
# Check: https://github.com/maubot/maubot/releases
# 3. Deploy with build test first
nixos-rebuild build --flake .#ops-jrz1
# 4. If build succeeds, deploy
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost
# 5. Monitor service restart
ssh root@45.77.205.49 'journalctl -u maubot.service -f'
# 6. Verify bot instances reconnected (check Matrix room for bot presence)
```
**Scenario 3: Plugin Update** (new Instagram bot version)
```bash
# Manual via web UI:
# 1. Upload new .mbp file (Plugins tab → Upload)
# 2. Maubot detects version change
# 3. Restart affected bot instances (Instances tab → Stop → Start)
# 4. Test in private room before production use
# No nixos-rebuild needed - plugin is runtime state
```
**Scenario 4: Add New Bot Instance** (e.g., second Instagram bot or new bot type)
```bash
# Manual via web UI:
# 1. Create bot Matrix user (via registration token)
# 2. Upload plugin if new type (Plugins tab)
# 3. Create bot instance (Instances tab → Add instance)
# 4. Configure and enable
# No nixos-rebuild needed - bot instances are runtime state
```
**State Preservation Guarantees**:
- NixOS StateDirectory (`/var/lib/maubot/`) persists across:
- Service restarts (systemctl restart maubot.service)
- System reboots
- Module configuration changes
- Maubot version upgrades (unless database schema incompatible)
- StateDirectory only wiped if:
- Explicitly deleted manually
- Service definition changes StateDirectory path
- Major maubot version with incompatible schema (rare, documented in release notes)
**Rollback with State**:
```bash
# NixOS generation rollback preserves StateDirectory
ssh root@45.77.205.49 'nixos-rebuild switch --rollback'
# Bot instances resume with previous configuration
# Database and plugins unchanged
```
**When to wipe database** (rare, destructive):
```bash
# Only if:
# 1. Database corruption detected
# 2. Major version migration requires clean slate (check release notes)
# 3. Testing fresh deployment
# Backup first:
ssh root@45.77.205.49 'tar czf /root/maubot-backup-$(date +%Y%m%d).tar.gz /var/lib/maubot/'
# Wipe:
ssh root@45.77.205.49 'systemctl stop maubot.service'
ssh root@45.77.205.49 'rm -rf /var/lib/maubot/bot.db'
ssh root@45.77.205.49 'systemctl start maubot.service'
# Reconfigure all bot instances via web UI
```
## Complexity Tracking
**No violations** - All constitution principles satisfied.
This feature follows established patterns:
- Declarative infrastructure (NixOS modules)
- Security first (sops-nix encrypted secrets)
- Presentable state (comprehensive spec, 7-day validation)
- Quality over speed (extract proven ops-base module, document alternatives)
**No simpler alternatives rejected** - Chosen approach is the simplest that meets requirements while maintaining quality standards.