ops-jrz1/specs/003-maubot-integration/plan.md
Dan 8826d62bcc Add maubot integration and infrastructure updates
- maubot.nix: Declarative bot framework with plugin deployment
- backup.nix: Local backup service for Matrix/bridge data
- sna-instagram-bot: Instagram content bridge plugin
- beads: Issue tracking workflow integrated
- spec 004: Browser-based dev environment design
- nixpkgs bump: Oct 22 → Dec 2
- Fix maubot health check (401 = healthy)
2025-12-08 15:55:12 -08:00

14 KiB

Implementation Plan: Maubot Integration

Branch: 003-maubot-integration | Date: 2025-10-26 | Spec: spec.md Input: Feature specification from /specs/003-maubot-integration/spec.md

Summary

Extract maubot bot framework from ops-base and deploy to ops-jrz1 with Instagram bot plugin. Primary approach: adapt proven ops-base maubot.nix module to ops-jrz1 patterns (conduwuit homeserver, sops-nix secrets, dev-platform wrapper), using registration token auth instead of shared secret. Instagram content fetching via yt-dlp (community scraping). Deployment validates single-instance initially, architecture supports 3+ concurrent instances.

Technical Context

Language/Version: Python 3.11 (maubot runtime environment) Primary Dependencies: maubot 0.5.2+, yt-dlp >=2023.1.6, aiohttp, SQLite, sops-nix Storage: SQLite /var/lib/maubot/bot.db (service state), per-bot databases (plugin-specific) Testing: Manual QA on production VPS (no staging environment), 7-day validation period Target Platform: NixOS 24.05+ on ops-jrz1 VPS (45.77.205.49, x86_64-linux) Project Type: Infrastructure service (NixOS module) Performance Goals: <5 second Instagram content fetch (SC-001), 99% uptime over 7 days (SC-003), <2 second management UI load (SC-007) Constraints: Localhost-only management interface (SSH tunnel required), single Instagram bot instance initially, conduwuit registration token auth (no shared secret) Scale/Scope: 1 Instagram bot instance MVP, architecture validated for 3 concurrent instances (SC-002), small team usage (<20 Instagram fetches/day)

Constitution Check

GATE: Must pass before Phase 0 research. Re-check after Phase 1 design.

Principle I: Declarative Infrastructure PASS

Compliance:

  • All maubot configuration defined in NixOS modules (maubot.nix, dev-services.nix)
  • No imperative modifications required (service managed via nixos-rebuild)
  • Configuration changes deployed declaratively
  • Rollback via NixOS generations

Evidence:

  • Module adaptation documented in research.md (ops-base → ops-jrz1 pattern)
  • Secrets via sops-nix (declarative encryption)
  • Runtime config generated from NixOS module options

Principle II: Security First PASS

Compliance:

  • All secrets encrypted via sops-nix (maubot-admin-password, maubot-secret-key, registration-token)
  • Runtime secrets in /run/secrets/ (tmpfs, ephemeral)
  • No secrets in Nix store or configuration files (LoadCredential pattern)
  • Management interface localhost-only (SSH tunnel required per FR-003)

Evidence:

  • Secrets management pattern documented in data-model.md
  • File permissions: 0400 for secrets, 0600 for config with credentials
  • Pre-commit hooks scan for secret leaks (inherited from platform)

Principle III: Presentable State Over Speed PASS

Compliance:

  • Comprehensive specification (spec.md with 16 functional requirements, 4 user stories)
  • Complete documentation suite (research.md, data-model.md, quickstart.md)
  • 7-day validation period required before announcement (per constitution)
  • Success criteria measurable and testable (SC-001 through SC-008)

Evidence:

  • Spec clarification session resolved all ambiguities (5 questions answered)
  • Quickstart.md provides deployment runbook with troubleshooting
  • Testing checklist in quickstart.md validates all success criteria

Principle IV: Quality Over Quick Wins PASS

Compliance:

  • Extracted proven pattern from ops-base (391-line maubot.nix module in production)
  • Research phase documented alternatives (yt-dlp vs instaloader, SQLite vs PostgreSQL)
  • Follows established ops-jrz1 patterns (mautrix-slack module structure, sops-nix secrets)
  • Spec-kit workflow followed (specify → clarify → plan → tasks → implement)

Evidence:

  • Research.md documents 3 major technical decisions with rationale
  • Module adaptation strategy preserves ops-base proven components
  • Constitution check validates pattern consistency

Gate Status: ALL CHECKS PASS - Proceed to implementation

Project Structure

Documentation (this feature)

specs/003-maubot-integration/
├── spec.md              # Feature specification (✅ complete)
├── plan.md              # This file (✅ complete)
├── research.md          # Phase 0 output (✅ complete)
├── data-model.md        # Phase 1 output (✅ complete)
├── quickstart.md        # Phase 1 output (✅ complete)
├── checklists/
│   └── requirements.md  # Quality validation (✅ complete)
└── tasks.md             # Phase 2 output (/speckit.tasks - pending)

Source Code (repository root)

Structure Decision: Infrastructure service (NixOS module) - no application source code

/home/dan/proj/ops-jrz1/
├── modules/
│   ├── maubot.nix           # Low-level maubot service module (to create)
│   ├── dev-services.nix     # High-level wrapper (to update)
│   ├── mautrix-slack.nix    # Reference pattern (existing)
│   └── matrix-continuwuity.nix  # Matrix homeserver (existing)
├── hosts/
│   └── ops-jrz1.nix         # VPS configuration (to update: enable maubot)
├── secrets/
│   └── secrets.yaml         # Encrypted secrets (to update: add maubot secrets)
├── specs/
│   └── 003-maubot-integration/  # This feature directory
└── docs/
    ├── platform-vision.md   # North star document (reference)
    ├── CLAUDE.md            # Development guidelines (to update)
    └── worklogs/            # Session logs (to create after deployment)

External source files (to copy/adapt):

/home/dan/proj/ops-base/
└── vm-configs/modules/
    └── maubot.nix           # Source module (391 lines, proven in production)

/home/dan/proj/sna/
├── instagram_bot.py         # Instagram bot source (11,643 bytes)
└── sna-instagram-bot.mbp    # Packaged plugin (ready to upload)

Runtime state (on VPS after deployment):

/var/lib/maubot/
├── config/
│   └── config.yaml          # Generated runtime config
├── plugins/
│   └── sna.instagram-v1.0.0.mbp  # Uploaded plugin
├── bot.db                   # SQLite database (service state)
└── trash/                   # Deleted plugins

/run/secrets/                # sops-nix decrypted secrets (tmpfs)
├── maubot-admin-password
├── maubot-secret-key
└── matrix-registration-token

Deployment Strategy

Context: ops-jrz1 is a live production server with critical services (Matrix homeserver, Slack bridge, PostgreSQL, Forgejo, nginx). Deployment must be incremental with validation checkpoints.

Live Server Risk Assessment

Critical Services (must remain operational):

  • conduwuit Matrix homeserver (8008) - All Matrix functionality
  • mautrix-slack (29319) - ~50 Slack channels syncing bidirectionally
  • PostgreSQL (5432) - Bridge database (172KB, critical state)
  • Forgejo (git.clarun.xyz) - Code hosting
  • nginx (443) - TLS termination for all public services

New Service (isolated):

  • maubot (29316, localhost-only) - New SQLite database, different port, no appservice registration

Incremental Deployment Approach

Deploy in 4 phases with git commits as rollback points:

Phase 1: Module Files (No-Op Deployment)

  • Add modules/maubot.nix (adapted from ops-base)
  • Add services.dev-platform.maubot wrapper to modules/dev-services.nix (options + config)
  • Do NOT enable: services.dev-platform.maubot.enable remains unset
  • Deploy → Verify no services changed → Git commit
  • Rollback: nixos-rebuild switch --rollback OR git revert

Phase 2: Secrets (Preparation)

  • Add maubot-admin-password, maubot-secret-key to secrets/secrets.yaml
  • Add sops.secrets declarations to hosts/ops-jrz1.nix
  • Still disabled: services.dev-platform.maubot.enable remains unset
  • Deploy → Verify secrets decrypt to /run/secrets/ → Git commit
  • Rollback: nixos-rebuild switch --rollback OR git revert

Phase 3: Service Start (Module Only)

  • Enable in hosts/ops-jrz1.nix: services.dev-platform.maubot.enable = true
  • Deploy → Verify maubot.service starts → Verify existing services healthy → Git commit
  • Rollback: Set enable = false + redeploy OR nixos-rebuild switch --rollback

Phase 4: Bot Deployment (Manual, Reversible)

  • SSH tunnel to management UI (localhost:29316)
  • Create bot Matrix user via registration token
  • Upload Instagram plugin (.mbp file)
  • Create bot instance (test in private room first)
  • Rollback: Delete bot instance via web UI (no code changes to revert)

Validation Checkpoints

After each phase deployment:

# 1. Verify existing services still healthy
ssh root@45.77.205.49 'systemctl status matrix-continuwuity mautrix-slack forgejo postgresql nginx'

# 2. Check for errors in last 5 minutes (excluding maubot)
ssh root@45.77.205.49 'journalctl --since "5 minutes ago" | grep -E "ERR|CRIT|FTL" | grep -v maubot'

# 3. Test Slack bridge (post in Slack, verify appears in Matrix)

# Phase-specific validations documented in tasks.md

Rollback Procedures

NixOS Generation Rollback (fastest):

ssh root@45.77.205.49 'nixos-rebuild switch --rollback'
ssh root@45.77.205.49 'systemctl status matrix-continuwuity mautrix-slack'

Git Revert (if committed):

git revert HEAD
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost

Service Disable (Phase 3 specific):

# In hosts/ops-jrz1.nix
services.dev-platform.maubot.enable = false;  # Then redeploy

Risk Mitigation

Known risks from mautrix-slack deployment (2025-10-26):

  1. IPv4 vs localhost: Always use 127.0.0.1 (not localhost) in homeserverUrl
  2. Conduwuit database corruption: Have database wipe procedure ready (low risk - fresh maubot install)
  3. Port conflicts: Maubot uses 29316 (unique, no conflicts expected)

Blast radius containment:

  • Phase 1 fail → Nix syntax errors only, no runtime impact
  • Phase 2 fail → Secrets issue, no services affected
  • Phase 3 fail → Maubot won't start, but Matrix/Slack/Forgejo unaffected (different ports, databases)
  • Phase 4 fail → Bot instance only, delete via UI

Success Criteria Per Phase

  • Phase 1: Build succeeds, nixos-rebuild reports "no services changed"
  • Phase 2: /run/secrets/maubot-* files exist with mode 0400, existing services healthy
  • Phase 3: systemctl status maubot.service shows "active (running)", management UI accessible via SSH tunnel
  • Phase 4: Bot responds to Instagram URL in <5 seconds (SC-001)

Update/Upgrade Procedure (State-Preserving)

After initial deployment, future updates must preserve runtime state in /var/lib/maubot/:

  • bot.db - Service state (bot instances, plugin configurations)
  • plugins/ - Uploaded .mbp files
  • config/config.yaml - Generated runtime config

Typical update scenarios:

Scenario 1: Module Configuration Change (e.g., change port, add new option)

# 1. Edit modules/dev-services.nix or hosts/ops-jrz1.nix
# 2. Deploy
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost

# 3. Verify service restarted cleanly
ssh root@45.77.205.49 'systemctl status maubot.service'
ssh root@45.77.205.49 'journalctl -u maubot.service -n 50'

# 4. Verify bot instances still running (check management UI)
# StateDirectory persists across service restarts

Scenario 2: Maubot Version Upgrade (nixpkgs update)

# 1. Update flake.lock or nixpkgs input
nix flake update

# 2. Review maubot changelog for breaking changes
# Check: https://github.com/maubot/maubot/releases

# 3. Deploy with build test first
nixos-rebuild build --flake .#ops-jrz1

# 4. If build succeeds, deploy
nixos-rebuild switch --flake .#ops-jrz1 --target-host root@45.77.205.49 --build-host localhost

# 5. Monitor service restart
ssh root@45.77.205.49 'journalctl -u maubot.service -f'

# 6. Verify bot instances reconnected (check Matrix room for bot presence)

Scenario 3: Plugin Update (new Instagram bot version)

# Manual via web UI:
# 1. Upload new .mbp file (Plugins tab → Upload)
# 2. Maubot detects version change
# 3. Restart affected bot instances (Instances tab → Stop → Start)
# 4. Test in private room before production use

# No nixos-rebuild needed - plugin is runtime state

Scenario 4: Add New Bot Instance (e.g., second Instagram bot or new bot type)

# Manual via web UI:
# 1. Create bot Matrix user (via registration token)
# 2. Upload plugin if new type (Plugins tab)
# 3. Create bot instance (Instances tab → Add instance)
# 4. Configure and enable

# No nixos-rebuild needed - bot instances are runtime state

State Preservation Guarantees:

  • NixOS StateDirectory (/var/lib/maubot/) persists across:
    • Service restarts (systemctl restart maubot.service)
    • System reboots
    • Module configuration changes
    • Maubot version upgrades (unless database schema incompatible)
  • StateDirectory only wiped if:
    • Explicitly deleted manually
    • Service definition changes StateDirectory path
    • Major maubot version with incompatible schema (rare, documented in release notes)

Rollback with State:

# NixOS generation rollback preserves StateDirectory
ssh root@45.77.205.49 'nixos-rebuild switch --rollback'

# Bot instances resume with previous configuration
# Database and plugins unchanged

When to wipe database (rare, destructive):

# Only if:
# 1. Database corruption detected
# 2. Major version migration requires clean slate (check release notes)
# 3. Testing fresh deployment

# Backup first:
ssh root@45.77.205.49 'tar czf /root/maubot-backup-$(date +%Y%m%d).tar.gz /var/lib/maubot/'

# Wipe:
ssh root@45.77.205.49 'systemctl stop maubot.service'
ssh root@45.77.205.49 'rm -rf /var/lib/maubot/bot.db'
ssh root@45.77.205.49 'systemctl start maubot.service'

# Reconfigure all bot instances via web UI

Complexity Tracking

No violations - All constitution principles satisfied.

This feature follows established patterns:

  • Declarative infrastructure (NixOS modules)
  • Security first (sops-nix encrypted secrets)
  • Presentable state (comprehensive spec, 7-day validation)
  • Quality over speed (extract proven ops-base module, document alternatives)

No simpler alternatives rejected - Chosen approach is the simplest that meets requirements while maintaining quality standards.