ops-jrz1/CLAUDE.md
Dan 2dfe4ea829 Document current architecture, manual fixes, and QA checklist
Added comprehensive documentation:
- Manual workaround for sender_localpart registration bug
- QA testing checklist for untested features
- Future monitoring/alerting requirements
- Current architecture diagram and data flow
- Security model and operational notes
2025-10-26 14:52:31 -07:00

15 KiB

ops-jrz1 Development Guidelines

Auto-generated from all feature plans. Last updated: 2025-10-22

Active Technologies

  • Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts) (001-extract-matrix-platform)
  • mautrix-slack (Python 3.11), PostgreSQL 15.10, sops-nix (002-slack-bridge-integration)
  • Matrix homeserver: conduwuit (clarun.xyz)
  • Secrets management: sops-nix with age encryption

Project Structure

.
├── hosts/                    # NixOS host configurations
│   └── ops-jrz1.nix         # VPS configuration (45.77.205.49)
├── modules/                  # NixOS modules
│   ├── dev-services.nix     # PostgreSQL, Forgejo, bridge coordination
│   ├── mautrix-slack.nix    # Slack bridge module
│   └── matrix-continuwuity.nix  # Matrix homeserver
├── secrets/                  # sops-encrypted secrets
│   └── secrets.yaml         # Encrypted credentials (age)
├── specs/                    # Feature specifications
│   ├── 001-extract-matrix-platform/
│   └── 002-slack-bridge-integration/
│       ├── spec.md          # Feature specification
│       ├── plan.md          # Implementation plan
│       ├── research.md      # Technical research findings
│       ├── data-model.md    # Data model & state machines
│       ├── quickstart.md    # Deployment runbook
│       └── contracts/       # Configuration schemas
├── docs/                     # Documentation
│   ├── platform-vision.md   # North star document
│   └── worklogs/            # Deployment logs
└── .specify/                 # Spec-kit framework files

Commands

Deployment

# Deploy configuration to VPS
nixos-rebuild switch --flake .#ops-jrz1 \
  --target-host root@45.77.205.49 \
  --build-host localhost

# Deploy to staging
nixos-rebuild switch --flake .#ops-jrz1-staging \
  --target-host root@45.77.205.49 \
  --build-host localhost

Bridge Management

# Check bridge status
ssh root@45.77.205.49 'systemctl status mautrix-slack'

# View bridge logs
ssh root@45.77.205.49 'journalctl -u mautrix-slack -f'

# Check Socket Mode connection
ssh root@45.77.205.49 'journalctl -u mautrix-slack -n 20 | grep -i socket'

# Query bridge database
ssh root@45.77.205.49 'sudo -u mautrix_slack psql mautrix_slack -c "SELECT * FROM portal;"'

Secrets Management

# Edit encrypted secrets
sops secrets/secrets.yaml

# View decrypted secrets (never commit output)
sops -d secrets/secrets.yaml

# Add new secret
sops secrets/secrets.yaml
# (Edit in your $EDITOR, auto-encrypts on save)

Matrix Server

# Check Matrix homeserver
ssh root@45.77.205.49 'systemctl status matrix-continuwuity'

# Test federation
ssh root@45.77.205.49 'curl -s http://localhost:8008/_matrix/client/versions | jq .'

Database

# List databases
ssh root@45.77.205.49 'sudo -u postgres psql -l'

# Check bridge database
ssh root@45.77.205.49 'sudo -u postgres psql mautrix_slack -c "\dt"'

# Backup bridge database
ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql

Code Style

  • Nix 2.x, NixOS 24.05+, Bash 5.x: Follow standard conventions
  • NixOS modules: Use nixpkgs module pattern (options, config, mkIf)
  • Configuration: Declarative over imperative
  • Secrets: Never hardcode, use sops-nix or interactive login
  • Logging: Use appropriate levels (debug for troubleshooting, info for production)

Development Patterns

Slack Bridge (002-slack-bridge-integration)

  • Authentication: Interactive login via Matrix chat (login app command)
  • Socket Mode: WebSocket connection, no public endpoint needed
  • Portal Creation: Automatic based on activity (no manual channel mapping)
  • Secrets: Stored in bridge database after authentication (not in NixOS config)
  • Token Requirements: Bot token (xoxb-) + app-level token (xapp-)

Secrets Management

  • Encryption: Age encryption via SSH host key (/etc/ssh/ssh_host_ed25519_key)
  • Storage: secrets/secrets.yaml (encrypted, safe to commit)
  • Runtime: Decrypted to /run/secrets/ (tmpfs, cleared on reboot)
  • Permissions: 0440 for service-specific secrets, owned by service user

Deployment Workflow

  1. Make configuration changes locally
  2. Commit to git
  3. Deploy via nixos-rebuild
  4. Verify service status and logs
  5. Document in worklogs/
  6. Test functionality
  7. Monitor for stability

Recent Changes

  • 001-extract-matrix-platform: Added Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts)
  • 002-slack-bridge-integration: Deployed mautrix-slack bridge with Socket Mode (2025-10-26)
    • Phase 0-1: Research and design complete
    • Phase 2: Infrastructure deployed and operational
    • Status: Bidirectional message flow working (Slack ↔ Matrix)
    • ~50 Slack channels synced to Matrix rooms

Known Issues

  • olm-3.2.16 marked insecure (permitted via nixpkgs.config.permittedInsecurePackages)
  • conduwuit log level set to "debug" (intended for troubleshooting, consider reverting to "info")
  • Fresh database required after conduwuit version upgrades (wipe /var/lib/matrix-continuwuity/db/)

Testing Guidelines

  • Test message latency: Should be <5 seconds (FR-001, FR-002)
  • Test reactions, edits, file attachments
  • Monitor health indicators: connection_status, last_successful_message, error_count
  • Stability target: 99% uptime over 7-day period

Manual Configuration Workarounds

mautrix-slack Registration File Fix (KNOWN ISSUE)

Problem: The bridge's registration generator creates a random sender_localpart instead of using the configured bot.username value.

Current Manual Fix (Required on Fresh Deploy):

# After bridge service starts and generates registration
ssh root@45.77.205.49 'systemctl stop mautrix-slack'

# Edit registration file to fix sender_localpart
ssh root@45.77.205.49 "sed -i 's/^sender_localpart: .*/sender_localpart: slackbot/' /var/lib/matrix-appservices/mautrix_slack_registration.yaml"

# Re-register appservice in Matrix admin room
# In Element, send to admin room:
# !admin appservices unregister slack
# !admin appservices register
# <paste corrected YAML>

# Restart homeserver to load new registration
ssh root@45.77.205.49 'systemctl restart matrix-continuwuity'

# Start bridge
ssh root@45.77.205.49 'systemctl start mautrix-slack'

Root Cause: mautrix-slack's -g flag generates registration independently of config.yaml settings.

Potential Permanent Fix: Patch modules/mautrix-slack.nix to post-process registration file after generation:

# In ExecStartPre, after registration generation:
${pkgs.gnused}/bin/sed -i 's/^sender_localpart: .*/sender_localpart: ${cfg.appservice.senderLocalpart}/' "$REG_PATH"

Impact: Without this fix, registration sender_localpart won't match bridge config, causing authentication failures.


QA Testing Checklist

Core Features ( Tested & Working)

  • Bidirectional text messaging (Slack ↔ Matrix)
  • Channel discovery and room creation (~50 channels synced)
  • Socket Mode WebSocket connection
  • Bot authentication with Matrix homeserver
  • Bridge startup and recovery after restart

Features Requiring QA Testing (⚠️ Untested)

  • File Attachments

    • Upload file in Slack → verify appears in Matrix
    • Upload file in Matrix → verify appears in Slack
    • Test various file types (images, PDFs, archives)
    • Test large files (>10MB)
  • Emoji Reactions

    • Add reaction in Slack → verify appears in Matrix
    • Add reaction in Matrix → verify appears in Slack
    • Remove reaction → verify syncs
  • Message Edits

    • Edit message in Slack → verify updates in Matrix
    • Edit message in Matrix → verify updates in Slack
  • Message Deletion

    • Delete message in Slack → verify removes from Matrix
    • Delete message in Matrix → verify removes from Slack
  • Thread Replies

    • Reply in Slack thread → verify threading in Matrix
    • Reply in Matrix thread → verify threading in Slack
  • User Profile Sync

    • Change Slack display name → verify updates Matrix puppet
    • Change Slack avatar → verify updates Matrix puppet
  • Error Handling

    • Network interruption recovery
    • Matrix homeserver restart handling
    • Slack WebSocket reconnection
    • Invalid token handling
  • Performance

    • High-volume channel (>100 messages/hour)
    • Large file transfer times
    • Message latency under load

Test Commands

# Monitor bridge during testing
ssh root@45.77.205.49 'journalctl -u mautrix-slack -f'

# Check for errors
ssh root@45.77.205.49 'journalctl -u mautrix-slack --since "1 hour ago" | grep -E "ERR|WRN|FTL"'

# Verify message flow
# Test in #vlads-pad or similar channel
# Send from Slack, verify in Matrix room
# Send from Matrix room, verify in Slack

Future Infrastructure Needs

Monitoring & Alerting (Not Implemented)

Health Checks Needed:

  • Bridge WebSocket connection status
  • Matrix homeserver availability
  • Message processing latency
  • Database connection health
  • Error rate thresholds

Potential Solutions:

# Option 1: Simple systemd monitoring
systemctl status mautrix-slack | grep -q "active (running)" || alert

# Option 2: Prometheus + Alertmanager
# - Export bridge metrics (if available)
# - Alert on service down, high error rate, message lag

# Option 3: Uptime monitoring
# - External ping to Matrix homeserver
# - Check /_matrix/client/versions endpoint
# - Alert on HTTP errors or timeout

Metrics to Track:

  • Bridge uptime percentage
  • Messages processed (Slack → Matrix, Matrix → Slack)
  • WebSocket reconnection events
  • Database query performance
  • Error counts by type

Alert Conditions:

  • Bridge down for >5 minutes
  • No messages processed in >15 minutes (if active channels exist)
  • Error rate >5% of total messages
  • Database connection failures
  • Disk space <10% free

Backup Strategy (Not Implemented)

Critical Data:

  • Matrix RocksDB: /var/lib/matrix-continuwuity/db/ (66M)
  • Bridge PostgreSQL: mautrix_slack database (172K)
  • Registration files: /var/lib/matrix-appservices/*.yaml
  • Secrets: sops-encrypted secrets/secrets.yaml (in git)

Backup Approach:

# Daily database backups
ssh root@45.77.205.49 'tar czf /root/backups/matrix-$(date +%Y%m%d).tar.gz /var/lib/matrix-continuwuity/db/'
ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack > /root/backups/bridge-$(date +%Y%m%d).sql'

# Retention: 7 daily, 4 weekly, 12 monthly
# Store off-VPS (rsync to backup server or cloud storage)

Recovery Procedure:

  1. Deploy NixOS configuration
  2. Restore database backups
  3. Restore registration files
  4. Re-authenticate with Slack (new tokens via login app)
  5. Verify message flow

Note: Matrix database can be wiped and rebuilt from Slack if needed (current architecture treats Matrix as ephemeral view layer).


Current Architecture State (2025-10-26)

Deployed Services

┌─────────────────────────────────────────────────────┐
│ clarun.xyz (45.77.205.49)                          │
│                                                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ nginx :443 (HTTPS)                          │  │
│  │  - Matrix Client-Server API                 │  │
│  │  - Forgejo (git.clarun.xyz)                 │  │
│  └────────────┬────────────────────────────────┘  │
│               │                                    │
│               ├─→ conduwuit :8008 (127.0.0.1)     │
│               │   - Matrix homeserver             │
│               │   - RocksDB schema v18            │
│               │   - 66M database                  │
│               │                                    │
│               └─→ Forgejo :3000 (127.0.0.1)       │
│                                                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ mautrix-slack :29319 (127.0.0.1)            │  │
│  │  - Socket Mode WebSocket to Slack           │  │
│  │  - PostgreSQL backend (172K)                │  │
│  │  - ~50 portal rooms                         │  │
│  └────────────┬────────────────────────────────┘  │
│               │                                    │
│               └─→ PostgreSQL :5432 (unix socket)  │
│                                                     │
└─────────────────────────────────────────────────────┘
         │
         └─→ Slack API (Socket Mode WebSocket)
             - Workspace: chochacho
             - Bot token: xoxb-...
             - App token: xapp-...

Critical Networking Details

  • All internal services use IPv4 (127.0.0.1) - NOT "localhost"
  • Reason: localhost resolves to IPv6 [::1] but services bind IPv4-only
  • Fixed in: nginx proxy_pass, bridge homeserverUrl configuration

Service Dependencies

postgresql.service
  └─→ mautrix-slack.service
        └─→ matrix-continuwuity.service
              └─→ nginx.service

Data Flow

  1. Slack → Matrix:

    • Slack pushes event via Socket Mode WebSocket
    • Bridge receives, transforms to Matrix event
    • Bridge POSTs to conduwuit appservice endpoint
    • conduwuit distributes to Matrix rooms
    • Element clients receive via /sync
  2. Matrix → Slack:

    • Element client sends message via conduwuit
    • conduwuit forwards to bridge appservice endpoint
    • Bridge transforms to Slack API call
    • Bridge POSTs to Slack API (bot token)
    • Appears in Slack channel

Security Model

  • Secrets: Managed via sops-nix, deployed to /run/secrets/
  • Bridge tokens:
    • as_token: Bridge authenticates to Matrix
    • hs_token: Matrix authenticates to bridge
  • Slack tokens:
    • xoxb-: Bot API calls
    • xapp-: Socket Mode connection
  • No public bridge endpoint: Socket Mode eliminates webhook requirement

Operational Notes

  • Matrix database disposable (can rebuild from Slack)
  • Bridge config fully declarative except sender_localpart fix
  • Fresh database recommended after conduwuit version upgrades
  • Debug logging currently enabled on conduwuit