diff --git a/CLAUDE.md b/CLAUDE.md index 0e1350f..669b091 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -131,14 +131,16 @@ ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql ## Recent Changes - 001-extract-matrix-platform: Added Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts) -- 002-slack-bridge-integration: Added mautrix-slack bridge with Socket Mode (2025-10-22) - - Phase 0: Research complete (Socket Mode, API scopes, sops-nix, channel patterns) - - Phase 1: Design complete (data-model.md, contracts/, quickstart.md) - - Next: Phase 2 (/speckit.tasks for implementation breakdown) +- 002-slack-bridge-integration: Deployed mautrix-slack bridge with Socket Mode (2025-10-26) + - Phase 0-1: Research and design complete + - Phase 2: Infrastructure deployed and operational + - Status: Bidirectional message flow working (Slack ↔ Matrix) + - ~50 Slack channels synced to Matrix rooms ## Known Issues -- mautrix-slack exits with code 11 without credentials (expected, requires `login app`) - olm-3.2.16 marked insecure (permitted via nixpkgs.config.permittedInsecurePackages) +- conduwuit log level set to "debug" (intended for troubleshooting, consider reverting to "info") +- Fresh database required after conduwuit version upgrades (wipe /var/lib/matrix-continuwuity/db/) ## Testing Guidelines - Test message latency: Should be <5 seconds (FR-001, FR-002) @@ -147,4 +149,258 @@ ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql - Stability target: 99% uptime over 7-day period + +## Manual Configuration Workarounds + +### mautrix-slack Registration File Fix (KNOWN ISSUE) + +**Problem:** The bridge's registration generator creates a random `sender_localpart` instead of using the configured `bot.username` value. + +**Current Manual Fix (Required on Fresh Deploy):** +```bash +# After bridge service starts and generates registration +ssh root@45.77.205.49 'systemctl stop mautrix-slack' + +# Edit registration file to fix sender_localpart +ssh root@45.77.205.49 "sed -i 's/^sender_localpart: .*/sender_localpart: slackbot/' /var/lib/matrix-appservices/mautrix_slack_registration.yaml" + +# Re-register appservice in Matrix admin room +# In Element, send to admin room: +# !admin appservices unregister slack +# !admin appservices register +# + +# Restart homeserver to load new registration +ssh root@45.77.205.49 'systemctl restart matrix-continuwuity' + +# Start bridge +ssh root@45.77.205.49 'systemctl start mautrix-slack' +``` + +**Root Cause:** mautrix-slack's `-g` flag generates registration independently of `config.yaml` settings. + +**Potential Permanent Fix:** Patch `modules/mautrix-slack.nix` to post-process registration file after generation: +```nix +# In ExecStartPre, after registration generation: +${pkgs.gnused}/bin/sed -i 's/^sender_localpart: .*/sender_localpart: ${cfg.appservice.senderLocalpart}/' "$REG_PATH" +``` + +**Impact:** Without this fix, registration sender_localpart won't match bridge config, causing authentication failures. + +--- + +## QA Testing Checklist + +### Core Features (✅ Tested & Working) +- [x] Bidirectional text messaging (Slack ↔ Matrix) +- [x] Channel discovery and room creation (~50 channels synced) +- [x] Socket Mode WebSocket connection +- [x] Bot authentication with Matrix homeserver +- [x] Bridge startup and recovery after restart + +### Features Requiring QA Testing (⚠️ Untested) +- [ ] **File Attachments** + - Upload file in Slack → verify appears in Matrix + - Upload file in Matrix → verify appears in Slack + - Test various file types (images, PDFs, archives) + - Test large files (>10MB) + +- [ ] **Emoji Reactions** + - Add reaction in Slack → verify appears in Matrix + - Add reaction in Matrix → verify appears in Slack + - Remove reaction → verify syncs + +- [ ] **Message Edits** + - Edit message in Slack → verify updates in Matrix + - Edit message in Matrix → verify updates in Slack + +- [ ] **Message Deletion** + - Delete message in Slack → verify removes from Matrix + - Delete message in Matrix → verify removes from Slack + +- [ ] **Thread Replies** + - Reply in Slack thread → verify threading in Matrix + - Reply in Matrix thread → verify threading in Slack + +- [ ] **User Profile Sync** + - Change Slack display name → verify updates Matrix puppet + - Change Slack avatar → verify updates Matrix puppet + +- [ ] **Error Handling** + - Network interruption recovery + - Matrix homeserver restart handling + - Slack WebSocket reconnection + - Invalid token handling + +- [ ] **Performance** + - High-volume channel (>100 messages/hour) + - Large file transfer times + - Message latency under load + +### Test Commands +```bash +# Monitor bridge during testing +ssh root@45.77.205.49 'journalctl -u mautrix-slack -f' + +# Check for errors +ssh root@45.77.205.49 'journalctl -u mautrix-slack --since "1 hour ago" | grep -E "ERR|WRN|FTL"' + +# Verify message flow +# Test in #vlads-pad or similar channel +# Send from Slack, verify in Matrix room +# Send from Matrix room, verify in Slack +``` + +--- + +## Future Infrastructure Needs + +### Monitoring & Alerting (Not Implemented) + +**Health Checks Needed:** +- Bridge WebSocket connection status +- Matrix homeserver availability +- Message processing latency +- Database connection health +- Error rate thresholds + +**Potential Solutions:** +```bash +# Option 1: Simple systemd monitoring +systemctl status mautrix-slack | grep -q "active (running)" || alert + +# Option 2: Prometheus + Alertmanager +# - Export bridge metrics (if available) +# - Alert on service down, high error rate, message lag + +# Option 3: Uptime monitoring +# - External ping to Matrix homeserver +# - Check /_matrix/client/versions endpoint +# - Alert on HTTP errors or timeout +``` + +**Metrics to Track:** +- Bridge uptime percentage +- Messages processed (Slack → Matrix, Matrix → Slack) +- WebSocket reconnection events +- Database query performance +- Error counts by type + +**Alert Conditions:** +- Bridge down for >5 minutes +- No messages processed in >15 minutes (if active channels exist) +- Error rate >5% of total messages +- Database connection failures +- Disk space <10% free + +### Backup Strategy (Not Implemented) + +**Critical Data:** +- Matrix RocksDB: `/var/lib/matrix-continuwuity/db/` (66M) +- Bridge PostgreSQL: `mautrix_slack` database (172K) +- Registration files: `/var/lib/matrix-appservices/*.yaml` +- Secrets: sops-encrypted `secrets/secrets.yaml` (in git) + +**Backup Approach:** +```bash +# Daily database backups +ssh root@45.77.205.49 'tar czf /root/backups/matrix-$(date +%Y%m%d).tar.gz /var/lib/matrix-continuwuity/db/' +ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack > /root/backups/bridge-$(date +%Y%m%d).sql' + +# Retention: 7 daily, 4 weekly, 12 monthly +# Store off-VPS (rsync to backup server or cloud storage) +``` + +**Recovery Procedure:** +1. Deploy NixOS configuration +2. Restore database backups +3. Restore registration files +4. Re-authenticate with Slack (new tokens via `login app`) +5. Verify message flow + +**Note:** Matrix database can be wiped and rebuilt from Slack if needed (current architecture treats Matrix as ephemeral view layer). + +--- + +## Current Architecture State (2025-10-26) + +### Deployed Services +``` +┌─────────────────────────────────────────────────────┐ +│ clarun.xyz (45.77.205.49) │ +│ │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ nginx :443 (HTTPS) │ │ +│ │ - Matrix Client-Server API │ │ +│ │ - Forgejo (git.clarun.xyz) │ │ +│ └────────────┬────────────────────────────────┘ │ +│ │ │ +│ ├─→ conduwuit :8008 (127.0.0.1) │ +│ │ - Matrix homeserver │ +│ │ - RocksDB schema v18 │ +│ │ - 66M database │ +│ │ │ +│ └─→ Forgejo :3000 (127.0.0.1) │ +│ │ +│ ┌─────────────────────────────────────────────┐ │ +│ │ mautrix-slack :29319 (127.0.0.1) │ │ +│ │ - Socket Mode WebSocket to Slack │ │ +│ │ - PostgreSQL backend (172K) │ │ +│ │ - ~50 portal rooms │ │ +│ └────────────┬────────────────────────────────┘ │ +│ │ │ +│ └─→ PostgreSQL :5432 (unix socket) │ +│ │ +└─────────────────────────────────────────────────────┘ + │ + └─→ Slack API (Socket Mode WebSocket) + - Workspace: chochacho + - Bot token: xoxb-... + - App token: xapp-... +``` + +### Critical Networking Details +- **All internal services use IPv4 (127.0.0.1)** - NOT "localhost" +- Reason: `localhost` resolves to IPv6 `[::1]` but services bind IPv4-only +- Fixed in: nginx proxy_pass, bridge homeserverUrl configuration + +### Service Dependencies +``` +postgresql.service + └─→ mautrix-slack.service + └─→ matrix-continuwuity.service + └─→ nginx.service +``` + +### Data Flow +1. **Slack → Matrix:** + - Slack pushes event via Socket Mode WebSocket + - Bridge receives, transforms to Matrix event + - Bridge POSTs to conduwuit appservice endpoint + - conduwuit distributes to Matrix rooms + - Element clients receive via /sync + +2. **Matrix → Slack:** + - Element client sends message via conduwuit + - conduwuit forwards to bridge appservice endpoint + - Bridge transforms to Slack API call + - Bridge POSTs to Slack API (bot token) + - Appears in Slack channel + +### Security Model +- **Secrets:** Managed via sops-nix, deployed to `/run/secrets/` +- **Bridge tokens:** + - `as_token`: Bridge authenticates to Matrix + - `hs_token`: Matrix authenticates to bridge +- **Slack tokens:** + - `xoxb-`: Bot API calls + - `xapp-`: Socket Mode connection +- **No public bridge endpoint:** Socket Mode eliminates webhook requirement + +### Operational Notes +- Matrix database disposable (can rebuild from Slack) +- Bridge config fully declarative except sender_localpart fix +- Fresh database recommended after conduwuit version upgrades +- Debug logging currently enabled on conduwuit + \ No newline at end of file