Document current architecture, manual fixes, and QA checklist
Added comprehensive documentation: - Manual workaround for sender_localpart registration bug - QA testing checklist for untested features - Future monitoring/alerting requirements - Current architecture diagram and data flow - Security model and operational notes
This commit is contained in:
parent
0b1751766b
commit
2dfe4ea829
266
CLAUDE.md
266
CLAUDE.md
|
|
@ -131,14 +131,16 @@ ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql
|
|||
|
||||
## Recent Changes
|
||||
- 001-extract-matrix-platform: Added Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts)
|
||||
- 002-slack-bridge-integration: Added mautrix-slack bridge with Socket Mode (2025-10-22)
|
||||
- Phase 0: Research complete (Socket Mode, API scopes, sops-nix, channel patterns)
|
||||
- Phase 1: Design complete (data-model.md, contracts/, quickstart.md)
|
||||
- Next: Phase 2 (/speckit.tasks for implementation breakdown)
|
||||
- 002-slack-bridge-integration: Deployed mautrix-slack bridge with Socket Mode (2025-10-26)
|
||||
- Phase 0-1: Research and design complete
|
||||
- Phase 2: Infrastructure deployed and operational
|
||||
- Status: Bidirectional message flow working (Slack ↔ Matrix)
|
||||
- ~50 Slack channels synced to Matrix rooms
|
||||
|
||||
## Known Issues
|
||||
- mautrix-slack exits with code 11 without credentials (expected, requires `login app`)
|
||||
- olm-3.2.16 marked insecure (permitted via nixpkgs.config.permittedInsecurePackages)
|
||||
- conduwuit log level set to "debug" (intended for troubleshooting, consider reverting to "info")
|
||||
- Fresh database required after conduwuit version upgrades (wipe /var/lib/matrix-continuwuity/db/)
|
||||
|
||||
## Testing Guidelines
|
||||
- Test message latency: Should be <5 seconds (FR-001, FR-002)
|
||||
|
|
@ -147,4 +149,258 @@ ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql
|
|||
- Stability target: 99% uptime over 7-day period
|
||||
|
||||
<!-- MANUAL ADDITIONS START -->
|
||||
|
||||
## Manual Configuration Workarounds
|
||||
|
||||
### mautrix-slack Registration File Fix (KNOWN ISSUE)
|
||||
|
||||
**Problem:** The bridge's registration generator creates a random `sender_localpart` instead of using the configured `bot.username` value.
|
||||
|
||||
**Current Manual Fix (Required on Fresh Deploy):**
|
||||
```bash
|
||||
# After bridge service starts and generates registration
|
||||
ssh root@45.77.205.49 'systemctl stop mautrix-slack'
|
||||
|
||||
# Edit registration file to fix sender_localpart
|
||||
ssh root@45.77.205.49 "sed -i 's/^sender_localpart: .*/sender_localpart: slackbot/' /var/lib/matrix-appservices/mautrix_slack_registration.yaml"
|
||||
|
||||
# Re-register appservice in Matrix admin room
|
||||
# In Element, send to admin room:
|
||||
# !admin appservices unregister slack
|
||||
# !admin appservices register
|
||||
# <paste corrected YAML>
|
||||
|
||||
# Restart homeserver to load new registration
|
||||
ssh root@45.77.205.49 'systemctl restart matrix-continuwuity'
|
||||
|
||||
# Start bridge
|
||||
ssh root@45.77.205.49 'systemctl start mautrix-slack'
|
||||
```
|
||||
|
||||
**Root Cause:** mautrix-slack's `-g` flag generates registration independently of `config.yaml` settings.
|
||||
|
||||
**Potential Permanent Fix:** Patch `modules/mautrix-slack.nix` to post-process registration file after generation:
|
||||
```nix
|
||||
# In ExecStartPre, after registration generation:
|
||||
${pkgs.gnused}/bin/sed -i 's/^sender_localpart: .*/sender_localpart: ${cfg.appservice.senderLocalpart}/' "$REG_PATH"
|
||||
```
|
||||
|
||||
**Impact:** Without this fix, registration sender_localpart won't match bridge config, causing authentication failures.
|
||||
|
||||
---
|
||||
|
||||
## QA Testing Checklist
|
||||
|
||||
### Core Features (✅ Tested & Working)
|
||||
- [x] Bidirectional text messaging (Slack ↔ Matrix)
|
||||
- [x] Channel discovery and room creation (~50 channels synced)
|
||||
- [x] Socket Mode WebSocket connection
|
||||
- [x] Bot authentication with Matrix homeserver
|
||||
- [x] Bridge startup and recovery after restart
|
||||
|
||||
### Features Requiring QA Testing (⚠️ Untested)
|
||||
- [ ] **File Attachments**
|
||||
- Upload file in Slack → verify appears in Matrix
|
||||
- Upload file in Matrix → verify appears in Slack
|
||||
- Test various file types (images, PDFs, archives)
|
||||
- Test large files (>10MB)
|
||||
|
||||
- [ ] **Emoji Reactions**
|
||||
- Add reaction in Slack → verify appears in Matrix
|
||||
- Add reaction in Matrix → verify appears in Slack
|
||||
- Remove reaction → verify syncs
|
||||
|
||||
- [ ] **Message Edits**
|
||||
- Edit message in Slack → verify updates in Matrix
|
||||
- Edit message in Matrix → verify updates in Slack
|
||||
|
||||
- [ ] **Message Deletion**
|
||||
- Delete message in Slack → verify removes from Matrix
|
||||
- Delete message in Matrix → verify removes from Slack
|
||||
|
||||
- [ ] **Thread Replies**
|
||||
- Reply in Slack thread → verify threading in Matrix
|
||||
- Reply in Matrix thread → verify threading in Slack
|
||||
|
||||
- [ ] **User Profile Sync**
|
||||
- Change Slack display name → verify updates Matrix puppet
|
||||
- Change Slack avatar → verify updates Matrix puppet
|
||||
|
||||
- [ ] **Error Handling**
|
||||
- Network interruption recovery
|
||||
- Matrix homeserver restart handling
|
||||
- Slack WebSocket reconnection
|
||||
- Invalid token handling
|
||||
|
||||
- [ ] **Performance**
|
||||
- High-volume channel (>100 messages/hour)
|
||||
- Large file transfer times
|
||||
- Message latency under load
|
||||
|
||||
### Test Commands
|
||||
```bash
|
||||
# Monitor bridge during testing
|
||||
ssh root@45.77.205.49 'journalctl -u mautrix-slack -f'
|
||||
|
||||
# Check for errors
|
||||
ssh root@45.77.205.49 'journalctl -u mautrix-slack --since "1 hour ago" | grep -E "ERR|WRN|FTL"'
|
||||
|
||||
# Verify message flow
|
||||
# Test in #vlads-pad or similar channel
|
||||
# Send from Slack, verify in Matrix room
|
||||
# Send from Matrix room, verify in Slack
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Future Infrastructure Needs
|
||||
|
||||
### Monitoring & Alerting (Not Implemented)
|
||||
|
||||
**Health Checks Needed:**
|
||||
- Bridge WebSocket connection status
|
||||
- Matrix homeserver availability
|
||||
- Message processing latency
|
||||
- Database connection health
|
||||
- Error rate thresholds
|
||||
|
||||
**Potential Solutions:**
|
||||
```bash
|
||||
# Option 1: Simple systemd monitoring
|
||||
systemctl status mautrix-slack | grep -q "active (running)" || alert
|
||||
|
||||
# Option 2: Prometheus + Alertmanager
|
||||
# - Export bridge metrics (if available)
|
||||
# - Alert on service down, high error rate, message lag
|
||||
|
||||
# Option 3: Uptime monitoring
|
||||
# - External ping to Matrix homeserver
|
||||
# - Check /_matrix/client/versions endpoint
|
||||
# - Alert on HTTP errors or timeout
|
||||
```
|
||||
|
||||
**Metrics to Track:**
|
||||
- Bridge uptime percentage
|
||||
- Messages processed (Slack → Matrix, Matrix → Slack)
|
||||
- WebSocket reconnection events
|
||||
- Database query performance
|
||||
- Error counts by type
|
||||
|
||||
**Alert Conditions:**
|
||||
- Bridge down for >5 minutes
|
||||
- No messages processed in >15 minutes (if active channels exist)
|
||||
- Error rate >5% of total messages
|
||||
- Database connection failures
|
||||
- Disk space <10% free
|
||||
|
||||
### Backup Strategy (Not Implemented)
|
||||
|
||||
**Critical Data:**
|
||||
- Matrix RocksDB: `/var/lib/matrix-continuwuity/db/` (66M)
|
||||
- Bridge PostgreSQL: `mautrix_slack` database (172K)
|
||||
- Registration files: `/var/lib/matrix-appservices/*.yaml`
|
||||
- Secrets: sops-encrypted `secrets/secrets.yaml` (in git)
|
||||
|
||||
**Backup Approach:**
|
||||
```bash
|
||||
# Daily database backups
|
||||
ssh root@45.77.205.49 'tar czf /root/backups/matrix-$(date +%Y%m%d).tar.gz /var/lib/matrix-continuwuity/db/'
|
||||
ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack > /root/backups/bridge-$(date +%Y%m%d).sql'
|
||||
|
||||
# Retention: 7 daily, 4 weekly, 12 monthly
|
||||
# Store off-VPS (rsync to backup server or cloud storage)
|
||||
```
|
||||
|
||||
**Recovery Procedure:**
|
||||
1. Deploy NixOS configuration
|
||||
2. Restore database backups
|
||||
3. Restore registration files
|
||||
4. Re-authenticate with Slack (new tokens via `login app`)
|
||||
5. Verify message flow
|
||||
|
||||
**Note:** Matrix database can be wiped and rebuilt from Slack if needed (current architecture treats Matrix as ephemeral view layer).
|
||||
|
||||
---
|
||||
|
||||
## Current Architecture State (2025-10-26)
|
||||
|
||||
### Deployed Services
|
||||
```
|
||||
┌─────────────────────────────────────────────────────┐
|
||||
│ clarun.xyz (45.77.205.49) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ nginx :443 (HTTPS) │ │
|
||||
│ │ - Matrix Client-Server API │ │
|
||||
│ │ - Forgejo (git.clarun.xyz) │ │
|
||||
│ └────────────┬────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ ├─→ conduwuit :8008 (127.0.0.1) │
|
||||
│ │ - Matrix homeserver │
|
||||
│ │ - RocksDB schema v18 │
|
||||
│ │ - 66M database │
|
||||
│ │ │
|
||||
│ └─→ Forgejo :3000 (127.0.0.1) │
|
||||
│ │
|
||||
│ ┌─────────────────────────────────────────────┐ │
|
||||
│ │ mautrix-slack :29319 (127.0.0.1) │ │
|
||||
│ │ - Socket Mode WebSocket to Slack │ │
|
||||
│ │ - PostgreSQL backend (172K) │ │
|
||||
│ │ - ~50 portal rooms │ │
|
||||
│ └────────────┬────────────────────────────────┘ │
|
||||
│ │ │
|
||||
│ └─→ PostgreSQL :5432 (unix socket) │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────┘
|
||||
│
|
||||
└─→ Slack API (Socket Mode WebSocket)
|
||||
- Workspace: chochacho
|
||||
- Bot token: xoxb-...
|
||||
- App token: xapp-...
|
||||
```
|
||||
|
||||
### Critical Networking Details
|
||||
- **All internal services use IPv4 (127.0.0.1)** - NOT "localhost"
|
||||
- Reason: `localhost` resolves to IPv6 `[::1]` but services bind IPv4-only
|
||||
- Fixed in: nginx proxy_pass, bridge homeserverUrl configuration
|
||||
|
||||
### Service Dependencies
|
||||
```
|
||||
postgresql.service
|
||||
└─→ mautrix-slack.service
|
||||
└─→ matrix-continuwuity.service
|
||||
└─→ nginx.service
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
1. **Slack → Matrix:**
|
||||
- Slack pushes event via Socket Mode WebSocket
|
||||
- Bridge receives, transforms to Matrix event
|
||||
- Bridge POSTs to conduwuit appservice endpoint
|
||||
- conduwuit distributes to Matrix rooms
|
||||
- Element clients receive via /sync
|
||||
|
||||
2. **Matrix → Slack:**
|
||||
- Element client sends message via conduwuit
|
||||
- conduwuit forwards to bridge appservice endpoint
|
||||
- Bridge transforms to Slack API call
|
||||
- Bridge POSTs to Slack API (bot token)
|
||||
- Appears in Slack channel
|
||||
|
||||
### Security Model
|
||||
- **Secrets:** Managed via sops-nix, deployed to `/run/secrets/`
|
||||
- **Bridge tokens:**
|
||||
- `as_token`: Bridge authenticates to Matrix
|
||||
- `hs_token`: Matrix authenticates to bridge
|
||||
- **Slack tokens:**
|
||||
- `xoxb-`: Bot API calls
|
||||
- `xapp-`: Socket Mode connection
|
||||
- **No public bridge endpoint:** Socket Mode eliminates webhook requirement
|
||||
|
||||
### Operational Notes
|
||||
- Matrix database disposable (can rebuild from Slack)
|
||||
- Bridge config fully declarative except sender_localpart fix
|
||||
- Fresh database recommended after conduwuit version upgrades
|
||||
- Debug logging currently enabled on conduwuit
|
||||
|
||||
<!-- MANUAL ADDITIONS END -->
|
||||
Loading…
Reference in a new issue