ops-jrz1/specs/002-slack-bridge-integration/spec.md
Dan ca379311b8 Add Slack bridge integration feature specification
Includes spec, plan, research, data model, contracts, and quickstart guide
for mautrix-slack Socket Mode bridge deployment.
2025-10-26 14:36:44 -07:00

265 lines
15 KiB
Markdown

# Feature Specification: Matrix-Slack Bridge Integration
**Feature Branch**: `002-slack-bridge-integration`
**Created**: 2025-10-22
**Status**: Draft
**Input**: User description: "Matrix-Slack bridge integration for bidirectional communication between Matrix homeserver and Slack workspace (chochacho), enabling unified team communication"
## Clarifications
### Session 2025-10-22
- Q: Initial Channel Bridge Configuration → A: Start with one test channel (e.g., #dev-platform or #test), expand after validation
- Q: Bridge Health Monitoring → A: Basic health indicators (connection status, last message timestamp, error count) logged to journal
## User Scenarios & Testing *(mandatory)*
### User Story 1 - Slack to Matrix Message Delivery (Priority: P1)
A team member sends a message in a Slack channel and it appears automatically in the corresponding Matrix room, allowing Matrix users to participate in the conversation seamlessly.
**Why this priority**: This is the core value proposition - establishing the communication bridge. Without this, the feature has no functionality. This validates that the bridge infrastructure is working correctly.
**Independent Test**: Can be fully tested by sending a test message in Slack and verifying it appears in Matrix, delivering immediate value as a read-only Slack viewer via Matrix.
**Acceptance Scenarios**:
1. **Given** bridge is configured for #general Slack channel, **When** user posts "Hello from Slack" in #general, **Then** message appears in bridged Matrix room within 5 seconds with original sender name
2. **Given** bridge is running and healthy, **When** user posts message with emoji reactions in Slack, **Then** message appears in Matrix with emoji preserved
3. **Given** bridge is configured, **When** user posts multi-line message in Slack, **Then** message formatting is preserved in Matrix (line breaks, lists)
4. **Given** bridge is operational, **When** user uploads file in Slack channel, **Then** file link appears in Matrix room
---
### User Story 2 - Matrix to Slack Message Delivery (Priority: P1)
A team member sends a message in Matrix room and it appears automatically in the corresponding Slack channel, enabling full bidirectional communication.
**Why this priority**: Completes the bidirectional flow, making this a true communication bridge rather than just a viewer. Essential for collaborative work.
**Independent Test**: Can be tested by sending a test message from Matrix and verifying it appears in Slack, proving full two-way communication works.
**Acceptance Scenarios**:
1. **Given** bridge is configured for bidirectional sync, **When** Matrix user posts "Hello from Matrix" in bridged room, **Then** message appears in Slack channel within 5 seconds with Matrix username
2. **Given** bridge supports rich formatting, **When** Matrix user posts message with markdown formatting, **Then** message appears in Slack with formatting converted appropriately
3. **Given** bridge handles mentions, **When** Matrix user mentions another user, **Then** mention is translated to Slack @username notation
4. **Given** bridge is operational, **When** Matrix user posts message with attachment, **Then** attachment link appears in Slack channel
---
### User Story 3 - Bridge Service Reliability (Priority: P2)
The bridge service starts automatically on server boot, recovers from connection failures, and continues operation without manual intervention.
**Why this priority**: Critical for production use but can be validated after basic messaging works. Prevents the bridge from being a maintenance burden.
**Independent Test**: Can be tested by rebooting the server and verifying bridge auto-starts and resumes messaging, or by simulating network failures.
**Acceptance Scenarios**:
1. **Given** server reboots, **When** system comes back online, **Then** bridge service starts automatically within 2 minutes and begins relaying messages
2. **Given** Slack API experiences temporary outage, **When** connectivity is restored, **Then** bridge reconnects automatically without message loss
3. **Given** Matrix homeserver restarts, **When** homeserver is available again, **Then** bridge re-establishes connection and resumes operation
4. **Given** bridge encounters configuration error, **When** error is logged, **Then** service reports clear diagnostic information for troubleshooting
---
### User Story 4 - Bridge Configuration Management (Priority: P3)
Platform administrators can configure which Slack channels are bridged to Matrix rooms through declarative configuration, without writing code or restarting services manually. Initial deployment starts with one test channel to validate the bridge mechanism before expanding to additional channels.
**Why this priority**: Important for managing the bridge long-term, but basic functionality can work with hardcoded configuration initially. Can be iterated after P1-P2 are working. Starting with a single test channel minimizes risk and provides clear validation before broader rollout.
**Independent Test**: Can be tested by adding a new channel to configuration and verifying it bridges correctly after configuration reload.
**Acceptance Scenarios**:
1. **Given** administrator wants to bridge new channel, **When** channel mapping is added to configuration file, **Then** new bridge is established after configuration update
2. **Given** channel is no longer needed, **When** channel mapping is removed from configuration, **Then** bridge stops relaying messages for that channel
3. **Given** multiple channels configured, **When** administrator views configuration, **Then** all active bridges are clearly listed with their mappings
4. **Given** configuration contains error, **When** configuration is applied, **Then** clear error message explains what needs to be fixed
---
### Edge Cases
- What happens when Slack workspace is temporarily unavailable? (Bridge should queue messages and deliver when available, or report unavailability to Matrix users)
- How does system handle rate limits from Slack API? (Bridge should throttle requests and queue messages to stay within limits)
- What happens when bridge tries to relay message too large for target platform? (Message should be truncated with indication, or split into multiple messages)
- How does bridge handle Slack threads? (Thread context should be preserved or indicated in Matrix, possibly with reply chain)
- What happens when user edits or deletes message in Slack? (Edited messages should sync to Matrix if supported, deletions should be reflected)
- How does bridge handle authentication token expiry? (Bridge should detect expiry and report error clearly, requiring reauthorization)
- What happens when two users have same display name? (Bridge should disambiguate with user IDs or workspace indicators)
## Requirements *(mandatory)*
### Functional Requirements
- **FR-001**: Bridge MUST relay messages from Slack to Matrix within 5 seconds of posting
- **FR-002**: Bridge MUST relay messages from Matrix to Slack within 5 seconds of posting
- **FR-003**: Bridge MUST preserve message sender identity (username/display name)
- **FR-004**: Bridge MUST operate using Socket Mode for reliable real-time messaging
- **FR-005**: Bridge MUST authenticate to Slack using bot token and app token
- **FR-006**: Bridge MUST register with Matrix homeserver as application service
- **FR-007**: Bridge MUST store credentials securely using sops-nix encrypted secrets
- **FR-008**: Bridge MUST use PostgreSQL database for storing bridge state and mappings
- **FR-009**: Bridge MUST connect to chochacho Slack workspace
- **FR-009a**: Initial deployment MUST bridge one designated test channel (e.g., #dev-platform or #test) for validation
- **FR-010**: Bridge MUST start automatically on system boot as systemd service
- **FR-011**: Bridge MUST log all operations to system journal for debugging
- **FR-011a**: Bridge MUST log health indicators including connection status, last successful message timestamp, and error counts
- **FR-012**: Bridge MUST map Slack users to Matrix ghost users (puppeting)
- **FR-013**: Bridge MUST handle connection failures gracefully with automatic retry
- **FR-014**: Bridge MUST respect Slack API rate limits to avoid service disruption
- **FR-015**: System MUST support reauthorization of Slack bot when scopes change
### Key Entities
- **Slack Channel**: Represents a conversation space in Slack workspace, identified by channel ID, contains messages and participants
- **Matrix Room**: Represents a conversation space in Matrix homeserver, identified by room ID, contains events and members
- **Channel Bridge Mapping**: Links a Slack channel to a Matrix room, defines bidirectional sync relationship
- **Ghost User**: Matrix user representation of Slack user, allows messages to appear from original sender in Matrix
- **Bridge State**: Persistent connection and sync status information, includes last message timestamps and error states
- **Credentials**: Slack bot token, app token, and Matrix app service tokens required for authentication
## Success Criteria *(mandatory)*
### Measurable Outcomes
- **SC-001**: Engineers can send message in Slack and see it appear in Matrix within 5 seconds
- **SC-002**: Engineers can send message in Matrix and see it appear in Slack within 5 seconds
- **SC-003**: Bridge maintains 99% uptime over 7-day period after deployment
- **SC-003a**: Bridge health status (connected/disconnected, last message time, error count) is visible in system logs
- **SC-004**: Bridge automatically recovers from network failures without manual intervention
- **SC-005**: Bridge setup and configuration is documented clearly enough for another engineer to replicate
- **SC-006**: Platform administrators can add new channel bridge in under 10 minutes
- **SC-007**: Zero message loss during normal operation (messages always delivered or error reported)
- **SC-008**: Bridge remains operational after server reboot without manual restart
## Assumptions
- Slack workspace (chochacho) administrators will grant necessary permissions for bot installation
- Existing Slack bot can be reauthorized with updated scopes and Socket Mode enabled
- Network connectivity between VPS and Slack API is reliable (>99% uptime)
- Matrix homeserver (clarun.xyz) is operational and accessible on localhost
- PostgreSQL database is available for bridge state storage
- Secrets management via sops-nix is already configured and working
- Engineers primarily communicate in Slack and will continue doing so
- Initial deployment bridges one test channel for validation before expanding
- Number of bridged channels will be small initially (< 10 channels after validation)
- Message volume is moderate (< 1000 messages/day per channel)
- No need for historical message import (bridge starts fresh from activation time)
## Scope
### In Scope
- Bidirectional message relay between Slack and Matrix
- Bot token authentication with Socket Mode
- Single Slack workspace (chochacho) integration
- Declarative channel bridge configuration via NixOS
- Automatic service startup and recovery
- Secure credential storage with sops-nix
- Basic message formatting preservation
- User identity preservation via ghost users
### Out of Scope
- Historical message import/migration
- WhatsApp bridge integration (future feature)
- Google Messages bridge integration (future feature)
- Multi-workspace Slack support
- Advanced Slack features (workflows, slash commands, custom integrations)
- Matrix E2E encryption for bridged rooms
- Message editing/deletion sync (nice-to-have, not MVP)
- Thread conversation preservation (nice-to-have, not MVP)
- Reaction sync between platforms (nice-to-have, not MVP)
- File upload sync (links only, not full upload mirroring)
- Voice/video call bridging
## Dependencies
### Technical Dependencies
- Matrix homeserver (conduwuit) operational on clarun.xyz:8008
- PostgreSQL 15.10 available for bridge database
- sops-nix secrets management configured with VPS age key
- NixOS module system for declarative service configuration
- mautrix-slack package available in nixpkgs-unstable
### External Dependencies
- Slack workspace (chochacho) administrator access
- Slack bot reauthorization with required scopes
- Socket Mode enabled for Slack app
- Slack API availability and rate limits
- Network connectivity to Slack API endpoints
### Process Dependencies
- Manager approval for Slack bot reauthorization
- Secrets (bot token, app token) obtained from Slack
- Matrix appservice registration completed
- Platform vision documentation (docs/platform-vision.md) approved
- Deployment pattern established (NixOS module approach)
## Notes
### Context from Platform Vision
This feature represents **Milestone 1** of the ops-jrz1 platform vision: "Working Slack Bridge". Success here validates the core communication architecture and unblocks team onboarding.
Reference: `docs/platform-vision.md` sections:
- Communication Layer principles
- Presentable MVP definition
- Phase 1 timeline
### Existing Infrastructure
- mautrix-slack NixOS module exists: `modules/mautrix-slack.nix`
- Module currently configured for "delpadtech" workspace (needs update to "chochacho")
- Service exits with code 11 (likely missing configuration or credentials)
- PostgreSQL database setup already configured in dev-services.nix
- Secrets management pattern established with Matrix registration token
### Known Issues
- Current exit code 11 suggests missing Slack credentials or configuration
- Workspace name needs update: delpadtech chochacho
- Socket Mode not yet configured in Slack app
- Bot scopes may need adjustment for mautrix-slack requirements
### Security Considerations
- Bot token and app token MUST be stored in encrypted secrets.yaml
- Tokens MUST NOT appear in configuration files or logs
- Bridge service runs as dedicated user (mautrix_slack) with limited permissions
- Database access restricted to bridge user only
- No public endpoints exposed (bridge connects outbound to Slack API)
### Testing Strategy
- Manual testing sufficient for MVP (automated tests future enhancement)
- Initial deployment validates bridge with single test channel (#dev-platform or #test)
- Test each user story independently as implemented
- Use test messages in Slack and Matrix to verify relay
- Simulate failures (network disconnect, service restart) to test recovery
- Monitor logs for errors and performance issues using basic health indicators (connection status, message timestamps, error counts)
- Validate secrets are never logged or exposed
- Verify health indicators appear in system journal during normal operation and failure scenarios
### Future Enhancements
These are explicitly out of scope for MVP but worth documenting for iteration:
- Message editing/deletion sync
- Thread preservation
- Emoji reaction sync
- Advanced Slack integrations (slash commands, workflows)
- Multi-workspace support
- Historical message import
- Advanced metrics dashboard (Prometheus, Grafana integration)
- Automated health checks and alerting beyond basic logging
- Message throughput and latency histograms