ops-jrz1/CLAUDE.md
Dan f25a8b06ef Production hardening and technical debt cleanup
Priority 1 - Production Quality:
- Revert Matrix homeserver log level from debug to info
- Reduces log volume by ~70% (22k+ lines/day to <7k)
- Improves performance and reduces disk usage

Priority 2 - Technical Debt:
- Automate sender_localpart fix in mautrix-slack.nix
- Eliminates manual sed command on fresh deployments
- Fix verified working (tested 2025-10-26)
- Update CLAUDE.md to document automated solution

Priority 3 - Project Hygiene:
- Remove unused mautrix-whatsapp and mautrix-gmessages imports
- Archive old configurations to docs/examples/alternative-deployments/
- Remove stale staging/ directories from 001 extraction workflow
- Update deployment documentation in tasks.md and quickstart.md
- Add deployment status notes to spec files

Files Modified:
- modules/dev-services.nix: log level debug → info
- modules/mautrix-slack.nix: automatic sender_localpart fix
- hosts/ops-jrz1.nix: remove unused bridge imports
- CLAUDE.md: update Known Issues, add Resolved Issues section
- specs/002-*/: add deployment status notes
- configurations/ → docs/examples/alternative-deployments/

Tested and Verified:
- All services running (matrix, bridge, forgejo, postgresql, nginx)
- Bridge authenticated and message flow working
- sender_localpart fix generates correct registration file
2025-10-26 15:59:05 -07:00

16 KiB

ops-jrz1 Development Guidelines

Auto-generated from all feature plans. Last updated: 2025-10-22

Active Technologies

  • Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts) (001-extract-matrix-platform)
  • mautrix-slack (Python 3.11), PostgreSQL 15.10, sops-nix (002-slack-bridge-integration)
  • Matrix homeserver: conduwuit (clarun.xyz)
  • Secrets management: sops-nix with age encryption

Project Structure

.
├── hosts/                    # NixOS host configurations
│   └── ops-jrz1.nix         # VPS configuration (45.77.205.49)
├── modules/                  # NixOS modules
│   ├── dev-services.nix     # PostgreSQL, Forgejo, bridge coordination
│   ├── mautrix-slack.nix    # Slack bridge module
│   └── matrix-continuwuity.nix  # Matrix homeserver
├── secrets/                  # sops-encrypted secrets
│   └── secrets.yaml         # Encrypted credentials (age)
├── specs/                    # Feature specifications
│   ├── 001-extract-matrix-platform/
│   └── 002-slack-bridge-integration/
│       ├── spec.md          # Feature specification
│       ├── plan.md          # Implementation plan
│       ├── research.md      # Technical research findings
│       ├── data-model.md    # Data model & state machines
│       ├── quickstart.md    # Deployment runbook
│       └── contracts/       # Configuration schemas
├── docs/                     # Documentation
│   ├── platform-vision.md   # North star document
│   └── worklogs/            # Deployment logs
└── .specify/                 # Spec-kit framework files

Commands

Deployment

# Deploy configuration to VPS
nixos-rebuild switch --flake .#ops-jrz1 \
  --target-host root@45.77.205.49 \
  --build-host localhost

# Deploy to staging
nixos-rebuild switch --flake .#ops-jrz1-staging \
  --target-host root@45.77.205.49 \
  --build-host localhost

Bridge Management

# Check bridge status
ssh root@45.77.205.49 'systemctl status mautrix-slack'

# View bridge logs
ssh root@45.77.205.49 'journalctl -u mautrix-slack -f'

# Check Socket Mode connection
ssh root@45.77.205.49 'journalctl -u mautrix-slack -n 20 | grep -i socket'

# Query bridge database
ssh root@45.77.205.49 'sudo -u mautrix_slack psql mautrix_slack -c "SELECT * FROM portal;"'

Secrets Management

# Edit encrypted secrets
sops secrets/secrets.yaml

# View decrypted secrets (never commit output)
sops -d secrets/secrets.yaml

# Add new secret
sops secrets/secrets.yaml
# (Edit in your $EDITOR, auto-encrypts on save)

Matrix Server

# Check Matrix homeserver
ssh root@45.77.205.49 'systemctl status matrix-continuwuity'

# Test federation
ssh root@45.77.205.49 'curl -s http://localhost:8008/_matrix/client/versions | jq .'

Database

# List databases
ssh root@45.77.205.49 'sudo -u postgres psql -l'

# Check bridge database
ssh root@45.77.205.49 'sudo -u postgres psql mautrix_slack -c "\dt"'

# Backup bridge database
ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack' > backup.sql

Code Style

  • Nix 2.x, NixOS 24.05+, Bash 5.x: Follow standard conventions
  • NixOS modules: Use nixpkgs module pattern (options, config, mkIf)
  • Configuration: Declarative over imperative
  • Secrets: Never hardcode, use sops-nix or interactive login
  • Logging: Use appropriate levels (debug for troubleshooting, info for production)

Development Patterns

Slack Bridge (002-slack-bridge-integration)

  • Authentication: Interactive login via Matrix chat (login app command)
  • Socket Mode: WebSocket connection, no public endpoint needed
  • Portal Creation: Automatic based on activity (no manual channel mapping)
  • Secrets: Stored in bridge database after authentication (not in NixOS config)
  • Token Requirements: Bot token (xoxb-) + app-level token (xapp-)

Secrets Management

  • Encryption: Age encryption via SSH host key (/etc/ssh/ssh_host_ed25519_key)
  • Storage: secrets/secrets.yaml (encrypted, safe to commit)
  • Runtime: Decrypted to /run/secrets/ (tmpfs, cleared on reboot)
  • Permissions: 0440 for service-specific secrets, owned by service user

Deployment Workflow

  1. Make configuration changes locally
  2. Commit to git
  3. Deploy via nixos-rebuild
  4. Verify service status and logs
  5. Document in worklogs/
  6. Test functionality
  7. Monitor for stability

Git Workflow

This project uses Trunk-Based Development for simplified collaboration and deployment.

Branch Strategy

  • main: Single long-lived branch, always deployable
  • Feature branches: Short-lived (hours to days), naming: ###-feature-name
  • No long-lived branches: Feature branches merge or delete quickly

Feature Development Workflow

# 1. Start feature from latest main
git checkout main
git pull origin main
git checkout -b 003-feature-name

# 2. Develop with frequent commits
# Make changes, commit often with clear messages

# 3. Keep main in sync (if feature takes >1 day)
git checkout main
git pull origin main
git checkout 003-feature-name
git rebase main

# 4. When feature complete, merge to main
git checkout main
git merge 003-feature-name  # Fast-forward merge preferred

# 5. Tag release if deploying
git tag -a v0.3.0 -m "Release notes..."
git push origin main --tags

# 6. Delete feature branch
git branch -d 003-feature-name

Release Tagging

  • Version scheme: v0.MINOR.PATCH (semver-like)
  • When to tag: After completing and merging a feature
  • Tag format: Annotated tags with comprehensive release notes
  • Example:
    git tag -a v0.3.0 -m "Release v0.3.0: Feature Description
    
    - Key changes
    - Architecture updates
    - Known issues
    "
    

Branch Naming Convention

  • Format: ###-short-description
  • Examples: 002-slack-bridge-integration, 003-monitoring-setup
  • Number matches spec directory in specs/###-feature-name/

Commit Guidelines

  • Clear, concise commit messages
  • No emojis or marketing language
  • Focus on "what" and "why" not "how"
  • Group related changes in single commit
  • Example: "Fix bridge homeserver URL to use IPv4 (127.0.0.1) instead of localhost"

Main Branch Protection

  • Always keep main deployable
  • Test before merging to main
  • Document breaking changes in commit message
  • Tag releases for deployment milestones

Recent Changes

  • 001-extract-matrix-platform: Added Nix 2.x, NixOS 24.05+, Bash 5.x (for scripts)
  • 002-slack-bridge-integration: Deployed mautrix-slack bridge with Socket Mode (2025-10-26)
    • Phase 0-1: Research and design complete
    • Phase 2: Infrastructure deployed and operational
    • Status: Bidirectional message flow working (Slack ↔ Matrix)
    • ~50 Slack channels synced to Matrix rooms

Known Issues

  • olm-3.2.16 marked insecure (permitted via nixpkgs.config.permittedInsecurePackages)
  • Fresh database required after conduwuit version upgrades (wipe /var/lib/matrix-continuwuity/db/)

Resolved Issues

  • conduwuit debug logging (reverted to "info" 2025-10-26)
  • Manual sender_localpart fix (automated in mautrix-slack.nix 2025-10-26)

Testing Guidelines

  • Test message latency: Should be <5 seconds (FR-001, FR-002)
  • Test reactions, edits, file attachments
  • Monitor health indicators: connection_status, last_successful_message, error_count
  • Stability target: 99% uptime over 7-day period

Configuration Notes

mautrix-slack Registration File Fix (RESOLVED)

Issue: The bridge's registration generator (-g flag) creates a random sender_localpart instead of using the configured bot.username value.

Root Cause: mautrix-slack generates registration independently of config.yaml settings.

Solution: Automated fix implemented in modules/mautrix-slack.nix (lines 339-341)

The module now automatically patches the sender_localpart during registration generation:

# In ExecStartPre, after registration generation:
${pkgs.gnused}/bin/sed -i "s/^sender_localpart: .*/sender_localpart: ${cfg.appservice.senderLocalpart}/" "$REG_PATH"

Status: No manual intervention required on fresh deploys. The fix is applied automatically during service startup.

Verification: Tested 2025-10-26 - registration file correctly generated with sender_localpart: slackbot matching configuration.


QA Testing Checklist

Core Features ( Tested & Working)

  • Bidirectional text messaging (Slack ↔ Matrix)
  • Channel discovery and room creation (~50 channels synced)
  • Socket Mode WebSocket connection
  • Bot authentication with Matrix homeserver
  • Bridge startup and recovery after restart

Features Requiring QA Testing (⚠️ Untested)

  • File Attachments

    • Upload file in Slack → verify appears in Matrix
    • Upload file in Matrix → verify appears in Slack
    • Test various file types (images, PDFs, archives)
    • Test large files (>10MB)
  • Emoji Reactions

    • Add reaction in Slack → verify appears in Matrix
    • Add reaction in Matrix → verify appears in Slack
    • Remove reaction → verify syncs
  • Message Edits

    • Edit message in Slack → verify updates in Matrix
    • Edit message in Matrix → verify updates in Slack
  • Message Deletion

    • Delete message in Slack → verify removes from Matrix
    • Delete message in Matrix → verify removes from Slack
  • Thread Replies

    • Reply in Slack thread → verify threading in Matrix
    • Reply in Matrix thread → verify threading in Slack
  • User Profile Sync

    • Change Slack display name → verify updates Matrix puppet
    • Change Slack avatar → verify updates Matrix puppet
  • Error Handling

    • Network interruption recovery
    • Matrix homeserver restart handling
    • Slack WebSocket reconnection
    • Invalid token handling
  • Performance

    • High-volume channel (>100 messages/hour)
    • Large file transfer times
    • Message latency under load

Test Commands

# Monitor bridge during testing
ssh root@45.77.205.49 'journalctl -u mautrix-slack -f'

# Check for errors
ssh root@45.77.205.49 'journalctl -u mautrix-slack --since "1 hour ago" | grep -E "ERR|WRN|FTL"'

# Verify message flow
# Test in #vlads-pad or similar channel
# Send from Slack, verify in Matrix room
# Send from Matrix room, verify in Slack

Future Infrastructure Needs

Monitoring & Alerting (Not Implemented)

Health Checks Needed:

  • Bridge WebSocket connection status
  • Matrix homeserver availability
  • Message processing latency
  • Database connection health
  • Error rate thresholds

Potential Solutions:

# Option 1: Simple systemd monitoring
systemctl status mautrix-slack | grep -q "active (running)" || alert

# Option 2: Prometheus + Alertmanager
# - Export bridge metrics (if available)
# - Alert on service down, high error rate, message lag

# Option 3: Uptime monitoring
# - External ping to Matrix homeserver
# - Check /_matrix/client/versions endpoint
# - Alert on HTTP errors or timeout

Metrics to Track:

  • Bridge uptime percentage
  • Messages processed (Slack → Matrix, Matrix → Slack)
  • WebSocket reconnection events
  • Database query performance
  • Error counts by type

Alert Conditions:

  • Bridge down for >5 minutes
  • No messages processed in >15 minutes (if active channels exist)
  • Error rate >5% of total messages
  • Database connection failures
  • Disk space <10% free

Backup Strategy (Not Implemented)

Critical Data:

  • Matrix RocksDB: /var/lib/matrix-continuwuity/db/ (66M)
  • Bridge PostgreSQL: mautrix_slack database (172K)
  • Registration files: /var/lib/matrix-appservices/*.yaml
  • Secrets: sops-encrypted secrets/secrets.yaml (in git)

Backup Approach:

# Daily database backups
ssh root@45.77.205.49 'tar czf /root/backups/matrix-$(date +%Y%m%d).tar.gz /var/lib/matrix-continuwuity/db/'
ssh root@45.77.205.49 'sudo -u postgres pg_dump mautrix_slack > /root/backups/bridge-$(date +%Y%m%d).sql'

# Retention: 7 daily, 4 weekly, 12 monthly
# Store off-VPS (rsync to backup server or cloud storage)

Recovery Procedure:

  1. Deploy NixOS configuration
  2. Restore database backups
  3. Restore registration files
  4. Re-authenticate with Slack (new tokens via login app)
  5. Verify message flow

Note: Matrix database can be wiped and rebuilt from Slack if needed (current architecture treats Matrix as ephemeral view layer).


Current Architecture State (2025-10-26)

Deployed Services

┌─────────────────────────────────────────────────────┐
│ clarun.xyz (45.77.205.49)                          │
│                                                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ nginx :443 (HTTPS)                          │  │
│  │  - Matrix Client-Server API                 │  │
│  │  - Forgejo (git.clarun.xyz)                 │  │
│  └────────────┬────────────────────────────────┘  │
│               │                                    │
│               ├─→ conduwuit :8008 (127.0.0.1)     │
│               │   - Matrix homeserver             │
│               │   - RocksDB schema v18            │
│               │   - 66M database                  │
│               │                                    │
│               └─→ Forgejo :3000 (127.0.0.1)       │
│                                                     │
│  ┌─────────────────────────────────────────────┐  │
│  │ mautrix-slack :29319 (127.0.0.1)            │  │
│  │  - Socket Mode WebSocket to Slack           │  │
│  │  - PostgreSQL backend (172K)                │  │
│  │  - ~50 portal rooms                         │  │
│  └────────────┬────────────────────────────────┘  │
│               │                                    │
│               └─→ PostgreSQL :5432 (unix socket)  │
│                                                     │
└─────────────────────────────────────────────────────┘
         │
         └─→ Slack API (Socket Mode WebSocket)
             - Workspace: chochacho
             - Bot token: xoxb-...
             - App token: xapp-...

Critical Networking Details

  • All internal services use IPv4 (127.0.0.1) - NOT "localhost"
  • Reason: localhost resolves to IPv6 [::1] but services bind IPv4-only
  • Fixed in: nginx proxy_pass, bridge homeserverUrl configuration

Service Dependencies

postgresql.service
  └─→ mautrix-slack.service
        └─→ matrix-continuwuity.service
              └─→ nginx.service

Data Flow

  1. Slack → Matrix:

    • Slack pushes event via Socket Mode WebSocket
    • Bridge receives, transforms to Matrix event
    • Bridge POSTs to conduwuit appservice endpoint
    • conduwuit distributes to Matrix rooms
    • Element clients receive via /sync
  2. Matrix → Slack:

    • Element client sends message via conduwuit
    • conduwuit forwards to bridge appservice endpoint
    • Bridge transforms to Slack API call
    • Bridge POSTs to Slack API (bot token)
    • Appears in Slack channel

Security Model

  • Secrets: Managed via sops-nix, deployed to /run/secrets/
  • Bridge tokens:
    • as_token: Bridge authenticates to Matrix
    • hs_token: Matrix authenticates to bridge
  • Slack tokens:
    • xoxb-: Bot API calls
    • xapp-: Socket Mode connection
  • No public bridge endpoint: Socket Mode eliminates webhook requirement

Operational Notes

  • Matrix database disposable (can rebuild from Slack)
  • Bridge config fully declarative except sender_localpart fix
  • Fresh database recommended after conduwuit version upgrades
  • Debug logging currently enabled on conduwuit