ops-jrz1/specs/001-extract-matrix-platform/research.md
Dan 894e7241f1 Initialize ops-jrz1 repository with Matrix platform extraction foundation
- Add speckit workflow infrastructure (.claude, .specify)
- Create NixOS configuration skeleton (flake.nix, configuration.nix, hosts/ops-jrz1.nix)
- Add sanitization scripts with 22 rules for personal info removal
- Add validation scripts with gitleaks integration
- Configure git hooks (pre-commit, pre-push) for security validation
- Add project documentation (README, LICENSE)
- Add comprehensive .gitignore for Nix, secrets, staging

Phase 1 and Phase 2 complete. Foundation ready for module extraction from ops-base.
2025-10-13 13:37:17 -07:00

16 KiB

Research: Extract Matrix Platform Modules

Date: 2025-10-11 Feature: Extract Matrix Platform Modules as Public Template Status: Completed

Overview

This document captures technical decisions and research findings for extracting Matrix platform modules from ops-base and publishing as a public template. All decisions are informed by the RFC multi-model consensus validation (gemini-2.5-pro, gpt-5-codex, qwen3-coder).

Decision 1: Sanitization Strategy

Decision

Hybrid approach: Automated script for bulk replacements + manual validation checklist

Rationale

  1. Safety: Multiple validation layers reduce risk of missed sensitive data
  2. Efficiency: Automated script handles 90% of repetitive replacements
  3. Accuracy: Manual review catches context-specific issues (comments, documentation)
  4. Auditability: Checklist provides proof of thorough review
  5. RFC Consensus: All three models recommended this approach

Implementation

# scripts/sanitize-files.sh
# Automated replacements:
- clarun.xyz → example.com
- talu.uno → matrix.example.org
- 192.168.1.x → 10.0.0.x (RFC 1918)
- 45.77.205.49 → 203.0.113.10 (TEST-NET-3)
- /home/dan → /home/user
- jrz1 → matrix
- @admin:clarun → @admin:example

# Manual validation checklist:
- Grep for known sensitive patterns
- Review all comments for personal context
- Scan git commit messages if preserved
- Check for hardcoded tokens/secrets
- Verify REPLACE_ME comments added
- Run gitleaks for automated secret detection

Alternatives Considered

  • git-filter-repo: Rejected - creates modified history, we want fresh history
  • Fully manual: Rejected - too error-prone for 3,400+ lines
  • Fully automated: Rejected - can't detect context-specific personal info

References


Decision 2: Worklog Extraction Process

Decision

LLM-assisted selective extraction with manual review and organization

Rationale

  1. Volume: 300KB+ of worklogs too large for purely manual extraction
  2. Quality: LLM can identify and extract architectural patterns effectively
  3. Structure: Human organization ensures docs follow consistent template
  4. Sanitization: Manual review removes personal debugging context
  5. Precedent: Successfully used for RFC creation

Implementation

# Process for each worklog file:
1. Identify extractable content:
   - Architectural decisions (KEEP)
   - Pattern documentation (KEEP)
   - Problem-solving approaches (KEEP, sanitize)
   - Personal debugging sessions (SKIP)
   - Time-stamped logs (SKIP)
   - IP/domain-specific troubleshooting (SANITIZE)

2. Extract with LLM:
   - Input: worklog file + target doc structure
   - Output: Draft documentation section
   - Prompt: "Extract technical patterns, remove personal context"

3. Manual review:
   - Verify accuracy against source code
   - Remove remaining personal references
   - Ensure consistency with other docs
   - Add cross-references and examples

4. Target mapping:
   - Socket Mode pattern → docs/bridges/slack-setup.md
   - Config generation → docs/patterns/config-generation.md
   - Admin room setup → docs/patterns/admin-room-setup.md
   - sops-nix workflow → docs/secrets-management.md

Alternatives Considered

  • Full manual extraction: Rejected - too time-consuming, inconsistent
  • Direct copy-paste: Rejected - contains personal info, lacks structure
  • Automated extraction only: Rejected - loses nuance, creates poor docs

Target Documents (from worklogs)

Worklog Source Target Documentation
mautrix-slack-bridge-implementation-gmessages-pattern.org docs/patterns/config-generation.md
mautrix-slack-socket-mode-oauth-scopes-blocker.org docs/bridges/slack-setup.md
conduwuit-admin-room-discovery-password-reset.org docs/patterns/admin-room-setup.md
sops-nix-secrets-management-rfc.md docs/secrets-management.md
(various debugging logs) docs/troubleshooting.md (optional)

Decision 3: CI/CD Implementation

Decision

GitHub Actions with nix flake check + gitleaks, no Cachix for v1.0

Rationale

  1. Simplicity: GitHub Actions native to platform, no additional services
  2. Security: gitleaks catches secrets in every PR/commit
  3. Validation: nix flake check ensures all configs build
  4. Cost: Free for public repos, no Cachix subscription needed
  5. RFC Alignment: Matches RFC automated validation requirements

Implementation

# .github/workflows/ci.yml
name: CI

on: [push, pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: cachix/install-nix-action@v25
      - name: Run nix flake check
        run: nix flake check --all-systems
      - name: Build example configurations
        run: |
          nix build .#nixosConfigurations.example-vps.config.system.build.toplevel
          nix build .#nixosConfigurations.example-dev.config.system.build.toplevel          

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - name: Run gitleaks
        uses: gitleaks/gitleaks-action@v2
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Alternatives Considered

  • Cachix integration: Deferred to v1.1 - adds complexity, build times acceptable without it
  • GitLab CI: Rejected - GitHub is primary platform for NixOS community
  • Local-only validation: Rejected - PR validation critical for community contributions

Performance Targets

  • Total CI time: <5 minutes per commit
  • nix flake check: <2 minutes
  • gitleaks scan: <30 seconds
  • Build examples: <3 minutes

Decision 4: Sync Workflow Design

Decision

Git tags + sync-log.md file with quarterly calendar reminders

Rationale

  1. Traceability: Git tags mark template versions synced with ops-base state
  2. Documentation: sync-log.md records what changed and why
  3. Simplicity: No complex tooling, standard git workflow
  4. Discoverability: sync-log.md visible in repo for transparency
  5. RFC Consensus: Documented workflow reduces sync discipline risk

Implementation

# scripts/sync-to-template.sh workflow:
1. Identify changes in ops-base since last sync:
   git log --since="$(git -C ../nixos-matrix-platform-template tag -l 'sync-*' --sort=-v:refname | head -1 | cut -d'-' -f2)"

2. Review changes for applicability:
   - Bug fixes: SYNC
   - New features: SYNC if tested
   - Security fixes: SYNC (priority)
   - Personal config: SKIP
   - Worklogs: SKIP

3. Apply sanitization to selected changes

4. Validate in template:
   nix flake check
   gitleaks detect --no-git

5. Update sync-log.md:
   ## Sync 2025-10-11 (from ops-base commit abc123)
   - Fixed: Matrix registration token validation
   - Added: WhatsApp bridge reconnection logic
   - Security: Updated sops-nix to v0.16.0

6. Commit with tag:
   git commit -m "Sync improvements from ops-base (2025-10-11)"
   git tag "sync-20251011-abc123"
   git push --tags

# sync-log.md format:
# Sync Log: ops-base → nixos-matrix-platform-template

## Sync 2025-10-11 (ops-base: abc123)
**Changes**:
- [BUGFIX] Matrix registration token validation
- [FEATURE] WhatsApp bridge reconnection
- [SECURITY] sops-nix v0.16.0 upgrade

**Skipped**:
- Personal config changes in comm-talu-uno.nix

Alternatives Considered

  • Automated sync via git subtree: Rejected - sanitization can't be automated
  • Manual documentation only: Rejected - easy to forget, no traceability
  • Separate tracking tool: Rejected - over-engineered for quarterly cadence

Quarterly Sync Schedule

  • Q1 (January): Major sync after holiday break
  • Q2 (April): Feature updates and spring cleaning
  • Q3 (July): Mid-year maintenance
  • Q4 (October): Pre-holiday stability sync

Decision 5: Testing Strategy

Decision

Build validation + selective VPS integration testing (Phase 3 of RFC)

Rationale

  1. Cost-effective: Full VPS test only at major milestones
  2. Fast feedback: nix flake check catches 90% of issues instantly
  3. Real validation: At least one VPS deployment before v1.0 publication
  4. Community testing: Beta testers provide diverse environment testing
  5. Risk management: Balances thoroughness with time/cost

Implementation

# Testing levels:
1. Every commit (CI):
   - nix flake check (all configs)
   - gitleaks scan
   - Build example-vps.nix
   - Build example-dev.nix

2. Before PR merge:
   - All CI checks pass
   - Manual review of sanitization
   - Documentation accuracy check

3. Before v1.0 publication (Phase 3):
   - Deploy example-vps.nix to fresh Vultr VPS
   - Deploy example-dev.nix to fresh Vultr VPS
   - Test all user stories end-to-end
   - Verify Matrix server responds
   - Test bridge setup guides
   - Validate secrets management workflow
   - Community beta testing (3-5 testers)

4. Post-publication:
   - Monitor GitHub issues for deployment problems
   - Track success metrics (SC-007, SC-008)

VPS Test Checklist (Phase 3)

  • Fresh NixOS VPS provisioned
  • Clone template repository
  • Follow getting-started.md (time it - should be <30 min)
  • Customize example-vps.nix with test domain
  • Deploy with nixos-rebuild
  • Verify Matrix API responds (curl /_matrix/client/versions)
  • Create test user with registration token
  • Test Element Web login
  • Follow slack-setup.md (if testing bridges)
  • Verify CI runs on sample PR
  • Document any issues or unclear docs

Alternatives Considered

  • No integration testing: Rejected - too risky for v1.0 publication
  • Automated VPS tests: Deferred to v1.1 - complex setup, manual adequate for v1.0
  • Continuous VPS testing: Rejected - expensive, unnecessary for template

Decision 6: Repository Initialization Approach

Decision

Manual repository creation on GitHub with direct file commits (not git-filter-repo)

Rationale

  1. Clean history: Fresh git init ensures no ops-base history
  2. Safety: Manual process allows review at each step
  3. Simplicity: Direct commits easier than git-filter-repo for one-time extraction
  4. Transparency: Clear commit history shows sanitization process

Implementation

# Repository creation workflow:
1. Create empty GitHub repo: nixos-matrix-platform-template
2. Clone to local machine
3. Copy sanitized files from staging directory
4. Review each file before adding
5. Create initial commit structure:
   - Commit 1: Add README, LICENSE, .gitignore
   - Commit 2: Add modules/
   - Commit 3: Add configurations/
   - Commit 4: Add docs/
   - Commit 5: Add .github/workflows/
   - Commit 6: Add examples/ and scripts/
6. Run final validation
7. Push to GitHub
8. Enable GitHub Discussions
9. Add repository description and tags

No git-filter-repo

  • Not needed - we're creating fresh repo, not rewriting history
  • ops-base history stays in ops-base
  • Template has clean, purposeful commit history

Technology Stack Summary

Core Technologies

  • Nix/NixOS: 24.05+ (pinned via flake.lock)
  • nixpkgs: Pinned to tested commit for reproducibility
  • sops-nix: v0.15.0+ (secrets management)
  • age: Latest (encryption backend for sops-nix)

Validation & Security

  • gitleaks: v8.18.0+ (secret scanning)
  • nix flake check: Built-in Nix validation

CI/CD

  • GitHub Actions: Native CI/CD platform
  • cachix/install-nix-action@v25: Nix installation for CI
  • gitleaks/gitleaks-action@v2: Secret scanning action

Development Tools

  • Bash: 5.x (sanitization/sync scripts)
  • git: 2.x (version control, tagging)
  • SSH: For VPS deployment testing

Services (Matrix Platform - in template)

  • matrix-continuwuity: Matrix homeserver (from nixpkgs-unstable)
  • mautrix-slack: Slack bridge
  • mautrix-whatsapp: WhatsApp bridge
  • mautrix-gmessages: Google Messages bridge
  • forgejo: Git service
  • nginx: Reverse proxy
  • postgresql: Database for bridges/Forgejo
  • fail2ban: Intrusion prevention
  • sops: Secrets encryption CLI

Best Practices Applied

From NixOS Community

  1. Pinned dependencies: Use flake.lock for reproducibility
  2. Module options: Provide configurable options with sensible defaults
  3. Security hardening: Apply systemd security features
  4. Documentation: Comprehensive examples and guides

From Infrastructure-as-Code

  1. Immutability: Fresh git history, no rewrites
  2. Validation: Multiple layers (syntax, build, secrets, integration)
  3. Idempotency: All scripts can be run multiple times safely
  4. Auditability: Clear commit messages, sync logs, checklists

From Open Source Projects

  1. Governance files: CONTRIBUTING.md, SECURITY.md, CODE_OF_CONDUCT
  2. Issue templates: Structured bug reports and feature requests
  3. CI/CD: Automated checks on every PR
  4. Semantic versioning: v1.0.0 for initial stable release

Risk Mitigations

Secret Leakage (Critical Risk)

  • Mitigation 1: Automated gitleaks scan (CI + local)
  • Mitigation 2: Manual review checklist
  • Mitigation 3: Fresh git history (no ops-base commits)
  • Mitigation 4: Community beta review before publication
  • Residual Risk: Low (multi-layer validation)

Template Divergence (Medium Risk)

  • Mitigation 1: Documented sync workflow (scripts/sync-to-template.sh)
  • Mitigation 2: Quarterly calendar reminders
  • Mitigation 3: Git tags + sync-log.md for tracking
  • Mitigation 4: RFC-validated process
  • Residual Risk: Medium (requires discipline)

Breaking Dependencies (Medium Risk)

  • Mitigation 1: Pinned nixpkgs version
  • Mitigation 2: CI tests on every commit
  • Mitigation 3: Integration testing before major releases
  • Mitigation 4: Version compatibility matrix in README
  • Residual Risk: Low (pinning prevents surprises)

Poor Documentation (Medium Risk)

  • Mitigation 1: Extract from 300KB+ tested worklogs
  • Mitigation 2: Community beta testing for clarity
  • Mitigation 3: User story acceptance criteria
  • Mitigation 4: Quick start guide (5-minute target)
  • Residual Risk: Low (comprehensive extraction + validation)

Success Metrics Tracking

How to Measure (post-publication)

  • SC-001 (30 min deployment): Time beta testers during Phase 3
  • SC-002 (builds pass): CI badge in README, track failures
  • SC-003 (zero secrets): gitleaks CI status, manual audits
  • SC-004 (8 modules extracted): Checklist of modules present and building - matrix-continuwuity, mautrix-slack, mautrix-whatsapp, mautrix-gmessages, dev-services, fail2ban, ssh-hardening, matrix-secrets
  • SC-005 (documentation complete): Checklist of required docs present - getting-started.md, architecture.md, secrets-management.md, slack-setup.md, whatsapp-setup.md, gmessages-setup.md, config-generation.md, admin-room-setup.md
  • SC-006 (CI runs): GitHub Actions badge, monitor runs
  • SC-007 (10 stars in 3 months): GitHub stars count
  • SC-008 (3 issues/PRs): GitHub insights
  • SC-009 (zero incidents): Monitor issues, no secret reports
  • SC-010 (quarterly sync): Track sync-log.md entries

Open Questions Resolved

All open questions from spec were marked "None - resolved in RFC". Additional implementation questions resolved here:

  1. Q: Should we use pre-commit hooks?

    • A: Yes, but optional for users. Include .pre-commit-config.yaml example
  2. Q: What NixOS version to target?

    • A: 24.05+ (current stable), test on both stable and unstable
  3. Q: Should we include Cachix in CI?

    • A: Not for v1.0 (added complexity), consider for v1.1 if builds slow
  4. Q: How to handle user questions/support?

    • A: GitHub Discussions for Q&A, Issues for bugs only (per CONTRIBUTING.md)
  5. Q: Should we create a Matrix room for support?

    • A: Yes, mentioned in README (#nixos-matrix-template:matrix.org) - dogfooding

Next Steps

  1. Research completed - all decisions documented
  2. → Proceed to Phase 1: Design & Contracts
    • Create data-model.md (entities, relationships)
    • Create contracts/ (sanitization rules, CI contracts)
    • Create quickstart.md (developer onboarding)
    • Update agent context (.specify/memory/AGENT_FILE.md)