ops-jrz1/specs/001-extract-matrix-platform/research.md
Dan 894e7241f1 Initialize ops-jrz1 repository with Matrix platform extraction foundation
- Add speckit workflow infrastructure (.claude, .specify)
- Create NixOS configuration skeleton (flake.nix, configuration.nix, hosts/ops-jrz1.nix)
- Add sanitization scripts with 22 rules for personal info removal
- Add validation scripts with gitleaks integration
- Configure git hooks (pre-commit, pre-push) for security validation
- Add project documentation (README, LICENSE)
- Add comprehensive .gitignore for Nix, secrets, staging

Phase 1 and Phase 2 complete. Foundation ready for module extraction from ops-base.
2025-10-13 13:37:17 -07:00

470 lines
16 KiB
Markdown

# Research: Extract Matrix Platform Modules
**Date**: 2025-10-11
**Feature**: Extract Matrix Platform Modules as Public Template
**Status**: Completed
## Overview
This document captures technical decisions and research findings for extracting Matrix platform modules from ops-base and publishing as a public template. All decisions are informed by the RFC multi-model consensus validation (gemini-2.5-pro, gpt-5-codex, qwen3-coder).
## Decision 1: Sanitization Strategy
### Decision
**Hybrid approach**: Automated script for bulk replacements + manual validation checklist
### Rationale
1. **Safety**: Multiple validation layers reduce risk of missed sensitive data
2. **Efficiency**: Automated script handles 90% of repetitive replacements
3. **Accuracy**: Manual review catches context-specific issues (comments, documentation)
4. **Auditability**: Checklist provides proof of thorough review
5. **RFC Consensus**: All three models recommended this approach
### Implementation
```bash
# scripts/sanitize-files.sh
# Automated replacements:
- clarun.xyz → example.com
- talu.uno → matrix.example.org
- 192.168.1.x → 10.0.0.x (RFC 1918)
- 45.77.205.49 → 203.0.113.10 (TEST-NET-3)
- /home/dan → /home/user
- jrz1 → matrix
- @admin:clarun → @admin:example
# Manual validation checklist:
- Grep for known sensitive patterns
- Review all comments for personal context
- Scan git commit messages if preserved
- Check for hardcoded tokens/secrets
- Verify REPLACE_ME comments added
- Run gitleaks for automated secret detection
```
### Alternatives Considered
- **git-filter-repo**: Rejected - creates modified history, we want fresh history
- **Fully manual**: Rejected - too error-prone for 3,400+ lines
- **Fully automated**: Rejected - can't detect context-specific personal info
### References
- RFC Section: "Enhanced Sanitization Process" (lines 305-375)
- gitleaks documentation: https://github.com/gitleaks/gitleaks
- NixOS security best practices
---
## Decision 2: Worklog Extraction Process
### Decision
**LLM-assisted selective extraction** with manual review and organization
### Rationale
1. **Volume**: 300KB+ of worklogs too large for purely manual extraction
2. **Quality**: LLM can identify and extract architectural patterns effectively
3. **Structure**: Human organization ensures docs follow consistent template
4. **Sanitization**: Manual review removes personal debugging context
5. **Precedent**: Successfully used for RFC creation
### Implementation
```
# Process for each worklog file:
1. Identify extractable content:
- Architectural decisions (KEEP)
- Pattern documentation (KEEP)
- Problem-solving approaches (KEEP, sanitize)
- Personal debugging sessions (SKIP)
- Time-stamped logs (SKIP)
- IP/domain-specific troubleshooting (SANITIZE)
2. Extract with LLM:
- Input: worklog file + target doc structure
- Output: Draft documentation section
- Prompt: "Extract technical patterns, remove personal context"
3. Manual review:
- Verify accuracy against source code
- Remove remaining personal references
- Ensure consistency with other docs
- Add cross-references and examples
4. Target mapping:
- Socket Mode pattern → docs/bridges/slack-setup.md
- Config generation → docs/patterns/config-generation.md
- Admin room setup → docs/patterns/admin-room-setup.md
- sops-nix workflow → docs/secrets-management.md
```
### Alternatives Considered
- **Full manual extraction**: Rejected - too time-consuming, inconsistent
- **Direct copy-paste**: Rejected - contains personal info, lacks structure
- **Automated extraction only**: Rejected - loses nuance, creates poor docs
### Target Documents (from worklogs)
| Worklog Source | Target Documentation |
|----------------|---------------------|
| mautrix-slack-bridge-implementation-gmessages-pattern.org | docs/patterns/config-generation.md |
| mautrix-slack-socket-mode-oauth-scopes-blocker.org | docs/bridges/slack-setup.md |
| conduwuit-admin-room-discovery-password-reset.org | docs/patterns/admin-room-setup.md |
| sops-nix-secrets-management-rfc.md | docs/secrets-management.md |
| (various debugging logs) | docs/troubleshooting.md (optional) |
---
## Decision 3: CI/CD Implementation
### Decision
**GitHub Actions with nix flake check + gitleaks**, no Cachix for v1.0
### Rationale
1. **Simplicity**: GitHub Actions native to platform, no additional services
2. **Security**: gitleaks catches secrets in every PR/commit
3. **Validation**: nix flake check ensures all configs build
4. **Cost**: Free for public repos, no Cachix subscription needed
5. **RFC Alignment**: Matches RFC automated validation requirements
### Implementation
```yaml
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: cachix/install-nix-action@v25
- name: Run nix flake check
run: nix flake check --all-systems
- name: Build example configurations
run: |
nix build .#nixosConfigurations.example-vps.config.system.build.toplevel
nix build .#nixosConfigurations.example-dev.config.system.build.toplevel
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Run gitleaks
uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
```
### Alternatives Considered
- **Cachix integration**: Deferred to v1.1 - adds complexity, build times acceptable without it
- **GitLab CI**: Rejected - GitHub is primary platform for NixOS community
- **Local-only validation**: Rejected - PR validation critical for community contributions
### Performance Targets
- Total CI time: <5 minutes per commit
- nix flake check: <2 minutes
- gitleaks scan: <30 seconds
- Build examples: <3 minutes
---
## Decision 4: Sync Workflow Design
### Decision
**Git tags + sync-log.md file** with quarterly calendar reminders
### Rationale
1. **Traceability**: Git tags mark template versions synced with ops-base state
2. **Documentation**: sync-log.md records what changed and why
3. **Simplicity**: No complex tooling, standard git workflow
4. **Discoverability**: sync-log.md visible in repo for transparency
5. **RFC Consensus**: Documented workflow reduces sync discipline risk
### Implementation
```bash
# scripts/sync-to-template.sh workflow:
1. Identify changes in ops-base since last sync:
git log --since="$(git -C ../nixos-matrix-platform-template tag -l 'sync-*' --sort=-v:refname | head -1 | cut -d'-' -f2)"
2. Review changes for applicability:
- Bug fixes: SYNC
- New features: SYNC if tested
- Security fixes: SYNC (priority)
- Personal config: SKIP
- Worklogs: SKIP
3. Apply sanitization to selected changes
4. Validate in template:
nix flake check
gitleaks detect --no-git
5. Update sync-log.md:
## Sync 2025-10-11 (from ops-base commit abc123)
- Fixed: Matrix registration token validation
- Added: WhatsApp bridge reconnection logic
- Security: Updated sops-nix to v0.16.0
6. Commit with tag:
git commit -m "Sync improvements from ops-base (2025-10-11)"
git tag "sync-20251011-abc123"
git push --tags
# sync-log.md format:
# Sync Log: ops-base → nixos-matrix-platform-template
## Sync 2025-10-11 (ops-base: abc123)
**Changes**:
- [BUGFIX] Matrix registration token validation
- [FEATURE] WhatsApp bridge reconnection
- [SECURITY] sops-nix v0.16.0 upgrade
**Skipped**:
- Personal config changes in comm-talu-uno.nix
```
### Alternatives Considered
- **Automated sync via git subtree**: Rejected - sanitization can't be automated
- **Manual documentation only**: Rejected - easy to forget, no traceability
- **Separate tracking tool**: Rejected - over-engineered for quarterly cadence
### Quarterly Sync Schedule
- Q1 (January): Major sync after holiday break
- Q2 (April): Feature updates and spring cleaning
- Q3 (July): Mid-year maintenance
- Q4 (October): Pre-holiday stability sync
---
## Decision 5: Testing Strategy
### Decision
**Build validation + selective VPS integration testing** (Phase 3 of RFC)
### Rationale
1. **Cost-effective**: Full VPS test only at major milestones
2. **Fast feedback**: nix flake check catches 90% of issues instantly
3. **Real validation**: At least one VPS deployment before v1.0 publication
4. **Community testing**: Beta testers provide diverse environment testing
5. **Risk management**: Balances thoroughness with time/cost
### Implementation
```
# Testing levels:
1. Every commit (CI):
- nix flake check (all configs)
- gitleaks scan
- Build example-vps.nix
- Build example-dev.nix
2. Before PR merge:
- All CI checks pass
- Manual review of sanitization
- Documentation accuracy check
3. Before v1.0 publication (Phase 3):
- Deploy example-vps.nix to fresh Vultr VPS
- Deploy example-dev.nix to fresh Vultr VPS
- Test all user stories end-to-end
- Verify Matrix server responds
- Test bridge setup guides
- Validate secrets management workflow
- Community beta testing (3-5 testers)
4. Post-publication:
- Monitor GitHub issues for deployment problems
- Track success metrics (SC-007, SC-008)
```
### VPS Test Checklist (Phase 3)
- [ ] Fresh NixOS VPS provisioned
- [ ] Clone template repository
- [ ] Follow getting-started.md (time it - should be <30 min)
- [ ] Customize example-vps.nix with test domain
- [ ] Deploy with nixos-rebuild
- [ ] Verify Matrix API responds (curl /_matrix/client/versions)
- [ ] Create test user with registration token
- [ ] Test Element Web login
- [ ] Follow slack-setup.md (if testing bridges)
- [ ] Verify CI runs on sample PR
- [ ] Document any issues or unclear docs
### Alternatives Considered
- **No integration testing**: Rejected - too risky for v1.0 publication
- **Automated VPS tests**: Deferred to v1.1 - complex setup, manual adequate for v1.0
- **Continuous VPS testing**: Rejected - expensive, unnecessary for template
---
## Decision 6: Repository Initialization Approach
### Decision
**Manual repository creation** on GitHub with direct file commits (not git-filter-repo)
### Rationale
1. **Clean history**: Fresh git init ensures no ops-base history
2. **Safety**: Manual process allows review at each step
3. **Simplicity**: Direct commits easier than git-filter-repo for one-time extraction
4. **Transparency**: Clear commit history shows sanitization process
### Implementation
```bash
# Repository creation workflow:
1. Create empty GitHub repo: nixos-matrix-platform-template
2. Clone to local machine
3. Copy sanitized files from staging directory
4. Review each file before adding
5. Create initial commit structure:
- Commit 1: Add README, LICENSE, .gitignore
- Commit 2: Add modules/
- Commit 3: Add configurations/
- Commit 4: Add docs/
- Commit 5: Add .github/workflows/
- Commit 6: Add examples/ and scripts/
6. Run final validation
7. Push to GitHub
8. Enable GitHub Discussions
9. Add repository description and tags
```
### No git-filter-repo
- Not needed - we're creating fresh repo, not rewriting history
- ops-base history stays in ops-base
- Template has clean, purposeful commit history
---
## Technology Stack Summary
### Core Technologies
- **Nix/NixOS**: 24.05+ (pinned via flake.lock)
- **nixpkgs**: Pinned to tested commit for reproducibility
- **sops-nix**: v0.15.0+ (secrets management)
- **age**: Latest (encryption backend for sops-nix)
### Validation & Security
- **gitleaks**: v8.18.0+ (secret scanning)
- **nix flake check**: Built-in Nix validation
### CI/CD
- **GitHub Actions**: Native CI/CD platform
- **cachix/install-nix-action@v25**: Nix installation for CI
- **gitleaks/gitleaks-action@v2**: Secret scanning action
### Development Tools
- **Bash**: 5.x (sanitization/sync scripts)
- **git**: 2.x (version control, tagging)
- **SSH**: For VPS deployment testing
### Services (Matrix Platform - in template)
- **matrix-continuwuity**: Matrix homeserver (from nixpkgs-unstable)
- **mautrix-slack**: Slack bridge
- **mautrix-whatsapp**: WhatsApp bridge
- **mautrix-gmessages**: Google Messages bridge
- **forgejo**: Git service
- **nginx**: Reverse proxy
- **postgresql**: Database for bridges/Forgejo
- **fail2ban**: Intrusion prevention
- **sops**: Secrets encryption CLI
---
## Best Practices Applied
### From NixOS Community
1. **Pinned dependencies**: Use flake.lock for reproducibility
2. **Module options**: Provide configurable options with sensible defaults
3. **Security hardening**: Apply systemd security features
4. **Documentation**: Comprehensive examples and guides
### From Infrastructure-as-Code
1. **Immutability**: Fresh git history, no rewrites
2. **Validation**: Multiple layers (syntax, build, secrets, integration)
3. **Idempotency**: All scripts can be run multiple times safely
4. **Auditability**: Clear commit messages, sync logs, checklists
### From Open Source Projects
1. **Governance files**: CONTRIBUTING.md, SECURITY.md, CODE_OF_CONDUCT
2. **Issue templates**: Structured bug reports and feature requests
3. **CI/CD**: Automated checks on every PR
4. **Semantic versioning**: v1.0.0 for initial stable release
---
## Risk Mitigations
### Secret Leakage (Critical Risk)
- **Mitigation 1**: Automated gitleaks scan (CI + local)
- **Mitigation 2**: Manual review checklist
- **Mitigation 3**: Fresh git history (no ops-base commits)
- **Mitigation 4**: Community beta review before publication
- **Residual Risk**: Low (multi-layer validation)
### Template Divergence (Medium Risk)
- **Mitigation 1**: Documented sync workflow (scripts/sync-to-template.sh)
- **Mitigation 2**: Quarterly calendar reminders
- **Mitigation 3**: Git tags + sync-log.md for tracking
- **Mitigation 4**: RFC-validated process
- **Residual Risk**: Medium (requires discipline)
### Breaking Dependencies (Medium Risk)
- **Mitigation 1**: Pinned nixpkgs version
- **Mitigation 2**: CI tests on every commit
- **Mitigation 3**: Integration testing before major releases
- **Mitigation 4**: Version compatibility matrix in README
- **Residual Risk**: Low (pinning prevents surprises)
### Poor Documentation (Medium Risk)
- **Mitigation 1**: Extract from 300KB+ tested worklogs
- **Mitigation 2**: Community beta testing for clarity
- **Mitigation 3**: User story acceptance criteria
- **Mitigation 4**: Quick start guide (5-minute target)
- **Residual Risk**: Low (comprehensive extraction + validation)
---
## Success Metrics Tracking
### How to Measure (post-publication)
- **SC-001** (30 min deployment): Time beta testers during Phase 3
- **SC-002** (builds pass): CI badge in README, track failures
- **SC-003** (zero secrets): gitleaks CI status, manual audits
- **SC-004** (8 modules extracted): Checklist of modules present and building - matrix-continuwuity, mautrix-slack, mautrix-whatsapp, mautrix-gmessages, dev-services, fail2ban, ssh-hardening, matrix-secrets
- **SC-005** (documentation complete): Checklist of required docs present - getting-started.md, architecture.md, secrets-management.md, slack-setup.md, whatsapp-setup.md, gmessages-setup.md, config-generation.md, admin-room-setup.md
- **SC-006** (CI runs): GitHub Actions badge, monitor runs
- **SC-007** (10 stars in 3 months): GitHub stars count
- **SC-008** (3 issues/PRs): GitHub insights
- **SC-009** (zero incidents): Monitor issues, no secret reports
- **SC-010** (quarterly sync): Track sync-log.md entries
---
## Open Questions Resolved
All open questions from spec were marked "None - resolved in RFC". Additional implementation questions resolved here:
1. **Q: Should we use pre-commit hooks?**
- A: Yes, but optional for users. Include .pre-commit-config.yaml example
2. **Q: What NixOS version to target?**
- A: 24.05+ (current stable), test on both stable and unstable
3. **Q: Should we include Cachix in CI?**
- A: Not for v1.0 (added complexity), consider for v1.1 if builds slow
4. **Q: How to handle user questions/support?**
- A: GitHub Discussions for Q&A, Issues for bugs only (per CONTRIBUTING.md)
5. **Q: Should we create a Matrix room for support?**
- A: Yes, mentioned in README (#nixos-matrix-template:matrix.org) - dogfooding
---
## Next Steps
1. Research completed - all decisions documented
2. Proceed to Phase 1: Design & Contracts
- Create data-model.md (entities, relationships)
- Create contracts/ (sanitization rules, CI contracts)
- Create quickstart.md (developer onboarding)
- Update agent context (.specify/memory/AGENT_FILE.md)