Dan bce31933ed Add platform vision and spec-kit integration docs

2025-10-26 14:36:52 -07:00

11 KiB

Raw Blame History

ops-jrz1 Platform Vision

Status: North Star Document Last Updated: 2025-10-22 Maintainers: dan (primary), team (shared responsibility)

Executive Summary

ops-jrz1 is a self-hosted collaborative development platform for small engineering teams (2-5 engineers). It provides communication bridging (Matrix ↔ Slack), code hosting (Forgejo), and declarative deployment infrastructure (NixOS) with a focus on sustainability over speed and quality over quick wins.

Core Philosophy

Build It Right Over Time

Avoid technical debt
Declarative and reproducible (NixOS)
Self-documenting
Sustainable for small team
Clear patterns for contributions

Presentable State First

Working demo-able features
Clear documentation
Inviting for new engineers
Professional appearance

Current State (Generation 31+)

Operational Services

✅ Matrix homeserver (conduwuit 0.5.0-rc.8) on clarun.xyz
✅ Forgejo (7.0.12) at git.clarun.xyz
✅ nginx reverse proxy with TLS (Let's Encrypt)
✅ PostgreSQL 15.10 (Forgejo database)
✅ sops-nix secrets management
✅ Self-hosted infrastructure configuration (ops-jrz1 repo on Forgejo)

Security Posture

✅ SSH key-only authentication
✅ Secrets encrypted with age/sops-nix
✅ Services isolated on localhost (Matrix, PostgreSQL)
✅ Firewall (only SSH, HTTP, HTTPS exposed)
✅ Comprehensive security validation completed

Incomplete/Blocked

⚠️ mautrix-slack bridge (exit code 11, needs configuration)
⚠️ mautrix-whatsapp (configured but not tested)
⚠️ mautrix-gmessages (configured but not tested)
⚠️ No deployment pattern for team projects yet

Target "Presentable MVP"

Definition of Presentable

When we can say: "Here's a working platform you can use and contribute to"

Criteria:

Slack bridge works bidirectionally
One example project successfully deployed
Clear onboarding documentation
Stable and tested (not constantly broken)
Professional presentation (docs, architecture clarity)

Milestone 1: Working Slack Bridge

Goal: Engineers in Slack can see it's alive and useful

Success Metric: Send "Hello from Matrix!" message that appears in Slack via bridge

Tasks:

Update workspace config (delpadtech → chochacho)
Create Slack app in chochacho workspace
Configure Slack credentials (app token, bot token) in sops-nix
Debug exit code 11 issue
Test bidirectional messaging (Slack ↔ Matrix)
Document setup in worklog

Impact: Highly visible proof of concept, validates core architecture

Priority: HIGH - Unblocks team communication and collaboration

Milestone 2: Example Project Pattern

Goal: Clear template for "how to add a project"

Success Metric: Engineer can clone template repo, modify, and deploy a simple bot

Deliverables:

Example project: "chochacho-hello-bot" (responds to !hello in Matrix)
Project structure: Nix flake + NixOS module pattern
Documentation: docs/project-template.md
Template repository on Forgejo

Impact: Makes platform "joinable" - clear contribution path

Priority: MEDIUM - Required before onboarding engineers

Milestone 3: Platform Documentation

Goal: New engineer can understand and use the platform

Deliverables:

docs/architecture.md - How the platform works
docs/onboarding.md - How to join as an engineer
docs/deployment.md - How to deploy projects
README.md - Overview and navigation

Impact: Presentability factor, shows maturity and thoughtfulness

Priority: MEDIUM - Can iterate as engineers join

Architecture Principles

Communication Layer

Primary: Slack (chochacho workspace) Hub: Matrix homeserver bridges to Slack Direction: Bidirectional (Slack ↔ Matrix)

Current Focus: Slack bridge only (not WhatsApp, Google Messages, etc.)

User Experience: Engineers stay in Slack, Matrix runs behind the scenes to unify communication

Code Hosting

Primary: Self-hosted Forgejo at git.clarun.xyz Flexibility: Projects can also reference external repos (GitHub, etc.)

Model:

ops-jrz1 repository: Platform infrastructure (NixOS config)
Project repositories: Individual team projects
Clear separation: Infrastructure vs applications

Deployment Philosophy

Chosen Approach: NixOS-Native (Strict Declarative)

Pattern: Project as NixOS Module

# Example project structure
project-name/
├── flake.nix              # Nix flake (how to build)
├── default.nix            # Derivation (package definition)
├── module.nix             # NixOS service module
├── src/                   # Project code
└── README.md              # Deployment instructions

Deployment Workflow:

Engineer develops project locally (with Nix)
Project added to ops-jrz1 as import or flake reference
Push to Forgejo (project repo or ops-jrz1 update)
Admin reviews change (pull request optional)
nixos-rebuild switch deploys to production
Rollback available via NixOS generations

Benefits:

✅ Declarative and reproducible
✅ Built-in rollback (generation management)
✅ Consistent with existing ops-jrz1 pattern
✅ Forces proper packaging (quality gate)
✅ No additional deployment systems to maintain

Trade-offs:

❌ Requires NixOS knowledge (acceptable: team can learn)
❌ Less "instant" than webhook deployment (acceptable: "no deployment urgency")
❌ Admin approval step (beneficial: quality control)

Alternative Considered: Hybrid model (platform in NixOS, projects flexible)

Deferred: Can relax strictness later if needed
Starting strict enforces quality and consistency

Multi-Engineer Access Model

Level 1: Communication Only

Slack workspace access (chochacho)
Can participate in bridged conversations
No infrastructure access needed

Level 2: Code Contributor

Forgejo account (pattern established)
SSH key uploaded to Forgejo
Can push to project repositories
Can submit pull requests

Level 3: Deployer

Can trigger deployments (merge to main?)
May have SSH access for debugging
Permissions to restart services

Level 4: Admin

SSH root access to VPS
Can modify ops-jrz1 NixOS config
Secrets management access (sops-nix keys)
Infrastructure decision authority

Target Distribution (2-5 engineers):

Level 1: All engineers
Level 2: All engineers (default)
Level 3: 2-3 trusted engineers
Level 4: 1-2 admins (primary: dan)

Secrets Management

Tool: sops-nix with age encryption

Current State:

VPS SSH host key as age key: age1vuxcwvdvzl2u7w6kudqvnnf45czrnhwv9aevjq9hyjjpa409jvkqhkz32q
Admin workstation can decrypt (dan's age key)

Pattern:

# secrets/secrets.yaml (encrypted)
matrix-registration-token: "..."
acme-email: "..."
slack-app-token: "..."        # Future
slack-bot-token: "..."        # Future

Future Considerations:

Add engineer age keys for collaboration
Per-project secrets (if needed)
Secret rotation workflow

Testing Strategy

Current: ops-jrz1-vm (VM testing before production)

Workflow:

Develop locally
Test in VM (nixos-rebuild build-vm)
Deploy to production (nixos-rebuild switch)
Rollback if issues (nixos-rebuild switch --rollback)

Future:

Automated testing (unit, integration)
Staging environment (if needed)
Pre-deployment health checks

Technical Stack

Infrastructure

OS: NixOS 24.05
Config Management: Nix flakes
Secrets: sops-nix with age encryption
Firewall: iptables (nixos-fw)
Web Server: nginx with ACME/Let's Encrypt

Communication

Matrix Homeserver: conduwuit 0.5.0-rc.8
Bridge Framework: mautrix (Python-based)
Target Bridge: mautrix-slack (Socket Mode)

Development Platform

Git Server: Forgejo 7.0.12
Database: PostgreSQL 15.10
CI/CD: Forgejo Actions (future consideration)

Expected Project Stack (Flexible)

Python bots (primary expectation)
Node.js services (if needed)
Go binaries (if needed)
Any language with Nix packaging support

Open Questions

Communication Bridge

Which Slack channels to bridge? (All? Specific list? On-demand?)
User identity mapping: Slack display names or Matrix usernames?
Bot integration needs: GitHub notifications? CI/CD status?

Project Deployment

Automated deployment on merge? Or manual trigger?
Pull request workflow required? Or direct push to main?
Health checks before deployment?
Monitoring and alerting strategy?

Team Collaboration

How many engineers will actually join? (impacts scaling decisions)
Shared development environments needed?
Per-project Matrix rooms or one big room?
Weekly syncs or async-only collaboration?

Repository Organization

Monorepo (ops-jrz1 + projects) or separate repos?
Public vs private repositories?
Who owns which repositories?

Success Metrics

Technical Success

✅ All services healthy and monitored
✅ Zero unplanned downtime
✅ Fast rollback capability (< 5 minutes)
✅ Clear audit trail (git history + NixOS generations)

Team Success

✅ Engineers can deploy projects independently
✅ Onboarding time < 1 hour
✅ Documentation answers common questions
✅ Platform feels stable and trustworthy

Project Success (Presentable State)

✅ Slack bridge works reliably
✅ Example project demonstrates the pattern
✅ Documentation is complete and clear
✅ At least one other engineer has successfully deployed

Timeline

Phase 1: Working Slack Bridge (1-2 focused sessions)

Update workspace configuration
Slack app setup and credential management
Debug and validate bidirectional messaging

Phase 2: Project Pattern (1-2 sessions after Phase 1)

Create example bot
Document deployment pattern
Establish template repository

Phase 3: Documentation (1 session)

Architecture documentation
Onboarding guide
Deployment runbook

Phase 4: Team Onboarding (1 session per engineer)

Invite engineers
Supervised first deployment
Gather feedback and iterate

Target: Presentable state within 4-8 focused work sessions

Constraint: Not pressing, quality over speed

References

Internal Documentation

Security Test Report - Generation 31 validation
Deployment Log - Initial deployment
Forgejo Setup - Git server configuration

External Resources

Revision History

2025-10-22: Initial vision document created after brainstorming session
- Defined presentable MVP criteria
- Established three-milestone roadmap
- Documented architectural principles
- Identified open questions for iteration

11 KiB Raw Blame History