skills/docs/research/multi-model-consensus-analysis.md

# Multi-Model Consensus: Current State & Pi Integration Analysis

**Date**: 2026-01-22
**Purpose**: Analyze what we have in orch vs what pi needs for multi-model consensus

---

## What We Have: Orch CLI

### Core Capabilities

**Commands**:
1. `orch consensus` - Parallel multi-model queries with vote/brainstorm/critique/open modes
2. `orch chat` - Single-model conversation with session management
3. `orch models` - List/resolve 423 available models
4. `orch sessions` - Manage conversation history

**Key Features**:

**Model Selection**:
- 423 models across providers (OpenAI, Anthropic, Google, DeepSeek, Qwen, Perplexity, etc.)
- Aliases: `flash`, `gemini`, `gpt`, `claude`, `sonnet`, `opus`, `haiku`, `deepseek`, `r1`, `qwen`
- Stance modifiers: `gpt:for`, `claude:against`, `gemini:neutral`
- Cost awareness: `--allow-expensive` for opus/r1

**Modes**:
- `vote` - Support/Oppose/Neutral verdict with reasoning
- `brainstorm` - Generate ideas without judgment
- `critique` - Find flaws and weaknesses
- `open` - Freeform responses

**Context**:
- File inclusion: `--file PATH` (multiple allowed)
- Stdin piping: `cat code.py | orch consensus "..."`
- Session continuity: `--session ID` for chat mode
- Web search: `--websearch` (Gemini only)

**Execution**:
- Parallel by default, `--serial` for sequential
- Serial strategies: neutral, refine, debate, brainstorm
- Synthesis: `--synthesize MODEL` to aggregate responses
- Timeout control: `--timeout SECS`

**Output**:
- Structured vote results with verdict counts
- Reasoning for each model
- Color-coded output (SUPPORT/OPPOSE/NEUTRAL)
- Session IDs for continuation

### Current Skill Integration

**Location**: `~/.codex/skills/orch/`

**What it provides**:
- Documentation of orch capabilities
- Usage patterns (second opinion, architecture decision, code review, devil's advocate, etc.)
- Model selection guidance
- Conversational patterns (session-based multi-turn, cross-model dialogue, iterative refinement)
- Combined patterns (explore then validate)

**What it does NOT provide**:
- Direct agent tool invocation (agent must shell out to `orch`)
- UI integration (no pickers, no inline results)
- Conversation context sharing (agent's conversation ≠ orch's conversation)
- Interactive model selection
- Add-to-context workflow

---

## What Pi Oracle Extension Provides

### From shitty-extensions/oracle.ts

**UI Features**:
- Interactive model picker overlay
- Quick keys (1-9) for fast selection
- Shows which models are authenticated/available
- Excludes current model from picker
- Formatted result display with scrolling

**Context Sharing**:
- **Inherits full conversation context** - Oracle sees the entire pi conversation
- Sends conversation history to queried model
- No need to re-explain context

**Workflow**:
1. User types `/oracle <prompt>`
2. Model picker appears
3. Select model with arrow keys or number
4. Oracle queries model with **full conversation context + prompt**
5. Result displays in scrollable overlay
6. **"Add to context?" prompt** - YES/NO choice
7. If YES, oracle response appends to conversation

**Model Awareness**:
- Only shows models with valid API keys
- Filters out current model
- Groups by provider (OpenAI, Google, Anthropic, OpenAI Codex)

**Input Options**:
- Direct: `/oracle -m gpt-4o <prompt>` (skips picker)
- Files: `/oracle -f file.ts <prompt>` (includes file content)

**Implementation Details**:
- Uses pi's `@mariozechner/pi-ai` complete() API
- Serializes conversation with `serializeConversation()`
- Converts to LLM format with `convertToLlm()`
- Custom TUI component for result display
- BorderedLoader during query

---

## Gap Analysis

### What Orch Has That Oracle Doesn't

1. **Multiple simultaneous queries** - Oracle queries one model at a time
2. **Structured voting** - Support/Oppose/Neutral verdicts with counts
3. **Multiple modes** - vote/brainstorm/critique/open (Oracle is always "open")
4. **Stance modifiers** - :for/:against/:neutral bias (devil's advocate)
5. **Serial strategies** - refine, debate, brainstorm sequences
6. **Synthesis** - Aggregate multiple responses into summary
7. **Session management** - Persistent conversation threads
8. **423 models** - Far more models than Oracle's ~18
9. **Cost awareness** - Explicit `--allow-expensive` gate
10. **Web search** - Integrated search for Gemini/Perplexity
11. **CLI flexibility** - File piping, stdin, session export

### What Oracle Has That Orch Doesn't

1. **Conversation context inheritance** - Oracle sees full pi conversation automatically
2. **Interactive UI** - Model picker, scrollable results, keyboard navigation
3. **Add-to-context workflow** - Explicit YES/NO to inject response
4. **Current model exclusion** - Automatically filters out active model
5. **Native pi integration** - No subprocess, uses pi's AI API directly
6. **Quick keys** - 1-9 for instant model selection
7. **Authenticated model filtering** - Only shows models with valid keys
8. **Inline result display** - Formatted overlay with scrolling

### What Neither Has (Opportunities)

1. **Side-by-side comparison** - Show multiple model responses in split view
2. **Vote visualization** - Bar chart or consensus gauge
3. **Response diff** - Highlight disagreements between models
4. **Model capability awareness** - Filter by vision/reasoning/coding/etc.
5. **Cost preview** - Show estimated cost before querying
6. **Cached responses** - Don't re-query same prompt to same model
7. **Response export** - Save consensus to file/issue
8. **Model recommendations** - Suggest models based on query type
9. **Confidence scoring** - Gauge certainty in responses
10. **Conversation branching** - Fork conversation with different models

---

## Pi Integration Options

### Option 1: Wrap Orch CLI as Tool

**Approach**: Register `orch` as a pi tool, shell out to CLI

**Pros**:
- Minimal code, reuses existing orch
- All orch features available (423 models, voting, synthesis, etc.)
- Already works with current skill

**Cons**:
- No conversation context sharing (pi's conversation ≠ orch's input)
- No interactive UI (no model picker, no add-to-context)
- Subprocess overhead
- Output parsing required
- Can't leverage pi's AI API

**Implementation**:
```typescript
pi.registerTool({
  name: "orch_consensus",
  description: "Query multiple AI models for consensus on a question",
  parameters: Type.Object({
    prompt: Type.String({ description: "Question to ask" }),
    models: Type.Array(Type.String(), { description: "Model aliases (flash, gemini, gpt, claude, etc.)" }),
    mode: Type.Optional(Type.Enum({ vote: "vote", brainstorm: "brainstorm", critique: "critique", open: "open" })),
    files: Type.Optional(Type.Array(Type.String(), { description: "Paths to include as context" })),
  }),
  async execute(toolCallId, params, onUpdate, ctx, signal) {
    const args = ["consensus", params.prompt, ...params.models];
    if (params.mode) args.push("--mode", params.mode);
    if (params.files) params.files.forEach(f => args.push("--file", f));

    const result = await pi.exec("orch", args);
    return { content: [{ type: "text", text: result.stdout }] };
  }
});
```

**Context issue**: Agent would need to manually provide conversation context:
```typescript
// Agent would have to do this:
const context = serializeConversation(ctx.sessionManager.getBranch());
const contextFile = writeToTempFile(context);
args.push("--file", contextFile);
```

---

### Option 2: Oracle-Style Extension with Orch Models

**Approach**: Port Oracle's UI/UX but use orch's model registry

**Pros**:
- Best UX: interactive picker, add-to-context, full conversation sharing
- Native pi integration, no subprocess
- Can query multiple models and show side-by-side
- Direct access to pi's AI API

**Cons**:
- Doesn't leverage orch's advanced features (voting, synthesis, serial strategies)
- Duplicate model registry (though could import from orch config)
- More code to maintain
- Loses orch's CLI flexibility (piping, session export, etc.)

**Implementation**:
```typescript
pi.registerCommand("consensus", {
  description: "Get consensus from multiple models",
  handler: async (args, ctx) => {
    // 1. Show model picker (multi-select)
    const models = await ctx.ui.custom(
      (tui, theme, kb, done) => new ModelPickerComponent(theme, done, { multiSelect: true })
    );

    // 2. Serialize conversation context
    const conversationHistory = serializeConversation(ctx.sessionManager.getBranch());

    // 3. Query models in parallel
    const promises = models.map(m =>
      complete(m.model, [
        ...conversationHistory.map(convertToLlm),
        { role: "user", content: args }
      ], m.apiKey)
    );

    // 4. Show results in comparison view
    const results = await Promise.all(promises);
    await ctx.ui.custom(
      (tui, theme, kb, done) => new ConsensusResultComponent(results, theme, done)
    );

    // 5. Add to context?
    const shouldAdd = await ctx.ui.confirm("Add responses to conversation context?");
    if (shouldAdd) {
      // Append all responses or synthesized summary
      ctx.sessionManager.appendMessage({
        role: "assistant",
        content: formatConsensus(results)
      });
    }
  }
});
```

**Features to implement**:
- Multi-select model picker (checkboxes)
- Parallel query with progress indicators
- Side-by-side result display with scrolling
- Voting mode: parse "SUPPORT/OPPOSE/NEUTRAL" from responses
- Add-to-context with synthesis option

---

### Option 3: Hybrid Approach

**Approach**: Keep orch CLI for advanced use, add Oracle-style extension for quick queries

**Pros**:
- Best of both worlds
- Agent can use tool for programmatic access
- User can use `/oracle` for interactive queries
- Orch handles complex scenarios (serial strategies, synthesis)
- Oracle handles quick second opinions

**Cons**:
- Two parallel systems to maintain
- Potential confusion about which to use

**Implementation**:

**Tool (for agent)**:
```typescript
pi.registerTool({
  name: "orch_consensus",
  // ... as in Option 1, shells out to orch CLI
});
```

**Command (for user)**:
```typescript
pi.registerCommand("oracle", {
  description: "Get second opinion from another model",
  // ... as in Option 2, native UI integration
});
```

**Usage patterns**:
- User types `/oracle <prompt>` → interactive picker, add-to-context flow
- Agent calls `orch_consensus()` → structured vote results in tool output
- Agent suggests: "I can get consensus from multiple models using orch_consensus if you'd like"
- User can also run `orch` directly in shell for advanced features

---

### Option 4: Enhanced Oracle with Orch Backend

**Approach**: Oracle UI that calls orch CLI under the hood

**Pros**:
- Leverage orch's features through nice UI
- Single source of truth (orch)
- Can expose orch modes/options in UI

**Cons**:
- Subprocess overhead
- Hard to share conversation context (orch doesn't expect serialized conversations)
- Awkward impedance mismatch

**Implementation challenges**:
```typescript
// How to pass conversation context to orch?
// Orch expects a prompt, not a conversation history

// Option A: Serialize entire conversation to temp file
const contextFile = "/tmp/pi-conversation.txt";
fs.writeFileSync(contextFile, formatConversation(history));
await pi.exec("orch", ["consensus", prompt, ...models, "--file", contextFile]);

// Option B: Inject context into prompt
const augmentedPrompt = `
Given this conversation:
${formatConversation(history)}

Answer this question: ${prompt}
`;
await pi.exec("orch", ["consensus", augmentedPrompt, ...models]);
```

Both are awkward because orch's input model doesn't match pi's conversation model.

---

## Recommendation

### Short Term: Option 3 (Hybrid)

**Rationale**:
1. **Keep orch CLI** for its strengths:
   - 423 models (way more than Oracle)
   - Voting/synthesis/serial strategies
   - CLI flexibility (piping, sessions, export)
   - Already works, well-tested

2. **Add Oracle-style extension** for its strengths:
   - Interactive UI (model picker, results display)
   - Conversation context sharing
   - Add-to-context workflow
   - Quick keys, better UX

3. **Clear division of labor**:
   - `/oracle` → quick second opinion, inherits conversation, nice UI
   - `orch_consensus` tool → agent programmatic access, structured voting
   - `orch` CLI → advanced features (synthesis, serial strategies, sessions)

### Long Term: Option 2 (Native Integration) + Orch as Fallback

**Rationale**:
Eventually, we want:
1. Native pi tool with full UI integration
2. Access to orch's model registry (import from config)
3. Voting, synthesis, comparison built into UI
4. Conversation context sharing by default

But keep `orch` CLI for:
- Session management
- Export/archival
- Scripting/automation
- Features not yet in pi extension

---

## Implementation Plan

### Phase 1: Oracle Extension (Week 1)

**Goal**: Interactive second opinion with conversation context

**Tasks**:
1. Port Oracle extension from shitty-extensions
2. Add model aliases from orch config
3. Implement model picker with multi-select
4. Conversation context serialization
5. Add-to-context workflow
6. Test with flash/gemini/gpt/claude

**Deliverable**: `/oracle` command for quick second opinions

### Phase 2: Orch Tool Wrapper (Week 2)

**Goal**: Agent can invoke orch programmatically

**Tasks**:
1. Register `orch_consensus` tool
2. Map tool parameters to orch CLI args
3. Serialize conversation context to temp file
4. Parse orch output (vote results)
5. Format for agent consumption

**Deliverable**: Agent can call orch for structured consensus

### Phase 3: Enhanced Oracle UI (Week 3-4)

**Goal**: Side-by-side comparison and voting

**Tasks**:
1. Multi-model query in parallel
2. Split-pane result display
3. Vote parsing (SUPPORT/OPPOSE/NEUTRAL)
4. Consensus gauge visualization
5. Diff highlighting (show disagreements)
6. Cost preview before query

**Deliverable**: Rich consensus UI with voting

### Phase 4: Advanced Features (Month 2)

**Goal**: Match orch's advanced features

**Tasks**:
1. Synthesis mode (aggregate responses)
2. Serial strategies (refine, debate)
3. Stance modifiers (:for/:against)
4. Response caching (don't re-query)
5. Model recommendations based on query
6. Export to file/issue

**Deliverable**: Feature parity with orch CLI

---

## Technical Details

### Model Registry Sharing

**Current state**: Orch has 423 models in Python config

**Options**:
1. **Import orch config** - Parse orch's model registry
2. **Duplicate registry** - Maintain separate TypeScript registry
3. **Query orch** - Call `orch models` and parse output

**Recommendation**: Start with (3), migrate to (1) later

```typescript
async function getOrchModels(): Promise<ModelAlias[]> {
  const { stdout } = await pi.exec("orch", ["models"]);
  return parseOrchModels(stdout);
}
```

### Conversation Context Serialization

**Challenge**: Pi's conversation format ≠ standard chat format

**Solution**: Use pi's built-in `serializeConversation()` and `convertToLlm()`

```typescript
import { serializeConversation, convertToLlm } from "@mariozechner/pi-coding-agent";

const history = ctx.sessionManager.getBranch();
const serialized = serializeConversation(history);
const llmMessages = serialized.map(convertToLlm);

// Now compatible with any model's chat API
const response = await complete(model, llmMessages, apiKey);
```

### Add-to-Context Workflow

**UI Flow**:
1. Show consensus results
2. Prompt: "Add responses to conversation context?"
3. Options:
   - YES - Add all responses (verbose)
   - SUMMARY - Add synthesized summary (concise)
   - NO - Don't add

**Implementation**:
```typescript
const choice = await ctx.ui.select("Add to context?", [
  "Yes, add all responses",
  "Yes, add synthesized summary",
  "No, keep separate"
]);

if (choice === 0) {
  // Append all model responses
  for (const result of results) {
    ctx.sessionManager.appendMessage({
      role: "assistant",
      content: `[${result.modelName}]: ${result.response}`
    });
  }
} else if (choice === 1) {
  // Synthesize and append
  const summary = await synthesize(results, "gemini");
  ctx.sessionManager.appendMessage({
    role: "assistant",
    content: `[Consensus]: ${summary}`
  });
}
```

### Vote Parsing

**Challenge**: Extract SUPPORT/OPPOSE/NEUTRAL from freeform responses

**Strategies**:
1. **Prompt engineering** - Ask models to start response with verdict
2. **Regex matching** - Parse structured output
3. **Secondary query** - Ask "classify this response as SUPPORT/OPPOSE/NEUTRAL"

**Recommendation**: (1) with (3) as fallback

```typescript
const votePrompt = `${originalPrompt}

Respond with your verdict first: SUPPORT, OPPOSE, or NEUTRAL
Then explain your reasoning.`;

const response = await complete(model, [...history, { role: "user", content: votePrompt }]);

const match = response.match(/^(SUPPORT|OPPOSE|NEUTRAL)/i);
const verdict = match ? match[1].toUpperCase() : "NEUTRAL";
```

### Cost Estimation

**Orch approach**: Uses pricing data in model registry

**Implementation**:
```typescript
interface ModelInfo {
  id: string;
  name: string;
  inputCostPer1M: number;
  outputCostPer1M: number;
}

function estimateCost(prompt: string, history: Message[], models: ModelInfo[]): number {
  const inputTokens = estimateTokens([...history, { role: "user", content: prompt }]);
  const outputTokens = 1000; // Estimate

  return models.reduce((total, m) => {
    const inputCost = (inputTokens / 1_000_000) * m.inputCostPer1M;
    const outputCost = (outputTokens / 1_000_000) * m.outputCostPer1M;
    return total + inputCost + outputCost;
  }, 0);
}

// Show before querying
const cost = estimateCost(prompt, history, selectedModels);
const confirmed = await ctx.ui.confirm(`Estimated cost: $${cost.toFixed(3)}. Continue?`);
```

---

## Design Questions

### 1. Should Oracle query multiple models or just one?

**Current Oracle**: One model at a time
**Orch**: Multiple models in parallel

**Recommendation**: Support both
- `/oracle <prompt>` → single model picker (quick second opinion)
- `/oracle-consensus <prompt>` → multi-select picker (true consensus)

Or:
- `/oracle` with Shift+Enter for multi-select

### 2. Should results auto-add to context or always prompt?

**Current Oracle**: Always prompts
**Orch**: No context, just output

**Recommendation**: Make it configurable
- Default: always prompt
- Setting: `oracle.autoAddToContext = true` to skip prompt
- ESC = don't add (quick exit)

### 3. How to handle expensive models?

**Orch**: Requires `--allow-expensive` flag

**Recommendation**: Show cost and prompt
- Model picker shows cost per model
- Selecting opus/r1 shows warning: "This is expensive ($X per query). Continue?"
- Can disable in settings

### 4. Should we cache responses?

**Problem**: Querying same prompt to same model multiple times wastes money

**Recommendation**: Short-term cache
- Cache key: `hash(model + conversation_context + prompt)`
- TTL: 5 minutes
- Show indicator: "(cached)" in results
- Option to force refresh

### 5. How to visualize consensus?

**Options**:
1. List view (like orch) - each model's response sequentially
2. Side-by-side - split screen with responses in columns
3. Gauge - visual consensus meter (% support)
4. Diff view - highlight agreements/disagreements

**Recommendation**: Progressive disclosure
- Initial: Gauge + vote counts
- Expand: List view with reasoning
- Advanced: Side-by-side diff view

---

## Next Steps

1. **Prototype Oracle extension** (today)
   - Port from shitty-extensions
   - Test with flash/gemini
   - Verify conversation context sharing

2. **Design consensus UI** (tomorrow)
   - Sketch multi-model result layout
   - Decide on vote visualization
   - Mock up add-to-context flow

3. **Implement model picker** (day 3)
   - Multi-select support
   - Quick keys (1-9 for single, checkboxes for multi)
   - Show cost/capabilities
   - Filter by authenticated models

4. **Build comparison view** (day 4-5)
   - Parallel query execution
   - Progress indicators
   - Side-by-side results
   - Diff highlighting

5. **Add orch tool wrapper** (day 6)
   - Register tool for agent use
   - Map parameters to CLI args
   - Parse vote output

6. **Integration testing** (day 7)
   - Test with real conversations
   - Verify context sharing works
   - Check cost estimates
   - Test with slow models (timeout handling)

---

## Success Metrics

**Must Have**:
- [ ] `/oracle` command works with conversation context
- [ ] Model picker shows authenticated models only
- [ ] Results display with add-to-context option
- [ ] Multi-model query in parallel
- [ ] Vote parsing (SUPPORT/OPPOSE/NEUTRAL)
- [ ] Cost estimation before query

**Nice to Have**:
- [ ] Side-by-side comparison view
- [ ] Diff highlighting for disagreements
- [ ] Response caching (5min TTL)
- [ ] Model recommendations based on query
- [ ] Export consensus to file/issue
- [ ] Serial strategies (refine, debate)

**Stretch Goals**:
- [ ] Synthesis mode with custom prompts
- [ ] Confidence scoring
- [ ] Conversation branching
- [ ] Historical consensus tracking
- [ ] Model capability filtering (vision/reasoning/coding)

---

## References

- [orch CLI](https://github.com/yourusername/orch) - Current implementation
- [shitty-extensions/oracle.ts](https://github.com/hjanuschka/shitty-extensions/blob/main/extensions/oracle.ts)
- [pi-mono extension docs](https://github.com/badlogic/pi-mono/blob/main/packages/coding-agent/docs/extensions.md)
- [pi-mono TUI docs](https://github.com/badlogic/pi-mono/blob/main/packages/tui/README.md)