skills/specs/001-screenshot-analysis/FUTURE-ENHANCEMENT.md
dan 5fea49b7c0 feat(tufte-press): evolve skill to complete workflow with JSON generation and build automation
- Transform tufte-press from reference guide to conversation-aware generator
- Add JSON generation from conversation context following strict schema
- Create build automation scripts with Nix environment handling
- Integrate CUPS printing with duplex support
- Add comprehensive workflow documentation

Scripts added:
- skills/tufte-press/scripts/generate-and-build.sh (242 lines)
- skills/tufte-press/scripts/build-card.sh (23 lines)

Documentation:
- Updated SKILL.md with complete workflow instructions (370 lines)
- Updated README.md with usage examples (340 lines)
- Created SKILL-DEVELOPMENT-STRATEGY-tufte-press.md (450 lines)
- Added worklog: 2025-11-10-tufte-press-skill-evolution.org

Features:
- Agent generates valid JSON from conversation
- Schema validation before build (catches errors early)
- Automatic Nix shell entry for dependencies
- PDF build via tufte-press toolchain
- Optional print with duplex support
- Self-contained margin notes enforced
- Complete end-to-end testing

Workflow: Conversation → JSON → Validate → Build → Print

Related: niri-window-capture, screenshot-latest, worklog skills
2025-11-10 15:03:44 -08:00

6 KiB

Future Enhancement: Direct Screen Capture

Discovery

During implementation, we discovered that grim (the Wayland screenshot tool) can output directly to stdout:

grim - | file -
# Output: /dev/stdin: PNG image data, 174 x 174, 8-bit/color RGBA, non-interlaced

This opens up the possibility of skipping file-based screenshots entirely.

Current Workflow

User action:

  1. Mod4+S → select region → space
  2. Screenshot saved to ~/Pictures/Screenshots/Screenshot-YYYY-MM-DD-HH-MM-SS.png
  3. Tell AI: "look at my screenshot"
  4. AI runs: ls -t ~/Pictures/Screenshots/*.png | head -1
  5. AI reads file and analyzes

Latency: 2-5 seconds (file I/O, directory scanning)

Proposed Direct Capture Workflow

User action:

  1. Tell AI: "show me what's on my screen"
  2. AI runs: grim - | <inject into context>
  3. AI analyzes without file intermediary

Latency: <1 second (no file I/O)

Technical Questions (Unanswered)

Can AI read from stdin?

grim - | base64 | <how does AI ingest this?>

Unknown: Does OpenCode/Claude Code support image injection from stdin/base64?

Can AI read from clipboard?

grim - | wl-copy
# AI reads from clipboard with wl-paste?

Unknown: Does OpenCode/Claude Code have clipboard access?

Can we capture specific windows?

niri compositor provides:

niri msg focused-window    # Get focused window info
niri msg windows           # List all windows
niri msg pick-window       # Mouse selection

grim supports regions:

grim -g "x,y widthxheight" -    # Capture specific region

Possibility:

  1. Get window geometry from niri
  2. Capture that specific region with grim
  3. Inject directly without saving

Implementation Options

Option A: Clipboard-Based (Easiest to Test)

#!/usr/bin/env bash
# skills/screenshot-capture/scripts/capture-screen.sh

# Capture entire screen to clipboard
grim - | wl-copy

# Tell AI it's in clipboard
echo "Screen captured to clipboard. Use wl-paste to read."

Pros:

  • Simple integration
  • Works with existing clipboard tools
  • No file cleanup needed

Cons:

  • Requires AI to support clipboard reading
  • Unclear if OpenCode/Claude Code can do this

Option B: Temp File (Current Approach)

#!/usr/bin/env bash
# What we currently do (implicitly)

TEMP_FILE="/tmp/screen-capture-$(date +%s).png"
grim "$TEMP_FILE"
echo "$TEMP_FILE"

# AI reads file, analyzes, could delete after

Pros:

  • Works with current AI image capabilities
  • Proven approach

Cons:

  • File I/O overhead
  • Temp file cleanup required
  • Not as elegant

Option C: Base64 Stdin (Most Direct)

#!/usr/bin/env bash
# Hypothetical direct injection

grim - | base64 | ai-inject-image --format png --encoding base64

Pros:

  • No files at all
  • Minimal latency
  • Clean architecture

Cons:

  • Requires AI tool support for stdin images
  • Completely unknown if possible

Next Steps to Validate

  1. Test clipboard reading:

    grim - | wl-copy
    # In OpenCode: "What's in the clipboard?"
    # Does it understand it's an image?
    
  2. Test temp file with auto-cleanup:

    TEMP=$(mktemp --suffix=.png)
    trap "rm -f $TEMP" EXIT
    grim "$TEMP"
    # AI analyzes
    # File auto-deleted on exit
    
  3. Research AI tool capabilities:

    • Check OpenCode documentation for image input methods
    • Check Claude Code documentation for image input methods
    • Test if base64-encoded images can be injected
  4. Test region capture:

    # Get focused window geometry
    niri msg focused-window -j | jq -r '.geometry'
    
    # Capture just that region
    grim -g "$GEOMETRY" -
    

User Experience Comparison

Current (File-Based)

User: "Look at my last screenshot"
AI: <finds file in ~/Pictures/Screenshots>
AI: <reads file>
AI: "I see a terminal window with..."
Time: 2-5 seconds

Proposed (Direct Capture)

User: "Show me what's on screen"
AI: <captures directly with grim>
AI: "I see a terminal window with..."
Time: <1 second

Advanced (Region Aware)

User: "What's in the focused window?"
AI: <gets geometry from niri>
AI: <captures that region only>
AI: "The focused window shows..."
Time: <1 second

Decision: Why We Didn't Implement This Now

  1. Unknown AI Capabilities: Don't know if OpenCode/Claude Code support non-file image input
  2. Unvalidated Workflow: Current file-based approach is proven to work
  3. User Request: User asked for "find my screenshots", not "capture my screen"
  4. YAGNI: Would be premature optimization without user feedback

Current implementation solves the stated problem. This enhancement is for IF users say:

  • "This is too slow"
  • "I want to capture what's on screen now, not find old files"
  • "Can you see my current window?"

Recommendation

Ship the file-based solution first (screenshot-latest skill).

After real usage, if users want:

  • Real-time screen capture → Investigate direct capture
  • Region selection → Integrate niri window geometry
  • Clipboard workflow → Test clipboard-based approach

Don't build it until users ask for it.

Technical Notes

grim capabilities verified:

  • Can output to stdout (grim -)
  • Outputs valid PNG format
  • Supports region capture (-g "x,y WxH")
  • Works with Wayland compositors (niri confirmed)

niri capabilities verified:

  • Can query window geometry (niri msg windows -j)
  • Can get focused window (niri msg focused-window)
  • Supports JSON output for parsing

Unknown capabilities:

  • Can OpenCode/Claude Code read from clipboard?
  • Can OpenCode/Claude Code accept base64 image data?
  • Can OpenCode/Claude Code accept stdin image data?
  • What's the actual latency difference in real usage?

References

  • man grim - Screenshot tool documentation
  • niri msg --help - Compositor IPC commands
  • man wl-clipboard - Wayland clipboard utilities

This document describes potential enhancements, not current implementation. The current screenshot-latest skill uses file-based approach intentionally.