Scaffold for agent capability benchmark harness (skills-qng9): - docs/specs/scenario-schema.md: YAML schema for test scenarios - tests/scenarios/: Easy, medium, hard example scenarios - tests/fixtures/: Python fixtures for testing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
80 lines
2.1 KiB
YAML
80 lines
2.1 KiB
YAML
# Hard scenario: Debug and fix a race condition
|
|
id: fix-race-condition
|
|
title: Fix race condition in cache invalidation
|
|
difficulty: hard
|
|
tags: [python, concurrency, debugging, race-condition]
|
|
|
|
fixture:
|
|
source: flask-user-api # Same fixture, but with a bug
|
|
|
|
task:
|
|
description: |
|
|
Users are reporting stale data after updates. Investigation shows there's
|
|
a race condition in the cache invalidation logic.
|
|
|
|
Bug report:
|
|
- User updates their profile
|
|
- Immediately after, they see old data
|
|
- Refreshing a few seconds later shows correct data
|
|
|
|
The issue is intermittent and happens under load.
|
|
|
|
Find and fix the race condition.
|
|
|
|
context:
|
|
- path: src/routes/users.py
|
|
hint: The endpoint with the bug
|
|
- path: src/cache.py
|
|
hint: Cache implementation
|
|
- path: tests/test_users.py
|
|
hint: Existing tests (don't catch this bug)
|
|
|
|
execution:
|
|
mode: live # Too complex for scripted mode
|
|
timeout: 15m
|
|
|
|
live:
|
|
max_turns: 50
|
|
# No scripted version - agent must debug
|
|
|
|
verify:
|
|
properties:
|
|
- type: tests_pass
|
|
command: pytest tests/ -v
|
|
|
|
- type: custom
|
|
command: |
|
|
# Stress test for race condition
|
|
python tests/stress_test_cache.py --iterations=100 --concurrent=10
|
|
|
|
llm_judge:
|
|
enabled: true
|
|
model: opus # Use best model for complex evaluation
|
|
rubric:
|
|
- criterion: Correctly identified the race condition
|
|
weight: 1.0
|
|
- criterion: Fix actually resolves the race (not just hiding it)
|
|
weight: 1.5
|
|
- criterion: Fix doesn't introduce new bugs or performance issues
|
|
weight: 1.0
|
|
- criterion: Explanation of the bug is accurate
|
|
weight: 0.5
|
|
- criterion: Added test that would catch this regression
|
|
weight: 0.8
|
|
threshold: 0.8
|
|
|
|
human:
|
|
required: true # Hard scenarios need human verification
|
|
queue: hard-reviews
|
|
rubric:
|
|
- Race condition is correctly identified
|
|
- Fix is correct and complete
|
|
- No new concurrency issues introduced
|
|
- Performance impact is acceptable
|
|
|
|
benchmark:
|
|
enabled: true
|
|
runs: 3
|
|
dimensions:
|
|
models: [sonnet, opus] # Compare models on hard tasks
|