Run One Task¶
Prerequisites¶
- Installed the benchmark (
uv synccompleted) - Configured
.envwith at least one API key (e.g.ANTHROPIC_API_KEYorOPENAI_API_KEY)
Step 1: Pick a task¶
List available tasks:
Filter by domain:
Each row shows task_id, domain, status, and name. Use any task_id from the output.
Start with a fast task
physics.sod_shock_tube runs in under 5 minutes and has clear pass/fail criteria.
Step 2: Run it¶
ai4sci-bench run \
--tasks physics.sod_shock_tube \
--agent direct_llm \
--agent-config '{"model":"claude-sonnet-4-20250514"}' \
--sandbox task \
--prompt-levels b1
| Flag | Meaning |
|---|---|
--tasks |
Task ID(s) to run (comma-separated for multiple) |
--agent |
Which adapter to use (direct_llm, claude_code_cli, codex_cli) |
--agent-config |
JSON with model name and optional settings |
--sandbox task |
Isolate execution in a task-specific uv environment |
--prompt-levels b1 |
Run only the most guided prompt level |
Step 3: What happens during a run¶
- Ground-truth generation — the framework runs
generate_gt.pyto create reference outputs for scoring - Prompt delivery — the agent receives the task prompt, input data, and workspace
- Agent execution — the agent writes code and produces output artifacts
- Scoring — scorers compare agent outputs against the reference using gates and numerical checks
- Report — results are written as structured JSON with full provenance
Step 4: Check results¶
View a summary:
Or inspect the raw output directory:
results/
run_metadata.json # Global run config
physics.sod_shock_tube/
<instance_id>.json # Scored result with component scores
The scored result JSON includes final_score (0-100), component_scores, and gate_results.
Common options¶
| Option | Default | Description |
|---|---|---|
--prompt-levels |
b1,b2,b3,b4 |
Which prompt levels to run |
--sandbox |
none |
Isolation mode: none, task, os, linux_ns |
--timeout |
3600 | Agent timeout in seconds |
--instances-per-task |
1 | Number of parameter samples per task |
--seed |
42 | Random seed for reproducibility |
Trying all prompt levels¶
Run the full B1-B4 autonomy ladder to see how agent performance degrades as guidance decreases:
ai4sci-bench run \
--tasks physics.sod_shock_tube \
--agent direct_llm \
--agent-config '{"model":"claude-sonnet-4-20250514"}' \
--sandbox task \
--prompt-levels b1,b2,b3,b4
- B1: Full execution guidance — algorithm, parameters, output format all specified
- B2: Partial guidance — methods suggested but not fully detailed
- B3: Minimal guidance — only the scientific goal and output contract
- B4: Distractor-aware — minimal guidance plus misleading suggestions
Troubleshooting¶
API key not found
Ensure your .env file has the correct key for your chosen model. The direct_llm adapter uses litellm and reads standard env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.).
Timeout errors
Some tasks require more than the default 3600s. Check the task's execution.agent_timeout_seconds in its task.yaml, or pass --timeout 7200.
Missing packages
Use --sandbox task to automatically install task-declared runtime.packages in an isolated environment.
Next steps¶
- Read Results — understand the output structure in detail
- Bring Your Agent — evaluate Claude Code CLI, Codex, or your own agent