Run One Task¶

Run your first benchmark evaluation in under 5 minutes. This guide walks you through picking a task, executing it, and reading the results.

Prerequisites¶

Installed the benchmark (uv sync completed)
Configured .env with at least one API key (e.g. ANTHROPIC_API_KEY or OPENAI_API_KEY)

Step 1: Pick a task¶

List available tasks:

ai4sci-bench list

Filter by domain:

ai4sci-bench list --domain physics

Each row shows task_id, domain, status, and name. Use any task_id from the output.

Start with a fast task

physics.sod_shock_tube runs in under 5 minutes and has clear pass/fail criteria.

Step 2: Run it¶

ai4sci-bench run \
  --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task \
  --prompt-levels b1

Flag	Meaning
`--tasks`	Task ID(s) to run (comma-separated for multiple)
`--agent`	Which adapter to use (`direct_llm`, `claude_code_cli`, `codex_cli`)
`--agent-config`	JSON with model name and optional settings
`--sandbox task`	Isolate execution in a task-specific `uv` environment
`--prompt-levels b1`	Run only the most guided prompt level

Step 3: What happens during a run¶

Ground-truth generation — the framework runs generate_gt.py to create reference outputs for scoring
Prompt delivery — the agent receives the task prompt, input data, and workspace
Agent execution — the agent writes code and produces output artifacts
Scoring — scorers compare agent outputs against the reference using gates and numerical checks
Report — results are written as structured JSON with full provenance

Step 4: Check results¶

View a summary:

ai4sci-bench report

Or inspect the raw output directory:

results/
  run_metadata.json          # Global run config
  physics.sod_shock_tube/
    <instance_id>.json       # Scored result with component scores

The scored result JSON includes final_score (0-100), component_scores, and gate_results.

Common options¶

Option	Default	Description
`--prompt-levels`	`b1,b2,b3,b4`	Which prompt levels to run
`--sandbox`	`none`	Isolation mode: `none`, `task`, `os`, `linux_ns`
`--timeout`	3600	Agent timeout in seconds
`--instances-per-task`	1	Number of parameter samples per task
`--seed`	42	Random seed for reproducibility

Trying all prompt levels¶

Run the full B1-B4 autonomy ladder to see how agent performance degrades as guidance decreases:

ai4sci-bench run \
  --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task \
  --prompt-levels b1,b2,b3,b4

B1: Full execution guidance — algorithm, parameters, output format all specified
B2: Partial guidance — methods suggested but not fully detailed
B3: Minimal guidance — only the scientific goal and output contract
B4: Distractor-aware — minimal guidance plus misleading suggestions

Troubleshooting¶

API key not found

Ensure your .env file has the correct key for your chosen model. The direct_llm adapter uses litellm and reads standard env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.).

Timeout errors

Some tasks require more than the default 3600s. Check the task's execution.agent_timeout_seconds in its task.yaml, or pass --timeout 7200.

Missing packages

Use --sandbox task to automatically install task-declared runtime.packages in an isolated environment.

Next steps¶

Read Results — understand the output structure in detail
Bring Your Agent — evaluate Claude Code CLI, Codex, or your own agent