Skip to content

Run One Task

Run your first benchmark evaluation in under 5 minutes. This guide walks you through picking a task, executing it, and reading the results.

Prerequisites

  • Installed the benchmark (uv sync completed)
  • Configured .env with at least one API key (e.g. ANTHROPIC_API_KEY or OPENAI_API_KEY)

Step 1: Pick a task

List available tasks:

ai4sci-bench list

Filter by domain:

ai4sci-bench list --domain physics

Each row shows task_id, domain, status, and name. Use any task_id from the output.

Start with a fast task

physics.sod_shock_tube runs in under 5 minutes and has clear pass/fail criteria.

Step 2: Run it

ai4sci-bench run \
  --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task \
  --prompt-levels b1
Flag Meaning
--tasks Task ID(s) to run (comma-separated for multiple)
--agent Which adapter to use (direct_llm, claude_code_cli, codex_cli)
--agent-config JSON with model name and optional settings
--sandbox task Isolate execution in a task-specific uv environment
--prompt-levels b1 Run only the most guided prompt level

Step 3: What happens during a run

  1. Ground-truth generation — the framework runs generate_gt.py to create reference outputs for scoring
  2. Prompt delivery — the agent receives the task prompt, input data, and workspace
  3. Agent execution — the agent writes code and produces output artifacts
  4. Scoring — scorers compare agent outputs against the reference using gates and numerical checks
  5. Report — results are written as structured JSON with full provenance

Step 4: Check results

View a summary:

ai4sci-bench report

Or inspect the raw output directory:

results/
  run_metadata.json          # Global run config
  physics.sod_shock_tube/
    <instance_id>.json       # Scored result with component scores

The scored result JSON includes final_score (0-100), component_scores, and gate_results.

Common options

Option Default Description
--prompt-levels b1,b2,b3,b4 Which prompt levels to run
--sandbox none Isolation mode: none, task, os, linux_ns
--timeout 3600 Agent timeout in seconds
--instances-per-task 1 Number of parameter samples per task
--seed 42 Random seed for reproducibility

Trying all prompt levels

Run the full B1-B4 autonomy ladder to see how agent performance degrades as guidance decreases:

ai4sci-bench run \
  --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task \
  --prompt-levels b1,b2,b3,b4
  • B1: Full execution guidance — algorithm, parameters, output format all specified
  • B2: Partial guidance — methods suggested but not fully detailed
  • B3: Minimal guidance — only the scientific goal and output contract
  • B4: Distractor-aware — minimal guidance plus misleading suggestions

Troubleshooting

API key not found

Ensure your .env file has the correct key for your chosen model. The direct_llm adapter uses litellm and reads standard env vars (ANTHROPIC_API_KEY, OPENAI_API_KEY, etc.).

Timeout errors

Some tasks require more than the default 3600s. Check the task's execution.agent_timeout_seconds in its task.yaml, or pass --timeout 7200.

Missing packages

Use --sandbox task to automatically install task-declared runtime.packages in an isolated environment.

Next steps