Skip to content

Bring Your Agent

Evaluate any LLM agent with the benchmark — use a built-in adapter or plug in your own.

Built-in adapters

The benchmark ships with three ready-to-use agent adapters:

Direct API-based evaluation. Sends the task prompt to an LLM and executes the returned code. Supports any model accessible via litellm (Claude, GPT, Gemini, OpenRouter).

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task --prompt-levels b1

Config options:

Key Description
model Model identifier (e.g. claude-opus-4-20250514, gpt-5.5, openrouter/anthropic/claude-3.5-sonnet)
api_key Optional — overrides env var
api_base Optional — custom API endpoint

Best for: baselines, quick comparisons, any model with an API.

Uses the Claude Code CLI as a fully autonomous coding agent.

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent claude_code_cli \
  --agent-config '{"effort":"medium","permission_mode":"bypassPermissions"}' \
  --sandbox task --prompt-levels b1

Config options:

Key Description
effort Reasoning effort: low, medium (default), high, xhigh
permission_mode bypassPermissions recommended for non-interactive use

Requires: claude CLI installed locally.

Uses OpenAI's Codex CLI agent with GPT models.

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent codex_cli \
  --agent-config '{"model":"gpt-5.5","effort":"medium"}' \
  --sandbox task --prompt-levels b1

Config options:

Key Description
model Default: gpt-5.5
effort Reasoning effort: low, medium (default), high, xhigh

Default sandbox: workspace-write.

Sandbox modes

Control how agent code is isolated from the host system:

Mode Isolation Use case
none No isolation Quick local testing only
task Task-specific uv environment Default for most runs — installs task packages
os Full Docker container Maximum isolation, recommended for submissions
linux_ns Linux namespace (Linux only) Lightweight alternative to Docker
ai4sci-bench run --sandbox os ...

Sandbox requirement for submissions

Official leaderboard submissions require --sandbox task or --sandbox os. Results from none mode are not accepted.

Tool modes

Control whether agents can access external resources:

Mode Flag Behavior
Restricted --tool-mode restricted (default) No web search, no external tools
Search --tool-mode search Web search allowed
Unrestricted --tool-mode unrestricted All tools allowed

The --allow-external-tools flag is equivalent to --tool-mode search.

Multi-model comparison

Use batch-run to evaluate multiple agents in a single session. Ground truth is generated once and reused:

ai4sci-bench batch-run \
  --tasks physics.sod_shock_tube,physics.dp_contact_process \
  --agent 'direct_llm:{"model":"claude-sonnet-4-20250514"}' \
  --agent 'direct_llm:{"model":"gpt-5.5"}' \
  --sandbox task \
  --prompt-levels b1,b2,b3,b4

Results go to separate directories per agent, with a shared batch_records/ summary.

Custom agents

Via --agent-cmd

Wrap any CLI tool as an agent:

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent-cmd 'my-agent --workspace {workspace} --prompt {prompt_file}' \
  --sandbox task --prompt-levels b1

The command receives:

  • {workspace} — path to the working directory with input data
  • {prompt_file} — path to the rendered prompt markdown

Your agent must write output files to the workspace directory.

Via HTTP

For remote agents, use the http_agent adapter (experimental):

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent http_agent \
  --agent-config '{"endpoint":"http://localhost:8080/solve"}' \
  --sandbox task

Preflight validation

Before expensive runs, validate that your setup works:

ai4sci-bench validate --preflight \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task

This checks: API connectivity, sandbox initialization, task environment setup — without consuming full run tokens.

Next steps