Bring Your Agent¶

Evaluate any LLM agent with the benchmark — use a built-in adapter or plug in your own.

Built-in adapters¶

The benchmark ships with three ready-to-use agent adapters:

direct_llmclaude_code_clicodex_cli

Direct API-based evaluation. Sends the task prompt to an LLM and executes the returned code. Supports any model accessible via litellm (Claude, GPT, Gemini, OpenRouter).

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task --prompt-levels b1

Config options:

Key	Description
`model`	Model identifier (e.g. `claude-opus-4-20250514`, `gpt-5.5`, `openrouter/anthropic/claude-3.5-sonnet`)
`api_key`	Optional — overrides env var
`api_base`	Optional — custom API endpoint

Best for: baselines, quick comparisons, any model with an API.

Uses the Claude Code CLI as a fully autonomous coding agent.

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent claude_code_cli \
  --agent-config '{"effort":"medium","permission_mode":"bypassPermissions"}' \
  --sandbox task --prompt-levels b1

Config options:

Key	Description
`effort`	Reasoning effort: `low`, `medium` (default), `high`, `xhigh`
`permission_mode`	`bypassPermissions` recommended for non-interactive use

Requires: claude CLI installed locally.

Uses OpenAI's Codex CLI agent with GPT models.

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent codex_cli \
  --agent-config '{"model":"gpt-5.5","effort":"medium"}' \
  --sandbox task --prompt-levels b1

Config options:

Key	Description
`model`	Default: `gpt-5.5`
`effort`	Reasoning effort: `low`, `medium` (default), `high`, `xhigh`

Default sandbox: workspace-write.

Sandbox modes¶

Control how agent code is isolated from the host system:

Mode	Isolation	Use case
`none`	No isolation	Quick local testing only
`task`	Task-specific `uv` environment	Default for most runs — installs task packages
`os`	Full Docker container	Maximum isolation, recommended for submissions
`linux_ns`	Linux namespace (Linux only)	Lightweight alternative to Docker

ai4sci-bench run --sandbox os ...

Sandbox requirement for submissions

Official leaderboard submissions require --sandbox task or --sandbox os. Results from none mode are not accepted.

Tool modes¶

Control whether agents can access external resources:

Mode	Flag	Behavior
Restricted	`--tool-mode restricted` (default)	No web search, no external tools
Search	`--tool-mode search`	Web search allowed
Unrestricted	`--tool-mode unrestricted`	All tools allowed

The --allow-external-tools flag is equivalent to --tool-mode search.

Multi-model comparison¶

Use batch-run to evaluate multiple agents in a single session. Ground truth is generated once and reused:

ai4sci-bench batch-run \
  --tasks physics.sod_shock_tube,physics.dp_contact_process \
  --agent 'direct_llm:{"model":"claude-sonnet-4-20250514"}' \
  --agent 'direct_llm:{"model":"gpt-5.5"}' \
  --sandbox task \
  --prompt-levels b1,b2,b3,b4

Results go to separate directories per agent, with a shared batch_records/ summary.

Custom agents¶

Via `--agent-cmd`¶

Wrap any CLI tool as an agent:

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent-cmd 'my-agent --workspace {workspace} --prompt {prompt_file}' \
  --sandbox task --prompt-levels b1

The command receives:

{workspace} — path to the working directory with input data
{prompt_file} — path to the rendered prompt markdown

Your agent must write output files to the workspace directory.

Via HTTP¶

For remote agents, use the http_agent adapter (experimental):

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent http_agent \
  --agent-config '{"endpoint":"http://localhost:8080/solve"}' \
  --sandbox task

Preflight validation¶

Before expensive runs, validate that your setup works:

ai4sci-bench validate --preflight \
  --agent direct_llm \
  --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task

This checks: API connectivity, sandbox initialization, task environment setup — without consuming full run tokens.

Next steps¶

Read Results — understand scoring output
Submit Results — submit for the official leaderboard

Bring Your Agent¶

Built-in adapters¶

Sandbox modes¶

Tool modes¶

Multi-model comparison¶

Custom agents¶

Via --agent-cmd¶

Via HTTP¶

Preflight validation¶

Next steps¶

Via `--agent-cmd`¶