Bring Your Agent¶
Built-in adapters¶
The benchmark ships with three ready-to-use agent adapters:
Direct API-based evaluation. Sends the task prompt to an LLM and executes the returned code. Supports any model accessible via litellm (Claude, GPT, Gemini, OpenRouter).
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent direct_llm \
--agent-config '{"model":"claude-sonnet-4-20250514"}' \
--sandbox task --prompt-levels b1
Config options:
| Key | Description |
|---|---|
model |
Model identifier (e.g. claude-opus-4-20250514, gpt-5.5, openrouter/anthropic/claude-3.5-sonnet) |
api_key |
Optional — overrides env var |
api_base |
Optional — custom API endpoint |
Best for: baselines, quick comparisons, any model with an API.
Uses the Claude Code CLI as a fully autonomous coding agent.
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent claude_code_cli \
--agent-config '{"effort":"medium","permission_mode":"bypassPermissions"}' \
--sandbox task --prompt-levels b1
Config options:
| Key | Description |
|---|---|
effort |
Reasoning effort: low, medium (default), high, xhigh |
permission_mode |
bypassPermissions recommended for non-interactive use |
Requires: claude CLI installed locally.
Uses OpenAI's Codex CLI agent with GPT models.
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent codex_cli \
--agent-config '{"model":"gpt-5.5","effort":"medium"}' \
--sandbox task --prompt-levels b1
Config options:
| Key | Description |
|---|---|
model |
Default: gpt-5.5 |
effort |
Reasoning effort: low, medium (default), high, xhigh |
Default sandbox: workspace-write.
Sandbox modes¶
Control how agent code is isolated from the host system:
| Mode | Isolation | Use case |
|---|---|---|
none |
No isolation | Quick local testing only |
task |
Task-specific uv environment |
Default for most runs — installs task packages |
os |
Full Docker container | Maximum isolation, recommended for submissions |
linux_ns |
Linux namespace (Linux only) | Lightweight alternative to Docker |
Sandbox requirement for submissions
Official leaderboard submissions require --sandbox task or --sandbox os. Results from none mode are not accepted.
Tool modes¶
Control whether agents can access external resources:
| Mode | Flag | Behavior |
|---|---|---|
| Restricted | --tool-mode restricted (default) |
No web search, no external tools |
| Search | --tool-mode search |
Web search allowed |
| Unrestricted | --tool-mode unrestricted |
All tools allowed |
The --allow-external-tools flag is equivalent to --tool-mode search.
Multi-model comparison¶
Use batch-run to evaluate multiple agents in a single session. Ground truth is generated once and reused:
ai4sci-bench batch-run \
--tasks physics.sod_shock_tube,physics.dp_contact_process \
--agent 'direct_llm:{"model":"claude-sonnet-4-20250514"}' \
--agent 'direct_llm:{"model":"gpt-5.5"}' \
--sandbox task \
--prompt-levels b1,b2,b3,b4
Results go to separate directories per agent, with a shared batch_records/ summary.
Custom agents¶
Via --agent-cmd¶
Wrap any CLI tool as an agent:
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent-cmd 'my-agent --workspace {workspace} --prompt {prompt_file}' \
--sandbox task --prompt-levels b1
The command receives:
{workspace}— path to the working directory with input data{prompt_file}— path to the rendered prompt markdown
Your agent must write output files to the workspace directory.
Via HTTP¶
For remote agents, use the http_agent adapter (experimental):
ai4sci-bench run --tasks physics.sod_shock_tube \
--agent http_agent \
--agent-config '{"endpoint":"http://localhost:8080/solve"}' \
--sandbox task
Preflight validation¶
Before expensive runs, validate that your setup works:
ai4sci-bench validate --preflight \
--agent direct_llm \
--agent-config '{"model":"claude-sonnet-4-20250514"}' \
--sandbox task
This checks: API connectivity, sandbox initialization, task environment setup — without consuming full run tokens.
Next steps¶
- Read Results — understand scoring output
- Submit Results — submit for the official leaderboard