Overview¶

ASI-Bench evaluates LLM agents on compact but realistic AI for Science projects. This page explains what the benchmark measures, why the B1-B4 prompt ladder matters, and how results are intended to become auditable leaderboard entries.

Why This Benchmark¶

Many scientific benchmark tasks are either short-form question answering or isolated code-generation problems. ASI-Bench targets a different setting: project-level scientific workflows that require an agent to reason through a small but complete computational project.

The benchmark is designed to test whether an agent can:

understand a scientific objective
inspect provided data and decide what matters
choose an appropriate method
implement and debug the workflow
produce output artifacts that can be checked objectively

This makes the benchmark closer to the work pattern of a scientific assistant than a single-prompt exam.

What Project-Level Means¶

In this benchmark, a task is not only a prompt. A task includes metadata, runtime requirements, data or instance generation, expected artifacts, scoring rules, and public-safe summaries for website rendering.

A successful agent run should leave behind evidence:

generated data files or figures
implementation code
structured answer artifacts
scoreable outputs
run metadata and provenance

That evidence is important because the website is intended to support official leaderboard results, not just informal model comparisons.

Benchmark family	Typical focus	ASI-Bench difference
QA-style science benchmarks	answer correctness	evaluates multi-step scientific workflows and artifacts
coding benchmarks	code generation or issue repair	adds scientific method selection and domain interpretation
ScienceAgentBench / AstaBench	broad science-agent evaluation	emphasizes project-level tasks with B1-B4 prompt-level control
SkillsBench-style agent benchmarks	agent/tool skill use	uses scientific tasks and reproducible scoring as the primary surface

The benchmark therefore sits between narrow coding tests and unconstrained open-ended research.

B1-B4 Prompt Levels¶

ASI-Bench evaluates the same scientific objective under four prompt levels:

B1: strongest guidance, focused on execution.
B2: partial guidance, focused on domain understanding plus execution.
B3: minimal guidance, focused on autonomous scientific problem solving.
B4: B3 plus distractor content, focused on prioritization and robustness.

The goal is to measure how much scaffolding an agent needs before it can solve a scientific workflow reliably.

B1-B4 Prompt Ladder

Same task goal, same data, same evaluation; guidance decreases as autonomy increases.

B1

Execution-focused

Strongest guidance, mostly specified method, reliable implementation.

B2

Partial guidance

Method hints remain, parameters loosen, domain understanding matters.

B3

Minimal guidance

Task goal stays fixed while method choice becomes the challenge.

B4

Distractor-aware

B3 plus irrelevant context, testing focus and robustness.

Guidance decreases from B1 to B4 while autonomy and prioritization demands increase.

Task Lifecycle¶

The public website is generated from structured benchmark metadata. In the long term, each task page should expose:

task title and identifier
domain and subdomain
public summary
expected output types
high-level evaluation summary
runtime and sandbox notes
safe prompt excerpt

The current public catalog is intentionally scoped to tasks that are ready to be shown on the benchmark site.

Evaluation Workflow¶

Compact workflow overview from task definition through result reporting. — Benchmark workflow from task definition to scoreable outputs and website-ready reporting artifacts.

At a high level, the evaluation loop is:

Read the task prompt and inspect the provided data.
Decide on an appropriate scientific or computational method.
Implement the solution and generate required artifacts.
Produce structured outputs such as data files, figures, and code.
Score the result against benchmark evaluation rules.
Summarize the run in reporting artifacts that can power the public website.

Examples of result artifacts include:

run_metadata.json for run-level provenance
per-instance result JSON files
batch_overview.json
task_scoreboard.json
task_level_long.json

Reproducibility and Contamination Control¶

The benchmark design emphasizes:

sandboxed execution modes
explicit runtime requirements
structured output contracts
parameterized ground-truth generation
provenance attached to saved results

Parameterized generation helps reduce contamination risk and makes it easier to produce multiple scoreable instances without hand-authoring every case.

Leaderboard Readiness¶

The official leaderboard should eventually show reviewed runs accepted by benchmark maintainers. Each official result should include:

agent or harness name
model name
overall score
B1-B4 breakdown
task coverage
benchmark version
evaluation date
source or trace links when available
trust label such as official, reproduced, community, or unverified

This separation matters because official baselines and community submissions should not be mixed without clear provenance.