Overview¶
Why This Benchmark¶
Many scientific benchmark tasks are either short-form question answering or isolated code-generation problems. ASI-Bench targets a different setting: project-level scientific workflows that require an agent to reason through a small but complete computational project.
The benchmark is designed to test whether an agent can:
- understand a scientific objective
- inspect provided data and decide what matters
- choose an appropriate method
- implement and debug the workflow
- produce output artifacts that can be checked objectively
This makes the benchmark closer to the work pattern of a scientific assistant than a single-prompt exam.
What Project-Level Means¶
In this benchmark, a task is not only a prompt. A task includes metadata, runtime requirements, data or instance generation, expected artifacts, scoring rules, and public-safe summaries for website rendering.
A successful agent run should leave behind evidence:
- generated data files or figures
- implementation code
- structured answer artifacts
- scoreable outputs
- run metadata and provenance
That evidence is important because the website is intended to support official leaderboard results, not just informal model comparisons.
How It Differs From Related Benchmarks¶
| Benchmark family | Typical focus | ASI-Bench difference |
|---|---|---|
| QA-style science benchmarks | answer correctness | evaluates multi-step scientific workflows and artifacts |
| coding benchmarks | code generation or issue repair | adds scientific method selection and domain interpretation |
| ScienceAgentBench / AstaBench | broad science-agent evaluation | emphasizes project-level tasks with B1-B4 prompt-level control |
| SkillsBench-style agent benchmarks | agent/tool skill use | uses scientific tasks and reproducible scoring as the primary surface |
The benchmark therefore sits between narrow coding tests and unconstrained open-ended research.
B1-B4 Prompt Levels¶
ASI-Bench evaluates the same scientific objective under four prompt levels:
- B1: strongest guidance, focused on execution.
- B2: partial guidance, focused on domain understanding plus execution.
- B3: minimal guidance, focused on autonomous scientific problem solving.
- B4: B3 plus distractor content, focused on prioritization and robustness.
The goal is to measure how much scaffolding an agent needs before it can solve a scientific workflow reliably.
B1-B4 Prompt Ladder
Same task goal, same data, same evaluation; guidance decreases as autonomy increases.
Execution-focused
Strongest guidance, mostly specified method, reliable implementation.
Partial guidance
Method hints remain, parameters loosen, domain understanding matters.
Minimal guidance
Task goal stays fixed while method choice becomes the challenge.
Distractor-aware
B3 plus irrelevant context, testing focus and robustness.
Guidance decreases from B1 to B4 while autonomy and prioritization demands increase.
Task Lifecycle¶
The public website is generated from structured benchmark metadata. In the long term, each task page should expose:
- task title and identifier
- domain and subdomain
- public summary
- expected output types
- high-level evaluation summary
- runtime and sandbox notes
- safe prompt excerpt
The current public catalog is intentionally scoped to tasks that are ready to be shown on the benchmark site.
Evaluation Workflow¶
At a high level, the evaluation loop is:
- Read the task prompt and inspect the provided data.
- Decide on an appropriate scientific or computational method.
- Implement the solution and generate required artifacts.
- Produce structured outputs such as data files, figures, and code.
- Score the result against benchmark evaluation rules.
- Summarize the run in reporting artifacts that can power the public website.
Examples of result artifacts include:
run_metadata.jsonfor run-level provenance- per-instance result JSON files
batch_overview.jsontask_scoreboard.jsontask_level_long.json
Reproducibility and Contamination Control¶
The benchmark design emphasizes:
- sandboxed execution modes
- explicit runtime requirements
- structured output contracts
- parameterized ground-truth generation
- provenance attached to saved results
Parameterized generation helps reduce contamination risk and makes it easier to produce multiple scoreable instances without hand-authoring every case.
Leaderboard Readiness¶
The official leaderboard should eventually show reviewed runs accepted by benchmark maintainers. Each official result should include:
- agent or harness name
- model name
- overall score
- B1-B4 breakdown
- task coverage
- benchmark version
- evaluation date
- source or trace links when available
- trust label such as
official,reproduced,community, orunverified
This separation matters because official baselines and community submissions should not be mixed without clear provenance.