Skip to content

Paper & Citation

Status: The benchmark paper is in preparation. This page will be updated with the arXiv preprint link and full citation once published.

Benchmark Overview

Title ASI-Bench: A Project-level Benchmark for Evaluating LLM Agents on AI for Science
Authors (In preparation)
Venue arXiv preprint (forthcoming)
Repository github.com/zjw49246/Agent-AI4Sci-Bench
Website zjw49246.github.io/Agent-AI4Sci-Bench

Provisional Citation

If you use this benchmark before the paper is published, please cite the repository:

@misc{agentai4scibench2026,
  title        = {ASI-Bench: A Project-level Benchmark for
                  Evaluating LLM Agents on AI for Science},
  author       = {ASI-Bench Contributors},
  year         = {2026},
  howpublished = {\url{https://github.com/zjw49246/Agent-AI4Sci-Bench}},
  note         = {Paper in preparation. Check the repository for updates.}
}

Citation will be updated

Once the arXiv preprint is published, this block will be replaced with the full BibTeX entry including author list, eprint ID, and venue information.

Key Contributions

The paper presents:

  1. Project-level AI4Sci tasks — compact but realistic computational science workflows requiring multi-step reasoning, code generation, and artifact production
  2. B1-B4 autonomy ladder — a systematic way to measure how much scaffolding an agent needs to solve the same scientific problem
  3. Auditable evaluation — structured artifacts and provenance enabling reproducible, reviewed leaderboard entries
  4. Multi-domain coverage — 8 scientific domains with 42+ public tasks spanning math, physics, chemistry, astronomy, biology, materials, earth science, and engineering