AI for Science agent evaluation

ASI-Bench

A project-level benchmark for evaluating AI4Sci agents on realistic scientific workflows, scoreable artifacts, and auditable provenance.

Loading benchmark metrics...

Model progress

AI Progress on ASI-Bench

Overall score by model on the public test set. More model evaluations coming soon.

Verified review data

Leaderboard Snapshot

Repo-derived review rows show overall score, B1-B4 breakdowns, run metadata, and trust labels before the official baseline snapshot is finalized.

Scientific breadth

8 Domains, 42 Public Tasks

ASI-Bench covers a wide spectrum of computational science workflows.

Task registry

Featured Public Tasks

The cards below are selected from the generated public catalog and spread across domains when possible.

Get started in 60 seconds

Quick Start

Three commands from zero to your first benchmark score.

1
git clone https://github.com/zjw49246/Agent-AI4Sci-Bench.git
cd Agent-AI4Sci-Bench && uv sync && cp .env.example .env

Clone, install dependencies, and set your API keys.

2
ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent direct_llm --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task --prompt-levels b1

Run one task with a direct LLM agent on the easiest prompt level.

3
ai4sci-bench report

View your score breakdown and result artifacts.

Evaluation pipeline

From Task to Auditable Result

ASI-Bench combines scientific tasks, B1-B4 prompt levels, agent harnesses, and reproducible scoring into one auditable evaluation stack.

01

Scientific workflow task

Parameterized AI4Sci task with data, expected artifacts, runtime requirements, and scoring rules.

02

B1-B4 prompt level

Same scientific goal under decreasing guidance, from execution support to autonomous problem solving.

03

Agent / harness + model

CLI agent, scaffold, or direct baseline paired with a specific model and run configuration.

04

Sandbox, scorer, and report

Structured artifacts become task scores, B-level breakdowns, and reviewed leaderboard entries.

Evaluation pipeline from task definition through result reporting.

Why it matters

What Makes This Benchmark Different

Project-level workflows

Tasks require data inspection, method selection, implementation, debugging, and objective artifacts rather than a single short answer.

AI for Science domains

The public catalog spans math, physics, chemistry, astronomy, materials, and engineering-style scientific workflows.

B1-B4 autonomy ladder

The same scientific goal is evaluated under decreasing guidance, revealing how much scaffolding each agent needs.

Auditable evaluation

Runs produce structured artifacts and provenance that can power reviewed leaderboard entries instead of self-reported scores.

Updates

Latest News

  • 2026-04-25 Public website data snapshot generated with 22 public tasks across 6 domains.
  • 2026-04-22 Homepage, catalog, and leaderboard became data-driven from generated benchmark JSON.
  • 2026-04-22 Benchmark figures added: domain coverage, B1-B4 prompt ladder, and evaluation workflow.