AI for Science agent evaluation

ASI-Bench

A project-level benchmark for evaluating AI4Sci agents on realistic scientific workflows, scoreable artifacts, and auditable provenance.

Leaderboard Explore Tasks Methodology

Loading benchmark metrics...

Benchmark Workspace

repo-verified preview

Run Pipeline

From workflow to reviewed score

Task Parameterized AI4Sci problems define input data, expected artifacts, runtime rules, and scoring contracts.

Leaderboard

Review Repo-verified review rows show score, prompt-level, harness, and provenance fields before official baseline runs are published.

Prompt Ladder

B1

B2

B3

B4

Domains

Math Physics Chemistry Astronomy Materials Engineering

Catalog

test set Public test tasks are generated from current repo metadata across AI for Science domains.

Model progress

AI Progress on ASI-Bench

Overall score by model on the public test set. More model evaluations coming soon.

Verified review data

Leaderboard Snapshot

Repo-derived review rows show overall score, B1-B4 breakdowns, run metadata, and trust labels before the official baseline snapshot is finalized.

Scientific breadth

8 Domains, 42 Public Tasks

ASI-Bench covers a wide spectrum of computational science workflows.

Task registry

Featured Public Tasks

The cards below are selected from the generated public catalog and spread across domains when possible.

Get started in 60 seconds

Quick Start

Three commands from zero to your first benchmark score.

1

git clone https://github.com/zjw49246/Agent-AI4Sci-Bench.git
cd Agent-AI4Sci-Bench && uv sync && cp .env.example .env

Clone, install dependencies, and set your API keys.

2

ai4sci-bench run --tasks physics.sod_shock_tube \
  --agent direct_llm --agent-config '{"model":"claude-sonnet-4-20250514"}' \
  --sandbox task --prompt-levels b1

Run one task with a direct LLM agent on the easiest prompt level.

3

ai4sci-bench report

View your score breakdown and result artifacts.

Evaluation pipeline

From Task to Auditable Result

ASI-Bench combines scientific tasks, B1-B4 prompt levels, agent harnesses, and reproducible scoring into one auditable evaluation stack.

01

Scientific workflow task

Parameterized AI4Sci task with data, expected artifacts, runtime requirements, and scoring rules.

02

B1-B4 prompt level

Same scientific goal under decreasing guidance, from execution support to autonomous problem solving.

03

Agent / harness + model

CLI agent, scaffold, or direct baseline paired with a specific model and run configuration.

04

Sandbox, scorer, and report

Structured artifacts become task scores, B-level breakdowns, and reviewed leaderboard entries.

Evaluation pipeline from task definition through result reporting.

Why it matters

What Makes This Benchmark Different

Project-level workflows

Tasks require data inspection, method selection, implementation, debugging, and objective artifacts rather than a single short answer.

AI for Science domains

The public catalog spans math, physics, chemistry, astronomy, materials, and engineering-style scientific workflows.

B1-B4 autonomy ladder

The same scientific goal is evaluated under decreasing guidance, revealing how much scaffolding each agent needs.

Auditable evaluation

Runs produce structured artifacts and provenance that can power reviewed leaderboard entries instead of self-reported scores.

Updates

Latest News

2026-04-25 Public website data snapshot generated with 22 public tasks across 6 domains.
2026-04-22 Homepage, catalog, and leaderboard became data-driven from generated benchmark JSON.
2026-04-22 Benchmark figures added: domain coverage, B1-B4 prompt ladder, and evaluation workflow.