Skip to content

ASI-Bench

Result Format

Result Format¶

This page defines the required structure for benchmark result submissions.

Required Artifacts¶

A valid submission must include the complete output directory from ai4sci-bench run or batch-run.

Top-level Files¶

File	Purpose
`run_metadata.json`	Global run configuration, agent/model info, provenance
`batch_records/batch_overview.json`	Summary statistics across all tasks
`batch_records/task_scoreboard.json`	Per-task aggregated scores
`batch_records/task_level_long.json`	Per-task, per-prompt-level breakdown

Per-Instance Files¶

Under <task_id>/:

File	Purpose
`<instance_id>.json`	Scored result with component scores, gates, metadata
`<instance_id>.agent_stdout.jsonl`	Agent execution log (recommended)
`<instance_id>.agent_model_output.md`	Raw LLM completions (optional)

Key Fields in `run_metadata.json`¶

{
  "agent": "direct_llm",
  "agent_config": {"model": "claude-sonnet-4-20250514"},
  "sandbox": "task",
  "seed": 42,
  "prompt_levels": ["b1", "b2", "b3", "b4"],
  "benchmark_version": "public-test",
  "instances_per_task": 1,
  "framework_version": "...",
  "result_schema_version": 1
}

Naming Convention¶

Instance IDs are auto-generated and include task parameters and seed
Do not rename or restructure the output directory
Keep the full directory tree intact when submitting

What Reviewers Check¶

run_metadata.json provenance fields are complete
Sandbox mode was task or os (not none)
Fixed seed was used
All prompt levels (B1-B4) were run
No evidence of result tampering (gate pass patterns are consistent)