Read Results¶
Understand what the benchmark produces after a run — where files go, what they contain, and how to interpret scores.
Output directory structure¶
After a run completes, the output directory (default results/) contains:
results/
run_metadata.json # Global run configuration + provenance
<task_id>/
<instance_id>.json # Per-instance scored result
<instance_id>.agent_stdout.jsonl # Raw agent output log
<instance_id>.agent_model_output.md # LLM completions (direct_llm)
batch_records/ # Present in batch-run mode
batch_overview.json # Summary across all tasks
task_scoreboard.json # Per-task aggregated scores
task_level_long.json # Per-task-per-level breakdown
Key files¶
run_metadata.json¶
Global configuration for the entire run:
{
"agent": "direct_llm",
"agent_config": {"model": "claude-sonnet-4-20250514"},
"sandbox": "task",
"seed": 42,
"prompt_levels": ["b1", "b2", "b3", "b4"],
"benchmark_version": "public-test",
"instances_per_task": 1,
"framework_version": "...",
"result_schema_version": 1
}
Per-instance result (<instance_id>.json)¶
Each instance produces a scored result with these key fields:
| Field | Description |
|---|---|
final_score |
Overall score (0-100) |
component_scores |
Breakdown by scorer (numerical accuracy, code quality, etc.) |
gate_results |
Hard/soft gate pass/fail status |
requested_mode |
What sandbox mode was requested |
effective_mode |
What actually ran |
enforcement_status |
Whether the sandbox was enforced |
verification_status |
Post-run verification outcome |
batch_records/task_scoreboard.json¶
When using batch-run, this file provides the leaderboard-friendly view:
[
{
"task_id": "physics.sod_shock_tube",
"agent": "direct_llm",
"model": "claude-sonnet-4-20250514",
"b1": 100.0,
"b2": 98.5,
"b3": 53.2,
"b4": 74.9,
"mean_score": 81.6
}
]
Using ai4sci-bench report¶
The report command renders a human-readable summary from the results directory:
It displays:
- Per-task scores with B-level breakdown
- Gate pass/fail summary
- Low-scoring instances with diagnostic hints from scorer details
Understanding scores¶
- 100: Perfect — all output artifacts match reference within tolerances
- 70-99: Strong — most components correct, minor numerical or format issues
- 30-69: Partial — core logic present but significant deviations
- 0-29: Weak — fundamental approach or output issues
- 0 (with gate fail): A hard gate blocked scoring entirely (e.g., required file missing)
Gate types¶
| Gate | Severity | Behavior |
|---|---|---|
file_match |
hard | Required output files must exist |
code_analysis |
hard or soft | Pattern checks in agent code |
| Custom gates | configurable | Task-specific invariant checks |
Hard gates block all scoring — the final score is 0 regardless of other components. Soft gates only produce warnings.
Provenance fields¶
Every result records exactly how it was produced:
requested_mode/effective_mode— what sandbox was asked for vs. what ranenforcement_status— whether isolation was actually enforcedverification_status— post-run checks (e.g., no network access observed)- Agent config, model name, CLI version, seed, and timeout
This enables reproducibility and supports the official review process.
Next steps¶
- Bring Your Agent — try different agents and configurations
- Submit Results — submit for the official leaderboard