Read Results¶

Understand what the benchmark produces after a run — where files go, what they contain, and how to interpret scores.

Output directory structure¶

After a run completes, the output directory (default results/) contains:

results/
  run_metadata.json              # Global run configuration + provenance
  <task_id>/
    <instance_id>.json           # Per-instance scored result
    <instance_id>.agent_stdout.jsonl   # Raw agent output log
    <instance_id>.agent_model_output.md  # LLM completions (direct_llm)
  batch_records/                 # Present in batch-run mode
    batch_overview.json          # Summary across all tasks
    task_scoreboard.json         # Per-task aggregated scores
    task_level_long.json         # Per-task-per-level breakdown

Key files¶

`run_metadata.json`¶

Global configuration for the entire run:

{
  "agent": "direct_llm",
  "agent_config": {"model": "claude-sonnet-4-20250514"},
  "sandbox": "task",
  "seed": 42,
  "prompt_levels": ["b1", "b2", "b3", "b4"],
  "benchmark_version": "public-test",
  "instances_per_task": 1,
  "framework_version": "...",
  "result_schema_version": 1
}

Per-instance result (`<instance_id>.json`)¶

Each instance produces a scored result with these key fields:

Field	Description
`final_score`	Overall score (0-100)
`component_scores`	Breakdown by scorer (numerical accuracy, code quality, etc.)
`gate_results`	Hard/soft gate pass/fail status
`requested_mode`	What sandbox mode was requested
`effective_mode`	What actually ran
`enforcement_status`	Whether the sandbox was enforced
`verification_status`	Post-run verification outcome

`batch_records/task_scoreboard.json`¶

When using batch-run, this file provides the leaderboard-friendly view:

[
  {
    "task_id": "physics.sod_shock_tube",
    "agent": "direct_llm",
    "model": "claude-sonnet-4-20250514",
    "b1": 100.0,
    "b2": 98.5,
    "b3": 53.2,
    "b4": 74.9,
    "mean_score": 81.6
  }
]

Using `ai4sci-bench report`¶

The report command renders a human-readable summary from the results directory:

ai4sci-bench report
ai4sci-bench report --output-dir path/to/results

It displays:

Per-task scores with B-level breakdown
Gate pass/fail summary
Low-scoring instances with diagnostic hints from scorer details

Understanding scores¶

100: Perfect — all output artifacts match reference within tolerances
70-99: Strong — most components correct, minor numerical or format issues
30-69: Partial — core logic present but significant deviations
0-29: Weak — fundamental approach or output issues
0 (with gate fail): A hard gate blocked scoring entirely (e.g., required file missing)

Gate types¶

Gate	Severity	Behavior
`file_match`	hard	Required output files must exist
`code_analysis`	hard or soft	Pattern checks in agent code
Custom gates	configurable	Task-specific invariant checks

Hard gates block all scoring — the final score is 0 regardless of other components. Soft gates only produce warnings.

Provenance fields¶

Every result records exactly how it was produced:

requested_mode / effective_mode — what sandbox was asked for vs. what ran
enforcement_status — whether isolation was actually enforced
verification_status — post-run checks (e.g., no network access observed)
Agent config, model name, CLI version, seed, and timeout

This enables reproducibility and supports the official review process.

Next steps¶

Bring Your Agent — try different agents and configurations
Submit Results — submit for the official leaderboard