Result Format¶
This page defines the required structure for benchmark result submissions.
Required Artifacts¶
A valid submission must include the complete output directory from ai4sci-bench run or batch-run.
Top-level Files¶
| File | Purpose |
|---|---|
run_metadata.json |
Global run configuration, agent/model info, provenance |
batch_records/batch_overview.json |
Summary statistics across all tasks |
batch_records/task_scoreboard.json |
Per-task aggregated scores |
batch_records/task_level_long.json |
Per-task, per-prompt-level breakdown |
Per-Instance Files¶
Under <task_id>/:
| File | Purpose |
|---|---|
<instance_id>.json |
Scored result with component scores, gates, metadata |
<instance_id>.agent_stdout.jsonl |
Agent execution log (recommended) |
<instance_id>.agent_model_output.md |
Raw LLM completions (optional) |
Key Fields in run_metadata.json¶
{
"agent": "direct_llm",
"agent_config": {"model": "claude-sonnet-4-20250514"},
"sandbox": "task",
"seed": 42,
"prompt_levels": ["b1", "b2", "b3", "b4"],
"benchmark_version": "public-test",
"instances_per_task": 1,
"framework_version": "...",
"result_schema_version": 1
}
Naming Convention¶
- Instance IDs are auto-generated and include task parameters and seed
- Do not rename or restructure the output directory
- Keep the full directory tree intact when submitting
What Reviewers Check¶
run_metadata.jsonprovenance fields are complete- Sandbox mode was
taskoros(notnone) - Fixed seed was used
- All prompt levels (B1-B4) were run
- No evidence of result tampering (gate pass patterns are consistent)