How to contribute a benchmark task

The hard part isn't the clicks — it's the concepts. This is what a good task is, and how the three steps and their terms fit together. Every screen also teaches just-in-time as you fill it.

Step 1

Propose the science

In two or three sentences, say what scientific question the task tests and why it is hard. This is what reviewers read first — a sharp question beats a long one.

State what to compute, not how — the method stays out of the low-information prompts.
Declare only the scientific dependencies your task needs; the benchmark agent comes from the platform image, not from you.
Good: “Recover the steady-state vortex centre of a lid-driven cavity at Re=100.” Avoid: “Run a projection-method solver.”

Step 2

Build the four prompt levels

Write B1 through B4 as a gradient: B1 gives the most information, B3 the least. A model that understands should score high on B1 and lower on B3 — that gradient is the proof your task measures understanding.

B1 — full information: background, method, equations all given.
B3 — minimal: say only what to compute, never the algorithm. Naming a method here leaks the approach and breaks the gradient.
Assemble the evaluation: ground truth (generate_gt.py), scorers, and any gates.

Step 3

Test locally, submit, and review

Run the task on one model yourself, report the per-level scores, then submit. Reviewers use your local scores plus AI pre-review to judge difficulty and leakage before the task is accepted.

Local testing is required to submit — it is a key difficulty signal for reviewers.
Declare the sandbox and provenance for every result. Scores from different sandboxes are not comparable and are bucketed apart.
Track feedback on the proposal page and revise until it's accepted.

Glossary

Not sure what a term means? Plain-language explanations here.

Difficulty gradient (B1–B4)prompt_b1…b4

Four prompt levels, from full information (B1) to minimal (B3/B4). A model that truly understands should score lower as information is withheld — that's the proof the task tests understanding, not method-copying.

Information leakageleak

Naming a specific method / algorithm / key constant in a low-information level (especially B3) leaks the higher-level approach to the model and breaks the gradient. The editor flags this in real time.

Ground truth (GT)generate_gt.py

The script that produces the task's reference answer. The framework uses it to score an agent's output.

Scorernumerical / …

The rule that grades an output (e.g. relative_l2, mae), with full-score / zero-score thresholds and a weight.

Gatefile_match…

A hard / soft check before scoring (e.g. a required file must exist). A failing hard gate fails the run outright.

Sandboxnone / task / os / linux_ns

The isolation environment a run uses. Scores from different sandboxes aren't comparable, so you must declare it when submitting results.

Local testingrequired to submit

Before submitting, you run the task on a model yourself and report the per-level scores. It's a key signal reviewers use to judge difficulty.

Environmentruntime.packages

The Python version + scientific packages your task needs. Declare science packages only — the benchmark agents come from the platform image.

Start Contributing →← Back home