How to contribute a benchmark task

The hard part isn't the clicks — it's the concepts. This is what a good task is, and how the three steps and their terms fit together. Every screen also teaches just-in-time as you fill it.

Step 1

Propose the science

In two or three sentences, say what scientific question the task tests and why it is hard. This is what reviewers read first — a sharp question beats a long one.

  • State what to compute, not how — the method stays out of the low-information prompts.
  • Declare only the scientific dependencies your task needs; the benchmark agent comes from the platform image, not from you.
  • Good: “Recover the steady-state vortex centre of a lid-driven cavity at Re=100.” Avoid: “Run a projection-method solver.”
Step 2

Build the four prompt levels

Write B1 through B4 as a gradient: B1 gives the most information, B3 the least. A model that understands should score high on B1 and lower on B3 — that gradient is the proof your task measures understanding.

  • B1 — full information: background, method, equations all given.
  • B3 — minimal: say only what to compute, never the algorithm. Naming a method here leaks the approach and breaks the gradient.
  • Assemble the evaluation: ground truth (generate_gt.py), scorers, and any gates.
Step 3

Test locally, submit, and review

Run the task on one model yourself, report the per-level scores, then submit. Reviewers use your local scores plus AI pre-review to judge difficulty and leakage before the task is accepted.

  • Local testing is required to submit — it is a key difficulty signal for reviewers.
  • Declare the sandbox and provenance for every result. Scores from different sandboxes are not comparable and are bucketed apart.
  • Track feedback on the proposal page and revise until it's accepted.

Glossary

Not sure what a term means? Plain-language explanations here.

Difficulty gradient (B1–B4)prompt_b1…b4

Four prompt levels, from full information (B1) to minimal (B3/B4). A model that truly understands should score lower as information is withheld — that's the proof the task tests understanding, not method-copying.

Information leakageleak

Naming a specific method / algorithm / key constant in a low-information level (especially B3) leaks the higher-level approach to the model and breaks the gradient. The editor flags this in real time.

Ground truth (GT)generate_gt.py

The script that produces the task's reference answer. The framework uses it to score an agent's output.

Scorernumerical / …

The rule that grades an output (e.g. relative_l2, mae), with full-score / zero-score thresholds and a weight.

Gatefile_match…

A hard / soft check before scoring (e.g. a required file must exist). A failing hard gate fails the run outright.

Sandboxnone / task / os / linux_ns

The isolation environment a run uses. Scores from different sandboxes aren't comparable, so you must declare it when submitting results.

Local testingrequired to submit

Before submitting, you run the task on a model yourself and report the per-level scores. It's a key signal reviewers use to judge difficulty.

Environmentruntime.packages

The Python version + scientific packages your task needs. Declare science packages only — the benchmark agents come from the platform image.

Start Contributing →← Back home