The hard part isn't the clicks — it's the concepts. This is what a good task is, and how the three steps and their terms fit together. Every screen also teaches just-in-time as you fill it.
In two or three sentences, say what scientific question the task tests and why it is hard. This is what reviewers read first — a sharp question beats a long one.
Write B1 through B4 as a gradient: B1 gives the most information, B3 the least. A model that understands should score high on B1 and lower on B3 — that gradient is the proof your task measures understanding.
Run the task on one model yourself, report the per-level scores, then submit. Reviewers use your local scores plus AI pre-review to judge difficulty and leakage before the task is accepted.
Not sure what a term means? Plain-language explanations here.
Four prompt levels, from full information (B1) to minimal (B3/B4). A model that truly understands should score lower as information is withheld — that's the proof the task tests understanding, not method-copying.
Naming a specific method / algorithm / key constant in a low-information level (especially B3) leaks the higher-level approach to the model and breaks the gradient. The editor flags this in real time.
The script that produces the task's reference answer. The framework uses it to score an agent's output.
The rule that grades an output (e.g. relative_l2, mae), with full-score / zero-score thresholds and a weight.
A hard / soft check before scoring (e.g. a required file must exist). A failing hard gate fails the run outright.
The isolation environment a run uses. Scores from different sandboxes aren't comparable, so you must declare it when submitting results.
Before submitting, you run the task on a model yourself and report the per-level scores. It's a key signal reviewers use to judge difficulty.
The Python version + scientific packages your task needs. Declare science packages only — the benchmark agents come from the platform image.