Skip to content

News

This page tracks public-facing benchmark and website updates. It is intended to become the lightweight changelog for the benchmark website, public catalog, official leaderboard status, and release milestones.

What This Page Is For

  • benchmark launch announcements
  • new task or domain additions
  • official leaderboard refreshes
  • public release notes for scoring, visibility, or website changes

Current Update Log

2026-04-22 — Public Website Scaffold Established

The benchmark now has a real MkDocs-based web/ scaffold with:

  • homepage
  • overview
  • task catalog
  • leaderboard section
  • getting-started pages
  • FAQ
  • paper and news pages

This moved the website from planning documents into an actual buildable public site structure.

2026-04-22 — Public Task Catalog Limited to test Tasks

The generated public task catalog was tightened so the website no longer mirrors the full internal task tree. The public website now shows only the test subset.

Current public task and domain counts are generated from the latest repo metadata rather than maintained by hand.

2026-04-22 — Public Task Summaries Sanitized

The task catalog exporter now derives safer public summaries automatically and avoids leaking template-heavy prompt text such as unresolved {{ ... }} placeholders.

This makes the public task cards and catalog entries read more like benchmark summaries and less like raw prompt exports.

2026-04-22 — Homepage, Catalog, and Leaderboard Became Data-Driven

The site now renders generated JSON directly for:

  • homepage stats
  • featured public tasks
  • task catalog filtering/search
  • leaderboard rendering and verified review snapshot handling

This means the public site is now tied to generated benchmark metadata instead of relying entirely on hardcoded page text.

What Should Appear Here Next

Likely future entries:

  • paper release
  • first official baseline leaderboard snapshot
  • public benchmark release notes
  • new domain additions
  • changes to public visibility or submission policy