Session - Create Your Own LLM Agent Benchmark

Date: June 13, 2025 3-4pm

Session Leaders: Loren Riesenfeld

Format: Hybrid (In-person with online access)

We typically evaluate LLM performance qualitatively (vibees) and quantitatively (benchmarks). The problem with most benchmarks is that they're boring as hell.

But who says LLM benchmarks have to be boring? Recently, we've seen new, fun benchmarks like Claude plays Pokemon, and teaching models to play the game Diplomacy.

In this session, I'll show a benchmark I created that measures frontier reasoning model performance on their ability to solve The Atlantic's new word game Bracket City.

We'll brainstorm new benchmark ideas and create our own fun benchmarks.

Prereqs: - Some coding experience will be helpful, but not required. - An OpenRouter account and API key: https://openrouter.ai/