ITP Camp 2025 logo - click to go to home page.

Create Your Own LLM Agent Benchmark

Date: June 13, 2025 3-4pm


Session Leaders: Loren Riesenfeld


Format: Hybrid (In-person with online access)


Tags: #llm #ai #benchmark #agent


We typically evaluate LLM performance qualitatively (vibees) and quantitatively (benchmarks). The problem with most benchmarks is that they're boring as hell.

But who says LLM benchmarks have to be boring? Recently, we've seen new, fun benchmarks like Claude plays Pokemon, and teaching models to play the game Diplomacy.

In this session, I'll show a benchmark I created that measures frontier reasoning model performance on their ability to solve The Atlantic's new word game Bracket City.

We'll brainstorm new benchmark ideas and create our own fun benchmarks.

Prereqs: - Some coding experience will be helpful, but not required. - An OpenRouter account and API key: https://openrouter.ai/