Thumbnail

A reading list for evaluators

This is my reading list for the summer of 2024. This list was provided in large part by Ollie Jaffe, so thank you to him.

Must reads

https://www.jasonwei.net/blog/evals

GAIA: A Benchmark for General AI Assistants

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

More tasks, human baselines, and preliminary results for GPT-4 and Claude - METR Blog

Evaluating Frontier Models for Dangerous Capabilities (see the stuff on checkpoints here)

Contamination

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

pass@n just works

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

Smarter methods for test time compute

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning

https://epochai.org/blog/trading-off-compute-in-training-and-inference#combining-techniques

This is why people have faith in test time compute (see side experiment IV part C)

Scaling Scaling Laws with Board Games

Is performance on evals correlated?

Observational Scaling Laws and the Predictability of Language Model Performance

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

How to build evals using AI

Discovering Language Model Behaviors with Model-Written Evaluations

LM Reasoning

Language Models (Mostly) Know What They Know

Large Language Models Cannot Correct Their Reasoning Yet

LLMs cannot find reasoning errors, but can correct them!