A reading list for evaluators
This is my reading list for the summer of 2024. This list was provided in large part by Ollie Jaffe, so thank you to him.
Must reads
https://www.jasonwei.net/blog/evals
GAIA: A Benchmark for General AI Assistants
MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
More tasks, human baselines, and preliminary results for GPT-4 and Claude - METR Blog
Evaluating Frontier Models for Dangerous Capabilities (see the stuff on checkpoints here)
Contamination
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
pass@n just works
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
The Larger the Better? Improved LLM Code-Generation via Budget Reallocation
Smarter methods for test time compute
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
STaR: Self-Taught Reasoner Bootstrapping Reasoning With Reasoning
https://epochai.org/blog/trading-off-compute-in-training-and-inference#combining-techniques
This is why people have faith in test time compute (see side experiment IV part C)
Scaling Scaling Laws with Board Games
Is performance on evals correlated?
Observational Scaling Laws and the Predictability of Language Model Performance
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
How to build evals using AI
Discovering Language Model Behaviors with Model-Written Evaluations
LM Reasoning
Language Models (Mostly) Know What They Know