LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

LongCoT (LLNL/Oxford; Bartoldson, Torr, de Witt): 2,500 problems in chemistry, math, CS, chess, and logic. Each step tractable for frontier models in isolation -- failures reflect pure long-horizon reasoning limits. Best at release: GPT 5.2 9.8%, Gemini 3 Pro 6.1%.