LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning
arxiv.orgAI ResearchApr 15, 2026, 5:58 PM
LongCoT (LLNL/Oxford; Bartoldson, Torr, de Witt): 2,500 problems in chemistry, math, CS, chess, and logic. Each step tractable for frontier models in isolation -- failures reflect pure long-horizon reasoning limits. Best at release: GPT 5.2 9.8%, Gemini 3 Pro 6.1%.
5Apr 16, 2026, 5:31 AM