RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time
arxiv.orgAI ResearchApr 13, 2026, 3:38 PM
Visual generation reward models collapse human judgments to a scalar, discarding reasoning. RationalRewards (HKUST/U Waterloo; Wenhu Chen): PARROT recovers rationales from preference data. 8B model matches Gemini-2.5-Pro at 10-20x less data; test-time Generate-Critique-Refine matches RL fine-tuning.
5Apr 16, 2026, 4:22 PM