TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment
arxiv.orgAI ResearchApr 13, 2026, 8:00 PM

Dense patch-text alignment is where VLMs still fall short. TIPSv2 (Google DeepMind; CVPR 2026): patch-level distillation where the student surprisingly surpasses the teacher + iBOT++ extends self-distillation to visible patches, not just masked ones. 9 tasks, 20 datasets. Code and models released.
5Apr 16, 2026, 2:10 PM