TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Dense patch-text alignment is where VLMs still fall short. TIPSv2 (Google DeepMind; CVPR 2026): patch-level distillation where the student surprisingly surpasses the teacher + iBOT++ extends self-distillation to visible patches, not just masked ones. 9 tasks, 20 datasets. Code and models released.