Near, far: Patch-ordering enhances vision foundation models' scene understanding

Pariza, Valentinos; Salehi, Mohammadreza; Burghouts, Gertjan; Locatello, Francesco; Asano, Yuki M.

Computer Science > Computer Vision and Pattern Recognition

arXiv:2408.11054 (cs)

[Submitted on 20 Aug 2024 (v1), last revised 17 Apr 2025 (this version, v3)]

Title:Near, far: Patch-ordering enhances vision foundation models' scene understanding

Authors:Valentinos Pariza, Mohammadreza Salehi, Gertjan Burghouts, Francesco Locatello, Yuki M. Asano

View PDF HTML (experimental)

Abstract:We introduce NeCo: Patch Neighbor Consistency, a novel self-supervised training loss that enforces patch-level nearest neighbor consistency across a student and teacher model. Compared to contrastive approaches that only yield binary learning signals, i.e., 'attract' and 'repel', this approach benefits from the more fine-grained learning signal of sorting spatially dense features relative to reference patches. Our method leverages differentiable sorting applied on top of pretrained representations, such as DINOv2-registers to bootstrap the learning signal and further improve upon them. This dense post-pretraining leads to superior performance across various models and datasets, despite requiring only 19 hours on a single GPU. This method generates high-quality dense feature encoders and establishes several new state-of-the-art results such as +5.5% and +6% for non-parametric in-context semantic segmentation on ADE20k and Pascal VOC, +7.2% and +5.7% for linear segmentation evaluations on COCO-Things and -Stuff and improvements in the 3D understanding of multi-view consistency on SPair-71k, by more than 1.5%.

Comments:	Accepted at ICLR25. The webpage is accessible at: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2408.11054 [cs.CV]
	(or arXiv:2408.11054v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2408.11054

Submission history

From: Mohammadreza Salehi Dehnavi [view email]
[v1] Tue, 20 Aug 2024 17:58:59 UTC (36,425 KB)
[v2] Tue, 11 Feb 2025 14:15:13 UTC (40,871 KB)
[v3] Thu, 17 Apr 2025 09:45:54 UTC (47,850 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Near, far: Patch-ordering enhances vision foundation models' scene understanding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Near, far: Patch-ordering enhances vision foundation models' scene understanding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators