Your institution may have access to this item. Find your institution then sign in to continue.

Title: Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval.
Authors: Zhang, Haoxiang; Cheng, Deqiang; Jiang, He; Liu, Jingjing; Kou, Qiqi
Abstract: The Contrastive Language-Image Pre-training model (CLIP) has recently gained attention in the zero-shot domain. However it still falls short in addressing cross-modal perception, and the semantic gap between seen and unseen classes in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR). To overcome these obstacles, we propose a Task-Like Training paradigm (TLT). In this work, we view the cross-modal perception and the semantic gap as a multi-task learning process. Before tackling the challenges, we fully utilize CLIP's text encoder and propose text-based identification learning mechanism to assist the model to learn discriminative features quickly. Next, we propose text prompt tutoring and the cross-modal consistency learning to solve cross-modal perception and the semantic gap, respectively. Meanwhile, we present a collaborative architecture to explore the potential shared information between tasks. Extensive results show that our approach significantly outperforms the state-of-the-art methods on Sketchy, Sketchy-No, Tuberlin, and QuickDraw datasets.
Subjects: IMAGE retrieval; TUTORS &; tutoring
Publication: Multimedia Tools & Applications, 2024, Vol 83, Issue 19, p57811
ISSN: 1380-7501
Publication type: Article
DOI: 10.1007/s11042-023-17675-x

We found a match

Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval.