Your institution may have rights to this item. Sign in to continue.

Title: Word-to-region attention network for visual question answering.
Authors: Peng, Liang; Yang, Yang; Bin, Yi; Xie, Ning; Shen, Fumin; Ji, Yanli; Xu, Xing
Abstract: Visual attention, which allows more concentration on the image regions that are relevant to a reference question, brings remarkable performance improvement in Visual Question Answering (VQA). Most VQA attention models employ the entire reference question representation to query relevant image regions. Nonetheless, only certain salient words of the question play an effective role in an attention operation. In this paper, we propose a novel Word-to-Region Attention Network (WRAN), which can 1) simultaneously locate pertinent object regions instead of a uniform grid of image regions of euqal size and identify the corresponding words of the reference question; as well as 2) enforce consistency between image object regions and core semantics in questions. We evaluate the proposed model on the VQA v1.0 and VQA v2.0 datasets. Experimental results demonstrate the superiority of the proposed model as compared to the state-of-the-arts.
Subjects: QUESTION answering systems; ARTIFICIAL intelligence; MACHINE learning; VISUAL perception; SEMANTICS
Publication: Multimedia Tools & Applications, 2019, Vol 78, Issue 3, p3843
ISSN: 1380-7501
Publication type: Article
DOI: 10.1007/s11042-018-6389-3

We found a match

Word-to-region attention network for visual question answering.