We found a match
Your institution may have rights to this item. Sign in to continue.
- Title
Improving scene text image captioning using transformer-based multilevel attention.
- Authors
Srivastava, Swati; Sharma, Himanshu
- Abstract
Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.
- Subjects
OPTICAL character recognition
- Publication
Journal of Electronic Imaging, 2023, Vol 32, Issue 3, p33023
- ISSN
1017-9909
- Publication type
Article
- DOI
10.1117/1.JEI.32.3.033023