Srivastava, Swati; Sharma, Himanshu

doi:10.1117/1.JEI.32.3.033023

Back to matches

Your institution may have rights to this item. Sign in to continue.

Title: Improving scene text image captioning using transformer-based multilevel attention.
Authors: Srivastava, Swati; Sharma, Himanshu
Abstract: Many existing image captioning methods only focus on image objects and their relationships for generating image captions, ignoring the text present in an image. Scene text (ST) contains crucial information to understand an image and facilitating reasoning. The existing methods fail to establish strong correlations between optical character recognition (OCR) tokens, as they have limited OCR representation power. Further, these methods have not efficiently used the positional information of the text. In this work, we have proposed an ST-based image captioning model (Trans-MAtt) based on a multilevel attention mechanism and relation network. We have used relation networks to enhance the connections between ST tokens. We have employed a multi-level attention method, which comprises of spatial, semantic, and appearance attention modules that precisely define the image. To represent context-enriched ST tokens, we use a combination of appearance, location, FastText, and PHOC features. We predict the ST location in the image, which is further integrated with the generated word embeddings for final caption generation. Experiments on the TextCaps dataset demonstrate the effectiveness of the proposed Trans-MAtt model, where it outperforms the current best model by 3.4% on B-4, 2.9% on METEOR, 3.3% on ROUGE-L, 3.1% on CIDEr-D, and 4.1% on SPICE metric scores. Our experiments on the Flickr30k and MSCOCO datasets demonstrated the superiority of our proposed model over existing methods.
Subjects: OPTICAL character recognition
Publication: Journal of Electronic Imaging, 2023, Vol 32, Issue 3, p33023
ISSN: 1017-9909
Publication type: Article
DOI: 10.1117/1.JEI.32.3.033023

We found a match

Improving scene text image captioning using transformer-based multilevel attention.

Srivastava, Swati; Sharma, Himanshu

OPTICAL character recognition

Journal of Electronic Imaging, 2023, Vol 32, Issue 3, p33023

1017-9909

Article

10.1117/1.JEI.32.3.033023