Mehra, Sunakshi; Susan, Seba

doi:10.1007/s11042-023-15118-1

Back to matches

Your institution may have access to this item. Find your institution then sign in to continue.

Title: Deep fusion framework for speech command recognition using acoustic and linguistic features.
Authors: Mehra, Sunakshi; Susan, Seba
Abstract: The research problem addressed in this study is how to effectively combine multimodal data from imperfect text transcripts and raw audio in a deep framework for automatic speech recognition. In this study, we suggest combining audio and text modalities late in the process. We propose a self-attention based deep bidirectional long short-term memory (SA-deep BiLSTM) for processing audio and text data independently. For training each type of feature, we use the SA-deep BiLSTM model which comprises of five BiLSTM layers and a self-attention module between the third and fourth layers. The linguistic data, like the word stem extracted from the text transcript, and acoustic features like Mel frequency cepstral coefficients (MFCC) and Mel-spectrogram are taken into consideration. The GloVe word embedding is used to vectorize the linguistic data. By fusing the posterior class probabilities of SA-deep BiLSTM models trained on individual modalities, we were able to achieve an accuracy of 98.80% on the 10-word categories of the Google speech command dataset. Numerous tests using the Google speech command dataset and ablation analysis prove that the suggested method performs better than the state of the art because of the high classification accuracies attained.
Publication: Multimedia Tools & Applications, 2023, Vol 82, Issue 25, p38667
ISSN: 1380-7501
Publication type: Article
DOI: 10.1007/s11042-023-15118-1

We found a match

Deep fusion framework for speech command recognition using acoustic and linguistic features.

Mehra, Sunakshi; Susan, Seba

Multimedia Tools & Applications, 2023, Vol 82, Issue 25, p38667

1380-7501

Article

10.1007/s11042-023-15118-1