We found a match
Your institution may have access to this item. Find your institution then sign in to continue.
- Title
Deep fusion framework for speech command recognition using acoustic and linguistic features.
- Authors
Mehra, Sunakshi; Susan, Seba
- Abstract
The research problem addressed in this study is how to effectively combine multimodal data from imperfect text transcripts and raw audio in a deep framework for automatic speech recognition. In this study, we suggest combining audio and text modalities late in the process. We propose a self-attention based deep bidirectional long short-term memory (SA-deep BiLSTM) for processing audio and text data independently. For training each type of feature, we use the SA-deep BiLSTM model which comprises of five BiLSTM layers and a self-attention module between the third and fourth layers. The linguistic data, like the word stem extracted from the text transcript, and acoustic features like Mel frequency cepstral coefficients (MFCC) and Mel-spectrogram are taken into consideration. The GloVe word embedding is used to vectorize the linguistic data. By fusing the posterior class probabilities of SA-deep BiLSTM models trained on individual modalities, we were able to achieve an accuracy of 98.80% on the 10-word categories of the Google speech command dataset. Numerous tests using the Google speech command dataset and ablation analysis prove that the suggested method performs better than the state of the art because of the high classification accuracies attained.
- Publication
Multimedia Tools & Applications, 2023, Vol 82, Issue 25, p38667
- ISSN
1380-7501
- Publication type
Article
- DOI
10.1007/s11042-023-15118-1