Lightweight Real-Time Recurrent Models for Speech Enhancement and Automatic Speech Recognition.

Sami Dhahbi; Nasir Saleem; Teddy Surya Gunawan; Sami Bourouis; Imad Ali; Aymen Trigui; Abeer D. Algarni

doi:10.9781/ijimai.2024.04.003

Authors

Sami Dhahbi King Khalid University
Nasir Saleem Gomal University
Teddy Surya Gunawan International Islamic University Malaysia
Sami Bourouis Taif University
Imad Ali University of Swat
Aymen Trigui King Khalid University
Abeer D. Algarni Princess Nourah bint Abdulrahman University

DOI:

https://doi.org/10.9781/ijimai.2024.04.003

Keywords:

Real-Time Speech, Simple Recurrent Unit (SRU), Speech Enhancement, Speech Processing, Speech Quality

Supporting Agencies

The authors extend their appreciation to the Deanship of Scientific Research at King Khalid University for funding this work through large group Research Project under grant number RGP2/383/44. Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R51), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Abstract

Traditional recurrent neural networks (RNNs) encounter difficulty in capturing long-term temporal dependencies. However, lightweight recurrent models for speech enhancement are important to improve noisy speech, while being computationally efficient and able to capture long-term temporal dependencies efficiently. This study proposes a lightweight hourglass-shaped model for speech enhancement (SE) and automatic speech recognition (ASR). Simple recurrent units (SRU) with skip connections are implemented where attention gates are added to the skip connections, highlighting the important features and spectral regions. The model operates without relying on future information that is well-suited for real-time processing. Combined acoustic features and two training objectives are estimated. Experimental evaluations using the short time speech intelligibility (STOI), perceptual evaluation of speech quality (PESQ), and word error rates (WERs) indicate better intelligibility, perceptual quality, and word recognition rates. The composite measures further confirm the performance of residual noise and speech distortion. With the TIMIT database, the proposed model improves the STOI and PESQ by 16.21% and 0.69 (31.1%) whereas with the LibriSpeech database, the model improves STOI by 16.41% and PESQ by 0.71 (32.9%) over the noisy speech. Further, our model outperforms other deep neural networks (DNNs) in seen and unseen conditions. The ASR performance is measured using the Kaldi toolkit and achieves 15.13% WERs in noisy backgrounds.

Downloads

Download data is not yet available.

References

S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp. 113– 120, 1979.

S. Nasir, A. Sher, K. Usman, U. Farman, “Speech enhancement with geometric advent of spectral subtraction using connected timefrequency regions noise estimation”, Research Journal of Applied Sciences, Engineering and Technology, vol. 6, no. 6, pp. 1081–1087, 2013.

J. Lim, A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197–210, 1978.

Y. Ephraim, D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 6, pp. 1109–1121, 1984.

Y. Ephraim, D. Malah, “Speech enhancement using a minimum meansquare error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp. 443–445, 1985.

N. Mohammadiha, P. Smaragdis, A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140–2151, 2013.

I. Tashev, M. Slaney, “Data driven suppression rule for speech enhancement,” in: 2013 Information Theory and Applications Workshop (ITA), IEEE, 2013, pp. 1–6.

Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65–68, 2013.

Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.

M. Kolbæk, Z.-H. Tan, J. Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–167, 2016.

S. Hochreiter, J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Y. Wang, A. Narayanan, D. Wang, “On training targets for supervised speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849–1858, 2014.

N. Saleem, M.I. Khattak, “Deep neural networks for speech enhancement in complex-noisy environments”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 1, pp. 84–91, 2020.

N. Saleem, M.I. Khattak, M. Al-Hasan, A.B. Qazi, “On learning spectral masking for single channel speech enhancement using feedforward and recurrent neural networks,” IEEE Access, vol. 8, pp. 160581–160595, 2020.

N. Saleem, M.I. Khattak, “Multi-scale decomposition based supervised single channel deep speech enhancement,” Applied Soft Computing, vol. 95, pp. 106666, 2020.

N. Saleem, M.I. Khattak, M. Al-Hasan, A. Jan, “Multi-objective longshort term memory recurrent neural networks for speech enhancement,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 10, pp. 9037– 9052, 2021.

S. Samui, I. Chakrabarti, S.K. Ghosh, “Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network,” Applied Soft Computing, vol. 74, pp. 583–602, 2019.

M.H. Soni, N. Shah, H.A. Patil, “Time-frequency masking-based speech enhancement using generative adversarial network,” in: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 5039–5043.

N. Shah, H.A. Patil, M.H. Soni, “Time-frequency mask-based speech enhancement using convolutional generative adversarial network,” in: 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), IEEE, 2018, pp. 1246–1251.

W. Yu, J. Zhou, H. Wang, L. Tao, “Setransformer: speech enhancement transformer,” Cognitive Computation, vol. 14, pp. 1152-1158, 2022.

J. Cadore, F.J. Valverde-Albacete, A. Gallardo-Antolín, C. Peláez-Moreno, “Auditory-inspired morphological processing of speech spectrograms: Applications in automatic speech recognition and speech enhancement,” Cognitive computation, vol. 5, pp. 426–441, 2013.

I. Sutskever, O. Vinyals, Q.V. Le, “Sequence to sequence learning with neural networks,” in NIPS’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2, 2014, pp. 3104–3112.

I. Serban, A. Sordoni, Y. Bengio, A. Courville, J. Pineau, “Building endto-end dialogue systems using generative hierarchical neural network models,” in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.

K. Zarzycki, M. Ławryńczuk, “LSTM and GRU neural networks as models of dynamical processes used in predictive control: A comparison of models developed for two chemical reactors,” Sensors, vol. 21, no. 16, pp. 5625, 2021.

J. Chen, D. Wang, “Long short-term memory for speaker generalization in supervised speech separation,” The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4705–4714, 2017.

M. Sundermeyer, H. Ney, R. Schlüter, “From feedforward to recurrent lstm neural networks for language modeling,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 517–529, 2015.

M. Fernández-Díaz, A. Gallardo-Antolín, “An attention long shortterm memory based system for automatic classification of speech intelligibility,” Engineering Applications of Artificial Intelligence, vol. 96, pp. 103976, 2020.

M. Q. Gandapur, E. Verdú, “ConvGRU-CNN: Spatiotemporal Deep Learning for Real-World Anomaly Detection in Video Surveillance System”, International Journal of Interactive Multimedia & Artificial Intelligence, vol. 8, no. 4, 2023.

N. Saleem, J. Gao, M.I. Khattak, H.T. Rauf, S. Kadry, M. Shafi, “Deepresgru: Residual gated recurrent neural network-augmented kalman filtering for speech enhancement and recognition,” Knowledge-Based Systems, vol. 238, pp. 107914, 2022.

J. Ali Reshi, R. Ali, “An Efficient Fake News Detection System Using Contextualized Embeddings and Recurrent Neural Network,” International Journal of Interactive Multimedia and Artificial Intelligence, pp. 1-13, 10.9781/ijimai.2023.02.007.

B. Chang, L. Meng, E. Haber, F. Tung, D. Begert, “Multi-level residual networks from dynamical systems view,” arXiv preprint arXiv:1710.10348, 2017.

Y. Shao, S. Srinivasan, Z. Jin, D. Wang, “A computational auditory scene analysis system for speech segregation and robust speech recognition,” Computer Speech & Language, vol. 24, no. 1, pp. 77–93, 2010.

J.S. Garofolo, L.F. Lamel, W.M. Fisher, J.G. Fiscus, D.S. Pallett, “Darpa timit acoustic-phonetic continous speech corpus cd-rom,” NIST speech disc 1-1.1. NASA STI/Recon technical report, no. 93, 27403, 1993.

V. Panayotov, G. Chen, D. Povey, S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2015, pp. 5206–5210.

D. Pearce, J. Picone, “Aurora working group: DSR front end LVCSR evaluation AU/384/02,” Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep, 2002.

A. Varga, H.J. Steeneken, “Assessment for automatic speech recognition: Ii. noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech communication, vol. 12, no. 3, pp. 247– 251, 1993.

T. F. Damayanti, A. Wanto, H.S. Tambunan, “Prediction of Palm Oil Seed Stock Production Results with the Back-propagation Algorithm,” JOMLAI: Journal of Machine Learning and Artificial Intelligence, vol. 2, no. 2, pp. 105-112, 2023.

Q. Song, Y. Wu, Y.C. Soh, “Robust adaptive gradient-descent training algorithm for recurrent neural networks in discrete time domain,” IEEE Transactions on Neural networks, vol. 19, no. 11, pp. 1841–1853, 2008.

G.E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R.R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors”, arXiv preprint arXiv:1207.0580, 2012.

A.W. Rix, M.P. Hollier, A.P. Hekstra, J.G. Beerends, “Perceptual evaluation of speech quality (PESQ) the new itu standard for end-to-end speech quality assessment part i–time-delay compensation,” Journal of the Audio Engineering Society, vol. 50, no. 10, pp. 755–764, 2002.

C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, IEEE, 2010, pp. 4214–4217.

Y. Hu, P.C. Loizou, “Evaluation of objective measures for speech enhancement,” in: Ninth International Conference on Spoken Language Processing, 2006.

T. Kounovsky, J. Malek, “Single channel speech enhancement using convolutional neural network,” in: 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and Their Application to Mechatronics (ECMSM), IEEE, 2017, pp. 1–5.

P. Sun, J. Qin, “Low-rank and sparsity analysis applied to speech enhancement via online estimated dictionary,” IEEE Signal Processing Letters, vol. 23, no. 12, pp. 1862–1866, 2016.

W. Shi, X. Zhang, X. Zou, W. Han, G. Min, “Auditory mask estimation by RPCA for monaural speech enhancement,” in: 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), IEEE, 2017, pp. 179–184.

D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” in: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, IEEE Signal Processing Society, 2011.

Y. Tachioka, S. Watanabe, J. Le Roux, J.R. Hershey, “Discriminative methods for noise robust speech recognition: A chime challenge benchmark,” in: The 2nd International Workshop on Machine Listening in Multisource Environments, 2013, pp. 19–24.

A. Shewalkar, D. Nyavanandi, S.A. Ludwig, “Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU,” Journal of Artificial Intelligence and Soft Computing Research, vol. 9, no. 4, pp. 235-245, 2019.

A. Li, C. Zheng, L. Zhang, X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement” Applied Acoustics, vol. 187, 108499, 2022.

S. Pascual, J. Serra, A. Bonafonte, “Time-domain speech enhancement using generative adversarial networks,” Speech communication, vol. 114, pp. 10-21, 2019.

Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, ... & L. Xie, “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” arXiv preprint arXiv:2008.00264, 2020.

M. Nikzad, A. Nicolson, Y. Gao, J. Zhou, K.K. Paliwal, F. Shang, “Deep residual-dense lattice network for speech enhancement”, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8552-8559.

A. Defossez, G. Synnaeve, Y. Adi, “Real time speech enhancement in the waveform domain”, arXiv preprint arXiv:2006.12847, 2020.

K. Wang, B. He, W.P. Zhu, “TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2021, pp. 7098-7102.

E. Kim, H. Seo, “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Interspeech, 2021, pp. 2736-2740.

Z. Ye, N. Saleem, H. Ali, “Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement,” International Journal of Interactive Multimedia and Artificial Intelligence, 2023, doi: 10.9781/ijimai.2023.05.007.

M.I. Khattak, A. Jan, N. Saleem, E. Verdú, N. Khurshid, “Automated detection of COVID-19 using chest X-ray images and CT scans through multilayer-spatial convolutional neural networks,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 15-24, 2021.

G. Yao, C. Wang, Y. Wu, Y. Wang, “Pyramid fully residual network for single image de-raining,” Neurocomputing, vol. 456, pp.168-178, 2021.

H. Yue, W. Duo, X. Peng, J. Yang, “Reference-based speech enhancement via feature alignment and fusion network,” in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, 2022, pp. 11648-11656.

N. Saleem, T.S. Gunawan, M. Shafi, S. Bourouis, A. Trigui, “MultiAttention Bottleneck for Gated Convolutional Encoder-Decoder-Based Speech Enhancement,” IEEE Access, vol. 11, pp. 114172-114186, 2023.