Efficient Gated Convolutional Recurrent Neural Networks for Real-Time Speech Enhancement.

Fazal E. Wahab; Zhongfu Ye; Nasir Saleem; Hamza Ali; Imad Ali

doi:10.9781/ijimai.2023.05.007

Authors

Fazal E. Wahab University of Science and Technology of China.
Zhongfu Ye University of Science and Technology of China.
Nasir Saleem Faculty of Engineering & Technology, Gomal University.
Hamza Ali University of Engineering & Technology, Mardan.
Imad Ali Department of Computer Science, University of Swat.

DOI:

https://doi.org/10.9781/ijimai.2023.05.007

Keywords:

Convolutional Gated Recurrent Unit (Convolutional GRU), Deep Learning, Intelligibility, Long Short Term Memory (LSTM), Speech Enhancement

Abstract

Deep learning (DL) networks have grown into powerful alternatives for speech enhancement and have achieved excellent results by improving speech quality, intelligibility, and background noise suppression. Due to high computational load, most of the DL models for speech enhancement are difficult to implement for realtime processing. It is challenging to formulate resource efficient and compact networks. In order to address this problem, we propose a resource efficient convolutional recurrent network to learn the complex ratio mask for real-time speech enhancement. Convolutional encoder-decoder and gated recurrent units (GRUs) are integrated into the Convolutional recurrent network architecture, thereby formulating a causal system appropriate for real-time speech processing. Parallel GRU grouping and efficient skipped connection techniques are engaged to achieve a compact network. In the proposed network, the causal encoder-decoder is composed of five convolutional (Conv2D) and deconvolutional (Deconv2D) layers. Leaky linear rectified unit (ReLU) is applied to all layers apart from the output layer where softplus activation to confine the network output to positive is utilized. Furthermore, batch normalization is adopted after every convolution (or deconvolution) and prior to activation. In the proposed network, different noise types and speakers can be used in training and testing. With the LibriSpeech dataset, the experiments show that the proposed real-time approach leads to improved objective perceptual quality and intelligibility with much fewer trainable parameters than existing LSTM and GRU models. The proposed model obtained an average of 83.53% STOI scores and 2.52 PESQ scores, respectively. The quality and intelligibility are improved by 31.61% and 17.18% respectively over noisy speech.

Downloads

Download data is not yet available.

References

D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018, doi: 10.1109/TASLP.2018.2842159.

J. Agnew and J. M. Thornton, “Just noticeable and objectionable group delays in digital hearing aids,” Journal of the American Academy of Audiology, vol. 11, no. 6, pp. 330–336, 2000.

Y. Wang and D. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381–1390, 2013, doi: 10.1109/TASL.2013.2250961.

Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014, doi: 10.1109/TASLP.2014.2352935.

N. Saleem and M. I. Khattak, “Deep Neural Networks for Speech Enhancement in Complex-Noisy Environments,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 1, pp. 84–91, 2020, doi: 10.9781/ijimai.2019.06.001.

D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, 2015, doi: 10.1109/TASLP.2015.2512042.

Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014, doi: 10.1109/TASLP.2014.2364452.

Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal Processing Letters, vol. 21, no. 1, pp. 65–68, 2013, doi: 10.1109/LSP.2013.2291240.

N. Saleem and M. I. Khattak, “Multi-scale decomposition based supervised single channel deep speech enhancement,” Applied Soft Computing, vol. 95, p. 106666, 2020, doi: 10.1016/j.asoc.2020.106666.

J. Chen and D. Wang, “Long short-term memory for speaker generalization in supervised speech separation,” The Journal of the Acoustical Society of America, vol. 141, no. 6, pp. 4705–4714, 2017, doi: 10.1121/1.4986931.

M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech intelligibility potential of general and specialized deep neural network based speech enhancement systems,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 25, no. 1, pp. 153–167, 2016, doi: 10.1109/TASLP.2016.2628641.

Y. Li and D. Wang, “On the optimality of ideal binary time–frequency masks,” Speech Communication, vol. 51, no. 3, pp. 230–239, 2009, doi: 10.1109/ICASSP.2008.4518406.

H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp. 708–712, doi: 10.1109/ICASSP.2015.7178061.

K. Tan and D. Wang, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 380–390, 2020, doi: 10.1109/TASLP.2019.2955276.

J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” The Journal of the Acoustical Society of America, vol. 139, no. 5, pp. 2604–2612, 2016, doi: 10.1121/1.4948445.

N. Saleem, M. I. Khattak, M. A. Al-Hasan, and A. Jan, “Multiobjective long-short term memory recurrent neural networks for speech enhancement,” Journal of Ambient Intelligence and Humanized Computing, vol. 12, no. 10, pp. 9037–9052, 2021, doi: 10.1007/s12652-020-02598-4.

K. Tan, J. Chen, and D. Wang, “Gated residual networks with dilated convolutions for supervised speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 21–25, doi: 10.1109/ICASSP.2018.8461819.

K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in Interspeech 2018, Hyderabad, India, Sep. 2018, pp. 3229–3233, doi: 10.21437/Interspeech.2018-1405.

Z. Zhang, Z. Sun, J. Liu, J. Chen, Z. Huo, and X. Zhang, “Deep recurrent convolutional neural network: Improving performance for speech recognition,” arXiv preprint, arXiv:1611.07174, 2016.

G. Naithani, T. Barker, G. Parascandolo, L. Bramsl, N. H. Pontoppidan, and T. Virtanen, “Low latency sound source separation using convolutional recurrent neural networks,” in 2017 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, Oct. 2017, pp. 71–75, doi: 10.1109/WASPAA.2017.8169997.

S. R. Park and J. Lee, “A fully convolutional neural network for speech enhancement,” arXiv preprint, arXiv:1609.07132, 2016.

S. Jha, A. Dey, R. Kumar, and V. Kumar-Solanki, “A novel approach on visual question answering by parameter prediction using faster region-based convolutional neural network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, no. 5, pp. 30–38, 2019, doi: 10.9781/ijimai.2018.08.004.

N. Saleem, J. Gao, M. I. Khattak, H. T. Rauf, S. Kadry, and M. Shafi, “Deepresgru: Residual gated recurrent neural network-augmented Kalman filtering for speech enhancement and recognition,” Knowledge-Based Systems, vol. 238, p. 107914, 2022, doi: 10.1016/j.knosys.2021.107914.

A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022, doi: 10.1016/j.apacoust.2022.108499.

M. I. Khattak, N. Saleem, A. Nawaz, A. A. Almani, F. Umer, and E. Verdú, “ERBM-SE: Extended restricted Boltzmann machine for multi-objective single-channel speech enhancement,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 4, pp. 7–15, 2022, doi: 10.9781/ijimai.2022.03.002.

M. I. Khattak, N. Saleem, J. Gao, E. Verdú, and J. P. Fuente, “Regularized sparse features for noisy speech enhancement using deep neural networks,” Computers and Electrical Engineering, vol. 100, p. 107887, 2022, doi: 10.1016/j.compeleceng.2022.107887.

T. Gao, J. Du, L. R. Dai, and C. H. Lee, “SNR-based progressive learning of deep neural network for speech enhancement,” in Interspeech 2016, San Francisco, CA, USA, Sep. 2016, pp. 3713–3717.

A. Laishram and K. Thongam, “Automatic classification of oral pathologies using orthopantomogram radiography images based on convolutional neural network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 4, pp. 69–77, 2022, doi: https://doi.org/10.9781/ijimai.2021.10.009.

T. Kounovsky and J. Malek, “Single channel speech enhancement using convolutional neural network,” in 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Donostia-San Sebastián, Spain, May 2017, pp. 1-5, doi: 10.1109/ECMSM.2017.7945915.

E. M. Grais and M. D. Plumbley, “Single channel audio source separation using convolutional denoising autoencoders,” in 2017 IEEE Global Conference on Signal and Information Processing (GlobalSIP), Montreal, QC, Canada, Nov. 2017, pp. 1265–1269.

H. Zhao, S. Zarar, I. Tashev, and C. H. Lee, “Convolutional-recurrent neural networks for speech enhancement,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 2401–2405.

T. Gao, J. Du, L. R. Dai, and C. H. Lee, “Densely connected progressive learning for LSTM-based speech enhancement,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, Apr. 2018, pp. 5054–5058.

K. Tan and D. Wang, “A convolutional recurrent neural network for real-time speech enhancement,” in Interspeech 2018, Hyderabad, India, Vol. 2018, pp. 3229–3233.

S. Pirhosseinloo and J. S. Brumberg, “Dilated convolutional recurrent neural network for monaural speech enhancement,” in 2019 53rd Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, Vol. 2019, pp. 158–162.

N. Manju, B. S. Harish, and N. Nagadarshan, “Multilayer feedforward neural network for Internet traffic classification,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 1, pp. 117–123, 2020.

A. K. Dubey and V. Jain, “Comparative study of convolution neural network’s relu and leaky-relu activation functions,” in Applications of Computing, Automation and Wireless Systems in Electrical Engineering: Proceedings of MARC 2018, Singapore: Springer Singapore, 2019, pp. 873–880.

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint, arXiv:1406.1078, 2014.

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, Australia, Apr. 2015, pp. 5206–5210.

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint, arXiv:1412.6980, 2014.

A. A. Alvarez and F. Gómez, “Motivic pattern classification of music audio signals combining residual and LSTM networks,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 208–214, 2021.

J. G. Beerends, A. P. Hekstra, A. W. Rix, and M. P. Hollier, “Perceptual evaluation of speech quality (PESQ): The new ITU standard for end-to-end speech quality assessment part II: Psychoacoustic model,” Journal of the Audio Engineering Society, vol. 50, no. 10, pp. 765–778, 2002.

C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Dallas, TX, USA, Mar. 2010, pp. 4214–4217.

M. Hasannezhad, Z. Ouyang, W. P. Zhu, and B. Champagne, “An integrated CNN-GRU framework for complex ratio mask estimation in speech enhancement,” in 2020 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Auckland, New Zealand, Dec. 2020, pp. 642–647.

S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” arXiv preprint, arXiv:1703.09452, 2017