ERBM-SE: Extended Restricted Boltzmann Machine for Multi-Objective Single-Channel Speech Enhancement.

Muhammad Irfan Khattak; Nasir Saleem; Aamir Nawaz; Aftab Ahmed Almani; Farhana Umer; Elena Verdú

doi:10.9781/ijimai.2022.03.002

Authors

Muhammad Irfan Khattak University of Engineering and Technology Peshawar
Nasir Saleem Gomal University
Aamir Nawaz Gomal University
Aftab Ahmed Almani Shandong University
Farhana Umer Islamia University of Bahawalpur
Elena Verdú Universidad Internacional De La Rioja

DOI:

https://doi.org/10.9781/ijimai.2022.03.002

Keywords:

Restricted Boltzmann Machine, Spectral Masking, Speech Enhancement, Speech Intelligibility, Speech Quality, Supervised Learning, Machine Learning

Abstract

Machine learning-based supervised single-channel speech enhancement has achieved considerable research interest over conventional approaches. In this paper, an extended Restricted Boltzmann Machine (RBM) is proposed for the spectral masking-based noisy speech enhancement. In conventional RBM, the acoustic features for the speech enhancement task are layerwise extracted and the feature compression may result in loss of vital information during the network training. In order to exploit the important information in the raw data, an extended RBM is proposed for the acoustic feature representation and speech enhancement. In the proposed RBM, the acoustic features are progressively extracted by multiple-stacked RBMs during the pre-training phase. The hidden acoustic features from the previous RBM are combined with the raw input data that serve as the new inputs to the present RBM. By adding the raw data to RBMs, the layer-wise features related to the raw data are progressively extracted, that is helpful to mine valuable information in the raw data. The results using the TIMIT database showed that the proposed method successfully attenuated the noise and gained improvements in the speech quality and intelligibility. The STOI, PESQ and SDR are improved by 16.86%, 25.01% and 3.84dB over the unprocessed noisy speech.

Downloads

Download data is not yet available.

References

S. Boll, “Suppression of acoustic noise in speech using spectral subtraction,” IEEE Transactions on acoustics, speech, and signal processing, vol. 27, no. 2, pp.113-120, 1979.

Y. Lu and P.C. Loizou, “A geometric approach to spectral subtraction,” Speech communication, vol. 50, no. 6, pp. 453-466, 2008.

S. Nasir, A. Sher, K. Usman, U. Farman, “Speech enhancement with geometric advent of spectral subtraction using connected time-frequency regions noise estimation,” Research Journal of Applied Sciences, Engineering and Technology, vol. 6, no. 6, pp. 1081-1087, 2013.

B.L Sim, Y.C. Tong, J.S. Chang, C.T. Tan, “A parametric formulation of the generalized spectral subtraction method,” IEEE transactions on speech and audio processing, vol. 6, no. 4, pp. 328-337, 1998.

J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197-210, 1978,

P. Scalart, “Speech enhancement based on a priori signal to noise estimation,” in 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings (Vol. 2, pp. 629-632), IEEE, 1996.

Y. Sandoval-Ibarra, V.H. Diaz-Ramirez, V. I. Kober, V.N. Karnaukhov, “Speech enhancement with adaptive spectral estimators,” Journal of Communications Technology and Electronics, vol. 61, no. 6, 672-678, 2016.

Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 6, pp. 1109-1121, 1984.

Y. Ephraim and D. Malah, “Speech enhancement using a minimum meansquare error log-spectral amplitude estimator,” IEEE transactions on acoustics, speech, and signal processing, vol. 33, no. 2, pp.443-445, 1985.

K. Paliwal, B. Schwerin, K. Wójcicki, “Speech enhancement using a minimum mean-square error short-time spectral modulation magnitude estimator,” Speech Communication, vol. 54, no. 2, pp. 282-305, 2012.

N. Mohammadiha, P. Smaragdis, A. Leijon, “Supervised and unsupervised speech enhancement using nonnegative matrix factorization”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2140-2151, 2013.

I. Tashev and M. Slaney, “Data driven suppression rule for speech enhancement,” in 2013 Information Theory and Applications Workshop (ITA) (pp. 1-6). IEEE, 2013.

Y. Xu, J. Du, L.R. Dai, C.H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp.7-19, 2014.

Y. Xu, J. Du, L.R. Dai, C.H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Signal processing letters, vol. 21, no. 1, pp. 65-68, 2013.

W. Jiang, F. Wen, P. Liu, “Robust beamforming for speech recognition using DNN-based time-frequency masks estimation,” IEEE Access, vol. 6, pp. 52385-52392, 2018.

N. Saleem, M.I. Khattak, A.B. Qazi, “Supervised speech enhancement based on deep neural network,” Journal of Intelligent & Fuzzy Systems, vol. 37, no. 4, pp. 5187-5201, 2019.

N. Saleem, M. Irfan Khattak, M.Y. Ali, M. Shafi, “Deep neural network for supervised single-channel speech enhancement,” Archives of Acoustics, vol. 44, 2019.

T. Hussain, S.M. Siniscalchi, C.C. Lee, S.S. Wang, Y. Tsao, W.H. Liao, “Experimental study on extreme learning machine applications for speech enhancement,” IEEE Access, vol. 5, pp. 25542-25554, 2017.

Y. Wang, A. Narayanan, D. Wang, “On training targets for supervised speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 22, no. 12, pp. 1849-1858, 2014.

G. Kim, Y. Lu, Y. Hu, P.C. Loizou, “An algorithm that improves speech intelligibility in noise for normal-hearing listeners,” The Journal of the Acoustical Society of America, vol. 126, no. 3, pp. 1486-1494, 2009.

B.M. Mahmmod, T. Baker, F. Al-Obeidat, S.H. Abdulhussain, W.A. Jassim, “Speech enhancement algorithm based on super-Gaussian modeling and orthogonal polynomials,” IEEE Access, vol. 7, pp. 103485-103504, 2019.

J.H. Chang, Q.H. Jo, D.K. Kim, N.S. Kim, “Global soft decision employing support vector machine for speech enhancement,” IEEE Signal Processing Letters, vol. 16, no. 1, pp. 57-60, 2008.

K. Kwon, J.W. Shin, N.S. Kim, “NMF-based speech enhancement using bases update,” IEEE Signal Processing Letters, vol. 22, no. 4, pp. 450-454, 2014.

M. Sun, Y. Li, J.F. Gemmeke, X. Zhang, “Speech enhancement under low SNR conditions via noise estimation using sparse and low-rank NMF with Kullback–Leibler divergence,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 7, pp. 1233-1242, 2015.

N. Saleem, M.I. Khattak, “Multi-scale decomposition based supervised single channel deep speech enhancement,” Applied Soft Computing, vol. 95, pp. 106666, 2020.

N. Saleem, M.I. Khattak, “Deep Neural Networks for Speech Enhancement in Complex-Noisy Environments,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 1, pp. 84-90, 2020.

Y. Wang, D. Wang, “Towards scaling up classification-based speech separation,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 7, pp. 1381-1390, 2013.

K. Phapatanaburi, L. Wang, Z. Oo, W. Li, S. Nakagawa, M. Iwahashi, “Noise robust voice activity detection using joint phase and magnitude based feature enhancement,” Journal of ambient intelligence and humanized computing, vol. 8, no. 6, pp. 845-859, 2017.

P.S. Huang, M. Kim, M. Hasegawa-Johnson, P. Smaragdis, “Joint optimization of masks and deep recurrent neural networks for monaural source separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 12, pp. 2136-2147, 2015.

N. Saleem, M.I. Khattak, E.V. Perez, “Spectral Phase Estimation Based on Deep Neural Networks for Single Channel Speech Enhancement,” Journal of Communications Technology and Electronics, vol. 64, no. 12, 1372-1382, 2019.

X.L. Zhang and D. Wang, “A deep ensemble learning method for monaural speech separation,” IEEE/ACM transactions on audio, speech, and language processing, vol. 24, no. 5, pp. 967-977, 2016.

S. Samui, I. Chakrabarti, S.K. Ghosh, “Time–frequency masking based supervised speech enhancement framework using fuzzy deep belief network,” Applied Soft Computing, vol. 74, pp. 583-602, 2019.

N. Saleem, M.I. Khattak, A. Jan, “Multi-objective long-short term memory recurrent neural networks for speech enhancement,” Journal of Ambient Intelligence and Humanized Computing, pp. 1-16, 2020.

R. Karakida, M. Okada, S.I. Amari, “Dynamical analysis of contrastive divergence learning: Restricted Boltzmann machines with Gaussian visible units,” Neural Networks, vol. 79, pp. 78-87, 2016.

S. Samui, I. Chakrabarti, S.K. Ghosh, “Deep Recurrent Neural Network Based Monaural Speech Separation Using Recurrent Temporal Restricted Boltzmann Machines,” in INTERSPEECH (pp. 3622-3626), 2017.

Z. Chen, Y. Huang, J. Li, Y. Gong, “Improving Mask Learning Based Speech Enhancement System with Restoration Layers and Residual Connection,”, in INTERSPEECH (pp. 3632-3636), 2017.

A. Fischer, C. Igel, “An introduction to restricted Boltzmann machines,” in Iberoamerican congress on pattern recognition (pp. 14-36). Springer, Berlin, Heidelberg, 2012.

M. Aoyagi, “Learning coefficient in Bayesian estimation of restricted Boltzmann machine,” Journal of Algebraic Statistics, vol. 4, no. 1, pp. 31-58, 2013.

I. Sutskever, G.E. Hinton, G.W. Taylor, “The recurrent temporal restricted boltzmann machine,” in Advances in neural information processing systems, (pp. 1601-1608), 2009.

N. Zhang, S. Ding, J. Zhang, Y. Xue, “An overview on restricted Boltzmann machines,” Neurocomputing, vol. 275, pp. 1186-1199, 2018.

S.R. Chiluveru and M. Tripathy, “Low snr speech enhancement with dnn based phase estimation,” International Journal of Speech Technology, vol. 22, no. 1, pp. 283-292, 2019.

S.K. Roy, A. Nicolson, K.K. Paliwal, “DeepLPC: A deep learning approach to augmented Kalman filter-based single-channel speech enhancement,” IEEE Access, vol. 9, pp. 64524-64538, 2021.

K. Tan, D. Wang, “Towards Model Compression for Deep Learning Based Speech Enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1785-1794, 2021.

S.K. Roy, A. Nicolson, K.K. Paliwal, “DeepLPC-MHANet: Multi-Head Self-Attention for Augmented Kalman Filter-based Speech Enhancement,” IEEE Access, vol. 9, pp. 70516-70530, 2021.

A. Pandey, D. Wang, “Dense CNN with self-attention for time-domain speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1270-1279, 2021.

S. Abdullah, M. Zamani, A. Demosthenous, “Towards more efficient DNN-based speech enhancement using quantized correlation mask,” IEEE Access, vol. 9, pp. 24350-24362, 2021.

N. Saleem, M.I. Khattak, M. Al-Hasan, A.B. Qazi, “On Learning Spectral Masking for Single Channel Speech Enhancement Using Feedforward and Recurrent Neural Networks,” IEEE Access, vol. 8, pp. 160581-160595, 2020.

Y. Wang, Z. Pan, X. Yuan, C. Yang, W. Gui, “A novel deep learning based fault diagnosis approach for chemical process with extended deep belief network,” ISA transactions, vol. 96, pp. 457-467, 2020.

G.E. Hinton, “A practical guide to training restricted Boltzmann machines,” In Neural networks: Tricks of the trade (pp. 599-619). Springer, Berlin, Heidelberg, 2012.

G.E. Hinton, S. Osindero, Y.W. Teh, “A fast learning algorithm for deep belief nets,” Neural computation, vol. 18, no. 7, pp. 1527-1554, 2006.

V. Zue, S. Seneff, J. Glass, “Speech database development at MIT: TIMIT and beyond,” Speech communication, vol. 9, no. 4, pp. 351-356, 1990.

D. Pearce and J. Picone, “Aurora working group: DSR front end LVCSR evaluation AU/384/02,” Inst. for Signal & Inform. Process., Mississippi State Univ., Tech. Rep., 2002.

Q. Wang, J. Du, L.R. Dai, C.H. Lee, “A multiobjective learning and ensembling approach to high-performance speech enhancement with compact neural network architectures,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 7, pp. 1185-1197, 2018.

A.W. Rix, J.G. Beerends, M.P. Hollier, A.P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 01CH37221) (Vol. 2, pp. 749-752). IEEE, 2001.

C.H. Taal, R.C. Hendriks, R. Heusdens, J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE international conference on acoustics, speech and signal processing (pp. 4214-4217). IEEE, 2010.

J. Jensen and C.H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009-2022, 2016.

H.P. Liu, Y. Tsao, C.S. Fuh, “Bone-conducted speech enhancement using deep denoising autoencoder,” Speech Communication, vol. 104, pp.106-112, 2018.

T. Lavanya, T. Nagarajan, P. Vijayalakshmi, “Multi-Level Single-Channel Speech Enhancement Using a Unified Framework for Estimating Magnitude and Phase Spectra,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1315-1327, 2020.

N. Saleem, M.I. Khattak, E. Verdú, “On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 2, 2020.

N. Saleem and T.G. Tareen, “Spectral Restoration based speech enhancement for robust speaker identification,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 5, no. 1, pp. 34-39, 2018.