Audio-Visual Automatic Speech Recognition Using PZM, MFCC and Statistical Analysis.

Saswati Debnath; Pinki Roy

doi:10.9781/ijimai.2021.09.001

Authors

Saswati Debnath National Institute Of Technology Silchar
Pinki Roy National Institute Of Technology Silchar

DOI:

https://doi.org/10.9781/ijimai.2021.09.001

Keywords:

Audio-visual Speech Recognition, Lip Tracking, Pseudo Zernike Moment, Mel Frequency Cepstral Coefficients (MFCC), Incremental Feature Selection (IFS), Statistical Analysis

Abstract

Audio-Visual Automatic Speech Recognition (AV-ASR) has become the most promising research area when the audio signal gets corrupted by noise. The main objective of this paper is to select the important and discriminative audio and visual speech features to recognize audio-visual speech. This paper proposes Pseudo Zernike Moment (PZM) and feature selection method for audio-visual speech recognition. Visual information is captured from the lip contour and computes the moments for lip reading. We have extracted 19th order of Mel Frequency Cepstral Coefficients (MFCC) as speech features from audio. Since all the 19 speech features are not equally important, therefore, feature selection algorithms are used to select the most efficient features. The various statistical algorithm such as Analysis of Variance (ANOVA), Kruskal-wallis, and Friedman test are employed to analyze the significance of features along with Incremental Feature Selection (IFS) technique. Statistical analysis is used to analyze the statistical significance of the speech features and after that IFS is used to select the speech feature subset. Furthermore, multiclass Support Vector Machine (SVM), Artificial Neural Network (ANN) and Naive Bayes (NB) machine learning techniques are used to recognize the speech for both the audio and visual modalities. Based on the recognition rate combined decision is taken from the two individual recognition systems. This paper compares the result achieved by the proposed model and the existing model for both audio and visual speech recognition. Zernike Moment (ZM) is compared with PZM and shows that our proposed model using PZM extracts better discriminative features for visual speech recognition. This study also proves that audio feature selection using statistical analysis outperforms methods without any feature selection technique.

Downloads

Download data is not yet available.

References

N. Moritz, K.Adiloglu, J. Anemuller, S. Goetze, B. Kollmeier, “MultiChannel Speech Enhancement and Amplitude Modulation Analysis for Noise Robust Automatic Speech Recognition”, Computer Speech & Language, vol. 46, pp. 558-573, 2017.

D. Rudrapal, S. Das, S. Debbarma, N. Kar, N. Debbarma, “Voice Recognition and Authentication as a Proficient Biometric Tool and its Application in Online Exam for P.H People”, International Journal of Computer Applications (0975 8887), vol. 39, no. 12, 2012.

S. Singh and M. Yamini, “Voice based login authentication for Linux,” 2013 International Conference on Recent Trends in Information Technology (ICRTIT), 2013, pp. 619-624, doi: 10.1109/ICRTIT.2013.6844272.

Z. Saquib, N. Salam, R. Nair, N. Pandey, “Voiceprint Recognition Systems for Remote Authentication-A Survey”, International Journal of Hybrid Information Technology, vol. 4, no. 2, 2011.

P. Borde, A. Varpe, R. Manza, P. Yannawar, “Recognition of Isolated Words using Zernike and MFCC features for Audio Visual Speech Recognition”, International Journal of Speech Technology, 2014, doi: 18.10.1007/s10772-014-9257-1.

G. F. Meyer, J. B. Mulligan, S. M. Wuerger, “Continuous audiovisual digit recognition using N-best decision fusion”, Information Fusion, vol. 5, 2004.

H. Marouf and K. Faez, “Zernike Moment-Based Feature Extraction For Facial Recognition Of Identical Twins”, International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), vol. 3, no. 6, 2013.

A. Bhatia and E. Wolf, “On the Circle Polynomials of Zernike and Related Orthogonal Sets”, Proceedings of the Cambridge Philosophical Society, vol. 50, no.1, pp. 40-48, 1954.

S. B. Davis, P. Mermelstein, “Comparison of Parametric Representation for Monosyllabic Word Recognition in Continuously Spoken Sentences”, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357-365, 1980.

H. Ding, P. M. Feng, W. Chen, H Lin, “Identification of bacteriophage virion proteins by the ANOVA feature selection and analysis”, MolBioSyst, vol. 10, no.8, pp. 22292235, 2014.

A. Ganapathiraju, J. E. Hamakerand, J. Picone, “Applications of Support Vector Machines to Speech Recognition”, IEEE Transactions on Signal Processing, vol. 52, no. 8, 2004.

P. Borde, R. Manza, B. Gawali and P. Yannawar. “Article: vVISWa A Multilingual Multi-Pose Audio Visual Database for Robust Human Computer Interaction”, International Journal of Computer Applications, vol. 137, no. 4, pp. 25-31, 2016.

K. Noda, Y. Yamaguchi, K. Nakadai, H. Okuno, T. Ogata, “Audio-visual speech recognition using deep learning”, Applied Intelligence, vol. 42, 2014, doi: 10.1007/s10489-014-0629-7.

G. Meyer, J. Mulligan, S. Wuerger, “Continuous audio-visual digit recognition using N-best Decision Fusion”, Information Fusion, vol. 5, pp. 91-101, 2004.

G. Potamianos, C. Neti, G. Gravier, A. Garg and A. W. Senior, “Recent Advances in the Automatic Recognition of Audio-Visual Speech”, Proceedings of the IEEE, vol. 91, no. 9, pp. 1306-1326, 2003, doi: 10.1109/JPROC.2003.817150.

N. Dave, “A Lip Localization Based Visual Feature Extraction Method”, An International Journal (ECIJ), vol. 4, no. 4, 2015.

A. G. Chitu, L.J.M Rothkrantz, J. C. Wojdel, W. Pascal, “Comparison Between Different Feature Extraction Techniques for Audio-Visual Speech Recognition”, Journal on Multimodal User Interfaces, vol. 1, no. 1, pp 720, 2007.

S. Dupont and J. Luettin, “Audio-Visual Speech Modeling for Continuous Speech Recognition”, IEEE Transacctions on Multimedia, Vol. 2, No. 3, 2000.

T. S. Gunawan, A. A. M. Abushariah, M. A. M. Abushariah and O. Othman, “English Digits Recognition System Based on Hidden Markov Models,” IEEE 978-1-4244-6235-3/10/, 2010.

H. R. Goyal and S. G. Koolagudi, “Hindi Number Recognition using GMM,” Global Journal of Computer Applications, vol. 63, no. 21, pp. 25-30, 2013.

A. Jalalvand, F. Triefenbach, K. Demuynck, and J.-P. Marten, ”Robust continuous digit recognition using Reservoir Computing,” Computer Speech and Language , vol. 30, no. 1, pp. 135-158, 2015.

S. Lokesh, P. M. Kumar, M. R. Devi, P. Parthasarathy, C. Gokulnath , “An Automatic Tamil Speech Recognition system by using Bidirectional Recurrent Neural Network with Self Organizing Map”, Neural Computing and Applications , vol. 31, pp. 1521-1531, 2019, doi: https://doi.org/10.1007/s00521-018-3466-5.

Mini P P, T. Thomas, R Gopikakumari, ”Feature Vector Selection of Fusion of MFCC and SMRT Coefficients for SVM Classifier Based Speech Recognition System”, 978-1-5386-6575-6/18/$31.00 2018 IEEE.

S. Mendiratta, N. Turk, D. Bansal, “Automatic Speech Recognition Using Optimal Selection of Features Based On Hybrid ABC-PSO”, 2016 International Conference on Inventive Computation Technologies (ICICT), doi: 10.1109/INVENTIVE.2016.7824866.

R. Lienhart, J. Maydt, “An Extended Set of Haar-like Features for Rapid Object Detection”, Intel Labs, Intel Corporation, Santa Clara, n.d. Web. 2014.

E. Gregori. “Introduction To Computer Vision Using OpenCV. Embedded Vision Alliance”, 2012 Embedded Systems Conference, 2012.

P.I. Wilson, J. Fernandez, “Facial feature detection using haar classifiers”, Texas A & M University, (2014).

C. Teh, R. Chin, “On Image Analysis by the Method of Moments”, IEEE Transaction on Pattern Analysis and Machine Intelligence, vol. 10, no. 4, pp. 496-513, 1988.

C. Singh, R. Upneja, “Accurate calculation of high order pseudo Zernike moments and their numerical stability”, Digital Signal Processing, vol. 27, pp. 95-106, 2014.

K. M. Hosny , “Accurate pseudo Zernike moment invariants for greylevel images”, The Imaging Science, vol. 60, doi: 10.1179/1743131X11Y.0000000023.

R. Mukundan, K.R. Ramakrishnan, “Moment Functions in Image Analysis Theory and Applications”, World Scientific, Singapore (1998).

G. Chandrashekar, F. Sahin, “A survey on feature selection methods”, Computer Electrical Engineering, vol. 40, no. 1, pp. 628, 2014.

B. Soni, S. Debnath, P.K. Das, “Text-dependent speaker verification using classical LBG, adaptive LBG and FCM vector quantization”, International Journal of Speech Technology September, vol. 19, no. 3, pp. 525-536, 2016.

M.A. Hossan, S. Memon, M.A. Gregory, “A Novel Approach for MFCC feature extraction”, 4th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1-5, 2010.

N. Settouti, M.E.A. Bechar, M.A. Chikh, “Statistical comparisons of the top 10 algorithms in data mining for classification task”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 4 no. 1, pp. 46-51, 2016.

B. Niu, G. Huang, L. Zheng, X. Wang, F. Chen, Y. Zhang, T. Huang, “Prediction of substrate-enzyme-product interaction based on molecular descriptors and physicochemical properties”, BioMed Research International, vol. 2013, article ID 674215, 2013.

Y. Chan, R. P. Walmsley, “Learning and understanding the Kruskal Wallis one-way analysis of variance by ranks test for differences among three or more independent groups”, Physical Therapy, vol. 77, no. 12, pp.1755-1761, 1997.

D.W. Zimmerman, B.D. Zumbo, “Relative power of the Wilcoxon test, the Friedman test, and repeated-measures ANOVA on ranks”, Journal of Experimental Education, vol. 62, no. 1, pp. 75-86, 1993.

J.D. Pujari, R. Yakkundimath, A.S. Byadgi, “SVM and ANN based classification of plant diseases using feature reduction technique”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 3, no. 7, pp. 6-14, 2016.

W. Gevaert, G. Tsenov, V. Mladenov, “Neural Networks used for Speech Recognition”, Journal of Automatic Control. Vol. 20, pp. 1-7, 2010.

S. Russell, P. Norvig, “Artificial intelligence: a modern approach”, 2nd edn. Prentice Hall, Englewood Cliffs. ISBN 978-0137903955, 2003.

M. Islam, A. Roy, R. H. Laskar, “SVM-based robust image watermarking technique in LWT domain using different sub-bands,” Neural Computing and Applications, vol. 32, pp. 1379-1403, 2020, doi: https://doi.org/10.1007/s00521-018-3647-2

A. Jain and G. N. Rathna, “Visual Speech Recognition for Isolated Digits Using Discrete Cosine Transform and Local Binary Patterns Features”, 978-1-5090-5990-4/17/, 2017 IEEE.

G. Zhao, M. Barnard, M. Pietikainen, “Lipreading with local spatiotemporal descriptors”, IEEE Transactions on Multimedia, vol. 11, no. 7, 1254-1265, 2009.

A. Kandagal, V. Udayashankara, “Visual Speech Recognition Based on Lip Movement for Indian Languages”, 2017.

N. Saleem, M. Khattak, E.Verdú, “On Improvement of Speech Intelligibility and Quality: A Survey of Unsupervised Single Channel Speech Enhancement Algorithms”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 2, pp. 78-89, 2020, doi:10.9781/ijimai.2019.12.001.

N. Saleem, M. Khattak, “Deep Neural Networks for Speech Enhancement in Complex-Noisy Environments”, International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 1, pp. 84-90, 2020, doi: 10.9781/ijimai.2019.06.001.

M. Z. Ibrahim, D. J. Mulvaney, “Robust geometrical-based lip-reading using Hidden Markov models”, Eurocon 2013, pp. 2011-2016.

E. K. Patterson, S. Gurbuz, Z. Tüfekci, J.N. Gowdy, “CUAVE: A new audiovisual database for multmodal human-computer interface research”, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2002, pp. II-2017-II-2020, doi: 10.1109/ICASSP.2002.5745028.