Deep Multi-Model Fusion for Human Activity Recognition Using Evolutionary Algorithms.

Kamal Kant Verma; Brij Mohan Singh

doi:10.9781/ijimai.2021.08.008

Authors

Kamal Kant Verma Uttarakhand Technical University
Brij Mohan Singh Indian Institute of Technology Roorkee

DOI:

https://doi.org/10.9781/ijimai.2021.08.008

Keywords:

Human Activity, Activity recognition, Human Detection Activity, Support Vector Machine, Convolutional Neural Network (CNN), 3D-Convolutional Neural Network, Long Short Term Memory (LSTM), Deep Learning, Genetic Algorithms, Particle Swarm Optimization

Supporting Agencies

We are grateful to the College of Engineering Roorkee, India, and UTU Dehradun, India, for providing excellent research facility to carry out this research work.

Abstract

Machine recognition of the human activities is an active research area in computer vision. In previous study, either one or two types of modalities have been used to handle this task. However, the grouping of maximum information improves the recognition accuracy of human activities. Therefore, this paper proposes an automatic human activity recognition system through deep fusion of multi-streams along with decision-level score optimization using evolutionary algorithms on RGB, depth maps and 3d skeleton joint information. Our proposed approach works in three phases, 1) space-time activity learning using two 3D Convolutional Neural Network (3DCNN) and a Long Sort Term Memory (LSTM) network from RGB, Depth and skeleton joint positions 2) Training of SVM using the activities learned from previous phase for each model and score generation using trained SVM 3) Score fusion and optimization using two Evolutionary algorithm such as Genetic algorithm (GA) and Particle Swarm Optimization (PSO) algorithm. The proposed approach is validated on two 3D challenging datasets, MSRDailyActivity3D and UTKinectAction3D. Experiments on these two datasets achieved 85.94% and 96.5% accuracies, respectively. The experimental results show the usefulness of the proposed representation. Furthermore, the fusion of different modalities improves recognition accuracies rather than using one or two types of information and obtains the state-of-art results.

Downloads

Download data is not yet available.

References

H. Rahmani, A. Mian, and M. Shah, “Learning a deep model for human action recognition from novel viewpoints,” in IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 3, pp. 667-681, 2017.

Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “Learning clip representations for skeleton-based 3d action recognition,” in IEEE Transactions on Image Processing, vol. 27, no. 6, pp.2842-2855, 2018.

J. K. Aggarwal, and M. S. Ryoo, “Human activity analysis: A review,” in ACM Computing Surveys (CSUR), vol. 43, no. 3, pp. 1-43, 2011.

W. Li, Z. Zhang, and Z. Liu, “Action recognition based on a bag of 3d points,” in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, San Francisco, CA, USA, 2010, pp. 9-14.

L. Xia, C.C. Chen, and J. K. Aggarwal, “View invariant human action recognition using histograms of 3d joints,” in Computer Vision and Pattern Recognition Workshops (CVPRW), Providence, RI, USA, 2012, pp. 20-27.

X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” in Proceedings of the 20th ACM international conference on Multimedia, Nara, Japan, 2012, pp. 1057-1060.

Y. Zhu, W. Chen, and G. Guo, “Fusing spatiotemporal features and joints for 3d action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland, OR, USA, 2013, pp. 486-491.

A. Chaaraoui, J. Padilla-Lopez, and F. Flórez-Revuelta, “Fusion of skeletal and silhouette-based features for human action recognition with rgb-d devices,” in Proceedings of the IEEE international conference on computer vision workshops, Sydney, NSW, Australia, 2013, pp. 91-97.

S. Siddiqui, M. A. Khan, K. Bashir, M. Sharif, F. Azam, and M. Y. Javed, “Human action recognition: a construction of codebook by discriminative features selection approach,” in International Journal of Applied Pattern Recognition, vol. 5, no. 3, pp. 206-228, 2018.

A. Franco, A. Magnani, and D. Maio, “A multimodal approach for human activity recognition based on skeleton and RGB data,” in Pattern Recognition Letters, vol. 131, pp. 293-299, 2020.

K. Khoshelham, and S. O. Elberink, “Accuracy and resolution of kinect depth data for indoor mapping applications,” in Sensors, vol. 12, no. 2, pp. 1437-1454, 2012.

H. H. Pham, L. Khoudour, A. Crouzil, P. Zegers and S. A. Velastin, “Exploiting deep residual networks for human action recognition from skeletal data,” in Comput Vis Image Underst, vol. 170, pp. 51–66, 2018.

S. Yang, J. Yang, F. Li, G. Fan and D. Li, “Human Action Recognition Based on Fusion Features,” in International Conference on Cyber Security Intelligence and Analytics, 2019, pp. 569–579.

A. Jalal, M. Z. Uddin, and T. S. Kim, “Depth video-based human activity recognition system using translation and scaling invariant features for life logging at smart home,” in IEEE Transactions on Consumer Electronics, vol. 58, no. 3, pp. 863-871, 2012.

M. Khan, T. Akram, M. Sharif, N. Muhammad, M. Javed and S. Naqvi, “An improved strategy for human action recognition; experiencing a cascaded design,” in IET Image Processing, vol. 14, no. 5, pp. 818-829, 2019.

K. He, X. Zhang, S. Ren and J. Sun, “Deep residual learning for image recognition”, in IEEE conference on computer vision and pattern recognition, Los Vegas, NV, USA, 2016, pp. 770–778.

L. Bi, D. Feng and J. Kim, “Dual-path adversarial learning for fully convolutional network (FCN)-based medical image segmentation,” in Visual Computers, vol. 34, no. 6, pp. 1–10, 2018.

M. Rashid, M. A. Khan, M. Sharif, M. Raza, M. M. Sarfraz and F. Afza, “Object detection and classification: a joint selection and fusion strategy of deep convolutional neural network and SIFT point features,” in Multimedia Tools and Applications, vol. 78, no. 12, pp. 15751–15777, 2019.

F. Zhou, Y. Hu and X. Shen, “Msanet: multimodal self-augmentation and adversarial network for RGB-D object recognition,” The Visual Computers, vol. 35, no. 11, pp. 1583-1594, 2019, https://doi.org/10.1007/s00371-018-1559-x

I. Gogić,, M. Manhart, I. S. Pandžić and J. Ahlberg,, “Fast facial expression recognition using local binary features and shallow neural networks,” in The Visual Computer, vol. 36, no. 01, pp.1-16, 2018.

M. Sharif, M. A. Khan, M. Faisal, M. Yasmin and S. L. Fernandes, “A framework for offline signature verification system: Best features selection approach,” in Pattern Recognition Letters, 2018.

K. K. Verma, B. M. Singh and A. Dixit, “A review of supervised and unsupervised machine learning techniques for suspicious behavior recognition in intelligent surveillance system”, in International Journal of Information Technology, 2019, pp. 1-14.

G. I. Parisi, “Human Action Recognition and Assessment via Deep Neural Network Self-Organization,” in Modelling Human Motion, pp. 187-211, 2020.

X. X. Niu and C. Y. Suen, “A novel hybrid CNN–SVM classifier for recognizing handwritten digits,” in Pattern Recognition, vol. 45, no. 4, pp. 1318-1325, 2012.

D. X. Xue, R. Zhang, H. Feng and Y. L. Wang, “CNN-SVM for microvascular morphological type recognition with data augmentation,” in Journal of medical and biological engineering, vol. 36, no. 6, pp. 755-764, 2016.

A. B. Sargano, X. Wang, P. Angelov and Z. Habib, “Human action recognition using transfer learning with deep representations,” in 2017 International joint conference on neural networks (IJCNN), Anchorage, AK, USA, 2017, pp. 463-469.

T. Jiang, Z. Zhang and Y. Yang, “Modeling coverage with semantic embedding for image caption generation,” in The Visual Computers, vol. 35, no. 11, pp. 1655-1665, https://doi.org/10.1007/s00371-018-1565-z

A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei and S. Savarese, “Social lstm: human trajectory prediction in crowded spaces,” in IEEE conference on computer vision and pattern recognition, 2016, pp. 961–971

I. Sutskever, O. Vinyals Q. V. Le, “Sequence to sequence learning with neural networks,” in Advances in neural information processing systems, 2014, pp. 3104–3112.

K.K. Verma, B. M. Singh, “Deep Learning Approach to Recognize COVID-19, SARS and Streptococcus Disease from Chest X-Ray Images,” in Journal of Scientific and Industrial Research, vol. 80, no. 01, pp. 51-59, 2021.

J. Cong and B. Zhang, “Multi-model feature fusion for human action recognition towards sport sceneries,” in Signal Processing: Image Communication, 2020.

E. Zhou and H. Zhang, “Human action recognition towards massive-scale sport sceneries based on deep multi-model feature fusion,” Signal Processing: Image Communication, vol. 84, 2020.

K. Soomro, A. R. Zamir and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio and T. Serre, “HMDB: a large video database for human motion recognition,” in 2011 International Conference on Computer Vision, Barcelona, Spain, 2011, pp. 2556-2563.

F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Boston, MA, USA, 2015, pp. 961-970.

A. B. Sargano, P. Angelov and Z. Habib, “A comprehensive review on handcrafted and learning-based action representation approaches for human activity recognition,” in Applied sciences, vol. 7, no. 01, 2017.

H. Wang, and C. Schmid, “Action recognition with improved trajectories,” in Proceedings of the IEEE international conference on computer vision, Sydney, NSW, Australia, 2013, pp. 3551-3558.

A. Krizhevsky, I. Sutskever and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, vol. 25, pp. 1097-1105, 2012.

K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in Advances in neural information processing systems, 2014, pp. 568-576.

A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 2014, pp. 1725-1732.

D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, Santiago, Chile, 2015, pp. 4489-4497.

C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional two-stream network fusion for video action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 1933-1941.

G. Varol, I. Laptev and C. Schmid, “Long-term temporal convolutions for action recognition,” in IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 6, pp. 1510-1517, 2017.

A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad and S. W. Baik, “Action recognition in video sequences using deep bi-directional LSTM with CNN features,” in IEEE Access, vol. 6, pp. 1155-1166, 2017.

K. K. Verma, B. M. Singh, H. L. Mandoria and P. Chauhan, “Two-Stage Human Activity Recognition Using 2D-ConvNet,” in International Journal of Interactive Multimedia & Artificial Intelligence, vol. 6, no 2, pp. 135-135, 2020.

Z. Li, Z. Zheng, F. Lin, H. Leung and Q. Li, “Action recognition from depth sequence using depth motion maps-based local ternary patterns and CNN,” in Multimedia Tools and Applications, vol. 78, no. 14, pp. 19587-19601, 2019.

P. Wang, W. Li, Z. Gao, C. Tang and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, vol. 20, no. 5, pp.1051-1061, 2018.

C. Chen, R. Jafari and N. Kehtarnavaz, “Action recognition from depth sequences using depth motion maps-based local binary patterns,” in 2015 IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2015, pp. 1092-1099.

P. Wang, W. Li, Z. Gao, J. Zhang, C. Tang and P. Ogunbona, “Deep convolutional neural networks for action recognition using depth map sequences,” arXiv preprint arXiv:1501.04686, 2015.

V. Megavannan, B. Agarwal and R. V. Babu, “Human action recognition using depth maps,” in 2012 International Conference on Signal Processing and Communications (SPCOM), Bangalore, India, 2012, pp. 1-5.

Y. Han, S. L. Chung, Q. Xiao, W. Y. Lin and S. F. Su, “Global Spatio-Temporal Attention for Action Recognition based on 3D Human Skeleton Data,” in IEEE Access, vol. 8, pp. 88604-88616, 2020.

B. Ren, M. Liu, R. Ding and H. Liu, “A Survey on 3D Skeleton-Based Action Recognition Using Learning Method,” arXiv preprint arXiv:2002.05907, 2020.

R. Saini, P. Kumar, P. P. Roy and D. P. Dogra, “A novel framework of continuous human-activity recognition using kinect,” in Neurocomputing, vol. 311, pp. 99-111, 2018.

B. Chikhaoui, B. Ye and A. Mihailidis, “Feature-level combination of skeleton joints and body parts for accurate aggressive and agitated behavior recognition,” in Journal of Ambient Intelligence and Humanized Computing, vol. 8, no. 6, pp. 957-976, 2017.

A. Shahroudy, J. Liu, T.T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, NV, USA, 2016, pp. 1010-1019.

Y. Gu, X Ye, W. Sheng, Y. Ou and Y. Li, “Multiple stream deep learning model for human action recognition,” in Image and Vision Computing, vol. 93, 2020.

P. Khaire, P. Kumar and J. Imran, “Combining CNN streams of RGB-D and skeletal data for human activity recognition,” in Pattern Recognition Letters, vol. 115, pp. 107-116, 2018.

A. Tomas and K. K. Biswas, “Human activity recognition using combined deep architectures,” in 2017 IEEE 2nd International Conference on Signal and Image Processing (ICSIP), Singapore, 2017, pp. 41-45.

E. P. Ijjina and K. M. Chalavadi, “Human action recognition in RGB-D videos using motion sequence information and deep learning,” in Pattern Recognition, vol. 72, pp. 504-516, 2017.

C. Zhao, M. Chen, J. Zhao, Q. Wang and Y. Shen, “3D Behavior Recognition Based on Multi-Modal Deep Space-Time Learning,” in Applied Sciences, vol. 9, no. 4, pp. 716, 2019.

S. Ji, W. Xu, M. Yang and K. Yu, “3D convolutional neural networks for human action recognition,” in IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 01, pp. 221-231, 2012.

S. Hochreiter and J. Schmidhuber, “Bridging long time lags by weight guessing and “Long Short-Term Memory”,” in Spatiotemporal models in biological and artificial systems, vol. 37, pp. 65-72, 1996.

S. Gaglio, G. L. Re and M. Morana, “Human activity recognition process using 3-D posture data,” in IEEE Transactions on Human-Machine Systems, vol. 45, no. 05, pp. 586-597, 2014.

O. Chapelle, V. Vapnik, O. Bousquet and S. Mukherjee S, “Choosing multiple parameters for support vector machines,” in Machine learning, vol. 46, no. 1-3, pp. 131-59, 2002.

D. E. Goldberg, B. Korb and K. Deb, “Messy genetic algorithms: Motivation, analysis, and first results,” in Complex systems, vol. 3, no. 05, pp. 493-530, 1989.

J. Kennedy and R. Eberhart, “Particle swarm optimization,” in Proceedings of ICNN’95-International Conference on Neural Networks, Perth, Australia, 1995, pp. 1942-1948.

J. Wang, Z. Liu, Y. Wu and J. Yuan, “Mining actionlet ensemble for action recognition with depth cameras,” in 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 2012, pp. 1290-1297.

O. Oreifej and Z. Liu, “Hon4d: Histogram of oriented 4d normals for activity recognition from depth sequences,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 2013, pp. 716-723.

L. Xia and J. K. Aggarwal, “Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera,” in Proceedings of the IEEE conference on computer vision and pattern recognition, Portland, OR, USA, 2013, pp. 2834-2841.

L. Seidenari, V. Varano, S. Berretti, A. Bimbo and P. Pala, “Recognizing actions from depth cameras as weakly aligned multi-part bag-of-poses,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Portland OR, USA, 2013, pp. 479-485.

Y. Hbali, S. Hbali, L. Ballihi and M. Sadgal, “Skeleton-based human activity recognition for elderly monitoring systems,” in IET Computer Vision, vol. 12, no. 01, pp. 16-26, 2017.

A. Ben Tamou, L. Ballihi and D. Aboutajdine, “Automatic learning of articulated skeletons based on mean of 3d joints for efficient action recognition,” in International Journal of Pattern Recognition and Artificial Intelligence, vol. 31, no. 04, 2017.

A. A. Liu, W. Z. Nie, Y. T. Su, L. Ma, T. Hao and Z. X. Yang, “Coupled hidden conditional random fields for RGB-D human action recognition,” in Signal Processing, vol. 112, pp. 74-82, 2015.

Z. Liu, C. Zhang, Y. Tian, “3D-based deep convolutional neural network for action recognition with depth sequences,” in Image Vis. Comput., vol. 55, pp. 93–100, 2015.