Violence Detection in Audio: Evaluating the Effectiveness of Deep Learning Models and Data Augmentation.

Dalila Durães; Bruno Veloso; Paulo Novais

doi:10.9781/ijimai.2023.08.007

Authors

Dalila Durães University of Minho
Bruno Veloso University of Minho
Paulo Novais University of Minho

DOI:

https://doi.org/10.9781/ijimai.2023.08.007

Keywords:

Audio, Deep Learning, Human Detection Activity, Human Activity, Machine Learning, Transfer Learning, Violence Detection

Supporting Agencies

This work has been supported by FCT – Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020.

Abstract

Human nature is inherently intertwined with violence, impacting the lives of numerous individuals. Various forms of violence pervade our society, with physical violence being the most prevalent in our daily lives. The study of human actions has gained significant attention in recent years, with audio (captured by microphones) and video (captured by cameras) being the primary means to record instances of violence. While video requires substantial processing capacity and hardware-software performance, audio presents itself as a viable alternative, offering several advantages beyond these technical considerations. Therefore, it is crucial to represent audio data in a manner conducive to accurate classification. In the context of violence in a car, specific datasets dedicated to this domain are not readily available. As a result, we had to create a custom dataset tailored to this particular scenario. The purpose of curating this dataset was to assess whether it could enhance the detection of violence in car-related situations. Due to the imbalanced nature of the dataset, data augmentation techniques were implemented. Existing literature reveals that Deep Learning (DL) algorithms can effectively classify audio, with a commonly used approach involving the conversion of audio into a mel spectrogram image. Based on the results obtained for that dataset, the EfficientNetB1 neural network demonstrated the highest accuracy (95.06%) in detecting violence in audios, closely followed by EfficientNetB0 (94.19%). Conversely, MobileNetV2 proved to be less capable in classifying instances of violence.

Downloads

Download data is not yet available.

References

S. Koritsas, M. Boyle, J. Coles, “Factors associated with workplace violence in paramedics,” Prehospital and disaster medicine, vol. 24, no. 5, pp. 417–421, 2009.

W. So, “Perceived and actual leading causes of death through interpersonal violence in south korea as of 2018,” 2019.

APAV, “Estatísticas apav -relatório anual 2020.” https://apav.pt/apav_v3/images/pdf/Estatisticas_APAV_Relatorio_Anual_2020.pdf, 2021. Access at 22/10/2021.

D. Durães, F. Santos, F. S. Marcondes, S. Lange, J. Machado, “Comparison of transfer learning behaviour in violence detection with different public datasets,” in Progress in Artificial Intelligence, 2021, Springer International Publishing.

D. Durães, F. S. Marcondes, F. Gonçalves, J. Fonseca, J. Machado, P. Novais, “Detection violent behaviors: a survey,” in Ambient Intelligence– Software and Applications: 11th International Symposium on Ambient Intelligence, 2021, pp. 106–116, Springer.

A. Jan, G. M. Khan, “Real world anomalous scene detection and classification using multilayer deep neural networks,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, no. 2, pp. 158–167, 2023, doi: 10.9781/ijimai.2021.10.010.

F. Santos, D. Durães, F. S. Marcondes, N. Hammerschmidt, S. Lange, J. Machado, P. Novais, “In-car violence detection based on the audio signal,” in Intelligent Data Engineering and Automated Learning– IDEAL 2021: 22nd International Conference, IDEAL 2021, Manchester, UK, November 25–27, 2021, Proceedings 22, 2021, pp. 437–445, Springer.

S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al., “Cnn architectures for large- scale audio classification,” in 2017 ieee international conference on acoustics, speech and signal processing (icassp), 2017, pp. 131–135, IEEE.

M. Crocco, M. Cristani, A. Trucco, V. Murino, “Audio surveillance: A systematic review,” ACM Computing Surveys (CSUR), vol. 48, no. 4, pp. 1–46, 2016.

D. M. Beltrán-Flores, “Ópera nacionalista ecuatoriana,” Master’s thesis, 2022.

K. Gkountakos, K. Ioannidis, T. Tsikrika, S. Vrochidis, I. Kompatsiaris, “Crowd violence detection from video footage,” in 2021 international conference on content-based multimedia indexing (CBMI), 2021, pp. 1–4, IEEE.

T. Senst, V. Eiselein, A. Kuhn, T. Sikora, “Crowd violence detection using global motion-compensated lagrangian features and scale-sensitive video-level representation,” IEEE transactions on information forensics and security, vol. 12, no. 12, pp. 2945–2956, 2017.

K. Gkountakos, K. Ioannidis, T. Tsikrika, S. Vrochidis, I. Kompatsiaris, “A crowd analysis framework for detecting violence scenes,” in Proceedings of the 2020 International Conference on Multimedia Retrieval, 2020, pp. 276–280.

T. Hassner, Y. Itcher, O. Kliper-Gross, “Violent flows: Real-time detection of violent crowd behavior,” in 2012 IEEE computer society conference on computer vision and pattern recognition workshops, 2012, pp. 1–6, IEEE.

M. Sharma, T. Gupta, K. Qiu, X. Hao, R. Hamid, “Cnn-based audio event recognition for automated violence classification and rating for prime video content,” Proc. Interspeech 2022, pp. 2758-2762, 2022, doi: 10.21437/Interspeech.2022-10053.

A. J. Naik, M. Gopalakrishna, “Violence detection in surveillancevideo-a survey,” International Journal of Latest Research in Engineering and Technology (IJLRET), vol. 1, pp. 1–17, 2017.

A. M. Yildiz, P. D. Barua, S. Dogan, M. Baygin, T. Tuncer, C. P. Ooi, H. Fujita, U. R. Acharya, “A novel tree pattern-based violence detection model using audio signals,” Expert Systems with Applications, vol. 224, p. 120031, 2023.

D. Duraes, F. Santos, F. S. Marcondes, N. Hammerschmidt, P. Novais, “Applying multisensor in-car situations to detect violence,” Expert Systems, p. e13356, 2023.

V. S. Saravanarajan, R.-C. Chen, C. Dewi, L.-S. Chen, L. Ganesan, “Car crash detection using ensemble deep learning,” Multimedia Tools and Applications, pp. 1–19, 2023.

F. Reinolds, C. Neto, J. Machado, “Deep learning for activity recognition using audio and video,” Electronics, vol. 11, no. 5, p. 782, 2022.

I. Goodfellow, Y. Bengio, A. Courville, Deep learning. MIT press, 2016.

B. Peixoto, B. Lavi, P. Bestagini, Z. Dias, A. Rocha, “Multimodal violence detection in videos,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2957–2961, IEEE.

A. S. Arukgoda, Improving sinhala–tamil translation through deep learning techniques. PhD dissertation, 2021.

A. Uçar, Y. Demir, C. Güzeliş, “Object recognition and detection with deep learning for autonomous driving applications,” Simulation, vol. 93, no. 9, pp. 759–769, 2017.

Y. Cho, N. Bianchi-Berthouze, S. J. Julier, “Deepbreath: Deep learning of breathing patterns for automatic stress recognition using low-cost thermal imaging in unconstrained settings,” in 2017 Seventh international conference on affective computing and intelligent interaction (acii), 2017, pp. 456–463, IEEE.

B. Veloso, D. Durães, P. Novais, “Analysis of machine learning algorithms for violence detection in audio,” in Highlights in Practical Applications of Agents, Multi- Agent Systems, and Complex Systems Simulation. The PAAMS Collection: International Workshops of PAAMS 2022, L’Aquila, Italy, July 13–15, 2022, Proceedings, 2022, pp. 210–221, Springer.

H. Souto, R. Mello, A. Furtado, “An acoustic scene classification approach involving domestic violence using machine learning,” in Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, 2019, pp. 705–716, SBC.

H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019.

J.-L. Rouas, J. Louradour, S. Ambellouis, “Audio events detection in public transport vehicle,” in 2006 IEEE Intelligent Transportation Systems Conference, 2006, pp. 733–738, IEEE.

J. F. Gaviria, A. Escalante-Perez, J. C. Castiblanco, N. Vergara, V. ParraGarces, J. D. Serrano, A. F. Zambrano, L. F. Giraldo, “Deep learning-based portable device for audio distress signal recognition in urban areas,” Applied Sciences, vol. 10, no. 21, 2020, doi: 10.3390/app10217448.

M. S. Hossain, G. Muhammad, “Emotion recognition using deep learning approach from audio–visual emotional big data,” Information Fusion, vol. 49, pp. 69–78, 2019.

A. Arronte Alvarez, F. Gómez, “Motivic pattern classification of music audio signals combining residual and lstm networks,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 6, no. 6, pp. 208–214, 2021, doi:10.9781/ijimai.2021.01.003.

L. Nanni, G. Maguolo, M. Paci, “Data augmentation approaches for improving animal audio classification,” Ecological Informatics, vol. 57, p. 101084, 2020.

Z. Mushtaq, S.-F. Su, “Environmental sound classification using a regularized deep convolutional neural network with data augmentation,” Applied Acoustics, vol. 167, p. 107389, 2020, doi:10.9781/ijimai.2021.01.003.

S. Mertes, A. Baird, D. Schiller, B. W. Schuller, E. André, “An evolutionarybased generative approach for audio data augmentation,” in 2020 IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP), 2020, pp. 1–6, IEEE.

B. Zoph, E. D. Cubuk, G. Ghiasi, T.-Y. Lin, J. Shlens, Q. V. Le, “Learning data augmentation strategies for object detection,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part. XXVII 16, 2020, pp. 566–583, Springer.

L. Nanni, Y. M. Costa, R. L. Aguiar, R. B. Mangolin, S. Brahnam, C. N. Silla, “Ensemble of convolutional neural networks to improve animal audio classification,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2020, pp. 1–14, 2020.

K. Choi, G. Fazekas, K. Cho, M. Sandler, “A tutorial on deep learning for music information retrieval,” arXiv preprint arXiv:1709.04396, 2017.

M. S. Hossain, G. Muhammad, “Emotion recognition using deep learning approach from audio–visual emotional big data,” Information Fusion, vol. 49, pp. 69– 78, 2019.

H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, T. Sainath, “Deep learning for audio signal processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 2, pp. 206–219, 2019.

D. de Benito-Gorron, A. Lozano-Diez, D. T. Toledano, J. GonzalezRodriguez, “Exploring convolutional, recurrent, and hybrid deep neural networks for speech and music detection in a large audio dataset,” EURASIP Journal on Audio, Speech, and Music Processing, vol. 2019, no. 1, pp. 1–18, 2019.

P. Wu, j. Liu, Y. Shi, Y. Sun, F. Shao, Z. Wu, Z. Yang, “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” in European Conference on Computer Vision (ECCV), 2020.

W.-F. Pang, Q.-H. He, Y.-j. Hu, Y.-X. Li, “Violence detection in videos based on fusing visual and audio information,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 2260–2264, IEEE.

R.-R. O. S. Lab, “Ntu cctv-fights dataset.” https://rose1.ntu.edu.sg/dataset/cctvFights/, 2019. Access 03/02/2023.

M. Perez, A. C. Kot, A. Rocha, “Detection of real-world fights in surveillance videos,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 2662–2666, IEEE.

M. Schedi, M. Sjöberg, I. Mironică, B. Ionescu, V. L. Quang, Y.-G. Jiang, C.-H. Demarty, “Vsd2014: A dataset for violent scenes detection in hollywood movies and web videos,” in 2015 13th International Workshop on Content-Based Multimedia Indexing (CBMI), 2015, pp. 1–6, IEEE.

M. M. Soliman, M. H. Kamal, M. A. E.-M. Nashed, Y. M. Mostafa, B. S. Chawky, D. Khattab, “Violence recognition from videos using deep learning techniques,” in 2019 Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), 2019, pp. 80–85, IEEE.

S. Tang, S. Yuan, Y. Zhu, “Data preprocessing techniques in convolutional neural network based on fault diagnosis towards rotating machinery,” IEEE Access, vol. 8, pp. 149487–149496, 2020.

C. Shorten, T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.

K. O’Shea, R. Nash, “An introduction to convolutional neural networks,” arXiv preprint arXiv:1511.08458, 2015.

M. Tan, Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114, PMLR.

D. Sinha, M. El-Sharkawy, “Thin mobilenet: An enhanced mobilenet architecture,” in 2019 IEEE 10th annual ubiquitous computing, electronics & mobile communication conference (UEMCON), 2019, pp. 0280–0285, IEEE.

J. P. Gujjar, H. P. Kumar, N. N. Chiplunkar, “Image classification and prediction using transfer learning in colab notebook,” Global Transitions Proceedings, vol. 2, no. 2, pp. 382–385, 2021.

F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1251–1258.

M. Huh, P. Agrawal, A. A. Efros, “What makes imagenet good for transfer learning?,” arXiv preprint arXiv:1608.08614, 2016.

D. M. Powers, “Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation,” arXiv:2010.16061, 2020, doi: https://doi.org/10.48550/arXiv.2010.16061