Deobfuscating Leetspeak With Deep Learning to Improve Spam Filtering.

Iñaki Vélez de Mendizabal; Xabier Vidriales; Vitor Basto Fernandes; Enaitz Ezpeleta; José R. Méndez; Urko Zurutuza

doi:10.9781/ijimai.2023.07.003

Authors

Iñaki Vélez de Mendizabal Mondragon Unibertsitatea
Xabier Vidriales Mondragon Unibertsitatea
Vitor Basto Fernandes Iscte – Instituto Universitário de Lisboa
Enaitz Ezpeleta Mondragon Unibertsitatea
José R. Méndez Universidade de Vigo
Urko Zurutuza Mondragon Unibertsitatea

DOI:

https://doi.org/10.9781/ijimai.2023.07.003

Keywords:

Convolutional Neural Network (CNN), Deep Learning, Spam Filter, Text Mining

Supporting Agencies

Iñaki Velez de Mendizabal, Enaitz Ezpeleta and Urko Zurutuza are part of the Intelligent Systems for Industrial Systems research group of Mondragon Unibertsitatea (IT1676-22), supported by the department of Education, Universities and Research of the Basque Country. We are supported by the project Semantic Knowledge Integration for Content-Based Spam Filtering, subprojects TIN2017-84658-C2-1-R and TIN2017-84658-C2-2-R, from SMEIC, SRA and ERDF. Vitor Basto Fernandes acknowledges FCT – Fundação para a Ciência e a Tecnologia, I.P., for its support in the context of project UIDB/04466/2020 and UIDP/04466/2020.

Abstract

The evolution of anti-spam filters has forced spammers to make greater efforts to bypass filters in order to distribute content over networks. The distribution of content encoded in images or the use of Leetspeak are concrete and clear examples of techniques currently used to bypass filters. Despite the importance of dealing with these problems, the number of studies to solve them is quite small, and the reported performance is very limited. This study reviews the work done so far (very rudimentary) for Leetspeak deobfuscation and proposes a new technique based on using neural networks for decoding purposes. In addition, we distribute an image database specifically created for training Leetspeak decoding models. We have also created and made available four different corpora to analyse the performance of Leetspeak decoding schemes. Using these corpora, we have experimentally evaluated our neural network approach for decoding Leetspeak. The results obtained have shown the usefulness of the proposed model for addressing the deobfuscation of Leetspeak character sequences.

Downloads

Download data is not yet available.

References

M. Chakraborty, S. Pal, R. Pramanik, and C. Ravindranath Chowdary, “Recent developments in social spam detection and combating techniques: A survey,” Information Processing and Management, vol. 52, no. 6, pp. 1053–1073, 2016, doi: 10.1016/j.ipm.2016.04.009.

S. Suryawanshi, A. Goswami, and P. Patil, “Email Spam Detection: An Empirical Comparative Study of Different ML and Ensemble Classifiers,” in Proceedings of the 2019 IEEE 9th International Conference on Advanced Computing ’19, Tiruchirappalli, India, 2019, pp. 69–74. doi: 10.1109/IACC48062.2019.8971582.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” in Proceedings of the 1st International Conference on Learning Representations ’13, 2013. [Online]. Available: http://arxiv.org/abs/1301.3781

Y. Cabrera-León, P. García Báez, and C. P. Suárez-Araujo, “Non-email spam and machine learning-based anti-spam filters: Trends and some remarks,” in Proceedings of the Conference on Computer Aided Systems Theory ’17, 2018, vol. 10671, pp. 245–253. doi: 10.1007/978-3-319-74718-7_30.

Z. Liu, W. Lin, N. Li, and D. Lee, “Detecting and filtering instant messaging spam - a global and personalized approach,” in Proceedings of the 1st IEEE ICNP Workshop on Secure Network Protocols ’05, 2005, pp. 19–24. doi: 10.1109/NPSEC.2005.1532048.

C. Manning, P. Raghavan, and H. Schütze, “Introduction to information retrieval,” Natural Language Engineering, vol. 16, no. 1, pp. 100–103, 2010.

E. Alpaydin, Introduction to machine learning. Cambridge, Massachusetts: MIT press, 2020.

J. Hovold, “Naive Bayes Spam Filtering Using Word-Position-Based Attributes,” presented at the Second Conference on Email and Anti-Spam CEAS-2005, California, USA, 2005. [Online]. Available: http://www.ceas.cc/papers-2005/144.pdf

V. Metsis, I. Androutsopoulos, and G. Paliouras, “Spam filtering with naive bayes-which naive bayes?,” in Proceedings of the 3rd Conference on Email and Anti-Spam, 2006, pp. 28–69. [Online]. Available: http://www.ceas.cc/2006/listabs.html#15.pdf

I. Androutsopoulos, G. Paliouras, V. Karkaletsis, G. Sakkis, C. D. Spyropoulos, and P. Stamatopoulos, “Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach.” arXiv cs.CL/0009009, 2000. [Online]. Available: https://arxiv.org/pdf/cs/0009009.pdf

S. Goyal, R. Chauhan, and S. Parveen, “Spam detection using KNN and decision tree mechanism in social network,” in Proceedings of the 4th International Conference on Parallel, Distributed and Grid Computing ’16, Himachal Pradesh, India, 2016, pp. 522–526.

S. K. Trivedi and P. K. Panigrahi, “Spam classification: a comparative analysis of different boosted decision tree approaches,” Journal of Systems and Information Technology, vol. 20, no. 3, pp. 298–320, 2018, doi: 10.1108/JSIT-11-2017-0105.

Q. Wang, Y. Guan, and X. Wang, “SVM-Based Spam Filter with Active and Online Learning,” in Proceedings of the 15th Text REtrieval Conference, Gaithersburg, Maryland, 2006, p. 36. [Online]. Available: https://trec.nist.gov/pubs/trec15/papers/hit.spam.final.final.pdf

J. Clark, I. Koprinska, and J. Poon, “A neural network based approach to automated e-mail classification,” in Proceedings International Conference on Web Intelligence ’03, Halifax, NS, Canada, 2003, pp. 702–705. doi: 10.1109/WI.2003.1241300.

J. Goodman and W. Yih, “Online Discriminative Spam Filter Training,” in Proceedings of the 3rd Conference on Email and Anti-Spam, Mountain View, California, 2006, pp. 1–4. [Online]. Available: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/GoodmanYih-ceas06.pdf

T. Oda and T. White, “Increasing the accuracy of a spam-detecting artificial immune system,” in Proceedings of the 2003 Congress on Evolutionary Computation ’03, Camberra, Australia, 2003, vol. 1, pp. 390–396.

X. Carreras and L. Marquez, “Boosting trees for anti-spam email filtering.” arXiv cs/0109015, 2001. [Online]. Available: https://arxiv.org/abs/cs/0109015

C. Fellbaum, “WordNet,” in The Encyclopedia of Applied Linguistics, C. Chapelle, Ed. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2012, pp. 1–8. doi: 10.1002/9781405198431.wbeal1285.

R. Navigli and S. P. Ponzetto, “BabelNet: Building a very large multilingual semantic network,” in Proceedings of the 48th annual meeting of the association for computational linguistics, Uppsala, Sweden, 2010, pp. 216–225.

J. R. Méndez, T. R. Cotos-Yañez, and D. Ruano-Ordás, “A new semanticbased feature selection method for spam filtering,” Applied Soft Computing, vol. 76, pp. 89–104, 2019, doi: 10.1016/j.asoc.2018.12.008.

E. M. Bahgat and I. F. Moawad, “Semantic-Based Feature Reduction Approach for E-mail Classification,” in Proceedings of the 2nd International Conference on Advanced Intelligent Systems and Informatics ’16, Cairo, Egypt, 2017, pp. 53–63. doi: 10.1007/978-3-319-48308-56.

I. Vélez de Mendizabal, V. Basto-Fernandes, E. Ezpeleta, J. R. Méndez, and U. Zurutuza, “SDRS: A new lossless dimensionality reduction for text corpora,” Information Processing & Management, vol. 57, no. 4, p. 102249, 2020, doi: 10.1016/j.ipm.2020.102249.

M. Dredze, R. Gevaryahu, and A. Elias-Bachrach, “Learning Fast Classifiers for Image Spam,” in Proceedings of the 3rd Conference on Email and Anti-Spam ’07, Mountain View, California, 2007, pp. 1–9. [Online]. Available: https://www.cs.jhu.edu/~mdredze/publications/image_spam_ceas07.pdf

A. Chaudhuri, K. Mandaviya, P. Badelia, and S. K. Ghosh, “Optical character recognition systems,” in Optical Character Recognition Systems for Different Languages with Soft Computing, Cham, Switzerland: Springer, 2017, pp. 9–41.

B. Biggio, G. Fumera, I. Pillai, F. Roli, and R. Satta, “Evading SpamAssassin with obfuscated text images,” 2007. https://www.virusbulletin.com/virusbulletin/2007/11/evading-spamassassin-obfuscated-text-images/ (accessed Jun. 07, 2023).

J. Evershed and K. Fitch, “Correcting noisy OCR: Context beats confusion,” in Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage, 2014, pp. 45–51.

E. Bursztein, M. Martin, and J. Mitchell, “Text-based CAPTCHA strengths and weaknesses,” in Proceedings of the 18th ACM conference on Computer and communications security ’11, Chicago, Illinois, USA, 2011, pp. 125– 138. doi: 10.1145/2046707.2046724.

J. Wang, J. Qin, X. Xiang, Y. Tan, N. Pan, and College of Computer Science and Information Technology, Central South University of Forestry and Technology, 498 shaoshan S Rd, Changsha, 410004, China, “CAPTCHA recognition based on deep convolutional neural network,” Mathematical Biosciences and Engineering, vol. 16, no. 5, pp. 5851–5861, 2019, doi: 10.3934/mbe.2019292.

F.-L. Du, J.-X. Li, Z. Yang, P. Chen, B. Wang, and J. Zhang, “CAPTCHA Recognition Based on Faster R-CNN,” in Proceedings of the 13th International Conference on Intelligent Computing Theories and Application ’17, Liverpool, UK, 2017, vol. 10362, pp. 597–605. doi: 10.1007/978-3-319- 63312-1_52.

E. Flamand, “Deciphering L33t5p34k Internet Slang on Message Boards,” Diss. Ghent University, 2008. [Online]. Available: https://lib.ugent.be/en/catalog/rug01:001414289

J. A. Zdziarski, Ending spam: Bayesian content filtering and the art of statistical language classification. San Francisco, California: No starch press, 2005.

A. Tundis, G. Mukherjee, and M. Mühlhäuser, “Mixed-code text analysis for the detection of online hidden propaganda,” in Proceedings of the 15th International Conference on Availability, Reliability and Security ’20, Dublin, Ireland, 2020, pp. 1–7. doi: 10.1145/3407023.3409211.

F. K. Dosilovic, M. Brcic, and N. Hlupic, “Explainable artificial intelligence: A survey,” in Proceedings of the 41st International Convention on Information and Communication Technology, Electronics and Microelectronics, Opatija, Croatia, 2018, pp. 210–215. doi: 10.23919/MIPRO.2018.8400040.

A. Tundis, G. Mukherjee, and M. Mühlhäuser, “An Algorithm for the Detection of Hidden Propaganda in Mixed-Code Text over the Internet,” Applied Sciences, vol. 11, no. 5, Article ID: 2196, 2021, doi: 10.3390/app11052196.

T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition in natural images,” in Proceedings of the 4th International Conference on Computer Vision Theory and Applications ’09, Lisbon, Portugal, 2009, pp. 273–280.

M. Deore and U. Kulkarni, “MDFRCNN: Malware Detection using Faster Region Proposals Convolution Neural Network,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 7, no. 4, pp. 146–162, 2022, doi: 10.9781/ijimai.2021.09.005.

A. Bhaik, V. Singh, E. Gandotra, and D. Gupta, “Detection of Improperly Worn Face Masks using Deep Learning – A Preventive Measure Against the Spread of COVID-19,” International Journal of Interactive Multimedia and Artificial Intelligence, pp. 14–25, 2021, doi: 10.9781/ijimai.2021.09.003.

A. Jan and G. M. Khan, “Real World Anomalous Scene Detection and Classification using Multilayer Deep Neural Networks,” International Journal of Interactive Multimedia and Artificial Intelligence, vol. 8, no. 2, pp. 158–167, 2021, doi: 10.9781/ijimai.2021.10.010.

A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,” Computational intelligence and neuroscience, vol. 2018, ArticleID 7068349, 2018, doi: 10.1155/2018/7068349.

A. Ioannidou, E. Chatzilari, S. Nikolopoulos, and I. Kompatsiaris, “Deep Learning Advances in Computer Vision with 3D Data: A Survey,” ACM Computing Surveys, vol. 50, no. 2, pp. 1–38, 2018, doi: 10.1145/3042064.

M. Abadi et al., “Tensorflow: A system for large-scale machine learning,” in Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation ’16, Savannah, GA, USA, 2016, pp. 265–283. Accessed: Mar. 24, 2022. [Online]. Available: https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

A. Gulli and S. Pal, Deep learning with Keras. Birmingham, UK: Packt Publishing Ltd, 2017.

I. Vélez de Mendizabal, X. Vidriales, V. B. Fernandes, E. Ezpeleta, J. R. Méndez, and U. Zurutuza, “Image dataset to train a deep learning model to decode Leetspeak obfuscated characters.” Zenodo, Mar. 21, 2022. doi: 10.5281/ZENODO.6373558.

E. Ezpeleta, M. Iturbe, I. Garitano, I. V. de Mendizabal, and U. Zurutuza, “A mood analysis on youtube comments and a method for improved social spam detection,” in Proceedings of the 13th International Conference on Hybrid Artificial Intelligence Systems ’18, Oviedo, Spain, 2018, pp. 514–525.

I. Vélez de Mendizabal, X. Vidriales, V. B. Fernandes, E. Ezpeleta, J. R. Méndez, and U. Zurutuza, “Set of obfuscated spam dataset by using LeetSpeak transformations.” Zenodo, Mar. 21, 2022. doi: 10.5281/ ZENODO.6373653.