An Investigation Into Different Text Representations to Train an Artificial Immune Network for Clustering Texts.

Matheus A. Ferraria; Vinicius A. Ferraria; Leandro N. de Castro

doi:10.9781/ijimai.2023.08.006

Authors

Matheus A. Ferraria Universidade Presbiteriana Mackenzie
Vinicius A. Ferraria Universidade Presbiteriana Mackenzie
Leandro N. de Castro Florida Gulf Coast University

DOI:

https://doi.org/10.9781/ijimai.2023.08.006

Keywords:

Artificial Immune System, Artificial Immune Network, Clonal Selection, Natural Computing, Text Clustering, Text Structuring

Supporting Agencies

This work was financially supported by FAPESP, CNPq and MackPesquisa.

Abstract

Extracting knowledge from text data is a complex task that is usually performed by first structuring the texts and then applying machine learning algorithms, or by using specific deep architectures capable of dealing directly with the raw text data. The traditional approach to structure texts is called Bag of Words (BoW) and consists of transforming each word in a document into a dimension (variable) in the structured data. Another approach uses grammatical classes to categorize the words and, thus, limit the dimension of the structured data to the number of grammatical categories. Another form of structuring text data for analysis is by using a distributed representation of words, sentences, or documents with methods like Word2Vec, Doc2Vec, and SBERT. This paper investigates four classes of text structuring methods to prepare documents for being clustered by an artificial immune system called aiNet. The goal is to assess the influence of each structuring method in the quality of the clustering obtained by the system and how methods that belong to the same type of representation differ from each other, for example both LIWC and MRC are considered grammarbased models but each one of them uses completely different dictionaries to generate its representation. By using internal clustering measures, our results showed that vector space models, on average, presented the best results for the datasets chosen, followed closely by the state of the art SBERT model, and MRC had the overall worst performance. We could also observe a consistency in the number of clusters generated by each representation and for each dataset, having SBERT as the model that presented a number of clusters closer to the original number of classes in the data.

Downloads

Download data is not yet available.

References

C. S. Kumar, R. Santhosh, “Effective information retrieval and feature minimization technique for semantic web data,” Computers & Electrical Engineering, vol. 81, p. 106518, 2020.

S. S. Tandel, A. Jamadar, S. Dudugu, “A survey on text mining techniques,” in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS), 2019, pp. 1022–1026, IEEE.

L. N. de Castro, D. G. Ferrari, Introdução à mineração de dados. Saraiva Educação SA, 2017.

T. Jo, “Text mining: Studies in big data,” 2019.

K. Chowdhary, K. Chowdhary, “Natural language processing,” Fundamentals of artificial intelligence, pp. 603–649, 2020.

I. H. Witten, E. Frank, M. A. Hall, C. J. Pal, M. DATA, “Practical machine learning tools and techniques,” in Data Mining, vol. 2, 2005.

Y. HaCohen-Kerner, D. Miller, Y. Yigal, “The influence of preprocessing on text classification using a bag- of-words representation,” PloS one, vol. 15, no. 5, p. e0232525, 2020.

G. Miner, J. Elder IV, A. Fast, T. Hill, R. Nisbet, D. Delen, Practical text mining and statistical analysis for non-structured text data applications. Academic Press, 2012.

W. A. Qader, M. M. Ameen, B. I. Ahmed, “An overview of bag of words;importance, implementation, applications, and challenges,” in 2019 International Engineering Conference (IEC), 2019, pp. 200–204.

D. Yan, K. Li, S. Gu, L. Yang, “Network-based bag-of- words model for text classification,” IEEE Access, vol. 8, pp. 82641–82652, 2020.

W. A. Qader, M. M. Ameen, B. I. Ahmed, “An overview of bag of words; importance, implementation, applications, and challenges,” in 2019 international engineering conference (IEC), 2019, pp. 200–204, IEEE.

J. W. Pennebaker, M. E. Francis, R. J. Booth, “Linguistic inquiry and word count: Liwc 2001,” Mahway: Lawrence Erlbaum Associates, vol. 71, no. 2001, p. 2001, 2001.

A. Chiche, B. Yitagesu, “Part of speech tagging: a systematic review of deep learning and machine learning approaches,” Journal of Big Data, vol. 9, no. 1, pp. 1–25, 2022.

M. D. Wilson, “MRC Psycholinguistic Database: Machine Usable Dictionary: Version 2.00,” Behavior Research Methods, Instruments, & Computers, vol. 20, pp. 6–10, 1988.

J. Lastra-Díaz, J. Goikoetxea, M. A. Hadj Taieb, A. Garcia-Serrano, M. Ben Aouicha, E. Agirre, “A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art,” Engineering Applications of Artificial Intelligence, vol. 85, pp. 645–665, 2019, doi: 10.1016/j.engappai.2019.07.010.

U. Naseem, I. Razzak, S. K. Khan, M. Prasad, “A comprehensive survey on word representation models: From classical to state-of-the-art word representation language models,” Transactions on Asian and LowResource Language Information Processing, vol. 20, no. 5, pp. 1–35, 2021.

F. Almeida, G. Xexéo, “Word embeddings: A survey,” arXiv preprint arXiv:1901.09069, 2019.

B. Li, H. Zhou, J. He, M. Wang, Y. Yang, L. Li, “On the sentence embeddings from pre-trained language models,” arXiv preprint arXiv:2011.05864, 2020.

M. N. Moghadasi, Y. Zhuang, “Sent2vec: A new sentence embedding representation with sentimental semantic,” in 2020 IEEE International Conference on Big Data (Big Data), 2020, pp. 4672–4680.

T. Gao, X. Yao, D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.

T. Jiang, J. Jiao, S. Huang, Z. Zhang, D. Wang, F. Zhuang, F. Wei, H. Huang, D. Deng, Q. Zhang, “PromptBERT: Improving BERT sentence embeddings with prompts,” arXiv preprint arXiv:2201.04337, 2022.

X. Zhu, T. Li, G. De Melo, “Exploring semantic properties of sentence embeddings,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2018, pp. 632–637.

M. K. Mishra, J. Viradiya, “Survey of sentence embedding methods,” International Journal of Applied Science and Computations, vol. 6, no. 3, pp. 592–592, 2019.

S. A. Hofmeyr, S. Forrest, “Immunity by design: An artificial immune system,” in Proceedings of the 1st Annual Conference on Genetic and Evolutionary Computation-Volume 2, 1999, pp. 1289–1296, Citeseer.

F. A. González, D. Dasgupta, “Anomaly detection using real-valued negative selection,” Genetic Programming and Evolvable Machines, vol. 4, pp. 383– 403, 2003.

E. Bendiab, M. K. Kholladi, “The negative selection algorithm: a supervised learning approach for skin detection and classification,” International Journal of Computer Science and Network Security, vol. 10, pp. 86– 92, 2010.

M. Ayara, J. Timmis, R. de Lemos, L. N. de Castro, R. Duncan, “Negative selection: How to generate detectors,” in Proceedings of the 1st International Conference on Artificial Immune Systems (ICARIS), vol. 1, 2002, pp. 89–98, University of Kent at Canterbury Printing Unit Canterbury, UK.

J. Timmis, M. Neal, J. Hunt, “An artificial immune system for data analysis,” Biosystems, vol. 55, no. 1-3, pp. 143–150, 2000.

L. N. de Castro, F. J. Von Zuben, “aiNet: an artificial immune network for data analysis,” in Data mining: a heuristic approach, IGI Global, 2002, pp. 231–260.

D. Dasgupta, S. Yu, F. Nino, “Recent advances in artificial immune systems: models and applications,” Applied Soft Computing, vol. 11, no. 2, pp. 1574–1587, 2011.

J. Greensmith, A. Whitbrook, U. Aickelin, “Artificial immune systems,” Handbook of Metaheuristics, pp. 421– 448, 2010.

J. Timmis, “Artificial immune systems–today and tomorrow,” Natural computing, vol. 6, no. 1, p. 1, 2007.

N. Bayar, S. Darmoul, S. Hajri-Gabouj, H. Pierreval, “Fault detection, diagnosis and recovery using artificial immune systems: A review,” Engineering Applications of Artificial Intelligence, vol. 46, pp. 43–57, 2015, doi: https://doi.org/10.1016/j.engappai.2015.08.006

S. Alhasan, G. Abdul-Salaam, L. Bayor, K. Oliver, “Intrusion detection system based on artificial immune system: A review,” in 2021 International Conference on Cyber Security and Internet of Things (ICSIoT), 2021, pp. 7–14.

L. N. de Castro, J. Timmis, Artificial Immune Systems: A New Computational Intelligence Approach. Springer- Verlag UK, 2002.

I. Čisar, S. M. Čisar, B. Popović, K. Kuk, I. Vuković, “Application of artificial immune networks in continuous function optimizations,” Acta Polytechnica Hungarica, vol. 19, no. 7, pp. 53–164, 2022.

P. C. Pop, “The generalized minimum spanning tree problem: An overview of formulations, solution procedures and latest advances,” European Journal of Operational Research, vol. 283, no. 1, pp. 1–15, 2020.

D. Cheng, Q. Zhu, J. Huang, Q. Wu, L. Yang, “Clustering with local density peaks-based minimum spanning tree,” IEEE Transactions on Knowledge and Data Engineering, vol. 33, no. 2, pp. 374–387, 2019.

G. Mishra, S. K. Mohanty, “A fast hybrid clustering technique based on local nearest neighbor using minimum spanning tree,” Expert Systems with Applications, vol. 132, pp. 28–43, 2019.

M. L. Jockers, R. Thalken, Text analysis with R. Springer, 2020.

J. Hirschberg, C. D. Manning, “Advances in natural language processing,” Science, vol. 349, no. 6245, pp. 261–266, 2015.

J. Chai, A. Li, “Deep learning in natural language processing: A state-ofthe-art survey,” in 2019 International Conference on Machine Learning and Cybernetics (ICMLC), 2019, pp. 1–6.

D. W. Otter, J. R. Medina, J. K. Kalita, “A survey of the usages of deep learning for natural language processing,” IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 2, pp. 604–624, 2021, doi: 10.1109/TNNLS.2020.2979670.

K. Singh, H. Devi, A. Mahanta, “Document representation techniques and their effect on the document clustering and classification: A review.,” International Journal of Advanced Research in Computer Science, vol. 8, 2017.

M. H. Ahmed, S. Tiun, N. Omar, N. S. Sani, “Short text clustering algorithms, application and challenges: A survey,” Applied Sciences, vol. 13, no. 1, p. 342, 2023.

K. Babić, S. Martinčić-Ipšić, A. Meštrović, “Survey of neural text representation models,” Information, vol. 11, no. 11, p. 511, 2020.

S. A. Farimani, M. V. Jahan, A. Milani Fard, “From text representation to financial market prediction: A literature review,” Information, vol. 13, no. 10, p. 466, 2022.

G. E. Pibiri, R. Venturini, “Handling massive n-gram datasets efficiently,” ACM Transactions on Information Systems (TOIS), vol. 37, no. 2, pp. 1–41, 2019.

M. Schonlau, N. Guenther, “Text mining using n- grams,” Schonlau, M., Guenther, N. Sucholutsky, I. Text mining using n-gram variables. The Stata Journal, vol. 17, no. 4, pp. 866–881, 2017.

D. E. Cahyani, I. Patasik, “Performance comparison of TF-IDF and word2vec models for emotion text classification,” Bulletin of Electrical Engineering and Informatics, vol. 10, no. 5, pp. 2780–2788, 2021.

M. Das, S. Kamalanathan, P. Alphonse, “A comparative study on tf-idf feature weighting method and its analysis using unstructured dataset.,” in COLINS, 2021, pp. 98–107.

K. Toutanova, D. Klein, C. D. Manning, Y. Singer, “Feature-rich part-ofspeech tagging with a cyclic dependency network,” in Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003, pp. 252–259.

J. Awwalu, S. E.-Y. Abdullahi, A. E. Evwiekpaefe, “Parts of speech tagging: a review of techniques,” Fudma Journal of Sciences, vol. 4, no. 2, pp. 712–721, 2020.

T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.

Q. Le, T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188–1196, PMLR.

A. Rogers, O. Kovaleva, A. Rumshisky, “A primer in BERTology: What we know about how BERT works,” Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2021.

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” 2019.

N. Reimers, I. Gurevych, “Sentence-BERT: Sentence embeddings using siamese BERT-networks,” arXiv preprint arXiv:1908.10084, 2019.

Z. Rahimi, M. M. Homayounpour, “The impact of preprocessing on word embedding quality: A comparative study,” Language Resources and Evaluation, vol. 57, no. 1, pp. 257–291, 2023.

K. V. Ghag, K. Shah, “Comparative analysis of effect of stopwords removal on sentiment classification,” in 2015 international conference on computer, communication and control (IC4), 2015, pp. 1–6, IEEE.

“Gensim.” https://github.com/RaRe-Technologies/gensim

T. A. Almeida, J. M. G. Hidalgo, A. Yamakami, “Contributions to the study of sms spam filtering: new collection and results,” in Proceedings of the 11th ACM symposium on Document engineering, 2011, pp. 259–262.

H. Yin, X. Song, S. Yang, G. Huang, J. Li, “Representation learning for short text clustering,” in Web Information Systems Engineering–WISE 2021: 22nd International Conference on Web Information Systems Engineering, WISE 2021, Melbourne, VIC, Australia, October 26–29, 2021, Proceedings, Part II 22, 2021, pp. 321–335, Springer.

W. Wu, H. Xiong, S. Shekhar, J. He, A.-H. Tan, C.-L. Tan, S.-Y. Sung, “On quantitative evaluation of clustering systems,” Clustering and information retrieval, pp. 105–133, 2004.

C.-E. B. Ncir, A. Hamza, W. Bouaguel, “Parallel and scalable dunn index for the validation of big data clusters,” Parallel Computing, vol. 102, p. 102751, 2021.

D. Davies, D. Bouldin, “A cluster separation measure,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. PAMI-1, pp. 224–227, 05 1979, doi: 10.1109/TPAMI.1979.4766909.

I. F. Ashari, R. Banjarnahor, D. R. Farida, S. P. Aisyah, A. P. Dewi, N. Humaya, et al., “Application of data mining with the k-means clustering method and davies bouldin index for grouping imdb movies,” Journal of Applied Informatics and Computing, vol. 6, no. 1, pp. 07–15, 2022.

M. Mughnyanti, S. Efendi, M. Zarlis, “Analysis of determining centroid clustering x-means algorithm with davies-bouldin index evaluation,” in IOP Conference Series: Materials Science and Engineering, vol. 725, 2020, p. 012128, IOP Publishing.