A Word Embedding Based Approach for Focused Web Crawling Using the Recurrent Neural Network.

P. R. Joe Dhanith; B. Surendiran; S. P. Raja

doi:10.9781/ijimai.2020.09.003

Authors

P. R. Joe Dhanith National Institute of Technology
B. Surendiran National Institute of Technology, Puducherry
S. P. Raja Vel Tech Rangarajan Dr. Sagunthala R&D Institute of Science and Technology

DOI:

https://doi.org/10.9781/ijimai.2020.09.003

Keywords:

Web Crawlers, Semantics, Word Embeddings, Adagrad, Recurrent Network

Abstract

Learning-based focused crawlers download relevant uniform resource locators (URLs) from the web for a specific topic. Several studies have used the term frequency-inverse document frequency (TF-IDF) weighted cosine vector as an input feature vector for learning algorithms. TF-IDF-based crawlers calculate the relevance of a web page only if a topic word co-occurs on the said page, failing which it is considered irrelevant. Similarity is not considered even if a synonym of a term co-occurs on a web page. To resolve this challenge, this paper proposes a new methodology that integrates the Adagrad-optimized Skip Gram Negative Sampling (A-SGNS)-based word embedding and the Recurrent Neural Network (RNN).The cosine similarity is calculated from the word embedding matrix to form a feature vector that is given as an input to the RNN to predict the relevance of the website. The performance of the proposed method is evaluated using the harvest rate (hr) and irrelevance ratio (ir). The proposed methodology outperforms existing methodologies with an average harvest rate of 0.42 and irrelevance ratio of 0.58.

Downloads

Download data is not yet available.

References

“Internet Live Status,” 2020. [Online]. Available: https://www.internetlivestats.com/total-number-of-websites/

S. Brin and L. Page, “The anatomy of a large-scale hypertextual Web search engine BT - Computer Networks and ISDN Systems,” Comput. Networks ISDN Syst., vol. 30, no. 1–7, pp. 107–117, 1998.

G. Ayers, J. H. Ahn, C. Kozyrakis, and P. Ranganathan, “Memory Hierarchy for Web Search,” Proc.- Int. Symp. High-Performance Comput. Archit., vol. 2018-Febru, pp. 643–656, 2018.

Auf Wiedersehen, “The Architecture of a Large-Scale Web Search Engine, circa 2019,” 2019. [Online]. Available: https://0x65.dev/blog/2019-12-14/the-architecture-of-a-large-scale-web-search-engine-circa-2019.html

B. Muller, “How search engines work: Crawling, Indexing, and Ranking,” Moz Pro, 2020. [Online]. Available: https://moz.com/beginners-guide-toseo/how-search-engines-operate

A. Hliaoutakis, G. Varelas, E. Voutsakis, E. G. M. Petrakis, and E. Milios, “Information Retrieval by Semantic Similarity,” Int. J. Semant. Web Inf. Syst., vol. 2, no. 3, pp. 55–73, 2011.

Z. Liu, Y. Du, and Y. Zhao, “Focused Crawler Based on Domain Ontology and FCA,” J. Inf. Comput. Sci., vol. 8, no. 10, pp. 1909–1917, 2011.

Z. Geng, D. Shang, Q. Zhu, Q. Wu, and Y. Han, “Research on improved focused crawler and its application in food safety public opinion analysis,” 2017 Chinese Autom. Congr., pp. 2847–2852, 2017.

T. Hassan, C. Cruz, and A. Bertaux, “Predictive and evolutive crossreferencing for web textual sources,” Proc. Comput. Conf. 2017, vol. 2018-Janua, no. July, pp. 1114–1122, 2018.

S. Chakrabarti, M. van den Berg, and B. Dom, “Focused crawling: a new approach to top-specific Web source discovery,” Comput. Networks, vol. 31, no. 11–16, pp. 1623–1640, 1999.

F. Menczer, F. Menczer, G. Pant, G. Pant, P. Srinivasan, and P. Srinivasan, “Topical Web Crawlers: Evaluating Adaptive Algorithms,” ACM Trans. Internet Technol., vol. V, no. February, p. 38, 2003.

J. R. Park, C. Yang, Y. Tosaka, Q. Ping, and H. El Mimouni, “Developing an automatic crawling system for populating a digital repository of professional development resources: A pilot study,” J. Electron. Resour. Librariansh., vol. 28, no. 2, pp. 63–72, 2016.

G. H. Agre and N. V. Mahajan, “Keyword focused web crawler,” 2nd Int. Conf. Electron. Commun. Syst. ICECS 2015, pp. 1089–1092, 2015.

Y. Du, W. Liu, X. Lv, and G. Peng, “An improved focused crawler based on Semantic Similarity Vector Space Model,” Appl. Soft Comput. J., vol. 36, pp. 392–407, 2015.

G. Salton, A. Wong, and C. Yang, “Information Retrieval and Language Processing: A Vector Space Model for Automatic Indexing,” Commun. ACM, vol. 18, no. 11, pp. 613-620, 1975.

P. Bedi, A. Thukral, and H. Banati, “Focused crawling of tagged web resources using ontology,” Comput. Electr. Eng., vol. 39, no. 2, pp. 613–628, 2013.

A. I. Saleh, A. E. Abulwafa, and M. F. Al Rahmawy, “A web page distillation strategy for efficient focused crawling based on optimized Naïve bayes (ONB) classifier,” Appl. Soft Comput. J., vol. 53, pp. 181–204, 2017.

H. D. and F. K. Hussain, “SOF: a semi-supervised ontology-learning-based focused crawler,” Concurr. Comput. Pract. Exp., vol. 25, no. 6, pp. 1755–1770, 2013.

H. T. Zheng, B. Y. Kang, and H. G. Kim, “An ontology-based approach to learnable focused crawling,” Inf. Sci. (Ny)., vol. 178, no. 23, pp. 4512–4522, 2008.

A. Capuano, A. M. Rinaldi, and C. Russo, “An ontology-driven multimedia focused crawler based on linked open data and deep learning techniques,” Multimed. Tools Appl., 2019.

Y. Li, Z. A. Bandar, and D. McLean, “An approach for measuring semantic similarity between words using multiple information sources,” IEEE Trans. Knowl. Data Eng., vol. 15, no. 4, pp. 871–882, 2003.

J. Hosseinkhani, H. Taherdoost, and S. Keikhaee, “ANTON Framework Based on Semantic Focused Crawler to Support Web Crime Mining Using SVM,” Ann. Data Sci., 2019.

D. Mukhopadhyay and S. Sinha, “Domain-Specific Crawler Design,” pp. 85–112, 2019.

J. Qiu, Q. Du, W. Wang, K. Yin, C. Lin, and C. Qian, “Topic Crawler for OpenStack QA Knowledge Base,” Proc. - 2017 Int. Conf. Cyber-Enabled Distrib. Comput. Knowl. Discov. CyberC 2017, vol. 2018-Janua, pp. 309–317, 2018.

T. Suebchua, B. Manaskasemsak, A. Rungsawang, and H. Yamana, “History-enhanced focused website segment crawler,” Int. Conf. Inf. Netw., vol. 2018-Janua, pp. 80–85, 2018.

G. Xu, P. Jiang, C. Ma, and M. Daneshmand, “A Focused Crawler Model Based on Mutation Improving Particle Swarm Optimization Algorithm,” Proc. - 2018 IEEE Int. Conf. Ind. Internet, ICII 2018, no. Icii, pp. 173–174, 2018.

H. Dong and F. K. Hussain, “Self-adaptive semantic focused crawler for mining services information discovery,” IEEE Trans. Ind. Informatics, vol. 10, no. 2, pp. 1616–1626, 2014.

W. J. Liu and Y. J. Du, “A novel focused crawler based on cell-like membrane computing optimization algorithm,” Neurocomputing, vol. 123, pp. 266–280, 2014.

P. Resnik, “Using Information Content to Evaluate Semantic Similarity in a Taxonomy,” vol. 1, 1995.

“Natural Language Processing Tool Kit (NLTK),” 2020. [Online]. Available: https://www.nltk.org/

E. L. and E. K. Bird, Steven, Natural Language Processing with Python. O’Reilly Media Inc, 2009.

H. Palangi et al., “Deep Sentence embedding using long short-term memory networks: Analysis and application to information retrieval,” IEEE/ACM Trans. Audio Speech Lang. Process., vol. 24, no. 4, pp. 694–707, 2016.

I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” Adv. Neural Inf. Process. Syst., vol. 4, no. January, pp. 3104–3112, 2014.

T. Mikolov, M. Karafiát, L. Burget, C. Jan, and S. Khudanpur, “Recurrent neural network based language model,” Proc. 11th Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH 2010, no. September, pp. 1045–1048, 2010.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations ofwords and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., pp. 1–9, 2013.

Y. Goldberg and O. Levy, “word2vec Explained: deriving Mikolov et al.’s negative-sampling word-embedding method,” no. 2, pp. 1–5, 2014.

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” 1st Int. Conf. Learn. Represent. ICLR 2013 - Work. Track Proc., pp. 1–12, 2013.

J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning and stochastic optimization,” COLT 2010 - 23rd Conf. Learn. Theory, pp. 257–269, 2010.

S. Ruder, “An overview of gradient descent optimization algorithms,” pp. 1–14, 2016.

K. Kowsari, K. J. Meimandi, M. Heidarysafa, S. Mendu, L. Barnes, and D. Brown, “Text classification algorithms: A survey,” Inf., vol. 10, no. 4, pp. 1–68, 2019.

G. Chen, “A Gentle Tutorial of Recurrent Neural Network with Error Backpropagation,” pp. 1–10.

G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a Overview of minibatch gradient descent,” Neural Networks Mach. Learn. Coursera, 2012.

G. van Rossum, “Python tutorial, Technical Report CS-R9526,” Cent. voor Wiskd. en Inform. (CWI), Amsterdam, 1995.

“Python 3.6,” 2016. [Online]. Available: https://www.python.org/downloads/release/python-360/

“Spyder IDE,” 2009. [Online]. Available: https://www.spyder-ide.org/

L. Richardson, “Beautiful Soup Documentation Release 4.4.0.” 2019.

“urllib,” Python, 2020. [Online]. Available: https://docs.python.org/3/library/urllib.html

“lxml parser,” Python, 2020. [Online]. Available: https://lxml.de/elementsoup.html

G. Varoquaux, L. Buitinck, G. Louppe, O. Grisel, F. Pedregosa, and A. Mueller, “Scikit-learn: Machine learning in Python,” J. Mach. Learn. Res., vol. 12, no. 1, pp. 2825–2830, 2011.

D. Isa, L. H. Lee, V. P. Kallimani, and R. Rajkumar, “Text document preprocessing with the bayes formula for classification using the support vector machine,” IEEE Trans. Knowl. Data Eng., vol. 20, no. 9, pp. 1264–1272, 2008.

H. Salehinejad, S. Sankar, J. Barfett, E. Colak, and S. Valaee, “Recent Advances in Recurrent Neural Networks,” pp. 1–21, 2017.

Princeton University, “About WordNet.” WordNet. Princeton University, 2010.