پیاده‌سازی سیستم حذف ابرپیوندهای نویزی با استفاده از رویکرد معنایی و رابطه‌ای آنتولوژی DBpedia

نوع مقاله : مقاله پژوهشی (نظری)

نویسنده

عضو هیات علمی گروه مهندسی کامپیوتر، دانشگاه فنی و حرفه‌ای، تهران، ایران.

10.48301/kssa.2023.382583.2426

چکیده

همان­طور که داده­های وب به سرعت در حال گسترش و رشد هستند، ساختار گراف وب که یک نمایش گرافیکی از دنیای وب است، در حال بزرگ شدن می‌باشد و به تدریج ساختار محتوایی خود را به یک ساختار غیر محتوایی تبدیل کرده است. وجود داده­های هرز مانند ابرپیوندهای نویزی در گراف ساختار وب، بسیاری از الگوریتم­های لینک­کاوی را با مشکل مواجه ساخته و باعث کاهش سرعت و بازدهی الگوریتم­های بازیابی اطلاعات گردیده است. کارهای انجام شده به حذف ابرپیوندهای نویزی با استفاده رویکردهای ساختاری و رشته­ای پرداخته­اند. این رویکردها به اشتباه برخی از ابرپیوندهای مفید را حذف کرده و در بعضی شرایط قادر به تشخیص ابرپیوندهای نویزی نمی­باشند. در این مقاله، ابتدا توسط یک خزنده تعاملی یک مجموعه داده از ابرپیوندهای نویزی و مفید با استفاده از خزش وب سایت‌ها ایجاد شد. سپس از طریق رویکردهای وب معنایی و امکاناتی نظیرآنتولوژی DBpedia به ساختار معنایی و رابطه­ای این ابرپیوندها توجه گردید. در ادامه با فعال کردن استدلال­گر آنتولوژی DBpedia، فرآیند حذف ابرپیوندهای نویزی از گراف ساختار وب صورت گرفت. آزمایش‌های انجام گرفته بر روی این سیستم، دقت و توانایی تکنولوژی‌های وب معنایی را در حذف ابرپیوندهای نویزی نشان می­دهد.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Implementation of a Noisy Hyperlink Removal System: Using the Semantic and Relational Approach of the DBpedia Ontology

نویسنده [English]

  • Kazem Taghandiki
Faculty Member, Department of Computer Engineering, Technical and Vocational University (TVU), Tehran, Iran.
چکیده [English]

With the rapid expansion and growth of web data, the web graph structure, which is a graphical representation of the web world, is getting larger and larger and has gradually changed from a content structure to a non-content structure. The presence of junk data such as noisy hyperlinks in the web structure graph has caused problems for many link mining algorithms and reduced the speed and efficiency of information retrieval algorithms. Research has been conducted to remove noisy hyperlinks using structural and string approaches. These approaches incorrectly remove some useful hyperlinks and are unable to detect noisy hyperlinks in some situations. In this paper, a dataset of noisy and useful hyperlinks was first created by an interactive crawler using website crawling. Then, through semantic web approaches and facilities such as the Dbpedia ontology, attention was paid to the semantic and relational structure of these hyperlinks. This was followed by activating the DBpedia ontology reasoner, the process of removing noisy hyperlinks from the web structure graph taking place. The tests performed on this system showed the accuracy and capability of Semantic Web technologies to remove noisy hyperlinks.

کلیدواژه‌ها [English]

  • Semantic Web
  • Noisy Hyperlinks
  • Ontology
  • Reasoner
  • Semantic Similarity
  • Relatedness Similarity
[1] Nalini, M. K., Dhinakaran, K., Elantamilan, D., Gnanavel, R., & Vinod, D. (2022, January 28-29). Implementation of Indexing Techniques to Prevent Data Leakage and Duplication in Internet. 2022 International Conference on Advances in Computing, Communication and Applied Informatics Chennai, India. https://doi.org/10.1109/ACCAI53970.202 2.9752554
[2] Makkar, A., & Kumar, N. (2020). An efficient deep learning-based scheme for web spam detection in IoT environment. Future Generation Computer Systems, 108, 467-487. https://doi.org/10.1016/j.future.2020.03.004
[3] Wu, Y., Wu, Y., Liu, Y., & Shi, T. (2022, March 25-27). The research of the optimized solutions to Raft consensus algorithm based on a weighted PageRank algorithm. 2022 Asia Conference on Algorithms, Computing and Machine Learning, Hangzhou, China. h ttps://doi.org/10.1109/CACML55074.2022.00135
[4] Bhavitha, K. V., & Thangaraj, S. J. J. (2022, February 16-17). Novel Detection of Accurate Spam Content using Logistic Regression Algorithm Compared with Gaussian Algorithm. 2022 International Conference on Business Analytics for Technology and Security Dubai, United Arab Emirates. https://doi.org/10.1109/ICBATS54253.2022.9759003
[5] Benczur, A. A., Csalogany, K., Sarlos, T., & Uher, M. (2005, May 10-14). Spamrank–fully automatic link spam detection work in progress. Proceedings of the first international workshop on adversarial information retrieval on the web, Chiba, Japan. https://ww w.researchgate.net/publication/220846812_SpamRank_--_Fully_Automatic_Link _Spam_Detection
[6] Qi, X., Nie, L., & Davison, B. D. (2007, May 8). Measuring similarity to detect qualified links. Proceedings of the 3rd international workshop on Adversarial information retrieval on the Web, Banff, Alberta, Canada. https://doi.org/10.1145/1244408.1244418
[7] Wookey, L., & Geller, J. (2004). Semantic hierarchical abstraction of web site structures for web searchers. Journal of Research and Practice in Information Technology, 36(1), 23-34. https://doi.org/10.3316/ielapa.120100890765820
[8] Da Costa Carvalho, A. L., Chirita, P. A., De Moura, E. S., Calado, P., & Nejdl, W. (2006, May 23-26). Site level noise removal for search engines. Proceedings of the 15th international conference on World Wide Web, Edinburgh, Scotland. https://doi.org/ 10.1145/1135777.1135793
[9] Chen, Z., Liu, S., Wenyin, L., Pu, G., & Ma, W-Y. (2003, August 1). Building a web thesaurus from web link structure. Proceedings of the 26th annual international Association for Computing Machinery SIGIR conference on Research and development in informaion retrieval, Toronto, Canada. https://doi.org/10.1145/860435.860447
[10] Pedersen, T., Patwardhan, S., & Michelizzi, J. (2004, May 2-7). WordNet::Similarity: measuring the relatedness of concepts. Demonstration Papers at Human Language Technology-NAACL 2004, Boston, Massachusetts. https://doi.org/10.5555/161402 5.1614037
[11] Li, F. (2008, October 12-14). Extracting Structure of Web Site Based on Hyperlink Analysis. 2008 4th International Conference on Wireless Communications, Networking and Mobile Computing, Dalian, China. https://doi.org/10.1109/WiCom.2008.2538
[12] Keller, M., & Nussbaumer, M. (2011, September 7-9). Beyond the Web Graph: Mining the Information Architecture of the WWW with Navigation Structure Graphs. 2011 International Conference on Emerging Intelligent Data and Web Technologies, Tirana, Albania. https://doi.org/10.1109/EIDWT.2011.23
[13] Zheng, Y., Cheng, X-C., & Chen, K. (2008). Filtering noise in Web pages based on parsing tree. The Journal of China Universities of Posts and Telecommunications, 15(25), 46-50. https://doi.org/10.1016/S1005-8885(08)60153-3
[14] Bechhofer, S., Harmelen, F. V., Hendler, J., Horrocks, I., McGuinness, D. L., Patel-Schneider, P. F., & Stein, L. A. (2004). OWL Web Ontology Language Reference. W3C. https:// www.w3.org/TR/owl-ref/
[15] Widyassari, A. P., Noersasongko, E., Syukur, A., & Affandy. (2022, December 8-9). The 7-Phases Preprocessing Based On Extractive Text Summarization. 2022 Seventh International Conference on Informatics and Computing, Denpasar, Bali, Indonesia. https://doi.org/10.1109/ICIC56845.2022.10006998
[16] Rasham, S., Naz, A., Afzal, Z., Ahmed, W., Abbas, Q., Anwar, M. H., Ejaz, M., & Ilyas, M. (2022). The Challenges and Case for Urdu DBpedia. In A. Ullah, S. Anwar, Á. Rocha, & S. Gill (Eds.), Proceedings of International Conference on Information Technology and Applications. Springer Nature Singapore. https://doi.org/10.1007/9 78-981-16-7618-5_38
[17] GoogleTrends. (2021). Explore what the worldthe world is searching for right now. https ://trends.google.com/trends/
[18] FileHippo. (2021). FileHippo.com - Download Free Software. https://filehippo.com/
[19] Ercan, G., & Cicekli, I. (2007). Using lexical chains for keyword extraction. Information Processing & Management, 43(6), 1705-1714. https://doi.org/10.1016/j.ipm.2007.0 1.015
[20] Joshi, C., Attar, V. Z., & Kalamkar, S. P. (2022). An Unsupervised Topic Modeling Approach for Adverse Drug Reaction Extraction and Identification from Natural Language Text. In S. Tiwari, M. C. Trivedi, M. L. Kolhe, K. K. Mishra, & B. K. Singh (Eds.), Advances in Data and Information Sciences. Springer Singapore. https://doi.org/10.1007/978-981-16-5689-7_44
[21] Lott, B. (2012). Survey of keyword extraction techniques. UNM Education. https://www. docdroid.net/bii3/lott-pdf#page=10
[22] Fedorov, A. M., & Datyev, I. O. (2022). The Effect of Additive Regularization for Topic Modeling of Social Media Communities. In R. Silhavy (Ed.), Artificial Intelligence Trends in Systems. Springer International Publishing. https://doi.org/10.1007/978-3-031-09076-9_51
[23] Zaeri, A., & Nematbakhsh, M. A. (2012). A Terminological Search Algorithm for Ontology Matching. Modern Applied Science, 6(10), 37-52. https://doi.org/10.5539/mas.v6n1 0p37
[24] Ahuja, R., Chug, A., Kohli, S., Gupta, S., & Ahuja, P. (2019). The Impact of Features Extraction on the Sentiment Analysis. Procedia Computer Science, 152, 341-348. h ttps://doi.org/10.1016/j.procs.2019.05.008