مدل زبانی مبتنی بر BERT جهت تحلیل محتوای ورزشی در زبان فارسی

ستوده, داود; امیری طهرانی‌زاده, امین

doi:10.48301/kssa.2023.357227.2251

مدل زبانی مبتنی بر BERT جهت تحلیل محتوای ورزشی در زبان فارسی

نوع مقاله : مقاله پژوهشی (کاربردی)

نویسندگان

داود ستوده ¹

امین امیری طهرانی‌زاده ²

¹ عضو هیات علمی، گروه مهندسی کامپیوتر، دانشگاه فنی و حرفه‌ای، تهران، ایران.

² محقق پسادکترا، گروه انفورماتیک پزشکی، دانشکده پزشکی، دانشگاه علوم پزشکی مشهد، مشهد، ایران.

10.48301/kssa.2023.357227.2251

چکیده

مدل‌های زبانی آموزش دیده، به دلیل کاربرد آن‌ها در مسائل مرتبط با حوزه پردازش زبان‌های طبیعی دارای اهمیت فراوانی هستند. مدل‌های زبانی مانند BERT از محبوبیت بیشتری میان محققان برخوردار شده است. به دلیل توجه این مدل‌های زبانی به زبان انگلیسی، دیگر زبان‌ها به برخی از مدل‌های چند زبانه محدود می‌شوند. در این مقاله، مدل زبانی VarzeshiBERT به منظور تحلیل محتوای ورزشی فارسی در مسائل مرتبط با این حوزه زبانی ارائه شده است. این مدل زبانی بر پایه مدل زبانی Bert و با استفاده از مجموعه داده جمع‌آوری شده آموزش دیده است. سه مساله برای ارزیابی مدل زبانی جدید استفاده شده است: تحلیل احساسات، تشخیص نهاد‌های نامگذاری شده و پرکردن جای خالی. برای آموزش این مدل زبانی با توجه به عدم وجود مجموعه داده‌ای مناسب، یک مجموعه داده گسترده از رویداد‌ها و اخبار ورزشی زبان فارسی از چندین مرجع برخط تهیه شده است. با توجه به تخصصی بودن حوزه این مدل و در مقایسه با مدل‌های زبانی ارائه شده برای زبان فارسی، این مدل در هر سه مساله، نتایج بهتری را ارائه داده است. این مدل با 71.7% و 95.2% بهترین عملکرد را به ترتیب در بخش‌های پرکردن جای خالی و برچسب زنی اجزای کلام داشته است. در تحلیل احساسات نیز مدل ورزشی، نتایج بهتری را به همراه داشته است. این نتایج نشان می‌دهد، بکارگیری مدل زبانی مرتبط با هر حوزه تخصصی، نتایج بهتری در مقایسه با مدل‌های زبانی مرتبط اما با حوزه عمومی متون، خواهد داشت.

کلیدواژه‌ها

مدل زبانی

پردازش زبان‌های طبیعی

تحلیل احساسات

تشخیص نهادهای نامگذاری شده

مجموعه داده

موضوعات

هوش مصنوعی

عنوان مقاله English

Introducing a Language Model based on BERT to Analyze Sports Content in the Persian Language

نویسندگان English

Davood Sotoude ¹

Amin Amiri Tehranizade ²

¹ Faculty Member, Department of Computer Engineering, Technical and Vocational University (TVU), Tehran, Iran.

² Postdoc Researcher, Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.

چکیده English

Seljuk Pretrained language models are very important because of their application in issues related to natural language processing. Language models such as BERT have become more popular among researchers. Due to the focus of these language models on English, other languages are limited to some multilingual models. In this article, the PersianSportBERT language model is presented for the purpose of Persian sports analysis in topics related to this linguistic field. This language model is based on the Bert language model and was trained using the collected dataset. Three problems were used to evaluate the new language model: sentiment analysis, named entity recognition and text infilling. In order to train this language model, due to the lack of a suitable dataset, a wide range of sports events and news in the Persian language was prepared from several online sources. Due to the specialization of this model and compared to the language models presented for the Persian language, this model provided better results in all three problems. This model had the best performance with 71.7% and 95.2% in text infilling and named entity recognition, respectively. In sentiment analysis, the sports model presented better results. These findings demonstrate that using a language model related to any specialized field will have better results compared to language models related to the general field of texts.

کلیدواژه‌ها English

Language Models

Natural Language Processing

Sentiment Analysis

Named-entity Recognition

Dataset

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017, December 4-9). Attention is all you need. 31st Conference on Neural Information Processing System, Long Beach, California, USA. https://doi .org/10.48550/arxiv.1706.03762

[2] Devlin, J., Chang, M-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Computation and Language, 1-16. https://doi.org/10.48550/arXiv.1810.04805

[3] Agerri, R., Vicente, I. S., Campos, J. A., Barrena, A., Saralegi, X., Soroa, A., & Agirre, E. (2020, May 11-16). Give your text representation models some love: the case for basque. Proceedings of the 12th Conference on Language Resources and Evaluation, Marseille, France. https://doi.org/10.48550/arXiv.2004.00033

[4] Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., Seddah, D., & Sagot, B. (2019, July 5-10). CamemBERT: a tasty French language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington. http://dx.doi.org/10.18653/v1/2020.acl-main.645

[5] Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. Computation and Language, 1-14. https://doi.org/10.48550/arXiv.1912.07076

[6] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019, July 5-10). Unsupervised cross-lingual representation learning at scale. The 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington. https://doi.org/10.48550/arXiv.191 1.02116

[7] Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Processing Letters, 53(6), 3831-3847. https://doi.org/10.1007/s11063-021-10528-4

[8] Taghizadeh, N., Doostmohammadi, E., Seifossadat, E., Rabiee, H. R., & Tahaei, M. S. (2021). SINA-BERT: a pre-trained language model for analysis of medical texts in Persian. Computation and Language, 1-9. https://doi.org/10.48550/arXiv.2104.076 13

[9] Huang, G., & Hu, H. (2019). c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Processing Letters, 49(2), 683-691. https://doi.org/10.1007/s11063-018-9836-2

[10] Niu, J., Yang, Y., Zhang, S., Sun, Z., & Zhang, W. (2019). Multi-task Character-Level Attentional Networks for Medical Concept Normalization. Neural Processing Letters, 49(3), 1239-1256. https://doi.org/10.1007/s11063-018-9873-x

[11] Dai, A. M., & Le, Q. V. (2015, December 7-12). Semi-supervised sequence learning. Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada. https://proceedings.neurips.cc/paper_files/paper/2015/hash/7137debd45ae4d0 ab9aa953017286b20-Abstract.html

[12] Ramachandran, P., Liu, P. J., & Le, Q. V. (2017, September 7-11). Unsupervised pretraining for sequence to sequence learning. Conference on Empirical Methods in Natural Language Processing 2017, Denmark. https://doi.org/10.48550/arXiv.1611.02683

[13] Sutskever, I., Vinyals, O., & Le, Q. V. (2014, December 8-13). Sequence to sequence learning with neural networks 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, Canada. https://proceedings.neurips.cc/paper_files/paper/20 14/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html

[14] Howard, J., & Ruder, S. (2018, July 15-20). Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1031

[15] Graves, A. (2012). Long Short-Term Memory. In A. Graves (Ed.), Supervised Sequence Labelling with Recurrent Neural Networks. Springer Berlin Heidelberg. https://doi. org/10.1007/978-3-642-24797-2_4

[16] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. University of British Columbia, 12, 1-12. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=dOad5HoAAAAJ&citation_for_view=dOad5HoAAAAJ:W7OEmFMy1HYC

[17] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019, December 8-14). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, Vancouver, British Columbia, Canada. https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e 9ee67cc69-Abstract.html

[18] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. Computation and Language, 1-13. https://doi.org/10.48550/arXiv.1907.11692

[19] Lample, G., & Conneau, A. (2019, December 13-14). Cross-lingual language model pretraining. The 33rd Annual Conference on Neural Information Processing Systems, Vancouver, Canada. https://doi.org/10.48550/arXiv.1901.07291

[20] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485-5551. https://arxiv.org/abs/1910.10683

[21] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020, April 26-30). Albert: A lite bert for self-supervised learning of language representations. 8th International Conference on Learning Representations, Addis Ababa, Ethiopia. http s://doi.org/10.48550/arXiv.1909.11942

[22] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Computation and Language, 1-6. https://doi.org/10.48550/arXiv.190 3.10676

[23] Araci, D. (2019). Finbert: Financial sentiment analysis with pre-trained language models [Master, Amsterdam]. Netherlands. https://arxiv.org/abs/1908.10063

[24] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://doi.org/10.1093/bioinformatics/btz682

[25] Huang, K., Altosaar, J., & Ranganath, R. (2020, April 2-4). Clinicalbert: Modeling clinical notes and predicting hospital readmission. Conference on Health, Inference, and Learning 2020, Toronto, Ontario, Canada. https://arxiv.org/abs/1904.05342

[26] Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020, November 16-20). LEGAL-BERT: The muppets straight out of law school. The 2020 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic. https://arxiv.org/abs/2010.02559

[27] De Vries, W., Van Cranenburgh, A., Bisazza, A., Caselli, T., Van Noord, G., & Nissim, M. (2019). Bertje: A dutch bert model. Computation and Language, 1-6. https://doi. org/10.48550/arXiv.1912.09582

[28] Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., & Basile, V. (2019, November 13-19). Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. 6th Italian Conference on Computational Linguistics,, Bari, Italy. https://iris.unito.it/handle/2318/1759767

[29] Antoun, W., Baly, F., & Hajj, H. (2020, May 11-16). Arabert: Transformer-based model for arabic language understanding. Proceedings of the Twelfth International Conference on Language Resources and Evaluation, Marseille, France. https://doi.org/10.48550/arXiv .2003.00104

[30] Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8, 64-77. https://doi.org/10.1162/tacl_a_00300