Karafan Journal

Karafan Journal

Introducing a Language Model based on BERT to Analyze Sports Content in the Persian Language

Document Type : Original Article

Authors
1 Faculty Member, Department of Computer Engineering, Technical and Vocational University (TVU), Tehran, Iran.
2 Postdoc Researcher, Department of Medical Informatics, Faculty of Medicine, Mashhad University of Medical Sciences, Mashhad, Iran.
Abstract
Seljuk Pretrained language models are very important because of their application in issues related to natural language processing. Language models such as BERT have become more popular among researchers. Due to the focus of these language models on English, other languages ​​are limited to some multilingual models. In this article, the PersianSportBERT language model is presented for the purpose of Persian sports analysis in topics related to this linguistic field. This language model is based on the Bert language model and was trained using the collected dataset. Three problems were used to evaluate the new language model: sentiment analysis, named entity recognition and text infilling. In order to train this language model, due to the lack of a suitable dataset, a wide range of sports events and news in the Persian language was prepared from several online sources. Due to the specialization of this model and compared to the language models presented for the Persian language, this model provided better results in all three problems. This model had the best performance with 71.7% and 95.2% in text infilling and named entity recognition, respectively. In sentiment analysis, the sports model presented better results. These findings demonstrate that using a language model related to any specialized field will have better results compared to language models related to the general field of texts.
Keywords
Subjects

[1] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017, December 4-9). Attention is all you need. 31st Conference on Neural Information Processing System, Long Beach, California, USA. https://doi .org/10.48550/arxiv.1706.03762
[2] Devlin, J., Chang, M-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. Computation and Language, 1-16. https://doi.org/10.48550/arXiv.1810.04805
[3] Agerri, R., Vicente, I. S., Campos, J. A., Barrena, A., Saralegi, X., Soroa, A., & Agirre, E. (2020, May 11-16). Give your text representation models some love: the case for basque. Proceedings of the 12th Conference on Language Resources and Evaluation, Marseille, France. https://doi.org/10.48550/arXiv.2004.00033
[4] Martin, L., Muller, B., Suárez, P. J. O., Dupont, Y., Romary, L., de La Clergerie, É. V., Seddah, D., & Sagot, B. (2019, July 5-10). CamemBERT: a tasty French language model. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington. http://dx.doi.org/10.18653/v1/2020.acl-main.645
[5] Virtanen, A., Kanerva, J., Ilo, R., Luoma, J., Luotolahti, J., Salakoski, T., Ginter, F., & Pyysalo, S. (2019). Multilingual is not enough: BERT for Finnish. Computation and Language, 1-14. https://doi.org/10.48550/arXiv.1912.07076
[6] Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., & Stoyanov, V. (2019, July 5-10). Unsupervised cross-lingual representation learning at scale. The 58th Annual Meeting of the Association for Computational Linguistics, Seattle, Washington. https://doi.org/10.48550/arXiv.191 1.02116
[7] Farahani, M., Gharachorloo, M., Farahani, M., & Manthouri, M. (2021). ParsBERT: Transformer-based Model for Persian Language Understanding. Neural Processing Letters, 53(6), 3831-3847. https://doi.org/10.1007/s11063-021-10528-4
[8] Taghizadeh, N., Doostmohammadi, E., Seifossadat, E., Rabiee, H. R., & Tahaei, M. S. (2021). SINA-BERT: a pre-trained language model for analysis of medical texts in Persian. Computation and Language, 1-9. https://doi.org/10.48550/arXiv.2104.076 13
[9] Huang, G., & Hu, H. (2019). c-RNN: A Fine-Grained Language Model for Image Captioning. Neural Processing Letters, 49(2), 683-691. https://doi.org/10.1007/s11063-018-9836-2
[10] Niu, J., Yang, Y., Zhang, S., Sun, Z., & Zhang, W. (2019). Multi-task Character-Level Attentional Networks for Medical Concept Normalization. Neural Processing Letters, 49(3), 1239-1256. https://doi.org/10.1007/s11063-018-9873-x
[11] Dai, A. M., & Le, Q. V. (2015, December 7-12). Semi-supervised sequence learning. Annual Conference on Neural Information Processing Systems 2015, Montreal, Quebec, Canada. https://proceedings.neurips.cc/paper_files/paper/2015/hash/7137debd45ae4d0 ab9aa953017286b20-Abstract.html
[12] Ramachandran, P., Liu, P. J., & Le, Q. V. (2017, September 7-11). Unsupervised pretraining for sequence to sequence learning. Conference on Empirical Methods in Natural Language Processing 2017, Denmark. https://doi.org/10.48550/arXiv.1611.02683
[13] Sutskever, I., Vinyals, O., & Le, Q. V. (2014,  December 8-13). Sequence to sequence learning with neural networks 28th Annual Conference on Neural Information Processing Systems 2014, Montreal, Canada. https://proceedings.neurips.cc/paper_files/paper/20 14/hash/a14ac55a4f27472c5d894ec1c3c743d2-Abstract.html
[14] Howard, J., & Ruder, S. (2018, July 15-20). Universal language model fine-tuning for text classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia. https://doi.org/10.18653/v1/P18-1031
[15] Graves, A. (2012). Long Short-Term Memory. In A. Graves (Ed.), Supervised Sequence Labelling with Recurrent Neural Networks. Springer Berlin Heidelberg. https://doi. org/10.1007/978-3-642-24797-2_4
[16] Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. University of British Columbia, 12, 1-12. https://scholar.google.com/citations?view_op=view_citation&hl=en&user=dOad5HoAAAAJ&citation_for_view=dOad5HoAAAAJ:W7OEmFMy1HYC
[17] Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R. R., & Le, Q. V. (2019, December 8-14). Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, Vancouver, British Columbia, Canada. https://proceedings.neurips.cc/paper/2019/hash/dc6a7e655d7e5840e66733e 9ee67cc69-Abstract.html
[18] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. Computation and Language, 1-13. https://doi.org/10.48550/arXiv.1907.11692
[19] Lample, G., & Conneau, A. (2019, December 13-14). Cross-lingual language model pretraining. The 33rd Annual Conference on Neural Information Processing Systems, Vancouver, Canada. https://doi.org/10.48550/arXiv.1901.07291
[20] Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1), 5485-5551. https://arxiv.org/abs/1910.10683
[21] Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020, April 26-30). Albert: A lite bert for self-supervised learning of language representations. 8th International Conference on Learning Representations, Addis Ababa, Ethiopia. http s://doi.org/10.48550/arXiv.1909.11942
[22] Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. Computation and Language, 1-6. https://doi.org/10.48550/arXiv.190 3.10676
[23] Araci, D. (2019). Finbert: Financial sentiment analysis with pre-trained language models [Master, Amsterdam]. Netherlands. https://arxiv.org/abs/1908.10063
[24] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2019). BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234-1240. https://doi.org/10.1093/bioinformatics/btz682
[25] Huang, K., Altosaar, J., & Ranganath, R. (2020, April 2-4). Clinicalbert: Modeling clinical notes and predicting hospital readmission. Conference on Health, Inference, and Learning 2020, Toronto, Ontario, Canada. https://arxiv.org/abs/1904.05342
[26] Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., & Androutsopoulos, I. (2020, November 16-20). LEGAL-BERT: The muppets straight out of law school. The 2020 Conference on Empirical Methods in Natural Language Processing, Punta Cana, Dominican Republic. https://arxiv.org/abs/2010.02559
[27] De Vries, W., Van Cranenburgh, A., Bisazza, A., Caselli, T., Van Noord, G., & Nissim, M. (2019). Bertje: A dutch bert model. Computation and Language, 1-6. https://doi. org/10.48550/arXiv.1912.09582
[28] Polignano, M., Basile, P., De Gemmis, M., Semeraro, G., & Basile, V. (2019, November 13-19). Alberto: Italian BERT language understanding model for NLP challenging tasks based on tweets. 6th Italian Conference on Computational Linguistics,, Bari, Italy. https://iris.unito.it/handle/2318/1759767
[29] Antoun, W., Baly, F., & Hajj, H. (2020, May 11-16). Arabert: Transformer-based model for arabic language understanding. Proceedings of the Twelfth International Conference on Language Resources and Evaluation, Marseille, France. https://doi.org/10.48550/arXiv .2003.00104
[30] Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics, 8, 64-77. https://doi.org/10.1162/tacl_a_00300
Volume 20, Issue 1 - Serial Number 61
Technical & Engineering
Spring 2023
Pages 341-362

  • Receive Date 04 September 2022
  • Revise Date 13 November 2022
  • Accept Date 30 January 2023