A language model based on BERT to analyze sports content in Persian language

Document Type : Original Article

Authors

1 Department of Computer Engineering, Technical and Vocational University (TVU), Tehran, Iran

2 Medical University of Tehran, Tehran, Iran

10.48301/kssa.2023.357227.2251

Abstract

Pretrained language models are very important because of their application in issues related to natural language processing. Language models such as BERT have become more popular among researchers. Due to the focus of these language models on English, other languages are limited to some multilingual models. In this article, the VarzeshiBERT language model is presented for the purpose of Persian sports analysis in topics related to this linguistic field. This language model is based on the Bert language model and was trained using the collected dataset. Three problems are used to evaluate the new language model: sentiment analysis, named entity recognition and text infilling. In order to train this language model, due to the lack of a suitable dataset, a wide range of sports events and news in the Farsi language has been prepared from several online sources. Due to the specialization of this model and compared to the language models presented for the Persian language, this model has provided better results in all three problems. This model has the best performance with 71.7% and 95.2% in text infilling and Named Entity Recognition, respectively. In sentiment analysis, the sports model has brought better results. These results show that using a language model related to any specialized field will have better results compared to language models related to the general field of texts.

Keywords

Main Subjects