فصلنامه علمی کارافن

فصلنامه علمی کارافن

شناسایی ویژگی‌های موثر برای پیش‌بینی دیابت نوع 2 با استفاده از روش‌های نوین انتخاب ویژگی تصادفی مبتنی بر Wrapper

نوع مقاله : مقاله پژوهشی (کاربردی)

نویسندگان
1 عضو هیئت علمی، گروه مهندسی کامپیوتر، دانشگاه پیام‌نور، تهران، ایران.
2 دانشیار، گروه مهندسی کامپیوتر، دانشکده مهندسی برق و کامپیوتر، دانشگاه بیرجند، بیرجند، ایران
3 استادیار، گروه کامپیوتر، واحد بیرجند، دانشگاه آزاد اسلامی، بیرجند، ایران.
چکیده
دیابت ملیتوس نوع 2 یک اختلال متابولیک مزمن است که با هایپرگلیسمی ناشی از مقاومت به انسولین یا کمبود آن مشخص می‌شود. بر اساس برآوردها، در سال 2021 حدود 537 میلیون بزرگسال دچار دیابت بودند که بخش قابل توجهی از آن به دیابت نوع2 نسبت داده می‌شود. این موضوع نشان‌می‌دهد که تمرکز بر راهکارهای پیشگیری، تشخیص زودهنگام و مدیریت دیابت نوع 2 بسیار حیاتی است. این پژوهش به بررسی عملکرد روش‌های مختلف انتخاب ویژگی در مدل‌های یادگیری ماشین برای پیش‌بینی بیماری دیابت نوع 2 می‌پردازد. در این تحقیق، از روش‌های مختلف و نوین انتخاب ویژگی مبتنی‌بر wrapper برای شناسایی مهم‌ترین ویژگی‌ها استفاده شده است. الگوریتم‌های طبقه‌بندی شامل KNN، درخت تصمیم، SVM، جنگل تصادفی و MLP روی دو مجموعه‌داده استاندارد Pima Indian Diabetes و Mendeley Diabetes مورد ارزیابی قرار گرفته‌اند. نتایج با استفاده از معیارهای ارزیابی مانند دقت، ویژگی، صحت، حساسیت، F1-measure و منحنی ROC مقایسه و بررسی می‌شوند. ویژگی‌های انتخاب‌شده در مجموعه‌داده Pima شامل گلوکز، شاخص توده بدنی، سن و فشار خون، و در مجموعه‌داده Mendeley شامل HbA1c ، BMI و کلسترول هستند. این ویژگی‌ها بالاترین میزان دقت را به‌ترتیب با مقادیر ٪۷۷.۳ و ٪۹۸ توسط روش انتخاب ویژگی ERSFS در مجموعه‌داده‌های Pima و Mendeley نشان می‌دهند. پژوهش حاضر پتانسیل روش‌های انتخاب ویژگی را در بهبود عملکرد طبقه‌بندی دیابت نوع ۲ آشکار می‌سازد و می‌تواند به پزشکان و محققان در توسعه و استفاده از ابزارهای تشخیصی دقیق‌تر برای این بیماری کمک کند. همچنین، این تحقیق بینش ارزشمندی درباره مهم‌ترین عوامل مؤثر بر پیش‌بینی ابتلا به دیابت نوع۲ ارائه می‌دهد.
کلیدواژه‌ها
موضوعات

عنوان مقاله English

Effective Feature Identification for Type-2 Diabetes Prediction Using Novel Wrapper-Based Random Feature Selection Methods

نویسندگان English

Hamed SabbaghGol 1
Hamid Saadatfar 2
Mahdi Khazaiepoor 3
1 Faculty Member, Department of Computer Engineering, Payame Noor University, Tehran, Iran.
2 Associate Professor, Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran.
3 Assistant Professor, Department of Computer, Bi.C., Islamic Azad University, Birjand, Iran.
چکیده English

Diabetes mellitus Type-2 is a chronic metabolic disorder characterized by hyperglycemia resulting from insulin resistance or deficiency. According to estimates, in 2021, approximately 537 million adults had diabetes, a significant portion of which is attributed to Type-2 diabetes. This highlights the critical need to focus on preventive strategies, early diagnosis, and management of Type-2 diabetes. This study investigates the performance of different novel feature selection methods in machine learning models for predicting type-2 diabetes. In this research, various wrapper-based feature selection methods are employed to identify the most significant features. Classification algorithms, including KNN, decision tree, SVM, random forest, and MLP, are evaluated on two standard datasets: Pima Indian Diabetes and Mendeley Diabetes. The results are compared and evaluated using evaluation criteria such as accuracy, specificity, precision, sensitivity, F1-measure and ROC curve. The selected features in the Pima dataset include glucose, body mass index, age and blood pressure, and in the Mendeley dataset include HbA1c, BMI and cholesterol. These features show the highest accuracy with values of 77.3% and 98% by the ERSFS feature selection method in the Pima and Mendeley datasets, respectively. The present study reveals the potential of feature selection methods in improving the classification performance of type 2 diabetes and can help clinicians and researchers in developing and using more accurate diagnostic tools for this disease. Also, this study provides valuable insight into the most important factors affecting the prediction of type 2 diabetes

کلیدواژه‌ها English

Type-2 Diabetes
Dimension Reduction
Feature Selection
Machine Learning
Classification
[1]        Association, A.D. (2014). Diagnosis and classification of diabetes mellitus. Diabetes care, 37, 81-90. https://doi.org/10.2337/dc14-S081
[2]        Atkinson, M.A., G.S. Eisenbarth, and A.W. Michels. (2014). Type 1 diabetes. The lancet, 383(9911), 82-69. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(13)60591-7/fulltext
[3]        Control, C.f.D. and Prevention. (2020). National diabetes statistics report, 2020. Atlanta, GA: centers for disease control and prevention, US dept of health and human services; 2020. https://www.cdc.gov/diabetes/php/data-research/index.html
[4]        Chatterjee, S., K. Khunti, and M.J. Davies. (2017). Type 2 diabetes. The lancet,389(10085),2239-2251. https://www.thelancet.com/journals/lancet/article/piiS0140-6736(17)30058-2/fulltext
[5]        Forbes, J.M. and M.E. Cooper. (2013). Mechanisms of diabetic complications. Physiological reviews, 93(1), 137-188. https://www.ncbi.nlm.nih.gov/pubmed/23303908
[6]        Care, D. (2019). Care in diabetesd2019. Diabetes care, 42(1), S13-S28. https://pubmed.ncbi.nlm.nih.gov/30559228/
[7]        WHO, (2023). World Health Organization. Website Name. Retrieved from https://www.who.int/news-room/fact-sheets/detail/diabetes
[8]        Federation, I.D. (2022). Diabetes around the world in 2021. Retrieved from https://diabetesatlas.org/atlas-reports/
[9]        Namjouye Rad, A.a.D., Mahdi (2021). Detection of network penetration by data mining and using machine learning via SVM algorithm. Karafan Quarterly Scientific Journal, 17(4), 13-34. https://karafan.tvu.ac.ir/article_128393_ceb8bbb84a290af623e3744516a42921.pdf
[10]      Alipour, M. and M. Jafari. (2022). Estimating the Dynamic Margin of Voltage Stability in Power Systems Using Machine Learning. Karafan Quarterly Scientific Journal, 19(3), 221-245. https://karafan.tvu.ac.ir/article_143524_12280c7e82a5860ec12f63c83d2d3df4.pdf
[11]      Basiri, M. and F. Fathnejad. (2023). Presenting a Framework for Intelligent Sentiment Analysis Using a Novel Method of feature Combination and Meta-Initiative in Particle Swarm Optimization. Karafan Quarterly Scientific Journal, 20(3), 531-551. https://karafan.tvu.ac.ir/article_178537_64cc67fe36c313c328b09464a29b65e4.pdf
[12]      Bahmani, M., M.E. Pourzarandi, and M. Minoei. (2022). Factors Affecting the Forecast of Stock Returns using Delphi-Fuzzy Knowledge Analysis and Technique. Karafan Quarterly Scientific Journal, 19(2) 431-453. https://karafan.tvu.ac.ir/article_148742_241883f182a19b66c85376f090483227.pdf
[13]      Gao, L.A., et al. (2022). Prokaryotic innate immunity through pattern recognition of conserved viral proteins. Science, 377(6607). https://www.science.org/doi/full/10.1126/science.abm4096
[14]      Ghaffarian, H. and A. Bamohabbat. (2023). Classification and Prediction of Customer Categories Using Combination of LRFM Method, Quartiles and Multi-Class Data Mining Methods. Karafan Quarterly Scientific Journal, 20(1), 511-532. https://karafan.tvu.ac.ir/article_150022_ed4e7ca8ea509e6d752cb0250d32fc7e.pdf
[15]      Latha, C.B.C. and S.C. Jeeva. (2019). Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked, 16(100203). https://www.sciencedirect.com/science/article/pii/S235291481830217X
[16]      Ali, Y.A., E.M. Awwad, M. Al-Razgan, and A. Maarouf. (2023). Hyperparameter search for machine learning algorithms for optimizing the computational complexity. Processes, 11(2), 349. https://doi.org/10.3390/pr11020349
[17]      Pudjihartono, N., T. Fadason, A.W. Kempa-Liehr, and J.M. O'Sullivan. (2022). A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2(927312).  https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.927312/full
[18]      Crespo Márquez, A. (2022). The Curse of Dimensionality. In (Ed.), Digital Maintenance Management: Guiding Digital Transformation in Maintenance. Springer International Publishing: Cham. 67-86. https://link.springer.com/book/10.1007/978-3-030-97660-6
[19]      Organization, W.H. (2019). Classification of diabetes mellitus. https://apps.who.int/iris/bitstream/handle/10665/325182/9789241515702-eng.pdf
[20]      Alaguselvi, R. and K. Murugan. (2024). A Systematic Review for the Classification and Segmentation of Diabetic Retinopathy Lesion from Fundus. Artificial Intelligence and Machine Learning Techniques in Image Processing and Computer Vision, 54-74. https://www.taylorfrancis.com/chapters/edit/10.1201/9781003425700-5/systematic-review-classification-segmentation-diabetic-retinopathy-lesion-fundus-alaguselvi-kalpana-murugan
[21]      Tigga, N.P. and S. Garg. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167(706-716). https://doi.org/10.1016/j.procs.2020.03.336
[22]      Guyon, I. and A. Elisseeff. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. https://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf?ref=driverlayer.com/web
[23]      Chandrashekar, G. and F. Sahin. (2014). A survey on feature selection methods. Computers & electrical engineering, 40(1), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024
[24]      Peng, H., F. Long, and C. Ding. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), 1226-1238. https://www.ncbi.nlm.nih.gov/pubmed/16119262
[25]      Canayaz, M. (2022). Classification of diabetic retinopathy with feature selection over deep features using nature-inspired wrapper methods. Applied Soft Computing, 128(109462). https://doi.org/10.1016/j.asoc.2022.109462
[26]      Gnana, D.A.A., S.A.A. Balamurugan, and E.J. Leavline. (2016). Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications, 136(1), 9-17. https://doi.org/10.1109/TEVC.2015.2504420
[27]      Xue, B., M. Zhang, W.N. Browne, and X. Yao. (2015). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on evolutionary computation, 20(4), 606-626. https:// https://doi.org/10.1109/TEVC.2015.2504420
[28]      Siham, A., S. Sara, and A. Abdellah. (2021). Feature selection based on machine learning for credit scoring: An evaluation of filter and embedded methods. 2021 International conference on innovations in intelligent systems and applications (INISTA), IEEE. https://doi.org/10.1109/INISTA52262.2021.9548410
[29]      Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.  https://www.jstor.org/stable/2346178
[30]      Breiman, L. (2001). Random forests. Machine learning, 45(5-32). https://link.springer.com/article/10.1023/A:1010933404324
[31]      Déjean, S., R.T. Ionescu, J. Mothe, and M.Z. Ullah. (2020). Forward and backward feature selection for query performance prediction. Proceedings of the 35th annual ACM symposium on applied computing, https://dl.acm.org/doi/abs/10.1145/3341105.3373904
[32]      Kohavi, R. and G.H. John. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324. https://www.sciencedirect.com/science/article/pii/S000437029700043X
[33]      SabbaghGol, H., H. Saadatfar, and M. Khazaiepoor. (2024). Evolution of the random subset feature selection algorithm for classification problem. Knowl Based Syst, 285(111352). https://doi.org/10.1016/j.knosys.2023.111352
[34]      Ahadzadeh, B., et al. (2023). SFE: A Simple, Fast, and Efficient Feature Selection Algorithm for High-Dimensional Data. IEEE Trans Evol Comput, 27(6), 1896-1911. https://doi.org/10.1109/TEVC.2023.3238420
[35]      Akman, D.V., et al. (2023). k-best feature selection and ranking via stochastic approximation. Expert Syst Appl, 213(118864). https://doi.org/10.1016/j.eswa.2022.118864
[36]      Pan, H., et al. (2023). A risk prediction model for type 2 diabetes mellitus complicated with retinopathy based on machine learning and its application in health management. Frontiers in Medicine, 10(1136653). https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2023.1136653/full
[37]      Thipsawat, S. (2023). Dietary Consumption on Glycemic Control Among Prediabetes: A Review of the Literature. SAGE Open Nursing, 9. https://journals.sagepub.com/doi/abs/10.1177/23779608231218189
[38]      Islam, M.M., et al. (2023). Identification of the risk factors of type 2 diabetes and its prediction using machine learning techniques. Health Systems, 12(2), 243-254. https://www.tandfonline.com/doi/abs/10.1080/20476965.2022.2141141
[39]      Fitriyani, N.L., et al. (2023). Performance Analysis and Assessment of Type 2 Diabetes Screening Scores in Patients with Non-Alcoholic Fatty Liver Disease. Mathematics, 11(10), 2266. https://www.mdpi.com/2227-7390/11/10/2266
[40]      Zohara, Z., et al. (2023). The prospect of non-alcoholic fatty liver disease in adult patients with metabolic syndrome: a systematic review. Cureus, 15(7), https://pmc.ncbi.nlm.nih.gov/articles/PMC10427027/
[41]      Halias, A.F., et al. (2023). Type 2 Diabetes Mellitus Prediction Using Data Mining Approach. 2023 IEEE International Conference on Computing (ICOCO), IEEE. https://ieeexplore.ieee.org/abstract/document/10398078/
[42]      Safai, M., Safai, Alireza. (2021). Improving the diagnosis of type 2 diabetes and identifying its effective indicators with the feature selection approach. In (Ed.), The 5th International Conference on Electrical Engineering, Electronics and Smart Networks. https://civilica.com/doc/1257205
[43]      R., M., A. Banu.W, and D. Mavaluru. (2020). An efficient feature selection algorithm for health care data analysis. Bulletin of Electrical Engineering and Informatics. https://beei.org/index.php/EEI/article/view/1744
[44]      Sabbagh Gol, H. (2018). A Detection of Type2 Diabetes using C4.5 Decision Tree. Journal of Health and Biomedical Informatics, 5(2), 293-303. http://jhbmi.ir/article-1-281-en.html
[45]      Repository, U.M.L. (2017). Pima Indians Diabetes Database. In (Ed.), https://www.kaggle.com/uciml/pima-indians-diabetes-database
[46]      Dua, D. and C. Graff. (2019). UCI machine learning repository . University of California. School of Information and Computer Science, Irvine, CA, https://archive.ics.uci.edu/ml/datasets.php
[47]      Rashid, A. (2020). Diabetes Dataset. In (Ed.), Mendeley Data: https://doi.org/10.17632/wj9rwkp9c2.1
[48]      Cerda, P. and G. Varoquaux, (2020). Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1164-1176. https://doi.org/10.1109/TKDE.2020.2992529
[49]      Donders, A.R.T., G.J. Van Der Heijden, T. Stijnen, and K.G. Moons. (2006). A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10), 1087-1091. https://www.ncbi.nlm.nih.gov/pubmed/16980149
[50]      Patro, S. and K.K. Sahu. (2015). Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462, https://doi.org/10.48550/arXiv.1503.06462
[51]      Bouchlaghem, Y., Y. Akhiat, and S. Amjad. (2022). Feature selection: a review and comparative study. E3S Web of Conferences, EDP Sciences. https://doi.org/10.1051/e3sconf/202235101046
[52]      Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Ann Transl Med, 4(11), 218. https://www.ncbi.nlm.nih.gov/pubmed/27386492
[53]      Ben-Hur, A. and J. Weston. (2010). A user's guide to support vector machines. Methods Mol Biol, 609(2), 23-39. https://www.ncbi.nlm.nih.gov/pubmed/20221922
[54]      Maimon, O.Z. and L. Rokach. (2014). Data mining with decision trees: theory and applications World scientific. https://doi.org/10.1142/9097
[55]      Goodfellow, I., Y. Bengio, and A. Courville. (2016). Deep learning MIT press . https://www.deeplearningbook.org/
[56]      Sanyal, D., N. Bosch, and L. Paquette. (2020). Feature Selection Metrics: Similarities, Differences, and Characteristics of the Selected Models. Int Edu Data Mining Soci, https://eric.ed.gov/?id=ED607910
 
 
 
دوره 22، شماره 1
فنی و مهندسی
بهار 1404
صفحه 150-182

  • تاریخ دریافت 09 تیر 1403
  • تاریخ بازنگری 21 آذر 1403
  • تاریخ پذیرش 24 فروردین 1404