Karafan Journal

Karafan Journal

Effective Feature Identification for Type-2 Diabetes Prediction Using Novel Wrapper-Based Random Feature Selection Methods

Document Type : Original Article

Authors
1 Faculty Member, Department of Computer Engineering, Payame Noor University, Tehran, Iran.
2 Associate Professor, Department of Computer Engineering, Faculty of Electrical and Computer Engineering, University of Birjand, Birjand, Iran.
3 Assistant Professor, Department of Computer, Bi.C., Islamic Azad University, Birjand, Iran.
Abstract
Diabetes mellitus Type-2 is a chronic metabolic disorder characterized by hyperglycemia resulting from insulin resistance or deficiency. According to estimates, in 2021, approximately 537 million adults had diabetes, a significant portion of which is attributed to Type-2 diabetes. This highlights the critical need to focus on preventive strategies, early diagnosis, and management of Type-2 diabetes. This study investigates the performance of different novel feature selection methods in machine learning models for predicting type-2 diabetes. In this research, various wrapper-based feature selection methods are employed to identify the most significant features. Classification algorithms, including KNN, decision tree, SVM, random forest, and MLP, are evaluated on two standard datasets: Pima Indian Diabetes and Mendeley Diabetes. The results are compared and evaluated using evaluation criteria such as accuracy, specificity, precision, sensitivity, F1-measure and ROC curve. The selected features in the Pima dataset include glucose, body mass index, age and blood pressure, and in the Mendeley dataset include HbA1c, BMI and cholesterol. These features show the highest accuracy with values of 77.3% and 98% by the ERSFS feature selection method in the Pima and Mendeley datasets, respectively. The present study reveals the potential of feature selection methods in improving the classification performance of type 2 diabetes and can help clinicians and researchers in developing and using more accurate diagnostic tools for this disease. Also, this study provides valuable insight into the most important factors affecting the prediction of type 2 diabetes
Keywords
Subjects

[1]        Association, A.D. (2014). Diagnosis and classification of diabetes mellitus. Diabetes care, 37, 81-90. https://doi.org/10.2337/dc14-S081
[2]        Atkinson, M.A., G.S. Eisenbarth, and A.W. Michels. (2014). Type 1 diabetes. The lancet, 383(9911), 82-69. https://www.thelancet.com/journals/lancet/article/PIIS0140-6736(13)60591-7/fulltext
[3]        Control, C.f.D. and Prevention. (2020). National diabetes statistics report, 2020. Atlanta, GA: centers for disease control and prevention, US dept of health and human services; 2020. https://www.cdc.gov/diabetes/php/data-research/index.html
[4]        Chatterjee, S., K. Khunti, and M.J. Davies. (2017). Type 2 diabetes. The lancet,389(10085),2239-2251. https://www.thelancet.com/journals/lancet/article/piiS0140-6736(17)30058-2/fulltext
[5]        Forbes, J.M. and M.E. Cooper. (2013). Mechanisms of diabetic complications. Physiological reviews, 93(1), 137-188. https://www.ncbi.nlm.nih.gov/pubmed/23303908
[6]        Care, D. (2019). Care in diabetesd2019. Diabetes care, 42(1), S13-S28. https://pubmed.ncbi.nlm.nih.gov/30559228/
[7]        WHO, (2023). World Health Organization. Website Name. Retrieved from https://www.who.int/news-room/fact-sheets/detail/diabetes
[8]        Federation, I.D. (2022). Diabetes around the world in 2021. Retrieved from https://diabetesatlas.org/atlas-reports/
[9]        Namjouye Rad, A.a.D., Mahdi (2021). Detection of network penetration by data mining and using machine learning via SVM algorithm. Karafan Quarterly Scientific Journal, 17(4), 13-34. https://karafan.tvu.ac.ir/article_128393_ceb8bbb84a290af623e3744516a42921.pdf
[10]      Alipour, M. and M. Jafari. (2022). Estimating the Dynamic Margin of Voltage Stability in Power Systems Using Machine Learning. Karafan Quarterly Scientific Journal, 19(3), 221-245. https://karafan.tvu.ac.ir/article_143524_12280c7e82a5860ec12f63c83d2d3df4.pdf
[11]      Basiri, M. and F. Fathnejad. (2023). Presenting a Framework for Intelligent Sentiment Analysis Using a Novel Method of feature Combination and Meta-Initiative in Particle Swarm Optimization. Karafan Quarterly Scientific Journal, 20(3), 531-551. https://karafan.tvu.ac.ir/article_178537_64cc67fe36c313c328b09464a29b65e4.pdf
[12]      Bahmani, M., M.E. Pourzarandi, and M. Minoei. (2022). Factors Affecting the Forecast of Stock Returns using Delphi-Fuzzy Knowledge Analysis and Technique. Karafan Quarterly Scientific Journal, 19(2) 431-453. https://karafan.tvu.ac.ir/article_148742_241883f182a19b66c85376f090483227.pdf
[13]      Gao, L.A., et al. (2022). Prokaryotic innate immunity through pattern recognition of conserved viral proteins. Science, 377(6607). https://www.science.org/doi/full/10.1126/science.abm4096
[14]      Ghaffarian, H. and A. Bamohabbat. (2023). Classification and Prediction of Customer Categories Using Combination of LRFM Method, Quartiles and Multi-Class Data Mining Methods. Karafan Quarterly Scientific Journal, 20(1), 511-532. https://karafan.tvu.ac.ir/article_150022_ed4e7ca8ea509e6d752cb0250d32fc7e.pdf
[15]      Latha, C.B.C. and S.C. Jeeva. (2019). Improving the accuracy of prediction of heart disease risk based on ensemble classification techniques. Informatics in Medicine Unlocked, 16(100203). https://www.sciencedirect.com/science/article/pii/S235291481830217X
[16]      Ali, Y.A., E.M. Awwad, M. Al-Razgan, and A. Maarouf. (2023). Hyperparameter search for machine learning algorithms for optimizing the computational complexity. Processes, 11(2), 349. https://doi.org/10.3390/pr11020349
[17]      Pudjihartono, N., T. Fadason, A.W. Kempa-Liehr, and J.M. O'Sullivan. (2022). A review of feature selection methods for machine learning-based disease risk prediction. Frontiers in Bioinformatics, 2(927312).  https://www.frontiersin.org/journals/bioinformatics/articles/10.3389/fbinf.2022.927312/full
[18]      Crespo Márquez, A. (2022). The Curse of Dimensionality. In (Ed.), Digital Maintenance Management: Guiding Digital Transformation in Maintenance. Springer International Publishing: Cham. 67-86. https://link.springer.com/book/10.1007/978-3-030-97660-6
[19]      Organization, W.H. (2019). Classification of diabetes mellitus. https://apps.who.int/iris/bitstream/handle/10665/325182/9789241515702-eng.pdf
[20]      Alaguselvi, R. and K. Murugan. (2024). A Systematic Review for the Classification and Segmentation of Diabetic Retinopathy Lesion from Fundus. Artificial Intelligence and Machine Learning Techniques in Image Processing and Computer Vision, 54-74. https://www.taylorfrancis.com/chapters/edit/10.1201/9781003425700-5/systematic-review-classification-segmentation-diabetic-retinopathy-lesion-fundus-alaguselvi-kalpana-murugan
[21]      Tigga, N.P. and S. Garg. (2020). Prediction of type 2 diabetes using machine learning classification methods. Procedia Computer Science, 167(706-716). https://doi.org/10.1016/j.procs.2020.03.336
[22]      Guyon, I. and A. Elisseeff. (2003). An introduction to variable and feature selection. Journal of machine learning research, 3(Mar), 1157-1182. https://www.jmlr.org/papers/volume3/guyon03a/guyon03a.pdf?ref=driverlayer.com/web
[23]      Chandrashekar, G. and F. Sahin. (2014). A survey on feature selection methods. Computers & electrical engineering, 40(1), 16-28. https://doi.org/10.1016/j.compeleceng.2013.11.024
[24]      Peng, H., F. Long, and C. Ding. (2005). Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Transactions on pattern analysis and machine intelligence, 27(8), 1226-1238. https://www.ncbi.nlm.nih.gov/pubmed/16119262
[25]      Canayaz, M. (2022). Classification of diabetic retinopathy with feature selection over deep features using nature-inspired wrapper methods. Applied Soft Computing, 128(109462). https://doi.org/10.1016/j.asoc.2022.109462
[26]      Gnana, D.A.A., S.A.A. Balamurugan, and E.J. Leavline. (2016). Literature review on feature selection methods for high-dimensional data. International Journal of Computer Applications, 136(1), 9-17. https://doi.org/10.1109/TEVC.2015.2504420
[27]      Xue, B., M. Zhang, W.N. Browne, and X. Yao. (2015). A survey on evolutionary computation approaches to feature selection. IEEE Transactions on evolutionary computation, 20(4), 606-626. https:// https://doi.org/10.1109/TEVC.2015.2504420
[28]      Siham, A., S. Sara, and A. Abdellah. (2021). Feature selection based on machine learning for credit scoring: An evaluation of filter and embedded methods. 2021 International conference on innovations in intelligent systems and applications (INISTA), IEEE. https://doi.org/10.1109/INISTA52262.2021.9548410
[29]      Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1), 267-288.  https://www.jstor.org/stable/2346178
[30]      Breiman, L. (2001). Random forests. Machine learning, 45(5-32). https://link.springer.com/article/10.1023/A:1010933404324
[31]      Déjean, S., R.T. Ionescu, J. Mothe, and M.Z. Ullah. (2020). Forward and backward feature selection for query performance prediction. Proceedings of the 35th annual ACM symposium on applied computing, https://dl.acm.org/doi/abs/10.1145/3341105.3373904
[32]      Kohavi, R. and G.H. John. (1997). Wrappers for feature subset selection. Artificial intelligence, 97(1-2), 273-324. https://www.sciencedirect.com/science/article/pii/S000437029700043X
[33]      SabbaghGol, H., H. Saadatfar, and M. Khazaiepoor. (2024). Evolution of the random subset feature selection algorithm for classification problem. Knowl Based Syst, 285(111352). https://doi.org/10.1016/j.knosys.2023.111352
[34]      Ahadzadeh, B., et al. (2023). SFE: A Simple, Fast, and Efficient Feature Selection Algorithm for High-Dimensional Data. IEEE Trans Evol Comput, 27(6), 1896-1911. https://doi.org/10.1109/TEVC.2023.3238420
[35]      Akman, D.V., et al. (2023). k-best feature selection and ranking via stochastic approximation. Expert Syst Appl, 213(118864). https://doi.org/10.1016/j.eswa.2022.118864
[36]      Pan, H., et al. (2023). A risk prediction model for type 2 diabetes mellitus complicated with retinopathy based on machine learning and its application in health management. Frontiers in Medicine, 10(1136653). https://www.frontiersin.org/journals/medicine/articles/10.3389/fmed.2023.1136653/full
[37]      Thipsawat, S. (2023). Dietary Consumption on Glycemic Control Among Prediabetes: A Review of the Literature. SAGE Open Nursing, 9. https://journals.sagepub.com/doi/abs/10.1177/23779608231218189
[38]      Islam, M.M., et al. (2023). Identification of the risk factors of type 2 diabetes and its prediction using machine learning techniques. Health Systems, 12(2), 243-254. https://www.tandfonline.com/doi/abs/10.1080/20476965.2022.2141141
[39]      Fitriyani, N.L., et al. (2023). Performance Analysis and Assessment of Type 2 Diabetes Screening Scores in Patients with Non-Alcoholic Fatty Liver Disease. Mathematics, 11(10), 2266. https://www.mdpi.com/2227-7390/11/10/2266
[40]      Zohara, Z., et al. (2023). The prospect of non-alcoholic fatty liver disease in adult patients with metabolic syndrome: a systematic review. Cureus, 15(7), https://pmc.ncbi.nlm.nih.gov/articles/PMC10427027/
[41]      Halias, A.F., et al. (2023). Type 2 Diabetes Mellitus Prediction Using Data Mining Approach. 2023 IEEE International Conference on Computing (ICOCO), IEEE. https://ieeexplore.ieee.org/abstract/document/10398078/
[42]      Safai, M., Safai, Alireza. (2021). Improving the diagnosis of type 2 diabetes and identifying its effective indicators with the feature selection approach. In (Ed.), The 5th International Conference on Electrical Engineering, Electronics and Smart Networks. https://civilica.com/doc/1257205
[43]      R., M., A. Banu.W, and D. Mavaluru. (2020). An efficient feature selection algorithm for health care data analysis. Bulletin of Electrical Engineering and Informatics. https://beei.org/index.php/EEI/article/view/1744
[44]      Sabbagh Gol, H. (2018). A Detection of Type2 Diabetes using C4.5 Decision Tree. Journal of Health and Biomedical Informatics, 5(2), 293-303. http://jhbmi.ir/article-1-281-en.html
[45]      Repository, U.M.L. (2017). Pima Indians Diabetes Database. In (Ed.), https://www.kaggle.com/uciml/pima-indians-diabetes-database
[46]      Dua, D. and C. Graff. (2019). UCI machine learning repository . University of California. School of Information and Computer Science, Irvine, CA, https://archive.ics.uci.edu/ml/datasets.php
[47]      Rashid, A. (2020). Diabetes Dataset. In (Ed.), Mendeley Data: https://doi.org/10.17632/wj9rwkp9c2.1
[48]      Cerda, P. and G. Varoquaux, (2020). Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1164-1176. https://doi.org/10.1109/TKDE.2020.2992529
[49]      Donders, A.R.T., G.J. Van Der Heijden, T. Stijnen, and K.G. Moons. (2006). A gentle introduction to imputation of missing values. Journal of clinical epidemiology, 59(10), 1087-1091. https://www.ncbi.nlm.nih.gov/pubmed/16980149
[50]      Patro, S. and K.K. Sahu. (2015). Normalization: A preprocessing stage. arXiv preprint arXiv:1503.06462, https://doi.org/10.48550/arXiv.1503.06462
[51]      Bouchlaghem, Y., Y. Akhiat, and S. Amjad. (2022). Feature selection: a review and comparative study. E3S Web of Conferences, EDP Sciences. https://doi.org/10.1051/e3sconf/202235101046
[52]      Zhang, Z. (2016). Introduction to machine learning: k-nearest neighbors. Ann Transl Med, 4(11), 218. https://www.ncbi.nlm.nih.gov/pubmed/27386492
[53]      Ben-Hur, A. and J. Weston. (2010). A user's guide to support vector machines. Methods Mol Biol, 609(2), 23-39. https://www.ncbi.nlm.nih.gov/pubmed/20221922
[54]      Maimon, O.Z. and L. Rokach. (2014). Data mining with decision trees: theory and applications World scientific. https://doi.org/10.1142/9097
[55]      Goodfellow, I., Y. Bengio, and A. Courville. (2016). Deep learning MIT press . https://www.deeplearningbook.org/
[56]      Sanyal, D., N. Bosch, and L. Paquette. (2020). Feature Selection Metrics: Similarities, Differences, and Characteristics of the Selected Models. Int Edu Data Mining Soci, https://eric.ed.gov/?id=ED607910
 
 
 
Volume 22, Issue 1
Technical and Engineering
Spring 2025
Pages 150-182

  • Receive Date 29 June 2024
  • Revise Date 11 December 2024
  • Accept Date 13 April 2025