ارائه یک رویکرد طبقه بندی صوت برای ویلچرهوشمند فرمان پذیر صوتی با استفاده از شبکه های عمیق برای کاربران فارسی زبان

امیری, محمد

doi:10.48301/kssa.2025.490050.3053

ارائه یک رویکرد طبقه بندی صوت برای ویلچرهوشمند فرمان پذیر صوتی با استفاده از شبکه های عمیق برای کاربران فارسی زبان

نوع مقاله : مقاله پژوهشی (کاربردی)

نویسنده

محمد امیری

استادیار، گروه مهندسی کامپیوتر، دانشگاه ملی مهارت، تهران، ایران.

10.48301/kssa.2025.490050.3053

چکیده

در هر جامعه ای، برخی از معلولان نخاعی فاقد توانایی های جسمی و حرکتی برای حرکت دادن اندام های خود هستند و نمی توانند از ویلچر معمولی استفاده کنند و به ویلچر با کنترل صوتی نیاز دارند.
روش‌های سنتی برای دسته‌بندی فرامین صوتی، عمدتاً شامل الگوریتم‌های ساده‌ و روش‌های مبتنی بر نشانه‌گذاری دستی بودند که اغلب به دلیل عدم توانایی در شناسایی الگوهای پیچیده و تنوع بالای گفتار انسانی، کارآمدی محدودی دارند.
طبقه‌بندی صوت یکی از چالش‌های حوزه شناسایی الگو می‌باشد. به دلیل نتایج مثبت حاصله، شبکه های عصبی کانولوشن به طور گسترده ای در زمینه تشخیص و طبقه بندی صدا مورد استفاده قرار گرفته اند. در این مقاله، روشی برای طبقه‌بندی صداهای محیطی بر اساس طیف‌نگار صوتی، با استفاده از شبکه‌های عصبی عمیق، برای طبقه‌بندی صداهای فارسی زبانان برای ساخت ویلچر فرمان پذیر صوتی ارائه شده است. برای پیاده سازی ، از Inception-V3 به عنوان یک شبکه عصبی کانولوشن استفاده شده است که توسط مجموعه داده InceptionV3 از قبل آموزش داده شده است. در مرحله بعد با تصاویری که با استفاده از تصاویر ویژگیهای طیفی صوت صدای محیط حدود 50 فارسی زبان تولید شده بود، شبکه را آموزش دادیم. در فقدان مجموعه داده فارسی زبانان، مجموعه داده خود را با 50 نفر شامل 35 مرد و 15 زن در محدوده سنی 25 تا 60 سال ایجاد کردیم. نتایج تجربی به میانگین دقت 83.33 درصد دست یافت. بنابراین ویلچر قادر به اجرای پنج دستور توقف، چپ، راست، جلو و عقب خواهد بود.

کلیدواژه‌ها

تشخیص صدا

طبقه بندی صوت

یادگیری عمیق

شبکه های عصبی کانولوشنی

ویژگیهای طیفی صوت

دستگاههای فرمان پذیری صوتی

Inception-V3

موضوعات

هوش مصنوعی

عنوان مقاله English

Deep learning-based audio classification algorithm in a voice-controlled wheelchair for Persian-speaking users

نویسنده English

Mohammad Amiri

Assistant Professor, Department of Computer Engineering, National University of Skills (NUS), Tehran, Iran.

چکیده English

In every society, some spinal disabled people lack physical and motor abilities such as moving their limbs; they cannot use the normal wheelchair and need a wheelchair with voice control. Audio classification is one of the challenges in the field of pattern recognition. Traditional methods for classifying voice commands primarily include simple algorithms and manual annotation techniques, which often have limited efficiency due to their inability to recognize complex patterns and the high variability of human speech. Convolutional neural networks (CNNs) have been widely used in audio recognition and classification since they often provide positive results. In this paper, a method of classifying ambient sounds based on the sound spectrogram, using deep neural networks, is presented to classify Persian speakers' sounds for building a voice-controlled intelligent wheelchair. To implement this, we used Inception-V3 as a convolutional neural network which is pretrained by the InceptionV3 dataset. In the next step, we trained the network with images that were generated using spectrogram images of the ambient sound of about 50 Persian speakers. In the lack of Persian speakers' dataset, we created our dataset with 50 persons including 35 males and 15 females in the range of 25 to 60 years old. The experimental results achieved a mean accuracy of 83.33%. Therefore, the wheelchair will be able to execute five commands such as stop, left, right, front, and back.

کلیدواژه‌ها English

Voice Recognition

Audio Classification

Deep Learning

Convolutional Neural Networks

Spectrogram

Voice-controlled devices

Inception-V3

[1] Ghorbel, A., Amor, N. B., & Jallouli, M. (2019). A survey on different human-machine interactions used for controlling an electric wheelchair. Procedia Computer Science, 159, 398-407. https://doi.org/10.1016/j.procs.2019.09.194

[2] Mazo, M., Rodríguez, F. J., Lázaro, J. L., Ureña, J., García, J. C., Santiso, E., & Revenga, P. A. (1995). Electronic control of a wheelchair guided by voice commands. Control Engineering Practice, 3(5), 665-674. https://doi.org/10.1016/0967-0661(95)00042-S

[3] Tomari, M. R. M., Kobayashi, Y., & Kuno, Y. (2012). Development of Smart Wheelchair System for a User with Severe Motor Impairment. Procedia Engineering, 41, 538-546. https://doi.org/10.1016/j.proeng.2012.07.209

[4] Kumar, D., Malhotra, R., & Sharma, S. R. (2020). Design and Construction of a Smart Wheelchair. Procedia Computer Science, 172, 302-307. https://doi.org/10.1016/j.procs.2020.05.048

[5] Ruíz-Serrano, A., Posada-Gómez, R., Sibaja, A. M., Rodríguez, G. A., Gonzalez-Sanchez, B. E., & Sandoval-Gonzalez, O. O. (2013). Development of a Dual Control System Applied to a Smart Wheelchair, using Magnetic and Speech Control. Procedia Technology, 7, 158-165. https://doi.org/10.1016/j.protcy.2013.04.020

[6] Scardapane, S., Scarpiniti, M., Bucciarelli, M., Colone, F., Mansueto, M. V., & Parisi, R. (2015). Microphone array based classification for security monitoring in unstructured environments. AEU - International Journal of Electronics and Communications, 69(11), 1715-1723. https://doi.org/10.1016/j.aeue.2015.08.007

[7] Maccagno, A., Mastropietro, A., Mazziotta, U., Scarpiniti, M., Lee, Y.-C., & Uncini, A. (2021). A CNN Approach for Audio Classification in Construction Sites. In A. Esposito, M. Faundez-Zanuy, F. C. Morabito, & E. Pasero (Eds.), Progresses in Artificial Intelligence and Neural Systems (371-381). Springer Singapore. https://doi.org/10.1007/978-981-15-5093-5_33

[8] Wold, E., Blum, T., Keislar, D., & Wheaten, J. (1996). Content-based classification, search, and retrieval of audio. IEEE MultiMedia, 3(3), 27-36. https://doi.org/10.1109/93.556537

[9] Akoushideh, A., Tourani, A., Shahbahrami, A., & 4, M. M. (2021). Design and Implementation of Automatic License Plate Recognition System for Security Gates. Karafan, 18(3), 237-252. https://doi.org/10.48301/kssa.2021.130288

[10] Weninger, F., & Schuller, B. (2011, 22-27 May 2011). Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/72683/file/72683.pdf

[11] Ghiurcau, M. V., Rusu, C., Bilcu, R. C., & Astola, J. (2012). Audio based solutions for detecting intruders in wild areas. Signal Process., 92(3), 829–840. https://doi.org/10.1016/j.sigpro.2011.10.001

[12] Rabaoui, A., Davy, M., Rossignol, S., & Ellouze, N. (2008). Using One-Class SVMs and Wavelets for Audio Surveillance. Trans. Info. For. Sec., 3(4), 763–775. https://doi.org/10.1109/tifs.2008.2008216

[13] Xu, W., Zhang, X., Yao, L., Xue, W., & Wei, B. (2020). A multi-view CNN-based acoustic classification system for automatic animal species identification. Ad Hoc Networks, 102, 102115. https://doi.org/10.1016/j.adhoc.2020.102115

[14] Deperlioglu, O. (2021). Heart sound classification with signal instant energy and stacked autoencoder network. Biomedical Signal Processing and Control, 64, 102211. https://doi.org/10.1016/j.bspc.2020.102211

[15] Mahmoudian, S., Aminrasouli, N., Ahmadi, Z. Z., Lenarz, T., & Farhadi, M. (2019). Acoustic Analysis of Crying Signal in Infants with Disabling Hearing Impairment. Journal of Voice, 33(6), 946.e947-946.e913. https://doi.org/10.1016/j.jvoice.2018.05.016

[16] Messner, E., Fediuk, M., Swatek, P., Scheidl, S., Smolle-Jüttner, F.-M., Olschewski, H., & Pernkopf, F. (2020). Multi-channel lung sound classification with convolutional recurrent neural networks. Computers in Biology and Medicine, 122, 103831. https://doi.org/10.1016/j.compbiomed.2020.103831

[17] Hoseini, F., Sepehrzadeh, H., & 2, A. T. (2024). MRI Segmentation Using Inception-based U-Net Architecture and Up Skip Connections. Karafan, 21(1), 63-88. https://doi.org/10.48301/kssa.2023.394044.2530

[18] Benhari, M., & Hosseini, R. (2024). An Intelligent Ensemble Model of Uncertainty Management in Belief Network for the Classification of Microscopic Images to Detect Cervical Cancer. Karafan, 21(1), 89-69. https://doi.org/10.48301/kssa.2023.404913.2625

[19] Tsalera, E., Papadakis, A., & Samarakou, M. (2021). Comparison of pre-trained CNNs for audio classification using transfer learning. Journal of Sensor and Actuator Networks, 10(4), 72. https://doi.org/10.3390/jsan10040072

[20] Dong, X., Yin, B., Cong, Y., Du, Z., & Huang, X. (2020). Environment sound event classification with a two-stream convolutional neural network. IEEE Access, 8, 125714-125721. https://doi.org/10.1109/ACCESS.2020.3007906

[21] Bahle, G., Fortes Rey, V., Bian, S., Bello, H., & Lukowicz, P. (2021). Using privacy respecting sound analysis to improve bluetooth based proximity detection for COVID-19 exposure tracing and social distancing. Sensors, 21(16), 5604. https://doi.org/10.3390/s21165604

[22] Abeysinghe, A., Tohmuang, S., Davy, J. L., & Fard, M. (2023). Data augmentation on convolutional neural networks to classify mechanical noise. Applied Acoustics, 203, 109209. https://doi.org/10.1016/j.apacoust.2023.109209

[23] Zaman, K., & Direkoğlu, C. (2020). Classification of Harmful Noise Signals for Hearing Aid Applications using Spectrogram Images and Convolutional Neural Networks. 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), https://doi.org/10.1109/ISMSIT50672.2020.9254451

[24] Ballesteros, D. M., Rodriguez-Ortega, Y., Renza, D., & Arce, G. (2021). Deep4SNet: deep learning for fake speech classification. Expert Systems with Applications, 184, 115465. https://doi.org/10.1016/j.eswa.2021.115465

[25] Vrebčević, N., Mijić, I., & Petrinović, D. (2019). Emotion classification based on convolutional neural network using speech data. 2019 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.928780210.1109/CICT56698.2022.999796110.23919/MIPRO.2019.8756867

[26] Si, S., Wang, J., Sun, H., Wu, J., Zhang, C., Qu, X., Cheng, N., Chen, L., & Xiao, J. (2021). Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803.https://arxiv.org/pdf/2107.04803

[27] Pham, L. D., McLoughlin, I., Phan, H., & Palaniappan, R. (2019). A Robust Framework for Acoustic Scene Classification. INTERSPEECH, https://doi.org/10.21437/interspeech.2019-1841

[28] Jena, K. K., Bhoi, S. K., Mohapatra, S., & Bakshi, S. (2023). A hybrid deep learning approach for classification of music genres using wavelet and spectrogram analysis. Neural Computing and Applications, 35(15), 11223-11248. https://doi.org/10.1007/s00521-023-08294-6

[29] Scarpiniti, M., Comminiello, D., Uncini, A., & Lee, Y.-C. (2021). Deep recurrent neural networks for audio classification in construction sites. 2020 28th European Signal Processing Conference (EUSIPCO), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.9287802

[30] Yu, Y., Luo, S., Liu, S., Qiao, H., Liu, Y., & Feng, L. (2020). Deep attention based music genre classification. Neurocomputing, 372, 84-91. https://doi.org/10.1016/j.neucom.2019.09.054

[31] Srivastava, N., Ruhil, S., & Kaushal, G. (2022). Music genre classification using convolutional recurrent neural networks. 2022 IEEE 6th Conference on Information and Communication Technology (CICT), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.928780210.1109/CICT56698.2022.9997961

[32] Nigro, M., Rueda, A., & Krishnan, S. (2022). Acoustic Scene Classification Using Time–Frequency Energy Emphasis and Convolutional Recurrent Neural Networks. Artificial Intelligence and Evolutionary Computations in Engineering Systems: Computational Algorithm for AI Technology, Proceedings of ICAIECES 2020, https://link.springer.com/chapter/10.1007/978-981-16-2674-6_21

[33] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. http://arxiv.org/abs/1512.00567

[34] Dong, N., Zhao, L., Wu, C. H., & Chang, J. F. (2020). Inception v3 based cervical cell classification combined with artificially extracted features. Applied Soft Computing, 93, 106311. https://doi.org/10.1016/j.asoc.2020.106311

[35] Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network. IEEE Access, 7, 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882

[36] Altes, R. (1980). Detection, estimation, and classification with spectrograms. Journal of the Acoustical Society of America, 67, 1232-1246. https://doi.org/10.1121/1.384165

[37] Hussein, W., Hussein, M., & Becker, T. (2012). Spectrogram Enhancement By Edge Detection Approach Applied To Bioacoustics Calls Classification. International Journal of signal and image processing, 3, 1-20. https://doi.org/10.5121/sipij.2012.3201