Deep learning-based audio classification algorithm in a voice-controlled wheelchair for Persian-speaking users

Amiri, Mohammad

doi:10.48301/kssa.2025.490050.3053

Deep learning-based audio classification algorithm in a voice-controlled wheelchair for Persian-speaking users

Document Type : Original Article

Author

Mohammad Amiri

Assistant Professor, Department of Computer Engineering, National University of Skills (NUS), Tehran, Iran.

10.48301/kssa.2025.490050.3053

Abstract

In every society, some spinal disabled people lack physical and motor abilities such as moving their limbs; they cannot use the normal wheelchair and need a wheelchair with voice control. Audio classification is one of the challenges in the field of pattern recognition. Traditional methods for classifying voice commands primarily include simple algorithms and manual annotation techniques, which often have limited efficiency due to their inability to recognize complex patterns and the high variability of human speech. Convolutional neural networks (CNNs) have been widely used in audio recognition and classification since they often provide positive results. In this paper, a method of classifying ambient sounds based on the sound spectrogram, using deep neural networks, is presented to classify Persian speakers' sounds for building a voice-controlled intelligent wheelchair. To implement this, we used Inception-V3 as a convolutional neural network which is pretrained by the InceptionV3 dataset. In the next step, we trained the network with images that were generated using spectrogram images of the ambient sound of about 50 Persian speakers. In the lack of Persian speakers' dataset, we created our dataset with 50 persons including 35 males and 15 females in the range of 25 to 60 years old. The experimental results achieved a mean accuracy of 83.33%. Therefore, the wheelchair will be able to execute five commands such as stop, left, right, front, and back.

Keywords

Voice Recognition

Audio Classification

Deep Learning

Convolutional Neural Networks

Spectrogram

Voice-controlled devices

Inception-V3

Subjects

Artificial intelligence

[1] Ghorbel, A., Amor, N. B., & Jallouli, M. (2019). A survey on different human-machine interactions used for controlling an electric wheelchair. Procedia Computer Science, 159, 398-407. https://doi.org/10.1016/j.procs.2019.09.194

[2] Mazo, M., Rodríguez, F. J., Lázaro, J. L., Ureña, J., García, J. C., Santiso, E., & Revenga, P. A. (1995). Electronic control of a wheelchair guided by voice commands. Control Engineering Practice, 3(5), 665-674. https://doi.org/10.1016/0967-0661(95)00042-S

[3] Tomari, M. R. M., Kobayashi, Y., & Kuno, Y. (2012). Development of Smart Wheelchair System for a User with Severe Motor Impairment. Procedia Engineering, 41, 538-546. https://doi.org/10.1016/j.proeng.2012.07.209

[4] Kumar, D., Malhotra, R., & Sharma, S. R. (2020). Design and Construction of a Smart Wheelchair. Procedia Computer Science, 172, 302-307. https://doi.org/10.1016/j.procs.2020.05.048

[5] Ruíz-Serrano, A., Posada-Gómez, R., Sibaja, A. M., Rodríguez, G. A., Gonzalez-Sanchez, B. E., & Sandoval-Gonzalez, O. O. (2013). Development of a Dual Control System Applied to a Smart Wheelchair, using Magnetic and Speech Control. Procedia Technology, 7, 158-165. https://doi.org/10.1016/j.protcy.2013.04.020

[6] Scardapane, S., Scarpiniti, M., Bucciarelli, M., Colone, F., Mansueto, M. V., & Parisi, R. (2015). Microphone array based classification for security monitoring in unstructured environments. AEU - International Journal of Electronics and Communications, 69(11), 1715-1723. https://doi.org/10.1016/j.aeue.2015.08.007

[7] Maccagno, A., Mastropietro, A., Mazziotta, U., Scarpiniti, M., Lee, Y.-C., & Uncini, A. (2021). A CNN Approach for Audio Classification in Construction Sites. In A. Esposito, M. Faundez-Zanuy, F. C. Morabito, & E. Pasero (Eds.), Progresses in Artificial Intelligence and Neural Systems (371-381). Springer Singapore. https://doi.org/10.1007/978-981-15-5093-5_33

[8] Wold, E., Blum, T., Keislar, D., & Wheaten, J. (1996). Content-based classification, search, and retrieval of audio. IEEE MultiMedia, 3(3), 27-36. https://doi.org/10.1109/93.556537

[9] Akoushideh, A., Tourani, A., Shahbahrami, A., & 4, M. M. (2021). Design and Implementation of Automatic License Plate Recognition System for Security Gates. Karafan, 18(3), 237-252. https://doi.org/10.48301/kssa.2021.130288

[10] Weninger, F., & Schuller, B. (2011, 22-27 May 2011). Audio recognition in the wild: Static and dynamic classification on a real-world database of animal vocalizations. 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), https://opus.bibliothek.uni-augsburg.de/opus4/frontdoor/deliver/index/docId/72683/file/72683.pdf

[11] Ghiurcau, M. V., Rusu, C., Bilcu, R. C., & Astola, J. (2012). Audio based solutions for detecting intruders in wild areas. Signal Process., 92(3), 829–840. https://doi.org/10.1016/j.sigpro.2011.10.001

[12] Rabaoui, A., Davy, M., Rossignol, S., & Ellouze, N. (2008). Using One-Class SVMs and Wavelets for Audio Surveillance. Trans. Info. For. Sec., 3(4), 763–775. https://doi.org/10.1109/tifs.2008.2008216

[13] Xu, W., Zhang, X., Yao, L., Xue, W., & Wei, B. (2020). A multi-view CNN-based acoustic classification system for automatic animal species identification. Ad Hoc Networks, 102, 102115. https://doi.org/10.1016/j.adhoc.2020.102115

[14] Deperlioglu, O. (2021). Heart sound classification with signal instant energy and stacked autoencoder network. Biomedical Signal Processing and Control, 64, 102211. https://doi.org/10.1016/j.bspc.2020.102211

[15] Mahmoudian, S., Aminrasouli, N., Ahmadi, Z. Z., Lenarz, T., & Farhadi, M. (2019). Acoustic Analysis of Crying Signal in Infants with Disabling Hearing Impairment. Journal of Voice, 33(6), 946.e947-946.e913. https://doi.org/10.1016/j.jvoice.2018.05.016

[16] Messner, E., Fediuk, M., Swatek, P., Scheidl, S., Smolle-Jüttner, F.-M., Olschewski, H., & Pernkopf, F. (2020). Multi-channel lung sound classification with convolutional recurrent neural networks. Computers in Biology and Medicine, 122, 103831. https://doi.org/10.1016/j.compbiomed.2020.103831

[17] Hoseini, F., Sepehrzadeh, H., & 2, A. T. (2024). MRI Segmentation Using Inception-based U-Net Architecture and Up Skip Connections. Karafan, 21(1), 63-88. https://doi.org/10.48301/kssa.2023.394044.2530

[18] Benhari, M., & Hosseini, R. (2024). An Intelligent Ensemble Model of Uncertainty Management in Belief Network for the Classification of Microscopic Images to Detect Cervical Cancer. Karafan, 21(1), 89-69. https://doi.org/10.48301/kssa.2023.404913.2625

[19] Tsalera, E., Papadakis, A., & Samarakou, M. (2021). Comparison of pre-trained CNNs for audio classification using transfer learning. Journal of Sensor and Actuator Networks, 10(4), 72. https://doi.org/10.3390/jsan10040072

[20] Dong, X., Yin, B., Cong, Y., Du, Z., & Huang, X. (2020). Environment sound event classification with a two-stream convolutional neural network. IEEE Access, 8, 125714-125721. https://doi.org/10.1109/ACCESS.2020.3007906

[21] Bahle, G., Fortes Rey, V., Bian, S., Bello, H., & Lukowicz, P. (2021). Using privacy respecting sound analysis to improve bluetooth based proximity detection for COVID-19 exposure tracing and social distancing. Sensors, 21(16), 5604. https://doi.org/10.3390/s21165604

[22] Abeysinghe, A., Tohmuang, S., Davy, J. L., & Fard, M. (2023). Data augmentation on convolutional neural networks to classify mechanical noise. Applied Acoustics, 203, 109209. https://doi.org/10.1016/j.apacoust.2023.109209

[23] Zaman, K., & Direkoğlu, C. (2020). Classification of Harmful Noise Signals for Hearing Aid Applications using Spectrogram Images and Convolutional Neural Networks. 4th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), https://doi.org/10.1109/ISMSIT50672.2020.9254451

[24] Ballesteros, D. M., Rodriguez-Ortega, Y., Renza, D., & Arce, G. (2021). Deep4SNet: deep learning for fake speech classification. Expert Systems with Applications, 184, 115465. https://doi.org/10.1016/j.eswa.2021.115465

[25] Vrebčević, N., Mijić, I., & Petrinović, D. (2019). Emotion classification based on convolutional neural network using speech data. 2019 42nd international convention on information and communication technology, electronics and microelectronics (MIPRO), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.928780210.1109/CICT56698.2022.999796110.23919/MIPRO.2019.8756867

[26] Si, S., Wang, J., Sun, H., Wu, J., Zhang, C., Qu, X., Cheng, N., Chen, L., & Xiao, J. (2021). Variational information bottleneck for effective low-resource audio classification. arXiv preprint arXiv:2107.04803.https://arxiv.org/pdf/2107.04803

[27] Pham, L. D., McLoughlin, I., Phan, H., & Palaniappan, R. (2019). A Robust Framework for Acoustic Scene Classification. INTERSPEECH, https://doi.org/10.21437/interspeech.2019-1841

[28] Jena, K. K., Bhoi, S. K., Mohapatra, S., & Bakshi, S. (2023). A hybrid deep learning approach for classification of music genres using wavelet and spectrogram analysis. Neural Computing and Applications, 35(15), 11223-11248. https://doi.org/10.1007/s00521-023-08294-6

[29] Scarpiniti, M., Comminiello, D., Uncini, A., & Lee, Y.-C. (2021). Deep recurrent neural networks for audio classification in construction sites. 2020 28th European Signal Processing Conference (EUSIPCO), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.9287802

[30] Yu, Y., Luo, S., Liu, S., Qiao, H., Liu, Y., & Feng, L. (2020). Deep attention based music genre classification. Neurocomputing, 372, 84-91. https://doi.org/10.1016/j.neucom.2019.09.054

[31] Srivastava, N., Ruhil, S., & Kaushal, G. (2022). Music genre classification using convolutional recurrent neural networks. 2022 IEEE 6th Conference on Information and Communication Technology (CICT), https://doi.org/10.21437/interspeech.2019-184110.23919/Eusipco47968.2020.928780210.1109/CICT56698.2022.9997961

[32] Nigro, M., Rueda, A., & Krishnan, S. (2022). Acoustic Scene Classification Using Time–Frequency Energy Emphasis and Convolutional Recurrent Neural Networks. Artificial Intelligence and Evolutionary Computations in Engineering Systems: Computational Algorithm for AI Technology, Proceedings of ICAIECES 2020, https://link.springer.com/chapter/10.1007/978-981-16-2674-6_21

[33] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2015). Rethinking the Inception Architecture for Computer Vision IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA. http://arxiv.org/abs/1512.00567

[34] Dong, N., Zhao, L., Wu, C. H., & Chang, J. F. (2020). Inception v3 based cervical cell classification combined with artificially extracted features. Applied Soft Computing, 93, 106311. https://doi.org/10.1016/j.asoc.2020.106311

[35] Khamparia, A., Gupta, D., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound Classification Using Convolutional Neural Network and Tensor Deep Stacking Network. IEEE Access, 7, 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882

[36] Altes, R. (1980). Detection, estimation, and classification with spectrograms. Journal of the Acoustical Society of America, 67, 1232-1246. https://doi.org/10.1121/1.384165

[37] Hussein, W., Hussein, M., & Becker, T. (2012). Spectrogram Enhancement By Edge Detection Approach Applied To Bioacoustics Calls Classification. International Journal of signal and image processing, 3, 1-20. https://doi.org/10.5121/sipij.2012.3201