CNN Based Automatic Speech Recognition: A Comparative Study

  • Hilal Ilgaz
    Computer Engineering Department, University of Gazi. Ankara, Turkey
  • Beyza Akkoyun
    Computer Engineering Department, University of Gazi. Ankara, Turkey.
  • Özlem Alpay
    Computer Engineering Department, University of Gazi. Ankara, Turkey. ozlemalpay[at]gazi.edu.tr
  • M. Ali Akcayol
    Computer Engineering Department, University of Gazi. Ankara, Turkey.

Abstract

Recently, one of the most common approaches used in speech recognition is deep learning. The most advanced results have been obtained with speech recognition systems created using convolutional neural network (CNN) and recurrent neural networks (RNN). Since CNNs can capture local features effectively, they are applied to tasks with relatively short-term dependencies, such as keyword detection or phoneme- level sequence recognition. This paper presents the development of a deep learning and speech command recognition system. The Google Speech Commands Dataset has been used for training. The dataset contained 65.000 one-second-long words of 30 short English words. That is, %80 of the dataset has been used in the training and %20 of the dataset has been used in the testing. The data set consists of one-second voice commands that have been converted into a spectrogram and used to train different artificial neural network (ANN) models. Various variants of CNN are used in deep learning applications. The performance of the proposed model has reached %94.60.
  • Referencias
  • Cómo citar
  • Del mismo autor
  • Métricas
Abdel-Hamid, O., Deng, L., & Yu, D. (2013). Exploring convolutional neural network structures and optimization techniques for speech recognition. Proc. Interspeech, 2013, 3366-3370. 10.21437/Interspeech.2013-744
Beckmann, P., Kegler, M., Saltini, H., & Cernak, M. (2019). Speech-vgg: A deep feature extractor for speech processing. arXiv preprint arXiv:1910.09909.
Büyük, O. (2018). Mobil araçlarda Türkçe konus¸ma tanıma için yeni bir veri tabanı ve bu veri tabanı ile elde edilen ilk konus¸ma tanıma sonuçları. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 24(2), 180-184.
Cevik, F., & Kilimci, Z. H. (2020). Derin öğrenme yöntemleri ve kelime yerleştirme modelleri kullanılarak Parkinson hastalığının duygu analiziyle değerlendirilmesi. Pamukkale Üniversitesi Mühendislik Bilimleri Dergisi, 27(2), 151-161.
De Andrade, D. C., Leo, S., Viana, M. L. D. S., & Bernkopf, C. (2018). A neural attention model for speech command recognition. arXiv preprint arXiv:1808.08929.
Dridi, H., & Ouni, K. (2020). Towards robust combined deep architecture for speech recognition: Experiments on TIMIT. International Journal of Advanced Computer Science and Applications (IJACSA), 11(4), 525-534. 10.14569/IJACSA.2020.0110469
Fantaye, T. G., Yu, J., & Hailu, T. T. (2020). Advanced convolutional neural network-based hybrid acoustic models for low-resource speech recognition. Computers, 9(2), 36. 10.3390/computers9020036
Khanam, F., Munmun, F. A., Ritu, N. A., Saha, A. K., & Firoz, M. (2022). Text to Speech Synthesis: A Systematic Review, Deep Learning Based Architecture and Future Research Direction. Journal of Advances in Information Technology, 13(5). 10.12720/jait.13.5.398-412
Kim, B., Chang, S., Lee, J., & Sung, D. (2021). Broadcasted residual learning for efficient keyword spotting. arXiv preprint arXiv:2106.04140. 10.21437/Interspeech.2021-383
Kolesau, A., & Šešok, D. (2021). Voice Activation for Low-Resource Languages. Applied Sciences, 11(14), 6298. 10.3390/app11146298
Lee, J., Kim, T., Park, J., & Nam, J. (2017). Raw waveform-based audio classification using sample-level CNN architectures. arXiv preprint arXiv:1712.00866.
McMahan, B., & Rao, D. (2018, April). Listening to the world improves speech command recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 32(1). 10.1609/aaai.v32i1.11284
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied intelligence, 42, 722-737. 10.1007/s10489-014-0629-7
Numpy. (n.d.) What is NumPy. https://numpy.org/doc/stable/user/whatisnumpy.html
O’Shaughnessy, D. (2008). Automatic speech recognition: History, methods and challenges. Pattern Recognition, 41(10), 2965-2979. 10.1016/j.patcog.2008.05.008
Poudel, S., & Anuradha, R. (2020). Speech command recognition using artificial neural networks. JOIV: International Journal on Informatics Visualization, 4(2), 73-75. 10.30630/joiv.4.2.358
Rodríguez, E., Ruíz, B., García-Crespo, Á., & García, F. (1997). Speech/speaker recognition using a HMM/GMM hybrid model. In Audio-and Video-based Biometric Person Authentication: First International Conference, AVBPA’97 Crans-Montana, Switzerland, March 12-14, 1997 Proceedings 1 (pp. 227-234). Springer Berlin Heidelberg. 10.1007/BFb0016000
Ruan, K., Zhao, S., Jiang, X., Li, Y., Fei, J., Ou, D.,... & Xia, J. (2022). A 3D fluorescence classification and component prediction method based on VGG convolutional neural network and PARAFAC analysis method. Applied Sciences, 12(10), 4886. 10.3390/app12104886
Sainath, T. N., Kingsbury, B., Mohamed, A. R., Dahl, G. E., Saon, G., Soltau, H.,... & Ramabhadran, B. (2013, December). Improvements to deep convolutional neural networks for LVCSR. In 2013 IEEE workshop on automatic speech recognition and understanding (pp. 315-320). IEEE. 10.1109/ASRU.2013.6707749
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal processing letters, 24(3), 279-283. 10.1109/LSP.2017.2657381
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Sundberg, J., Lã, F. M., & Gill, B. P. (2013). Formant tuning strategies in professional male opera singers. Journal of Voice, 27(3), 278-288. 10.1016/j.jvoice.2012.12.002
Vygon, R., & Mikhaylovskiy, N. (2021). Learning efficient representations for keyword spotting with triplet loss. In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27-30, 2021, Proceedings 23 (pp. 773-785). Springer International Publishing. 10.1007/978-3-030-87802-3_69
Wang, Y., Deng, X., Pu, S., & Huang, Z. (2017a). Residual convolutional CTC networks for automatic speech recognition. arXiv preprint arXiv:1702.
Wang, Y., Skerry-Ryan, R. J., Stanton, D., Wu, Y., Weiss, R. J., Jaitly, N.,... & Saurous, R. A. (2017b). Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135.
Warden, P. (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
Vygon, R., & Mikhaylovskiy, N. (2021). Learning efficient representations for keyword spotting with triplet loss. In Speech and Computer: 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27-30, 2021, Proceedings 23 (pp. 773-785). Springer International Publishing. 10.1007/978-3-030-87802-3_69
Zhang, Y., Suda, N., Lai, L., & Chandra, V. (2017). Hello edge: Keyword spotting on microcontrollers. arXiv preprint arXiv:1711.07128.
Ilgaz, H., Akkoyun, B., Alpay, Özlem, & Akcayol, M. A. (2024). CNN Based Automatic Speech Recognition: A Comparative Study. ADCAIJ: Advances in Distributed Computing and Artificial Intelligence Journal, 13(1), e29191. https://doi.org/10.14201/adcaij.29191

Downloads

Download data is not yet available.
+