Main Article Content

Mahesh S Patil
School of Computer Science and Engineering, KLE Technological University
India
Satyadhyan Chickerur
India
Anand Meti
School of Computer Science and Engineering, KLE Technological University
India
Priyanka M Nabapure
School of Computer Science and Engineering, KLE Technological University
India
Sunaina Mahindrakar
School of Computer Science and Engineering, KLE Technological University
India
Sonali Naik
School of Computer Science and Engineering, KLE Technological University
India
Soumya Kanyal
School of Computer Science and Engineering, KLE Technological University
India
Vol. 8 No. 3 (2019), Articles, pages 13-26
DOI: https://doi.org/10.14201/ADCAIJ2019831326
Accepted: Feb 25, 2020
Copyright

Abstract

Speech Communication in a noisy environment is a difficult and challenging task. Many professionals work in noisy environments like aviation, constructions, or manufacturing, and find it difficult to communicate orally. Such noisy environments need an automated lip-reading system that could be helpful in communicating some instructions and commands. This paper proposes a novel lip-reading solution, which extracts the geometrical shape of lip movement from the video and predicts the words/sentences spoken. An Indian specific language data set is developed which consists of lip movement information captured from 50 persons. This includes students in the age group of 18 to 20 years and faculty in the age group of 25 to 40 years . All have spoken a paragraph of 58 words within 10 sentences in Hindi (Devanagari, spoken in India) language which was recorded under various conditions. The implementation consists of facial parts detection, along with Long short term memory’s. The proposed solution is able to predict the words spoken with 77% and 35% accuracy for data set of 3 and 10 words respectively. The sentences are predicted with 20% accuracy, which is encouraging.

Downloads

Download data is not yet available.

Article Details

References

Afouras, T., Chung, J. S., & Zisserman, A. (2018). Deep lip reading: A comparison of models and an online application. In Interspeech 2018 (pp. 3514-3518). ISCA: ISCA. https://doi.org/10.21437/Interspeech.2018-1943.

Almajai, I., Cox, S., Harvey, R., & Lan, Y. (2016). Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2722-2726). IEEE. https://doi.org/10.1109/ICASSP.2016.7472172.

Assael, Y. M., Shillingford, B., Whiteson, S., & de Freitas, N. (2016). LipNet: End-to-End Sentence-level Lipreading.

Bailly-Bailliére, E., Bengio, S., Bimbot, F., Hamouz, M., Kittler, J., Mariéthoz, J., … Thiran, J.-P. (2003). The BANCA database and evaluation protocol. In J. Kittler & M. S. Nixon (Eds.), Audio- and Video-Based Biometric Person Authentication (Vol. 2688, pp. 625-638). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44887-X_74.

Buchan, J. N., Paré, M., & Munhall, K. G. (2007). Spatial statistics of gaze fixations during dynamic face processing. Social Neuroscience, 2(1), 1-13. https://doi.org/10.1080/17470910601043644.

Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 3444-3453). IEEE. https://doi.org/10.1109/CVPR.2017.367.

Chung, J. S., & Zisserman, A. (2017). Lip reading in the wild. In S.-H. Lai, V. Lepetit, K. Nishino, & Y. Sato (Eds.), Computer vision - ACCV 2016 (Vol. 10112, pp. 87-103). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-54184-6_6.

Cooke, M., Barker, J., Cunningham, S., & Shao, X. (2006). An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America, 120(5), 2421-2424. https://doi.org/10.1121/1.2229005.

dlib C++ Library. (n.d.). Retrieved July 23, 2019, from http://dlib.net/.

Dupont, S., & Luettin, J. (2000). Audio-visual speech modeling for continuous speech recognition. IEEE Transactions on Multimedia, 2(3), 141-151. https://doi.org/10.1109/6046.865479.

Easton, R. D., & Basala, M. (1982). Perceptual dominance during lipreading. Perception & Psychophysics, 32(6), 562-570.

Erber, N. P. (1975). Auditory-visual perception of speech. The Journal of Speech and Hearing Disorders, 40(4), 481-492.

Fernandez-Lopez, A., Martinez, O., & Sukno, F. M. (2017). Towards Estimating the Upper Bound of Visual-Speech Recognition: The Visual Lip-Reading Feasibility Database. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (pp. 208-215). IEEE. https://doi.org/10.1109/FG.2017.34.

Fernandez-Lopez, A., & Sukno, F. M. (2018). Survey on automatic lip-reading in the era of deep learning. Image and Vision Computing, 78, 53-72. https://doi.org/10.1016/j.imavis.2018.07.002.

Google’s DeepMind AI can lip-read TV shows better than a pro | New Scientist. (n.d.). Retrieved July 23, 2019, from https://www.newscientist.com/article/2113299-googles-deepmind-ai-can-lip-read-tv-shows-better-than-a-pro/.

Hilder, S., Harvey, R. W., & Theobald, B. J. (2009). Comparison of human and machine-based lip-reading. AVSP.

Hindi - Wikipedia. (n.d.). Retrieved July 24, 2019, from https://en.wikipedia.org/wiki/Hindi.

Home - Keras Documentation. (n.d.). Retrieved July 23, 2019, from https://keras.io/.

Huang, J., Potamianos, G., Connell, J., & Neti, C. (2004). Audio-visual speech recognition using an infrared headset. Speech Communication, 44(1-4), 83-96. https://doi.org/10.1016/j.specom.2004.10.007.

Long short- term memory - Wikipedia. (n.d.). Retrieved July 23, 2019, from https://en.wikipedia.org/wiki/Long_short-_term_memory

Lucey, P. J., Sridharan, S., & Dean, D. B. (2008). Continuous pose-invariant lipreading.

Matthews, I., Cootes, T. F., Bangham, J. A., Cox, S., & Harvey, R. (2002). Extraction of visual features for lipreading. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(2), 198-213. https://doi.org/10.1109/34.982900.

Moll, K. L., & Daniloff, R. G. (1971). Investigation of the Timing of Velar Movements during Speech. The Journal of the Acoustical Society of America, 50(2B), 678-684. https://doi.org/10.1121/1.1912683.

Nefian, A V, Liang, L., Pi, X., & Xiaoxiang, L. (n.d.). A coupled HMM for audio-visual speech recognition. …

Nefian, Ara V., Liang, L., Pi, X., Liu, X., & Murphy, K. (2002). Dynamic Bayesian Networks for Audio-Visual Speech Recognition. EURASIP Journal on Advances in Signal Processing, 2002(11), 783042. https://doi.org/10.1155/S1110865702206083.

Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H. G., & Ogata, T. (2015). Audio-visual speech recognition using deep learning. Applied Intelligence, 42(4), 722-737. https://doi.org/10.1007/s10489-014-0629-7.

Ortiz, I. de los R. R. (2008). Lipreading in the Prelingually Deaf: What makes a Skilled Speechreader? The Spanish Journal of Psychology, 11(2), 488-502. https://doi.org/10.1017/S1138741600004492.

Patterson, E. K., Gurbuz, S., & Tufekci, Z. (2002). CUAVE: A new audio-visual database for multimodal human-computer interface research.

Petridis, S., & Pantic, M. (2016). Deep complementary bottleneck features for visual speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2304-2308). IEEE. https://doi.org/10.1109/ICASSP.2016.7472088.

Petridis, S., Shen, J., Cetin, D., & Pantic, M. (2018). Visual-Only Recognition of Normal, Whispered and Silent Speech. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6219-6223). IEEE. https://doi.org/10.1109/ICASSP.2018.8461596.

Potamianos, G., Neti, C., Gravier, G., & Garg, A. (2003). Recent advances in the automatic recognition of audiovisual speech. Proceedings of the …

Potamianos, G., Neti, C., & Luettin, J. (2004). Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio …

Ronquest, R. E., Levi, S. V., & Pisoni, D. B. (2010). Language identification from visual-only speech signals. Attention, Perception, & Psychophysics, 72(6), 1601-1613. https://doi.org/10.3758/APP.72.6.1601.

Sterpu, G., & Harte, N. (n.d.). Towards lipreading sentences using active appearance models.

Sui, C., Bennamoun, M., & Togneri, R. (2015). Listening with Your Eyes: Towards a Practical Visual Speech Recognition System Using Deep Boltzmann Machines. In 2015 IEEE International Conference on Computer Vision (ICCV) (pp. 154-162). IEEE. https://doi.org/10.1109/ICCV.2015.26.

Sumby, W. H. (1954). Visual Contribution to Speech Intelligibility in Noise. The Journal of the Acoustical Society of America, 26(2), 212. https://doi.org/10.1121/1.1907309.

Wand, M., Koutnik, J., & Schmidhuber, J. (2016). Lipreading with long short-term memory. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6115-6119). IEEE. https://doi.org/10.1109/ICASSP.2016.7472852.

Yau, W. C., Kumar, D. K., & Weghorn, H. (2007). Visual speech recognition using motion features and hidden markov models. In W. G. Kropatsch, M. Kampel, & A. Hanbury (Eds.), Computer analysis of images and patterns (pp. 832-839). Berlin, Heidelberg: Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-74272-2_103.

Zhou, Z., Zhao, G., Hong, X., & Pietikäinen, M. (2014). A review of recent advances in visual speech decoding. Image and Vision Computing, 32(9), 590-605. https://doi.org/10.1016/j.imavis.2014.06.004.