Human recognition by utilizing voice recognition and visual recognition

Document Type : Research Paper


1 Department of Computer Science, College of Education for Pure Science University of Baghdad, Baghdad, Iraq

2 Department of Religious Education, Iraqi Sunni Affairs, Iraq


Audio-visual detection and recognition system is thought to become the most promising methods for many applications includes surveillance, speech recognition, eavesdropping devices, intelligence operations, etc. In the recent field of human recognition, the majority of the research be- coming performed presently is focused on the reidentification of various body images taken by several cameras or its focuses on recognized audio-only. However, in some cases these traditional methods can- not be useful when used alone such as in indoor surveillance systems, that are installed close to the ceiling and capture images right from above in a downwards direction and in some cases people don't look straight the cameras or it cannot be added in some area such as W.C. or sleeping room. Thus, its commonly difficult to identify any movement or breakthrough process, on the other hand when need to pursue suspect when enter a building or party to identify his location and/or listen to his speech only and isolate it from other voices or noises, the other. Hence, the use of the hybrid combination technique is very effective. In this work, we proposed a multimodal human recognition approach that utilizes both the face and audio and is based upon a deep convolutional neural network (CNN). Mainly, to solve the challenge of not capturing part of the body, final results of recognizing via separate CNNs of VGG Face16 and ResNet50 are joined together depending on the score-level combination by Weighted Sum rule to enhance recognition performance. The results show that the proposed system success to recognise each person from his voice and/or his face captured. In addition, the system can separate the person voice and isolate it from noisy environment and determine the existence of desired person.


[1] E. M. Grais and M. D. Plumbley, Combining Fully Convolutional and Recurrent Neural Networks for Single
Channel Audio Source Separation, In Audio Engineering Society Convention 144. Audio Engineering Society,
[2] M. H. Kolekar, Intelligent Video Surveillance Systems: An Algorithmic Approach, CRC Press, 2018.
[3] Y. Kortli, M. Jridi, A. Al Falou and M. Atri, Face recognition systems: A Survey, Sensors, 20 (2020) 342.
[4] J. Kotus, K. Lopatka, A. Czyz˙ewski, G. Bogdanis and June, Audio-visual surveillance system for application in
bank operating room, Int Conf Multimedia Commun Serv Secur., Springer, Berlin, Heidelberg, 2013, pp. 107-120.
[5] G. O’Regan, Artificial Intelligence and Applications, Springer, 2018.
[6] C. H. Taal, R. C. Hendriks, R. Heusdens and J. A. Jensen, short-time objective intelligibility measure for timefrequency weighted noisy speech, In 2010 IEEE Int. Conf. Acoustics, Speech and Signal processing, IEEE, (2010)
[7] A. Torfi, S. M. Iranmanesh, N. Nasrabadi and J. Dawson, 3d convolutional neural networks for cross audio-visual
matching recognition, IEEE Access, 5 (2017) 22081–22091.
Volume 13, Issue 1
March 2022
Pages 343-351
  • Receive Date: 01 August 2021
  • Revise Date: 11 September 2021
  • Accept Date: 27 September 2021