MFCC based hybrid fingerprinting method for audio classification through LSTM

Document Type : Research Paper


Department of Computer Science, Karpagam Academy of Higher Education, Coimbatore, India


In this paper, a novel audio finger methodology for audio classification is proposed. The fingerprint of the audio signal is a unique digest to identify the signal. The proposed model uses the audio fingerprinting methodology to create a unique fingerprint of the audio files. The fingerprints are created by extracting an MFCC spectrum and then taking a mean of the spectra and converting the spectrum into a binary image. These images are then fed to the LSTM network to classify the environmental sounds stored in UrbanSound8K dataset and it produces an accuracy of 98.8\% of accuracy across all 10 folds of the dataset.


[1] S. Baluja and M. Covell, Audio fingerprinting: combining computer vision data stream processing, IEEE Int.
Conf. on Acoustics, Speech, and Signal Process. 2 (2007).
[2] V. Boddapati, A. Petef, J. Rasmusson and L. Lundberg, Classifying environmental sounds using image recognition
networks, Procedia Comput. Sci. 112 (2017) 2048–2056.
[3] P. Cano and E. Batlle, A review of audio fingerprinting, J. VLSI Signal Process. 41 (2005) 271–284.
[4] M. Covell and S. Baluja, Waveprint: Efficient wavelet-based audio fingerprinting, Pattern Recognit. 41(11) (2008)
[5] J.K. Das, A. Ghosh, A.K. Pal, S. Dutta and A. Chakrabarty, Urban sound classification using convolutional
neural network and long short term memory based on multiple features, Fourth Int. Conf. Intell. Comput. Data
Sci. (2020) 1–9.
[6] T. Elliot, J. Howard, R. Lisam, K. Fehling, A.d. Luca and D. Haba, Dense vs convolutional vs fully connected
layers,, (2016).
[7] D. Ellis, Robust landmark-based audio fingerprinting, Online Serial, 2009.
[8] Y. Jiang, C. Wu, K. Deng and Y. Wu, An audio fingerprinting extraction algorithm based on lifting wavelet packet
and improved optimal-basis selection, Multimed. Tools Appl. 78 (2019) 30011–30025.
[9] T. Kalker and J. Haitsma, A highly robust audio fingerprinting system, Proc. ISMIR’2002, 2002 (2002) 144–148.
[10] H.B. Kekre, N. Bhandari and N. Nair, A review of audio fingerprinting and comparison of algorithms, Int. J.
Comput. Appl. 70(13) (2013).
[11] F. Kurth, A ranking technique for fast audio identification, Proc. Int. Workshop Multimedia Signal Process.
(2002) 186–189.
[12] I. Lezhenin, N. Bogach and E. Pyshkin, Urban sound classification using long short-term memory neural network,
Federated Conf. Comput. Sci. Inf. Syst. (2019) 57–60.
[13] V. Nair and G.E. Hinton, Rectified linear units improve restricted boltzmann machines, Proc. 27th Int. Conf.
Mach. Learn. (ICML-10) (2010) 807–814.
[14] K.J. Piczak, Environmental sound classification with convolutional neural networks, IEEE 25th Int. Workshop on
Machine Learn. Signal Process. (2015) 1–6.
[15] T. Qiao, S. Zhang, Z. Zhang, S. Cao and S. Xu, Sub-spectrogram segmentation for environmental sound classification via convolutional recurrent neural network and score level fusion, arXiv preprint arXiv:1908.05863,
[16] G. Richly, L. Varga, F. Kovacs and G. Hosszu, Short-term sound stream characterization for reliable, real-time
occurrence monitoring of given sound-prints, Proc. 10th Mediter. Electrotech. Conf. MeleCon 2 (2000) 526–528.
[17] T.N. Sainath, O. Vinyals, A. Senior and H. Sak, Convolutional, long short- term memory, fully connected deep
neural networks, IEEE Int. Conf. Acoustics, Speech and Signal Process. (2015) 4580–4584.
[18] J. Salamon and J.P. Bello, Deep convolutional neural networks and data augmentation for environmental sound
classification, IEEE Signal Process. Lett. 24(3) (2017) 279–283.
[19] J. Salamon, C. Jacoby and J.P. Bello, A dataset and taxonomy for urban sound research, Proc. 22nd ACM Int.
Conf. Multimedia (2014) 1041–1044.
[20] J. Sharma, O.-C. Granmo and M. Goodwin, Environment sound classification using multiple feature channels
and attention based deep convolutional neural network, Proc. Interspeech 2020 (2020) 1186–1190.
[21] S. Sri Ranjani, V. Abdulkareem, K. Karthik and P.K. Bora, Application of SHAZAM-based audio fingerprinting
for multilingual Indian song retrieval, Adv. Commun. Comput. 347 (2015) 81–92.
[22] T. Stokes, Spectro-temporal landmarking with rank-ordered local maxima for audio fingerprinting, 16th Int. Soc.
Music Inf. Retr. Conf. (2015).[23] Y. Tokozume and T. Harada, Learning environmental sounds with end-to-end convolutional neural network, IEEE
Int. Conf. Acoustics, Speech and Signal Process. (2017) 2721–2725.
[24] A. Wang, The shazam music recognition service, Comm. ACM, 49(8) (2006).
[25] X. Zhang, B. Zhu, L. Li, W. Li, X. Li, W. Wang, P. Lu and W. Zhang, SIFT-based local spectrogram image
descriptor: a novel feature for robust music identification, EURASIP J. Audio, Speech, and Music Process.
2015(1) (2015) 1–15.
Volume 12, Special Issue
December 2021
Pages 2125-2136
  • Receive Date: 02 November 2021
  • Revise Date: 29 November 2021
  • Accept Date: 07 December 2021