Harmonium note and triad music transcription using neural networks

Document Type : Research Paper


1 Department of Electronics and Telecommunication College of Engineering Pune (COEP), Wellesley Rd, Shivajinagar, Pune, Maharashtra 411005, India

2 College of Engineering Pune (COEP), Wellesley Rd, Shivajinagar, Pune, Maharashtra 411005, India


Learning music requires a two-prong approach which includes theoretical studies and practical exposure to the instrument to be learnt. While previous literature has focused on developing technologies for determining the notes of different musical instruments, the harmonium has not been so popular in this research area. This research focuses on using a hybrid approach for polyphonic triad recognition of the Harmonium music. In this research, over 21000 audio samples of harmonium including notes and triads were taken for the Convolutional-Recurrent Neural Network (CRNN) model training purpose. The recorded audio samples were also used to train the Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) models to comparatively analyze the efficiency of these models. The results indicated that the CRNN model is more efficient, accurate, and precise on a score-based transcription. The proposed system produced 94\% accurate results for triad recognition of Harmonium. The recognized triads were represented as sheet music using Lilypond. Possible applications of this output are for better understanding of the triad sequences by students or for Automatic Music Transcription of performances.


[1] S. Adavanne, A. Politis and T. Virtanen, Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network, Proc. Eur. Signal Process. Conf. 2018.
[2] M. Arjovsky, S. Chintala and L. Bottou, Wasserstein generative adversarial networks, Int. Conf. Machine Learn.
2017, pp. 214–223.
[3] E. Benetos, S. Dixon, Zh. Duan, Sebastian ewert: automatic music transcription: an overview, IEEE Signal
Process. Mag. 36(1) (2019) 20–30.[4] S. Chakrabarty and E.A.P. Habets, Multi-speaker localization using convolutional neural network trained with
noise, Proc. Machine Learn. Audio Process. Workshop at NIPS, 2017.
[5] S.Y. Chang, B. Li, T.N. Sainath, G. Simko, and C. Parada, Endpoint detection using grid long short-term memory
networks for streaming speech recognition, Proc. Interspeech 2017.
[6] J. Devlin, M. Chang, K. Lee and K. Toutanova, BERT: pre-training of deep bidirectional transformers for language
understanding, arXivpreprintarXiv: 1810.04805, 2018.
[7] E.L. Ferguson, S.B. Williams and C.T. Jin, Sound source localization in a multipath environment using convolutional neural networks, Proc. IEEE Int. Conf.Acoustics, Speech, and Signal Process. 2018.
[8] T. Gan, M´usica colonial: 18th century music score meets 21st century digitalization technology, JCDL ’05: Proc.
5th ACM/IEEE-CS Joint Conf. Digital Libraries, pp. 379.
[9] M. Huzaifah, Comparison of Time-Frequency Representations for Environmental Sound Classification usingConvolutional Neural Networks, arXiv:1706.07156v1 [cs.CV], 2017.
[10] N.P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden and
A. Borcherset al., In-datacenter performance analysis of a tensor processing unit, in IEEE Computer Architecture(ISCA), 2017 ACM/IEEE 44th Annual Int. Symp. (2017) 1–12.
[11] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande, E. Lockhart, F. Stimberg, A.v.d. Oord, S.
Dieleman and K. Kavukcuoglu, Efficient neural audio synthesis, arXiv preprintarXiv:1802.08435, 2018.
[12] T. Kawashima and K. IchigeK, Automatic piano music transcription by hadamard product of low-rank NMF and
CNN/CDAE outputs, IEEJ Trans. Electron. Inf. Syst. 139(10) (2019) 1106–1112.
[13] M. Kolbæk, Z.H. Tan and J. Jensen, Monaural speech enhancement using deep neural networks by maximizing a
short-time objective intelligibility measure, Proc. IEEE Int. Conf. Acoustics, Speech, and Signal Process., 2018.
[14] J. Lee, J. Park, K.L. Kim and J. Nam, Sample-level deep convolutional neural networks for music auto-tagging
using raw waveforms, Proc. 14th Sound and Music Comput. Conf. Espoo, Finland, 2017, pp. 220–226.
[15] W. Li, L. Cao, D. Zhao, X. Cui and J. Yang, CRNN: Integrating classification rules into neural network, Proc.
Int. Joint Conf. Neural Network. 2013, pp. 1–8.
[16] Q. Liu, Y. Xu, J.B. Jackson, W. Wang and Ph. Coleman, Iterative deep neural networks for speaker-independent
binaural blind speech separation, Proc. IEEE Int. Conf. Acoustics, Speech, Signal Process. 2018.
[17] S. Mishra, B.L. Sturm and S. Dixon, Local interpretable model-agnostic explanations for music content analysis,
ISMIR 2017.
[18] M. M¨uller, D.P.W. Ellis, A. Klapuri and G. Richard, Signal processing for music analysis, IEEE J. Selected
Topics in Signal Process. 5(6) (2011) 1088–1110.
[19] Musical scales, [Online]. Avalable: https://heptagrama.com/musical-scales.htm.
[20] B. PuriS and S.P. Mahajan, Optimum feature selection for harmonium note identification using ANN, 10th Int.
Conf. Comput. Commun. Network. Technol. 2019.
[21] S.B. Puri and S.P. Mahajan, Review on automatic music transcription system, Int. Conf. Comput. Commun.
Control and Automation 2017.
[22] P. Raguraman, R. Mohan and M. Vijayan, LibROSA based assessment tool for music information retrieval
systems, IEEE Conf, Multimedia Information Processing and Retrieval 2019.
[23] A. Rom´an Antonio Pertusa, Jorge Calvo-Zaragoza: Data representations for audio-to-score monophonic music
transcription, Expert Syst. Appl. 162 (2020).
[24] M. Schedl and S. B¨ock, Polyphonic piano note transcription with recurrent neural networks, IEEE Int. Conf.
Acoustics, Speech, and Signal Process. 1988.
[25] A. Schl¨uter, Learning to pinpoint singing voice from weakly labeled examples, ISMIR 2016.
[26] S. Sigtia, S. Dixon and E. Benetos, End-to-End neural network for polyphonic piano music transcription,
IEEE/ACM Trans. Audio, Speech, and Language Process. (2016) 927–939.
[27] Y.C. Subakan and P. Smaragdis, Generative adversarial source separation, IEEE Int. Conf. Acoustics, Speech
and Signal Process. 2018, pp. 26–30.
[28] D. Wang and J. Chen, Supervised speech separation based on deep learning: an overview, arXiv:1708.07524, 2017.
[29] T. Weyde, S. Sigtia, S. Dixon, G. D’Avila, E. Benetos, N. Boulanger-Lewandowski and S. Artur, A Hybrid
Recurrent Neural Network For Music Transcription, School of Electronic Engineering and Computer Science
Centre for Digital Music 1411.1623, 2014.
[30] G.A. Wiggins, S. Liu, L. Guo and F. Cong, A parallel fusion approach to piano music transcription based on
convolutional neural network, ICASSP 2018-2018 IEEE Int. Conf. Acoustics, Speech and Signal Process., 2018.
[31] J. Xu, B. Tang, H. Man and H. He, Semi-supervised feature selection based on relevance and redundancy criteria,
IEEE Trans. Neural Networks Learn. Syst. 28(9) (2017).
[32] A. Ycart and E. Benetos, Polyphonic Music Sequence Transduction with Meter Constrained LSTM Networks,Conf. ICASSP 2018-2018 IEEE Int. Conf. Acoustics, Speech and Signal Process. 2018.
[33] J. Zhang, X. Yu, W. Wan and J. Liu, An audio retrieval method based on Chroma gram and distance metrics,
Int. Conf. Audio, Language Image Process. 2010.
[34] A.Zhou, Chord detection uUsing deep learning, Int. Conf. Music Inf. Retrieval 2015.
[35] G. Zweig, C. Yu, J. Droppo and A. Stolcke, Advances in all-neural speech recognition, Proc. ICASSP, 2017.
Volume 12, Special Issue
December 2021
Pages 2105-2123
  • Receive Date: 01 October 2021
  • Revise Date: 04 November 2021
  • Accept Date: 07 December 2021