Hidden Markov model and Persian speech recognition

Document Type : Research Paper


Assistant Professor, Department of Technology and Media Engineering IRIBU University, Tehran, Iran.


Nowadays, speech recognition, which simply refers to the process of converting an audio signal into its equivalent text, has become one of the most important research topics. Although many studies have been conducted in the field of speech recognition for many languages of the world, but can be said that no more study has been conducted in the Persian language and therefore it is necessary to conduct more studies in this field. Since Persian is a rich language that can create many new words by adding a suffix (prefix) to its main root, so it can be said that the success rate of voice recognition programs in this language has also increased with the increase in the number of phonemes and therefore can have a significant improvement. Therefore, in this study, a practical approach to Persian speech recognition based on syllables, which are a unit between phonemes and words, has been used and done by the hidden Markov model. After obtaining syllable utterances, multiple coefficients are calculated for all syllables. Finally, suitable models were created and the success rate was calculated by conducting tests for the systems. To measure the performance of the system, the error rate criterion was used. The results of this study show that the word error rate for the hidden Markov model was 18.3% and increased the system performance by approximately 16% after post-processing.


[1] A. Asliyan, K. G¨unel and T. Yakhno,Syllable Based Speech Recognition Using Dynamic Time Warping, Academic Informatics, Canakkale Onsekiz Mart University, Canakkale, 2008.
[2] M. Asadolahzade Kermanshahi and M.M. Homayounpour, Improving phoneme sequence recognition using phoneme duration, J. AI and Data Min. 7 (2018), no. 1, 137–147.
[3] M. Bijankhan, J. Shcikhzadegan, M.R. Rohani, Y. Samareh, C. Lucas and M. Tebyani, FARSDAT- The speech database of Farsi spoken language, Proc. Aust. Conf. Speech Sci. Technol. 2 (1994), 826–831.
[4] M. Farsinejad, B. Zamani Dehkordi and A. Akbari, Proposing a two-stage sound detector method based on the hidden Markov model, The fourteenth Ann. Nat. Conf. Iran. Comput. Assoc., Amirkabir University of Technology, 2007.
[5] J.G. Fiscus, J. Ajot, J.S. Garofolo and G. Doddingtion, Results of the 2006 spoken tcrm detection evaluation, Proc. ACM SIGIR Work, 2006, pp. 51–55.
[6] J.S. Garofolo, C.G.P. Auzance and E.M. Voorhees, The TREC spoken document retrieval track: A success story, Proc. TREC-8 8940 (1999), no. 500-246, 109–130.
[7] A. Harma, A comparison of warped and conventional linear predictive coding, IEEE Trans. Speech Audio Process. 9 (2001), no. 5, 579–588.
[8] A. Harma, Linear predictive coding with modified filter structures, IEEE Trans. Speech Audio Process. 9 (2001), no. 8, 769–777.
[9] M. M. Homayunpour and S. M. Mousavi, Generation of Persian speech synthesis parameters using hidden Markov and decision tree models, J. Comput. Sci. Engin. 2 (2007), no. 1–3.
[10] R.J. Jones, S. Downey and J.S. Mason, Continuous speech recognition using syllables, Proc. Eurospeech 3 (1997), 1171–1174.
[11] M. Khanzadi, H. Veisi, R. Alinaghizade and Z. Soleymani, Persian phoneme and syllable recognition using recurrent neural networks for phonological awareness assessment, J. Artif. Intell. Data Min. 10 (2022), no. 1, 117–126.
[12] J. Kruskall and M. Liberman, The symmetric time warping problem: From continuous to discrete. In Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence Comparison, Addison-Wesley Publishing Co., 1983.
[13] L. Lee, J. Glass, H. Lee and C. Chan,Spoken content retrival beyond cascading speeech rcognition whit text retrival, IEEE/ACM trans. Audio Speech Lang. Process. 23 (2015), no. 9, 1389–1420.
[14] E. Mengusoglu and O. Derro, Turkish LVCSR: Database preparation and language modeling for an agglutinative language, ICASSP’2001, Student Forum, May, Salt-Lake City, 2001.
[15] C.S. Myers, L.R. Rabiner and A.E. Rosenberg, Performance tradeoffs in dynamic time warping algorithms for isolated word recognition, IEEE Trans. Acous. Speech Sig. Process. ASPP-28 (1980), no. 6, 623–635.
[16] K.K. Paliwal, A. Agarwal and S.S. Sinha, A modification over Sakoe and Chiba’s dynamic time warping algorithm for isolated word recognition, Signal Process. 4 (1982), no. 4, 329–333.
[17] J.G. Proakis, and D.G. Manolakis, Digital Signal Processing: Principles and Application, Prentice-Hall, Upper Saddle River, NJ, 1996.
[18] L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition, Prenctice-Hall, Englewood Cliffs, NJ, 1993.
[19] A.E. Rosenberg, L.R. Rabiner, S.E. Levinson and J.G. Wilpon, A preliminary study on the use of demisyllables in automatic speech recognition, Conf. Rec. Int. Conf. Acous. Speech Sig. Process. GA, 1981, pp. 967–970.
[20] F. Salehi, Speech recognition using methods of hidden Markov models and artificial neural networks and hybrid speech recognition systems, Nat. Conf. Engin. Sci. New Ideas, 2013.
[21] Y. Samere, Phonology of the Persian language, University Publishing Center, Second Edition, 1368.
[22] I. Shafran, Clustering wide context and HMM topologies for spontaneous speech recognition, Ph.D. Thesis, University of Washington, 2001.
[23] J. Sheikh Zadegan, Ranking of persian speech phonemes from the point of view of efficiency in speaker recognition, J. Languge Res. 7 (2015), no. 1, 77–96.
[24] T. Svendsen, K.K. Paliwal, E. Harborg and P.O. Husoy, A modified acoustic sub-word unit based speech recognizer, Proc. IEEE Int. Conf. Acoustics Speech Signal Process. 1989, pp. 108–111.
[25] J. Tejedor, D.T. Toledano, P. Lopez-Otero, L. Docio-Fernandez, L. Serrano, I. Hernaez, A. Coucheiro-Limeres, J. Ferreiros, J. Olcoz and J. Llombart, AlBAYZIN 2016 spoken term detection evaluation: An international open competitive evaluation in Spanish, EURASIA J, Audio. Speech, Music Process. 2017 (2017), no. 1, 1–23.
[26] J. Trmal, M. Wiesner, V. Peddinti, X. Zhang, P. Ghahremani, Y. Wang, V. Manohar, H. Xu, D. Povey and S. Khudanpur, The Kaldi open KWS system: improving low resource keyword search, Interspeech, 2017, pp. 3597–3601.
[27] H. Veisey, S.A. Qureshi and A. Bastan Fard, Recognition of speech phrases for Farsi news of the Islamic Republic of Iran, Signal Data Process. Quart. 4 (2019), no. 46.
[28] T. Zoghi and M.M. Homayounpour, Adaptive windows convolutional neural network for speech recognition, Signal Data Process. Quart. 3 (2017), no. 37.
Volume 14, Issue 1
January 2023
Pages 3111-3119
  • Receive Date: 16 June 2022
  • Revise Date: 19 July 2022
  • Accept Date: 09 August 2022