A method for the automatic extraction of keywords in legislative documents using statistical, semantic, and clustering relationships

Document Type : Research Paper


1 Faculty of Computer Engineering, Shahroud University of Technology, Semnan, Iran

2 University of Science and Technology of Mazandaran, Behshahr, Iran


Using smart methods for the automatic generation of keywords in legislative documents has  attracted the attention of many researchers over the past few decades. With the increasing  evelopment of legislative documents and the large volume of unstructured texts, the need for rapid  access to these documents has become more significant. Extracting the keywords in legislative documents will accel-erate the legislative process and reduce costs. Nowadays, many methods are presented dynamically for generating keywords. The present study attempted to extract more meaningful keywords from texts by using the thesaurus, which has a structured system to improve the  classification of legislative documents. In this method, the semantic relationships in the thesaurus and document clustering were used and the statistical features of different words were calculated to extract some words as keywords. After pre-processing the texts, first the keywords in the text were selected using statisti-cal methods. Then, the phrases derived from the keywords were extracted using semantic terms in the thesaurus. After that, a numerical weight was assigned to each word to determine the relative importance of the words and indicate the effect of the word in relation to the text and compared to other words. Finally, the final keywords were selected using the relationships in the thesaurus and clustering methods. In order to evaluate this method, the tested text was compared to educational texts and the similarity between them was used. The results of testing various texts and subjects indicated the high accuracy of the proposed method. The data from the Parliament of Iran and the Deputy for Presidential Laws were used to evaluate the proposed model. This model could provide a very high accuracy and performance in these two bases in comparison to other methods.


[1] R. Alagheband and MR. Saeedi Mohammadi, Clustering of texts based on category center using svd method and
using neighborhood points, International Conference on Persian Line and Language Processing, (2012).
[2] A. Arasteh, M. Elahimannesh and A. sharif, B.minaei-Bidogli, Semantically clustering of persian words,
Conference on Persian language proseshng semnan, Iran, (2012).
[3] M. Beh Shameh and H. Bashiri, From clustering and summarizing documents for latent semantic indexing,
Hamedan University of Technology, (2009).
[4] A. Benabdellah and A. BenghabritJamane bouhaddou, A survey of clustering algorithms for an industrial
context, Second International Conference on Intelligent Computing in Data Sciences (ICDS). (2018).
[5] W. Binyu , L. Wenfen, L. Zijie, H. Xuexian , W. Jianghong and L. Chun, Text clustering algorithm based on
deep representation learning , The 2nd Asian Conference on Artificial Intelligence Technology, (2018).
[6] C. Charul and I. Aggarwa, A surves of text clustring algorithms, Watson research center yorktown heights, NY.
Cheng Xiang Zhai university of Illinois at Urbana-Champaign Urbana. (2012).[7] Database of the Deputy of Laws in Parliament of Iran.
[8] Database of the Vice President for Codification and Revision of Presidential Laws.
[9] M. Eslami Nasab and R. Javidan, A method to find the semantic similarity of articles and documents, Shiraz
university of technology, (2014).
[10] H. HudhudKian, Assessing the similarity of documents on the web, Master thesis, Shiraz university, (2011).
[11] Iran big data house. Extracting keywords from Persian text.
[12] Yi. Junkai, Zh. Yacong, Zh. Xianghui, and W. Jing, A Novel text clustring approach using deep learning
vocabulary network , college of information science and technology, beijing university of chemical technology,
Beijing , China. (2017).
[13] H. Kamyar, New method of semantic weighting of words in word processing applications, Master Thesis, Ferdowsi
University of Mashhad, (2011).
[14] G. Miner, D. Delen, J. Elder, A. Fast, T. Hill, and R. Nisbet, Elsevier, the Seven Practice Areas of Text
Analytics. (2012).
[15] M. Mohammadi, M. Ataloui, Keyword extraction of Persian documents, Sharif University of Technology, (2007).
[16] A. Mosallanejad, J. Davoodi Moghadam and A. Ahmadi, Presenting an efficient algorithm for extracting
semantic relationships in documents based on Wikipedia knowledge base, 23rd Iranian Conference on Electrical
Engineering, Sharif University of Technology, (2015).
[17] P. Nerurkara, A. Shirkeb, M. Chandanec and S. Bhirudd, Empirical Analysis of Data Clustering Algorithms ,
6th International Conference on Smart Computing and Communications, ICSCC, Kurukshetra, India, (2017).
[18] P. Pantel and D. Ravichandran, Automatically labeling semantic classes, In Proceedings of HLT-NAACL. (2004).
[19] F. Rad, H. Parvin and A. Dehbashi , Introducing a new method for automatic indexing and keyword extraction
for information retrieval and text clustering, (2016).
[20] Research Center of the Parliament of Iran , Artificial intelligence and legislation, (2018).
[21] V. Rezaei, M. Mohammadpour, H. Parvin and S. Nejatian, Providing a method for extracting keywords and
weight of words to improve the classification of Persian texts, (2017).
[22] Sh. Safari, Automatic production of keywords for scientific documents using semantic relations, Master Thesis,
[23] F. Sedghi, H. Bent Al-Huda and A. Turkaman Rahmani, Using a set of finders to classify documents (an
approach based on artificial security), Iran University of Science and Technology, (2014).
[24] M. Shams Fard , A. Abdullahzadeh, Extraction of conceptual knowledge from text using linguistic and semantic
patterns, Amir Kabir University of Technology, (2002).
[25] D. Shilpa Dang and H. Peerzada, Text Mining: Techniques and its Application, November (2014).
[26] G. Tasatsaronis, I. varlamis and M. vazirgianise ,Text relatedness based on a word thesaurus”Jornal of artificalligence, 37 (2010) 1-39.
[27] W. Witten and H. Medelly, Thesarus based automatic key phrase, Conference on digital, (2006).[28] Z. Xiaoming, L. Zhang , Automatic topic detection with an hncermental clustering algorithm, Beihang University
.Beijing china .Lncs 6318, (2010) 344.
[29] L. Yuri , R. Anand, D. Jeffrey, Translated by Mehdi Esmaili, Kavoshdadegan, (2017).
[30] M. Zakir Hossain, M. Nasim Akhtar, R.B. Ahmad and M. Rahman, A Dynamic k-means clustering for data
mining, Indonesian journal (2019).
[31] M. Zakir Hossain, M. Nasim Akhtar, R.B. Ahmad and R. Mostafijur, A dynamic K-means clustering for data
mining, Indonesian Journal of Electrical Engineering and Computer Science, 13 (2) (2019) 521-526.
[32] Zh. Zhiling, J. Zhu, D. Liang, H. Li and L. Yu.Guoql, Hot topic detection based on a refined TF-IDF algorithm,
National Natural Science Foundation of China, (2016).
Volume 12, Special Issue
December 2021
Pages 265-278
  • Receive Date: 03 April 2021
  • Revise Date: 10 May 2021
  • Accept Date: 27 May 2021