Connected Component Based Word Spotting on Persian Handwritten image documents

Document Type: Research Paper

Authors

Department of Computer Engineering and IT, Payam Noor University, Tehran, Iran

10.22075/ijnaa.2019.4125

Abstract

Word spotting is to make searchable unindexed image documents by locating word/words in a doc-
ument image, given a query word. This problem is challenging, mainly due to the large number
of word classes with very small inter-class and substantial intra-class distances. In this paper, a
segmentation-based word spotting method is presented for multi-writer Persian handwritten doc-
uments using attribute-based classi cation and label-embedding. For this purpose, a hierarchical
framework is proposed, in which at rst, the candidate are selected based on connected compo-
nents(CCs) sequence. Then, the query word is segmented to constructor CCs, and similar CCs count
in the candidate region of document are selected based on their distances to the CCs count of the
query word. As a result, the candidate regions are extracted. In the nal phase, the query word
is located only in the candidate regions of the document. A well known Persian handwritten text
dataset, namely FTH, is chosen as a benchmark for the presented method. The results shows that
the proposed method outperforms the state-of-the-art methods, 81.02 percent for unseen word class
retrieval.

Keywords