Document Type : Research Paper
School of Computer Engineering, Iran University of Science and Technology,Tehran, Iran.
Interpretability of word vector components is very important for obtaining conceptual relations. Word vectors derived from counting models are interpretable but suffer from the high-dimensionality problem. Our goal in this study is to obtain interpretable low-dimensional word vectors in such a way that the least accuracy loss occurs. To achieve this goal, we propose an approach to reduce the dimensions of word vectors using a labeling method based on the BPSO algorithm and a voting method for selecting final context words. In this approach, we define several different base models to solve the labeling problem using different data and different objective functions. Then, we train each base model and select 3 of the best solutions for each model. We create the target word vectors of the dictionary based on the context words labeled "1". Next, we use the three best solutions of each base model to build the ensemble. After creating the ensemble, we use the voting method to assign the final label to the primary context words and select N final context words. In this study, we use the corpus ukWaC to construct word vectors. We evaluate the resulting word vectors on the MEN, RG-65, and SimLex-999 test sets. The evaluation results show that by reducing the word vectors dimensions from 5000 to 1507, the Spearman correlation coefficient of the proposed approach has been reduced to a lesser extent compared to each base model. Therefore, the accuracy drop of the proposed approach is justified after reducing the dimensions from 5000 to 1507. It is not a large penalty because the resulting word vectors are low-dimensional and interpretable.