A hybrid semi-supervised boosting to sentiment analysis

Document Type : Research Paper


1 Electrical and Computer Engineering Department, Tabriz University, Tabriz, Iran

2 Computer Engineering Department, Payame-Noor University, Tehran, Iran


 In this article, we propose a hybrid semi-supervised boosting algorithm to sentiment analysis. Semi-supervised learning is a learning task from a limited amount of labeled data and plenty of unlabeled data which is the case in our used dataset. The proposed approach employs the classifier predictions along with the similarity information to assign label to unlabeled examples. We propose a hybrid model based on the agreement among different constructed classification model based on the boosting framework to assign a final label to unlabeled data. The proposed approach employs several different similarity measurements in its loss function to show the role of the similarity function. We further address the main preprocessing steps in the used dataset. Our experimental results on real-world microblog data from a commercial website show that the proposed approach can effectively exploit information from the unlabeled data and significantly improves the classification performance.


[1] U. Aggarwal and G. Aggarwal. Sentiment analysis: A survey. Int. J. Comput. Sci. Engin. 5(5), (2017) 222–225.
[2] M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut. A brief survey of
text mining: Classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919, 2017.
[3] J. A. Balazs and J. D. Vel´asquez. Opinion mining and information fusion: a survey. Inf. Fus. 27, (2016) 95–110.
[4] M. Belkin, P. Niyogi, and V. Sindhwani. Manifold regularization: A geometric framework for learning from labeled
and unlabeled examples. J. Machine Learn. Res. 7, (2006) 2399–2434.
[5] S. Bhatia, M. Sharma, and K. K. Bhatia. Sentiment analysis and mining of opinions. 503–523. Springer, 2018.[6] P. Biyani, C. Caragea, P. Mitra, C. Zhou, J. Yen, G. E. Greer, and K. Portier. Co-training over domainindependent and domain-dependent features for sentiment analysis of an online cancer support community. Advances in Social Networks Analysis and Mining (ASONAM), 2013 IEEE/ACM International Conference on, 2013,
[7] A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Proc. Eleventh Annual
Conf. Comput. Learn. Theory, (1998) 92–100.
[8] K. Chen and S. Wang. Semi-supervised learning via regularized boosting working on multiple semi-supervised
assumptions. IEEE Trans. Pattern Anal. Machine Intel. 33(1), (2011) 129–143.
[9] K. Crammer, A. Kulesza, and M. Dredze. Adaptive regularization of weight vectors. Adv. Neural Inf. Proc. Syst.
2009, (2009) 414–422.
[10] D. Davidov, O. Tsur, and A. Rappoport. Enhanced sentiment learning using twitter hashtags and smileys. Proc.
23rd Int. Conf. Comput. Ling. Posters, 2010, 241–249.
[11] W. C. Dhaoui, C. and L. Tan. Social media sentiment analysis: lexicon versus machine learning. J. Consumer
Market. 39(6), (2017) 480–488.
[12] J. D’hondt, J. Vertommen, P. Verhaegen, D. Cattrysse, and J. R. Duflou. Pairwise-adaptive dissimilarity measure
for document clustering. Inf. Sci. 180(12), (2010) 2341–2358.
[13] Y. Han, Y. Liu, and Z. Jin. Sentiment analysis via semi-supervised learning: a model based on dynamic threshold
and multi-classifiers. Neural Computing and Applications, 32, 2020, 5117–5129.
[14] S. Hong, J. Lee, and J.-H. Lee. Competitive self-training technique for sentiment analysis in mass social media.
Soft Comput. Intel. Syst. 2014 Joint 7th Int. Conf. Adv. Intel. Syst. 15th Int. Symp. 2014, (2014) 9–12.
[15] M. Hu and B. Liu. Mining and summarizing customer reviews. Proc. Tenth ACM SIGKDD Int. Conf. Knowledge
Disc. Data Min., (2004) 168–177.
[16] X. Hu, L. Tang, J. Tang, and H. Liu. Exploiting social relations for sentiment analysis in microblogging. Proc.
Sixth ACM Int. Conf. Web Search Data Min., 2013, 537–546.
[17] S. Inzalkar and J. Sharma. A survey on text mining-techniques and application. Int. J. Res. Sci. Engin. 24, (2015)
[18] F. H. Khan, U. Qamar, and S. Bashir. A semi-supervised approach to sentiment analysis using revised sentiment
strength based on sentiwordnet. Knowl. Inf. Syst. 51(3), (2017) 851–872.
[19] S. Kumar, K. De, and P. P. Roy. Movie recommendation system using sentiment analysis from microblogging
data. IEEE Trans. Comput. Soc. Syst. 2020, (2020) 1–9.
[20] M. Labani, P. Moradi, F. Ahmadizar, and M. Jalili. A novel multivariate filter method for feature selection in
text classification problems. Engin. Appl. Artif. Intel. 70, (2018) 25–37.
[21] Z. Li, C. Li, L. Yang, P. S. Yu, and Z. Li. Mixture distribution modeling for scalable graph-based semi-supervised
learning. Knowledge-Based Syst. 200, (2020) 105974.
[22] Y. Lin, J. Jiang, and S. Lee. A similarity measure for text classification and clustering. IEEE Trans. Knowledge
Data Engin. 26(7), (2014) 1575–1590.
[23] S. Liu, W. Zhu, N. Xu, F. Li, X.-q. Cheng, Y. Liu, and Y. Wang. Co-training and visualizing sentiment evolvement
for tweet events. Proc. 22nd Int. Conf. World Wide Web, 2013, 105–106.
[24] W. Liu, X. Jing, Y. Chen, and J. Li. Co-training based on multi-type text features. Int. Conf. Signal Inf. Proc.
Network. Comput., 2017, 213–220.
[25] Z. Liu, X. Dong, Y. Guan, and J. Yang. Reserved self-training: A semi-supervised sentiment classification method
for chinese microblogs. Proc. Sixth Int. Joint Conf. Natural Lang. Proc., 2013, 455–462.
[26] Z. Miao, Y. Li, X. Wang, and W. Tan. Snippext: Semi-supervised opinion mining with augmented data. CoRR,
abs/2002.03049, 2020.
[27] S. M. Mohammad, S. Kiritchenko, and X. Zhu. Nrc-canada: Building the state-of-the-art in sentiment analysis
of tweets. arXiv preprint arXiv:1308.6242, 2013.
[28] A. Pak and P. Paroubek. Twitter based system: Using twitter for disambiguating sentiment ambiguous adjectives.
Proc. 5th Int. Workshop Semantic Ev., 2010, 436–439.
[29] S. Park, J. Lee, and K. Kim. Semi-supervised distributed representations of documents for sentiment analysis.
Neural Networks, 119, (2019) 139–150.
[30] L. Qiu, W. Zhang, C. Hu, and K. Zhao. Selc: a self-supervised model for sentiment classification. Proc. 18th
ACM conf. Inf. Knowledge Manag., 2009, 929–936.
[31] J. Read and J. Carroll. Weakly supervised techniques for domain-independent sentiment classification. Proc. 1st
Int. CIKM Workshop Topic-sentiment Anal. Mass Opin., 2009, 45–52.
[32] H. Saif, T. Dickinson, L. Kastler, M. Fernandez, and H. Alani. A semantic graph-based approach for radicalisation
detection on social media. Euro. Semantic web Conf., 2017, 571–587.[33] J. Serrano-Guerrero, J. A. Olivas, F. P. Romero, and E. Herrera-Viedma. Sentiment analysis: A review and
comparative analysis of web services. Inf. Sci. 311, (2015) 18–38.
[34] N. F. F. Silva, L. F. Coletta, E. R. Hruschka, and E. R. Hruschka Jr. Using unsupervised information to improve
semi-supervised tweet sentiment classification. Inf. Sci. 355, (2016) 348–365.
[35] N. F. F. D. Silva, L. F. Coletta, and E. R. Hruschka. A survey and comparative study of tweet sentiment analysis
via semi-supervised learning. ACM Computing Surveys, 49(1), (2016) 1–15.
[36] K. Taghva, R. Beckley, and M. Sadeh. A list of farsi stopwords. Ret. Sept. 2003(7), (2003).
[37] C. Tan, L. Lee, J. Tang, L. Jiang, M. Zhou, and P. Li. User-level sentiment analysis incorporating social networks.
Proc. 17th ACM SIGKDD Int. Conf. Knowledge Disc. Data Min., 2011, 1397–1405.
[38] J. Tanha. Mssboost: A new multiclass boosting to semi-supervised learning. Neurocomput. 2018, (2018).
[39] J. Tanha. A multiclass boosting algorithm to labeled and unlabeled data. Int. J. Machine Learn. Cyber. 2019,
[40] J. Tanha, M. J. Saberian, and M. Van Someren. Multiclass semi-supervised boosting using similarity learning.
Data Mining (ICDM), 2013 IEEE 13th Int. Conf., 2013, 1205–1210.
[41] J. Tanha, M. Van Someren, and H. Afsarmanesh. Boosting for multiclass semi-supervised learning. Pattern
Recog. Let. 37, (2014) 63–77.
[42] J. Tanha, M. van Someren, and H. Afsarmanesh. Semi-supervised self-training for decision tree classifiers. Int.
J. Machine Learn. Cyber. 8(1), (2017) 355–370.
[43] H. Thakkar and D. Patel. Approaches for sentiment analysis on twitter: A state-of-art study. arXiv preprint
arXiv:1512.01043, 2015.
[44] A. Tripathy, A. Agrawal, and S. K. Rath. Classification of sentiment reviews using n-gram machine learning
approach. Expert Syst. Appl. 57, (2016) 117–126.
[45] H. Valizadegan, R. Jin, and A. K. Jain. Semi-supervised boosting for multi-class classification. Joint Euro. Conf.
Machine Learn. Knowledge Disc. Datab. 2008, (2008) 522–537.
[46] B. Xiang and L. Zhou. Improving twitter sentiment analysis with topic-based mixture modeling and semi-supervised
training. Proc. 52nd Annual Meet. Assoc. Comput. Ling. 2, (2014) 434–439.
[47] W. Xu and Y. Tan. Semi-supervised target-oriented sentiment classification. Neurocomput. 337, (2019) 120–128.
[48] N. Yu. Exploring c o-training strategies for opinion detection. J. Assoc. Inf. Sci. Tech. 56(10), (2014) 2098–2110.
[49] N. Yu and S. Kubler. Semi-supervised learning for opinion detection. Web Intel. Intel. Agent Tech. (WI-IAT),
2010 IEEE/WIC/ACM International Conf. 3, (2010) 249–252.
[50] T. Zagibalov and J. Carroll. Unsupervised classification of sentiment and objectivity in chinese text. Proc. Third
Int. Joint Conf. Natural Lang. Proc. Volume-I, 2008.
[51] S. Zeng, D. Luo, C. Zhang, and X. Li. A Correlation-Based TOPSIS Method for Multiple Attribute Decision
Making with Single-Valued Neutrosophic Information. Int. J. Inf. Tech. Dec. Mak. 19(1), ().
[52] J. Zhao, M. Lan, and T. Zhu. Ecnu: Expression-and message-level sentiment orientation classification in twitter
using multiple effective features. Proc. 8th Intm Workshop Semantic Ev. 2014, (2014) 259–264.
[53] F. Zou, F. L. Wang, X. Deng, and S. Han. Automatic identification of chinese stop words. Res. Comput. Sci. 18,
(2006) 151–162.
Volume 12, Issue 2
November 2021
Pages 1769-1784
  • Receive Date: 04 April 2021
  • Accept Date: 26 June 2021