Xinyue Wang, Danqun Zhao*
A series of comparative experiments on citation sentiment identification/analysis (CSI/CSA) are carried out based on comparing and integrating sentiment lexicon and machine learning methods in our paper, and a fusion of the two methods is also explored. We design four groups of comparative experiments for the key steps in current CSI/CSA research, involving sentiment lexicon expansion, text feature extraction, data resampling and method fusion in order to find out combinations of methods with better identification effects. Our experiments’ details are as follows: using open citation corpus founded by Athar; selecting SentiWordNet and SO-PMI as original sentiment lexicon and its expansion method; choosing TF-IDF, Word2Vec, BERT for text feature extraction and "SMOTE+Undersampling" as main method for data resampling. Nine frequently-used machine learning algorithms(models), including support vector machine, random forest, decision tree, linear classification, AdaBoost, extremely randomized trees, stochastic gradient descent, long short-term memory network and convolutional neural network, are finally used in our comparative experiments. The experimental results and main findings include: ① The extended sentiment lexicon by SO-PMI is better than the original one for CSI/CSA; ② As a simple method for text feature extraction, TF-IDF is generally better than Word2Vec and BERT; ③ The use of "SMOTE+Undersampling" can better solve data imbalance problem in Athar-corpus; ④ The integration of sentiment lexicon and machine learning can improve the effect of CSI/CSA, specifically shown in their higher index values both of accuracy and Macro-F1.
Citation Sentiment Identification/Analysis (CSI/CSA); Sentiment Lexicon; Machine Learning; SMOTE
DOWNLOAD FULL ARTICLE