2024  VOLUME 4  ISSUE 1

RESEARCH ARTICLE

A comparative study of Neural Sinkhorn Topic Model based on different word embedding

AUTHOR

Fenggao Niu, Sijia Wang, Bikun Chen

ABSTRACT

Topic models play an important role in many tasks of natural language processing. The early topic models are based on the bag-of-words assumption, which do not consider the context relationship and face the sparsity problem. Word embedding can map words into a dense vector in a low-dimensional space and preserve the relationship information between words. Therefore, word embedding vectors such as Word2Vec, GloVe, and fastText have been introduced into neural topic models to improve the modeling effect. However, current topic models do not fully consider the characteristics of each word embedding, and only use one of them. In order to study the advantages and disadvantages of different word embeddings and their influence on the topic model, and then provide a basis for the reasonable choice of embedding methods, this paper explores the influence mechanism of different word embeddings on the topic model (Neural Sinkhorn Topic Model is selected) and text classification task by changing the word embeddings and their dimensions. The results show that: ⅰ) Word embedding trained by large corpora has the greatest impact on topic modeling and document classification, with an increase of 23% in topic coherence and topic diversity indicators, and an average increase of 68% in classification indicators; ⅱ) Word embedding trained by Skip-gram model is suitable for long text topic modeling, and word embedding trained by GloVe model is suitable for short text topic modeling; ⅲ) Word embedding trained by fastText model has poor performance in the topic model, and the effect of combining with the topic model for document classification is better; ⅳ) The selection of word embedding dimension also has an impact on the topic model, and the most suitable word embedding dimension should be selected according to the actual situation.

KEYWORDS

Word embedding; Topic model; Neural sinkhorn topic model; Comparative study; Text classification

DOWNLOAD FULL ARTICLE