Boosting Text Clustering using Topic Selection

Marcelo Mendoza; Pablo Ormeño-Arriagada; Carlos Valle

Abstract

Latent Dirichlet Allocation (LDA) is a key topic modeling algorithm in the text mining field. Despite the great success of LDA, the state of the art reports that LDA is sensitive to the choice of hyper parameters and accordingly, the quality of the topics found depends on tuning. Instead of looking for the optimal hyper parameters of LDA for a given corpus, we propose a strategy for topic selection and aggregation that exploits hyper parameter variability, as the number of topics inferred, to boost the quality of the topics found. We show that our approach is simple and very effective to boost topic models. Experimental results show that our proposal improves the quality of the topics found, favoring document and term clustering tasks.

Más información

Fecha de publicación: 2018
Año de Inicio/Término: May 22, 2018