Per-Corpus Configuration of Topic Modelling for GitHub and Stack Overflow Collections
To make sense of large amounts of textual data, topic modelling is frequently used as a text-mining tool for the discovery of hidden semantic structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used topic model that aims to explain the structure of a corpus by grouping texts. LDA requires multiple parameters to work well, and there are only rough and sometimes conflicting guidelines available on how these parameters should be set. In this paper, we contribute (i) a broad study of parameters to arrive at good local optima, (ii) an a-posteriori characterisation of text corpora related to eight programming languages from GitHub and Stack Overflow, and (iii) an analysis of corpus feature importance via per-corpus LDA configuration.
READ FULL TEXT