Abstract:
Researchers usually present and synthesize their findings in scientific publications. For this reason, it is essential to
analyze their substance to understand a subject. This study suggests improving the topic modeling in a collection of conference
papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017. Two goals of this study were achieved:
producing more coherent topics and topic automatic labeling. The first goal was achieved through five phases, text pre-processing
phase, reduction phase using a new method called RS-LW (Reduced Sentences Based on Length and Weight), which removes the
sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less
weight sentences. Sentence embedding phase using S-BERT (Sentence-Bidirectional Encoder Representation from Transformer),
Reducing the dimensionality of the sentences embedding phase by utilizing UMAP (Uniform Manifold Approximation and
Projection). Lastly, the use of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to organize
comparable documents. The experimental findings demonstrate that the use of the proposed RS-LW phase has produced more
cohesive topics. This has led to improvements in topic coherence by (0.593), and topic diversity performance by (0.96). Though topic
modeling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been
identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors profile in
Google Scholar and extracting the interests for use in automatically labeling the topics.