From Data to Insight: Topic Modeling and Automatic Labeling Strategies

F. Najeeb, Rana; N. Dhannoon, Ban; Qais Alkhalidi, Farah

doi:http://dx.doi.org/10.12785/ijcds/XXXXXX

Journals About us Ethics and Policies Objectives Values Contact us

UOB Journals
→
02. International Journal of Computing and Digital Systems
→
Preprint
→
View Item

From Data to Insight: Topic Modeling and Automatic Labeling Strategies

F. Najeeb, Rana; N. Dhannoon, Ban; Qais Alkhalidi, Farah

DOI: http://dx.doi.org/10.12785/ijcds/XXXXXX

ISSN: 2210-142X

Date: 2024-04-26

Abstract:

Researchers usually present and synthesize their findings in scientific publications. For this reason, it is essential to analyze their substance to understand a subject. This study suggests improving the topic modeling in a collection of conference papers on Neural Information Processing Systems (NIPS) released between 1987 and 2017. Two goals of this study were achieved: producing more coherent topics and topic automatic labeling. The first goal was achieved through five phases, text pre-processing phase, reduction phase using a new method called RS-LW (Reduced Sentences Based on Length and Weight), which removes the sentences of shorter length, then calculates the weight for the remaining sentences and removes approximately 25% of the less weight sentences. Sentence embedding phase using S-BERT (Sentence-Bidirectional Encoder Representation from Transformer), Reducing the dimensionality of the sentences embedding phase by utilizing UMAP (Uniform Manifold Approximation and Projection). Lastly, the use of HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise) to organize comparable documents. The experimental findings demonstrate that the use of the proposed RS-LW phase has produced more cohesive topics. This has led to improvements in topic coherence by (0.593), and topic diversity performance by (0.96). Though topic modeling extracts the most salient sentences describing latent topics from text collections, an appropriate label has not yet been identified. The second goal was achieved by suggesting a new method to generate the keywords by accessing the authors profile in Google Scholar and extracting the interests for use in automatically labeling the topics.

Show full item record