1 minute read

Topic Modeling with BERT

Topic Modeling with BERT

Key steps in BERTopic modelling are as following.

  • Use “Sentence Embedding” models to embed the sentences of the article
  • Reduce the dimensionality of embedding using UMAP
  • Cluster these documents (reduced dimensions) using HDBSAN
  • Use c-TF-IDF extract keywords, their frequency and IDF for each cluster.
  • MMR: Maximize Candidate Relevance. How many words in a topic can represent the topic?
  • Intertopic Distance Map
  • Use similarity matrix (heatmap), dandogram (hierarchical map), to visualize the topics and key_words.
  • Traction of topic over time period. Some may be irrelevant and for other traction may be increasing or decreasing.

Installation

# Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

# If you want to install BERTopic with other embedding models, you can choose one of the following:

# Choose an embedding backend
pip install bertopic[flair, gensim, spacy, use]

# Topic modeling with images
pip install bertopic[vision]

Supported Topic Modelling Techniques

BERTopic supports all kinds of topic modeling techniques as below.

  • Guided
  • Supervised
  • Semi-supervised
  • Manual
  • Multi-topic distributions
  • Hierarchical
  • Class-based
  • Dynamic
  • Online/Incremental
  • Multimodal
  • Multi-aspect
  • Text Generation/LLM
  • Merge Models

Related Resources

Tools in BERTopic

Tools-in-BERTopic

Best Topic Modeling Tool in BERTopic

BEST-Tools-in-BERTopic

BERTopic Model Building

BERTopic-Model-Building

Application

  • arXiv Dataset (1.7m+ STEP papers)
  • Images/photographs
  • Historical Documents
  • News articles