Topic Modeling with BERT
Topic Modeling with BERT
Key steps in BERTopic modelling are as following.
- Use “Sentence Embedding” models to embed the sentences of the article
- Reduce the dimensionality of embedding using UMAP
- Cluster these documents (reduced dimensions) using HDBSAN
- Use c-TF-IDF extract keywords, their frequency and IDF for each cluster.
- MMR: Maximize Candidate Relevance. How many words in a topic can represent the topic?
- Intertopic Distance Map
- Use similarity matrix (heatmap), dandogram (hierarchical map), to visualize the topics and key_words.
- Traction of topic over time period. Some may be irrelevant and for other traction may be increasing or decreasing.
Installation
# Installation, with sentence-transformers, can be done using pypi:
pip install bertopic
# If you want to install BERTopic with other embedding models, you can choose one of the following:
# Choose an embedding backend
pip install bertopic[flair, gensim, spacy, use]
# Topic modeling with images
pip install bertopic[vision]
Supported Topic Modelling Techniques
BERTopic supports all kinds of topic modeling techniques as below.
- Guided
- Supervised
- Semi-supervised
- Manual
- Multi-topic distributions
- Hierarchical
- Class-based
- Dynamic
- Online/Incremental
- Multimodal
- Multi-aspect
- Text Generation/LLM
- Merge Models
Related Resources
- Advanced Topic Modeling with BERTopic by PINECONE
- BERTopic by SpaCy
- BERTopic github
- BERTopic by Huggingface
Tools in BERTopic
Best Topic Modeling Tool in BERTopic
BERTopic Model Building
Application
- arXiv Dataset (1.7m+ STEP papers)
- Images/photographs
- Historical Documents
- News articles