has a variety of use circumstances within the pure language processing (NLP) area, resembling doc tagging, survey evaluation, and content material group. It falls below the realm of unsupervised studying method, making it a really cost-effective method that reduces the assets required to gather human-annotated knowledge. We’ll dive deeper into BERTopic, a preferred python library for transformer-based matter modeling, to assist us course of monetary information sooner and reveal how the trending subjects change extra time.
BERTopic consists of 6 core modules that may be personalized to go well with completely different use circumstances. On this article, we’ll look at, experiment with every module individually and discover how they work collectively coherently to provide the tip outcomes.
At a excessive degree, a typical BERTopic structure consists of:
- Embeddings: rework textual content into vector representations (i.e. embeddings) that seize semantic which means utilizing sentence-transformer fashions.
- Dimensionality Discount: scale back the high-dimensional embeddings to a lower-dimensional house whereas preserving essential relationships, together with PCA, UMAP …
- Clustering: group comparable paperwork collectively based mostly on their embeddings with diminished dimensionality to type distinct subjects, together with HDBSCAN, Okay-Means algorithms …
- Vectorizers: after matter clusters are fashioned, vectorizers convert textual content into numerical options that can be utilized for matter evaluation, together with rely vectorizer, on-line vectorizer …
- c-TF-IDF: calculate significance scores for phrases inside and throughout matter clusters to establish key phrases.
- Illustration Mannequin: leverage semantic similarity between the embedding of candidate key phrases and the embedding of paperwork to search out probably the most consultant matter key phrases, together with KeyBERT, LLM-based methods …
Challenge Overview
On this sensible utility, we’ll use Topic Modeling to establish trending subjects in Apple monetary information. Utilizing NewsAPI, we gather each day top-ranked Apple inventory information from Google Search and compile them right into a dataset of 250 paperwork, with every doc containing monetary information for one particular day. Nevertheless, this isn’t the primary focus of this text so be at liberty to switch it with your personal dataset. The target is to exhibit easy methods to rework uncooked textual content paperwork containing high Google search outcomes into significant matter key phrases and refine these key phrases to be extra consultant.

BERTopic’s 6 Elementary Modules
1. Embeddings

BERTopic makes use of sentence transformer fashions as its first constructing block, changing sentences into dense vector representations (i.e. embeddings) that seize semantic meanings. These fashions are based mostly on transformer architectures like BERT and are particularly skilled to provide high-quality sentence embeddings. We then compute the semantic similarity between sentences utilizing cosine distance between the embeddings. Widespread fashions embody:
- all-MiniLM-L6-v2: light-weight, quick, good basic efficiency
- BAAI/bge-base-en-v1.5: bigger mannequin with robust semantic understanding therefore offers a lot slower coaching and inference pace.
There are an enormous vary of pre-trained sentence transformers so that you can select from on the “Sentence Transformer” web site and Huggingface model hub. We are able to use a couple of traces of code to load a sentence transformer mannequin and encode the textual content sequences into excessive dimensional numerical embeddings.
from sentence_transformers import SentenceTransformer
# Initialize mannequin
mannequin = SentenceTransformer("all-MiniLM-L6-v2")
# Convert sentences to embeddings
sentences = ["First sentence", "Second sentence"]
embeddings = mannequin.encode(sentences) # Returns numpy array of embeddings
On this occasion, we enter a set of economic information knowledge from October 2024 to March 2025 into the sentence transformer “bge-base-en-v1.5”. As proven within the outcome under. these textual content paperwork are remodeled into vector embedding with the form of 250 rows and every with 384 dimensions.

We are able to then feed this sentence transformer to BERTopic pipeline and maintain all different modules because the default settings.
from sentence_transformers import SentenceTransformer
from bertopic import BERTopic
emb_minilm = SentenceTransformer("all-MiniLM-L6-v2")
topic_model = BERTopic(
embedding_model=emb_minilm,
)
topic_model.fit_transform(docs)
topic_model.get_topic_info()
As the tip outcome, we get the next matter illustration.

In comparison with the extra highly effective and bigger “bge-base-en-v1.5” mannequin, we get the next outcome which is barely extra significant than the smaller “all-MiniLM-L6-v2” mannequin however nonetheless leaves massive room for enchancment.

One space for enchancment is decreasing the dimensionality, as a result of sentence transformers usually leads to high-dimensional embeddings. As BERTopic depends on evaluating the spatial proximity between embedding house to type significant clusters, it’s essential to use a dimensionality discount method to make the embeddings much less sparse. Due to this fact, we’re going to introduce varied dimensionality discount methods within the subsequent part.
2. Dimensionality Discount

After changing the monetary information paperwork into embeddings, we face the issue of excessive dimensionality. Since every embedding incorporates 384 dimensions, the vector house turns into too sparse to create significant distance measurement between two vector embeddings. Principal Element Evaluation (PCA) and Uniform Manifold Approximation and Projection (UMAP) are frequent methods to cut back dimensionalities whereas preserving the utmost variance within the knowledge. We’ll take a look at UMAP, BERTopic’s default dimensionality discount method, in additional particulars. It’s a non-linear algorithm adopted from topology evaluation that seeks numerous construction throughout the knowledge. It really works by extending a radius outwards from every knowledge level and connecting factors with its shut neighbors. You may dive extra into the UMAP visualization on this web site “Understanding UMAP“.
UMAP n_neighbours
Experimentation
An essential UMAP parameter is n_neighbours
that controls how UMAP balances native and international construction within the knowledge. Low values of n_neighbors
will power UMAP to focus on native construction, whereas massive values will take a look at bigger neighborhoods of every level.
The diagram under exhibits a number of scatterplots demonstrating the impact of various n_neighbors
values, with every plot visualizing the embeddings in an 2-dimensional house after making use of UMAP dimensionality discount.
With smaller n_neighbors
values (e.g. n=2, n=5), the plots present extra tightly coupled micro clusters, indicating a concentrate on native construction. As n_neighbors
will increase (in direction of n=100, n=150), the factors type extra cohesive international patterns, demonstrating how bigger neighborhood sizes assist UMAP seize broader relationships within the knowledge.

UMAP min_dist
Experimentation
The min_dist
parameter in UMAP controls how tightly factors are allowed to be packed collectively within the decrease dimensional illustration. It units the minimal distance between factors within the embedding house. A smaller min_dist
permits factors to be packed very carefully collectively whereas a bigger min_dist
forces factors to be extra scattered and evenly unfold out. The diagram under exhibits an experimentation on min_dist
worth from 0.0001 to 1 when setting the n_neighbors=5.
When min_dist is ready to smaller values, UMAP emphasizes on preserving native construction whereas bigger values rework the embeddings right into a round form.

We determine to set n_neighbors=5
and min_dist=0.01
based mostly on the hyperparameter tuning outcomes, because it types extra distinct knowledge clusters which might be simpler for the next clustering mannequin to course of.
import umap
UMAP_N = 5
UMAP_DIST = 0.01
umap_model = umap.UMAP(
n_neighbors=UMAP_N,
min_dist=UMAP_DIST,
random_state=0
)
3. Clustering

Following the dimensionality discount module, it’s the method of grouping embeddings with shut proximity into clusters. This course of is key to matter modeling, because it categorizes related textual content paperwork collectively by taking a look at their semantic relationships. BERTopic employs HDBSCAN mannequin by default, which has the benefit in capturing constructions with numerous densities. Moreover, BERTopic supplies the pliability of selecting different clustering fashions based mostly on the character of the dataset, resembling Okay-Means (for spherical, equally-sized clusters) or agglomerative clustering (for hirerarchical clusters).
HDBSCAN Experimentation
We’ll discover how two essential parameters, min_cluster_size
and min_samples
, influence the habits of HDBSCAN mannequin.min_cluster_size
determines the minimal variety of knowledge factors allowed to type a cluster and clusters not assembly the brink are handled as outliers. When setting min_cluster_size
too low, you would possibly get many small, unstable clusters which is perhaps noise. If setting it too excessive, you would possibly merge a number of clusters into one, shedding their distinct traits.
min_samples
calculates the space between a degree and its k-th nearest neighbor, figuring out how strict the cluster formation course of is. The bigger the min_samples
worth, the extra conservative the clustering turns into, as clusters will likely be restricted to type in dense areas, classifying sparse factors as noise.
Condensed Tree is a helpful method to assist us determine acceptable values of those two parameters. Clusters that persist for a wide variety of lambda values (proven because the left vertical axis in a condense tree plot) are thought-about steady and extra significant. We favor the chosen clusters to be each tall (extra steady) and broad (massive cluster dimension). We use condensed_tree_
from HDBSCAN to check min_cluster_size
from 3 to 50, then visualize the information factors of their vector house, colour coded by the expected cluster labels. As we progress by way of completely different min_cluster_size
, we will establish optimum values that group shut knowledge factors collectively.
On this experimentation, we chosen min_cluster_size=15
because it generates 4 clusters (highlighted in purple within the condensed tree plot under) with good stability and cluster dimension. Moreover the scatterplot additionally signifies cheap cluster formation based mostly on proximity and density.

min_cluster_size
Experimentation
min_cluster_size
ExperimentationWe then perform an identical train to check min_samples
from 1 to 80 and chosen min_samples=5
. As you may observe from the visuals, the parameters min_samples
and min_cluster_size
exert distinct impacts on the clustering course of.

min_samples
Experimentation
min_samples
Experimentationimport hdbscan
MIN_CLUSTER _SIZE= 15
MIN_SAMPLES = 5
clustering_model = hdbscan.HDBSCAN(
min_cluster_size=MIN_CLUSTER_SIZE,
metric='euclidean',
cluster_selection_method='eom',
min_samples=MIN_SAMPLES,
random_state=0
)
topic_model = BERTopic(
embedding_model=emb_bge,
umap_model=umap_model,
hdbscan_model=clustering_model,
)
topic_model.fit_transform(docs)
topic_model.get_topic_info()
Okay-Means Experimentation
In comparison with HDBSCAN, utilizing Okay-Means clustering permits us to generate extra granular subjects by specifying the n_cluster
parameter, consequently, controlling the variety of subjects generated from the textual content paperwork.
This picture exhibits a collection of scatter plots demonstrating completely different clustering outcomes when various the variety of clusters (n_cluster
) from 3 to 50 utilizing Okay-Means. With n_cluster=3
, the information is split into simply three massive teams. As n_cluster
will increase (5, 8, 10, and so forth.), the information factors are break up into extra granular groupings. General, it types rounded-shape clusters in comparison with HDBSCAN. We chosen n_cluster=8
the place the clusters are neither too broad (shedding essential distinctions) nor too granular (creating synthetic divisions). Moreover, it’s a correct quantity of subjects for categorizing 250 days of economic information. Nevertheless, be at liberty to regulate the code snippet to your necessities if must establish extra granular or broader subjects.

n_cluster
Experimentationfrom sklearn.cluster import KMeans
N_CLUSTER = 8
clustering_model = KMeans(
n_clusters=N_CLUSTER,
random_state=0
)
topic_model = BERTopic(
embedding_model=emb_bge,
umap_model=umap_model,
hdbscan_model=clustering_model,
)
topic_model.fit_transform(docs)
topic_model.get_topic_info()
Evaluating the subject cluster outcomes of Okay-Means and HDBSCAN reveals that Okay-Means produces extra distinct and significant matter representations. Nevertheless, each strategies nonetheless generate many cease phrases, indicating that subsequent modules are essential to refine the subject representations.


4. Vectorizer

Earlier modules serve the position of grouping paperwork into semantically comparable clusters, and ranging from this module the primary focus is to fine-tune the subjects by selecting extra consultant and significant key phrases. BERTopic gives varied Vectorizer choices from the fundamental CountVectorizer
to extra superior OnlineCountVectorizer
which incrementally replace matter representations. For this train, we’ll experiment on CountVectorizer
, a textual content processing software that creates a matrix of token counts out of a set of paperwork. Every row within the matrix represents a doc and every column represents a time period from the vocabulary, with the values exhibiting what number of occasions every time period seems in every doc. This matrix illustration permits machine studying algorithms to course of the textual content knowledge mathematically.
Vectorizer Experimentation
We’ll undergo a couple of essential parameters of the CountVectorizer
and see how they may have an effect on the subject representations.
ngram_range
specifies what number of phrases to mix collectively into matter phrases. It’s significantly helpful for paperwork consists of quick phrases, which isn’t wanted on this state of affairs.
instance output if we setngram_range=(1, 3)
0 -1_apple nasdaq aapl_apple stock_apple nasdaq_nasdaq aapl
1 0_apple warren buffett_apple stock_berkshire hathaway_apple nasdaq aapl
2 1_apple nasdaq aapl_nasdaq aapl apple_apple stock_apple nasdaq
3 2_apple aapl stock_apple nasdaq aapl_apple stock_aapl inventory
4 3_apple nasdaq aapl_cramer apple aapl_apple nasdaq_apple inventory
stop_words
determines whether or not cease phrases are faraway from the subjects, which considerably improves matter representations.min_df
andmax_df
determines the frequency thresholds for phrases to be included within the vocabulary.min_df
units the minimal variety of paperwork a time period should seem whereasmax_df
units the utmost doc frequency above which phrases are thought-about too frequent and discarded.
We discover the impact of including CountVectorizer
with max_df=0.8
(i.e. ignore phrases showing in additional than 80% of the paperwork) to each HDBSCAN and Okay-Means fashions from the earlier step.
from sklearn.feature_extraction.textual content import CountVectorizer
vectorizer_model = CountVectorizer(
max_df=0.8,
stop_words="english"
)
topic_model = BERTopic(
embedding_model=emb_bge,
umap_model=umap_model,
hdbscan_model=clustering_model,
vectorizer_model=vectorizer_model
)
Each exhibits enhancements after introducing the CountVectorizer
, considerably decreasing key phrases incessantly appeared in all paperwork and never bringing additional values, resembling “appl”, “inventory”, and “apple”.


5. c-TF-IDF

Whereas the Vectorizer module focuses on adjusting the subject illustration on the doc degree, c-TF-IDF primarily take a look at the cluster degree to cut back incessantly encountered subjects throughout clusters. That is achieved by changing all paperwork belonging to 1 cluster as a single doc and calculated the key phrase significance based mostly on the standard TF-IDF method.
c-TF-IDF Experimentation
reduce_frequent_words
: determines whether or not to down-weight incessantly occurring phrases throughout subjectsbm25_weighting
: when set to True, makes use of BM25 weighting as an alternative of normal TF-IDF, which may also help higher deal with doc size variations. In smaller datasets, this variant will be extra sturdy to cease phrases.
We use the next code snippet so as to add c-TF-IDF (with bm25_weighting=True
) into our BERTopic pipeline.
from bertopic.vectorizers import ClassTfidfTransformer
ctfidf_model = ClassTfidfTransformer(bm25_weighting=True)
topic_model = BERTopic(
embedding_model=emb_bge,
umap_model=umap_model,
hdbscan_model=clustering_model,
vectorizer_model=vectorizer_model,
ctfidf_model=ctfidf_model
)
The subject cluster outputs under present that including c-TF-IDF has no main influence to the tip outcomes when CountVectorizer
has already been added. That is probably as a result of our CountVectorizer
has already set a excessive bar of eliminating phrases showing in additional than 80% on the doc degree. Subsequently, this already reduces overlapping vocabularies on the matter cluster degree, which is what c-TF-IDF is meant to realize.


Nevertheless, If we substitute CountVectorizer
with c-TF-IDF, though the outcome under exhibits slight enhancements in comparison with when each should not added, there are too many cease phrases current, making the subject representations much less beneficial. Due to this fact, it seems that for the paperwork we’re coping with on this situation, c-TF-IDF module doesn’t convey additional worth.


6. Illustration Mannequin

The final module is the illustration mannequin which has been noticed having a major influence on tuning the subject representations. As a substitute of utilizing the frequency based mostly method like Vectorizer and c-TF-IDF, it leverages semantic similarity between the embeddings of candidate key phrases and the embeddings of paperwork to search out probably the most consultant matter key phrases. This can lead to extra semantically coherent matter representations and decreasing the variety of synonymically comparable key phrases. BERTopic additionally gives varied customization choices for illustration fashions, together with however not restricted to the next:
KeyBERTInspired
: make use of KeyBERT method to extract matter phrases based mostly semantic similarity.ZeroShotClassification
: benefit from open-source transformers within the Huggingface model hub to assign labels to subjects.MaximalMarginalRelevance
: lower synonyms in subjects (e.g. inventory and shares).
KeyBERTInspired Experimentation
We discovered that KeyBERTInspired is a really cost-effective method because it considerably improves the tip outcome by including a couple of additional traces of code, with out the necessity of in depth hyperparameter tuning.
from bertopic.illustration import KeyBERTInspired
representation_model = KeyBERTInspired()
topic_model = BERTopic(gh
embedding_model=emb_bge,
umap_model=umap_model,
hdbscan_model=clustering_model,
vectorizer_model=vectorizer_model,
representation_model=representation_model
)
After incorporating the KeyBERT-Impressed illustration mannequin, we now observe that each fashions generate noticeably extra coherent and beneficial themes.


Take-Residence Message
This text explores BERTopic method and implementation for matter modeling, detailing its six key modules with sensible examples utilizing Apple inventory market information knowledge to exhibit every part’s influence on the standard of matter representations.
- Embeddings: use transformer-based embedding fashions to transform paperwork into numerical representations that seize semantic which means and contextual relationships in textual content.
- Dimensionality Discount: make use of UMAP or different dimensionality discount methods to cut back high-dimensional embeddings whereas preserving each native and international construction of the information
- Clustering: examine HDBSCAN (density-based) and Okay-Means (centroid-based) clustering algorithm to group comparable paperwork into coherent subjects
- Vectorizers: use Rely Vectorizer to create document-term matrices and refine subjects based mostly on statistical method.
- c-TF-IDF: replace matter representations by analyzing time period frequency at cluster degree (matter class) and scale back frequent phrases throughout completely different subjects.
- Illustration Mannequin: refine matter key phrases utilizing semantic similarity, providing choices like
KeyBERTInspired
andMaximalMarginalRelevance
for higher matter descriptions