Fine-Tune Your Topic Modeling Workflow with BERTopic

Matter modeling stays a important software within the AI and NLP toolbox. Whereas massive language fashions (LLMs) deal with textual content exceptionally nicely, extracting high-level matters from huge datasets nonetheless requires devoted matter modeling methods. A typical workflow consists of 4 core steps: embedding, dimensionality discount, clustering, and matter illustration.

frameworks right this moment is BERTopic, which simplifies every stage with modular parts and an intuitive API. On this submit, I’ll stroll by way of sensible changes you may make to enhance clustering outcomes and increase interpretability based mostly on hands-on experiments utilizing the open-source 20 Newsgroups dataset, which is distributed below the Artistic Commons Attribution 4.0 Worldwide license.

Undertaking Overview

We’ll begin with the default settings really helpful in BERTopic’s documentation and progressively replace particular configurations to spotlight their results. Alongside the best way, I’ll clarify the aim of every module and methods to make knowledgeable selections when customizing them.

Dataset Preparation

We load a pattern of 500 information paperwork.

import random
from datasets import load_dataset
dataset = load_dataset("SetFit/20_newsgroups")
random.seed(42)
text_label = record(zip(dataset["train"]["text"], dataset["train"]["label_text"]))
text_label_500 = random.pattern(text_label, 500)

Because the information originates from informal Usenet discussions, we apply cleansing steps to strip headers, take away muddle, and protect solely informative sentences.

This preprocessing ensures higher-quality embeddings and a smoother downstream clustering course of.

import re

def clean_for_embedding(textual content, max_sentences=5):
    traces = textual content.cut up("n")
    traces = [line for line in lines if not line.strip().startswith(">")]
    traces = [line for line in lines if not re.match
            (r"^s*(from|subject|organization|lines|writes|article)s*:", line, re.IGNORECASE)]
    textual content = " ".be a part of(traces)
    textual content = re.sub(r"s+", " ", textual content).strip()
    textual content = re.sub(r"[!?]{3,}", "", textual content)
    sentence_split = re.cut up(r'(?<=[.!?]) +', textual content)
    sentence_split = [
        s for s in sentence_split
        if len(s.strip()) > 15 and not s.strip().isupper()
    ]
    return " ".be a part of(sentence_split[:max_sentences])
texts_clean = [clean_for_embedding(text) for text,_ in text_label_500]
labels = [label for _, label in text_label_500]

Preliminary BERTopic Pipeline

Utilizing BERTopic’s modular design, we configure every element: SentenceTransformer for embeddings, UMAP for dimensionality discount, HDBSCAN for clustering, and CountVectorizer + KeyBERT for matter illustration. This setup yields only some broad matters with noisy representations, highlighting the necessity for fine-tuning to realize extra coherent outcomes.

from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from sentence_transformers import SentenceTransformer

from sklearn.feature_extraction.textual content import CountVectorizer
from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.illustration import KeyBERTInspired

# Step 1 - Extract embeddings
embedding_model = SentenceTransformer("all-MiniLM-L6-v2")

# Step 2 - Scale back dimensionality
umap_model = UMAP(n_neighbors=10, n_components=5, min_dist=0.0, metric='cosine', random_state=42)

# Step 3 - Cluster decreased embeddings
hdbscan_model = HDBSCAN(min_cluster_size=15, metric='euclidean', cluster_selection_method='eom', prediction_data=True)

# Step 4 - Tokenize matters
vectorizer_model = CountVectorizer(stop_words="english")

# Step 5 - Create matter illustration
ctfidf_model = ClassTfidfTransformer()

# Step 6 - (Elective) Nice-tune matter representations with
# a `bertopic.illustration` mannequin
representation_model = KeyBERTInspired()

# All steps collectively
topic_model = BERTopic(
  embedding_model=embedding_model,          # Step 1 - Extract embeddings
  umap_model=umap_model,                    # Step 2 - Scale back dimensionality
  hdbscan_model=hdbscan_model,              # Step 3 - Cluster decreased embeddings
  vectorizer_model=vectorizer_model,        # Step 4 - Tokenize matters
  ctfidf_model=ctfidf_model,                # Step 5 - Extract matter phrases
  representation_model=representation_model # Step 6 - (Elective) Nice-tune matter representations
)
matters, probs = topic_model.fit_transform(texts_clean)

This setup yields only some broad matters with noisy representations. This outcome highlights the necessity for finetuning to realize extra coherent outcomes.

Authentic found matters (Picture generated by creator)

Parameter Tuning for Granular Matters

n_neighbors from UMAP module

UMAP is the dimensionality discount module to cut back origin embedding to a decrease dimension dense vector. By adjusting UMAP’s n_neighbors, we management how regionally or globally the information is interpreted throughout dimensionality discount. Decreasing this worth uncovers finer-grained clusters and improves matter distinctiveness.

umap_model_new = UMAP(n_neighbors=5, n_components=5, min_dist=0.0, metric='cosine', random_state=42)
topic_model.umap_model = umap_model_new
matters, probs = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

Image generated by author — Matters found after setting the UMAP’s n_neighbors parameter (Picture generated by creator)

min_cluster_size and cluster_selection_method from HDBSCAN module

HDBSCAN is the clustering module set by default for BerTopic. By modifying HDBSCAN’s min_cluster_size and switching the cluster_selection_method from “eom” to “leaf” additional sharpens matter decision. These settings assist uncover smaller, extra centered themes and steadiness the distribution throughout clusters.

hdbscan_model_leaf = HDBSCAN(min_cluster_size=5, metric='euclidean', cluster_selection_method='leaf', prediction_data=True)
topic_model.hdbscan_model = hdbscan_model_leaf
matters, _ = topic_model.fit_transform(texts_clean)
topic_model.get_topic_info()

The variety of clusters will increase to 30 by setting cluster_selection_method to leaf and min_cluster_size to five.

Controlling Randomness for Reproducibility

UMAP is inherently non-deterministic, that means it could actually produce completely different outcomes on every run except you explicitly set a set random_state. This element is commonly omitted in instance code, so you should definitely embody it to make sure reproducibility.

Equally, when you’re utilizing a third-party embedding API (like OpenAI), be cautious. Some APIs introduce slight variations on repeated calls. For reproducible outputs, cache embeddings and feed them instantly into BERTopic.

from bertopic.backend import BaseEmbedder
import numpy as np
class CustomEmbedder(BaseEmbedder):
    """Lightweight wrapper to name NVIDIA's embedding endpoint through OpenAI SDK."""

    def __init__(self, embedding_model, shopper):
        tremendous().__init__()
        self.embedding_model = embedding_model
        self.shopper = shopper

    def encode(self, paperwork):  # kind: ignore[override]
        response = self.shopper.embeddings.create(
            enter=paperwork,
            mannequin=self.embedding_model,
            encoding_format="float",
            extra_body={"input_type": "passage", "truncate": "NONE"},
        )
        embeddings = np.array([embed.embedding for embed in response.data])
        return embeddings
topic_model.embedding_model = CustomEmbedder()
matters, probs = topic_model.fit_transform(texts_clean, embeddings=embeddings)

Each dataset area could require completely different clustering settings for optimum outcomes. To streamline experimentation, take into account defining analysis standards and automating the tuning course of. For this tutorial, we’ll use the cluster configuration that units n_neighbors to five, min_cluster_size to five, and cluster_selection_method to “eom”. It is a mixture that strikes a steadiness between granularity and coherence.

Enhancing Matter Representations

Illustration performs a vital position in making clusters interpretable. By default, BERTopic generates unigram-based representations, which frequently lack ample context. Within the subsequent part, we’ll discover a number of methods to counterpoint these representations and enhance matter interpretability.

Ngram

n-gram vary

In BERTopic, CountVectorizer is the default software to transform textual content information into bag-of-words representations. As an alternative of counting on generic unigrams, swap to bigrams or trigrams utilizing ngram_range in CountVectorizer. This straightforward change provides a lot wanted context.

Since we’re solely updating illustration, BerTopic presents the update_topics operate to keep away from redoing the modeling another time.

topic_model.update_topics(texts_clean, vectorizer_model=CountVectorizer(stop_words="english", ngram_range=(2,3)))
topic_model.get_topic_info()

Customized Tokenizer

Some bigrams are nonetheless onerous to interpret e.g. 486dx 50, ac uk, dxf doc,… For higher management, implement a customized tokenizer that filters n-grams based mostly on part-of-speech patterns. This removes meaningless mixtures and elevates the standard of your matter key phrases.

import spacy
from typing import Checklist

class ImprovedTokenizer:
    def __init__(self):
        self.nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"])
        self.MEANINGFUL_BIGRAMS = {
            ("ADJ", "NOUN"),
            ("NOUN", "NOUN"),
            ("VERB", "NOUN"),
        }
    # Preserve solely probably the most significant syntactic bigram patterns
    def __call__(self, textual content: str, max_tokens=200) -> Checklist[str]:
        doc = self.nlp(textual content[:3000])  # truncate lengthy docs for pace
        tokens = [(t.text, t.lemma_.lower(), t.pos_) for t in doc if t.is_alpha]
       
        bigrams = []
        for i in vary(len(tokens) - 1):
            word1, lemma1, pos1 = tokens[i]
            word2, lemma2, pos2 = tokens[i + 1]
            if (pos1, pos2) in self.MEANINGFUL_BIGRAMS:
                # Optionally lowercase each phrases to normalize
                bigrams.append(f"{lemma1} {lemma2}")
       
        return bigrams
topic_model.update_topics(docs=texts_clean,vectorizer_model=CountVectorizer(tokenizer=ImprovedTokenizer()))
topic_model.get_topic_info()

LLM

Lastly, you possibly can combine LLMs to generate coherent titles or summaries for every matter. BERTopic helps OpenAI integration instantly or by way of customized prompting. These LLM-based summaries drastically enhance explainability.

import openai
from bertopic.illustration import OpenAI

shopper = openai.OpenAI(api_key=os.environ["OPENAI_API_KEY"])
topic_model.update_topics(texts_clean, representation_model=OpenAI(shopper, mannequin="gpt-4o-mini", delay_in_seconds=5))
topic_model.get_topic_info()

The representations at the moment are all significant sentences.

You may as well write your individual operate for getting the LLM-generated title, and replace it again to the subject mannequin object by utilizing update_topic_labels operate. Please check with the instance code snippet under.

import openai
from typing import Checklist
def generate_topic_titles_with_llm(
    topic_model,
    docs: Checklist[str],
    api_key: str,
    mannequin: str = "gpt-4o"
) -> Dict[int, Tuple[str, str]]:
    shopper = openai.OpenAI(api_key=api_key)
    topic_info = topic_model.get_topic_info()
    topic_repr = {}
    matters = topic_info[topic_info.Topic != -1].Matter.tolist()

    for matter in tqdm(matters, desc="Producing titles"):
        indices = [i for i, t in enumerate(topic_model.topics_) if t == topic]
        if not indices:
            proceed
        top_doc = docs[indices[0]]

        immediate = f"""You're a useful summarizer for matter clustering.
        Given the next textual content that represents a subject, generate:
        1. A brief **title** for the subject (2–6 phrases)
        2. A one or two sentence **abstract** of the subject.
        Textual content:
        """
        {top_doc}
        """
        """

        attempt:
            response = shopper.chat.completions.create(
                mannequin=mannequin,
                messages=[
                    {"role": "system", "content": "You are a helpful assistant for summarizing topics."},
                    {"role": "user", "content": prompt}
                ],
                temperature=0.5
            )
            output = response.selections[0].message.content material.strip()
            traces = output.cut up('n')
            title = traces[0].substitute("Title:", "").strip()
            abstract = traces[1].substitute("Abstract:", "").strip() if len(traces) > 1 else ""
            topic_repr[topic] = (title, abstract)
        besides Exception as e:
            print(f"Error with matter {matter}: {e}")
            topic_repr[topic] = ("[Error]", str(e))

    return topic_repr

topic_repr = generate_topic_titles_with_llm( topic_model, texts_clean, os.environ["OPENAI_API_KEY"])
topic_repr_dict = {
    matter: topic_repr.get(matter, "Matter")
    for matter in matter.get_topic_info()["Topic"]
 }
topic_model.set_topic_labels(topic_repr_dict)

Conclusion

This information outlined actionable methods to spice up matter modeling outcomes utilizing BERTopic. By understanding the position of every module and tuning parameters on your particular area, you possibly can obtain extra centered, secure, and interpretable matters.

Illustration issues simply as a lot as clustering. Whether or not it’s by way of n-grams, syntactic filtering, or LLMs, investing in higher representations makes your matters simpler to know and extra helpful in follow.

BERTopic additionally presents superior modeling methods past the fundamentals lined right here. In a future submit, we’ll discover these capabilities in depth. Keep tuned!

Source link

Fine-Tune Your Topic Modeling Workflow with BERTopic

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Meta launches an AI feature that lets Threads users temporarily personalize their feed by specifying topics in a public post that begins with “Dear Algo” (Jonathan Vanian/CNBC)

AI boom boosts Nvidia despite ‘geopolitical issues’

Best Tested Walking Pads (2025): Sperax, WalkingPad, Egofit

Fine-Tune Your Topic Modeling Workflow with BERTopic

Undertaking Overview

Dataset Preparation

Preliminary BERTopic Pipeline

Parameter Tuning for Granular Matters

n_neighbors from UMAP module

min_cluster_size and cluster_selection_method from HDBSCAN module

Controlling Randomness for Reproducibility

Enhancing Matter Representations

Ngram

n-gram vary

Customized Tokenizer

LLM

Conclusion

Related Posts