Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Portable water filter provides safe drinking water from any source
    • MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged
    • NCAA seeks faster trial over DraftKings disputed March Madness branding case
    • AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says
    • Extragalactic Archaeology tells the ‘life story’ of a whole galaxy
    • Swedish semiconductor startup AlixLabs closes €15 million Series A to scale atomic-level etching technology
    • Republican Mutiny Sinks Trump’s Push to Extend Warrantless Surveillance
    • Yocha Dehe slams Vallejo Council over rushed casino deal approval process
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Saturday, April 18
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Topic Model Labelling with LLMs | Towards Data Science
    Artificial Intelligence

    Topic Model Labelling with LLMs | Towards Data Science

    Editor Times FeaturedBy Editor Times FeaturedJuly 15, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    By: Petr Koráb*, Martin Feldkircher**, *** Viktoriya Teliha** (*Textual content Mining Tales, Prague, **Vienna Faculty of Worldwide Research, ***Centre for Utilized Macroeconomic Evaluation, Australia).

    of phrases produced by subject fashions requires area expertise and could also be subjective to the labeler. Particularly when the variety of subjects grows giant, it could be handy to assign human-readable names to subjects routinely with an LLM. Merely copying and pasting the outcomes into UIs, corresponding to chatgpt.com, is kind of a “black-box” and unsystematic. A more sensible choice could be so as to add subject labeling to the code with a documented labeler, which provides the engineer extra management over the outcomes and ensures reproducibility. This tutorial will discover intimately:

    • Methods to prepare a subject mannequin with a recent new Turftopic Python package deal
    • Methods to label subject mannequin outcomes with GPT-4.0 mini.

    We are going to prepare a cutting-edge FASTopic mannequin by Xiaobao Wu et al. [3] offered eventually 12 months’s NeurIPS. This mannequin outperforms other competing models, corresponding to BERTopic, in a number of key metrics (e.g., subject variety) and has broad applications in business intelligence.

    1. Elements of the Subject Modelling Pipeline

    Labelling is the important a part of the subject modelling pipeline as a result of it bridges the mannequin outputs with real-world selections. The mannequin assigns a quantity to every subject, however a enterprise resolution depends on the human-readable textual content label summarizing the standard phrases in every subject. The fashions are sometimes labelled by (1) labellers with the area expertise, usually utilizing a well-defined labelling technique, (2) LLMs, and (3) industrial instruments. The trail from uncooked information to decision-making via a subject mannequin is properly defined in Picture 1.

    Picture 1. Elements of the subject modeling pipeline.
    Supply: tailored and prolonged from Kardos et al [2].

    The pipeline begins with uncooked information, which is preprocessed and vectorized for the subject mannequin. The mannequin returns subjects named with integers, together with typical phrases (phrases or bigrams). The labeling layer replaces the integer within the subject identify with the textual content label. The mannequin person (product manager, customer care dept., and so forth.) then works with labelled phrases to make data-informed selections. Within the following modeling instance, we are going to comply with it step-by-step.

    2. Information

    We are going to use FASTopic to categorise buyer complaints information into 10 subjects. The instance use case makes use of a synthetically generated Customer Care Email dataset obtainable on Kaggle, licensed below the GPL-3 license. The prefiltered information covers 692 incoming emails to the shopper care division and appears like this:

    Picture 2. Customer Care Email dataset. Picture by authors.

    2.1. Information preprocessing

    Textual content information is sequentially preprocessed in six steps. Numbers are eliminated first, adopted by emojis. English stopwords are eliminated afterward, adopted by punctuation. Further tokens (corresponding to firm and particular person names) are eliminated within the subsequent step earlier than lemmatization. Learn extra on textual content preprocessing for subject fashions in our previous tutorial.

    First, we learn the clear information and tokenize the dataset:

    import pandas as pd
    
    # Learn information
    information = pd.read_csv("information.csv", usecols=['message_clean'])
    
    # Create corpus checklist
    docs = information["message_clean"].tolist()
    Picture 3. Really helpful cleansing pipeline for subject fashions. Picture by authors.

    2.2. Bigram vectorization

    Subsequent, we create a bigram tokenizer to course of tokens as bigrams in the course of the mannequin coaching. Bigram fashions present extra related info and establish higher key qualities and issues for enterprise selections than single-word fashions (“supply” vs. “poor supply”, “abdomen” vs. “delicate abdomen”, and so forth.).

    from sklearn.feature_extraction.textual content import CountVectorizer
    
    bigram_vectorizer = CountVectorizer(
        ngram_range=(2, 2),               # solely bigrams
        max_features=1000                 # high 1000 bigrams by frequency
    )

    3. Mannequin coaching

    The FASTopic mannequin is presently applied in two Python packages:

    • Fastopic: official package deal by X. Wu
    • Turftopic : new Python package deal that brings many useful subject modeling options, together with labeling with LLMs [2]

    We are going to use the Turftopic implementation due to the direct hyperlink between the mannequin and the Namer that provides LLM labelling.

    Let’s arrange the mannequin and match it to the info. It’s important to set a random state to safe coaching reproducibility.

    from turftopic import FASTopic
    
    # Mannequin specification
    topic_size  = 10
    mannequin = FASTopic(n_components = topic_size,       # prepare for 10 subjects
                     vectorizer = bigram_vectorizer,  # generate bigrams in subjects
                     random_state = 32).match(docs)     # set random state 
    
    # Match mannequin to corpus
    topic_data = mannequin.prepare_topic_data(docs)

    Now, let’s put together a dataframe with subject IDs and the highest 10 bigrams with the very best likelihood acquired from the mannequin (code is here).

    Picture 4. Unlabeled subjects in FASTopic. Picture by authors.

    4. Subject labeling

    Within the subsequent step, we add textual content labels to the subject IDs with GPT4-o-mini. Let’s comply with these steps:

    With this code, we label the subjects and add a brand new row topic_name to the dataframe.

    from turftopic.namers import OpenAITopicNamer
    import os
    
    # OpenAI API key key to entry GPT-4
    os.environ["OPENAI_API_KEY"] = ""   
    
    # use Namer to label subject mannequin with LLM
    namer = OpenAITopicNamer("gpt-4o-mini")
    mannequin.rename_topics(namer)
    
    # create a dataframe with labelled subjects
    topics_df = mannequin.topics_df()
    topics_df.columns = ['topic_id', 'topic_name', 'topic_words']
    
    # break up and explode
    topics_df['topic_word'] = topics_df['topic_words'].str.break up(',')
    topics_df = topics_df.explode('topic_word')
    topics_df['topic_word'] = topics_df['topic_word'].str.strip()
    
    # add a rank for every phrase inside a subject
    topics_df['word_rank'] = topics_df.groupby('topic_id').cumcount() + 1
    
    # pivot to vast format
    vast = topics_df.pivot(index='word_rank', 
                           columns=['topic_id', 'topic_name'], values='topic_word')

    Right here is the desk with labeled subjects after further transformations. It might be attention-grabbing to match the LLM outcomes with these of an organization insider who’s conversant in the corporate’s processes and buyer base. The dataset is artificial, so let’s depend on the GPT-4 labeling.

    Picture 5. Labeled subjects in FASTopic by GPT4–o-mini. Picture by authors.

    We will additionally visualize the labeled subjects for a greater presentation. The code for the bigram phrase cloud visualization, generated from the subjects produced by the mannequin, is here.

    Picture 6. Phrase cloud visualization of labeled subjects by GPT4–o-mini. Picture by authors.

    Abstract

    • The brand new Turftopic Python package deal hyperlinks latest subject fashions with the LLM-based labeler for producing human-readable subject names.
    • The primary advantages are: 1) independence from the labeler’s subjective expertise, 2) capability to label fashions with numerous subjects {that a} human labeler might need problem labeling independently, and three) extra management of the code and reproducibility.
    • Subject labeling with LLMs has a variety of functions in various areas. Learn our latest paper on the subject modeling of central financial institution communication, the place GPT-4 labeled the FASTopic mannequin.
    • The labels are barely completely different for every coaching, even with the random state. It’s not attributable to the Namer, however by the random processes in mannequin coaching that output bigrams with possibilities in descending order. The variations in possibilities are in tiny decimals, so every coaching generates a couple of new phrases within the high 10, which then impacts the LLM labeler.

    The information and full code for this tutorial are here.

    Petr Korab is a Senior Information Analyst and Founding father of Text Mining Stories with over eight years of expertise in Enterprise Intelligence and NLP.

    Join for our blog to get the most recent information from the NLP trade!

    References

    [1] Feldkircher, M., Korab, P., Teliha, V., (2025). “What Do Central Bankers Talk About? Evidence From the BIS Archive,” CAMA Working Papers 2025–35, Centre for Utilized Macroeconomic Evaluation, Crawford Faculty of Public Coverage, The Australian Nationwide College.

    [2] Kardos, M., Enevoldsen, Ok. C., Kostkan, J., Kristensen-McLachlan, R. D., Rocca, R. (2025). Turftopic: Subject Modelling with Contextual Representations from Sentence Transformers. Journal of Open Supply Software program, 10(111), 8183, https://doi.org/10.21105/joss.08183.

    [3] Wu, X, Nguyen, T., Ce Zhang, D., Yang Wang, W., Luu, A. T. (2024). FASTopic: A Fast, Adaptive, Stable, and Transferable Topic Modeling Paradigm. arXiv preprint: 2405.17978.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Beyond Prompting: Using Agent Skills in Data Science

    April 17, 2026

    6 Things I Learned Building LLMs From Scratch That No Tutorial Teaches You

    April 17, 2026

    Introduction to Deep Evidential Regression for Uncertainty Quantification

    April 17, 2026

    memweave: Zero-Infra AI Agent Memory with Markdown and SQLite — No Vector Database Required

    April 17, 2026

    Comments are closed.

    Editors Picks

    Portable water filter provides safe drinking water from any source

    April 18, 2026

    MAGA Is Increasingly Convinced the Trump Assassination Attempt Was Staged

    April 18, 2026

    NCAA seeks faster trial over DraftKings disputed March Madness branding case

    April 18, 2026

    AI Trusted Less Than Social Media and Airlines, With Grok Placing Last, Survey Says

    April 18, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Crypto.com pulls prediction market and sports contracts from Arizona

    December 17, 2025

    Kawasaki robo horse Corleo production starts

    January 7, 2026

    Taiwan says China’s cyberattacks on its hospitals, banks, energy, and other infrastructure rose 6% YoY in 2025 to an average of 2.63M attacks a day (Yimou Lee/Reuters)

    January 5, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.