Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Efficient hybrid minivan delivers MPG
    • How Can Astronauts Tell How Fast They’re Going?
    • A look at the AI nonprofit METR, whose time-horizon metrics are used by AI researchers and Wall Street investors to track the rapid development of AI systems (Kevin Roose/New York Times)
    • Double Dazzle: This Weekend, There Are 2 Meteor Showers in the Night Sky
    • asexual fish defy extinction with gene repair
    • The ‘Lonely Runner’ Problem Only Appears Simple
    • Binance and Bitget to probe a rally in RaveDAO’s RAVE token, which surged 4,500% in a week, after ZachXBT alleged RAVE insiders engineered a large short squeeze (Francisco Rodrigues/CoinDesk)
    • Today’s NYT Connections Hints, Answers for April 19 #1043
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries
    Artificial Intelligence

    Topic Modeling Techniques for 2026: Seeded Modeling, LLM Integration, and Data Summaries

    Editor Times FeaturedBy Editor Times FeaturedJanuary 14, 2026No Comments16 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    By: Martin Feldkircher (Vienna Faculty of Worldwide Research), Márton Kardos (Aarhus College, Denmark), and Petr Koráb (Textual content Mining Tales)

    1.

    Subject modelling has not too long ago progressed in two instructions. The improved statistical strategies stream of Python packages focuses on extra sturdy, environment friendly, and preprocessing-free fashions, producing fewer junk matters (e.g., FASTopic). The opposite depends on the facility of generative language fashions to extract intuitively comprehensible matters and their descriptions (e.g., TopicGPT [6], LlooM [5]).

    Because of analysis on statistical strategies for modelling textual content representations from transformers, junk matters are the exception reasonably than the norm in newer fashions. In the meantime, novel, LLM-based approaches are difficult our long-standing views about what a subject mannequin is and what it could actually do. Human-readable subject names and descriptions are actually changing into an increasing number of an anticipated results of a well-designed subject modelling pipeline.

    As thrilling as these developments are, subject modelling is way from being a solved downside. Neural subject fashions might be reasonably unstable and generally arduous for customers to belief due to their black-box nature. LLM-powered strategies produce spectacular outcomes, however can at instances increase questions on belief, as a result of hallucinations and sensitivity to semantically irrelevant modifications in enter. That is particularly an issue for the banking sector, the place (un)certainty is crucial. Working giant language fashions can be an enormous infrastructural and computational burden, and would possibly find yourself costing giant sums of cash even for smaller datasets.

    Our previous tutorial supplies an in depth introduction to how LLMs improve conventional subject modeling by robotically labeling subject names. On this article, we mix present subject modeling strategies with focused LLM help. In our view, a mixture of latest advances in language modeling and classical machine studying can present customers with the most effective of each worlds: a pipeline that mixes the capabilities of enormous language fashions with the computational effectivity, trustworthiness, and stability of probabilistic ML.

    This text explains three recent topic-modelling methods that ought to be a part of the NLP toolkit in 2026. We are going to determine:

    • The way to use textual content prompts to specify what subject fashions ought to give attention to (i.e., seeded subject fashions).
    • How LLM-generated summaries could make subject fashions extra correct.
    • How generative fashions can be utilized to label matters and supply their descriptions.
    • How these methods can be utilized to realize insights from central banking communication.

    We illustrate these on the central financial institution communication speeches corpus from the European Central Financial institution. Any such textual content is lengthy, rigorously structured, and extremely repetitive — precisely the form of knowledge the place commonplace subject fashions battle and the place interpretability is crucial. By combining seeded subject modelling with LLM-assisted doc summarization and evaluation, we present how one can extract centered, secure, and economically significant matters with out compromising transparency or scalability.

    2. Instance Information

    We use the press convention communications of the European Central Financial institution (ECB) as instance textual content knowledge. Since 2002, the ECB’s Governing Council has met on the primary Thursday of every month, and its communication of the assembly’s end result follows the two-step construction ([2]).

    The way it works: First, at 13:45 CET, the ECB releases a short financial coverage determination (MPD) assertion, which accommodates solely restricted textual data. Second, at 14:30 CET, the ECB President delivers an introductory assertion throughout a press convention. This rigorously ready doc explains the rationale behind coverage choices, outlines the ECB’s evaluation of financial situations, and supplies steering on future coverage concerns. The introductory assertion usually lasts about quarter-hour and is adopted by a 45-minute Q&A session. 

    For this text, we use the introductory statements, scraped straight from the ECB web site (launched with a flexible data licence). The dataset accommodates 279 statements, and here’s what it appears like:

    Picture 1: ECB communication dataset. Supply: Picture by authors.

    3. Seeded Subject Modelling 

    Historically, subject fashions give attention to figuring out essentially the most informative matters in a dataset. A naive method practitioners take is to suit a bigger mannequin, then, often manually, filter out matters irrelevant to their knowledge query.

    What should you may situation a subject mannequin to solely extract related matters to your knowledge query? That is exactly what seeded subject modelling is used for.

    In some strategies, this implies deciding on a set of key phrases that replicate your query. However within the framework we discover on this article, you’ll be able to specify your curiosity in free-text utilizing a seed phrase that tells the mannequin what to give attention to.

    3.1 KeyNMF Mannequin

    We are going to use the cutting-edge contextual KeyNMF subject mannequin ([3]). It’s, in lots of facets, similar to older subject fashions, because it formulates subject discovery by way of matrix factorization. In different phrases, when utilizing this mannequin, you assume that matters are latent elements, that your paperwork include to a lesser or better extent, which decide and clarify the content material of these paperwork. 

    KeyNMF is contextual as a result of, not like older fashions, it makes use of context-sensitive transformer representations of textual content. To grasp how seeded modelling works, we have to achieve a primary understanding of the mannequin. The modelling course of occurs within the following steps:

    1. We encode our paperwork into dense vectors utilizing a sentence-transformer.
    2. We encode the vocabulary of those paperwork into the identical embedding house.
    3. For every doc, we extract the highest N key phrases by taking the phrases which have the best cosine similarity to the doc embedding.
    4. Phrase significance for a given doc is then the cosine similarity, pruned at zero. These scores are organized right into a key phrase matrix, the place every row is a doc, and columns correspond to phrases.
    5. The key phrase matrix is decomposed right into a topic-term matrix and a document-topic matrix utilizing Nonnegative Matrix Factorization.
    Picture 2: KeyNMF mannequin structure. Supply: Turftopic’s documentation

    3.2 Seeded KeyNMF 

    The overall KeyNMF, whereas completely satisfactory for locating matters in a corpus, isn’t essentially the most appropriate alternative if we have to use the mannequin for a particular query. To make this occur, we first should specify a seed phrase, a phrase that minimally signifies what we’re concerned about. For instance, when analysing the ECB communication dataset, this might be “Enlargement of the Eurozone”.

    As sentence-transformers can encode this seed phrase, we will use it to retrieve paperwork which can be related to our query:

    1. We encode the seed phrase into the identical embedding house as our paperwork and vocabulary.
    2. To make our mannequin extra attentive to paperwork that include related data, we compute a doc relevance rating by computing cosine similarity to the seed embedding. We prune, once more, at zero.
    3. To magnify the seed’s significance, one can apply a seed exponent. This includes elevating the doc relevance scores to the facility of this exponent.
    4. We multiply the key phrase matrix’s entries by the doc relevance.
    5. We then, as earlier than, use NMF to decompose this, now conditioned, key phrase matrix.

    The benefits of this method are that it’s:

    • 1) extremely versatile, and
    • 2) can save quite a lot of handbook work.

    Watch out: some embedding fashions might be delicate to phrasing and would possibly retrieve completely different document-importance scores for a similar doc with a barely completely different seed phrase. To take care of this, we advocate that you simply use one of many paraphrase models from sentence-transformers, as a result of they’ve intentionally been educated to be phrasing invariant, and produce high-quality matters with KeyNMF.

    3.3 The way to use Seeded KeyNMF

    KeyNMF and its seeded model can be found on PyPI within the Turftopic package deal, in a scikit-learn-compatible kind. To specify what you have an interest in, merely initialize the mannequin with a seed phrase:

    from sentence-transformers import SentenceTransformer
    from turftopic import KeyNMF
    
    # Encode paperwork utilizing a sentence-transformer
    encoder = SentenceTransformer("paraphrase-mpnet-base-v2")
    embeddings = encoder.encode(paperwork, show_progress_bar=True)
    
    # Initialize KeyNMF with 4 matters and a seed phrase
    mannequin = KeyNMF(
     n_components=4,
     encoder=encoder,
     seed_phrase="Enlargement of the Eurozone",
     seed_exponent=3.0,
    )
    
    # Match mannequin 
    mannequin.match(corpus)
    
    # Print modelled matters
    mannequin.print_topics()

    We are able to see that the mannequin returns subject IDs with typical key phrases which can be clearly associated to the Euro and the Eurozone:

    Subject Id Highest Rating
    0 gdp, financial, euro, progress, economic system, evaluation, evaluation, macroeconomic, measures, anticipated
    1 inflation, charges, inflationary, stability, euro, charge, economic system, financial, ecb, expectations
    2 euro, fiscal, stability, international locations, reforms, query, governing, coverage, european, insurance policies
    3 ecb, germany, frankfurt, communicationssonnemannstrasse, 7455media, europa, query, eu, 1344, 2060314
    Picture 3: Seed KeyNMF mannequin output. Supply: picture by authors.

    4. LLM-assisted Subject Modeling

    Discovering interpretable matters from a corpus is a tough process, and it typically requires greater than only a statistical mannequin that finds patterns within the uncooked knowledge. LLMs serve subject modelling in two most important areas:

    • Studying a doc and figuring out the suitable facets within the textual content primarily based on a particular knowledge query.
    • Decoding the subject mannequin’s output within the related context.

    Within the following textual content, we’ll now discover 1) how LLMs enhance processing paperwork for a subject mannequin and a couple of) how generative fashions enhance understanding and deciphering the mannequin outcomes.

    Picture 4: Subject modelling pipeline prolonged with LLM parts. Supply: Turftopic’s docs.

    4.1. Doc Summarization 

    One of many Achilles’ heels of the sentence transformers we regularly use for subject evaluation is their quick context size. Encoder fashions that may learn significantly longer contexts have not often been evaluated for his or her efficiency in subject modeling. Subsequently, we didn’t know whether or not or how these bigger transformer fashions work in a subject modelling pipeline. One other problem is that they produce higher-dimensional embeddings, which regularly negatively have an effect on unsupervised machine studying fashions ([4]). It will probably both be as a result of Euclidean distances get inflated in higher-dimensional house, or as a result of the variety of parameters surges with enter dimensionality, making parameter restoration harder.

    We are able to remedy these points by:

    • Chunking paperwork into smaller sections that match into the context window of a sentence transformer. Sadly, chunking can lead to textual content chunks which can be wildly out of context, and it would take appreciable effort to chunk paperwork at semantically wise boundaries.
    • Utilizing generative fashions to summarize the contents of those paperwork. LLMs excel at this process and may take away all kinds of tokenization-based noise and irrelevant data from texts which may hinder our subject mannequin.

    Let’s now summarise the trade-offs of utilizing LLM-generated summaries in subject modelling within the following picture.

    Picture 5: Advantages and disadvantages of LLM-assisted doc processing within the subject modelling pipeline. Supply: picture by authors.

    The advisable technique for LLM-assisted doc preprocessing is a two-step:

    1. Practice a subject mannequin with easy preprocessing, or no preprocessing in any respect.
    2. Once you discover that subject fashions have a tough time deciphering your corpus, utilizing LLM-based summarisation could be a good selection if the trade-offs work positively in your particular venture.

    4.1.1. Doc Summarization in Code

    Let’s now have a look at how we will summarize paperwork utilizing an LLM. On this instance, we’ll use GPT-5-nano, however Turftopic additionally permits operating domestically run open LLMs. We advocate utilizing open LLMs domestically, if potential, as a result of decrease prices and higher knowledge privateness.

    import pandas as pd
    from tqdm import tqdm
    from turftopic.analyzers import OpenAIAnalyzer, LLMAnalyzer
    
    # Loading the info
    knowledge = pd.read_parquet("knowledge/ecb_data.parquet")
    content material = checklist(knowledge["content"])
    
    # We write a immediate that can extract the related data
    # We ask the mannequin to separate data to key factors in order that 
    # they turn out to be simpler to mannequin
    
    summary_prompt="Summarize the next press convention from 
    the European Central Financial institution right into a set of key factors separated by 
    two newline characters. Reply with the abstract solely, nothing else. 
    n {doc}"
    
    # Formalize a summarized
    summarizer = OpenAIAnalyzer("gpt-5-nano", summary_prompt=summary_prompt)
    summaries = []
    
    # Summarize dataframe, monitor code execution 
    for doc in tqdm(knowledge["content"], desc="Summarising paperwork..."):
        abstract = summarizer.summarize_document(doc)
        # We print summaries as we go as a sanity test, to verify 
        # the immediate works
        print(abstract)
        summaries.append(abstract)
    
    # Accumulate summaries right into a dataframe
    summary_df = pd.DataFrame(
        {
            "id": knowledge["id"],
            "date": knowledge["date"],
            "creator": knowledge["author"],
            "title": knowledge["title"],
            "abstract": summaries,
        }
    )

    Subsequent, we’ll match a easy KeyNMF mannequin on the important thing factors in these summaries, and let the mannequin uncover the variety of matters utilizing the Bayesian Information Criterion.  This method works very nicely on this case, however watch out that automated subject quantity detection has its shortcomings. Try the Topic Model Leaderboard to realize extra data on how fashions carry out at detecting the variety of matters.

    import numpy as np
    import pandas as pd
    from sentence_transformers import SentenceTransformer
    from turftopic import KeyNMF
    
    # Create corpus from textual content summaries (not authentic texts)
    corpus = checklist(summary_df["summary"])
    
    # Accumulate key factors by segmenting at double line breaks
    factors = []
    
    for doc in corpus:
      _points = doc.break up("nn")
      doc_points = [p for p in _points if len(p.strip().removeprefix(" - "))]
      factors.prolong(doc_points)
    
    # Inform KeyNMF to robotically detect the variety of matters utilizing BIC
    mannequin = KeyNMF("auto", encoder="paraphrase-mpnet-base-v2")
    doc_topic = mannequin.fit_transform(factors)
    
    # Print subject IDs with prime phrases
    mannequin.print_topics()

    Listed here are the KeyNMF outcomes educated on doc summaries:

    Subject ID Highest Rating
    0 inflation, hicp, expectations, anticipated, wage, vitality, costs, worth, medium, pressures
    1 ecb, charges, unchanged, stored, key, charge, liquidity, banks, market, trade
    2 m3, credit score, financial, progress, lending, liquidity, loans, monetary, cash, m1
    3 euro, space, economic system, banknotes, anticipated, trade, foreign money, international locations, convergence, exterior
    4 reforms, fiscal, progress, structural, consolidation, markets, potential, productiveness, market, important
    5 dangers, draw back, progress, balanced, outlook, upside, costs, tensions, potential, international
    6 stability, worth, medium, pact, stays, dangers, expectations, time period, upside, fiscal
    7 coverage, financial, charge, fiscal, charges, choices, transmission, measures, stays, stance
    8 gdp, progress, actual, financial, projections, quarter, demand, q2, anticipated, economic system
    9 council, governing, charge, determination, assembly, refinancing, consensus, unanimous, guarantee, charges
    Picture 6: KeyNMF 10-topic outcomes educated on doc summaries. Supply: picture by authors.

    4.3. Subject Evaluation with LLMs

    In a typical topic-analysis pipeline, a person would first practice a subject mannequin, then spend time deciphering what the mannequin has found, label matters manually, and eventually present a short description of the kinds of paperwork the subject accommodates. That is time-consuming, particularly in corpora with many recognized matters. 

    This half can now be completed by LLMs that may simply generate human-readable subject names and descriptions. We are going to use the identical Analyzer API from Turftopic to attain this:

    from turftopic.analyzers import OpenAIAnalyzer
    
    analyzer = OpenAIAnalyzer()
    analysis_result = mannequin.analyze_topics(analyzer, use_documents=True)
    
    print(analysis_result.to_df())

    We apply the analyzer to the introductory statements issued by the ECB, which accompany every financial coverage determination. These statements are ready rigorously and comply with a comparatively commonplace construction. Listed here are the labelled subject names with their descriptions and prime phrases printed from analysis_result:

    Picture 7: Subject Evaluation utilizing GPT-5-nano in Turftopic. Supply: picture by authors.

    Subsequent, let’s present the prevalence of the labelled KeyNMF’ subject names over time. It’s how intensely these matters had been mentioned within the ECB press conferences over the past 25 years:

    from datetime import datetime
    import plotly.categorical as px
    from scipy.sign import savgol_filter
    
    # create dataframe from labelled matters, 
    # mix with timestamp from date column
    time_df = pd.DataFrame(
        dict(
            date=timestamps,
            **dict(zip(analysis_result.topic_names, doc_topic.T /
                                                    doc_topic.sum(axis=1)))
        )
    ).set_index("date")
    
    # group dataframe to month-to-month frequency
    time_df = time_df.groupby(by=[time_df.index.month, time_df.index.year]).imply()
    time_df.index = [datetime(year=y, month=m, day=1) for m, y in time_df.index]
    time_df = time_df.sort_index()
    
    # show dataframe with Plotly
    
    for col in time_df.columns:
        time_df[col] = savgol_filter(time_df[col], 12, 2)
    fig = px.line(
        time_df.sort_index(),
        template="plotly_white",
    )
    fig.present()

    Right here is the labelled subject mannequin dataframe displayed in yearly frequency:

    Picture 8: Subject Evaluation utilizing GPT-5-nano in Turftopic over time. Supply: Picture by authors.

    Mannequin ends in context: The financial union subject was most outstanding within the early 2000s (see [5] for extra data). The financial coverage and charge determination subject peaks on the finish of the worldwide monetary disaster round 2011, a interval throughout which the ECB (some commentators argue mistakenly) raised rates of interest. The timing of the inflation and inflation expectations subject additionally corresponds with financial developments: it rises sharply round 2022, when vitality costs pushed inflation into double-digit territory within the euro space for the primary time since its creation.

    5. Abstract

    Let’s now summarize the important thing factors of the article. The necessities and code for this tutorial are on this repo.

    • Seeded KeyNMF subject mannequin combines textual content prompts with the newest subject mannequin to pay attention modelling on a sure downside.
    • Summarizing knowledge for subject modeling reduces coaching time, nevertheless it has drawbacks that ought to be thought of in a venture.
    • The Tutftopic Python package deal implements systematic descriptions and labels with latest LLMs into a subject modelling pipeline.

    References 

    [1] Taejin Park, Fernando Perez-Cruz and Hyun Music Shin. 2025. Mapping the space of central bankers’ ideas. In: BIS Working Papers, No. 1299, 16 October 2025, 26 pp.

    [2] Carlo Altavilla, Luca Brugnolini, Refet S. Gürkaynak, Roberto Motto and Giuseppe Ragusa. 2019. Measuring euro area monetary policy. In: Journal of Financial Economics, Quantity 108, pp 162-179.

    [3] Ross Deans Kristensen-McLachlan, Rebecca M.M. Hicke,  Márton Kardos, and Mette Thunø. 2024. Context is Key(NMF): Modelling Topical Information Dynamics in Chinese Diaspora Media. In: CHR 2024: Computational Humanities Analysis Convention, December 4–6, 2024, Aarhus, Denmark.

    [4] Márton Kardos, Jan Kostkan, Kenneth Enevoldsen, Arnault-Quentin Vermillet, Kristoffer Nielbo, and Roberta Rocca. 2025. S3 – Semantic Signal Separation. In: Proceedings of the 63rd Annual Assembly of the Affiliation for Computational Linguistics (Quantity 1: Lengthy Papers), pages 633–666, Vienna, Austria. Affiliation for Computational Linguistics.

    [5] Martin Feldkircher, Petr Koráb and Viktoriya Teliha. 2025. What do central bankers talk about? Evidence from the BIS archive. In: CAMA Working Paper Nr. 35/2025.

    [6] Michelle S. Lam, Janice Teoh, James A. Landay, Jeffrey Heer, and Michael S. Bernstein. 2024. Idea Induction: Analyzing Unstructured Textual content with Excessive-Stage Ideas Utilizing LLooM. In: Proceedings of the 2024 CHI Convention on Human Elements in Computing Methods (CHI ’24). Affiliation for Computing Equipment, New York, NY, USA, Article 766, 1–28. https://doi.org/10.1145/3613904.3642830.

    [7] Chau Minh Pham, Alexander Hoyle, Simeng Solar, Philip Resnik, and Mohit Iyyer. 2024. TopicGPT: A Prompt-based Topic Modeling Framework. In Proceedings of the 2024 Convention of the North American Chapter of the Affiliation for Computational Linguistics: Human Language Applied sciences (Quantity 1: Lengthy Papers), pages 2956–2984, Mexico Metropolis, Mexico. Affiliation for Computational Linguistics.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    A Practical Guide to Memory for Autonomous LLM Agents

    April 17, 2026

    You Don’t Need Many Labels to Learn

    April 17, 2026

    Comments are closed.

    Editors Picks

    Efficient hybrid minivan delivers MPG

    April 19, 2026

    How Can Astronauts Tell How Fast They’re Going?

    April 19, 2026

    A look at the AI nonprofit METR, whose time-horizon metrics are used by AI researchers and Wall Street investors to track the rapid development of AI systems (Kevin Roose/New York Times)

    April 19, 2026

    Double Dazzle: This Weekend, There Are 2 Meteor Showers in the Night Sky

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Build a Culture of Experimentation

    December 25, 2025

    I Clamp Every Accessory I Can to My Desk To Avoid Clutter, and You Should, Too

    March 17, 2026

    xMeet Chatbot App Access, Costs, and Feature Insights

    February 16, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.