Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Our Favorite Apple Watch Has Never Been Less Expensive
    • Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)
    • Today’s NYT Strands Hints, Answer and Help for April 20 #778
    • KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.
    • OneOdio Focus A1 Pro review
    • The 11 Best Fans to Buy Before It Gets Hot Again (2026)
    • A look at Dylan Patel’s SemiAnalysis, an AI newsletter and research firm that expects $100M+ in 2026 revenue from subscriptions and AI supply chain research (Abram Brown/The Information)
    • ‘Euphoria’ Season 3 Release Schedule: When Does Episode 2 Come Out?
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Learn How to Use Transformers with HuggingFace and SpaCy
    Artificial Intelligence

    Learn How to Use Transformers with HuggingFace and SpaCy

    Editor Times FeaturedBy Editor Times FeaturedSeptember 15, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    Introduction

    the the cutting-edge structure for NLP and never solely. Trendy fashions like ChatGPT, Llama, and Gemma are primarily based on this structure launched in 2017 within the Consideration Is All You Want paper from Vaswani et al.

    Within the previous article, we noticed tips on how to use spaCy to perform a number of duties, and also you might need seen that we by no means needed to prepare something, however we leveraged spaCy capabilities, that are primarily rule-based approaches.

    SpaCy additionally gives to insert within the NLP pipeline trainable elements or to make use of fashions off the shelf from the 🤗 HuggingFace Hub, which is a web-based platform that gives open-source fashions for AI builders to make use of.

    So let’s discover ways to use SpaCy with Hugging Face’s fashions!

    Why Transformers?

    Earlier than transformers the SOTA structure to create vector representations of phrases was phrase vectors strategies. A phrase vector is a dense illustration of a phrase, which we will use to carry out some mathematical calculation on it.

    For instance, we will observe that two phrases which have the same which means even have related vectors. Essentially the most well-known strategies of this sort are GloVe and FastText.

    These strategies, although, have launched an enormous downside, a phrase is represented at all times by the identical vector. However a phrase doesn’t at all times have the identical which means.

    For instance:

    • “She went to the financial institution to withdraw some cash.”
    • “He sat by the financial institution of the river, watching the water movement.”

    In these two sentences, the phrase financial institution assumes two totally different meanings, so it doesn’t make sense to at all times signify the phrase with the identical vector.

    With transformer-based structure, we’re in a position at the moment to create fashions that think about the whole context to generate the vectorial illustration of a phrase.

    src: https://arxiv.org/abs/1706.03762

    The primary innovation launched by this community is the multi-head consideration block. If you’re not accustomed to it, I not too long ago wrote an article about this: https://towardsdatascience.com/a-simple-implementation-of-the-attention-mechanism-from-scratch/

    The transformer is made up of two elements. The left half, which is the encoder which creates the vectorial illustration of texts, and the proper half, the decoder, is used to generate new textual content. For instance, GPT is predicated on the proper half, as a result of it generates textual content as a chatbot.

    On this article, we have an interest within the encoder half, which is ready to seize the semantics of the textual content we give as enter.

    BERT and RoBERTa

    This gained’t be a course about these fashions, however let’s recap some important matters.

    Whereas ChatGPT is constructed on the decoder facet of the transformer structure, BERT and RoBERTa are primarily based on the encoder facet.

    BERT was launched by Google in 2018 and you may learn extra about it right here: https://arxiv.org/abs/1810.04805

    BERT is a stack of encoder layers. There are two sizes of this mannequin. BERT base incorporates 12 encoders whereas BERT massive incorporates 24 encoders

    src: https://iq.opengenus.org/content/images/2021/01/bert-base-bert-large-encoders.png

    BERT base generates a vector of dimension 768, whereas the massive one a vector of dimension 1024. Each take an enter of dimension 512 tokens.

    The tokenizer utilized by the BERT mannequin known as WordPiece.

    BERT is skilled on two targets:

    • Masked Language Modeling (MLM): Predicts lacking (masked) tokens inside a sentence.
    • Subsequent Sentence Prediction (NSP): Determines whether or not a given second sentence logically follows the primary one.

    RoBERTa mannequin builds on prime of BERT with some key variations: https://arxiv.org/abs/1907.11692.

    RoBERTa makes use of a dynamic masking, so masked tokens change at each iteration through the coaching, and doesn’t use the NSP as coaching targets.

    Use RoBERTa with SpaCy

    The TextCategorizer is a spaCy element that predicts a number of labels for a whole doc. It might probably work in two modalities:

    • exclusive_classes = true: one label per textual content (e.g., optimistic or unfavorable)
    • exclusive_classes = false: a number of labels per textual content (e.g., spam, pressing, billing)

    spaCy can mix this with totally different embeddings:

    • Traditional phrase vectors (tok2vec)
    • Transformer fashions like RoBERTa, which we use right here

    On this approach we will lavarage the RoBERTa understanding of the english language, and combine it within the spacy pipeline to make it manufacturing prepared.

    When you’ve got a dataset, you’ll be able to additional prepare the RoBERTa mannequin utilizing spaCy to fine-tune it on the particular downstream activity you’re making an attempt to resolve.

    Dataset preparation

    On this article I’m going to make use of the TREC dataset, which incorporates brief questions. Every query is labelled with the kind of reply it expects, akin to:

    Label Which means
    ABBR Abbreviation
    DESC Description / Definition
    ENTY Entity (factor, object)
    HUM Human (individual, group)
    LOC Location (place)
    NUM Numeric (depend, date, and many others)

    That is an instance, the place we count on as reply a human identify:

    Q (textual content): “Who wrote the Iliad?”
    A (label): “HUM”

    As ordinary we begin by putting in the libraries.

    !pip set up datasets==3.6.0
    !pip set up -U spacy[transformers]

    Now we have to load put together the dataset.

    With spacy.clean("en") we will create a clean spaCy pipeline for English. It doesn’t embody any elements (just like the tagger or the parser),. It’s light-weight and ideal for changing uncooked textual content to Doc objects with out loading a full language mannequin like we do with en_core_web_sm.

    DocBin is a particular spaCy class that effectively shops many Doc objects in binary format. That is how spaCy expects coaching knowledge to be saved.

    As soon as transformed and saved as .spacy recordsdata, these could be handed instantly into spacy prepare, which is way sooner than utilizing plain JSON or textual content recordsdata.

    So now this script to organize the prepare and dev dataset needs to be fairly easy.

    from datasets import load_dataset
    import spacy
    from spacy.tokens import DocBin
    
    # Load TREC dataset
    dataset = load_dataset("trec")
    
    # Get label names (e.g., ["DESC", "ENTY", "ABBR", ...])
    label_names = dataset["train"].options["coarse_label"].names
    
    # Create a clean English pipeline (no elements but)
    nlp = spacy.clean("en")
    
    # Convert Hugging Face examples into spaCy Docs and save as .spacy file
    def convert_to_spacy(cut up, filename):
        doc_bin = DocBin()
        for instance in cut up:
            textual content = instance["text"]
            label = label_names[example["coarse_label"]]
            cats = {identify: 0.0 for identify in label_names}
            cats[label] = 1.0
            doc = nlp.make_doc(textual content)
            doc.cats = cats
            doc_bin.add(doc)
        doc_bin.to_disk(filename)
    
    convert_to_spacy(dataset["train"], "prepare.spacy")
    convert_to_spacy(dataset["test"], "dev.spacy")
    

    We’re going to firther prepare RoBERTa on this dataset utilizing a sapCy CLI command. The command expects a config.cfg file the place we describe the kind of coaching, the mannequin we’re utilizing, the variety of epohchs and many others.

    Right here is the config file I used for my coaching pourposes.

    [paths]
    prepare = ./prepare.spacy
    dev = ./dev.spacy
    vectors = null
    init_tok2vec = null
    
    [system]
    gpu_allocator = "pytorch"
    seed = 42
    
    [nlp]
    lang = "en"
    pipeline = ["transformer", "textcat"]
    batch_size = 32
    
    [components]
    
    [components.transformer]
    manufacturing unit = "transformer"
    
    [components.transformer.model]
    @architectures = "spacy-transformers.TransformerModel.v3"
    identify = "roberta-base"
    tokenizer_config = {"use_fast": true}
    transformer_config = {}
    mixed_precision = false
    grad_scaler_config = {}
    
    [components.transformer.model.get_spans]
    @span_getters = "spacy-transformers.strided_spans.v1"
    window = 128
    stride = 96
    
    [components.textcat]
    manufacturing unit = "textcat"
    scorer = {"@scorers": "spacy.textcat_scorer.v2"}
    threshold = 0.5
    
    [components.textcat.model]
    @architectures = "spacy.TextCatEnsemble.v2"
    nO = null
    
    [components.textcat.model.linear_model]
    @architectures = "spacy.TextCatBOW.v3"
    ngram_size = 1
    no_output_layer = true
    exclusive_classes = true
    size = 262144
    
    [components.textcat.model.tok2vec]
    @architectures = "spacy-transformers.TransformerListener.v1"
    upstream = "transformer"
    pooling = {"@layers": "reduce_mean.v1"}
    grad_factor = 1.0
    
    [corpora]
    
    [corpora.train]
    @readers = "spacy.Corpus.v1"
    path = ${paths.prepare}
    
    [corpora.dev]
    @readers = "spacy.Corpus.v1"
    path = ${paths.dev}
    
    [training]
    train_corpus = "corpora.prepare"
    dev_corpus = "corpora.dev"
    seed = ${system.seed}
    gpu_allocator = ${system.gpu_allocator}
    dropout = 0.1
    accumulate_gradient = 1
    endurance = 1600
    max_epochs = 10
    max_steps = 2000
    eval_frequency = 100
    frozen_components = []
    annotating_components = []
    
    [training.optimizer]
    @optimizers = "Adam.v1"
    learn_rate = 0.00005
    L2 = 0.01
    grad_clip = 1.0
    use_averages = false
    eps = 1e-08
    beta1 = 0.9
    beta2 = 0.999
    L2_is_weight_decay = true
    
    [training.batcher]
    @batchers = "spacy.batch_by_words.v1"
    discard_oversize = false
    tolerance = 0.2
    
    [training.batcher.size]
    @schedules = "compounding.v1"
    begin = 256
    cease = 2048
    compound = 1.001
    
    [training.logger]
    @loggers = "spacy.ConsoleLogger.v1"
    progress_bar = true
    
    [training.score_weights]
    cats_score = 1.0
    
    [initialize]
    vectors = ${paths.vectors}
    init_tok2vec = ${paths.init_tok2vec}
    vocab_data = null
    lookups = null
    
    [initialize.components]
    [initialize.tokenizer]
    

    Ensure you have a GPU at your disposal and launch the coaching CLI command!

    python —m spacy prepare config.cfg --output ./output --gpu-id 0

    You will notice the coaching beginning with and you may monitor the lack of the TextCategorizer element.

    Simply to be clear, we’re coaching right here the TextCategorizer element, which is a small neural community head that receives the doc illustration and learns to foretell the proper label.

    However we’re additionally fine-tuning RoBERTa throughout this coaching. Which means the RoBERTa weights are up to date utilizing the TREC dataset, so it learns tips on how to signify enter questions in a approach that’s extra helpful for classification.

    As soon as the mannequin is skilled and saved, we will use it in inference!

    import spacy
    
    nlp = spacy.load("output/model-best")
    
    doc = nlp("What's the capital of Italy?")
    print(doc.cats)
    

    The output needs to be one thing just like the next

    {'LOC': 0.98, 'HUM': 0.01, 'NUM': 0.0, …}

    Closing Ideas

    To recap, on this publish we noticed tips on how to:Use a Hugging Face dataset with spaCy

    • Convert textual content classification knowledge into .spacy format
    • Configure a full pipeline utilizing RoBERTa and textcat
    • Practice and take a look at your mannequin utilizing spaCy CLI

    This technique works for any brief textual content classification activity, emails, assist tickets, product opinions, FAQs, and even chatbot intents.



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    Our Favorite Apple Watch Has Never Been Less Expensive

    April 19, 2026

    Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)

    April 19, 2026

    Today’s NYT Strands Hints, Answer and Help for April 20 #778

    April 19, 2026

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Saudi Arabia’s JEC Tower edges closer to becoming world’s tallest building

    November 30, 2025

    Boyd Gaming to sell Sam’s Town Shreveport to Bally’s in Louisiana

    February 27, 2026

    Inheritance: A Software Engineering Concept Data Scientists Must Know To Succeed

    May 23, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.