Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • OnlyFans Turbocharged Sex Work. Now Its Founder Is Targeting the Whole Influencer Economy
    • AI video just took a startling leap in realism. Are we doomed?
    • Best Internet Providers in Colorado
    • Maine’s Floating Offshore Wind Setback: What’s Next?
    • The Role of AI in Simplifying Branding for E-commerce Businesses
    • Why manual palletizing is ripe for automation
    • Electronic facial tattoo predicts mental overload for high-pressure jobs
    • Portuguese energy startup Gazelle Wind Power raises an additional €2 million for its offshore wind platform tech
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Thursday, May 29
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Fine-tuning Multimodal Embedding Models | by Shaw Talebi
    Artificial Intelligence

    Fine-tuning Multimodal Embedding Models | by Shaw Talebi

    Editor Times FeaturedBy Editor Times FeaturedJanuary 31, 2025No Comments8 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    The primary (and most necessary) step of any fine-tuning course of is knowledge assortment. Right here, I extracted title-thumbnail pairs from my channel in a 2-step course of.

    First, I used YouTube’s search API to extract the video IDs for all of the movies on my channel. Second, I used YouTube’s video API to extract the title and thumbnail URL of every of my long-form movies (i.e. longer than 3 min).

    # imports
    from top_secret import my_key
    import requests
    from isodate import parse_duration

    import pandas as pd
    import numpy as np
    from sentence_transformers import SentenceTransformer
    from datasets import DatasetDict, Dataset

    channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID
    page_token = None # initialize web page token
    url = 'https://www.googleapis.com/youtube/v3/search' # YouTube search API

    # extract video knowledge throughout a number of search outcome pages
    video_id_list = []

    whereas page_token != 0:
    params = {
    "key": my_key,
    'channelId': channel_id,
    'half': ["snippet","id"],
    'order': "date",
    'maxResults':50,
    'pageToken': page_token
    }
    response = requests.get(url, params=params)

    for raw_item in dict(response.json())['items']:

    # solely execute for youtube movies
    if raw_item['id']['kind'] != "youtube#video":
    proceed

    # seize video ids
    video_id_list.append(raw_item['id']['videoId'])

    strive:
    # seize subsequent web page token
    page_token = dict(response.json())['nextPageToken']
    besides:
    # if no subsequent web page token kill whereas loop
    page_token = 0

    Observe that you will want a YouTube API key to run the above Python code, which you’ll be able to create utilizing the Google Cloud Console. To adapt this to your channel, you simply want to alter the channel_id variable.

    # extract video titles and thumbnails
    url = "https://www.googleapis.com/youtube/v3/movies"
    video_data_list = []

    for video_id in video_id_list:

    params = {
    "half": ["snippet","contentDetails"],
    "id": video_id,
    "key": my_key,
    }
    response = requests.get(url, params=params)

    raw_dict = dict(response.json())['items'][0]

    # solely course of movies longer than 3 minutes
    iso_duration = raw_dict['contentDetails']["duration"]
    if parse_duration(iso_duration).total_seconds() < 180:
    proceed

    # extract video knowledge
    video_data = {}
    video_data['video_id'] = video_id
    video_data['title'] = raw_dict['snippet']['title']
    video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']

    # append knowledge to record
    video_data_list.append(video_data)

    As an extra step, I created unfavorable thumbnail-title pairs. We are able to use these in the course of the coaching course of to not solely information the mannequin with examples of which embedding ought to be shut collectively (i.e. constructive pair), but in addition which embedding ought to be far aside (i.e. unfavorable pairs).

    To do that, I computed the similarity between all doable title pairs utilizing the sentence transformer library. Then for every constructive pair, I matched the least comparable title as a unfavorable instance (guaranteeing there have been no duplicates).

    # retailer knowledge in dataframe
    df = pd.DataFrame(video_data_list)

    # Load the mannequin
    mannequin = SentenceTransformer("all-mpnet-base-v2")

    # Encode all titles
    embeddings = mannequin.encode(df['title'].to_list())

    # compute similarities
    similarities = mannequin.similarity(embeddings, embeddings)

    # match least JDs least much like constructive match because the unfavorable match
    similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
    negative_pair_index_list = []

    for i in vary(len(similarities)):

    # Begin with the smallest similarity index for the present row
    j = 0
    index = int(similarities_argsorted[i][j])

    # Make sure the index is exclusive
    whereas index in negative_pair_index_list:
    j += 1 # Transfer to the following smallest index
    index = int(similarities_argsorted[i][j]) # Fetch subsequent smallest index

    negative_pair_index_list.append(index)

    # add unfavorable pairs to df
    df['title_neg'] = df['title'].iloc[negative_pair_index_list].values

    Lastly, I created a train-valid-test break up and pushed the dataset to the Hugging Face Hub.

    # Shuffle the dataset
    df = df.pattern(frac=1, random_state=42).reset_index(drop=True)

    # Cut up into prepare, validation, and take a look at units
    train_frac = 0.7
    valid_frac = 0.15
    test_frac = 0.15

    # outline prepare and validation dimension
    train_size = int(train_frac * len(df))
    valid_size = int(valid_frac * len(df))

    # create prepare, validation, and take a look at datasets
    df_train = df[:train_size]
    df_valid = df[train_size:train_size + valid_size]
    df_test = df[train_size + valid_size:]

    # Convert the pandas DataFrames again to Hugging Face Datasets
    train_ds = Dataset.from_pandas(df_train)
    valid_ds = Dataset.from_pandas(df_valid)
    test_ds = Dataset.from_pandas(df_test)

    # Mix right into a DatasetDict
    dataset_dict = DatasetDict({
    'prepare': train_ds,
    'legitimate': valid_ds,
    'take a look at': test_ds
    })

    # push knowledge to hub
    dataset_dict.push_to_hub("shawhin/yt-title-thumbnail-pairs")

    Though we’ve all the info we want for fine-tuning, it’s nonetheless not an appropriate format for coaching. Extra particularly, we have to convert our picture URLs to PIL picture objects and arrange our knowledge into (anchor, constructive, unfavorable) triplets, i.e., a thumbnail, its corresponding title, and unfavorable title, respectively.

    We are able to course of all three knowledge splits (i.e. prepare, legitimate, and take a look at) within the following means utilizing the Hugging Face Datasets library.

    from PIL import Picture

    # load dataset
    dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")

    # outline preprocessing operate
    def preprocess(batch):
    """
    Preprocessing knowledge with out augmentations for take a look at set
    """
    # get photographs from urls
    image_list = [Image.open(requests.get(url, stream=True).raw)
    for url in batch["thumbnail_url"]]

    # return columns with customary names
    return {
    "anchor": image_list,
    "constructive": batch["title"],
    "unfavorable": batch["title_neg"]
    }

    # take away columns not related to coaching
    columns_to_remove = [col for col in dataset['train'].column_names
    if col not in ['anchor', 'positive', 'negative']]
    # apply transformations
    dataset = dataset.map(preprocess, batched=True,
    remove_columns=columns_to_remove)

    It’s necessary that we order our columns as (anchor, constructive, unfavorable) triplets as a result of that is the format anticipated by the loss operate we’ll use throughout coaching (which I realized the exhausting means).

    Coaching includes optimizing a mannequin’s parameters to attenuate a loss operate. Nevertheless, this worth (i.e. a contrastive loss) isn’t useful in assessing the mannequin’s efficiency on a downstream activity (e.g. matching titles to thumbnails).

    A amount that’s extra insightful, on this case, is the mannequin’s potential to accurately match a given thumbnail to the right title amongst a number of candidates. That is denoted Recall@1.

    We are able to implement an evaluator appropriate with the Sentence Transformers library to compute this metric. Because the code is kind of lengthy, I gained’t paste it right here, however the curious reader can discover it in Cell 12 of this notebook.

    # operate to create new evaluator given knowledge break up
    def create_recall_evaluator(set_name, okay=1):
    """
    Create triplet evaluator for "prepare", "legitimate", or "take a look at" break up
    """

    return ImageTextRetrievalEvaluator(
    photographs=dataset[f"{set_name}"]["anchor"],
    texts=dataset[f"{set_name}"]["positive"],
    title=f"yt-title-thumbnail-{set_name}",
    okay=okay
    )

    # Create new evaluator with Recall@okay
    evaluator_recall_train = create_recall_evaluator("prepare", okay=1)
    evaluator_recall_valid = create_recall_evaluator("legitimate", okay=1)

    print("Practice:", evaluator_recall_train(mannequin))
    print("Legitimate:", evaluator_recall_valid(mannequin))

    # >> Practice: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}
    # >> Legitimate: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}

    We are able to see the mannequin already has first rate efficiency out-of-the-box, with appropriate titles being matched 66% of the time.

    There are 3 key issues we should do earlier than coaching the mannequin. Particularly, select which parameters to coach, choose a loss operate, and set hyperparameters.

    Trainable Parameters

    The important thing limitation of this undertaking is that I’ve solely posted 76 YouTube movies (as of penning this). With the validation and take a look at splits, this leaves solely 53 examples for coaching.

    Since we’ve so few coaching examples, limiting the variety of parameters we prepare is a good suggestion. On this case, I solely prepare the ultimate projection layer of the mannequin, which maps the textual content and picture embeddings right into a shared vector area. That is about 1M parameters complete.

    # import mannequin
    from sentence_transformers import SentenceTransformer
    mannequin = SentenceTransformer("sentence-transformers/clip-ViT-L-14")

    # choose particular layers to coach (word: you'll be able to add extra layers to this record)
    trainable_layers_list = ['projection']

    # Apply freezing configuration
    for title, param in mannequin.named_parameters():

    # freeze all params
    param.requires_grad = False

    # unfreeze layers in trainable_layers_list
    if any(layer in title for layer in trainable_layers_list):
    param.requires_grad = True

    # Depend complete and trainable parameters
    total_params = sum(p.numel() for p in mannequin.parameters())
    trainable_params = sum(p.numel() for p in mannequin.parameters() if p.requires_grad)

    print(f"Complete parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"% of trainable parameters: {100*trainable_params/total_params:.2f}%")

    # >> Complete parameters: 427,616,513
    # >> Trainable parameters: 1,376,256
    # >> % of trainable parameters: 0.32%

    Loss operate

    Right here, I take advantage of the Multiple Negatives Ranking Loss from the Sentence Transformers library (which works with single negatives like on this case). It really works by maximizing the similarity between constructive pairs whereas minimizing the similarity between unfavorable pairs. Right here’s what the loss operate appears to be like like for the one unfavorable case [2].

    Mulitple negatives loss operate (with just one unfavorable). Picture by creator.
    from sentence_transformers.losses import MultipleNegativesRankingLoss

    # outline loss
    loss = MultipleNegativesRankingLoss(mannequin)

    Hyperparameters

    For hyperparameters, I experimented with a handful of decisions manually and picked the selection with the most effective validation loss and Recall@1 efficiency. Listed here are the ultimate decisions.

    from sentence_transformers import SentenceTransformerTrainingArguments

    # hyperparameters
    num_epochs = 2
    batch_size = 16
    lr = 1e-4
    finetuned_model_name = "clip-title-thumbnail-embeddings"

    train_args = SentenceTransformerTrainingArguments(
    output_dir=f"fashions/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    # Analysis settings
    eval_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
    )

    With our loss and hyperparameters outlined, we will prepare the mannequin utilizing the SentenceTransformersTrainer().

    from sentence_transformers import SentenceTransformerTrainer

    coach = SentenceTransformerTrainer(
    mannequin=mannequin,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
    )
    coach.prepare()

    Mannequin coaching is an iterative course of the place you could discover dozens of fashions for various decisions of trainable parameters, loss features, and hyperparameters.

    Nevertheless, I extremely advocate protecting these experiments so simple as doable. If you end up spending an excessive amount of time tweaking coaching args to get your mannequin to converge, there’s most likely one thing basically unsuitable along with your knowledge (talking from expertise 😅).

    As a ultimate step, we will consider the mannequin’s Recall@1 rating on the testing set. These knowledge weren’t used for coaching or hyperparameter tuning, so it offers us an unbiased evaluation of the mannequin.

    evaluator_recall_test = create_recall_evaluator("take a look at")

    print("Practice:", evaluator_recall_train(mannequin))
    print("Legitimate:", evaluator_recall_valid(mannequin))
    print("Take a look at:", evaluator_recall_test(mannequin))

    # >> Practice: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}
    # >> Legitimate: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}
    # >> Take a look at: {'yt-title-thumbnail-test_Recall@1': 0.75}

    We see that the mannequin performs nicely throughout all three datasets with 75% Recall@1 on the take a look at set. In different phrases, 75% of the time, the mannequin accurately matches a given thumbnail to its authentic title. Moreover, the recall for the validation dataset will increase by 27%!

    Multimodal embedding fashions, like CLIP, unlock numerous 0-shot use circumstances resembling picture classification and retrieval. Right here, we noticed how we will fine-tune such a mannequin to adapt it to a specialised area (i.e. my YouTube titles and thumbnails).

    Though CLIP is a small mannequin by as we speak’s requirements (~500M parameters) and our coaching dataset was tiny, the ultimate mannequin nonetheless demonstrated robust efficiency on this activity. This highlights the facility of fine-tuning.

    When you have any questions or ideas for future content material, let me know within the feedback 🙂

    Extra on Multimodal AI 👇

    Multimodal AI



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    The Role of AI in Simplifying Branding for E-commerce Businesses

    May 29, 2025

    How Uncensored AI Tools Are Changing Digital Self-Expression

    May 29, 2025

    Uncensored AI Generator from Existing Photo

    May 29, 2025

    From Data to Stories: Code Agents for KPI Narratives

    May 29, 2025

    Tree of Thought Prompting: Teaching LLMs to Think Slowly

    May 29, 2025

    Detecting Malicious URLs Using LSTM and Google’s BERT Models

    May 29, 2025

    Comments are closed.

    Editors Picks

    OnlyFans Turbocharged Sex Work. Now Its Founder Is Targeting the Whole Influencer Economy

    May 29, 2025

    AI video just took a startling leap in realism. Are we doomed?

    May 29, 2025

    Best Internet Providers in Colorado

    May 29, 2025

    Maine’s Floating Offshore Wind Setback: What’s Next?

    May 29, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Fall asleep peacefully with Jabees Peace Pillow bone conduction speaker

    April 19, 2025

    Chocolate makers stoke boom for Indian cocoa beans

    February 3, 2025

    Best Treadmills for Home of 2025

    February 5, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.