Close Menu
    Trending
    • High Paying, Six Figure Jobs For Recent Graduates: Report
    • What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization
    • YouBot: Understanding YouTube Comments and Chatting Intelligently — An Engineer’s Perspective | by Sercan Teyhani | Jun, 2025
    • Inspiring Quotes From Brian Wilson of The Beach Boys
    • AI Is Not a Black Box (Relatively Speaking)
    • From Accidents to Actuarial Accuracy: The Role of Assumption Validation in Insurance Claim Amount Prediction Using Linear Regression | by Ved Prakash | Jun, 2025
    • I Wish Every Entrepreneur Had a Dad Like Mine — Here’s Why
    • Why You’re Still Coding AI Manually: Build a GPT-Backed API with Spring Boot in 30 Minutes | by CodeWithUs | Jun, 2025
    Finance StarGate
    • Home
    • Artificial Intelligence
    • AI Technology
    • Data Science
    • Machine Learning
    • Finance
    • Passive Income
    Finance StarGate
    Home»Artificial Intelligence»Fine-tuning Multimodal Embedding Models | by Shaw Talebi
    Artificial Intelligence

    Fine-tuning Multimodal Embedding Models | by Shaw Talebi

    FinanceStarGateBy FinanceStarGateFebruary 2, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Share
    Facebook Twitter LinkedIn Pinterest Email


    The primary (and most necessary) step of any fine-tuning course of is information assortment. Right here, I extracted title-thumbnail pairs from my channel in a 2-step course of.

    First, I used YouTube’s search API to extract the video IDs for all of the movies on my channel. Second, I used YouTube’s video API to extract the title and thumbnail URL of every of my long-form movies (i.e. longer than 3 min).

    # imports
    from top_secret import my_key
    import requests
    from isodate import parse_duration

    import pandas as pd
    import numpy as np
    from sentence_transformers import SentenceTransformer
    from datasets import DatasetDict, Dataset

    channel_id = 'UCa9gErQ9AE5jT2DZLjXBIdA' # my YouTube channel ID
    page_token = None # initialize web page token
    url = 'https://www.googleapis.com/youtube/v3/search' # YouTube search API

    # extract video information throughout a number of search consequence pages
    video_id_list = []

    whereas page_token != 0:
    params = {
    "key": my_key,
    'channelId': channel_id,
    'half': ["snippet","id"],
    'order': "date",
    'maxResults':50,
    'pageToken': page_token
    }
    response = requests.get(url, params=params)

    for raw_item in dict(response.json())['items']:

    # solely execute for youtube movies
    if raw_item['id']['kind'] != "youtube#video":
    proceed

    # seize video ids
    video_id_list.append(raw_item['id']['videoId'])

    attempt:
    # seize subsequent web page token
    page_token = dict(response.json())['nextPageToken']
    besides:
    # if no subsequent web page token kill whereas loop
    page_token = 0

    Word that you’ll want a YouTube API key to run the above Python code, which you’ll be able to create utilizing the Google Cloud Console. To adapt this to your channel, you simply want to vary the channel_id variable.

    # extract video titles and thumbnails
    url = "https://www.googleapis.com/youtube/v3/movies"
    video_data_list = []

    for video_id in video_id_list:

    params = {
    "half": ["snippet","contentDetails"],
    "id": video_id,
    "key": my_key,
    }
    response = requests.get(url, params=params)

    raw_dict = dict(response.json())['items'][0]

    # solely course of movies longer than 3 minutes
    iso_duration = raw_dict['contentDetails']["duration"]
    if parse_duration(iso_duration).total_seconds() proceed

    # extract video information
    video_data = {}
    video_data['video_id'] = video_id
    video_data['title'] = raw_dict['snippet']['title']
    video_data['thumbnail_url'] = raw_dict['snippet']['thumbnails']['high']['url']

    # append information to record
    video_data_list.append(video_data)

    As an extra step, I created detrimental thumbnail-title pairs. We will use these throughout the coaching course of to not solely information the mannequin with examples of which embedding ought to be shut collectively (i.e. optimistic pair), but in addition which embedding ought to be far aside (i.e. detrimental pairs).

    To do that, I computed the similarity between all potential title pairs utilizing the sentence transformer library. Then for every optimistic pair, I matched the least comparable title as a detrimental instance (making certain there have been no duplicates).

    # retailer information in dataframe
    df = pd.DataFrame(video_data_list)

    # Load the mannequin
    mannequin = SentenceTransformer("all-mpnet-base-v2")

    # Encode all titles
    embeddings = mannequin.encode(df['title'].to_list())

    # compute similarities
    similarities = mannequin.similarity(embeddings, embeddings)

    # match least JDs least much like optimistic match because the detrimental match
    similarities_argsorted = np.argsort(similarities.numpy(), axis=1)
    negative_pair_index_list = []

    for i in vary(len(similarities)):

    # Begin with the smallest similarity index for the present row
    j = 0
    index = int(similarities_argsorted[i][j])

    # Make sure the index is exclusive
    whereas index in negative_pair_index_list:
    j += 1 # Transfer to the following smallest index
    index = int(similarities_argsorted[i][j]) # Fetch subsequent smallest index

    negative_pair_index_list.append(index)

    # add detrimental pairs to df
    df['title_neg'] = df['title'].iloc[negative_pair_index_list].values

    Lastly, I created a train-valid-test break up and pushed the dataset to the Hugging Face Hub.

    # Shuffle the dataset
    df = df.pattern(frac=1, random_state=42).reset_index(drop=True)

    # Break up into practice, validation, and check units
    train_frac = 0.7
    valid_frac = 0.15
    test_frac = 0.15

    # outline practice and validation measurement
    train_size = int(train_frac * len(df))
    valid_size = int(valid_frac * len(df))

    # create practice, validation, and check datasets
    df_train = df[:train_size]
    df_valid = df[train_size:train_size + valid_size]
    df_test = df[train_size + valid_size:]

    # Convert the pandas DataFrames again to Hugging Face Datasets
    train_ds = Dataset.from_pandas(df_train)
    valid_ds = Dataset.from_pandas(df_valid)
    test_ds = Dataset.from_pandas(df_test)

    # Mix right into a DatasetDict
    dataset_dict = DatasetDict({
    'practice': train_ds,
    'legitimate': valid_ds,
    'check': test_ds
    })

    # push information to hub
    dataset_dict.push_to_hub("shawhin/yt-title-thumbnail-pairs")

    Though now we have all the info we’d like for fine-tuning, it’s nonetheless not an acceptable format for coaching. Extra particularly, we have to convert our picture URLs to PIL picture objects and set up our information into (anchor, optimistic, detrimental) triplets, i.e., a thumbnail, its corresponding title, and detrimental title, respectively.

    We will course of all three information splits (i.e. practice, legitimate, and check) within the following means utilizing the Hugging Face Datasets library.

    from PIL import Picture

    # load dataset
    dataset = load_dataset("shawhin/yt-title-thumbnail-pairs")

    # outline preprocessing perform
    def preprocess(batch):
    """
    Preprocessing information with out augmentations for check set
    """
    # get photographs from urls
    image_list = [Image.open(requests.get(url, stream=True).raw)
    for url in batch["thumbnail_url"]]

    # return columns with normal names
    return {
    "anchor": image_list,
    "optimistic": batch["title"],
    "detrimental": batch["title_neg"]
    }

    # take away columns not related to coaching
    columns_to_remove = [col for col in dataset['train'].column_names
    if col not in ['anchor', 'positive', 'negative']]
    # apply transformations
    dataset = dataset.map(preprocess, batched=True,
    remove_columns=columns_to_remove)

    It’s necessary that we order our columns as (anchor, optimistic, detrimental) triplets as a result of that is the format anticipated by the loss perform we’ll use throughout coaching (which I realized the exhausting means).

    Coaching entails optimizing a mannequin’s parameters to attenuate a loss perform. Nevertheless, this worth (i.e. a contrastive loss) is never useful in assessing the mannequin’s efficiency on a downstream job (e.g. matching titles to thumbnails).

    A amount that’s extra insightful, on this case, is the mannequin’s capability to appropriately match a given thumbnail to the proper title amongst a number of candidates. That is denoted Recall@1.

    We will implement an evaluator suitable with the Sentence Transformers library to compute this metric. For the reason that code is kind of lengthy, I received’t paste it right here, however the curious reader can discover it in Cell 12 of this notebook.

    # perform to create new evaluator given information break up
    def create_recall_evaluator(set_name, okay=1):
    """
    Create triplet evaluator for "practice", "legitimate", or "check" break up
    """

    return ImageTextRetrievalEvaluator(
    photographs=dataset[f"{set_name}"]["anchor"],
    texts=dataset[f"{set_name}"]["positive"],
    identify=f"yt-title-thumbnail-{set_name}",
    okay=okay
    )

    # Create new evaluator with Recall@okay
    evaluator_recall_train = create_recall_evaluator("practice", okay=1)
    evaluator_recall_valid = create_recall_evaluator("legitimate", okay=1)

    print("Practice:", evaluator_recall_train(mannequin))
    print("Legitimate:", evaluator_recall_valid(mannequin))

    # >> Practice: {'yt-title-thumbnail-train_Recall@1': 0.660377358490566}
    # >> Legitimate: {'yt-title-thumbnail-valid_Recall@1': 0.6363636363636364}

    We will see the mannequin already has respectable efficiency out-of-the-box, with right titles being matched 66% of the time.

    There are 3 key issues we should do earlier than coaching the mannequin. Specifically, select which parameters to coach, choose a loss perform, and set hyperparameters.

    Trainable Parameters

    The important thing limitation of this mission is that I’ve solely posted 76 YouTube movies (as of penning this). With the validation and check splits, this leaves solely 53 examples for coaching.

    Since now we have so few coaching examples, limiting the variety of parameters we practice is a good suggestion. On this case, I solely practice the ultimate projection layer of the mannequin, which maps the textual content and picture embeddings right into a shared vector house. That is about 1M parameters complete.

    # import mannequin
    from sentence_transformers import SentenceTransformer
    mannequin = SentenceTransformer("sentence-transformers/clip-ViT-L-14")

    # choose particular layers to coach (observe: you may add extra layers to this record)
    trainable_layers_list = ['projection']

    # Apply freezing configuration
    for identify, param in mannequin.named_parameters():

    # freeze all params
    param.requires_grad = False

    # unfreeze layers in trainable_layers_list
    if any(layer in identify for layer in trainable_layers_list):
    param.requires_grad = True

    # Depend complete and trainable parameters
    total_params = sum(p.numel() for p in mannequin.parameters())
    trainable_params = sum(p.numel() for p in mannequin.parameters() if p.requires_grad)

    print(f"Whole parameters: {total_params:,}")
    print(f"Trainable parameters: {trainable_params:,}")
    print(f"% of trainable parameters: {100*trainable_params/total_params:.2f}%")

    # >> Whole parameters: 427,616,513
    # >> Trainable parameters: 1,376,256
    # >> % of trainable parameters: 0.32%

    Loss perform

    Right here, I exploit the Multiple Negatives Ranking Loss from the Sentence Transformers library (which works with single negatives like on this case). It really works by maximizing the similarity between optimistic pairs whereas minimizing the similarity between detrimental pairs. Right here’s what the loss perform appears like for the only detrimental case [2].

    Mulitple negatives loss perform (with just one detrimental). Picture by writer.
    from sentence_transformers.losses import MultipleNegativesRankingLoss

    # outline loss
    loss = MultipleNegativesRankingLoss(mannequin)

    Hyperparameters

    For hyperparameters, I experimented with a handful of decisions manually and picked the selection with the very best validation loss and Recall@1 efficiency. Listed below are the ultimate decisions.

    from sentence_transformers import SentenceTransformerTrainingArguments

    # hyperparameters
    num_epochs = 2
    batch_size = 16
    lr = 1e-4
    finetuned_model_name = "clip-title-thumbnail-embeddings"

    train_args = SentenceTransformerTrainingArguments(
    output_dir=f"fashions/{finetuned_model_name}",
    num_train_epochs=num_epochs,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    learning_rate=lr,
    # Analysis settings
    eval_strategy="epoch",
    eval_steps=1,
    logging_steps=1,
    )

    With our loss and hyperparameters outlined, we will practice the mannequin utilizing the SentenceTransformersTrainer().

    from sentence_transformers import SentenceTransformerTrainer

    coach = SentenceTransformerTrainer(
    mannequin=mannequin,
    args=train_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["valid"],
    loss=loss,
    evaluator=[evaluator_recall_train, evaluator_recall_valid],
    )
    coach.practice()

    Mannequin coaching is an iterative course of the place you might discover dozens of fashions for various decisions of trainable parameters, loss capabilities, and hyperparameters.

    Nevertheless, I extremely suggest holding these experiments so simple as potential. If you end up spending an excessive amount of time tweaking coaching args to get your mannequin to converge, there’s in all probability one thing essentially incorrect together with your information (talking from expertise 😅).

    As a ultimate step, we will consider the mannequin’s Recall@1 rating on the testing set. These information weren’t used for coaching or hyperparameter tuning, so it provides us an unbiased evaluation of the mannequin.

    evaluator_recall_test = create_recall_evaluator("check")

    print("Practice:", evaluator_recall_train(mannequin))
    print("Legitimate:", evaluator_recall_valid(mannequin))
    print("Take a look at:", evaluator_recall_test(mannequin))

    # >> Practice: {'yt-title-thumbnail-train_Recall@1': 0.8490566037735849}
    # >> Legitimate: {'yt-title-thumbnail-valid_Recall@1': 0.9090909090909091}
    # >> Take a look at: {'yt-title-thumbnail-test_Recall@1': 0.75}

    We see that the mannequin performs properly throughout all three datasets with 75% Recall@1 on the check set. In different phrases, 75% of the time, the mannequin appropriately matches a given thumbnail to its unique title. Moreover, the recall for the validation dataset will increase by 27%!

    Multimodal embedding fashions, like CLIP, unlock numerous 0-shot use circumstances similar to picture classification and retrieval. Right here, we noticed how we will fine-tune such a mannequin to adapt it to a specialised area (i.e. my YouTube titles and thumbnails).

    Though CLIP is a small mannequin by immediately’s requirements (~500M parameters) and our coaching dataset was tiny, the ultimate mannequin nonetheless demonstrated sturdy efficiency on this job. This highlights the facility of fine-tuning.

    If in case you have any questions or strategies for future content material, let me know within the feedback 🙂

    Extra on Multimodal AI 👇

    Multimodal AI



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleHow to prevent order discrepancy with automated PO-SO matching
    Next Article Mark Zuckerberg Warns Meta Staff: Stop Leaking to the Press
    FinanceStarGate

    Related Posts

    Artificial Intelligence

    What If I had AI in 2018: Rent the Runway Fulfillment Center Optimization

    June 14, 2025
    Artificial Intelligence

    AI Is Not a Black Box (Relatively Speaking)

    June 13, 2025
    Artificial Intelligence

    Boost Your LLM Output and Design Smarter Prompts: Real Tricks from an AI Engineer’s Toolbox

    June 13, 2025
    Add A Comment

    Comments are closed.

    Top Posts

    News Bytes Podcast 20250217: Arm Selling Its Own Chips to Meta?, Big xAI, Big Power, Big… Pollution?, TSMC in Intel Fab Takeover?, Europe’s Big AI Investment

    February 17, 2025

    New-Generation Marketing Mix Modelling with Meridian | by Benjamin Etienne | Feb, 2025

    February 2, 2025

    MIT’s McGovern Institute is shaping brain science and improving human lives on a global scale | MIT News

    April 18, 2025

    This Chef Lost His Restaurant the Week Michelin Called. Now He’s Made a Comeback By Perfecting One Recipe.

    May 20, 2025

    DeepSeed-R1 Local API & Chatbot in Action | by Bill Huang | Feb, 2025

    February 6, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    Most Popular

    Day 4: React Controlled Components: Master User Input Handling in Next.js (JSX Guide)🚀 | by Lokesh Prajapati | Apr, 2025

    April 18, 2025

    Basic Feature Discovering for Machine Learning | by Sefza Auma Tiang Alam | Jun, 2025

    June 6, 2025

    Prediksi Turnover Karyawan Menggunakan Random Forest dan K-Fold Cross-Validation | by Devi Hilsa Farida | May, 2025

    May 16, 2025
    Our Picks

    How I Maintain Success in a Highly Competitive Market — and How You Can, Too

    February 9, 2025

    Making AI Accessible: Dramatic Cost Savings with Meta Llama 3.3 on Databricks | by Invisible Guru Jii | Mar, 2025

    March 15, 2025

    The Future of Filmmaking: How Generative AI is Transforming Video Production | by Felix Nguyen | Feb, 2025

    February 17, 2025
    Categories
    • AI Technology
    • Artificial Intelligence
    • Data Science
    • Finance
    • Machine Learning
    • Passive Income
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us
    Copyright © 2025 Financestargate.com All Rights Reserved.

    Type above and press Enter to search. Press Esc to cancel.