Multimodal Search Engine Agents Powered by BLIP-2 and Gemini

This put up was co-authored with Rafael Guedes.

Introduction

Conventional fashions can solely course of a single sort of knowledge, akin to textual content, pictures, or tabular knowledge. Multimodality is a trending idea within the AI analysis group, referring to a mannequin’s capability to be taught from a number of varieties of knowledge concurrently. This new expertise (probably not new, however considerably improved in the previous few months) has quite a few potential purposes that can remodel the consumer expertise of many merchandise.

One good instance can be the brand new approach engines like google will work sooner or later, the place customers can enter queries utilizing a mix of modalities, akin to textual content, pictures, audio, and so forth. One other instance may very well be enhancing AI-powered buyer help programs for voice and textual content inputs. In e-commerce, they’re enhancing product discovery by permitting customers to look utilizing pictures and textual content. We are going to use the latter as our case examine on this article.

The frontier AI analysis labs are transport a number of fashions that help a number of modalities each month. CLIP and DALL-E by OpenAI and BLIP-2 by Salesforce mix picture and textual content. ImageBind by Meta expanded the a number of modality idea to 6 modalities (textual content, audio, depth, thermal, picture, and inertial measurement items).

On this article, we’ll discover BLIP-2 by explaining its structure, the best way its loss operate works, and its coaching course of. We additionally current a sensible use case that mixes BLIP-2 and Gemini to create a multimodal style search agent that may help prospects to find the perfect outfit based mostly on both textual content or textual content and picture prompts.

As all the time, the code is out there on our GitHub.

BLIP-2: a multimodal mannequin

BLIP-2 (Bootstrapped Language-Picture Pre-Coaching) [1] is a vision-language mannequin designed to resolve duties akin to visible query answering or multimodal reasoning based mostly on inputs of each modalities: picture and textual content. As we’ll see under, this mannequin was developed to handle two major challenges within the vision-language area:

Cut back computational value utilizing frozen pre-trained visible encoders and LLMs, drastically lowering the coaching assets wanted in comparison with a joint coaching of imaginative and prescient and language networks.
Bettering visual-language alignment by introducing Q-Former. Q-Former brings the visible and textual embeddings nearer, resulting in improved reasoning process efficiency and the power to carry out multimodal retrieval.

Structure

The structure of BLIP-2 follows a modular design that integrates three modules:

Visible Encoder is a frozen visible mannequin, akin to ViT, that extracts visible embeddings from the enter pictures (that are then utilized in downstream duties).
Querying Transformer (Q-Former) is the important thing to this structure. It consists of a trainable light-weight transformer that acts as an intermediate layer between the visible and language fashions. It’s liable for producing contextualized queries from the visible embeddings in order that they are often processed successfully by the language mannequin.
LLM is a frozen pre-trained LLM that processes refined visible embeddings to generate textual descriptions or solutions.

Determine 2: BLIP-2 structure (picture by creator)

Loss Features

BLIP-2 has three loss capabilities to coach the Q-Former module:

Picture-text contrastive loss [2] enforces the alignment between visible and textual content embeddings by maximizing the similarity of paired image-text representations whereas pushing aside dissimilar pairs.
Picture-text matching loss [3] is a binary classification loss that goals to make the mannequin be taught fine-grained alignments by predicting whether or not a textual content description matches the picture (constructive, i.e., goal=1) or not (unfavourable, i.e., goal=0).
Picture-grounded textual content technology loss [4] is a cross-entropy loss utilized in LLMs to foretell the chance of the subsequent token within the sequence. The Q-Former structure doesn’t enable interactions between the picture embeddings and the textual content tokens; due to this fact, the textual content have to be generated based mostly solely on the visible data, forcing the mannequin to extract related visible options.

For each image-text contrastive loss and image-text matching loss, the authors used in-batch unfavourable sampling, which implies that if we’ve a batch measurement of 512, every image-text pair has one constructive pattern and 511 unfavourable samples. This strategy will increase effectivity since unfavourable samples are taken from the batch, and there’s no want to look your complete dataset. It additionally gives a extra numerous set of comparisons, resulting in a greater gradient estimation and sooner convergence.

Determine 3: Coaching losses defined (picture by creator)

Coaching Course of

The coaching of BLIP-2 consists of two phases:

Stage 1 – Bootstrapping visual-language illustration:

The mannequin receives pictures as enter which might be transformed to an embedding utilizing the frozen visible encoder.
Along with these pictures, the mannequin receives their textual content descriptions, that are additionally transformed into embedding.
The Q-Former is educated utilizing image-text contrastive loss, making certain that the visible embeddings align carefully with their corresponding textual embeddings and get additional away from the non-matching textual content descriptions. On the similar time, the image-text matching loss helps the mannequin develop fine-grained representations by studying to categorise whether or not a given textual content accurately describes the picture or not.

Determine 4: Stage 1 coaching course of (picture by creator)

Stage 2 – Bootstrapping vision-to-language technology:

The pre-trained language mannequin is built-in into the structure to generate textual content based mostly on the beforehand realized representations.
The main focus shifts from alignment to textual content technology through the use of the image-grounded textual content technology loss which improves the mannequin capabilities of reasoning and textual content technology.

Determine 5: Stage 2 coaching course of (picture by the creator)

Making a Multimodal Style Search Agent utilizing BLIP-2 and Gemini

On this part, we’ll leverage the multimodal capabilities of BLIP-2 to construct a style assistant search agent that may obtain enter textual content and/or pictures and return suggestions. For the dialog capabilities of the agent, we’ll use Gemini 1.5 Professional hosted in Vertex AI, and for the interface, we’ll construct a Streamlit app.

The style dataset used on this use case is licensed beneath the MIT license and will be accessed by means of the next hyperlink: Fashion Product Images Dataset. It consists of greater than 44k pictures of style merchandise.

Step one to make this attainable is to arrange a Vector DB. This allows the agent to carry out a vectorized search based mostly on the picture embeddings of the gadgets obtainable within the retailer and the textual content or picture embeddings from the enter. We use docker and docker-compose to assist us arrange the surroundings:

Docker-Compose with Postgres (the database) and the PGVector extension that enables vectorized search.

providers:
  postgres:
    container_name: container-pg
    picture: ankane/pgvector
    hostname: localhost
    ports:
      - "5432:5432"
    env_file:
      - ./env/postgres.env
    volumes:
      - postgres-data:/var/lib/postgresql/knowledge
    restart: unless-stopped

  pgadmin:
    container_name: container-pgadmin
    picture: dpage/pgadmin4
    depends_on:
      - postgres
    ports:
      - "5050:80"
    env_file:
      - ./env/pgadmin.env
    restart: unless-stopped

volumes:
  postgres-data:

Postgres env file with the variables to log into the database.

POSTGRES_DB=postgres
POSTGRES_USER=admin
POSTGRES_PASSWORD=root

Pgadmin env file with the variables to log into the UI for guide querying the database (non-compulsory).

[email protected] 
PGADMIN_DEFAULT_PASSWORD=root

Connection env file with all of the parts to make use of to hook up with PGVector utilizing Langchain.

DRIVER=psycopg
HOST=localhost
PORT=5432
DATABASE=postgres
USERNAME=admin
PASSWORD=root

As soon as the Vector DB is ready up and operating (docker-compose up -d), it’s time to create the brokers and instruments to carry out a multimodal search. We construct two brokers to resolve this use case: one to grasp what the consumer is requesting and one other one to supply the advice:

The classifier is liable for receiving the enter message from the client and extracting which class of garments the consumer is in search of, for instance, t-shirts, pants, footwear, jerseys, or shirts. It’ll additionally return the variety of gadgets the client needs in order that we are able to retrieve the precise quantity from the Vector DB.

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Discipline

class ClassifierOutput(BaseModel):
    """
    Information construction for the mannequin's output.
    """

    class: record = Discipline(
        description="An inventory of garments class to seek for ('t-shirt', 'pants', 'footwear', 'jersey', 'shirt')."
    )
    number_of_items: int = Discipline(description="The variety of gadgets we should always retrieve.")

class Classifier:
    """
    Classifier class for classification of enter textual content.
    """

    def __init__(self, mannequin: ChatVertexAI) -> None:
        """
        Initialize the Chain class by creating the chain.
        Args:
            mannequin (ChatVertexAI): The LLM mannequin.
        """
        tremendous().__init__()

        parser = PydanticOutputParser(pydantic_object=ClassifierOutput)

        text_prompt = """
        You're a style assistant skilled on understanding what a buyer wants and on extracting the class or classes of garments a buyer needs from the given textual content.
        Textual content:
        {textual content}

        Directions:
        1. Learn rigorously the textual content.
        2. Extract the class or classes of garments the client is in search of, it may be:
            - t-shirt if the custimer is in search of a t-shirt.
            - pants if the client is in search of pants.
            - jacket if the client is in search of a jacket.
            - footwear if the client is in search of footwear.
            - jersey if the client is in search of a jersey.
            - shirt if the client is in search of a shirt.
        3. If the client is in search of a number of gadgets of the identical class, return the variety of gadgets we should always retrieve. If not specfied however the consumer requested for greater than 1, return 2.
        4. If the client is in search of a number of class, the variety of gadgets must be 1.
        5. Return a legitimate JSON with the classes discovered, the important thing have to be 'class' and the worth have to be a listing with the classes discovered and 'number_of_items' with the variety of gadgets we should always retrieve.

        Present the output as a legitimate JSON object with none further formatting, akin to backticks or further textual content. Make sure the JSON is accurately structured in keeping with the schema offered under.
        {format_instructions}

        Reply:
        """

        immediate = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = immediate | mannequin | parser

    def classify(self, textual content: str) -> ClassifierOutput:
        """
        Get the class from the mannequin based mostly on the textual content context.
        Args:
            textual content (str): consumer message.
        Returns:
            ClassifierOutput: The mannequin's reply.
        """
        attempt:
            return self.chain.invoke({"textual content": textual content})
        besides Exception as e:
            elevate RuntimeError(f"Error invoking the chain: {e}")

The assistant is liable for answering with a customized advice retrieved from the Vector DB. On this case, we’re additionally leveraging the multimodal capabilities of Gemini to investigate the pictures retrieved and produce a greater reply.

from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Discipline

class AssistantOutput(BaseModel):
    """
    Information construction for the mannequin's output.
    """

    reply: str = Discipline(description="A string with the style recommendation for the client.")

class Assistant:
    """
    Assitant class for offering style recommendation.
    """

    def __init__(self, mannequin: ChatVertexAI) -> None:
        """
        Initialize the Chain class by creating the chain.
        Args:
            mannequin (ChatVertexAI): The LLM mannequin.
        """
        tremendous().__init__()

        parser = PydanticOutputParser(pydantic_object=AssistantOutput)

        text_prompt = """
        You're employed for a style retailer and you're a style assistant skilled on understanding what a buyer wants.
        Based mostly on the gadgets which might be obtainable within the retailer and the client message under, present a style recommendation for the client.
        Variety of gadgets: {number_of_items}
        
        Photographs of things:
        {gadgets}

        Buyer message:
        {customer_message}

        Directions:
        1. Test rigorously the pictures offered.
        2. Learn rigorously the client wants.
        3. Present a style recommendation for the client based mostly on the gadgets and buyer message.
        4. Return a legitimate JSON with the recommendation, the important thing have to be 'reply' and the worth have to be a string together with your recommendation.

        Present the output as a legitimate JSON object with none further formatting, akin to backticks or further textual content. Make sure the JSON is accurately structured in keeping with the schema offered under.
        {format_instructions}

        Reply:
        """

        immediate = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = immediate | mannequin | parser

    def get_advice(self, textual content: str, gadgets: record, number_of_items: int) -> AssistantOutput:
        """
        Get recommendation from the mannequin based mostly on the textual content and gadgets context.
        Args:
            textual content (str): consumer message.
            gadgets (record): gadgets discovered for the client.
            number_of_items (int): variety of gadgets to be retrieved.
        Returns:
            AssistantOutput: The mannequin's reply.
        """
        attempt:
            return self.chain.invoke({"customer_message": textual content, "gadgets": gadgets, "number_of_items": number_of_items})
        besides Exception as e:
            elevate RuntimeError(f"Error invoking the chain: {e}")

By way of instruments, we outline one based mostly on BLIP-2. It consists of a operate that receives a textual content or picture as enter and returns normalized embeddings. Relying on the enter, the embeddings are produced utilizing the textual content embedding mannequin or the picture embedding mannequin of BLIP-2.

from typing import Optionally available

import numpy as np
import torch
import torch.nn.useful as F
from PIL import Picture
from PIL.JpegImagePlugin import JpegImageFile
from transformers import AutoProcessor, Blip2TextModelWithProjection, Blip2VisionModelWithProjection

PROCESSOR = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
TEXT_MODEL = Blip2TextModelWithProjection.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32).to(
    "cpu"
)
IMAGE_MODEL = Blip2VisionModelWithProjection.from_pretrained(
    "Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32
).to("cpu")

def generate_embeddings(textual content: Optionally available[str] = None, picture: Optionally available[JpegImageFile] = None) -> np.ndarray:
    """
    Generate embeddings from textual content or picture utilizing the Blip2 mannequin.
    Args:
        textual content (Optionally available[str]): buyer enter textual content
        picture (Optionally available[Image]): buyer enter picture
    Returns:
        np.ndarray: embedding vector
    """
    if textual content:
        inputs = PROCESSOR(textual content=textual content, return_tensors="pt").to("cpu")
        outputs = TEXT_MODEL(**inputs)
        embedding = F.normalize(outputs.text_embeds, p=2, dim=1)[:, 0, :].detach().numpy().flatten()
    else:
        inputs = PROCESSOR(pictures=picture, return_tensors="pt").to("cpu", torch.float16)
        outputs = IMAGE_MODEL(**inputs)
        embedding = F.normalize(outputs.image_embeds, p=2, dim=1).imply(dim=1).detach().numpy().flatten()

    return embedding

Word that we create the connection to PGVector with a unique embedding mannequin as a result of it’s necessary, though it is not going to be used since we’ll retailer the embeddings produced by BLIP-2 immediately.

Within the loop under, we iterate over all classes of garments, load the pictures, and create and append the embeddings to be saved within the vector db into a listing. Additionally, we retailer the trail to the picture as textual content in order that we are able to render it in our Streamlit app. Lastly, we retailer the class to filter the outcomes based mostly on the class predicted by the classifier agent.

import glob
import os

from dotenv import load_dotenv
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Picture

from blip2 import generate_embeddings

load_dotenv("env/connection.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    consumer=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  # doesn't matter for our use case
    collection_name="style",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

if __name__ == "__main__":

    # generate picture embeddings
    # save path to picture in textual content
    # save class in metadata
    texts = []
    embeddings = []
    metadatas = []

    for class in glob.glob("pictures/*"):
        cat = class.cut up("/")[-1]
        for img in glob.glob(f"{class}/*"):
            texts.append(img)
            embeddings.append(generate_embeddings(picture=Picture.open(img)).tolist())
            metadatas.append({"class": cat})

    vector_db.add_embeddings(texts, embeddings, metadatas)

We are able to now construct our Streamlit app to talk with our assistant and ask for suggestions. The chat begins with the agent asking the way it can assist and offering a field for the client to put in writing a message and/or to add a file.

As soon as the client replies, the workflow is the next:

The classifier agent identifies which classes of garments the client is in search of and what number of items they need.
If the client uploads a file, this file goes to be transformed into an embedding, and we’ll search for comparable gadgets within the vector db, conditioned by the class of garments the client needs and the variety of items.
The gadgets retrieved and the client’s enter message are then despatched to the assistant agent to provide the advice message that’s rendered along with the pictures retrieved.
If the client didn’t add a file, the method is identical, however as a substitute of producing picture embeddings for retrieval, we create textual content embeddings.

import os

import streamlit as st
from dotenv import load_dotenv
from langchain_google_vertexai import ChatVertexAI
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Picture

import utils
from assistant import Assistant
from blip2 import generate_embeddings
from classifier import Classifier

load_dotenv("env/connection.env")
load_dotenv("env/llm.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    consumer=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  # doesn't matter for our use case
    collection_name="style",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

mannequin = ChatVertexAI(model_name=os.getenv("MODEL_NAME"), challenge=os.getenv("PROJECT_ID"), temperarture=0.0)
classifier = Classifier(mannequin)
assistant = Assistant(mannequin)

st.title("Welcome to ZAAI's Style Assistant")

user_input = st.text_input("Hello, I am ZAAI's Style Assistant. How can I allow you to at this time?")

uploaded_file = st.file_uploader("Add a picture", sort=["jpg", "jpeg", "png"])

if st.button("Submit"):

    # perceive what the consumer is asking for
    classification = classifier.classify(user_input)

    if uploaded_file:

        picture = Picture.open(uploaded_file)
        picture.save("input_image.jpg")
        embedding = generate_embeddings(picture=picture)

    else:

        # create textual content embeddings in case the consumer doesn't add a picture
        embedding = generate_embeddings(textual content=user_input)

    # create a listing of things to be retrieved and the trail
    retrieved_items = []
    retrieved_items_path = []
    for merchandise in classification.class:
        garments = vector_db.similarity_search_by_vector(
            embedding, ok=classification.number_of_items, filter={"class": {"$in": [item]}}
        )
        for dress in garments:
            retrieved_items.append({"bytesBase64Encoded": utils.encode_image_to_base64(dress.page_content)})
            retrieved_items_path.append(dress.page_content)

    # get assistant's advice
    assistant_output = assistant.get_advice(user_input, retrieved_items, len(retrieved_items))
    st.write(assistant_output.reply)

    cols = st.columns(len(retrieved_items)+1)
    for col, retrieved_item in zip(cols, ["input_image.jpg"]+retrieved_items_path):
        col.picture(retrieved_item)

    user_input = st.text_input("")

else:
    st.warning("Please present textual content.")

Each examples will be seen under:

Determine 6 exhibits an instance the place the client uploaded a picture of a pink t-shirt and requested the agent to finish the outfit.

Determine 6: Instance of textual content and picture enter (picture by creator)

Determine 7 exhibits a extra simple instance the place the client requested the agent to indicate them black t-shirts.

Determine 7: Instance of textual content enter (picture by creator)

Conclusion

Multimodal AI is now not only a analysis matter. It’s getting used within the trade to reshape the best way prospects work together with firm catalogs. On this article, we explored how multimodal fashions like BLIP-2 and Gemini will be mixed to handle real-world issues and supply a extra personalised expertise to prospects in a scalable approach.

We explored the structure of BLIP-2 in depth, demonstrating the way it bridges the hole between textual content and picture modalities. To increase its capabilities, we developed a system of brokers, every specializing in numerous duties. This method integrates an LLM (Gemini) and a vector database, enabling retrieval of the product catalog utilizing textual content and picture embeddings. We additionally leveraged Gemini’s multimodal reasoning to enhance the gross sales assistant agent’s responses to be extra human-like.

With instruments like BLIP-2, Gemini, and PG Vector, the way forward for multimodal search and retrieval is already occurring, and the various search engines of the long run will look very totally different from those we use at this time.

About me

Serial entrepreneur and chief within the AI area. I develop AI merchandise for companies and spend money on AI-focused startups.

Founder @ ZAAI | LinkedIn | X/Twitter

References

[1] Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Picture Pre-training with Frozen Picture Encoders and Massive Language Fashions. arXiv:2301.12597

[2] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan. 2020. Supervised Contrastive Studying. arXiv:2004.11362

[3] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi. 2021. Align earlier than Fuse: Imaginative and prescient and Language Illustration Studying with Momentum Distillation. arXiv:2107.07651

[4] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon. 2019. Unified Language Mannequin Pre-training for Pure Language Understanding and Era. arXiv:1905.03197

Source link

Multimodal Search Engine Agents Powered by BLIP-2 and Gemini

User Privacy Concerns with AI Sexting Apps

AI in Cryptocurrency Trading: Boon or Bane?

Sesame Speech Model: How This Viral AI Model Generates Human-Like Speech

Layers of the AI Stack, Explained Simply

An LLM-Based Workflow for Automated Tabular Data Validation

Plotly’s AI Tools Are Redefining Data Science Workflows

The future of AI processing

How one midwest manufacturer automated heavy lifting and unlocked new value with PCC

Verge Next licenses hubless motor tech, enabling more e-motos

EU-Startups Summit 2025: Practical info you need before landing in Malta

Featured Picks

Why venture capitalists should care about their portfolio companies’ brand stories

Best Coolers of 2025 – CNET

AI “godfather” Yoshua Bengio joins UK project to prevent AI catastrophes

Multimodal Search Engine Agents Powered by BLIP-2 and Gemini

Introduction

BLIP-2: a multimodal mannequin

Structure

Loss Features

Coaching Course of

Stage 1 – Bootstrapping visual-language illustration:

Stage 2 – Bootstrapping vision-to-language technology:

Making a Multimodal Style Search Agent utilizing BLIP-2 and Gemini

Conclusion

About me

References

Related Posts