LLaVA on a Budget: Multimodal AI with Limited Resources

Within the final couple of years, I’ve labored primarily with giant language fashions, coaching, fine-tuning, prompting and so forth, since this was extremely requested out there and by customers. However I consider that LLMs that work primarily on textual content is simply the start of GenAI. At a sure level, all people will need bodily AI, the place fashions can see, hear, really feel, and purpose in a extra grounded, human approach.

So let’s get began with multimodality. On this pocket book, I introduce LLaVA, an structure able to decoding each pictures and textual content to generate multimodal responses.

On this tutorial, we’re going to use a lighter-weight part appropriate to run the pocket book on a free-tier surroundings resembling Google Colab.

The elements we’re going to use are:

1️⃣ CLIP-ViT B/32 because the picture encoder

2️⃣ TinyLlama-1.1B because the language mannequin

3️⃣ A 2-layer MLP adapter to bridge the 2

From the paper Visual Instruction Tuning (NeurIPS 2023)

Setup

Earlier than we will dive into the code, let’s arrange our surroundings.

Let’s first set up the datasets library.

!pip set up -U datasets

We now must import the required packages from Hugging Face and PyTorch. These imports present pre-trained fashions and utilities for multimodal processing.

import json
from pathlib import Path

import requests
import safetensors
import torch
from datasets import load_dataset
from huggingface_hub import hf_hub_download
from PIL import Picture
from transformers import (
    AutoConfig,
    AutoTokenizer,
    LlamaTokenizer,
    LlavaConfig,
    LlavaForConditionalGeneration,
    LlavaProcessor,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)
from transformers.fashions.clip.modeling_clip import CLIPVisionModel
from transformers.fashions.clip.image_processing_clip import CLIPImageProcessor

Obtain pre-trained mannequin elements

Our LLaVA mannequin will probably be composed of:

Picture supply: https://arxiv.org/pdf/2103.00020

The hf_hub_download is a hub we’re exploring with a purpose to retrieve pre-trained weights:

vision_backbone_name = "openai/clip-vit-base-patch32"
text_backbone_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

_ = hf_hub_download(
    vision_backbone_name, filename="pytorch_model.bin", local_dir="/content material"
)
_ = hf_hub_download(
    text_backbone_name, filename="mannequin.safetensors", local_dir="/content material"
)

Mannequin

Instantiate a brand new LLaVA mannequin

Let’s now instantiate a brand new LlaVA mannequin. As defined above, a LlaVA mannequin consists of two components, a visible encoder and a textual decoder that we’ve got simply downloaded.

vision_config = AutoConfig.from_pretrained(vision_backbone_name).vision_config
text_config = AutoConfig.from_pretrained(text_backbone_name)

We specify the spine fashions within the LlaVA config. We then instantiate the precise mannequin with LlavaForConditionalGeneration(llava_config).

llava_config = LlavaConfig(vision_config=vision_config, text_config=text_config)

mannequin = LlavaForConditionalGeneration(llava_config).cuda()
mannequin

Carry out some surgical operations

*Picture supply: https://unsplash.com/photos/doctor-having-operation-E285pJbC4uE*

Beforehand, we mentioned we may assemble an LLaVA mannequin by ranging from a pre-trained picture encoder and a pre-trained LLM. Let’s just do that!

The unique LLaVA mannequin is initialised from a CLIP-ViT L/14 and a Vicuna v1.5 7B. To make issues extra manageable with the sources offered by the free plan of Google Colab, we’ll use a CLIP-ViT B/16 and a TinyLlama 1.1B.

The one part we’ll prepare is a 2-layer MLP adapter in between them.

In an effort to use the CLIP and TinyLlama fashions, we have to load their pre-trained weights. However these weights can come in numerous codecs like .safetensors or .bin. The load_weights operate handles this for us. It checks the file kind and calls the proper loading operate.

def load_weights(path_to_weights: str):
    if path_to_weights.endswith(".safetensors"):
        return load_safetensors_weights(path_to_weights)
    elif path_to_weights.endswith(".bin"):
        return load_bin_weights(path_to_weights)
    else:
        increase ValueError(f"Unsupported weights file: {path_to_weights}")

def load_bin_weights(path_to_weights: str):
    return torch.load(path_to_weights, weights_only=True)

def load_safetensors_weights(path_to_weights: str):
    return safetensors.torch.load_file(path_to_weights)

vision_backbone_state_dict = load_weights("/content material/pytorch_model.bin")
text_backbone_state_dict = load_weights("/content material/mannequin.safetensors")

Inject the imaginative and prescient spine’s weights into the mannequin 💉

The following traces masses the weights into the imaginative and prescient a part of the mannequin. We set strict=False to be versatile because it permits us to skip any weights that don’t completely match the mannequin’s anticipated construction.

incompatible_keys = mannequin.vision_tower.load_state_dict(
    vision_backbone_state_dict, strict=False
)

assert len(incompatible_keys.missing_keys) == 0, (
    f"Lacking keys in state dict: {incompatible_keys.missing_keys}"
)

incompatible_keys.unexpected_keys

Inject the textual content spine’s weights into the mannequin 💉

Identical logic as earlier than, but additionally for the textual content mannequin.

incompatible_keys = mannequin.language_model.load_state_dict(
    text_backbone_state_dict, strict=True
)

Freeze the pre-trained elements ❄️

We would like now to freeze the spine visible and textual content fashions, as a result of we don’t wish to replace their weights whereas coaching.

We are going to solely prepare the small adapter (the MLP that connects imaginative and prescient and language), which is way lighter and sooner to coach.

_ = mannequin.vision_tower.requires_grad_(False)
_ = mannequin.language_model.requires_grad_(False)

# Then we outline a helper operate to depend mannequin parameters

def count_parameters(mannequin, trainable_only=False):
    return sum(
        p.numel()
        for p in mannequin.parameters()
        if not trainable_only or p.requires_grad
    )

print(f"Whole parameters: {count_parameters(mannequin)}")
print(f"Trainable parameters: {count_parameters(mannequin, trainable_only=True)}")

Processor

Earlier than feeding some textual content into our mannequin, we have to convert phrases into numbers. That is what the tokenizer is required for.

tokenizer = LlamaTokenizer.from_pretrained(
    text_backbone_name, additional_special_tokens=["", ""]
)
tokenizer.pad_token_id = 32001

Beneath is the format we’ll use to talk with our LLaVA mannequin.

The primary half is the so-called system immediate, which comprises basic pointers for the way the mannequin ought to reply to the person.

The second half is a Jinja template (mainly code) that determines how the dialog is rendered based mostly on some structured enter (see instance beneath).

LLAVA_CHAT_TEMPLATE = (
    "A chat between a curious person and a synthetic intelligence assistant. The assistant offers useful, detailed, and well mannered solutions to the person's questions. "
    "{% for message in messages %}{% if message['role'] == 'person' %}USER: {% else %}ASSISTANT: {% endif %}{% for merchandise in message['content'] %}{% if merchandise['type'] == 'textual content' %}{{ merchandise['text'] }}{% elif merchandise['type'] == 'picture' %}{% endif %}{% endfor %}{% if message['role'] == 'person' %} {% else %}{{eos_token}}{% endif %}{% endfor %}"
)
tokenizer.chat_template = LLAVA_CHAT_TEMPLATE

sample_messages = [
    {
        "content": [
            {
                "index": 0,
                "text": None,
                "type": "image"
            },
            {
                "index": None,
                "text": "nWhat potential activities might be popular at this location?",
                "type": "text"
            }
        ],
        "function": "person"
    },
    {
        "content material": [
            {
                "index": None,
                "text": (
                    "At this location, with a sandy path leading to the ocean where multiple boats, including "
                    "sailboats, are moored, popular activities might include boating, sailing, swimming, and "
                    "beachcombing. Additionally, the sandy path and shoreline provide an ideal setting for leisurely "
                    "strolls and picnics, while the ocean view offers a serene environment for relaxation and "
                    "photography. Depending on the specific area and available facilities, other water sports such as "
                    "kayaking, paddleboarding, and snorkeling could also be prevalent."
                ),
                "type": "text"
            }
        ],
        "function": "assistant"
    }
]

Let’s apply the chat template to our samples.

tokenizer.apply_chat_template(
    sample_messages, tokenize=False, add_generation_prompt=False
)

At this level we’ve arrange our tokenizer and downloaded the imaginative and prescient mannequin. We carry them collectively into one unified processor.

processor = LlavaProcessor(
    image_processor=CLIPImageProcessor.from_pretrained(vision_backbone_name),
    tokenizer=tokenizer,
    patch_size=mannequin.config.vision_config.patch_size,
)
processor.chat_template = LLAVA_CHAT_TEMPLATE

Since we added particular tokens like and to our tokenizer earlier, the mannequin must regulate its vocabulary to grasp them too

mannequin.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=8)

Dataset

Let’s obtain the dataset we’re going to use from Hugging Face.

The dataset containing samples of image-text {couples} is publicly out there and could be discovered here.

train_dataset = load_dataset(
    "HuggingFaceH4/llava-instruct-mix-vsft", cut up="prepare", streaming=True
)

What do our coaching examples appear like?

subsequent(iter(train_dataset))

How will we construct a batch of examples?

The next operate takes uncooked image-text examples and turns them into model-ready inputs. It codecs the messages utilizing the chat template, processes each the textual content and picture with the LlavaProcessor we outlined beforehand, and creates correct coaching labels whereas ignoring padding.

def get_data_collator(processor, ignore_index):
    def collate_examples(examples):
        # Extract texts and pictures from the uncooked examples
        texts = []
        pictures = []
        for instance in examples:
            messages = instance["messages"]
            textual content = processor.tokenizer.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=False
            )
            texts.append(textual content)
            pictures.append(instance["images"][0])

        # Course of the inputs (tokenize textual content and remodel pictures)
        batch = processor(texts, pictures, return_tensors="pt", padding=True)

        # Create labels
        labels = batch["input_ids"].clone()
        if processor.tokenizer.pad_token_id shouldn't be None:
            labels[labels == processor.tokenizer.pad_token_id] = ignore_index
        batch["labels"] = labels

        return batch

    return collate_examples

# NOTE: this does a bit greater than a collate operate ought to...

Coaching

Let’s lastly outline the coaching arguments, together with batch measurement, studying price, whole steps, and use combined precision (fp16) for velocity. We additionally keep away from saving checkpoints to maintain issues gentle. Then we wrap every part right into a Seq2SeqTrainerpassing within the mannequin, dataset, and our customized collator for image-text inputs.

args = Seq2SeqTrainingArguments(
    output_dir="/content material/training_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    max_steps=350,
    lr_scheduler_type="cosine_with_min_lr",
    lr_scheduler_kwargs={"min_lr": 2e-5},
    warmup_ratio=0.05,
    logging_strategy="steps",
    logging_steps=5,
    fp16=True,
    remove_unused_columns=False,  # Essential!
    optim="adamw_torch",
    report_to="none",
    save_strategy="no",  # let's not save the checkpoint to disk, in any other case it's going to take 5 minutes
)

coach = Seq2SeqTrainer(
    mannequin=mannequin,
    args=args,
    data_collator=get_data_collator(
        processor, ignore_index=mannequin.config.ignore_index,
    ),
    train_dataset=train_dataset,
)

coach.prepare()

Inference

To be famous that to ensure inference works as anticipated it’s best to use heavier fashions, and prepare for longer time.

We’ll use this picture for inference:

*Picture supply: https://it.wikipedia.org/wiki/Gioconda#/media/File:Mona_Lisa,_by_Leonardo_da_Vinci,_from_C2RMF_retouched.jpg*

dialog = [
    {
        "content": [
            {
                "type": "image"
            },
            {
                "text": "nWhat is represented in the image?",
                "type": "text"
            }
        ],
        "function": "person"
    }
]

On this cell block for instance, we load a picture from a URL and format a dialog utilizing the chat template. The processor turns each into tensors. Then we transfer the enter to the mannequin’s machine and generate a response, letting the mannequin describe the picture based mostly on the person’s immediate.

image_url = "https://llava-vl.github.io/static/pictures/monalisa.jpg"

inputs_for_generation = processor(
    pictures=Picture.open(requests.get(image_url, stream=True).uncooked),
    textual content=processor.apply_chat_template(dialog, add_generation_prompt=True),
    return_tensors="pt",
)

inputs_for_generation = inputs_for_generation.to(machine=mannequin.machine)
output = coach.mannequin.generate(
    **inputs_for_generation, max_new_tokens=200, do_sample=False
)

print(processor.decode(output[0], skip_special_tokens=True))

Extensions and enhancements

Use a bigger picture encoder (e.g. CLIP-ViT Giant) and LLM (e.g. Llama 3.1 8B)
Prepare for longer. It takes a while for the mannequin to determine how you can observe directions within the presence of picture options
Comply with the multi-stage coaching process adopted by the unique LLaVA
- Stage 1: Pre-training for Characteristic Alignment –> prepare the mannequin on single-turn instruction information, the place it’s requested to briefly describe the image. Picture encoder and LLM are frozen
- Stage 2: High-quality-tuning Finish-to-Finish –> prepare the mannequin on multi-turn instruction information. Solely the picture encoder is frozen

Working demo: huggingface.co/spaces/badayvedat/LLaVA

Conclusion

I feel this small undertaking is attention-grabbing to raised perceive how multimodal fashions like LLaVA work. Even when we used smaller fashions, the principle concept is similar: mix imaginative and prescient and language into one system that may perceive pictures and speak about them.

In fact, the outcomes obtained on this toy instance usually are not actually good; there’s loads of house for enchancment. However making LLaVA work in an surroundings with restricted sources is sort of difficult

Comply with me on TDS for those who like this text! 😁

💼 Linkedin ️| 🐦 X (Twitter) | 💻 Web site

Source link

LLaVA on a Budget: Multimodal AI with Limited Resources

Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project

Abstract Classes: A Software Engineering Concept Data Scientists Must Know To Succeed

Build an AI Agent to Explore Your Data Catalog with Natural Language

Regularisation: A Deep Dive into Theory, Implementation, and Practical Insights

Grad-CAM from Scratch with PyTorch Hooks

A Practical Starters’ Guide to Causal Structure Learning with Bayesian Methods in Python

Today’s NYT Mini Crossword Answers for June 18

Amazon boss says AI will replace jobs at tech giant

Apply Sphinx’s Functionality to Create Documentation for Your Next Data Science Project

Protein discovery may combat aging and brain diseases

Featured Picks

Artificial gills unlock long-range underwater robots

Jury orders NSO to pay $167 million for hacking WhatsApp users

Health care giant Ascension says 5.6 million patients affected in cyberattack

LLaVA on a Budget: Multimodal AI with Limited Resources

Setup

Obtain pre-trained mannequin elements

Mannequin

Instantiate a brand new LLaVA mannequin

Carry out some surgical operations

vision_backbone_state_dict = load_weights("/content material/pytorch_model.bin") text_backbone_state_dict = load_weights("/content material/mannequin.safetensors")

Inject the imaginative and prescient spine’s weights into the mannequin 💉

Inject the textual content spine’s weights into the mannequin 💉

Freeze the pre-trained elements ❄️

Processor

Dataset

Coaching

Inference

Extensions and enhancements

Conclusion

Related Posts

`vision_backbone_state_dict = load_weights("/content material/pytorch_model.bin") text_backbone_state_dict = load_weights("/content material/mannequin.safetensors")`