Introducing Gemini Embeddings 2 Preview | Towards Data Science

a preview model of its newest embedding mannequin. This mannequin is notable for one primary purpose. It may well embed textual content, PDFs, photographs, audio, and video, making it a one-stop store for embedding absolutely anything you’d care to throw at it.

In the event you’re new to embedding, you would possibly surprise what all of the fuss is about, nevertheless it seems that embedding is among the cornerstones of retrieval augmented era or RAG, because it’s identified. In flip, RAG is among the most basic purposes of recent synthetic intelligence processing.

A fast recap of RAG and Embedding

RAG is a technique of chunking, encoding and storing data that may then be searched utilizing similarity capabilities that match search phrases to the embedded data. The encoding half turns no matter you’re looking out right into a collection of numbers referred to as vectors — that is what embedding does. The vectors (embeddings) are then sometimes saved in a vector database.

When a consumer enters a search time period, it is usually encoded as embeddings, and the ensuing vectors are in contrast with the contents of the vector database, normally utilizing a course of referred to as cosine similarity. The nearer the search time period vectors are to elements of the data within the vector retailer, the extra related the search phrases are to these elements of the saved information. Giant language fashions can interpret all this and retrieve and show probably the most related elements to the consumer.

There’s an entire bunch of different stuff that surrounds this, like how the enter information ought to be break up up or chunked, however the embedding, storing, and retrieval are the primary options of RAG processing. That will help you visualise, right here’s a simplified schematic of a RAG course of.

Picture by Nano Banana

So, what’s particular about Gemini Embedding?

Okay, so now that we all know how essential embedding is for RAG, why is Google’s new Gemini embedding mannequin such a giant deal? Merely this. Conventional embedding fashions — with a couple of exceptions — have been restricted to textual content, PDFs, and different doc sorts, and perhaps photographs at a push.

What Gemini now presents is true multi-modal enter for embeddings. Which means textual content, PDF’s and docs, photographs, audio and video. Being a preview embedding mannequin, there are specific dimension limitations on the inputs proper now, however hopefully you may see the path of journey and the way doubtlessly helpful this may very well be.

Enter limitations

I discussed that there are limitations on what we will enter to the brand new Gemini embedding mannequin. They’re:

Textual content: As much as 8192 enter tokens, which is about 6000 phrases
Pictures: As much as 6 photographs per request, supporting PNG and JPEG codecs
Movies: A most of two minutes of video in MP4 and MOV codecs
Audio: A most period of 80 seconds, helps MP3, WAV.
Paperwork: As much as 6 pages lengthy

Okay, time to see the brand new embedding mannequin in follow with some Python coding examples.

Organising a growth surroundings

To start, let’s arrange a typical growth surroundings to maintain our tasks separate. I’ll be utilizing the UV device for this, however be happy to make use of whichever strategies you’re used to.

$ uv init embed-test --python 3.13
$ cd embed-test
$ uv venv
$ supply embed-test/bin/activate
$ uv add google-genai jupyter numpy scikit-learn audioop-lts

# To run the pocket book, kind this in

$ uv run jupyter pocket book

You’ll additionally want a Gemini API key, which you will get from Google’s AI Studio dwelling web page.

https://aistudio.google.com

Search for a Get API Key hyperlink close to the underside left of the display after you’ve logged in. Pay attention to it as you’ll want it later.

Please notice, aside from being a consumer of their merchandise, I’ve no affiliation or affiliation with Google or any of its subsidiaries.

Setup Code

I gained’t discuss a lot about embedding textual content or PDF paperwork, as these are comparatively easy and are lined extensively elsewhere. As an alternative, we’ll take a look at embedding photographs and audio, that are much less frequent.

That is the setup code, which is frequent to all our examples.

import os
import numpy as np
from pydub import AudioSegment
from google import genai
from google.genai import sorts
from sklearn.metrics.pairwise import cosine_similarity

from IPython.show import show, Picture as IPImage, Audio as IPAudio, Markdown

shopper = genai.Shopper(api_key='YOUR_API_KEY')

MODEL_ID = "gemini-embedding-2-preview"

Instance 1 — Embedding photographs

For this instance, we’ll embed 3 photographs: considered one of a ginger cat, considered one of a Labrador, and considered one of a yellow dolphin. We’ll then arrange a collection of questions or phrases, every yet another particular to or associated to one of many photographs, and see if the mannequin can select probably the most applicable picture for every query. It does this by computing a similarity rating between the query and every picture. The upper this rating, the extra pertinent the query to the picture.

Listed below are the pictures I’m utilizing.

So, I’ve two questions and two phrases.

Which animal is yellow
Which is most definitely referred to as Rover
There’s one thing fishy happening right here
A purrrfect picture

# Some helper perform
#

# embed textual content
def embed_text(textual content: str) -> np.ndarray:
    """Encode a textual content string into an embedding vector.

    Merely go the string on to embed_content.
    """
    outcome = shopper.fashions.embed_content(
        mannequin=MODEL_ID,
        contents=[text],
    )
    return np.array(outcome.embeddings[0].values)
    
# Embed a picture
def embed_image(image_path: str) -> np.ndarray:

    # Decide MIME kind from extension
    ext = image_path.decrease().rsplit('.', 1)[-1]
    mime_map = {'png': 'picture/png', 'jpg': 'picture/jpeg', 'jpeg': 'picture/jpeg'}
    mime_type = mime_map.get(ext, 'picture/png')

    with open(image_path, 'rb') as f:
        image_bytes = f.learn()

    outcome = shopper.fashions.embed_content(
        mannequin=MODEL_ID,
        contents=[
            types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
        ],
    )
    return np.array(outcome.embeddings[0].values)

# --- Outline picture recordsdata ---
image_files = ["dog.png", "cat.png", "dolphin.png"]
image_labels = ["dog","cat","dolphin"]

# Our questions
text_descriptions = [
    "Which animal is yellow",
    "Which is most likely called Rover",
    "There's something fishy going on here",
    "A purrrfect image"
]

# --- Compute embeddings ---
print("Embedding texts...")
text_embeddings = np.array([embed_text(t) for t in text_descriptions])

print("Embedding photographs...")
image_embeddings = np.array([embed_image(f) for f in image_files])

# Use cosine similarity for matches
text_image_sim = cosine_similarity(text_embeddings, image_embeddings)

# Print greatest matches for every textual content
print("nBest picture match for every textual content:")
for i, textual content in enumerate(text_descriptions):
    # np.argmax seems throughout the row (i) to seek out the best rating among the many columns
    best_idx = np.argmax(text_image_sim[i, :])
    best_image = image_labels[best_idx]
    best_score = text_image_sim[i, best_idx]
    
    print(f"  "{textual content}" => {best_image} (rating: {best_score:.3f})")

Right here’s the output.

Embedding texts...
Embedding photographs...

Greatest picture match for every textual content:
  "Which animal is yellow" => dolphin (rating: 0.399)
  "Which is most definitely referred to as Rover" => canine (rating: 0.357)
  "There's one thing fishy happening right here" => dolphin (rating: 0.302)
  "A purrrfect picture" => cat (rating: 0.368)

Not too shabby. The mannequin got here up with the identical solutions I’d have given. How about you?

Instance 2 — Embedding audio

For the audio, I used a person’s voice describing a fishing journey during which he sees a vibrant yellow dolphin. Click on beneath to listen to the complete audio. It’s about 37 seconds lengthy.

In the event you don’t wish to pay attention, right here is the complete transcript.

Hello, my identify is Glen, and I wish to let you know about a captivating sight I witnessed final Tuesday afternoon whereas out ocean fishing with some buddies. It was a heat day with a yellow solar within the sky. We have been fishing for Tuna and had no luck catching something. Boy, we should have spent the very best a part of 5 hours on the market. So, we have been fairly glum as we headed again to dry land. However then, out of the blue, and I swear that is no lie, we noticed a faculty of dolphins. Not solely that, however considered one of them was vibrant yellow in color. We by no means noticed something prefer it in our lives, however I can let you know all ideas of a nasty fishing day went out the window. It was mesmerising.

Now, let’s see if we will slim down the place the speaker talks about seeing a yellow dolphin.

Usually, when coping with embeddings, we’re solely basically properties, concepts, and ideas contained within the supply data. If we wish to slim down particular properties, reminiscent of the place in an audio file a specific phrase happens or the place in a video a specific motion or occasion happens, this can be a barely extra complicated job. To do this in our instance, we first must chunk the audio into smaller items earlier than embedding every chunk. We then carry out a similarity search on every embedded chunk earlier than producing our ultimate reply.


# --- HELPER FUNCTIONS ---

def embed_text(textual content: str) -> np.ndarray:
    outcome = shopper.fashions.embed_content(mannequin=MODEL_ID, contents=[text])
    return np.array(outcome.embeddings[0].values)
    
def embed_audio(audio_path: str) -> np.ndarray:
    ext = audio_path.decrease().rsplit('.', 1)[-1]
    mime_map = {'wav': 'audio/wav', 'mp3': 'audio/mp3'}
    mime_type = mime_map.get(ext, 'audio/wav')

    with open(audio_path, 'rb') as f:
        audio_bytes = f.learn()

    outcome = shopper.fashions.embed_content(
        mannequin=MODEL_ID,
        contents=[types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)],
    )
    return np.array(outcome.embeddings[0].values)

# --- MAIN SEARCH SCRIPT ---

def search_audio_with_embeddings(audio_file_path: str, search_phrase: str, chunk_seconds: int = 5):
    print(f"Loading {audio_file_path}...")
    audio = AudioSegment.from_file(audio_file_path)
    
    # pydub works in milliseconds, so 5 seconds = 5000 ms
    chunk_length_ms = chunk_seconds * 1000 
    
    audio_embeddings = []
    temp_files = []
    
    print(f"Slicing audio into {chunk_seconds}-second items...")
    
    # 2. Chop the audio into items
    # We use a loop to leap ahead by chunk_length_ms every time
    for i, start_ms in enumerate(vary(0, len(audio), chunk_length_ms)):
        # Extract the slice
        chunk = audio[start_ms:start_ms + chunk_length_ms]
        
        # Reserve it quickly to your folder so the Gemini API can learn it
        chunk_name = f"temp_chunk_{i}.wav"
        chunk.export(chunk_name, format="wav")
        temp_files.append(chunk_name)
        
        # 3. Embed this particular chunk
        print(f"  Embedding chunk {i + 1}...")
        emb = embed_audio(chunk_name)
        audio_embeddings.append(emb)
        
    audio_embeddings = np.array(audio_embeddings)
    
    # 4. Embed the search textual content
    print(f"nEmbedding your search: '{search_phrase}'...")
    text_emb = np.array([embed_text(search_phrase)])
    
    # 5. Examine the textual content in opposition to all of the audio chunks
    print("Calculating similarities...")
    sim_scores = cosine_similarity(text_emb, audio_embeddings)[0]
    
    # Discover the chunk with the best rating
    best_chunk_idx = np.argmax(sim_scores)
    best_score = sim_scores[best_chunk_idx]
    
    # Calculate the timestamp
    start_time = best_chunk_idx * chunk_seconds
    end_time = start_time + chunk_seconds
    
    print("n--- Outcomes ---")
    print(f"The idea '{search_phrase}' most carefully matches the audio between {start_time}s and {end_time}s!")
    print(f"Confidence rating: {best_score:.3f}")
    

# --- RUN IT ---

# Exchange with no matter phrase you're in search of!
search_audio_with_embeddings("fishing2.mp3", "yellow dolphin", chunk_seconds=5)

Right here is the output.

Loading fishing2.mp3...
Slicing audio into 5-second items...
  Embedding chunk 1...
  Embedding chunk 2...
  Embedding chunk 3...
  Embedding chunk 4...
  Embedding chunk 5...
  Embedding chunk 6...
  Embedding chunk 7...
  Embedding chunk 8...

Embedding your search: 'yellow dolphin'...
Calculating similarities...

--- Outcomes ---
The idea 'yellow dolphin' most carefully matches the audio between 25s and 30s!
Confidence rating: 0.643

That’s fairly correct. Listening to the audio once more, the phrase “dolphin” is talked about on the 25-second mark and “vibrant yellow” is talked about on the 29-second mark. Earlier within the audio, I intentionally launched the phrase “yellow solar” to see whether or not the mannequin could be confused, nevertheless it dealt with the distraction effectively.

Abstract

This text introduces Gemini Embeddings 2 Preview as Google’s new all-in-one embedding mannequin for textual content, PDFs, photographs, audio, and video. It explains why that issues for RAG programs, the place embeddings assist flip content material and search queries into vectors that may be in contrast for similarity.

I then walked by two Python examples exhibiting the way to generate embeddings for photographs and audio with the Google GenAI SDK, use similarity scoring to match textual content queries in opposition to photographs, and chunk audio into smaller segments to determine the a part of a spoken recording that’s semantically closest to a given search phrase.

The chance to carry out semantic searches past simply textual content and different paperwork is an actual boon. Google’s new embedding mannequin guarantees to open up an entire new raft of potentialities for multimodal search, retrieval, and advice programs, making it a lot simpler to work with photographs, audio, video, and paperwork in a single pipeline. Because the tooling matures, it might turn into a really sensible basis for richer RAG purposes that perceive way over textual content alone.

Yow will discover the unique weblog publish asserting Gemini Embeddings 2 utilizing the hyperlink beneath.

https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2

Source link

Introducing Gemini Embeddings 2 Preview | Towards Data Science

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

How to Find the Optimal Coding Agent Interface

I Completed Five Years in Analytics Consulting: 5 Lessons That Changed How I Work

GPU-Resident Top-K for Agentic RAG: I Built a CUDA Kernel So My Retrieval Step Would Stop Bouncing Off the GPU

Can Machine Learning Predict the World Cup?

Automate Writing Your LLM Prompts

These Were My Favorite Things Samsung Unpacked During Its 2026 Galaxy Event

AI minister role boosted but tech department axed in Burnham shake-up

Loop Engineering for RAG Question Parsing: The Small Loop That Runs Before Retrieval

The risk of weather data sabotage is rising

Featured Picks

Iran-linked hackers disrupt operations at US critical infrastructure sites

Attackers prompted Gemini over 100,000 times while trying to clone it, Google says

How to build a better AI benchmark

Introducing Gemini Embeddings 2 Preview | Towards Data Science

A fast recap of RAG and Embedding

So, what’s particular about Gemini Embedding?

Enter limitations

Organising a growth surroundings

Setup Code

Instance 1 — Embedding photographs

Instance 2 — Embedding audio

Abstract

Related Posts