Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • AI Machine-Vision Earns Man Overboard Certification
    • Battery recycling startup Renewable Metals charges up on $12 million Series A
    • The Influencers Normalizing Not Having Sex
    • Sources say NSA is using Mythos Preview, and a source says it is also being used widely within the DoD, despite Anthropic’s designation as a supply chain risk (Axios)
    • Today’s NYT Wordle Hints, Answer and Help for April 20 #1766
    • Scandi-style tiny house combines smart storage and simple layout
    • Our Favorite Apple Watch Has Never Been Less Expensive
    • Vercel says it detected unauthorized access to its internal systems after a hacker using the ShinyHunters handle claimed a breach on BreachForums (Lawrence Abrams/BleepingComputer)
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Monday, April 20
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Introducing Gemini Embeddings 2 Preview | Towards Data Science
    Artificial Intelligence

    Introducing Gemini Embeddings 2 Preview | Towards Data Science

    Editor Times FeaturedBy Editor Times FeaturedMarch 17, 2026No Comments11 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link


    a preview model of its newest embedding mannequin. This mannequin is notable for one primary purpose. It may well embed textual content, PDFs, photographs, audio, and video, making it a one-stop store for embedding absolutely anything you’d care to throw at it.

    In the event you’re new to embedding, you would possibly surprise what all of the fuss is about, nevertheless it seems that embedding is among the cornerstones of retrieval augmented era or RAG, because it’s identified. In flip, RAG is among the most basic purposes of recent synthetic intelligence processing.

    A fast recap of RAG and Embedding

    RAG is a technique of chunking, encoding and storing data that may then be searched utilizing similarity capabilities that match search phrases to the embedded data. The encoding half turns no matter you’re looking out right into a collection of numbers referred to as vectors — that is what embedding does. The vectors (embeddings) are then sometimes saved in a vector database.

    When a consumer enters a search time period, it is usually encoded as embeddings, and the ensuing vectors are in contrast with the contents of the vector database, normally utilizing a course of referred to as cosine similarity. The nearer the search time period vectors are to elements of the data within the vector retailer, the extra related the search phrases are to these elements of the saved information. Giant language fashions can interpret all this and retrieve and show probably the most related elements to the consumer.

    There’s an entire bunch of different stuff that surrounds this, like how the enter information ought to be break up up or chunked, however the embedding, storing, and retrieval are the primary options of RAG processing. That will help you visualise, right here’s a simplified schematic of a RAG course of.

    Picture by Nano Banana

    So, what’s particular about Gemini Embedding?

    Okay, so now that we all know how essential embedding is for RAG, why is Google’s new Gemini embedding mannequin such a giant deal? Merely this. Conventional embedding fashions — with a couple of exceptions — have been restricted to textual content, PDFs, and different doc sorts, and perhaps photographs at a push. 

    What Gemini now presents is true multi-modal enter for embeddings. Which means textual content, PDF’s and docs, photographs, audio and video. Being a preview embedding mannequin, there are specific dimension limitations on the inputs proper now, however hopefully you may see the path of journey and the way doubtlessly helpful this may very well be.

    Enter limitations

    I discussed that there are limitations on what we will enter to the brand new Gemini embedding mannequin. They’re:

    • Textual content: As much as 8192 enter tokens, which is about 6000 phrases
    • Pictures: As much as 6 photographs per request, supporting PNG and JPEG codecs
    • Movies: A most of two minutes of video in MP4 and MOV codecs
    • Audio: A most period of 80 seconds, helps MP3, WAV.
    • Paperwork: As much as 6 pages lengthy

    Okay, time to see the brand new embedding mannequin in follow with some Python coding examples.

    Organising a growth surroundings

    To start, let’s arrange a typical growth surroundings to maintain our tasks separate. I’ll be utilizing the UV device for this, however be happy to make use of whichever strategies you’re used to.

    $ uv init embed-test --python 3.13
    $ cd embed-test
    $ uv venv
    $ supply embed-test/bin/activate
    $ uv add google-genai jupyter numpy scikit-learn audioop-lts
    
    # To run the pocket book, kind this in
    
    $ uv run jupyter pocket book

    You’ll additionally want a Gemini API key, which you will get from Google’s AI Studio dwelling web page.

    https://aistudio.google.com

    Search for a Get API Key hyperlink close to the underside left of the display after you’ve logged in. Pay attention to it as you’ll want it later.

    Please notice, aside from being a consumer of their merchandise, I’ve no affiliation or affiliation with Google or any of its subsidiaries.

    Setup Code

    I gained’t discuss a lot about embedding textual content or PDF paperwork, as these are comparatively easy and are lined extensively elsewhere. As an alternative, we’ll take a look at embedding photographs and audio, that are much less frequent.

    That is the setup code, which is frequent to all our examples.

    import os
    import numpy as np
    from pydub import AudioSegment
    from google import genai
    from google.genai import sorts
    from sklearn.metrics.pairwise import cosine_similarity
    
    from IPython.show import show, Picture as IPImage, Audio as IPAudio, Markdown
    
    shopper = genai.Shopper(api_key='YOUR_API_KEY')
    
    MODEL_ID = "gemini-embedding-2-preview"

    Instance 1 — Embedding photographs

    For this instance, we’ll embed 3 photographs: considered one of a ginger cat, considered one of a Labrador, and considered one of a yellow dolphin. We’ll then arrange a collection of questions or phrases, every yet another particular to or associated to one of many photographs, and see if the mannequin can select probably the most applicable picture for every query. It does this by computing a similarity rating between the query and every picture. The upper this rating, the extra pertinent the query to the picture.

    Listed below are the pictures I’m utilizing.

    Picture by Nano Banana

    So, I’ve two questions and two phrases.

    • Which animal is yellow
    • Which is most definitely referred to as Rover
    • There’s one thing fishy happening right here
    • A purrrfect picture
    # Some helper perform
    #
    
    # embed textual content
    def embed_text(textual content: str) -> np.ndarray:
        """Encode a textual content string into an embedding vector.
    
        Merely go the string on to embed_content.
        """
        outcome = shopper.fashions.embed_content(
            mannequin=MODEL_ID,
            contents=[text],
        )
        return np.array(outcome.embeddings[0].values)
        
    # Embed a picture
    def embed_image(image_path: str) -> np.ndarray:
    
        # Decide MIME kind from extension
        ext = image_path.decrease().rsplit('.', 1)[-1]
        mime_map = {'png': 'picture/png', 'jpg': 'picture/jpeg', 'jpeg': 'picture/jpeg'}
        mime_type = mime_map.get(ext, 'picture/png')
    
        with open(image_path, 'rb') as f:
            image_bytes = f.learn()
    
        outcome = shopper.fashions.embed_content(
            mannequin=MODEL_ID,
            contents=[
                types.Part.from_bytes(data=image_bytes, mime_type=mime_type),
            ],
        )
        return np.array(outcome.embeddings[0].values)
    
    # --- Outline picture recordsdata ---
    image_files = ["dog.png", "cat.png", "dolphin.png"]
    image_labels = ["dog","cat","dolphin"]
    
    # Our questions
    text_descriptions = [
        "Which animal is yellow",
        "Which is most likely called Rover",
        "There's something fishy going on here",
        "A purrrfect image"
    ]
    
    # --- Compute embeddings ---
    print("Embedding texts...")
    text_embeddings = np.array([embed_text(t) for t in text_descriptions])
    
    print("Embedding photographs...")
    image_embeddings = np.array([embed_image(f) for f in image_files])
    
    # Use cosine similarity for matches
    text_image_sim = cosine_similarity(text_embeddings, image_embeddings)
    
    # Print greatest matches for every textual content
    print("nBest picture match for every textual content:")
    for i, textual content in enumerate(text_descriptions):
        # np.argmax seems throughout the row (i) to seek out the best rating among the many columns
        best_idx = np.argmax(text_image_sim[i, :])
        best_image = image_labels[best_idx]
        best_score = text_image_sim[i, best_idx]
        
        print(f"  "{textual content}" => {best_image} (rating: {best_score:.3f})")

    Right here’s the output.

    Embedding texts...
    Embedding photographs...
    
    Greatest picture match for every textual content:
      "Which animal is yellow" => dolphin (rating: 0.399)
      "Which is most definitely referred to as Rover" => canine (rating: 0.357)
      "There's one thing fishy happening right here" => dolphin (rating: 0.302)
      "A purrrfect picture" => cat (rating: 0.368)

    Not too shabby. The mannequin got here up with the identical solutions I’d have given. How about you?

    Instance 2 — Embedding audio

    For the audio, I used a person’s voice describing a fishing journey during which he sees a vibrant yellow dolphin. Click on beneath to listen to the complete audio. It’s about 37 seconds lengthy.

    In the event you don’t wish to pay attention, right here is the complete transcript.

    Hello, my identify is Glen, and I wish to let you know about a captivating sight I witnessed final Tuesday afternoon whereas out ocean fishing with some buddies. It was a heat day with a yellow solar within the sky. We have been fishing for Tuna and had no luck catching something. Boy, we should have spent the very best a part of 5 hours on the market. So, we have been fairly glum as we headed again to dry land. However then, out of the blue, and I swear that is no lie, we noticed a faculty of dolphins. Not solely that, however considered one of them was vibrant yellow in color. We by no means noticed something prefer it in our lives, however I can let you know all ideas of a nasty fishing day went out the window. It was mesmerising.

    Now, let’s see if we will slim down the place the speaker talks about seeing a yellow dolphin. 

    Usually, when coping with embeddings, we’re solely basically properties, concepts, and ideas contained within the supply data. If we wish to slim down particular properties, reminiscent of the place in an audio file a specific phrase happens or the place in a video a specific motion or occasion happens, this can be a barely extra complicated job. To do this in our instance, we first must chunk the audio into smaller items earlier than embedding every chunk. We then carry out a similarity search on every embedded chunk earlier than producing our ultimate reply.

    
    # --- HELPER FUNCTIONS ---
    
    def embed_text(textual content: str) -> np.ndarray:
        outcome = shopper.fashions.embed_content(mannequin=MODEL_ID, contents=[text])
        return np.array(outcome.embeddings[0].values)
        
    def embed_audio(audio_path: str) -> np.ndarray:
        ext = audio_path.decrease().rsplit('.', 1)[-1]
        mime_map = {'wav': 'audio/wav', 'mp3': 'audio/mp3'}
        mime_type = mime_map.get(ext, 'audio/wav')
    
        with open(audio_path, 'rb') as f:
            audio_bytes = f.learn()
    
        outcome = shopper.fashions.embed_content(
            mannequin=MODEL_ID,
            contents=[types.Part.from_bytes(data=audio_bytes, mime_type=mime_type)],
        )
        return np.array(outcome.embeddings[0].values)
    
    # --- MAIN SEARCH SCRIPT ---
    
    def search_audio_with_embeddings(audio_file_path: str, search_phrase: str, chunk_seconds: int = 5):
        print(f"Loading {audio_file_path}...")
        audio = AudioSegment.from_file(audio_file_path)
        
        # pydub works in milliseconds, so 5 seconds = 5000 ms
        chunk_length_ms = chunk_seconds * 1000 
        
        audio_embeddings = []
        temp_files = []
        
        print(f"Slicing audio into {chunk_seconds}-second items...")
        
        # 2. Chop the audio into items
        # We use a loop to leap ahead by chunk_length_ms every time
        for i, start_ms in enumerate(vary(0, len(audio), chunk_length_ms)):
            # Extract the slice
            chunk = audio[start_ms:start_ms + chunk_length_ms]
            
            # Reserve it quickly to your folder so the Gemini API can learn it
            chunk_name = f"temp_chunk_{i}.wav"
            chunk.export(chunk_name, format="wav")
            temp_files.append(chunk_name)
            
            # 3. Embed this particular chunk
            print(f"  Embedding chunk {i + 1}...")
            emb = embed_audio(chunk_name)
            audio_embeddings.append(emb)
            
        audio_embeddings = np.array(audio_embeddings)
        
        # 4. Embed the search textual content
        print(f"nEmbedding your search: '{search_phrase}'...")
        text_emb = np.array([embed_text(search_phrase)])
        
        # 5. Examine the textual content in opposition to all of the audio chunks
        print("Calculating similarities...")
        sim_scores = cosine_similarity(text_emb, audio_embeddings)[0]
        
        # Discover the chunk with the best rating
        best_chunk_idx = np.argmax(sim_scores)
        best_score = sim_scores[best_chunk_idx]
        
        # Calculate the timestamp
        start_time = best_chunk_idx * chunk_seconds
        end_time = start_time + chunk_seconds
        
        print("n--- Outcomes ---")
        print(f"The idea '{search_phrase}' most carefully matches the audio between {start_time}s and {end_time}s!")
        print(f"Confidence rating: {best_score:.3f}")
        
    
    # --- RUN IT ---
    
    # Exchange with no matter phrase you're in search of!
    search_audio_with_embeddings("fishing2.mp3", "yellow dolphin", chunk_seconds=5)

    Right here is the output.

    Loading fishing2.mp3...
    Slicing audio into 5-second items...
      Embedding chunk 1...
      Embedding chunk 2...
      Embedding chunk 3...
      Embedding chunk 4...
      Embedding chunk 5...
      Embedding chunk 6...
      Embedding chunk 7...
      Embedding chunk 8...
    
    Embedding your search: 'yellow dolphin'...
    Calculating similarities...
    
    --- Outcomes ---
    The idea 'yellow dolphin' most carefully matches the audio between 25s and 30s!
    Confidence rating: 0.643

    That’s fairly correct. Listening to the audio once more, the phrase “dolphin” is talked about on the 25-second mark and “vibrant yellow” is talked about on the 29-second mark. Earlier within the audio, I intentionally launched the phrase “yellow solar” to see whether or not the mannequin could be confused, nevertheless it dealt with the distraction effectively.

    Abstract

    This text introduces Gemini Embeddings 2 Preview as Google’s new all-in-one embedding mannequin for textual content, PDFs, photographs, audio, and video. It explains why that issues for RAG programs, the place embeddings assist flip content material and search queries into vectors that may be in contrast for similarity.

    I then walked by two Python examples exhibiting the way to generate embeddings for photographs and audio with the Google GenAI SDK, use similarity scoring to match textual content queries in opposition to photographs, and chunk audio into smaller segments to determine the a part of a spoken recording that’s semantically closest to a given search phrase.

    The chance to carry out semantic searches past simply textual content and different paperwork is an actual boon. Google’s new embedding mannequin guarantees to open up an entire new raft of potentialities for multimodal search, retrieval, and advice programs, making it a lot simpler to work with photographs, audio, video, and paperwork in a single pipeline. Because the tooling matures, it might turn into a really sensible basis for richer RAG purposes that perceive way over textual content alone.


    Yow will discover the unique weblog publish asserting Gemini Embeddings 2 utilizing the hyperlink beneath.

    https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-embedding-2



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    KV Cache Is Eating Your VRAM. Here’s How Google Fixed It With TurboQuant.

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    Comments are closed.

    Editors Picks

    AI Machine-Vision Earns Man Overboard Certification

    April 20, 2026

    Battery recycling startup Renewable Metals charges up on $12 million Series A

    April 20, 2026

    The Influencers Normalizing Not Having Sex

    April 20, 2026

    Sources say NSA is using Mythos Preview, and a source says it is also being used widely within the DoD, despite Anthropic’s designation as a supply chain risk (Axios)

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    A Hands-On Guide to Anthropic’s New Structured Output Capabilities

    November 24, 2025

    So yeah, I vibe-coded a log colorizer—and I feel good about it

    February 9, 2026

    OpenAI has trained its LLM to confess to bad behavior

    December 3, 2025
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.