Close Menu
    Facebook LinkedIn YouTube WhatsApp X (Twitter) Pinterest
    Trending
    • Francis Bacon and the Scientific Method
    • Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval
    • Sulfur lava exoplanet L 98-59 d defies classification
    • Hisense U7SG TV Review (2026): Better Design, Great Value
    • Google is in talks with Marvell Technology to develop a memory processing unit that works alongside TPUs, and a new TPU for running AI models (Qianer Liu/The Information)
    • Premier League Soccer: Stream Man City vs. Arsenal From Anywhere Live
    • Dreaming in Cubes | Towards Data Science
    • Onda tiny house flips layout to fit three bedrooms and two bathrooms
    Facebook LinkedIn WhatsApp
    Times FeaturedTimes Featured
    Sunday, April 19
    • Home
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    • More
      • AI
      • Robotics
      • Industries
      • Global
    Times FeaturedTimes Featured
    Home»Artificial Intelligence»Detecting and Editing Visual Objects with Gemini
    Artificial Intelligence

    Detecting and Editing Visual Objects with Gemini

    Editor Times FeaturedBy Editor Times FeaturedFebruary 26, 2026No Comments37 Mins Read
    Facebook Twitter Pinterest Telegram LinkedIn Tumblr WhatsApp Email
    Share
    Facebook Twitter LinkedIn Pinterest Telegram Email WhatsApp Copy Link



    earlier than we begin:

    • I’m a developer at Google Cloud. Ideas and opinions expressed listed below are fully my very own.
    • The entire supply code for this text, together with future updates, is obtainable in this notebook beneath the Apache 2.0 license.
    • All new photos on this article have been generated with Gemini Nano Banana utilizing the explored proof-of-concept. All supply photos are both within the public area or free to make use of (reference hyperlinks are supplied within the code output).
    • You possibly can experiment with Gemini fashions without spending a dime in Google AI Studio. For programmatic API entry, please word that whereas a free tier is obtainable for some fashions (i.e., you possibly can carry out object detection), picture technology is a pay-as-you-go service.

    ✨ Overview

    Conventional laptop imaginative and prescient fashions are usually skilled to detect a set set of object lessons, like “individual”, “cat”, or “automobile”. If you wish to detect one thing particular that wasn’t within the coaching set, resembling an “illustration” in a e-book {photograph}, you normally have to collect a dataset, label it manually, and practice a customized mannequin, which might take hours and even days.

    On this exploration, we’ll check a distinct method utilizing Gemini. We are going to leverage its spatial understanding capabilities to carry out open-vocabulary object detection. This permits us to seek out objects primarily based solely on a pure language description, with none coaching.

    As soon as the visible objects are detected, we’ll extract them after which use Gemini’s picture modifying capabilities (particularly the Nano Banana fashions) to revive and creatively remodel them.


    🔥 Problem

    We’re coping with unstructured knowledge: images of books, magazines, and objects within the wild. These photos current a number of difficulties for conventional laptop imaginative and prescient:

    • Selection: The objects we wish to discover (illustrations, engravings, and any visuals on the whole) differ wildly in model and content material.
    • Distortion: Pages are curved, images are taken at angles, and lighting is uneven.
    • Noise: Outdated books have stains, paper grain, and textual content bleeding by means of from the opposite aspect.

    Our problem is to construct a sturdy pipeline that may detect these objects regardless of the distortions, extract them cleanly, and edit them to appear to be high-quality digital belongings… all utilizing easy textual content prompts.


    🏁 Setup

    🐍 Python packages

    We’ll use the next packages:

    • google-genai: the Google Gen AI Python SDK lets us name Gemini with just a few traces of code
    • pillow for picture administration
    • matplotlib for end result visualization

    We’ll additionally use these packages (dependencies of google-genai):

    • pydantic for knowledge administration
    • tenacity for request administration
    pip set up --quiet "google-genai>=1.63.0" "pillow>=11.3.0" "matplotlib>=3.10.0"

    🔗 Gemini API

    To make use of the Gemini API, we have now two primary choices:

    1. By way of Vertex AI with a Google Cloud undertaking
    2. By way of Google AI Studio with a Gemini API key
    The Google Gen AI SDK supplies a unified interface to those APIs, and we are able to use surroundings variables for the configuration. 🔽

    🛠️ Possibility 1 – Gemini API by way of Vertex AI

    Necessities:

    Gen AI SDK surroundings variables:

    • GOOGLE_GENAI_USE_VERTEXAI="True"
    • GOOGLE_CLOUD_PROJECT=""
    • GOOGLE_CLOUD_LOCATION=""

    💡 For preview fashions, the situation have to be set to international. For typically obtainable fashions, we are able to select the closest location among the many Google model endpoint locations.

    ℹ️ Study extra about setting up a project and a development environment.

    🛠️ Possibility 2 – Gemini API by way of Google AI Studio

    Requirement:

    Gen AI SDK surroundings variables:

    • GOOGLE_GENAI_USE_VERTEXAI="False"
    • GOOGLE_API_KEY=""

    ℹ️ Study extra about getting a Gemini API key from Google AI Studio.

    💡 You possibly can retailer your surroundings configuration outdoors of the supply code:

    Setting Methodology
    IDE .env file (or equal)
    Colab Colab Secrets and techniques (🗝️ icon in left panel, see code under)
    Colab Enterprise Google Cloud undertaking and placement are routinely outlined
    Vertex AI Workbench Google Cloud undertaking and placement are routinely outlined
    Outline the next surroundings detection features. You can even outline your configuration manually if wanted. 🔽
    import os
    import sys
    from collections.abc import Callable
    
    from google import genai
    
    # Guide setup (depart unchanged if setup is environment-defined)
    
    # @markdown **Which API: Vertex AI or Google AI Studio?**
    GOOGLE_GENAI_USE_VERTEXAI = True  # @param {kind: "boolean"}
    
    # @markdown **Possibility A - Google Cloud undertaking [+location]**
    GOOGLE_CLOUD_PROJECT = ""  # @param {kind: "string"}
    GOOGLE_CLOUD_LOCATION = "international"  # @param {kind: "string"}
    
    # @markdown **Possibility B - Google AI Studio API key**
    GOOGLE_API_KEY = ""  # @param {kind: "string"}
    
    
    def check_environment() -> bool:
        check_colab_user_authentication()
        return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()
    
    
    def check_manual_setup() -> bool:
        return check_define_env_vars(
            GOOGLE_GENAI_USE_VERTEXAI,
            GOOGLE_CLOUD_PROJECT.strip(),  # May need been pasted with line return
            GOOGLE_CLOUD_LOCATION,
            GOOGLE_API_KEY,
        )
    
    
    def check_vertex_ai() -> bool:
        # Workbench and Colab Enterprise
        match os.getenv("VERTEX_PRODUCT", ""):
            case "WORKBENCH_INSTANCE":
                cross
            case "COLAB_ENTERPRISE":
                if not running_in_colab_env():
                    return False
            case _:
                return False
    
        return check_define_env_vars(
            True,
            os.getenv("GOOGLE_CLOUD_PROJECT", ""),
            os.getenv("GOOGLE_CLOUD_REGION", ""),
            "",
        )
    
    
    def check_colab() -> bool:
        if not running_in_colab_env():
            return False
    
        # Colab Enterprise was checked earlier than, so that is Colab solely
        from google.colab import auth as colab_auth  # kind: ignore
    
        colab_auth.authenticate_user()
    
        # Use Colab Secrets and techniques (🗝️ icon in left panel) to retailer the surroundings variables
        # Secrets and techniques are non-public, seen solely to you and the notebooks that you choose
        # - Vertex AI: Retailer your settings as secrets and techniques
        # - Google AI: Immediately import your Gemini API key from the UI
        vertexai, undertaking, location, api_key = get_vars(get_colab_secret)
    
        return check_define_env_vars(vertexai, undertaking, location, api_key)
    
    
    def check_local() -> bool:
        vertexai, undertaking, location, api_key = get_vars(os.getenv)
    
        return check_define_env_vars(vertexai, undertaking, location, api_key)
    
    
    def running_in_colab_env() -> bool:
        # Colab or Colab Enterprise
        return "google.colab" in sys.modules
    
    
    def check_colab_user_authentication() -> None:
        if running_in_colab_env():
            from google.colab import auth as colab_auth  # kind: ignore
    
            colab_auth.authenticate_user()
    
    
    def get_colab_secret(secret_name: str, default: str) -> str:
        from google.colab import errors, userdata  # kind: ignore
    
        strive:
            return userdata.get(secret_name)
        besides errors.SecretNotFoundError:
            return default
    
    
    def disable_colab_cell_scrollbar() -> None:
        if running_in_colab_env():
            from google.colab import output  # kind: ignore
    
            output.no_vertical_scroll()
    
    
    def get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:
        # Restrict getenv calls to the minimal (could set off UI affirmation for secret entry)
        vertexai_str = getenv("GOOGLE_GENAI_USE_VERTEXAI", "")
        if vertexai_str:
            vertexai = vertexai_str.decrease() in ["true", "1"]
        else:
            vertexai = bool(getenv("GOOGLE_CLOUD_PROJECT", ""))
    
        undertaking = getenv("GOOGLE_CLOUD_PROJECT", "") if vertexai else ""
        location = getenv("GOOGLE_CLOUD_LOCATION", "") if undertaking else ""
        api_key = getenv("GOOGLE_API_KEY", "") if not undertaking else ""
    
        return vertexai, undertaking, location, api_key
    
    
    def check_define_env_vars(
        vertexai: bool,
        undertaking: str,
        location: str,
        api_key: str,
    ) -> bool:
        match (vertexai, bool(undertaking), bool(location), bool(api_key)):
            case (True, True, _, _):
                # Vertex AI - Google Cloud undertaking [+location]
                location = location or "international"
                define_env_vars(vertexai, undertaking, location, "")
            case (True, False, _, True):
                # Vertex AI - API key
                define_env_vars(vertexai, "", "", api_key)
            case (False, _, _, True):
                # Google AI Studio - API key
                define_env_vars(vertexai, "", "", api_key)
            case _:
                return False
    
        return True
    
    
    def define_env_vars(vertexai: bool, undertaking: str, location: str, api_key: str) -> None:
        os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = str(vertexai)
        os.environ["GOOGLE_CLOUD_PROJECT"] = undertaking
        os.environ["GOOGLE_CLOUD_LOCATION"] = location
        os.environ["GOOGLE_API_KEY"] = api_key
    
    
    def check_configuration(shopper: genai.Consumer) -> None:
        service = "Vertex AI" if shopper.vertexai else "Google AI Studio"
        print(f"✅ Utilizing the {service} API", finish="")
    
        if shopper._api_client.undertaking:
            print(f' with undertaking "{shopper._api_client.undertaking[:7]}…"', finish="")
            print(f' in location "{shopper._api_client.location}"')
        elif shopper._api_client.api_key:
            api_key = shopper._api_client.api_key
            print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', finish="")
            print(f" (in case of error, ensure that it was created for {service})")
    
    
    print("✅ Setting features outlined")

    🤖 Gen AI SDK

    To ship Gemini requests, create a google.genai shopper:

    from google import genai
    
    check_environment()
    
    shopper = genai.Consumer()
    
    check_configuration(shopper)

    🖼️ Picture check suite

    Let’s outline a listing of photos for our exams: 🔽
    from dataclasses import dataclass
    from enum import StrEnum
    
    Url = str
    
    
    class Supply(StrEnum):
        incunable = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"
        engravings = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"
        museum_guidebook = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"
        denver_illustrated = "https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"
        physics_textbook = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg"
        portrait_miniatures = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg"
        wizard_of_oz_drawings = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"
        work = "https://photos.unsplash.com/photo-1714146681164-f26fed839692?h=1440"
        alice_drawing = "https://photos.unsplash.com/photo-1630595011903-689853b04ee2?h=800"
        e-book = "https://photos.unsplash.com/photo-1643451533573-ee364ba6e330?h=800"
        guide = "https://photos.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800"
        electronics = "https://photos.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"
    
    
    @dataclass
    class SourceMetadata:
        title: str
        webpage_url: Url
        credit_line: str
    
    
    LOC = "Library of Congress"
    LOC_RARE_BOOKS = "Library of Congress, Uncommon E-book and Particular Collections Division"
    LOC_MEETING_FRONTIERS = "Library of Congress, Assembly of Frontiers"
    
    metadata_by_source: dict[Source, SourceMetadata] = {
        Supply.incunable: SourceMetadata(
            "Vergaderinge der historien van Troy (1485)",
            "https://www.loc.gov/useful resource/rbc0001.2014rosen0487/?sp=165",
            LOC_RARE_BOOKS,
        ),
        Supply.engravings: SourceMetadata(
            "Harper's illustrated catalogue (1847)",
            "https://www.loc.gov/useful resource/gdcscd.00340766921/?sp=121",
            LOC,
        ),
        Supply.museum_guidebook: SourceMetadata(
            "Barnum's American Museum illustrated (1850)",
            "https://www.loc.gov/useful resource/rbc0001.2014gen34181/?sp=33",
            LOC_RARE_BOOKS,
        ),
        Supply.denver_illustrated: SourceMetadata(
            "Denver illustrated (1893)",
            "https://www.loc.gov/useful resource/gdclccn.rc01000494/?sp=51",
            LOC_MEETING_FRONTIERS,
        ),
        Supply.physics_textbook: SourceMetadata(
            "Classes in physics (1916)",
            "https://www.loc.gov/useful resource/gdcscd.00036487318/?sp=103",
            LOC,
        ),
        Supply.portrait_miniatures: SourceMetadata(
            "The historical past of portrait miniatures (1904)",
            "https://www.loc.gov/useful resource/rbc0001.2024rosen013592v02/?sp=249",
            LOC_RARE_BOOKS,
        ),
        Supply.wizard_of_oz_drawings: SourceMetadata(
            "The great Wizard of Oz (1899)",
            "https://www.loc.gov/useful resource/rbc0001.2006gen32405/?sp=48",
            LOC_RARE_BOOKS,
        ),
        Supply.work: SourceMetadata(
            "Open e-book displaying work by Vincent van Gogh",
            "https://unsplash.com/images/9hD7qrxICag",
            "Photograph by Trung Manh cong on Unsplash",
        ),
        Supply.alice_drawing: SourceMetadata(
            "Open e-book displaying an illustration and textual content from Alice's Adventures in Wonderland",
            "https://unsplash.com/images/bewzr_Q9u2o",
            "Photograph by Brett Jordan on Unsplash",
        ),
        Supply.e-book: SourceMetadata(
            "Open e-book displaying two botanical illustrations",
            "https://unsplash.com/images/4IDqcNj827I",
            "Photograph by Ranurte on Unsplash",
        ),
        Supply.guide: SourceMetadata(
            "Open person guide for classic digital camera",
            "https://unsplash.com/images/aaFU96eYASk",
            "Photograph by Annie Spratt on Unsplash",
        ),
        Supply.electronics: SourceMetadata(
            "Circuit board with digital elements",
            "https://unsplash.com/images/Aqa1pHQ57pw",
            "Photograph by Albert Stoynov on Unsplash",
        ),
    }
    
    print("✅ Take a look at photos outlined")

    🧠 Gemini fashions

    Gemini is available in totally different versions. We will at present use the next fashions:

    • For object detection: Gemini 2.5 or Gemini 3, every obtainable in Flash or Professional variations.
    • For object modifying: Gemini 2.5 Flash Picture or Gemini 3 Professional Picture, often known as Nano Banana and Nano Banana Professional.

    🛠️ Helpers

    Now, let’s add core helper lessons and features: 🔽
    from enum import auto
    from pathlib import Path
    from typing import Any, forged
    
    import IPython.show
    import matplotlib.pyplot as plt
    import pydantic
    import tenacity
    from google.genai.errors import ClientError
    from google.genai.sorts import (
        FinishReason,
        GenerateContentConfig,
        GenerateContentResponse,
        PIL_Image,
        ThinkingConfig,
        ThinkingLevel,
    )
    
    
    # Multimodal fashions with spatial understanding and structured outputs
    class MultimodalModel(StrEnum):
        # Usually Obtainable (GA)
        GEMINI_2_5_FLASH = "gemini-2.5-flash"
        GEMINI_2_5_PRO = "gemini-2.5-pro"
        # Preview
        GEMINI_3_FLASH_PREVIEW = "gemini-3-flash-preview"
        GEMINI_3_1_PRO_PREVIEW = "gemini-3.1-pro-preview"
        # Default mannequin used for object detection
        DEFAULT = GEMINI_3_FLASH_PREVIEW
    
    
    # Picture technology and modifying fashions
    class ImageModel(StrEnum):
        # Usually Obtainable (GA)
        GEMINI_2_5_FLASH_IMAGE = "gemini-2.5-flash-image"  # Nano Banana 🍌
        # Preview
        GEMINI_3_PRO_IMAGE_PREVIEW = "gemini-3-pro-image-preview"  # Nano Banana Professional 🍌
        # Default mannequin used for picture modifying
        DEFAULT = GEMINI_2_5_FLASH_IMAGE
    
    
    Mannequin = MultimodalModel | ImageModel
    
    
    def generate_content(
        contents: listing[Any],
        mannequin: Mannequin,
        config: GenerateContentConfig | None,
        should_display_response_info: bool = False,
    ) -> GenerateContentResponse | None:
        response = None
        shopper = check_client_for_model(mannequin)
    
        for try in get_retrier():
            with try:
                response = shopper.fashions.generate_content(
                    mannequin=mannequin.worth,
                    contents=contents,
                    config=config,
                )
        if should_display_response_info:
            display_response_info(response, config)
    
        return response
    
    
    def check_client_for_model(mannequin: Mannequin) -> genai.Consumer:
        if (
            mannequin.worth.endswith("-preview")
            and shopper.vertexai
            and shopper._api_client.location != "international"
        ):
            # Preview fashions are solely obtainable on the "international" location
            return genai.Consumer(location="international")
    
        return shopper
    
    
    def display_response_info(
        response: GenerateContentResponse | None,
        config: GenerateContentConfig | None,
    ) -> None:
        if response is None:
            print("❌ No response")
            return
    
        if usage_metadata := response.usage_metadata:
            if usage_metadata.prompt_token_count:
                print(f"Enter tokens   : {usage_metadata.prompt_token_count:9,d}")
            if usage_metadata.candidates_token_count:
                print(f"Output tokens  : {usage_metadata.candidates_token_count:9,d}")
            if usage_metadata.thoughts_token_count:
                print(f"Ideas tokens: {usage_metadata.thoughts_token_count:9,d}")
    
        if (
            config just isn't None
            and config.response_mime_type == "utility/json"
            and response.parsed is None
        ):
            print("❌ Couldn't parse the JSON response")
            return
        if not response.candidates:
            print("❌ No `response.candidates`")
            return
        if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:
            print(f"❌ {finish_reason = }")
        if not response.textual content:
            print("❌ No `response.textual content`")
            return
    
    
    def generate_image(
        sources: listing[PIL_Image],
        immediate: str,
        mannequin: ImageModel,
        config: GenerateContentConfig | None = None,
    ) -> PIL_Image | None:
        contents = [*sources, prompt.strip()]
    
        response = generate_content(contents, mannequin, config)
    
        return check_get_output_image_from_response(response)
    
    
    def check_get_output_image_from_response(
        response: GenerateContentResponse | None,
    ) -> PIL_Image | None:
        if response is None:
            print("❌ No `response`")
            return None
        if not response.candidates:
            print("❌ No `response.candidates`")
            if response.prompt_feedback:
                if block_reason := response.prompt_feedback.block_reason:
                    print(f"{block_reason = :s}")
                if block_reason_message := response.prompt_feedback.block_reason_message:
                    print(f"{block_reason_message = }")
            return None
        if not (content material := response.candidates[0].content material):
            print("❌ No `response.candidates[0].content material`")
            return None
        if not (components := content material.components):
            print("❌ No `response.candidates[0].content material.components`")
            return None
    
        output_image: PIL_Image | None = None
        for half in components:
            if half.textual content:
                display_markdown(half.textual content)
                proceed
            sdk_image = half.as_image()
            assert sdk_image just isn't None
            output_image = sdk_image._pil_image
            assert output_image just isn't None
            break  # There needs to be a single picture
    
        return output_image
    
    
    def get_thinking_config(mannequin: Mannequin) -> ThinkingConfig | None:
        match mannequin:
            case MultimodalModel.GEMINI_2_5_FLASH:
                return ThinkingConfig(thinking_budget=0)
            case MultimodalModel.GEMINI_2_5_PRO:
                return ThinkingConfig(thinking_budget=128, include_thoughts=False)
            case MultimodalModel.GEMINI_3_FLASH_PREVIEW:
                return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)
            case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:
                return ThinkingConfig(thinking_level=ThinkingLevel.LOW)
            case _:
                return None  # Default
    
    
    def display_markdown(markdown: str) -> None:
        IPython.show.show(IPython.show.Markdown(markdown))
    
    
    def display_image(picture: PIL_Image) -> None:
        IPython.show.show(picture)
    
    
    def get_retrier() -> tenacity.Retrying:
        return tenacity.Retrying(
            cease=tenacity.stop_after_attempt(7),
            wait=tenacity.wait_incrementing(begin=10, increment=1),
            retry=should_retry_request,
            reraise=True,
        )
    
    
    def should_retry_request(retry_state: tenacity.RetryCallState) -> bool:
        if not retry_state.end result:
            return False
        err = retry_state.end result.exception()
        if not isinstance(err, ClientError):
            return False
        print(f"❌ ClientError {err.code}: {err.message}")
    
        retry = False
        match err.code:
            case 400 if err.message just isn't None and " strive once more " in err.message:
                # Workshop: first time entry to Cloud Storage (service agent provisioning)
                retry = True
            case 429:
                # Workshop: short-term undertaking with 1 QPM quota
                retry = True
        print(f"🔄 Retry: {retry}")
    
        return retry
    
    
    print("✅ Helpers outlined")

    🔍 Detecting visible objects

    To carry out visible object detection, craft the immediate to point what you’d prefer to detect and the way outcomes needs to be returned. In the identical request, it’s doable to additionally extract further details about every detected object. This may be just about something, from labels resembling “furnishings”, “desk”, or “chair”, to extra exact classifications like “mammals” or “reptiles”, or to contextual knowledge resembling captions, colours, shapes, and many others.

    For the following exams, we’ll experiment with detecting illustrations inside e-book images. Right here’s a doable immediate:

    OBJECT_DETECTION_PROMPT = """
    Detect each illustration inside the e-book picture and extract the next knowledge for every:
    - `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
    - `caption`: Verbatim caption or legend resembling "Determine 1". Use "" if not discovered.
    - `label`: Single-word label describing the illustration. Use "" if not discovered.
    """

    Notes:

    • Bounding containers are very helpful for finding or extracting the detected objects.
    • Sometimes, for Gemini fashions, a box_2d bounding field represents coordinates normalized to a (0, 0, 1000, 1000) house for a (0, 0, width, peak) enter picture.
    • We’re additionally requesting to extract captions (metadata usually current in reference books) and labels (dynamic metadata).

    To automate response processing, it’s handy to outline a Pydantic class that matches the immediate, resembling:

    class DetectedObject(pydantic.BaseModel):
        box_2d: listing[int]
        caption: str
        label: str
    
    DetectedObjects: TypeAlias = listing[DetectedObject]

    Then, request a structured output with config fields response_mime_type and response_schema:

    config = GenerateContentConfig(
        # …,
        response_mime_type="utility/json",
        response_schema=DetectedObjects,
        # …,
    )

    This may generate a JSON response which the SDK can parse routinely, letting us straight use object situations:

    detected_objects = forged(DetectedObjects, response.parsed)
    Let’s add just a few object-detection-specific lessons and features: 🔽
    import io
    import urllib.request
    from collections.abc import Iterator
    from dataclasses import area
    from datetime import datetime
    
    import PIL.Picture
    from google.genai.sorts import Half, PartMediaResolutionLevel
    from PIL.PngImagePlugin import PngInfo
    
    OBJECT_DETECTION_PROMPT = """
    Detect each illustration inside the e-book picture and extract the next knowledge for every:
    - `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
    - `caption`: Verbatim caption or legend resembling "Determine 1". Use "" if not discovered.
    - `label`: Single-word label describing the illustration. Use "" if not discovered.
    """
    
    # Margin added to detected/cropped objects, giving extra context for a greater understanding of spatial distortions
    CROP_MARGIN_PX = 10
    
    # Set to True to save lots of every generated picture
    SAVE_GENERATED_IMAGES = False
    OUTPUT_IMAGES_PATH = Path("./object_detection_and_editing")
    
    
    # Matching class for structured output technology
    class DetectedObject(pydantic.BaseModel):
        box_2d: listing[int]
        caption: str
        label: str
    
    
    # Misc knowledge lessons
    InputImage = Path | Url
    DetectedObjects = listing[DetectedObject]
    WorkflowStepImages = listing[PIL_Image]
    
    
    class WorkflowStep(StrEnum):
        SOURCE = auto()
        CROPPED = auto()
        RESTORED = auto()
        COLORIZED = auto()
        CINEMATIZED = auto()
    
    
    @dataclass
    class VisualObjectWorkflow:
        source_image: PIL_Image
        detected_objects: DetectedObjects
        images_by_step: dict[WorkflowStep, WorkflowStepImages] = area(default_factory=dict)
    
        def __post_init__(self) -> None:
            denormalize_bounding_boxes(self)
    
    
    workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}
    
    
    def denormalize_bounding_boxes(self: VisualObjectWorkflow) -> None:
        """Convert the box_2d coordinates.
        - Earlier than: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini
        - After:  [x1, y1, x2, y2] in source_image coordinates, as utilized in Pillow
        """
    
        def to_image_coord(coord: int, dim: int) -> int:
            return int(coord * dim / 1000 + 0.5)
    
        w, h = self.source_image.dimension
        for obj in self.detected_objects:
            y1, x1, y2, x2 = obj.box_2d
            x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)
            y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)
            obj.box_2d = [x1, y1, x2, y2]
    
    
    def detect_objects(
        picture: InputImage,
        immediate: str = OBJECT_DETECTION_PROMPT,
        mannequin: MultimodalModel = MultimodalModel.DEFAULT,
        config: GenerateContentConfig | None = None,
        media_resolution: PartMediaResolutionLevel | None = None,
        display_results: bool = True,
    ) -> None:
        display_image_source_info(picture)
        pil_image, content_part = get_pil_image_and_part(picture, mannequin, media_resolution)
        immediate = immediate.strip()
        contents = [content_part, prompt]
        config = config or get_object_detection_config(mannequin)
    
        response = generate_content(contents, mannequin, config)
    
        if response just isn't None and response.parsed just isn't None:
            detected_objects = forged(DetectedObjects, response.parsed)
        else:
            detected_objects = DetectedObjects()
    
        workflow = VisualObjectWorkflow(pil_image, detected_objects)
        workflow_by_image[image] = workflow
        add_cropped_objects(workflow, picture, immediate)
    
        if display_results:
            display_detected_objects(workflow)
    
    
    def get_pil_image_and_part(
        picture: InputImage,
        mannequin: MultimodalModel,
        media_resolution: PartMediaResolutionLevel | None,
    ) -> tuple[PIL_Image, Part]:
        if isinstance(picture, Path):
            image_bytes = picture.read_bytes()
        else:
            headers = {"Person-Agent": "Mozilla/5.0"}
            req = urllib.request.Request(picture, headers=headers)
            with urllib.request.urlopen(req, timeout=10) as response:
                image_bytes = response.learn()
    
        pil_image = PIL.Picture.open(io.BytesIO(image_bytes))
        content_part = Half.from_bytes(
            knowledge=image_bytes,
            mime_type="picture/*",
            media_resolution=media_resolution,
        )
    
        return pil_image, content_part
    
    
    def get_object_detection_config(mannequin: Mannequin) -> GenerateContentConfig:
        # Low randomness for extra determinism
        return GenerateContentConfig(
            temperature=0.0,
            top_p=0.0,
            seed=42,
            response_mime_type="utility/json",
            response_schema=DetectedObjects,
            thinking_config=get_thinking_config(mannequin),
        )
    
    
    def add_cropped_objects(
        workflow: VisualObjectWorkflow,
        enter: InputImage,
        immediate: str,
        crop_margin: int = CROP_MARGIN_PX,
    ) -> None:
        cropped_images: listing[PIL_Image] = []
        obj_count = len(workflow.detected_objects)
        for obj_order, obj in enumerate(workflow.detected_objects, 1):
            cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)
            cropped_images.append(cropped_image)
            save_workflow_image(
                WorkflowStep.SOURCE,
                WorkflowStep.CROPPED,
                enter,
                obj_order,
                obj_count,
                cropped_image,
                dict(immediate=immediate, crop_margin=str(crop_margin)),
            )
        workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images
    
    
    def extract_object_image(
        picture: PIL_Image,
        obj: DetectedObject,
        margin: int = 0,
    ) -> tuple[PIL_Image, tuple[int, int, int, int]]:
        def clamp(coord: int, dim: int) -> int:
            return min(max(coord, 0), dim)
    
        x1, y1, x2, y2 = obj.box_2d
        w, h = picture.dimension
        if margin != 0:
            x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)
            y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)
    
        field = (x1, y1, x2, y2)
        object_image = picture.crop(field)
    
        return object_image, field
    
    
    def save_workflow_image(
        source_step: WorkflowStep,
        target_step: WorkflowStep,
        input_image: InputImage,
        obj_order: int,
        obj_count: int,
        target_image: PIL_Image | None,
        image_info: dict[str, str] | None = None,
    ) -> None:
        if not SAVE_GENERATED_IMAGES or target_image is None:
            return
        if not OUTPUT_IMAGES_PATH.is_dir():
            OUTPUT_IMAGES_PATH.mkdir(dad and mom=True)
        time_str = datetime.now().strftime("%Y-%m-%d_percentH-%M-%S")
        strive:
            filename = f"{Supply(input_image).title}_"
        besides ValueError:
            filename = ""
        filename += f"{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png"
        image_path = OUTPUT_IMAGES_PATH.joinpath(filename)
        params = {}
        if image_info:
            png_info = PngInfo()
            for ok, v in image_info.objects():
                png_info.add_text(ok, v)
            params.replace(pnginfo=png_info)
        target_image.save(image_path, **params)
    
    
    # Matplotlib
    FIGURE_FG_COLOR = "#F1F3F4"
    FIGURE_BG_COLOR = "#202124"
    EDGE_COLOR = "#80868B"
    rcParams = {
        "determine.dpi": 300,
        "textual content.shade": FIGURE_FG_COLOR,
        "determine.edgecolor": FIGURE_FG_COLOR,
        "axes.titlecolor": FIGURE_FG_COLOR,
        "axes.edgecolor": FIGURE_FG_COLOR,
        "xtick.shade": FIGURE_FG_COLOR,
        "ytick.shade": FIGURE_FG_COLOR,
        "determine.facecolor": FIGURE_BG_COLOR,
        "axes.edgecolor": EDGE_COLOR,
        "xtick.backside": False,
        "xtick.high": False,
        "ytick.left": False,
        "ytick.proper": False,
        "xtick.labelbottom": False,
        "ytick.labelleft": False,
    }
    plt.rcParams.replace(rcParams)
    
    
    def display_image_source_info(picture: InputImage) -> None:
        def get_image_info_md() -> str:
            if picture not in Supply:
                return f"[[Source Image]({picture})]"
            supply = Supply(picture)
            metadata = metadata_by_source.get(supply)
            if not metadata:
                return f"[[Source Image]({supply.worth})]"
            components = [
                f"[Source Image]({supply.worth})",
                f"[Source Page]({metadata.webpage_url})",
                metadata.title,
                metadata.credit_line,
            ]
            separator = "•"
            inner_info = f" {separator} ".be a part of(components)
            return f"{separator} {inner_info} {separator}"
    
        def yield_md_rows() -> Iterator[str]:
            horizontal_line = "---"
            image_info = get_image_info_md()
            yield horizontal_line
            yield f"_{image_info}_"
            yield horizontal_line
    
        display_markdown(f"{chr(10)}{chr(10)}".be a part of(yield_md_rows()))
    
    
    def display_detected_objects(workflow: VisualObjectWorkflow) -> None:
        source_image = workflow.source_image
        detected_objects = PIL.Picture.new("RGB", source_image.dimension, "white")
        for obj in workflow.detected_objects:
            obj_image, field = extract_object_image(source_image, obj)
            detected_objects.paste(obj_image, (field[0], field[1]))
    
        _, (ax1, ax2) = plt.subplots(1, 2, format="compressed")
        ax1.imshow(source_image)
        ax2.imshow(detected_objects)
    
        disable_colab_cell_scrollbar()
        plt.present()
    
    
    print("✅ Object detection helpers outlined")

    🧪 Let’s begin easy: can we detect the one illustration on this incunable from 1485?

    detect_objects(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects detected by Gemini

    💡 This works properly. The bounding field may be very exact, enclosing the hand-colored woodcut illustration very tightly.


    🧪 Now, let’s verify the detection of the a number of visuals on this museum guidebook:

    detect_objects(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects detected by Gemini

    💡 Remarks:

    • The bounding containers are once more very exact.
    • The outcomes are good: there aren’t any false positives and no false negatives.
    • The captions under the visuals are usually not enclosed inside the bounding containers, which was particularly requested. The bounding field granularity will be managed by altering the immediate.

    🧪 What about barely warped visuals?

    detect_objects(Supply.work)

    • Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    Visual objects detected by Gemini

    💡 This doesn’t make a distinction. Discover how the bottom-right portray is partially lined by the orange bookmark. We’ll attempt to repair that within the restoration step.


    🧪 What concerning the tilted visuals on this e-book concerning the structure in Denver?

    detect_objects(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects detected by Gemini

    💡 Every visible is completely detected: spatial understanding covers tilted objects.


    🧪 Lastly, let’s verify the detection on this considerably warped e-book web page from Alice’s Adventures in Wonderland:

    detect_objects(Supply.alice_drawing)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects detected by Gemini

    💡 Web page curvature and different distortions don’t forestall non-rectangular objects from being detected. In reality, spatial understanding works on the pixel degree, which explains this precision for warped objects. If you happen to’d prefer to work at a decrease degree, you can too ask for a “segmentation masks” within the immediate and also you’ll get a base64-encoded PNG (every pixel giving the 0-255 chance it belongs to the thing inside the bounding field). See the segmentation doc for extra particulars.


    🏷️ Textual content extraction and dynamic labeling

    On high of localizing every object with its bounding field, our immediate requested to extract a verbatim caption and to assign a single-word label, when doable.

    Let’s add a easy perform to show the detection knowledge in a desk: 🔽
    from collections import defaultdict
    
    
    def display_detection_data(supply: Supply, show_consolidated: bool = False) -> None:
        def string_with_visible_linebreaks(s: str) -> str:
            return f'''"{s.exchange(chr(10), "↩️")}"'''
    
        def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -> Iterator[str]:
            yield "| label | depend | captions |"
            yield "| :--- | ---: | :--- |"
            stats = defaultdict(listing)
            for obj in workflow.detected_objects:
                stats[obj.label].append(string_with_visible_linebreaks(obj.caption))
            for label, captions in stats.objects():
                depend = len(captions)
                label_captions = " • ".be a part of(sorted(captions))
                yield f"| {label} | {depend} | {label_captions} |"
    
        def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -> Iterator[str]:
            yield "| box_2d | label | caption |"
            yield "| :--- | :--- | :--- |"
            for obj in workflow.detected_objects:
                yield f"| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |"
    
        workflow = workflow_by_image.get(supply)
        if workflow is None:
            print(f'❌ No detection for supply "{supply.title}"')
            return
        md_rows = listing(
            yield_md_rows_consolidated(workflow)
            if show_consolidated
            else yield_md_rows_with_bbox(workflow)
        )
        display_image_source_info(supply)
        display_markdown(chr(10).be a part of(md_rows))

    Within the museum guidebook, the dynamic labeling is exact in keeping with the context, and the captions under every illustration are completely extracted:

    display_detection_data(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    box_2d label caption
    [954, 629, 1338, 1166] beetle “The Horned Beetle.”
    [265, 984, 464, 1504] armor “Armor of a Man.”
    [737, 984, 915, 1328] armor “Horse Armor.”
    [1225, 1244, 1589, 1685] beetle “The Goliath Beetle.”
    [264, 1766, 431, 2006] masks “The Masks.”
    [937, 1769, 1260, 2087] butterfly “Painted Woman Butterfly.”
    [1325, 2170, 1581, 2468] butterfly “The Woman Butterfly.”

    Within the e-book picture displaying 4 work, that is good too:

    display_detection_data(Supply.work)

    • Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    box_2d label caption
    [378, 203, 837, 575] portray “Hái Ô-liu (Olive Selecting), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]”
    [913, 207, 1380, 563] portray “Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]”
    [387, 596, 845, 978] portray “Trưa: Nghỉ ngơi (phỏng theo Millet) (Midday: Relaxation from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]”
    [921, 611, 1397, 982] portray “Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]”

    Within the Denver structure e-book, the 4 captions are assigned to the right illustrations, which was not an apparent activity:

    display_detection_data(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    box_2d label caption
    [203, 224, 741, 839] constructing “ERNEST AND CRANMER BUILDING.”
    [743, 73, 1192, 758] constructing “PEOPLE’S BANK BUILDING.”
    [1185, 211, 1787, 865] constructing “BOSTON BUILDING.”
    [699, 754, 1238, 1203] constructing “COOPER BUILDING.”

    💡 When you’ve got a more in-depth have a look at the enter picture, it’s exhausting to inform which caption belongs to which illustration at a look. Most of us would wish to consider it (and could be incorrect). Asking Gemini reveals that the outcomes are intentional and never pure luck: Deciphering classic layouts can really feel a bit like a puzzle, however there may be normally a “reading-order” logic at play. On this particular case, the captions are organized to correspond with the photographs in a clockwise or Z-pattern ranging from the highest left.


    Within the “Alice’s Adventures in Wonderland” e-book web page, there was a single illustration accompanying the story textual content. As anticipated, the caption is empty (i.e., no false optimistic):

    display_detection_data(Supply.alice_drawing)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    box_2d label caption
    [111, 146, 1008, 593] illustration “”

    🔭 Generalizing object detection

    We will use the identical ideas for different object sorts. We’ll typically maintain requesting bounding containers to establish object positions inside photos. With out altering our present output construction (i.e., no code change), we are able to use captions and labels to extract totally different object metadata relying on the enter kind.


    🧪 See how we are able to detect digital elements by adapting the immediate whereas conserving the very same code and output construction:

    ELECTRONIC_COMPONENT_DETECTION_PROMPT = """
    Exhaustively detect all the person digital elements within the picture and supply the next knowledge for every:
    - `box_2d`: bounding field coordinates.
    - `caption`: Verbatim alphanumeric textual content seen on the part (together with authentic line breaks), or "" if no textual content is current.
    - `label`: Particular kind of part.
    """
    
    detect_objects(
        Supply.electronics,
        ELECTRONIC_COMPONENT_DETECTION_PROMPT,
        media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,
    )

    • Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

    Visual objects detected by Gemini

    💡 Remarks:

    • Massive and tiny elements are detected, due to the precise instruction “exhaustively detect…”.
    • By utilizing the ultra-high media decision, we guarantee extra particulars are tokenized and the “P” part (a visible outlier) will get detected.

    Right here’s a consolidated view of the detected elements:

    display_detection_data(Supply.electronics, show_consolidated=True)

    • Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

    label depend captions
    built-in circuit 3 “49240↩️020S6K” • “8105↩️0:35” • “P4010↩️9NA0”
    resistor 4 “” • “” • “105” • “R020”
    inductor 1 “n1W”
    diode 3 “Okay” • “L” • “P”
    capacitor 6 “” • “” • “” • “” • “” • “”
    transistor 1 “41”
    connector 1 “”

    💡 Remarks:

    • Parts are detected together with their textual content markings, regardless of the three totally different textual content orientations (upright, sideways, and the wrong way up), the blur, and the picture noise.
    • We eliminated the diploma of freedom for multi-line textual content by specifying the inclusion of “authentic line breaks” within the immediate: responses now constantly embrace the road breaks for the three built-in circuits (displayed with the ↩️ emoji for higher visibility).
    • The final diploma of freedom lies within the labeling. Whereas most elements have been correctly labeled, it’s unclear whether or not the “P” part is a diode, a resistor, or a fuse. Making the directions extra particular (e.g., itemizing the doable labels, utilizing an enum for the label area within the Pydantic class, or offering pointers and extra particulars concerning the anticipated circuit boards) will make the immediate extra “closed” and the outcomes extra deterministic and correct.
      It’s additionally doable to allow/replace the thinking_config configuration, which is able to set off a sequence of thought earlier than producing the ultimate reply. In all of the detections carried out, our code used ThinkingLevel.MINIMAL, which didn’t eat any thought tokens (with Gemini 3 Flash). Updating the parameter to ThinkingLevel.LOW, ThinkingLevel.MEDIUM, or ThinkingLevel.HIGH will use thought tokens and might result in higher outputs in complicated instances.

    This demonstrates the flexibility of the method. With out retraining a mannequin, we switched from detecting Fifteenth-century woodcuts and illustrations with classic layouts to figuring out fashionable electronics simply by altering the immediate. Such detections, together with caption and label metadata, might be used to auto-crop elements for a components catalog, confirm meeting traces, or create interactive schematics… all and not using a single labeled coaching picture.


    🪄 Modifying visible objects

    Now that we are able to detect visible objects, we are able to envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Picture (often known as Nano Banana 🍌) by default, a state-of-the-art picture technology and modifying mannequin.

    Our object modifying features will observe the identical template, taking one step as enter and producing an edited picture for the output step. Let’s outline core helpers for this: 🔽
    from typing import Protocol
    
    
    class ObjectEditingFunction(Protocol):
        def __call__(
            self,
            picture: InputImage,
            immediate: str | None = None,
            mannequin: ImageModel | None = None,
            config: GenerateContentConfig | None = None,
            display_results: bool = True,
        ) -> None: ...
    
    
    SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]
    registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}
    
    DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=["IMAGE"])
    EMPTY_IMAGE = PIL.Picture.new("1", (1, 1), "white")
    
    
    def object_editing_function(
        default_prompt: str,
        source_step: WorkflowStep,
        target_step: WorkflowStep,
        default_model: ImageModel = ImageModel.DEFAULT,
        default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,
    ) -> ObjectEditingFunction:
        def editing_function(
            picture: InputImage,
            immediate: str | None = default_prompt,
            mannequin: ImageModel | None = default_model,
            config: GenerateContentConfig | None = default_config,
            display_results: bool = True,
        ) -> None:
            workflow, source_images = get_workflow_and_step_images(picture, source_step)
            if immediate is None:
                immediate = default_prompt
            immediate = immediate.strip()
            if mannequin is None:
                mannequin = default_model
            # Observe: "config is None" is legitimate and can use the mannequin endpoint default config
    
            target_images: listing[PIL_Image] = []
            display_image_source_info(picture)
            obj_count = len(source_images)
            for obj_order, source_image in enumerate(source_images, 1):
                target_image = generate_image([source_image], immediate, mannequin, config)
                save_workflow_image(
                    source_step,
                    target_step,
                    picture,
                    obj_order,
                    obj_count,
                    target_image,
                    dict(immediate=immediate),
                )
                target_images.append(target_image if target_image else EMPTY_IMAGE)
    
            workflow.images_by_step[target_step] = target_images
            if display_results:
                display_sources_and_targets(workflow, source_step, target_step)
    
        registered_functions[(source_step, target_step)] = editing_function
    
        return editing_function
    
    
    def get_workflow_and_step_images(
        picture: InputImage,
        step: WorkflowStep,
    ) -> tuple[VisualObjectWorkflow, list[PIL_Image]]:
        # Objects detected?
        if picture not in workflow_by_image:
            detect_objects(picture, display_results=False)
        workflow = workflow_by_image.get(picture)
        assert workflow just isn't None
    
        # Workflow step objects? (single degree, might be prolonged to a dynamical graph)
        operation = (WorkflowStep.CROPPED, step)
        if step not in workflow.images_by_step and operation in registered_functions:
            source_function = registered_functions[operation]
            source_function(picture, display_results=False)
    
        # Supply photos
        source_images = workflow.images_by_step.get(step)
        assert source_images just isn't None
    
        return workflow, source_images
    
    
    def display_sources_and_targets(
        workflow: VisualObjectWorkflow,
        source_step: WorkflowStep,
        target_step: WorkflowStep,
    ) -> None:
        source_images = workflow.images_by_step[source_step]
        target_images = workflow.images_by_step[target_step]
        if not source_images:
            print("❌ No photos to show")
            return
    
        fig = plt.determine(format="compressed")
        if horizontal := (len(source_images) >= 2):
            rows, cols = 2, len(source_images)
        else:
            rows, cols = len(source_images), 2
        gs = fig.add_gridspec(rows, cols)
    
        for i, (source_image, target_image) in enumerate(
            zip(source_images, target_images, strict=True)
        ):
            for dim, picture in enumerate([source_image, target_image]):
                grid_spec = gs[dim, i] if horizontal else gs[i, dim]
                ax = fig.add_subplot(grid_spec)
                ax.set_axis_off()
                ax.imshow(picture)
    
        disable_colab_cell_scrollbar()
        plt.present()
    
    
    print("✅ Object modifying helpers outlined")

    Now, let’s outline a primary modifying step to revive the detected objects that may include many real-life artifacts…


    ✨ Restoring visible objects

    For this restoration step, we have to craft a immediate that’s generic sufficient (to cowl most use instances) but in addition particular sufficient (to have in mind restoration wants).

    A picture modifying immediate relies on pure language, usually utilizing crucial or declarative directions. With an crucial immediate, you describe the actions to carry out on the enter, whereas with a declarative immediate, you describe the anticipated output. Each are doable and can present equal outcomes. Your selection is mostly a matter of choice, so long as the immediate is sensible.

    Our check suite is usually composed of e-book images, which might include numerous photographic and paper artifacts. The Nano Banana fashions perceive these subtleties and might edit photos accordingly, which simplifies the immediate.

    Here’s a doable restoration perform utilizing an crucial immediate:

    RESTORATION_PROMPT = """
    - Isolate and straighten the visible on a pure white background, excluding any surrounding textual content.
    - Clear up all bodily artifacts and noise whereas preserving each authentic element.
    - Middle the end result and scale it to suit the canvas with minimal, symmetrical margins, making certain no distortion or cropping.
    """
    
    # Default config with low randomness for extra deterministic restoration outputs
    RESTORATION_CONFIG = GenerateContentConfig(
        temperature=0.0,
        top_p=0.0,
        seed=42,
        response_modalities=["IMAGE"],
    )
    
    restore_objects = object_editing_function(
        RESTORATION_PROMPT,
        WorkflowStep.CROPPED,
        WorkflowStep.RESTORED,
        default_config=RESTORATION_CONFIG,
    )
    
    print("✅ Restoration perform outlined")

    🧪 Let’s attempt to restore the illustration from the 1485 incunable:

    restore_objects(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 We now have a pleasant restoration of the hand-colored woodcut illustration. Observe that our immediate is generic (“clear up all bodily artifacts”) and might be made extra particular to take away extra or fewer artifacts. On this instance, there are remaining artifacts, such because the paper discoloration within the sword or the bleeding ink within the armor. We’ll see if we are able to repair these within the colorization step.


    🧪 What concerning the illustrations from the museum guidebook?

    restore_objects(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 All good!


    🧪 What concerning the barely warped visuals?

    restore_objects(Supply.work)

    • Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

    Visual objects edited by Nano Banana

    💡 Remarks:

    • Discover how, on the final portray, the orange bookmark is correctly eliminated and the hidden half inpainted to finish the portray.
    • We requested to “fill the canvas with minimal uniform margins, with out distortion or cropping”. Relying on the side ratio and kind of the visible, this diploma of freedom can lead to totally different white margins.
    • This instance reveals well-known work by Vincent Van Gogh. Nano Banana doesn’t fetch any reference photos and solely makes use of the supplied enter. If these have been images of personal work, they might be restored in the identical manner.

    Within the Denver structure e-book, the illustrations will be tilted, which our generic immediate doesn’t absolutely have in mind. When a number of geometric transformations are concerned, it may be difficult to craft an crucial immediate that particulars all of the operations to carry out. As an alternative, a descriptive immediate will be extra easy by straight describing the anticipated output.

    🧪 Right here’s an instance of a descriptive immediate specializing in the restoration of tilted visuals:

    tilted_visual_prompt = """
    An upright, high-fidelity rendition of the visible remoted in opposition to a pure white background, filling the canvas with minimal uniform margins. The output is clear, sharp, and freed from bodily artifacts.
    """
    
    restore_objects(Supply.denver_illustrated, tilted_visual_prompt)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects edited by Nano Banana

    💡 Remarks:

    • To get these outcomes, the immediate focuses on requesting an “upright” visible “filling the canvas”, which proves extra easy to put in writing than making an attempt to account for all doable geometric corrections.
    • The native visible understanding routinely identifies the content material kind (picture, illustration, and many others.) and the totally different artifacts (photographic, paper, printing, scanning…), permitting for exact restorations out of the field.
    • Discover how the consistency is preserved: the final visible is restored as an illustration, whereas the primary visuals preserve their photographic model.
    • The outcomes, with this fairly generic immediate, are spectacular. It’s, after all, doable to be extra particular and request explicit lighting, kinds, colours…

    On this final check, the enter visible has distortions not solely from the web page curvature but in addition from the picture perspective.

    🧪 Right here’s an instance of a descriptive immediate specializing in restoring warped illustrations:

    warped_visual_prompt = """
    An edge-to-edge digital extraction of the illustration from the supplied e-book picture, excluding any peripheral textual content. All web page curvature and perspective distortions are corrected, leading to a picture framed in an ideal rectangle, on a pure white canvas with minimal margins.
    """
    
    restore_objects(Supply.alice_drawing, warped_visual_prompt)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    💡 It’s actually spectacular that such a restoration will be carried out in a single step. Observe that this immediate just isn’t steady and might generate much less optimum outcomes (it could profit from being extra exact). When you’ve got complicated transformations, check descriptive prompts iteratively, utilizing exact and concise directions, and also you could be pleasantly shocked. Within the worst case, it’s additionally doable to course of the transformations in successive, simpler steps.

    Now, let’s add a colorization step…


    🎨 Colorization

    Our restoration step revered the unique kinds of the enter photos. Current picture modifying fashions excel at remodeling picture kinds, beginning with colours. This may typically be carried out straight with a easy, exact instruction.

    Here’s a doable colorization perform utilizing an crucial immediate:

    COLORIZATION_PROMPT = """
    Colorize this picture in a contemporary e-book illustration model, sustaining all authentic particulars with none additions.
    """
    
    colorize = object_editing_function(
        COLORIZATION_PROMPT,
        WorkflowStep.RESTORED,
        WorkflowStep.COLORIZED,
    )
    
    print("✅ Colorization perform outlined")

    🧪 Let’s modernize our 1485 illustration:

    colorize(Supply.incunable)

    • Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 All particulars are preserved, as requested within the immediate. Discover how the colorization can naturally repair some remaining artifacts (e.g., the paper discoloration within the sword or the bleeding ink within the armor).


    🧪 Let’s colorize our museum guidebook illustrations:

    colorize(Supply.museum_guidebook)

    • Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 Our immediate may be very open because it solely specifies “fashionable e-book illustration model”. This may generate very artistic colorizations, however all of them appear to make good sense.


    🧪 What about our Denver buildings?

    colorize(Supply.denver_illustrated)

    • Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

    Visual objects edited by Nano Banana

    💡 As requested, all of them appear to be fashionable illustrations, together with the primary visuals (originating from noisy images).


    It’s doable to go additional by not solely “colorizing” but in addition “remodeling” the picture right into a considerably totally different one.

    🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing right into a watercolor portray:

    watercolor_prompt = """
    Remodel this visible right into a heat, watercolor portray.
    """
    
    colorize(Supply.alice_drawing, watercolor_prompt)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    🧪 What about making it a standard portray?

    painting_prompt = """
    Remodel this visible into a standard portray.
    """
    
    colorize(Supply.alice_drawing, painting_prompt)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    We will additionally change picture compositions. Relying on the context, some compositions are roughly implied by default. For instance, illustrations usually have margins, whereas images typically have edge-to-edge (full-bleed within the printing world) compositions. When doable, it’s attention-grabbing to check with a sort of visible (which intrinsically brings a number of semantics to the context) and alter the directions accordingly.

    🧪 Let’s see how we are able to detect engravings on this 1847 e-book, restore them, and remodel them into fashionable digital graphics:

    detect_objects(Supply.engravings)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana
    restore_objects(Supply.engravings)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana
    visual_to_digital_graphic_prompt = """
    Remodel this visible right into a full-color, flat digital graphic, extending the content material for a full-bleed impact.
    """
    
    colorize(Supply.engravings, visual_to_digital_graphic_prompt)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana

    🧪 We will additionally remodel the identical engravings into images with a quite simple immediate:

    visual_to_photo_prompt = """
    Remodel this visible right into a high-end, fashionable digital camera {photograph}.
    """
    
    colorize(Supply.engravings, visual_to_photo_prompt)

    • Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

    Visual objects edited by Nano Banana

    💡 As images are typically full-bleed, the immediate doesn’t have to specify a composition.

    It’s actually as much as our creativeness, as Nano Banana appears to understand each side of the visible semantics.

    Let’s add a remaining step to see how far we are able to go, reimagining photos as cinematic film stills…


    🎞️ Cinematization

    We’ve used fairly “closed” prompts thus far, crafting particular directions and constraints to manage the outputs. It’s doable to go even additional with “open” prompts and generate photos in full artistic mode. Notably, it may be attention-grabbing to check with photographic or cinematographic terminology because it encompasses many visible methods.

    Here’s a doable generic cinematization perform to reimagine photos as film stills:

    CINEMATIZATION_PROMPT = """
    Reimagine this picture as a joyful, fashionable live-action cinematic film nonetheless that includes skilled lighting and composition.
    """
    
    cinematize = object_editing_function(
        CINEMATIZATION_PROMPT,
        WorkflowStep.RESTORED,
        WorkflowStep.CINEMATIZED,
    )

    🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:

    cinematize(Supply.alice_drawing)

    • Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

    Visual objects edited by Nano Banana

    💡 This appears like a high-budget film nonetheless. There are many levels of freedom within the immediate, however you’re more likely to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for a lot of cinematographers), and detailed textures. Such compositions actually evoke totally different atmospheres in comparison with the images generated within the earlier check.


    🧪 Let’s check the workflow on a web page from the Fantastic Wizard of Oz containing three drawings:

    detect_objects(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana
    restore_objects(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana
    cinematize(Supply.wizard_of_oz_drawings)

    • Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

    Visual objects edited by Nano Banana

    💡 The forged for a brand new film is prepared 😉


    Cinematic photos have numerous use instances:

    • These cinematized stills will be good “reference photos” for video technology fashions like Veo. See Generate Veo videos from reference images.
    • As they’re photorealistic representations, they may also be a supply for producing 2D or 3D visuals, in any model, with sensible figures, good proportions, superior lighting, enhanced compositions…
    • You should use them in {many professional} contexts or for high-end merchandise: shows, magazines, posters, storyboards, brainstorming periods…

    🏁 Conclusion

    • Gemini’s native spatial understanding allows the detection of particular visible objects primarily based on a single immediate in pure language.
    • We examined the detection of illustrations in e-book images, which conventional machine studying (ML) fashions normally miss, as they’re usually skilled to detect folks, animals, automobiles, meals, and a finite set of bodily object lessons.
    • We examined the detection of straight, tilted, and even considerably warped illustrations, they usually have been at all times exactly recognized.
    • The core implementation was easy, requiring minimal code utilizing the Python SDK and customised prompts. By comparability, fine-tuning a standard object detection mannequin is time-consuming: it entails assembling a picture dataset, labeling objects, and managing coaching jobs.
    • This resolution may be very versatile: we may swap from detecting illustrations to digital elements, by adapting the immediate, whereas conserving the code unchanged.
    • Utilizing structured outputs (with a JSON schema or Pydantic lessons, and the Python SDK) makes the code each simple to implement and able to deploy to manufacturing.
    • Then, Nano Banana permits modifying these visible objects in just about any manner conceivable.
    • We examined a workflow with restoration, colorization, and even cinematization steps, utilizing crucial and descriptive prompts.
    • The chances appear actually limitless, and the ideas on this exploration will be reused in numerous contexts.

    ➕ Extra!

    Thanks for studying. Let me know in case you create one thing cool!



    Source link

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Editor Times Featured
    • Website

    Related Posts

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Dreaming in Cubes | Towards Data Science

    April 19, 2026

    AI Agents Need Their Own Desk, and Git Worktrees Give Them One

    April 18, 2026

    Your RAG System Retrieves the Right Data — But Still Produces Wrong Answers. Here’s Why (and How to Fix It).

    April 18, 2026

    Europe Warns of a Next-Gen Cyber Threat

    April 18, 2026

    How to Learn Python for Data Science Fast in 2026 (Without Wasting Time)

    April 18, 2026

    Comments are closed.

    Editors Picks

    Francis Bacon and the Scientific Method

    April 19, 2026

    Proxy-Pointer RAG: Structure Meets Scale at 100% Accuracy with Smarter Retrieval

    April 19, 2026

    Sulfur lava exoplanet L 98-59 d defies classification

    April 19, 2026

    Hisense U7SG TV Review (2026): Better Design, Great Value

    April 19, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    About Us
    About Us

    Welcome to Times Featured, an AI-driven entrepreneurship growth engine that is transforming the future of work, bridging the digital divide and encouraging younger community inclusion in the 4th Industrial Revolution, and nurturing new market leaders.

    Empowering the growth of profiles, leaders, entrepreneurs businesses, and startups on international landscape.

    Asia-Middle East-Europe-North America-Australia-Africa

    Facebook LinkedIn WhatsApp
    Featured Picks

    Today’s NYT Connections: Sports Edition Hints, Answers for March 1 #524

    March 1, 2026

    How to Back Up Your iPhone to iCloud, MacOS, or Windows (2026)

    March 31, 2026

    Best Mid Layer for Hiking, Backpacking, and Travel (2026)

    March 7, 2026
    Categories
    • Founders
    • Startups
    • Technology
    • Profiles
    • Entrepreneurs
    • Leaders
    • Students
    • VC Funds
    Copyright © 2024 Timesfeatured.com IP Limited. All Rights.
    • Privacy Policy
    • Disclaimer
    • Terms and Conditions
    • About us
    • Contact us

    Type above and press Enter to search. Press Esc to cancel.