earlier than we begin:
- I’m a developer at Google Cloud. Ideas and opinions expressed listed below are fully my very own.
- The entire supply code for this text, together with future updates, is obtainable in this notebook beneath the Apache 2.0 license.
- All new photos on this article have been generated with Gemini Nano Banana utilizing the explored proof-of-concept. All supply photos are both within the public area or free to make use of (reference hyperlinks are supplied within the code output).
- You possibly can experiment with Gemini fashions without spending a dime in Google AI Studio. For programmatic API entry, please word that whereas a free tier is obtainable for some fashions (i.e., you possibly can carry out object detection), picture technology is a pay-as-you-go service.
✨ Overview
Conventional laptop imaginative and prescient fashions are usually skilled to detect a set set of object lessons, like “individual”, “cat”, or “automobile”. If you wish to detect one thing particular that wasn’t within the coaching set, resembling an “illustration” in a e-book {photograph}, you normally have to collect a dataset, label it manually, and practice a customized mannequin, which might take hours and even days.
On this exploration, we’ll check a distinct method utilizing Gemini. We are going to leverage its spatial understanding capabilities to carry out open-vocabulary object detection. This permits us to seek out objects primarily based solely on a pure language description, with none coaching.
As soon as the visible objects are detected, we’ll extract them after which use Gemini’s picture modifying capabilities (particularly the Nano Banana fashions) to revive and creatively remodel them.
🔥 Problem
We’re coping with unstructured knowledge: images of books, magazines, and objects within the wild. These photos current a number of difficulties for conventional laptop imaginative and prescient:
- Selection: The objects we wish to discover (illustrations, engravings, and any visuals on the whole) differ wildly in model and content material.
- Distortion: Pages are curved, images are taken at angles, and lighting is uneven.
- Noise: Outdated books have stains, paper grain, and textual content bleeding by means of from the opposite aspect.
Our problem is to construct a sturdy pipeline that may detect these objects regardless of the distortions, extract them cleanly, and edit them to appear to be high-quality digital belongings… all utilizing easy textual content prompts.
🏁 Setup
🐍 Python packages
We’ll use the next packages:
google-genai: the Google Gen AI Python SDK lets us name Gemini with just a few traces of codepillowfor picture administrationmatplotlibfor end result visualization
We’ll additionally use these packages (dependencies of google-genai):
pydanticfor knowledge administrationtenacityfor request administration
pip set up --quiet "google-genai>=1.63.0" "pillow>=11.3.0" "matplotlib>=3.10.0"
🔗 Gemini API
To make use of the Gemini API, we have now two primary choices:
- By way of Vertex AI with a Google Cloud undertaking
- By way of Google AI Studio with a Gemini API key
The Google Gen AI SDK supplies a unified interface to those APIs, and we are able to use surroundings variables for the configuration. 🔽
🛠️ Possibility 1 – Gemini API by way of Vertex AI
Necessities:
Gen AI SDK surroundings variables:
GOOGLE_GENAI_USE_VERTEXAI="True"GOOGLE_CLOUD_PROJECT="" GOOGLE_CLOUD_LOCATION=""
💡 For preview fashions, the situation have to be set to
international. For typically obtainable fashions, we are able to select the closest location among the many Google model endpoint locations.ℹ️ Study extra about setting up a project and a development environment.
🛠️ Possibility 2 – Gemini API by way of Google AI Studio
Requirement:
Gen AI SDK surroundings variables:
GOOGLE_GENAI_USE_VERTEXAI="False"GOOGLE_API_KEY=""
ℹ️ Study extra about getting a Gemini API key from Google AI Studio.
💡 You possibly can retailer your surroundings configuration outdoors of the supply code:
| Setting | Methodology |
|---|---|
| IDE | .env file (or equal) |
| Colab | Colab Secrets and techniques (🗝️ icon in left panel, see code under) |
| Colab Enterprise | Google Cloud undertaking and placement are routinely outlined |
| Vertex AI Workbench | Google Cloud undertaking and placement are routinely outlined |
Outline the next surroundings detection features. You can even outline your configuration manually if wanted. 🔽
import os
import sys
from collections.abc import Callable
from google import genai
# Guide setup (depart unchanged if setup is environment-defined)
# @markdown **Which API: Vertex AI or Google AI Studio?**
GOOGLE_GENAI_USE_VERTEXAI = True # @param {kind: "boolean"}
# @markdown **Possibility A - Google Cloud undertaking [+location]**
GOOGLE_CLOUD_PROJECT = "" # @param {kind: "string"}
GOOGLE_CLOUD_LOCATION = "international" # @param {kind: "string"}
# @markdown **Possibility B - Google AI Studio API key**
GOOGLE_API_KEY = "" # @param {kind: "string"}
def check_environment() -> bool:
check_colab_user_authentication()
return check_manual_setup() or check_vertex_ai() or check_colab() or check_local()
def check_manual_setup() -> bool:
return check_define_env_vars(
GOOGLE_GENAI_USE_VERTEXAI,
GOOGLE_CLOUD_PROJECT.strip(), # May need been pasted with line return
GOOGLE_CLOUD_LOCATION,
GOOGLE_API_KEY,
)
def check_vertex_ai() -> bool:
# Workbench and Colab Enterprise
match os.getenv("VERTEX_PRODUCT", ""):
case "WORKBENCH_INSTANCE":
cross
case "COLAB_ENTERPRISE":
if not running_in_colab_env():
return False
case _:
return False
return check_define_env_vars(
True,
os.getenv("GOOGLE_CLOUD_PROJECT", ""),
os.getenv("GOOGLE_CLOUD_REGION", ""),
"",
)
def check_colab() -> bool:
if not running_in_colab_env():
return False
# Colab Enterprise was checked earlier than, so that is Colab solely
from google.colab import auth as colab_auth # kind: ignore
colab_auth.authenticate_user()
# Use Colab Secrets and techniques (🗝️ icon in left panel) to retailer the surroundings variables
# Secrets and techniques are non-public, seen solely to you and the notebooks that you choose
# - Vertex AI: Retailer your settings as secrets and techniques
# - Google AI: Immediately import your Gemini API key from the UI
vertexai, undertaking, location, api_key = get_vars(get_colab_secret)
return check_define_env_vars(vertexai, undertaking, location, api_key)
def check_local() -> bool:
vertexai, undertaking, location, api_key = get_vars(os.getenv)
return check_define_env_vars(vertexai, undertaking, location, api_key)
def running_in_colab_env() -> bool:
# Colab or Colab Enterprise
return "google.colab" in sys.modules
def check_colab_user_authentication() -> None:
if running_in_colab_env():
from google.colab import auth as colab_auth # kind: ignore
colab_auth.authenticate_user()
def get_colab_secret(secret_name: str, default: str) -> str:
from google.colab import errors, userdata # kind: ignore
strive:
return userdata.get(secret_name)
besides errors.SecretNotFoundError:
return default
def disable_colab_cell_scrollbar() -> None:
if running_in_colab_env():
from google.colab import output # kind: ignore
output.no_vertical_scroll()
def get_vars(getenv: Callable[[str, str], str]) -> tuple[bool, str, str, str]:
# Restrict getenv calls to the minimal (could set off UI affirmation for secret entry)
vertexai_str = getenv("GOOGLE_GENAI_USE_VERTEXAI", "")
if vertexai_str:
vertexai = vertexai_str.decrease() in ["true", "1"]
else:
vertexai = bool(getenv("GOOGLE_CLOUD_PROJECT", ""))
undertaking = getenv("GOOGLE_CLOUD_PROJECT", "") if vertexai else ""
location = getenv("GOOGLE_CLOUD_LOCATION", "") if undertaking else ""
api_key = getenv("GOOGLE_API_KEY", "") if not undertaking else ""
return vertexai, undertaking, location, api_key
def check_define_env_vars(
vertexai: bool,
undertaking: str,
location: str,
api_key: str,
) -> bool:
match (vertexai, bool(undertaking), bool(location), bool(api_key)):
case (True, True, _, _):
# Vertex AI - Google Cloud undertaking [+location]
location = location or "international"
define_env_vars(vertexai, undertaking, location, "")
case (True, False, _, True):
# Vertex AI - API key
define_env_vars(vertexai, "", "", api_key)
case (False, _, _, True):
# Google AI Studio - API key
define_env_vars(vertexai, "", "", api_key)
case _:
return False
return True
def define_env_vars(vertexai: bool, undertaking: str, location: str, api_key: str) -> None:
os.environ["GOOGLE_GENAI_USE_VERTEXAI"] = str(vertexai)
os.environ["GOOGLE_CLOUD_PROJECT"] = undertaking
os.environ["GOOGLE_CLOUD_LOCATION"] = location
os.environ["GOOGLE_API_KEY"] = api_key
def check_configuration(shopper: genai.Consumer) -> None:
service = "Vertex AI" if shopper.vertexai else "Google AI Studio"
print(f"✅ Utilizing the {service} API", finish="")
if shopper._api_client.undertaking:
print(f' with undertaking "{shopper._api_client.undertaking[:7]}…"', finish="")
print(f' in location "{shopper._api_client.location}"')
elif shopper._api_client.api_key:
api_key = shopper._api_client.api_key
print(f' with API key "{api_key[:5]}…{api_key[-5:]}"', finish="")
print(f" (in case of error, ensure that it was created for {service})")
print("✅ Setting features outlined")
🤖 Gen AI SDK
To ship Gemini requests, create a google.genai shopper:
from google import genai
check_environment()
shopper = genai.Consumer()
check_configuration(shopper)
🖼️ Picture check suite
Let’s outline a listing of photos for our exams: 🔽
from dataclasses import dataclass
from enum import StrEnum
Url = str
class Supply(StrEnum):
incunable = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014rosen0487:0165/full/pct:25/0/default.jpg"
engravings = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:34:07:66:92:1:00340766921:0121/full/pct:50/0/default.jpg"
museum_guidebook = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2014:2014gen34181:0033/full/pct:75/0/default.jpg"
denver_illustrated = "https://tile.loc.gov/image-services/iiif/service:gdc:gdclccn:rc:01:00:04:94:rc01000494:0051/full/pct:50/0/default.jpg"
physics_textbook = "https://tile.loc.gov/image-services/iiif/service:gdc:gdcscd:00:03:64:87:31:8:00036487318:0103/full/pct:50/0/default.jpg"
portrait_miniatures = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2024:2024rosen013592v02:0249/full/pct:50/0/default.jpg"
wizard_of_oz_drawings = "https://tile.loc.gov/image-services/iiif/service:rbc:rbc0001:2006:2006gen32405:0048/full/pct:25/0/default.jpg"
work = "https://photos.unsplash.com/photo-1714146681164-f26fed839692?h=1440"
alice_drawing = "https://photos.unsplash.com/photo-1630595011903-689853b04ee2?h=800"
e-book = "https://photos.unsplash.com/photo-1643451533573-ee364ba6e330?h=800"
guide = "https://photos.unsplash.com/photo-1623666936367-a100f62ba9b7?h=800"
electronics = "https://photos.unsplash.com/photo-1757397584789-8b2c5bfcdbc3?h=1440"
@dataclass
class SourceMetadata:
title: str
webpage_url: Url
credit_line: str
LOC = "Library of Congress"
LOC_RARE_BOOKS = "Library of Congress, Uncommon E-book and Particular Collections Division"
LOC_MEETING_FRONTIERS = "Library of Congress, Assembly of Frontiers"
metadata_by_source: dict[Source, SourceMetadata] = {
Supply.incunable: SourceMetadata(
"Vergaderinge der historien van Troy (1485)",
"https://www.loc.gov/useful resource/rbc0001.2014rosen0487/?sp=165",
LOC_RARE_BOOKS,
),
Supply.engravings: SourceMetadata(
"Harper's illustrated catalogue (1847)",
"https://www.loc.gov/useful resource/gdcscd.00340766921/?sp=121",
LOC,
),
Supply.museum_guidebook: SourceMetadata(
"Barnum's American Museum illustrated (1850)",
"https://www.loc.gov/useful resource/rbc0001.2014gen34181/?sp=33",
LOC_RARE_BOOKS,
),
Supply.denver_illustrated: SourceMetadata(
"Denver illustrated (1893)",
"https://www.loc.gov/useful resource/gdclccn.rc01000494/?sp=51",
LOC_MEETING_FRONTIERS,
),
Supply.physics_textbook: SourceMetadata(
"Classes in physics (1916)",
"https://www.loc.gov/useful resource/gdcscd.00036487318/?sp=103",
LOC,
),
Supply.portrait_miniatures: SourceMetadata(
"The historical past of portrait miniatures (1904)",
"https://www.loc.gov/useful resource/rbc0001.2024rosen013592v02/?sp=249",
LOC_RARE_BOOKS,
),
Supply.wizard_of_oz_drawings: SourceMetadata(
"The great Wizard of Oz (1899)",
"https://www.loc.gov/useful resource/rbc0001.2006gen32405/?sp=48",
LOC_RARE_BOOKS,
),
Supply.work: SourceMetadata(
"Open e-book displaying work by Vincent van Gogh",
"https://unsplash.com/images/9hD7qrxICag",
"Photograph by Trung Manh cong on Unsplash",
),
Supply.alice_drawing: SourceMetadata(
"Open e-book displaying an illustration and textual content from Alice's Adventures in Wonderland",
"https://unsplash.com/images/bewzr_Q9u2o",
"Photograph by Brett Jordan on Unsplash",
),
Supply.e-book: SourceMetadata(
"Open e-book displaying two botanical illustrations",
"https://unsplash.com/images/4IDqcNj827I",
"Photograph by Ranurte on Unsplash",
),
Supply.guide: SourceMetadata(
"Open person guide for classic digital camera",
"https://unsplash.com/images/aaFU96eYASk",
"Photograph by Annie Spratt on Unsplash",
),
Supply.electronics: SourceMetadata(
"Circuit board with digital elements",
"https://unsplash.com/images/Aqa1pHQ57pw",
"Photograph by Albert Stoynov on Unsplash",
),
}
print("✅ Take a look at photos outlined")
🧠 Gemini fashions
Gemini is available in totally different versions. We will at present use the next fashions:
- For object detection: Gemini 2.5 or Gemini 3, every obtainable in Flash or Professional variations.
- For object modifying: Gemini 2.5 Flash Picture or Gemini 3 Professional Picture, often known as Nano Banana and Nano Banana Professional.
🛠️ Helpers
Now, let’s add core helper lessons and features: 🔽
from enum import auto
from pathlib import Path
from typing import Any, forged
import IPython.show
import matplotlib.pyplot as plt
import pydantic
import tenacity
from google.genai.errors import ClientError
from google.genai.sorts import (
FinishReason,
GenerateContentConfig,
GenerateContentResponse,
PIL_Image,
ThinkingConfig,
ThinkingLevel,
)
# Multimodal fashions with spatial understanding and structured outputs
class MultimodalModel(StrEnum):
# Usually Obtainable (GA)
GEMINI_2_5_FLASH = "gemini-2.5-flash"
GEMINI_2_5_PRO = "gemini-2.5-pro"
# Preview
GEMINI_3_FLASH_PREVIEW = "gemini-3-flash-preview"
GEMINI_3_1_PRO_PREVIEW = "gemini-3.1-pro-preview"
# Default mannequin used for object detection
DEFAULT = GEMINI_3_FLASH_PREVIEW
# Picture technology and modifying fashions
class ImageModel(StrEnum):
# Usually Obtainable (GA)
GEMINI_2_5_FLASH_IMAGE = "gemini-2.5-flash-image" # Nano Banana 🍌
# Preview
GEMINI_3_PRO_IMAGE_PREVIEW = "gemini-3-pro-image-preview" # Nano Banana Professional 🍌
# Default mannequin used for picture modifying
DEFAULT = GEMINI_2_5_FLASH_IMAGE
Mannequin = MultimodalModel | ImageModel
def generate_content(
contents: listing[Any],
mannequin: Mannequin,
config: GenerateContentConfig | None,
should_display_response_info: bool = False,
) -> GenerateContentResponse | None:
response = None
shopper = check_client_for_model(mannequin)
for try in get_retrier():
with try:
response = shopper.fashions.generate_content(
mannequin=mannequin.worth,
contents=contents,
config=config,
)
if should_display_response_info:
display_response_info(response, config)
return response
def check_client_for_model(mannequin: Mannequin) -> genai.Consumer:
if (
mannequin.worth.endswith("-preview")
and shopper.vertexai
and shopper._api_client.location != "international"
):
# Preview fashions are solely obtainable on the "international" location
return genai.Consumer(location="international")
return shopper
def display_response_info(
response: GenerateContentResponse | None,
config: GenerateContentConfig | None,
) -> None:
if response is None:
print("❌ No response")
return
if usage_metadata := response.usage_metadata:
if usage_metadata.prompt_token_count:
print(f"Enter tokens : {usage_metadata.prompt_token_count:9,d}")
if usage_metadata.candidates_token_count:
print(f"Output tokens : {usage_metadata.candidates_token_count:9,d}")
if usage_metadata.thoughts_token_count:
print(f"Ideas tokens: {usage_metadata.thoughts_token_count:9,d}")
if (
config just isn't None
and config.response_mime_type == "utility/json"
and response.parsed is None
):
print("❌ Couldn't parse the JSON response")
return
if not response.candidates:
print("❌ No `response.candidates`")
return
if (finish_reason := response.candidates[0].finish_reason) != FinishReason.STOP:
print(f"❌ {finish_reason = }")
if not response.textual content:
print("❌ No `response.textual content`")
return
def generate_image(
sources: listing[PIL_Image],
immediate: str,
mannequin: ImageModel,
config: GenerateContentConfig | None = None,
) -> PIL_Image | None:
contents = [*sources, prompt.strip()]
response = generate_content(contents, mannequin, config)
return check_get_output_image_from_response(response)
def check_get_output_image_from_response(
response: GenerateContentResponse | None,
) -> PIL_Image | None:
if response is None:
print("❌ No `response`")
return None
if not response.candidates:
print("❌ No `response.candidates`")
if response.prompt_feedback:
if block_reason := response.prompt_feedback.block_reason:
print(f"{block_reason = :s}")
if block_reason_message := response.prompt_feedback.block_reason_message:
print(f"{block_reason_message = }")
return None
if not (content material := response.candidates[0].content material):
print("❌ No `response.candidates[0].content material`")
return None
if not (components := content material.components):
print("❌ No `response.candidates[0].content material.components`")
return None
output_image: PIL_Image | None = None
for half in components:
if half.textual content:
display_markdown(half.textual content)
proceed
sdk_image = half.as_image()
assert sdk_image just isn't None
output_image = sdk_image._pil_image
assert output_image just isn't None
break # There needs to be a single picture
return output_image
def get_thinking_config(mannequin: Mannequin) -> ThinkingConfig | None:
match mannequin:
case MultimodalModel.GEMINI_2_5_FLASH:
return ThinkingConfig(thinking_budget=0)
case MultimodalModel.GEMINI_2_5_PRO:
return ThinkingConfig(thinking_budget=128, include_thoughts=False)
case MultimodalModel.GEMINI_3_FLASH_PREVIEW:
return ThinkingConfig(thinking_level=ThinkingLevel.MINIMAL)
case MultimodalModel.GEMINI_3_1_PRO_PREVIEW:
return ThinkingConfig(thinking_level=ThinkingLevel.LOW)
case _:
return None # Default
def display_markdown(markdown: str) -> None:
IPython.show.show(IPython.show.Markdown(markdown))
def display_image(picture: PIL_Image) -> None:
IPython.show.show(picture)
def get_retrier() -> tenacity.Retrying:
return tenacity.Retrying(
cease=tenacity.stop_after_attempt(7),
wait=tenacity.wait_incrementing(begin=10, increment=1),
retry=should_retry_request,
reraise=True,
)
def should_retry_request(retry_state: tenacity.RetryCallState) -> bool:
if not retry_state.end result:
return False
err = retry_state.end result.exception()
if not isinstance(err, ClientError):
return False
print(f"❌ ClientError {err.code}: {err.message}")
retry = False
match err.code:
case 400 if err.message just isn't None and " strive once more " in err.message:
# Workshop: first time entry to Cloud Storage (service agent provisioning)
retry = True
case 429:
# Workshop: short-term undertaking with 1 QPM quota
retry = True
print(f"🔄 Retry: {retry}")
return retry
print("✅ Helpers outlined")
🔍 Detecting visible objects
To carry out visible object detection, craft the immediate to point what you’d prefer to detect and the way outcomes needs to be returned. In the identical request, it’s doable to additionally extract further details about every detected object. This may be just about something, from labels resembling “furnishings”, “desk”, or “chair”, to extra exact classifications like “mammals” or “reptiles”, or to contextual knowledge resembling captions, colours, shapes, and many others.
For the following exams, we’ll experiment with detecting illustrations inside e-book images. Right here’s a doable immediate:
OBJECT_DETECTION_PROMPT = """
Detect each illustration inside the e-book picture and extract the next knowledge for every:
- `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
- `caption`: Verbatim caption or legend resembling "Determine 1". Use "" if not discovered.
- `label`: Single-word label describing the illustration. Use "" if not discovered.
"""
Notes:
- Bounding containers are very helpful for finding or extracting the detected objects.
- Sometimes, for Gemini fashions, a
box_2dbounding field represents coordinates normalized to a(0, 0, 1000, 1000)house for a(0, 0, width, peak)enter picture. - We’re additionally requesting to extract captions (metadata usually current in reference books) and labels (dynamic metadata).
To automate response processing, it’s handy to outline a Pydantic class that matches the immediate, resembling:
class DetectedObject(pydantic.BaseModel):
box_2d: listing[int]
caption: str
label: str
DetectedObjects: TypeAlias = listing[DetectedObject]
Then, request a structured output with config fields response_mime_type and response_schema:
config = GenerateContentConfig(
# …,
response_mime_type="utility/json",
response_schema=DetectedObjects,
# …,
)
This may generate a JSON response which the SDK can parse routinely, letting us straight use object situations:
detected_objects = forged(DetectedObjects, response.parsed)
Let’s add just a few object-detection-specific lessons and features: 🔽
import io
import urllib.request
from collections.abc import Iterator
from dataclasses import area
from datetime import datetime
import PIL.Picture
from google.genai.sorts import Half, PartMediaResolutionLevel
from PIL.PngImagePlugin import PngInfo
OBJECT_DETECTION_PROMPT = """
Detect each illustration inside the e-book picture and extract the next knowledge for every:
- `box_2d`: Bounding field coordinates of the illustration solely (ignoring any caption).
- `caption`: Verbatim caption or legend resembling "Determine 1". Use "" if not discovered.
- `label`: Single-word label describing the illustration. Use "" if not discovered.
"""
# Margin added to detected/cropped objects, giving extra context for a greater understanding of spatial distortions
CROP_MARGIN_PX = 10
# Set to True to save lots of every generated picture
SAVE_GENERATED_IMAGES = False
OUTPUT_IMAGES_PATH = Path("./object_detection_and_editing")
# Matching class for structured output technology
class DetectedObject(pydantic.BaseModel):
box_2d: listing[int]
caption: str
label: str
# Misc knowledge lessons
InputImage = Path | Url
DetectedObjects = listing[DetectedObject]
WorkflowStepImages = listing[PIL_Image]
class WorkflowStep(StrEnum):
SOURCE = auto()
CROPPED = auto()
RESTORED = auto()
COLORIZED = auto()
CINEMATIZED = auto()
@dataclass
class VisualObjectWorkflow:
source_image: PIL_Image
detected_objects: DetectedObjects
images_by_step: dict[WorkflowStep, WorkflowStepImages] = area(default_factory=dict)
def __post_init__(self) -> None:
denormalize_bounding_boxes(self)
workflow_by_image: dict[InputImage, VisualObjectWorkflow] = {}
def denormalize_bounding_boxes(self: VisualObjectWorkflow) -> None:
"""Convert the box_2d coordinates.
- Earlier than: [y1, x1, y2, x2] normalized to 0-1000, as returned by Gemini
- After: [x1, y1, x2, y2] in source_image coordinates, as utilized in Pillow
"""
def to_image_coord(coord: int, dim: int) -> int:
return int(coord * dim / 1000 + 0.5)
w, h = self.source_image.dimension
for obj in self.detected_objects:
y1, x1, y2, x2 = obj.box_2d
x1, x2 = to_image_coord(x1, w), to_image_coord(x2, w)
y1, y2 = to_image_coord(y1, h), to_image_coord(y2, h)
obj.box_2d = [x1, y1, x2, y2]
def detect_objects(
picture: InputImage,
immediate: str = OBJECT_DETECTION_PROMPT,
mannequin: MultimodalModel = MultimodalModel.DEFAULT,
config: GenerateContentConfig | None = None,
media_resolution: PartMediaResolutionLevel | None = None,
display_results: bool = True,
) -> None:
display_image_source_info(picture)
pil_image, content_part = get_pil_image_and_part(picture, mannequin, media_resolution)
immediate = immediate.strip()
contents = [content_part, prompt]
config = config or get_object_detection_config(mannequin)
response = generate_content(contents, mannequin, config)
if response just isn't None and response.parsed just isn't None:
detected_objects = forged(DetectedObjects, response.parsed)
else:
detected_objects = DetectedObjects()
workflow = VisualObjectWorkflow(pil_image, detected_objects)
workflow_by_image[image] = workflow
add_cropped_objects(workflow, picture, immediate)
if display_results:
display_detected_objects(workflow)
def get_pil_image_and_part(
picture: InputImage,
mannequin: MultimodalModel,
media_resolution: PartMediaResolutionLevel | None,
) -> tuple[PIL_Image, Part]:
if isinstance(picture, Path):
image_bytes = picture.read_bytes()
else:
headers = {"Person-Agent": "Mozilla/5.0"}
req = urllib.request.Request(picture, headers=headers)
with urllib.request.urlopen(req, timeout=10) as response:
image_bytes = response.learn()
pil_image = PIL.Picture.open(io.BytesIO(image_bytes))
content_part = Half.from_bytes(
knowledge=image_bytes,
mime_type="picture/*",
media_resolution=media_resolution,
)
return pil_image, content_part
def get_object_detection_config(mannequin: Mannequin) -> GenerateContentConfig:
# Low randomness for extra determinism
return GenerateContentConfig(
temperature=0.0,
top_p=0.0,
seed=42,
response_mime_type="utility/json",
response_schema=DetectedObjects,
thinking_config=get_thinking_config(mannequin),
)
def add_cropped_objects(
workflow: VisualObjectWorkflow,
enter: InputImage,
immediate: str,
crop_margin: int = CROP_MARGIN_PX,
) -> None:
cropped_images: listing[PIL_Image] = []
obj_count = len(workflow.detected_objects)
for obj_order, obj in enumerate(workflow.detected_objects, 1):
cropped_image, _ = extract_object_image(workflow.source_image, obj, crop_margin)
cropped_images.append(cropped_image)
save_workflow_image(
WorkflowStep.SOURCE,
WorkflowStep.CROPPED,
enter,
obj_order,
obj_count,
cropped_image,
dict(immediate=immediate, crop_margin=str(crop_margin)),
)
workflow.images_by_step[WorkflowStep.CROPPED] = cropped_images
def extract_object_image(
picture: PIL_Image,
obj: DetectedObject,
margin: int = 0,
) -> tuple[PIL_Image, tuple[int, int, int, int]]:
def clamp(coord: int, dim: int) -> int:
return min(max(coord, 0), dim)
x1, y1, x2, y2 = obj.box_2d
w, h = picture.dimension
if margin != 0:
x1, x2 = clamp(x1 - margin, w), clamp(x2 + margin, w)
y1, y2 = clamp(y1 - margin, h), clamp(y2 + margin, h)
field = (x1, y1, x2, y2)
object_image = picture.crop(field)
return object_image, field
def save_workflow_image(
source_step: WorkflowStep,
target_step: WorkflowStep,
input_image: InputImage,
obj_order: int,
obj_count: int,
target_image: PIL_Image | None,
image_info: dict[str, str] | None = None,
) -> None:
if not SAVE_GENERATED_IMAGES or target_image is None:
return
if not OUTPUT_IMAGES_PATH.is_dir():
OUTPUT_IMAGES_PATH.mkdir(dad and mom=True)
time_str = datetime.now().strftime("%Y-%m-%d_percentH-%M-%S")
strive:
filename = f"{Supply(input_image).title}_"
besides ValueError:
filename = ""
filename += f"{obj_order}o{obj_count}_{source_step}_{target_step}_{time_str}.png"
image_path = OUTPUT_IMAGES_PATH.joinpath(filename)
params = {}
if image_info:
png_info = PngInfo()
for ok, v in image_info.objects():
png_info.add_text(ok, v)
params.replace(pnginfo=png_info)
target_image.save(image_path, **params)
# Matplotlib
FIGURE_FG_COLOR = "#F1F3F4"
FIGURE_BG_COLOR = "#202124"
EDGE_COLOR = "#80868B"
rcParams = {
"determine.dpi": 300,
"textual content.shade": FIGURE_FG_COLOR,
"determine.edgecolor": FIGURE_FG_COLOR,
"axes.titlecolor": FIGURE_FG_COLOR,
"axes.edgecolor": FIGURE_FG_COLOR,
"xtick.shade": FIGURE_FG_COLOR,
"ytick.shade": FIGURE_FG_COLOR,
"determine.facecolor": FIGURE_BG_COLOR,
"axes.edgecolor": EDGE_COLOR,
"xtick.backside": False,
"xtick.high": False,
"ytick.left": False,
"ytick.proper": False,
"xtick.labelbottom": False,
"ytick.labelleft": False,
}
plt.rcParams.replace(rcParams)
def display_image_source_info(picture: InputImage) -> None:
def get_image_info_md() -> str:
if picture not in Supply:
return f"[[Source Image]({picture})]"
supply = Supply(picture)
metadata = metadata_by_source.get(supply)
if not metadata:
return f"[[Source Image]({supply.worth})]"
components = [
f"[Source Image]({supply.worth})",
f"[Source Page]({metadata.webpage_url})",
metadata.title,
metadata.credit_line,
]
separator = "•"
inner_info = f" {separator} ".be a part of(components)
return f"{separator} {inner_info} {separator}"
def yield_md_rows() -> Iterator[str]:
horizontal_line = "---"
image_info = get_image_info_md()
yield horizontal_line
yield f"_{image_info}_"
yield horizontal_line
display_markdown(f"{chr(10)}{chr(10)}".be a part of(yield_md_rows()))
def display_detected_objects(workflow: VisualObjectWorkflow) -> None:
source_image = workflow.source_image
detected_objects = PIL.Picture.new("RGB", source_image.dimension, "white")
for obj in workflow.detected_objects:
obj_image, field = extract_object_image(source_image, obj)
detected_objects.paste(obj_image, (field[0], field[1]))
_, (ax1, ax2) = plt.subplots(1, 2, format="compressed")
ax1.imshow(source_image)
ax2.imshow(detected_objects)
disable_colab_cell_scrollbar()
plt.present()
print("✅ Object detection helpers outlined")
🧪 Let’s begin easy: can we detect the one illustration on this incunable from 1485?
detect_objects(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 This works properly. The bounding field may be very exact, enclosing the hand-colored woodcut illustration very tightly.
🧪 Now, let’s verify the detection of the a number of visuals on this museum guidebook:
detect_objects(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 Remarks:
- The bounding containers are once more very exact.
- The outcomes are good: there aren’t any false positives and no false negatives.
- The captions under the visuals are usually not enclosed inside the bounding containers, which was particularly requested. The bounding field granularity will be managed by altering the immediate.
🧪 What about barely warped visuals?
detect_objects(Supply.work)
• Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

💡 This doesn’t make a distinction. Discover how the bottom-right portray is partially lined by the orange bookmark. We’ll attempt to repair that within the restoration step.
🧪 What concerning the tilted visuals on this e-book concerning the structure in Denver?
detect_objects(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 Every visible is completely detected: spatial understanding covers tilted objects.
🧪 Lastly, let’s verify the detection on this considerably warped e-book web page from Alice’s Adventures in Wonderland:
detect_objects(Supply.alice_drawing)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 Web page curvature and different distortions don’t forestall non-rectangular objects from being detected. In reality, spatial understanding works on the pixel degree, which explains this precision for warped objects. If you happen to’d prefer to work at a decrease degree, you can too ask for a “segmentation masks” within the immediate and also you’ll get a base64-encoded PNG (every pixel giving the 0-255 chance it belongs to the thing inside the bounding field). See the segmentation doc for extra particulars.
🏷️ Textual content extraction and dynamic labeling
On high of localizing every object with its bounding field, our immediate requested to extract a verbatim caption and to assign a single-word label, when doable.
Let’s add a easy perform to show the detection knowledge in a desk: 🔽
from collections import defaultdict
def display_detection_data(supply: Supply, show_consolidated: bool = False) -> None:
def string_with_visible_linebreaks(s: str) -> str:
return f'''"{s.exchange(chr(10), "↩️")}"'''
def yield_md_rows_consolidated(workflow: VisualObjectWorkflow) -> Iterator[str]:
yield "| label | depend | captions |"
yield "| :--- | ---: | :--- |"
stats = defaultdict(listing)
for obj in workflow.detected_objects:
stats[obj.label].append(string_with_visible_linebreaks(obj.caption))
for label, captions in stats.objects():
depend = len(captions)
label_captions = " • ".be a part of(sorted(captions))
yield f"| {label} | {depend} | {label_captions} |"
def yield_md_rows_with_bbox(workflow: VisualObjectWorkflow) -> Iterator[str]:
yield "| box_2d | label | caption |"
yield "| :--- | :--- | :--- |"
for obj in workflow.detected_objects:
yield f"| {obj.box_2d} | {obj.label} | {string_with_visible_linebreaks(obj.caption)} |"
workflow = workflow_by_image.get(supply)
if workflow is None:
print(f'❌ No detection for supply "{supply.title}"')
return
md_rows = listing(
yield_md_rows_consolidated(workflow)
if show_consolidated
else yield_md_rows_with_bbox(workflow)
)
display_image_source_info(supply)
display_markdown(chr(10).be a part of(md_rows))
Within the museum guidebook, the dynamic labeling is exact in keeping with the context, and the captions under every illustration are completely extracted:
display_detection_data(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •
| box_2d | label | caption |
|---|---|---|
| [954, 629, 1338, 1166] | beetle | “The Horned Beetle.” |
| [265, 984, 464, 1504] | armor | “Armor of a Man.” |
| [737, 984, 915, 1328] | armor | “Horse Armor.” |
| [1225, 1244, 1589, 1685] | beetle | “The Goliath Beetle.” |
| [264, 1766, 431, 2006] | masks | “The Masks.” |
| [937, 1769, 1260, 2087] | butterfly | “Painted Woman Butterfly.” |
| [1325, 2170, 1581, 2468] | butterfly | “The Woman Butterfly.” |
Within the e-book picture displaying 4 work, that is good too:
display_detection_data(Supply.work)
• Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •
| box_2d | label | caption |
|---|---|---|
| [378, 203, 837, 575] | portray | “Hái Ô-liu (Olive Selecting), tháng 12 năm 1889, sơn dầu trên toan, 28 3/4 x 35 in. [73 x 89 cm]” |
| [913, 207, 1380, 563] | portray | “Hẻm núi Les Peiroulets (Les Peiroulets Ravine), tháng 10 năm 1889, sơn dầu trên toan, 28 3/4 x 36 1/4 in. [73 x 92 cm]” |
| [387, 596, 845, 978] | portray | “Trưa: Nghỉ ngơi (phỏng theo Millet) (Midday: Relaxation from Work [after Millet]), tháng 1 năm 1890, sơn dầu trên toan, 28 3/4 x 35 7/8 in. [73 x 91 cm]” |
| [921, 611, 1397, 982] | portray | “Hoa hạnh đào (Almond Blossom), tháng 2 năm 1890, sơn dầu trên toan, 28 3/8 x 36 1/4 in. [73 x 92 cm]” |
Within the Denver structure e-book, the 4 captions are assigned to the right illustrations, which was not an apparent activity:
display_detection_data(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •
| box_2d | label | caption |
|---|---|---|
| [203, 224, 741, 839] | constructing | “ERNEST AND CRANMER BUILDING.” |
| [743, 73, 1192, 758] | constructing | “PEOPLE’S BANK BUILDING.” |
| [1185, 211, 1787, 865] | constructing | “BOSTON BUILDING.” |
| [699, 754, 1238, 1203] | constructing | “COOPER BUILDING.” |
💡 When you’ve got a more in-depth have a look at the enter picture, it’s exhausting to inform which caption belongs to which illustration at a look. Most of us would wish to consider it (and could be incorrect). Asking Gemini reveals that the outcomes are intentional and never pure luck: Deciphering classic layouts can really feel a bit like a puzzle, however there may be normally a “reading-order” logic at play. On this particular case, the captions are organized to correspond with the photographs in a clockwise or Z-pattern ranging from the highest left.
Within the “Alice’s Adventures in Wonderland” e-book web page, there was a single illustration accompanying the story textual content. As anticipated, the caption is empty (i.e., no false optimistic):
display_detection_data(Supply.alice_drawing)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •
| box_2d | label | caption |
|---|---|---|
| [111, 146, 1008, 593] | illustration | “” |
🔭 Generalizing object detection
We will use the identical ideas for different object sorts. We’ll typically maintain requesting bounding containers to establish object positions inside photos. With out altering our present output construction (i.e., no code change), we are able to use captions and labels to extract totally different object metadata relying on the enter kind.
🧪 See how we are able to detect digital elements by adapting the immediate whereas conserving the very same code and output construction:
ELECTRONIC_COMPONENT_DETECTION_PROMPT = """
Exhaustively detect all the person digital elements within the picture and supply the next knowledge for every:
- `box_2d`: bounding field coordinates.
- `caption`: Verbatim alphanumeric textual content seen on the part (together with authentic line breaks), or "" if no textual content is current.
- `label`: Particular kind of part.
"""
detect_objects(
Supply.electronics,
ELECTRONIC_COMPONENT_DETECTION_PROMPT,
media_resolution=PartMediaResolutionLevel.MEDIA_RESOLUTION_ULTRA_HIGH,
)
• Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •

💡 Remarks:
- Massive and tiny elements are detected, due to the precise instruction “exhaustively detect…”.
- By utilizing the ultra-high media decision, we guarantee extra particulars are tokenized and the “P” part (a visible outlier) will get detected.
Right here’s a consolidated view of the detected elements:
display_detection_data(Supply.electronics, show_consolidated=True)
• Source Image • Source Page • Circuit board with digital elements • Photograph by Albert Stoynov on Unsplash •
| label | depend | captions |
|---|---|---|
| built-in circuit | 3 | “49240↩️020S6K” • “8105↩️0:35” • “P4010↩️9NA0” |
| resistor | 4 | “” • “” • “105” • “R020” |
| inductor | 1 | “n1W” |
| diode | 3 | “Okay” • “L” • “P” |
| capacitor | 6 | “” • “” • “” • “” • “” • “” |
| transistor | 1 | “41” |
| connector | 1 | “” |
💡 Remarks:
- Parts are detected together with their textual content markings, regardless of the three totally different textual content orientations (upright, sideways, and the wrong way up), the blur, and the picture noise.
- We eliminated the diploma of freedom for multi-line textual content by specifying the inclusion of “authentic line breaks” within the immediate: responses now constantly embrace the road breaks for the three built-in circuits (displayed with the ↩️ emoji for higher visibility).
- The final diploma of freedom lies within the labeling. Whereas most elements have been correctly labeled, it’s unclear whether or not the “P” part is a diode, a resistor, or a fuse. Making the directions extra particular (e.g., itemizing the doable labels, utilizing an enum for the
labelarea within the Pydantic class, or offering pointers and extra particulars concerning the anticipated circuit boards) will make the immediate extra “closed” and the outcomes extra deterministic and correct.
It’s additionally doable to allow/replace thethinking_configconfiguration, which is able to set off a sequence of thought earlier than producing the ultimate reply. In all of the detections carried out, our code usedThinkingLevel.MINIMAL, which didn’t eat any thought tokens (with Gemini 3 Flash). Updating the parameter toThinkingLevel.LOW,ThinkingLevel.MEDIUM, orThinkingLevel.HIGHwill use thought tokens and might result in higher outputs in complicated instances.
This demonstrates the flexibility of the method. With out retraining a mannequin, we switched from detecting Fifteenth-century woodcuts and illustrations with classic layouts to figuring out fashionable electronics simply by altering the immediate. Such detections, together with caption and label metadata, might be used to auto-crop elements for a components catalog, confirm meeting traces, or create interactive schematics… all and not using a single labeled coaching picture.
🪄 Modifying visible objects
Now that we are able to detect visible objects, we are able to envision an automation workflow to extract and reuse them. For this, we’ll use Gemini 2.5 Flash Picture (often known as Nano Banana 🍌) by default, a state-of-the-art picture technology and modifying mannequin.
Our object modifying features will observe the identical template, taking one step as enter and producing an edited picture for the output step. Let’s outline core helpers for this: 🔽
from typing import Protocol
class ObjectEditingFunction(Protocol):
def __call__(
self,
picture: InputImage,
immediate: str | None = None,
mannequin: ImageModel | None = None,
config: GenerateContentConfig | None = None,
display_results: bool = True,
) -> None: ...
SourceTargetSteps = tuple[WorkflowStep, WorkflowStep]
registered_functions: dict[SourceTargetSteps, ObjectEditingFunction] = {}
DEFAULT_EDITING_CONFIG = GenerateContentConfig(response_modalities=["IMAGE"])
EMPTY_IMAGE = PIL.Picture.new("1", (1, 1), "white")
def object_editing_function(
default_prompt: str,
source_step: WorkflowStep,
target_step: WorkflowStep,
default_model: ImageModel = ImageModel.DEFAULT,
default_config: GenerateContentConfig = DEFAULT_EDITING_CONFIG,
) -> ObjectEditingFunction:
def editing_function(
picture: InputImage,
immediate: str | None = default_prompt,
mannequin: ImageModel | None = default_model,
config: GenerateContentConfig | None = default_config,
display_results: bool = True,
) -> None:
workflow, source_images = get_workflow_and_step_images(picture, source_step)
if immediate is None:
immediate = default_prompt
immediate = immediate.strip()
if mannequin is None:
mannequin = default_model
# Observe: "config is None" is legitimate and can use the mannequin endpoint default config
target_images: listing[PIL_Image] = []
display_image_source_info(picture)
obj_count = len(source_images)
for obj_order, source_image in enumerate(source_images, 1):
target_image = generate_image([source_image], immediate, mannequin, config)
save_workflow_image(
source_step,
target_step,
picture,
obj_order,
obj_count,
target_image,
dict(immediate=immediate),
)
target_images.append(target_image if target_image else EMPTY_IMAGE)
workflow.images_by_step[target_step] = target_images
if display_results:
display_sources_and_targets(workflow, source_step, target_step)
registered_functions[(source_step, target_step)] = editing_function
return editing_function
def get_workflow_and_step_images(
picture: InputImage,
step: WorkflowStep,
) -> tuple[VisualObjectWorkflow, list[PIL_Image]]:
# Objects detected?
if picture not in workflow_by_image:
detect_objects(picture, display_results=False)
workflow = workflow_by_image.get(picture)
assert workflow just isn't None
# Workflow step objects? (single degree, might be prolonged to a dynamical graph)
operation = (WorkflowStep.CROPPED, step)
if step not in workflow.images_by_step and operation in registered_functions:
source_function = registered_functions[operation]
source_function(picture, display_results=False)
# Supply photos
source_images = workflow.images_by_step.get(step)
assert source_images just isn't None
return workflow, source_images
def display_sources_and_targets(
workflow: VisualObjectWorkflow,
source_step: WorkflowStep,
target_step: WorkflowStep,
) -> None:
source_images = workflow.images_by_step[source_step]
target_images = workflow.images_by_step[target_step]
if not source_images:
print("❌ No photos to show")
return
fig = plt.determine(format="compressed")
if horizontal := (len(source_images) >= 2):
rows, cols = 2, len(source_images)
else:
rows, cols = len(source_images), 2
gs = fig.add_gridspec(rows, cols)
for i, (source_image, target_image) in enumerate(
zip(source_images, target_images, strict=True)
):
for dim, picture in enumerate([source_image, target_image]):
grid_spec = gs[dim, i] if horizontal else gs[i, dim]
ax = fig.add_subplot(grid_spec)
ax.set_axis_off()
ax.imshow(picture)
disable_colab_cell_scrollbar()
plt.present()
print("✅ Object modifying helpers outlined")
Now, let’s outline a primary modifying step to revive the detected objects that may include many real-life artifacts…
✨ Restoring visible objects
For this restoration step, we have to craft a immediate that’s generic sufficient (to cowl most use instances) but in addition particular sufficient (to have in mind restoration wants).
A picture modifying immediate relies on pure language, usually utilizing crucial or declarative directions. With an crucial immediate, you describe the actions to carry out on the enter, whereas with a declarative immediate, you describe the anticipated output. Each are doable and can present equal outcomes. Your selection is mostly a matter of choice, so long as the immediate is sensible.
Our check suite is usually composed of e-book images, which might include numerous photographic and paper artifacts. The Nano Banana fashions perceive these subtleties and might edit photos accordingly, which simplifies the immediate.
Here’s a doable restoration perform utilizing an crucial immediate:
RESTORATION_PROMPT = """
- Isolate and straighten the visible on a pure white background, excluding any surrounding textual content.
- Clear up all bodily artifacts and noise whereas preserving each authentic element.
- Middle the end result and scale it to suit the canvas with minimal, symmetrical margins, making certain no distortion or cropping.
"""
# Default config with low randomness for extra deterministic restoration outputs
RESTORATION_CONFIG = GenerateContentConfig(
temperature=0.0,
top_p=0.0,
seed=42,
response_modalities=["IMAGE"],
)
restore_objects = object_editing_function(
RESTORATION_PROMPT,
WorkflowStep.CROPPED,
WorkflowStep.RESTORED,
default_config=RESTORATION_CONFIG,
)
print("✅ Restoration perform outlined")
🧪 Let’s attempt to restore the illustration from the 1485 incunable:
restore_objects(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 We now have a pleasant restoration of the hand-colored woodcut illustration. Observe that our immediate is generic (“clear up all bodily artifacts”) and might be made extra particular to take away extra or fewer artifacts. On this instance, there are remaining artifacts, such because the paper discoloration within the sword or the bleeding ink within the armor. We’ll see if we are able to repair these within the colorization step.
🧪 What concerning the illustrations from the museum guidebook?
restore_objects(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 All good!
🧪 What concerning the barely warped visuals?
restore_objects(Supply.work)
• Source Image • Source Page • Open e-book displaying work by Vincent van Gogh • Photograph by Trung Manh cong on Unsplash •

💡 Remarks:
- Discover how, on the final portray, the orange bookmark is correctly eliminated and the hidden half inpainted to finish the portray.
- We requested to “fill the canvas with minimal uniform margins, with out distortion or cropping”. Relying on the side ratio and kind of the visible, this diploma of freedom can lead to totally different white margins.
- This instance reveals well-known work by Vincent Van Gogh. Nano Banana doesn’t fetch any reference photos and solely makes use of the supplied enter. If these have been images of personal work, they might be restored in the identical manner.
Within the Denver structure e-book, the illustrations will be tilted, which our generic immediate doesn’t absolutely have in mind. When a number of geometric transformations are concerned, it may be difficult to craft an crucial immediate that particulars all of the operations to carry out. As an alternative, a descriptive immediate will be extra easy by straight describing the anticipated output.
🧪 Right here’s an instance of a descriptive immediate specializing in the restoration of tilted visuals:
tilted_visual_prompt = """
An upright, high-fidelity rendition of the visible remoted in opposition to a pure white background, filling the canvas with minimal uniform margins. The output is clear, sharp, and freed from bodily artifacts.
"""
restore_objects(Supply.denver_illustrated, tilted_visual_prompt)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 Remarks:
- To get these outcomes, the immediate focuses on requesting an “upright” visible “filling the canvas”, which proves extra easy to put in writing than making an attempt to account for all doable geometric corrections.
- The native visible understanding routinely identifies the content material kind (picture, illustration, and many others.) and the totally different artifacts (photographic, paper, printing, scanning…), permitting for exact restorations out of the field.
- Discover how the consistency is preserved: the final visible is restored as an illustration, whereas the primary visuals preserve their photographic model.
- The outcomes, with this fairly generic immediate, are spectacular. It’s, after all, doable to be extra particular and request explicit lighting, kinds, colours…
On this final check, the enter visible has distortions not solely from the web page curvature but in addition from the picture perspective.
🧪 Right here’s an instance of a descriptive immediate specializing in restoring warped illustrations:
warped_visual_prompt = """
An edge-to-edge digital extraction of the illustration from the supplied e-book picture, excluding any peripheral textual content. All web page curvature and perspective distortions are corrected, leading to a picture framed in an ideal rectangle, on a pure white canvas with minimal margins.
"""
restore_objects(Supply.alice_drawing, warped_visual_prompt)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 It’s actually spectacular that such a restoration will be carried out in a single step. Observe that this immediate just isn’t steady and might generate much less optimum outcomes (it could profit from being extra exact). When you’ve got complicated transformations, check descriptive prompts iteratively, utilizing exact and concise directions, and also you could be pleasantly shocked. Within the worst case, it’s additionally doable to course of the transformations in successive, simpler steps.
Now, let’s add a colorization step…
🎨 Colorization
Our restoration step revered the unique kinds of the enter photos. Current picture modifying fashions excel at remodeling picture kinds, beginning with colours. This may typically be carried out straight with a easy, exact instruction.
Here’s a doable colorization perform utilizing an crucial immediate:
COLORIZATION_PROMPT = """
Colorize this picture in a contemporary e-book illustration model, sustaining all authentic particulars with none additions.
"""
colorize = object_editing_function(
COLORIZATION_PROMPT,
WorkflowStep.RESTORED,
WorkflowStep.COLORIZED,
)
print("✅ Colorization perform outlined")
🧪 Let’s modernize our 1485 illustration:
colorize(Supply.incunable)
• Source Image • Source Page • Vergaderinge der historien van Troy (1485) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 All particulars are preserved, as requested within the immediate. Discover how the colorization can naturally repair some remaining artifacts (e.g., the paper discoloration within the sword or the bleeding ink within the armor).
🧪 Let’s colorize our museum guidebook illustrations:
colorize(Supply.museum_guidebook)
• Source Image • Source Page • Barnum’s American Museum illustrated (1850) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 Our immediate may be very open because it solely specifies “fashionable e-book illustration model”. This may generate very artistic colorizations, however all of them appear to make good sense.
🧪 What about our Denver buildings?
colorize(Supply.denver_illustrated)
• Source Image • Source Page • Denver illustrated (1893) • Library of Congress, Assembly of Frontiers •

💡 As requested, all of them appear to be fashionable illustrations, together with the primary visuals (originating from noisy images).
It’s doable to go additional by not solely “colorizing” but in addition “remodeling” the picture right into a considerably totally different one.
🧪 Let’s make our “Alice’s Adventures in Wonderland” drawing right into a watercolor portray:
watercolor_prompt = """
Remodel this visible right into a heat, watercolor portray.
"""
colorize(Supply.alice_drawing, watercolor_prompt)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

🧪 What about making it a standard portray?
painting_prompt = """
Remodel this visible into a standard portray.
"""
colorize(Supply.alice_drawing, painting_prompt)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

We will additionally change picture compositions. Relying on the context, some compositions are roughly implied by default. For instance, illustrations usually have margins, whereas images typically have edge-to-edge (full-bleed within the printing world) compositions. When doable, it’s attention-grabbing to check with a sort of visible (which intrinsically brings a number of semantics to the context) and alter the directions accordingly.
🧪 Let’s see how we are able to detect engravings on this 1847 e-book, restore them, and remodel them into fashionable digital graphics:
detect_objects(Supply.engravings)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

restore_objects(Supply.engravings)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

visual_to_digital_graphic_prompt = """
Remodel this visible right into a full-color, flat digital graphic, extending the content material for a full-bleed impact.
"""
colorize(Supply.engravings, visual_to_digital_graphic_prompt)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

🧪 We will additionally remodel the identical engravings into images with a quite simple immediate:
visual_to_photo_prompt = """
Remodel this visible right into a high-end, fashionable digital camera {photograph}.
"""
colorize(Supply.engravings, visual_to_photo_prompt)
• Source Image • Source Page • Harper’s illustrated catalogue (1847) • Library of Congress •

💡 As images are typically full-bleed, the immediate doesn’t have to specify a composition.
It’s actually as much as our creativeness, as Nano Banana appears to understand each side of the visible semantics.
Let’s add a remaining step to see how far we are able to go, reimagining photos as cinematic film stills…
🎞️ Cinematization
We’ve used fairly “closed” prompts thus far, crafting particular directions and constraints to manage the outputs. It’s doable to go even additional with “open” prompts and generate photos in full artistic mode. Notably, it may be attention-grabbing to check with photographic or cinematographic terminology because it encompasses many visible methods.
Here’s a doable generic cinematization perform to reimagine photos as film stills:
CINEMATIZATION_PROMPT = """
Reimagine this picture as a joyful, fashionable live-action cinematic film nonetheless that includes skilled lighting and composition.
"""
cinematize = object_editing_function(
CINEMATIZATION_PROMPT,
WorkflowStep.RESTORED,
WorkflowStep.CINEMATIZED,
)
🧪 Let’s cinematize the “Alice’s Adventures in Wonderland” drawing:
cinematize(Supply.alice_drawing)
• Source Image • Source Page • Open e-book displaying an illustration and textual content from Alice’s Adventures in Wonderland • Photograph by Brett Jordan on Unsplash •

💡 This appears like a high-budget film nonetheless. There are many levels of freedom within the immediate, however you’re more likely to get foreground figures in sharp focus, a gradual background blur, “golden hour” lighting (a magical ingredient for a lot of cinematographers), and detailed textures. Such compositions actually evoke totally different atmospheres in comparison with the images generated within the earlier check.
🧪 Let’s check the workflow on a web page from the Fantastic Wizard of Oz containing three drawings:
detect_objects(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

restore_objects(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

cinematize(Supply.wizard_of_oz_drawings)
• Source Image • Source Page • The great Wizard of Oz (1899) • Library of Congress, Uncommon E-book and Particular Collections Division •

💡 The forged for a brand new film is prepared 😉
Cinematic photos have numerous use instances:
- These cinematized stills will be good “reference photos” for video technology fashions like Veo. See Generate Veo videos from reference images.
- As they’re photorealistic representations, they may also be a supply for producing 2D or 3D visuals, in any model, with sensible figures, good proportions, superior lighting, enhanced compositions…
- You should use them in {many professional} contexts or for high-end merchandise: shows, magazines, posters, storyboards, brainstorming periods…
🏁 Conclusion
- Gemini’s native spatial understanding allows the detection of particular visible objects primarily based on a single immediate in pure language.
- We examined the detection of illustrations in e-book images, which conventional machine studying (ML) fashions normally miss, as they’re usually skilled to detect folks, animals, automobiles, meals, and a finite set of bodily object lessons.
- We examined the detection of straight, tilted, and even considerably warped illustrations, they usually have been at all times exactly recognized.
- The core implementation was easy, requiring minimal code utilizing the Python SDK and customised prompts. By comparability, fine-tuning a standard object detection mannequin is time-consuming: it entails assembling a picture dataset, labeling objects, and managing coaching jobs.
- This resolution may be very versatile: we may swap from detecting illustrations to digital elements, by adapting the immediate, whereas conserving the code unchanged.
- Utilizing structured outputs (with a JSON schema or Pydantic lessons, and the Python SDK) makes the code each simple to implement and able to deploy to manufacturing.
- Then, Nano Banana permits modifying these visible objects in just about any manner conceivable.
- We examined a workflow with restoration, colorization, and even cinematization steps, utilizing crucial and descriptive prompts.
- The chances appear actually limitless, and the ideas on this exploration will be reused in numerous contexts.
➕ Extra!
Thanks for studying. Let me know in case you create one thing cool!

