Multimodal Support
Exo supports multimodal inputs and outputs across all built-in LLM providers (OpenAI, Anthropic, Gemini, Vertex). Agents can receive images, audio, video, and documents from users, and tools can re...
Exo supports multimodal inputs and outputs across all built-in LLM providers (OpenAI, Anthropic, Gemini, Vertex). Agents can receive images, audio, video, and documents from users, and tools can return media that the LLM processes directly.
Content Block Types
Six frozen Pydantic models represent media content. Import them from exo (re-exported from exo.types):
from exo import (
AudioBlock,
ContentBlock,
DocumentBlock,
ImageDataBlock,
ImageURLBlock,
TextBlock,
VideoBlock,
)| Type | type field | Key fields |
|---|---|---|
TextBlock | "text" | text: str |
ImageURLBlock | "image_url" | url: str, detail: "auto"|"low"|"high" |
ImageDataBlock | "image_data" | data: str (base64), media_type: str |
AudioBlock | "audio" | data: str (base64), format: str |
VideoBlock | "video" | data: str | None, url: str | None, media_type: str |
DocumentBlock | "document" | data: str (base64), media_type: str, title: str | None |
ContentBlock is a discriminated union of all six types. MessageContent = str | list[ContentBlock].
Sending Multimodal Input to an Agent
Pass a list of ContentBlock objects as the input to run(), or as a UserMessage.content:
import base64
from exo import run, Agent, ImageURLBlock, TextBlock
agent = Agent(name="vision", model="openai:gpt-4o")
# Image from URL
result = await run(agent, [
TextBlock(text="What is in this image?"),
ImageURLBlock(url="https://example.com/photo.jpg", detail="high"),
])
# Image from bytes
with open("diagram.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
result = await run(agent, [
TextBlock(text="Explain this diagram."),
ImageDataBlock(data=b64, media_type="image/png"),
])For audio:
from exo import AudioBlock
with open("speech.mp3", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
result = await run(agent, [
TextBlock(text="Transcribe this audio."),
AudioBlock(data=b64, format="mp3"),
])For PDF documents (Anthropic):
from exo import DocumentBlock
with open("report.pdf", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
result = await run(agent, [
TextBlock(text="Summarise this report."),
DocumentBlock(data=b64, media_type="application/pdf", title="Q1 Report"),
])Tools That Return Media
A tool can return list[ContentBlock] — the agent automatically propagates the blocks to the LLM so it can reason about the generated media:
from exo import tool, ImageDataBlock
import base64
@tool
async def capture_screenshot(url: str) -> list[ImageDataBlock]:
"""Capture a screenshot of a URL."""
screenshot_bytes = await _take_screenshot(url)
b64 = base64.b64encode(screenshot_bytes).decode()
return [ImageDataBlock(data=b64, media_type="image/png")]
agent = Agent(
name="auditor",
model="anthropic:claude-opus-4-6",
tools=[capture_screenshot],
)The LLM receives the image alongside the tool result and can describe, analyze, or make decisions based on it.
Built-in Generation Tools
Three ready-to-use tools are available in exo.models. Add them directly to any agent.
DALL-E 3 Image Generation
from exo import Agent
from exo.models import dalle_generate_image
agent = Agent(
name="artist",
model="openai:gpt-4o",
tools=[dalle_generate_image],
)
result = await run(agent, "Create a painting of a sunset over Mount Fuji.")Parameters: prompt, size ("1024x1024" / "1792x1024" / "1024x1792"), quality ("standard" / "hd"), style ("vivid" / "natural").
Returns: list[ImageURLBlock].
Requires: OPENAI_API_KEY environment variable.
Imagen 3 Image Generation
from exo.models import imagen_generate_image
agent = Agent(
name="illustrator",
model="gemini:gemini-2.0-flash",
tools=[imagen_generate_image],
)Parameters: prompt, aspect_ratio ("1:1" / "16:9" / "9:16" / "4:3" / "3:4"), number_of_images (1-4).
Returns: list[ImageDataBlock] (base64 PNG).
Requires: GOOGLE_API_KEY environment variable.
Veo 2 Video Generation
from exo.models import veo_generate_video
agent = Agent(
name="director",
model="vertex:gemini-2.0-flash",
tools=[veo_generate_video],
)Parameters: prompt, duration_seconds (5-8), aspect_ratio ("16:9" / "9:16").
Returns: list[VideoBlock] (URL to generated video in Cloud Storage).
Requires: GOOGLE_CLOUD_PROJECT and GOOGLE_CLOUD_LOCATION environment variables.
Provider Capability Matrix
| Block type | OpenAI | Anthropic | Gemini | Vertex |
|---|---|---|---|---|
TextBlock | ✅ | ✅ | ✅ | ✅ |
ImageURLBlock (https) | ✅ | ✅ | ✅ | ✅ |
ImageURLBlock (data URI) | ✅ | ✅ | ✅ | ✅ |
ImageDataBlock | ✅ (via data URI) | ✅ | ✅ | ✅ |
AudioBlock | ✅ | ❌ (warning) | ✅ | ✅ |
VideoBlock | ❌ (warning) | ❌ (warning) | ✅ | ✅ |
DocumentBlock (PDF) | ❌ (warning) | ✅ | ✅ | ✅ |
| Tool result with media | synthetic user msg | native | native | native |
Unsupported blocks are silently skipped with a WARNING log. OpenAI tool results containing media blocks inject a synthetic user message after the tool message so the LLM can see the images.
End-to-End Example: Vision Agent + Image Generation Swarm
import asyncio
from exo import Agent, Swarm, run, ImageURLBlock, TextBlock
from exo.models import dalle_generate_image
# Agent 1: Analyze an input image
analyst = Agent(
name="analyst",
model="openai:gpt-4o",
instructions="You describe images in detail, focusing on style and composition.",
)
# Agent 2: Generate a variation based on the description
generator = Agent(
name="generator",
model="openai:gpt-4o",
instructions="You generate image prompts and call dalle_generate_image.",
tools=[dalle_generate_image],
)
# Sequential swarm: analyst → generator
swarm = Swarm(agents=[analyst, generator], flow="analyst >> generator")
async def main() -> None:
result = await run(
swarm,
[
TextBlock(text="Analyze this painting and generate a modern variant."),
ImageURLBlock(url="https://upload.wikimedia.org/wikipedia/commons/thumb/e/ec/Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg/800px-Mona_Lisa%2C_by_Leonardo_da_Vinci%2C_from_C2RMF_retouched.jpg"),
],
)
print(result.output)
asyncio.run(main())The analyst describes the Mona Lisa; the generator calls dalle_generate_image with a modernized prompt; the final response includes the URL of the generated image.