Why Accessibility Trees Beat Screenshots
The core insight behind Rove: for most AI agent tasks, you don't need a picture of the page — you need to know what's on it.
The Token Problem
When an AI agent needs to "see" a webpage, the traditional approach is to take a screenshot and send it to a vision model. This works, but it's expensive:
| Approach | Tokens per page | Cost per 1,000 pages |
|---|---|---|
| Screenshot → Vision model | ~114,000 | ~$0.57 |
| A11y tree → LLM | ~26,000 | ~$0.13 |
| Savings | 77% | 77% |
Costs calculated at GPT-4o pricing. Actual costs vary by model and page complexity.
What Is an Accessibility Tree?
Every web page has two representations:
- The DOM — the full HTML with all styling, scripts, and layout information. Too verbose for an LLM.
- The accessibility tree — a semantic representation of the page content, built by the browser for screen readers.
The accessibility tree contains:
- Roles: heading, button, link, textbox, checkbox, etc.
- Names: the visible text or ARIA label
- States: focused, disabled, checked, expanded
- Hierarchy: parent-child relationships between elements
It does NOT contain:
- CSS styling
- JavaScript
- Layout coordinates
- Decorative elements
- Hidden content (by default)
Example
For a simple login page, here's the difference:
Screenshot: A 1280x720 PNG image, ~500KB, encoded to ~114,000 tokens
Accessibility tree: ~200 bytes of structured data, ~340 tokens:
{
"role": "main",
"children": [
{"role": "heading", "name": "Sign In", "level": 1},
{"role": "textbox", "name": "Email address", "focused": true},
{"role": "textbox", "name": "Password"},
{"role": "button", "name": "Sign in"},
{"role": "link", "name": "Forgot password?"}
]
}
An LLM can understand this instantly. It knows there's an email field, a password field, and a sign-in button. It can instruct the agent to fill the email, fill the password, and click sign in — all from ~340 tokens instead of ~114,000.
When to Use Each
Use the a11y tree when:
- Navigating and interacting with pages
- Filling forms
- Extracting text content
- Understanding page structure
- Any task where you need to know what is on the page
Use screenshots when:
- You need to verify visual appearance (colors, layout, images)
- You're doing visual regression testing
- The page uses canvas or WebGL content that isn't in the DOM
- You specifically need a visual record for debugging
How Rove Exposes It
# Get the a11y tree for the current page
curl -X POST https://api.roveapi.com/v1/browser/action \
-H "Authorization: Bearer rvp_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"session_id": "sess_...",
"action": "get_a11y_tree",
"params": {}
}'
The response includes estimated_tokens so you can track your token budget:
{
"tree": "...",
"node_count": 47,
"estimated_tokens": 1240
}
Auto-Scoping
On large pages (over 50K characters), Rove automatically scopes the tree to the main content area. It looks for landmarks like <main>, [role="main"], #content, or #app and returns only that subtree.
When auto-scope kicks in, the response includes:
{
"tree": "...",
"auto_scoped": true,
"scoped_to": "main",
"node_count": 142,
"estimated_tokens": 5200
}
This means Amazon search pages that would return 270K characters automatically come back at ~20-30K — no configuration needed.
Manual Scoping
You can also scope manually using selector:
{
"action": "get_a11y_tree",
"params": {
"selector": "#search-results"
}
}
When you provide an explicit selector, auto-scope is skipped — you get exactly the subtree you asked for.
Controlling Output Size
max_chars
Hard limit on output size. Truncates at the nearest line boundary:
{
"action": "get_a11y_tree",
"params": {
"max_chars": 30000
}
}
Response includes truncated: true and full_length when capped.
max_depth
Limit tree nesting depth for a structural overview:
{
"action": "get_a11y_tree",
"params": {
"max_depth": 3
}
}
A depth of 3 returns top-level structure without deeply nested elements — useful for understanding page layout before drilling into a specific section.
visible_only
Skip hidden and offscreen elements (collapsed menus, modals, below-fold content):
{
"action": "get_a11y_tree",
"params": {
"visible_only": true
}
}
exclude_selectors
Strip noise elements before snapshotting:
{
"action": "get_a11y_tree",
"params": {
"exclude_selectors": ["nav", "footer", ".ad-slot", "[role='banner']"]
}
}
Elements are hidden from the DOM before the snapshot and restored after — the page is not permanently modified.
Combining Options
All scoping options can be combined:
{
"action": "get_a11y_tree",
"params": {
"selector": "main",
"max_depth": 4,
"max_chars": 50000,
"exclude_selectors": ["nav", "footer"]
}
}
Working with Large Pages
The "77% fewer tokens than screenshots" claim holds for typical pages, but heavy retail and content sites can return much larger trees. Amazon's home page returns ~197K characters, and search results can reach ~270K characters. Here's how to handle them.
The Orient-Drill-Act Pattern
Instead of trying to consume the full tree in one call, use a three-step approach:
- Orient — Get a shallow overview of the page structure:
{
"action": "get_a11y_tree",
"params": { "max_depth": 2 }
}
- Drill — Scope into the relevant subtree:
{
"action": "get_a11y_tree",
"params": { "selector": "#search-results" }
}
- Act — Interact with specific elements you found in the scoped tree.
Root Selector Scoping
For known page layouts, skip the orient step and go straight to the content:
| Site type | Recommended selector | Typical result |
|---|---|---|
| Amazon search | #search .s-main-slot | ~4K chars (from ~270K) |
| Amazon product | #dp-container | ~8K chars |
| News articles | article, [role="article"] | ~3-5K chars |
| SaaS dashboards | main, [role="main"] | ~5-10K chars |
The Warmup Pattern
Some sites (especially Amazon) serve lighter pages on first navigation and heavier ones on subsequent requests. If you're scraping product pages, navigate to the homepage first to establish cookies and session state, then navigate to your target URL:
navigate({ url: "https://www.amazon.com" })
navigate({ url: "https://www.amazon.com/dp/B0..." })
get_a11y_tree({ selector: "#dp-container" })
Expected Tree Sizes by Category
| Page type | Full tree | With auto-scope | With manual selector |
|---|---|---|---|
| Simple landing page | ~2-5K chars | ~2-5K chars | — |
| Blog post | ~5-15K chars | ~3-8K chars | ~2-4K chars |
| E-commerce product | ~50-100K chars | ~15-30K chars | ~4-8K chars |
| E-commerce search | ~150-300K chars | ~20-40K chars | ~4-10K chars |
| SaaS dashboard | ~10-50K chars | ~5-15K chars | ~3-8K chars |
| Social media feed | ~100-200K chars | ~20-40K chars | ~5-15K chars |
The key takeaway: always scope large pages with selector, max_depth, or exclude_selectors rather than consuming the full tree.
MCP Usage
Via MCP (Claude, Cursor), the same options are available as tool parameters:
get_a11y_tree({
session_id: "sess_...",
selector: "#results",
max_chars: 30000,
exclude_selectors: ["nav", "footer"]
})