Why Accessibility Trees Beat Screenshots

The core insight behind Rove: for most AI agent tasks, you don't need a picture of the page — you need to know what's on it.

The Token Problem

When an AI agent needs to "see" a webpage, the traditional approach is to take a screenshot and send it to a vision model. This works, but it's expensive:

Approach	Tokens per page	Cost per 1,000 pages
Screenshot → Vision model	~114,000	~$0.57
A11y tree → LLM	~26,000	~$0.13
Savings	77%	77%

Costs calculated at GPT-4o pricing. Actual costs vary by model and page complexity.

What Is an Accessibility Tree?

Every web page has two representations:

The DOM — the full HTML with all styling, scripts, and layout information. Too verbose for an LLM.
The accessibility tree — a semantic representation of the page content, built by the browser for screen readers.

The accessibility tree contains:

Roles: heading, button, link, textbox, checkbox, etc.
Names: the visible text or ARIA label
States: focused, disabled, checked, expanded
Hierarchy: parent-child relationships between elements

It does NOT contain:

CSS styling
JavaScript
Layout coordinates
Decorative elements
Hidden content (by default)

Example

For a simple login page, here's the difference:

Screenshot: A 1280x720 PNG image, ~500KB, encoded to ~114,000 tokens

Accessibility tree: ~200 bytes of structured data, ~340 tokens:

{
  "role": "main",
  "children": [
    {"role": "heading", "name": "Sign In", "level": 1},
    {"role": "textbox", "name": "Email address", "focused": true},
    {"role": "textbox", "name": "Password"},
    {"role": "button", "name": "Sign in"},
    {"role": "link", "name": "Forgot password?"}
  ]
}

An LLM can understand this instantly. It knows there's an email field, a password field, and a sign-in button. It can instruct the agent to fill the email, fill the password, and click sign in — all from ~340 tokens instead of ~114,000.

When to Use Each

Use the a11y tree when:

Navigating and interacting with pages
Filling forms
Extracting text content
Understanding page structure
Any task where you need to know what is on the page

Use screenshots when:

You need to verify visual appearance (colors, layout, images)
You're doing visual regression testing
The page uses canvas or WebGL content that isn't in the DOM
You specifically need a visual record for debugging

How Rove Exposes It

# Get the a11y tree for the current page
curl -X POST https://api.roveapi.com/v1/browser/action \
  -H "Authorization: Bearer rvp_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_...",
    "action": "get_a11y_tree",
    "params": {}
  }'

The response includes estimated_tokens so you can track your token budget:

{
  "tree": "...",
  "node_count": 47,
  "estimated_tokens": 1240
}

Auto-Scoping

On large pages (over 50K characters), Rove automatically scopes the tree to the main content area. It looks for landmarks like <main>, [role="main"], #content, or #app and returns only that subtree.

When auto-scope kicks in, the response includes:

{
  "tree": "...",
  "auto_scoped": true,
  "scoped_to": "main",
  "node_count": 142,
  "estimated_tokens": 5200
}

This means Amazon search pages that would return 270K characters automatically come back at ~20-30K — no configuration needed.

Manual Scoping

You can also scope manually using selector:

{
  "action": "get_a11y_tree",
  "params": {
    "selector": "#search-results"
  }
}

When you provide an explicit selector, auto-scope is skipped — you get exactly the subtree you asked for.

Controlling Output Size

`max_chars`

Hard limit on output size. Truncates at the nearest line boundary:

{
  "action": "get_a11y_tree",
  "params": {
    "max_chars": 30000
  }
}

Response includes truncated: true and full_length when capped.

`max_depth`

Limit tree nesting depth for a structural overview:

{
  "action": "get_a11y_tree",
  "params": {
    "max_depth": 3
  }
}

A depth of 3 returns top-level structure without deeply nested elements — useful for understanding page layout before drilling into a specific section.

`visible_only`

Skip hidden and offscreen elements (collapsed menus, modals, below-fold content):

{
  "action": "get_a11y_tree",
  "params": {
    "visible_only": true
  }
}

`exclude_selectors`

Strip noise elements before snapshotting:

{
  "action": "get_a11y_tree",
  "params": {
    "exclude_selectors": ["nav", "footer", ".ad-slot", "[role='banner']"]
  }
}

Elements are hidden from the DOM before the snapshot and restored after — the page is not permanently modified.

Combining Options

All scoping options can be combined:

{
  "action": "get_a11y_tree",
  "params": {
    "selector": "main",
    "max_depth": 4,
    "max_chars": 50000,
    "exclude_selectors": ["nav", "footer"]
  }
}

Working with Large Pages

The "77% fewer tokens than screenshots" claim holds for typical pages, but heavy retail and content sites can return much larger trees. Amazon's home page returns ~197K characters, and search results can reach ~270K characters. Here's how to handle them.

The Orient-Drill-Act Pattern

Instead of trying to consume the full tree in one call, use a three-step approach:

Orient — Get a shallow overview of the page structure:

{
  "action": "get_a11y_tree",
  "params": { "max_depth": 2 }
}

Drill — Scope into the relevant subtree:

{
  "action": "get_a11y_tree",
  "params": { "selector": "#search-results" }
}

Act — Interact with specific elements you found in the scoped tree.

Root Selector Scoping

For known page layouts, skip the orient step and go straight to the content:

Site type	Recommended selector	Typical result
Amazon search	`#search .s-main-slot`	~4K chars (from ~270K)
Amazon product	`#dp-container`	~8K chars
News articles	`article, [role="article"]`	~3-5K chars
SaaS dashboards	`main, [role="main"]`	~5-10K chars

The Warmup Pattern

Some sites (especially Amazon) serve lighter pages on first navigation and heavier ones on subsequent requests. If you're scraping product pages, navigate to the homepage first to establish cookies and session state, then navigate to your target URL:

navigate({ url: "https://www.amazon.com" })
navigate({ url: "https://www.amazon.com/dp/B0..." })
get_a11y_tree({ selector: "#dp-container" })

Expected Tree Sizes by Category

Page type	Full tree	With auto-scope	With manual selector
Simple landing page	~2-5K chars	~2-5K chars	—
Blog post	~5-15K chars	~3-8K chars	~2-4K chars
E-commerce product	~50-100K chars	~15-30K chars	~4-8K chars
E-commerce search	~150-300K chars	~20-40K chars	~4-10K chars
SaaS dashboard	~10-50K chars	~5-15K chars	~3-8K chars
Social media feed	~100-200K chars	~20-40K chars	~5-15K chars

The key takeaway: always scope large pages with selector, max_depth, or exclude_selectors rather than consuming the full tree.

MCP Usage

Via MCP (Claude, Cursor), the same options are available as tool parameters:

get_a11y_tree({
  session_id: "sess_...",
  selector: "#results",
  max_chars: 30000,
  exclude_selectors: ["nav", "footer"]
})