Why Accessibility Trees Beat Screenshots

The core insight behind Rove: for most AI agent tasks, you don't need a picture of the page — you need to know what's on it.

The Token Problem

When an AI agent needs to "see" a webpage, the traditional approach is to take a screenshot and send it to a vision model. This works, but it's expensive:

ApproachTokens per pageCost per 1,000 pages
Screenshot → Vision model~114,000~$0.57
A11y tree → LLM~26,000~$0.13
Savings77%77%

Costs calculated at GPT-4o pricing. Actual costs vary by model and page complexity.

What Is an Accessibility Tree?

Every web page has two representations:

  1. The DOM — the full HTML with all styling, scripts, and layout information. Too verbose for an LLM.
  2. The accessibility tree — a semantic representation of the page content, built by the browser for screen readers.

The accessibility tree contains:

  • Roles: heading, button, link, textbox, checkbox, etc.
  • Names: the visible text or ARIA label
  • States: focused, disabled, checked, expanded
  • Hierarchy: parent-child relationships between elements

It does NOT contain:

  • CSS styling
  • JavaScript
  • Layout coordinates
  • Decorative elements
  • Hidden content (by default)

Example

For a simple login page, here's the difference:

Screenshot: A 1280x720 PNG image, ~500KB, encoded to ~114,000 tokens

Accessibility tree: ~200 bytes of structured data, ~340 tokens:

{
  "role": "main",
  "children": [
    {"role": "heading", "name": "Sign In", "level": 1},
    {"role": "textbox", "name": "Email address", "focused": true},
    {"role": "textbox", "name": "Password"},
    {"role": "button", "name": "Sign in"},
    {"role": "link", "name": "Forgot password?"}
  ]
}

An LLM can understand this instantly. It knows there's an email field, a password field, and a sign-in button. It can instruct the agent to fill the email, fill the password, and click sign in — all from ~340 tokens instead of ~114,000.

When to Use Each

Use the a11y tree when:

  • Navigating and interacting with pages
  • Filling forms
  • Extracting text content
  • Understanding page structure
  • Any task where you need to know what is on the page

Use screenshots when:

  • You need to verify visual appearance (colors, layout, images)
  • You're doing visual regression testing
  • The page uses canvas or WebGL content that isn't in the DOM
  • You specifically need a visual record for debugging

How Rove Exposes It

# Get the a11y tree for the current page
curl -X POST https://rove-api.fly.dev/v1/browser/action \
  -H "Authorization: Bearer rvp_live_YOUR_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "session_id": "sess_...",
    "action": "get_a11y_tree",
    "params": {
      "include_hidden": false,
      "root_selector": "main"
    }
  }'

The response includes estimated_tokens so you can track your token budget:

{
  "tree": { ... },
  "node_count": 47,
  "estimated_tokens": 1240
}

Scoping the Tree

For complex pages, you can scope the tree to a specific element:

{
  "action": "get_a11y_tree",
  "params": {
    "root_selector": "#product-details"
  }
}

This returns only the subtree under that element, further reducing token count.