Why Accessibility Trees Beat Screenshots
The core insight behind Rove: for most AI agent tasks, you don't need a picture of the page — you need to know what's on it.
The Token Problem
When an AI agent needs to "see" a webpage, the traditional approach is to take a screenshot and send it to a vision model. This works, but it's expensive:
| Approach | Tokens per page | Cost per 1,000 pages |
|---|---|---|
| Screenshot → Vision model | ~114,000 | ~$0.57 |
| A11y tree → LLM | ~26,000 | ~$0.13 |
| Savings | 77% | 77% |
Costs calculated at GPT-4o pricing. Actual costs vary by model and page complexity.
What Is an Accessibility Tree?
Every web page has two representations:
- The DOM — the full HTML with all styling, scripts, and layout information. Too verbose for an LLM.
- The accessibility tree — a semantic representation of the page content, built by the browser for screen readers.
The accessibility tree contains:
- Roles: heading, button, link, textbox, checkbox, etc.
- Names: the visible text or ARIA label
- States: focused, disabled, checked, expanded
- Hierarchy: parent-child relationships between elements
It does NOT contain:
- CSS styling
- JavaScript
- Layout coordinates
- Decorative elements
- Hidden content (by default)
Example
For a simple login page, here's the difference:
Screenshot: A 1280x720 PNG image, ~500KB, encoded to ~114,000 tokens
Accessibility tree: ~200 bytes of structured data, ~340 tokens:
{
"role": "main",
"children": [
{"role": "heading", "name": "Sign In", "level": 1},
{"role": "textbox", "name": "Email address", "focused": true},
{"role": "textbox", "name": "Password"},
{"role": "button", "name": "Sign in"},
{"role": "link", "name": "Forgot password?"}
]
}
An LLM can understand this instantly. It knows there's an email field, a password field, and a sign-in button. It can instruct the agent to fill the email, fill the password, and click sign in — all from ~340 tokens instead of ~114,000.
When to Use Each
Use the a11y tree when:
- Navigating and interacting with pages
- Filling forms
- Extracting text content
- Understanding page structure
- Any task where you need to know what is on the page
Use screenshots when:
- You need to verify visual appearance (colors, layout, images)
- You're doing visual regression testing
- The page uses canvas or WebGL content that isn't in the DOM
- You specifically need a visual record for debugging
How Rove Exposes It
# Get the a11y tree for the current page
curl -X POST https://rove-api.fly.dev/v1/browser/action \
-H "Authorization: Bearer rvp_live_YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"session_id": "sess_...",
"action": "get_a11y_tree",
"params": {
"include_hidden": false,
"root_selector": "main"
}
}'
The response includes estimated_tokens so you can track your token budget:
{
"tree": { ... },
"node_count": 47,
"estimated_tokens": 1240
}
Scoping the Tree
For complex pages, you can scope the tree to a specific element:
{
"action": "get_a11y_tree",
"params": {
"root_selector": "#product-details"
}
}
This returns only the subtree under that element, further reducing token count.