If you’re building AI agents that browse the web, you’ve probably wired up the same loop everyone else has: take a screenshot, send it to a vision model, parse the coordinates, click, repeat. It works — but it’s fragile, wasteful, and harder for the model to reason about than it needs to be.
PinchTab takes a fundamentally different approach. Instead of sending pixel data to your LLM, it extracts structured text from the page and returns stable element references. Your agent reads a clean list of interactive elements instead of interpreting a 1,500-token image. The result is browser automation that’s more reliable, easier to debug, and simpler for the model to act on.
The Screenshot Problem
The current generation of browser-using AI agents — think Anthropic’s Computer Use, OpenAI’s Operator, or any custom Playwright/Puppeteer setup — typically works like this:
- Take a screenshot of the page
- Send the image to a vision model
- The model returns coordinates to click
- Execute the click at those pixel coordinates
- Take another screenshot
- Repeat
This loop has three problems:
Fragility. Pixel coordinates break constantly. A slightly different viewport size, a cookie banner that shifts content down by 40 pixels, a responsive layout that reflows — any of these can send your agent clicking in the wrong place. Coordinate-based clicking is inherently brittle because the visual layout of a page is not a stable API.
Reasoning overhead. When you send a screenshot to an LLM, the model has to do extra work: identify what’s on the page, figure out which elements are interactive, determine what they do, and map that understanding back to pixel coordinates. That’s a lot of implicit reasoning that could be made explicit — and when the model gets it wrong, you get silent failures that are hard to debug.
Cost. A typical browser screenshot costs ~1,500 tokens on Claude (images are resized to fit within 1568px and tokenized at roughly width × height / 750). That’s not catastrophic for a single page, but it adds up in multi-step workflows. A 10-page agent run burns ~15,000 tokens on screenshots alone — before any reasoning. PinchTab’s structured text approach uses ~800 tokens per page, cutting input costs roughly in half.
How PinchTab Works
PinchTab is a standalone HTTP server (a single 12MB Go binary) that gives your AI agent direct control over Chrome through a clean API. The key insight is that it extracts the semantic structure of a page rather than its visual appearance.
When you request a page snapshot, PinchTab returns something like:
[e1] Search input "Search products..."
[e2] Button "Sign In"
[e3] Link "Electronics" → /category/electronics
[e4] Link "Clothing" → /category/clothing
[e5] Button "Add to Cart"
Each [eN] is a stable element reference tied to the DOM, not to pixel positions. Your agent reads text, not pixels. It clicks e5, not coordinates (742, 381).
This matters for three reasons:
- Refs don’t break when the layout shifts. A cookie banner, a viewport resize, or a CSS change won’t invalidate element references the way it would invalidate pixel coordinates.
- LLMs reason better about structured text. Given a labeled list of buttons and links, the model can select the right action with higher accuracy and less prompt engineering than when interpreting a raw image.
- Debugging is straightforward. When something goes wrong, you can read the snapshot output and immediately see what the agent saw — no squinting at screenshots trying to figure out what the model misinterpreted.
Getting Started in 60 Seconds
Install:
curl -fsSL https://pinchtab.com/install.sh | bash
Or via npm or Docker:
npm install -g pinchtab
# or
docker run -d -p 9867:9867 pinchtab/pinchtab
Start the server and create a browser instance:
pinchtab &
TAB=$(curl -s -X POST http://localhost:9867/instances \
-d '{"profile":"default"}' | jq -r '.id')
Navigate to a page:
curl -X POST "http://localhost:9867/instances/$TAB/action" \
-d '{"kind":"navigate","url":"https://example.com"}'
Get the structured snapshot (interactive elements only):
curl "http://localhost:9867/instances/$TAB/snapshot?filter=interactive"
Click an element by its stable ref:
curl -X POST "http://localhost:9867/instances/$TAB/action" \
-d '{"kind":"click","ref":"e5"}'
Fill a form field:
curl -X POST "http://localhost:9867/instances/$TAB/action" \
-d '{"kind":"fill","ref":"e3","value":"search query here"}'
That’s it. No browser drivers to configure, no Selenium grid, no Docker Compose orchestration. One binary, one HTTP API.
The CLI Shorthand
For quick scripting and testing, PinchTab also exposes CLI commands that map directly to the API:
pinchtab nav https://example.com
pinchtab snap -i -c # interactive elements with coordinates
pinchtab click e5 # click by ref
pinchtab fill e3 "hello" # fill input
pinchtab text # extract full page text
This makes it easy to prototype agent workflows in a shell script before wiring up your LLM.
The Real Cost Picture
Let’s be precise about token costs. A typical 1920×1080 browser screenshot gets resized by Claude’s API to fit within 1568px on the long edge, landing at roughly 1,500 tokens. PinchTab’s structured text snapshots use roughly 800 tokens per page — about half the input cost.
| Approach | Tokens per page | 10-page workflow | 1,000 runs/day |
|---|---|---|---|
| Screenshots (Claude) | ~1,500 | ~15,000 | 15M tokens/day |
| PinchTab | ~800 | ~8,000 | 8M tokens/day |
That’s a meaningful saving at scale, but the token reduction isn’t the main story. The real wins are:
- Higher accuracy. Structured element refs eliminate coordinate-based misclicks entirely. The agent can’t click the wrong pixel because it’s not clicking pixels at all.
- Faster iteration. When an agent workflow breaks, the text snapshot tells you exactly what went wrong. No need to replay screenshots frame by frame.
- No vision model required. You can use any text-only LLM — no need to pay for multimodal capabilities or deal with vision model latency.
That last point is significant. Vision API calls are typically slower than text-only calls, and many of the cheapest, fastest models don’t support image input at all. PinchTab opens up browser automation to the entire model ecosystem.
Multi-Instance Orchestration
PinchTab isn’t just for single-threaded agents. It supports running parallel Chrome instances with isolated user profiles:
# Create isolated instances for different users/tasks
TAB1=$(curl -s -X POST http://localhost:9867/instances \
-d '{"profile":"user-a"}' | jq -r '.id')
TAB2=$(curl -s -X POST http://localhost:9867/instances \
-d '{"profile":"user-b"}' | jq -r '.id')
Each profile maintains its own cookies, local storage, and browsing history — and these persist across server restarts. This means your agents can maintain logged-in sessions without re-authenticating every time.
When to Use PinchTab
PinchTab fits best when you’re building:
- AI agents that interact with web UIs — form filling, data extraction, monitoring dashboards
- Automated workflows at scale where reliability and debuggability matter
- Multi-user agent systems that need isolated browser sessions
- Headless automation on servers or Raspberry Pi (ARM64 native support)
It’s less suited for tasks that genuinely need visual understanding — like analyzing charts, reading CAPTCHAs, or evaluating page design. For those, you still need a vision model and screenshots.
The Bigger Picture
PinchTab represents a broader shift in how we think about AI-browser interaction. The first wave of browser agents used brute-force vision: send the model everything, let it figure it out. That works, but it’s fragile and puts the burden of understanding entirely on the model.
The next wave is about giving agents the right abstraction. An LLM doesn’t need to see a login button — it needs to know there’s a clickable element labeled “Sign In” at ref e2. By meeting the model at the semantic level instead of the pixel level, PinchTab makes browser agents more reliable, easier to debug, and accessible to any LLM — not just multimodal ones.
The project is open source (MIT license), written in Go, and available now on GitHub.
PinchTab is listed in our Agents & Automation directory — view the full tool listing. Have a tool to suggest? Let us know.