Back to list
High-Potential
Python

🧪 ClawBench: Browser AI Agent Benchmark

319 stars20 forksPython
agent-evaluationagentic-aiai-agent-benchmarkai-agentsbenchmarkbrowser-agentbrowser-automationbrowser-usechrome-agentchrome-extensioncomputer-usedataset
With the explosion of "Computer Use" and browser automation agents, objectively evaluating their capabilities has become a challenge. This project addresses that gap by providing an open-source benchmark specifically for browser AI agents. The dataset includes 153 everyday online tasks across 144 live websites. To ensure accurate evaluation, it uses a combination of 5-layer recording, DOM matching, and an LLM judge. Interestingly, the current top score is only 33.3%, indicating that today's browser agents still struggle with complex, real-world web navigation. For teams developing autonomous agents, this is a highly valuable testing framework.