Complete post: https://github.com/nottelabs/open-operator-evals
The race for open-source web agents is heating up, leading to very bold statements being thrown around. We cut through the noise and bring a fully transparent and reproducible benchmark to get a sense of the curent scene. Everything is open, inviting you to see exactly how different systems perform—and perhaps prompting a closer look at other's claims ;)
Rank | Provider | Agent Self-Report | LLM Evaluation | Time per Task | Task Reliability |
---|---|---|---|---|---|
🏆 | Notte | 86.2% | 79.0% | 47s | 96.6% |
2️⃣ | Browser-Use | 77.3% | 60.2% | 113s | 83.3% |
3️⃣ | Convergence | 38.4% | 31.4% | 83s | 50% |
Results are averaged over tasks and then over 8 separate runs to account for the high variance inherent in web agent systems. In our benchmarks, each provider ran each task 8 times using the same configuration, headless mode, and strict limits: 6 minutes or 20 steps maximum—because no one wants an agent burning 80 steps to find a lasagna recipe. Agents had to handle execution and failures autonomously.
Key highlights
You can investigate all replays/logs and reproduce the benchmark yourself 👇🏻
- Notte leads the benchmark by achieving the highest performance with 86.2% self-reported success and 79% LLM-verified completion. It also has the fastest execution time at 47s per task and an impressive 96.6% task reliability—Percentage of tasks an agent successfully completes at least once when given multiple attempts
- Browser-Use demonstrates a notable performance difference compared to their claimed results in their blog post, achieving 77.3% self-reported agent performance and 60.2% LLM-verified success versus their stated 89%. The absence of access to their results files prevents us from verifying their reported performance.
- Convergence shows significantly lower performance than competitors with 38.4% agent success and 31.4% evaluation success, primarily due to CAPTCHA and bot detection issues. However, shows strong self-awareness, achieving near-perfect alignment in some instances, indicating potential for improvement if detection challenges are overcome.
PS: We are actively hiring software and research engineers 🪩
The metrics
-
Agent Self-Report
The success rate reported by the agent itself across all tasks. This reflects the agent's internal confidence in its performance. -
LLM Evaluation
The success rate determined by GPT-4 using WebVoyager's evaluation prompt as a judge evaluator, assessing the agent's actions and outputs. This provides an objective measure of task completion. -
Time per Task
The average execution time in seconds for the agent to attempt and complete a single task. This indicates the efficiency and speed of the agent's operations. -
Task Reliability
The percentage of tasks the agent successfully completed at least once across multiple attempts (8 in this benchmark). This metric highlights the agent's ability to handle a diverse set of tasks given sufficient retries, indicating system robustness.
-
Alignment
Ratio of Agent Self-Report to LLM Evaluation, indicating overestimation (>1.0) or underestimation (<1.0) by the agent. Being close to 1 or <1.0 is typically better. -
Mismatch
Counts the specific instances where the agent claimed success but the evaluator disagreed. This reveals how often the agent incorrectly assessed its own performance.
The dataset
WebVoyager is a dataset of ~600 tasks for web agents. Example:
task: Book a journey with return option on same day from Edinburg to Manchester for Tomorrow, and show me the lowest price option available
url: https://www.google.com/travel/flights
An agent navigates the site and returns a success status and an answer. Relying on the agent’s self-reported success is unreliable, as agents may misjudge task completion. WebVoyager addresses this with an independent LLM evaluator that judges success based on agent actions and screenshots.
The challenge of high variance
Beyond known limitations like outdated web content, a key issue is the high variance in agent performance. These systems, powered by non-deterministic LLMs and operating on a constantly changing web, often yield inconsistent results. Reasoning errors, execution failures, and unpredictable network behavior make single-run evaluations unreliable. To counter this, we propose to run each task multiple times for a much more accurate view—averaging results helps smooth out randomness and gives a more statistically sound estimate of performance.
WebVoyager30
To reduce variance and improve reproducibility, we sampled WebVoyager30—a 30-task subset across 15 diverse websites. It retains the full dataset’s complexity while enabling practical multi-run evaluation, offering a more reliable benchmark for the community.
Running 30 tasks × 8 times (240 runs total) is far more informative than running 600 tasks once, as it averages out randomness and provides a statistically sounder view of performance. Running all 600 tasks 8× would be ideal but is often impractical due to compute costs and time, making fast and accessible reproduction difficult.
The selected tasks are neither trivial nor overly complex—they reflect the overall difficulty of the full dataset, making this a reasonable and cost-effective proxy.
Breakdowns
Benchmark results breakdown for each provider.
Notte
Provider: notte
Version: [v1.3.3](https://github.com/nottelabs/notte/releases/tag/v1.3.3)
Reasoning: gemini/gemini-2.0-flash
Notte leads the benchmark with 86.2% self-reported success and 79% LLM-verified completion, along with the fastest execution time at 47s per task and an impressive 96.6% task reliability. It shows consistent performance, with self-assessments slightly overestimating results. Alignment ratios range from 0.960 to 1.183, with low mismatch counts (mostly 3). Task times are really efficient (45-51s), and run 1743001170-7 achieved near-perfect alignment at 0.960.0.
Runs | Agent Self-Report | LLM Evaluation | Alignment | Mismatch | Time per Task |
---|---|---|---|---|---|
1743001170-0 | 0.929 | 0.857 | 1.084 | 3 | 47s |
1743001170-3 | 0.867 | 0.767 | 1.130 | 3 | 50s |
1743001170-4 | 0.867 | 0.800 | 1.084 | 3 | 51s |
1743001170-6 | 0.867 | 0.733 | 1.183 | 4 | 45s |
1743001170-1 | 0.862 | 0.759 | 1.136 | 3 | 47s |
1743001170-7 | 0.857 | 0.893 | 0.960 | 1 | 47s |
1743001170-2 | 0.828 | 0.759 | 1.091 | 2 | 45s |
1743001170-5 | 0.821 | 0.750 | 1.095 | 3 | 49s |
Browser-Use
Provider: Browser-Use
Version: [v0.1.40](https://github.com/browser-use/browser-use/releases/tag/0.1.40)
Reasoning: openai/gpt-4o
Browser-Use reported an 89% success rate on WebVoyager, but we were unable to replicate these results despite our efforts, both on WebVoyager30 with multiple retries and with the full dataset in a single shot. We also tested different configurations of the agent, browser, and lenient interpretations of ambiguous outcomes, but their reported performance was impossible to achieve. Browser-Use shows higher alignment ratios (1.2–1.534), indicating 20–50% overestimation of its abilities. It also has more mismatches (5–8), reflecting a bigger gap between self-assessment and performance.
Runs | Agent Self-Report | LLM Evaluation | Alignment | Mismatch | Time per Task |
---|---|---|---|---|---|
1743016360-6 | 0.833 | 0.667 | 1.249 | 7 | 98s |
1743016360-4 | 0.815 | 0.667 | 1.222 | 5 | 119s |
1743016360-1 | 0.808 | 0.577 | 1.400 | 7 | 127s |
1743016360-5 | 0.800 | 0.600 | 1.333 | 6 | 95s |
1743016360-2 | 0.786 | 0.679 | 1.158 | 5 | 132s |
1743016360-7 | 0.767 | 0.500 | 1.534 | 8 | 105s |
1743016360-3 | 0.708 | 0.542 | 1.306 | 5 | 113s |
1743016360-0 | 0.667 | 0.583 | 1.144 | 2 | 118s |
Convergence
Provider: Convergence
Version: [a4389c5](https://github.com/convergence-ai/proxy-lite/commit/a4389c599d5f5f77dc18510c879e2e783434766b)
Reasoning: Convergence Proxy-lite
Convergence Proxy-lite performs significantly below competitors at just 38.4% (agent) and 31.4% (evaluation) success rates. However, these results appear heavily impacted by technical issues, as the system frequently triggers Google's CAPTCHA and bot detection services. Despite these limitations, Convergence demonstrates remarkably better alignment between self-assessment and evaluation than Browser-Use, with one run achieving perfect 1.000 alignment with zero mismatches. This suggests that with improved bot detection handling, Convergence would likely outperform Browser-Use due to its superior self-awareness and calibration.
Runs | Agent Self-Report | LLM Evaluation | Alignment | Mismatch | Time per Task |
---|---|---|---|---|---|
1743114165-6 | 0.483 | 0.345 | 1.400 | 4 | 77s |
1743114165-0 | 0.407 | 0.333 | 1.222 | 2 | 85s |
1743114165-3 | 0.393 | 0.286 | 1.374 | 3 | 82s |
1743114165-4 | 0.379 | 0.345 | 1.099 | 2 | 82s |
1743114165-5 | 0.379 | 0.276 | 1.373 | 3 | 84s |
1743114165-7 | 0.367 | 0.333 | 1.102 | 3 | 84s |
1743114165-2 | 0.357 | 0.286 | 1.248 | 3 | 86s |
1743114165-1 | 0.310 | 0.310 | 1.000 | 0 | 84s |
Conclusion
Our open-source agent evaluation reveals notable differences between reported and observed performance. While Notte shows strong capabilities and good self-awareness, other systems exhibit issues with reproducibility and self-assessment. These results underscore the importance of clear, reproducible benchmarks. We encourage collaboration from the research and engineering community to develop improved trusted evaluation standards.