Evaluating AI test automation against traditional browser use

All this author’s posts

AI-Powered Web Tasks: Comparing Harness AI Test Automation and browser-use

Our previous blog, “AI-Powered Web Testing: An Evaluation of Our AI Test Automation and Automating Complex Web Tasks,“ introduced a framework for evaluating our AI Test Automation tool against web tasks. The Harness AI Test Automation solution allows users to define intent-based tests using natural language. This solution leads to more efficient and adaptable test definitions. Here, we follow up our previous blog with a detailed comparison between the AI Test Automation solution and browser-use.

Our evaluation focuses on the Harness AI Test Automation and browser-user handling diverse web scenarios. We defined key categories of web tasks representing real-world user interactions, from simple information retrieval to complex enterprise application workflows. The inspiration for the chosen web tasks stems from everyday consumer use cases, publicly available benchmarks, and Harness customer requirements.

The evaluation framework includes navigating websites, understanding web pages, and UI interactions. Furthermore, the agent must be able to complete multi-step processes. The appendix concretely defines the prompts, expected behavior by the agent, and whether or not the solution completed the task.

Summary of the AI Test Automation and browser-use

The Harness AI Test Automation completed 17 of 19 tasks (89.5% success rate). browser-use completed 13 of the 19 tasks (68% success rate). We will now discuss the tasks based on category followed by more detailed definitions of the individual web tasks.

Category Comparisons

Basic Assertion

Our simplest task involves navigating to harness.io, opening the About Us page, and validating Jyoti as the CEO. Both solutions successfully validated Jyoti as the CEO. The Harness AI Test Automation solution scrolled to the About Us link, navigated to the About Us page, and correctly confirmed Jyoti as the CEO. Similarly, browser-use followed almost the precise sequence of locating the About Us page and identifying Jyoti Bansal as the CEO. A more efficient solution may have been to leverage the search bar on Harness.io to answer the question, “Is Jyoti the CEO?” Neither solution leveraged the search bar.

This baseline test demonstrated that both tools can effectively handle UI interactions and page comprehension.

E-commerce Flows

A primary use case for these solutions is executing tasks on e-commerce websites. Here at Harness, we simulate a shoe store with five-star reviews, prices, and the ability to purchase or checkout. Our task involved adding three 4-star-rated shoes under $50 to a cart and completing the checkout. This e-commerce task is where we observed our first difference.

The Harness AI Test Automation successfully identified the correct products based on star ratings and price constraints, added them to the cart, and completed the checkout process. In contrast, browser-use struggled with the filtering criteria, adding five items and some items above $50. browser-user eventually timed out without completing the checkout.

The task also highlighted our customer’s requirement of repeatability. With this and other tasks, we observed repeatability as an issue. Software testing requires consistency across runs unless a defect is truly introduced.

Our shoe store with stars, prices, and basic e-commerce functionality

Another e-commerce task involved retrieving the price range for Amazon Basics. The starting website is onlyonestopshop.com. browser-use successfully navigated to Google to find the answer. The Harness AI Test Automation struggled to find this information because the agent did not navigate away from onlyonestopshop.

Finally, we searched for hot pink napkins on onlyonestopshop. The task asserted that the napkins cost to be under a dollar. Both solutions succeeded. The Harness AI Test Automation used the site’s search functionality directly. browser-use navigated to Google, queried for pink napkins on onlyonestopshop, and opened the detailed page of the pink napkins.

browser-use opened Google to satisfy the pink napkin task

Date Selection on a Calendar

Calendar interactions are a special use case that presents confusion for LLMs. We tested the AI Test Automation and browser-use solutions on kayak and booking.com. The Harness AI Test Automation and browser-use performed well on their tasks, interacting with the date pickers and selecting the correct dates. This demonstrates the solution’s ability to execute an important function common in reservation systems.

However, when the booking scenario complexity increased, both applications struggled. The most difficult task was, “finding a flight from New York to Atlanta with specific dates in business class.” The Harness AI Test Automation set the travel class and dates but incorrectly set the departure and destination city. browser-use encountered three consecutive errors failing to complete the task.

Financial (Mortgage)

We defined web tasks with the financial website better.com. The web task was to see the rates of a specific type of mortgage. Both tools completed the task. The Harness AI Test Automation correctly navigated the site, identified, and clicked the requested mortgage type. browser-use similarly executed the task, even going further in the process and failing when better.com asked for an email confirmation.

Additionally, we prompted both web task solutions to create an account on an internal demo site Digital Bank. Although not strictly a financial task, creating and logging into an account is common across the internet. The Harness AI Test Automation solution could enter the initial signup form, fill out another form with information such as an address, and log into Digital Bank. Unfortunately, browser-use failed to enter the same password and confirm the password.

Left: The Harness AI Test Automation solution can create and log in a new user. Right: browser-use is unable to match the passwords to create a user.

Maps

Inspired by publicly available benchmarks, we observed how AI Test Automation and browser-use performed with map-based tasks. One task asked the agents to find the driving distance between Carnegie Mellon University and Massachusetts’ top computer science school. The AI Test Automation tool and browser-use performed well taking different approaches. The Harness AI Test Automation navigated to Google Maps, recognized MIT as the top Computer Science (CS) school in Massachusetts, entered the appropriate locations, and accurately determined the trip would take longer than 9 hours. It appears the LLM knew the top CS school in Massachusetts is MIT. Browser-use completed the task successfully but opted for a bus route that took over 10 hours.

Left: The Harness AI Test Automation solution found the fastest route from Carnegie Mellon University to Massachusetts Institute of Technology on Google Maps. Right: browser-use leveraged rome2rio.com to find a route from Carnegie Mellon University to Massachusetts Institute of Technology.

Our customers need to validate online documentation both proprietary and open source. We asked both solutions to navigate GitHub to list known issues. browser-use and the AI Test Automation solutions accomplished the task.

Furthermore, both tools were tested on Apache’s project management interface. The Web task was to select the Cassandra project and navigate to the AMQP 2.2.0 release. The AI Test Automation and browser-use applications performed well, finding the Cassandra project, and locating the requested release.

However, when checking documentation on pnpm.io for the prune command, the Harness AI Test Automation successfully located the command documentation for version 8.x.

Unfortunately, browser-use failed to find the command incorrectly claiming success. These Web tasks demonstrate the specialization required for some customer use cases.

Enterprise Applications

Both solutions performed well when navigating the Harness app and managing a GitHub connector. The test required logging in, selecting a project, opening the account settings, deleting, and creating a GitHub connector. These CRUD (create, read, update, delete) operations are prevalent in Enterprise use cases.

The Harness AI Test Automation identified the correct project, located and deleted the existing GitHub connector, created a new connector with dummy credentials, and verified the connection. Similarly, browser-use demonstrated its ability to execute the Web task flawlessly.

This demonstrates that both tools can handle enterprise application workflows involving authentication, navigation through non-traditional interfaces, and the execution of CRUD operations.

Internal Applications

The most significant performance gap emerged while testing specialized interfaces for internal applications. Unorthodox and information-rich interfaces proved difficult. One task required the agent to tap on the run/play icon for a specific test. The Harness AI Test Automation solution successfully tapped the run/play icon for the target test. browser-use clicked the three-dot menu instead and chose the Run Suite option. Both applications accomplished the overall task, but browser-use did not tap on the run/play icon as the prompt instructed.

browser-use clicked the three-button action despite the instruction of clicking the play button next to ‘Data Driven test’.

Another web task was to clone a test. browser-use did not recognize the clone button that popped up, instead clicking the run button. Although the UI is arguably not intuitive, similar issues arise with our customers.

browser-use was unable to recognize ‘Clone test’ at the bottom of the page.

As mentioned earlier, complex tables full of data have proven difficult. One task the two solutions were tasked with involves audit logs. Specifically, the task was to identify an audit log older than two days. The Harness AI Test Automation successfully navigated to the audit logs interface and navigated the data table for older entries. browser-use successfully reached the audit log UI but failed to navigate the data table effectively to locate older records.

This highlights further specialization necessary for complex data tables common with enterprise applications.

Key Findings and Observations

Our evaluation revealed several important insights:

Task Complexity Matters: Simple tasks performed well for the Harness AI Test Automation and browser-use, but Harness AI Test Automation excelled when dealing with more complex use cases.
Multi-criteria Tasks: The AI Test Automation and the browser-use solutions can simultaneously perform multiple filtering criteria. The AI Test Automation outperforms one of our tasks finding products with specific ratings and price points.
Specialized Interfaces: The Harness AI Test Automation better handles specialized interfaces prevalent in internal enterprise applications.
Data Table: Complex data tables posed a significant challenge for browser-use. The Harness AI Test Automation multi-agent architecture can handle complex data tables common in enterprise applications.
Alternative Approaches: When not obvious, browser-use leveraged Google search, navigating away from the originating website. The Harness AI Test Automation service does not have such functionality by design.

Design Similarities and Differences

Here are some key similarities and differences between AI Test Automation and browser-use:

Multi-agent Architecture: Our solution employs agents to perform tasks that require specialization. Some tasks require more nuanced handling of complex interactions.
Test-Specific Optimizations: The Harness AI Test Automation solution is built for software testing. The software testing use case includes specialized interactions and assertions that general browser automation tools may lack.
Prompt Engineering: browser-use and the Harness AI Test Automation solutions leverage the HTML source and visual screenshots of web pages. The two artifacts are decorated with marks to assist the LLM. They are passed to the LLM to induce the next steps and assertion evaluation.

Future Work

We will continue to expand our evaluation framework adding more tasks, annotating tasks with expected outcomes, and making the framework publicly available. We continuously add capabilities to our solution as we encounter more complex scenarios.

In future evaluations, we will compare the Harness AI Test Automation against other adjacent technologies including Anthropic’s Computer-Use, OpenAI’s Operator, and Opera’s AI Browser Operator. This blog and future studies will allow us to understand the state of the art of AI-powered web automation and LLM capabilities.

We should note that browser-use and similar tools focus primarily on general web browser interactions. The Harness AI Test Automation is designed specifically for software testing. Additional capabilities like conditionals, API invocation, and assertions are required. Additionally, repeatability is a necessity for engineering teams.

Conclusion

Overall results of Harness AI Test Automation and browser-use.

In our evaluation, the Harness AI Test Automation and browser-use solution easily handles simple web tasks. As the difficulty of the tasks increases, the AI Test Automation tool starts to outperform. When presented with enterprise application tasks, the AI Test Automation solution shines.

The AI Test Automation tool continues to advance web application testing by defining tests with natural language. Defining tests can now be done with intent. The test definition is more adaptive to the underlying application and applicable to multiple implementations. At Harness, we remain committed to leveraging AI to streamline software engineering. We are obsessed with empowering developers to do what they love, ship stuff.

Appendix

Ben Markines

All this author’s posts

He is a Principal Engineer at Split Software, working on data pipelines and applications that power some of the world’s most popular products. His work involves collecting, managing, and analyzing large-scale data used to drive monitoring and statistical applications, enabling software teams to gain insights and make data-driven decisions that deliver exceptional user experiences. He excels at taking abstract, unstructured projects, defining clear requirements, and building alignment across teams. His background spans data mining, machine learning, and similarity networks.

Evaluating AI test automation against traditional browser use

AI-Powered Web Tasks: Comparing Harness AI Test Automation and browser-use

Summary of the AI Test Automation and browser-use

Category Comparisons