
AI-Powered Web Testing: An Evaluation of Our AI Test Automation and Automating Complex Web Tasks

Harness AI Test Automation
At Harness, we’re developers ourselves, and we’re obsessed with creating the best developer experience. Today, software engineers are responsible for designing and implementing applications and ensuring thorough testing. How can engineers ensure that their change, whether it be the front end or back end, does not break the application? The AI Test Automation tool lets users define their tests using natural language. We can now validate the application using natural language as the ‘source code’ of the test. A test defined in a language is portable to multiple applications. But how can we determine how well our solution is doing? The purpose of this blog is to describe a framework we used to evaluate our solution. In this write-up, we define a set of wide-ranging web tasks that our agent is expected to complete. Our agent is known as the Harness AI Test Automation. Future work will benchmark the AI Test Automation results against adjacent agents charged with executing web tasks.
Our AI Test Automation application navigates web pages, evaluates assertions, and self-heals regardless of underlying service changes. A crucial underlying function of the AI Test Automation application is to execute web tasks. These web tasks simulate real-world user interactions. The classic method to define these interactions was to build complex automation scripts. Our AI Test Automation service executes these tasks and interactions by deriving specific actions provided by user intent. Our solution does not require code and offers less maintenance than contemporary solutions.

Naturally Define Web Tasks
For evaluation, we identified a set of web tasks analogous to what our internal and external customers expect. Additionally, we include tasks inspired by publicly available research such as VisualWebArena. We ran our evaluation as a random sample to keep the study contained but also to be a representative example of the use cases we strive to solve. The web tasks range from internal B2B applications to consumer web Applications.
Web Tasks
We induced these Web tasks from our customers, dogfooding, and web tasks from past research:
- Information retrieval — Finding information on websites (like finding a contact number or identifying a company’s CEO)
- E-commerce flows — Adding products to a cart and checking out
- Application configuration — Creating and managing artifacts in an enterprise application
- Date selection — Navigating complex calendars across travel websites
- Form filling — Completing online forms
- UI interaction — Logging into Web Applications, clicking specific HTML elements, navigating tables looking for a record given criteria.
These tasks forced us to build a multiagent solution with various specialty levels. The Harness AI Test Automation tool accepts a set of intents to accomplish these tasks. Please note that a requirement for software testing is producing consistent results across test runs.
How Our AI Test Automation Tool Performed
We will now go into more specifics about our evaluation breaking down the above Web Tasks even further.
Company Information Retrieval
Our first test was straightforward: navigate to harness.io, find the About Us page, and verify if Jyoti is the CEO. The AI Test Automation Application navigated, scrolled the page, found the About Us link, and again scrolled to locate the CEO. Although simple, these tasks form a basis of scrolling functionality and an agent’s ability to define follow-up browser actions.
E-commerce Scenarios

The Agent generates specific actions to execute the user’s intent.
We made the web task more complicated with e-commerce flows. On a demo app, we asked the AI Test Automation application to add three 4-star-rated shoes under $50 to a cart, and then complete checkout. This requires the agent to understand product ratings, filter by price, select the correct items, and complete a multi-step checkout process — all of which the AI Test Automation application successfully handled.
Another set of tasks involved onlyonestopshop. The AI Test Automation can navigate the site for a specific item under a given price. The automation made it but failed the checkout process.
Enterprise Application Workflows
A more involved task involves the AI Test Automation application to navigate a proprietary Enterprise application. In one of our use cases, we asked the application to log into the Harness platform, select a specific project, navigate to account settings, delete an existing GitHub connector, and recreate the just deleted connector. These CRUD operations represent a common pattern for enterprise applications to test for every entity type defined. The AI Test Automation framework automatically verifies such operations. Additionally, tasks over enterprise applications have additional requirements such as more complex user interfaces and security.
Travel Booking Interfaces
Selecting a specific date on a calendar presents complications for automation. We created an agent specializing in calendar interactions to navigate and select a date defined by the user. Our AI Test Automation tool successfully navigated Kayak, booking.com, and Wasimil calendars to pick specific dates in the near term (14th of next month) and far future (February 15th of next year). These types of interactions often break traditional test automation.
Software Documentation Sites
When tasked with checking documentation on sites like GitHub and pnpm.io, the AI Test Automation demonstrated its ability to navigate the documentation. The automation finds specific commands (like the prune command in pnpm) and validates an existing version.
Real Estate and Financial Sites
The AI Test Automation was tasked to execute financial tasks. One example requires our agent to find a rate for a type of mortgage. On Better.com, the agent found a mortgage type and asserted the expected rate.
Internal Applications
At Harness, the AI Test Automation tool has become the standard for streamlining testing and validation processes. Our AI Test Automation application showed off its flexibility by navigating specialized interfaces, selecting checkboxes next to specific artifacts, and triggering actions upon those artifacts, e.g. cloning and editing. The AI Test Automation tool found a record by sorting and paging through a complex data table with many columns.
Tasks Involving Maps
Inspired by the visualwebarena, the AI Test Automation application determined the distance between Carnegie Mellon and the top CS school in Massachusetts. We assume the LLM knows that the current top CS school in Massachusetts is MIT. The AI Test Automation navigated to Google Maps, identified MIT as the destination, and accurately determined the trip would take longer than 9 hours.
Power of Intent-based Tests
Traditional automated testing of web applications usually requires coding knowledge, significant maintenance when the UI changes, and detailed knowledge of the underlying application. The Harness AI Test Automation allows users to define their tests using natural language. The user can now define their tests naturally and have the software generate the necessary actions to execute the goal. Currently, users lay out a particular sequence of steps to accomplish the task they have in mind, which can be brittle if the sequence changes because of design modifications. The user describes the intent behind the test to the AI Test Automation. Then, the AI Test Automation generates the necessary steps to execute the user’s intent.
AI Test Automation fundamentally changes who can create and maintain tests. Composing a test can now be democratized across the organization. This will in turn produce more robust applications and feature ideation. The AI Test Automation can adapt to various use cases across domains. Traditional test scripts may break when developers move a button or change a class name. This self-healing capability means fewer broken tests, less maintenance, and higher overall reliability. This tool allows engineering teams to evolve their applications faster with confidence.
Future Work
Our AI Test Automation performed well across many tasks. We will publish the precise results in a follow-up blog comparing our AI Test Automation application against related technologies. These technologies include browser-use, Anthropic’s computer-use, and GPT’s Operator. On the other hand, our AI Test Automation application struggles with complicated data tables and instructions/tasks that confuse the LLM.

An Example Appointment Calendar/Data Table is one that many models struggle with. As of this writing, Claude 3.7 is the only model that successfully schedules an event at the correct time and date.
The AI Test Automation tool is an advancement for software engineering organizations. AI-enabled tools empower developers to concentrate on other less automated work. Here at Harness, we intend to leverage AI throughout the software engineering life cycle. The Harness AI Test Automation solution is one important part of that life cycle.

