No items found.
September 12, 2025

Reverse Engineering Comet

Table of Contents

Comet’s automation is powerful and elegant when assumptions about web structure hold, but brittle when they don’t. The diversity and messiness of the web mean that screenshot-augmented, multi-modal approaches (like Harness AI Test) offer broader coverage and reliability, especially for enterprise applications. Comet’s design reflects a deliberate trade-off: optimized for speed and simplicity on well-architected sites, but potentially limited by the web’s inconsistencies.

Comet is a new browser from Perplexity, and it is amazing at performing tasks autonomously. You simply need to give a high-level intent, and it does the rest.

Find me the cheapest flight from SFO to LAX departing on 14th September and returning on 19th September.

That's all you need to tell Comet, and it will find you the flight. Needless to say, this will change how we browse the internet.

As an engineer, I wanted to know how it works. Specifically, what is the context it's using? I tried to find information online, but there is no engineering blog or any other insights I could find. So I tried the next best thing: just talk to Comet and see what it has to say. Surprisingly, it is quite vocal about how it figures out things.

Let's go step by step and prompt Comet about its inner workings:

Do you use accessibility tree on the website to see what elements to act on?

Comet: Yes, I utilize the website’s accessibility tree — specifically roles, ARIA attributes, labels, and states — to precisely identify and interact with page elements for automation and control tasks

What about screenshots?

Comet: No, I do not use screenshots to identify or interact with website elements. My actions are based on the underlying accessibility tree and HTML structure, not images of the page.

Do you look at the raw DOM as well?

Comet: Yes, I use the raw DOM (Document Object Model) in addition to the accessibility tree. This allows me to access the full structure, attributes, and content of every element on the page for precise interaction and automation.

But DOMs can be huge and won't fit your context window.
Do you do any special processing to reduce its size?

Comet: Yes, special processing is performed on large DOMs to make them manageable:

Filtering: Only relevant elements are extracted — those visible, interactive, or matching the task (e.g., buttons, form fields).

Pruning: Non-essential elements (such as scripts, styles, or hidden sections) are ignored.

Summarization: Hierarchical structure is flattened, and elements are grouped by their roles and ARIA landmarks.

Chunking: The DOM is broken into smaller, logical sections for stepwise analysis.

This targeted extraction ensures only the essential information needed for the automation task is processed, keeping the context efficient and focused.

So in a nutshell, it's using DOM + Accessibility tree, well, that will limit its capability somewhat, especially when dealing with applications that are complex and do not follow rules (unfortunately, a common occurrence on the internet)

Exploring Limitations

Next, let's talk about the limitations of Comet

How do you deal with IFRAME with cross origin source? You are not allowed to see inside it, so how will you know where to act?

Comet: If the element to act on is inside a cross-origin iframe, I cannot access or manipulate it directly. Only the iframe as a whole can be targeted from the parent page.

Since you cannot "see" i.e. no access to screenshot, doesn't that mean you cannot act on images or icons or svg that do not have good label?

Comet: Correct. If an image, icon, or SVG does not have accessible labels, ARIA attributes, alt text, or useful DOM properties (such as a descriptive title or role), I cannot reliably identify or act on it. My ability to interact is limited to what can be detected in the DOM or accessibility tree.

Testing Comet with simple HTML

Let’s verify Comet’s claim that it relies solely on accessibility features and labels, without using screenshots.

Simple HTML to demonstrate

<html>
<head>
  <script>
    function hclick(txt) {
      document.getElementById('res').innerText = 'Result:' + txt;
    }
  </script>
</head>
<body>
<div id="res">Result: </div>
IMG1
 <img onclick="hclick('img1');" src="./svg1.svg" style="width:50px;height:50px" />
IMG2
 <img onclick="hclick('img2');" src="./svg2.svg" style="width:50px;height:50px" />
</body>
</html>

Click on the cart

Comet: There is still no clickable “Cart” element detected on this page. The word “Cart:” might be present as plain text, but it is not an interactive button or link that can be clicked. If the cart is meant to be clickable, it should be implemented as a button or link in the HTML code.

It failed to click on the cart; if it could “see” it would have been able to find the correct element.

Click on the bulb

Comet: There are no elements labeled “bulb” or displaying a bulb icon on this page, so nothing could be clicked. If you’re expecting a bulb icon or button, it may not be present or may not be labeled as such in the HTML.

Again, failed to click on the bulb

Let’s make the page more readable and add some labels

<html>
<head>
  <script>
    function hclick(txt) {
      document.getElementById('res').innerText = 'Result:' + txt;
    }
  </script>
</head>
<body>
<div id="res">Result: </div>

Cart
 <img onclick="hclick('img1');" src="./svg1.svg" style="width:50px;height:50px" />
Bulb
 <img onclick="hclick('img2');" src="./svg2.svg" style="width:50px;height:50px" />
</body>
</html>
Click on cart

Comet: There is no clickable “Cart” element found on this page. The word “Cart” is present, but it is not implemented as an interactive button or link that can be clicked. If you want it to be clickable, it should be defined as a button or anchor tag in the HTML.

It failed again — despite the clickable image next to the Cart text, Comet couldn’t make the connection.

Let's be a bit more explicit and add an alt attribute to the cart image. This will make it obvious it's a cart in the HTML

Cart
 <img alt="cart" onclick="hclick('svg1');" src="./svg1.svg" style="width:50px;height:50px" />
Bulb
 <img alt="bulb" onclick="hclick('svg2');" src="./svg2.svg" style="width:50px;height:50px" />

Comet: There is no clickable “cart” element found on the current page. If you would like me to try anything else or if the cart should be activated in another way, please let me know.

Failed again? I guess it cannot tell that the img is clickable, it claimed to check the event handlers, but it failed to do so in this case.

Let's wrap the img in a button and see if we can make it see it now

Cart
 <button>
 <img alt="cart" onclick="hclick('img1');" src="./svg1.svg" style="width:50px;height:50px" />
 </button>
  Bulb
  <button>
  <img alt="bulb" onclick="hclick('img2');" src="./svg2.svg" style="width:50px;height:50px" />
  </button>

Click on the cart

Comet: The cart button was found and successfully clicked on your current page

Finally, it was able to act!

Verdict:

Comet works well with well-structured applications, but it makes some basic assumptions. When apps don’t follow best practices, Comet can struggle or fail outright.

Contrasting with Harness AI Test Automation

Harness AI Test takes a bimodal approach: it looks at the screenshot while also generating a wireframe from the DOM and Accessibility tree.
This allows it to identify clickable, writable, and selectable elements — regardless of how the developer implemented them, whether through tags like <button>, onclick handlers, tabIndex, pointer, or other variations.

Let’s take a look at the original HTML with AI Test, which can truly see.

<html>
<head>
  <script>
    function hclick(txt) {
      document.getElementById('res').innerText = 'Result:' + txt;
    }
  </script>
</head>
<body>
<div id="res">Result: </div>

IMG1
 <img onclick="hclick('svg1');" src="./svg1.svg" style="width:50px;height:50px" />
IMG2
 <img onclick="hclick('svg2');" src="./svg2.svg" style="width:50px;height:50px" />
</body>
</html>

In both prompts, the AI Test was able to locate and click on the appropriate image and also verify that the expected outcome was achieved.

Conclusion

Real-world DOMs are messy, and accessibility is often weak. Even id uniqueness gets broken, almost anything can be made clickable, and custom tags are common. To build agents that truly match human behavior, we need multiple signals: screenshots, a condensed DOM, accessibility data, and practical knowledge of JS-driven behavior. That last piece comes from experience with legacy patterns (yes, even frameset).

Comet’s choice to lean on accessibility and the DOM over screenshots looks like a technical and product trade-off: computer-use models are early, screenshot “decoration” can hurt consumer UX, speed and cost matter, and top consumer sites generally follow accessibility best practices — so it works well there. Enterprise apps are different: older stacks and inconsistent accessibility make a screenshot+vision (bimodal) approach better for accuracy, coverage, and repeatability across edge cases.

Bottom line: Comet is excellent where its assumptions hold, but the web is diverse. A bimodal path would broaden its reach.

Automate Software Reliability Testing

Learn how to enhance system reliability with Harness Chaos Engineering. Prevent failures and improve customer experience.

Book a 30 minute product demo.
No items found.
AI Test Automation