Harness AI achieves top ranking in autonomous code fixes

All this author’s posts

Harness AI achieved #4 spot on the SWE-Bench Verified leaderboard by autonomously solving real-world GitHub software issues, powering fast, reliable AI-assisted software delivery.

AI-assisted coding tools have dramatically increased the speed and quantity of code generation, yet software delivery remains a key bottleneck. While AI accelerates development, the real challenge lies in integrating, validating, and deploying code reliably. Without end-to-end automation, from code creation through delivery, the true impact of AI-generated code is delayed, highlighting the need for platforms that streamline the entire software lifecycle. That’s the problem we’re solving every day while building Harness AI.

That is why, today, we’re thrilled to announce that Harness AI’s code agent has climbed to #4 on the SWE‑Bench Verified leaderboard, one of the most rigorous benchmarks for autonomous software engineering, taking us one step closer to building AI that is designed for everything after code generation.

What is SWE‑bench Verified?

SWE-bench is a benchmark for evaluating LLMs on real-world software issues collected from GitHub. SWE-bench Verified is its human-validated version. It tests AI’s real-world agentic coding skills, the kind required for coding tools like Cursor or Claude Code. Here’s the actual scenario:

500 real GitHub issues across production-grade Python repos. No hints. No scaffolding. Just raw code, and a single attempt.

To solve a task, an agent must:

Understand the repository
Apply a working code fix
And validate it autonomously

Harness AI did exactly that, and it did it better than almost every other AI out there!

‍

Engineering Harness AI

At the heart of Harness AI is a clean, modular architecture that mimics how real developers work, but with the speed and precision of AI.

Claude 4 Sonnet (Thinking)

We use Claude 4 in Thinking Mode, letting it reason deeply, generate step-by-step strategies, and revise plans on the fly. Unlike traditional prompting, “Thinking Mode” creates an internal monologue (a scratchpad), where the agent can brainstorm ideas, evaluate outcomes, and course-correct before taking action. This drastically reduces hallucinations and brittle plans.

Multi-Agent System

Build & Test Agent
- The unsung hero. Before writing a single line of code, this agent explores the repo to identify the right build and test commands, intelligently reading READMEs, CI configurations, and scripts.
- This ensures the agent never hallucinates test commands, and only validates fixes against real test suites that actually run.
Fixing Agent
- Dynamically plans, edits, and validates fixes using real-time feedback.
- It never guesses, it adapts. If a test fails, it reflects and replans using the sequential thinking tool, and Claude 4 Sonnet (Thinking)

Tools Used Under the Hood

SWE‑Bench doesn’t require fancy or complex tools. In most cases, simple file editing, shell execution, and structured planning are enough to solve tasks effectively.

Tool	Purpose
read_file	Read the contents of a file
write_file	Write content to a file (creates new or overwrites existing)
replace_in_file	Replace text in a file using string matching or regex patterns
execute_command	Execute a shell command and return the output
search_tool	Search for files and patterns in any directory
sequential_thinking_tool	Think step-by-step. Plan. Adjust. Retry. Adopted from Mcp Sequential Thinking Tools

‍

All tool usage is guarded with better error handling, intelligent fallbacks, and timeouts. If a tool fails (e.g., a test command hangs or a grep query returns nothing), the agent doesn’t break; it adapts.

This Is More Than Being on the Leaderboard

While we are proud of the results, we see it more as a signal. AI agents are no longer limited to toy coding problems or compiler tricks. They can now read real codebases, understand architecture, fix bugs, and prove it works.

We built Harness AI to tackle SWE‑Bench with precision, using just a few well-defined tools and sub-agents. That’s all the benchmark needed.

But enterprise-grade engineering problems go far beyond that. In real-world environments, AI agents must reason across systems, interact with external services, and collaborate with developers in dynamic workflows. That’s where Harness AI truly excels, with a scalable architecture, intelligent sub-agents, and advanced tools built for the complexity of modern software delivery.

Lookout for more such awesome work from our engineering team. Speaking about the team, special shoutout to the team that made this happen
Srikar Mannepalli
‍Himanshu Agrawal
‍Hang Zhang ‍
‍Harshit Mahajan
‍Vistaar Juneja
‍‍Raj Patel
‍Gurashish Brar

Learn more about Harness AI’s capabilities across the SDLC.

‍

Shubham Jindal

All this author’s posts

Harness AI achieves top ranking in autonomous code fixes | Harness Blog

What is SWE‑bench Verified?

Engineering Harness AI

Claude 4 Sonnet (Thinking)

Multi-Agent System

Tools Used Under the Hood

This Is More Than Being on the Leaderboard

Similar Blogs

Harness AI Unveils Advanced DevOps Automation: Smarter Pipelines, Faster Delivery, and Enterprise-Ready Compliance

Harness launches MCP tools to enhance its AI powered Chaos Engineering Capabilities

AI-Native Application Security

2025

Harness AI achieves top ranking in autonomous code fixes | Harness Blog

What is SWE‑bench Verified?

Engineering Harness AI

Claude 4 Sonnet (Thinking)

Multi-Agent System

Tools Used Under the Hood

This Is More Than Being on the Leaderboard

Similar Blogs

Harness AI Unveils Advanced DevOps Automation: Smarter Pipelines, Faster Delivery, and Enterprise-Ready Compliance

Harness launches MCP tools to enhance its AI powered Chaos Engineering Capabilities

the State of

AI-Native Application Security

2025