DevOps Meets AI: Evaluating the Performance of Leading LLMs

All this author’s posts

DevOps Meets AI: Evaluating the Performance of Leading LLMs

Modern DevOps processes are essential for ensuring efficient, reliable, and scalable software delivery. However, managing infrastructure, CI/CD pipelines, monitoring, and incident response remains a complex and time-consuming challenge for many organizations. These tasks require continuous tuning, configuration management, and rapid troubleshooting, making DevOps resource-intensive. As software systems grow in complexity, manual intervention becomes a bottleneck, increasing the risk of human error, inefficiencies, and slower deployments. This is where automation becomes a necessity, helping teams streamline workflows, reduce operational overhead, and improve deployment velocity.

Table of contents

The rise of artificial intelligence, particularly large language models (LLMs), has opened new possibilities for automating various aspects of software development and operations. By leveraging AI, organizations can enhance efficiency, reduce manual effort, and accelerate software delivery. LLMs bring the potential to transform DevOps by enabling intelligent automation, improving decision-making, and making systems more adaptive to changing requirements.

Our AI engineering team has been at the forefront of integrating AI into DevOps workflows. From AI-powered CI/CD optimizations to intelligent deployment strategies, we continuously explore ways to leverage AI for greater efficiency. In this blog, we share our journey in evaluating LLMs for DevOps automation, benchmarking their performance, and understanding their impact on software delivery workflows.

Harnessing LLMs for DevOps Automation

Before diving into the evaluation, let’s first outline the specific problem we aim to solve using large language models. (Note: In this post, I won’t go into the underlying architecture of the Harness AI DevOps Agent — stay tuned for a future blog post on that!)

Our exploration begins with the task of pipeline generation. Specifically, the AI DevOps Agent takes a user command describing the desired pipeline as input, along with relevant context information. The expected output is a pipeline YAML file generated by the AI DevOps agent, which is composed of multiple sub-agents, automating the configuration process and streamlining DevOps workflows. An example user command and the resulting YAML pipeline would be:

“Create an IACM pipeline to do create a IACM init and plan”

Response:

yaml

pipeline:

/* Syntax highlighting for code blocks */ .hljs-keyword { color: #569cd6; font-weight: bold; } .hljs-string { color: #ce9178; } .hljs-number { color: #b5cea8; } .hljs-comment { color: #6a9955; font-style: italic; } .hljs-function { color: #dcdcaa; } .hljs-variable { color: #9cdcfe; } .hljs-type { color: #4ec9b0; } .hljs-built_in { color: #4fc1ff; } .hljs-operator { color: #d4d4d4; } .hljs-punctuation { color: #d4d4d4; } .hljs-attr { color: #92c5f8; } .hljs-property { color: #9cdcfe; } .hljs-title { color: #dcdcaa; } .hljs-class { color: #4ec9b0; } .hljs-meta { color: #569cd6; } .hljs-literal { color: #569cd6; } .hljs-symbol { color: #ce9178; } .hljs-regexp { color: #d16969; } .hljs-link { color: #3794ff; text-decoration: underline; } .hljs-selector-tag { color: #569cd6; } .hljs-selector-id { color: #ffd700; } .hljs-selector-class { color: #d7ba7d; } .hljs-addition { color: #4fc1ff; background-color: rgba(79, 193, 255, 0.1); } .hljs-deletion { color: #f85149; background-color: rgba(248, 81, 73, 0.1); }

For simplicity, we conducted the first phase of our evaluations by focusing on generating a single step of the pipeline. Additionally, we explored two different solution designs for utilizing LLMs:

Direct Single LLM Calls: In this approach, we send the user command along with the relevant context (e.g., stage type, pipeline schema) in a single request to the LLM under evaluation.
Agentic Framework Approach: This approach leverages an agentic framework to distribute sub-tasks — such as context generation, schema verification, and step generation — among multiple AI agents. We implemented this framework using AutoGen.

Performance Metrics: How We Measure Success

In this blog post, we focus on the generation use case — specifically, creating pipeline steps, stages, and related configurations — and introduce the metrics used to evaluate the performance of different models for this task. Our evaluations are conducted against a benchmark dataset with a known ground truth. Specifically, we have curated a dataset consisting of user commands for creating pipeline steps and their corresponding YAML configurations. Using this benchmark data, we have developed a set of metrics to assess the quality of AI-generated YAML outputs in response to user prompts.

Since we are evaluating AI-generated pipelines against known, predefined pipelines, the comparison ultimately involves measuring the differences between two YAML files. To accomplish this, we leverage and build upon DeepDiff, a framework for computing the structural differences between key-value objects. DeepDiff is conceptually inspired by Levenshtein Edit Distance, making it well-suited for quantifying variations between YAML configurations and assessing how closely the generated output matches the expected pipeline definition.

At its core, DeepDiff quantifies the difference between two objects by determining the number of operations required to transform one into the other. This difference is then normalized to produce a similarity score between 0 and 1, providing a structured way to compare data. While we utilize the standard DeepDiff library as one of our evaluation metrics, we have also developed two modified versions tailored specifically for comparing step YAMLs. These adaptations address the unique challenges of our use case, ensuring a more precise and meaningful assessment of AI-generated pipeline configurations.

In particular, we have introduced:

DeepDiff 2: This metric first applies schema verification before computing the similarity score, assigning a score of zero if the generated YAML fails validation. Additionally, it does not penalize differences in optional fields such as name, identifier, and description, ensuring that minor variations do not disproportionately impact the similarity score. Moreover, as long as the generated solution adheres to schema validation, this metric allows additional keys in the step without penalizing the score.
DeepDiff 3: This metric builds upon DeepDiff 2 but introduces a penalty for any additional key that does not exist in the reference solution. This stricter approach provides a more precise comparison to the ground truth, considering that extra keys with default values may impact the user experience. Users may not expect to see default values for optional fields in the UI, making it essential to account for such differences in evaluation.

Benchmarking LLMs: Evaluating the Leading Models

Benchmark Dataset

Let’s first introduce the benchmark data used for this study.

At Harness, our QA team generates numerous sample pipelines using automation tools such as APIs and Terraform Providers to simulate customer use cases and various Harness configurations. These pipelines play a crucial role in sanity testing, ensuring that when a new version of Harness is released, all steps, stages, and pipelines continue to function as expected.

For this study, we leveraged this data to create a benchmark dataset of 115 step YAMLs. For each example, we manually added a potential user command that could generate the corresponding step. The same user command was then used to generate a step YAML using an LLM. The AI-generated solutions were subsequently compared against the original YAML file to evaluate accuracy and quality.

Below is an example of a user command and its corresponding YAML file, which serves as the ground truth in our evaluation:

User Command:“Please add a Terraform plan step to the pipeline.”

Ground Truth YAML:

yaml

-

/* Syntax highlighting for code blocks */ .hljs-keyword { color: #569cd6; font-weight: bold; } .hljs-string { color: #ce9178; } .hljs-number { color: #b5cea8; } .hljs-comment { color: #6a9955; font-style: italic; } .hljs-function { color: #dcdcaa; } .hljs-variable { color: #9cdcfe; } .hljs-type { color: #4ec9b0; } .hljs-built_in { color: #4fc1ff; } .hljs-operator { color: #d4d4d4; } .hljs-punctuation { color: #d4d4d4; } .hljs-attr { color: #92c5f8; } .hljs-property { color: #9cdcfe; } .hljs-title { color: #dcdcaa; } .hljs-class { color: #4ec9b0; } .hljs-meta { color: #569cd6; } .hljs-literal { color: #569cd6; } .hljs-symbol { color: #ce9178; } .hljs-regexp { color: #d16969; } .hljs-link { color: #3794ff; text-decoration: underline; } .hljs-selector-tag { color: #569cd6; } .hljs-selector-id { color: #ffd700; } .hljs-selector-class { color: #d7ba7d; } .hljs-addition { color: #4fc1ff; background-color: rgba(79, 193, 255, 0.1); } .hljs-deletion { color: #f85149; background-color: rgba(248, 81, 73, 0.1); }

This YAML structure represents the expected output when an LLM generates a pipeline step based on the given user command. The AI-generated YAML will be evaluated against this reference to assess its accuracy and quality.

Models Compared

We evaluated both an agentic framework and direct model calls for utilizing LLMs in pipeline generation. The selection of models for each approach was based on the technical adaptability of the frameworks we used. For example, AutoGen supports only a limited set of LLMs, which influenced our model choices for the agentic framework.

As a result, there isn’t a one-to-one correspondence between the models used in the agentic framework and those used in direct calls. However, there is significant overlap between the two sets.

Agentic Framework: Models operating within an agent-driven setup

GPT-4o
O3-mini-medium
Claude-3.7

Direct Model Calls: Models queried directly without an agentic framework

GPT-4o
O3-mini-medium
Claude-3.7
DeepSeek R1
DeepSeek V3

This comparison allows us to assess how different models and methodologies perform in generating high-quality DevOps pipeline configurations.

Results

The figure below illustrates the performance of each model based on the three evaluation metrics introduced earlier. Models that are called using an agentic framework are prefixed with “Autogen_” in the results.

Our findings indicate that using an agentic framework significantly improves response quality across all three metrics. However, AutoGen does not yet support DeepSeek models, so for these models, we only report their performance when called directly.

LLM Performance Comparison for Pipeline Step Generation

In order to gain deeper insights into the scores, we also visualize the number of samples that failed the schema verification step, where a zero score is assigned to such cases. This highlights instances where models struggle to generate valid YAML structures:

Schema Verification Failures Across Models

The plot above clearly demonstrates the effectiveness of an agentic framework with a dedicated schema verification agent. Notably, none of the models within the agentic framework produced outputs that failed schema validation.

Takeaways

Our evaluation of LLMs for DevOps automation provided valuable insights into their strengths, limitations, and practical applications. Below are some key takeaways:

LLMs demonstrate strong potential for automating DevOps workflows, particularly in generating pipeline YAMLs from user commands — achieving a pass rate of over 95% for the best models. This reduces manual effort, increases efficiency, and streamlines software delivery.
Leveraging an agentic framework that breaks tasks into smaller sub-tasks and distributes them among sub-agents significantly improves accuracy. This approach reduces schema verification failures and minimizes model hallucinations, leading to more reliable and structured pipeline generation.

Bashir Rastegarpanah

All this author’s posts

He is an applied scientist with hands-on experience building data-driven solutions in natural language processing, genomics, recommender systems, and AI observability. He has led research projects in both startup environments and academia. In recent years, his work has focused on developing tools and methodologies to enhance the safety, reliability, and trustworthiness of AI systems — including generative machine learning models and large language models.