Enterprises considering how to take advantage of AI innovations should take notes from Jyoti Bansal. Over the last fifteen years, he has been exploring different ways of instrumenting and optimizing software infrastructure using advanced analytics. These digital twins of complex software environments help teams discern hidden patterns, ask better questions, and make more meaningful decisions.
Bansal founded AppDynamics in 2008 for the new field of Application Performance Monitoring, which was later acquired by Cisco in 2015 for $3.7 billion. The core idea was to design better software agents and log analytics to listen in on communications across applications and summarize these for various workflows. In 2017 he founded Harness, which used different kinds of agents to automate and simplify software delivery processes. He launched Traceable, which also uses agents to identify and remediate API security issues in 2020.
Long before Large Language Models (LLM) and ChatGPT were a thing, Harness used other types of AI to improve continuous verification and test intelligence. Continuous verification helps teams speed up software development lifecycles while reducing the impact of problems. Test intelligence helps teams shortlist the most appropriate tests for code updates by identifying the most likely things to break. He explains:
When we launched Harness as a company five years ago, we were the first one to bring AI to DevOps and this continuous verification technology. Now AI is everywhere, right? It wasn't using LLMs at that time, but it was using a lot of very sophisticated AI models, like neural nets, to learn the normal behavior of code.
Harness is now introducing a new LLM co-pilot to improve the user interface and streamline the governance of IT, security, and cost management. He acknowledges that hallucinations are an important issue to address. Bansal says:
When we look at LLMs, it’s a natural extension to bringing AI to DevOps and use it for many other things.
Enterprises at the cutting edge are increasingly provisioning their infrastructure as code using tools like Hashicorp Terraform, which Martin Banks wrote about recently. In this new workflow, the business writes a spec, developers code the functionality, QA writes the test, and then the operations write the infrastructure code, which pushes it out into deployment.
The tests in the middle are informed about the nature of the code and the infrastructure it will run on. These can include security, performance, integration, and load testing. The operations code for provisioning the infrastructure needs to scale up new releases slowly to identify and mitigate any problems and then facilitate a wider rollout once it has been vetted.
Coding the software is sometimes seen as the main event, but it’s dependent on all these other processes, which is where Harness decided to focus. Bansal explains:
Writing code is an important part, but that's not really the only part. So, we look at how we can use AI to improve the rest of the pieces as well. This includes things like how you run your CI/CD pipeline, how you debug problems, how you fix vulnerabilities, and how you auto-generate tests. You can make all those workflows thirty to fifty percent more efficient, and that’s why we launched this AI Development Assistant called AIDA.
It’s important to note that the new AI assistant was not built in a vacuum. It takes advantage of Harness’ deep domain expertise in analyzing millions of software and infrastructure code changes, how they break, how they introduce vulnerabilities, and how they affect costs.
One essential component is continuous verification, an evolution of Continuous Integration and Continuous Deployment (CI/CD). These processes help streamline workflows around pushing out code updates, sometimes several times a day, so they can be rolled back when things go wrong. Continuous verification orchestrates updates to a small set of users before scaling up to reduce the risks of breaking things for everyone. In this case, AI and machine learning models can help identify deviations that may signify a problem. Bansal notes:
Everyone wants to ship very fast, but the number one reason they cannot is because you don't know if you will break something. And that process can take a long time. AI can really help in learning if an update is going to break something.
Test intelligence is the process of creating a model of the dependencies across software components that helps map code changes to the most appropriate set of tests. QA teams might create a library of five-thousand tests for a large code base. But the majority may not be relevant for a small code update or addition. In this case, AI and ML models help associate previous updates with problems to shortlist the top four hundred tests to surface the most likely issues and the order they should run.
The end result is that developers may only spend 10 minutes waiting for their update to go through rather than two hours. This is a big issue for developers because they don’t generally want to change projects during the build process. Bansal explains:
Doing something else is a hard mental switch. You just made a code change and now want to know if it is going to work or not. You want to finish the task before you do something else. When you have to run all the tests, that can mean two hours of wait. That’s why you see a lot of ping pong tables in engineering offices because people are waiting for the builds to complete and because there are too many.
Over the years, Harness has built up a large collection of data about the characteristics of software, test, and infrastructure code and organized it into various forms for analysis using graph, object, streaming, and vector databases. The latter is a newer structure in which raw data is transformed into an intermediate space that reduces the burden for LLMs.
Bansal says the most exciting thing about LLMs is the potential for a simpler user interface. For example, it could simplify the process of generating complex tests for things like chaos engineering which involves carefully breaking various kinds of infrastructure in a sandbox to identify problems. He explains:
If you can use the natural language interface to create the chaos experiment it reduces the time and cognitive burden for the developer to learn the format of writing the experiment. LLMs are very powerful in creating this friendly natural language interface or assisting the developers to configure things when learning configuration formats and languages, which is a burden otherwise.”
Similarly, it could also help teams create policies for controlling cloud costs, security governance, quality governance, and policy authoring to improve risk management. Previously, many of these things were written in obscure configuration languages like YAML, which took hours to set up. Done right, an LLM could help generate the appropriate configuration in a few minutes.
Bansal goes on:
If you can just say that in English natural language and do some LLM interface, then it can generate the right kind of configuration for you in minutes. That's where we see a lot of power for LLMs because there are so many tasks that need to be done in security, governance, compliance, debugging, and troubleshooting, so you can have an assistant to help you through. You don’t have to go and learn the syntax and configuration formats for so many things.
Each kind of use case needs to be trained differently. For example, LLMs for code creation, policy creation, and test generation all must be trained in a different way and different algorithms are better suited for the various use cases.
In an effort like this, it’s essential to reduce the risk of hallucinations in automatically generated scripts that play a role in provisioning infrastructure. It’s bad enough when AI makes up citations. But it could be catastrophic if a test bot went mad, spinning up endless instances and incurring a large cloud bill.
Bansal said they are putting several layers of controls to reduce these risks. One layer helps structure inputs into the LLM prompts. A second ensures that the results are right and valid. It is also important to hone the data used to train these models.
A general-purpose LLM like ChatGPT may create policies that are not even syntactically accurate because it was trained on more general information from across the web. A policy engine for AWS would need to train on valid AWS policies to ensure those are the only things that come out. Bansal explains:
There's a lot of work that needs to be done around the LLM to make it possible. I would say anyone claiming 100% accuracy is claiming too much. If you can get to high 90% accuracy, that’s the most you will likely get. But that’s also the reason why it's important for everyone to think of these as not a replacement for the developer on any of these jobs. It's an assistant to the developer. So, the developer still has to take a look and make sure this is right.
There are also cost tradeoffs to consider as well. Bansal says:
It’s an assistant to the developer to do 40%-50% of the job. It’s a productivity boost and not a replacement for you. Pushing for 100% accuracy may not be the right goal. Asking the AI a question might be so expensive that it costs you like $200 to ask the question, and it may not be worth it. It would make more sense if you want to ask the question at $2 and get to 98% accuracy and still have the developer involved. Maybe over time, the cost will come down so much you can ask the question, which is 100% accurate, but that’s not the case now.
There are a few things that stand out for me. First is the importance of building a foundation for bringing different types of data together to answer new questions across various disciplines. In this case, business, coding, QA, and operations. In some respects, this is analogous to the digital twins enterprises are building in other domains, but as applied to software infrastructure.
Second is the cautious approach Harness is taking to reduce hallucinations and mitigate their impact when they do arise. As Bansal suggested, a cost-benefit analysis will be required to strike the right balance between delivering the greatest benefit without breaking the bank.