It wasn’t exactly the holiday dinner I hoped for.
At my first software company, I was the unlucky first responder when an outage struck our platform…on Thanksgiving. Holiday viewing had pushed traffic way up at Netflix, our biggest customer, leaving our monitoring system unable to handle the extra demand. Reluctant to disturb my teammates, I took it upon myself to get to the root of the problem. Four hours later, the meal was finished, but I was still hunched over my laptop. In total, we spent at least two days figuring out what went wrong and fixing it.
In the intervening 15 years, not much has changed in incident response. Software incidents remain a menace. And for engineers, dealing with them is still a huge headache—reactive, manual and time-consuming. These outages play a major role in burnout, which more than half of developers cite as a reason their peers quit, Harness research found.
But AI is poised to change that, and hopefully spare engineers a lot of late nights and skipped meals. Here’s why incident response is such a hassle, and how new agentic AI is putting an end to that toil and frustration.
Stuck in the past: A serious problem with ripple effects
Right now, when a software incident strikes, it’s like a house is on fire. And every minute matters. The goal isn’t merely to put out the fire, but to mitigate the impact of the incident and make sure that the customer experience doesn’t suffer.
First, the on-call engineer gets an alert, putting them in the hot seat. Logging into monitoring systems, they pore over charts and graphs to identify possible root causes, most of which turn out to be dead ends. Inevitably, other experts from the development team are brought into the loop, the response swelling into group Slack threads or Zoom calls.
The whole process is messy and unpredictable. Often, fixes are the easy part. It’s pinpointing and diagnosing the problem that’s tricky. Meanwhile, the impact of poor, slow, and inefficient incident response can be devastating. Last year’s CrowdStrike meltdown, for instance, saw a faulty software update crash 8.5 million devices worldwide, plunging airlines, banks, retailers, hospitals, and many other essential businesses into chaos.
Nor are these events rare. In a recent survey, 55 percent of companies said they experienced IT disruptions at least once weekly, with almost half of these resulting in downtime for at least two hours. Overall, one in three businesses lost $100,000 to $1 million or more per outage.
Then there’s the impact on developers and the dev experience. No one likes to get woken up in the middle of the night or have to spend the weekend on call. And when an engineer is in the midst of a task, it’s challenging to context-switch to troubleshoot an incident.
Bring incident response into the AI era
While AI is being used extensively for coding, it’s still novel for incident response. Historically, dev teams have been reluctant to introduce automation into a process that can be frustratingly ad hoc. But in many ways, incident response makes a perfect use case for AI—and agentic AI in particular—since it requires always-on capabilities, a deep knowledge base and repeated cycles to uncover root causes.
Let’s say several users report issues with publishing on a company’s website. Identifying what could be a bigger problem, agentic AI can then alert the on-call team of a possible incident—getting a head start before things spiral out of control.
Traditionally, diagnostics—figuring out what went wrong—has been the most tedious step in incident response, requiring experienced devs to cycle through potential root causes and eliminate them one by one. But agentic AI can quickly handle detective work that might take engineers hours, or even days. Ideally, it also integrates institutional knowledge from past Zoom calls, Slack messages, and other communication into its knowledge base.
Ultimately, AI will settle on one or more likely root causes: The latest software deployment may have introduced conflicting publishing permissions, leading to user errors, for instance.
Importantly, human engineers still have a critical role to play at this stage. Rather than prescribe a solution, AI can guide the ensuing investigation by flagging recent code changes that might be problematic or system logs that warrant a closer look. Engineers can use those AI-powered insights to do further research and decide on the best course of action, such as rolling back the code to the previous stable version or applying a patch.
The process doesn’t end here. Ideally, agentic AI can learn from the experience: documenting key events and actions taken and filing away insights for future incidents. (Translation: Next time, you don’t have to wake up the database manager to repeat the process.)
Caveats for AI incident response
While this process may sound seamless, it comes with caveats. For starters, even the best large language model will fall short for incident response, unless it’s fully integrated into the development process.
For incident response to work, the AI agent must be plugged into the company’s developer resources and knowledge graph: databases, microservices, continuous integration/continuous delivery, and other infrastructure and apps. By connecting the dots across the software development lifecycle, agentic AI simplifies the hunt for root causes and speeds up resolution time.
Nonetheless, when used properly, agentic AI for incident response can yield a significant payoff for the business, customers, and employees.
For starters, there’s a dramatic improvement in mean time to recovery from incidents—as much as 50 to 80 percent, in my experience. With unplanned downtime costing the world’s biggest companies an estimated $400 billion a year—or almost 10 percent of profits—that’s a huge ROI. Customers regain access to vital services faster, reducing risks to brand loyalty.
A better developer experience is another upside. Freed from unnecessary toil, engineers can focus on what they do best: creating innovative software that adds value for customers. And just as important, they can relax and enjoy their Thanksgiving.