How to Leverage Machine Learning to Detect Cloud Spend Anomalies
To make predictions on cost data, we tested both statistical and neural net models and compared their performance on different time series datasets.
Written by: Ram Potham, Parnian Zargham, Ravi Yadalam
As companies continue to scale their cloud consumption, effectively managing cloud costs is critical. The ability to automatically detect anomalies in cloud spend is key for effectively managing cloud costs. Cloud spend anomalies are when a company’s cloud costs deviate from their typical spending pattern and could cause material impact to the company. These anomalies could be caused by various reasons, such as unintentional provisioning of resources by someone in the team or compromised cloud account credentials. It’s imperative that they are caught and addressed as soon as possible.
The first step in detecting cloud spend anomalies, or cloud cost management (CCM), is being able to accurately predict the company’s cloud costs over time. Finding the cloud services and resources that are costing much more than they should can prevent significant wasted cloud spend. Cloud cost predictions can be done accurately using machine learning methods by framing the problem as a time series forecasting problem.
In this blog, we detail the multiple machine learning approaches we have experimented with to arrive at the best solution to this problem.
Introduction to Time Series Prediction in Machine Intelligence
Our time series data is a sequence of the daily instance cost. In a broad overview, machine learning forecasting has proven effective in capturing patterns in time series data since it can find weekly, monthly, and seasonal patterns in cost data. Using the patterns and previous data, a model can make predictions from a day to over a month in advance. We find these patterns by developing models designed to separate the different components of the time series data and find how they change and interact. Once we can predict the data accurately, we simply create error boundaries to accurately predict anomalies to identify when companies are paying too much.
Generally, time series data can be split into trends (e.g., increasing or decreasing behavior), seasonality (such as the cyclic component of a certain period), and random noise (since observed data doesn’t perfectly follow a model). We can model them separately and put them together to make our final prediction.
However, before we get started building the models, we need to first preprocess the data. This involves handling and imputing missing values by linear interpolation. For this blog, we will use the last 90-day observation to make a one-day forecast. We must also appropriately scale the data and the predictions so that the model is trained accurately and predictions are on the right scale.
To make predictions on cost data, we tried both statistical and neural net models and compared their performance on different time series datasets.
Overview of Machine Learning Models
ARIMA stands for “autoregressive integrated moving average, which is a lot to unpack.” We call this model uto ARIMA since it automatically generates the optimal parameters for the model based on the time series data it’s trained on. There are three parts in this model:
- The autoregressive component involves predicting future values using past values (specifically a linear combination of them)
- The integrated component involves differencing the data (take the difference of every two adjacent points) multiple times until the trend is eliminated. By doing so, we call the data stationary and it is easier to predict future values.
- The moving average component involves using the error component of past values to predict future values. A variation of this model that we tested was SARIMA, which also accounted for a seasonal component.
These parameters - p, d, and q - are based on the AR, I, and MA components of the ARIMA model respectively.
This is a deep learning time series model that was the winner of the M4 major forecasting competition, a competition made to improve forecasting accuracy. It stands for “exponential smoothing recurrent neural networks.” The exponential smoothing component decomposes the time series into trend and seasonality components. Meanwhile, the recurrent neural network is a type of artificial neural network that uses sequential time series data. The network takes previous inputs into its memory influencing future inputs and outputs. This is different from traditional neural networks where all inputs are independent of each other.
The long short-term memory (LSTM) network is another neural network that remembers past sequences/patterns to predict future values and is similar to an RNN. However, a typical recurrent neural network can face the vanishing gradient problem, where information is lost when training. Meanwhile, the LSTM model includes a series of gates (input, output, and forget) that act like switches to control weights and prevent the vanishing gradient. Our implementation uses stacked LSTM layers which make the model deeper, allowing it to understand more complex patterns in the data.
Prophet is developed by Facebook. It is designed to forecast time series with yearly, weekly, and daily patterns plus holiday effects. The benefit of this framework is that it’s easy to use and generalizable, including multiple time series models. We used Kats for prediction and anomaly detection.
To effectively compare all the approaches, we fed the models the same data for training and predicted on the same points. We used daily cost data from Harness instances running in our own QA environment. For this study, we are interested in predicting the cost value at each day given the last 90 days of data. The dataset spans from 2021 January to 2022 June.
We compared the accuracy of these models with respect to mean absolute percent error (MAPE), a commonly used metric for comparing prediction models. The result is summarized in Table 1 below.
Taking CCM to the Next Level
While it seems that Auto ARIMA does perform the best in terms of MAPE, it needs extensive hyperparameter tuning. Auto ARIMA alsounderperforms on some of the data with specific patterns. For example, on cost data with spikes on the first day of each month, it fails to predict the values at the peaks.
Overall, we found it is best to use Auto ARIMA when there isn't much random noise and there’s a stable seasonal and trend component. When there is a lot of random noise and a stable seasonal/trend component, LSTM and ESRNN work better, but they both need a massive dataset to be trained and learn the correct set of parameters. Since we need to train separate models per customer and per instance, we are dealing with smaller datasets. On the other hand, Prophet outperforms in modeling seasonalities with less hypermater tuning.
Through this experience, we concluded that we cannot use one algorithm to model all kinds of our cost data. In the preprocessing step, we need to categorize the time series based on some characteristics like seasonality to know which models are going to perform better.Between the selected models, we cross-validate on the validation dataset to pick the best model with the best set of hyperparameters, hence the training and prediction time can vary for different data.
These machine learning models not only enable companies to accurately project their daily cloud costs but also to find cost anomalies from their daily cloud consumption. These predictions empower companies to proactively prevent adverse cloud cost impact by addressing anomalous consumption of cloud services and resources in time.
Learn more about how Harness CCM can help your team predict cloud spend before it breaks your budget.