Chapters
Try It For Free
December 31, 2024

Finding the Needle in the Cost Haystack: Anomaly Detection with BQML

Table of Contents
Finding the Needle in the Cost Haystack: Anomaly Detection with BQML

Finding the Needle in the Cost Haystack: Anomaly Detection with BQML

Cloud cost anomalies can be elusive. Unexpected spikes in usage, misconfigurations, and billing changes often blend into the noise of normal spending patterns. Detecting these anomalies before they turn into expensive surprises is a challenge for Cloud Cost Management (CCM).

At Harness, we leverage Google BigQuery ML (BQML) to automatically detect cloud cost anomalies using time-series forecasting models. Unlike traditional anomaly detection methods that require external ML pipelines, BQML enables in-database machine learning, allowing us to run anomaly detection directly on cloud cost data stored in BigQuery.

Why Use BQML for Cost Anomaly Detection?

Traditional anomaly detection approaches often require complex data movement to ML platforms like Prophet, ARIMA, or SARIMA. With BQML, we simplify this process by running machine learning models natively in Google BigQuery, eliminating unnecessary data transfer overhead. We currently use Prophet for anomaly detection. However, BQML eliminates the formatting overhead required by Python-based ML frameworks like Prophet.

Additionally, BQML integrates seamlessly within BigQuery, eliminating the need for external setup, dependencies, or infrastructure — a requirement when using Prophet. This makes BQML a more efficient and scalable choice for in-database anomaly detection.

Key Advantages of BQML for Anomaly Detection

  • SQL-Based Machine Learning — Train and deploy ML models using standard SQL queries
  • No Data Movement — Analyze cost anomalies directly in BigQuery
  • Scalable for Large Datasets — Optimized for handling millions of cost records efficiently
  • Automated Forecasting & Detection — Supports scheduled model retraining for continuous monitoring
BQML workflow

BQML workflow

How BQML Detects Cost Anomalies

1. Preparing Cloud Cost Data in BigQuery

At Harness Cloud Cost Management, we have built a comprehensive cost tracking infrastructure that captures daily cloud spending across AWS, GCP, and Azure at the resource level. Our system aggregates expenses by cloud provider, account, and specific services, creating a unified view of your entire cloud footprint. This granular approach not only powers our anomaly detection engine but also enables deeper spending analysis. By consolidating usage data at the resource level, we’ve streamlined cost monitoring, making it easier to spot unusual patterns and take proactive steps to optimize your cloud investments.

2. Training the ARIMA_PLUS Model

We use BQML’s ARIMA_PLUS model, which is specifically designed for time-series forecasting and anomaly detection. After evaluating multiple BQML model options, we found ARIMA_PLUS to be the most effective for cloud cost anomaly detection.

Why is ARIMA_PLUS is the right model for Daily Cloud Cost Data?

ARIMA_PLUS excels at handling the unique characteristics of cloud cost data:

  • Seasonal Patterns — Cloud costs often follow regular patterns (daily, weekly, monthly billing cycles) that ARIMA_PLUS can automatically detect and incorporate into its predictions.
  • Trend Components — Many cloud services show gradual increases or decreases in cost over time as usage patterns evolve. ARIMA_PLUS captures these trends effectively.
  • Irregular Spikes — The model can distinguish between expected variations (like monthly billing) and truly anomalous cost events.
  • Automated Parameter Selection — ARIMA_PLUS automatically determines the optimal parameters for your specific cost patterns, reducing the need for manual tuning.

Comparison with Other BQML Time Series Models

Blog image

For our cloud cost anomaly detection needs, ARIMA_PLUS with a data size of 16 months, provides the best balance of accuracy, automation, and interpretability.

Understanding ARIMA_PLUS and max_order in BQML

ARIMA Model Components in ARIMA_PLUS

BQML’s ARIMA_PLUS model applies a mix of autoregression (AR), differencing (I), and moving averages (MA) to detect anomalies in time-series data. The max_order parameter plays a crucial role in controlling the model's complexity.

When you set max_order = 2, you're defining an upper limit on ARIMA's p, d, and q parameters to keep the model efficient while still capturing key cost patterns.

Breaking Down ARIMA Parameters

Blog image

How max_order = 2 Affects ARIMA_PLUS in BQML

Setting max_order = 2 allows only certain combinations of ARIMA models:

  • p, d, q values are restricted to {0, 1, or 2}
  • Reduces overfitting risks by limiting model complexity
  • Ensures the model generalizes well to new cost data trends

Example Configurations with max_order = 2

  • ARIMA(1,1,2) with Seasonality = 12 (Good for monthly billing cycles)
  • ARIMA(2,0,1) with Seasonality = 4 (Works for quarterly financial reporting)

Eliminating False Positives with Seasonality Handling

One of the biggest challenges in anomaly detection is false positives — cases where the model flags expected cost fluctuations as anomalies.

Example: Monthly Billing Spikes

Consider a cloud service that is billed at the start of every month.

  • Without seasonality detection, the model might incorrectly flag this expected spike as an anomaly every month.
  • With seasonality detection enabled in ARIMA_PLUS, the model learns the recurring billing pattern and distinguishes normal fluctuations from true anomalies.

This ensures that regular monthly charges are recognized as expected behavior, reducing false positives and improving anomaly detection accuracy.

How Long Does It Take to Learn Normal Spikes?

BQML’s ARIMA_PLUS model typically requires 2–3 full seasonal cycles (e.g., 2–3 months for monthly patterns) to accurately distinguish normal vs. abnormal cost fluctuations. For instance, with a service billed at the start of every month, the model needs to observe this pattern for about 2–3 months before it can reliably identify it as expected behavior rather than an anomaly.

3. Detecting Anomalies in Cost Data

Once the model is trained, we use ML.DETECT_ANOMALIES to identify suspicious cost spikes.

SELECT *
FROM ML.DETECT_ANOMALIES(
  MODEL `ccm-play.BillingReport.cost_anomaly_model`
)
WHERE is_anomaly = TRUE;

Detected anomalies are processed through customer-configured Anomaly Preferences to minimize duplication and ensure accurate tracking.

Anomaly Preferences

A newly detected anomaly will only surface if no other anomaly has occurred in the last N days. If an anomaly exists within the last N days, the system applies customer-defined thresholds to determine whether the new anomaly is distinct:

  • % Change: Must exceed X%.
  • Absolute $ Increase: Must be at least $Y.

If these thresholds are met, the anomaly is logged as a new anomaly. If the thresholds are not met, the system updates the duration of the existing anomaly instead of creating a duplicate.

4. Forecasting the Cost

We use ML.FORECAST on the model to predict costs for a given day:

SELECT forecast_timestamp, forecast_value, prediction_interval_lower_bound, prediction_interval_upper_bound 
FROM ML.FORECAST (MODEL `project.dataset.anomaly_model`, 
STRUCT(15 AS horizon, 0.98 AS confidence_level))
  • horizon = 15 → Predicts values 15 days into the future.
  • confidence_level = 0.98 → Sets a 98% confidence interval, meaning predictions are expected to fall within this range 98% of the time.

5. Retraining

BQML does not support incremental training. Instead, we retrain the model every Sunday using the last 487 days of data, which ensures it stays updated with the latest daily cost trends.

Performance & Real-World Results

Training Time vs Data Size

We tested the worst-case scenario with a 35GB dataset of cost data to evaluate model training performance:

  • 30-day model → ~2 minutes to train
  • 487-day model → ~17 minutes, but significantly improves anomaly detection accuracy

On average, training took only a few minutes for normal-sized cost data.

Detected Anomalies

In production, our model flagged anomalies accurately in the cost spikes of many resources.

Blog image

Cost of Creating and Training the Model

BQML on-demand pricing is $312.50 per TiB. In the worst-case scenario, we train the model using 16 months of data, where the dataset size is 35.39 GB.

Cost Calculation: 35390000000 * 312.5 / 1099511627776 = ~$10.80

BQML training costs are automatically labeled within Google Cloud Billing, allowing for cost tracking and analysis. These labels help identify and attribute expenses associated with BQML model training within your Google Cloud project.

BQML Cost Tagging Details: The cost associated with BQML is tagged with the following predefined labels:

  • Key: bigquery.googleapis.com/bqml
  • Value: bqml_arima_plus_training

Conclusion: Smarter Cost Monitoring with BQML

With BQML’s ARIMA_PLUS, we can efficiently detect cost anomalies while minimizing false positives.

  • Seasonality handling improves accuracy for cyclical cost patterns
  • max_order = 2 balances model complexity and performance
  • Fully automated anomaly detection with BigQuery SQL

Next Steps

We are continuously improving our anomaly detection pipeline by:

  • Comparing BQML vs Prophet models under a feature flag
  • Enhancing detection for Kubernetes cost anomalies using BQML
  • Detecting anomalies in real time as cost data is ingested

Further Reading

Chandra Mulpuri

Chandra Mulpuri is a Software Engineering professional specializing in cloud-based communication and collaboration platforms. He has extensive experience in building virtual meeting, calling, and telephony applications with a strong focus on high availability and enterprise scalability. Chandra is skilled in Unified Communications, Call Control, Java, C++, and Cloud Applications, and is a VMware Certified Professional (VCP:73237).

Read

Cloudopoly: Master Cloud Spend to Achieve Strategy, Savings, and Scale

Join the FinOps Excellence Summit on July 16th. Learn from industry leaders about cloud cost optimization, savings strategies, and AI-powered FinOps. Register now!

Register now
Link

Similar Blogs

No items found.
Cloud Cost Management