Chapters
Try It For Free
No items found.
December 31, 2024

Seamless Data Sync from Google BigQuery to ClickHouse in an AWS Airgapped Environment

Table of Contents
Seamless Data Sync from Google BigQuery to ClickHouse in an AWS Airgapped Environment


Understanding the Key Components

Airgap Environment

An airgapped environment enforces strict outbound policies, preventing external network communication. This setup enhances security but presents challenges for cross-cloud data synchronization.

Proxy Server

A proxy server is a lightweight, high-performance intermediary facilitating outbound requests from workloads in restricted environments. It acts as a bridge, enabling controlled external communication.

ClickHouse

ClickHouse is an open-source, column-oriented OLAP (Online Analytical Processing) database known for its high-performance analytics capabilities.

This article explores how to seamlessly sync data from BigQuery, Google Cloud’s managed analytics database, to ClickHouse running in an AWS-hosted airgapped Kubernetes cluster using proxy-based networking.

Use Case

Deploying ClickHouse in airgapped environments presents challenges in syncing data across isolated cloud infrastructures such as GCP, Azure, or AWS.

In our setup, ClickHouse is deployed via Helm charts in an AWS Kubernetes cluster, with strict outbound restrictions. The goal is to sync data from a BigQuery table (GCP) to ClickHouse (AWS K8S), adhering to airgap constraints.

Challenges

  • Restricted Outbound Network: The ClickHouse cluster cannot directly access Google Cloud services due to airgap policies.
  • Data Transfer Between Isolated Clouds: There is no straightforward mechanism for syncing data from GCP to ClickHouse in AWS without external connectivity.

Solution

The solution leverages a corporate proxy server to facilitate communication. By injecting a custom proxy configuration into ClickHouse, we enable HTTP/HTTPS traffic routing through the proxy, allowing controlled outbound access.

Blog image

Architecture Overview

  1. BigQuery to GCS Export: Data is first exported from BigQuery to a GCS bucket.
  2. ClickHouse GCS Integration: ClickHouse fetches data from GCS using ClickHouse’s GCS function.
  3. Proxy Routing: ClickHouse’s outbound requests are routed through a corporate proxy server.
  4. Data Ingestion in ClickHouse: The retrieved data is processed and stored within ClickHouse for analytics.

Implementation Steps

1. Proxy Configuration

  • Created a proxy.xml file defining proxy details for outbound HTTP/HTTPS requests.
  • Used a Kubernetes ConfigMap (clickhouse-proxy-config)* to store this configuration.
  • Mounted the ConfigMap dynamically into the ClickHouse pod.

2. Kubernetes Deployment

  • Mounted proxy.xml in the ClickHouse pod at /etc/clickhouse-server/config.d/proxy.xml.
  • Adjusted security contexts, allowing privilege escalation (for testing) and running the pod as root to simplify permissions.
Blog image

3. Testing and Validation

  • Deployed a non-stateful ClickHouse instance to iterate quickly.
  • Verified that ClickHouse requests were routed through the proxy.

Observed proxy logs confirming outbound requests were successfully relayed to GCP.

Blog image

Left window shows query to BigQuery and right window shows proxy logs — the request forwarding through proxy server

Outcome

This approach successfully enabled secure communication between ClickHouse (AWS) and BigQuery (GCP) in an airgapped environment. The use of a ConfigMap-based proxy configuration made the setup:

  • Scalable: Easily adaptable to different cloud vendors (GCP, Azure, AWS).
  • Flexible: Decouples networking configurations from application logic.
  • Secure: Ensures outbound traffic is strictly controlled via the proxy.

By leveraging ClickHouse’s extensible configuration system and Kubernetes, we overcame strict network isolation to enable cross-cloud data workflows in constrained environments. This architecture can be extended to other cloud-native workloads requiring external data synchronization in airgapped environments.

Next-generation CI/CD For Dummies

Stop struggling with tools—master modern CI/CD and turn deployment headaches into smooth, automated workflows.

Read the ebook

Similar Blogs

No items found.
No items found.
Continuous Integration