On Kubernetes (Beta)

Overview

Airbyte allows scaling sync workloads horizontally using Kubernetes. The core components (api server, scheduler, etc) run as deployments while the scheduler launches connector-related pods on different nodes.

Getting Started

Cluster Setup

For local testing we recommend following one of the following setup guides:

For testing on GKE you can create a cluster with the command line or the Cloud Console UI.

For testing on EKS you can install eksctl and run eksctl create cluster to create an EKS cluster/VPC/subnets/etc. This process should take 10-15 minutes.

For production, Airbyte should function on most clusters v1.19 and above. We have tested support on GKE and EKS. If you run into a problem starting Airbyte, please reach out on the #troubleshooting channel on our Slack or create an issue on GitHub.

Install kubectl

If you do not already have the CLI tool kubectl installed, please follow these instructions to install.

Configure kubectl

Configure kubectl to connect to your cluster by using kubectl use-context my-cluster-name.

  • For GKE

    • Configure gcloud with gcloud auth login.

    • On the Google Cloud Console, the cluster page will have a Connect button, which will give a command to run locally that looks like

      gcloud container clusters get-credentials CLUSTER_NAME --zone ZONE_NAME --project PROJECT_NAME.

    • Use kubectl config get-contexts to show the contexts available.

    • Run kubectl use-context <gke context> to access the cluster from kubectl.

  • For EKS

    • Configure your AWS CLI to connect to your project.

    • Install eksctl

    • Run eksctl utils write-kubeconfig --cluster=<CLUSTER NAME> to make the context available to kubectl

    • Use kubectl config get-contexts to show the contexts available.

    • Run kubectl use-context <eks context> to access the cluster with kubectl.

Configure Logs

Both dev and stable versions of Airbyte include a stand-alone Minio deployment. Airbyte publishes logs to this Minio deployment by default. This means Airbyte comes as a self-contained Kubernetes deployment - no other configuration is required.

Airbyte currently supports logging to Minio, S3 or GCS. The following instructions are for users wishing to log to their own Minio layer, S3 bucket or GCS bucket.

The provided credentials require both read and write permissions. The logger attempts to create the log bucket if it does not exist.

Configuring Custom Minio Log Location

Replace the following variables in the .env file in the kube/overlays/stable directory:

# The Minio bucket to write logs in.
S3_LOG_BUCKET=
# Minio Access Key.
AWS_ACCESS_KEY_ID=
# Minio Secret Key.
AWS_SECRET_ACCESS_KEY=
# Endpoint where Minio is deployed at.
S3_MINIO_ENDPOINT=

The S3_PATH_STYLE_ACCESS variable should remain true. The S3_LOG_BUCKET_REGION variable should remain empty.

Configuring Custom S3 Log Location

Replace the following variables in the .env file in the kube/overlays/stable directory:

# The S3 bucket to write logs in.
S3_LOG_BUCKET=
# The S3 bucket region.
S3_LOG_BUCKET_REGION=
# Aws Access Key Id.
AWS_ACCESS_KEY_ID=
# Aws Secret Access Key
AWS_SECRET_ACCESS_KEY=
# Set this to empty.
S3_MINIO_ENDPOINT=
# Set this to empty.
S3_PATH_STYLE_ACCESS=

See here for instructions on creating an S3 bucket and here for instructions on creating AWS credentials.

Configuring Custom GCS Log Location

Create the GCP service account with read/write permission to the GCS log bucket.

1) Base64 encode the GCP json secret.

# The output of this command will be a Base64 string.
$ cat gcp.json | base64

2) Populate the gcs-log-creds secrets with the Base64-encoded credential. This is as simple as taking the encoded credential from the previous step and adding it to the secret-gcs-log-creds,yaml file.

apiVersion: v1
kind: Secret
metadata:
name: gcs-log-creds
namespace: default
data:
gcp.json: <base64-encoded-string>

3) Replace the following variables in the .env file in the kube/overlays/stable directory:

# The GCS bucket to write logs in.
GCP_STORAGE_BUCKET=
# The path the GCS creds are written to. Unless you know what you are doing, use the below default value.
GOOGLE_APPLICATION_CREDENTIALS=/secrets/gcs-log-creds/gcp.json

See here for instruction on creating a GCS bucket and here for instruction on creating GCP credentials.

Launch Airbyte

Run the following commands to launch Airbyte:

git clone https://github.com/airbytehq/airbyte.git
cd airbyte
kubectl apply -k kube/overlays/stable

After 2-5 minutes, kubectl get pods | grep airbyte should show Running as the status for all the core Airbyte pods. This may take longer on Kubernetes clusters with slow internet connections.

Run kubectl port-forward svc/airbyte-webapp-svc 8000:80 to allow access to the UI/API.

Now visit http://localhost:8000 in your browser and start moving some data!

Production Airbyte on Kubernetes

Setting resource limits

  • Core container pods

    • Instead of launching Airbyte with kubectl apply -k kube/overlays/stable, you can run with kubectl apply -k kube/overlays/stable-with-resource-limits.

    • The kube/overlays/stable-with-resource-limits/set-resource-limits.yaml file can be modified to provide different resource requirements for core pods.

  • Connector pods

    • By default, connector pods launch without resource limits.

    • To add resource limits, configure the "Docker Resource Limits" section of the .env file in the overlay folder you're using.

  • Volume sizes

    • You can modify kube/resources/volume-* files to specify different volume sizes for the persistent volumes backing Airbyte.

Increasing job parallelism

The number of simultaneous jobs (getting specs, checking connections, discovering schemas, and performing syncs) is limited by a few factors. First of all, the SUBMITTER_NUM_THREADS (set in the .env file for your Kustimization overlay) provides a global limit on the number of simultaneous jobs that can run across all worker pods.

The number of worker pods can be changed by increasing the number of replicas for the airbyte-worker deployment. An example of a Kustomization patch that increases this number can be seen in airbyte/kube/overlays/dev-integration-test/kustomization.yaml and airbyte/kube/overlays/dev-integration-test/parallelize-worker.yaml. The number of simultaneous jobs on a specific worker pod is also limited by the number of ports exposed by the worker deployment and set by TEMPORAL_WORKER_PORTS in your .env file. Without additional ports used to communicate to connector pods, jobs will start to run but will hang until ports become available.

You can also tune environment variables for the max simultaneous job types that can run on the worker pod by setting MAX_SPEC_WORKERS, MAX_CHECK_WORKERS, MAX_DISCOVER_WORKERS, MAX_SYNC_WORKERS for the worker pod deployment (not in the .env file). These values can be used if you want to create separate worker deployments for separate types of workers with different resource allocations.

Cloud logging

Airbyte writes logs to two directories. App logs, including server and scheduler logs, are written to the app-logging directory. Job logs are written to the job-logging directory. Both directories live at the top-level e.g., the app-logging directory lives at s3://log-bucket/app-logging etc. These paths can change, so we recommend having a dedicated log bucket, and to not use this bucket for other purposes.

Airbyte publishes logs every minute. This means it is normal to see minute-long log delays. Each publish creates it's own log file, since Cloud Storages do not support append operations. This also mean it is normal to see hundreds of files in your log bucket.

Each log file is named {yyyyMMddHH24mmss}_{podname}_{UUID} and is not compressed. Users can view logs simply by navigating to the relevant folder and downloading the file for the time period in question.

See the Known Issues section for planned logging improvements.

Using an external DB

After Issue #3605 is completed, users will be able to configure custom dbs instead of a simple postgres container running directly in Kubernetes. This separate instance (preferable on a system like AWS RDS or Google Cloud SQL) should be easier and safer to maintain than Postgres on your cluster.

Known Issues

As we improve our Kubernetes offering, we would like to point out some common pain points. We are working on improving these. Please let us know if there are any other issues blocking your adoption of Airbyte or if you would like to contribute fixes to address any of these issues.

  • The server and scheduler deployments must run on the same node. (#4232)

  • Some UI operations have higher latency on Kubernetes than Docker-Compose. (#4233)

  • Logging to Azure Storage is not supported. (#4200)

  • Large log files might take a while to load. (#4201)

  • UI does not include configured buckets in the displayed log path. (#4204)

  • Logs are not reset when Airbyte is re-deployed. (#4235)

  • File sources reading from and file destinations writing to local mounts are not supported on Kubernetes.

Customizing Airbyte Manifests

We use Kustomize to allow overrides for different environments. Our shared resources are in the kube/resources directory, and we define overlays for each environment. We recommend creating your own overlay if you want to customize your deployments. This overlay can live in your own VCS.

Example kustomization.yaml file:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
bases:
- https://github.com/airbytehq/airbyte.git/kube/overlays/stable?ref=master

View Raw Manifests

For a specific overlay, you can run kubectl kustomize kube/overlays/stable to view the manifests that Kustomize will apply to your Kubernetes cluster. This is useful for debugging because it will show the exact resources you are defining.

Helm Charts

We do not currently offer Helm charts. If you are interested in this functionality please vote on the related issue.

Operator Guide

View API Server Logs

kubectl logs deployments/airbyte-server to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.

View Scheduler or Job Logs

kubectl logs deployments/airbyte-scheduler to view real-time logs. Logs can also be downloaded as a text file via the Admin tab in the UI.

Connector Container Logs

Although all logs can be accessed by viewing the scheduler logs, connector container logs may be easier to understand when isolated by accessing from the Airbyte UI or the Airbyte API for a specific job attempt. Connector pods launched by Airbyte will not relay logs directly to Kubernetes logging. You must access these logs through Airbyte.

Upgrading Airbyte Kube

See Upgrading K8s.

Resizing Volumes

To resize a volume, change the .spec.resources.requests.storage value. After re-applying, the mount should be extended if that operation is supported for your type of mount. For a production deployment, it's useful to track the usage of volumes to ensure they don't run out of space.

Copy Files To/From Volumes

See the documentation for kubectl cp.

Listing Files

kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx ls /tmp/workspace/8

Reading Files

kubectl exec -it airbyte-scheduler-6b5747df5c-bj4fx cat /tmp/workspace/8/0/logs.log

Persistent storage on GKE regional cluster

Running Airbyte on GKE regional cluster requires enabling persistent regional storage. To do so, enable CSI driver on GKE. After enabling, add storageClassName: standard-rwo to the volume-configs yaml.

volume-configs.yaml example:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: airbyte-volume-configs
labels:
airbyte: volume-configs
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 500Mi
storageClassName: standard-rwo

Troubleshooting

If you run into any problems operating Airbyte on Kubernetes, please reach out on the #issues channel on our Slack or create an issue on GitHub.

Developing Airbyte on Kubernetes

Read about the Kubernetes dev cycle!