How to run Meltano in a container on Google Cloud Composer

Mark MacArdle
Data@ManyPets
Published in
11 min readJan 6, 2022

--

Run Meltano pipelines with Airflow's KubernetesPodOperator

Meltano is an open source tool which can be used to extract from data sources and load it to destinations like your data warehouse. It uses extractors and loaders written in the Singer open source standard. Singer has been around for a long time so there’s a huge list of available extractors written by numerous contributors. At ManyPets we began using Meltano when neither Fivetran, Stitch, or any other ETL provider we could find supported our telephone system, called Purecloud. Luckily there’s a Singer extractor for it called tap-purecloud which Meltano can run.

Cloud Composer is a managed service for running Airflow on GCP. Meltano can be run in a container and Airflow has operators for triggering containers so this is a good way to productionise your Meltano pipelines.

As someone new to running containers I struggled a lot to get my first pipeline working like this. Meltano is pretty new and doesn’t have that many how-tos out there so this post is for anyone else struggling! This post will cover:

  • Getting your Meltano pipeline set up
  • Running it in a container and how to handle variables and secrets
  • Getting that container on GCP
  • Running the container in an Airflow DAG with Cloud Composer

Setup

You need three basic things to run a pipeline with Meltano: a Meltano project, an extractor to pull data, and a loader to send it to your warehouse. This post won’t cover setting up a project or extractor as Meltano have a good getting started guide for that. I’ll be referring to the tap-purecloud extractor as an example in this post but it the same steps will be done for any extractor.

For a loader, if you’re using Cloud Composer and GCP you’re also likely using BigQuery as your warehouse so this post will cover some setup for the target-bigquery loader.

BigQuery loader and running a pipeline locally

When setting up the target-bigquery loader I only set a few of the configs and will be using environment variables (env vars) to supply the rest at run time. More on that later.

Note: all terminal code shown here is run from the root of the Meltano project directory.

meltano add loader target-bigquery
meltano config target-bigquery set location EU
meltano config target-bigquery set add_metadata_columns true

target-bigquery has 4 compulsory configs. One is location which is set above as I know for my use case I won’t ever have to change this. There’s also project_id and credentials_path which we’ll set with env vars later as they will be variable. Finally there’s dataset_id which, despite being compulsory, we actually won’t set as this will cause it to default to the namespace config value of the extractor (so make sure you’ve set that if you haven’t already). This approach allows you to have data from different extractors uploaded to different BigQuery datasets (called schemas in other databases). If you do set the dataset_id config then all extractors would upload to the same dataset. The extractor I’m using is tap-purecloud and I’ve set it’s namespace config to purecloud_landing_zone.

With an extractor and the target-bigquery loader set up you should be able to run your pipeline locally with a command like meltano elt tap-purecloud target-bigquery.

Other odds and ends

To keep following this guide you’ll need to have Docker installed locally, so you can create containers, as well as the Google SDK, so you can use gcloud commands. You can check if you have them already by running docker --version and gcloud --version and seeing if your terminal outputs a version or an error.

This guide was tested only on Composer version 1.x (note that Composer and Airflow itself have different versions, this guide will work with v1 or v2 of Airflow). If you’re using Composer v2.x then some modifications may be needed for the gcloud commands and Airflow operator we’ll use. See here for the docs which allow switching between the commands and examples for v1 and v2.

A value we’re going to use a lot is the name of the GCP project we’re working on. Let us set it as an env var in our terminal so we don’t have to keep typing it.

export GCP_PROJECT=my-project-name

If you’re unfamiliar with env vars they’re variables you can set in your terminal and reference with the $ sign. Eg echo $GCP_PROJECT will output our project name. We’ll add more useful env vars as we progress.

Containerising your Meltano project

If you’ve gotten a pipeline running locally then, before involving Airflow and GCP, the next thing to do is get it running locally in a Docker container. Meltano provide a command to create starter Dockerfile and .dockerignore files:

meltano add files docker

Passing secrets to the container

When setting up a Meltano extractor or loader it’ll store any config values with a type of secret (a secret is any confidential value, eg a password) in a .env file. This is handy and something I use when running outside a container, but we shouldn’t include the .env file on the container. It’s considered bad practice to store secrets on a container image as you can’t be sure where an image may end up. The .env file is included by default in the .dockerignore file Meltano generates to help you avoid doing this.

Instead we’ll pass any secret configs to the container as env vars. For any secret files, like the service account keyfile target-bigquery needs, we’ll “mount” these to the container when we run it. If you’re keeping your keyfile in your Meltano project directory make sure you’re excluding it in the .dockerignore file

Passing variables to the container

For config values that may change it’s also best to pass these as env vars to the container, rather than hardcode them in the meltano.yml file. This allows you to define them in your Airflow DAG and then pass them to the task that calls the container. For example we do this for the project_id variable for the target-bigquery loader because we may run the pipeline on our development or production environments.

To see what Meltano expects the env var for a given config to be called run the below (replacing target-bigquery with a different extractor or loader if you need).

meltano config target-bigquery list

From this we find the name of the env var for project_id should be TARGET_BIGQUERY_PROJECT_ID.

Build the container

You shouldn’t need to edit the Dockerfile, so you can now build the a container image locally with a command like below. An image is a static, executable file which runs the containers code. We used meltano_bbm for the tag here but it can be anything.

docker build --tag meltano_bbm .

If you make any changes to your Dockerfile or .dockerignore then re-run this command to update the image.

Running the container locally

You can then run the container locally with a command like below. The env vars the container will have are defined with -e and the mounting of the keyfile is done --mount. The reason for not passing the keyfile’s content as an env var is that the target-bigquery loader can only use a file for credentials. I’ve used /var/keyfile.json as where the keyfile will be stored on the container but you could put it anywhere.

The final line is the arguments that’ll be passed to the meltano command when the container has started up. The reason they’re passed to the meltano command is that that it’s defined as the “entrypoint” for the container in the Dockerfile.

docker run \
--mount type=bind,src=/absolute/path/to/service-account/keyfile.json,dst=/var/keyfile.json \
-e TAP_PURECLOUD_CLIENT_SECRET=$TAP_PURECLOUD_CLIENT_SECRET \
-e TAP_PURECLOUD_START_DATE=2021-12-16 \
-e TARGET_BIGQUERY_PROJECT_ID=$GCP_PROJECT \
-e GOOGLE_APPLICATION_CREDENTIALS=/var/keyfile.json \
meltano_bbm \
elt tap-purecloud target-bigquery --job_id=purecloud_to_bigquery

Note that the way the keyfile is mounted here is technically a little different to how KubernetesPodOperator is going to do it in Airflow. Here a bind mount is used while KubernetesPodOperator will use a volume mount. I don’t believe there’s any difference in effect for this use case. I used the bind mount here just because I couldn’t figure out how to get the volume mount working right.

Debug by launching a shell on the container

Initially I had a lot of issues understanding what env vars where on the container so a top tip for debugging is to launch a shell on the container at start up instead of having it run Meltano.

You do this by overriding the defined entrypoint with the --entrypoint=bash argument. Also using the -it flag means an interactive container will launch instead of the shell just being started and immediately exiting. You can use these extra arguments while still setting all env vars as before. A difference to a normal run is that you won’t need to supply any arguments to Meltano to run a pipeline.

docker run \
--mount type=bind,src=/absolute/path/to/service-account/keyfile.json,dst=/var/keyfile.json \
-e TAP_PURECLOUD_CLIENT_SECRET=$TAP_PURECLOUD_CLIENT_SECRET \
-e TAP_PURECLOUD_START_DATE=2021-12-16 \
-e TARGET_BIGQUERY_PROJECT_ID=$GCP_PROJECT \
-e GOOGLE_APPLICATION_CREDENTIALS=/var/keyfile.json \
--entrypoint=bash \
-it \
meltano_bbm

Putting the container image and secrets on GCP

Now that we have a working container we need to make it and the secrets it needs accessible in our GCP project.

Artifact Registry

Artifact Registry is an expanded, newer version of the much better named Container Registry. It does everything Container Registry did and more. We’ll store our container image in it so that other GCP tools can download the image from it.

First create a repository for the image. In this case repository means a directory of container images. Here meltano-repo will be the name of the repo. Change it or the location if you need.

gcloud artifacts repositories create meltano-repo --repository-format=docker --location=europe-west2

Next we need to authorise our locally running docker to be able to push to our new repo. If you used a different location to europe-west2 above you need to change that here too.

gcloud auth configure-docker europe-west2-docker.pkg.dev

We then need to tag our image so it gets pushed to the right repo. You need to do this every time you rebuild the image.

docker tag meltano_bbm europe-west2-docker.pkg.dev/$GCP_PROJECT/meltano-repo/meltano_bbm

Finally actually push the image to our Artifact Registry repo.

docker push europe-west2-docker.pkg.dev/$GCP_PROJECT/meltano-repo/meltano_bbm

Kubernetes setup

Kubernetes is a tool for managing running containers. GCP has a managed service for this called Google Kubernetes Engine (GKE). If you’re using Cloud Composer you won’t need to set up a cluster from scratch as Composer itself runs on GKE so one will already exist.

What we will do is create a new node-pool within that cluster for our Meltano container to run on. When you run the Airflow DAG that will trigger the container, the DAG will take up one pod on a node and the container will need to be run on a second. As this will take up at least two slots it’s recommended to create a new node-pool within the Composer GKE cluster for our Meltano container to run on. This will avoid it competing for resources with Composer itself.

To create the node pool we first need the name and zone of our Composer Kubernetes cluster. We can find these by going to Kubernetes Engine in the Google Cloud UI and clicking into the cluster for Composer. Set these values as env vars in your terminal as we’ll have to reuse them.

export COMPOSER_GKE_NAME=europe-west2-data-warehouse-123abc-gke
export COMPOSER_GKE_ZONE = europe-west2-c

Then we can run a command like below to create a pool called meltano-pool. We enable autoscaling to avoid this new node pool costing us money for when it’s not in use. For machine-type you can use what you want. I chose e2-standard-2 as it was one of the cheaper options and my pipeline doesn’t need to import much data.

gcloud container node-pools create meltano-pool \
--project=$GCP_PROJECT \
--cluster=$COMPOSER_GKE_NAME \
--zone=$COMPOSER_GKE_ZONE \
--machine-type=e2-standard-2 \
--enable-autoupgrade \
--num-nodes=1 \
--enable-autoscaling \
--min-nodes=0 \
--max-nodes=3 \
--disk-size=20

The last thing to do here is put our secrets on the Composer Kubernetes cluster so we can easily pass them to the container when we trigger it in Airflow.

Get credentials which, in future commands, kubectl will automatically use to connect to the Composer cluster.

gcloud container clusters get-credentials $COMPOSER_GKE_NAME \
--project=$GCP_PROJECT \
--zone=$COMPOSER_GKE_ZONE

When setting a secret I first delete any existing one of the same name to allow it be updated

export PURECLOUD_SECRET_NAME=purecloud-api-secret
kubectl delete secret $PURECLOUD_SECRET_NAME --ignore-not-found

Then actually set the secret. As I set the Purecloud client secret in the Meltano extractor’s config when setting it up, the value for it is saved in my .env file under TAP_PURECLOUD_CLIENT_SECRET. Due to the formatting of the .env file running source .env makes any values defined in it accessible in my current terminal session.

source .env
kubectl create secret generic $PURECLOUD_SECRET_NAME \
--from-literal purecloud_secret=$TAP_PURECLOUD_CLIENT_SECRET

For use with the target-bigquery we need to make the value of a service account keyfile a secret in Kubernetes. The service account needs permission to create datasets and modify tables so I gave it BigQuery User permission in IAM. You could be more restrictive than this though.

The commands to make a secret from a file are nearly the same as creating from a passed value except instead of kubectl create … --from-literal … we use kubectl create … --from-file ….

export SA_SECRET_NAME=meltano-service-account
kubectl delete secret $SA_SECRET_NAME --ignore-not-found
kubectl create secret generic $SA_SECRET_NAME \
--from-file meltano_serv_acc_keyfile.json=./path/to/serv_acc.json

Running on Composer

We’re nearly there! Now we just need to actually trigger the running of our container, which we’ll do with the KubernetesPodOperator. Below is an example of the DAG we use to run our Purecloud to BigQuery pipeline. It shows how we pass normal env vars as well as a secret one. It also shows how the secret for the service account keyfile is handled differently, being mounted as a volume instead of passed as an env var.

A nice feature of KubernetesPodOperator is that it doesn’t just start off the container and then finish the task. It’ll both wait for the container to finish as well as print out any logs coming from it. This makes running a container just like running any other Airflow DAG step. It was for these reasons we chose this approach over using Cloud Run (GCP’s on demand container service).

We’re growing fast here at ManyPets 🚀. This means there’s a lot more data challenges to solve and we’d would love your help to do it 😄. See our careers page to come join us!

--

--