Apache Kafka and Google Cloud Platform Guide

Image for post
Image for post

In this article, you can find a step by step quick start guide on how to send messages from an Apache Kafka topic to Google Cloud Platform using the apache beam data pipeline running on dataflow and create Data lake and Data Warehouse hosted on the cloud for big data analytics.

I already have messages on my Kafka server and if you want to learn how to move database records to Kafka please go through my other article here.

So first we will be moving the data to google cloud storage(GCS) which is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. For that, you need to create an account on GCP. Google provides a free tier account and you can create it here. Once you have created the account you can create the bucket using the GCP guide. You would also need to create a service account to get the key which would be required to connect to google cloud services for that use this guide. I have created a bucket called kafka_messgages and folders called customers where I will be storing files.

Image for post
Image for post

Now that we have the service account JSON key let push data to the bucket. I would be converting the data into Parquet format before loading it into the bucket. In this code, you would need to provide the bucket name, the path of your service account key, and the Kafka server.

Image for post
Image for post

Get the code here. Let’s check if we have the messages in the buckets.

Image for post
Image for post
Image for post
Image for post

You can see have three new folders with timestamps and inside we have the data.parquet file. Now, let's create a dataflow job that will move the data from cloud storage to BigQuery. First, you need to create a table in Bigquery. I have created a dataset called kafka and inside that, a table called customers.

Image for post
Image for post

In order to run apache beam dataflow job from your terminal you need to install cloud sdk from here. Then create a virtual environment python3 -m venv gcp then activate it source gcp/bin/activate and then pip install apache-beam

Image for post
Image for post

then pip install "google-cloud-storage" andpip install "apache-beam[gcp]"

Image for post
Image for post
Image for post
Image for post

You can find the code here and edit the file.

Image for post
Image for post

Now, let’s run the python file

Image for post
Image for post

Now go to GCP console then to dataflow you will should have a new job running.

Image for post
Image for post
Image for post
Image for post

It will take around 4–5 minutes to run the job. Once it is completed now lets checks Google Bibquery if the data was succefully loaded.

Image for post
Image for post

That should give you a quickstart on how to migrate your data to Google Cloud Platform.

Management Consultant - Data Architect

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store