In this article, you can find a step by step quick start guide on how to send messages from an Apache Kafka topic to Google Cloud Platform using the apache beam data pipeline running on dataflow and create Data lake and Data Warehouse hosted on the cloud for big data analytics.
I already have messages on my Kafka server and if you want to learn how to move database records to Kafka please go through my other article here.
So first we will be moving the data to google cloud storage(GCS) which is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. For that, you need to create an account on GCP. Google provides a free tier account and you can create it here. Once you have created the account you can create the bucket using the GCP guide. You would also need to create a service account to get the key which would be required to connect to google cloud services for that use this guide. I have created a bucket called kafka_messgages and folders called customers where I will be storing files.
Now that we have the service account JSON key let push data to the bucket. I would be converting the data into Parquet format before loading it into the bucket. In this code, you would need to provide the bucket name, the path of your service account key, and the Kafka server.
Get the code here. Let’s check if we have the messages in the buckets.
You can see have three new folders with timestamps and inside we have the data.parquet file. Now, let's create a dataflow job that will move the data from cloud storage to BigQuery. First, you need to create a table in Bigquery. I have created a dataset called kafka and inside that, a table called customers.
In order to run apache beam dataflow job from your terminal you need to install cloud sdk from here. Then create a virtual environment python3 -m venv gcp then activate it source gcp/bin/activate and then
pip install apache-beam
pip install "google-cloud-storage" and
pip install "apache-beam[gcp]"
You can find the code here and edit the file.
Now, let’s run the python file
Now go to GCP console then to dataflow you will should have a new job running.
It will take around 4–5 minutes to run the job. Once it is completed now lets checks Google Bibquery if the data was succefully loaded.
That should give you a quickstart on how to migrate your data to Google Cloud Platform.