Amazon Web Services(AWS) Big Data Solution
Big data refers to data that would typically be too expensive to store, manage, and analyze using traditional database systems and that is why organizations are moving towards cloud computing. Cloud computing offers access to data storage, processing, and analytics on a more scalable, flexible, cost-effective, and even secure basis than can be achieved with an on-premises deployment. Instead of investing heavily in data centers or servers before you know how you’re going to use them, you can pay only when you consume the resources, and pay only for how much you consume basically a pay-as-you-go model. One of the biggest cloud provider is Amazon and it’s called Amazon Web Services. AWS has many services that can help build your soultion for big data challenges. This article can give you a quickstart on how to implememt these solution for your organisation.
Let’s start the guide now so in order to make programmatic calls to AWS or to use the AWS Command Line Interface, we need an AWS access key. When you create your access keys, you create the access key ID (for example, AKIAIOSFODNN7EXAMPLE
) and secret access key (for example, wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
) as a set. To download the access key and secret key from the AWS console you need to have permission. So I am using the Administrator user to download the key for a user in the developer group. Log in to your console and go to IAM then go to the user then security credentials and hit “create access key”.
Now, go to your terminal. Download python CLI and provide the access key.
pip install awscliaws configure
Now let's send some data from the Kafka topic. If you want to create your own Kafka server you can go to this article. Using the below code you can get data from the Kafka topic and send it to the S3 bucket. You can find the code here.
Check in the bucket using AWS Console
Using AWS glue crawler we can connect to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in your data catalog.
- Go to AWS Glue, then from left click crawler and add crawler.
- Give crawler name and click next.
- Select data stores and crawl all folders
- Give the path of the bucket
- No for add another source
- Choose an existing IAM role or create a new one with providing access to S3 buckets
- Run-on Demand
- Create a database, next and finish
Now run the crawler. It will take a few minutes to complete.
Go to the database from the left pane, then select the database name, then tables in DB_Name, table_name scroll down and you can see the schema.
You can query the data using AWS Athena which is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. You can read more about it here.
Go to Athena using the console. Before you can run the query you need to set up a query result location in S3. Create a S3 bucket and then go to the setting on the top right and add the S3 path. We can now query the dataset.
Now, you can create a redshift cluster for the data warehousing solution using this documentation.
Once your cluster is available we can create a table using EDITOR.
Let’s create a glue job to load the data from S3 buckets to the Redshift tables.
Before that, we need to create a connection in the Glue console. Go to glue on the left click connection.
Now create a glue job. There are few ways to generate a glue script one is using AWS generated script which can be used if you don’t have any complex transformations or logic before storing the data into the redshift cluster
You also need to set up Amazon S3 VPC endpoint and you can do it by going to VPC -> Endpoint
Let’s create the glue job now. I am using a custom glue script you can find the code here. Then I saved that script on a S3 bucket called “new-glue-script”
Let’s check the data in redshift
So, this is how you can quick start your data solution in AWS.