This is an End-To-End Data Engineering Project on Real-Time Stock Market Data, using AWS and Kafka.
- Apache Kafka: This is an event streaming platform used to collect, process, store, and integrate data at scale. Its Python API was utilized to stream data received from the web API
- AWS EC2 : This is a cloud-based computing service that allows users to quickly launch virtual servers and manage cookies, security, and networking from an easy-to-use dashboard. It was used to host Kafka server for this project.
- AWS Lambda: This is a serverless compute service that lets you run codes without provisioning or managing any server. lambda functions can be used to run code in response to events like changes in S3, DynamoDB, cron jobs, etc. For this project, it was used to run the data scrape function triggered every hour.
- AWS Cloudwatch: This is a monitoring and management service that provides data and actionable insights for AWS, hybrid, and on-premises applications and resources. For this project it was used to collect and aggregate metrics and logs from the Lambda functions. The performance events are ingested as CloudWatch logs to simplify monitoring and troubleshooting.
- AWS Severless Application Model (SAM): This is a service that is designed to make the creation, deployment, and execution of serverless applications as simple as possible. This can be done using AWS SAM templates. For this project it was used to deploy our lambda function.
- AWS S3: This is a highly scalable object storage service that stores data as objects within buckets. It is commonly used to store and distribute large media files, data backups and static website files. For this project it is used to store data streamed from kafka.
- AWS Glue :Glue Crawler is a fully managed service that automatically crawls your data sources, identifies data and infer schemas to create an AWS Glue Data Catalog. This Catalog is a fully managed metadata repository that makes it easy to discover and manage data that has been crawled.It allows us to query the data directly from S3 without loading it first.
- AWS Athena: Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL commands. For this project it is used to analyze data in Glue Data Catalog or in other S3 buckets.
- Python
- yaml(for AWS SAM templates)
Data was scraped from - https://edition.cnn.com/markets using Python requests module. The website does not provide a public API, hence it was reverse engineered to get endpoints. Postman was then used to analyze endpoint
-
Created EC2 Instance (AWS Linux AMI 2) to run python consumer file and host kafka
-
Installed AWS SAM and created Lambda function using SAM Template. Sample template used can be found here
Tested lambda Function before enabling cron -
Installed Kafka and Java on ec2 Instance, and ran kafka and zookeper in background to test producer and consumer clients
-
Created S3 Bucket and IAM role to enable access to s3 from any instance
-
Created and ran crawler on AWS Glue. This includes creating a database, chosing a data source, amd creating an IAM role that allows Glue to access S3
-
Access Athena to preview data, with time the number of rows increases due to the cron job that runs hourly on aws lamdba.
start consumer file -> scrape data using aws lambda created through sam -> transform data -> move to producer client -> consumer reads data and uploads to s3 -> Data schema is crawled through Glue -> Athena queries data directly from s3
- Kafka commands can be found here
- Generated app through SAM can be found here This module contains the kafka producer and transformation code.
- Kafka Consumer file can be found here
A sample template.yaml is also provided to show how SAM is used to configure lambda functions.