Stock Market Real-Time Data Analysis Using Kafka

This is an End-To-End Data Engineering Project on Real-Time Stock Market Data, using AWS and Kafka.

Technologies used

Apache Kafka: This is an event streaming platform used to collect, process, store, and integrate data at scale. Its Python API was utilized to stream data received from the web API
AWS EC2 : This is a cloud-based computing service that allows users to quickly launch virtual servers and manage cookies, security, and networking from an easy-to-use dashboard. It was used to host Kafka server for this project.
AWS Lambda: This is a serverless compute service that lets you run codes without provisioning or managing any server. lambda functions can be used to run code in response to events like changes in S3, DynamoDB, cron jobs, etc. For this project, it was used to run the data scrape function triggered every hour.
AWS Cloudwatch: This is a monitoring and management service that provides data and actionable insights for AWS, hybrid, and on-premises applications and resources. For this project it was used to collect and aggregate metrics and logs from the Lambda functions. The performance events are ingested as CloudWatch logs to simplify monitoring and troubleshooting.
AWS Severless Application Model (SAM): This is a service that is designed to make the creation, deployment, and execution of serverless applications as simple as possible. This can be done using AWS SAM templates. For this project it was used to deploy our lambda function.
AWS S3: This is a highly scalable object storage service that stores data as objects within buckets. It is commonly used to store and distribute large media files, data backups and static website files. For this project it is used to store data streamed from kafka.
AWS Glue :Glue Crawler is a fully managed service that automatically crawls your data sources, identifies data and infer schemas to create an AWS Glue Data Catalog. This Catalog is a fully managed metadata repository that makes it easy to discover and manage data that has been crawled.It allows us to query the data directly from S3 without loading it first.
AWS Athena: Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL commands. For this project it is used to analyze data in Glue Data Catalog or in other S3 buckets.

Architecture

Languages used

Python
yaml(for AWS SAM templates)

Datasets

Data was scraped from - https://edition.cnn.com/markets using Python requests module. The website does not provide a public API, hence it was reverse engineered to get endpoints. Postman was then used to analyze endpoint

Development Process

Created EC2 Instance (AWS Linux AMI 2) to run python consumer file and host kafka
Installed AWS SAM and created Lambda function using SAM Template. Sample template used can be found here

Tested lambda Function before enabling cron
Installed Kafka and Java on ec2 Instance, and ran kafka and zookeper in background to test producer and consumer clients
Created S3 Bucket and IAM role to enable access to s3 from any instance
Created and ran crawler on AWS Glue. This includes creating a database, chosing a data source, amd creating an IAM role that allows Glue to access S3
Access Athena to preview data, with time the number of rows increases due to the cron job that runs hourly on aws lamdba.

Pipeline Flow

start consumer file -> scrape data using aws lambda created through sam -> transform data -> move to producer client -> consumer reads data and uploads to s3 -> Data schema is crawled through Glue -> Athena queries data directly from s3

Necessary Files

Kafka commands can be found here
Generated app through SAM can be found here This module contains the kafka producer and transformation code.
Kafka Consumer file can be found here

AWS SAM

A sample template.yaml is also provided to show how SAM is used to configure lambda functions.

References

Creating database for AWS Athena

creating access key for s3 usage'

Install and configure AWS CLI

What is AWS SAM?

Using AWS SAM

Using S3fs

Installing Python on ec2

Understanding Schedule Property in SAM

setting cron jobs in SAM

Configuring other SAM properties like memory size, timeout or env variables

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
kafka-lambda-app		kafka-lambda-app
readme_images		readme_images
.gitignore		.gitignore
commands.txt		commands.txt
kafka_consumer.py		kafka_consumer.py
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Stock Market Real-Time Data Analysis Using Kafka

Technologies used

Architecture

Languages used

Datasets

Development Process

Pipeline Flow

Necessary Files

AWS SAM

References

About

Uh oh!

Releases

Packages

Languages

jatinm2593/Real_time_End_to_End_Pipeline_using_Kafka

Folders and files

Latest commit

History

Repository files navigation

Stock Market Real-Time Data Analysis Using Kafka

Technologies used

Architecture

Languages used

Datasets

Development Process

Pipeline Flow

Necessary Files

AWS SAM

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages