Skip to content

anton-antonenko/spark-streaming-capstone-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spark Streaming Capstone Project

This is my capstone project for BigData Streaming course.
The goal of the project is to simulate click-stream and detect bots in it in near-realtime.
Clicks are generated by data-generator and saved to file as json rows. Then this data is pushed to Kafka by KafkaConnect. Spark application reads the data from Kafka and writes the output to Cassandra and Redis.
The pipeline is following:

                                                                               / -> Redis (detected bots cache)
data-generator -> clickstream.txt -> KafkaConnect -> Kafka topic -> Spark app |
                                                                               \ -> Cassandra (the whole clickstream stored there)

Spark app must consider that data can be late for 10 minutes.
User ip is considered as bot when more than 20 clicks came for 10 seconds.
When bot is detected it must be saved to Redis.
After 10 minutes since last bot-like activity user ip must be removed from the bots cache.

How to run it

Check out demo.sh file with all needed instructions

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published