This is my capstone project for BigData Streaming course.
The goal of the project is to simulate click-stream and detect bots in it in near-realtime.
Clicks are generated by data-generator
and saved to file as json rows. Then this data is pushed to Kafka by KafkaConnect. Spark application reads the data from Kafka and writes the output to Cassandra and Redis.
The pipeline is following:
/ -> Redis (detected bots cache)
data-generator -> clickstream.txt -> KafkaConnect -> Kafka topic -> Spark app |
\ -> Cassandra (the whole clickstream stored there)
Spark app must consider that data can be late for 10 minutes.
User ip is considered as bot when more than 20 clicks came for 10 seconds.
When bot is detected it must be saved to Redis.
After 10 minutes since last bot-like activity user ip must be removed from the bots cache.
Check out demo.sh
file with all needed instructions