Assignment 01
Assignment 01
: 6107
Subject: 510303 - BDA
ASSIGNMENT: 01
Aim: Demonstrate application of Apache spark to analyse streaming data from social media. (Installation of
multi-node Hadoop as well as Spark is to be done by students.)
Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: socialmedia.csv from kaggle
Theory: This PySpark code demonstrates real-time data processing using structured streaming. It analyzes
social media data, aggregating post counts and average likes per user within hourly windows. The code
showcases:
The aim is to provide insights into user activity patterns and engagement levels over time, enabling
continuous monitoring and analysis of social media trends.
Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, count, avg
from pyspark.sql.types import StructType, StructField, StringType, TimestampType
spark = SparkSession.builder.appName("SocialMediaStreamingAnalysis").master("local[*]").getOrCreate()
schema = StructType([
StructField("timestamp", TimestampType(), True),
StructField("user_id", StringType(), True),
StructField("post_text", StringType(), True),
StructField("likes", StringType(), True)
])
windowedCounts = lines.groupBy(
window(lines.timestamp, "1 hour"),
lines.user_id
).agg(
count("*").alias("post_count"),
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA
avg("likes").alias("avg_likes")
)
print("Query Explanation:")
windowedCounts.explain(extended=True)
query = windowedCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()
query.awaitTermination()
Output:
Query Explanation:
== Physical Plan ==
-------------------------------------------
Batch: 0
-------------------------------------------
+------------------------------------------+-------+----------+-----------------+
|window |user_id|post_count|avg_likes |
+------------------------------------------+-------+----------+-----------------+
+------------------------------------------+-------+----------+-----------------+
-------------------------------------------
Batch: 1
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA
-------------------------------------------
+------------------------------------------+-------+----------+-----------------+
|window |user_id|post_count|avg_likes |
+------------------------------------------+-------+----------+-----------------+
+------------------------------------------+-------+----------+-----------------+
Conclusion: This assignment successfully demonstrates the use of Apache Spark and PySpark for analyzing
real-time social media data using structured streaming. The aggregation and windowing functions allow
continuous monitoring of user activity patterns and trends on social media, providing valuable insights into
user engagement.