0% found this document useful (0 votes)
71 views

Assignment 01

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

Assignment 01

Uploaded by

DHRUV TILLU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Name: Dhruv Jayant Tillu Roll No.

: 6107
Subject: 510303 - BDA

ASSIGNMENT: 01
Aim: Demonstrate application of Apache spark to analyse streaming data from social media. (Installation of
multi-node Hadoop as well as Spark is to be done by students.)

Requirements:
• Software: PyCharm Professional
• Libraries: PySpark Module
• Dataset: socialmedia.csv from kaggle

Theory: This PySpark code demonstrates real-time data processing using structured streaming. It analyzes
social media data, aggregating post counts and average likes per user within hourly windows. The code
showcases:

1. Defining a schema for structured data

2. Reading streaming data from a CSV file

3. Applying windowed aggregations on streaming data

4. Using PySpark's DataFrame API for declarative data transformations

5. Outputting results in real-time to the console

The aim is to provide insights into user activity patterns and engagement levels over time, enabling
continuous monitoring and analysis of social media trends.

Code:
from pyspark.sql import SparkSession
from pyspark.sql.functions import window, count, avg
from pyspark.sql.types import StructType, StructField, StringType, TimestampType

spark = SparkSession.builder.appName("SocialMediaStreamingAnalysis").master("local[*]").getOrCreate()

schema = StructType([
StructField("timestamp", TimestampType(), True),
StructField("user_id", StringType(), True),
StructField("post_text", StringType(), True),
StructField("likes", StringType(), True)
])

lines = spark.readStream.option("sep", ",").schema(schema).csv("./socialmedia.csv")

windowedCounts = lines.groupBy(
window(lines.timestamp, "1 hour"),
lines.user_id
).agg(
count("*").alias("post_count"),
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

avg("likes").alias("avg_likes")
)

print("Query Explanation:")
windowedCounts.explain(extended=True)

query = windowedCounts \
.writeStream \
.outputMode("complete") \
.format("console") \
.start()

query.awaitTermination()

Output:

Query Explanation:

== Physical Plan ==

*(2) HashAggregate(keys=[window#20, user_id#1], functions=[count(1), avg(cast(likes#3 as double))])

+- Exchange hashpartitioning(window#20, user_id#1, 200), ENSURE_REQUIREMENTS, [id=#45]

+- *(1) HashAggregate(keys=[window#20, user_id#1], functions=[partial_count(1), partial_avg(cast(likes#3 as


double))])

+- *(1) Project [named_struct(start, precisetimestamp(HiveIntervalDayTime(3600000000),0), end,


precisetimestamp(HiveIntervalDayTime(7200000000),0)) AS window#20, user_id#1, likes#3]

+- *(1) Filter (isnotnull(timestamp#0) AND (timestamp#0 >= cast(1970-01-01 00:00:00.0 as timestamp)))

+- StreamingRelation CSV, [timestamp#0, user_id#1, post_text#2, likes#3]

== Analyzed Logical Plan ==

window: struct<start:timestamp,end:timestamp>, user_id: string, post_count: bigint, avg_likes: double

Aggregate [window#20, user_id#1], [window#20, user_id#1, count(1) AS post_count#33L, avg(cast(likes#3 as


double)) AS avg_likes#38]

+- Project [named_struct(start, precisetimestamp(HiveIntervalDayTime(3600000000),0), end,


precisetimestamp(HiveIntervalDayTime(7200000000),0)) AS window#20, user_id#1, likes#3]

+- Filter (isnotnull(timestamp#0) AND (timestamp#0 >= cast(1970-01-01 00:00:00.0 as timestamp)))

+- StreamingRelation CSV, [timestamp#0, user_id#1, post_text#2, likes#3]

== Optimized Logical Plan ==

Aggregate [window#20, user_id#1], [window#20, user_id#1, count(1) AS post_count#33L, avg(cast(likes#3 as


double)) AS avg_likes#38]
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

+- Project [named_struct(start, precisetimestamp(HiveIntervalDayTime(3600000000),0), end,


precisetimestamp(HiveIntervalDayTime(7200000000),0)) AS window#20, user_id#1, likes#3]

+- Filter (isnotnull(timestamp#0) AND (timestamp#0 >= 1970-01-01 00:00:00.0))

+- StreamingRelation CSV, [timestamp#0, user_id#1, post_text#2, likes#3]

-------------------------------------------

Batch: 0

-------------------------------------------

+------------------------------------------+-------+----------+-----------------+

|window |user_id|post_count|avg_likes |

+------------------------------------------+-------+----------+-----------------+

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user1 |2 |13.5 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user2 |2 |36.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user3 |1 |7.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user4 |1 |31.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user5 |1 |45.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user6 |1 |28.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user3 |1 |19.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user4 |1 |26.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user5 |1 |17.0 |

|{2024-09-18 10:00:00, 2024-09-18 11:00:00}|user7 |1 |82.0 |

|{2024-09-18 10:00:00, 2024-09-18 11:00:00}|user1 |1 |9.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user2 |1 |14.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user6 |1 |23.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user7 |1 |56.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user4 |1 |18.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user5 |1 |21.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user3 |1 |11.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user1 |1 |16.0 |

+------------------------------------------+-------+----------+-----------------+

-------------------------------------------

Batch: 1
Name: Dhruv Jayant Tillu Roll No.: 6107
Subject: 510303 - BDA

-------------------------------------------

+------------------------------------------+-------+----------+-----------------+

|window |user_id|post_count|avg_likes |

+------------------------------------------+-------+----------+-----------------+

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user1 |2 |13.5 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user2 |2 |36.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user3 |1 |7.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user4 |1 |31.0 |

|{2024-09-18 08:00:00, 2024-09-18 09:00:00}|user5 |1 |45.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user6 |1 |28.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user3 |1 |19.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user4 |1 |26.0 |

|{2024-09-18 09:00:00, 2024-09-18 10:00:00}|user5 |1 |17.0 |

|{2024-09-18 10:00:00, 2024-09-18 11:00:00}|user7 |1 |82.0 |

|{2024-09-18 10:00:00, 2024-09-18 11:00:00}|user1 |1 |9.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user2 |1 |14.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user6 |1 |23.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user7 |1 |56.0 |

|{2024-09-18 11:00:00, 2024-09-18 12:00:00}|user4 |1 |18.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user5 |1 |21.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user3 |1 |11.0 |

|{2024-09-18 12:00:00, 2024-09-18 13:00:00}|user1 |1 |16.0 |

+------------------------------------------+-------+----------+-----------------+

Conclusion: This assignment successfully demonstrates the use of Apache Spark and PySpark for analyzing
real-time social media data using structured streaming. The aggregation and windowing functions allow
continuous monitoring of user activity patterns and trends on social media, providing valuable insights into
user engagement.

You might also like