1. Introduction of Subject
1. Introduction of Subject
Dr Jigna Patel
N-407
[email protected]
9898942993
Course Outcomes
After successful completion of this course, student will be able to
1. outline the significance and challenges of big data
2. model big data using different tools and frameworks
3. apply big data techniques for useful business analytic applications
4. design algorithms for mining the data from large volumes
Syllabus
Unit I
Introduction to Big Data: Evolution of Big Data, Types of Digital Data, Classification of Digital Data, Structured Data, Semi-
Structured Data, Unstructured Data, Definition of Big Data, Challenges of Conventional Systems, Big data platforms and data
storage
Unit II
Big Data Analytics: Importance of Big data analytics, Classification of Analytics, Top Challenges Facing Big Data, Technologies to
meet the Challenges Posed by Big Data, Terminologies Used in Big Data Environment
Unit III
Hadoop: Introducing Hadoop, comparisons of RDBMS and Hadoop, Distributed Computing Challenges, Hadoop Overview,
Business Value of Hadoop, Hadoop Distributed File System, Processing Data with Hadoop, working with Map Reduce,
Hadoop YARN, Hadoop in the Cloud, Applications on Big Hadoop Ecosystem, Fundamentals of Pig, Hive, HBase and ZooKeeper,
Basic concepts of Apache Spark
Unit IV
The Big data technology landscape: CAP Theorem - BASE Concept, NoSQL, Types of No SQL databases, Introduction to
MongoDB, Data Types in MongoDB, CRUD, Apache Cassandra, Features of Cassandra, CRUD
Unit V
Big data analytics Algorithm: Applying Linear Regression, Clustering, Association rule mining, Decision tree on Big Data.
Self-study: Frameworks: Applications on Big Data Using Pig and Hive
References:
1. Michael Berthold, David J. Hand, Intelligent Data Analysis, Springer
2. Tom White, Hadoop: The Definitive Guide, Third Edition, O’reilly Media
3. Chris Eaton, Dirk DeRoos, Tom Deutsch, George Lapis, Paul Zikopoulos, Understanding Big Data: Analytics
for Enterprise Class Hadoop and Streaming Data, McGraw Hill Publishing
4. Anand Rajaraman and Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press
5. Bill Franks, Taming the Big Data Tidal Wave: Finding Opportunities in Huge Data Streams with Advanced
Analytics, John Wiley & sons
6. Glenn J. Myatt, Making Sense of Data, John Wiley & Sons
7. Pete Warden, Big Data Glossary, O’Reilly
8. Jiawei Han, Micheline Kamber, Data Mining Concepts and Techniques, Second Edition, Elsevier
9. Da Ruan, Guoquing Chen, Etienne E.Kerre, GeertWets, Intelligent Data Mining, Springer
10. Paul Zikopoulos, Dirk deRoos, Krishnan Parasuraman, Thomas Deutsch, James Giles, David Corrigan,
Harness the Power of Big Data The IBM Big Data Platform, Tata McGraw Hill Publications
11. Michael Minelli, Michele Chambers, Ambiga Dhiraj, Big Data, Big Analytics: Emerging Business Intelligence
and Analytic Trends for Today's Businesses, Wiley Publications
12. Zikopoulos, Paul, Chris Eaton, Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming
Data, Tata McGraw Hill Publications
13. Seema Acharya and Subhashini C, Big Data and Analytics, Wiley India
Examination Scheme
CE SEE LPW
Exam Duration Continuous 3.0 Hrs Continuous Evaluation
Evaluation + 2 hrs Semester End
LPW Exam
10. Implement any one of the analytic algorithm using Pyspark and MLLib 04 4
for larger datasets in main memory. (Machine Learning application)
• Regression Hours
• K-means Clustering
Association Rule Mining Algorithm
Lab session
Practice Purpose
Sr. NO Practical Title Hours CLO
11* Extend MongoDB functionality for MapReduce on document collection 02 3
Hours
12* Extend Cassandra functionality for Map Reduce on restaurant dataset. 02 3
Hours
InClassQuestion#1
Reference : https://data-flair.training/blogs/big-data-case-studies/
• The largest retailer in the world and the world’s largest company by revenue,
• with more than 2.1 million employees
• 10,586 stores and clubs in 24 countries
• More than 2 million employees and 20000 stores in 28 countries
• Major Problems are :
1. Inventory Management: Ensuring shelves are stocked with the right products at the right time.
2. Customer Insights: Understanding and predicting customer behavior to improve sales.
3. Supply Chain Optimization: Managing a vast network of suppliers and logistics.
3. Streaming Quality:
Tools: Amazon Web Services (AWS), Akamai
Algorithms: Adaptive bitrate streaming, predictive analytics
Solution: Optimizing streaming quality by predicting and
managing network congestion.
• A big technical challenge for eBay as a data-intensive business to exploit a system
that can rapidly analyze and act on data as it arrives (streaming data).
• There are many rapidly evolving methods to support streaming data analysis.
• eBay is working with several tools including Apache Spark, Storm, Kafka.
• It allows the company’s data analysts to search for information tags that have
been associated with the data (metadata) and make it consumable to as many
people as possible with the right level of security and permissions (data
governance).
• The company has been at the forefront of using big data solutions and actively
contributes its knowledge back to the open-source community
• It is a 179-year-old company.
• The genius company has recognized the potential of Big Data and put it to use in
business units around the globe.
• P&G has put a strong emphasis on using big data to make better, smarter, real-
time business decisions.
• The Global Business Services organization has developed tools, systems, and
processes to provide managers with direct access to the latest data and advanced
analytics. Therefore P&G being the oldest company, still holding a great share in
the market despite having many emerging companies
InClassQuestion#2
• How can we apply Big data Analytics in Education Sector?
• Personalizing Learning: Tailoring educational content and approaches to meet
individual student needs and learning styles.
• Predicting Student Performance: Using data to identify at-risk students and
intervene early to improve outcomes.
• Enhancing Curriculum Development: Analyzing data on student engagement
and success to refine and improve curriculum content.
• Optimizing Resource Allocation: Efficiently distributing resources like faculty,
funding, and facilities based on data insights.
• Improving Administrative Efficiency: Streamlining operations and decision-
making processes using data-driven insights.
Reference : https://www.bigdataframework.org/short-history-of-big-data/
Evolution of Technology
Reference : https://www.youtube.com/watch?v=zez2Tv-bcXY
Internet of Things
Reference : https://www.edureka.co/blog/big-data-tutorial
Conclusion
28