Module 5 BDA

The document discusses the MapReduce paradigm and its implementation in Apache Hadoop, detailing the map and reduce steps involved in processing unstructured data. It also covers the Hadoop Distributed File System (HDFS) architecture, including the roles of NameNode, DataNode, and Secondary NameNode, as well as the structure of a typical MapReduce job in Java. Additionally, it introduces various Hadoop ecosystem projects, NoSQL data stores, SQL essentials, and advanced SQL techniques, including the use of the MADlib library for scalable analytics.

Uploaded by

sangeetha.kumaravel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views25 pages

Module 5 BDA

Uploaded by

sangeetha.kumaravel

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

ITDX45- Big Data

Analytics
Module 5 - Technology and Tools
Analytics of unstructured data
• MapReduce paradigm break a large task into smaller
tasks, run tasks in parallel, and consolidate the outputs
of the individual tasks into the final output.
• Apache Hadoop includes a software implementation of
MapReduce.
• MapReduce consists of two basic parts
- a map step, and
- a reduce step
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the map
steps
• Provides the final output
• map step parses the provided text string into individual words
and emits a set of key/value pairs of the form <word, 1>.
• the reduce step sums the 1 values and outputs the <word,
count> key/value pairs.
• the final output of a MapReduce process applied to a set
of documents might have the key as an ordered pair
and the value as an ordered tuple of length 2n.
• A possible representation of such a key/value pair
follows:
<(filename, datetime),(word1,5, word2,7,… ,
wordn,6)>
Hadoop cluster
• To manage the data access, Hadoop Distributed File System
(HDFS) utilizes three Java background processes:
1. NameNode,
2. DataNode, and
3. Secondary NameNode.
• NameNode determines and tracks where the various blocks
of a data file are stored.
• The DataNode manages the data stored on each machine.
• Secondary NameNode, provides the capability to perform
some of the NameNode tasks to reduce the load on the
NameNode.
If a client application wants to access a particular file
stored in HDFS, the application contacts the NameNode, and the
NameNode provides the application with the locations of the
various blocks for that file. The application then communicates
with the appropriate DataNodes to access the file.
• Each DataNode periodically builds a report about the blocks
stored on the DataNode and sends the report to the
NameNode.
• If one or more blocks are not accessible on a DataNode, the
NameNode ensures that an accessible copy of an inaccessible
data block is replicated to another machine.
• For performance reasons, the NameNode resides in a
machine’s memory. Because the NameNode is critical to the
operation of HDFS, any unavailability or corruption of the
NameNode results in a data unavailability event on the
cluster.
• Thus, the NameNode is viewed as a single point of failure in
the Hadoop environment.
• To minimize the chance of a NameNode failure and to improve
performance, the NameNode is typically run on a dedicated
machine.
Structuring a MapReduce Job in
Hadoop
• A typical MapReduce program in Java consists of three
classes:
1. the driver,
2. The mapper, and
3. the reducer.
• The driver provides details such as input file locations, the
provisions for adding the input file to the map task, the
names of the mapper and reducer Java classes, and the
location of the reduce task output.
• The mapper provides the logic to be processed on each data
block corresponding to the specified input files in the driver
code.
The Hadoop Ecosystem
• Hadoop-related Apache projects:
1. Pig: Provides a high-level data-flow programming
language
2. Hive: Provides SQL-like access
3. Mahout: Provides analytical tools
4. HBase: Provides real-time reads and writes
No Sql
• (Not only Structured Query Language) is a term used to
describe those data stores applied to unstructured data.
• Key/value stores contain data (the value) that can be
simply accessed by a given identifier (the key).
• Using a customer’s login ID as the key, the value contains the
customer’s preferences.
• Using a web session ID as the key, the value contains everything
that was captured during the session.
• Document stores are useful when the value of the
key/value pair is a file and the file itself is self-describing (for
example, JSON or XML).
• Content management of web pages
• Web analytics of stored log data
• Column family stores are useful for sparse datasets,
records with thousands of columns but only a few
columns have entries.
• To store and render blog entries, tags, and viewers’ feedback
• To store and update various web page metrics and counters
• Graph databases are intended for use cases such as
networks, where there are items (people or web page
links) and relationships between these items.
• Social networks such as Facebook and LinkedIn
• Geospatial applications such as delivery and traffic systems to
optimize the time to reach one or more destinations
SQL Essentials
• A relational database, part of a Relational Database
Management System (RDBMS), organizes data in tables
with established relationships between the tables.
SQL query
• SELECT first_name, last_name FROM customer WHERE
customer_id = 666730
• Joins enable a database user to appropriately select columns
from two or more tables.
• SELECT c.customer_id, o.order_id, o.product_id,
o.item_quantity AS qty FROM orders o INNER JOIN customer c
ON o.customer_id = c.customer_id WHERE c.first_name =
‘Mason’ AND c.last_name = ‘Hu’
• INNER JOIN returns those rows from the two tables
where the ON criterion is met.
• The sorting of the records is accomplished with the
ORDER BY clause.
• SELECT c.customer_id, c.first_name, c.last_name,
o.order_id FROM orders o RIGHT OUTER JOIN customer c
ON o.customer_id = c.customer_id WHERE o.order_id IS
NULL ORDER BY c.last_name, c.first_name LIMIT 5
• RIGHT OUTER JOIN is used to specify that all rows
from the table customer, on the right-hand side
(RHS) of the join, should be returned, regardless of
whether there is a matching customer_id in the
orders table.
Database text analysis
• SELECT SUBSTRING(‘1234567890’, 3,2) /* returns ‘34’ */
• SELECT ‘1234567890’ LIKE ‘%7%’ /* returns True */
• SELECT ‘1234567890’ LIKE ‘7%’ /* returns False */
• SELECT ‘1234567890’ LIKE ‘_2%’ /* returns True */
• SELECT ‘1234567890’ LIKE ‘_3%’ /* returns False */
• SELECT ‘1234567890’ LIKE ‘__3%’ /* returns True */
Regular expression operators
Regular expression elements
Advanced SQL
• Window functions
• User defined functions and aggregates
• Ordered Aggregates
MadLib
• External library used by Sql implementations
• MADlib is an open-source library for scalable in-
database analytics. It offers data parallel
implementations of mathematical, statistical, and
machine learning methods for structured and
unstructured data.
Components of MAD
• Magnetic
• Agile
• Deep

Hadoop 1.x Architecture: Name: Siddhant Singh Chandel PRN: 20020343053
No ratings yet
Hadoop 1.x Architecture: Name: Siddhant Singh Chandel PRN: 20020343053
4 pages
Hadoop and BigData LAB MANUAL
50% (4)
Hadoop and BigData LAB MANUAL
59 pages
doc6
No ratings yet
doc6
3 pages
RGPV Notes _ Data Analytics
No ratings yet
RGPV Notes _ Data Analytics
3 pages
Unit V FRAMEWORKS AND VISUALIZATION
No ratings yet
Unit V FRAMEWORKS AND VISUALIZATION
71 pages
CS726 - Solution Manual - Introduction To Information Retrieval
100% (2)
CS726 - Solution Manual - Introduction To Information Retrieval
49 pages
Word Count
No ratings yet
Word Count
10 pages
BDA - II Sem - II Mid
100% (1)
BDA - II Sem - II Mid
4 pages
Lecture 9 - MapReduce
No ratings yet
Lecture 9 - MapReduce
50 pages
BDH Unit 3
No ratings yet
BDH Unit 3
16 pages
Module -2_ Introduction to Hadoop
No ratings yet
Module -2_ Introduction to Hadoop
24 pages
Data Science and Big Data UNIT 3
No ratings yet
Data Science and Big Data UNIT 3
11 pages
BDAunit-III
No ratings yet
BDAunit-III
4 pages
MODULE 3
No ratings yet
MODULE 3
37 pages
Big Data Unit 5
No ratings yet
Big Data Unit 5
16 pages
NOSQL - Simp QB
No ratings yet
NOSQL - Simp QB
3 pages
Unit 5
No ratings yet
Unit 5
27 pages
A Dynamic Data Placement Strategy
No ratings yet
A Dynamic Data Placement Strategy
9 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
hadoop ecosystem-converted
No ratings yet
hadoop ecosystem-converted
5 pages
PDA Project
No ratings yet
PDA Project
7 pages
Big Data Analytics AAM Unit 5 (1)
No ratings yet
Big Data Analytics AAM Unit 5 (1)
28 pages
Last year question paper-- Big Data-(BCS 061)
No ratings yet
Last year question paper-- Big Data-(BCS 061)
9 pages
bioDiesel_research
No ratings yet
bioDiesel_research
29 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Unit 5
No ratings yet
Unit 5
7 pages
Haddob Lab Report
No ratings yet
Haddob Lab Report
12 pages
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
No ratings yet
Join Algorithms Using Mapreduce: A Survey: Vikas Jadhav, Jagannath Aghav, Sunil Dorwani
5 pages
(CSUR '17) Optimization of Complex Dataflows With User-Defined Functions
No ratings yet
(CSUR '17) Optimization of Complex Dataflows With User-Defined Functions
39 pages
Unit 2 (1)
No ratings yet
Unit 2 (1)
22 pages
Lesson 2 A Review of Hadoop
No ratings yet
Lesson 2 A Review of Hadoop
6 pages
unit 2
No ratings yet
unit 2
9 pages
BDA Assignment1 BE6 20
No ratings yet
BDA Assignment1 BE6 20
10 pages
07-BigData-DataAnalysis
No ratings yet
07-BigData-DataAnalysis
66 pages
Date of Exam:25/09/2020: "T3 Examination, Sep 2020."
No ratings yet
Date of Exam:25/09/2020: "T3 Examination, Sep 2020."
6 pages
ADBMS original-output
No ratings yet
ADBMS original-output
28 pages
CloudComputing DATABASE
No ratings yet
CloudComputing DATABASE
27 pages
Pig Tutorial For Beginners - Orzota
No ratings yet
Pig Tutorial For Beginners - Orzota
5 pages
BDA
No ratings yet
BDA
8 pages
Hadoop ISE 2
No ratings yet
Hadoop ISE 2
25 pages
Bcse302l Dbms Module-7 Nosql
No ratings yet
Bcse302l Dbms Module-7 Nosql
30 pages
DM - Topic Five
No ratings yet
DM - Topic Five
30 pages
BDA Class3
No ratings yet
BDA Class3
15 pages
Implementation of Multi Node Clusters in Column Oriented Database Using HDFS
No ratings yet
Implementation of Multi Node Clusters in Column Oriented Database Using HDFS
5 pages
Assignment 6
No ratings yet
Assignment 6
12 pages
DA U2
No ratings yet
DA U2
17 pages
Education System Artcle
No ratings yet
Education System Artcle
11 pages
Chapter-14
No ratings yet
Chapter-14
35 pages
Google Cloud Big Data and Machine Learning Fundamentals
No ratings yet
Google Cloud Big Data and Machine Learning Fundamentals
151 pages
CC - Unit - 4
No ratings yet
CC - Unit - 4
2 pages
Big Data and Analytics and MapReduce 29052023 054155pm
No ratings yet
Big Data and Analytics and MapReduce 29052023 054155pm
35 pages
Big Data Analysis pdf 2
No ratings yet
Big Data Analysis pdf 2
18 pages
UNIT II HADOOP WITH HDFS
No ratings yet
UNIT II HADOOP WITH HDFS
22 pages
DBMS Unit-5
No ratings yet
DBMS Unit-5
92 pages
biggdata
No ratings yet
biggdata
24 pages
Mrjob Documentation: Release 0.6.0.dev0
No ratings yet
Mrjob Documentation: Release 0.6.0.dev0
150 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall Analysis in India "
21 pages
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
No ratings yet
NAAC Accredited ''A" D. E. Society's: "Rainfall in India Analysis"
21 pages
Discuss The Hadoop Distributed File System (HDFS)
No ratings yet
Discuss The Hadoop Distributed File System (HDFS)
12 pages
Predicting The Performance of Virtual Machine Migration
No ratings yet
Predicting The Performance of Virtual Machine Migration
10 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Open I Maj Tutorial
No ratings yet
Open I Maj Tutorial
89 pages
UNIT 2 - Part1
No ratings yet
UNIT 2 - Part1
53 pages
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
No ratings yet
Hadoopdb: An An Architectural Hybrid of Mapreduce & Dbms Technologies For Analytical Workloads
34 pages
Big Data and Hadoop Overview
100% (1)
Big Data and Hadoop Overview
17 pages
Introduction To Big Dat1
No ratings yet
Introduction To Big Dat1
6 pages
Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
L Hadoop 1 PDF
No ratings yet
L Hadoop 1 PDF
12 pages
Unit v Programming Model
No ratings yet
Unit v Programming Model
53 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
DocScanner Jan 12, 2023 2-29 PM
No ratings yet
DocScanner Jan 12, 2023 2-29 PM
32 pages
BDA-UNIT-1
No ratings yet
BDA-UNIT-1
32 pages
UNIT-4-IOT Notes
No ratings yet
UNIT-4-IOT Notes
74 pages
Unit 5 - Introduction To Hadoop
No ratings yet
Unit 5 - Introduction To Hadoop
50 pages
Mesos Tech Report
No ratings yet
Mesos Tech Report
14 pages
BIG DATA & Hadoop Interview Questions With Answers
No ratings yet
BIG DATA & Hadoop Interview Questions With Answers
9 pages
Lect7 IoT BigData1
No ratings yet
Lect7 IoT BigData1
28 pages
Case Study: Hadoop
No ratings yet
Case Study: Hadoop
46 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
Hadoop-How It Works
No ratings yet
Hadoop-How It Works
5 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
No ratings yet
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
20 pages
Big Data Analytics
No ratings yet
Big Data Analytics
124 pages
CP7019-Managing Big Data-Anna University - Question Paper
75% (4)
CP7019-Managing Big Data-Anna University - Question Paper
4 pages
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Introduction to Oracle Database Administration
From Everand
Introduction to Oracle Database Administration
Ying Wang
5/5 (1)
Visual Basic 2010 Coding Briefs Data Access
From Everand
Visual Basic 2010 Coding Briefs Data Access
Kevin Hough
5/5 (1)

Module 5 BDA

Uploaded by

Module 5 BDA

Uploaded by

ITDX45- Big Data

You might also like