0% found this document useful (0 votes)
2 views25 pages

Module 5 BDA

The document discusses the MapReduce paradigm and its implementation in Apache Hadoop, detailing the map and reduce steps involved in processing unstructured data. It also covers the Hadoop Distributed File System (HDFS) architecture, including the roles of NameNode, DataNode, and Secondary NameNode, as well as the structure of a typical MapReduce job in Java. Additionally, it introduces various Hadoop ecosystem projects, NoSQL data stores, SQL essentials, and advanced SQL techniques, including the use of the MADlib library for scalable analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views25 pages

Module 5 BDA

The document discusses the MapReduce paradigm and its implementation in Apache Hadoop, detailing the map and reduce steps involved in processing unstructured data. It also covers the Hadoop Distributed File System (HDFS) architecture, including the roles of NameNode, DataNode, and Secondary NameNode, as well as the structure of a typical MapReduce job in Java. Additionally, it introduces various Hadoop ecosystem projects, NoSQL data stores, SQL essentials, and advanced SQL techniques, including the use of the MADlib library for scalable analytics.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

ITDX45- Big Data

Analytics
Module 5 - Technology and Tools
Analytics of unstructured data
• MapReduce paradigm break a large task into smaller
tasks, run tasks in parallel, and consolidate the outputs
of the individual tasks into the final output.
• Apache Hadoop includes a software implementation of
MapReduce.
• MapReduce consists of two basic parts
- a map step, and
- a reduce step
Map:
• Applies an operation to a piece of data
• Provides some intermediate output
Reduce:
• Consolidates the intermediate outputs from the map
steps
• Provides the final output
• map step parses the provided text string into individual words
and emits a set of key/value pairs of the form <word, 1>.
• the reduce step sums the 1 values and outputs the <word,
count> key/value pairs.
• the final output of a MapReduce process applied to a set
of documents might have the key as an ordered pair
and the value as an ordered tuple of length 2n.
• A possible representation of such a key/value pair
follows:
<(filename, datetime),(word1,5, word2,7,… ,
wordn,6)>
Hadoop cluster
• To manage the data access, Hadoop Distributed File System
(HDFS) utilizes three Java background processes:
1. NameNode,
2. DataNode, and
3. Secondary NameNode.
• NameNode determines and tracks where the various blocks
of a data file are stored.
• The DataNode manages the data stored on each machine.
• Secondary NameNode, provides the capability to perform
some of the NameNode tasks to reduce the load on the
NameNode.
If a client application wants to access a particular file
stored in HDFS, the application contacts the NameNode, and the
NameNode provides the application with the locations of the
various blocks for that file. The application then communicates
with the appropriate DataNodes to access the file.
• Each DataNode periodically builds a report about the blocks
stored on the DataNode and sends the report to the
NameNode.
• If one or more blocks are not accessible on a DataNode, the
NameNode ensures that an accessible copy of an inaccessible
data block is replicated to another machine.
• For performance reasons, the NameNode resides in a
machine’s memory. Because the NameNode is critical to the
operation of HDFS, any unavailability or corruption of the
NameNode results in a data unavailability event on the
cluster.
• Thus, the NameNode is viewed as a single point of failure in
the Hadoop environment.
• To minimize the chance of a NameNode failure and to improve
performance, the NameNode is typically run on a dedicated
machine.
Structuring a MapReduce Job in
Hadoop
• A typical MapReduce program in Java consists of three
classes:
1. the driver,
2. The mapper, and
3. the reducer.
• The driver provides details such as input file locations, the
provisions for adding the input file to the map task, the
names of the mapper and reducer Java classes, and the
location of the reduce task output.
• The mapper provides the logic to be processed on each data
block corresponding to the specified input files in the driver
code.
The Hadoop Ecosystem
• Hadoop-related Apache projects:
1. Pig: Provides a high-level data-flow programming
language
2. Hive: Provides SQL-like access
3. Mahout: Provides analytical tools
4. HBase: Provides real-time reads and writes
No Sql
• (Not only Structured Query Language) is a term used to
describe those data stores applied to unstructured data.
• Key/value stores contain data (the value) that can be
simply accessed by a given identifier (the key).
• Using a customer’s login ID as the key, the value contains the
customer’s preferences.
• Using a web session ID as the key, the value contains everything
that was captured during the session.
• Document stores are useful when the value of the
key/value pair is a file and the file itself is self-describing (for
example, JSON or XML).
• Content management of web pages
• Web analytics of stored log data
• Column family stores are useful for sparse datasets,
records with thousands of columns but only a few
columns have entries.
• To store and render blog entries, tags, and viewers’ feedback
• To store and update various web page metrics and counters
• Graph databases are intended for use cases such as
networks, where there are items (people or web page
links) and relationships between these items.
• Social networks such as Facebook and LinkedIn
• Geospatial applications such as delivery and traffic systems to
optimize the time to reach one or more destinations
SQL Essentials
• A relational database, part of a Relational Database
Management System (RDBMS), organizes data in tables
with established relationships between the tables.
SQL query
• SELECT first_name, last_name FROM customer WHERE
customer_id = 666730
• Joins enable a database user to appropriately select columns
from two or more tables.
• SELECT c.customer_id, o.order_id, o.product_id,
o.item_quantity AS qty FROM orders o INNER JOIN customer c
ON o.customer_id = c.customer_id WHERE c.first_name =
‘Mason’ AND c.last_name = ‘Hu’
• INNER JOIN returns those rows from the two tables
where the ON criterion is met.
• The sorting of the records is accomplished with the
ORDER BY clause.
• SELECT c.customer_id, c.first_name, c.last_name,
o.order_id FROM orders o RIGHT OUTER JOIN customer c
ON o.customer_id = c.customer_id WHERE o.order_id IS
NULL ORDER BY c.last_name, c.first_name LIMIT 5
• RIGHT OUTER JOIN is used to specify that all rows
from the table customer, on the right-hand side
(RHS) of the join, should be returned, regardless of
whether there is a matching customer_id in the
orders table.
Database text analysis
• SELECT SUBSTRING(‘1234567890’, 3,2) /* returns ‘34’ */
• SELECT ‘1234567890’ LIKE ‘%7%’ /* returns True */
• SELECT ‘1234567890’ LIKE ‘7%’ /* returns False */
• SELECT ‘1234567890’ LIKE ‘_2%’ /* returns True */
• SELECT ‘1234567890’ LIKE ‘_3%’ /* returns False */
• SELECT ‘1234567890’ LIKE ‘__3%’ /* returns True */
Regular expression operators
Regular expression elements
Advanced SQL
• Window functions
• User defined functions and aggregates
• Ordered Aggregates
MadLib
• External library used by Sql implementations
• MADlib is an open-source library for scalable in-
database analytics. It offers data parallel
implementations of mathematical, statistical, and
machine learning methods for structured and
unstructured data.
Components of MAD
• Magnetic
• Agile
• Deep

You might also like