0% found this document useful (0 votes)
14 views

4-2 Bda PPTS

This document provides an introduction to big data analytics and the characteristics of big data. It discusses the scale, variety, and speed of big data and how these factors require new techniques for managing and analyzing large, complex datasets. The document also introduces Hadoop as an open-source platform for distributed storage and processing of big data at scale. Key topics covered include the value of big data analytics, challenges in handling big data, and technologies like MapReduce that power big data systems.

Uploaded by

LOKESWARI G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

4-2 Bda PPTS

This document provides an introduction to big data analytics and the characteristics of big data. It discusses the scale, variety, and speed of big data and how these factors require new techniques for managing and analyzing large, complex datasets. The document also introduces Hadoop as an open-source platform for distributed storage and processing of big data at scale. Key topics covered include the value of big data analytics, challenges in handling big data, and technologies like MapReduce that power big data systems.

Uploaded by

LOKESWARI G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 114

Big Data Analytics

Introduction

1
Theme of this Course

Large-Scale Data Management


Big Data Analytics
Data Science and Analytics
• How to manage very large amounts of data and extract value and
knowledge from them

2
Introduction to Big Data

What is Big Data?


What makes data, “Big” Data?

3
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity, and


complexity require new architecture, techniques,
algorithms, and analytics to manage it and
extract value and hidden knowledge from it…

4
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase
in
collected/generated
data

5
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
6
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

• Healthcare monitoring: sensors monitoring your activities and body



any abnormal measurements require immediate reaction

7
Big Data: 3V’s

8
Some Make it 4V’s

9
Harnessing Big Data

• O LT P : Online Transaction Processing (DBMSs)

• O L A P : Online Analytical Processing (Data Warehousing)

• R TA P : Real-Time Analytics Processing (Big Data Architecture & technology)

10
Who’s Generating Big Data

Mobile devices
(tracking all objects all the
time)

Social media and networks Scienti fi c instruments


(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks


(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover


knowledge from the collected data in a timely manner and in a scalable
fashion
11
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

12
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting


- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

13
Value of Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications

• Traditional DW architectures (e.g.


Exadata, Teradata) are not well-
suited for big data apps

• Shared nothing, massively parallel


processing, scale out architectures
are well-suited for big data apps

14
Challenges in Handling Big Data

• The Bottleneck is in technology


• New architecture, algorithms, techniques are needed

• Also in technical skills


• Experts in using the new technology and dealing with big data

15
What Technology D o We Have
For Big Data ??

16
17
Big Data Technology

18
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Data mining and machine learning tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop

19
Course Logistics

20
Course Logistics

• Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/

• Electronic WPI system: blackboard.wpi.edu

• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)

21
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)

• Reading List
• We will cover the state-of-art technology from research papers in
big conferences
• Many Hadoop-related papers are available on the course website

• Related books:
• Hadoop, The Definitive Guide [pdf]

22
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List
) Done in teams
• Hands-on Course of two
• No written homework or exams
• Several coding projects covering the entire
semester

23
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare
a review on the presented paper
• Course website gives guidelines on how to make good reviews

• Reviews are done individually

24
Late Submission Policy
• For Projects
• One-day late  10% off the max grade
• Two-day late  20% off the max grade
• Three-day late  30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after

• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class

25
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages

• Download it from the course website (link)


• Username and password will be sent to you

• Need Virtual Box (Vbox) [free]

26
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team
selects its first paper to present (1st come 1st served)
• Send me your choices top 2/3 choices

3. You have until Jan 20th


• Otherwise, I’ll randomly form teams and assign papers

4. Use Blackboard “Discussion” forum for posts or


for searching for teammates

27
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Analytics and data mining tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop

28
Open Source World’s Solution

 Google File System – Hadoop Distributed FS


 Map-Reduce – Hadoop Map-Reduce
 Sawzall – Pig, Hive, JAQL
 Big Table – Hadoop HBase, Cassandra
 Chubby – Zookeeper
Simplified Search Engine
Architecture

Spider Runtime
Batch Processing System
on top of Hadoop

Internet Search Log Storage SE Web Server


Simplified Data Warehouse
Architecture

Business
Intelligence Database
Batch Processing System
on top fo Hadoop

Domain Knowledge View/Click/Events Log Storage Web Server


Hadoop History
 Jan 2006 – Doug Cutting joins Yahoo
 Feb 2006 – Hadoop splits out of Nutch and Yahoo
starts using it.
 Dec 2006 – Yahoo creating 100-node Webmap with
Hadoop
 Apr 2007 – Yahoo on 1000-node cluster
 Jan 2008 – Hadoop made a top-level Apache project
 Dec 2007 – Yahoo creating 1000-node Webmap
with Hadoop
 Sep 2008 – Hive added to Hadoop as a contrib project
Hadoop Introduction
 Open Source Apache Project
 http://hadoop.apache.org/
 Book:
http://oreilly.com/catalog/9780596521998/index.html
 Written in Java
 Does work with other languages
 Runs on
 Linux, Windows and more
 Commodity hardware with high failure rate
Current Status of Hadoop
 Largest Cluster
 2000 nodes (8 cores, 4TB disk)
 Used by 40+ companies / universities over
the world
 Yahoo, Facebook, etc
 Cloud Computing Donation from Google and IBM
 Startup focusing on providing services for
hadoop
 Cloudera
Hadoop Components
 Hadoop Distributed File System (HDFS)
 Hadoop Map-Reduce
 Contributes
 Hadoop Streaming
 Pig / JAQL / Hive
 HBase
 Hama / Mahout
Hadoop Distributed File System
Goals of HDFS
 Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Convenient Cluster Management
 Load balancing
 Node failures
 Cluster expansion
 Optimized for Batch Processing
 Allow move computation to data
 Maximize throughput
HDFS Architecture
HDFS Details
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple DataNodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from DataNode
HDFS User Interface
 Java API
 Command Line
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt
 hadoop dfs -rm /foodir myfile.txt
 hadoop dfsadmin -report
 hadoop dfsadmin -decommission
datanodename
 Web Interface
 http://host:port/dfshealth.jsp
More about HDFS
 http://hadoop.apache.org/core/docs/current/hdfs_design.html

 Hadoop FileSystem API


 HDFS
 Local File System
 Kosmos File System (KFS)
 Amazon S3 File System
Hadoop Map-Reduce and
Hadoop Streaming
Hadoop Map-Reduce Introduction
 Map/Reduce works like a parallel Unix pipeline:
 cat input | grep | sort | uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce | Output
 Framework does inter-node communication
 Failure recovery, consistency etc
 Load balancing, scalability etc
 Fits a lot of batch processing applications
 Log processing
 Web index building
(Simplified) Map Reduce Review
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>


<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local


Map Sort Reduce
Machine 2 Shuffle

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>


<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
Physical Flow
Example Code
Hadoop Streaming
 Allow to write Map and Reduce functions in any
languages
 Hadoop Map/Reduce only accepts Java

 Example: Word Count


 hadoop streaming
-input /user/zshao/articles
-mapper „tr “ ” “\n”‟
-reducer „uniq -c„
-output /user/zshao/
-numReduceTasks 32
Example: Log Processing
 Generate #pageview and #distinct users
for each page each day
 Input: timestamp url userid
 Generate the number of page views
 Map: emit < <date(timestamp), url>, 1>
 Reduce: add up the values for each row
 Generate the number of distinct users
 Map: emit < <date(timestamp), url, userid>, 1>
 Reduce: For the set of rows with the same
<date(timestamp), url>, count the number of distinct users by
“uniq –c"
Example: Page Rank
 In each Map/Reduce Job:
 Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
 Reduce: add all values up for each link, to generate the
new eigenvalue for that link.

 Run 50 map/reduce jobs till the eigenvalues are


stable.
TODO: Split Job Scheduler and Map-
Reduce
 Allow easy plug-in of different scheduling
algorithms
 Scheduling based on job priority, size, etc
 Scheduling for CPU, disk, memory, network
bandwidth
 Preemptive scheduling
 Allow to run MPI or other jobs on the same
cluster
 PageRank is best done with MPI
TODO: Faster Map-Reduce
Mapper Sender Receiver Reducer
R1
R2
map R3 Merge
sort … sor
R1 t R e duce
map R2 sor
R3 t

Sender Re ce ive r
eRMRcSeeaidepvunpecder
user
buffer
m e togr disk,e andN do f l o w
e Cr Mocd maa op plaCealssrounuesmfd alspPenoaaerdrerw
fun
r cc ti
a
checkpointing
o
l n
l s

rftsuofciunoso
nnctnocttitro1ioo,nlncsas:
r t , dump
tR i teiod nu c e
l:l
MapReduce and
Hadoop Distributed
File System 54

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


The Context: Big-data
55
⚫ Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
⚫ Google collects 270PB data in a month (2007), 20000PB a day (2008)
⚫ 2010 census data is expected to be a huge gold mine of information
⚫ Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
⚫ We are in a knowledge economy.
 Data is an important asset to any organization

 Discovery of knowledge; Enabling discovery; annotation of data

⚫ We are looking at newer


 programming models, and

 Supporting algorithms and data structures.

⚫ NSF refers to it as “data-intensive computing” and industry calls it “big-


data” and “cloud computing”
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Purpose of this talk
56

⚫ To provide a simple introduction to:


“The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
⚫ To encourage educators to explore ways to infuse
relevant concepts of this emerging area into
their curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


The Outline
57

⚫ Introduction to MapReduce
⚫ From CS Foundation to MapReduce
⚫ MapReduce programming model
⚫ Hadoop Distributed File System
⚫ Relevance to Undergraduate Curriculum
⚫ Demo (Internet access needed)
⚫ Our experience with the framework
⚫ Summary
⚫ References

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


MapReduce
58

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


What is MapReduce?
59

⚫ MapReduce is a programming model Google has used


successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


From CS Foundations to MapReduce
60

Consider a large data collection:


{web, weed, green, sun, moon, land, part, web,
green,…}
Problem: Count the occurrences of the different words
in the collection.

Lets design a solution for this problem;


 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Word Counter and Result Table
61
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1

green 2
Data Mai
sun 1
collection n

moon 1

land 1

WordCounter part 1

parse( )
count( )

DataCollection ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Multiple Instances of Word Counter
62

web 2

weed 1

green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1

parse( )
count( )

DataCollection ResultTable Observe:


Multi-thread
Lock on shared data

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Improve Word Counter for Performance
63
N
Main owebNo need
2
for lock

weed 1

Data green 2
collection
sun 1

moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

W ordList
Separate counters
DataCollection ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Peta-scale Data
64
Mai
web 2
n

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Addressing the Scale Issue
65

⚫ Single machine cannot serve all the data: you need a distributed
special (file) system
⚫ Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)

⚫ Critical aspects: fault tolerance + replication + load balancing,


monitoring
⚫ Exploit parallelism afforded by splitting parsing and counting
⚫ Provision and locate computing at data locations

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Peta-scale Data
66
Mai
web 2
n

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Peta Scale Data is Commonly Distributed
collection
67
Mai
web 2
Data n

collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter

WordList
Data DataCollection ResultTable

collection Issue: managing the


large scale data
KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Write Once Read Many (WORM) data
collection
68
Mai
web 2
Data n

collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter

WordList
Data DataCollection ResultTable

collection

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data WORM Data is Amenable to Parallelism
collection
69
Mai
Data n

collection
1. Data with WORM
characteristics : yields
Data to parallel
collection processing;
Thread 2. Data without
1..*
1..
dependencies: yields
Data * to out of order
1..*
1..
collection Parser * Counter
processing

WordList
Data DataCollection ResultTable

collection

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Divide and Conquer: Provision Computing at Data Location
70
Main For our example,
#1: Schedule parallel parse tasks
Data Thread
#2: Schedule parallel count tasks
collection
1....
1....* *
Parser Counter

One node DataCollection WordList ResultTable This is a particular solution;


Main
Lets generalize it:

Data Thread
Our parse is a mapping operation:
collection Parser
1....*
1....
*
Counter
MAP: input  <key, value> pairs
DataCollection WordList ResultTable

Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread

collection
1....
1....* *

Map/Reduce originated from Lisp


Parser Counter

DataCollection WordList ResultTable

But have different meaning here


Main

Runtime adds distribution + fault


Data tolerance + replication + monitoring +
collection
Thread

1....

load balancing to your base application!


1....* *
Parser Counter

DataCollection WordList ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Mapper and Reducer
71

Remember: MapReduce is simplified processing for larger data sets:


MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Map Operation
72
web 1

MAP: Input data  <key, value> pair weed

green
1

1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1
Map web
weed
1
moon
1
land
1

1web 1
Data green
green
1 part
1
1
Collection: split1 Split the data to web … 1
sun
1 web
1
weed
1
moon 11
Supply multiple weedKEY VAL
1 land green
1green 1

processors green UE
1
part …1
1

sun
1
sun KEY VALUE
web 11
moon 1
green 1part
moon
1

Data Map land 1


… 1
1
web 1
part 1
Collection: split 2 web 1
KEY
land
green
VALUE
1
1
……

… 1


green 11 KEY VALUE
… 1

KEY VALUE

Data
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Reduce Operation
73

MAP: Input data  <key, value> pair


REDUCE: <key, value> pair  <result>

Reduce
Map
Data
Collection: split1 Split the data to
Supply
multiple
processors Reduce
Data Map
Collection: split 2
……

Data

Map Reduce
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

CCSCNE 2009 Palttsburg, April 24 2009 74 B.Ramamurthy & K.Madurai


MapReduce Example in my operating systems class
75

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Word
s
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
MapReduce Programming
Model
76

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


MapReduce programming model
77

⚫ Determine if the problem is parallelizable and solvable using


MapReduce (ex: Is the data WORM?, large data set).
⚫ Design and implement solution as Mapper classes and
Reducer class.
⚫ Compile the source code with hadoop core.
⚫ Package the code as jar executable.
⚫ Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
⚫ Load the data (or use it on previously available data)
⚫ Launch the job and monitor.
⚫ Study the result.
⚫ Detailed steps.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


MapReduce Characteristics
78

⚫ Very large scale data: peta, exa bytes


⚫ Write once and read many data: allows for parallelism without
mutexes
⚫ Map and Reduce are the main operations: simple code
⚫ There are other supporting operations such as combine and
partition (out of the scope of this talk).
⚫ All the map should be completed before reduce operation
starts.
⚫ Map and reduce operations are typically performed by the same
physical processor.
⚫ Number of map tasks and reduce tasks are configurable.
⚫ Operations are provisioned near the data.
⚫ Commodity hardware and storage.
⚫ Runtime takes care of splitting and moving data for operations.
⚫ Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Classes of problems “mapreducable”
79

⚫ Benchmark for comparing: Jim Gray’s challenge on data-


intensive computing. Ex: “Sort”
⚫ Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
⚫ Simple algorithms such as grep, text-indexing, reverse
indexing
⚫ Bayesian classification: data mining domain
⚫ Facebook uses it for various operations: demographics
⚫ Financial services use it for analytics
⚫ Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
⚫ Expected to play a critical role in semantic web and web3.0

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Scope of MapReduce
80
Data size: small
Pipelined Instruction level

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level


Data size: large

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Hadoop
81

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


What is Hadoop?
82

⚫ At Google MapReduce operation are run on a special


file system called Google File System (GFS) that is
highly optimized for this purpose.
⚫ GFS is not open source.
⚫ Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
⚫ The software framework that supports HDFS,
MapReduce and other related entities is called
the project Hadoop or simply Hadoop.
⚫ This is open source and distributed by Apache.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Basic Features: HDFS
83

⚫ Highly fault-tolerant
⚫ High throughput
⚫ Suitable for applications with large data sets
⚫ Streaming access to file system data
⚫ Can be built out of commodity hardware

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Hadoop Distributed File System
84

HDFS Server Master node

HDFS Client

Application

Local file
system
Block size:
2K Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hadoop Distributed File System
85

HDFS Server Master node

blockmap

HDFS Client heartbeat

Application

Local file
system
Block size:
2K Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Relevance and Impact on Undergraduate courses
86

⚫ Data structures and algorithms: a new look at traditional


algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
⚫ You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
⚫ Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
⚫ While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Demo
87

⚫ VMware simulated Hadoop and MapReduce demo


⚫ Remote access to NEXOS system at my Buffalo office
⚫ 5-node HDFS running HDFS on Ubuntu 8.04
⚫ 1 –name node and 4 data-nodes
⚫ Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
⚫ Zeus (namenode), datanodes: hermes,
dionysus, aphrodite, athena

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


Summary
88

⚫ We introduced MapReduce programming model for


processing large scale data
⚫ We discussed the supporting Hadoop Distributed
File System
⚫ The concepts were illustrated using a simple
example
⚫ We reviewed some important parts of the source
code for the example.
⚫ Relationship to Cloud Computing

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai


References
89

1. Apache Hadoop Tutorial: http://hadoop.apache.org


http://hadoop.apache.org/core/docs/current/mapred_tu

torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-
113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapred
uce.html
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hive - SQL on top of Hadoop
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/
Hive Architecture
Map Reduce HDFS

Web UI Hive CLI


Mgmt, etc
Browsing Queries
DDL
MetaStore Hive QL

Parser Planner Execution

SerDe

Thrift Jute JSON


Thrift API
Hive QL – Join
page_view pv_users
user
pag use time pag age
use age gende
eid rid X r = eid
rid
1 111 9:08: 1 25
01 111 25 femal
e 2 25
•2 S L:111 9:08:
222 32 male 1 32
IN Q
13
ERT INTO TABLE p v_users
222 pageid,
1 SE ECT 9:08:u.a ge
14pv JOIN
L pv. page_view
FROM
user u ON (pv.userid =
S
u.userid);
Hive QL – Join in Map Reduce
page_view
pv_users
pag use key valu
time eid rid key e pag age
1 111 9:08: value
111 111
01 <1,1 > <1,1 > eid
2 111 9:08: Shuffle 111 1 25
111 Sort duce
user 13 Map
<1,2 > <1,2Re> 2 25
use
1 age
222 gende
9:08: key valu key <2,2
11 valu
222 <1,1 page age
rid 14r e 1 e
5>
> id
111 25 111 <2,2 222
femal 5> <1,1 > 1 32
e 222 <2,3 222 <2,3
222 32 male 2> 2>
Hive QL – Group By
pv_users
pageid_age_sum
pag age
pag age Co
eid
eid unt
1 25
1 25 1
2 25
2 25 2
• SQL: 1 32
1 32 1
2▪ INSERT
25INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
– GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pag age key valu key valu pag age Co
e e
eid <1, 1 <1, 1 eid unt
1 25 25> 25> 1 25 1
2 25 <2, 1 Shuffle <1, 1 Reduce 1 32 1
Map
Sort
pag age 25>
key valu 32>
key valu
e e pag age Cou
eid nt
<1, 1 <2, 1 eid
1 32 32> 25>
2 25 2
2 25 <2, 1 <2, 1
25> 25>
Hive QL – Group By with Distinct
page_view
pag user time result
id
page count_distinct
eid
id _userid
1 111 9:08:
1 2
01
2 1
2 111 9:08:
13
• 1SQL222 9:08:
14CO
– ELECT ageid, T(DISTINCT userid)
p UN
– FROM ge_vie ROUP BY pageid
2S p111 w9:08:
a G
20
Hive QL – Group By with Distinct in Map
Reduce
page_view

page useri key v page cou


time id d <1,111 id nt
1 111 9:08: > 1 2
01 Shuffle <1,22
and 2> Reduce
2 11 9:08:
page time Sort
1 13 key v
useri id page cou
d <2,111 id nt
1 >
2 1
222 o<rt2k,e
2 Shuffl1e1k1ey
9:08: 1y1. 1
20
1 of the s
is9a:0p8r:efix >
4
Hive QL: Order By
page_view

page time key v page useri time


useri id id d
<1,111 9:08:
d > 01 1 111 9:08:
2 Shuffle <2,111 9:08: 01
111 and > 13 Redu ce 2 111 9:08:
1 11 9:08:
page
9:08:1 time Sort
13
01 key v page useri time
useri id 1
d <1,22 9:08: id d
3
2 2> 14 1 222 9:08:
111 <2,111 9:08: 14
1Shuff2le2r2ando14
9:08: 9m: > 20 2 111 9:08:
0l8y.: 2 20
0
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
(Simplified) Map Reduce Revisit
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>


<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local


Map Sort Reduce
Shuffle
Machine 2

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>


<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
Merge Sequential Map Reduce Jobs

ke av AB
y
ke av bv
1 B 111 y ABC
Map Reduce
ke bv 1 11 22 ke av bv cv
y 1C 2 Map Reduce y
1 22 ke cv 1 11 22 33
2 y 1 2 3
• SQL: 1 33
3 = b.key) join c on a.key =
– FROM (a join b on a.ke
c.key SELECT … y
Share Common Read Operations

pag ag pag cou


• Extended SQL
▪ FROM pv_users
e
p Reduce
nt ▪ INSERT INTO TABLE
eid eid pv_pageid_sum
▪ SELECT pageid,
Ma 1 1 count(1)
1 25 2 1 ▪ GROUP BY pageid
▪ INSERT INTO TABLE
2 ag
pag 32 age cou pv_age_sum
e
p Reduce
nt ▪ SELECT age,
count(1)
eid 25 1 ▪ GROUP BY age;
Ma
32 1
1 25
2 32
Load Balance Problem

pv_users
pag ag
e pagpeaidg_eaidg_ea_gpea_rtsiaulm_su
eid m
pag
pag ag
ag cou
cou
1 25 eid e nt
1 25
Map-Reduce eid
1 25 4
1 25 2
1 25 2 32 1
2 32 1
2 32
1 25 2
1 25
Map-side Aggregation / Combiner
Machine 1

<k1, v1>
<male, 343> <male, 343> <male, 343>
<k2, v2> <male, 466>
<female, 128> <male, 123> <male, 123>
<k3, v3>

Local Global Local Local


Map Sort Reduce
Shuffle
Machine 2

<k4, v4>
<male, 123> <female, 128> <female, 128>
<k5, v5> <female, 372>
<female, 244> <female, 244> <female, 244>
<k6, v6>
Query Rewrite

• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);
TODO: Column-based Storage and Map-side Join

url page IP url clicked viewed


quality
http://a.com/ 12 145
http://a. 90 65.1.2.3
co m/ http://b.com/ 45 383
http://b. 20 68.9.0.81
http://c.com/ 23 67
co m/
http://c.co 68 11.3.85.1
m/
MetaStore
• Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Other information
• Thrift API
– Current clients in Php (Web Interface), Python (old CLI),
Java (Query Engine and CLI), Perl (Tests)
• Metadata can be stored as text files or even in a
SQL backend
Hive CLI
• DDL:
– create table/drop table/rename table
– alter table add column
• Browsing:
– show tables
– describe table
– cat table
• Loading Data
• Queries
Web UI for Hive

• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)

• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By
Hive QL – Custom Map/Reduce Scripts

• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);

• Map-Reduce: similar to hadoop streaming

You might also like