0% found this document useful (0 votes)

14 views

4-2 Bda PPTS

This document provides an introduction to big data analytics and the characteristics of big data. It discusses the scale, variety, and speed of big data and how these factors require new techniques for managing and analyzing large, complex datasets. The document also introduces Hadoop as an open-source platform for distributed storage and processing of big data at scale. Key topics covered include the value of big data analytics, challenges in handling big data, and technologies like MapReduce that power big data systems.

Uploaded by

LOKESWARI G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

4-2 Bda PPTS

Uploaded by

LOKESWARI G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 114

Big Data Analytics

Introduction

1
Theme of this Course

Large-Scale Data Management

Big Data Analytics
Data Science and Analytics
• How to manage very large amounts of data and extract value and
knowledge from them

2
Introduction to Big Data

What is Big Data?

What makes data, “Big” Data?

3
Big Data Definition

• No single standard definition…

“Big Data” is data whose scale, diversity, and

complexity require new architecture, techniques,
algorithms, and analytics to manage it and
extract value and hidden knowledge from it…

4
Characteristics of Big Data:
1-Scale (Volume)
• Data Volume
• 44x increase from 2009 2020
• From 0.8 zettabytes to 35zb
• Data volume is increasing exponentially

Exponential increase
in
collected/generated
data

5
Characteristics of Big Data:
2-Complexity (Varity)
• Various formats, types, and
structures
• Text, numerical, images, audio,
video, sequences, time series,
social media data, multi-dim
arrays, etc…
• Static data vs. streaming data
• A single application can be
generating/collecting many types
of data
6
Characteristics of Big Data:
3-Speed (Velocity)
• Data is begin generated fast and need to be processed fast
• Online Data Analytics
• Late decisions  missing opportunities
• Examples
• E-Promotions: Based on your current location, your purchase history,
what you like  send promotions right now for store next to you

• Healthcare monitoring: sensors monitoring your activities and body


any abnormal measurements require immediate reaction

7
Big Data: 3V’s

8
Some Make it 4V’s

9
Harnessing Big Data

• O LT P : Online Transaction Processing (DBMSs)

• O L A P : Online Analytical Processing (Data Warehousing)

• R TA P : Real-Time Analytics Processing (Big Data Architecture & technology)

10
Who’s Generating Big Data

Mobile devices
(tracking all objects all the
time)

Social media and networks Scienti fi c instruments

(all of us are generating data) (collecting all sorts of data)

Sensor technology and networks

(measuring all kinds of data)

• The progress and innovation is no longer hindered by the ability to collect data

• But, by the ability to manage, analyze, summarize, visualize, and discover

knowledge from the collected data in a timely manner and in a scalable
fashion
11
The Model Has Changed…
• The Model of Generating/Consuming Data has Changed

Old Model: Few companies are generating data, all others are consuming data

New Model: all of us are generating data, and all of us are consuming data

12
What’s driving Big Data
- Optimizations and predictive analytics
- Complex statistical analysis
- All types of data, and many sources
- Very large datasets
- More of a real-time

- Ad-hoc querying and reporting

- Data mining techniques
- Structured data, typical sources
- Small to mid-size datasets

13
Value of Big Data Analytics
• Big data is more real-time in
nature than traditional DW
applications

• Traditional DW architectures (e.g.

Exadata, Teradata) are not well-
suited for big data apps

• Shared nothing, massively parallel

processing, scale out architectures
are well-suited for big data apps

14
Challenges in Handling Big Data

• The Bottleneck is in technology

• New architecture, algorithms, techniques are needed

• Also in technical skills

• Experts in using the new technology and dealing with big data

15
What Technology D o We Have
For Big Data ??

16
17
Big Data Technology

18
What You Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Data mining and machine learning tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop

19
Course Logistics

20
Course Logistics

• Web Page: http://web.cs.wpi.edu/~cs525/s13-MYE/

• Electronic WPI system: blackboard.wpi.edu

• Lectures
• Tuesday, Thursday: (4:00pm - 5:20pm)

21
Textbook & Reading List
• No specific textbook
• Big Data is a relatively new topic (so no fixed syllabus)

• Reading List
• We will cover the state-of-art technology from research papers in
big conferences
• Many Hadoop-related papers are available on the course website

• Related books:
• Hadoop, The Definitive Guide [pdf]

22
Requirements & Grading
• Seminar-Type Course
• Students will read research papers and present them (Reading List
) Done in teams
• Hands-on Course of two
• No written homework or exams
• Several coding projects covering the entire
semester

23
Requirements & Grading (Cont’d)
• Reviews
• When a team is presenting (not the instructor), the other students should prepare
a review on the presented paper
• Course website gives guidelines on how to make good reviews

• Reviews are done individually

24
Late Submission Policy
• For Projects
• One-day late  10% off the max grade
• Two-day late  20% off the max grade
• Three-day late  30% off the max grade
• Beyond that, no late submission is accepted
• Submissions:
• Submitted via blackboard system by the due date
• Demonstrated to the instructor within the week after

• For Reviews
• No late submissions
• Student may skip at most 4 reviews
• Submissions:
• Given to the instructor at the beginning of class

25
More about Projects
• A virtual machine is created including the needed platform for the projects
• Ubuntu OS (Version 12.10)
• Hadoop platform (Version 1.1.0)
• Apache Pig (Version 0.10.0)
• Mahout library (Version 0.7)
• Rhadoop
• In addition to other software packages

• Download it from the course website (link)

• Username and password will be sent to you

• Need Virtual Box (Vbox) [free]

26
Next Step from You…
1. Form teams of two
2. Visit the course website (Reading List), each team
selects its first paper to present (1st come 1st served)
• Send me your choices top 2/3 choices

3. You have until Jan 20th

• Otherwise, I’ll randomly form teams and assign papers

4. Use Blackboard “Discussion” forum for posts or

for searching for teammates

27
Course Output: What You
Will Learn…
• We focus on Hadoop/MapReduce technology
• Learn the platform (how it is designed and works)
• How big data are managed in a scalable, efficient way
• Learn writing Hadoop jobs in different languages
• Programming Languages: Java, C, Python
• High-Level Languages: Apache Pig, Hive
• Learn advanced analytics tools on top of Hadoop
• RHadoop: Statistical tools for managing big data
• Mahout: Analytics and data mining tools over big data
• Learn state-of-art technology from recent research papers
• Optimizations, indexing techniques, and other extensions to Hadoop

28
Open Source World’s Solution

 Google File System – Hadoop Distributed FS

 Map-Reduce – Hadoop Map-Reduce
 Sawzall – Pig, Hive, JAQL
 Big Table – Hadoop HBase, Cassandra
 Chubby – Zookeeper
Simplified Search Engine
Architecture

Spider Runtime
Batch Processing System
on top of Hadoop

Internet Search Log Storage SE Web Server

Simplified Data Warehouse
Architecture

Business
Intelligence Database
Batch Processing System
on top fo Hadoop

Domain Knowledge View/Click/Events Log Storage Web Server

Hadoop History
 Jan 2006 – Doug Cutting joins Yahoo
 Feb 2006 – Hadoop splits out of Nutch and Yahoo
starts using it.
 Dec 2006 – Yahoo creating 100-node Webmap with
Hadoop
 Apr 2007 – Yahoo on 1000-node cluster
 Jan 2008 – Hadoop made a top-level Apache project
 Dec 2007 – Yahoo creating 1000-node Webmap
with Hadoop
 Sep 2008 – Hive added to Hadoop as a contrib project
Hadoop Introduction
 Open Source Apache Project
 http://hadoop.apache.org/
 Book:
http://oreilly.com/catalog/9780596521998/index.html
 Written in Java
 Does work with other languages
 Runs on
 Linux, Windows and more
 Commodity hardware with high failure rate
Current Status of Hadoop
 Largest Cluster
 2000 nodes (8 cores, 4TB disk)
 Used by 40+ companies / universities over
the world
 Yahoo, Facebook, etc
 Cloud Computing Donation from Google and IBM
 Startup focusing on providing services for
hadoop
 Cloudera
Hadoop Components
 Hadoop Distributed File System (HDFS)
 Hadoop Map-Reduce
 Contributes
 Hadoop Streaming
 Pig / JAQL / Hive
 HBase
 Hama / Mahout
Hadoop Distributed File System
Goals of HDFS
 Very Large Distributed File System
 10K nodes, 100 million files, 10 PB
 Convenient Cluster Management
 Load balancing
 Node failures
 Cluster expansion
 Optimized for Batch Processing
 Allow move computation to data
 Maximize throughput
HDFS Architecture
HDFS Details
 Data Coherency
 Write-once-read-many access model
 Client can only append to existing files
 Files are broken up into blocks
 Typically 128 MB block size
 Each block replicated on multiple DataNodes
 Intelligent Client
 Client can find location of blocks
 Client accesses data directly from DataNode
HDFS User Interface
 Java API
 Command Line
 hadoop dfs -mkdir /foodir
 hadoop dfs -cat /foodir/myfile.txt
 hadoop dfs -rm /foodir myfile.txt
 hadoop dfsadmin -report
 hadoop dfsadmin -decommission
datanodename
 Web Interface
 http://host:port/dfshealth.jsp
More about HDFS
 http://hadoop.apache.org/core/docs/current/hdfs_design.html

 Hadoop FileSystem API

 HDFS
 Local File System
 Kosmos File System (KFS)
 Amazon S3 File System
Hadoop Map-Reduce and
Hadoop Streaming
Hadoop Map-Reduce Introduction
 Map/Reduce works like a parallel Unix pipeline:
 cat input | grep | sort | uniq -c | cat > output
 Input | Map | Shuffle & Sort | Reduce | Output
 Framework does inter-node communication
 Failure recovery, consistency etc
 Load balancing, scalability etc
 Fits a lot of batch processing applications
 Log processing
 Web index building
(Simplified) Map Reduce Review
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>

<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local

Map Sort Reduce
Machine 2 Shuffle

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>

<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
Physical Flow
Example Code
Hadoop Streaming
 Allow to write Map and Reduce functions in any
languages
 Hadoop Map/Reduce only accepts Java

 Example: Word Count

 hadoop streaming
-input /user/zshao/articles
-mapper „tr “ ” “\n”‟
-reducer „uniq -c„
-output /user/zshao/
-numReduceTasks 32
Example: Log Processing
 Generate #pageview and #distinct users
for each page each day
 Input: timestamp url userid
 Generate the number of page views
 Map: emit < <date(timestamp), url>, 1>
 Reduce: add up the values for each row
 Generate the number of distinct users
 Map: emit < <date(timestamp), url, userid>, 1>
 Reduce: For the set of rows with the same
<date(timestamp), url>, count the number of distinct users by
“uniq –c"
Example: Page Rank
 In each Map/Reduce Job:
 Map: emit <link, eigenvalue(url)/#links>
for each input: <url, <eigenvalue, vector<link>> >
 Reduce: add all values up for each link, to generate the
new eigenvalue for that link.

 Run 50 map/reduce jobs till the eigenvalues are

stable.
TODO: Split Job Scheduler and Map-
Reduce
 Allow easy plug-in of different scheduling
algorithms
 Scheduling based on job priority, size, etc
 Scheduling for CPU, disk, memory, network
bandwidth
 Preemptive scheduling
 Allow to run MPI or other jobs on the same
cluster
 PageRank is best done with MPI
TODO: Faster Map-Reduce
Mapper Sender Receiver Reducer
R1
R2
map R3 Merge
sort … sor
R1 t R e duce
map R2 sor
R3 t
…

Sender Re ce ive r
eRMRcSeeaidepvunpecder
user
buffer
m e togr disk,e andN do f l o w
e Cr Mocd maa op plaCealssrounuesmfd alspPenoaaerdrerw
fun
r cc ti
a
checkpointing
o
l n
l s

rftsuofciunoso
nnctnocttitro1ioo,nlncsas:
r t , dump
tR i teiod nu c e
l:l
MapReduce and
Hadoop Distributed
File System 54

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

The Context: Big-data
55
⚫ Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009)
⚫ Google collects 270PB data in a month (2007), 20000PB a day (2008)
⚫ 2010 census data is expected to be a huge gold mine of information
⚫ Data mining huge amounts of data collected in a wide range of domains
from astronomy to healthcare has become essential for planning and
performance.
⚫ We are in a knowledge economy.
 Data is an important asset to any organization

 Discovery of knowledge; Enabling discovery; annotation of data

⚫ We are looking at newer

 programming models, and

 Supporting algorithms and data structures.

⚫ NSF refers to it as “data-intensive computing” and industry calls it “big-

data” and “cloud computing”
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Purpose of this talk
56

⚫ To provide a simple introduction to:

“The big-data computing” : An important
advancement that has a potential to impact
significantly the CS and undergraduate curriculum.
 A programming model called MapReduce for
processing “big-data”
 A supporting file system called Hadoop Distributed
File System (HDFS)
⚫ To encourage educators to explore ways to infuse
relevant concepts of this emerging area into
their curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

The Outline
57

⚫ Introduction to MapReduce
⚫ From CS Foundation to MapReduce
⚫ MapReduce programming model
⚫ Hadoop Distributed File System
⚫ Relevance to Undergraduate Curriculum
⚫ Demo (Internet access needed)
⚫ Our experience with the framework
⚫ Summary
⚫ References

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce
58

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

What is MapReduce?
59

⚫ MapReduce is a programming model Google has used

successfully is processing its “big-data” sets (~ 20000 peta
bytes per day)
 Users specify the computation in terms of a map and a
reduce function,
 Underlying runtime system automatically parallelizes the
computation across large-scale clusters of machines, and
 Underlying system also handles machine failures,
efficient communications, and performance issues.
-- Reference: Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters. Communication of
ACM 51, 1 (Jan. 2008), 107-113.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

From CS Foundations to MapReduce
60

Consider a large data collection:

{web, weed, green, sun, moon, land, part, web,
green,…}
Problem: Count the occurrences of the different words
in the collection.

Lets design a solution for this problem;

 We will start from scratch
 We will add and relax constraints
 We will do incremental design, improving the solution for
performance and scalability

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Word Counter and Result Table
61
{web, weed, green, sun, moon, land, part, web 2
web, green,…}
weed 1

green 2
Data Mai
sun 1
collection n

moon 1

land 1

WordCounter part 1

parse( )
count( )

DataCollection ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Multiple Instances of Word Counter
62

web 2

weed 1

green 2
Data
Main sun 1
collection
moon 1
Thread
land 1
1..*
WordCounter part 1

parse( )
count( )

DataCollection ResultTable Observe:

Multi-thread
Lock on shared data

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Improve Word Counter for Performance
63
N
Main owebNo need
2
for lock

weed 1

Data green 2
collection
sun 1

moon 1
Thread
land 1
1..*
1..* part 1
Parser Counter

W ordList
Separate counters
DataCollection ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Peta-scale Data
64
Mai
web 2
n

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Addressing the Scale Issue
65

⚫ Single machine cannot serve all the data: you need a distributed
special (file) system
⚫ Large number of commodity hardware disks: say, 1000 disks 1TB
each
 Issue: With Mean time between failures (MTBF) or failure rate of
1/1000, then at least 1 of the above 1000 disks would be down at a
given time.
 Thus failure is norm and not an exception.
 File system has to be fault-tolerant: replication, checksum
 Data transfer bandwidth is critical (location of data)

⚫ Critical aspects: fault tolerance + replication + load balancing,

monitoring
⚫ Exploit parallelism afforded by splitting parsing and counting
⚫ Provision and locate computing at data locations

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Peta-scale Data
66
Mai
web 2
n

weed 1

green 2

Data sun 1

collection moon 1
Thread
land 1
1..*
1..
* part 1
1..*
1..
Parser * Counter

DataCollection WordList ResultTable

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Peta Scale Data is Commonly Distributed
collection
67
Mai
web 2
Data n

collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter

WordList
Data DataCollection ResultTable

collection Issue: managing the

large scale data
KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data Write Once Read Many (WORM) data
collection
68
Mai
web 2
Data n

collection weed 1

green 2

Data sun 1
collection moon 1
Thread
land 1
1..*
1..
Data * part 1
1..*
1..
collection Parser * Counter

WordList
Data DataCollection ResultTable

collection

KEY web weed green sun moon land part web green …….

VALUE
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamrthy & Ku .Madurai
Data WORM Data is Amenable to Parallelism
collection
69
Mai
Data n

collection
1. Data with WORM
characteristics : yields
Data to parallel
collection processing;
Thread 2. Data without
1..*
1..
dependencies: yields
Data * to out of order
1..*
1..
collection Parser * Counter
processing

WordList
Data DataCollection ResultTable

collection

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Divide and Conquer: Provision Computing at Data Location
70
Main For our example,
#1: Schedule parallel parse tasks
Data Thread
#2: Schedule parallel count tasks
collection
1....
1....* *
Parser Counter

One node DataCollection WordList ResultTable This is a particular solution;

Main
Lets generalize it:

Data Thread
Our parse is a mapping operation:
collection Parser
1....*
1....
*
Counter
MAP: input  <key, value> pairs
DataCollection WordList ResultTable

Main
Our count is a reduce operation:
REDUCE: <key, value> pairs reduced
Data Thread

collection
1....
1....* *

Map/Reduce originated from Lisp

Parser Counter

DataCollection WordList ResultTable

But have different meaning here

Main

Runtime adds distribution + fault

Data tolerance + replication + monitoring +
collection
Thread

1....

load balancing to your base application!

1....* *
Parser Counter

DataCollection WordList ResultTable

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Mapper and Reducer
71

Remember: MapReduce is simplified processing for larger data sets:

MapReduce Version of WordCount Source code
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Map Operation
72
web 1

MAP: Input data  <key, value> pair weed

green
1

1 web 1
sun 1 weed 1
moon 1 green 1
land 1 sun1 1
web
part 1
Map web
weed
1
moon
1
land
1

1web 1
Data green
green
1 part
1
1
Collection: split1 Split the data to web … 1
sun
1 web
1
weed
1
moon 11
Supply multiple weedKEY VAL
1 land green
1green 1

processors green UE
1
part …1
1

sun
1
sun KEY VALUE
web 11
moon 1
green 1part
moon
1

Data Map land 1

… 1
1
web 1
part 1
Collection: split 2 web 1
KEY
land
green
VALUE
1
1
……

… 1

…
green 11 KEY VALUE
… 1

KEY VALUE

Data
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Reduce Operation
73

MAP: Input data  <key, value> pair

REDUCE: <key, value> pair  <result>

Reduce
Map
Data
Collection: split1 Split the data to
Supply
multiple
processors Reduce
Data Map
Collection: split 2
……

Data
…
Map Reduce
Collection: split n

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Large scale data splits
Map <key, 1> Reducers (say, Count)

Parse-hash

Count
P-0000
, count1

Parse-hash

Count
P-0001
, count2
Parse-hash

Count
P-0002
Parse-hash ,count3

CCSCNE 2009 Palttsburg, April 24 2009 74 B.Ramamurthy & K.Madurai

MapReduce Example in my operating systems class
75

combine part0
map reduce
Cat split

reduce part1
split map combine

Bat

map part2
split combine reduce
Dog

split map
Other
Word
s
(size:
TByte)
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
MapReduce Programming
Model
76

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce programming model
77

⚫ Determine if the problem is parallelizable and solvable using

MapReduce (ex: Is the data WORM?, large data set).
⚫ Design and implement solution as Mapper classes and
Reducer class.
⚫ Compile the source code with hadoop core.
⚫ Package the code as jar executable.
⚫ Configure the application (job) as to the number of mappers
and reducers (tasks), input and output streams
⚫ Load the data (or use it on previously available data)
⚫ Launch the job and monitor.
⚫ Study the result.
⚫ Detailed steps.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

MapReduce Characteristics
78

⚫ Very large scale data: peta, exa bytes

⚫ Write once and read many data: allows for parallelism without
mutexes
⚫ Map and Reduce are the main operations: simple code
⚫ There are other supporting operations such as combine and
partition (out of the scope of this talk).
⚫ All the map should be completed before reduce operation
starts.
⚫ Map and reduce operations are typically performed by the same
physical processor.
⚫ Number of map tasks and reduce tasks are configurable.
⚫ Operations are provisioned near the data.
⚫ Commodity hardware and storage.
⚫ Runtime takes care of splitting and moving data for operations.
⚫ Special distributed file system. Example: Hadoop Distributed File
System and Hadoop Runtime.
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Classes of problems “mapreducable”
79

⚫ Benchmark for comparing: Jim Gray’s challenge on data-

intensive computing. Ex: “Sort”
⚫ Google uses it (we think) for wordcount, adwords, pagerank,
indexing data.
⚫ Simple algorithms such as grep, text-indexing, reverse
indexing
⚫ Bayesian classification: data mining domain
⚫ Facebook uses it for various operations: demographics
⚫ Financial services use it for analytics
⚫ Astronomy: Gaussian analysis for locating extra-terrestrial
objects.
⚫ Expected to play a critical role in semantic web and web3.0

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Scope of MapReduce
80
Data size: small
Pipelined Instruction level

Concurrent Thread level

Service Object level

Indexed File level

Mega Block level

Virtual System Level

Data size: large

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Hadoop
81

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

What is Hadoop?
82

⚫ At Google MapReduce operation are run on a special

file system called Google File System (GFS) that is
highly optimized for this purpose.
⚫ GFS is not open source.
⚫ Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System
(HDFS).
⚫ The software framework that supports HDFS,
MapReduce and other related entities is called
the project Hadoop or simply Hadoop.
⚫ This is open source and distributed by Apache.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Basic Features: HDFS
83

⚫ Highly fault-tolerant
⚫ High throughput
⚫ Suitable for applications with large data sets
⚫ Streaming access to file system data
⚫ Can be built out of commodity hardware

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Hadoop Distributed File System
84

HDFS Server Master node

HDFS Client

Application

Local file
system
Block size:
2K Name Nodes
Block size: 128M
More details: We discuss this in great detail in my Operating Replicated
Systems course
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hadoop Distributed File System
85

HDFS Server Master node

blockmap

HDFS Client heartbeat

Application

⚫ Data structures and algorithms: a new look at traditional

algorithms such as sort: Quicksort may not be your
choice! It is not easily parallelizable. Merge sort is better.
⚫ You can identify mappers and reducers among your
algorithms. Mappers and reducers are simply place
holders for algorithms relevant for your applications.
⚫ Large scale data and analytics are indeed concepts to
reckon with similar to how we addressed “programming
in the large” by OO concepts.
⚫ While a full course on MR/HDFS may not be warranted,
the concepts perhaps can be woven into most courses in
our CS curriculum.

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Demo
87

⚫ VMware simulated Hadoop and MapReduce demo

⚫ Remote access to NEXOS system at my Buffalo office
⚫ 5-node HDFS running HDFS on Ubuntu 8.04
⚫ 1 –name node and 4 data-nodes
⚫ Each is an old commodity PC with 512 MB RAM,
120GB – 160GB external memory
⚫ Zeus (namenode), datanodes: hermes,
dionysus, aphrodite, athena

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

Summary
88

⚫ We introduced MapReduce programming model for

processing large scale data
⚫ We discussed the supporting Hadoop Distributed
File System
⚫ The concepts were illustrated using a simple
example
⚫ We reviewed some important parts of the source
code for the example.
⚫ Relationship to Cloud Computing

CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai

References
89

1. Apache Hadoop Tutorial: http://hadoop.apache.org

http://hadoop.apache.org/core/docs/current/mapred_tu

torial.html
2. Dean, J. and Ghemawat, S. 2008. MapReduce:
simplified data processing on large clusters.
Communication of ACM 51, 1 (Jan. 2008), 107-
113.
3. Cloudera Videos by Aaron Kimball:
http://www.cloudera.com/hadoop-training-basic
4. http://www.cse.buffalo.edu/faculty/bina/mapred
uce.html
CCSCNE 2009 Palttsburg, April 24 2009 B.Ramamurthy & K.Madurai
Hive - SQL on top of Hadoop
Map-Reduce and SQL
• Map-Reduce is scalable
– SQL has a huge user base
– SQL is easy to code
• Solution: Combine SQL and Map-Reduce
– Hive on top of Hadoop (open source)
– Aster Data (proprietary)
– Green Plum (proprietary)
Hive
• A database/data warehouse on top of Hadoop
– Rich data types (structs, lists and maps)
– Efficient implementations of SQL filters, joins and group-
by’s on top of map reduce
• Allow users to access Hive data without using
Hive
• Link:
– http://svn.apache.org/repos/asf/hadoop/hive/
trunk/
Hive Architecture
Map Reduce HDFS

Web UI Hive CLI

Mgmt, etc
Browsing Queries
DDL
MetaStore Hive QL

Parser Planner Execution

SerDe

Thrift Jute JSON

Thrift API
Hive QL – Join
page_view pv_users
user
pag use time pag age
use age gende
eid rid X r = eid
rid
1 111 9:08: 1 25
01 111 25 femal
e 2 25
•2 S L:111 9:08:
222 32 male 1 32
IN Q
13
ERT INTO TABLE p v_users
222 pageid,
1 SE ECT 9:08:u.a ge
14pv JOIN
L pv. page_view
FROM
user u ON (pv.userid =
S
u.userid);
Hive QL – Join in Map Reduce
page_view
pv_users
pag use key valu
time eid rid key e pag age
1 111 9:08: value
111 111
01 <1,1 > <1,1 > eid
2 111 9:08: Shuffle 111 1 25
111 Sort duce
user 13 Map
<1,2 > <1,2Re> 2 25
use
1 age
222 gende
9:08: key valu key <2,2
11 valu
222 <1,1 page age
rid 14r e 1 e
5>
> id
111 25 111 <2,2 222
femal 5> <1,1 > 1 32
e 222 <2,3 222 <2,3
222 32 male 2> 2>
Hive QL – Group By
pv_users
pageid_age_sum
pag age
pag age Co
eid
eid unt
1 25
1 25 1
2 25
2 25 2
• SQL: 1 32
1 32 1
2▪ INSERT
25INTO TABLE pageid_age_sum
▪ SELECT pageid, age, count(1)
▪ FROM pv_users
– GROUP BY pageid, age;
Hive QL – Group By in Map Reduce
pv_users pageid_age_sum
pag age key valu key valu pag age Co
e e
eid <1, 1 <1, 1 eid unt
1 25 25> 25> 1 25 1
2 25 <2, 1 Shuffle <1, 1 Reduce 1 32 1
Map
Sort
pag age 25>
key valu 32>
key valu
e e pag age Cou
eid nt
<1, 1 <2, 1 eid
1 32 32> 25>
2 25 2
2 25 <2, 1 <2, 1
25> 25>
Hive QL – Group By with Distinct
page_view
pag user time result
id
page count_distinct
eid
id _userid
1 111 9:08:
1 2
01
2 1
2 111 9:08:
13
• 1SQL222 9:08:
14CO
– ELECT ageid, T(DISTINCT userid)
p UN
– FROM ge_vie ROUP BY pageid
2S p111 w9:08:
a G
20
Hive QL – Group By with Distinct in Map
Reduce
page_view

page useri key v page cou

time id d <1,111 id nt
1 111 9:08: > 1 2
01 Shuffle <1,22
and 2> Reduce
2 11 9:08:
page time Sort
1 13 key v
useri id page cou
d <2,111 id nt
1 >
2 1
222 o<rt2k,e
2 Shuffl1e1k1ey
9:08: 1y1. 1
20
1 of the s
is9a:0p8r:efix >
4
Hive QL: Order By
page_view

page time key v page useri time

useri id id d
<1,111 9:08:
d > 01 1 111 9:08:
2 Shuffle <2,111 9:08: 01
111 and > 13 Redu ce 2 111 9:08:
1 11 9:08:
page
9:08:1 time Sort
13
01 key v page useri time
useri id 1
d <1,22 9:08: id d
3
2 2> 14 1 222 9:08:
111 <2,111 9:08: 14
1Shuff2le2r2ando14
9:08: 9m: > 20 2 111 9:08:
0l8y.: 2 20
0
Hive Optimizations
Efficient Execution of SQL on top of Map-Reduce
(Simplified) Map Reduce Revisit
Machine 1

<k1, v1> <nk1, nv1> <nk1, nv1> <nk1, nv1>

<nk1, 2>
<k2, v2> <nk2, nv2> <nk3, nv3> <nk1, nv6>
<nk3, 1>
<k3, v3> <nk3, nv3> <nk1, nv6> <nk3, nv3>

Local Global Local Local

Map Sort Reduce
Shuffle
Machine 2

<k4, v4> <nk2, nv4> <nk2, nv4> <nk2, nv4>

<k5, v5> <nk2, nv5> <nk2, nv5> <nk2, nv5> <nk2, 3>
<k6, v6> <nk1, nv6> <nk2, nv2> <nk2, nv2>
Merge Sequential Map Reduce Jobs

ke av AB
y
ke av bv
1 B 111 y ABC
Map Reduce
ke bv 1 11 22 ke av bv cv
y 1C 2 Map Reduce y
1 22 ke cv 1 11 22 33
2 y 1 2 3
• SQL: 1 33
3 = b.key) join c on a.key =
– FROM (a join b on a.ke
c.key SELECT … y
Share Common Read Operations

pag ag pag cou

• Extended SQL
▪ FROM pv_users
e
p Reduce
nt ▪ INSERT INTO TABLE
eid eid pv_pageid_sum
▪ SELECT pageid,
Ma 1 1 count(1)
1 25 2 1 ▪ GROUP BY pageid
▪ INSERT INTO TABLE
2 ag
pag 32 age cou pv_age_sum
e
p Reduce
nt ▪ SELECT age,
count(1)
eid 25 1 ▪ GROUP BY age;
Ma
32 1
1 25
2 32
Load Balance Problem

pv_users
pag ag
e pagpeaidg_eaidg_ea_gpea_rtsiaulm_su
eid m
pag
pag ag
ag cou
cou
1 25 eid e nt
1 25
Map-Reduce eid
1 25 4
1 25 2
1 25 2 32 1
2 32 1
2 32
1 25 2
1 25
Map-side Aggregation / Combiner
Machine 1

<k1, v1>
<male, 343> <male, 343> <male, 343>
<k2, v2> <male, 466>
<female, 128> <male, 123> <male, 123>
<k3, v3>

Local Global Local Local

Map Sort Reduce
Shuffle
Machine 2

<k4, v4>
<male, 123> <female, 128> <female, 128>
<k5, v5> <female, 372>
<female, 244> <female, 244> <female, 244>
<k6, v6>
Query Rewrite

• Predicate Push-down
– select * from (select * from t) where col1 = ‘2008’;
• Column Pruning
– select col1, col3 from (select * from t);
TODO: Column-based Storage and Map-side Join

url page IP url clicked viewed

quality
http://a.com/ 12 145
http://a. 90 65.1.2.3
co m/ http://b.com/ 45 383
http://b. 20 68.9.0.81
http://c.com/ 23 67
co m/
http://c.co 68 11.3.85.1
m/
MetaStore
• Stores Table/Partition properties:
– Table schema and SerDe library
– Table Location on HDFS
– Logical Partitioning keys and types
– Other information
• Thrift API
– Current clients in Php (Web Interface), Python (old CLI),
Java (Query Engine and CLI), Perl (Tests)
• Metadata can be stored as text files or even in a
SQL backend
Hive CLI
• DDL:
– create table/drop table/rename table
– alter table add column
• Browsing:
– show tables
– describe table
– cat table
• Loading Data
• Queries
Web UI for Hive

• MetaStore UI:
– Browse and navigate all tables in the system
– Comment on each table and each column
– Also captures data dependencies
• HiPal:
– Interactively construct SQL queries by mouse clicks
– Support projection, filtering, group by and joining
– Also support
Hive Query Language
• Philosophy
– SQL
– Map-Reduce with custom scripts (hadoop streaming)

• Query Operators
– Projections
– Equi-joins
– Group by
– Sampling
– Order By
Hive QL – Custom Map/Reduce Scripts

• Extended SQL:
• FROM (
• FROM pv_users
• MAP pv_users.userid, pv_users.date
• USING 'map_script' AS (dt, uid)
• CLUSTER BY dt) map
• INSERT INTO TABLE pv_users_reduced
• REDUCE map.dt, map.uid
• USING 'reduce_script' AS (date, count);

• Map-Reduce: similar to hadoop streaming

Bose Corporation's Competitive Advantages and Its Shift To An Online-Only Model
No ratings yet
Bose Corporation's Competitive Advantages and Its Shift To An Online-Only Model
16 pages
The Role of Artificial Intelligence in Risk Management
No ratings yet
The Role of Artificial Intelligence in Risk Management
12 pages
Introduction
No ratings yet
Introduction
2 pages
20IT503 - Big Data Analytics - Unit4
No ratings yet
20IT503 - Big Data Analytics - Unit4
73 pages
Asia Chemicals Outlook Final Smaller Version
100% (1)
Asia Chemicals Outlook Final Smaller Version
125 pages
Sybca Bigdata
No ratings yet
Sybca Bigdata
97 pages
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
No ratings yet
B.Tech. CS_CE and CSE Syllabus 3rd Year 2024-25
2 pages
Hadoop PPT
No ratings yet
Hadoop PPT
25 pages
Big Data Analytics Digital Notes
No ratings yet
Big Data Analytics Digital Notes
119 pages
Big Data Analytics (R18a0529)
No ratings yet
Big Data Analytics (R18a0529)
134 pages
BDA - Unit-1
No ratings yet
BDA - Unit-1
24 pages
BIG DATA ANALYTICS (1)
No ratings yet
BIG DATA ANALYTICS (1)
20 pages
Hadoop Ecosystem Large PDF
No ratings yet
Hadoop Ecosystem Large PDF
229 pages
COMP9313: Big Data Management
No ratings yet
COMP9313: Big Data Management
79 pages
Big Data- Road map
No ratings yet
Big Data- Road map
22 pages
Big Data Analytics Overview
No ratings yet
Big Data Analytics Overview
17 pages
Unit 1
No ratings yet
Unit 1
19 pages
Introduction To Big Data and Hadoop
No ratings yet
Introduction To Big Data and Hadoop
10 pages
L8 Big Data Management en
No ratings yet
L8 Big Data Management en
58 pages
IIT Kharagpur Data Science PDF
No ratings yet
IIT Kharagpur Data Science PDF
22 pages
BD IMP QUES 1
No ratings yet
BD IMP QUES 1
22 pages
Big Data & Hadoop - Course Curriculum
No ratings yet
Big Data & Hadoop - Course Curriculum
6 pages
BDA U1
No ratings yet
BDA U1
80 pages
Big Data Training
No ratings yet
Big Data Training
244 pages
Updated Unit-2
0% (1)
Updated Unit-2
55 pages
BIG Data_Unit_1
No ratings yet
BIG Data_Unit_1
24 pages
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
No ratings yet
Prepared by Richa Btech (Cse) 6 Sem Dav University Jalandhar
30 pages
MCAD2232 (PRESS) BIG DATA and Its Applications
No ratings yet
MCAD2232 (PRESS) BIG DATA and Its Applications
140 pages
Ashish_Presentation_Stage1_modify_LR
No ratings yet
Ashish_Presentation_Stage1_modify_LR
24 pages
Chapter 1
No ratings yet
Chapter 1
40 pages
Big Data Analytics
No ratings yet
Big Data Analytics
131 pages
Big Data complete Notes
No ratings yet
Big Data complete Notes
33 pages
Big Data Analytics: by S. P. Sajjan
No ratings yet
Big Data Analytics: by S. P. Sajjan
21 pages
4.7.1 BDA-MBA
No ratings yet
4.7.1 BDA-MBA
2 pages
BD by maaz
No ratings yet
BD by maaz
19 pages
Data Science
No ratings yet
Data Science
31 pages
Big Data Analytics Syllabus_22UAI603C_204_2025
No ratings yet
Big Data Analytics Syllabus_22UAI603C_204_2025
2 pages
Big Data Analytics
No ratings yet
Big Data Analytics
36 pages
20ai402 Data Analytics Unit-2
No ratings yet
20ai402 Data Analytics Unit-2
72 pages
Big Data Streams Analytics: Challenges, Analysis, and Applications
No ratings yet
Big Data Streams Analytics: Challenges, Analysis, and Applications
55 pages
BDA Syllabus - Sem VII - Mumbai University
No ratings yet
BDA Syllabus - Sem VII - Mumbai University
3 pages
Essentials of Big Data Griet
No ratings yet
Essentials of Big Data Griet
2 pages
BIG Data Syllabus
No ratings yet
BIG Data Syllabus
2 pages
Big Data and Hadoop For Developers - Syllabus
No ratings yet
Big Data and Hadoop For Developers - Syllabus
6 pages
Unit 1^J2 big data
No ratings yet
Unit 1^J2 big data
6 pages
BDA U2
No ratings yet
BDA U2
68 pages
BIG DATA AND ANALYTICS presentation
No ratings yet
BIG DATA AND ANALYTICS presentation
31 pages
Big Data Analytics
No ratings yet
Big Data Analytics
3 pages
Big Data Syllabus
No ratings yet
Big Data Syllabus
6 pages
Big Data
No ratings yet
Big Data
25 pages
Big Data Analytics Unit-1
No ratings yet
Big Data Analytics Unit-1
39 pages
No SQL Database in Bda
No ratings yet
No SQL Database in Bda
84 pages
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
No ratings yet
Big Data Analytics (BDA) : Name of The Faculty: Affiliation: Teaching Area
8 pages
Big Data Engineer Course (2) (1)
No ratings yet
Big Data Engineer Course (2) (1)
31 pages
BigData-Session1
No ratings yet
BigData-Session1
14 pages
Ut Dallas - Big Data Analytics Management Syl33611
No ratings yet
Ut Dallas - Big Data Analytics Management Syl33611
9 pages
2 Data Science
No ratings yet
2 Data Science
27 pages
BDA 01 - Introduction
No ratings yet
BDA 01 - Introduction
43 pages
Bdhs - Ebook
No ratings yet
Bdhs - Ebook
970 pages
Building Scalable Data-Intensive Applications
From Everand
Building Scalable Data-Intensive Applications
Chandani Kaul
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Hadoop Ecosystem for Big Data
From Everand
Hadoop Ecosystem for Big Data
Dr. Zemelak Goraga
No ratings yet
Be Data Curious!: Be Data Curious!, #1
From Everand
Be Data Curious!: Be Data Curious!, #1
Nick Jewell
No ratings yet
Hands-On Machine Learning Recommender Systems with Apache Spark
From Everand
Hands-On Machine Learning Recommender Systems with Apache Spark
Ernesto Lee
No ratings yet
Catalogue of ICT Companies From Moldova
No ratings yet
Catalogue of ICT Companies From Moldova
44 pages
Management Information System - Group Assignment (Petronas)
No ratings yet
Management Information System - Group Assignment (Petronas)
14 pages
Bangalore Institute of Technology K. R. Road, V. V. Pura, Bengaluru-04
No ratings yet
Bangalore Institute of Technology K. R. Road, V. V. Pura, Bengaluru-04
2 pages
ThingWorx Innovation PDF
100% (1)
ThingWorx Innovation PDF
24 pages
Python For Data Analysis
100% (1)
Python For Data Analysis
14 pages
Data Driven User Experience Ux Design
No ratings yet
Data Driven User Experience Ux Design
21 pages
pdf-2025-the-new-rules-of-job-architecture
No ratings yet
pdf-2025-the-new-rules-of-job-architecture
14 pages
MarketingAutomationSurvey PDF
No ratings yet
MarketingAutomationSurvey PDF
11 pages
SOP ASU
No ratings yet
SOP ASU
2 pages
Sap S4hana State of The Market Report
No ratings yet
Sap S4hana State of The Market Report
38 pages
Tentative Mba/Mba (Finance, Marketing & HRM) Examination Schedule
No ratings yet
Tentative Mba/Mba (Finance, Marketing & HRM) Examination Schedule
5 pages
Business Analytics 2 Syllabus
No ratings yet
Business Analytics 2 Syllabus
2 pages
Big Data Analytics Quick Guide
100% (1)
Big Data Analytics Quick Guide
53 pages
101 AI Business Ideas
No ratings yet
101 AI Business Ideas
5 pages
SSM Presentation
No ratings yet
SSM Presentation
15 pages
LCT-Calendar-2025
No ratings yet
LCT-Calendar-2025
30 pages
Intelligent Data and Analytics Fabric
No ratings yet
Intelligent Data and Analytics Fabric
18 pages
Chapter 1. New Trends of Business Models in Digital Era
No ratings yet
Chapter 1. New Trends of Business Models in Digital Era
36 pages
Dpu Mba Artificial Intelligence & Machine Learning Management
No ratings yet
Dpu Mba Artificial Intelligence & Machine Learning Management
26 pages
AMT-C With SAP ADMS - 201603
No ratings yet
AMT-C With SAP ADMS - 201603
59 pages
WORK BOOK 1 CRISP DM
No ratings yet
WORK BOOK 1 CRISP DM
6 pages
Artificial-Intelligence-and-Automation-for-the-Future-of-Startups1
No ratings yet
Artificial-Intelligence-and-Automation-for-the-Future-of-Startups1
22 pages
Charan Resume
No ratings yet
Charan Resume
1 page
Da Unit-1
No ratings yet
Da Unit-1
23 pages
Cientista de Dados - Curso
No ratings yet
Cientista de Dados - Curso
1 page
Dti Executive Summary Website Version PDF
No ratings yet
Dti Executive Summary Website Version PDF
71 pages