0% found this document useful (0 votes)
323 views

Unit-6: Data Visualization and Hadoop

The document discusses data visualization and Hadoop. It begins by defining data visualization as the graphical representation of data or information using visual elements like charts, graphs, and maps. It then discusses some challenges of big data visualization, such as visualizing diverse and heterogeneous data, speed requirements, and designing scalable tools. Different types of conventional data visualization methods are covered, including tables, histograms, scatter plots, various charts, timelines, and diagrams. The document outlines techniques and tools used for data visualization with Hadoop.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
323 views

Unit-6: Data Visualization and Hadoop

The document discusses data visualization and Hadoop. It begins by defining data visualization as the graphical representation of data or information using visual elements like charts, graphs, and maps. It then discusses some challenges of big data visualization, such as visualizing diverse and heterogeneous data, speed requirements, and designing scalable tools. Different types of conventional data visualization methods are covered, including tables, histograms, scatter plots, various charts, timelines, and diagrams. The document outlines techniques and tools used for data visualization with Hadoop.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 96

Unit- 6

Data Visualization and


Hadoop
Outline

● Introduction to Da t a Visualization

● Challenges to Big d a t a visualization

● Types of d a t a visualization

● Da t a Visualization Techniques

● Tools used in Da t a Visualization

● H a d o o p ecosystem, M a p Reduce, Pig, Hive


Introduction

Representation of
Data Visualization

is

Graphical
Information

D ata
Data Visualization
● Data visualization is a graphical representation of any d a t a o r information.
● Visual elements such as charts, graphs, a n d maps a r e the few d a t a visualization
tools that provide the viewers with a n easy a n d accessible way of understanding the
represented information.
● Da ta visualization enables you o r decision-makers of any en terp rise o r industry
to look into analytical r e p o r t s a n d understan d concepts that might otherwise be
difficult to grasp.
Introduction

Data Visualization
Tools

Provide

Accessible way

To

See & Understand


trends, outliers, and patterns in data.
Introduction

Visual Elements

Charts Graphs

Maps
Introduction
Introduction
Its Need

Data

is

When

More
Visualized

Valuable
It is
Its Need

Charts Graphs
Make

Communicating

Data
Findings
Easier
Without
Its Need

True Meaning

the
Audience to Grasp of

For the

Visual representation
Hard
insights Findings
of
It can be
Outline

● Introduction to Data Visualization

● Challenges to Big data visualization

● Types of data visualization

● Data Visualization Techniques

● Tools used in Data Visualization

● Hadoop ecosystem, M a p Reduce, Pig, Hive


Big data visualization Challenge
Big data Visualization Challenge

● Visualization of big data with diversity and heterogeneity (structured,


semi-structured, and unstructured) is a big problem.

● Speed is the desired factor for the big da t a analysis.


● Designing a new visualization tool with efficient indexing is not easy in big
data.
● Cloud computing and advanced graphical user interface can be merged
with the big d a t a for the better management of big da t a scalability
Big data visualization Challenge

● Visualization systems must contend with unstructured data


forms such as graphs, tables, text, trees, and other metadata.

● Big da t a often has unstructured formats.


● Due to bandwidth limitations and power requirements,
visualization should move closer to the da t a to ex t ract
meaningful information efficiently.
● Visualization software should be r u n in an in smooth manner. Because of
the big da ta size, the need for massive parallelization is a challenge in
visualization.
Big data visualization Challenge

Information
Large Image
Loss
Perception

Some Other Big


Data
Visualization
High Problem
Performance
Requirement
High Rate
of Image
Change
Visual Noise
Big Data Visualization Challenge

Visual noise
Most of the objects in dataset are too relative to each other. Users cannot
divide them as separate objects on the screen.

Information loss
Reduction of visible data sets can be used, but leads to information loss

Large image perception


Data visualization methods are not only limited by aspect ratio and
resolution of device, but also by physical perception limits.
Big Data Visualization Challenge

High rate of image change:

Users observe data and cannot react to the number of data change or its intensity on display.

High performance requirements


It can be hardly noticed in static visualization because of lower visualization speed

requirements , high performance requirement.


Big Data Visualization Challenge

Solution
Meeting Speed
Understanding Data

Addressing Data Quality

Displaying Meaningful
Results

Dealing With Outliers


Big data Visualization
Challenge
Meeting the need for speed
One possible solution is hardware. Increased memory and powerful parallel
processing can be used. Another method is putting data in-memory but
using a grid computing approach, where many machines are used.

Understanding the data


One solution is to have the proper domain expertise in place.
Big Data Visualization Challenge

Addressing data quality


It is necessary to ensure the data is clean through the process of data
governance or information management.

Displaying meaningful results


One way is to cluster data into a higher-level view where smaller groups of
data are visible and the data can be effectively visualized

Dealing with outliers


Possible solutions are to remove the outliers from the data or create a separate chart for the outliers.
Outline

● Introduction to Data Visualization

● Challenges to Big data visualization

● Types of data visualization

● Data Visualization Techniques

● Tools used in Data Visualization

● Hadoop ecosystem, M a p Reduce, Pig, Hive


Types of data visualization

● Many conventional da t a visualization methods a r e often used.


● They are:
○ Table, histogram, scatter plot, line chart, b a r chart, pie chart, a r e a
chart, flow chart, bubble chart,

○ multiple da t a series o r combination of charts, timeline,


○ Venn d i a g ram, da ta flow d i a g ram, and entity r elationship
diagram, etc.
○ The additional methods are: parallel coordinates, treemap, cone
tree, and semantic network, etc
Types of data visualization

1. Table

2.Histogram

3.Scatter Plot
Types of Data
Visualization 4. Various Charts

5.Timeline

6.Various Diagram
Types of data visualization

Table
Types of Data Visualization

Histogram

● The data is grouped


into ranges(eg.10-
29) & then plotted as
connected bars
Types of data visualization

● An approximate representation of the distribution of numerical data.


● Divide the entire range of values into a series of intervals and then
count how many values fall into each interval this is called binning.

● For example, determining frequency of annual stock market percentage


returns within particular ranges (bins) such as 0-10%, 11-20%, etc.
● The height of the b a r represents the number of observations (years) with
a r e t u r n % in the range represented by the respective bin.
Conventional data visualization methods

Scatter plot

● It display collection of all points


for the set of data
● When you have multiple data
points and need to examine the
correlation between X and Y
variables.
● Consequently, variables should
depend on each other or influence
each other in some way.
● For example, supply is usually
related to demand.
Conventional data visualization methods

Scatter plot
Conventional data visualization methods

Scatter plot
Conventional data visualization methods

Scatter plot
Conventional data visualization methods

Scatter plot
Conventional data visualization methods

Different Types of Chart

● Dot distortion map

● Dot symbol to
represent a feature on
the map
Conventional data visualization tools

Different Types of Chart

Pie Chart

The circle is divided into sector to


represented numeric proportion
Conventional data visualization methods

Different Types of Chart

3) hierarchical

Tree Diagram

It represent data or the hierarchy


in the graph form
Conventional data visualization methods

Different Types of Chart

Node Link Diagram

● Node - Visualize as a Dot


● Link - Line Segment to
display data connection
Conventional data visualization methods

Timeline

● The most effective way to


visualize a sequence of events
in chronological order.

● They are typically linear, with


key events outlined along the
axis.
● Timelines are used to
communicate time related
informtion and display
historical data.
Conventional data visualization methods

Timeline
Conventional data visualization methods

Timeline
Conventional data visualization methods

Timeline
Conventional data visualization methods

Venn Diagram
Conventional data visualization methods

- Data Flow Diagram


Conventional data visualization methods

Parallel Coordinates

• To plot individual data elements


across many dimensions.
• Parallel coordinate is very useful
when to display
multidimensional data
Conventional data visualization methods

Treemap

● An effective method for


visualizing hierarchies.

● The size of each sub-rectangle


represents one measure, while
color is often used to represent
another measure of data.

● streaming music and video tracks


in a social network community.
Conventional data visualization methods

Semantic Network

● A graphical representation of logical


relationship between different concepts.

● It generates directed graph, the


combination of nodes or vertices, edges
or arcs, and label over each edge
Outline

● Introduction to Data Visualization

● Challenges to Big data visualization

● Types of data visualization

● Data Visualization Techniques

● Tools used in Data Visualization

● Hadoop ecosystem, M a p Reduce, Pig, Hive


Data visualization techniques
Data visualization techniques

1. Data Visualization

2.Information Visualization

3.Concept Visualization
Visualization Technique/
Methods 4. Strategic Visualization

5.Metaphor Visualization

6.Compound Visualization

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.259.4640&rep=rep1&type=pdf
Big data visualization Challenge

Information visualization

Visually represents quantitative data with or without axes in schematic or


diagrammatic forms e.g. Table, Line chart, Pie chart, Histogram, and Scatter plot
etc

Understanding the data

An interactive interface of data to increase cognition or perception ability.


Transform data into a changeable image, through which users can interact during
manipulation, e.g. Data map, Tree map, Clustering, Semantic network, Timeline,
and Venn/ Euler diagram etc.
Big data visualization Challenge

Concept visualizations

These methods used to elaborate ideas, plan, concepts, and analyze it easily,
e.g. Mindmap, Layer chart, Concentric circle, Decision tree, Pert chart etc.

Strategic visualization

A systematic approach in which an organization visually represent it strategies of


development, formulation, communication, implementation, and some time its
analysis, e.g. Organizational chart, Strategy map, Failure tree, and Portfolio
diagram etc
Big data visualization Challenge

Metaphor visualization

It organizes and structure information graphically. They convey insight of


information through key characteristics of metaphor that is employed, e.g. Metro
map, Story template, Funnel, and Tree etc.

Compound visualization

The complementary use of different graphic representation formats in one single


schema or frame, e.g. Cartoon, Rich picture, Knowledge map, and Learning map
etc
Data visualization techniques
Outline

● Introduction to Data Visualization

● Challenges to Big data visualization

● Types of data visualization

● Data Visualization Techniques

● Tools used in Data Visualization

● Hadoop ecosystem, Map Reduce, Pig, Hive


Data Visualization Tools

Dundas BI
Infogram

Sisense Adaptive Insights

Power BI
Google Charts
FineReport WhataGraph

Grafana Tableau
What is Tableau

Tableau
Founded
is
Interactive Data Focus on
Visualization Products
in
produce
Software Company Business Products
Tableau

• Tableau Software is an American computer software company headquartered in


Seattle, WA, USA.
• It generates interactive data visualization products which focused on BI.
• The company was established at Stanford University’s Department of Computer
Science between 1997 and 2002.

Tableau Desktop (Business analytics anyone can use)


• Tableau Desktop is a data visualization application to
facilitate you to examine virtually any kind of
structured data and generate highly interactive,
beautiful graphs, dashboards, and reports within
minutes.
Tableau Server
• Tableau Server is a business intelligence application that offers browser-based analytics
anyone can utilize.
• It is a rapid-fire alternative to the slow pace of traditional BI software. It is an online
solution meant for sharing, distributing, and collaborating on content created in Tableau.
• What makes Tableau different? It is proposed to everyone. There is no scripting required, so
everyone can grow to be an analytics expert. You can grow your deployment, as you require
it. Train online for free. Find answers in minutes, not months.

Tableau Online
• Tableau Online is a secure, cloud-based solution used for sharing, distributing, and
collaborating on Tableau views and Tableau dashboards.
• Tableau online sets the flexibility and ease of a powerful cloud-based data visualization
solution to work intended for you—without servers, server software, or IT support.
Tableau Public
• Tableau Public is a free software to facilitate anyone to connect to a spreadsheet or file
and create interactive data visualizations for the web.
• It is delivered as a service that permits the user to be up and running overnight.
• With Tableau Public users can construct amazing interactive visuals and publish them
quickly, without the help of programmers or IT.
• It is designed for organizations to facilitate their websites with interactive data
visualizations. There are higher limits on the size of data you can work with and among
other features, you can keep your underlying data hidden.
What is Tableau

Tableau

Allow
to Spend more
time
Customer on
all
less on data
Data analysis wrangling
What is Google Charts

Google chart tools are powerful, simple to use, and free.

We can use interactive charts and data tools.


Microsoft Power BI

• Microsoft Power BI is a
business intelligence (BI)
platform that provides
nontechnical business users
with tools for aggregating,
analyzing, visualizing and
sharing data.
• Power BI's user interface is
fairly intuitive for users familiar
with Excel, and its deep
integration with other Microsoft
products makes it a versatile
self-service tool that requires
little upfront training.
Uses of Power BI?
• Though Power BI is a self-service BI tool that brings data analytics to
employees, it's mostly used by data analysts and BI professionals who create
the data models before disseminating reports throughout the organization.
• However, those without an analytical background can still navigate Power BI
and create reports.
Key features of Power BI
Some of the most important features are the following:

• Artificial intelligence. Users can access image recognition and text analytics in Power BI, create machine
learning models using automated ML capabilities and integrate with Azure Machine Learning.
• Hybrid deployment support. This feature provides built-in connectors that allow Power BI tools to
connect with a number of different data sources from Microsoft, Salesforce and other vendors.
• Quick Insights. This feature allows users to create subsets of data and automatically apply analytics to
that information.
• Common data model support. Power BI's support for the common data model allows the use of a
standardized and extensible collection of data schemas (entities, attributes and relationships).
• Cortana integration. This feature, which is especially popular on mobile devices, allows users to
verbally query data using natural language and access results using Cortana, Microsoft's digital assistant.
• Customization. This feature allows developers to change the appearance of default visualization and
reporting tools and import new tools into the platform.
Content Beyond Syllabus- Dashboard

Decision - making process

Data Accuracy

gain information quickly and accurately

1
Dashboard

A good Presentation

of Decision making easily

Information

in that enables
Visual form
1
*
Dashboard

Large sum of available data

Can be Sort & Process

Challenge
For to
Organization
1
*
Outline

● Introduction to Data Visualization

● Challenges to Big data visualization

● Types of data visualization

● Data Visualization Techniques

● Tools used in Data Visualization

● Hadoop ecosystem, Map Reduce, Pig, Hive


Hadoop ecosystem,Map Reduce, Pig, Hive

• The Hadoop Ecosystem is a framework and suite of tools that tackle the many

• Although Hadoop has been on the decline for some time, there are organizations
like LinkedIn where it has become a core technology.
• Some of the popular tools that help scale and improve functionality are Pig,
Hive, Oozie, and Spark.
• Spark has developed legs of its own and has become an ecosystem unto itself,
where add-ons like Spark MLlib turn it into a machine learning platform that
supports Hadoop, Kubernetes, and Apache Mesos.
• Most of the tools in the Hadoop Ecosystem revolve around the four core
technologies, which are YARN, HDFS, MapReduce, and Hadoop Common.
• All these components or tools work together to provide services such as
absorption, storage, analysis, maintenance of big data, and much more.
Hadoop ecosystem,Map Reduce, Pig, Hive

• HDFS: Hadoop Distributed File System


• HIVE: Data warehouse that helps in reading,
writing, and managing large datasets
• PIG: helps create applications that run on
Hadoop, allowing to execute jobs in MapReduce
• MapReduce: System used for processing large
data sets
• YARN: Yet Another Resource Negotiator
• Spark: Popular analytics engine that works in-
memory
• Oozie: Open-source workflow scheduling
program
• Zookeeper: Centralized service for maintaining
config info, naming, providing distributed
synchronization, and more
• Mahout: Helps create ML applications
Hadoop ecosystem,Map Reduce, Pig, Hive

• A
Hadoop is an open source framework, from the Apache foundation,
• Capable of processing large amounts of heterogeneous data sets in a
distributed fashion across clusters of commodity computers and hardware
using a simplified programming model.
• Hadoop provides a reliable shared storage and analysis system.
Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS

• HDFS Architecture provides a complete overview of HDFS Namenode and


data nodes and their functionality.

• Namenode will store metadata and data nodes will store actual data.
• The client will interact with the Namenode in the cluster to perform the task.
• Data nodes will keep sending a heartbeat to Namenode to indicate that it’s
alive.

https://www.cloudduggu.com/hadoop/hdfs/
Hadoop ecosystem,Map Reduce, Pig, Hive

Master Node- File Store/


Hadoop Distributed File System HDFS
Namespace/Metadata

Name Node Secondary Name Node

HDFS Client NameSpace Backup

Heart Beats, Balancing, Replication

Worker Node

Data Node DataNode


DataNode DataNode

DataServing
Hadoop ecosystem,Map Reduce, Pig, Hive

Map Reduce

● The MapReduce paradigm offers the means


○ to break a large task into smaller tasks,
○ run tasks in parallel, and
○ consolidate the outputs of the individual tasks into the final
output.
● Apache Hadoop includes a software implementation of MapReduce.
Hadoop ecosystem,Map Reduce, Pig, Hive

Map Reduce

MapReduce

Consolidates the
Applies an intermediate
operation to a Reduce outputs from the
piece of data Map Step map steps
Step

Provides some Provides the final


intermediate output output
Hadoop ecosystem,Map Reduce, Pig, Hive

Map Reduce

Map Step Reduce Step


As I/P &
O/P

Uses <Key, Value>


Pair
Hadoop ecosystem,Map Reduce, Pig, Hive

Map Reduce
Hadoop ecosystem, Map Reduce, Pig, Hive

Map Reduce
Hadoop ecosystem,Map Reduce, Pig, Hive
`

Hadoop Distributed File System HDFS

How a MapReduce job


is run in Hadoop?

Driver

Mapper
A typical MapReduce
program in Java consists
of three classes Reducer
Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS

Input file locations

Provisions for adding the input file to the


map task
Driver
provides Names of the mapper and reducer Java
details such classes

The location of the reduce task output


Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS

Provides the logic to be processed on


each data block corresponding to the
specified input files in the driver code

Map task is instantiated on a worker


Mapper node where a data block resides.

The key/value pairs are stored


temporarily in the worker node’s
memory
Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS

The key/value pairs are processed by the built-in


shuffle and sort
Functionality based on the number of reducers to be
Shuffle executed
& Sort
Keys are passed to each reducer in sorted order.

Each reducer processes the values for each key and


emits a key/value pair as defined by the reduce logic
Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS


Hadoop ecosystem,Map Reduce, Pig, Hive

Hadoop Distributed File System HDFS

Hadoop MapReduce
Program Language Option

Hadoop Streaming
Hadoop pipes-
Java API- Require
Knowledge of Python, Uses c++ Code
C, or Ruby
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem


Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - Pig


Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - Pig

● Apache Pig consists of a


○ data flow language,
○ Pig Latin, and
○ environment to execute the Pig code.
● The main benefit of using Pig is to utilize the power of MapReduce in a
distributed system,

while simplifying the tasks of developing and executing a MapReduce job.


Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - Pig

● Pig include entering the Pig execution environment by typing pig at the command prompt
and then entering a sequence of Pig instruction lines at the grunt prompt.
● Example :
$ pig grunt> records = LOAD ‘/user/customer.txt’ AS (cust_id:INT,
first_name:CHARARRAY, last_name:CHARARRAY,
email_address:CHARARRAY);
grunt> filtered_records = FILTER records BY email_address matches
‘.*@isp.com’; grunt> STORE filtered_records INTO
‘/user/isp_customers’;
grunt> quit
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - Pig Builtin Functions

Load/Store Math
Eval

String DateTime
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE


Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE

● Apache Hive enables users to process data without explicitly writing MapReduce code.
● Hive language, HiveǪL (Hive Ǫuery Language), resembles Structured Ǫuery Language
(SǪL)
● A Hive table structure consists of rows and columns.
● The rows typically correspond to some record, transaction, or particular entity (for
example, customer) detail.
● The values of the corresponding columns represent the various attributes or characteristics
for each row.
● Additionally, a user may consider using Hive if the user has experience with SQL and
the data is already in HDFS.
● Hive is not intended for real-time querying
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE

● When to use Hive?


○ Data easily fits into a table structure.
○ Data is already in HDFS. (Note: Non-HDFS files can be loaded into a Hive
table.)
○ Developers are comfortable with SQL programming and queries.
○ There is a desire to partition datasets based on time. (For example, daily updates
are added to the Hive table.)
○ Batch processing is acceptable.
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE

● To start hive simply type hive on command prompt.


$ hive
● hive
>
From this environment, a user can define new tables, query them, or
summarize their contents.
● hive> create table customer ( cust_id bigint, first_name string, last_name
string, email_address string) row format delimited fields terminated by
‘\t’;
(‘\t’)-delimited HDFS file
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE

● To load the customer table with the contents of HDFS file, customer.txt
hive> load data inpath ‘/user/customer.txt’ into table customer;

● HiveQL query is executed to count the number of records in the newly


created table, customer.
hive> select count(*) from customer;
Hadoop ecosystem,Map Reduce, Pig, Hive

The Hadoop Ecosystem - HIVE use cases

● Exploratory or ad-hoc analysis of HDFS data: Data can be queried, transformed, and
exported to analytical tools, such as R.
● Extracts or data feeds to reporting systems, dashboards, or data repositories such as
HBase:
Hive queries can be scheduled to provide such periodic feeds.
● Combining external structured data to data already residing in HDFS: Hadoop is
excellent for processing unstructured data, but often there is structured data residing in
an RDBMS, such as Oracle or SQL Server, that needs to be joined with the data residing
in HDFS.
● The data from an RDBMS can be periodically added to Hive tables for querying with
existing data in HDFS.
Thank You

You might also like