0% found this document useful (0 votes)
25 views30 pages

INtroduction To Big DAta and HAdoop

Uploaded by

Sudhanshu Chand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views30 pages

INtroduction To Big DAta and HAdoop

Uploaded by

Sudhanshu Chand
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

INtroduction to Big DAta and

HAdoop

CREATED BY:-
VIPIN JAISWAL
CONTENTS
1. Basics of Big Data 2. Overview of Hadoop

3. Components of Hadoop Ecosystem 4. Advantages of Hadoop

5. Hadoop Ecosystem Tools 6. Big Data Processing with Hadoop

7. Challenges and Future of Big Data and Hadoop


01
Basics of Big Data
Definition and Characteristics of Big Data

Volume Velocity
Big Data refers to a large amount of data that cannot be Velocity in Big Data refers to the speed at which data is

01 easily managed and processed using traditional database


management systems. The volume of data in Big Data refers
to the massive scale at which data is generated and stored.
02 generated and processed. With the advent of the internet and
technological advancements, data is continuously being
generated and needs to be processed in real- time. Velocity
This includes structured, unstructured, and semi- structured is crucial for data- intensive applications such as social media
data. analytics, financial trading, or sensor data analysis.

Variety Veracity
The variety of Big Data refers to the different types and formats of Veracity refers to the quality and accuracy of the data. In Big
data. It includes structured data (like relational databases),

03 unstructured data (like social media posts, emails, or multimedia


content), and semi- structured data (like XML or JSON files). Big
Data encompasses a wide range of data formats, making it
04 Data, the data generated can often be uncertain, incomplete,
or contain errors. Veracity is a key challenge in Big Data
analysis, as the trustworthiness of the insights derived from
the data heavily relies on the accuracy and reliability of the
challenging to handle and analyze using traditional methods.
data sources.
Importance and Benefits of Big Data

Decision Making and Insights


Big Data provides valuable insights that can drive better decision- making
processes. By analyzing large volumes of data from various sources,
organizations can identify patterns, trends, and correlations that can lead
to informed business strategies and improved operational efficiency.

iSHEJI Predictive Analytics


Big Data enables predictive analytics, which involves the use of historical
data to make predictions about future events or trends. By analyzing and
modeling large datasets, organizations can identify patterns and make
accurate predictions for proactive decision- making, risk assessment, and
forecasting.

Cost and Time Efficiency


Big Data technologies, like Hadoop, provide cost and time efficiency in
handling and analyzing large datasets. Traditional methods often require
significant investments in infrastructure and processing time, whereas
Big Data technologies offer scalable and distributed processing
capabilities, reducing both cost and time involved in data processing.
02
Overview of Hadoop
History and Evolution of Hadoop

01 02
Hadoop was developed by Doug Cutting and Mike Inspired by Google's MapReduce and Google File
Cafarella in 2005 as a result of their experiences System (GFS) papers, Hadoop aimed to provide
handling big data at Yahoo! an open- source framework for processing and
storing large datasets.

03 04
Originally, it consisted of two main Over the years, Hadoop has evolved significantly,
componentsHadoop Distributed File System with various components being added to the
(HDFS) for data storage and MapReduce for data ecosystem, making it a comprehensive big data
processing. processing framework.
Architecture of Hadoop

The architecture of It follows a master- The master node is The slave nodes, also Hadoop's architecture
Hadoop is based on a slave architecture, responsible for known as data nodes, enables horizontal
distributed computing where one machine managing the distributed store and process the scalability and fault
model, allowing it to serves as the master file system (HDFS), job data in parallel using tolerance, making it
and coordinates the scheduling, and
process and store the MapReduce suitable for handling
overall processing, while resource management
large datasets across the other machines act through the component paradigm. big data workloads.
multiple machines. as slaves and perform called Yet Another
the actual data Resource Negotiator
processing tasks. (YARN).
03
Components of Hadoop Ecosystem
Hadoop Distributed File System (HDFS)

It offers high throughput


and fault tolerance by
HDFS provides data maintaining DataNode
redundancy through replicas, enabling data
It follows a master- slave replication, dividing the availability even in the
architecture, where the data into blocks and presence of failures.
HDFS is the primary
NameNode serves as the storing multiple copies on
storage system in Hadoop,
master and maintains the different DataNodes.
designed to store and
metadata about files and
distribute large datasets
blocks, while the
across multiple machines.
DataNodes act as slaves
and store the actual data.
Yet Another Resource Negotiator (YARN)

YARN is the resource management and job scheduling


component in Hadoop.

The ResourceManager oversees resource allocation and


scheduling of jobs across the cluster, ensuring optimal
resource utilization.

It separates the resource management capabilities from the


MapReduce processing framework, making Hadoop more
versatile to support various data processing models.

The NodeManager runs on individual slave nodes and


manages resources like CPU, memory, and disk for executing
tasks.

YARN consists of two main componentsResourceManager


and NodeManager.
MapReduce

MapReduce is the processing paradigm It divides the processing task into two The Map stage takes input data, applies
in Hadoop used for distributed data main stagesMap and Reduce. a mapping function to generate
processing. intermediate key- value pairs.

The Reduce stage combines the MapReduce provides fault tolerance, as It automatically handles the
intermediate key- value pairs with the the tasks can be re- executed on parallelization of tasks, data distribution,
same key and produces the final output. different nodes in case of failures. and fault tolerance, allowing efficient
processing of large- scale datasets.
Hadoop Common

02
It includes various modules and
01 libraries that provide
03
Hadoop Common is a set of functionality like input/output Hadoop Common acts as a
common utilities and libraries operations, networking, foundation for the entire
used by other components in security, and serialization. Hadoop ecosystem, ensuring
the Hadoop ecosystem. interoperability and
compatibility between different
components.
04
Advantages of Hadoop
Scalability and Flexibility

01 02
Hadoop offers horizontal It can handle large datasets by
scalability, allowing organizations distributing the workload across a
to scale their data storage and cluster of machines, enabling
processing capabilities parallel processing.
effortlessly.

03
The flexibility of Hadoop allows it to ingest and process different
types of data, including structured, semi- structured, and
unstructured data.
Fault Tolerance

Fault tolerance is a critical Hadoop achieves fault In case of node failures,


feature of Hadoop, ensuring tolerance by replicating Hadoop automatically re-
the availability and reliability data across multiple replicates the data and
of data processing. DataNodes in HDFS. reassigns the tasks to
different nodes, ensuring
uninterrupted processing.
Cost-Effective Storage and Processing

01 02 03
Hadoop provides a cost- It utilizes commodity Hadoop's open- source
effective solution for storing hardware that is less nature eliminates the need
and processing large expensive compared to for licensing fees, making it a
datasets. traditional storage and cost- effective choice for
processing solutions. organizations dealing with
big data.
05
Hadoop Ecosystem Tools
Apache Hive

Introduction to Hive Hive Query Language (HiveQL) Integration with Hadoop

Apache Hive is a data warehousing HiveQL is a SQL- like language Apache Hive is designed to work
software that enables users to used to query data stored in Apache seamlessly with Hadoop. It can run
process structured data in Hadoop. Hive. It is similar to SQL in syntax, on top of Hadoop Distributed File
It provides a SQL- like interface to but is optimized for querying large System (HDFS) and process large
query data stored in Hadoop. data sets stored in Hadoop. datasets in parallel.

01 02 03
Apache Pig

02
01 Pig Latin Language 03
Pig Latin is a scripting language
Introduction to Pig used to create data processing Data Processing with Pig
pipelines on Hadoop. It is similar to
Apache Pig is a platform for creating Pig Latin is used to define data
SQL in syntax but is optimized for
MapReduce programs used to transformations such as filtering,
big data processing.
analyze large data sets. It provides grouping, joining or aggregations.
a high- level language, Pig Latin, for Pig automatically generates
expressing complex data MapReduce jobs from these
transformations. transformations to process and
analyze large data sets efficiently.
Apache Spark

Part 01 Part 02 Part 03

Spark Streaming and Machine Spark Streaming and Machine Spark Streaming and Machine
Learning Learning Learning
Apache Spark is an open- source cluster Spark Core is the foundation of Apache Spark Streaming is a module in Spark
computing system used for large- scale Spark and provides distributed task that enables real- time processing of
data processing. It provides high- level scheduling, memory management, and streaming data. Spark's Machine
APIs for distributed data processing in fault recovery. Spark SQL is a module in Learning library, MLlib, provides
Java, Scala, Python, and R. Spark used for structured data algorithms and tools for data scientists
processing, integrating with various to build and deploy machine learning
SQL- based tools. models on large datasets.
06
Big Data Processing with Hadoop
Data Ingestion and Storage in Hadoop
Data Extraction and Transformation

Data extraction involves gathering data from various sources such as databases, log files, or APIs.
Data transformation involves cleaning, filtering, and formatting the extracted data to make it suitable for analysis in Hadoop.

Data Loading into Hadoop

Once the data is extracted and transformed, it can be loaded into Hadoop Distributed File System (HDFS), which is the primary storage system
in Hadoop.
The data can be loaded into HDFS using various methods such as command- line tools, Hadoop APIs, or third- party tools like Apache Nifi.

Data Partitioning and Replication in HDFS

Data partitioning refers to splitting the data into smaller chunks called blocks and distributing them across multiple nodes in the Hadoop cluster.
Data replication ensures that each data block is replicated on multiple nodes for fault tolerance. This replication factor can be configured based
on the desired level of fault tolerance.
Data Processing with MapReduce

MapReduce Workflow
MapReduce is a programming model used for processing large datasets in
parallel across a Hadoop cluster.
The MapReduce workflow involves two main stagesthe Map stage, where data is
processed and transformed into intermediate key- value pairs, and the Reduce
stage, where the intermediate results are aggregated to generate final output.

Map and Reduce Functions


Map functions take input key- value pairs and generate intermediate key- value
pairs as output. They perform data transformation and filtering operations.
Reduce functions take intermediate key- value pairs as input and produce final
output by performing aggregation or summarization tasks.

Job Execution and Task Scheduling


When a MapReduce job is submitted, it is divided into multiple tasks that are
assigned to different nodes in the Hadoop cluster for parallel execution.
The Hadoop YARN (Yet Another Resource Negotiator) system is responsible for
managing task scheduling, resource allocation, and fault tolerance during job
execution.
Data Analytics and Visualization

Visualization Tools and


Data Analysis Techniques Insights and Decision Making
Techniques

Data analysis techniques in Hadoop Visualization tools like Apache Data analytics and visualization enable
involve applying statistical, machine Superset, Tableau, or Matplotlib can businesses to uncover meaningful
learning, and data mining algorithms to be used to create visually appealing insights and make informed decisions.
extract insights and patterns from large and interactive visualizations of the By analyzing big data, organizations
datasets. analyzed data. can identify trends, predict future
Techniques such as clustering, Techniques such as charts, graphs, outcomes, optimize processes, and
classification, regression, and anomaly maps, and dashboards help in identify areas for improvement, leading
detection can be applied to gain presenting complex data in a simplified to better decision- making and
deeper understanding and make data- and intuitive manner. competitive advantage.
driven decisions.
07
Challenges and Future of Big Data and
Hadoop
Data Security and Privacy

28M+

Risks and Threats in Privacy Regulations


Big Data and Compliance
With the collection of big data comes the need From healthcare to finance, organizations
for increased security measures to protect across various industries must comply with
sensitive information from cyber threats,
privacy regulations to safeguard the
unauthorized access, and data breaches. Risks
personal information of their clients and
and threats can arise at all stages of the data
lifecycle, from collection to processing, customers. Failure to comply can result in
analysis, and storage. Some of the common severe penalties and legal consequences.
risks and threats in big data include malicious However, with the large quantities of data
attacks, insider threats, data leaks, and hacks. collected, vulnerabilities can emerge leading
to non- compliance.
Scalability and Performance Optimization

0 0
1 Text here 2 Text here

Handling Large Improving


Volumes of Data
Big data requires efficient storage and processing solutions to handle the sheer volume of information.
Processing Speed
Organizations must process big data in a timely manner to achieve actionable insights and valuable data analysis.
Traditional databases and storage systems struggle to accommodate big data's constant growth and change. Improving processing speed involves optimizing clusters, networks, and other computing resources to efficiently
Hadoop's distributed architecture allows organizations to scale their data infrastructure as their needs increase. run data- intensive applications. Various optimization techniques, such as caching and parallel processing help
improve processing speed.
Advancements and Emerging Technologies

Cloud Computing and Big Internet of Things (IoT) Artificial Intelligence and
Data Integration Machine Learning

With the growth of big data, cloud IoT devices generate massive Artificial intelligence and machine learning
computing has emerged as an amounts of data in real- time, which have transformative capabilities for big
essential tool for cost- effective can be analyzed and leveraged to data analysis. These technologies can
help organizations identify patterns,
storage, analysis, and computation of improve decision- making. Integration
predict trends, and generate insights from
data. Organizations now have the of big data and IoT offers vast volumes of data. Incorporation of AI
option to store, process, and compute organizations opportunities to gain and machine learning into big data
big data in the cloud, removing the more insights into their data and solutions will drive advancements and
need for large on- premise improve business processes. progress in data analytics and exploration.
infrastructures.

” ” ”
Thanks

You might also like