INtroduction To Big DAta and HAdoop
INtroduction To Big DAta and HAdoop
HAdoop
CREATED BY:-
VIPIN JAISWAL
CONTENTS
1. Basics of Big Data 2. Overview of Hadoop
Volume Velocity
Big Data refers to a large amount of data that cannot be Velocity in Big Data refers to the speed at which data is
Variety Veracity
The variety of Big Data refers to the different types and formats of Veracity refers to the quality and accuracy of the data. In Big
data. It includes structured data (like relational databases),
01 02
Hadoop was developed by Doug Cutting and Mike Inspired by Google's MapReduce and Google File
Cafarella in 2005 as a result of their experiences System (GFS) papers, Hadoop aimed to provide
handling big data at Yahoo! an open- source framework for processing and
storing large datasets.
03 04
Originally, it consisted of two main Over the years, Hadoop has evolved significantly,
componentsHadoop Distributed File System with various components being added to the
(HDFS) for data storage and MapReduce for data ecosystem, making it a comprehensive big data
processing. processing framework.
Architecture of Hadoop
The architecture of It follows a master- The master node is The slave nodes, also Hadoop's architecture
Hadoop is based on a slave architecture, responsible for known as data nodes, enables horizontal
distributed computing where one machine managing the distributed store and process the scalability and fault
model, allowing it to serves as the master file system (HDFS), job data in parallel using tolerance, making it
and coordinates the scheduling, and
process and store the MapReduce suitable for handling
overall processing, while resource management
large datasets across the other machines act through the component paradigm. big data workloads.
multiple machines. as slaves and perform called Yet Another
the actual data Resource Negotiator
processing tasks. (YARN).
03
Components of Hadoop Ecosystem
Hadoop Distributed File System (HDFS)
MapReduce is the processing paradigm It divides the processing task into two The Map stage takes input data, applies
in Hadoop used for distributed data main stagesMap and Reduce. a mapping function to generate
processing. intermediate key- value pairs.
The Reduce stage combines the MapReduce provides fault tolerance, as It automatically handles the
intermediate key- value pairs with the the tasks can be re- executed on parallelization of tasks, data distribution,
same key and produces the final output. different nodes in case of failures. and fault tolerance, allowing efficient
processing of large- scale datasets.
Hadoop Common
02
It includes various modules and
01 libraries that provide
03
Hadoop Common is a set of functionality like input/output Hadoop Common acts as a
common utilities and libraries operations, networking, foundation for the entire
used by other components in security, and serialization. Hadoop ecosystem, ensuring
the Hadoop ecosystem. interoperability and
compatibility between different
components.
04
Advantages of Hadoop
Scalability and Flexibility
01 02
Hadoop offers horizontal It can handle large datasets by
scalability, allowing organizations distributing the workload across a
to scale their data storage and cluster of machines, enabling
processing capabilities parallel processing.
effortlessly.
03
The flexibility of Hadoop allows it to ingest and process different
types of data, including structured, semi- structured, and
unstructured data.
Fault Tolerance
01 02 03
Hadoop provides a cost- It utilizes commodity Hadoop's open- source
effective solution for storing hardware that is less nature eliminates the need
and processing large expensive compared to for licensing fees, making it a
datasets. traditional storage and cost- effective choice for
processing solutions. organizations dealing with
big data.
05
Hadoop Ecosystem Tools
Apache Hive
Apache Hive is a data warehousing HiveQL is a SQL- like language Apache Hive is designed to work
software that enables users to used to query data stored in Apache seamlessly with Hadoop. It can run
process structured data in Hadoop. Hive. It is similar to SQL in syntax, on top of Hadoop Distributed File
It provides a SQL- like interface to but is optimized for querying large System (HDFS) and process large
query data stored in Hadoop. data sets stored in Hadoop. datasets in parallel.
01 02 03
Apache Pig
02
01 Pig Latin Language 03
Pig Latin is a scripting language
Introduction to Pig used to create data processing Data Processing with Pig
pipelines on Hadoop. It is similar to
Apache Pig is a platform for creating Pig Latin is used to define data
SQL in syntax but is optimized for
MapReduce programs used to transformations such as filtering,
big data processing.
analyze large data sets. It provides grouping, joining or aggregations.
a high- level language, Pig Latin, for Pig automatically generates
expressing complex data MapReduce jobs from these
transformations. transformations to process and
analyze large data sets efficiently.
Apache Spark
Spark Streaming and Machine Spark Streaming and Machine Spark Streaming and Machine
Learning Learning Learning
Apache Spark is an open- source cluster Spark Core is the foundation of Apache Spark Streaming is a module in Spark
computing system used for large- scale Spark and provides distributed task that enables real- time processing of
data processing. It provides high- level scheduling, memory management, and streaming data. Spark's Machine
APIs for distributed data processing in fault recovery. Spark SQL is a module in Learning library, MLlib, provides
Java, Scala, Python, and R. Spark used for structured data algorithms and tools for data scientists
processing, integrating with various to build and deploy machine learning
SQL- based tools. models on large datasets.
06
Big Data Processing with Hadoop
Data Ingestion and Storage in Hadoop
Data Extraction and Transformation
Data extraction involves gathering data from various sources such as databases, log files, or APIs.
Data transformation involves cleaning, filtering, and formatting the extracted data to make it suitable for analysis in Hadoop.
Once the data is extracted and transformed, it can be loaded into Hadoop Distributed File System (HDFS), which is the primary storage system
in Hadoop.
The data can be loaded into HDFS using various methods such as command- line tools, Hadoop APIs, or third- party tools like Apache Nifi.
Data partitioning refers to splitting the data into smaller chunks called blocks and distributing them across multiple nodes in the Hadoop cluster.
Data replication ensures that each data block is replicated on multiple nodes for fault tolerance. This replication factor can be configured based
on the desired level of fault tolerance.
Data Processing with MapReduce
MapReduce Workflow
MapReduce is a programming model used for processing large datasets in
parallel across a Hadoop cluster.
The MapReduce workflow involves two main stagesthe Map stage, where data is
processed and transformed into intermediate key- value pairs, and the Reduce
stage, where the intermediate results are aggregated to generate final output.
Data analysis techniques in Hadoop Visualization tools like Apache Data analytics and visualization enable
involve applying statistical, machine Superset, Tableau, or Matplotlib can businesses to uncover meaningful
learning, and data mining algorithms to be used to create visually appealing insights and make informed decisions.
extract insights and patterns from large and interactive visualizations of the By analyzing big data, organizations
datasets. analyzed data. can identify trends, predict future
Techniques such as clustering, Techniques such as charts, graphs, outcomes, optimize processes, and
classification, regression, and anomaly maps, and dashboards help in identify areas for improvement, leading
detection can be applied to gain presenting complex data in a simplified to better decision- making and
deeper understanding and make data- and intuitive manner. competitive advantage.
driven decisions.
07
Challenges and Future of Big Data and
Hadoop
Data Security and Privacy
28M+
0 0
1 Text here 2 Text here
Cloud Computing and Big Internet of Things (IoT) Artificial Intelligence and
Data Integration Machine Learning
With the growth of big data, cloud IoT devices generate massive Artificial intelligence and machine learning
computing has emerged as an amounts of data in real- time, which have transformative capabilities for big
essential tool for cost- effective can be analyzed and leveraged to data analysis. These technologies can
help organizations identify patterns,
storage, analysis, and computation of improve decision- making. Integration
predict trends, and generate insights from
data. Organizations now have the of big data and IoT offers vast volumes of data. Incorporation of AI
option to store, process, and compute organizations opportunities to gain and machine learning into big data
big data in the cloud, removing the more insights into their data and solutions will drive advancements and
need for large on- premise improve business processes. progress in data analytics and exploration.
infrastructures.
” ” ”
Thanks