BD unit 1
BD unit 1
+
Big Data Technologies and Tools
1. Data Storage
This involves storing and organizing large amounts of data in a way that makes it easy to
access and manage. Two widely used tools are:
Apache Hadoop:
o It's open-source software that helps store and process big data by dividing the
work across multiple computers (clusters).
o It works quickly and handles different types of data.
MongoDB:
o A database designed to handle massive amounts of unstructured data, like text
aor images.
o It uses "key-value pairs" (like labels and their values) to organize information.
2. Data Mining
This is about extracting useful patterns, trends, or information from raw data. Two common
tools for this are:
RapidMiner:
o It helps process data and create machine learning models to predict outcomes
(like predicting future sales).
o Combines data preparation and advanced analytics in one platform.
Presto:
o Originally created by Facebook, Presto is a tool to run queries on large
datasets from different sources (like databases or cloud storage).
o It provides results very quickly, combining data efficiently.
3. Data Analytics
Here, tools are used to analyze and make sense of the data to support business decisions.
Examples include:
Apache Spark:
o A fast tool for analyzing big data.
o Unlike Hadoop, it processes data in memory (RAM) instead of relying on
slower storage methods.
o
It handles complex analytics tasks with speed.
Splunk:
o A tool to find insights and trends in large datasets.
o It creates visualizations like charts, dashboards, and reports, and supports AI
integration to enhance data analysis.
4. Data Visualization
This step involves creating visual representations (like graphs or dashboards) to make the
data insights easy to understand for decision-makers.
Tableau:
o A popular tool with a simple drag-and-drop feature for creating graphs, pie
charts, and dashboards.
o Visualizations can be securely shared in real-time.
Looker:
o A business intelligence tool that simplifies sharing data insights with others.
o It allows teams to monitor and track important metrics, like social media
performance.
3) Vs of Big Data
Big data definitions may vary slightly, but it will always be described in terms
of volume, velocity, and variety. These big data characteristics are often
referred to as the “3 Vs of big data” and were first defined by Gartner in 2001.
1. Volume
As its name suggests, the most common characteristic associated with big data
is its high volume. This describes the enormous amount of data that is available
for collection and produced from a variety of sources and devices on a
continuous basis.
2. Velocity
Big data velocity refers to the speed at which data is generated. Today, data is
often produced in real-time or near real-time, and therefore, it must also be
processed, accessed, and analyzed at the same rate to have any meaningful
impact.
3. Variety
Data is heterogeneous, meaning it can come from many different sources and
can be structured, unstructured, or semi-structured. More traditional structured
data (such as data in spreadsheets or relational databases) is now supplemented
by unstructured text, images, audio, video files, or semi-structured formats like
sensor data that can’t be organized in a fixed data schema.
**In addition to these three original Vs, three others are often mentioned in
relation to harnessing the power of big data: veracity, variability, and value**.
4. Veracity
Big data can be messy, noisy, and error-prone, which makes it difficult to
control the quality and accuracy of the data. Large datasets can be unwieldy and
confusing, while smaller datasets could present an incomplete picture. The
higher the veracity of the data, the more trustworthy it is.
5. Variability
6. Value
It’s essential to determine the business value of the data you collect. Big data
must contain the right data and then be effectively analyzed in order to yield
insights that can help drive decision-making.
4)Crowdsourcing (note)
7)Mobile BI (padeepz)
Introduction to NoSQL
What is NoSQL?
NoSQL stands for "Not Only SQL", meaning it is not limited to traditional relational
databases. It is designed to handle huge amounts of data that relational databases struggle
with. NoSQL databases are schema-free and non-relational, making them more flexible for
different data structures.
Most NoSQL databases are open-source and distributed, meaning data is copied across
multiple servers (either local or remote). If one server goes offline, the system continues to
run without losing data access.
Unlike traditional databases, NoSQL databases do not follow strict consistency rules, making
them faster and more scalable.
Why NoSQL?
Advantages of NoSQL:
Disadvantages of NoSQL:
The CAP theorem states that a database can guarantee only two of the following three
properties:
1. Consistency – Every request returns the most recent data (but may require waiting).
2. Availability – Every request gets a response, but it may not be the latest data.
3. Partition Tolerance – The system remains operational even if some network failures occur.
Since NoSQL databases are distributed, they must choose between Consistency and
Availability while ensuring Partition Tolerance.