0% found this document useful (0 votes)
16 views

BD unit 1

The document covers various aspects of big data applications and technologies, including data storage, mining, analytics, and visualization tools like Apache Hadoop, MongoDB, and Tableau. It also introduces NoSQL databases, highlighting their flexibility, scalability, and types, such as key-value and document-oriented stores. Additionally, it discusses the '3 Vs' of big data (volume, velocity, variety) and the CAP theorem related to distributed databases.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

BD unit 1

The document covers various aspects of big data applications and technologies, including data storage, mining, analytics, and visualization tools like Apache Hadoop, MongoDB, and Tableau. It also introduces NoSQL databases, highlighting their flexibility, scalability, and types, such as key-value and document-oriented stores. Additionally, it discusses the '3 Vs' of big data (volume, velocity, variety) and the CAP theorem related to distributed databases.

Uploaded by

SUJITHA M
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1)BIG DATA APPLICATIONS (Padeepz app)

2)Big data Technologies (Padeepz app)

+
Big Data Technologies and Tools

1. Data Storage

This involves storing and organizing large amounts of data in a way that makes it easy to
access and manage. Two widely used tools are:

 Apache Hadoop:
o It's open-source software that helps store and process big data by dividing the
work across multiple computers (clusters).
o It works quickly and handles different types of data.
 MongoDB:
o A database designed to handle massive amounts of unstructured data, like text
aor images.
o It uses "key-value pairs" (like labels and their values) to organize information.

2. Data Mining

This is about extracting useful patterns, trends, or information from raw data. Two common
tools for this are:

 RapidMiner:
o It helps process data and create machine learning models to predict outcomes
(like predicting future sales).
o Combines data preparation and advanced analytics in one platform.
 Presto:
o Originally created by Facebook, Presto is a tool to run queries on large
datasets from different sources (like databases or cloud storage).
o It provides results very quickly, combining data efficiently.

3. Data Analytics

Here, tools are used to analyze and make sense of the data to support business decisions.
Examples include:

 Apache Spark:
o A fast tool for analyzing big data.
o Unlike Hadoop, it processes data in memory (RAM) instead of relying on
slower storage methods.
o
It handles complex analytics tasks with speed.
 Splunk:
o A tool to find insights and trends in large datasets.
o It creates visualizations like charts, dashboards, and reports, and supports AI
integration to enhance data analysis.

4. Data Visualization

This step involves creating visual representations (like graphs or dashboards) to make the
data insights easy to understand for decision-makers.

 Tableau:
o A popular tool with a simple drag-and-drop feature for creating graphs, pie
charts, and dashboards.
o Visualizations can be securely shared in real-time.
 Looker:
o A business intelligence tool that simplifies sharing data insights with others.
o It allows teams to monitor and track important metrics, like social media
performance.

3) Vs of Big Data
Big data definitions may vary slightly, but it will always be described in terms
of volume, velocity, and variety. These big data characteristics are often
referred to as the “3 Vs of big data” and were first defined by Gartner in 2001.

1. Volume

As its name suggests, the most common characteristic associated with big data
is its high volume. This describes the enormous amount of data that is available
for collection and produced from a variety of sources and devices on a
continuous basis.

2. Velocity

Big data velocity refers to the speed at which data is generated. Today, data is
often produced in real-time or near real-time, and therefore, it must also be
processed, accessed, and analyzed at the same rate to have any meaningful
impact.
3. Variety

Data is heterogeneous, meaning it can come from many different sources and
can be structured, unstructured, or semi-structured. More traditional structured
data (such as data in spreadsheets or relational databases) is now supplemented
by unstructured text, images, audio, video files, or semi-structured formats like
sensor data that can’t be organized in a fixed data schema.

**In addition to these three original Vs, three others are often mentioned in
relation to harnessing the power of big data: veracity, variability, and value**.

4. Veracity

Big data can be messy, noisy, and error-prone, which makes it difficult to
control the quality and accuracy of the data. Large datasets can be unwieldy and
confusing, while smaller datasets could present an incomplete picture. The
higher the veracity of the data, the more trustworthy it is.

5. Variability

The meaning of collected data is constantly changing, which can lead to


inconsistency over time. These shifts include not only changes in context and
interpretation but also data collection methods based on the information that
companies want to capture and analyze.

6. Value

It’s essential to determine the business value of the data you collect. Big data
must contain the right data and then be effectively analyzed in order to yield
insights that can help drive decision-making.

4)Crowdsourcing (note)

5)Web Analytics (Padeepz app)

6)Colud and Big data (pdf) + diff b/w them (note)

7)Mobile BI (padeepz)

8) inter and trans firewall analytics.(pdf)


UNIT 2

Introduction to NoSQL

What is NoSQL?

NoSQL stands for "Not Only SQL", meaning it is not limited to traditional relational
databases. It is designed to handle huge amounts of data that relational databases struggle
with. NoSQL databases are schema-free and non-relational, making them more flexible for
different data structures.

Most NoSQL databases are open-source and distributed, meaning data is copied across
multiple servers (either local or remote). If one server goes offline, the system continues to
run without losing data access.

Unlike traditional databases, NoSQL databases do not follow strict consistency rules, making
them faster and more scalable.

Why NoSQL?

NoSQL databases are useful due to modern business data challenges:

1. Volume & Velocity – Handles large, fast-growing datasets.


2. Variability – Supports diverse data types that do not fit into structured tables.
3. Agility – Adapts quickly to business changes.

Key Features of NoSQL

 Works on multiple processors and low-cost hardware.


 Supports linear scalability – adding more processors increases performance.
 Designed for big data processing used by companies like Facebook, Twitter, and Google.

Types of NoSQL Databases

There are four main types:

1. Key-Value Stores – Data is stored in a hash table format.


o Example: Redis, Amazon DynamoDB
2. Document-Oriented Stores – Data is stored in documents (mostly JSON format).
o Example: MongoDB, CouchDB
3. Column-Oriented Stores – Stores data in columns instead of rows, optimizing
performance for big data.
o Example: Cassandra, HBase
4. Graph Databases – Stores relationships between data points, often used in social
networks.
o Example: Neo4j, AllegroGraph
Examples

CouchDB – A JSON-based document database using JavaScript for queries.

 Elasticsearch – A document database with a powerful search engine.


 Couchbase – A key-value and document store used for cloud applications.

Advantages of NoSQL:

 Can store different data types easily.


 Works efficiently with big data.
 Cost-effective – Many NoSQL databases are open-source and run on cheap hardware.

Disadvantages of NoSQL:

 Does not have built-in consistency like relational databases.


 Developers may need to write extra code for reliability.

CAP Theorem (Trade-Off in Distributed Databases)

The CAP theorem states that a database can guarantee only two of the following three
properties:

1. Consistency – Every request returns the most recent data (but may require waiting).
2. Availability – Every request gets a response, but it may not be the latest data.
3. Partition Tolerance – The system remains operational even if some network failures occur.

Since NoSQL databases are distributed, they must choose between Consistency and
Availability while ensuring Partition Tolerance.

You might also like