0% found this document useful (0 votes)
185 views

Unit Iii

Uhv

Uploaded by

srinivas79668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views

Unit Iii

Uhv

Uploaded by

srinivas79668
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 20

UNIT III

NoSQL movement for handling Bigdata


Distributing data storage and processing with Hadoop framework, case study on risk
assessment for loan sanctioning, , CAP theorem, base principle of NoSQL databases,
types of NoSQL databases, case study on disease diagnosis and profiling
Hadoop INTRODUCTION
History of Hadoop
Apache Software Foundation is the developers of Hadoop, and it’s co-founders
are Doug Cutting and Mike Cafarella. It’s co-founder Doug Cutting named it on his
son’s toy elephant. In October 2003 the first paper release was Google File System. In
January 2006, MapReduce development started on the Apache Nutch which consisted
of around 6000 lines coding for it and around 5000 lines coding for HDFS. In April
2006 Hadoop 0.1.0 was released.
Hadoop is an open-source software framework for storing and processing big data. It
was created by Apache Software Foundation in 2006, based on a white paper written
by Google in 2003 that described the Google File System (GFS) and the MapReduce
programming model. The Hadoop framework allows for the distributed processing of
large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. It is used by many organizations, including Yahoo,
Facebook, and IBM, for a variety of purposes such as data warehousing, log
processing, and research. Hadoop has been widely adopted in the industry and has
become a key technology for big data processing.
Q) Distributing data storage and processing with Hadoop framework
Hadoop is an open source software programming framework for storing a large
amount of data and performing the computation. Its framework is based on Java
programming with some native code in C and shell scripts.
Hadoop is an open-source software framework that is used for storing and processing
large amounts of data in a distributed computing environment. It is designed to handle
big data and is based on the MapReduce programming model, which allows for the
parallel processing of large datasets.
Hadoop has two main components:
 HDFS (Hadoop Distributed File System): This is the storage component of
Hadoop, which allows for the storage of large amounts of data across multiple
machines. It is designed to work with commodity hardware, which makes it
cost-effective.

1
 YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as
CPU and memory) for processing the data stored in HDFS.
 Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large
data sets across clusters of computers using a simple programming model.

components that collectively form a Hadoop ecosystem:


 HDFS: Hadoop Distributed File System
 YARN: Yet Another Resource Negotiator
 MapReduce: Programming based Data Processing
 Spark: In-Memory data processing
 PIG, HIVE: Query based processing of data services
 HBase: NoSQL Database
 Mahout, Spark MLLib: Machine Learning algorithm libraries
2
 Solar, Lucene: Searching and Indexing
 Zookeeper: Managing cluster
 Oozie: Job Scheduling

3
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
 Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
 Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
 Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it
can continue to operate even in the presence of hardware failures.
 Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance
 High Availability: Hadoop provides High Availability feature, which helps to
make sure that the data is always available and is not lost.
 Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to implement
a wide variety of data processing tasks.
4
 Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
 Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
 Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
 YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.

Hadoop comprises of 4 core components:


 Hadoop Common: Common utilities which support other Hadoop modules.
 HDFS: Hadoop Distributed File System provides unrestricted, high-speed
access to the data application.
 MapReduce: A highly efficient methodology for parallel processing of huge
volumes of data.
 YARN: Stands for Yet Another Resource Negotiato. This technology is
basically used for scheduling of job and efficient management of the cluster
resource.
Other Hadoop ecosystem components:
 Ambari: It is a tool for managing, monitoring and provisioning of the Hadoop
clusters. Apache Ambari supports the HDFS and MapReduce programs. Major
highlights of Ambari are:
o Managing of the Hadoop framework is highly efficient, secure and
consistent.
o Management of cluster operations with an intuitive web UI and a robust
API
o The installation and configuration of Hadoop cluster are simplified
effectively.
o It is used to support automation, smart configuration and
recommendations.
o Advanced cluster security set-up comes additional with this tool kit.
o The entire cluster can be controlled using the metrics, heat maps,
analysis and troubleshooting.
5
o Increased levels of customization and extension make this more
valuable.
 Cassandra: It is a distributed system to handle extremely huge amount of data
which is stored across several commodity servers. The database management
system (DBMS)is highly available with no single point of failure.
 HBase: it is a non-relational, distributed database management system that
works efficiently on sparse data sets and it is highly scalable.
 Spark: This is highly agile, scalable and secure, the Big Data compute engine,
versatiles the sufficient work on a wide variety of applications like real-time
processing, machine learning, ETL and so on.
 Hive: It is a data warehouse tool basically used for analyzing, querying and
summarizing of analyzed data concepts on top of the Hadoop framework.
 Pig: Pig is a high-level framework which ensures us to work in coordination
either with Apache Spark or MapReduce to analyze the data. The language
used to code for the frameworks are known as Pig Latin.
 Sqoop: This framework is used for transferring the data to Hadoop from
relational databases. This application is based on a command-line interface.
 Oozie: This is a scheduling system for workflow management, executing
workflow routes for successful completion of the task in Hadoop.
 Zookeeper: Open source centralized service which is used to provide
coordination between distributed applications of Hadoop. It offers the registry
and synchronization service at a high level.

6
Q) ACID principle of relational databases
The ACID principle refers to a set of properties that ensure reliable processing in a
relational database. ACID stands for Atomicity, Consistency, Isolation, and
Durability, each representing an essential property for database transactions:
1. Atomicity
 A transaction is treated as a single, indivisible unit. This means that either all
the operations within the transaction are completed successfully, or none are. If
any part of the transaction fails, the entire transaction is rolled back, ensuring
that the database remains in a consistent state.
2. Consistency
 The database must move from one valid state to another. Any data written to
the database must adhere to all predefined rules, constraints, and triggers. A
transaction should ensure that the integrity of the database is maintained,
leaving the database in a consistent state both before and after the transaction.
3. Isolation
 Transactions should occur independently without interference. The operations
of one transaction should not be visible to other transactions until it is
complete. Isolation prevents concurrent transactions from affecting each other's
execution, thus avoiding issues like dirty reads or lost updates.
4. Durability
 Once a transaction is committed, the changes made are permanent, even in the
event of a system crash or power failure. Durability ensures that the database
can recover to a consistent state using backup and recovery mechanisms.

The ACID principles in the context of a relational database system, using a real-world
example with a banking application and typical relational databases such as MySQL,
PostgreSQL, Oracle, or Microsoft SQL Server. Suppose we want to perform a
transfer of $100 from Account A to Account B in a bank's database. Here’s how each
ACID principle applies:
1. Atomicity
 Relational Database Example: In a relational database like MySQL or
PostgreSQL, a transaction is initiated using the BEGIN statement and
concluded with a COMMIT or ROLLBACK. Atomicity ensures that all SQL
operations within a transaction succeed or fail as a whole.

7
 SQL Example:
sql
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B';
COMMIT;
 Explanation: If an error occurs (e.g., due to a database connection issue or
constraint violation), a ROLLBACK will revert all changes, leaving the
database unaffected. If Account A’s balance is reduced but Account B’s balance
is not increased, the transaction will be rolled back to prevent data
inconsistency.
2. Consistency
 Relational Database Example: Databases like Oracle or SQL Server enforce
consistency by applying constraints, triggers, and rules during transactions. If a
rule is violated, such as a CHECK constraint ensuring that an account balance
cannot go below zero, the database will reject the transaction.
 SQL Example:
sql
ALTER TABLE accounts ADD CONSTRAINT balance_check CHECK (balance >=
0);
 Explanation: If Account A has only $50 and the transaction tries to deduct
$100, the CHECK constraint will prevent the operation, maintaining the
database's consistent state by enforcing that no account goes into a negative
balance.
3. Isolation
 Relational Database Example: Relational databases support different
isolation levels (e.g., Read Uncommitted, Read Committed, Repeatable
Read, and Serializable) to control how transactions interact with each other.
For example, in PostgreSQL, the default isolation level is Read Committed,
which ensures that any data read is committed at the moment the transaction
begins.
 SQL Example:
sql
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;

8
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B';
COMMIT;
 Explanation: By setting the isolation level to SERIALIZABLE, the database
ensures that the current transaction behaves as if no other transactions are
running concurrently, thus avoiding issues like dirty reads, non-repeatable
reads, or phantom reads.
4. Durability
 Relational Database Example: In systems like MySQL, durability is achieved
by writing transaction logs to disk before making changes to the actual
database. These logs ensure that even if the database server crashes, committed
transactions can be recovered.
 Explanation: After the COMMIT operation in a relational database like SQL
Server, the changes are stored in a transaction log. If the system crashes, the log
can be used to restore the last committed state of the database, ensuring that no
data is lost.

9
The CAP Theorem is a fundamental principle in distributed systems, formulated by
Eric Brewer in 2000. It states that a distributed system can only provide two out of
the following three guarantees simultaneously:
1. Consistency (C): Every read receives the most recent write or an error. This
means all nodes in the system will return the same data at any given time.
2. Availability (A): Every request (read or write) receives a response, even if it's
not the most recent data. This means the system remains operational 100% of
the time.
3. Partition Tolerance (P): The system continues to operate, even if there is a
partition (communication breakdown) between nodes. This means the system
can handle network failures where some parts of the system can't communicate
with others.
The Core Idea
 In a distributed system, network partitions are inevitable. Therefore, systems
must always be partition-tolerant. Given this constraint, they must choose
between Consistency and Availability when a partition occurs:
o CP Systems: Prioritize Consistency over Availability. These systems
will reject requests (making them unavailable) to ensure that all nodes
have the same data.
 Example: HBase, MongoDB (with certain configurations)
o AP Systems: Prioritize Availability over Consistency. These systems
will return the data they have, even if it may not be the most recent.
 Example: Cassandra, CouchDB
The CAP Theorem has significant practical implications for designing, building, and
operating distributed systems. Here's how it influences various types of systems and
architectures, along with examples:
1. CP Systems (Consistency + Partition Tolerance)
 Characteristics:
o Prioritize data accuracy over availability.
o During a network partition, some parts of the system may become
unavailable to ensure consistency.
 Use Cases:
o Systems where data accuracy is critical, like financial services, banking
systems, and inventory management, where having accurate, consistent
data is more important than high availability.
10
2. AP Systems (Availability + Partition Tolerance)
 Characteristics:
o Prioritize uptime and responsiveness.
o Provide data even during network partitions, but the data may not be the
most recent.
o Achieve eventual consistency, where the system becomes consistent
over time after the partition is resolved.
 Use Cases:
o Systems where uptime is crucial, like social networks, caching systems,
or services where temporary data inconsistency is acceptable.
3. CA Systems (Consistency + Availability)
 Characteristics:
o Prioritize both consistency and availability.
o Cannot tolerate network partitions, meaning it assumes a reliable
network.
 Use Cases:
o Centralized databases or systems in a controlled environment where
partitions are unlikely.
o Ideal for single-node databases or systems within a highly controlled
data center environment.
 The CAP theorem states that distributed databases can have at most two of
the three properties: consistency, availability, and partition tolerance. As a
result, database systems prioritize only two properties at a time.

Q) Base principle of NoSQL databases,

11
NoSQL databases are designed to handle a wide variety of data models and to
support large-scale, high-performance, and flexible storage and
retrieval. The core principles of NoSQL databases are:
1. Schema Flexibility: Unlike traditional relational databases, NoSQL databases
are schema-less or have a flexible schema, which allows for storing
unstructured, semi-structured, and structured data. This flexibility enables rapid
adaptation to changing data requirements.
2. Scalability: NoSQL databases are often horizontally scalable, meaning they
can be distributed across multiple servers or nodes to manage growing data
volumes. This approach, known as "sharding," allows NoSQL systems to
handle large datasets and heavy read/write loads.
3. Data Model Variety: NoSQL databases offer multiple data models suited for
different types of data and use cases:
o Document-based (e.g., MongoDB) stores data as JSON-like
documents.
o Key-value (e.g., Redis) uses simple key-value pairs for quick retrieval.
o Column-family (e.g., Cassandra) organizes data by column rather than
by row.
o Graph-based (e.g., Neo4j) represents data as nodes and edges, ideal for
relationships and connections.
4. Eventual Consistency: Many NoSQL databases favor eventual consistency
over strict transactional consistency. This approach allows for temporary
inconsistencies to improve performance and scalability but will eventually
achieve a consistent state.
5. High Availability and Fault Tolerance: NoSQL databases often replicate data
across multiple nodes, so if one node fails, data can still be accessed from other
nodes. This replication model enhances availability and resilience.
These principles make NoSQL databases well-suited for use cases where data
structure is dynamic, performance needs are high, and scalability is a
priority, such as in real-time applications, social networks, and big
data analytics.

Q) Types of NoSQL databases


NoSQL databases are categorized based on their data model. NoSQL databases
are

12
1. Document Databases
o Store data in JSON-like documents, making them highly flexible and
schema-less.
o Each document contains pairs of fields and values, and documents can
vary in structure.
o Commonly used for content management systems, catalogs, and user
profiles.
Examples: MongoDB, Couchbase, Amazon DocumentDB.
2. Key-Value Stores
o Use a simple key-value pair system for storing data, with each key
mapping to a unique value.
o Useful for applications requiring fast, low-latency data retrieval.
o Commonly used for caching, session storage, and real-time data
management.
Examples: Redis, Amazon DynamoDB, Riak.
3. Column-Family Stores
o Organize data into columns rather than rows, with data stored in a table
format.
o Efficient for reading and writing large volumes of data and commonly
used in analytics and data warehousing.
o Useful for time-series data and handling large-scale data.
Examples: Cassandra, HBase, ScyllaDB.
4. Graph Databases
o Designed to store data as a graph of nodes and edges, which represent
entities and their relationships.
o Ideal for applications where relationships between entities are critical,
like social networks, recommendation engines, and fraud detection.
Examples: Neo4j, ArangoDB, Amazon Neptune.
5. Time-Series Databases
o Optimized for handling time-stamped data, making them ideal for IoT,
monitoring, and real-time analytics.
o Efficient in storing, retrieving, and analyzing time-based data points.
13
Examples: InfluxDB, TimescaleDB, OpenTSDB.
6. Object-Oriented Databases
o Store data in objects, typically with an object-oriented programming
approach.
o Useful for applications requiring complex data structures and direct
representation of objects in code.
Examples: Db4o, ObjectDB
(OR)

A database is a collection of structured data or information which is stored in a


computer system and can be accessed easily. A database is usually managed by a
Database Management System (DBMS).
NoSQL is a non-relational database that is used to store the data in the nontabular
form. NoSQL stands for Not only SQL. The main types are documents, key-value,
wide-column, and graphs.
Types of NoSQL Database:
 Document-based databases
 Key-value stores
 Column-oriented databases
 Graph-based databases

14
Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data
in rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be accessed
by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.
Key features of documents database:
 Flexible schema: Documents in the database has a flexible schema. It means
the documents in the database need not be the same schema.

15
 Faster creation and maintenance: the creation of documents is easy and
minimal maintenance is required once we create the document.
 No foreign keys: There is no dynamic relationship between two documents so
documents can be independent of one another. So, there is no requirement for a
foreign key in a document database.
 Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL database
is a key-value store. Every data element in the database is stored in key-value pairs.
The data can be retrieved by using a unique key allotted to each element in the
database. The values can be simple data types like strings and numbers or complex
objects.
A key-value store is like a relational database with only two columns which is the key
and the value.
Key features of the key-value store:
 Simplicity.
 Scalability.
 Speed.
Column Oriented Databases:
A column-oriented database is a non-relational database that stores the data in
columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming memory
with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data. Key
features of columnar oriented database:
 Scalability.
 Compression.
 Very responsive.
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Key features of graph database:
16
 In a graph-based database, it is easy to identify the relationship between the
data by using the links.
 The Query’s output is real-time results.
 The speed depends upon the number of relationships among the database
elements.
 Updating data is also easy, as adding a new node or edge to a graph database is
a straightforward task that does not require significant schema changes.

Q) case study on disease diagnosis and profiling

17
Case Study: Early Diagnosis and Profiling for Diabetes and Cardiovascular
Diseases Using Machine Learning
Background
Chronic diseases like diabetes and cardiovascular diseases (CVDs) pose a
significant health burden worldwide. These diseases are often interrelated, and
early diagnosis can play a crucial role in improving patient outcomes. Traditional
diagnosis approaches rely on symptoms, lab tests, and physician expertise, which
can sometimes result in delayed detection. This study explores a data-driven
approach using machine learning for early detection and profiling of diabetes
and CVDs, aiming to offer personalized risk profiles and improve prediction
accuracy.
Objectives
1. Early Detection: Develop machine learning models to diagnose diabetes and
CVDs based on clinical data and patient history, focusing on detecting risk
before symptoms become severe.
2. Risk Profiling: Profile patients based on demographic, clinical, and lifestyle
features to identify those at high risk.
3. Personalized Treatment Pathways: Use profiling data to offer personalized
health recommendations for high-risk patients, improving adherence to
preventive care measures.
Data Collection and Preprocessing
The data used for this study was collected from electronic health records (EHRs)
across multiple hospitals and includes:
 Demographic Information: Age, gender, ethnicity.
 Lifestyle Factors: Smoking status, alcohol consumption, exercise frequency.
 Clinical Measurements: Blood pressure, cholesterol levels, BMI, blood sugar
levels.
 Medical History: Previous diagnoses, medications, family history of diseases.
To ensure data quality:
1. Data Cleaning: Removed duplicate entries, handled missing values using
mean imputation for continuous variables and mode imputation for categorical
variables.
2. Normalization: Normalized clinical measurement data to maintain uniformity
across records.

18
3. Feature Engineering: Created new features, such as BMI categories, exercise
frequency levels, and average blood pressure readings.
Methodology
The study involved several machine learning models to predict diabetes and CVD
risk:
1. Logistic Regression for baseline risk scoring.
2. Random Forest and XGBoost for robust, high-performance predictions, given
the complex relationships among features.
3. Neural Network model to capture non-linear relationships in the data,
particularly useful for chronic diseases with multifactorial causes.
The data was split into training and testing sets, with cross-validation to prevent
overfitting. Performance metrics used include accuracy, precision, recall, and F1
score, focusing on reducing false negatives to avoid missed diagnoses.
Results and Analysis
1. Model Performance: The XGBoost model performed best with an accuracy of
92% for diabetes diagnosis and 89% for CVD diagnosis. Precision and recall
scores were also high, indicating reliable performance.
2. Risk Profiling: Patients were grouped into high, medium, and low-risk
categories based on model outputs and risk factors. Key risk factors identified
included age, high BMI, and sedentary lifestyle for both diseases.
3. Personalized Recommendations: High-risk patients were given specific
lifestyle recommendations, such as diet modifications, exercise plans, and
regular monitoring. The model’s interpretability helped physicians understand
the most influential factors for each patient.
Discussion
The machine learning models were able to provide an early and accurate prediction
of diabetes and CVD risk. This approach outperformed traditional statistical
methods, especially in identifying high-risk individuals who may not yet exhibit
symptoms. The integration of risk profiling into EHRs allowed healthcare providers
to offer personalized care recommendations, enhancing patient engagement in
preventive care.

Key Takeaways

19
1. Enhanced Prediction Accuracy: Machine learning models effectively
identified individuals at risk of diabetes and CVD, supporting early diagnosis.
2. Customizable Risk Profiles: Profiling enabled healthcare providers to
understand individual risk factors and offer targeted interventions.
3. Improved Patient Outcomes: Personalized treatment pathways encouraged
adherence to preventive measures, potentially reducing future healthcare costs
and improving patient quality of life.

20

You might also like