Unit Iii
Unit Iii
1
YARN (Yet Another Resource Negotiator): This is the resource management
component of Hadoop, which manages the allocation of resources (such as
CPU and memory) for processing the data stored in HDFS.
Hadoop also includes several additional modules that provide additional
functionality, such as Hive (a SQL-like query language), Pig (a high-level
platform for creating MapReduce programs), and HBase (a non-relational,
distributed database).
Hadoop is commonly used in big data scenarios such as data warehousing,
business intelligence, and machine learning. It’s also used for data processing,
data analysis, and data mining. It enables the distributed processing of large
data sets across clusters of computers using a simple programming model.
3
Features of hadoop:
1. it is fault tolerance.
2. it is highly available.
3. it’s programming is easy.
4. it have huge flexible storage.
5. it is low cost.
Hadoop has several key features that make it well-suited for big data processing:
Distributed Storage: Hadoop stores large data sets across multiple machines,
allowing for the storage and processing of extremely large amounts of data.
Scalability: Hadoop can scale from a single server to thousands of machines,
making it easy to add more capacity as needed.
Fault-Tolerance: Hadoop is designed to be highly fault-tolerant, meaning it
can continue to operate even in the presence of hardware failures.
Data locality: Hadoop provides data locality feature, where the data is stored
on the same node where it will be processed, this feature helps to reduce the
network traffic and improve the performance
High Availability: Hadoop provides High Availability feature, which helps to
make sure that the data is always available and is not lost.
Flexible Data Processing: Hadoop’s MapReduce programming model allows
for the processing of data in a distributed fashion, making it easy to implement
a wide variety of data processing tasks.
4
Data Integrity: Hadoop provides built-in checksum feature, which helps to
ensure that the data stored is consistent and correct.
Data Replication: Hadoop provides data replication feature, which helps to
replicate the data across the cluster for fault tolerance.
Data Compression: Hadoop provides built-in data compression feature, which
helps to reduce the storage space and improve the performance.
YARN: A resource management platform that allows multiple data processing
engines like real-time streaming, batch processing, and interactive SQL, to run
and process data stored in HDFS.
6
Q) ACID principle of relational databases
The ACID principle refers to a set of properties that ensure reliable processing in a
relational database. ACID stands for Atomicity, Consistency, Isolation, and
Durability, each representing an essential property for database transactions:
1. Atomicity
A transaction is treated as a single, indivisible unit. This means that either all
the operations within the transaction are completed successfully, or none are. If
any part of the transaction fails, the entire transaction is rolled back, ensuring
that the database remains in a consistent state.
2. Consistency
The database must move from one valid state to another. Any data written to
the database must adhere to all predefined rules, constraints, and triggers. A
transaction should ensure that the integrity of the database is maintained,
leaving the database in a consistent state both before and after the transaction.
3. Isolation
Transactions should occur independently without interference. The operations
of one transaction should not be visible to other transactions until it is
complete. Isolation prevents concurrent transactions from affecting each other's
execution, thus avoiding issues like dirty reads or lost updates.
4. Durability
Once a transaction is committed, the changes made are permanent, even in the
event of a system crash or power failure. Durability ensures that the database
can recover to a consistent state using backup and recovery mechanisms.
The ACID principles in the context of a relational database system, using a real-world
example with a banking application and typical relational databases such as MySQL,
PostgreSQL, Oracle, or Microsoft SQL Server. Suppose we want to perform a
transfer of $100 from Account A to Account B in a bank's database. Here’s how each
ACID principle applies:
1. Atomicity
Relational Database Example: In a relational database like MySQL or
PostgreSQL, a transaction is initiated using the BEGIN statement and
concluded with a COMMIT or ROLLBACK. Atomicity ensures that all SQL
operations within a transaction succeed or fail as a whole.
7
SQL Example:
sql
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B';
COMMIT;
Explanation: If an error occurs (e.g., due to a database connection issue or
constraint violation), a ROLLBACK will revert all changes, leaving the
database unaffected. If Account A’s balance is reduced but Account B’s balance
is not increased, the transaction will be rolled back to prevent data
inconsistency.
2. Consistency
Relational Database Example: Databases like Oracle or SQL Server enforce
consistency by applying constraints, triggers, and rules during transactions. If a
rule is violated, such as a CHECK constraint ensuring that an account balance
cannot go below zero, the database will reject the transaction.
SQL Example:
sql
ALTER TABLE accounts ADD CONSTRAINT balance_check CHECK (balance >=
0);
Explanation: If Account A has only $50 and the transaction tries to deduct
$100, the CHECK constraint will prevent the operation, maintaining the
database's consistent state by enforcing that no account goes into a negative
balance.
3. Isolation
Relational Database Example: Relational databases support different
isolation levels (e.g., Read Uncommitted, Read Committed, Repeatable
Read, and Serializable) to control how transactions interact with each other.
For example, in PostgreSQL, the default isolation level is Read Committed,
which ensures that any data read is committed at the moment the transaction
begins.
SQL Example:
sql
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
8
BEGIN;
UPDATE accounts SET balance = balance - 100 WHERE account_id = 'A';
UPDATE accounts SET balance = balance + 100 WHERE account_id = 'B';
COMMIT;
Explanation: By setting the isolation level to SERIALIZABLE, the database
ensures that the current transaction behaves as if no other transactions are
running concurrently, thus avoiding issues like dirty reads, non-repeatable
reads, or phantom reads.
4. Durability
Relational Database Example: In systems like MySQL, durability is achieved
by writing transaction logs to disk before making changes to the actual
database. These logs ensure that even if the database server crashes, committed
transactions can be recovered.
Explanation: After the COMMIT operation in a relational database like SQL
Server, the changes are stored in a transaction log. If the system crashes, the log
can be used to restore the last committed state of the database, ensuring that no
data is lost.
9
The CAP Theorem is a fundamental principle in distributed systems, formulated by
Eric Brewer in 2000. It states that a distributed system can only provide two out of
the following three guarantees simultaneously:
1. Consistency (C): Every read receives the most recent write or an error. This
means all nodes in the system will return the same data at any given time.
2. Availability (A): Every request (read or write) receives a response, even if it's
not the most recent data. This means the system remains operational 100% of
the time.
3. Partition Tolerance (P): The system continues to operate, even if there is a
partition (communication breakdown) between nodes. This means the system
can handle network failures where some parts of the system can't communicate
with others.
The Core Idea
In a distributed system, network partitions are inevitable. Therefore, systems
must always be partition-tolerant. Given this constraint, they must choose
between Consistency and Availability when a partition occurs:
o CP Systems: Prioritize Consistency over Availability. These systems
will reject requests (making them unavailable) to ensure that all nodes
have the same data.
Example: HBase, MongoDB (with certain configurations)
o AP Systems: Prioritize Availability over Consistency. These systems
will return the data they have, even if it may not be the most recent.
Example: Cassandra, CouchDB
The CAP Theorem has significant practical implications for designing, building, and
operating distributed systems. Here's how it influences various types of systems and
architectures, along with examples:
1. CP Systems (Consistency + Partition Tolerance)
Characteristics:
o Prioritize data accuracy over availability.
o During a network partition, some parts of the system may become
unavailable to ensure consistency.
Use Cases:
o Systems where data accuracy is critical, like financial services, banking
systems, and inventory management, where having accurate, consistent
data is more important than high availability.
10
2. AP Systems (Availability + Partition Tolerance)
Characteristics:
o Prioritize uptime and responsiveness.
o Provide data even during network partitions, but the data may not be the
most recent.
o Achieve eventual consistency, where the system becomes consistent
over time after the partition is resolved.
Use Cases:
o Systems where uptime is crucial, like social networks, caching systems,
or services where temporary data inconsistency is acceptable.
3. CA Systems (Consistency + Availability)
Characteristics:
o Prioritize both consistency and availability.
o Cannot tolerate network partitions, meaning it assumes a reliable
network.
Use Cases:
o Centralized databases or systems in a controlled environment where
partitions are unlikely.
o Ideal for single-node databases or systems within a highly controlled
data center environment.
The CAP theorem states that distributed databases can have at most two of
the three properties: consistency, availability, and partition tolerance. As a
result, database systems prioritize only two properties at a time.
11
NoSQL databases are designed to handle a wide variety of data models and to
support large-scale, high-performance, and flexible storage and
retrieval. The core principles of NoSQL databases are:
1. Schema Flexibility: Unlike traditional relational databases, NoSQL databases
are schema-less or have a flexible schema, which allows for storing
unstructured, semi-structured, and structured data. This flexibility enables rapid
adaptation to changing data requirements.
2. Scalability: NoSQL databases are often horizontally scalable, meaning they
can be distributed across multiple servers or nodes to manage growing data
volumes. This approach, known as "sharding," allows NoSQL systems to
handle large datasets and heavy read/write loads.
3. Data Model Variety: NoSQL databases offer multiple data models suited for
different types of data and use cases:
o Document-based (e.g., MongoDB) stores data as JSON-like
documents.
o Key-value (e.g., Redis) uses simple key-value pairs for quick retrieval.
o Column-family (e.g., Cassandra) organizes data by column rather than
by row.
o Graph-based (e.g., Neo4j) represents data as nodes and edges, ideal for
relationships and connections.
4. Eventual Consistency: Many NoSQL databases favor eventual consistency
over strict transactional consistency. This approach allows for temporary
inconsistencies to improve performance and scalability but will eventually
achieve a consistent state.
5. High Availability and Fault Tolerance: NoSQL databases often replicate data
across multiple nodes, so if one node fails, data can still be accessed from other
nodes. This replication model enhances availability and resilience.
These principles make NoSQL databases well-suited for use cases where data
structure is dynamic, performance needs are high, and scalability is a
priority, such as in real-time applications, social networks, and big
data analytics.
12
1. Document Databases
o Store data in JSON-like documents, making them highly flexible and
schema-less.
o Each document contains pairs of fields and values, and documents can
vary in structure.
o Commonly used for content management systems, catalogs, and user
profiles.
Examples: MongoDB, Couchbase, Amazon DocumentDB.
2. Key-Value Stores
o Use a simple key-value pair system for storing data, with each key
mapping to a unique value.
o Useful for applications requiring fast, low-latency data retrieval.
o Commonly used for caching, session storage, and real-time data
management.
Examples: Redis, Amazon DynamoDB, Riak.
3. Column-Family Stores
o Organize data into columns rather than rows, with data stored in a table
format.
o Efficient for reading and writing large volumes of data and commonly
used in analytics and data warehousing.
o Useful for time-series data and handling large-scale data.
Examples: Cassandra, HBase, ScyllaDB.
4. Graph Databases
o Designed to store data as a graph of nodes and edges, which represent
entities and their relationships.
o Ideal for applications where relationships between entities are critical,
like social networks, recommendation engines, and fraud detection.
Examples: Neo4j, ArangoDB, Amazon Neptune.
5. Time-Series Databases
o Optimized for handling time-stamped data, making them ideal for IoT,
monitoring, and real-time analytics.
o Efficient in storing, retrieving, and analyzing time-based data points.
13
Examples: InfluxDB, TimescaleDB, OpenTSDB.
6. Object-Oriented Databases
o Store data in objects, typically with an object-oriented programming
approach.
o Useful for applications requiring complex data structures and direct
representation of objects in code.
Examples: Db4o, ObjectDB
(OR)
14
Document-Based Database:
The document-based database is a nonrelational database. Instead of storing the data
in rows and columns (tables), it uses the documents to store the data in the database. A
document database stores data in JSON, BSON, or XML documents.
Documents can be stored and retrieved in a form that is much closer to the data
objects used in applications which means less translation is required to use these data
in the applications. In the Document database, the particular elements can be accessed
by using the index value that is assigned for faster querying.
Collections are the group of documents that store documents that have similar
contents. Not all the documents are in any collection as they require a similar schema
because document databases have a flexible schema.
Key features of documents database:
Flexible schema: Documents in the database has a flexible schema. It means
the documents in the database need not be the same schema.
15
Faster creation and maintenance: the creation of documents is easy and
minimal maintenance is required once we create the document.
No foreign keys: There is no dynamic relationship between two documents so
documents can be independent of one another. So, there is no requirement for a
foreign key in a document database.
Open formats: To build a document we use XML, JSON, and others.
Key-Value Stores:
A key-value store is a nonrelational database. The simplest form of a NoSQL database
is a key-value store. Every data element in the database is stored in key-value pairs.
The data can be retrieved by using a unique key allotted to each element in the
database. The values can be simple data types like strings and numbers or complex
objects.
A key-value store is like a relational database with only two columns which is the key
and the value.
Key features of the key-value store:
Simplicity.
Scalability.
Speed.
Column Oriented Databases:
A column-oriented database is a non-relational database that stores the data in
columns instead of rows. That means when we want to run analytics on a small
number of columns, you can read those columns directly without consuming memory
with the unwanted data.
Columnar databases are designed to read data more efficiently and retrieve the data
with greater speed. A columnar database is used to store a large amount of data. Key
features of columnar oriented database:
Scalability.
Compression.
Very responsive.
Graph-Based databases:
Graph-based databases focus on the relationship between the elements. It stores the
data in the form of nodes in the database. The connections between the nodes are
called links or relationships.
Key features of graph database:
16
In a graph-based database, it is easy to identify the relationship between the
data by using the links.
The Query’s output is real-time results.
The speed depends upon the number of relationships among the database
elements.
Updating data is also easy, as adding a new node or edge to a graph database is
a straightforward task that does not require significant schema changes.
17
Case Study: Early Diagnosis and Profiling for Diabetes and Cardiovascular
Diseases Using Machine Learning
Background
Chronic diseases like diabetes and cardiovascular diseases (CVDs) pose a
significant health burden worldwide. These diseases are often interrelated, and
early diagnosis can play a crucial role in improving patient outcomes. Traditional
diagnosis approaches rely on symptoms, lab tests, and physician expertise, which
can sometimes result in delayed detection. This study explores a data-driven
approach using machine learning for early detection and profiling of diabetes
and CVDs, aiming to offer personalized risk profiles and improve prediction
accuracy.
Objectives
1. Early Detection: Develop machine learning models to diagnose diabetes and
CVDs based on clinical data and patient history, focusing on detecting risk
before symptoms become severe.
2. Risk Profiling: Profile patients based on demographic, clinical, and lifestyle
features to identify those at high risk.
3. Personalized Treatment Pathways: Use profiling data to offer personalized
health recommendations for high-risk patients, improving adherence to
preventive care measures.
Data Collection and Preprocessing
The data used for this study was collected from electronic health records (EHRs)
across multiple hospitals and includes:
Demographic Information: Age, gender, ethnicity.
Lifestyle Factors: Smoking status, alcohol consumption, exercise frequency.
Clinical Measurements: Blood pressure, cholesterol levels, BMI, blood sugar
levels.
Medical History: Previous diagnoses, medications, family history of diseases.
To ensure data quality:
1. Data Cleaning: Removed duplicate entries, handled missing values using
mean imputation for continuous variables and mode imputation for categorical
variables.
2. Normalization: Normalized clinical measurement data to maintain uniformity
across records.
18
3. Feature Engineering: Created new features, such as BMI categories, exercise
frequency levels, and average blood pressure readings.
Methodology
The study involved several machine learning models to predict diabetes and CVD
risk:
1. Logistic Regression for baseline risk scoring.
2. Random Forest and XGBoost for robust, high-performance predictions, given
the complex relationships among features.
3. Neural Network model to capture non-linear relationships in the data,
particularly useful for chronic diseases with multifactorial causes.
The data was split into training and testing sets, with cross-validation to prevent
overfitting. Performance metrics used include accuracy, precision, recall, and F1
score, focusing on reducing false negatives to avoid missed diagnoses.
Results and Analysis
1. Model Performance: The XGBoost model performed best with an accuracy of
92% for diabetes diagnosis and 89% for CVD diagnosis. Precision and recall
scores were also high, indicating reliable performance.
2. Risk Profiling: Patients were grouped into high, medium, and low-risk
categories based on model outputs and risk factors. Key risk factors identified
included age, high BMI, and sedentary lifestyle for both diseases.
3. Personalized Recommendations: High-risk patients were given specific
lifestyle recommendations, such as diet modifications, exercise plans, and
regular monitoring. The model’s interpretability helped physicians understand
the most influential factors for each patient.
Discussion
The machine learning models were able to provide an early and accurate prediction
of diabetes and CVD risk. This approach outperformed traditional statistical
methods, especially in identifying high-risk individuals who may not yet exhibit
symptoms. The integration of risk profiling into EHRs allowed healthcare providers
to offer personalized care recommendations, enhancing patient engagement in
preventive care.
Key Takeaways
19
1. Enhanced Prediction Accuracy: Machine learning models effectively
identified individuals at risk of diabetes and CVD, supporting early diagnosis.
2. Customizable Risk Profiles: Profiling enabled healthcare providers to
understand individual risk factors and offer targeted interventions.
3. Improved Patient Outcomes: Personalized treatment pathways encouraged
adherence to preventive measures, potentially reducing future healthcare costs
and improving patient quality of life.
20