Big Data Complete Notes
Big Data Complete Notes
1. Structured Data
3. Stored in RDBMS.
5. Semi-Structured Data
6. Partially organized.
9. Unstructured Data
1
Definition of Big Data:
Storage GB to TB TB to PB
What It Is:
• Advanced techniques to extract actionable insights from huge and diverse data.
What It Isn’t:
• Cost-effective storage.
• Real-time decisions.
• Cloud computing.
2
Classification of Analytics:
Data Science:
• Interdisciplinary field.
• Combines statistics, machine learning, data engineering, domain expertise.
Important Terminologies:
3
• Highly scalable.
• Fault-tolerant.
• Runs on commodity hardware.
• Data replication for fault recovery.
Key Advantages:
• Cost-effective.
• Handles structured, semi-structured, and unstructured data.
• Supports multiple languages (Java, Python, etc.).
• Ecosystem includes various tools for different tasks.
Versions of Hadoop:
• Hadoop 1.x: Single NameNode, scalability issues.
• Hadoop 2.x: Introduced YARN, better resource management.
• Hadoop 3.x: Erasure coding, containerization support, better performance.
Distributions:
• Cloudera, Hortonworks, MapR, Amazon EMR.
4
RDBMS vs Hadoop
History of Hadoop:
• Inspired by Google File System (GFS).
• Created by Doug Cutting and Mike Cafarella.
• Yahoo adopted and funded development.
HDFS:
• Master-slave architecture.
• NameNode: Metadata.
• DataNodes: Store blocks.
• Replication factor (default = 3).
• Designed for write-once, read-many workloads.
Introduction:
5
Components:
NoSQL Databases
Introduction:
• Non-relational databases designed for horizontal scalability and flexible data models.
Types:
Advantages:
• Schema-free
• Horizontal scaling
• High performance
• Better handling of unstructured data
Use in Industry:
6
UNIT – IV: MONGODB
Necessity of MongoDB
• High availability and scalability
• Schema flexibility
• Rich querying and indexing capabilities
MongoDB RDBMS
Document Row
Collection Table
Field Column
Index Index
Datatypes in MongoDB
• String, Integer, Double, Boolean
• Array
• ObjectId
• Embedded documents
• Null, Date
// Insert
> db.users.insert({name: "Alice", age: 25});
// Find
> db.users.find({age: {$gt: 20}});
// Update
> db.users.update({name: "Alice"}, {$set: {age: 26}});
// Delete
> db.users.remove({name: "Alice"});
7
UNIT – V: R PROGRAMMING
Introduction to R
• Statistical computing language
• Open-source and powerful for data analysis and visualization
Operators in R
• Arithmetic: +, -, *, /, ^
• Relational: <, <=, >, >=, ==, !=
• Logical: &, |, !
Data Structures
• Vectors: One-dimensional
• Matrices: Two-dimensional
• Lists: Collection of elements
• Data Frames: Table-like structure
• Factors: Categorical data
• Tables: Frequency counts
Graphs in R
• plot(), barplot(), hist(), boxplot(), pie()
Apply Family
• apply(), lapply(), sapply(), tapply(), mapply()
• Used for repetitive operations on data structures
8
END OF SEMESTER NOTES
Let me know if you need revision MCQs, model answers, or a formatted PDF.