NoSQL
NoSQL
In the computing system (web and business applications), there are enormous data that comes out
every day from the web. A large section of these data is handled by Relational database
management systems (RDBMS). The idea of relational model came with E.F.Codd’s 1970 paper
"A relational model of data for large shared data banks" which made data modeling and application
programming much easier. Beyond the intended benefits, the relational model is well-suited to
client-server programming and today it is predominant technology for storing structured data in
web and business applications.
Distributed Systems
A distributed system consists of multiple computers and software components that communicate
through a computer network (a local network or by a wide area network). A distributed system can
consist of any number of possible configurations, such as mainframes, workstations, personal
computers, and so on. The computers interact with each other and share the resources of the system
to achieve a common goal.
Scalability: In distributed computing the system can easily be expanded by adding more machines
as needed.
Sharing of Resources: Shared data is essential to many applications such as banking, reservation
system. As data or resources are shared in distributed system, other resources can be also shared
(e.g. expensive printers).
Flexibility: As the system is very flexible, it is very easy to install, implement and debug new
services.
Speed: A distributed computing system can have more computing power and its speed makes it
different than other systems.
Open system: As it is open system, every service is equally accessible to every client i.e. local or
remote.
Performance: The collection of processors in the system can provide higher performance (and
better price/performance ratio) than a centralized computer.
Software: Less software support is the main disadvantage of distributed computing system.
Networking: The network infrastructure can create several problems such as transmission
problem, overloading, loss of messages.
Security: Easy access in distributed computing system increases the risk of security and sharing
of data generates the problem of data security
Scalability
In electronics (including hardware, communication and software), scalability is the ability of a
system to expand to meet your business needs. For example, scaling a web application is all about
allowing more people to use your application. You scale a system by upgrading the existing
hardware without changing much of the application or by adding extra hardware.
There are two ways of scaling horizontal and vertical scaling:
Vertical scaling to scale vertically (or scales up) means to add resources within the same logical
unit to increase capacity. For example, to add CPUs to an existing server, increase memory in the
system or expanding storage by adding hard drive.
Horizontal scaling to scale horizontally (or scales out) means to add more nodes to a
system, such as adding a new computer to a distributed software application. In NoSQL
system, data store can be much faster as it takes advantage of “scaling out” which means
to add more nodes to a system and distribute the load over those nodes.
What is NoSQL?
NoSQL is a non-relational database management system, different from traditional relational
database management systems in some significant ways. It is designed for distributed data stores
where very large scale of data storing needs (for example Google or Facebook which collects
terabits of data every day for their users). These types of data storing may not require fixed schema,
avoid join operations and typically scale horizontally.
NoSQL Database is used to refer a non-SQL or not only SQL or non-relational database.
It provides a mechanism for storage and retrieval of data other than tabular relations model used
in relational databases. NoSQL database doesn't use tables for storing data. It is generally used to
store big data and real-time web applications.
Why NoSQL?
In today’s time data is becoming easier to access and capture through third parties such as
Facebook, Google+ and others. Personal user information, social graphs, geo location data, user-
generated content and machine logging data are just a few examples where the data has been
increasing exponentially. To avail the above service properly, it is required to process huge amount
of data. Which SQL databases were never designed? The evolution of NoSql databases is to handle
these huge data properly.
Example:
Social-network graph:
Each record: UserID1, UserID2
Separate records: UserID, first_name,last_name, age, gender,...
Task: Find all friends of friends of friends of ... friends of a given user.
Wikipedia pages :
Large collection of documents
Combination of structured and unstructured data
Task: Retrieve all pages regarding athletics of Summer Olympic before 1950.
RDBMS vs NoSQL
RDBMS
- Structured and organized data
- Structured query language (SQL)
- Data and its relationships are stored in separate tables.
- Data Manipulation Language, Data Definition Language
- Tight Consistency
NoSQL
- Stands for Not Only SQL
- No declarative query language
- No predefined schema
- Key-Value pair storage, Column Store, Document Store, Graph databases
- Eventual consistency rather ACID property
- Unstructured and unpredictable data
- CAP Theorem
- Prioritizes high performance, high availability and scalability
- BASE Transaction
In the early 2009, when last.fm wanted to organize an event on open-source distributed databases,
Eric Evans, a Rackspace employee, reused the term to refer databases which are non-relational,
distributed, and does not conform to atomicity, consistency, isolation, durability - four obvious
features of traditional relational database systems.
In the same year, the "no:sql(east)" conference held in Atlanta, USA, NoSQL was discussed and
debated a lot. And then, discussion and practice of NoSQL got a momentum, and NoSQL saw an
unprecedented growth.
Availability - This means that the system is always on (service guarantee availability), no
downtime.
Partition Tolerance - This means that the system continues to function even the communication
among the servers is unreliable, i.e. the servers may be partitioned into multiple groups that cannot
communicate with one another.
In theoretically it is impossible to fulfill all 3 requirements. CAP provides the basic requirements
for a distributed system to follow 2 of the 3 requirements. Therefore, all the current NoSQL
database follow the different combinations of the C, A, P from the CAP theorem. Here is the brief
description of three combinations CA, CP, AP:
CA - Single site cluster, therefore all nodes are always in contact. When a partition occurs, the
system blocks.
CP -Some data may not be accessible, but the rest is still consistent/accurate.
AP - System is still available under partitioning, but some of the data returned may be inaccurate.
NoSQL pros/cons
Advantages :
• High scalability
• Distributed Computing
• Lower cost
• Schema flexibility, semi-structure data
• No complicated Relationships
Disadvantages
• No standardization
• Limited query capabilities (so far)
• Eventual consistent is not intuitive to program for
The BASE
The CAP theorem states that a distributed computer system cannot guarantee all of the following
three properties at the same time:
• Consistency
• Availability
• Partition tolerance
A BASE system gives up on consistency.
• Basically Available indicates that the system does guarantee availability, in terms of the
CAP theorem.
• Soft state indicates that the state of the system may change over time, even without input.
This is because of the eventual consistency model.
• Eventual consistency indicates that the system will become consistent over time, given that
the system doesn't receive input during that time.
ACID vs BASE
ACID BASE
Durable
NoSQL Categories
There are four general types (most common categories) of NoSQL databases. Each of these
categories has its own specific attributes and limitations. There is not a single solution which is
better than all the others, however there are some databases that are better to solve specific
problems. To clarify the NoSQL databases, lets discuss the most common categories:
• Key-value stores
• Column-oriented
• Graph
• Document oriented
Key-value stores
• Key-value stores are most basic types of NoSQL databases.
• Designed to handle huge amounts of data.
• Based on Amazon’s Dynamo paper.
• Key value stores allow developer to store schema-less data.
• In the key-value storage, database stores data as hash table where each key is unique and
the value can be string, JSON, BLOB (Binary Large OBjec) etc.
• A key may be strings, hashes, lists, sets, sorted sets and values are stored against these
keys.
• For example a key-value pair might consist of a key like "Name" that is associated with a
value like "Robin".
• Key-Value stores can be used as collections, dictionaries, associative arrays etc.
• Key-Value stores follow the 'Availability' and 'Partition' aspects of CAP theorem.
• Key-Values stores would work well for shopping cart contents, or individual values like
color schemes, a landing page URI, or a default account number.
Example of Key-value store DataBase : Redis, Dynamo, Riak. etc.
Pictorial Presentation :
Column-oriented databases
• Column-oriented databases primarily work on columns and every column is treated
individually.
• Values of a single column are stored contiguously.
• Column stores data in column specific files.
• In Column stores, query processors work on columns too.
• All data within each column datafile have the same type which makes it ideal for
compression.
• Column stores can improve the performance of queries as it can access specific column
data.
• High performance on aggregation queries (e.g. COUNT, SUM, AVG, MIN, MAX).
• Works on data warehouses and business intelligence, customer relationship management
(CRM), Library card catalogs etc.
Example of Column-oriented databases: BigTable, Cassandra, SimpleDB etc.
Pictorial Presentation:
Graph databases
A graph data structure consists of a finite (and possibly mutable) set of ordered pairs, called edges
or arcs, of certain entities called nodes or vertices.
Here is a comparison between the classic relational model and the graph model :
Rows Vertices
Joins Edges
Tables Collections
Rows Documents
Data Storage Layer: focuses on task of high performance as well as scalable data storage for
the task at hand.
Data management Layer: allows for low level access to the data through specialized language
that are more appropriate for the job rather than being constrained by using the standard SQL
format.
Key Features of NoSQL :
Dynamic schema: NoSQL databases do not have a fixed schema and can accommodate changing
data structures without the need for migrations or schema alterations.
Horizontal scalability: NoSQL databases are designed to scale out by adding more nodes to a
database cluster, making them well-suited for handling large amounts of data and high levels of
traffic.
Document-based: Some NoSQL databases, such as MongoDB, use a document-based data model,
where data is stored in semi-structured format, such as JSON or BSON.
Key-value-based: Other NoSQL databases, such as Redis, use a key-value data model, where data
is stored as a collection of key-value pairs.
Column-based: Some NoSQL databases, such as Cassandra, use a column-based data model,
where data is organized into columns instead of rows.
Distributed and high availability: NoSQL databases are often designed to be highly available
and to automatically handle node failures and data replication across multiple nodes in a database
cluster.
Flexibility: NoSQL databases allow developers to store and retrieve data in a flexible and dynamic
manner, with support for multiple data types and changing data structures.
Performance: NoSQL databases are optimized for high performance and can handle a high
volume of reads and writes, making them suitable for big data and real-time applications.
Advantages of NoSQL: There are many advantages of working with NoSQL databases
such as MongoDB and Cassandra. The main advantages are high scalability and high
availability.
High scalability : NoSQL databases use sharding for horizontal scaling. Partitioning of data and
placing it on multiple machines in such a way that the order of the data is preserved is sharding.
Vertical scaling means adding more resources to the existing machine whereas horizontal scaling
means adding more machines to handle the data. Vertical scaling is not that easy to implement but
horizontal scaling is easy to implement. Examples of horizontal scaling databases are MongoDB,
Cassandra, etc. NoSQL can handle a huge amount of data because of scalability, as the data grows
NoSQL scale itself to handle that data in an efficient manner.
Flexibility: NoSQL databases are designed to handle unstructured or semi-structured data, which
means that they can accommodate dynamic changes to the data model. This makes NoSQL
databases a good fit for applications that need to handle changing data requirements.
High availability : Auto replication feature in NoSQL databases makes it highly available because
in case of any failure data replicates itself to the previous consistent state.
Scalability: NoSQL databases are highly scalable, which means that they can handle large
amounts of data and traffic with ease. This makes them a good fit for applications that need to
handle large amounts of data or traffic
Performance: NoSQL databases are designed to handle large amounts of data and traffic, which
means that they can offer improved performance compared to traditional relational databases.
Cost-effectiveness: NoSQL databases are often more cost-effective than traditional relational
databases, as they are typically less complex and do not require expensive hardware or software.
Disadvantages of NoSQL: NoSQL has the following disadvantages.
Lack of standardization : There are many different types of NoSQL databases, each with its own
unique strengths and weaknesses. This lack of standardization can make it difficult to choose the
right database for a specific application
Lack of ACID compliance : NoSQL databases are not fully ACID-compliant, which means that
they do not guarantee the consistency, integrity, and durability of data. This can be a drawback for
applications that require strong data consistency guarantees.
Narrow focus : NoSQL databases have a very narrow focus as it is mainly designed for storage
but it provides very little functionality. Relational databases are a better choice in the field of
Transaction Management than NoSQL.
Open-source : NoSQL is open-source database. There is no reliable standard for NoSQL yet. In
other words, two database systems are likely to be unequal.
Lack of support for complex queries : NoSQL databases are not designed to handle complex
queries, which means that they are not a good fit for applications that require complex data analysis
or reporting.
Lack of maturity : NoSQL databases are relatively new and lack the maturity of traditional
relational databases. This can make them less reliable and less secure than traditional databases.
Management challenge : The purpose of big data tools is to make the management of a large
amount of data as simple as possible. But it is not so easy. Data management in NoSQL is much
more complex than in a relational database. NoSQL, in particular, has a reputation for being
challenging to install and even more hectic to manage on a daily basis.
GUI is not available : GUI mode tools to access the database are not flexibly available in the
market.
Backup : Backup is a great weak point for some NoSQL databases like MongoDB. MongoDB has
no approach for the backup of data in a consistent manner.
Large document size : Some database systems like MongoDB and CouchDB store data in JSON
format. This means that documents are quite large (BigData, network bandwidth, speed), and
having descriptive key names actually hurts since they increase the document size.
When should NoSQL be used:
• When a huge amount of data needs to be stored and retrieved.
• The relationship between the data you store is not that important
• The data changes over time and is not structured.
• Support of Constraints and Joins is not required at the database level
• The data is growing continuously and you need to scale the database regularly to handle
the data.