0% found this document useful (0 votes)
5 views

Big Data Introduction

The document provides an overview of various data types including structured, semi-structured, quasi-structured, and unstructured data, along with their characteristics, merits, and demerits. It also discusses the concept of Big Data, its importance, and the 10 V's associated with it, as well as data processing and analytics tools like Hadoop, Hive, MADlib, and Pig. The document emphasizes the significance of understanding these data types for effective data management and analysis in organizations.

Uploaded by

qygsjxmc8k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Big Data Introduction

The document provides an overview of various data types including structured, semi-structured, quasi-structured, and unstructured data, along with their characteristics, merits, and demerits. It also discusses the concept of Big Data, its importance, and the 10 V's associated with it, as well as data processing and analytics tools like Hadoop, Hive, MADlib, and Pig. The document emphasizes the significance of understanding these data types for effective data management and analysis in organizations.

Uploaded by

qygsjxmc8k
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 46

Specialized Topics in

Computer (CSC 411)

Introduction to Big Data


and Data Types.
STRUCTURED DATA , SEMI-STRUCTURED
DATA, QUASI-STRUCTURED DATA AND
UNSTRUCTURED DATA
Learning Objectives
• What is Data and Information? • Semi-Structured
• What is digital data • Quasi-Structured
• What is Big Data • Unstructured
• 10 V’s of Big Data • Data Analytics Tools
• Data and Information • Conclusion – The Future of
Processing Data.
• Structured
What is Data and Information
What is Data and Information?
• The terms Data and Information are closely connected and it’s common for the two
terms to be used interchangeably.
• However, it’s vital to grasp the distinction between Data and Information.
• Data can be described as the quantities, characters, or symbols on which operations
are performed by a computer, being stored and transmitted in the form of electrical
signals and recorded on magnetic, optical, or mechanical recording media.
• Data is a collection of discrete values that convey information, describing quantity,
quality, fact, statistics, other basic units of meaning, or simply sequences of symbols
that may be further interpreted.
• Data is a statement of fact about an entity
What is Data and Information?
• Information
• It is data that has been
processed in such a way as to
be meaningful to the person
who receives it.
• It is a processed data with a
purpose.
• It is any thing that is
communicated.
Big Data
What is Big Data ?
• The term big data started to show up sparingly in
the early 1990s, and its prevalence and importance
increased exponentially as years passed.
• Nowadays big data is often seen as integral to a
company's data strategy.
• Big Data is a broad term used to refer to the huge
volume of digital information generated by
various businesses.
• This big data is not only generated by traditional
information exchange and software, but also from
sensors of various types embedded in a variety of
environments; hospitals, metro stations, markets,
and virtually every electrical device that produces
data.
What is Big Data ?
• Big Data puts an inordinate focus
on the issue of information volume.
It exceeds the capacity of
traditional data management
technologies creating the need for
new tools and technologies to
handle the extremely large volume.
• It not only presents a challenge in
storing large volumes of data but
also the new capabilities of
analyzing this huge volume of data.
10 V’s of Big Data
10 V’s of Big Data
• Volume • Vulnerabilit
• Velocity y

• Variety • Volatility

• Variability • Visualizatio
n
• Veracity
• Value
• Validity
What is Digital Data?
What is Digital Data?
• In computing world, digital data is
considered as a collection of facts that is
transmitted and saved in an electronic
format, and processed through software
system.
• Digital Data is generated by various devices,
like desktops, laptops, tablets, mobile
phones, and electronic sensors.
• Digital data is stored as a strings of binary
values (0s and 1s) on a storage medium
that’s either internal or external to the
devices generating or accessing the
information.
What is Digital Data?
• The storage devices could also be of various
varieties, like magnetic, optical, or solid state
storage devices.
• Examples of digital data are electronic
documents, text files, e-mails, e-books, digital
pictures, digital audio, and digital video.
Data and Information Processing
• Processing and analyzing information is significant and critical to any
organization.
• It allows organizations to derive value from information to take intelligent
decisions and improve organizational effectiveness.
• It is easier to analyze the structured data because it is stored in organised
format.
• On the opposite hand, processing non-structured data and extracting value
from it using traditional applications is tough, long, and needs to increase
the hardware resources.
• New architectures, technologies, and techniques have emerged that modify
storing, managing, analyzing, and bringing value from unstructured information
which is coming from various sources.
Data Types
Structured Data Type
• It is the type of data that is stored in a relational databases such as SQL
and Oracle where data is organised in rows and columns within named
tables.
• It is highly specific and is stored in a predefined format
• Structured data also adheres to predefined rules for formatting and
labeling information.
• It consists of clearly defined data types with patterns that make them
easily searchable.
• It usually resides in relational databases (RDBMs). Fields store length-
delimited data like phone numbers, Social Security numbers, or ZIP
codes, and records even contain text strings of variable length like
names, making it a simple matter to search.
Structured Data
• Data may be human- or machine-generated, as long as the data is created
within an RDB structure.
• This format is eminently searchable, both with human-generated queries
and via algorithms using types of data and field names, such as
alphabetical or numeric, currency, or date.
• Common relational database applications with structured data include
airline reservation systems, inventory control, sales transactions, and ATM
activity.
• Structured Query Language (SQL) enables queries on this type of
structured data within relational databases.
Characteristics of Structured Data
• The structured data conform to a data model with a predefined
structure.
• Data is organized into entities such as tables, and these columns
are linked together using relationships.
• All data stored in a table column have similar attributes. For
example, if a table contains the [FirstName] column as string data,
it will always store the string data for all records in the column.
• It does not allow dynamic structure change for a specific record.
Merits of Structured Data
• The fixed and well-defined schema helps easy management, less storage, and access
to the data.
• The data can be indexed based on its attributes. The indexing helps to read data from
a database quickly.
• Data security can be implemented at the granular level, i.e., row, column, or table.
• The structured data can be accessed easily by the machine learning algorithms.
Therefore, you can quickly do data manipulation and calculations.
• You can perform Business Intelligence operations with Increased access to more tools.
• The structured data enables users to understand and analyze different data
relationships quickly.
Demerits of Structured Data
• You need to define the schema well in advance, typical for all data requirements. If you need
an additional column requirement, it requires structure modification for all records in the
table. Therefore, the structured data is less flexible.

• It can be used for its intended goal with limiting business use case.

• Limitations On Use: Due to the organization style of structured data, it is more difficult to
have flexibility or varied use cases.

• Limited Storage: Structured data is stored in specific spaces of data warehouses. While
accessing the data is easy, scalability can be difficult. Changes within data warehouses can
become hard to manage. Using cloud data centers help with the storage problems.

• High Overhead: Data centers or other storage for structured data can become expensive and
be part of the structured data ordeal. Again, cloud data centers are recommended, but the
storage can still require significant work to keep the data maintained properly .
Examples of Structured Data
• Spreadsheets. • Phone numbers
• Relational databases • Email addresses
such as Microsoft • ATM activity
SQL Server, Oracle.
• Inventory control
• Online Transaction
Processing – OLTP • Student fee payment
Systems. databases

• Sales transactions. • Airline reservation


and ticketing
• ZIP codes
Semi-structured Data
• This type of data does not have a standard data model but it has
clear self-describing patterns and structure.
• The Semi-structured data does not conform to a specific data model.
• However, it has structural properties for quick data analysis.
• It can be considered as a combined version of Structured and
Unstructured Data.
• Examples of semi-structured data are Excel spreadsheets that have
a row and column structure and XML files that are defined by an
XML schema.
Examples of Semi-structured Data
• Emails: Emails are an excellent example of semi-structured data. It has different
tags for sender, recipients, date, subject, importance and can be easily
categorized into different folders Inbox, Sent, Spam, Promotions.
• Markup language XML has a set of document encoding rules for defining the
human and machine-readable formats.
• The JavaScript Object Notation (JSON) offers a semi-structured data interchange
format. It can be used for transmitting data between web servers and
applications. It is widely popular for data exchange and supported by various
relational and non-relational databases.
• The No-SQL databases (MongoDB, documentDB, Couchbase) use flexible data
model that can be used with semi-structured data for storing, importing, and
exporting.
The following image shows semi-structured
data that contains student records in JSON
format.
Quasi-structured Data
• This type of data consists of textual content with erratic data
formats, and its formatted with effort, software system tools, and
time.
• Quasi-structured data is more of a textual data with erratic data
formats. It can be formatted. with effort, tools, and time.
• This data type includes web clickstream data such as Google
Searches
• An example of quasi-structured data is the data about webpages a
user visited and in what order.
Unstructured Data
• This type of data doesn’t have an information schema table format.
model and isn’t organized in any specific • It allows dynamic data generation and storage.
format.
• We can use non-relational databases such as
• It does not contain a predefined schema
MongoDB, Couchbase, Apache Cassandra,
structure or does not belong to a data model.
Redis, DocumentDB for storing unstructured
• Therefore, we cannot store them in relational data.
databases. • Some samples of unstructured data are e-mails,
• We can use non-relational databases such as displays, images, text documents, PDF files and
MongoDB, Couchbase, Apache Cassandra, videos.
Redis, DocumentDB for storing unstructured • Approx 90% of the digital data generated these
data. days is non-structured data which is either
• It might have internal structural elements, but semi-, quasi-, and unstructured data.
it does not store information in a predefined
Characteristics of Unstructured
Data

• It works with data that does not have a specific


format or sequence.
• You do not define a specific schema or structure for
data storage.
• It allows dynamic data storage for individual records.
• Data is portable and scalable.
Merits of Unstructured Data

• As unstructured data does not have predefined rules,


you can use it for more than one intended purpose.
• It is quick to adapt the unstructured data because it
uses dynamic schema, and you do not need to edit
all records for updating a single record.
• It can work efficiently with the heterogeneity of
sources.
Disadvantages of Unstructured
Data
• You need more experienced persons such as data analysts
and data scientists to work with the unstructured data and
draw value from it.
• You need specific data management tools for data analysis.
• Indexing unstructured data is complex and prone to error
due to flexible structure and a lack of predefined attributes.
• Its storage cost is high as compared to structured data.
Examples of Unstructured Data
• As per the recent report, 80% to 90% of data such as social media messages.
enterprise data is unstructured. • Media files: All sorts of media files such as
• Therefore, it emphasizes the importance images, audio, video.
and criticality of working with unstructured •
Communication: Mobile communication
data. data, SMS messages, location data, live
• Emails: The Email body or message is a chat, IM, collaboration software.
popular unstructured data we use daily for • Books, Magazines, articles, blogs, press
email communication.
releases, Medical records (X-Rays, ECG or
• Documents: Word files, spreadsheets, PDF, imaginary data).
Powerpoint presentations. • Scientific research data.
• Websites: Youtube, Facebook, Instagram,
• Satellite imagery, and sensor data.
LinkedIn contents can contain unstructured
Differences Between Structured
and Unstructured Data
• Structured data is highly specific in • However, if any information does not
comparison to unstructured data. comply with the schema requirements, it
• Structured data is stored in a predefined fails to store in a database.
schema or format, whereas unstructured • The unstructured data offers flexibility
data is a conglomeration of many and scalability without defining a fixed
different types of information. schema before working with any
• Structured data has a fixed schema and document.
is referred to as organized data. • It allows storing data in various formats.
• The information can usually easily be • However, it is slightly challenging to
searched for and processed in a work in comparison with Structured
database. data.
Structured vs. Unstructured Data: Comparison Table
The following table summarizes the difference between structured and
unstructured data.
How to Convert Unstructured Data into
Structured Data
• The data conversion process is time-consuming and requires
experience resources.
• It might involve the following phases.
• Define your structure data requirements.
• Data cleansing — removing duplicates, cleanup columns.
• Refine data.
• The data conversion might use the machine learning models with the
Python, R services, or third-party tools such as Azure Data factory, log
parser tools, Cogito Semantic Technology, Zoho Analytics, SAS Viya,
TextMiner, RapidMiner.
Hadoop Data Analytics Tool
• It is an open-source framework for distributed storage and
processing of large sets of data on commodity hardware.
• It enables businesses to quickly gain insight from massive
amounts of structured and unstructured data.
• It is designed to scale from a single server to thousands of
machines, with a very high degree of fault tolerance.
• Rather than relying on high-end hardware, the resiliency of
these clusters comes from the software’s ability to detect and
handle failures at the application layer.
Benefits of Hadoop Data Analytics
Tool
• Hadoop enables a computing solution that is:
• Scalable: New nodes can be added as needed without changing the data formats,
how data is loaded, how jobs are written, or the applications on the top.
• Cost-effective: Hadoop brings massively parallel computing to commodity
servers. The result is a sizeable decrease in the cost per terabyte of storage,
which in turn makes it affordable to model all your data.
• Flexible: Hadoop is schema-less and can absorb any type of data from any
number of sources. Data from multiple sources can be joined and aggregated in
arbitrary ways enabling deeper analyses than any one system can provide.
• Fault tolerant: When a node is lost, the system redirects work to another
location of the data and continues processing without missing a fright beat.
Hive Data Analytics Tool
• Hive data warehouse software facilitates querying and
managing large datasets residing in distributed storage.
• It provides a mechanism to project structure onto this
data and query the data using a SQL-like language called
HiveQL.
• At the same time, this language also allows traditional
map/reduce programmers to plug in their custom
MADlib Data Analytics Tool
• MADlib is an open-source library for scalable in-
database analytics that can help improve data analysis
efficiency and accuracy.
• It provides data parallel implementations of
mathematical, statistical, and machine-learning
methods for structured and unstructured data.
• These SQL-based algorithms for machine learning, data
mining, and statistics run at speed and scale
Pig Data Analytics Tool
• Pig is a platform for analysing large data sets that consists of a
high-level language for expressing data analysis programs,
coupled with infrastructure for evaluating these programs.
• The salient property of Pig programs is that their structure is
amenable to substantial parallelization, which in turn enables
them to handle very large data sets.
• At the present time, Pig's infrastructure layer consists of a
compiler that produces sequences of MapReduce programs, for
which large-scale parallel implementations already exist (e.g. the
Hadoop subproject).
Pig Data Analytics Tool
• Pig's language layer currently consists of a textual language called Pig
Latin, which has the following key properties:
• Ease of programming: It is trivial to achieve parallel execution of simple,
"embarrassingly parallel" data analysis tasks.
• Complex tasks comprised of multiple interrelated data transformations are
explicitly encoded as data flow sequences, making them easy to write,
understand, and maintain.
• Optimization opportunities: The way tasks are encoded permits the system to
optimize their execution automatically, allowing the user to focus on semantics
rather than efficiency.
• Extensibility: Users can create their own functions to do special-purpose
processing.
MapReduce Data Analytics Tool
• A software framework that allows developers to write programs that
process massive amounts of unstructured data in parallel across a
distributed cluster of processors or standalone computers.
• The framework is divided into two parts:
• Map: A function that parcels out work to different nodes in the
distributed cluster.
• Reduce: A function that collates the work and resolves the results
into a single value.
• The first is the map job, which takes a set of data and converts it into
another set of data, where individual elements are broken down into
tuples (key/value pairs).
MapReduce Data Analytics Tool
• The reduce job takes the output from a map as input and
combines those data tuples into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the
reduce job is always performed after the map job.
• MapReduce is important because it allows ordinary
developers to use MapReduce library routines to create
parallel programs without having to worry about
programming for intra-cluster communication, task
monitoring, or failure handling.
Conclusion: The Future of Data
• Data is at the heart of our businesses in today’s digital world,
whether a business professional or a consumer.
• Data is collected at every moment, and it forms the basis of our
many decisions.
• In the future, data may take on a more significant role in our lives,
but it will likely be used in new ways.
• Each organization includes structured, unstructured and semi-
structured data.
• You might interchange data formats for data import, export or
consume them in a standard format
References
• https://blog.skyvia.com/structured-vs-unstructured-data/
• https://www.datamation.com/big-data/structured-vs-unstructured-data
/
• https://www.mycloudwiki.com/san/data-and-information-basics/
• The 10 Vs of Big Data | Transforming Data with Intelligence (tdwi.org
)

You might also like