0% found this document useful (0 votes)
19 views

Chapter 2-converted BI

Chapter 2 of 'Fundamentals of Business Analytics' by RN Prasad and Seema Acharya discusses the types of digital data: structured, semi-structured, and unstructured. It outlines their origins, management, storage, and the challenges associated with extracting information from each type. The chapter emphasizes the importance of understanding these data types for effective business analytics.

Uploaded by

P B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Chapter 2-converted BI

Chapter 2 of 'Fundamentals of Business Analytics' by RN Prasad and Seema Acharya discusses the types of digital data: structured, semi-structured, and unstructured. It outlines their origins, management, storage, and the challenges associated with extracting information from each type. The chapter emphasizes the importance of understanding these data types for effective business analytics.

Uploaded by

P B
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Chapter 2 “Fundamentals of Business Analytics”

Types of Digital Data RN Prasad and Seema Acharya


Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Learning Objectives and Learning Outcomes

Learning Objectives Learning Outcomes

Introduction to digital data and its types (a) To differentiate between


structured, unstructured and
1. Structured data – origin, organization, semi-structured data
storage, access and usage
(b) To understand the need to
2. Semi-structured data – origin, integrate structured,
organization, storage, access and usage unstructured and semi-
structured data
3. Unstructured data – origin, organization,
storage, access and usage

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Session Plan

Lecture time : 45 to 60 minutes

Q/A : 15 minutes

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Agenda
• Types of digital data
– Unstructured
• Origin
• Management
• Storage
• Storage of unstructured data in relational database
• Process of extracting information
• Key take-away and additional reads
– Semi-structured
• Origin
• Management
• Storage
• Storage of semi-structured data in relational database
• Process of extracting information
• XML
• Key take-away and additional reads
Agenda (contd.)

• Types of digital data – contd.


– Structured
• Origin
• Management
• Storage
• Process of extracting information
• Key take-away and additional reads

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Digital Data

• Digital data can be


– Unstructured
– Semi-structured
– Structured

• According to Merrill Lynch 80–90% of business data is either unstructured


or semi-structured

• Data is usually in a format which makes it difficult to extract information


from it

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Formats of Digital Data

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Unstructured Data

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
What is Unstructured Data?

Does not
conform to any
data model
Cannot be
stored in form
Has no easily of rows and
identifiable columns as in a
structure database

Unstructured
data

Not in any
Does not particular
follow any rule format or
or semantics sequence
Not easily
usable by a
program
Where does Unstructured Data Come from?

Web pages

Memos

Videos (MPEG, etc.)

Images (JPEG, GIF, etc.)

Body of an e-mail

Unstructured data Word document

PowerPoint presentations

Chats

Reports

Whitepapers

Surveys
How to Store Unstructured Data?

Sheer volume of unstructured data and its unprecedented


Storage growth makes it difficult to store. Audios, videos, images,
Space etc. acquire huge amount of storage space

Scalability becomes an issue with increase


Scalability in unstructured data

Retrieving and recovering unstructured


Retrieve data are cumbersome
information
Challenges faced
Ensuring security is difficult due to varied
Security sources of data (e.g. e-mail, web pages)

Update and Updating, deleting, etc. are not easy due to


delete the unstructured form

Indexing
and Indexing becomes difficult with increase in data.
searching Searching is difficult for non-text data
How to Store Unstructured Data?
Unstructured data may be be converted to formats which are easily
Change managed, stored and searched. For example, IBM is working on
formats providing a solution which converts audio , video, etc. to text

Create hardware which support unstructured data


New either compliment the existing storage devices or be a
hardware stand alone for unstructured data

Store in relational databases which support


RDBMS/
Possible solutions BLOBs
BLOBs which is Binary Large Objects

XML Store in XML which tries to give some structure to


unstructured data by using tags and elements

CAS Organize files based on their metadata


How to Extract Information from Unstructured
Data?
Unstructured data is not easily interpreted by conventional
Interpretation search algorithms

As the data grows it is not possible to put tags


Tags manually

Designing algorithms to understand the meaning


Indexing of the document and then tag or index them
accordingly is difficult
Challenges faced
Deriving Computer programs cannot automatically derive
meaning meaning/structure from unstructured data

File formats Increasing number of file formats make it difficult to


interpret data

Classification/ Different naming conventions followed across the


Taxonomy organization make it difficult to classify data.
How to Extract Information from Unstructured
Data?
Unstructured data can be stored in a virtual repository and be
Tags automatically tagged. For example, Documentum provides this
type of solution

Text mining tools help in grouping and classifying


Text mining unstructured data and analyze by considering
grammar, context, synonyms ,etc.

Application platforms like XOLAP help


Application extract information from e-mail and XML
Possible solutions platforms based documents

Classification/ Taxonomies within the organization can be


Taxonomy managed automatically to organize data in
hierarchical structures

Naming conventions/ Following naming conventions or standards


standards across an organization can greatly improve
storage and retrieval
Further Reading

• http://www.information-management.com/issues/20030201/6287-1.html
• http://www.enterpriseitplanet.com/storage/features/article.php/11318_34071
61_2
• http://domino.research.ibm.com/comm/research_projects.nsf/pages/uima.ind
ex.html
• http://www.research.ibm.com/UIMA/UIMA%20Architecture%20Highlights.
html

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Answer a Quick Question

Ask the participants of the learning program to state some more examples of
Unstructured data

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Do it Exercise

Search, think and write about two best practices for managing the growth of
unstructured data

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Semi-structured Data

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
What is Semi-structured Data?
Does not
conform to a
data model but
contains tags &
elements
(metadata) Cannot be
stored in form
Similar entities
of rows and
are grouped
columns as in a
database
Semi-
structured
data

Attributes in a The tags and


group may not elements
be the same describe how
data is stored

Not sufficient
Metadata
Where does Semi-structured Data Come from?

E-mail

XML

TCP/IP packets

Zipped files
Semi-structured
data
Binary
executables

Mark-up languages

Integration of data from


heterogeneous sources
How to Manage Semi-structured Data?

Some ways in which semi-structured data is managed and stored

Graph-based data
Schemas XML
models

• Describe the • Contain data on • Models the data


structure and the leaves of the using tags and
content of data to graph. Also known elements
some extent as ‘schema less’

• Assign meaning to • Used for data • Schemas are not


data hence exchange among tightly coupled to
allowing automatic heterogeneous data
search and sources
indexing
How to Store Semi-structured Data?

Storing data with their schemas increases cost


Storage cost

Semi-structured data cannot be stored in


RDBMS existing RDBMS as data cannot be mapped
into tables directly

Irregular and Some data elements may have extra


partial structure information while others none at all

Challenges faced
In many cases the structure is implicit.
Implicit structure Interpreting relationships and
correlations is very difficult

Schemas keep changing with


Evolving schemas requirements making it difficult to
capture it in a database

Distinction between Vague distinction between schema and data exists at times
schema and data making it difficult to capture data
How to Store Semi-structured Data?

XML allows to define tags and attributes to store data.


Data can be stored in a hierarchical/nested structure
XML

Semi-structured data can be stored in a relational


database by mapping the data to a relational
RDBMS schema which is then mapped to a table

Possible solutions
Special Databases which are specifically designed to store
purpose semi-structured data
DBMS

OEM Data can be stored and exchanged in the form of graph


where entities are represented as objects which are the
vertices in a graph
How to Extract Information from Semi-structured Data?

Semi-structured is usually stored in flat


files which are difficult to index and
Flat files search

Data comes from varied sources which is


Heterogeneous difficult to tag and search
Challenges faced sources

Incomplete/ Extracting structure when there is none and


irregular interpreting the relations existing in the structure
structure which is present is a difficult task
How to Extract Information from Semi-structured Data?

Indexing data in a graph-based model


Indexing enables quick search

Allows data to be stored in a graph-based data


OEM model which is easier to index and search

Possible solutions

XML Allows data to be arranged in a hierarchical or


tree-like structure which enables indexing and
searching

Mining Various mining tools are available which search


tools data based on graphs, schemas, structure, etc.
XML – A Solution for Semi-structured Data Management

XML Extensible MarkUp Language

Open-source mark up language written in plain text.


What is XML? It is hardware and software independent

Designed to store and transport data over the


Does what? Internet

It allows data to be stored in a hierarchical/nested


How? structure. It allows user to define tags to store the
data
XML – A Solution for Semi-structured Data Management

XML has no predefined tags

<message>
<to> XYZ </to>
<from> ABC </from>
<subject> Greetings </subject>
<body> Hello! How are you? </body>
</message>

The words in the <> (angular brackets) are user-defined tags


XML is known as self-describing as data can exist without a schema and
schema can be added later
Schema can be described in XSLT or XML schema

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Further Reading

• http://queue.acm.org/detail.cfm?id=1103832
• http://www.computerworld.com/s/article/93968/Taming_Text
• http://searchstorage.techtarget.com/generic/0,295582,sid5_gci1334684,00.
html
• http://searchdatamanagement.techtarget.com/generic/0,295582,sid91_gci1
264550,00.html
• http://searchdatamanagement.techtarget.com/news/article/0,289142,sid91_
gci1252122,00.html

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Answer a Quick Question

What is your take on this….

A Web Page is unstructured. If yes, why?

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data
What Is Structured Data?

Conforms to a
data model
Data is stored in
form of rows and
Similar entities columns
are grouped (e.g., relational
database)

Structured
data

Attributes in a Data resides in


group are the fixed fields within
same a record or file

Definition, format
& meaning of data
is explicitly
known
Where does Structured Data Come from?

Databases (e.g., Access)

Spreadsheets

Structured Data
SQL

OLTP systems

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Structured Data: Everything in its Place

Fully described datasets

Clearly defined categories and sub-categories

Data neatly placed in rows and columns

Data that goes into the records is regulated by a well-defined structure

Indexing can be easily done either by the DBMS itself or manually


Structured Data

Semi-structured Structured

Name E-mail First Name Last Name E-mail Id Alternate E-


mail Id

Patrick Wood [email protected], Patrick Wood [email protected] p.wood@ym


[email protected] c.ac.uk ail.uk

First name: Mark [email protected] Mark Taylor MarkT@dcs.


Last name: Taylor ymail.ac.uk

Alex Bourdoo [email protected] Alex Bourdoo AlexBourdoo


c.uk @dcs.ymail.a
c.uk

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Ease with Structured Data-Storage

Data types – both defined and user defined help


Storage with the storage of structured data

Scalability is not generally an issue with


Scalability increase in data

Ease with structured


data
Security

Update and Updating, deleting, etc. is easy due to


delete structured form

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Ease with Structured Data-Retrieval

Retrieve A well-defined structure helps in easy


information retrieval of data

Data can be indexed based not only on a


Indexing and text string but other attributes as well. This
searching enables streamlined search

Ease with structured


data
Structured data can be easily mined and
Mining data knowledge can be extracted from it

BI works extremely well with structured data.


BI operations Hence data mining, warehousing, etc. can be
easily undertaken
Further Readings

• http://www.govtrack.us/articles/20061209data.xpd
• http://www.sapdesignguild.org/editions/edition2/sui_content.asp

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Do it Exercise

Think and write about an instance where data was presented to you in
Unstructured, semi-structured and structured data format

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.
Summary please…

Ask a few participants of the learning program to summarize the lecture.

“Fundamentals of Business Analytics”


RN Prasad and Seema Acharya
Copyright © 2011 Wiley India Pvt. Ltd. All rights reserved.

You might also like