0% found this document useful (0 votes)
7 views

Unit 1 Introduction

Uploaded by

Maryam Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Unit 1 Introduction

Uploaded by

Maryam Fatima
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

Unit 1.

INTRODUCTION TO DATA SCIENCE

Learning Objectives
•Introduction to core concepts and technologies
•Familiarity with terminology related with data science
•Dealing with Data Science Process
•Getting acquainted with various popular Data Science toolkit
•Types of data dealt with in Data Science
•Familiarity with example applications
1.1 DATA SCIENCE
• Interdisciplinary field of scientific methods processes
& systems to extract knowledge or insights from data
in structured or unstructured similar to data mining
• Data Science is the science which uses computer
science, statistics and machine learning, visualization
and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to
create data products
Data Science as convergence of various knowledge
domains
Discipline of using quantative
methods from Statistics and
Mathematics with Technology
Broad Canvas of Data Science Dealing with
Big Data
1.2 TERMINOLOGY RELATED WITH DATA
SCIENCE
Big Data is also data but with a huge size. Big
Data is a term used to describe a collection of
data that is huge in size and yet growing
exponentially with time.
– Characteristics Of Big Data
Volume, Velocity, Value, Verasity,
Variety,
• Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is typically
a database data.

• Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it
easier to analyze. With some process, you can store them in the relation
database

• Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database.
properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character
Technology XML/RDF(Resource
database table and binary data
Description Framework).

Matured transaction and No transaction


Transaction is adapted
Transaction management various concurrency management and no
from DBMS not matured
techniques concurrency

Versioning over Versioning over tuples or


Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than


It is more flexible and
It is schema dependent structured data but less
Flexibility there is absence of
and less flexible flexible than unstructured
schema
data

It is very difficult to scale It’s scaling is simpler than


Scalability It is more scalable.
DB schema structured data

New technology, not very


Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
TERMINOLOGY RELATED WITH DATA
SCIENCE
Business Intelligence-technology which is uses
the transformed and loaded historical data to
generate the report.
-Help executives, managers & other corporate
end users make informed business decisions
Business Intelligence & Big Data
Business Intelligence & Big Data
• Data Analytics-collect, process, perform statistical
analysis of data

• Data Wrangling-Data wrangling is the process of cleaning


and unifying messy and complex data sets for easy
access and analysis.
The key steps to data wrangling:
• Data Acquisition: Identify and obtain access to the data
within your sources
• Joining Data : Combine the edited data for further use
and analysis
• Data Cleansing: Redesign the data into a
usable/functional format and correct/remove any bad
data
Goals of data wrangling:
• Reveal a “deeper intelligence” within your data,
by gathering data from multiple sources
• Provide accurate, actionable data in the hands
of business analysts in a timely matter
• Reduce the time spent collecting and
organizing unruly data before it can be utilized
• Enable data scientists and analysts to focus on
the analysis of data, rather than the wrangling
• Drive better decision-making skills by senior
leaders in an organization
• Algorithm-Series of repeatable steps for
carrying out a certain type of task with data.
• Machine Learning-Machine learning is an
application of artificial intelligence (AI) that
provides systems the ability to automatically
learn and improve from experience without
being explicitly programmed
• Web Analytics-Web analytics is the process of
analyzing the behaviour of visitors to a
Web site.
The use of Web analytics is said to enable a
business to attract more visitors, retain or
attract new customers for goods or services,
or to increase the dollar volume each
customer spends.
1.3 METHODS OF DATA REPOSITORY
• Data Repository refers to an enterprise data
storage entity into which data has been
partitioned for an analytical or reporting
purpose.
• Data Lakes,
• Data Marts,
• DW,
• Big Data and Hadoop
1.3 Methods of Data Repository
• Data Lake-is a storage repository that holds a vast
amount of raw data in its native format until is
needed and refined elsewhere
Characteristics-
1. All data is loaded from source systems. No data is
turned away
2. Data is stored at the leaf level in an untransformed
or nearly untransformed state
3. Data is transformed and scheme is applied to fulfil
the needs of analysis
Data Warehouse-is constructed by integrating data from
multiple heterogeneous sources that support analytical
reporting, structured or ad-hoc queries and decision
making.

Understanding a DW-
1. kept separate from organization's operational database
2. No frequent updating done in a warehouse
3. Historical data
4. Helps in the integration of diversity of application system
5. Helps in consolidated historical data analysis
Data Warehouse Models

Virtual Warehouse-the view over an operational DW.


easy to build, requires excess capacity on operational database servers.
Data Mart-subset of organization-wide data
Characteristics-
1. window-based of unix /linux based servers are used to data marts
2. Measured short period of time
3. Data mart cycle is measured in short periods of time
4. Small in size
5. Customized by department
6. Source of data mart is departmentally structured DW
7. It is flexible
Enterprise Warehouse - collects all the information and the subjects spanning an entire
organization
Provides us enterprise-wide data integration
Integrated from operational systems and external information providers
1.5 Types of Data
• Unstructured data: Heterogeneous data source
containing a combination of simple text files, Images,
videos etc. Word, PDF, Text, Media Logs
• Semi Structured data: Web Pages are generated in
scripting of HTML XML data
• Structured Data: stored, accessed and processed in the
form of fixed format Relational Data
1.6 Data Science Process
• Frame or define the (domain) problem
• Collect the raw data needed for your problem
• Data Preparation for process the data for analysis
• Explore the data
• Perform in depth analysis and producing
prescriptive business insight
• Evaluation
• Visualization and Communication of Result of the
analysis
1.7 Data Science Project’s Life Cycle
High level of phases of crisp-DM suggested for the Data Science
Projects
Data Science Project’s Life Cycle
• Business Understanding
• Data Acquisition and Understanding
• Modeling
• Deployment
• Customer acceptance
Life cycle of Data Science Process
1.8 Popular Data Science Toolkit

• R Programming Language
• Python
• KNIME-open source analytics platform for data reporting, mining and
predictive analysis
• SQL
• Apache Hadoop and Big Data tools
Apache mahaout-an environment for building scalable machine learning
algorithm
Apache Spark-cluster computing framework for data analysis
Impala-MPP database for Apache Hadoop
Apache storm-computational platform for real time analytics
MongoDB-NOSQL database –scalability and high performance
• Tensor Flow-dataflow programming across a range of tasks.
1.9 Familiarity with Example Application

1. Airline Route planning-Predict flight delay,


Decide which class of airplane buy,whether to
land directly to the destination or take halt in
between,effectively drive cutomer loyality
programs
2. Fraud & Risk Detection-customer profiling,
past expenditure
3. Delivery Logistics-improve their operational
efficiency.
4. Uber’s Taxi Service- Smart phone app based taxi booking service
5. Price Comparison website - comparing the price of product from
multiple vendors at one place.
6. People analytics –application of analytics helps companies
manage human resources
7. Portfolio Analytics-make decisions on when to lend money
8. Risk Analytics-risk scores for individual customer
9.Digital Analytics-business & technical activity that define ,create
collect, verify or transform digital data into
reporting,research,analysis,recommendations,optimiztions,pred
ictions,automations
10.Security Analytics-event management and user behavior
analytics

Marketing, Finance, Human Resources, Health Care, Government

You might also like