0% found this document useful (0 votes)

7 views

Unit 1 Introduction

Uploaded by

Maryam Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

Unit 1 Introduction

Uploaded by

Maryam Fatima

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

Unit 1.

INTRODUCTION TO DATA SCIENCE

Learning Objectives
•Introduction to core concepts and technologies
•Familiarity with terminology related with data science
•Dealing with Data Science Process
•Getting acquainted with various popular Data Science toolkit
•Types of data dealt with in Data Science
•Familiarity with example applications
1.1 DATA SCIENCE
• Interdisciplinary field of scientific methods processes
& systems to extract knowledge or insights from data
in structured or unstructured similar to data mining
• Data Science is the science which uses computer
science, statistics and machine learning, visualization
and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to
create data products
Data Science as convergence of various knowledge
domains
Discipline of using quantative
methods from Statistics and
Mathematics with Technology
Broad Canvas of Data Science Dealing with
Big Data
1.2 TERMINOLOGY RELATED WITH DATA
SCIENCE
Big Data is also data but with a huge size. Big
Data is a term used to describe a collection of
data that is huge in size and yet growing
exponentially with time.
– Characteristics Of Big Data
Volume, Velocity, Value, Verasity,
Variety,
• Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is typically
a database data.

• Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it
easier to analyze. With some process, you can store them in the relation
database

• Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database.
properties Structured data Semi-structured data Unstructured data

It is based on
It is based on Relational It is based on character
Technology XML/RDF(Resource
database table and binary data
Description Framework).

Matured transaction and No transaction

Transaction is adapted
Transaction management various concurrency management and no
from DBMS not matured
techniques concurrency

Versioning over Versioning over tuples or

Version management Versioned as a whole
tuples,row,tables graph is possible

It is more flexible than

It is more flexible and
It is schema dependent structured data but less
Flexibility there is absence of
and less flexible flexible than unstructured
schema
data

It is very difficult to scale It’s scaling is simpler than

Scalability It is more scalable.
DB schema structured data

New technology, not very

Robustness Very robust —
spread

Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
TERMINOLOGY RELATED WITH DATA
SCIENCE
Business Intelligence-technology which is uses
the transformed and loaded historical data to
generate the report.
-Help executives, managers & other corporate
end users make informed business decisions
Business Intelligence & Big Data
Business Intelligence & Big Data
• Data Analytics-collect, process, perform statistical
analysis of data

• Data Wrangling-Data wrangling is the process of cleaning

and unifying messy and complex data sets for easy
access and analysis.
The key steps to data wrangling:
• Data Acquisition: Identify and obtain access to the data
within your sources
• Joining Data : Combine the edited data for further use
and analysis
• Data Cleansing: Redesign the data into a
usable/functional format and correct/remove any bad
data
Goals of data wrangling:
• Reveal a “deeper intelligence” within your data,
by gathering data from multiple sources
• Provide accurate, actionable data in the hands
of business analysts in a timely matter
• Reduce the time spent collecting and
organizing unruly data before it can be utilized
• Enable data scientists and analysts to focus on
the analysis of data, rather than the wrangling
• Drive better decision-making skills by senior
leaders in an organization
• Algorithm-Series of repeatable steps for
carrying out a certain type of task with data.
• Machine Learning-Machine learning is an
application of artificial intelligence (AI) that
provides systems the ability to automatically
learn and improve from experience without
being explicitly programmed
• Web Analytics-Web analytics is the process of
analyzing the behaviour of visitors to a
Web site.
The use of Web analytics is said to enable a
business to attract more visitors, retain or
attract new customers for goods or services,
or to increase the dollar volume each
customer spends.
1.3 METHODS OF DATA REPOSITORY
• Data Repository refers to an enterprise data
storage entity into which data has been
partitioned for an analytical or reporting
purpose.
• Data Lakes,
• Data Marts,
• DW,
• Big Data and Hadoop
1.3 Methods of Data Repository
• Data Lake-is a storage repository that holds a vast
amount of raw data in its native format until is
needed and refined elsewhere
Characteristics-
1. All data is loaded from source systems. No data is
turned away
2. Data is stored at the leaf level in an untransformed
or nearly untransformed state
3. Data is transformed and scheme is applied to fulfil
the needs of analysis
Data Warehouse-is constructed by integrating data from
multiple heterogeneous sources that support analytical
reporting, structured or ad-hoc queries and decision
making.

Understanding a DW-
1. kept separate from organization's operational database
2. No frequent updating done in a warehouse
3. Historical data
4. Helps in the integration of diversity of application system
5. Helps in consolidated historical data analysis
Data Warehouse Models

Virtual Warehouse-the view over an operational DW.

easy to build, requires excess capacity on operational database servers.
Data Mart-subset of organization-wide data
Characteristics-
1. window-based of unix /linux based servers are used to data marts
2. Measured short period of time
3. Data mart cycle is measured in short periods of time
4. Small in size
5. Customized by department
6. Source of data mart is departmentally structured DW
7. It is flexible
Enterprise Warehouse - collects all the information and the subjects spanning an entire
organization
Provides us enterprise-wide data integration
Integrated from operational systems and external information providers
1.5 Types of Data
• Unstructured data: Heterogeneous data source
containing a combination of simple text files, Images,
videos etc. Word, PDF, Text, Media Logs
• Semi Structured data: Web Pages are generated in
scripting of HTML XML data
• Structured Data: stored, accessed and processed in the
form of fixed format Relational Data
1.6 Data Science Process
• Frame or define the (domain) problem
• Collect the raw data needed for your problem
• Data Preparation for process the data for analysis
• Explore the data
• Perform in depth analysis and producing
prescriptive business insight
• Evaluation
• Visualization and Communication of Result of the
analysis
1.7 Data Science Project’s Life Cycle
High level of phases of crisp-DM suggested for the Data Science
Projects
Data Science Project’s Life Cycle
• Business Understanding
• Data Acquisition and Understanding
• Modeling
• Deployment
• Customer acceptance
Life cycle of Data Science Process
1.8 Popular Data Science Toolkit

• R Programming Language
• Python
• KNIME-open source analytics platform for data reporting, mining and
predictive analysis
• SQL
• Apache Hadoop and Big Data tools
Apache mahaout-an environment for building scalable machine learning
algorithm
Apache Spark-cluster computing framework for data analysis
Impala-MPP database for Apache Hadoop
Apache storm-computational platform for real time analytics
MongoDB-NOSQL database –scalability and high performance
• Tensor Flow-dataflow programming across a range of tasks.
1.9 Familiarity with Example Application

1. Airline Route planning-Predict flight delay,

Decide which class of airplane buy,whether to
land directly to the destination or take halt in
between,effectively drive cutomer loyality
programs
2. Fraud & Risk Detection-customer profiling,
past expenditure
3. Delivery Logistics-improve their operational
efficiency.
4. Uber’s Taxi Service- Smart phone app based taxi booking service
5. Price Comparison website - comparing the price of product from
multiple vendors at one place.
6. People analytics –application of analytics helps companies
manage human resources
7. Portfolio Analytics-make decisions on when to lend money
8. Risk Analytics-risk scores for individual customer
9.Digital Analytics-business & technical activity that define ,create
collect, verify or transform digital data into
reporting,research,analysis,recommendations,optimiztions,pred
ictions,automations
10.Security Analytics-event management and user behavior
analytics

Marketing, Finance, Human Resources, Health Care, Government

DGC2020HD Configuration Manual
No ratings yet
DGC2020HD Configuration Manual
392 pages
Blender
50% (2)
Blender
33 pages
21css303t Datascience Unit 1 Notes (1)
No ratings yet
21css303t Datascience Unit 1 Notes (1)
246 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
ds
No ratings yet
ds
38 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Unit I- Data Science
No ratings yet
Unit I- Data Science
161 pages
DS Unit 1
No ratings yet
DS Unit 1
37 pages
Foundations of Data Science PPT TEXT BOOK
No ratings yet
Foundations of Data Science PPT TEXT BOOK
132 pages
CHAPTER 1
No ratings yet
CHAPTER 1
85 pages
Big Data For Dummies
No ratings yet
Big Data For Dummies
8 pages
FDS - UNIT 1
No ratings yet
FDS - UNIT 1
233 pages
Chapter Two Data Science: by Abdulaziz Oumer
No ratings yet
Chapter Two Data Science: by Abdulaziz Oumer
29 pages
Module 1 - Data Science Introduction _Detailed
No ratings yet
Module 1 - Data Science Introduction _Detailed
131 pages
DA-1,2,3[1]_merged
No ratings yet
DA-1,2,3[1]_merged
39 pages
File
No ratings yet
File
27 pages
Data Science: Chapter 1: Introduction To Big Data
100% (2)
Data Science: Chapter 1: Introduction To Big Data
77 pages
DS Unit 1
No ratings yet
DS Unit 1
26 pages
Chapter 2
No ratings yet
Chapter 2
30 pages
Data Mining and BI - Student Notes 2
No ratings yet
Data Mining and BI - Student Notes 2
40 pages
Internship Report: T.J.Instituteoftechnology
No ratings yet
Internship Report: T.J.Instituteoftechnology
29 pages
Chapter 2 - Introduction to Data Science (2)
No ratings yet
Chapter 2 - Introduction to Data Science (2)
35 pages
Chapter-2 Data Science2
No ratings yet
Chapter-2 Data Science2
24 pages
Unit-1 IDS
No ratings yet
Unit-1 IDS
26 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
No ratings yet
Unit1 Introduction To Data Analytics and Data Analytics Lifecycle Notes
13 pages
Chapter Two
No ratings yet
Chapter Two
14 pages
Introduction to Data Science
No ratings yet
Introduction to Data Science
29 pages
IDS Unit 1
No ratings yet
IDS Unit 1
67 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
53 pages
FDSNotes
No ratings yet
FDSNotes
12 pages
Chapter - 2 - Data Science
No ratings yet
Chapter - 2 - Data Science
32 pages
EmgTech Chapter 02
No ratings yet
EmgTech Chapter 02
52 pages
Ccw331 Two Marks
No ratings yet
Ccw331 Two Marks
18 pages
ETCh2
No ratings yet
ETCh2
36 pages
Data Analytics 1
No ratings yet
Data Analytics 1
4 pages
DS
No ratings yet
DS
32 pages
CH1 Introduction To Data Science BS
No ratings yet
CH1 Introduction To Data Science BS
69 pages
Chapter 2. Introduction to Data Science
No ratings yet
Chapter 2. Introduction to Data Science
41 pages
Final Industrial Report
No ratings yet
Final Industrial Report
34 pages
Screenshot 2025-04-23 at 8.26.12 AM
No ratings yet
Screenshot 2025-04-23 at 8.26.12 AM
14 pages
Project Report
No ratings yet
Project Report
29 pages
Fods Notes
No ratings yet
Fods Notes
139 pages
20IT501_BDA_Unit1
No ratings yet
20IT501_BDA_Unit1
18 pages
Data Science and Big Data Analytics Unit 1 notes
No ratings yet
Data Science and Big Data Analytics Unit 1 notes
13 pages
Data Science: by Neha Tyagi
100% (1)
Data Science: by Neha Tyagi
17 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
37 pages
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
Emerging Tech Ch 2
No ratings yet
Emerging Tech Ch 2
52 pages
dwdm
No ratings yet
dwdm
11 pages
Unit 1
No ratings yet
Unit 1
26 pages
Kadir
No ratings yet
Kadir
84 pages
Explaratory Data Analysis - Python
No ratings yet
Explaratory Data Analysis - Python
16 pages
Chapter 2. Introduction To Data Science
No ratings yet
Chapter 2. Introduction To Data Science
40 pages
Bda Unit 1
No ratings yet
Bda Unit 1
74 pages
Chapter 2 Data Science
No ratings yet
Chapter 2 Data Science
33 pages
Data Preprocessing, Data Warehousing
No ratings yet
Data Preprocessing, Data Warehousing
9 pages
Chapter 2 Introduction To Data Science
No ratings yet
Chapter 2 Introduction To Data Science
50 pages
Chapter 2 - Introduction to Data Science
No ratings yet
Chapter 2 - Introduction to Data Science
37 pages
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
CS508 FinalTerm Solved Short Questions
No ratings yet
CS508 FinalTerm Solved Short Questions
40 pages
File Handling in Java
No ratings yet
File Handling in Java
14 pages
Internship Post - Frontend Development Intern
No ratings yet
Internship Post - Frontend Development Intern
2 pages
Ethernet Frame Format EEE 802
No ratings yet
Ethernet Frame Format EEE 802
3 pages
Cryptography and Network Security: UNIT-1
No ratings yet
Cryptography and Network Security: UNIT-1
5 pages
TVL-ICT-CSS-11-Q3_ICCS-Week-5-6
No ratings yet
TVL-ICT-CSS-11-Q3_ICCS-Week-5-6
9 pages
Compilier Design Ca1
No ratings yet
Compilier Design Ca1
4 pages
Gigabyte Ga h61m d2 b3 r1.11
No ratings yet
Gigabyte Ga h61m d2 b3 r1.11
34 pages
Fdocuments - in - Verilog HDL Basics
No ratings yet
Fdocuments - in - Verilog HDL Basics
69 pages
Domainspecific Languages Effective Modeling Automation And Reuse Andrzej Wsowski instant download
No ratings yet
Domainspecific Languages Effective Modeling Automation And Reuse Andrzej Wsowski instant download
87 pages
Europe Outdoor 4G LTE Router Cat4 Backup SIM Slot 4G Wi-Fi
No ratings yet
Europe Outdoor 4G LTE Router Cat4 Backup SIM Slot 4G Wi-Fi
1 page
Blackbook
No ratings yet
Blackbook
35 pages
Ankr Whitepaper 2.0
No ratings yet
Ankr Whitepaper 2.0
54 pages
Microprocessor and Interfacing
No ratings yet
Microprocessor and Interfacing
4 pages
CISSP Lance Notes
100% (2)
CISSP Lance Notes
45 pages
Changes To Functionality in Microsoft Windows XP Service Pack 2 Part 3: Memory Protection Technologies
No ratings yet
Changes To Functionality in Microsoft Windows XP Service Pack 2 Part 3: Memory Protection Technologies
10 pages
帮助写作网 - 战略管理作业
100% (1)
帮助写作网 - 战略管理作业
7 pages
001 - Introduction To Java
No ratings yet
001 - Introduction To Java
16 pages
Franchi-Aerial - Robotics - Group - KOM - Control For - UAVs-2020-2B
No ratings yet
Franchi-Aerial - Robotics - Group - KOM - Control For - UAVs-2020-2B
51 pages
Edit code - EDA Playground
No ratings yet
Edit code - EDA Playground
1 page
MVH X3000BT
No ratings yet
MVH X3000BT
129 pages
Roll No.: 2521 Title: HEX To BCD and BCD To HEX Conversion
No ratings yet
Roll No.: 2521 Title: HEX To BCD and BCD To HEX Conversion
4 pages
Installing and Creating An Oracle Database 19c On Linux 7
No ratings yet
Installing and Creating An Oracle Database 19c On Linux 7
19 pages
Crime Data Analysis Using ML
No ratings yet
Crime Data Analysis Using ML
22 pages
Question Bank
No ratings yet
Question Bank
2 pages
Create A Class: Unit Ii Classes, Objects and Methods
No ratings yet
Create A Class: Unit Ii Classes, Objects and Methods
35 pages
Image Analysisn and Computer Vision Projects
No ratings yet
Image Analysisn and Computer Vision Projects
12 pages
POL-012.02 Acceptable Use Policy
No ratings yet
POL-012.02 Acceptable Use Policy
5 pages

Unit 1 Introduction

Uploaded by

Unit 1 Introduction

Uploaded by

Unit 1.

INTRODUCTION TO DATA SCIENCE

Matured transaction and No transaction

Versioning over Versioning over tuples or

It is more flexible than

It is very difficult to scale It’s scaling is simpler than

New technology, not very

• Data Wrangling-Data wrangling is the process of cleaning

Virtual Warehouse-the view over an operational DW.

1. Airline Route planning-Predict flight delay,

Marketing, Finance, Human Resources, Health Care, Government

You might also like