Unit 1 Introduction
Unit 1 Introduction
Learning Objectives
•Introduction to core concepts and technologies
•Familiarity with terminology related with data science
•Dealing with Data Science Process
•Getting acquainted with various popular Data Science toolkit
•Types of data dealt with in Data Science
•Familiarity with example applications
1.1 DATA SCIENCE
• Interdisciplinary field of scientific methods processes
& systems to extract knowledge or insights from data
in structured or unstructured similar to data mining
• Data Science is the science which uses computer
science, statistics and machine learning, visualization
and human-computer interactions to collect, clean,
integrate, analyze, visualize, interact with data to
create data products
Data Science as convergence of various knowledge
domains
Discipline of using quantative
methods from Statistics and
Mathematics with Technology
Broad Canvas of Data Science Dealing with
Big Data
1.2 TERMINOLOGY RELATED WITH DATA
SCIENCE
Big Data is also data but with a huge size. Big
Data is a term used to describe a collection of
data that is huge in size and yet growing
exponentially with time.
– Characteristics Of Big Data
Volume, Velocity, Value, Verasity,
Variety,
• Structured data –
Structured data is data whose elements are addressable for effective
analysis. It has been organized into a formatted repository that is typically
a database data.
• Semi-Structured data –
Semi-structured data is information that does not reside in a relational
database but that have some organizational properties that make it
easier to analyze. With some process, you can store them in the relation
database
• Unstructured data –
Unstructured data is a data which is not organized in a predefined
manner or does not have a predefined data model, thus it is not a good
fit for a mainstream relational database.
properties Structured data Semi-structured data Unstructured data
It is based on
It is based on Relational It is based on character
Technology XML/RDF(Resource
database table and binary data
Description Framework).
Structured query allow Queries over anonymous Only textual queries are
Query performance
complex joining nodes are possible possible
TERMINOLOGY RELATED WITH DATA
SCIENCE
Business Intelligence-technology which is uses
the transformed and loaded historical data to
generate the report.
-Help executives, managers & other corporate
end users make informed business decisions
Business Intelligence & Big Data
Business Intelligence & Big Data
• Data Analytics-collect, process, perform statistical
analysis of data
Understanding a DW-
1. kept separate from organization's operational database
2. No frequent updating done in a warehouse
3. Historical data
4. Helps in the integration of diversity of application system
5. Helps in consolidated historical data analysis
Data Warehouse Models
• R Programming Language
• Python
• KNIME-open source analytics platform for data reporting, mining and
predictive analysis
• SQL
• Apache Hadoop and Big Data tools
Apache mahaout-an environment for building scalable machine learning
algorithm
Apache Spark-cluster computing framework for data analysis
Impala-MPP database for Apache Hadoop
Apache storm-computational platform for real time analytics
MongoDB-NOSQL database –scalability and high performance
• Tensor Flow-dataflow programming across a range of tasks.
1.9 Familiarity with Example Application