0% found this document useful (0 votes)

31 views

INTRODUCTION and M1-CH-1

Data science

Uploaded by

Aanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

31 views

INTRODUCTION and M1-CH-1

Data science

Uploaded by

Aanya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

DATA SCIENCE AND

VISUALIZATION
21CS644
Module-1
INTRODUCTION
Data… Data… EveryWhere!

 Lots of data is being collected and warehoused

 Web data, e-commerce

 purchases at department/

grocery stores

 Bank/Credit Card

transactions

 Social Network
 DATA….

 INFORMATION…
HOW MUCH DATA?
▪ Google processes 20 PB a day (2017)
▪ Wayback Machine has 70 PB + 100 TB/month (12/2020)
▪ Facebook has 4 PB of user data + 500TB/day (1/2021)
▪ eBay has 100 PB of user data + 50 TB/day (4/2014)
▪ CERN’s Large Hadron Collider (LHC) generates 15 PB a year
CERN’S HADRON COLLIDER: LARGEST MACHINE IN THE
WORLD

Maximilien Brice, © CERN

TYPES OF
DATA

 Structured
 Unstructured
 Semi-structured
STRUCTURED DATA

 Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
UNSTRUCTURED
 ANY DATA WITH UNKNOWN FORM OR THESTRUCTUREIS CLASSIFIED AS
UNSTRUCTURED DATA.
 Example:
SEMI-
STRUCTURED

 Semi-structured data can contain both the forms of data.

 Examples Of Semi-structured Data

 Personal data stored in an XML file-

<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>

What to do with these data?

 Aggregation and Statistics

 Data warehouse and OLAP

 indexing, Searching, and Querying

 Keyword based search

 Pattern matching (XML/RDF)
 Knowledge discovery

 Data Mining
 Statistical Modeling
WHAT IS BIG DATA?
WHAT IS BIG DATA?

 It is a set of extremely large data so complex and unorganized that it defies the

common and easy data management methods that were designed and used up until this

rise in data.

 Big data sets can’t be processed in traditional database management systems and tools.

 They don’t fit into a regular database network.

Job Opportunities in Data

 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
DATA SCIENCE

 Data Science: Data science is study of data. It involves developing methods of

recording, storing, and analyzing data to extract useful information.
 The goal of data science is to gain knowledge from any type of data both
structured and unstructured.
 Data science is a term for set of fields that are focused on mining big data sets
and discovering trends, methods, new insights, and processes.
 It works on any size of data.
 Some of the applications of data science are E-commerce,
Manufacturing, banking, health care, transport, finance,
etc.
 Data science is a “concept to data analysis, machine
learning, and unifies statistics” in order to understand
actual phenomena with data.
 Data visualization tools and technologies are essential to
analyze massive amounts of information and make data-
driven decisions.
 The concept of using pictures to understand data has been
used for centuries. General types of data visualizations
are Charts, Tables, Graphs, Maps, Dashboards
INTRODUCTION TO DATA SCIENCE
MODULE-1
What is Data Science?

 Over the past few years, there’s been a lot of hype in the media about
“data science” and “Big Data.”

 Today, Data rules the world. This has resulted in a huge demand for
Data Scientists.
 A Data Scientist helps companies with data-driven
decisions, to make their business better.
 Data science is a field that deals with unstructured,
structured data, and semi-structured data.
 It involves practices like data cleansing, data preparation,
data analysis, and much more.
 Data science is the combination of: statistics,
mathematics, programming, and problem-solving;
capturing data in ingenious ways;

 This umbrella term includes various techniques that are

used when extracting insights and information from data
BIG DATA

 Big data is a combination of structured, semi-structured and

unstructured data that organizations collect, analyze and mine
for information and insights.

 It's used in machine learning projects, predictive modeling and

other advanced analytics applications.
 Data analytics is the science of examining raw data to reach
certain conclusions.

 Data analytics involves applying an algorithmic or

mechanical process to derive insights and running through
several data sets to look for meaningful correlations.
Big Data and Data Science Hype

 Data science enables companies not only to understand data from

multiple sources but also to enhance decision making.

 As a result, data science is widely used in almost every industry,

including health care, finance, marketing, banking, city planning,
and more.
 There’s a lack of definitions around the most basic terminology.

 What is “Big Data” anyway?

 What does “data science” mean?
 What is the relationship between Big Data and data science?
 Is data science the science of Big Data?
 There’s a distinct lack of respect for the researchers in academia and industry labs who have been working
 on this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by
 statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the
 media describes it, machine learning algorithms were just invented last week and data was never “big”
until
 Google came along. This is simply not the case. Many of the methods and techniques we’re using—and the
 challenges we’re facing now—are part of the
 evolution of everything that’s come before. This doesn’t mean that there’s not new and exciting stuff going
 on, but we think it’s important to show some basic respect for everything that came
 before.
 The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era
 like “Masters of the Universe” to describe data scientists, and that doesn’t bode well.
 In general, hype masks reality and increases the noise-to-signal ratio. The longer the hype
goes on, the more
 many of us will get turned off by it, and the harder it will be to see what’s good
underneath it all, if anything.
 Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and
 butter. Maybe you, dear reader, are not a statistician and don’t care, but imagine that
 for the statistician, this feels a little bit like how identity theft might feel for you. Although
we will make the
 case that data science is not just a rebranding of statistics or machine learning but rather
a field unto itself,
 the media often describes data science in a way that makes it sound like as if it’s simply
statistics or machine
 learning in the context of the tech industr
 People have said to us, “Anything that has to call itself a science isn’t.” Although there
might be truth in
 there, that doesn’t mean that the term “data science” itself represents nothing, but of
course what it represents
 may not be science but more of a craft.
getting past the hype-self read
Why now? – Datafication

What is Datafication?

 Datafication is the process of transforming information into digital

data, by enabling the collection, storage, analysis, and use of that
data for specific purposes.
 In practice, datafication can be applied in all sorts of different
contexts, from collecting user information on digital platforms to
capturing sensor data on IoT devices.
The importance of Datafication and its benefits

 Datafication is extremely important in an increasingly

digitized and connected world, as it allows organizations
and individuals to gain valuable information in order to
make more assertive decisions.
 This involves knowing the information, designing
scenarios, predicting trends and behaviors, as well as
understanding what the future outcomes are, whether
positive or not.
Benefits

• Improved decision making

• Operational efficiency
• Personalization
• Risk forecasting and mitigation
• Innovation
• Improved decision making: Data collection and analysis can help
organizations make informed decisions based on valuable evidence
and insights. This can lead to better resource allocation and more
effective actions.
• Operational efficiency: Datafication enables organizations to
monitor and analyze data in real time, improving operational
efficiency and reducing costs.
• Personalization: Data collection can help customize user experiences by providing personalized
recommendations, offers, and relevant content. This is the case with Spotify music
recommendations, movies on Netflix and videos on YouTube, but also Google, Facebook and
Instagram ads;

• Risk forecasting and mitigation: Data analytics can help identify risks and predict future
trends, allowing organizations to take steps in order to mitigate those risks before they occur.

• Innovation: Lastly, Datafication can provide valuable insights for innovation and the
development of new products and services. That is, new products can emerge based on user
behavior and dentification of their real needs.
The Current Landscape

 Data science is part of the computer sciences.

 It comprises the disciplines of
 i) analytics
 ii) statistics and
 iii) machine learning.
 Analytics generates insights from data using simple presentation,
manipulation, calculation or visualization of data.

 Statistics provides a methodological approach to answer questions raised by

the analysts with a certain level of confidence.

 Machine learning is a subfield of artificial intelligence that uses algorithms

trained on data sets to create models that enable machines to perform tasks
The Current Landscape
 The technical or hard skills in the essence of Data Science require the
intersection of three areas to avoid moving in incomplete and dangerous
areas:
 The computer ability to get and handle data, the mathematical and
statistical knowledge to process the data according to the advances of
Artificial Intelligence and the relevant expertise with which the
knowledge of the specific space is used.
 When this space refers to the financial sector, several peculiarities must
be taken into account, which makes the intersection zones of only two
areas specially risky.
The Current Landscape – Skillsets
Needed

 Data Scientists Skillset:

 Statistics – Traditional Analysis
 Data Munging – the process of cleaning and
transforming data prior to use or analysis.
 Visualization – Graphs, tools
Skillsets Needed
 Data Scientist – Coined by DJ Patil and Jeff
Hammerbacher in 2008 – Emergence of the Job Title
 Data Science – Coined by Peter Naur in 1974
 Working in a team on problems that requires a hybrid of
skill set of stats and computer science paired with
personal characteristics including curiosity and
persistence – Data Scientist Job Title
 Data Scientist – Hybrid of Statistician, Software Engineer
and Social Scientist – This is true in case of a social
product such as friend recommendation in Facebook.
 The Definition of Data Scientist depends on the context
Skillsets Needed

 Expertise in the following fields is a requirement

 Computer Science
 Math
 Statistics
 Machine Learning
 Communication and Presentation Skills
 Data Visualization
 Extensive Domain Expertise
Data Scientist vs
Data Science Team Profiles
took a survey and used clustering to define subfields of data science
Define Data Science by Usage

 In Academics:
 An academic data scientist is a scientist, trained in any of the academic
fields-trained in anything from social science to biology, who works with large amounts
of data, and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data,
while simultaneously solving a real world problem.
Define Data Science by Usage-In Industry:

 Data scientists often work with a team to complete projects. Typical activities include:
 Design, develop, and maintain machine learning and other data models
 Select, use, and debug existing data models

 Perform statistical and data analyses, often to make decisions about products.

 Conduct research to learn more about the field and to improve model accuracy,
including meeting with and interviewing experts
Define Data Science by Usage

 In a general sense:

 A Data scientist is someone:

 Who knows how to extract meaning from data and interpret it, which requires tools
and methods from statistics and machine learning as well as the human factor

 Whose work involves data collection, cleaning, and munging which requires
persistence, statistics, and software engineering skills – Skills for understanding biases
in data, for debugging and logging output from code.
Define Data Science by Usage

 In a general sense:
 A Data scientist is someone:(contd.)
 Who performs Exploratory Data Analysis, --is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing data visualization
methods.
 Who finds patterns, build models and algorithms for different purposes.
 Who designs experiments as it is a critical part of data driven decision
making.
 Who communicates with team members, engineers, and leadership in clear
language an with data visualizations
END OF M1-CH1

Cambridge Primary Science Year 8 LB 2nd Edition
90% (113)
Cambridge Primary Science Year 8 LB 2nd Edition
333 pages
Solving Problems Involving Linear Inequalities
100% (1)
Solving Problems Involving Linear Inequalities
10 pages
Data Science
From Everand
Data Science
Chloe Martin
No ratings yet
Data Science
No ratings yet
Data Science
85 pages
Modul1 PPt.pptx
No ratings yet
Modul1 PPt.pptx
56 pages
DSV Module-1
No ratings yet
DSV Module-1
26 pages
Module 1
No ratings yet
Module 1
47 pages
Introduction Data Science
100% (1)
Introduction Data Science
23 pages
Data
No ratings yet
Data
43 pages
Introduction To Datasciecne
No ratings yet
Introduction To Datasciecne
50 pages
Fds Module 1
No ratings yet
Fds Module 1
65 pages
Module - 1 IDS
100% (1)
Module - 1 IDS
19 pages
Data Analytics with Python: Data Analytics in Python Using Pandas
From Everand
Data Analytics with Python: Data Analytics in Python Using Pandas
Frank Millstein
3/5 (1)
Data Science Module1
No ratings yet
Data Science Module1
20 pages
Datascience
75% (8)
Datascience
28 pages
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
From Everand
Data Analytics for Businesses 2019: Master Data Science with Optimised Marketing Strategies using Data Mining Algorithms (Artificial Intelligence, Machine Learning, Predictive Modelling and more)
Riley Adams
5/5 (1)
Chapter 1 Data Science Fundamentals
No ratings yet
Chapter 1 Data Science Fundamentals
34 pages
The Evolution of Data Science From Past To Present
No ratings yet
The Evolution of Data Science From Past To Present
11 pages
DS_Module 1
No ratings yet
DS_Module 1
57 pages
Introduction To Data Science - Ii-I
No ratings yet
Introduction To Data Science - Ii-I
128 pages
Data Science Unit 1
No ratings yet
Data Science Unit 1
70 pages
Fundamentals of Data Science
100% (3)
Fundamentals of Data Science
62 pages
Ds Intro KK
No ratings yet
Ds Intro KK
11 pages
DS 1
No ratings yet
DS 1
56 pages
Bsd1313 Chapter 1
No ratings yet
Bsd1313 Chapter 1
60 pages
Unit I Introduction To Data Science
No ratings yet
Unit I Introduction To Data Science
79 pages
Unit 3 Part 1
No ratings yet
Unit 3 Part 1
43 pages
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
No ratings yet
Research On Data Science, Data Analytics and Big Data Rahul Reddy Nadikattu
7 pages
Data Science 2020
100% (1)
Data Science 2020
123 pages
Ch7-Overview of Data Science-part 1
No ratings yet
Ch7-Overview of Data Science-part 1
37 pages
Inroduction To Data Science
No ratings yet
Inroduction To Data Science
62 pages
Data Science - AD1102-1
No ratings yet
Data Science - AD1102-1
53 pages
IDS-UNIT-1-FINAL (1)
No ratings yet
IDS-UNIT-1-FINAL (1)
30 pages
(DSBDA) Unit 1 Introduction To Data Science
No ratings yet
(DSBDA) Unit 1 Introduction To Data Science
14 pages
Lec1 - For Upload Complete
No ratings yet
Lec1 - For Upload Complete
111 pages
DSF 1-2
No ratings yet
DSF 1-2
28 pages
himadev
No ratings yet
himadev
37 pages
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
No ratings yet
e4f1fb7f-a61e-4090-9018-344695f0d7d4 (2)
30 pages
Data Science Unit-I
No ratings yet
Data Science Unit-I
13 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
3250+module+1+ +Intro+to+Data+Science
No ratings yet
3250+module+1+ +Intro+to+Data+Science
71 pages
1-Need for Data Science-13!12!2024 (1)
No ratings yet
1-Need for Data Science-13!12!2024 (1)
51 pages
PSD02 - Data Science Overview
No ratings yet
PSD02 - Data Science Overview
64 pages
01 Introduction
No ratings yet
01 Introduction
37 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
37 pages
From Data To Decisions: Driving Performance in the Age of Analytics
From Everand
From Data To Decisions: Driving Performance in the Age of Analytics
Babatunde Yusuf
No ratings yet
EDS Unit 1?
No ratings yet
EDS Unit 1?
15 pages
DataScientist v2
No ratings yet
DataScientist v2
14 pages
DSBDA_UNIT1
No ratings yet
DSBDA_UNIT1
232 pages
Introduction to Data Science Lecture 1
No ratings yet
Introduction to Data Science Lecture 1
4 pages
Data Science
No ratings yet
Data Science
40 pages
Unit 1-FDS
No ratings yet
Unit 1-FDS
18 pages
Lecture 2-Quick Overview of Data Science
No ratings yet
Lecture 2-Quick Overview of Data Science
18 pages
M-1-FDS-NOTES-PPT (2) (1)
No ratings yet
M-1-FDS-NOTES-PPT (2) (1)
19 pages
DS-BDS (Unit 1) Technical
No ratings yet
DS-BDS (Unit 1) Technical
22 pages
DS-Unit-1_ABM
No ratings yet
DS-Unit-1_ABM
103 pages
Data Science Presentation Final
No ratings yet
Data Science Presentation Final
34 pages
1) Data-sci Chapter-1
No ratings yet
1) Data-sci Chapter-1
17 pages
Data Science Basics
No ratings yet
Data Science Basics
25 pages
DS B&V-1 (1)
No ratings yet
DS B&V-1 (1)
30 pages
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
No ratings yet
0bd671618021447098c9e2da9729d5bb_20250117_105640_347_862932_Introduction
35 pages
Data Science Presentation Enhanced (1)
No ratings yet
Data Science Presentation Enhanced (1)
34 pages
Simulation Report
No ratings yet
Simulation Report
2 pages
Jackendoff How Did Language Begin
No ratings yet
Jackendoff How Did Language Begin
4 pages
Basics of Vedanta Complete
100% (2)
Basics of Vedanta Complete
489 pages
30 Nov 2023
No ratings yet
30 Nov 2023
3 pages
Why Teams Fail: By: Saloni Maheshwari
No ratings yet
Why Teams Fail: By: Saloni Maheshwari
15 pages
Recognition of Achievers For First Grading and Parent-Teachers Conference
No ratings yet
Recognition of Achievers For First Grading and Parent-Teachers Conference
2 pages
Scheme of Work Form 5
100% (1)
Scheme of Work Form 5
9 pages
Core Maths Extended Project and FSMQ Grade Boundaries June 2023
No ratings yet
Core Maths Extended Project and FSMQ Grade Boundaries June 2023
1 page
Memory As A Way of Knowing
No ratings yet
Memory As A Way of Knowing
46 pages
Module On Fraction
No ratings yet
Module On Fraction
4 pages
A Presentation On Workplace Diversity and Cultural Challenges
No ratings yet
A Presentation On Workplace Diversity and Cultural Challenges
21 pages
Agenda
No ratings yet
Agenda
41 pages
My Learning Insights
No ratings yet
My Learning Insights
1 page
AI Agents
0% (1)
AI Agents
9 pages
COT SY 2021-2022 Quarter 4- Science and Health 3 -
No ratings yet
COT SY 2021-2022 Quarter 4- Science and Health 3 -
6 pages
Sunday: Arabic English Math Bio. P. E. Phys. Geo. Chem
No ratings yet
Sunday: Arabic English Math Bio. P. E. Phys. Geo. Chem
1 page
Lauren Hazony Resume 3-18
No ratings yet
Lauren Hazony Resume 3-18
4 pages
Unit 3 flashcards
No ratings yet
Unit 3 flashcards
21 pages
ANN 5TH PPT
No ratings yet
ANN 5TH PPT
98 pages
Albert Bandura's Social Cognitive Learning Theory
No ratings yet
Albert Bandura's Social Cognitive Learning Theory
26 pages
Kemampuan Bahasa Inggris
No ratings yet
Kemampuan Bahasa Inggris
7 pages
Universitas Ghent
No ratings yet
Universitas Ghent
3 pages
(Routledge Sufi Series 23.) Akīm Al-Tirmidhī, Mu Ammad Ibn Alī - Sviri, Sara - Perspectives On Early Islamic Mysticism - The World of Al-Ḥakīm Al-Tirmidhī and His Contemporaries-Routledge (2020)
100% (3)
(Routledge Sufi Series 23.) Akīm Al-Tirmidhī, Mu Ammad Ibn Alī - Sviri, Sara - Perspectives On Early Islamic Mysticism - The World of Al-Ḥakīm Al-Tirmidhī and His Contemporaries-Routledge (2020)
377 pages
The Impact of School Environments: A Literature Review: Article
No ratings yet
The Impact of School Environments: A Literature Review: Article
48 pages
Movies Around The World
No ratings yet
Movies Around The World
22 pages
GenAi for edu
No ratings yet
GenAi for edu
9 pages
Cardiovascular Endurance Training Log
No ratings yet
Cardiovascular Endurance Training Log
7 pages
Philippine Economy Quiz
No ratings yet
Philippine Economy Quiz
3 pages

INTRODUCTION and M1-CH-1

Uploaded by

INTRODUCTION and M1-CH-1

Uploaded by

DATA SCIENCE AND

 Lots of data is being collected and warehoused

Maximilien Brice, © CERN

 Examples Of Semi-structured Data

 Data warehouse and OLAP

 Keyword based search

 They don’t fit into a regular database network.

 Data Science: Data science is study of data. It involves developing methods of

 This umbrella term includes various techniques that are

 Big data is a combination of structured, semi-structured and

 It's used in machine learning projects, predictive modeling and

 Data analytics involves applying an algorithmic or

 Data science enables companies not only to understand data from

 As a result, data science is widely used in almost every industry,

 What is “Big Data” anyway?

 Datafication is the process of transforming information into digital

 Datafication is extremely important in an increasingly

• Improved decision making

 Data science is part of the computer sciences.

 Statistics provides a methodological approach to answer questions raised by

 Machine learning is a subfield of artificial intelligence that uses algorithms

 Data Scientists Skillset:

 Expertise in the following fields is a requirement

 A Data scientist is someone:

You might also like