0% found this document useful (0 votes)
31 views

INTRODUCTION and M1-CH-1

Data science

Uploaded by

Aanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

INTRODUCTION and M1-CH-1

Data science

Uploaded by

Aanya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

DATA SCIENCE AND

VISUALIZATION
21CS644
Module-1
INTRODUCTION
Data… Data… EveryWhere!

 Lots of data is being collected and warehoused


 Web data, e-commerce

 purchases at department/

grocery stores

 Bank/Credit Card

transactions

 Social Network
 DATA….

 INFORMATION…
HOW MUCH DATA?
▪ Google processes 20 PB a day (2017)
▪ Wayback Machine has 70 PB + 100 TB/month (12/2020)
▪ Facebook has 4 PB of user data + 500TB/day (1/2021)
▪ eBay has 100 PB of user data + 50 TB/day (4/2014)
▪ CERN’s Large Hadron Collider (LHC) generates 15 PB a year
CERN’S HADRON COLLIDER: LARGEST MACHINE IN THE
WORLD

Maximilien Brice, © CERN


TYPES OF
DATA

 Structured
 Unstructured
 Semi-structured
STRUCTURED DATA

 Any data that can be stored, accessed and processed in the form of fixed format is
termed as a 'structured' data.
UNSTRUCTURED
 ANY DATA WITH UNKNOWN FORM OR THESTRUCTUREIS CLASSIFIED AS
UNSTRUCTURED DATA.
 Example:
SEMI-
STRUCTURED

 Semi-structured data can contain both the forms of data.

 Examples Of Semi-structured Data


 Personal data stored in an XML file-

<rec><name>Prashant Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>

What to do with these data?

 Aggregation and Statistics

 Data warehouse and OLAP


 indexing, Searching, and Querying

 Keyword based search


 Pattern matching (XML/RDF)
 Knowledge discovery

 Data Mining
 Statistical Modeling
WHAT IS BIG DATA?
WHAT IS BIG DATA?

 It is a set of extremely large data so complex and unorganized that it defies the

common and easy data management methods that were designed and used up until this

rise in data.

 Big data sets can’t be processed in traditional database management systems and tools.

 They don’t fit into a regular database network.


Job Opportunities in Data

 Data Analysts
 analyze and interpret data, visualize it, and build reports to help make better business decisions.
 Data Scientists
 mine data by assessing data sources and use algorithms and Machine Learning techniques.
 Data Architects
 design database systems and tools.
 Database Managers
 control database system performance, perform troubleshooting, and upgrade hardware and software.
 Big Data Engineers
 design, maintain, and support Big Data solutions.
DATA SCIENCE

 Data Science: Data science is study of data. It involves developing methods of


recording, storing, and analyzing data to extract useful information.
 The goal of data science is to gain knowledge from any type of data both
structured and unstructured.
 Data science is a term for set of fields that are focused on mining big data sets
and discovering trends, methods, new insights, and processes.
 It works on any size of data.
 Some of the applications of data science are E-commerce,
Manufacturing, banking, health care, transport, finance,
etc.
 Data science is a “concept to data analysis, machine
learning, and unifies statistics” in order to understand
actual phenomena with data.
 Data visualization tools and technologies are essential to
analyze massive amounts of information and make data-
driven decisions.
 The concept of using pictures to understand data has been
used for centuries. General types of data visualizations
are Charts, Tables, Graphs, Maps, Dashboards
INTRODUCTION TO DATA SCIENCE
MODULE-1
What is Data Science?

 Over the past few years, there’s been a lot of hype in the media about
“data science” and “Big Data.”

 Today, Data rules the world. This has resulted in a huge demand for
Data Scientists.
 A Data Scientist helps companies with data-driven
decisions, to make their business better.
 Data science is a field that deals with unstructured,
structured data, and semi-structured data.
 It involves practices like data cleansing, data preparation,
data analysis, and much more.
 Data science is the combination of: statistics,
mathematics, programming, and problem-solving;
capturing data in ingenious ways;

 This umbrella term includes various techniques that are


used when extracting insights and information from data
BIG DATA

 Big data is a combination of structured, semi-structured and


unstructured data that organizations collect, analyze and mine
for information and insights.

 It's used in machine learning projects, predictive modeling and


other advanced analytics applications.
 Data analytics is the science of examining raw data to reach
certain conclusions.

 Data analytics involves applying an algorithmic or


mechanical process to derive insights and running through
several data sets to look for meaningful correlations.
Big Data and Data Science Hype

 Data science enables companies not only to understand data from


multiple sources but also to enhance decision making.

 As a result, data science is widely used in almost every industry,


including health care, finance, marketing, banking, city planning,
and more.
 There’s a lack of definitions around the most basic terminology.

 What is “Big Data” anyway?


 What does “data science” mean?
 What is the relationship between Big Data and data science?
 Is data science the science of Big Data?
 There’s a distinct lack of respect for the researchers in academia and industry labs who have been working
 on this kind of stuff for years, and whose work is based on decades (in some cases, centuries) of work by
 statisticians, computer scientists, mathematicians, engineers, and scientists of all types. From the way the
 media describes it, machine learning algorithms were just invented last week and data was never “big”
until
 Google came along. This is simply not the case. Many of the methods and techniques we’re using—and the
 challenges we’re facing now—are part of the
 evolution of everything that’s come before. This doesn’t mean that there’s not new and exciting stuff going
 on, but we think it’s important to show some basic respect for everything that came
 before.
 The hype is crazy—people throw around tired phrases straight out of the height of the
pre-financial crisis era
 like “Masters of the Universe” to describe data scientists, and that doesn’t bode well.
 In general, hype masks reality and increases the noise-to-signal ratio. The longer the hype
goes on, the more
 many of us will get turned off by it, and the harder it will be to see what’s good
underneath it all, if anything.
 Statisticians already feel that they are studying and working on the “Science of Data.”
That’s their bread and
 butter. Maybe you, dear reader, are not a statistician and don’t care, but imagine that
 for the statistician, this feels a little bit like how identity theft might feel for you. Although
we will make the
 case that data science is not just a rebranding of statistics or machine learning but rather
a field unto itself,
 the media often describes data science in a way that makes it sound like as if it’s simply
statistics or machine
 learning in the context of the tech industr
 People have said to us, “Anything that has to call itself a science isn’t.” Although there
might be truth in
 there, that doesn’t mean that the term “data science” itself represents nothing, but of
course what it represents
 may not be science but more of a craft.
getting past the hype-self read
Why now? – Datafication

What is Datafication?

 Datafication is the process of transforming information into digital


data, by enabling the collection, storage, analysis, and use of that
data for specific purposes.
 In practice, datafication can be applied in all sorts of different
contexts, from collecting user information on digital platforms to
capturing sensor data on IoT devices.
The importance of Datafication and its benefits

 Datafication is extremely important in an increasingly


digitized and connected world, as it allows organizations
and individuals to gain valuable information in order to
make more assertive decisions.
 This involves knowing the information, designing
scenarios, predicting trends and behaviors, as well as
understanding what the future outcomes are, whether
positive or not.
Benefits

• Improved decision making


• Operational efficiency
• Personalization
• Risk forecasting and mitigation
• Innovation
• Improved decision making: Data collection and analysis can help
organizations make informed decisions based on valuable evidence
and insights. This can lead to better resource allocation and more
effective actions.
• Operational efficiency: Datafication enables organizations to
monitor and analyze data in real time, improving operational
efficiency and reducing costs.
• Personalization: Data collection can help customize user experiences by providing personalized
recommendations, offers, and relevant content. This is the case with Spotify music
recommendations, movies on Netflix and videos on YouTube, but also Google, Facebook and
Instagram ads;

• Risk forecasting and mitigation: Data analytics can help identify risks and predict future
trends, allowing organizations to take steps in order to mitigate those risks before they occur.

• Innovation: Lastly, Datafication can provide valuable insights for innovation and the
development of new products and services. That is, new products can emerge based on user
behavior and dentification of their real needs.
The Current Landscape

 Data science is part of the computer sciences.


 It comprises the disciplines of
 i) analytics
 ii) statistics and
 iii) machine learning.
 Analytics generates insights from data using simple presentation,
manipulation, calculation or visualization of data.

 Statistics provides a methodological approach to answer questions raised by


the analysts with a certain level of confidence.

 Machine learning is a subfield of artificial intelligence that uses algorithms


trained on data sets to create models that enable machines to perform tasks
The Current Landscape
 The technical or hard skills in the essence of Data Science require the
intersection of three areas to avoid moving in incomplete and dangerous
areas:
 The computer ability to get and handle data, the mathematical and
statistical knowledge to process the data according to the advances of
Artificial Intelligence and the relevant expertise with which the
knowledge of the specific space is used.
 When this space refers to the financial sector, several peculiarities must
be taken into account, which makes the intersection zones of only two
areas specially risky.
The Current Landscape – Skillsets
Needed

 Data Scientists Skillset:


 Statistics – Traditional Analysis
 Data Munging – the process of cleaning and
transforming data prior to use or analysis.
 Visualization – Graphs, tools
Skillsets Needed
 Data Scientist – Coined by DJ Patil and Jeff
Hammerbacher in 2008 – Emergence of the Job Title
 Data Science – Coined by Peter Naur in 1974
 Working in a team on problems that requires a hybrid of
skill set of stats and computer science paired with
personal characteristics including curiosity and
persistence – Data Scientist Job Title
 Data Scientist – Hybrid of Statistician, Software Engineer
and Social Scientist – This is true in case of a social
product such as friend recommendation in Facebook.
 The Definition of Data Scientist depends on the context
Skillsets Needed

 Expertise in the following fields is a requirement


 Computer Science
 Math
 Statistics
 Machine Learning
 Communication and Presentation Skills
 Data Visualization
 Extensive Domain Expertise
Data Scientist vs
Data Science Team Profiles
took a survey and used clustering to define subfields of data science
Define Data Science by Usage

 In Academics:
 An academic data scientist is a scientist, trained in any of the academic
fields-trained in anything from social science to biology, who works with large amounts
of data, and must grapple with computational problems posed by the
structure, size, messiness, and the complexity and nature of the data,
while simultaneously solving a real world problem.
Define Data Science by Usage-In Industry:

 Data scientists often work with a team to complete projects. Typical activities include:
 Design, develop, and maintain machine learning and other data models
 Select, use, and debug existing data models

 Perform statistical and data analyses, often to make decisions about products.

 Conduct research to learn more about the field and to improve model accuracy,
including meeting with and interviewing experts
Define Data Science by Usage

 In a general sense:

 A Data scientist is someone:


 Who knows how to extract meaning from data and interpret it, which requires tools
and methods from statistics and machine learning as well as the human factor

 Whose work involves data collection, cleaning, and munging which requires
persistence, statistics, and software engineering skills – Skills for understanding biases
in data, for debugging and logging output from code.
Define Data Science by Usage

 In a general sense:
 A Data scientist is someone:(contd.)
 Who performs Exploratory Data Analysis, --is used by data scientists to analyze and
investigate data sets and summarize their main characteristics, often employing data visualization
methods.
 Who finds patterns, build models and algorithms for different purposes.
 Who designs experiments as it is a critical part of data driven decision
making.
 Who communicates with team members, engineers, and leadership in clear
language an with data visualizations
END OF M1-CH1

You might also like