0% found this document useful (0 votes)
37 views

Ch 1 Introduction to Data Science

The document provides an overview of a course on the Foundation of Data Science, highlighting key concepts such as data science, big data, and the skills required for data scientists. It discusses the hype surrounding data science, its relevance across various industries, and the importance of datafication. Additionally, it introduces essential jargon and metrics used in data science, along with tools and frameworks that facilitate data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views

Ch 1 Introduction to Data Science

The document provides an overview of a course on the Foundation of Data Science, highlighting key concepts such as data science, big data, and the skills required for data scientists. It discusses the hype surrounding data science, its relevance across various industries, and the importance of datafication. Additionally, it introduces essential jargon and metrics used in data science, along with tools and frameworks that facilitate data analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 2
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 3
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 4
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 5
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 6
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 7
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 8
Course Overview

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 9
FOUNDATION OF DATA SCIENCE (ENCT202)

INTRODUCTION TO DATA SCIENCE


Session I : 1.1 Overview of Data Science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 10
Learning Objectives
• To define data science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 11
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 12
1.1 Overview of Data Science
Big Data and Data Science Hype
• Over the past few years, there’s been a lot of hype in the media about “data science”
and “Big Data.”
• What is “Big Data” anyway?
• What does “data science” mean?
• What is the relationship between Big Data and data science?
• Is data science the science of Big Data?
• Is data science only the stuff going on in companies like Google and Facebook and
tech companies?
• Why do many people refer to Big Data as crossing disciplines (astronomy, finance,
tech, etc.) and to data science as only taking place in tech?
• Just how big is big? Or is it just a relative term?
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 13
1.1 Overview of Data Science
• The hype is crazy where people describe "Data Scientists" as “Masters of the
Universe”.
• But, the hype masks reality and increases the noise-to-signal ratio.
• Statisticians already feel that they are studying and working on the “Science of
Data.”
• For the statistician, this feels a little bit like how identity theft might feel for
anyone.
• However, data science is not just a rebranding of statistics or machine learning
but rather a field unto itself, the media often describes data science in a way
that makes it sound like as if it’s simply statistics or machine learning in the
context of the tech industry.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 14
1.1 Overview of Data Science
Getting Past The Hype
• Around all the hype, there is a ring of truth: this is something new.

Why the hype now?


• We have massive amounts of data about many aspects of our lives, and,
simultaneously, an abundance of inexpensive computing power.
• Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions - all this is being tracked online, as most
people know.
• What people might not know is that the “datafication” of our offline behavior
has started as well, mirroring the online data collection revolution.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 15
1.1 Overview of Data Science
Getting Past The Hype
• Around all the hype, there is a ring of truth: this is something new.

Why the hype now?


• We have massive amounts of data about many aspects of our lives, and,
simultaneously, an abundance of inexpensive computing power.
• Shopping, communicating, reading news, listening to music, searching for
information, expressing our opinions - all this is being tracked online, as most
people know.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 16
1.1 Overview of Data Science
• It’s not just Internet data, though - it’s finance, the medical industry,
pharmaceuticals, bioinformatics, social welfare, government, education, retail,
and the list goes on.
• In some cases, the amount of data collected might be enough to be considered
“big” ;in other cases, it’s not.
• But it’s not only the massiveness that makes all this new data interesting (or
poses challenges).
• It’s that the data itself, often in real time, becomes the building blocks of data
products.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 17
1.1 Overview of Data Science
• On the Internet, this means:
› Amazon recommendation systems
› Friend recommendations on Facebook,
› Film and music recommendations, and so on.
• In finance, this means credit ratings, trading algorithms, and models.
• In education, this is starting to mean dynamic personalized learning and assess
ments coming out of places like Khan Academy.
• In government, this means policies based on data.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 18
1.1 Overview of Data Science
Datafication
• Quantifying friendships with “likes”: it’s the way everything we do, online or
otherwise, ends up recorded for later examination in someone’s data storage
units or maybe multiple storage units, and maybe also for sale.
• Datafication is a process of “taking all aspects of life and turning them into
data.”
• Examples :
› Google’s augmented-reality glasses datafy the gaze.
› Twitter datafies stray thoughts.
› LinkedIn datafies professional networks.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 19
1.1 Overview of Data Science : The Current Landscape
• So, what is data science?
• Is it new, or is it just
statistics or analytics
rebranded?
• Is it real, or is it pure
hype?
• And if it’s new and if it’s
real, what does that mean?

Figure : Drew Conway’s Venn diagram of data science


Link : The Data Science Venn Diagram — Drew Conway
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 20
1.1 Overview of Data Science : The Current Landscape
• Nathan Yau’s 2009 post, “Rise of the Data Scientist”, mentions the skills of a
data scientist which includes :
✓Statistics (traditional analysis you’re used to thinking about)
✓Data munging (parsing, scraping, and formatting data)
✓Visualization (graphs, tools, etc.)

• But wait, is data science just a bag of tricks? Or is it the logical extension of
other fields like statistics and machine learning?
• Some argue that data science is just a rebranding and unwelcome takeover of
statistics.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 21
1.1 Overview of Data Science : The Current Landscape
• DJ Patil and Jeff Hammerbacher - then at LinkedIn and Facebook,
respectively - coined the term “data scientist” in 2008.
• So that is when “data scientist” emerged as a job title (Wikipedia finally
gained an entry on data science in 2012)
• Data scientists are asked to be "experts" in computer science, statistics,
communication, data visualization, and to have extensive domain expertise.
• Nobody is an expert in everything, which is why it makes more sense to
create teams of people who have different profiles and different expertise -
together, as a team, they can specialize in all those things.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 22
1.1 Overview of Data Science : A Data Science Profile
• Computer science
• Mathematics
• Statistics
• Machine learning
• Domain expertise
• Communication and presentation skills
• Data visualization

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 23
1.1 Overview of Data Science : What is a Data Scientist really?
• Defining data science by its usage i.e what data scientists get paid to do.
• In Academia, an academic data scientist is a scientist, trained in anything from social
science to biology, who works with large amounts of data, and must grapple with
computational problems posed by the structure, size, messiness, and the complexity and
nature of the data, while simultaneously solving a real world problem.
• In Industry, a data scientist is a multidisciplinary professional who extracts meaningful
insights from data by applying techniques from statistics, machine learning, and software
engineering and handle the entire data pipeline, from collecting, cleaning, and
transforming raw data to exploratory analysis, identifying patterns, and building models or
prototypes that drive decision-making and product development. Combining technical
expertise with a deep understanding of business needs, data scientists communicate
findings clearly through visualizations and storytelling, enabling data-driven strategies
while addressing biases and ensuring accuracy in their workflows.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 24
FOUNDATION OF DATA SCIENCE (ENCT202)

INTRODUCTION TO DATA SCIENCE


Session II : 1.2 Jargons of Data Science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 25
Jargon: Special words or expressions used by a profession or group that
1.2 Jargons of Data Science are difficult for others to understand.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 26
1.2 Jargons of Data Science
Data
• Raw facts or discrete pieces of information without context or meaning.
• Processes (connected, categorized, calculated, corrected, condensed)are performed to
organize and refine the raw data.
• Example: Individual rainfall measurements for specific days.
Information
• Data that has been processed, analyzed, and given context to establish relationships
or patterns.
• Processes (contextualized, compared, action-informed, consequences, conversations)
are performed to transform data into usable insights or summaries.
• Example: Rainfall this week averaged 50mm, higher than the monthly norm.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 27
1.2 Jargons of Data Science
Knowledge
• Contextualized information that explains patterns, relationships, and causes,
leading to actionable insights.
• Reflects understanding gained through comparisons, actions, and experiences.
• Example: Prolonged heavy rainfall may lead to flooding in urban areas.
Wisdom
• The ability to apply knowledge effectively, guided by values, beliefs, and
experience, for sound decision-making.
• Embedding knowledge into practical decision-making frameworks or real-world
applications.
• Example: Initiate flood response measures when rainfall exceeds the weekly
threshold of 70mm to prevent damage.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 28
1.2 Jargons of Data Science : Types of Data in Data Science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 29
1.2 Jargons of Data Science
Big Data: Extremely large datasets that are challenging to store, process, and
analyze using traditional methods.
Data Pipeline: The sequence of processes through which data is collected,
cleaned, transformed, and made ready for analysis.
ETL (Extract, Transform, Load): A process of collecting data from various
sources, cleaning and transforming it, and loading it into a storage system like a
database.
EDA (Exploratory Data Analysis): A preliminary step in data analysis to
summarize data characteristics using visualization and statistical methods.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 30
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 31
Different units of data for reference

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 32
1.2 Jargons of Data Science
A/B Testing: A statistical method to compare two versions of a product or
process.
Data Wrangling: The process of cleaning and transforming raw data into a
usable format.
Data Lake vs. Data Warehouse: A data lake stores raw data in its native
format, while a data warehouse stores processed and structured data for analysis.
Model Drift: The decline in a model's performance over time due to changes in
the underlying data.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 33
1.2 Jargons of Data Science : Metrics
Accuracy: The ratio of correctly predicted observations to the total
observations.
Precision and Recall: Metrics for evaluating classification models, especially
in imbalanced datasets.
Confusion Matrix: A table that visualizes the performance of a classification
algorithm.
ROC Curve: A graph showing the performance of a classification model at
different threshold levels.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 34
1.2 Jargons of Data Science : Visualization
Dashboard: A visual interface summarizing data and analytics in charts and
graphs.
Heatmap: A graphical representation of data where values are represented by
varying intensities of color.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 35
1.2 Jargons of Data Science : Tools and frameworks
TensorFlow/PyTorch: Popular deep learning frameworks.
Pandas/NumPy: Libraries in Python for data manipulation and numerical
computing.
Hadoop/Spark: Frameworks for distributed storage and processing of big data.
SQL: A language for querying and managing relational databases.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 36
1.2 Jargons of Data Science : Data Processing
Feature Engineering: The process of creating new input features or modifying
existing ones to improve model performance.
Dimensionality Reduction: Reducing the number of variables (features) in a
dataset while preserving its essential information (e.g., PCA).
Normalization: Scaling data to have a mean of 0 and a standard deviation of 1.
Imputation: Filling in missing data points with estimated values.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 37
1.2 Jargons of Data Science : Data Processing
Supervised Learning: A machine learning method where the model is trained
on labeled data.
Unsupervised Learning: A method where the model learns patterns from
unlabeled data.
Reinforcement Learning: A learning method where an agent learns to make
decisions by interacting with the environment and receiving rewards or
penalties.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 38
1.2 Jargons of Data Science : Data Processing
Overfitting: When a model performs well on training data but poorly on unseen
data due to excessive complexity.
Underfitting: When a model is too simple and fails to capture the underlying
patterns in data.
Hyperparameters: Model settings that must be defined before training begins,
such as learning rate or the number of layers in a neural network.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 39
FOUNDATION OF DATA SCIENCE (ENCT202)

INTRODUCTION TO DATA SCIENCE


Session III : 1.3 Modern Data Ecosystem

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 40
1.3 Modern Data Ecosystem
• A data ecosystem refers to the infrastructure, tools, and processes that work
together to store, manage, and analyze data.
• Building robust data ecosystems has become essential as organizations
continue to rely on data for decision-making.
• By 2025, it is predicted that 55% of IT organizations will adopt a data
ecosystem, streamlining vendor relationships and improving data management.
• This shift reduces costs and simplifies how businesses handle their data,
helping them gain more control and make better-informed decisions.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 41
1.3 Modern Data Ecosystem
Definition of a data ecosystem
• A data ecosystem is a sophisticated, interconnected framework of tools,
processes, people, and technologies that work together to handle the entire data
lifecycle—from collection to analysis to actionable insights.
What is a modern data ecosystem?
• Modern Data Ecosystems refer to a more advanced or up-to-date version of a
data ecosystem.
• They typically emphasize cloud-native solutions, big data technologies, artificial
intelligence (AI), machine learning (ML), and real-time analytics.
• They often focus on scalability, agility, and the use of technologies that allow for
more dynamic and efficient data processing compared to older, more traditional data
ecosystems.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 42
1.3 Modern Data Ecosystem
Definition of a data ecosystem
• A data ecosystem is a sophisticated, interconnected framework of tools,
processes, people, and technologies that work together to handle the entire data
lifecycle—from collection to analysis to actionable insights.
What is a modern data ecosystem?
• Modern Data Ecosystems refer to a more advanced or up-to-date version of a
data ecosystem.
• They typically emphasize cloud-native solutions, big data technologies, artificial
intelligence (AI), machine learning (ML), and real-time analytics.
• They often focus on scalability, agility, and the use of technologies that allow for
more dynamic and efficient data processing compared to older, more traditional data
ecosystems.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 43
1.3 Traditional vs Modern Data Ecosystem
Category Data Ecosystem Modern Data Ecosystem
Architecture & Infrastructure Traditional on-premise systems, Cloud-native, scalable, distributed.
relational databases. Silos Seamless integration across platforms.
between systems.
Data Management & Processing Focus on structured data. Batch Supports structured, semi-structured,
processing common. Manual and unstructured data. Real-time
governance. processing. Automated governance.
Analytics & Insights Descriptive and diagnostic Predictive and prescriptive analytics.
analytics. Historical data focus. AI/ML-driven insights, real-time data
analysis.
Integration & Flexibility Complex ETL processes to unify API-driven, real-time integration.
disparate systems. Less flexible. Modular and adaptable to new data
sources.
Scalability & Agility Limited scalability, costly Built for scalability, especially in cloud
infrastructure expansion. environments. Easily handles big data
growth.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 44
1.3 Traditional vs Modern Data Ecosystem
Category Data Ecosystem Modern Data Ecosystem
Security & Governance Siloed security controls. Manual Centralized, automated governance
governance processes. with advanced security features
(encryption, compliance).
User Access & Collaboration Restricted access, with limited user Democratized data access through self-
visibility across datasets. service tools. Enhanced cross-team
collaboration.
Cost Efficiency High maintenance costs for More cost-efficient with cloud pay-as-
infrastructure and storage. you-go models and optimized storage.
Data Sources Primarily traditional databases, Includes IoT, social media, logs,
structured data. structured and unstructured data.
Technology Stack Legacy tools, on-prem BI systems, data Cloud platforms (e.g., Azure, AWS),
warehouses. data lakes, advanced BI tools like Power
BI, AI/ML frameworks.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 45
1.3 Modern Data Ecosystem : its key components
• Data sources
• Data storage and infrastructure
• Data integration tools
• Data processing and analytics
• Data governance

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 46
1.3 Modern Data Ecosystem : its key components
Data sources
• Data sources are at the foundation of every data analytics ecosystem.
• These are the origins of the raw data that feeds into the ecosystem, and they
come from a variety of platforms, such as:
– Internal data sources like CRM software, ERP systems, and transactional
databases.
– External sources such as social media platforms, websites, third-party APIs,
and market data.
– IoT devices, sensors, & other machines that generate real-time data in
industries.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 47
1.3 Modern Data Ecosystem : its key components
Data storage and infrastructure
• Data warehouses: Perfect for structured data, these centralized hubs make it easy to
store huge amounts of data, which can be quickly accessed for reports and analysis.
• Data lakes: If you’re dealing with a mix of structured and unstructured data, data
lakes are ideal. They allow you to keep raw data in its original form until you need it,
offering a more flexible approach.
– While data lakes offer flexibility for raw data storage, it’s essential to maintain
metadata (data about data) catalogs and enforce governance policies to prevent
them from becoming disorganized “data swamps”
• Cloud storage: Platforms like AWS, Microsoft Azure, and Google Cloud offer
storage that grows with your business while remaining secure and cost-efficient.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 48
1.3 Modern Data Ecosystem : its key components
Data Integration Tools
• ETL (Extract, Transform, Load): A traditional method where data is extracted from
sources, transformed into a usable format, and loaded into a storage system.
• ELT (Extract, Load, Transform): A more modern approach where raw data is loaded into
a system and then transformed on-demand.
– ELT has gained popularity in modern cloud-based ecosystems due to its flexibility and
efficiency.
– Unlike traditional ETL, where transformations are performed on separate systems before
loading, ELT allows for raw data to be loaded into scalable cloud data storage like data
lakes.
– The data is then transformed as needed using distributed computing frameworks, such
as Databricks. This approach significantly reduces time-to-insight and is more suited for
handling big data and unstructured datasets.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 49
1.3 Modern Data Ecosystem : its key components
Data processing and analytics
• Modern data ecosystems depend on analytics tools to extract insights, identify
trends, and generate reports from the accumulated data.
– Big data analytics: Tools such as Tableau, Power BI, and Google Analytics help
analyze large datasets in real-time, identifying patterns and trends that inform
decision-making.
– AI and Machine Learning: Advanced analytics that enables businesses to predict
outcomes, automate processes, and gain deeper insights from their data that would
be impossible for humans to detect manually in the age of data and AI.
– Business Intelligence (BI): Platforms that provide visualizations, dashboards, and
reporting tools for decision-makers to interpret data effectively.
With the help of data analytics and BI tools, businesses can move beyond descriptive analytics (what happened)
to predictive (what will happen) and prescriptive (what should happen) analytics.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 50
1.3 Modern Data Ecosystem : its key components
Data governance
• Data governance is all about making sure that data is managed the right way—kept secure,
well-organized, and compliant with important regulations like GDPR* or CCPA*.
• It starts with data privacy and security, where safeguards are put in place to protect sensitive
information from breaches or unauthorized access.
• Then there’s data quality that ensures data accuracy, consistency, and reliability—because if
the data’s off, the decisions will reflect on it, too.
• Finally, compliance plays a big role in ensuring businesses follow the rules, protecting the
company and its customers.
• All these elements together create a strong data governance framework that protects and
maximizes the value of a company’s data.
• Data governance builds trust in the system by focusing on privacy, quality, and compliance
and keeps everything running smoothly within ethical and legal boundaries.
*GDPR : General Data Protection Regulation *CCPA : California Consumer Privacy Act

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 51
FOUNDATION OF DATA SCIENCE (ENCT202)

INTRODUCTION TO DATA SCIENCE


Session III : 1.4 Data science lifecycle

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 52
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 53
1.4 Data Science Lifecycle : The Data Science Process

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 54
1.4 Data Science Lifecycle : The Data Science Process
Real World
– Inside the Real World are lots of people busy at various activities. Some
people are using Google+, others are competing in the Olympics; there are
spammers sending spam, etc.
Raw Data Collection
– Logs, Olympics records, Enron Corporation employee emails, or recorded
genetic material, etc.
Data processing
– Processing data to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling using tools such as Python, R, or
SQL, or all of the above.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 55
1.4 Data Science Lifecycle : The Data Science Process
Building models
• Next, we design our model to use some algorithm like k-nearest neighbor (k-
NN), linear regression, Naive Bayes, or something else.
• The model we choose depends on the type of problem we’re trying to solve,
of course, which could be a classification problem, a prediction problem, or
a basic description problem.
Communicate, Visualizations and Report Findings
• We then can interpret, visualize, report, or communicate our results.
• This could take the form of reporting the results up to our boss or coworkers,
or publishing a paper in a journal and going out and giving academic talks
about it.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 56
1.4 Data Science Lifecycle : The Data Science Process
Clean data
• Eventually we get the data down to a nice format, like something with
columns: | name | event | year | gender | event time |
This is where you typically start in a standard statistics class, with a clean, orderly dataset.
But it’s not where you typically start in the real world.
Exploratory Data Analysis (EDA)
• Once we have this clean dataset, we should be doing some kind of EDA.
• In the course of doing EDA, we may realize that it isn’t actually clean because
of duplicates, missing values, absurd outliers. If that’s the case, we may have
to go back to collect more data, or spend more time cleaning the dataset.
Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 57
1.4 Data Science Lifecycle : The Data Science Process
Build Data Product
• Alternatively, our goal may be to build or prototype a “data product”; e.g., a
spam classifier, or a search ranking algorithm, or a recommendation system.
• Now the key here that makes data science special and distinct from statistics is
that this data product then gets incorporated back into the real world, and users
interact with that product, and that generates more data, which creates a
feedback loop.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 58
1.4 Data Science Lifecycle : The Data Science Process
A Data Scientist’s Role in This
Process
– Someone has to make the
decisions about what data to
collect, and why.
– That person needs to be
formulating questions and
hypotheses and making a plan for
how the problem will be attacked.
– That someone is the data scientist
or our beloved data science team.

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 59
FOUNDATION OF DATA SCIENCE (ENCT202)

INTRODUCTION TO DATA SCIENCE


Session III : 1.5 Trends, markets and application of data science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 60
1.5 Trends, markets and application of data science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 61
1.5 Trends, markets and application of data science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 62
1.5 Trends, markets and application of data science

Foundation of Data Science : Adapted by Er. Suwarna Lingden, Assistant Professor, DoECE, Thapathali Campus, Institute of Engineering, Tribhuvan University 63

You might also like