0% found this document useful (0 votes)

59 views17 pages

Module 1.foundations of Data Science

Module 1 covers the foundations of data science, including its interdisciplinary nature, lifecycle stages, and the importance of Python, NumPy, and Pandas for data manipulation and analysis. It also discusses data visualization techniques using Matplotlib and Seaborn, showcasing how to create various plots such as histograms, scatter plots, and heatmaps. The module emphasizes the significance of data cleaning, exploration, and communication of insights in the data science process.

Uploaded by

bachchasharma2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views17 pages

Module 1.foundations of Data Science

Uploaded by

bachchasharma2001

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Module 1: Foundations of Data Science

Introduction to Data Science

Data science is an interdisciplinary field that involves using scientific

methods, algorithms, processes, and systems to extract knowledge and
insights from structured and unstructured data. It combines elements from
statistics, computer science, domain knowledge, and data engineering to
solve complex data-driven problems.

Key facts about data science:

Data science involves collecting, cleaning, and analyzing large volumes
of data to gain insights and make informed decisions.
It encompasses various techniques, including data exploration, statistical
modeling, machine learning, and data visualization.
Data scientists often work with programming languages like Python and
R, as well as tools such as Jupyter notebooks and data visualization
libraries.

Data Science Lifecycle

The data science lifecycle is a series of stages that data scientists follow
when working on a data-driven project. These stages typically include:

Problem Definition: Understand the problem you need to solve and define
clear objectives. Identify what data is needed to address the problem.

Data Collection: Gather relevant data from various sources, such as

databases, APIs, or files. This may involve data scraping, data cleaning,
and data integration.

Data Exploration: Explore the dataset to understand its characteristics,

such as data types, distributions, and missing values. Visualization plays
a crucial role in this stage.

Data Preprocessing: Clean and prepare the data for analysis. This
includes handling missing values, encoding categorical variables, and
scaling or normalizing numerical features.
Feature Engineering: Create new features or transform existing ones to
improve the performance of machine learning models.

Model Building: Select appropriate machine learning algorithms and train

predictive models using the prepared data.

Model Evaluation: Assess the performance of the models using metrics

like accuracy, precision, recall, or F1-score. Fine-tune models as needed.

Model Deployment: Deploy the trained model to production, making it

accessible for predictions in real-time applications.

Monitoring and Maintenance: Continuously monitor model performance

and retrain as necessary to adapt to changing data patterns.

Communication and Reporting: Share insights and results with

stakeholders through reports, dashboards, or presentations.
Python for Data Science:

Python is a widely used programming language in the field of data

science due to its simplicity, extensive libraries, and vibrant community.
Here are some fundamental concepts and data structures in Python
relevant to data science:

Data Types: Python supports various data types, including integers,

floats, strings, booleans, and complex numbers.

Integers (int): Whole numbers, e.g., 5, -10.

Floats (float): Numbers with a decimal point, e.g., 3.14, -0.5.
Strings (str): Sequences of characters enclosed in single or double quotes,
e.g., "Hello, World!".
Booleans (bool): Represents either True or False.

Operators:

Arithmetic Operators: + (addition), - (subtraction), * (multiplication), /

(division), % (modulo), ** (exponentiation).

Comparison Operators: == (equal to), != (not equal to), < (less than), >
(greater than), <= (less than or equal to), >= (greater than or equal to).

Logical Operators: and (logical AND), or (logical OR), not (logical NOT).

Control Statements:
if-else Statements: Conditional execution based on a condition.

if condition:
# code to run if condition is Trueelse:
# code to run if condition is False

Loops:
for Loops: Iterating over a sequence (e.g., a list, tuple, or range).

for item in sequence:

# code to repeat for each item

while Loops: Repeating a block of code while a condition is True.

while condition:
# code to repeat as long as condition is True
Functions:
Functions allow you to encapsulate a block of code into a reusable unit.
You can define functions using the def keyword and call them by name.

def add(a, b):

return a + b

result = add(2, 3) # Calls the add function and assigns the result to the
'result' variable.

Data Structures:
Lists: Ordered collections of items that can be of different data types.
Lists are mutable, meaning you can change their contents.

my_list = [1, 2, 3, 'apple']

Tuples: Similar to lists but immutable, meaning you can't change their
contents after creation.
my_tuple = (1, 2, 3, 'banana')

Sets: Unordered collections of unique elements. Sets are useful for tasks
that involve checking for membership or eliminating duplicates.
my_set = {1, 2, 3, 4, 4} # Creates a set with unique elements: {1, 2, 3, 4}

Dictionaries: Collections of key-value pairs. You can use keys to access

values.
my_dict = {'name': 'Alice', 'age': 30, 'city': 'New York'}
Data Analysis using NumPy and Pandas

NumPy and Pandas are essential libraries in Python for data manipulation
and analysis.

NumPy: NumPy provides support for large, multi-dimensional arrays

and matrices, as well as a collection of mathematical functions to operate
on these arrays efficiently. It is commonly used for numerical
computations and array manipulation.

Arrays in NumPy
NumPy’s main object is the homogeneous multidimensional array. It is a
table of elements (usually numbers), all of the same type, indexed by a
tuple of positive integers. In NumPy, dimensions are called axes. The
number of axes is rank. NumPy’s array class is called ndarray. It is also
known by the alias array.

Example:

import numpy as np

# Creating array object

arr = np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)

Output:
Array is of type: <class 'numpy.ndarray'>
No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64
Pandas: Pandas is a popular open-source data manipulation and analysis
library for the Python programming language. It provides easy-to-use
data structures and data analysis tools for working with structured data,
such as spreadsheets or SQL tables. Pandas is a fundamental tool in the
data science and data analysis ecosystem and is often used in conjunction
with other libraries like NumPy, Matplotlib, and Scikit-learn.
Key features of Pandas include:

Data Structures:

Series: A one-dimensional labeled array capable of holding data of

various types.
DataFrame: A two-dimensional labeled data structure, similar to a
spreadsheet or SQL table, where data is organized into rows and columns.

Data Cleaning and Preparation:

Handling missing data: Pandas provides methods for filling, dropping, or

interpolating missing values.
Data transformation: You can reshape data, pivot tables, merge and join
datasets, and apply various data transformations.
Data indexing and selection: Pandas allows you to slice and dice data
using labels, integers, or boolean indexing.

Data Analysis and Exploration:

Descriptive statistics: Calculate summary statistics like mean, median,

standard deviation, and more.
Grouping and aggregation: Easily group data based on certain criteria and
perform aggregation operations.
Time series analysis: Pandas supports time-based data manipulation and
resampling.

Data I/O:

Read and write data: Pandas can read data from various file formats like
CSV, Excel, SQL databases, and more.
Data export: You can save Pandas DataFrames to different file formats as
well.
Example:

import pandas as pd
# Load a CSV file into a DataFrame
df = pd.read_csv('data.csv')
# Display the first few rows of the DataFrameprint
df.head()
# Calculate mean and standard deviation of a column
mean_value = df['column_name'].mean()
std_deviation = df['column_name'].std()
# Group data by a column and calculate the sum of another column
grouped_data = df.groupby('group_column')['sum_column'].sum()
# Filter data based on a condition
filtered_data = df[df['column_name'] > 50]

Pandas is an essential tool for data analysis and manipulation in Python,

and it plays a crucial role in various data-related tasks, from data cleaning
and preprocessing to exploratory data analysis and model building.

Data Visualization using Matplotlib and Seaborn

Data visualization is crucial for understanding data and communicating
results. Matplotlib and Seaborn are popular Python libraries for creating
various types of plots and visualizations.

Matplotlib provides many out-of-the-box tools for quick and easy data
visualization. For example, when analyzing a new data set, researchers
are often interested in the distribution of values for a set of columns. One
way to do so is through a histogram.
Histograms are approximations to distributions generated through
selecting values based on a set range and putting each set of values in a
bin or bucket. Visualizing the distribution as a histogram is
straightforward using Matplotlib.
For our purposes, we will be working with the FIFA19 data set, which
you can find here. To start, we need to import the Pandas library, which is
a Python library used for data tasks such as statistical analysis and data
wrangling:

import pandas as pd

Next, we need to import the pyplot module from the Matplotlib library. It
is custom to import it as plt:
import matplotlib.pyplot as plt
Now, let’s read our data into a Pandas dataframe. We will relax the limit
on display columns and rows using the set_option() method in Pandas:
df = pd.read_csv("fifa19.csv")

pd.set_option('display.max_columns', None)

pd.set_option('display.max_rows', None)
Let’s display the first five rows of data using the head() method:
print(df.head())

We can generate a histogram for any of the numerical columns by calling

the hist() method on the plt object and passing in the selected column in
the data frame. Let’s do this for the Overall column, which corresponds to
overall player rating:
plt.hist(df['Overall'])
We can also label the x-axis, y-axis and title of the plot using the xlabel(),
ylabel() and title() methods, respectively:
plt.xlabel('Overall')

plt.ylabel('Frequency')

plt.title('Histogram of Overall Rating')

plt.show()

This visualization is great way of understanding how the values in your

data are distributed and easily seeing which values occur most and least
often.

Generating Scatter Plots in Python With Matplotlib

Scatter plots are a useful data visualization tool that helps with
identifying variable dependence. For example, if we are interested in
seeing if there is a positive relationship between wage and overall player
rating, (i.e., if a player’s wage increases, does his rating also go up?) we
can employ scatter plot visualization for insight.
Before we generate our scatter plot of wage versus overall rating, let’s
convert the wage column from a string to a floating point numerical
column. For this, we will create a new column called wage_euro:
df['wage_euro'] = df['Wage'].str.strip('€')
df['wage_euro'] = df['wage_euro'].str.strip('K')
df['wage_euro'] = df['wage_euro'].astype(float)*1000.0
Now, let’s display our new column wage_euro and the overall column:
print(df[['Overall', 'wage_euro']].head())

To generate a scatter plot in Matplotlib, we simply use the scatter()

method on the plt object. Let’s also label the axes and give our plot a title:
plt.scatter(df['Overall'], df['wage_euro'])

plt.title('Overall vs. Wage')

plt.ylabel('Wage')
plt.xlabel('Overall')
plt.show()

Generating Bar Charts in Python With Matplotlib

Bar charts are another useful visualization tool for analyzing categories in
data. For example, if we want to see the most common nationalities found
in our FIFA19 data set, we can employ bar charts. To visualize
categorical columns, we first should count the values. We can use the
counter method from the collections modules to generate a dictionary of
count values for each category in a categorical column. Let’s do this for
the nationality column:
from collections import Counter

print(Counter(df[‘Nationality’]))

We can filter this dictionary using the most_common method. Let’s look
at the 10 most common nationality values (note: you can also use the
least_common method to analyze infrequent nationality values):
print(dict(Counter(df[‘Nationality’]).most_common(10)))

Finally, to generate the bar plot of the 10 most common nationality values,
we simple call the bar method on the plt object and pass in the keys and
values of our dictionary:
nationality_dict = dict(Counter(df['Nationality']).most_common(10))
plt.bar(nationality_dict.keys(), nationality_dict.values())
plt.xlabel('Nationality')
plt.ylabel('Frequency')
plt.title('Bar Plot of Ten Most Common Nationalities')
plt.xticks(rotation=90)
plt.show()
As you can see, the values on the x-axis are overlapping, which makes
them hard to see. We can use the “xticks()” method to rotate the values:
plt.xticks(rotation=90)

Generating Pie Charts in Python With Matplotlib

Pie charts are a useful way to visualize proportions in your data. For
example, in this data set, we can use a pie chart to visualize the
proportion of players from England, Germany and Spain.
To do this, let’s create new columns that contain England, Spain,
Germany and one column labeled “other” for all other nationalities:

df.loc[df.Nationality =='England', 'Nationality2'] = 'England'

df.loc[df.Nationality =='Spain', 'Nationality2'] = 'Spain'
df.loc[df.Nationality =='Germany', 'Nationality2'] = 'Germany'
df.loc[~df.Nationality.isin(['England', 'German', 'Spain']), 'Nationality2']
= 'Other'

prop = dict(Counter(df['Nationality2']))
for key, values in prop.items():

prop[key] = (values)/len(df)*100

print(prop)

We can create a pie chart using our dictionary and the pie method in
Matplotlib:
fig1, ax1 = plt.subplots()
ax1.pie(prop.values(), labels=prop.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.

plt.show()

all these methods offer us multiple, powerful ways to visualize the

proportions of categories in our data.

Python Data Visualization With Seaborn

Seaborn is a library built on top of Matplotlib that enables more
sophisticated visualization and aesthetic plot formatting. Once you’ve
mastered Matplotlib, you may want to move up to Seaborn for more
complex visualizations.
For example, simply using the Seaborn set() method can dramatically
improve the appearance of your Matplotlib plots. Let’s take a look.
First, import Seaborn as sns and reformat all of the figures we generated.
At the top of your script, write the following code and rerun:
import seaborn as sns

sns.set()

Generating Histograms in Python With Seaborn

We can also generate all of the same visualizations we did in Matplotlib
using Seaborn.
To regenerate our histogram of the overall column, we use the distplot
method on the Seaborn object:
sns.distplot(df['Overall'])

And we can reuse the plt object for additional axis formatting and title
setting:
plt.xlabel('Overall')

plt.ylabel('Frequency')

plt.title('Histogram of Overall Rating')

plt.show()

As we can see, this plot looks much more visually appealing than the
Matplotlib plot.
Generating Scatter Plots in Python With Seaborn
Seaborn also makes generating scatter plots straightforward. Let’s
recreate the scatter plot from earlier:
sns.scatterplot(df['Overall'], df['wage_euro'])

plt.title('Overall vs. Wage')

plt.ylabel('Wage')

plt.xlabel('Overall')

plt.show()

Generating Heatmaps in Python With Seaborn

Seaborn is also known for making correlation heatmaps, which can be
used to identify variable dependence. To generate one, first we need to
calculate the correlation between a set of numerical columns. Let’s do
this for age, overall, wage_euro and skill moves:
corr = df[['Overall', 'Age', 'wage_euro', 'Skill Moves']].corr()

sns.heatmap(corr, annot=True)

plt.title('Heatmap of Overall, Age, wage_euro, and Skill Moves')

plt.show()
We can also set annot for annotate to true to see the correlation values:
sns.heatmap(corr, annot=True)

Generating Pairs Plots in Python With Seaborn

The last Seaborn tool I’ll discuss is the pairplot method. This allows you
to generate a matrix of distributions and scatter plots for a set of
numerical features. Let’s do this for age, overall and potential:
data = df[['Overall', 'Age', 'Potential',]]

sns.pairplot(data)

plt.show()
this is a quick and easy way to visualize both the distribution in numerical
values and relationships between variables through scatter plots.

both Seaborn and Matplotlib are valuable tools for any data scientist.
Matplotlib makes labeling, titling and formatting graphs simple, which is
important for effective data communication. Further, it provides much of
the basic tooling for visualizing data including histograms, scatter plots,
pie charts and bar charts.
Seaborn is an important library to know because of its beautiful visuals
and extensive statistical tooling. As you can see above, the plots
generated in Seaborn, even if they communicate the same information,
are much prettier than those generated in Matplotlib. Further, the tools
provided by Seaborn allow for much more sophisticated analysis and
visuals. Although I only discussed how to use Seaborn to generate
heatmaps and pairwise plots, it can also be used to generate more
complicated visuals like density maps for variables, line plots with
confidence intervals, cluster maps and much more.
Matplotlib and Seaborn are two of the most widely used visualization
libraries in Python. They both allow you to quickly perform data
visualization for gaining statistical insights and telling a story with data.
While there is significant overlap in the use cases for each of these
libraries, having knowledge of both libraries can allow a data scientist to
generate beautiful visuals that can tell an impactful story about the data
being analyzed.

Tool and Lib in Data Science
No ratings yet
Tool and Lib in Data Science
32 pages
Data Science 2
No ratings yet
Data Science 2
15 pages
Wa0005.
No ratings yet
Wa0005.
29 pages
Advanced Python & Data Science Guide
No ratings yet
Advanced Python & Data Science Guide
42 pages
21CSS303T - UNIT-1 - Lecture - 1
No ratings yet
21CSS303T - UNIT-1 - Lecture - 1
90 pages
Data Science Workshop - Day 1
No ratings yet
Data Science Workshop - Day 1
80 pages
Chapter 1+ Python Basics-1
No ratings yet
Chapter 1+ Python Basics-1
16 pages
CRAI AI BOOTCAMP Week Two 2025
No ratings yet
CRAI AI BOOTCAMP Week Two 2025
29 pages
Course - Introduction To Data Science (SD211105)
No ratings yet
Course - Introduction To Data Science (SD211105)
10 pages
Report
No ratings yet
Report
18 pages
A Report Submitted in Partial Fulfillment of The Requirement of The Award of Degree of
No ratings yet
A Report Submitted in Partial Fulfillment of The Requirement of The Award of Degree of
35 pages
CH 4
No ratings yet
CH 4
17 pages
DS Final
No ratings yet
DS Final
46 pages
Data Science Introduction - Lecture Class
No ratings yet
Data Science Introduction - Lecture Class
62 pages
Foundation of Data Science
100% (4)
Foundation of Data Science
3 pages
Unit 1
100% (1)
Unit 1
69 pages
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
No ratings yet
Python Data Science Group Bootcamp NYC (Affordable Machine Learning)
16 pages
Unit II - Notes
No ratings yet
Unit II - Notes
10 pages
1st Class-Introduction and Python Package
No ratings yet
1st Class-Introduction and Python Package
93 pages
Data Science Using Python - Introduction
No ratings yet
Data Science Using Python - Introduction
6 pages
Data Science I: Charles C.N. Wang
No ratings yet
Data Science I: Charles C.N. Wang
68 pages
Roadmap
No ratings yet
Roadmap
27 pages
Ch01 - Introduction To Data Science
No ratings yet
Ch01 - Introduction To Data Science
65 pages
Feature Engineering - Introduction
No ratings yet
Feature Engineering - Introduction
74 pages
DAL EXT 1 and 2
No ratings yet
DAL EXT 1 and 2
125 pages
Chapter 1+ Python Basics
No ratings yet
Chapter 1+ Python Basics
6 pages
Python For Data Science
No ratings yet
Python For Data Science
20 pages
Python Data Analysis Introduction
No ratings yet
Python Data Analysis Introduction
259 pages
Mastering Python For Data Science With Numpy & Pandas
100% (3)
Mastering Python For Data Science With Numpy & Pandas
136 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
Python
No ratings yet
Python
170 pages
Cheat Sheet: Python For Data Science
No ratings yet
Cheat Sheet: Python For Data Science
4 pages
T - Report Abhishek Choudary
No ratings yet
T - Report Abhishek Choudary
17 pages
Introduction To Python 1
No ratings yet
Introduction To Python 1
13 pages
Data Analysis Using Python Day - 1 To Day - 4
No ratings yet
Data Analysis Using Python Day - 1 To Day - 4
30 pages
CS3352 FDS QP Solved (Anna University)
No ratings yet
CS3352 FDS QP Solved (Anna University)
98 pages
Python For Data Science
No ratings yet
Python For Data Science
89 pages
Data Science
No ratings yet
Data Science
31 pages
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
No ratings yet
Experiment No: 1 Introduction To Data Analytics and Python Fundamentals Page-1/11
8 pages
FDS Lab
No ratings yet
FDS Lab
43 pages
01 Introduction To Python
No ratings yet
01 Introduction To Python
36 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
20 pages
EXP1-siddhant Gupta (23 - SE - 148)
No ratings yet
EXP1-siddhant Gupta (23 - SE - 148)
17 pages
Python For Data Science Extended Ebook PDF
100% (5)
Python For Data Science Extended Ebook PDF
56 pages
Machine Learning Lecture2
No ratings yet
Machine Learning Lecture2
38 pages
Data Analysis Lab with Python
No ratings yet
Data Analysis Lab with Python
11 pages
Unit 5
No ratings yet
Unit 5
28 pages
DS Unit 1 - NUMPY
No ratings yet
DS Unit 1 - NUMPY
29 pages
Unit 1
No ratings yet
Unit 1
84 pages
FINAL FDS MANUAL Print
No ratings yet
FINAL FDS MANUAL Print
55 pages
Python Libraries
No ratings yet
Python Libraries
6 pages
Artificial and Data Science
No ratings yet
Artificial and Data Science
52 pages
Q.1 Explain Process of Working With Data From Files in Data Science
No ratings yet
Q.1 Explain Process of Working With Data From Files in Data Science
10 pages
TY FDS Workbook
No ratings yet
TY FDS Workbook
56 pages
Ds With Py
No ratings yet
Ds With Py
39 pages
Python Data Analysis Guide
No ratings yet
Python Data Analysis Guide
75 pages
Introduction To Data Science
No ratings yet
Introduction To Data Science
25 pages
Some Important Questions For End Sem Exam (GSC707)
No ratings yet
Some Important Questions For End Sem Exam (GSC707)
9 pages
Protein and Amino Acid
No ratings yet
Protein and Amino Acid
21 pages
Ayush Resume
No ratings yet
Ayush Resume
2 pages
PepsiCo India Certificate
No ratings yet
PepsiCo India Certificate
1 page
Sudheer Prajapat Resume1
No ratings yet
Sudheer Prajapat Resume1
1 page
JD - ShipGlobal
No ratings yet
JD - ShipGlobal
3 pages
Ayush Resume
No ratings yet
Ayush Resume
1 page
Khushi OK
No ratings yet
Khushi OK
2 pages
Lss. Uace Math 2
No ratings yet
Lss. Uace Math 2
6 pages
Engineering Economic Risk Analysis
No ratings yet
Engineering Economic Risk Analysis
39 pages
Linear Equations and Inequalities Lesson Plan
100% (1)
Linear Equations and Inequalities Lesson Plan
7 pages
Tridiagonal System Solver Guide
No ratings yet
Tridiagonal System Solver Guide
2 pages
Mth-Iii 4
No ratings yet
Mth-Iii 4
1 page
Reliability, Validity, Sensitivity
No ratings yet
Reliability, Validity, Sensitivity
3 pages
Using Basketball To Understand Options
No ratings yet
Using Basketball To Understand Options
3 pages
Functional Regression Insights
No ratings yet
Functional Regression Insights
7 pages
Rabie Bin Asim Design Problem 1
No ratings yet
Rabie Bin Asim Design Problem 1
25 pages
An Introduction To Synthetic CDO and Its Structure
100% (2)
An Introduction To Synthetic CDO and Its Structure
39 pages
CHANDRA DZDA STAT6174037 ProbabilityTheoryandAppliedStatistics
No ratings yet
CHANDRA DZDA STAT6174037 ProbabilityTheoryandAppliedStatistics
17 pages
wph16 01 Pef 20230302
No ratings yet
wph16 01 Pef 20230302
17 pages
Short-Run Cost Output Relationship
No ratings yet
Short-Run Cost Output Relationship
5 pages
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
No ratings yet
Jurnal JP - Peran Masa Kerja Dan Gaya Komunikasi Terhadap Kinerja Karyawan Dengan Motivasi Karyawan Sebagai Mediator Pada PT Gajah Tunggal TBK
13 pages
Logic Students' Guide
No ratings yet
Logic Students' Guide
5 pages
Phet Gas Law Simulation 2010
No ratings yet
Phet Gas Law Simulation 2010
8 pages
ADC SNR Jitter
No ratings yet
ADC SNR Jitter
6 pages
KUET - Academic Records
No ratings yet
KUET - Academic Records
4 pages
Gandhinagar Institute of Technology: Question Bank
No ratings yet
Gandhinagar Institute of Technology: Question Bank
5 pages
Module 6 Texture
No ratings yet
Module 6 Texture
20 pages
MAMBA
No ratings yet
MAMBA
5 pages
Essential Mathematics 2
No ratings yet
Essential Mathematics 2
1 page
2021 Article
No ratings yet
2021 Article
17 pages
Impact of PM Comprtrncied Emotional Intelligence & Transformation Leadership On Project Success
No ratings yet
Impact of PM Comprtrncied Emotional Intelligence & Transformation Leadership On Project Success
13 pages
A-Thurs-O2 Absorption-Report
No ratings yet
A-Thurs-O2 Absorption-Report
25 pages
Post Lab #2
No ratings yet
Post Lab #2
7 pages
Option Delta With Skew Adjustment
100% (1)
Option Delta With Skew Adjustment
33 pages
A Practical Guide To Critical Thinking-Haskins
0% (1)
A Practical Guide To Critical Thinking-Haskins
20 pages
CSE110 - OOP - Lab Assignment 02 - Student Version
No ratings yet
CSE110 - OOP - Lab Assignment 02 - Student Version
4 pages

Module 1.foundations of Data Science

Uploaded by

Module 1.foundations of Data Science

Uploaded by

Module 1: Foundations of Data Science

Introduction to Data Science

Data science is an interdisciplinary field that involves using scientific

Key facts about data science:

Data Science Lifecycle

Data Collection: Gather relevant data from various sources, such as

Data Exploration: Explore the dataset to understand its characteristics,

Model Building: Select appropriate machine learning algorithms and train

Model Evaluation: Assess the performance of the models using metrics

Model Deployment: Deploy the trained model to production, making it

Monitoring and Maintenance: Continuously monitor model performance

Communication and Reporting: Share insights and results with

Python is a widely used programming language in the field of data

Data Types: Python supports various data types, including integers,

Integers (int): Whole numbers, e.g., 5, -10.

Arithmetic Operators: + (addition), - (subtraction), * (multiplication), /

for item in sequence:

while Loops: Repeating a block of code while a condition is True.

def add(a, b):

my_list = [1, 2, 3, 'apple']

Dictionaries: Collections of key-value pairs. You can use keys to access

NumPy: NumPy provides support for large, multi-dimensional arrays

# Creating array object

Series: A one-dimensional labeled array capable of holding data of

Data Cleaning and Preparation:

Handling missing data: Pandas provides methods for filling, dropping, or

Data Analysis and Exploration:

Descriptive statistics: Calculate summary statistics like mean, median,

Pandas is an essential tool for data analysis and manipulation in Python,

Data Visualization using Matplotlib and Seaborn

We can generate a histogram for any of the numerical columns by calling

plt.title('Histogram of Overall Rating')

This visualization is great way of understanding how the values in your

Generating Scatter Plots in Python With Matplotlib

To generate a scatter plot in Matplotlib, we simply use the scatter()

plt.title('Overall vs. Wage')

Generating Bar Charts in Python With Matplotlib

Generating Pie Charts in Python With Matplotlib

df.loc[df.Nationality =='England', 'Nationality2'] = 'England'

all these methods offer us multiple, powerful ways to visualize the

Python Data Visualization With Seaborn

Generating Histograms in Python With Seaborn

plt.title('Histogram of Overall Rating')

plt.title('Overall vs. Wage')

Generating Heatmaps in Python With Seaborn

plt.title('Heatmap of Overall, Age, wage_euro, and Skill Moves')

Generating Pairs Plots in Python With Seaborn

You might also like