Module 1: Foundations of Data Science
Introduction to Data Science
Data science is an interdisciplinary field that involves using scientific
methods, algorithms, processes, and systems to extract knowledge and
insights from structured and unstructured data. It combines elements from
statistics, computer science, domain knowledge, and data engineering to
solve complex data-driven problems.
Key facts about data science:
Data science involves collecting, cleaning, and analyzing large volumes
of data to gain insights and make informed decisions.
It encompasses various techniques, including data exploration, statistical
modeling, machine learning, and data visualization.
Data scientists often work with programming languages like Python and
R, as well as tools such as Jupyter notebooks and data visualization
libraries.
Data Science Lifecycle
The data science lifecycle is a series of stages that data scientists follow
when working on a data-driven project. These stages typically include:
Problem Definition: Understand the problem you need to solve and define
clear objectives. Identify what data is needed to address the problem.
Data Collection: Gather relevant data from various sources, such as
databases, APIs, or files. This may involve data scraping, data cleaning,
and data integration.
Data Exploration: Explore the dataset to understand its characteristics,
such as data types, distributions, and missing values. Visualization plays
a crucial role in this stage.
Data Preprocessing: Clean and prepare the data for analysis. This
includes handling missing values, encoding categorical variables, and
scaling or normalizing numerical features.
Feature Engineering: Create new features or transform existing ones to
improve the performance of machine learning models.
Model Building: Select appropriate machine learning algorithms and train
predictive models using the prepared data.
Model Evaluation: Assess the performance of the models using metrics
like accuracy, precision, recall, or F1-score. Fine-tune models as needed.
Model Deployment: Deploy the trained model to production, making it
accessible for predictions in real-time applications.
Monitoring and Maintenance: Continuously monitor model performance
and retrain as necessary to adapt to changing data patterns.
Communication and Reporting: Share insights and results with
stakeholders through reports, dashboards, or presentations.
Python for Data Science:
Python is a widely used programming language in the field of data
science due to its simplicity, extensive libraries, and vibrant community.
Here are some fundamental concepts and data structures in Python
relevant to data science:
Data Types: Python supports various data types, including integers,
floats, strings, booleans, and complex numbers.
Integers (int): Whole numbers, e.g., 5, -10.
Floats (float): Numbers with a decimal point, e.g., 3.14, -0.5.
Strings (str): Sequences of characters enclosed in single or double quotes,
e.g., "Hello, World!".
Booleans (bool): Represents either True or False.
Operators:
Arithmetic Operators: + (addition), - (subtraction), * (multiplication), /
(division), % (modulo), ** (exponentiation).
Comparison Operators: == (equal to), != (not equal to), < (less than), >
(greater than), <= (less than or equal to), >= (greater than or equal to).
Logical Operators: and (logical AND), or (logical OR), not (logical NOT).
Control Statements:
if-else Statements: Conditional execution based on a condition.
if condition:
# code to run if condition is Trueelse:
# code to run if condition is False
Loops:
for Loops: Iterating over a sequence (e.g., a list, tuple, or range).
for item in sequence:
# code to repeat for each item
while Loops: Repeating a block of code while a condition is True.
while condition:
# code to repeat as long as condition is True
Functions:
Functions allow you to encapsulate a block of code into a reusable unit.
You can define functions using the def keyword and call them by name.
def add(a, b):
return a + b
result = add(2, 3) # Calls the add function and assigns the result to the
'result' variable.
Data Structures:
Lists: Ordered collections of items that can be of different data types.
Lists are mutable, meaning you can change their contents.
my_list = [1, 2, 3, 'apple']
Tuples: Similar to lists but immutable, meaning you can't change their
contents after creation.
my_tuple = (1, 2, 3, 'banana')
Sets: Unordered collections of unique elements. Sets are useful for tasks
that involve checking for membership or eliminating duplicates.
my_set = {1, 2, 3, 4, 4} # Creates a set with unique elements: {1, 2, 3, 4}
Dictionaries: Collections of key-value pairs. You can use keys to access
values.
my_dict = {'name': 'Alice', 'age': 30, 'city': 'New York'}
Data Analysis using NumPy and Pandas
NumPy and Pandas are essential libraries in Python for data manipulation
and analysis.
NumPy: NumPy provides support for large, multi-dimensional arrays
and matrices, as well as a collection of mathematical functions to operate
on these arrays efficiently. It is commonly used for numerical
computations and array manipulation.
Arrays in NumPy
NumPy’s main object is the homogeneous multidimensional array. It is a
table of elements (usually numbers), all of the same type, indexed by a
tuple of positive integers. In NumPy, dimensions are called axes. The
number of axes is rank. NumPy’s array class is called ndarray. It is also
known by the alias array.
Example:
import numpy as np
# Creating array object
arr = np.array( [[ 1, 2, 3],
[ 4, 2, 5]] )
# Printing type of arr object
print("Array is of type: ", type(arr))
# Printing array dimensions (axes)
print("No. of dimensions: ", arr.ndim)
# Printing shape of array
print("Shape of array: ", arr.shape)
# Printing size (total number of elements) of array
print("Size of array: ", arr.size)
# Printing type of elements in array
print("Array stores elements of type: ", arr.dtype)
Output:
Array is of type: <class 'numpy.ndarray'>
No. of dimensions: 2
Shape of array: (2, 3)
Size of array: 6
Array stores elements of type: int64
Pandas: Pandas is a popular open-source data manipulation and analysis
library for the Python programming language. It provides easy-to-use
data structures and data analysis tools for working with structured data,
such as spreadsheets or SQL tables. Pandas is a fundamental tool in the
data science and data analysis ecosystem and is often used in conjunction
with other libraries like NumPy, Matplotlib, and Scikit-learn.
Key features of Pandas include:
Data Structures:
Series: A one-dimensional labeled array capable of holding data of
various types.
DataFrame: A two-dimensional labeled data structure, similar to a
spreadsheet or SQL table, where data is organized into rows and columns.
Data Cleaning and Preparation:
Handling missing data: Pandas provides methods for filling, dropping, or
interpolating missing values.
Data transformation: You can reshape data, pivot tables, merge and join
datasets, and apply various data transformations.
Data indexing and selection: Pandas allows you to slice and dice data
using labels, integers, or boolean indexing.
Data Analysis and Exploration:
Descriptive statistics: Calculate summary statistics like mean, median,
standard deviation, and more.
Grouping and aggregation: Easily group data based on certain criteria and
perform aggregation operations.
Time series analysis: Pandas supports time-based data manipulation and
resampling.
Data I/O:
Read and write data: Pandas can read data from various file formats like
CSV, Excel, SQL databases, and more.
Data export: You can save Pandas DataFrames to different file formats as
well.
Example:
import pandas as pd
# Load a CSV file into a DataFrame
df = pd.read_csv('data.csv')
# Display the first few rows of the DataFrameprint
df.head()
# Calculate mean and standard deviation of a column
mean_value = df['column_name'].mean()
std_deviation = df['column_name'].std()
# Group data by a column and calculate the sum of another column
grouped_data = df.groupby('group_column')['sum_column'].sum()
# Filter data based on a condition
filtered_data = df[df['column_name'] > 50]
Pandas is an essential tool for data analysis and manipulation in Python,
and it plays a crucial role in various data-related tasks, from data cleaning
and preprocessing to exploratory data analysis and model building.
Data Visualization using Matplotlib and Seaborn
Data visualization is crucial for understanding data and communicating
results. Matplotlib and Seaborn are popular Python libraries for creating
various types of plots and visualizations.
Matplotlib provides many out-of-the-box tools for quick and easy data
visualization. For example, when analyzing a new data set, researchers
are often interested in the distribution of values for a set of columns. One
way to do so is through a histogram.
Histograms are approximations to distributions generated through
selecting values based on a set range and putting each set of values in a
bin or bucket. Visualizing the distribution as a histogram is
straightforward using Matplotlib.
For our purposes, we will be working with the FIFA19 data set, which
you can find here. To start, we need to import the Pandas library, which is
a Python library used for data tasks such as statistical analysis and data
wrangling:
import pandas as pd
Next, we need to import the pyplot module from the Matplotlib library. It
is custom to import it as plt:
import matplotlib.pyplot as plt
Now, let’s read our data into a Pandas dataframe. We will relax the limit
on display columns and rows using the set_option() method in Pandas:
df = pd.read_csv("fifa19.csv")
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
Let’s display the first five rows of data using the head() method:
print(df.head())
We can generate a histogram for any of the numerical columns by calling
the hist() method on the plt object and passing in the selected column in
the data frame. Let’s do this for the Overall column, which corresponds to
overall player rating:
plt.hist(df['Overall'])
We can also label the x-axis, y-axis and title of the plot using the xlabel(),
ylabel() and title() methods, respectively:
plt.xlabel('Overall')
plt.ylabel('Frequency')
plt.title('Histogram of Overall Rating')
plt.show()
This visualization is great way of understanding how the values in your
data are distributed and easily seeing which values occur most and least
often.
Generating Scatter Plots in Python With Matplotlib
Scatter plots are a useful data visualization tool that helps with
identifying variable dependence. For example, if we are interested in
seeing if there is a positive relationship between wage and overall player
rating, (i.e., if a player’s wage increases, does his rating also go up?) we
can employ scatter plot visualization for insight.
Before we generate our scatter plot of wage versus overall rating, let’s
convert the wage column from a string to a floating point numerical
column. For this, we will create a new column called wage_euro:
df['wage_euro'] = df['Wage'].str.strip('€')
df['wage_euro'] = df['wage_euro'].str.strip('K')
df['wage_euro'] = df['wage_euro'].astype(float)*1000.0
Now, let’s display our new column wage_euro and the overall column:
print(df[['Overall', 'wage_euro']].head())
To generate a scatter plot in Matplotlib, we simply use the scatter()
method on the plt object. Let’s also label the axes and give our plot a title:
plt.scatter(df['Overall'], df['wage_euro'])
plt.title('Overall vs. Wage')
plt.ylabel('Wage')
plt.xlabel('Overall')
plt.show()
Generating Bar Charts in Python With Matplotlib
Bar charts are another useful visualization tool for analyzing categories in
data. For example, if we want to see the most common nationalities found
in our FIFA19 data set, we can employ bar charts. To visualize
categorical columns, we first should count the values. We can use the
counter method from the collections modules to generate a dictionary of
count values for each category in a categorical column. Let’s do this for
the nationality column:
from collections import Counter
print(Counter(df[‘Nationality’]))
We can filter this dictionary using the most_common method. Let’s look
at the 10 most common nationality values (note: you can also use the
least_common method to analyze infrequent nationality values):
print(dict(Counter(df[‘Nationality’]).most_common(10)))
Finally, to generate the bar plot of the 10 most common nationality values,
we simple call the bar method on the plt object and pass in the keys and
values of our dictionary:
nationality_dict = dict(Counter(df['Nationality']).most_common(10))
plt.bar(nationality_dict.keys(), nationality_dict.values())
plt.xlabel('Nationality')
plt.ylabel('Frequency')
plt.title('Bar Plot of Ten Most Common Nationalities')
plt.xticks(rotation=90)
plt.show()
As you can see, the values on the x-axis are overlapping, which makes
them hard to see. We can use the “xticks()” method to rotate the values:
plt.xticks(rotation=90)
Generating Pie Charts in Python With Matplotlib
Pie charts are a useful way to visualize proportions in your data. For
example, in this data set, we can use a pie chart to visualize the
proportion of players from England, Germany and Spain.
To do this, let’s create new columns that contain England, Spain,
Germany and one column labeled “other” for all other nationalities:
df.loc[df.Nationality =='England', 'Nationality2'] = 'England'
df.loc[df.Nationality =='Spain', 'Nationality2'] = 'Spain'
df.loc[df.Nationality =='Germany', 'Nationality2'] = 'Germany'
df.loc[~df.Nationality.isin(['England', 'German', 'Spain']), 'Nationality2']
= 'Other'
prop = dict(Counter(df['Nationality2']))
for key, values in prop.items():
prop[key] = (values)/len(df)*100
print(prop)
We can create a pie chart using our dictionary and the pie method in
Matplotlib:
fig1, ax1 = plt.subplots()
ax1.pie(prop.values(), labels=prop.keys(), autopct='%1.1f%%',
shadow=True, startangle=90)
ax1.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
all these methods offer us multiple, powerful ways to visualize the
proportions of categories in our data.
Python Data Visualization With Seaborn
Seaborn is a library built on top of Matplotlib that enables more
sophisticated visualization and aesthetic plot formatting. Once you’ve
mastered Matplotlib, you may want to move up to Seaborn for more
complex visualizations.
For example, simply using the Seaborn set() method can dramatically
improve the appearance of your Matplotlib plots. Let’s take a look.
First, import Seaborn as sns and reformat all of the figures we generated.
At the top of your script, write the following code and rerun:
import seaborn as sns
sns.set()
Generating Histograms in Python With Seaborn
We can also generate all of the same visualizations we did in Matplotlib
using Seaborn.
To regenerate our histogram of the overall column, we use the distplot
method on the Seaborn object:
sns.distplot(df['Overall'])
And we can reuse the plt object for additional axis formatting and title
setting:
plt.xlabel('Overall')
plt.ylabel('Frequency')
plt.title('Histogram of Overall Rating')
plt.show()
As we can see, this plot looks much more visually appealing than the
Matplotlib plot.
Generating Scatter Plots in Python With Seaborn
Seaborn also makes generating scatter plots straightforward. Let’s
recreate the scatter plot from earlier:
sns.scatterplot(df['Overall'], df['wage_euro'])
plt.title('Overall vs. Wage')
plt.ylabel('Wage')
plt.xlabel('Overall')
plt.show()
Generating Heatmaps in Python With Seaborn
Seaborn is also known for making correlation heatmaps, which can be
used to identify variable dependence. To generate one, first we need to
calculate the correlation between a set of numerical columns. Let’s do
this for age, overall, wage_euro and skill moves:
corr = df[['Overall', 'Age', 'wage_euro', 'Skill Moves']].corr()
sns.heatmap(corr, annot=True)
plt.title('Heatmap of Overall, Age, wage_euro, and Skill Moves')
plt.show()
We can also set annot for annotate to true to see the correlation values:
sns.heatmap(corr, annot=True)
Generating Pairs Plots in Python With Seaborn
The last Seaborn tool I’ll discuss is the pairplot method. This allows you
to generate a matrix of distributions and scatter plots for a set of
numerical features. Let’s do this for age, overall and potential:
data = df[['Overall', 'Age', 'Potential',]]
sns.pairplot(data)
plt.show()
this is a quick and easy way to visualize both the distribution in numerical
values and relationships between variables through scatter plots.
both Seaborn and Matplotlib are valuable tools for any data scientist.
Matplotlib makes labeling, titling and formatting graphs simple, which is
important for effective data communication. Further, it provides much of
the basic tooling for visualizing data including histograms, scatter plots,
pie charts and bar charts.
Seaborn is an important library to know because of its beautiful visuals
and extensive statistical tooling. As you can see above, the plots
generated in Seaborn, even if they communicate the same information,
are much prettier than those generated in Matplotlib. Further, the tools
provided by Seaborn allow for much more sophisticated analysis and
visuals. Although I only discussed how to use Seaborn to generate
heatmaps and pairwise plots, it can also be used to generate more
complicated visuals like density maps for variables, line plots with
confidence intervals, cluster maps and much more.
Matplotlib and Seaborn are two of the most widely used visualization
libraries in Python. They both allow you to quickly perform data
visualization for gaining statistical insights and telling a story with data.
While there is significant overlap in the use cases for each of these
libraries, having knowledge of both libraries can allow a data scientist to
generate beautiful visuals that can tell an impactful story about the data
being analyzed.