0% found this document useful (0 votes)
5 views17 pages

Handling Missing Data in Pandas by Jaume Boguñá

Uploaded by

mm0597301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views17 pages

Handling Missing Data in Pandas by Jaume Boguñá

Uploaded by

mm0597301
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

HANDLING MISSING DATA

PYTHON for DATA SCIENCE

Jaume Boguñá
Dive into Python
Handling Missing Data in Pandas
Missing data occurs when information is absent for one or more items.

In Pandas, missing data is shown using two values:


o None: A Python object, often used to represent missing data.
o NaN: Stands for "Not a Number“, a special floating-point value.

Pandas provides functions to detect, remove, and replace null values:


1. isnull()/isna().
2. notnull()/notna().
3. dropna().
4. fillna().
5. replace().

Jaume Boguñá

Dive into Python 2


1. isnull()
1.1 Detect missing values (None or NaN)
import pandas as pd
import numpy as np

df = pd.DataFrame({"Name": ["Jaume", "Paula", "David", "Berta"],


"Age": [25, np.nan, 43, 17],
"City": ['Madrid', "Valencia", None, "Sevilla"],
"Profession": ["Engineer", "Doctor", "Teacher", None]})
df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

df.isnull()
Name Age City Profession
0 False False False False
1 False True False False
2 False False True False
3 False False False True

Jaume Boguñá

Dive into Python 3


1. isnull()
1.2 Count Missing Values in a DataFrame
df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

# Count missing values in each column


df.isnull().sum()
Name 0
Age 1
City 1
Profession 1
dtype: int64

# Count missing values across the entire DataFrame


df.isnull().sum().sum()
3

Jaume Boguñá

Dive into Python 4


1. isnull()
1.3 Get Percentage of Missing Values

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

# Percentage of missing values in each column


df.isnull().mean() * 100
Name 0.0
Age 25.0
City 25.0
Profession 25.0
dtype: float64

Jaume Boguñá

Dive into Python 5


1. isnull()
1.4 Checking for Null Rows

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

# Check if any rows are entirely null


df.isnull().all(axis=1)
0 False
1 False
2 False
3 False
dtype: bool

Jaume Boguñá

Dive into Python 6


2. notnull()
Opposite of isnull()

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

df.notnull()
Name Age City Profession
0 True True True True
1 True False True True
2 True True False True
3 True True True False

Jaume Boguñá

Dive into Python 7


3. dropna()
3.1 Drop Rows with Missing Values

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

df.dropna()
Name Age City Profession
0 Jaume 25.0 Madrid Engineer

Jaume Boguñá

Dive into Python 8


3. dropna()
3.2 Drop Columns with Missing Values

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

df.dropna(axis=1)
Name
0 Jaume
1 Paula
2 David
3 Berta

Jaume Boguñá

Dive into Python 9


3. dropna()
3.3 Define Which Columns to Look for Missing Values

df
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
1 Paula NaN Valencia Doctor
2 David 43.0 None Teacher
3 Berta 17.0 Sevilla None

df.dropna(subset=['Age', 'Profession'])
Name Age City Profession
0 Jaume 25.0 Madrid Engineer
2 David 43.0 None Teacher

Jaume Boguñá

Dive into Python 10


4. fillna()
4.1 Fill missing values with a specified value

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Replace all NaN elements with 0s


df.fillna(0)
Maths Science French
Joan 8.0 9.0 0.0
Nadia 7.0 0.0 8.0
Elsa 0.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 11


4. fillna()
4.1 Fill missing values with a specified value

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Replace all NaN elements in column 'Maths', 'Science', 'French',


# with 5, 10, and 0 respectively
df.fillna(value={"Maths": 5, "Science": 10, "French": 0})
Maths Science French
Joan 8.0 9.0 0.0
Nadia 7.0 10.0 8.0
Elsa 5.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 12


4. fillna()
4.1 Fill missing values with a specified value

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Fill NaN values with the mean of each column

df.fillna(value=round(df.mean(),1))
Maths Science French
Joan 8.0 9.0 6.7
Nadia 7.0 7.3 8.0
Elsa 7.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 13


4. fillna()
4.2 Fill missing values with a specified method (forward or backward)

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Backward fill (bfill)


df.fillna(method="bfill")
Maths Science French
Joan 8.0 9.0 8.0
Nadia 7.0 6.0 8.0
Elsa 6.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 14


5. replace()
Replaces specific values, including missing values (NaN)

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Replace NaN with 0

df.replace(to_replace=np.nan, value=0)
Maths Science French
Joan 8.0 9.0 0.0
Nadia 7.0 0.0 8.0
Elsa 0.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 15


5. replace()
Replaces specific values, including missing values (NaN)

df
Maths Science French
Joan 8.0 9.0 NaN
Nadia 7.0 NaN 8.0
Elsa NaN 6.0 5.0
Mario 6.0 7.0 7.0

# Replace NaN values with the min value of each column

df.replace(to_replace=np.nan, value=df.min())
Maths Science French
Joan 8.0 9.0 5.0
Nadia 7.0 6.0 8.0
Elsa 6.0 6.0 5.0
Mario 6.0 7.0 7.0

Jaume Boguñá

Dive into Python 16


Like Comment Share

Jaume Boguñá
Aerospace Engineer | Data Scientist

You might also like