Data Preprocessing in Python
Last Updated :
18 Sep, 2025
Data preprocessing is the first step in any data analysis or machine learning pipeline. It involves cleaning, transforming and organizing raw data to ensure it is accurate, consistent and ready for modeling. It has a big impact on model building such as:
- Clean and well-structured data allows models to learn meaningful patterns rather than noise.
- Properly processed data prevents misleading inputs, leading to more reliable predictions.
- Organized data makes it simpler to create useful inputs for the model, enhancing model performance.
- Organized data supports better Exploratory Data Analysis (EDA), making patterns and trends more interpretable.
Data PreprocessingSteps-by-Step implementation
Let's implement various preprocessing features,
Step 1: Import Libraries and Load Dataset
We prepare the environment with libraries liike pandas, numpy, scikit learn, matplotlib and seaborn for data manipulation, numerical operations, visualization and scaling. Load the dataset for preprocessing.
The sample dataset can be downloaded from here.
Python
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler, StandardScaler
import seaborn as sns
import matplotlib.pyplot as plt
df = pd.read_csv('Geeksforgeeks/Data/diabetes.csv')
df.head()
Output:
DatasetStep 2: Inspect Data Structure and Check Missing Values
We understand dataset size, data types and identify any incomplete (missing) data that needs handling.
- df.info(): Prints concise summary including count of non-null entries and data type of each column.
- df.isnull().sum(): Returns the number of missing values per column.
Python
df.info()
print(df.isnull().sum())
Output:
Step 3: Statistical Summary and Visualizing Outliers
Get numeric summaries like mean, median, min/max and detect unusual points (outliers). Outliers can skew models if not handled.
- df.describe(): Computes count, mean, std deviation, min/max and quartiles for numerical columns.
- Boxplots: Visualize spread and detect outliers using matplotlib’s boxplot().
Python
df.describe()
fig, axs = plt.subplots(len(df.columns), 1, figsize=(7, 18), dpi=95)
for i, col in enumerate(df.columns):
axs[i].boxplot(df[col], vert=False)
axs[i].set_ylabel(col)
plt.tight_layout()
plt.show()
Output:
BoxplotStep 4: Remove Outliers Using the Interquartile Range (IQR) Method
Remove extreme values beyond a reasonable range to improve model robustness.
- IQR = Q3 (75th percentile) – Q1 (25th percentile).
- Values below Q1 - 1.5IQR or above Q3 + 1.5IQR are outliers.
- Calculate lower and upper bounds for each column separately.
- Filter data points to keep only those within bounds.
Python
q1, q3 = np.percentile(df['Insulin'], [25, 75])
iqr = q3 - q1
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr
clean_df = df[(df['Insulin'] >= lower) & (df['Insulin'] <= upper)]
Step 5: Correlation Analysis
Understand relationships between features and the target variable (Outcome). Correlation helps gauge feature importance.
- df.corr(): Computes pairwise correlation coefficients between columns.
- Heatmap via seaborn visualizes correlation matrix clearly.
- Sorting correlations with corr['Outcome'].sort_values() highlights features most correlated with the target.
Python
corr = df.corr()
plt.figure(dpi=130)
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.show()
print(corr['Outcome'].sort_values(ascending=False))
Output:
Step 6: Visualize Target Variable Distribution
Check if target classes (Diabetes vs Not Diabetes) are balanced, affecting model training and evaluation.
- plt.pie(): Pie chart to display proportion of each class in the target variable 'Outcome'.
Python
plt.pie(df['Outcome'].value_counts(), labels=[
'Diabetes', 'Not Diabetes'], autopct='%.f%%', shadow=True)
plt.title('Outcome Proportionality')
plt.show()
Output:
ResultStep 7: Separate Features and Target Variable
Prepare independent variables (features) and dependent variable (target) separately for modeling.
- df.drop(columns=[...]): Drops the target column from features.
- Direct column selection df['Outcome'] selects target column.
Python
X = df.drop(columns=['Outcome'])
y = df['Outcome']
Step 8: Feature Scaling: Normalization and Standardization
Scale features to a common range or distribution, important for many ML algorithms sensitive to feature magnitudes.
1. Normalization (Min-Max Scaling): Rescales features between 0 and 1. Good for algorithms like k-NN and neural networks.
- Class: MinMaxScaler from sklearn.
- .fit_transform(): Learns min/max from data and applies scaling.
Python
scaler = MinMaxScaler()
X_normalized = scaler.fit_transform(X)
print(X_normalized[:5])
Output:
Normalization2. Standardization: Transforms features to have mean = 0 and standard deviation = 1, useful for normally distributed features.
- Class: StandardScaler from sklearn.
Python
scaler = StandardScaler()
X_standardized = scaler.fit_transform(X)
print(X_standardized[:5])
Output:
StandardizationAdvantages
Let's see the advantages of data preprocessing,
- Improves Data Quality: Cleans and organizes raw data for better analysis.
- Enhances Model Accuracy: Removes noise and irrelevant data, leading to more precise predictions.
- Reduces Overfitting: Handles outliers and redundant features, improving model generalization.
- Speeds Up Training: Efficiently scaled data reduces computation time.
- Ensures Algorithm Compatibility: Converts data into formats suitable for machine learning models.
Explore
Machine Learning Basics
Python for Machine Learning
Feature Engineering
Supervised Learning
Unsupervised Learning
Model Evaluation and Tuning
Advanced Techniques
Machine Learning Practice