0% found this document useful (0 votes)
105 views

ML-1-PPT-UNIT-1

The document provides an overview of Machine Learning (ML), detailing its definition, importance, types (supervised, unsupervised, reinforcement learning), and challenges such as overfitting and underfitting. It discusses key concepts like regression, gradient descent, model selection, and the significance of normalization and standardization in data preparation. Additionally, it includes Python code for implementing linear regression and highlights the trade-offs involved in model complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views

ML-1-PPT-UNIT-1

The document provides an overview of Machine Learning (ML), detailing its definition, importance, types (supervised, unsupervised, reinforcement learning), and challenges such as overfitting and underfitting. It discusses key concepts like regression, gradient descent, model selection, and the significance of normalization and standardization in data preparation. Additionally, it includes Python code for implementing linear regression and highlights the trade-offs involved in model complexity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 93

MACHINE LEARNING

Dr Soumya Ranjan Mishra


Introduction to Machine Learning
Introduction to Machine Learning:

Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on building systems that learn
from data, improve over time, and make predictions or decisions without being explicitly programmed.

What is Machine Learning?

Ø It is a method of data analysis that automates analytical model building.

Ø ML algorithms use computational methods to "learn" information directly from data without relying on a
predetermined equation.

Ø It involves developing algorithms that can automatically detect patterns in data and use these patterns to
predict future data or outcomes.
Why is Machine Learning Important?

Data Explosion: We generate an enormous amount of data every day, and analyzing it manually is impractical.
ML helps in extracting useful insights.

Automation: ML algorithms can automate repetitive tasks like fraud detection, email classification,
recommendation systems, and more.

Predictive Power: Machine learning models can predict future trends by recognizing patterns in past data.
Types of Machine Learning:

Supervised Learning:

Definition: Supervised learning is where the model is trained on a labeled dataset. The algorithm learns
from the input-output pairs (known as training data) and tries to map inputs to outputs.

Example: Predicting house prices based on features like area, number of rooms, etc.

Key Algorithms:

Ø Linear Regression
Ø Logistic Regression
Ø Support Vector Machines (SVM)
Ø Decision Trees
Ø K-Nearest Neighbors (KNN)
Ø Neural Networks
Applications:
Ø Spam email detection
Ø Stock price prediction
Ø Disease prediction (e.g., cancer diagnosis)
Unsupervised Learning:

Definition: In unsupervised learning, the model is given data without labels (no predefined output). The model
attempts to find hidden patterns or intrinsic structures in the data.

Example: Grouping customers based on purchasing patterns.

Key Algorithms:

Ø K-Means Clustering
Ø Hierarchical Clustering
Ø Principal Component Analysis (PCA)
Ø Gaussian Mixture Models (GMM)

Applications:

Ø Market segmentation
Ø Anomaly detection
Ø Dimensionality reduction
Reinforcement Learning

Definition: Reinforcement learning is a type of ML where an agent learns by interacting with the environment
and receiving feedback through rewards or punishments.

Example: Training a robot to navigate through a maze.

Applications:

Ø Autonomous vehicles
Ø Game playing (e.g., AlphaGo)
Ø Robotics
The Curse of Dimensionality:

Definition: The curse of dimensionality refers to the challenges and issues that arise when analyzing and organizing data
in high-dimensional spaces (with many features or variables).

Key Issues:

Ø Data Sparsity: As the number of features increases, the data becomes sparse, making it harder to find meaningful
patterns.

Ø Distance Metrics: In high-dimensional spaces, the distance between data points becomes less informative, making
clustering or classification less effective.

Ø Overfitting: Higher dimensions lead to overfitting since the model may fit the noise in the data.

Mitigation:

Dimensionality Reduction: Use techniques like Principal Component Analysis (PCA) to reduce the number of features
while retaining the important information.
Overfitting and Underfitting:

Overfitting:

Definition: Overfitting occurs when a model is too complex and captures noise or random fluctuations in the training data. As
a result, the model performs very well on the training data but poorly on unseen data (test set).

Symptoms:

Ø Very low error on the training set.

Ø High error on the test set.

Causes:
Ø Too complex model (e.g., too many parameters, high-degree polynomial regression).

Ø Insufficient training data.

Solutions:
Ø Use simpler models (e.g., reducing the number of features or parameters).
Ø Use regularization techniques (e.g., L1/L2 regularization).
Ø Increase the amount of training data.
Underfitting:

Definition: Underfitting occurs when the model is too simple to capture the underlying patterns in the data.
The model performs poorly both on the training set and the test set.

Symptoms:

High error on both training and test sets.

Causes:

Ø Too simple model (e.g., linear regression when the data has non-linear relationships).

Ø Not enough features in the model.

Solutions:

Ø Use more complex models.

Ø Include more relevant features.


Model Selection:

Definition: Model selection is the process of choosing the best machine learning model for a given problem. It
involves selecting an algorithm and tuning its parameters to achieve the best performance.

Approaches:

Cross-Validation: Use techniques like k-fold cross-validation to evaluate the model on different subsets of the
data.

Hyperparameter Tuning: Use grid search or random search to find the best set of hyperparameters for the
chosen model.

Bias-Variance Trade-off: Choose a model that strikes a balance between bias (underfitting) and variance
(overfitting).
Error Analysis and Validation:

Error Analysis:

Analyze the errors of your model to understand where it is going wrong.

Common techniques include:

Confusion Matrix: Used for classification tasks to show the performance of a classification model.

Residual Analysis: For regression models, analyze the residuals (the difference between the actual and
predicted values) to understand the model's performance.
Model Validation:

Ø Hold-out Validation: Split data into training and test sets. Train the model on the training set and validate
it on the test set.

Ø Cross-validation: Split the data into multiple folds and train and validate the model on each fold.

Ø Validation Set: Set aside a portion of the training data to validate the model during training.
Parametric vs. Non-Parametric Models:

Parametric Models:

Definition: Parametric models make assumptions about the underlying data distribution and have a fixed
number of parameters.

Examples: Linear Regression, Logistic Regression, Naive Bayes.

Advantages:

Ø Less computationally expensive.

Ø Easier to interpret.

Disadvantages:

May not perform well if the assumption about the data distribution is wrong.
Non-Parametric Models:

Definition: Non-parametric models do not make strong assumptions about the data distribution and do not
have a fixed number of parameters. They are more flexible in terms of data modeling.

Examples: K-Nearest Neighbors (KNN), Decision Trees, Random Forests.

Advantages:

Ø Can model complex, non-linear relationships.

Ø Do not require assumptions about the data.

Disadvantages:

Ø More computationally expensive.

Ø Can suffer from overfitting if not tuned properly.


Example: Parametric Model: Linear Regression

Scenario: Predicting a student's exam score based on the number of hours they studied.

Assumption: There is a linear relationship between hours studied () and exam score ().

� = 5� + 50
For every additional hour studied, the score increases by 5 points. Even with minimal data (e.g., 10 students), this model
assumes the same relationship applies universally.

Example: Non-Parametric Model: k-Nearest Neighbors (k-NN)

Scenario: Predicting a student's exam score based on hours studied, but without assuming a specific form of the
relationship.

Stores all the training data (e.g., scores of 50 students with their study hours). To predict a new student's score, it looks at the
scores of the � closest students (e.g., those who studied a similar number of hours).

Example:
A new student studied for 4 hours.
The model looks at the 3 students who studied closest to 4 hours and predicts the score based on their average.
Regression
What is Regression?

Ø Regression is a supervised learning technique used to model the relationship between


input features (independent variables) and a continuous target variable (dependent
variable).

Ø It predicts a numerical outcome based on the given input data.

Example: Predicting house prices based on features like size, location, and number of
rooms.
Linear Regression
Definition:

Ø Linear regression is one of the simplest and most widely used regression techniques.

Ø It models the relationship between the dependent variable Y and one or more
independent variables X by fitting a linear equation:
Types:

Simple Linear Regression:

Involves a single independent variable.

y = wx + b
Multiple Linear Regression:

Involves multiple independent variables.


Intuition Behind Linear Regression

Ø The goal is to find the best-fitting line (or hyperplane in higher dimensions) that
minimizes the error between the predicted values Y- and the actual values Y

Ø The "best fit" is achieved by optimizing the weights w and bias b to minimize the cost
function.

y = wx + b

Example:
Suppose you have data of house sizes X and their prices Y:

A linear regression model predicts house prices based on size by finding a straight line that
best matches the data.
Cost Function

Purpose:

Ø The cost function measures the error or difference between the predicted values Y-
and the actual values Y.

Ø In linear regression, we use the Mean Squared Error (MSE) as the cost function.
Why Use MSE?
Ø Squaring the errors ensures they are positive, avoiding cancellations between positive and negative errors.

Ø Larger errors are penalized more heavily, encouraging the model to minimize big mistakes.

Minimizing the Cost Function:


Ø The goal of training the model is to find w and b such that the cost function J(w,b) is minimized.

Ø Optimization techniques like Gradient Descent are used for this purpose.
Example Workflow of Linear Regression

• Collect training data (X, y).

• Initialize w and b randomly.

• Compute the cost function J(w,b) using MSE

• Update w and b iteratively to reduce the cost function (using Gradient Descent).

• Evaluate the model using new data.


Gradient Descent for Linear Regression

Yi= β0+β1Xi

where Y i = Dependent variable,

β 0 = constant/Intercept,

β 1 = Slope/Intercept,

X i = Independent variable.

The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to find the best-fit line.

The best-fit line is a line that has the least error which means the error between predicted values and
actual values should be minimum.
But how the linear regression finds out which is the best fit line?

Ø The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line.

Ø The best fit line is a line that has the least error which means the error between predicted values and
actual values should be minimum.
We calculate MSE using the simple linear equation : Y i = β 0 + β 1 X i

Ø Using the MSE function, we’ll update the values of B 0 and B 1 such that the MSE value settles at
the minima.

Ø These parameters can be determined using the gradient descent method such that the value for
the cost function is minimum.
Ø Gradient Descent is one of the optimization algorithms that optimize the cost function (objective function)
to reach the optimal minimal solution.

Ø To find the optimum solution, we need to reduce the cost function (MSE) for all data points. This is done
by updating the values of the slope coefficient (B1) and the constant coefficient (B0) iteratively until we
get an optimal solution for the linear function.
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by
reducing the cost function by randomly selecting coefficient values and then iteratively updating the
coefficient values to reach the minimum cost function.

In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning
rate, and this decides how fast the algorithm converges to the minima.
Multiple Linear Regression (MLR)

Introduction
Ø Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one
dependent variable (target) and two or more independent variables (predictors).

Ø The goal is to find the linear equation that best predicts the dependent variable using the independent
variables.
Assumptions of MLR

Linearity: The relationship between dependent and independent variables is linear.

Independence: Observations are independent of each other.

Homoscedasticity: The variance of residuals is constant across all levels of the independent variables.

No Multicollinearity: Independent variables are not highly correlated with each other.

Normality: Residuals (errors) are normally distributed.

Applications

Ø Predicting housing prices based on features like size, location, and number of rooms.

Ø Forecasting sales using advertising spend on different platforms.

Ø Estimating the impact of multiple factors on employee performance


Different evaluation metrics are used to measure how well a model predicts continuous
values.
Types of Gradient Descent
Batch, Stochastic, and Mini-Batch Gradient Descent
Gradient Descent is an optimization algorithm used to minimize the cost
function by iteratively moving in the direction of the steepest descent
(negative gradient).
Types of Gradient Descent

1. Batch Gradient Descent

2. Stochastic Gradient Descent (SGD)

3. Mini-Batch Gradient Descent


Batch Gradient Descent
Uses the entire dataset to compute the gradient.

• Advantages: Stable updates, converges to global minimum.

• Disadvantages: Slow for large datasets, high memory usage.

• Best for small datasets.


Stochastic Gradient Descent (SGD)
Uses one random sample to compute the gradient per iteration.

• Advantages: Faster updates, suitable for large datasets.

• Disadvantages: Noisy updates, may oscillate around minimum.

• Best for online learning and large-scale data.


Mini-Batch Gradient Descent
Splits the dataset into small batches for gradient computation.

• Advantages: Faster convergence, balanced approach.

• Disadvantages: Requires batch size tuning, slight oscillations.

• Most commonly used in deep learning.


Comparison of Gradient Descent Types
• Batch GD:
• Update: Once per epoch
• Stable but slow.

• SGD:
• Update: Once per sample
• Fast but noisy.

• Mini-Batch GD:
• Update: Once per batch
• Balanced and efficient.
Applications
• Batch GD: Small datasets (e.g., linear regression)

• SGD: Online learning (e.g., recommender systems).

• Mini-Batch GD: Deep learning models.


Normalization and Standardization
Normalization:

Rescales the values of a dataset to a range of [0, 1] or [-1, 1].

Standardization:

Transforms data to have a mean (� ) of 0 and a standard deviation (� ) of 1.


Why Are These Important?

Normalization:

Ø Useful for algorithms like k-Nearest Neighbors (k-NN) and Neural Networks that are sensitive to the scale
of data.

Ø Ensures all features contribute equally to the model.

Standardization:

Ø Essential for models assuming Gaussian distribution or sensitive to feature magnitude, such as Support
Vector Machines (SVM), Principal Component Analysis (PCA), and Linear Regression.

Ø Improves convergence speed during optimization in Gradient Descent.


Normalization Example:

Feature 1 (Height in cm): [150, 160, 170, 180, 190]


Feature 2 (Weight in kg): [50, 60, 70, 80, 90]

Normalized Data:

Height: [0, 0.25, 0.5, 0.75, 1]


Weight: [0, 0.25, 0.5, 0.75, 1]

Standardization Example:

Feature (Scores): [50, 60, 70, 80, 90]


Mean = 70
Standard Deviation = 15.81

Standardized Data:

[−1.26,−0.63,0,0.63,1.26]
[−1.26,−0.63,0,0.63,1.26]

Useful for SVM or PCA as it makes the feature mean-centered.


PYTHON
CODE OF LINEAR REGRESSION
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X = np.array([1, 2, 3, 4, 5,7,8,9,10,15,17,19,21]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5,7,9,10,12,15,18,20,25])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=32)


model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

plt.scatter(X, y, color="blue", label="Data Points")


plt.plot(X, model.predict(X), color="red", label="Best Fit Line")
plt.xlabel("X")
plt.ylabel("y")
plt.title("Linear Regression: Best Fit Line")
plt.legend()
plt.show()

new_data = np.array([[30]]) # Replace with your input


predicted_output = model.predict(new_data)
print(f"Predicted output for {new_data.flatten()}: {predicted_output[0]:.2f}")
Introduction to Overfitting and Underfitting

Ø Overfitting and underfitting are common problems in machine learning that affect
model performance.

Ø They arise from the trade-off between a model's complexity and its ability to generalize
to new, unseen data.
Underfitting:

A model is underfitting when it is too simple to capture the underlying patterns in the data.

Symptoms:

Ø High training error.

Ø High testing error.

Causes:

Ø Model lacks complexity.


Ø Insufficient training.
Ø Features are not informative enough.

Example: Using a linear regression model to fit non-linear data


Overfitting:

A model is overfitting when it learns not only the patterns but also the noise in the training
data.

Symptoms:

Ø Low training error.


Ø High testing error.
Causes:

Ø Model is too complex.


Ø Insufficient training data for the model's complexity.
Ø No regularization applied.

Example: A decision tree with very deep splits that memorizes the training data.
Example predicting house prices:

Underfitting: Using only the size of the house as a feature while ignoring other factors like
location, number of bedrooms, etc.

Overfitting: Including irrelevant details like the color of the walls, which does not
generalize to new data.
What is Regularization?

Ø Regularization is a technique to prevent overfitting in machine learning models.

Ø It adds a penalty term to the loss function to shrink the magnitude of coefficients, ensuring the model
generalizes well to new data.

Lasso Regularization

Ø Least Absolute Shrinkage and Selection Operator.

Ø Penalty Term: Adds the L1 norm (sum of absolute values of coefficients) to the loss function.
Key Features:

Ø Shrinks coefficients.

Ø Feature Selection: Some coefficients are reduced to exactly zero, effectively removing
irrelevant features.

When to Use:

When you suspect some features are irrelevant and want to automatically select the most
important ones.
Ridge Regularization

Penalty Term: Adds the L2 norm (sum of squares of coefficients) to the loss function.

Key Features:

Ø Shrinks coefficients but does not make them zero.

Ø Retains all features and is better for handling multicollinearity (correlated features).

When to Use:

When all features are expected to contribute to the target variable, even if minimally.
Lasso Regularization: Feature Selection in Predictive Modeling

Example:

A healthcare company is building a predictive model to estimate a patient's risk of developing diabetes based
on features like age, BMI, cholesterol levels, glucose levels, and hundreds of genetic markers.

Problem:

Ø Many features (like certain genetic markers) might have little to no effect on the outcome.

Ø Including all features increases model complexity and could lead to overfitting.

Solution:

Ø Apply Lasso regularization to automatically reduce irrelevant features' coefficients to 0.

Ø The model keeps only the most important features, making it simpler and more interpretable for clinicians.
Ridge Regularization: Improving Predictions in Multicollinear Data

Example:

An e-commerce company wants to predict product demand based on factors like price, discount offered,
advertising spend, and competitor pricing.

Problem:

Ø Some features (e.g., "price" and "discount offered") are highly correlated, leading to multicollinearity.

Ø Standard linear regression struggles in such scenarios, resulting in unstable coefficient estimates.

Solution:

Ø Use Ridge regularization, which penalizes large coefficients without dropping any features.

Ø Ridge shrinks the correlated feature coefficients, stabilizing predictions while retaining all information.
Dataset: Predict house prices based on features like size, location, and age.

Lasso: Automatically drops less relevant features like age if size and location explain most of
the variance.

Ridge: Keeps all features and distributes importance across them, even if age has minimal
impact.

Lasso = Simpler, interpretable models (use when feature selection is needed).

Ridge = Stable, robust predictions (use when all features are useful).
1000590
100090
1000590
95534
Bias, Variance, and Tradeoff

Bias is when the model is too simple and cannot capture the pattern in the data properly.
This leads to underfitting.

Example

Imagine you are trying to guess someone's age??

Instead of analyzing details like height, hair color, and clothing

you always guess the same average age for everyone (e.g., 25 years).

This is a high bias guess because you ignore specific details and make overly simplistic
assumptions.
Variance is when the model is too complex and tries to learn even the smallest details
(noise) in the data. This leads to overfitting.

You try to guess someone’s age again??

But this time you overanalyze everything:

1. She has long hair, so she must be 20.


2. Oh wait, she’s wearing glasses, so maybe she’s 22.
3. Wait, a slight wrinkle? Maybe 28.

This is high variance because you are too sensitive to small details that don’t really
matter.
Bias-Variance Tradeoff (Simple Analogy of Exam)
High Bias (Underfitting):

Ø You only study one chapter for the exam and ignore everything else.

Ø Result: You don’t perform well because you miss most questions.

Ø Model is too simple, unable to capture the full syllabus.

High Variance (Overfitting):

Ø You try to memorize every single word from the textbook, notes, and examples—even unnecessary details.

Ø Result: You get confused in the exam because you overanalyzed and didn’t focus on what’s important.

Ø Model is too complex and learns unnecessary details.


Balanced Study (Optimal Tradeoff):

Ø You study all chapters thoroughly but focus on understanding the key concepts and main topics.

Ø Result: You perform well on the exam because you have the right balance of preparation.

Bias = Too simple → Underfitting → Misses the important details.

Variance = Too complex → Overfitting → Gets distracted by unnecessary details.

Goal: Find the right balance (Bias-Variance Tradeoff) so the model can generalize well to new data.

You might also like