ML-1-PPT-UNIT-1
ML-1-PPT-UNIT-1
Machine Learning (ML) is a branch of artificial intelligence (AI) that focuses on building systems that learn
from data, improve over time, and make predictions or decisions without being explicitly programmed.
Ø ML algorithms use computational methods to "learn" information directly from data without relying on a
predetermined equation.
Ø It involves developing algorithms that can automatically detect patterns in data and use these patterns to
predict future data or outcomes.
Why is Machine Learning Important?
Data Explosion: We generate an enormous amount of data every day, and analyzing it manually is impractical.
ML helps in extracting useful insights.
Automation: ML algorithms can automate repetitive tasks like fraud detection, email classification,
recommendation systems, and more.
Predictive Power: Machine learning models can predict future trends by recognizing patterns in past data.
Types of Machine Learning:
Supervised Learning:
Definition: Supervised learning is where the model is trained on a labeled dataset. The algorithm learns
from the input-output pairs (known as training data) and tries to map inputs to outputs.
Example: Predicting house prices based on features like area, number of rooms, etc.
Key Algorithms:
Ø Linear Regression
Ø Logistic Regression
Ø Support Vector Machines (SVM)
Ø Decision Trees
Ø K-Nearest Neighbors (KNN)
Ø Neural Networks
Applications:
Ø Spam email detection
Ø Stock price prediction
Ø Disease prediction (e.g., cancer diagnosis)
Unsupervised Learning:
Definition: In unsupervised learning, the model is given data without labels (no predefined output). The model
attempts to find hidden patterns or intrinsic structures in the data.
Key Algorithms:
Ø K-Means Clustering
Ø Hierarchical Clustering
Ø Principal Component Analysis (PCA)
Ø Gaussian Mixture Models (GMM)
Applications:
Ø Market segmentation
Ø Anomaly detection
Ø Dimensionality reduction
Reinforcement Learning
Definition: Reinforcement learning is a type of ML where an agent learns by interacting with the environment
and receiving feedback through rewards or punishments.
Applications:
Ø Autonomous vehicles
Ø Game playing (e.g., AlphaGo)
Ø Robotics
The Curse of Dimensionality:
Definition: The curse of dimensionality refers to the challenges and issues that arise when analyzing and organizing data
in high-dimensional spaces (with many features or variables).
Key Issues:
Ø Data Sparsity: As the number of features increases, the data becomes sparse, making it harder to find meaningful
patterns.
Ø Distance Metrics: In high-dimensional spaces, the distance between data points becomes less informative, making
clustering or classification less effective.
Ø Overfitting: Higher dimensions lead to overfitting since the model may fit the noise in the data.
Mitigation:
Dimensionality Reduction: Use techniques like Principal Component Analysis (PCA) to reduce the number of features
while retaining the important information.
Overfitting and Underfitting:
Overfitting:
Definition: Overfitting occurs when a model is too complex and captures noise or random fluctuations in the training data. As
a result, the model performs very well on the training data but poorly on unseen data (test set).
Symptoms:
Causes:
Ø Too complex model (e.g., too many parameters, high-degree polynomial regression).
Solutions:
Ø Use simpler models (e.g., reducing the number of features or parameters).
Ø Use regularization techniques (e.g., L1/L2 regularization).
Ø Increase the amount of training data.
Underfitting:
Definition: Underfitting occurs when the model is too simple to capture the underlying patterns in the data.
The model performs poorly both on the training set and the test set.
Symptoms:
Causes:
Ø Too simple model (e.g., linear regression when the data has non-linear relationships).
Solutions:
Definition: Model selection is the process of choosing the best machine learning model for a given problem. It
involves selecting an algorithm and tuning its parameters to achieve the best performance.
Approaches:
Cross-Validation: Use techniques like k-fold cross-validation to evaluate the model on different subsets of the
data.
Hyperparameter Tuning: Use grid search or random search to find the best set of hyperparameters for the
chosen model.
Bias-Variance Trade-off: Choose a model that strikes a balance between bias (underfitting) and variance
(overfitting).
Error Analysis and Validation:
Error Analysis:
Confusion Matrix: Used for classification tasks to show the performance of a classification model.
Residual Analysis: For regression models, analyze the residuals (the difference between the actual and
predicted values) to understand the model's performance.
Model Validation:
Ø Hold-out Validation: Split data into training and test sets. Train the model on the training set and validate
it on the test set.
Ø Cross-validation: Split the data into multiple folds and train and validate the model on each fold.
Ø Validation Set: Set aside a portion of the training data to validate the model during training.
Parametric vs. Non-Parametric Models:
Parametric Models:
Definition: Parametric models make assumptions about the underlying data distribution and have a fixed
number of parameters.
Advantages:
Ø Easier to interpret.
Disadvantages:
May not perform well if the assumption about the data distribution is wrong.
Non-Parametric Models:
Definition: Non-parametric models do not make strong assumptions about the data distribution and do not
have a fixed number of parameters. They are more flexible in terms of data modeling.
Advantages:
Disadvantages:
Scenario: Predicting a student's exam score based on the number of hours they studied.
Assumption: There is a linear relationship between hours studied () and exam score ().
� = 5� + 50
For every additional hour studied, the score increases by 5 points. Even with minimal data (e.g., 10 students), this model
assumes the same relationship applies universally.
Scenario: Predicting a student's exam score based on hours studied, but without assuming a specific form of the
relationship.
Stores all the training data (e.g., scores of 50 students with their study hours). To predict a new student's score, it looks at the
scores of the � closest students (e.g., those who studied a similar number of hours).
Example:
A new student studied for 4 hours.
The model looks at the 3 students who studied closest to 4 hours and predicts the score based on their average.
Regression
What is Regression?
Example: Predicting house prices based on features like size, location, and number of
rooms.
Linear Regression
Definition:
Ø Linear regression is one of the simplest and most widely used regression techniques.
Ø It models the relationship between the dependent variable Y and one or more
independent variables X by fitting a linear equation:
Types:
y = wx + b
Multiple Linear Regression:
Ø The goal is to find the best-fitting line (or hyperplane in higher dimensions) that
minimizes the error between the predicted values Y- and the actual values Y
Ø The "best fit" is achieved by optimizing the weights w and bias b to minimize the cost
function.
y = wx + b
Example:
Suppose you have data of house sizes X and their prices Y:
A linear regression model predicts house prices based on size by finding a straight line that
best matches the data.
Cost Function
Purpose:
Ø The cost function measures the error or difference between the predicted values Y-
and the actual values Y.
Ø In linear regression, we use the Mean Squared Error (MSE) as the cost function.
Why Use MSE?
Ø Squaring the errors ensures they are positive, avoiding cancellations between positive and negative errors.
Ø Larger errors are penalized more heavily, encouraging the model to minimize big mistakes.
Ø Optimization techniques like Gradient Descent are used for this purpose.
Example Workflow of Linear Regression
• Update w and b iteratively to reduce the cost function (using Gradient Descent).
Yi= β0+β1Xi
β 0 = constant/Intercept,
β 1 = Slope/Intercept,
X i = Independent variable.
The goal of the linear regression algorithm is to get the best values for B 0 and B 1 to find the best-fit line.
The best-fit line is a line that has the least error which means the error between predicted values and
actual values should be minimum.
But how the linear regression finds out which is the best fit line?
Ø The goal of the linear regression algorithm is to get the best values for B0 and B1 to find the best fit line.
Ø The best fit line is a line that has the least error which means the error between predicted values and
actual values should be minimum.
We calculate MSE using the simple linear equation : Y i = β 0 + β 1 X i
Ø Using the MSE function, we’ll update the values of B 0 and B 1 such that the MSE value settles at
the minima.
Ø These parameters can be determined using the gradient descent method such that the value for
the cost function is minimum.
Ø Gradient Descent is one of the optimization algorithms that optimize the cost function (objective function)
to reach the optimal minimal solution.
Ø To find the optimum solution, we need to reduce the cost function (MSE) for all data points. This is done
by updating the values of the slope coefficient (B1) and the constant coefficient (B0) iteratively until we
get an optimal solution for the linear function.
A regression model optimizes the gradient descent algorithm to update the coefficients of the line by
reducing the cost function by randomly selecting coefficient values and then iteratively updating the
coefficient values to reach the minimum cost function.
In the gradient descent algorithm, the number of steps you’re taking can be considered as the learning
rate, and this decides how fast the algorithm converges to the minima.
Multiple Linear Regression (MLR)
Introduction
Ø Multiple Linear Regression (MLR) is a statistical technique used to model the relationship between one
dependent variable (target) and two or more independent variables (predictors).
Ø The goal is to find the linear equation that best predicts the dependent variable using the independent
variables.
Assumptions of MLR
Homoscedasticity: The variance of residuals is constant across all levels of the independent variables.
No Multicollinearity: Independent variables are not highly correlated with each other.
Applications
Ø Predicting housing prices based on features like size, location, and number of rooms.
• SGD:
• Update: Once per sample
• Fast but noisy.
• Mini-Batch GD:
• Update: Once per batch
• Balanced and efficient.
Applications
• Batch GD: Small datasets (e.g., linear regression)
Standardization:
Normalization:
Ø Useful for algorithms like k-Nearest Neighbors (k-NN) and Neural Networks that are sensitive to the scale
of data.
Standardization:
Ø Essential for models assuming Gaussian distribution or sensitive to feature magnitude, such as Support
Vector Machines (SVM), Principal Component Analysis (PCA), and Linear Regression.
Normalized Data:
Standardization Example:
Standardized Data:
[−1.26,−0.63,0,0.63,1.26]
[−1.26,−0.63,0,0.63,1.26]
X = np.array([1, 2, 3, 4, 5,7,8,9,10,15,17,19,21]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5,7,9,10,12,15,18,20,25])
Ø Overfitting and underfitting are common problems in machine learning that affect
model performance.
Ø They arise from the trade-off between a model's complexity and its ability to generalize
to new, unseen data.
Underfitting:
A model is underfitting when it is too simple to capture the underlying patterns in the data.
Symptoms:
Causes:
A model is overfitting when it learns not only the patterns but also the noise in the training
data.
Symptoms:
Example: A decision tree with very deep splits that memorizes the training data.
Example predicting house prices:
Underfitting: Using only the size of the house as a feature while ignoring other factors like
location, number of bedrooms, etc.
Overfitting: Including irrelevant details like the color of the walls, which does not
generalize to new data.
What is Regularization?
Ø It adds a penalty term to the loss function to shrink the magnitude of coefficients, ensuring the model
generalizes well to new data.
Lasso Regularization
Ø Penalty Term: Adds the L1 norm (sum of absolute values of coefficients) to the loss function.
Key Features:
Ø Shrinks coefficients.
Ø Feature Selection: Some coefficients are reduced to exactly zero, effectively removing
irrelevant features.
When to Use:
When you suspect some features are irrelevant and want to automatically select the most
important ones.
Ridge Regularization
Penalty Term: Adds the L2 norm (sum of squares of coefficients) to the loss function.
Key Features:
Ø Retains all features and is better for handling multicollinearity (correlated features).
When to Use:
When all features are expected to contribute to the target variable, even if minimally.
Lasso Regularization: Feature Selection in Predictive Modeling
Example:
A healthcare company is building a predictive model to estimate a patient's risk of developing diabetes based
on features like age, BMI, cholesterol levels, glucose levels, and hundreds of genetic markers.
Problem:
Ø Many features (like certain genetic markers) might have little to no effect on the outcome.
Ø Including all features increases model complexity and could lead to overfitting.
Solution:
Ø The model keeps only the most important features, making it simpler and more interpretable for clinicians.
Ridge Regularization: Improving Predictions in Multicollinear Data
Example:
An e-commerce company wants to predict product demand based on factors like price, discount offered,
advertising spend, and competitor pricing.
Problem:
Ø Some features (e.g., "price" and "discount offered") are highly correlated, leading to multicollinearity.
Ø Standard linear regression struggles in such scenarios, resulting in unstable coefficient estimates.
Solution:
Ø Use Ridge regularization, which penalizes large coefficients without dropping any features.
Ø Ridge shrinks the correlated feature coefficients, stabilizing predictions while retaining all information.
Dataset: Predict house prices based on features like size, location, and age.
Lasso: Automatically drops less relevant features like age if size and location explain most of
the variance.
Ridge: Keeps all features and distributes importance across them, even if age has minimal
impact.
Ridge = Stable, robust predictions (use when all features are useful).
1000590
100090
1000590
95534
Bias, Variance, and Tradeoff
Bias is when the model is too simple and cannot capture the pattern in the data properly.
This leads to underfitting.
Example
you always guess the same average age for everyone (e.g., 25 years).
This is a high bias guess because you ignore specific details and make overly simplistic
assumptions.
Variance is when the model is too complex and tries to learn even the smallest details
(noise) in the data. This leads to overfitting.
This is high variance because you are too sensitive to small details that don’t really
matter.
Bias-Variance Tradeoff (Simple Analogy of Exam)
High Bias (Underfitting):
Ø You only study one chapter for the exam and ignore everything else.
Ø Result: You don’t perform well because you miss most questions.
Ø You try to memorize every single word from the textbook, notes, and examples—even unnecessary details.
Ø Result: You get confused in the exam because you overanalyzed and didn’t focus on what’s important.
Ø You study all chapters thoroughly but focus on understanding the key concepts and main topics.
Ø Result: You perform well on the exam because you have the right balance of preparation.
Goal: Find the right balance (Bias-Variance Tradeoff) so the model can generalize well to new data.