0% found this document useful (0 votes)
1K views

YLP Logistic Regression

The document provides an overview of logistic regression for binary classification problems. It discusses how logistic regression can be used to predict things like loan defaults or medical diagnoses. Unlike linear regression, logistic regression uses a sigmoid curve to model binary outcomes as probabilities between 0 and 1. It finds the best fitting curve by varying the beta coefficients to maximize the likelihood of classifying the training data correctly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views

YLP Logistic Regression

The document provides an overview of logistic regression for binary classification problems. It discusses how logistic regression can be used to predict things like loan defaults or medical diagnoses. Unlike linear regression, logistic regression uses a sigmoid curve to model binary outcomes as probabilities between 0 and 1. It finds the best fitting curve by varying the beta coefficients to maximize the likelihood of classifying the training data correctly.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Y.

LAKSHMI PRASAD
08978784848
Objectives
 Introduction to Logistic regression
 Some Potential Problems and Solutions
 Probability and Odds
 Assumptions of Logistic Regression
 Interpreting Coefficients in Logistic Regression
 Evaluating the performance of the model

Y. Lakshmi Prasad :08978784848


Binary Classification
 The most common use of logistic regression models
is in binary classification problems.
 Bank customers seek loans from the bank promising
to repay the loan in installments over a determined
period of time and with some interest on the amount.
However, banks are always at a risk because many
customers might not be able to pay their loans back.
This can cause big losses to the bank.

Y. Lakshmi Prasad :08978784848


Classification Problem
 Therefore, predicting credit risk is of utmost importance for
the bank where the bank analyzes customers’ information
and credit history before deciding to grant a loan.
 Logistic Regression can be used to build a predictive
modeling to predict how likely a customer is to default the
repayment of the loan.

Y.Lakshmi Prasad 08978784848


What is Classification?
PARTITIONING the (FEATURE)
SPACE into PURE REGIONS assigned
to each CLASS
Two types of Classifiers

Descriptive (Generative) Classifiers Discriminative Classifiers


 Learn class DENSITY  Learn class SEPARATORS
Functions  Logistic Regression
 Bayesian Classifiers  Decision Trees
 Nearest Neighbor  Neural Networks, SVM
Approaches to learn classifiers
 1. Logistic Regression
 2. K-Nearest Neighbour classifiers
 3. Decision Trees
 4. Bayesian Classifiers(NaiveBayes)
 5. Support Vector Machines
 6. Neural Networks
 7. Bagging (Random Forest)
 8. Boosting (Ada Boost, XG Boost)

Y. Lakshmi Prasad :08978784848


Expectations from Model
 Predictive (Classification ) accuracy: The ability of the
model to correctly predict the class label of new or
previously unseen data.
 1. Accuracy = Percent (%) of testing set examples
correctly classified by the classifier.
 2. Speed: The computation costs involved in generating
and using the model should be as minimum as possible.
 3. Versatile: Ability of the model to make correct
predictions given noisy data or data with missing values.
 4. Scalability: Ability to construct the model efficiently
given large amount of data.

Y. Lakshmi Prasad :08978784848


Expectations from Model
 5. Interpretability/ Explain-ability: Level of
understanding and insight that is provided by the model
 6. Generalized: It should be able to give the same
amount of accuracy on validation set.
 7. Deterministic: I expect my model to give me the
same result every time I run it.
 8. Regularization: I expect my model complexity
should be controlled by me, So that I can prevent my
model from over-fitting.

Y. Lakshmi Prasad :08978784848


Logistic Regression
 Logistic Regression is a supervised classification model.
 It allows you to make predictions from labelled data, if the
target (output) variable is categorical.
 1. A bank wants to predict, based on some variables,
whether a particular customer will default on a loan or not
 2. A factory manager wants to predict, based on some
variables, whether a particular machine will
 break down in the next month or not
 3. Google’s backend wants to predict, based on some
variables, whether an incoming email is spam or not

Y. Lakshmi Prasad :08978784848


Logistic Regression examples
 Marketing: Classify whether the Lead is a Hot Lead or a
warm lead.
 Stock Market: predict whether the stock will
outperform or under perform
 Healthcare: Whether the tumour is malignant or
benign
 Networking: classify whether the packet is malicious or
not.
 Human Resource: whether the employee is going to
leave the company(attrition) or not

Y.Lakshmi Prasad 08978784848


Logistic Regression
 Used because having a categorical outcome variable violates
the assumption of linearity in normal regression.
 Instead of building a predictive model for "Y (Response)"
directly, the approach models “Log Odds (Y)”; hence the
name Logistic or Logit.

Y.Lakshmi Prasad 08978784848


Where is the Problem?
 Dependent variable is limited to the [0 or 1]
because we have 2 classes: default or no-default,
Diabetic or Non diabetic, Churned or not churned
etc.,.
 Linear Regression is designed to solve the problem
of Minimizing the Root Mean Squared Error, which
does not seem to be an appropriate fit in this case.

Y. Lakshmi Prasad :08978784848


Logistic Regression
 Let us take the diabetes example, in this example,
we try to predict whether a person has diabetes or not,
based on that person’s blood sugar level.
 Why a simple boundary decision approach does not
work very well for this example.
 It would be too risky to decide the class blatantly on
the basis of cut-off, as especially in the middle, the
patients could basically belong to any class, diabetic or
non-diabetic.

Y.Lakshmi Prasad 08978784848


Classifying with Linear Regression

Y. Lakshmi Prasad :08978784848


Where is the Problem?
 recall the graph of the diabetes example.
 Suppose there is another person, with a blood sugar
level of 195, and you do not know whether that person
has diabetes or not. What would you do then? Would
you classify him/her as a diabetic or as a non-diabetic?

Y.Lakshmi Prasad 08978784848


Step Function

Y.Lakshmi Prasad 08978784848


Limitation of Step Function
 Now, based on the boundary, you may be tempted to
declare this person a diabetic, but can you really do
that?
 This person’s sugar level (195 mg/dL) is very close to
the threshold (200 mg/dL), below which people are
declared as non-diabetic.
 It is, therefore, quite possible that this person was just a
non-diabetic with a slightly high blood sugar level.
 After all, the data does have people with slightly high
sugar levels (220 mg/dL), who are not diabetics.

Y.Lakshmi Prasad 08978784848


Classifying with Linear Regression

Y.Lakshmi Prasad 08978784848


Classifying with Linear Regression
 The main problem with a straight line is that it is not
steep enough.
 In the sigmoid curve, as you can see, you have low
values for a lot of points, then the values rise all of a
sudden, after which you have a lot of high values.
 In a straight line though, the values rise from low to
high very uniformly, and hence, the “boundary” region,
the one where the probabilities transition from high to
low is not present.

Y.Lakshmi Prasad 08978784848


Logistic Regression
 In this situation, We actually like to talk in terms of
probability.
 One such curve which can model the probability of
diabetes very well, is the sigmoid curve.

Y.Lakshmi Prasad 08978784848


sigmoid curve
 sigmoid curve has all the properties you would want
— extremely low values in the start, extremely high
values in the end, and intermediate values in the
middle — it’s a good choice for modelling the value of
the probability of diabetes.

Y.Lakshmi Prasad 08978784848


S-Curve

Y.Lakshmi Prasad 08978784848


S-Curve
 Here, We want to have the P1,P2,P3,P4,P6 to be as small
as possible and P5,P7,P8,P9,P10 to be as high as
possible.
 In case of P4, I can say either I want to minimize P4 or I
can even say I want to maximize 1-P4.
 Comprehensively, If I want to maximize all these points
into the same assumption, I can say I want to maximize
P5,P7,P8,P9,P10 and 1-P1,1-P2,1-P3,1-P4,1-P6.
 That means I want to maximize the product of all these
points. So, we want to find out that B0 and B1 which
maximizes the product of all these points.

Y.Lakshmi Prasad 08978784848


Best-fit Curve
 The next step, just like linear regression, would be to
find the best fit curve.
 Hence, you learnt that in order to find the best fit
sigmoid curve, you need to vary β0 and β1 until you get
the combination of beta values that maximises the
likelihood.

Y.Lakshmi Prasad 08978784848


Understanding Likelihood
 Now, let’s say that for the nine points in our example, the
labels are as follows:

Point no. 1 2 3 4 5 6 7 8 9

Diabetes no no no yes no yes yes yes yes

In this case, the likelihood would be equal to:


 A) (1−P1)(1−P2)(1−P3)(1−P4)(1−P5)(P6)(P7)(P8)(P9)
 B) (1−P1)(1−P2)(1−P3)(1−P5)(P4)(P6)(P7)(P8)(P9)
 C) (P1)(P2)(P3)(P4)(P5)(1−P6)(1−P7)(1−P8)(1−P9)
 D) (P1)(P2)(P3)(P5)(1−P4)(1−P6)(1−P7)(1−P8)(1−P9)

Y.Lakshmi Prasad 08978784848


Answer
 Answer B :Recall that likelihood is the product
of (1−Pi) for all non-diabetic patients and (Pi) for all
diabetic patients. Hence, the likelihood is given
by (1−P1)(1−P2)(1−P3)(1−P5), (all non-diabetic patients)
multiplied by (P4)(P6)(P7)(P8)(P9)(P10) (all diabetic
patients).

Y.Lakshmi Prasad 08978784848


Logistic Regression Best fit curve

Y.Lakshmi Prasad 08978784848


Logistic Regression Best fit curve
 If you had to find β0 and β1 for the best fitting sigmoid
curve, you would have to try a lot of combinations,
unless you arrive at the one which maximises the
likelihood.
 This is similar to linear regression, where you
vary β0 and β1 until you find the combination that
minimises the cost function.
 Hence, this is called a Generalised Linear regression
Model (GLM), or a logistic regression model.

Y.Lakshmi Prasad 08978784848


ODDS
 The odds has a range of 0 to  with values greater than 1
associated with an event being more likely to occur than
to not occur and values less than 1 associated with an
event that is less likely to occur than not occur

p
odds 
1 p

Y. Lakshmi Prasad :08978784848


Log(Odds)
• It solves the problem we encounter in fitting a linear
model to probabilities.
• As probabilities (the dependent variable) only range
from 0 to 1, we can get linear predictions that are
outside of this range

 p 
ln  odds   ln    ln  p   ln 1  p 
 1 p 

Y. Lakshmi Prasad :08978784848


What is an "Odds Ratio”?
 It is a standard statistical term that denotes probability
of success to probability of failure.
 If probability of success is 0.75, then odds ratio =
(0.75/0.25) = 3
 In other words, there is a 3:1 chance of success
 If probability of success is 50%, what is the odds ratio?

Y. Lakshmi Prasad :08978784848


Why Use Odds-Ratio
• In logistic regression the Odds Ratio represents the
constant effect of a predictor X, on the likelihood that
one outcome will occur.
• In regression models, we often want a measure of the
unique effect of each X on Y.
• If we try to express the effect of X on the likelihood of a
categorical Y having a specific value through Probability,
the effect is not constant.

Y. Lakshmi Prasad :08978784848


Why Use Odds-Ratio
• That means there is no way to express in one number
how X affects Y in terms of Probability.
• The effect of X on the probability of Y has different
values depending on the value of X.
• We will not be able to describe that effect in a single
number using Probability and will have to use Odds
Ratio

Y.Lakshmi Prasad 08978784848


The Logistic Regression Model
The "logit" model:

ln[p/(1-p)] = 0 + 1X

• p is the probability that the event Y occurs, p(Y=1)


[range=0 to 1]

• p/(1-p) is the "odds ratio"


[range=0 to ∞]

• ln[p/(1-p)]: log odds ratio, or "logit“


[range=-∞ to +∞]

Y. Lakshmi Prasad :08978784848


Logistic regression Model

Y. Lakshmi Prasad :08978784848


Types of Logistic Regression
 1. Binary Logit: Used when the response variable is binary or
dichotomous. It has only 2 outcomes e.g. Good v/s Bad,
Yes v/s No
 2. Multinomial Logit: Used when the response variable has more
than 2 outcomes, and The outcomes cannot be ordered in any
manner e.g. Choice of bread.
 3. Ordered Logit: Used when the response variable has more
than 2 outcomes, and The outcomes can be ordered in a
meaningful way.
High / Medium / Low,
Strongly Agree / Agree / Disagree / Strongly Disagree

Y. Lakshmi Prasad :08978784848


Response Variable coding
 Data Preparation for Logistic Regression includes:
 The response variable (or target variable) will need to
be converted to a 1,0
 Code "Sanctioned personal loan" as "1" and "Rejected
personal loan" as "0"

Y. Lakshmi Prasad :08978784848


Considerations
 1. Missing value treatment - using logical rules
 2. Outlier detection - to ensure we don't have highly
skewed values
 3. Multicollinearity- two independent variables do
not provide similar information
 4. Variable transformations- we have meaningful
transformation of variables depending on the research
and modeling scope
 5. Descriptive statistics - Basic measures of central
tendency need to be output to validate if correct data is
being used for modeling

Y. Lakshmi Prasad :08978784848


Assumptions
 Makes fewer assumptions than Linear Regression
 Non-linear transformation of the logistic function
 No assumption on normal distribution for residuals.
 No Homoscedastic assumption.

Y. Lakshmi Prasad :08978784848


Data Preparation Partitioning
 Divide the sample into 2 sub-samples
 1. Training Sample: Sample used to build Logistic
regression model.
 2. Validation Sample: Estimates obtained from the
development sample will be tested here for comparison
and checking robustness of the model

Y. Lakshmi Prasad :08978784848


Simple Decision Boundary?
Medium Decision Boundary!
Complex Decision Boundary!
Model SIGNAL not NOISE

Model is too simple  UNDER LEARN


Model is too complex  MEMORIZE

Model is just right  GENERALIZE


Generalization vs. Memorization
Generalization: The ability to predict or
assign a label to a “new” observation based
on the “model” built from past experience
Generalize, don’t Memorize!
Training Set Accuracy
Right Level of Model
Complexity
Model Accuracy

Validation Set Accuracy

Model Complexity
Questions for Classification!
 What is the NATURE of classifier’s DECISION BOUNDARY?
 What is the COMPLEXITY of classifier’s DECISION
BOUNDARY?
 How do I CONTROL the COMPLEXITY of the classifier?
 How do I know when the classifier is COMPLEX ENOUGH?
 How to pick the right CLASSIFIER to use?
Metrics to Evaluate
 1. Confusion Matrix (Accuracy, Sensitivity, Specificity )
 2. Receiver Operating characteristic Curve
 3. Weight of Evidence
 4. Concordant, Discordant, Tied Pairs
 5. Area Under Curve (c-Statistic)
 6. Akaike Information Criterion
 7. Gini Coefficient

Y. Lakshmi Prasad :08978784848


Event Rate
 Event rate is a statistical term that describes how often
an event occurs.
 Divide the number of times the occurrence happened by
the total times the occurrence could happen to
determine the event rate.

Y. Lakshmi Prasad :08978784848


Confusion Matrix

Y. Lakshmi Prasad :08978784848


Sensitivity & Specificity

Y. Lakshmi Prasad :08978784848


Receiver Operating Characteristic (ROC) Curve

Tradeoff between sensitivity


and specificity

The closer the curve follows the


left-hand border and then the top
border of the ROC space, the more
accurate the test.

The closer the curve comes to the 45-


degree diagonal of the ROC space, the
less accurate the test

Y. Lakshmi Prasad :08978784848


Weight of Evidence (WoE)
 The weight of evidence tells the predictive power of an
independent variable in relation to the dependent
variable.
 The WoE recoding of predictors is particularly well suited
for subsequent modeling using Logistic Regression.
 For a continuous variable, split data into 10 parts (or
lesser depending on the distribution).
 Calculate WOE by taking natural log of division of % of
non-events and % of events.

Y. Lakshmi Prasad :08978784848


Information Value
 Information value is a useful concept for variable selection
during model building.
 It helps to rank variables on the basis of their importance

 Information Value -- Variable Predictive Power


< 0.02 -- Not useful
> 0.02 to 0.1 -- Weak predictive power
> 0.1 to 0.3 -- Medium predictive power
> 0.3 to 0.5 -- Strong predictive power
> 0.5 -- Suspicious predictive power
Y. Lakshmi Prasad :08978784848
Concordant, Discordant
 Concordant : Percentage of pairs where the
observation with the event has a higher predicted
probability than the observation without the event
 Percent Concordant = (Number of concordant
pairs)/Total number of pairs
 Discordant : Percentage of pairs where the
observation with the event has a lower predicted
probability than the observation without the event
 Percent Discordance = (Number of discordant
pairs)/Total number of pairs

Y. Lakshmi Prasad :08978784848


c-Statistic
 Tied : Percentage of pairs where the observation with the event
has same predicted probability than the observation without
the event
 Percent Tied = (Number of tied pairs)/Total number of pairs

 c statistics : Also called area under curve (AUC). It is calculated


by adding Concordance Percent and 0.5 times of Tied Percent
 C- statistic = Percent Concordant + 0.5 * Percent Tied

Higher percentages of concordant pairs and lower percentages of


discordant and tied pairs indicate a more desirable model.

Y. Lakshmi Prasad :08978784848


Akaike Information Criterion (AIC)
 The Akaike Information Criterion (AIC) provides a
method for assessing the quality of model through
comparison of related models.
 It’s based on the Deviance, but penalizes for making the
model more complicated.
 Much like adjusted R-squared, it’s intent is to prevent
from including irrelevant predictors.
 If you have more than one similar candidate models then
you should select the model that has the smallest AIC.

Y. Lakshmi Prasad :08978784848


Gini Coefficient
 Gini coefficient can be straight
away derived from the AUC
ROC number.
 Gini is nothing but ratio
between area between the
ROC curve and the diagonal
line & the area of the above
triangle
 Gini above 60% is a “good”
model.

Y. Lakshmi Prasad :08978784848


Imbalanced Datasets – SMOTE
Synthetic Minority Over-sampling Technique:
 Imbalanced data sets are a special case for classification
problem where the class distribution is not uniform
among the classes.
 These type of sets gives a challenge because this creates a
bias towards the majority class.
 Oversampling involves using a bias to select more
samples from one class than from another.
 The general idea of this method is to artificially generate
new examples of the minority class using the nearest
neighbors of these cases.
 The majority class examples are also under-sampled,
leading to a more balanced dataset
Y. Lakshmi Prasad :08978784848
Questions?

You might also like