0% found this document useful (0 votes)
124 views

Logistic Regression

Logistic regression is a statistical model used to predict the probability of an observation falling into one of two categories of a dichotomous dependent variable based on one or more independent variables. There are several key assumptions of logistic regression: it requires a binary dependent variable, independent variables that are either continuous or categorical, independence of observations, a minimum number of cases per independent variable, a linear relationship between continuous variables and the logit of the dependent variable, no multicollinearity among independent variables, and no outliers or influential points. Testing these assumptions is important for properly fitting and interpreting a logistic regression model.

Uploaded by

harish srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
124 views

Logistic Regression

Logistic regression is a statistical model used to predict the probability of an observation falling into one of two categories of a dichotomous dependent variable based on one or more independent variables. There are several key assumptions of logistic regression: it requires a binary dependent variable, independent variables that are either continuous or categorical, independence of observations, a minimum number of cases per independent variable, a linear relationship between continuous variables and the logit of the dependent variable, no multicollinearity among independent variables, and no outliers or influential points. Testing these assumptions is important for properly fitting and interpreting a logistic regression model.

Uploaded by

harish srinivas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Logistic Regression

Logistic Regression
• A binomial logistic regression attempts to
predict the probability that an observation
falls into one of two categories of a
dichotomous dependent variable based on
one or more independent variables that can
be either continuous or categorical
Basic requirements of a binomial logistic regression

• Assumption #1: You have one dependent variable that is dichotomous (i.e., a nominal


variable with two outcomes). Examples of dichotomous variables include gender (two
outcomes: "males" or "females"), presence of heart disease (two outcomes: "yes" or
"no"), employment status (two outcomes: "employed" or "unemployed"), transport
type (two outcomes: "bus" or "car").
• Assumption #2: You have one or more independent variables that are measured on
either a continuous or nominal scale. Examples of continuous variables include height
(measured in metres and centimetres), temperature (measured in °C), salary (measured
in US dollars), revision time (measured in hours), intelligence (measured using IQ score),
firm size (measured in terms of the number of employees), age (measured in years),
reaction time (measured in milliseconds), grip strength (measured in kg), power output
(measured in watts), test performance (measured from 0 to 100), sales (measured in
number of transactions per month) and academic achievement (measured in terms of
GMAT score). Examples of nominal variables include gender (e.g., two categories: male
and male), ethnicity (e.g., three categories: Caucasian, African American and Hispanic)
and profession (e.g., five categories: surgeon, doctor, nurse, dentist, therapist).
Basic requirements of a binomial logistic
regression
• Assumption #3: You should have independence of observations and
the categories of the dichotomous dependent variable and all your nominal
independent variables should be mutually exclusive and exhaustive.

Independence of observations means that there is no relationship between the


observations in each category of the dependent variable or the observations in
each category of any nominal independent variables. In addition, there is no
relationship between the categories. Indeed, an important distinction is made in
statistics when comparing values from either different individuals or from the
same individuals
Basic requirements of a binomial logistic
regression
• Assumption #4: You should have a bare minimum of 15 cases per independent
variable, although some recommend as high as 50 cases per independent
variable. As with other multivariate techniques, such as multiple regression, there
are a number of recommendations regarding minimum sample size. Indeed,
binomial logistic regression relies on maximum likelihood estimation (MLE) and the
reliability of estimates declines significantly for combinations of cases where there
are few cases.
Basic requirements of a binomial logistic
regression
• Assumptions #5, #6 and #7:  (a) there should be a linear relationship between the
continuous independent variables and the logit transformation of the dependent
variable; (b) there should be no multicollinearity; and (c) there should be no
significant outliers, leverage or influential points.
Fitting a binomial logistic regression model

• Binomial logistic regression is part of a larger statistical group of tests called


Generalized Linear Models (GzLM). These tests are an extension of Linear Models
(e.g., multiple regression) to incorporate dependent variables that are not just
continuous, but may be measured on other types of measurement scale (e.g.,
dichotomous or ordinal measurement scales).
• Like multiple regression, binomial logistic regression allows for a relationship to be
modelled between multiple independent variables and a single dependent variable
where the independent variables are being used to predict the dependent
variable. However, in the case of a binomial logistic regression, the dependent
variable is dichotomous. In addition, a transformation is applied so that instead of
predicting the category of the binomial logistic regression directly, the logit of the
dependent variable is predicted instead.
• For example, if we consider four independent variables to be "X1" through "X4"
and the dependent variable to be "Y", a binomial logistic regression models the
following: logit(Y) = β0 + β1X1 + β2X2 + β3X3 + β4X4+ ε.
Fitting a binomial logistic regression model

• Where β0 is the intercept (also known as the constant), β1 is the slope parameter
(also known as the slope coefficient) for X1, and so forth, and ε represents the
errors. This represents the population model, but it can be estimated as follows:
logit(Y) = b0 + b1X1 + b2X2 + b3X3 + b4X4+ e
• In the formula above, b0 is the sample intercept (aka constant) and estimates β 0,
b1 is the sample slope parameter for X1 and estimates β1, and so forth, e represents
the sample errors/residuals and estimates ε. A logit is the natural log of the odds
of an event occurring. It has little direct meaning. However, by applying an anti-log
it can have a much more interpretative meaning. In addition, through further
calculations you can ascertain other useful properties of the predictive power of
your binomial logistic regression model, such as the percentage of correctly
classified cases.
Setting up your data

• For a binomial logistic regression you will have at least two variables – one dependent
variable and one independent variable – but you will typically have two or more
independent variables. In addition, you may also choose to include a case identifier, as
discussed below. In this example, we have the following six variables:

1) The dependent variable, heart_disease, which is whether the participant has heart


disease;
    and
2) The independent variable, age, which is the participant's age in years;
3) The independent variable, weight, which is the participant's weight (technically,

it is their 'mass');
4) The independent variable, gender, which has two categories: "Male" and "Female";
5) The independent variable, VO2max, which is the maximal aerobic capacity.
    and
6) The case identifier, caseno, which is used for easy elimination of cases (e.g., participants)
that might occur when checking assumptions.
Setting up your data

Assumption#5
There needs to be a linear relationship between the continuous independent
variables and the logit transformation of the dependent variable.

The assumption of linearity in a binomial logistic regression requires that there is a


linear relationship betweenthe continuous independent variables, age, weight and 
VO2max, and the logit transformation of the dependent variable, heart_disease.

There are a number of methods to test for a linear relationship between the
continuous independent variables and the logit of the dependent variable. In this
guide, we use the Box-Tidwell approach, which adds an interaction terms between
the continuous independent variables and their natural logs to the regression
equation. (a) use the Binary Logistic procedure in SPSS Statistics to test this
assumption; (b) interpret and report the results from this test; and (c) proceed with
your analysis depending on whether you have met or violated this assumption.
Setting up your data

• Assumption #6
Your data must not show multicollinearity

Multicollinearity occurs when you have two or more independent variables that
are highly correlated with each other. This leads to problems with understanding
which independent variable contributes to the variance explained in the
dependent variable, as well as technical issues in calculating a binomial logistic
regression model.

You can detect for multicollinearity through an inspection of correlation


coefficients and Tolerance/VIF values, which will inform you whether your data
meets or violates this assumption.
Setting up your data
• Assumption#7
There should be no significant outliers, high leverage points or highly influential
points

Outliers, leverage and influential points are different terms used to represent


observations in your data set that are in some way unusual when you wish to
perform a binomial logistic regression analysis. These different classifications
of unusual points reflect the different impact they have on the regression line
Testing for linearity

• The most important assumption in logistic regression is that the model is


correctly specified, a component of which is the assumption of linearity, which
is often expressed as "linearity in the logit" (e.g., Hilbe, 2016; Menard, 2002).
It is similar to the linearity assumption in multiple regression, only with
respect to the log odds transformation (logit) of the dependent
variable rather than to the dependent variable itself.
• The linearity assumption states that for every one-unit increase in a continuous
independent variable, the value of the log odds (logit) of the dependent variable
increases by a constant amount
Testing for linearity
• One method that can be used to check the assumption of "linearity in
the logit" is the Box-Tidwell procedure (Box & Tidwell, 1962), which
was developed for linear regression, but is also appropriate for logistic
regression models (Fox, 2016; Guerrero & Johnson, 1982). The
procedure is simple to use and can be carried out in various statistical
packages, including SPSS Statistics. It is one of several methods
recommended to assess whether a continuous independent variable is
linearly related to the logit of the dependent variable (e.g., Hosmer &
Lemeshow, 1989; Menard, 2002, 2010).
• STEPONE
Create natural log transformations of all continuous independent variables
• STEPTWO
Create interaction terms for each of your continuous independent variables and
their respective natural log transformed variables, before running the Box-Tidwell
procedure
Interpreting Results

• There are two main objectives that you can achieve with the output from
a binomial logistic regression: (a) determine which of your independent
variables (if any) have a statistically significant effect on your dependent
variable; and (b) determine how well your binomial logistic regression
model predicts the dependent variable. Both of these objectives will be
answered in the following sections:
Interpreting Results

• Data coding: You can start your analysis by inspecting your variables and data,
including: (a) checking if any cases are missing and whether you have the number of
cases you expect (the "Case Processing Summary" table); (b) making sure that the
correct coding was used for the dependent variable (the "Dependent Variable
Encoding" table); and (c) determining whether there are any categories amongst
your categorical independent variables with very low counts – a situation that is
undesirable for binomial logistic regression (the "Categorical Variables Codings"
table). This is highlighted in the Data coding section on the next page.
• Baseline analysis: Next, you can consult the "Classification Table", "Variables in the
Equation" and "Variables not in the Equation" tables. These all relate to the
situation where no independent variables have been added to the model and the
model just includes the constant. As such, you are interested in this information
only as a comparison to the model with all the independent variables added. This 
Baseline analysis section provides a basis against which the main binomial logistic
regression analysis with all independent variables added to the equation can be
evaluated.
Interpreting Results
• Binomial logistic regression results: In evaluating the main logistic
regression results, you can start by determining the overall
statistical significance of the model (namely, how well the model
predicts categories compared to no independent variables). You
can also assess the adequacy of the model by analysing how poor
the model is at predicting the categorical outcomes using
the Hosmer and Lemeshow goodness of fit test. This is explained
in the Model fit . Next, you can consult the Cox & Snell R
Square and Nagelkerke R Square values to understand how much
variation in the dependent variable can be explained by the model
(i.e., these are two methods of calculating the explained variation),
but it is preferable to report the Nagelkerke R2 value. This is
illustrated in the Variance explained section.
Interpreting Results
• Category prediction: After determining model fit and
explained variation, it is very common to use binomial
logistic regression to predict whether cases can be correctly
classified (i.e., predicted) from the independent variables.
Logistic regression estimates the probability of an event (in
this case, having heart disease) occurring. If the estimated
probability of the event occurring is greater than or equal to
0.5 (better than even chance), SPSS Statistics classifies the
event as occurring (e.g., heart disease being present). If the
probability is less than 0.5, SPSS Statistics classifies the event
as not occurring (e.g., no heart disease).
Interpreting Results
• Variables in the equation:  you can assess the contribution of
each independent variable to the model and its statistical
significance using the Variables in the Equation table. You will
also be able to use the odds ratios of each of the independent
variables (along with their confidence intervals) to
understand the change in the odds ratio for each increase in
one unit of the independent variable. Using these odds ratios
you will be able to, for example, make statements such as:
"the odds of having heart disease is 7.026 times greater for
males as opposed to females". You can make such predictions
for categorical and continuous independent variables.
Baseline analysis

• The next three tables headed under the main title, "Block 0: Beginning Block", all
relate to the situation where no independent variables have been added to the
model and the model just includes the constant. As such, you are interested in this
information only as a comparison to the model with all the independent variables
added. The table below, "Classification Table", shows that without any
independent variables, the 'best guess' is to simply assume that all participants did
not have heart disease. If you assume this, you will overall correctly classify 65% of
cases (the "Overall Percentage" row), as shown below:
Baseline analysis

• The table below, "Variables in the Equation", simply shows you that only the
constant was included in this particular model:

• And the table below, "Variables not in the Equation", highlights the independent
variables left out of the model:
Baseline analysis
Binomial logistic regression results

• All the next tables come after the heading "Block 1: Method = Enter" and
represent the results of the main logistic regression analysis with all independent
variables added to the equation.
Model fit
• The first table, "Omnibus Tests of Model Coefficients", provides the overall
statistical significance of the model (namely, how well the model predicts
categories compared to no independent variables), as shown below
Binomial logistic regression results
• For this type of binomial logistic regression, you can reference the "Model" row.
From the table above, you can see that the model is statistically significant
(p < .0005; "Sig." column). Another way of assessing the adequacy of the model is
to analyse how poor the model is at predicting the categorical outcomes. This is
tested using the Hosmer and Lemeshow goodness of fit test as found in the
similarly titled table, as shown below

• For this test, you do not want the result to be statistically significant because this
would indicate that you have a poor fitting model. In this example, the Hosmer and
Lemeshow test is not statistically significant (p = .871; "Sig." column), indicating
that the model is not a poor fit
Variance explained
• In order to understand how much variation in the dependent variable can be explained by the
model (the equivalent of R2 in multiple regression), you can consult the table below, "Model
Summary":

• This table contains the Cox & Snell R Square and Nagelkerke R Square valueswhich are both
methods of calculating the explained variation (it is not as straightforward to do this as
compared to multiple regression). These values are sometimes referred to as pseudo R2 values
and will have lower values than in multiple regression. However, they are interpreted in the
same manner, but with more caution.
• Therefore, the explained variation in the dependent variable based on our model ranges from
24.0% to 33.0%, depending on whether you reference the Cox & Snell R2 or
Nagelkerke R2 methods, respectively. Nagelkerke R2 is a modification of Cox & Snell R2, the
latter of which cannot achieve a value of 1. For this reason, it is preferable to report the
Nagelkerke R2 value.
Category prediction
• Binomial logistic regression estimates the probability of an event (in
this case, having heart disease) occurring. If the estimated probability
of the event occurring is greater than or equal to 0.5 (better than even
chance), SPSS Statistics classifies the event as occurring (e.g., heart
disease being present). If the probability is less than 0.5, SPSS Statistics
classifies the event as not occurring (e.g., no heart disease). It is very
common to use logistic regression to predict whether cases can be
correctly classified (i.e., predicted) from the independent variables.
Therefore, it becomes necessary to have a method to assess the
effectiveness of the predicted classification against the actual
classification. There are many methods to assess this with their
usefulness often depending on the nature of the study conducted.
However, all methods revolve around the observed and predicted
classifications, which are presented in the Classification Table, as
shown below:
Category prediction
Category prediction
• Firstly, notice that the table has a subscript which states, "The cut value
is .500". This means that if the probability of a case being classified into
the "yes" category is greater than .500, then that particular case is
classified into the "yes" category. Otherwise, the case is classified as in
the "no" category (as mentioned previously). The classification table from
earlier – which did not include any independent variables – showed that
65.0% of cases overall could be correctly classified by simply assuming
that all cases were classified as "no" heart disease. However, with the
independent variables added, the model now correctly classifies 71.0% of
cases overall (see "Overall Percentage" row). That is, the addition of the
independent variables improves the overall prediction of cases into their
observed categories of the dependent variable. This particular measure is
referred to as the percentage accuracy in classification (PAC).
Category prediction
• Another measure is the sensitivity, which is the percentage of cases
that had the observed characteristic (e.g., "yes" for heart disease)
which were correctly predicted by the model (i.e., true positives). In
this case, 45.7% of participants who had heart disease were also
predicted by the model to have heart disease (see the "Percentage
Correct" column in the "Yes" row of the observed categories).
• Specificity is the percentage of cases that did not have the observed
characteristic (e.g., "no" for heart disease) and were also correctly
predicted as not having the observed characteristic (i.e., true
negatives). In this case, 84.6% of participants who did not have heart
disease were correctly predicted by the model not to have heart
disease (see the "Percentage Correct" column in the "No" row of the
observed categories).
Category prediction
• The positive predictive value is the percentage of correctly
predicted cases with the observed characteristic compared to
the total number of cases predicted as having the characteristic.
In our case, this is 100 x (16 ÷ (10 + 16)) which is 61.5%. That is,
of all cases predicted as having heart disease, 61.5% were
correctly predicted.
• The negative predictive value is the percentage of correctly
predicted cases without the observed characteristic compared
to the total number of cases predicted as not having the
characteristic. In our case, this is 100 x (55 ÷ (55 + 19)) which is
74.3%. That is, of all cases predicted as not having heart
disease, 74.3% were correctly predicted.
ROC Curve
• In the previous section you calculated five measures – such as sensitivity and
specificity – that assess the ability of a binomial logistic regression model to correctly
classify cases (i.e., to discriminate). All these measures were calculated based on
a cut-off point of 0.5 (50%), meaning that a case (e.g., participant) with a predicted
probability of the event (e.g., heart disease) that is greater than or equal to
0.5 would be classified as having the event (e.g., having heart disease), and all
participants with predicted probabilities lower than 0.5 would be classified as not
having the event (e.g., not having heart disease).
• However, instead of concentrating on one cut-off point only, you can consider all
possible cut-off points in your data, and how each cut-off point
changes the specificity and sensitivity of the test. For example, a higher cut-off point
will increase specificity, but lower sensitivity. That is, a higher cut-off point makes it
"harder" for participants to be classified as having the event of interest, but "easier"
to be classified as not having the event of interest. A visual representation of this is
presented in a plot called the Receiver Operating Characteristic (ROC) curve, which is
a plot of sensitivity versus 1 minus specificity (Hilbe, 2009). The ROC curve can also
be used to calculate an overall measure of discrimination, but this will be discussed
later.
ROC curve procedure
Interpreting the ROC curve

• You can see in the sub-note highlighted above that the positive actual state is
"1.00 Yes", indicating that we have correctly stated the event (i.e., the event of
interest in this example is having heart disease, which was coded as "1 = Yes").
Whatever category represents your event of interest should be reported in this
sub-note. If not, you need to go back to Step 3 of the ROC procedure above
and change the coding you have entered accordingly.
• Now that you know you have entered the correct information in the ROC curve
procedure, you can consider the ROC curve results. As such, the ROC curve is
presented under the heading, ROC Curve, as shown below:
Interpreting the ROC curve

• The further the blue line is above the straight line,


the better the discrimination. The area under the ROC curve is
equivalent to the concordance probability (Gönen, 2007), which can
also be reported via SPSS Statistics' NOMREG procedure (i.e., its
multinomial logistic regression procedure). The concordance (c)
statistic is the most common measure of the ability of a generalized
linear model (GzLM) to discriminate, of which binomial logistic
regression is a GzLM (Steyerberg, 2009). It is equivalent to the area
under the ROC curve for a dichotomous dependent variable (i.e., for
binomial (or binary) logistic regressions) (Gönen, 2007; Steyerberg,
2009). You can find the value for this area and, therefore, the
concordance statistic, by consulting the "Area" column in the Area
Under the Curve table, as highlighted below:
Interpreting the ROC curve
Interpreting the ROC curve
• You can see that the area under the ROC curve is .804. The area can range
from 0.5 to 1.0 with higher values representing better discrimination. According
to Hosmer et al. (2013) a value of .804 puts the discrimination of this model at the
lower border of excellent discrimination. The general rules of thumb of Hosmer et
al. (2013) are presented below
AUC Classification

0.5 This suggests no discrimination, so we might as well flip a coin.

0.5 < AUC < 0.7 We consider this poor discrimination, not much better than a coin toss.

0.7 ≤ AUC < 0.8 We consider this acceptable discrimination.

0.8 ≤ AUC < 0.9 We consider this excellent discrimination.

AUC ≥ 0.9 We consider this outstanding discrimination.

Table:Rules of thumb for the area under the ROC curve (AUC) according to Hosmer et al. (2013).
Interpreting the ROC curve
• It is also possible to provide a 95% confidence interval (CI) for the area under the
ROC curve. These are presented in the "Lower Bound" and "Upper Bound"
columns under the "Asymptotic 95% Confidence Interval" column in the "Area
Under the Curve" table, as highlighted below:
Interpreting the ROC curve
• The area under the ROC curve was .804 (95% CI, .718 to .891), which is an
excellent level of discrimination according to Hosmer et al. (2013).
• If you have space in your report, you should also present the ROC curve
itself (as recommended by Hosmer et al., 2003).
Variables in the equation

• The Variables in the Equation table shows the contribution of each independent


variable to the model and its statistical significance. This table is shown below:
Variables in the equation

• The Wald test ("Wald" column) is used to determine statistical significance for each
of the independent variables. The statistical significance of the test is found in the
"Sig." column. From these results you can see that age (p = .003), gender (p = .021)
and VO2max (p = .039) added significantly to the model/prediction, but weight (p =
.799) did not add significantly to the model.
Variables in the equation

• The B coefficients ("B" column) are used in the equation to predict the


probability of an event occurring, but not in an immediately intuitive
manner. The coefficients do, in fact, show the change in the log odds that
occur for a one-unit change in an independent variable when all other
independent variables are kept constant. So, for example, the log odds
change for gender is 1.950, which is the increase in log odds (as B is
positive) for males (as females were coded "0" and males as "1").
However, this is not often the most intuitive method of understanding
your results.
Variables in the equation

• Luckily, SPSS Statistics also includes the odds ratios of each of the
independent variables in the "Exp(B)" column along with their confidence
intervals ("95% C.I. for EXP(B)" column). This informs you of the change in
the odds for each increase in one unit of the independent variable. For
example, for gender, an increase in one unit (i.e., being male) increases
the odds by 7.026. What this means is that the odds of having heart
disease ("yes" category) is 7.026 times greater for males as opposed to
females. Values less than 1.000 indicate a decreased odds for an increase
in one unit of the independent variable. Sometimes, for clarity, the odds
ratio is inverted (e.g., 1 / .906 = 1.10, for VO2max). Thus, you would state
that for each unit reduction in the independent variable, VO2max, the
odds of having heart disease increases by a factor of 1.10. Remember to
invert the confidence intervals as well if you take this latter approach.
summary
• A binomial logistic regression was performed to ascertain the effects of
age, weight, gender and VO2max on the likelihood that participants
have heart disease. The logistic regression model was statistically
significant, χ2(4) = 27.402, p < .0005. The model explained 33.0%
(Nagelkerke R2) of the variance in heart disease and correctly classified
71.0% of cases. Sensitivity was 45.7%, specificity was 84.6%, positive
predictive value was 61.5% and negative predictive value was 74.3%. Of
the five predictor variables only three were statistically significant: age,
gender and VO2max (as shown in Table 1). Males had 7.02 times higher
odds to exhibit heart disease than females. Increasing age was
associated with an increased likelihood of exhibiting heart disease, but
increasing VO2max was associated with a reduction in the likelihood of
exhibiting heart disease.
summary
• A binomial logistic regression was performed to ascertain the effects of age,
weight, gender and VO2max on the likelihood that participants have heart disease.
Linearity of the continuous variables with respect to the logit of the dependent
variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni
correction was applied using all eight terms in the model resulting in statistical
significance being accepted when p < .00625 (Tabachnick & Fidell, 2014). Based on
this assessment, all continuous independent variables were found to be linearly
related to the logit of the dependent variable. There was one standardized residual
with a value of 3.349 standard deviations, which was kept in the analysis. The
logistic regression model was statistically significant, χ2(4) = 27.402, p < .0005. The
model explained 33.0% (Nagelkerke R2) of the variance in heart disease and
correctly classified 71.0% of cases. Sensitivity was 45.7%, specificity was 84.6%,
positive predictive value was 61.5% and negative predictive value was 74.3%. Of
the five predictor variables only three were statistically significant: age, gender and
VO2max (as shown in Table 1). Males had 7.02 times higher odds to exhibit heart
disease than females. Increasing age was associated with an increased likelihood
of exhibiting heart disease, but increasing VO2max was associated with a reduction
in the likelihood of exhibiting heart disease.

You might also like