Logistic Regression
Logistic Regression
Logistic Regression
• A binomial logistic regression attempts to
predict the probability that an observation
falls into one of two categories of a
dichotomous dependent variable based on
one or more independent variables that can
be either continuous or categorical
Basic requirements of a binomial logistic regression
• Where β0 is the intercept (also known as the constant), β1 is the slope parameter
(also known as the slope coefficient) for X1, and so forth, and ε represents the
errors. This represents the population model, but it can be estimated as follows:
logit(Y) = b0 + b1X1 + b2X2 + b3X3 + b4X4+ e
• In the formula above, b0 is the sample intercept (aka constant) and estimates β 0,
b1 is the sample slope parameter for X1 and estimates β1, and so forth, e represents
the sample errors/residuals and estimates ε. A logit is the natural log of the odds
of an event occurring. It has little direct meaning. However, by applying an anti-log
it can have a much more interpretative meaning. In addition, through further
calculations you can ascertain other useful properties of the predictive power of
your binomial logistic regression model, such as the percentage of correctly
classified cases.
Setting up your data
• For a binomial logistic regression you will have at least two variables – one dependent
variable and one independent variable – but you will typically have two or more
independent variables. In addition, you may also choose to include a case identifier, as
discussed below. In this example, we have the following six variables:
it is their 'mass');
4) The independent variable, gender, which has two categories: "Male" and "Female";
5) The independent variable, VO2max, which is the maximal aerobic capacity.
and
6) The case identifier, caseno, which is used for easy elimination of cases (e.g., participants)
that might occur when checking assumptions.
Setting up your data
Assumption#5
There needs to be a linear relationship between the continuous independent
variables and the logit transformation of the dependent variable.
There are a number of methods to test for a linear relationship between the
continuous independent variables and the logit of the dependent variable. In this
guide, we use the Box-Tidwell approach, which adds an interaction terms between
the continuous independent variables and their natural logs to the regression
equation. (a) use the Binary Logistic procedure in SPSS Statistics to test this
assumption; (b) interpret and report the results from this test; and (c) proceed with
your analysis depending on whether you have met or violated this assumption.
Setting up your data
• Assumption #6
Your data must not show multicollinearity
Multicollinearity occurs when you have two or more independent variables that
are highly correlated with each other. This leads to problems with understanding
which independent variable contributes to the variance explained in the
dependent variable, as well as technical issues in calculating a binomial logistic
regression model.
• There are two main objectives that you can achieve with the output from
a binomial logistic regression: (a) determine which of your independent
variables (if any) have a statistically significant effect on your dependent
variable; and (b) determine how well your binomial logistic regression
model predicts the dependent variable. Both of these objectives will be
answered in the following sections:
Interpreting Results
• Data coding: You can start your analysis by inspecting your variables and data,
including: (a) checking if any cases are missing and whether you have the number of
cases you expect (the "Case Processing Summary" table); (b) making sure that the
correct coding was used for the dependent variable (the "Dependent Variable
Encoding" table); and (c) determining whether there are any categories amongst
your categorical independent variables with very low counts – a situation that is
undesirable for binomial logistic regression (the "Categorical Variables Codings"
table). This is highlighted in the Data coding section on the next page.
• Baseline analysis: Next, you can consult the "Classification Table", "Variables in the
Equation" and "Variables not in the Equation" tables. These all relate to the
situation where no independent variables have been added to the model and the
model just includes the constant. As such, you are interested in this information
only as a comparison to the model with all the independent variables added. This
Baseline analysis section provides a basis against which the main binomial logistic
regression analysis with all independent variables added to the equation can be
evaluated.
Interpreting Results
• Binomial logistic regression results: In evaluating the main logistic
regression results, you can start by determining the overall
statistical significance of the model (namely, how well the model
predicts categories compared to no independent variables). You
can also assess the adequacy of the model by analysing how poor
the model is at predicting the categorical outcomes using
the Hosmer and Lemeshow goodness of fit test. This is explained
in the Model fit . Next, you can consult the Cox & Snell R
Square and Nagelkerke R Square values to understand how much
variation in the dependent variable can be explained by the model
(i.e., these are two methods of calculating the explained variation),
but it is preferable to report the Nagelkerke R2 value. This is
illustrated in the Variance explained section.
Interpreting Results
• Category prediction: After determining model fit and
explained variation, it is very common to use binomial
logistic regression to predict whether cases can be correctly
classified (i.e., predicted) from the independent variables.
Logistic regression estimates the probability of an event (in
this case, having heart disease) occurring. If the estimated
probability of the event occurring is greater than or equal to
0.5 (better than even chance), SPSS Statistics classifies the
event as occurring (e.g., heart disease being present). If the
probability is less than 0.5, SPSS Statistics classifies the event
as not occurring (e.g., no heart disease).
Interpreting Results
• Variables in the equation: you can assess the contribution of
each independent variable to the model and its statistical
significance using the Variables in the Equation table. You will
also be able to use the odds ratios of each of the independent
variables (along with their confidence intervals) to
understand the change in the odds ratio for each increase in
one unit of the independent variable. Using these odds ratios
you will be able to, for example, make statements such as:
"the odds of having heart disease is 7.026 times greater for
males as opposed to females". You can make such predictions
for categorical and continuous independent variables.
Baseline analysis
• The next three tables headed under the main title, "Block 0: Beginning Block", all
relate to the situation where no independent variables have been added to the
model and the model just includes the constant. As such, you are interested in this
information only as a comparison to the model with all the independent variables
added. The table below, "Classification Table", shows that without any
independent variables, the 'best guess' is to simply assume that all participants did
not have heart disease. If you assume this, you will overall correctly classify 65% of
cases (the "Overall Percentage" row), as shown below:
Baseline analysis
• The table below, "Variables in the Equation", simply shows you that only the
constant was included in this particular model:
• And the table below, "Variables not in the Equation", highlights the independent
variables left out of the model:
Baseline analysis
Binomial logistic regression results
• All the next tables come after the heading "Block 1: Method = Enter" and
represent the results of the main logistic regression analysis with all independent
variables added to the equation.
Model fit
• The first table, "Omnibus Tests of Model Coefficients", provides the overall
statistical significance of the model (namely, how well the model predicts
categories compared to no independent variables), as shown below
Binomial logistic regression results
• For this type of binomial logistic regression, you can reference the "Model" row.
From the table above, you can see that the model is statistically significant
(p < .0005; "Sig." column). Another way of assessing the adequacy of the model is
to analyse how poor the model is at predicting the categorical outcomes. This is
tested using the Hosmer and Lemeshow goodness of fit test as found in the
similarly titled table, as shown below
• For this test, you do not want the result to be statistically significant because this
would indicate that you have a poor fitting model. In this example, the Hosmer and
Lemeshow test is not statistically significant (p = .871; "Sig." column), indicating
that the model is not a poor fit
Variance explained
• In order to understand how much variation in the dependent variable can be explained by the
model (the equivalent of R2 in multiple regression), you can consult the table below, "Model
Summary":
• This table contains the Cox & Snell R Square and Nagelkerke R Square valueswhich are both
methods of calculating the explained variation (it is not as straightforward to do this as
compared to multiple regression). These values are sometimes referred to as pseudo R2 values
and will have lower values than in multiple regression. However, they are interpreted in the
same manner, but with more caution.
• Therefore, the explained variation in the dependent variable based on our model ranges from
24.0% to 33.0%, depending on whether you reference the Cox & Snell R2 or
Nagelkerke R2 methods, respectively. Nagelkerke R2 is a modification of Cox & Snell R2, the
latter of which cannot achieve a value of 1. For this reason, it is preferable to report the
Nagelkerke R2 value.
Category prediction
• Binomial logistic regression estimates the probability of an event (in
this case, having heart disease) occurring. If the estimated probability
of the event occurring is greater than or equal to 0.5 (better than even
chance), SPSS Statistics classifies the event as occurring (e.g., heart
disease being present). If the probability is less than 0.5, SPSS Statistics
classifies the event as not occurring (e.g., no heart disease). It is very
common to use logistic regression to predict whether cases can be
correctly classified (i.e., predicted) from the independent variables.
Therefore, it becomes necessary to have a method to assess the
effectiveness of the predicted classification against the actual
classification. There are many methods to assess this with their
usefulness often depending on the nature of the study conducted.
However, all methods revolve around the observed and predicted
classifications, which are presented in the Classification Table, as
shown below:
Category prediction
Category prediction
• Firstly, notice that the table has a subscript which states, "The cut value
is .500". This means that if the probability of a case being classified into
the "yes" category is greater than .500, then that particular case is
classified into the "yes" category. Otherwise, the case is classified as in
the "no" category (as mentioned previously). The classification table from
earlier – which did not include any independent variables – showed that
65.0% of cases overall could be correctly classified by simply assuming
that all cases were classified as "no" heart disease. However, with the
independent variables added, the model now correctly classifies 71.0% of
cases overall (see "Overall Percentage" row). That is, the addition of the
independent variables improves the overall prediction of cases into their
observed categories of the dependent variable. This particular measure is
referred to as the percentage accuracy in classification (PAC).
Category prediction
• Another measure is the sensitivity, which is the percentage of cases
that had the observed characteristic (e.g., "yes" for heart disease)
which were correctly predicted by the model (i.e., true positives). In
this case, 45.7% of participants who had heart disease were also
predicted by the model to have heart disease (see the "Percentage
Correct" column in the "Yes" row of the observed categories).
• Specificity is the percentage of cases that did not have the observed
characteristic (e.g., "no" for heart disease) and were also correctly
predicted as not having the observed characteristic (i.e., true
negatives). In this case, 84.6% of participants who did not have heart
disease were correctly predicted by the model not to have heart
disease (see the "Percentage Correct" column in the "No" row of the
observed categories).
Category prediction
• The positive predictive value is the percentage of correctly
predicted cases with the observed characteristic compared to
the total number of cases predicted as having the characteristic.
In our case, this is 100 x (16 ÷ (10 + 16)) which is 61.5%. That is,
of all cases predicted as having heart disease, 61.5% were
correctly predicted.
• The negative predictive value is the percentage of correctly
predicted cases without the observed characteristic compared
to the total number of cases predicted as not having the
characteristic. In our case, this is 100 x (55 ÷ (55 + 19)) which is
74.3%. That is, of all cases predicted as not having heart
disease, 74.3% were correctly predicted.
ROC Curve
• In the previous section you calculated five measures – such as sensitivity and
specificity – that assess the ability of a binomial logistic regression model to correctly
classify cases (i.e., to discriminate). All these measures were calculated based on
a cut-off point of 0.5 (50%), meaning that a case (e.g., participant) with a predicted
probability of the event (e.g., heart disease) that is greater than or equal to
0.5 would be classified as having the event (e.g., having heart disease), and all
participants with predicted probabilities lower than 0.5 would be classified as not
having the event (e.g., not having heart disease).
• However, instead of concentrating on one cut-off point only, you can consider all
possible cut-off points in your data, and how each cut-off point
changes the specificity and sensitivity of the test. For example, a higher cut-off point
will increase specificity, but lower sensitivity. That is, a higher cut-off point makes it
"harder" for participants to be classified as having the event of interest, but "easier"
to be classified as not having the event of interest. A visual representation of this is
presented in a plot called the Receiver Operating Characteristic (ROC) curve, which is
a plot of sensitivity versus 1 minus specificity (Hilbe, 2009). The ROC curve can also
be used to calculate an overall measure of discrimination, but this will be discussed
later.
ROC curve procedure
Interpreting the ROC curve
• You can see in the sub-note highlighted above that the positive actual state is
"1.00 Yes", indicating that we have correctly stated the event (i.e., the event of
interest in this example is having heart disease, which was coded as "1 = Yes").
Whatever category represents your event of interest should be reported in this
sub-note. If not, you need to go back to Step 3 of the ROC procedure above
and change the coding you have entered accordingly.
• Now that you know you have entered the correct information in the ROC curve
procedure, you can consider the ROC curve results. As such, the ROC curve is
presented under the heading, ROC Curve, as shown below:
Interpreting the ROC curve
0.5 < AUC < 0.7 We consider this poor discrimination, not much better than a coin toss.
Table:Rules of thumb for the area under the ROC curve (AUC) according to Hosmer et al. (2013).
Interpreting the ROC curve
• It is also possible to provide a 95% confidence interval (CI) for the area under the
ROC curve. These are presented in the "Lower Bound" and "Upper Bound"
columns under the "Asymptotic 95% Confidence Interval" column in the "Area
Under the Curve" table, as highlighted below:
Interpreting the ROC curve
• The area under the ROC curve was .804 (95% CI, .718 to .891), which is an
excellent level of discrimination according to Hosmer et al. (2013).
• If you have space in your report, you should also present the ROC curve
itself (as recommended by Hosmer et al., 2003).
Variables in the equation
• The Wald test ("Wald" column) is used to determine statistical significance for each
of the independent variables. The statistical significance of the test is found in the
"Sig." column. From these results you can see that age (p = .003), gender (p = .021)
and VO2max (p = .039) added significantly to the model/prediction, but weight (p =
.799) did not add significantly to the model.
Variables in the equation
• Luckily, SPSS Statistics also includes the odds ratios of each of the
independent variables in the "Exp(B)" column along with their confidence
intervals ("95% C.I. for EXP(B)" column). This informs you of the change in
the odds for each increase in one unit of the independent variable. For
example, for gender, an increase in one unit (i.e., being male) increases
the odds by 7.026. What this means is that the odds of having heart
disease ("yes" category) is 7.026 times greater for males as opposed to
females. Values less than 1.000 indicate a decreased odds for an increase
in one unit of the independent variable. Sometimes, for clarity, the odds
ratio is inverted (e.g., 1 / .906 = 1.10, for VO2max). Thus, you would state
that for each unit reduction in the independent variable, VO2max, the
odds of having heart disease increases by a factor of 1.10. Remember to
invert the confidence intervals as well if you take this latter approach.
summary
• A binomial logistic regression was performed to ascertain the effects of
age, weight, gender and VO2max on the likelihood that participants
have heart disease. The logistic regression model was statistically
significant, χ2(4) = 27.402, p < .0005. The model explained 33.0%
(Nagelkerke R2) of the variance in heart disease and correctly classified
71.0% of cases. Sensitivity was 45.7%, specificity was 84.6%, positive
predictive value was 61.5% and negative predictive value was 74.3%. Of
the five predictor variables only three were statistically significant: age,
gender and VO2max (as shown in Table 1). Males had 7.02 times higher
odds to exhibit heart disease than females. Increasing age was
associated with an increased likelihood of exhibiting heart disease, but
increasing VO2max was associated with a reduction in the likelihood of
exhibiting heart disease.
summary
• A binomial logistic regression was performed to ascertain the effects of age,
weight, gender and VO2max on the likelihood that participants have heart disease.
Linearity of the continuous variables with respect to the logit of the dependent
variable was assessed via the Box-Tidwell (1962) procedure. A Bonferroni
correction was applied using all eight terms in the model resulting in statistical
significance being accepted when p < .00625 (Tabachnick & Fidell, 2014). Based on
this assessment, all continuous independent variables were found to be linearly
related to the logit of the dependent variable. There was one standardized residual
with a value of 3.349 standard deviations, which was kept in the analysis. The
logistic regression model was statistically significant, χ2(4) = 27.402, p < .0005. The
model explained 33.0% (Nagelkerke R2) of the variance in heart disease and
correctly classified 71.0% of cases. Sensitivity was 45.7%, specificity was 84.6%,
positive predictive value was 61.5% and negative predictive value was 74.3%. Of
the five predictor variables only three were statistically significant: age, gender and
VO2max (as shown in Table 1). Males had 7.02 times higher odds to exhibit heart
disease than females. Increasing age was associated with an increased likelihood
of exhibiting heart disease, but increasing VO2max was associated with a reduction
in the likelihood of exhibiting heart disease.