0% found this document useful (0 votes)
3 views12 pages

Linear Regression Notes

Linear regression is a statistical method that estimates the relationship between one independent variable (height) and one dependent variable (weight) using a straight line. The document provides instructions for performing linear regression in Excel using the Analysis ToolPak, including how to interpret the output results such as the correlation coefficient, R square, and residuals. Additionally, it explains how to visualize the data and assess the model's accuracy through various plots.

Uploaded by

Khalid Obad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views12 pages

Linear Regression Notes

Linear regression is a statistical method that estimates the relationship between one independent variable (height) and one dependent variable (weight) using a straight line. The document provides instructions for performing linear regression in Excel using the Analysis ToolPak, including how to interpret the output results such as the correlation coefficient, R square, and residuals. Additionally, it explains how to visualize the data and assess the model's accuracy through various plots.

Uploaded by

Khalid Obad
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Linear regression Notes

What is a linear regression?


a regression model that estimates the relationship between one independent variable and
one dependent variable using a straight line.

Example:
We have measured the weight and the heights 11 different participants; each row
represents a different participant. Each variable is measured as:

• Weight (kg) (Y Range - dependent variable)


• Height (cm) (X Range - Independent variable)

So, we want to determine if there is a relation between the weight (dependent variable) and
the height (independent variable) using linear regression to see how well the measures of
height in my sample can predict the measures of weight.

Installing the Analysis ToolPak

There are a few ways you can perform a linear regression in Excel, but perhaps the easiest
method is to use the Analysis ToolPak. This is an add-on created by Microsoft to provide
data analysis tools for statistical analyses.

Here are the instructions for installing the Analysis Toolpak:


Linear regression Notes

1. Go to File>Options
2. Then click on Add-ins
3. At the bottom, you want to manage the Excel add-ins and click the Go button
4. Then, ensure you tick the Analysis ToolPak add-in, and click OK.

Now, when you click on the Data ribbon, you should see a Data Analysis button in a sub-
section called Analyze

Performing the linear regression in Excel

To perform the linear regression, click on the Data Analysis button.

Then, select Regression from the list.

You must then enter the following:

• Input Y Range – this is the data for the Y variable, otherwise known as the
dependent variable. The Y variable is the one that you want to predict in the
regression model. In this example it will be the weight data
• Input X Range – this is the data for the X variable, otherwise known as the
independent variable. In this example it will be the height data.

• The next option called Constant is Zero is used if you want the regression line to
start at 0, otherwise known as the origin. Doing so would mean there is no Y
intercept in the model. Generally, for linear regression, this option is not selected,
so just leave it unchecked for this example.

It is also possible to specify the confidence level for the test. By default, the
results will return the 95% confidence intervals without having to change any
options.
Linear regression Notes

Output options

For the Output Options, you can specify where you want the regression results to be
placed.

• Output Range – you can highlight where you want the results to be placed in
that worksheet
• New Worksheet Ply – lets you place the results in a new worksheet
• New Workbook – lets you save the results in an entirely separate workbook

Residuals

The final set of options concerns the residuals in the analysis.

• Residuals – will return the list of predicted dependent values, based on the
regression line, as well as the residual values for each point
• Standardized Residuals – will return the standardized residuals; these values
can be useful when identifying potential outliers
• Residual Plots – will create a scatter graph where the residuals are plotted on
the Y axis and the X variable is plotted on the X axis
• Line Fit Plots – will create another scatter graph where the Y and X variables
are plotted, but it will also add the predicted Y values onto the graph
• Normal Probability Plots- option plots another scatter plot, which is used to
determine whether the Y variable data fits a normal distribution.

Interpretation of the linear regression results

The results are generated into the following:

• Summary Output table


• ANOVA table
• Coefficients table
• Residual Output table
• Residual plot
• Standardized Residuals
Linear regression Notes

• Line Fits plot


• Normal Probability plot

Summary Output table

In the first table called Summary Output, there are some regression statistics from the
test.

Multiple R

This is the absolute value of the correlation coefficient between the two variables of
interest. Briefly, it is a value that tells you how strong the linear relationship is.

A value of 0.65 in this case indicates a fairly strong linear correlation between height and
weight measures.
Linear regression Notes

This single value can tell us two important factors about the correlation:

• Direction
• Strength/magnitude

The correlation coefficient value can be any number between –1 and +1; and it has no
units on measure.

• Perfectly positive correlation: r=1


• Perfectly negative correlation: r=-1
• No correlation: r=0

Correlation coefficient (r) Interpretation

0.00–0.10 No correlation

0.10–0.39 Weak correlation

0.40–0.69 Moderate correlation

0.70–0.89 Strong correlation

0.90–1.00 Very strong correlation


Linear regression Notes

R square

The coefficient of determination (R2) indicates the amount of variance shared between
the two variables.

R2 is an absolute value that is always between 0 and 1.


To interpret the coefficient of determination better, it is more convenient to multiply it by
100 to convert it to a percentage.

We can say that 91.33% of the variability in weight is explained by the variability in
height.
The other 8.67% of the variance is explained by other factors that were not measured in
the experiment, such as measurement errors.

Adjusted R square

The adjusted R square takes into account the number of independent variables in the
regression analysis, and corrects for bias.

Usually, this value is only relevant when you are performing multiple linear regression,
where there are more than 1 independent variables in the model.
Linear regression Notes

Standard error

The standard error of the regression is the average distance that the observed values fall
from the regression line.

The smaller the standard error, the more precise the linear regression model is.

Observations

This is just the number of subjects in the test.

ANOVA table

The main thing you will be concerned with when looking at this table is the value under
the Significance F header; this is in fact the P value for the regression model.

To be able to interpret this, we need our hypotheses:

• Null hypothesis – there is no linear relationship between the height and weight
measures
• Alternative hypothesis – there is a linear relationship between the height and
weight measures

If my alpha was 0.05, this means I will reject the null and accept the alternative
hypothesis if P≤0.05. The opposite will be true if P>0.05; in this case, I would fail to
reject the null hypothesis.

As you can see, the P value (Significance F) for the model was considerably lower than
my alpha value of 0.05. So, I can conclude that the linear regression model is significant.

Coefficients table

Let me now move on to the final table of results regarding the coefficients.
Linear regression Notes

The first row displays the results for the intercept, this is the point where the line of best
fit (regression line) crosses the Y axis when the value of X is zero.

The second row displays the results for the slope.

For a simple linear regression model, the most basic version of the equation is Y = m.X +
b.

Using the information reported from the results, we can then say:

Y = 0.800264.X – 79.599

So, in this example, if we knew a participants height (in cm), we can predict their weight
(in kg) by using this equation. For example, if a participant measured 175 cm, the model
estimates their height to be 60.45 kg.

Looking back at the coefficient results table, we can see there are other columns which
tells us the standard error, as well as the lower and upper 95% confidence intervals, or a
different confidence interval if a different confidence level was entered. And these values
are for the intercept and slope values.

You will also notice each also has a T-statistic. This value is used to compute the P value.

Residual options

So, that’s an overview of the regression model results, let me know cover the other
outputs from the regression test.
Linear regression Notes

Residual Output

If you selected to have the Residuals option during the regression set-up, you will have a
table titled Residual Output.

For each observation from your data that was entered into the regression test, you will get
a predicted value of Y based on the regression model.

For example, if you look at the first observation in original data, you see this participant
had a height of 167.08 cm. If I put this into the regression equation, along with the slope
and intercept values, I get the predicted weight value of 54.10999 kg.

This is what the Predicted column represents; Excel does this for each of the
observations.

Using the predicted values, Excel can then calculate the residuals.

A residual is simply the distance between the actual data point and the line of best fit.

For my first participant they had a height of 167.08 cm and a weight of 51.24 kg. As
calculated above, the predicted weight value based on the model was 54.10999 kg. The
residual for this point therefore is the difference between the actual weight value (51.24
kg), and the predicted weight value (54.10999 kg), which comes out at around -2.867 kg.

Excel then repeats this process for the rest of the observations.
Linear regression Notes

Residual Plot

If you also selected the Residual Plots option in the Regression set-up window, you will
also get a graph returned.

This is a scatter plot of the residuals on the Y axis and the values of the independent
variable on the X axis.

Residual plots are useful to look at when investigating homogeneity of variance, which is
an assumption of the linear regression test.

Standardized Residuals

If you selected the Standardized Residuals option in the regression options, you will also
see a column called Standard Residuals in the residuals table.

Normal Probability plot

Finally, if you selected the Normal Probability plots option in the regression setup
window, you will also see a table called Probability Output and a graph, called the
Normal Probability Plot, which is a scatter plot of this data in the graph.
Linear regression Notes

The X axis plots the percentile value ranging from 0 to 100 and the Y axis plots the Y
variable data.

The normal probability plot is used to determine whether the data fits a normal
distribution.

Essentially, what you are looking for is a straight line of data. And, as you can see, there
is a nice straight line of data for my example, which suggests the weight data are
normally distributed.

The standardized residual is the residual divided by an estimate of its standard deviation.
You can think of them as Z scores.

These values are useful to look at when trying to identify potential outliers in your
sample.

Generally, any standardized residuals with a value greater than 3 or -3 is a sign that it
may be an outlier.
Linear regression Notes

Line Fits Plot

If you selected to have the Line Fit Plots option, you will also see a scatter plot
containing the data that was entered into the regression test.

In my example, I have the height measures on the X axis and the weight measures on the
Y axis.

There is also another set of data, as shown in orange here, which are in fact the predicted
Y value based on the model. These are the Predicted values from the residuals table.

If instead of showing the Predicted values on the graph, but you instead wanted to plot
the line of best fit (which will pass through the predicted values), then you could remove
the predicted values from the graph.

You might also like