Correlation and Regression Analysis
Correlation and Regression Analysis
Analysis
Introduction
Univariate distribution are the distributions where unit can take only one variable
value. In Bivariate distribution units can take two variable values and the
distribution where units can take more than two variable values are known as
Multivariate distributions.
In bivariate distributions we may be interested to know:
1.Any relationship between the variables under study.
2.The effect of one variable on other.
3.Their moment togetherness.
The variables are said to be correlated if the change in one variable results in a
corresponding change in other variable.
Study of correlation deals with the degree (strength) of mutual statistical relationship
between two or more variables. i.e., correlation studies the correspondence of movement
(going togetherness) between two variables or series of paired items.
For example :
In correlation we do not deal with one series but rather with the association or
relationship between two series,
and we do not measure variation with one series but rather compare variation in two or
more series.
If two variables move together in same direction, the correlation between them
is said to be Positive. If two variables move in opposite directions, the
correlation between them is said to be Negative. If they do not move together
at all there is No Correlation between them.
Example :
1.Since the price and demand move in opposite direction, the correlation
between them is negative.
2.Smoking habit and cases of lung cancer move in the same direction,
correlation between them is positive.
Linear and Non-Linear Correlation
Example :
1.The law of demand says other factor remaining constant, increase in price of
commodity is followed by a decrease in its demand, but we can not find any
proportionality relationship between them.
2. A proportionate change can be observed between consumption of
coffee and number of employees.
Example :
1.x 1 2 3 4 5 Linear Correlation
y 2 4 6 8 10
When only two variables are involved and the relationship is studied between
those two variables the correlation is known as Simple Correlation. When
more than two variables are involved but the relationship is studied
between two variables only, keeping other variables as constant then the
correlation is known as Partial Correlation. But if more than two variables
are involved and the relationship is studied between all of them. then the
correlation is known as Multiple Correlation.
Some Important Points
1. Scatter Diagrams.
2. Karl Pearson’s coefficient for measuring linear correlation.
3. Method of Rank Differences (Spearman’s Rank Correlation Coefficient).
Scatter Diagram :
Scatter diagram or dot diagram is a graphical representation of pair of numerical values of
the two variables. Each pair of values is represented by a dot on the graph. The scatter
of points and the direction of the scatter diagram revels the nature and degree of
correlation between two variables.
If all the points lie on a straight line having positive slope (i.e. rising line) the correlation
is said to be perfect positive. In this case coefficient of correlation ‘r = + 1’.
If all the points lie on the line having negative slope the correlation is known as perfect
negative. In this case coefficient of correlation ‘r = - 1.
In general if low values of one variables go with the low values of other variable and high
value of one variable goes with the high value of other variable, the path traced by
these points runs roughly from lower left to upper right corner, the relationship is
Direct and Positive.
And low values of one variables go with the high values of other variable, while high
value of one variable goes with the low values of other variable, the path traced by
these points roughly from upper corner to the lower right corner, relationship is
inverse and called negative.
Positive Correlation
Negative Correlation
Merits and Limitations of the Scatter – Diagram Method :
1. It is a non – mathematical and easy way to find the correlation between two variables.
2. By drawing a line of best fit by free hand method through the plotted dots, the method
can be used for estimating the missing value of the dependent variable for a given value
of independent variable.
3. The shape of scatter – diagram reveals whether the correlation is Linear or Non – linear
which enables us to know the pattern of relationship existing between two variables.
Scatter diagrams gives us an idea whether correlation is positive or negative.
4. The values of extreme observations do not affect the method.
Demerits :
It gives only rough idea how the two variables are related. The methods gives an idea about
the direction of correlation and also whether it is how or low. But this method does not
give any quantitative measure of the degree or the extend of correlation.
Karl Pearson Coefficient of Correlation
r
X X Y Y
n x y
X X Y Y
X X Y Y
2 2
xy
x . y
2 2
where, x X X
y Y Y
Example:
From the following table calculate the Karl Pearson’s coefficient of correlation:
x 6 2 10 4 8
y 9 11 ? 8 7
Arithmetic mean of y is 8.
Solution:
y
y 35 ? 8 ? 5
n 5
x
x 30
6
n 5
X Y x =X- 6 y = Y – 8 x2 y2 xy
6 9 0 1 0 1 0
2 11 -4 3 16 9 -12
10 5 4 -3 16 9 -12
4 8 -2 0 4 0 0
8 7 2 -1 4 1 -2
x2 = 40 y2 = 20 xy = - 26
r
xy
26
0.92
x y
2 2
40 20
Direct Method :
In case mean values of the two series in a bivariate data are fractional values and number of
observations their volume in the two series is not very large, the following simplified form of
formula may be used for calculating the value of ‘r’.
XY
X Y
r N N N
X 2 X Y 2 Y
2 2
N N N N
N XY X Y
N X 2
X
2
N Y 2
Y
2
Short – Cut Method :
When mean values are fractional and the number of paired observations is large, and the
observations has large values, computing of ‘r’ can be simplified by using the deviations of
the of the observations from some suitably chosen constant or constants. The constants for
deviations of X and Y can be either same or different. The formula for computing correlation
coefficient based on deviations is as under :-
N d x d y d x d y
r
N d 2
x d x
2
N d 2
y d y
2
d x dy
d d x y
N N N
d x2 d x d y2 d y
2 2
N N N N
d x d y X A Y B
N N N
x y
d x dy
X
NA
Y
NB
N N N N N
x y
d dy
d
x
X A Y B dy N X A Y B
N
x
x y N x y
Assumptions of Karl Pearson’s Coefficient
Example 1 :
Calculate the correlation coefficient between the height of father and height of son from the
given data :
64 66 -3 -2 9 4 6
65 67 -2 -1 4 1 2
66 65 -1 -3 1 9 3
67 68 0 0 0 0 0
68 70 1 2 1 4 2
69 68 2 0 4 0 0
70 72 3 4 9 16 12
Y
Y 68
N
Since the actual Means of X and Y are whole numbers, we can use actual mean method of
computing ‘r’.
X X Y Y
X X
r
2
Y Y
2
xy
x . y
2 2
25
0.81
28 34
Case :
Table 3 shows the sales revenue and advertisement expenses of a company for past 10
months. Find the coefficient of correlation between the sales and advertisement.
Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct
Ad (000 INR) 10 11 12 13 11 10 9 10 11 14
Sales (000 INR) 110 120 115 128 137 145 150 130 120 115
r= - 0.51
Case :
A Computer while calculating correlation coefficient between two variables X and Y from
25 pairs of observations obtained the following results:
It was, however discovered at the time of checking that two pairs of observations were
interpreted by a computer bug wrong. They were taken as (6, 14) and (8, 6) while correct
values were (8, 12) and (6, 8). Prove that the correct value of correlation coefficient
should be 2/3.
Solution:
Calculate the coefficient of correlation from the following data:
Age of husband 23 27 28 29 30 31 33 35 36 39
Age of wife 18 22 23 24 25 26 28 29 30 32
r = 0.9956
Find Karl Pearson’s coefficient of correlation between sales and expenses of the following
ten firms:
Firm 1 2 3 4 5 6 7 8 9 10
Sales (000 units) 50 50 55 60 65 65 65 60 60 50
Expenses (000 INR) 11 13 14 16 16 15 15 14 13 13
r = 0.7866
X Y x = X – 31.1 y = Y – 25.7 x2 y2
xy
23 18 -8.1 -7.7 65.61 59.29 62.37
27 22 -4.1 -3.7 16.81 13.69 15.17
28 23 -3.1 -2.7 9.61 7.29 8.37
29 24 -2.1 -1.7 4.41 2.89 3.57
30 25 -1.1 -0.7 1.21 0.49 0.77
31 26 -0.1 0.3 0.01 0.09 -0.03
33 28 1.9 2.3 3.61 5.29 4.37
35 29 3.9 3.3 15.21 10.89 12.87
36 30 4.9 4.3 24.01 18.49 21.07
39 32 7.9 6.3 62.41 39.69 49.77
X Y x2 y2 = xy =
=311 =257 = 202.9 158.1 178.3
Calculation of Coefficient of Correlation for Grouped Data
Case:
Marks in Finance
Marks in
Statistics 10 20 30 40 50 Total
5 2 4 1 4 1 12
10 8 2 5 1 × 16
15 × 3 2 1 × 6
20 × 1 3 2 4 10
25 × × 4 2 × 6
Total 10 10 15 10 5
X 10 20 30 40 50
dx -2 -1 0 +1 +2 f fdy fdy2 fdxdy
y dy
×
15 0 × 3(0) 2(0) 1(0) 6 0 0 0
×
20 +1 1(-1) 3(0) 2(+2) 4(+8) 10 +10 10 +9
× × ×
25 +2 4(0) 2(+4) 6 +12 24 +4
Total f 10 10 15 10 5 50 -18 98 34
fdx -20 -10 0 +10 +10 -10
fdx2 40 10 0 10 20 80
fdxdy +24 +9 0 -3 +4 34
Probable Error
r ± P.E. (r) gives a range within which we can reasonably expect the value of
correlation to vary. It means if from same universe another sample is drawn
the coefficient of correlation for new sample would not fall outside these
limits.
6 D 2
R 1
N N 12
Where, ‘N’ denotes the number of paired values.
The above formula is applicable when no value in any of the two series is repeated.
(Repeated values are known as tied values and are given the same Rank). When there are
ties, we assign to each of the observations the mean of the ranks which they jointly
occupy.
For Example:
If the third and fourth largest values of a variable are the same, we assign to each values, the
rank = (3 + 4)/2 = 3.5 and if the fifth, sixth and seventh largest values of a variable are
the same, we assign to each rank = (5 + 6 + 7)/3 = 6.
When some of the values are repeated and average ranks are assigned, the following formula
is used to calculate rank correlation coefficient,
m 3 m mm 2 1
6 D
2
6 D
2
12 12
R 1 1
N N 1
2
N N 1
2
Where m = number of times a particular value is repeated. Repetition of values can be one
series or both the series. Repetition can be in one value or more than one value.
Ex:
From following data, find out coefficient of rank correlation between price and supply.
Price 4 6 8 10 12 14 16 18
Supply 10 15 20 25 30 35 40 45
Solution :
6 D 2 0
R 1 1 1
N N 2 1
8 82 1
Ex:
From following data, find out coefficient of rank correlation between price and supply.
x 50 33 40 10 15 15 65 24 15 57
y 12 12 24 6 15 4 20 9 6 18
Solution :
mm 2 1
6 D 2
R 1 12
N N 1
2
6 D
2
m1 m1 1 m2 m2 1 m3 m3 1
2 2 2
12
1
N N 1
2
R = 0.7333
S. No 1 2 3 4 5 6 7 8 9 10
X 12 18 32 18 25 24 25 40 38 22
y 16 15 28 16 24 22 28 36 34 19
R = 0.95
Marks in Statistics 30 38 28 27 28 23 30 33 28 35
Marks in Mathematics 29 27 22 29 20 29 18 21 27 22
R = - 0.3515
Twelve entries in painting competition were ranked by two judges as shown below:
Entry A B C D E F G H I J K L
Judge 1 5 2 3 4 1 6 8 7 10 9 12 11
Judge 2 4 5 2 1 6 7 10 9 11 12 3 8
R = 0.46
Regression Analysis
The regression equation used to predict the value of Y from a given value of X
can not be used to predict the value of X from a given value of Y.
In graphical method a scatter plot for the series must be prepared and two
regression lines are drawn for predicting the values of X and Y variables.
The regression lines that is used to predict the value of Y on the basis of X is
known as Y on X and the line which is used to predict the value of X for known
value of Y is known as X on Y.
In case of perfect correlation between X and Y (+1 or -1) there is only one
regression line. In other words, the two lines are identical.
Methods of Studying Regression:
Whenever a straight line is drawn to represent changes in dependent variable
with respect to independent variable, the regression is known as Linear
regression. If however the relationship between two variables can not be
represented through straight line the regression is known as Non-Linear
regression.
X 1 2 3 4 5
Y 166 184 142 180 338
Why do we need two regression lines to find the value of two variables X and Y
Y Y bYX X X .......(1)
and
X X bXY Y Y ........( 2)
where ,
X Mean of series X. Y
bYX r Regression coefficient of Y on X.
Y Mean of series Y. X
bXY r X Regression coefficient of X on Y.
Y
Properties of Regression Coefficients:
bYX bXY r 2
2. Correlation coefficient is the G.M. of two regression coefficients.
r bYX bXY
Here,
X 173
Y 47.58
x 66 y 22 r 0.57
x 22
byx r 0.57 0.19
y 66
Y Y bYX X X
Y 14.71 0.19 X
for X 200
2. If the rainfall is timely and good the crop will be good but if the crop is good
there is not guarantee that the rainfall is timely and good.
Difference between Correlation and Regression:
Correlation Regression
Coefficient of correlation is
Coefficient of regression is independent
independent of the change of the origin
of the change of origin only.
and scale.
“The device used for estimating the value of one variable from the value of other
consist of a line through the points drawn in such a manner as to represent the
average relationship between the two variables. Such a line is called the line of
regression”.
J R Stockton
As per the method of least squares, two regression lines are:
Y Y bYX X X .......(1)
and
X X bXY Y Y ........( 2)
where ,
X Mean of series X.
Y Mean of series Y.
Y
bYX r Regression coefficient of Y on X.
X
X
bXY r Regression coefficient of X on Y.
Y
Properties of Regression Lines:
3.If two regression lines are identical, the correlation between the variables is
perfect.
XY 1 r 2
tan 2
r
X Y
2
Meaning of Regression
•It was first used in 1877 by “Sir Fransis Galton”. While studying the
relationship between the height of father and sons.
1. The relationship of cause and effect between two or more variables can be
analyzed with the help of regression analysis.
2. The change in the value of one variable can be determined from regression
coefficient if there is change of a unit in the value of other variable.
4. In the field of business, regression is very useful because with the help of it a
businessman can predicting future production, consumption, investment,
prices, profits, sales, etc.
Types of Regression
1. Simple regression :
2. Multiple regression:
If more than two variables are studied at a time in regression analysis, it is called
multiple regression. A multiple regression analysis is one which is made among
more than two related variables at a time say X,Y and Z. The functional
relationship in such case is expressed as under;
x
y
dxdy N dx dy dxdy N dx dy
bxy byx
d y N dy d x N dx
2 2 2
2
Example 2
Population(in thousands) 18 19 20 21 22 23 24 25 26 27
19 -4 16 16 -2 4 8
20 -3 9 16 -2 4 6
21 -2 4 18 0 0 0
22 -1 1 18=A 0 0 0
23=A 0 0 19 1 1 0
24 1 1 20 2 4 2
25 2 4 20 2 4 4
26 3 9 21 3 9 9
27 4 16 21 3 9 12
dy 3 d y 51 dxdy 61
2
X 225 dx 5 d 2 x 85 Y 183
REGRESSION COEFFICIENT
X on Y Y on X
dxdy N dx dy
bxy dxdy N ( dx dy )
d y N ( dy)
2 2
byx
d x N ( dx)
2 2
X X bxy (Y Y ) Y Y byx( X X )
X 22.5 1.247(Y 18.3) Y 18.3 0.758( X 22.5)
X 22.5 1.247Y 22.82 Y 18.3 0.758 X 17.055
X 1.247Y 22.82 22.5 Y 0.758 X 17.055 18.3
X 1.247Y 0.32 Y 0.758 X 1.245
REGRESSION LINES
• Regression lines are the lines of best fit
expressing mutual average relationship
between two series. These lines give the best
estimate of one variable for any given value of
other variable.
Sale of goods
Sale of goods “x”
“Y”
(in UNITS) 5-15 15-25 25-35 35-45 total
0-10 1 1 - - 2
10-20 3 6 5 1 15
20-30 1 8 9 2 20
30-40 - 3 9 3 15
40-50 - - 4 4 8
Total 5 18 27 10 60
X 5-15 15-25 25-35 35-45 f fd y fd y2 fd x d y
M.P. 10 20 30 40
-1 0 1 2
dx
M.P.
dy
2 0
3 0 -5 -2
0-10 5 -2 1 1 _ _ 2 -4 8 2
10-20 15 -1 3 0 6 0 5 0 1 0 15 -15 15 -4
20-30 25 0 1 8 9 2 20 0 0 0
30-40 35 1 3 0 9 9 3 6 15 15 15 15
-
40-50 45 2 4 4 8 16 32 24
- 8 16
f 5 18 27 10 N=60
fd=12y fd y
2
=70 fd= 37d
x y
-5 0 27 20
fd x
5 0 27 40
fd =42
x
fd x2 5 0 12 20
fd 2
x 72
fd x d y fd d x y 37
Regression Equation
• Y on X
x
Y Y r
y
X X
x N fdxdy fdx fdy iy
r
y N fd x2 fd x
2
ix
6037 42 12 10 2220 504 1716
0.67
6072 42 4320 1764
2
10 2556
Y A
fd y
i
N
12
Y 25 10 27
60
X A
fd x i
N
42
X 20 10 27
60
Y 27 0.67 X 27 0.67 X 18.09
Y 27 0.67 X 18.09
Y 8.91 .67 X
– X on Y
x
X X r
y
Y Y
x N fd x d y fd x fd y i x
r
y N fd y fd y
2
2 iy
6037 42 12 2220 504 1716
0.423
6070 12 4200 144
2
4056
X 27 0.423Y 27 0.423Y 11 .42
X 15.58 0.423Y
REFERENCES
• BUSINESS STATISTICS BY: S.P.GUPTA &
M.P.GUPTA
• PRINCIPLE OF STATISTICS BY :
Dr. S.M.SHUKLA & Dr S. P. SAHAI
• FUNDAMENTAL OF STATISTICS BY:
B.M.AGARWAL
THANK YOU