0% found this document useful (0 votes)
81 views

Correlation and Regression Analysis

This document discusses correlation and regression analysis. It begins by defining univariate, bivariate, and multivariate distributions. It then defines correlation as measuring the relationship between two variables, noting that correlated variables change in the same or opposite directions together. Correlation can be positive, negative, linear, or non-linear. Methods for measuring correlation discussed include scatter diagrams and Karl Pearson's coefficient of correlation. The coefficient provides a single numerical measure of the linear relationship between two variables. Key assumptions and calculation methods are also outlined.

Uploaded by

himanshu.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views

Correlation and Regression Analysis

This document discusses correlation and regression analysis. It begins by defining univariate, bivariate, and multivariate distributions. It then defines correlation as measuring the relationship between two variables, noting that correlated variables change in the same or opposite directions together. Correlation can be positive, negative, linear, or non-linear. Methods for measuring correlation discussed include scatter diagrams and Karl Pearson's coefficient of correlation. The coefficient provides a single numerical measure of the linear relationship between two variables. Key assumptions and calculation methods are also outlined.

Uploaded by

himanshu.goel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 100

Correlation And Regression

Analysis
Introduction

Univariate distribution are the distributions where unit can take only one variable
value. In Bivariate distribution units can take two variable values and the
distribution where units can take more than two variable values are known as
Multivariate distributions.
In bivariate distributions we may be interested to know:
1.Any relationship between the variables under study.
2.The effect of one variable on other.
3.Their moment togetherness.

Correlation is a statistical tool which studies the relationship between the


variables and correlation analysis involves various methods and techniques used
for studying and measuring the extent of the relationship between two variables.

The variables are said to be correlated if the change in one variable results in a
corresponding change in other variable.
Study of correlation deals with the degree (strength) of mutual statistical relationship
between two or more variables. i.e., correlation studies the correspondence of movement
(going togetherness) between two variables or series of paired items.
For example :

1.If the price increases the demand decreases.


2.Cases of lung cancer may increase if the smoking habit increases.
3.Sale of woolen garments increases as the temperature decreases.
In the above said examples the two variables move together in same direction or in
opposite direction. But there are cases when two variables move independently and there
is no tendency of ‘going togetherness’ between them.

In correlation we do not deal with one series but rather with the association or
relationship between two series,

and we do not measure variation with one series but rather compare variation in two or
more series.

The two series may vary together in the same direction; or


They may vary together in opposite directions; or
They do not vary together at all
Definitions

Correlation has been defined in different ways as;

1.Correlation measures the closeness of relationship between two variables, more


exactly of the closeness of the linear relationship.
2.According to the words of Bodington; “Whenever some definite connection
exists between the two or more groups, classes or series or data there is said to
be a correlation”.
Importance and Utility of Correlation

1. The coefficient of correlation helps in measuring the extent of relationship


between two variables in one figure only.
2. Existence of relationship between two or more variables enables us to predict
what will happen in the future, e.g., if the production of wheat increases and
all the other factors are constant there may be a downfall in the price of
wheat.
3. If the two variables are closely related we can estimate the value of one
variable given the value of other variable.
4. Correlation facilitates decision-making in business organizations. Expectations
about the behavior of certain variables are also on correlation analysis.
Quadratic Correlation
Linear Correlation
Positive and Negative Correlation (Covariance)

If two variables move together in same direction, the correlation between them
is said to be Positive. If two variables move in opposite directions, the
correlation between them is said to be Negative. If they do not move together
at all there is No Correlation between them.

Example :
1.Since the price and demand move in opposite direction, the correlation
between them is negative.
2.Smoking habit and cases of lung cancer move in the same direction,
correlation between them is positive.
Linear and Non-Linear Correlation

If there is a proportionate change in the value of two variables the correlation is


known as Linear . If the change in the value of two variables is not
proportionate the correlation is known as Non – Linear.

Example :
1.The law of demand says other factor remaining constant, increase in price of
commodity is followed by a decrease in its demand, but we can not find any
proportionality relationship between them.
2. A proportionate change can be observed between consumption of
coffee and number of employees.
Example :
1.x 1 2 3 4 5 Linear Correlation
y 2 4 6 8 10

2. x 1 2 3 4 Non - Linear Correlation


y 3 5 8 15

Correlation Based on Number of Variables

When only two variables are involved and the relationship is studied between
those two variables the correlation is known as Simple Correlation. When
more than two variables are involved but the relationship is studied
between two variables only, keeping other variables as constant then the
correlation is known as Partial Correlation. But if more than two variables
are involved and the relationship is studied between all of them. then the
correlation is known as Multiple Correlation.
Some Important Points

1. There should be sufficient number of items in the series.


2. In correlation analysis we do not deal with one series only but the association
or relationship between two or more series.
3. We do not measure the variation in one series only rather we compare
variation in two or more series.
4. We study only Linear Correlation.
5. Correlation does not necessarily mean cause and effect relationship.
6. The sign of ‘r’ indicates the type of linear relationship whether positive or
negative.
Measure of Correlation

1. Scatter Diagrams.
2. Karl Pearson’s coefficient for measuring linear correlation.
3. Method of Rank Differences (Spearman’s Rank Correlation Coefficient).
Scatter Diagram :
Scatter diagram or dot diagram is a graphical representation of pair of numerical values of
the two variables. Each pair of values is represented by a dot on the graph. The scatter
of points and the direction of the scatter diagram revels the nature and degree of
correlation between two variables.
If all the points lie on a straight line having positive slope (i.e. rising line) the correlation
is said to be perfect positive. In this case coefficient of correlation ‘r = + 1’.
If all the points lie on the line having negative slope the correlation is known as perfect
negative. In this case coefficient of correlation ‘r = - 1.
In general if low values of one variables go with the low values of other variable and high
value of one variable goes with the high value of other variable, the path traced by
these points runs roughly from lower left to upper right corner, the relationship is
Direct and Positive.
And low values of one variables go with the high values of other variable, while high
value of one variable goes with the low values of other variable, the path traced by
these points roughly from upper corner to the lower right corner, relationship is
inverse and called negative.
Positive Correlation
Negative Correlation
Merits and Limitations of the Scatter – Diagram Method :
1. It is a non – mathematical and easy way to find the correlation between two variables.
2. By drawing a line of best fit by free hand method through the plotted dots, the method
can be used for estimating the missing value of the dependent variable for a given value
of independent variable.
3. The shape of scatter – diagram reveals whether the correlation is Linear or Non – linear
which enables us to know the pattern of relationship existing between two variables.
Scatter diagrams gives us an idea whether correlation is positive or negative.
4. The values of extreme observations do not affect the method.

Demerits :
It gives only rough idea how the two variables are related. The methods gives an idea about
the direction of correlation and also whether it is how or low. But this method does not
give any quantitative measure of the degree or the extend of correlation.
Karl Pearson Coefficient of Correlation

A mathematical method of measuring


the intensity or the magnitude of
linear relationship between two
variable series was suggested by
Karl Pearson (1867 – 1936), a great
British Bio – metrician and
Statistician.

Karl Pearson’s measure is known as


Pearson’s correlation coefficient
between two variables (series) X
and Y, usually denoted by ‘r (X, Y)’
or ‘rxy’ or simply ‘r’, is a numerical
measure of linear relationship
between them.
Assumptions of Karl Pearson’s Method

1. The variables X and Y under study are linearly related.


2. Each variable is affected by large number of independent
contributory causes of such a nature as to produce normal
distribution.
3. The forces so operating on each of the variable series are not
independent of each other but are related in casual fashion.
Calculation of Correlation Coefficient
For ungrouped data. Karl Pearson’s coefficient of correlation can be obtained by
using any of the following three methods :
(i) Actual Mean Method
(ii) Direct Method
(iii)Short – Cut Method

Actual Mean Method :

r
 X  X Y  Y 
n   x y

 X  X Y  Y 
 X  X   Y  Y  

2 2


 xy
 x . y
2 2

where, x  X  X
y  Y Y
Example:
From the following table calculate the Karl Pearson’s coefficient of correlation:

x 6 2 10 4 8
y 9 11 ? 8 7

Arithmetic mean of y is 8.

Solution:

y
 y  35  ?  8  ?  5
n 5

x
 x 30
 6
n 5
X Y x =X- 6 y = Y – 8 x2 y2 xy
6 9 0 1 0 1 0

2 11 -4 3 16 9 -12

10 5 4 -3 16 9 -12

4 8 -2 0 4 0 0

8 7 2 -1 4 1 -2

 x2 = 40  y2 = 20  xy = - 26

r
 xy 
 26
 0.92
x y
2 2
40  20
Direct Method :
In case mean values of the two series in a bivariate data are fractional values and number of
observations their volume in the two series is not very large, the following simplified form of
formula may be used for calculating the value of ‘r’.

 XY 
 X Y
r N N N
 X 2    X  Y 2    Y
2 2
 
 
N  N  N  N 
   
N  XY   X  Y 

N  X 2
  X 
2
 N  Y 2
  Y 
2

Short – Cut Method :
When mean values are fractional and the number of paired observations is large, and the
observations has large values, computing of ‘r’ can be simplified by using the deviations of
the of the observations from some suitably chosen constant or constants. The constants for
deviations of X and Y can be either same or different. The formula for computing correlation
coefficient based on deviations is as under :-
N  d x d y   d x  d y 
r 
N  d 2
x   d x 
2
 N  d 2
y   d y 
2

d x dy
 d d x y

 N N N
 d x2   d x   d y2   d y
2 2


N  N  N  N 
   
 d x d y    X  A  Y  B 
 N N N
 x y
d x dy 
 X 
NA 
 Y 
NB 

N  N N   N N 
   
 x y
d dy
   d   
x
 X  A Y B dy  N X  A Y  B
 N 
x

 x y N x y
Assumptions of Karl Pearson’s Coefficient

Karl Pearson’s coefficient of correlation as based on the following assumptions :-


(i)Linear Relationship :
In this method a linear relationship between two variables is assumed. In such case, the
paired observations on the two variables plotted on a scatter – diagram cluster around a
straight line.
(ii) Causal Relationship :
In studying correlation, we expect a cause and effect relationship between the forces
affecting the values in the two series.

Merits of Karl Pearson’s Coefficient of Correlation

1. It is important and popular method of measuring the relationship between two


variables. It gives a precise and quantitative value indicating the degree of relationship
existing between the two variables. The value of ‘r’ is easily interpretable.
2. It measures the direction as well as the relationship between the two variables.
Demerits of Karl Pearson’s Coefficient of Correlation

1. The value of the coefficient is affected by the extreme values.


2. Its computational procedure is difficult as compared to other methods.
3. It assumes the Linear Relationship between the two variables.

Example 1 :
Calculate the correlation coefficient between the height of father and height of son from the
given data :

Table: 1(Heights of Father’s and Son’s)

Height of Father (in inches) 64 65 66 67 68 69 70


Height of Son (in inches) 66 67 65 68 70 68 72
Table: 2(Calculation for ‘r’)

Height of Father Height of Son (X - Mean) (Y- x2 y2 xy


(X) (Y) X – 67 = x Mean)
Y – 68 =
y

64 66 -3 -2 9 4 6
65 67 -2 -1 4 1 2
66 65 -1 -3 1 9 3
67 68 0 0 0 0 0
68 70 1 2 1 4 2
69 68 2 0 4 0 0
70 72 3 4 9 16 12

X = 469 Y = 476 x2 = 28 y2 = 34 xy = 25


X 
 X  67
N

Y
 Y  68
N

Since the actual Means of X and Y are whole numbers, we can use actual mean method of
computing ‘r’.
 X  X Y Y  
 X  X  
r 
2

  Y Y
2


 xy
 x . y
2 2

25
  0.81
28  34
Case :
Table 3 shows the sales revenue and advertisement expenses of a company for past 10
months. Find the coefficient of correlation between the sales and advertisement.

Table 3: Sales and Advertisements for 10 months

Month Jan Feb Mar Apr May Jun Jul Aug Sep Oct
Ad (000 INR) 10 11 12 13 11 10 9 10 11 14
Sales (000 INR) 110 120 115 128 137 145 150 130 120 115

r= - 0.51
Case :

A Computer while calculating correlation coefficient between two variables X and Y from
25 pairs of observations obtained the following results:

n = 25, X = 125, X2 = 650, Y = 100, Y2 = 460, XY = 508

It was, however discovered at the time of checking that two pairs of observations were
interpreted by a computer bug wrong. They were taken as (6, 14) and (8, 6) while correct
values were (8, 12) and (6, 8). Prove that the correct value of correlation coefficient
should be 2/3.
Solution:
Calculate the coefficient of correlation from the following data:

Age of husband 23 27 28 29 30 31 33 35 36 39
Age of wife 18 22 23 24 25 26 28 29 30 32

r = 0.9956

Find Karl Pearson’s coefficient of correlation between sales and expenses of the following
ten firms:
Firm 1 2 3 4 5 6 7 8 9 10
Sales (000 units) 50 50 55 60 65 65 65 60 60 50
Expenses (000 INR) 11 13 14 16 16 15 15 14 13 13

r = 0.7866
X Y x = X – 31.1 y = Y – 25.7 x2 y2
xy
23 18 -8.1 -7.7 65.61 59.29 62.37
27 22 -4.1 -3.7 16.81 13.69 15.17
28 23 -3.1 -2.7 9.61 7.29 8.37
29 24 -2.1 -1.7 4.41 2.89 3.57
30 25 -1.1 -0.7 1.21 0.49 0.77
31 26 -0.1 0.3 0.01 0.09 -0.03
33 28 1.9 2.3 3.61 5.29 4.37
35 29 3.9 3.3 15.21 10.89 12.87
36 30 4.9 4.3 24.01 18.49 21.07
39 32 7.9 6.3 62.41 39.69 49.77
X Y x2  y2 = xy =
=311 =257 = 202.9 158.1 178.3
Calculation of Coefficient of Correlation for Grouped Data
Case:

Calculate the coefficient of correlation from the following data:

Marks in Finance
Marks in
Statistics 10 20 30 40 50 Total

5 2 4 1 4 1 12
10 8 2 5 1 × 16
15 × 3 2 1 × 6
20 × 1 3 2 4 10
25 × × 4 2 × 6
Total 10 10 15 10 5
X 10 20 30 40 50
dx -2 -1 0 +1 +2 f fdy fdy2 fdxdy
y dy

5 -2 2(+8) 4(+8) 1(0) 4(-8) 1(-4) 12 -24 48 +4

10 -1 8(+16) 2(+2) 5(0) 1(-1) × 16 -16 16 +17

×
15 0 × 3(0) 2(0) 1(0) 6 0 0 0

×
20 +1 1(-1) 3(0) 2(+2) 4(+8) 10 +10 10 +9

× × ×
25 +2 4(0) 2(+4) 6 +12 24 +4

Total f 10 10 15 10 5 50 -18 98 34
fdx -20 -10 0 +10 +10 -10
fdx2 40 10 0 10 20 80
fdxdy +24 +9 0 -3 +4 34
Probable Error

r ± P.E. (r) gives a range within which we can reasonably expect the value of
correlation to vary. It means if from same universe another sample is drawn
the coefficient of correlation for new sample would not fall outside these
limits.

P.E. (r) S.E.(r) is given by;


1 r 2
S .E.(r ) 
n
Probable error of the coefficient of correlation is given by;
1 r 2
P.E.(r )  0.6745  S .E.(r )  0.6745 
n
The reason behind taking the range 0.6745 is that 50% of the observations lie between 
 0.6745 , where  is the mean and  is standard deviation.
Coefficient of Determination (r2)

Coefficient of determination is treated as a better measure as it


tells the effect of independent variable on dependent variable.
For example if coefficient of correlation between advertisement
and sales is r = 0.80 then r2 = 0.64 explains that 64% of the
variation in sales can be explained by money spend on
advertisement.
Interpretation using Coefficient of Correlation

1. Whether the correlation is positive or negative


2. Coefficient of determination
3. Whether causality is there or not
x Mean  = 6 S.D.  = 2.5
2
3 Mean    will contain almost 65.8% values of the observations. For
the given observations almost 6 values
4
5 Mean    = [ 6 – 2.5, 6 + 2.5 ] = [ 3.5, 8.5 ] = [ 3, 4, 5, 6, 7, 8 ]
6
7
8 Mean   2 will contain almost 95% values of the observations. For
the given observations almost 9 values
9
10 Mean   2 = [ 6 – 5, 6 + 5 ] = [1, 11] = [ 2, 3, 4, 5, 6, 7, 8, 9, 10 ]

Mean   3 will contain almost 99% values of the observations. For


the given observations almost 9 values
Mean   3 = [ 6 – 7.5, 6 + 7.5 ] = [- 1.5, 13.5]
Spearman’s Rank Correlation Coefficient

Rank Correlation Coefficient permits us to correlate two


sets of positive of qualitative observations which
are subject to ranking such as qualitative
productivity ratings (poor, fair, good, very good,
etc.) for a group of workers by two independent
observers. This will also give an idea whether the
two observers have common or different tastes
likings in a particular attribute or characteristics.
Ranks can be assigned either by two persons to a
single characteristics, say, beauty, honesty,
intelligence, etc., or by a single person or two
characteristics. When the marks are assigned by
two persons to a single characteristics, the
correlation is found between the opinion or tastes
of the two persons. High positive correlation
Charles Edward Spearman
indicates that the two persons have the same taste
in that characteristic. If two characteristics are 10 .09.1863 – 17.09.1945
judged by the same person, e.g., marks obtained in English psychologist
training and quantum of sales, then correlation is
found between two characteristics.
Steps to Calculate Spearman’s Rank Correlation Coefficient

To calculate the Rank Correlation Coefficient :


1. We first rank the two series say X’s and Y’s individually among themselves,
giving rank 1 to the largest (or smallest) value, rank 2 to the second largest
(second smallest) and so on in each series separately.
2. Find the differences ‘D’ of the corresponding Ranks of X and Y.
3. Sequence these differences and find the sum of the squares of these
differences, i.e., D2.
4. Calculate rank correlation coefficient by using the formula :

6 D 2
R  1

N N 12
 Where, ‘N’ denotes the number of paired values.
The above formula is applicable when no value in any of the two series is repeated.
(Repeated values are known as tied values and are given the same Rank). When there are
ties, we assign to each of the observations the mean of the ranks which they jointly
occupy.
For Example:
If the third and fourth largest values of a variable are the same, we assign to each values, the
rank = (3 + 4)/2 = 3.5 and if the fifth, sixth and seventh largest values of a variable are
the same, we assign to each rank = (5 + 6 + 7)/3 = 6.
When some of the values are repeated and average ranks are assigned, the following formula
is used to calculate rank correlation coefficient,
  m 3  m   mm 2  1
6  D   
2
 6  D 
2

  12   12 
R  1  1
N N  1
2
N N  1
2

Where m = number of times a particular value is repeated. Repetition of values can be one
series or both the series. Repetition can be in one value or more than one value.
Ex:
From following data, find out coefficient of rank correlation between price and supply.

Price 4 6 8 10 12 14 16 18
Supply 10 15 20 25 30 35 40 45

Solution :

Price Rank (R1) Supply Rank (R2) D = (R2 – R1) D2


4 8 10 8 0 0
6 7 15 7 0 0
8 6 20 6 0 0
10 5 25 5 0 0
12 4 30 4 0 0
14 3 35 3 0 0
16 2 40 2 0 0
18 1 45 1 0 0

6 D 2 0
R  1  1 1

N N 2 1   
8 82  1
Ex:
From following data, find out coefficient of rank correlation between price and supply.

x 50 33 40 10 15 15 65 24 15 57
y 12 12 24 6 15 4 20 9 6 18

Solution :

x Rank (R1) y Rank (R2) D = (R2 – R1) D2


50 3 12 5.5 2.5 6.25
33 5 12 5.5 0.5 0.25
40 4 24 1 + 3.0 9.00
10 10 6 8.5 + 1.5 2.25
15 8 15 4 + 4.0 16.00
15 8 4 10 2.0 4.00
65 1 20 2 1.0 1.00
24 6 9 7 1.0 1.00
15 8 6 8.5 0.5 0.25
57 2 18 3 1.0 1.00
Here in the first series, i.e., X series value 15 is repeated 3 times, in the Y series, the values
12 and 6 are each repeated twice.
 Rank correlation coefficient

mm 2  1
6 D   2

R  1 12
N N  1
2


6  D  
2    
m1 m1  1  m2 m2  1  m3 m3  1 
2 2 2


 
12
 1  
N N  1
2

 39  1  24  1  24  1


6 41  
 12
 1
10  100  1
 24  6  6 
6 41  
 12
 1
990
44  6
 1  0.733
990
X 57 16 24 65 16 16 9 40 48 33
y 19 6 9 20 4 15 6 24 13 13

R = 0.7333

S. No 1 2 3 4 5 6 7 8 9 10
X 12 18 32 18 25 24 25 40 38 22
y 16 15 28 16 24 22 28 36 34 19

R = 0.95

Marks in Statistics 30 38 28 27 28 23 30 33 28 35
Marks in Mathematics 29 27 22 29 20 29 18 21 27 22

R = - 0.3515
Twelve entries in painting competition were ranked by two judges as shown below:

Entry A B C D E F G H I J K L
Judge 1 5 2 3 4 1 6 8 7 10 9 12 11
Judge 2 4 5 2 1 6 7 10 9 11 12 3 8

What degree of agreement between two judges?

R = 0.46
Regression Analysis

The term Regression


means stepping back
towards the average.
It was first used by Sir
Francis Galton
(1822 – 1911), in
connection with the
inheritance of stature.
Regression Means is ‘Stepping Back’ or ‘Going Back’

The Experiment: Francis Galton (later half of 19th Century)

Av. H = 5’ 8’’ Av. H = 5’ 10’’


Av. H = 5’ 4’’ Av. H = 5’ 6’’
Av. H = 5’ 2’’

Son’s of Short Population Son’s of Tall


Fathers Average Fathers
Introduction:

Regression means stepping back towards the average. In statistics regression


analysis is applicable to all those fields where two or more related variables have
the tendency to go back to mean.

According to Blair “Regression is the measure of average relationship between


two or more variables in terms of original units of data.”

The chief objective of Regression analysis is to know the nature of relationship


between two variables and to use it for predicting the most likely value of the
dependent variable corresponding to a given known value of the independent
variable. However it may be noted that the regression relation is not reversible,
i.e.

The regression equation used to predict the value of Y from a given value of X
can not be used to predict the value of X from a given value of Y.

So, the regression relation is average, irreversible and functional relation.


Methods of Studying Regression:

Regression can be studied either:


(i)Graphically
(ii)Algebraically

In graphical method a scatter plot for the series must be prepared and two
regression lines are drawn for predicting the values of X and Y variables.
The regression lines that is used to predict the value of Y on the basis of X is
known as Y on X and the line which is used to predict the value of X for known
value of Y is known as X on Y.
In case of perfect correlation between X and Y (+1 or -1) there is only one
regression line. In other words, the two lines are identical.
Methods of Studying Regression:
Whenever a straight line is drawn to represent changes in dependent variable
with respect to independent variable, the regression is known as Linear
regression. If however the relationship between two variables can not be
represented through straight line the regression is known as Non-Linear
regression.

Regression lines can be drawn by one of the following methods:

1.Free hand curve method


2.Method of least squares
Methods of Studying Regression:

Free Hand Curve Method


- Plot the pair values of X and Y through scatter diagram
- Draw first regression line in such a way that positive deviations of all points
from axis of Y gets cancelled by negative deviations of all points from axis of
Y. This line is called Y on X.
- Draw second regression line in such a way that positive deviations of all
points from axis of X gets cancelled by negative deviations of all points from
axis of X. This line is called X on Y.
- The two regression lines cut each other at a point, that point is known as
mean point of two series.
Methods of Studying Regression:

Method of Least Squares


In order to avoid difficulties related to free hand curve drawing method, a
mathematical relationship is established between the movements of X and Y
series and the algebraic equations are obtained to represent the relative
movements of X and Y series.
The two normal equations that represented by:
Illustration:
Plot the regression lines associated with the following data:

X 1 2 3 4 5
Y 166 184 142 180 338
Why do we need two regression lines to find the value of two variables X and Y

Since the regression relation is irreversible, one equation is not sufficient to


predict the value of two variables X and Y. Moreover two regression equations
are derived under different sets of assumptions, therefore one equation is not
sufficient to find X and Y.
Methods of Studying Regression:

Method of Deviations From the Mean


As the method of least squares is tedious and involve a lot of calculation.
Method of deviations from Mean is developed to obtain regression lines.

 
Y  Y  bYX X  X .......(1)
and
 
X  X  bXY Y  Y ........( 2)
where ,

X  Mean of series X. Y
bYX  r  Regression coefficient of Y on X.
Y  Mean of series Y. X

bXY  r X  Regression coefficient of X on Y.
Y
Properties of Regression Coefficients:

1. Both the regression coefficients should be of same sign.

bYX  bXY   r 2
2. Correlation coefficient is the G.M. of two regression coefficients.

r bYX bXY 

3. Both regression coefficients can not be more than 1.

4. Regression coefficients denote the rate of change.


Find two regression lines from the following data

Sales 91 97 108 121 67 124 51 73 111 57


Purchase 71 75 69 97 70 91 39 61 80 47
Example:

A survey was conducted to study the relationship between expenditure on


accommodation (x) and expenditure on the entertainment (y) and the
following results were obtained:

Expenditure on Mean S.D.


Accommodation 173 66
Entertainment 47.8 22
Correlation coefficient 0.57

Estimate the expenditure on entertainment if the expenditure on


accommodation is 200.
Solution:

Here,
X  173
Y  47.58
 x  66  y  22 r  0.57

x 22
byx  r  0.57   0.19
y 66

Y  Y  bYX X  X 
Y  14.71  0.19 X

for X  200

Y  14.71  0.19  200  52.71


The Irreversible Relation:

1. The increment in family income shows an increment in expenditure but the


increment in the expense of the family does not show the increment in family
income.

2. If the rainfall is timely and good the crop will be good but if the crop is good
there is not guarantee that the rainfall is timely and good.
Difference between Correlation and Regression:

Correlation Regression

It is merely concerned with determining


how strongly the two variables are It precedes correlation.
linearly related.

Not able to solve the prediction It solves the prediction problems.


problems.

Coefficient of correlation is
Coefficient of regression is independent
independent of the change of the origin
of the change of origin only.
and scale.

Coefficient of correlation satisfies the Coefficient of regression satisfies the


relation – 1 ≤ r ≤ + 1. relation 0 ≤ r2 ≤ 1.
Regression Lines:

“The device used for estimating the value of one variable from the value of other
consist of a line through the points drawn in such a manner as to represent the
average relationship between the two variables. Such a line is called the line of
regression”.

J R Stockton
As per the method of least squares, two regression lines are:

 
Y  Y  bYX X  X .......(1)
and
 
X  X  bXY Y  Y ........( 2)
where ,

X  Mean of series X.
Y  Mean of series Y.
Y
bYX  r  Regression coefficient of Y on X.
X
X
bXY r  Regression coefficient of X on Y.
Y
Properties of Regression Lines:

1.The A.M. of X and A.M. of Y lies on the regression lines.

2.If r = 0, two regression lines are perpendicular to each other.

3.If two regression lines are identical, the correlation between the variables is
perfect.

4.Angles between the regression lines can be given by

  XY 1 r 2 
tan    2 
 r 

 X   Y
2
 
Meaning of Regression

•Meaning of Regression is an act of returning or going back.

•It was first used in 1877 by “Sir Fransis Galton”. While studying the
relationship between the height of father and sons.

•The statistical tool with the help of which we are in a position to


estimate(predict) the unknown values of one variable from known
values of another variable is called Regression.
Significance of Regression Analysis

It can be expressed under following heads:

1. The relationship of cause and effect between two or more variables can be
analyzed with the help of regression analysis.

2. The change in the value of one variable can be determined from regression
coefficient if there is change of a unit in the value of other variable.

3. It provides estimates of values of the dependent variable on the basis of values


of the independent variable in the areas of social, economic and business
activities.

4. In the field of business, regression is very useful because with the help of it a
businessman can predicting future production, consumption, investment,
prices, profits, sales, etc.
Types of Regression

1. Simple regression :

If regression analysis is based only on two variable, is called simple regression. A


simple regression is one which is confined to only two variables say, X and Y. Here
‘X’ is a independent variable and ‘Y’ is a dependent variable. The functional
relationship between X and Y;
i.e. , Y= f(X)

2. Multiple regression:

If more than two variables are studied at a time in regression analysis, it is called
multiple regression. A multiple regression analysis is one which is made among
more than two related variables at a time say X,Y and Z. The functional
relationship in such case is expressed as under;

Y=f(X,Z), or X=f(Y,Z), or Z=f(X,Y).


3. Linear regression: If the regression line is in the form of a straight line, it
indicates linear regression between the variables under the study in case of linear
regression the values of the dependent variable changes at a constant rate for a unit
change in the value of the independent variable. This constant change may be in
terms of absolute amount or percentage.

4. Curvi-linear or non-linear regression: If the regression line is not a


straight line but a smoothed curve, regression is termed as curvi-linear or non-
linear.
Regression Equations
X on Y Y on X
• This equation describes the • This equation describes the
variation in the values of X for variation in the values of Y for the
the given changes in Y. given changes in X.
• It estimates the value of X for the • It estimates the value of Y for the
given value of Y. given value of X. (Y  Y )  r   ( X  X )
X  X   r  
x
(Y  Y )
y

x
y

X=Value of X variable to be predicted


Y=Value of Y variable to be predicted
=Arithmetic Mean of X series
X =Arithmetic Mean of X series
Xr=Correlation
series
Coefficient of X and Y
r=Correlation Coefficient of XandY
=Standard deviation of X series series
= Standard deviation of Y series
 x=Standard deviation of X series
Y=That y
x value of Y variable, = Standard deviation of Yseries
 y corresponding to which the vale
of X variable is to be predicted Y=That value of X variable,
corresponding to which the vale of
=Arithmetic Mean of Y series Y variable is to be predicted
Y =Arithmetic Mean of Y series
Y
EXAMPLE 1
• The following information are given to you
Husband Age Wife Age
MEAN 25 years 22 years
STANDARD DEVIATION 4 years 5 years

• Coefficient of correlation between ages of


husband and wives =+0.8
• Find the expected age of husband when wives
age is 12 years &expected age of wife when
husband age is 33 years.
Given that: X  25, Y  22,  x  4,  y  5, r  0.8

Regression equation of X on Y Regression equation of Y on X


 
X  X  r
x
y
(Y  Y ) Y  Y   r   X  X 
y

X  25  0.8  4 Y  22 Y  22  0.8  5 X  25


4
5 Y  22  1.0   X  25
X  25  0.64(Y  22) Y  X  25  22
X  0.64Y  14.08  25 Y  X 3
X  0.64Y  10.92
if the age of wife is 12 then If the age of husband is 33 years
the age of husband is :- then the age of wife is :-
X=0.64*12+10.92 Y= 33-3
=7.68+10.92 Wife age =30 Years
=18.60 years
REGRESSION COEFFICIENTS

This coefficient indicates that if there is a unit


change in the value of one variable, than what
will be the average change in the value of
other variable. Since there are two regression
equation, therefore, there are two regression
coefficient-regression coefficient of X on Y
and regression coefficient of Y on X.
• Regression coefficient of X on • Regression coefficient of Y on
Y (bxy) X (byx)
• This coefficient represents the • This coefficient represents the
change in the value of X for a unit change in the value of Y for a unit
change in the value of the change in the value of variable X
variable Y • When X and Y series are given
• When X and Y series are given and deviations have been taken
and deviations have been taken from assumed mean in one or
from assumed mean in one or in both series
both series

 dxdy  N   dx   dy   dxdy  N   dx   dy 
bxy  byx 
 d y  N   dy   d x  N   dx 
2 2 2
2
Example 2

From the following data calculate:


(a) the two regression coefficients
(b) the two regression equation

Population(in thousands) 18 19 20 21 22 23 24 25 26 27

No.of TV sets demanded 14 16 16 18 18 19 20 20 21 21


X dx from 23 2 Y dy from d2y dxdy
d x 18
18 -5 25 14 -4 16 20

19 -4 16 16 -2 4 8

20 -3 9 16 -2 4 6

21 -2 4 18 0 0 0

22 -1 1 18=A 0 0 0

23=A 0 0 19 1 1 0

24 1 1 20 2 4 2

25 2 4 20 2 4 4

26 3 9 21 3 9 9

27 4 16 21 3 9 12

 dy  3 d y  51  dxdy  61
2

 X  225  dx  5  d 2 x  85  Y  183
REGRESSION COEFFICIENT

X on Y Y on X

 dxdy  N   dx   dy 
bxy   dxdy  N  ( dx   dy )
 d y  N  ( dy)
2 2
byx 
 d x  N  ( dx)
2 2

61  10  (5  3) 6110  (5  3)


 
51  10  (3) 2 85 10  (5) 2
610  15 610  15

510  9 
850  25
625
 625
501 
 1.247
825
 .758
(b) Regression Equations
X on Y Y on X
Regression Equation of X on Y:This Regression Equations of Y on X:
equation describes the variation This equation describes the variation in
in the values of X for the given the
changes in Y. Values of Y for the given changes in X
_ _ _ _

X  X  bxy (Y  Y ) Y  Y  byx( X  X )
X  22.5  1.247(Y  18.3) Y  18.3  0.758( X  22.5)
X  22.5  1.247Y  22.82 Y  18.3  0.758 X  17.055
X  1.247Y  22.82  22.5 Y  0.758 X  17.055  18.3
X  1.247Y  0.32 Y  0.758 X  1.245
REGRESSION LINES
• Regression lines are the lines of best fit
expressing mutual average relationship
between two series. These lines give the best
estimate of one variable for any given value of
other variable.

• If we take the case of two variable X and Y we


shall have two regression line as X on Y and Y
on X .
• Obtain the regression equation of Y on X and
X on Y from the following table giving the Sale
of goods “X” and goods “Y”.

Sale of goods
Sale of goods “x”
“Y”
(in UNITS) 5-15 15-25 25-35 35-45 total
0-10 1 1 - - 2
10-20 3 6 5 1 15
20-30 1 8 9 2 20
30-40 - 3 9 3 15
40-50 - - 4 4 8
Total 5 18 27 10 60
X 5-15 15-25 25-35 35-45 f fd y fd y2 fd x d y
M.P. 10 20 30 40
-1 0 1 2
dx

M.P.

dy

2 0

3 0 -5 -2
0-10 5 -2 1 1 _ _ 2 -4 8 2
10-20 15 -1 3 0 6 0 5 0 1 0 15 -15 15 -4
20-30 25 0 1 8 9 2 20 0 0 0
30-40 35 1 3 0 9 9 3 6 15 15 15 15
-
40-50 45 2 4 4 8 16 32 24
- 8 16
f 5 18 27 10 N=60
 fd=12y  fd y
2
=70  fd= 37d
x y
-5 0 27 20

fd x
5 0 27 40
 fd =42
x

fd x2 5 0 12 20
 fd 2
x  72
fd x d y  fd d x y  37
Regression Equation
• Y on X
x
Y Y  r
y
X  X 
x N  fdxdy   fdx  fdy  iy
r  
y N  fd x2   fd x 
2
ix
6037   42 12  10 2220  504 1716
     0.67
6072   42  4320  1764
2
10 2556

Y  A
 fd y
i
N
12
Y  25   10  27
60

X  A
 fd x  i
N
42
X  20   10  27
60
Y  27  0.67 X  27   0.67 X  18.09
Y  27  0.67 X  18.09
Y  8.91  .67 X
– X on Y
x
X  X  r
y
Y Y

x N  fd x d y   fd x  fd y i x
r  
y N  fd y   fd y 
2
2 iy
6037   42  12 2220  504 1716
   0.423
6070   12  4200  144
2
4056
X  27  0.423Y  27   0.423Y  11 .42
X  15.58  0.423Y
REFERENCES
• BUSINESS STATISTICS BY: S.P.GUPTA &
M.P.GUPTA

• PRINCIPLE OF STATISTICS BY :
Dr. S.M.SHUKLA & Dr S. P. SAHAI
• FUNDAMENTAL OF STATISTICS BY:

B.M.AGARWAL
THANK YOU

You might also like