Regression
Regression
REGRESSION
Objectives: At the end of this unit, a student should be able to Understand the regression Estimate the regression equation Understand the relation of correlation and regression Use regression analysis Understand the application of regression analysis in business Appreciate the regression analysis. Structure 11.1 Introduction 11.2 Equation of a straight line 11.3 Two Lines of Regression 11.4 Properties of Regression Lines 11.5 Key words 11.6 Suggested readings 11.7 review exercise. 11.1 Introduction The coefficient of correlation gives the magnitude of the association of two variables. The next is to obtain the expression of relationship of the variables. We derive the equation that defines the relationship, which is linear, as we have defined linear correlation in the previous sections. The functional relation between the variables is called as regression equation. The meaning of regression is a tendency of returning to the mean. For example, in the correlation of heights of fathers and sons, a tendency of human race to return to or regress to the average height is observed. 11.2 Equation of a straight line The equation of the straight line is Y = a + bx Where a and b are constants, a is the y intercept i.e. the point where the line y = a + bx cuts the y axis, b is the slope of the line. It gives the rate of change of y with respect to X. We can find the values of a and b using the following normal equations. From y = a + bx
XY X
n X Y n X
2
a =Y b X
We find out values of a and b using the above data X 10 9 7 8 11 45 Y 6 3 2 4 5 20 XY 60 27 14 32 55 188 X2 100 81 49 64 121 415
TOTAL
45 =9 5 20 Y = =4 5 X =
b=
XY X
n X Y n X
2
= 0.8
a =Y b X
y = .2 + 0.8 X 3
Exercise: Obtain an estimating equation for the data given below: X: Y: 5 8 3 6 7 8 4 5 8 9 2 6 189 10 8 6 5 8 11 7 7 9 8 11 10
11.3 Two Lines of Regression For a bivariate data (Xi, Yi), the relationship may be Y depends on X or X depends on Y. If Y depends on X then the regression line is Y on X. Y is dependent variable and X is independent variable. If X depends on Y, then regression line is X on Y and X is dependent variable and Y is independent variable. The regression equation Y on X is Y = a + bx, is used to estimate value of Y when X is known. The regression equation X on Y is X = c + dy is used to estimate value of X when Y is given and a, b, c and d are constant. Y = a + bx can also be interpreted as a is the average value of Y when X is zero. X = c + dy, value c is the average value of X, when Y is zero. The slopes of the equation Y on X and X on Y are denoted as byx and bxy respectively. The values of byx and bxy are byx =
cov( X .Y ) var .x
bxy =
co ( X .Y ) v v .y ar
XY X
n X Y n X
2
bxy =
XY Y
n X Y nY
2
byx and bxy are the coefficient of regression. After we obtain values of byx and bxy we obtain the regression equations by substituting in the following equation. Y on X and X on Y
(Y Y ) = byx ( X X ) ( X X ) = bxy (Y Y )
The table below gives the stopping distance of an automobile at speed mils per hour at the distant danger is sighted. Speed V (miles per hour) Stopping distance d(ft) 20 54 30 90 40 138 50 206 60 292 70 396
Estimate distance when speed is 45 miles per hour. Estimate the speed when distance traveled before stopping the automobile is 100 feet. We have to obtain the estimating equations. We calculate byx and bxy. Speed X Distance Y
X Y
20 54
XY
30 90
X2 1080 2700 5200
40 138
Y2 400 900 1600 2500 3600 4900 13900
50 206
60 292
70 396
20 30 40 50 60 70
270
X =45
Y = 194.6667
n X Y n X nY
2
XY X XY bxy = Y
byx =
= 6.834286 = 0.140604
n X Y
2
(Y Y ) = byx ( X X ) ( X X ) = bxy (Y Y )
we get, (Y-194.6667) = 6.834286(X-45) Simplifying Y=6.834286X+112.876 And (X-45)=0.140604(Y-194.6667) X=0.140604Y+17.629 Observe that the value of byx and bxy have the same sign. 191
Exercise: For the data below, construct a scatter diagram. Find the least square regression lines Y on X and X on Y. Grade on first quiz X Grade on second quiz Y 6 8 5 7 8 7 8 10 7 5 6 8 10 10 4 6 9 8 7 6
11.4 Properties of Regression Lines The regression equations Y on X and X on Y has following properties a) The lines of regression meet in a point whose co-ordinates are X , Y . The averages of both X and Y will lie on both the lines of regression. b) The regression coefficients byx, bxy and correlation coefficient r will have the same sign. The relationship will remain the same in any of the coefficients. c) There is an angle formed between the two lines of regression. Let the angle be denoted by . The correlation is perfect then the angle is 0. The lines exactly coincide as the correlation becomes weaker and weaker the increases. d) The correlation coefficient r is geometric mean of the regression coefficients. The sign + or given to r, that exists for byx and bxy.
r= e) byx =
b yx . b xy
y x
and
x bxy = y
Illustration: 192
The two lines of regression are 5x + 6y = 160 and 2x + 4y = 80 Find 1. Find mean values of X and Y 2. Find regression coefficients 3. Find correlation coefficients 4. Find variance of Y if standard deviation of X is 1. We have 5x + 6y = 160 and 2x + 4y = 80 First we solve these equations simultaneously. To eliminate X 5x + 6y = 160 2x + 4y = 80
60
multiply by 2 multiply by 5
10x + 12y = 320 - 10x + 20y = 400 - 8y = - 80 y = 10 Substituting in any equation we get X = 20 and X =20 Y = 10 2. The regression equations are known. But we dont know which is Y on X and X on Y. we assume that and 5x + 6y = 160 be Y on X 2x + 4y = 80 be X on Y
2x = - 4y + 80 x = - 2y + 40 bxy = - 2
b yx . b xy
byx =
5 6
and
Correlation Coefficient r = =
5 x2 6
>1 Which is wrong. As, 1 r 1. Our assumption is wrong. We revert our assumption. 193
then
4y = - 2x + 80 y=bxy =
6 5
b yx . b xy
1 x + 40 2
byx = -
and
Correlation Coefficient r =
= x
6 5
1 2
yx
y 2 =r . x 2
1 3 y 2 = . 4 5 60
y 2 = 25
Exercise: In a partially destroyed laboratory record of analysis of correlation data, the following results only are legible: Variance of X = 9 Regression equations are 8x 10y + 66 = 0; 40x 18y = 214 What were (a) mean values of x and y (b) standard deviation of y (c) the coefficient of correlation between x and y.
194
11.5
Key words
Regression: A general process of predicting one variable from another by statistical means using previous data Regression line: A line fitted to set of data points to estimate the relationship between the variables. Dependent variable: The variable we are trying to predict Independent variable: The known variable in regression analysis. 11.6 Suggested readings
Anderson et al, Statistics for business and economics, eighth edition,2002, Thomson Asia Pvt. Ltd. Singapore R. Levin and D. Rubin, Statistics for management, seventh edition,1997,Prntice Hall of India, New Delhi. Frank and Althoen, Statistics concept and applications,1994, Cambridge university press, Cambridge A.D.Aczel and J. Sounderpandian, Complete Business Statistics, 2002, Tata McGraw Hill , New Delhi,India W.J.Stevenson, Business Statistics concept and applications, 1978, Harper and Row publishers, New York, USA.
11.7
Review exercise
1. A computer while calculating correlation coefficient between two variables X and Y from 25 pairs of observations obtained the following N = 25
X =125 Y =100
X Y
2 2
= 650 = 460
XY
= 508
195
Find the correlation coefficient of X and Y. Mean values of, X and Y. Regression equations of Y on X and X on Y.
2. A furniture retailer in a locality is interested in studying whether some relationship exists between the number of building permits issued in that locality in the past years and the volume of sales in those years. He has accordingly collected the data for the sales (y) and the number of building permits issued(X) in the past 10 years. The results are as follows X=200 Y= 2200 XY= 45800 X2= 4600 and Y2 =-490400. Using the appropriate regression equations, find i) The level of sales expected next year when 2000 building permits are to be issued. ii) The level of sales expected next year when 2000 building permits are to be issued. 3. To the Internal Revenue Service, the reasonableness of total itemized deduction depends on taxpayers
iii) iv)
The level of sales expected next year when 2000 building permits are to be issued. The level of sales expected next year when 2000 building permits are to be issued.
4. To the Internal Revenue Service, the reasonableness of total itemized deduction depends on taxpayers
adjusted gross income. Large deductions, which include charity and medical deductions, are more reasonable for taxpayers with large adjusted gross incomes. If a taxpayer claims larger than average itemized deductions for a given level of income, the chances if a IRS audit are increased. Data (in $1000) on adjusted gross income and the average or reasonable amount of itemized deductions follow. Adjusted gross income ($1000s) 22 27 32 48 65 85 120 Total itemized deductions ($1000s) 9.6 9.6 10.1 11.1 13.5 17.7 25.5 Use the least square method to develop the estimated regression equation. Estimate a reasonable level of total itemized deductions for a tax payer with an adjusted gross income of $52000. If this taxpayer has claimed total itemized deductions of $20,00, would the IRS agents request for a n audit appear justified? Explain. 5. In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2X-Y+1=0 and 3X-2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y. 6. In a laboratory experiment on correlation research study, the equation to the to regression lines were to be 2X-Y+1=0 and 3X-2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficients and the coefficient of correlation between the two variables X and Y. Given variance of X=9 find the standard deviation of Y. 7. The two lines of regression based on 100 observations were 20X-9Y-106=0 and 4X-5Y+30=0. Determine the coefficient of correlation, and calculate the variance of Y if the variance of X is 9.
196