Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 69

Chapter 14

Regression Analysis

LEARNING OBJECTIVES

After studying this chapter, you should be able to

 use simple linear regression for building models to business


data.
 understand how the method of least squares is used to
predict values of a dependent (or response) variable based on the
values of an independent (or explanatory) variable.
 measure the variability (residual) of the dependent variable
about a straight line (also called regression line) and examine
whether regression model fits to the data.

The cause is hidden, but the result is known.


—Ovid

I never think of the future, it comes soon enough.


—Albert Einstein

14.1 INTRODUCTION

In Chapter 13 we introduced the concept of statistical relationship between two


variables such as: level of sales and amount of advertising; yield of a crop and the
amount of fertilizer used; price of a product and its supply, and so on. The
relationship between such variables indicate the degree and direction of their
association, but fail to answer following question:

 Is there any functional (or algebraic) relationship between two variables? If yes, can it be
used to estimate the most likely value of one variable, given the value of other variable?

The statistical technique that expresses the relationship between two or more
variables in the form of an equation to estimate the value of a variable, based on the
given value of another variable, is called regression analysis. The variable whose
value is estimated using the algebraic equation is
called dependent (or response) variable and the variable whose value is used to
estimate this value is called independent (regressor or predictor) variable. The
linear algebraic equation used for expressing a dependent variable in terms of
independent variable is called linear regression equation.
The term regression was used in 1877 by Sir Francis Galton while studying the
relationship between the height of father and sons. He found that though ‘tall father
has tall sons’, the average height of sons of tall father is x above the general height,
the average height of sons is 2x/3 above the general height. Such a fall in the average
height was described by Galton as ‘regression to mediocrity’. However, the theory of
Galton is not universally applicable and the term regression is applied to other types
of variables in business and economics. The term regression in the literary sense is
also referred as ‘moving backward’.

The basic differences between correlation and regression analysis are summarized as
follows:

1. Developing an algebraic equation between two variables from sample data and predicting
the value of one variable, given the value of the other variable is referred to as regression
analysis, while measuring the strength (or degree) of the relationship between two
variables is referred as correlation analysis. The sign of correlation coefficient indicates
the nature (direct or inverse) of relationship between two variables, while the absolute
value of correlation coefficient indicates the extent of relationship.
2. Correlation analysis determines an association between two variables x and y but not that
they have a cause-and-effect relationship. Regression analysis, in contrast to correlation,
determines the cause-and-effect relationship between x and y, that is, a change in the
value of independent variable x causes a corresponding change (effect) in the value of
dependent variable y if all other factors that affect y remain unchanged.
3. In linear regression analysis one variable is considered as dependent variable and other
as independent variable, while in correlation analysis both variables are considered to be
independent.
4. The coefficient of determination r2 indicates the proportion of total variance in the
dependent variable that is explained or accounted for by the variation in the
independent variable. Since value of r2 is determined from a sample, its value is subject to
sampling error. Even if the value of r2 is high, the assumption of a linear regression may
be incorrect because it may represent a portion of the relationship that actually is in the
form of a curve.

14.2 ADVANTAGES OF REGRESSION ANALYSIS

The following are some important advantages of regression analysis:

1. Regression analysis helps in developing a regression equation by which the value of a


dependent variable can be estimated given a value of an independent variable.
2. Regression analysis helps to determine standard error of estimate to measure the
variability or spread of values of a dependent variable with respect to the regression line.
Smaller the variance and error of estimate, the closer the pair of values (x, y) fall about
the regression line and better the line fits the data, that is, a good estimate can be made of
the value of variable y. When all the points fall on the line, the standard error of estimate
equals zero.
3. When the sample size is large (df ≥ 29), the interval estimation for predicting the value of
a dependent variable based on standard error of estimate is considered to be acceptable
by changing the values of either x or y. The magnitude of r2 remains the same regardless
of the values of the two variables.
14.3 TYPES OF REGRESSION MODELS

The primary objective of regression analysis is the development of a regression


model to explain the association between two or more variables in the given
population. A regression model is the mathematical equation that provides
prediction of value of dependent variable based on the known values of one or more
independent variables.

The particular form of regression model depends upon the nature of the problem
under study and the type of data available. However, each type of association or
relationship can be described by an equation relating a dependent variable to one or
more independent variables.

14.3.1 Simple and Multiple Regression Models

If a regression model characterizes the relationship between a dependent y and only


one independent variable x, then such a regression model is called a simple
regression model. But if more than one independent variables are associated with a
dependent variable, then such a regression model is called a multiple regression
model. For example, sales turnover of a product (a dependent variable) is associated
with multiple independent variables such as price of the product, expenditure on
advertisement, quality of the product, competitors, and so on. Now if we want to
estimate possible sales turnover with respect to only one of these independent
variables, then it is an example of a simple regression model, otherwise multiple
regression model is applicable.

14.3.2 Linear and Nonlinear Regression Models

If the value of a dependent (response) variable y in a regression model tends to


increase in direct proportion to an increase in the values of independent (predictor)
variable x, then such a regression model is called a linear model. Thus, it can be
assumed that the mean value of the variable y for a given value of x is related by a
straight-line relationship. Such a relationship is called simple linear regression
model expressed with respect to the population parameters β0 and β1 as:

where    β0 = y-intercept that represents mean (or average) value of the


dependent variable y when x = 0

                 β1 = slope of the regression line that represents the expected change in
the value of y (either positive or negative) for a unit change in the
value of x.
 

Figure 14.1 Straight Line Relationship

The intercept β0 and the slope β1 are unknown regression coefficients. The equation
(14-1) requires to compute the values of β0 and β1 to predict average values of y for a
given value of x. However Fig. 14.1 presents a scatter diagram where each pair of
values (xi, yi) represents a point in a two-dimensional coordinate system. Although
the mean or average value of y is a linear function of x, but not all values of y fall
exactly on the straight line rather fall around the line.

Since few points do not fall on the regression line, therefore values of y are not
exactly equal to the values yielded by the equation: E(y|x) = β0 + β1x, also called line
of mean deviations of observed y value from the regression line. This situation is
responsible for random error (also called residual variation or residual error) in
the prediction of y values for given values of x. In such a situation, it is likely that the
variable x does not explain all the variability of the variable y. For instance, sales
volume is related to advertising, but if other factors related to sales are ignored, then
a regression equation to predict the sales volume (y) by using annual budget of
advertising (x) as a predictor will probably involve some error. Thus for a fixed value
of x, the actual value of y is determined by the mean value function plus a random
error term as follows:

where e is the observed random error. This equation is also called simple


probabilitic linear regression model.

The error component e allows each individual value of y to deviate from the line of
means by a small amount. The random errors corresponding to different
observations (xi, yi) for i=1, 2,…, n are assumed to follow a normal distribution with
mean zero and (unknown) constant standard deviation.

The term e in the expression (14-2) is called the random error because its value,
associated with each value of variable y, is assumed to vary unpredictably. The
extent of this error for a given value of x is measured by the error variance σe2. Lower
the value of σe2, better is the fit of linear regression model to a sample data.

If the line passing through the pair of values of variables x and y is curvilinear, then
the relationship is called nonlinear. A nonlinear relationship implies a varying
absolute change in the dependent variable with respect to changes in the value of the
independent variable. A nonlinear relationship is not very useful for predictions.

In this chapter, we shall discuss methods of simple linear regression analysis


involving single independent variable, whereas those involving two or more
independent variables will be discussed in Chapter 15.

14.4 ESTIMATION : THE METHOD OF LEAST SQUARES

To estimate the values of regression coefficients β0 and β1, suppose a sample


of n pairs of observations (x1, y1), (x2, y2),…, (xn, yn) is drawn from the population
under study. A method that provides the best linear unbiased estimates of β 0 and
β1 is called the method of least squares. The estimates of β0 and β1 should result in a
straight line that is ‘best fit’ to the data points. The straight line so drawn is referred
to as ‘best fitted’ (least squares or estimated) regression line because the sum of the
squares of the vertical deviations (difference between the acutal values of y and the
estimated values   predicted from the fitted line) is as small as possible.

Using equation (14-2), we may express given n observations in the sample data as:

Mathematically, we intend to minimize

Let b0 and b1 be the least-squares estimators of β0 and β1 respectively. The least-


squares estimators b0 and b1 must satisfy

 
 

After simplifying these two equations, we get

Equations (14-3) are called the least-squares normal equations. The values of least


squares estimators b0 and b1 can be obtained by solving equations (14-3). Hence
the fitted or estimated regression line is given by:

where   (called y hat) is the value of y lying on the fitted regression line for a
given x value and ei = yi –  i is called the residual that describes the error in fitting of
the regression line to the observation yi. The fitted value   is also called the predicted
value of y because if actual value of y is not known, then it would be predicted for a
given value of x using the estimated regression line.

Remark: The sum of the residuals is zero for any least-squares regression line.
Since Σyi = Σ i, therefore so Σei = 0.

14.5 ASSUMPTIONS FOR A SIMPLE LINEAR REGRESSION MODEL

To make valid statistical inference using regression analysis, we make certain


assumptions about the bivariate population from which a sample of paired
observations is drawn and the manner in which observations are generated. These
assumptions form the basis for application of simple linear regression
models. Figure 14.2 illustrates these assumptions.

Figure 14.2 Graphical Illustration of Assumptions in Regression Analysis


 

Assumptions

1. The relationship between the dependent variable y and independent variable x exists and


is linear. The average relationship between x and y can be described by a simple linear
regression equation y = a + bx + e, where e is the deviation of a particular value of y from
its expected value for a given value of independent variable x.
2. For every value of the independent variable x, there is an expected (or mean) value of the
dependent variable y and these values are normally distributed. The mean of these
normally distributed values fall on the line of regression.
3. The dependent variable y is a continuous random variable, whereas values of the
independent variable x are fixed values and are not random.
4. The sampling error associated with the expected value of the dependent variable y is
assumed to be an independent random variable distributed normally with mean zero and
constant standard deviation. The errors are not related with each other in successive
observations.
5. The standard deviation and variance of expected values of the dependent variable y about
the regression line are constant for all values of the independent variable x within the
range of the sample data.
6. The value of the dependent variable cannot be estimated for a value of an independent
variable lying outside the range of values in the sample data.

14.6 PARAMETERS OF SIMPLE LINEAR REGRESSION MODEL

The fundamental aim of regression analysis is to determine a regression equation


(line) that makes sense and fits the representative data such that the error of
variance is as small as possible. This implies that the regression equation should
adequately be used for prediction. J. R. Stockton stated that

 The device used for estimating the values of one variable from the value of the other
consists of a line through the points, drawn in such a manner as to represent the
average relationship between the two variables. Such a line is called line of regression.

The two variables x and y which are correlated can be expressed in terms of each


other in the form of straight line equations called regression equations. Such lines
should be able to provide the best fit of sample data to the population data. The
algebraic expression of regression lines is written as:

 The regression equation of y on x


 
y = a + bx
 
is used for estimating the value of y for given values of x.
 Regression equation of x on y

 
x = c + dy
 
is used for estimating the value of x for given values of y.

Remarks

1. When variables x and y are correlated perfectly (either positive or negative) these lines


coincide, that is, we have only one line.
2. Higher the degree of correlation, nearer the two regression lines are to each other.
3. Lesser the degree of correlation, more the two regression lines are away from each other.
That is, when r = 0, the two lines are at right angle to each other.
4. Two linear regression lines intersect each other at the point of the average value of
variables x and y.

14.6.1 Regression Coefficients

To estimate values of population parameter β0 and β1, under certain assumptions, the
fitted or estimated regression equation representing the straight line regression
model is written as:

 = a + bx

where     = estimated average (mean) value of dependent variable y for a given


value of independent variable x.
     a or b0 = y-intercept that represents average value of 

              b = slope of regression line that represents the expected change in the
value of y for unit change in the value of x

To determine the value of   for a given value of x, this equation requires the
determination of two unknown constants a (intercept) and b (also called regression
coefficient). Once these constants are calculated, the regression line can be used to
compute an estimated value of the dependent variable y for a given value of
independent variable x.

The particular values of a and b define a specific linear relationship


between x and y based on sample data. The coefficient ‘a’ represents the level of
fitted line (i.e., the distance of the line above or below the origin) when x equals zero,
whereas coefficient ‘b’ represents the slope of the line (a measure of the change in
the estimated value of y for a one-unit change in x).

The regression coefficient ‘b’ is also denoted as:

 byx (regression coefficient of y on x) in the regression line, y = a + bx


 bxy (regression coefficient of x on y) in the regression line, x = c + dy

Properties of regression coefficients

1. The correlation coefficient is the geometric mean of two regression coefficients, that
is, r = 
2. If one regression coefficient is greater than one, then other regression coefficient must be
less than one, because the value of correlation coefficient r cannot exceed one. However,
both the regression coefficients may be less than one.
3. Both regression coefficients must have the same sign (either positive or negative). This
property rules out the case of opposite sign of two regression coefficients.
4. The correlation coefficient will have the same sign (either positive or negative) as that of
the two regression coefficients. For example, if byx = −0.664 and bxy = − 0.234,
then 
5. The arithmetic mean of regression coefficients bxy and byx is more than or equal to the
correlation coefficient r, that is, (byx + bxy)/2 ≥ r. For example, if byx = − 0.664 and bxy = −
0.234, then the arithmetic mean of these two values is (− 0.664 − 0.234)/2 = − 0.449,
and this value is more than the value of r = − 0.394.
6. Regression coefficients are independent of origin but not of scale.

14.7 METHODS TO DETERMINE REGRESSION COEFFICIENTS

Following are the methods to determine the parameters of a fitted regression


equation.
14.7.1 Least Squares Normal Equations

Let   = a + bx be the least squares line of y on x, where   is the estimated average
value of dependent variable y. The line that minimizes the sum of squares of the
deviations of the observed values of y from those predicted is the best fitting line.
Thus the sum of residuals for any least-square line is minimum, where

Differentiating L with respect to a and b and equating to zero, we have

Solving these two equations, we get the same set of equations as equations (14-3)

where n is the total number of pairs of values of x and y in a sample data.


The equations (14-4) are called normal equations with respect to the regression line
of y on x. After solving these equations for a and b, the values of a and b are
substituted in the regression equation, y = a + bx.

Similarly if we have a least squares line   = c + dy of x on y, where   is the estimated
mean value of dependent variable x, then the normal equations will be

These equations are solved in the same manner as described above for
constants c and d.
The values of these constants are substituted to the regression equation x = c + dy.

Alternative method to calculate value of constants

Instead of using the algebraic method to calculate values of a and b, we may directly
use the results of the solutions of these normal equation.

The gradient ‘b’ (regression coefficient of y on x) and ‘d’ (regression coefficient


of x on y) are calculated as:

Since the regression line passes through the point ( ), the mean values
of x and y and the regression equations can be used to find the value of
constants a and c as follows:

The calculated values of a, b and c, d are substituted in the regression


line y = a + bx and x = c + dy respectively to determine the exact relationship.

Example 14.1: Use least squares regression line to estimate the increase in sales
revenue expected from an increase of 7.5 per cent in advertising expenditure.

Fir Annual Percentage Increase in Advertising Annual Percentage Increase in Sales


m Expenditure Revenue

A 1 1

B 3 2

C 4 2

D 6 4
E 8 6

F 9 8

G 11 8

H 14 9

Solution: Assume sales revenue (y) is dependent on advertising expenditure (x).


Calculations for regression line using following normal equations are shown in Table
14.1

Table 14.1: Calculation for Normal Equations

Approach 1 (Normal Equations):

Solving these equations, we get

a = 0.072 and b = 0.704


 

Substituting these values in the regression equation

y = a + bx = 0.072 + 0.704x

For x = 7.5% or 0.075 increase in advertising expenditure, the estimated increase in


sales revenue will be

y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%

Approach 2 (Short-cut method):

The intercept ‘a’ on the y-axis is calculated as:

Substituting the values of a = 0.072 and b = 0.704 in the regression equation, we get

y = a + bx = 0.072 + 0.704 x

 
For x = 0.075, we have y = 0.072 + 0.704 (0.075) = 0.1248 or 12.48%.

Example 14.2: The owner of a small garment shop is hopeful that his sales are
rising significantly week by week. Treating the sales for the previous six weeks as a
typical example of this rising trend, he recorded them in Rs 1000’s and analysed the
results

Fit a linear regression equation to suggest to him the weekly rate at which his sales
are rising and use this equation to estimate expected sales for the 7th week.

Solution: Assume sales (y) is dependent on weeks (x). Then the normal equations
for regression equation: y = a + bx are written as:

Calculations for sales during various weeks are shown in Table 14.2.

Table 14.2: Calculations of Normal Equations

The gradient ‘b’ is calculated as:

 
 

The intercept ‘a’ on the y-axis is calculated as

Substituting the values a = 2.64 and b = 0.025 in the regression equation, we have

Hence the expected sales during the 7th week is likely to be Rs 2.815 (in Rs 1000's).

both are in fractions, then

14.7.2 Deviations Method

Calculations to least squares normal equations become lengthy and tedious when
values of x and y are large. Thus the following two methods may be used to reduce
the computational time.

(a) Deviations Taken from Actual Mean Values of x and y If deviations of


actual values of variables x and y are taken from their mean values   and  , then the
regression equations can be written as:

 Regression equation of y on x


 

 
where byx = regression coefficient of
 
y on x
 
The value of byx can be calculated using the using the formula
 
 
 Regression equation of x on y

 
where bxy = regression coefficient of x on y.
The value of bxy can be calculated formula
 

 
(b) Deviations Taken from Assumed Mean Values for x and y If mean value
of either x or y or both are in fractions, then we must prefer to take deviations of
actual values of variables x and y from their assumed means.

 Regression equation of y on x


 

 
 n = number of observations
dx = x – A; A is assumed mean of x

dy = y – B; B is assumed mean of y


 Regression equation of x on y

 
  n = number of observations
dx = x – A; A is assumed mean of x
     dy = y – B; B is assumed mean of y
(c) Regression Coefficients in Terms of Correlation Coefficient If
deviations are taken from actual mean values, then the values of regression
coefficients can be alternatively calculated as follows:

 
 

Example 14.3: The following data relate to the scores obtained by 9 salesmen of a


company in an intelligence test and their weekly sales (in Rs 1000's)

(a) Obtain the regression equation of sales on intelligence test scores of the
salesmen.

(b) If the intelligence test score of a salesman in 65, what would be his expected
weekly sales.           [HP Univ., MCom, 1996]

Solution: Assume weekly sales (y) as dependent variable and test scores (x) as
independent variable. Calculations for the following regression equation are shown
in Table 14.3.

Table 14.3: Calculation for Regression Equation


 

Substituting values in the regression equation, we have

For test score x = 65 of salesman, we have

Hence we conclude that the weekly sales is expected to be Rs 53.75 (in Rs 1000's)for
a test score of 65.

Example 14.4: A company is introducing a job evaluation scheme in which all jobs
are graded by points for skill, responsibility, and so on. Monthly pay scales (Rs in
1000's) are then drawn up according to the number of points allocated and other
factors such as experience and local conditions. To date the company has applied
this scheme to 9 jobs:

1. Find the least squares regression line for linking pay scales to points.
2. Estimate the monthly pay for a job graded by 20 points.

Solution: Assume monthly pay (y) as the dependent variable and job grade points
(x)as the independent variable. Calculations for the following regression equation
are shown in Table 14.4.

 
 

Table 14.4: Calculation for Regression Equation

(a) 

Since mean values   and   are non-integer value, therefore deviations are taken from
assumed mean as shown in Table 14.4.

Substituting values in the regression equation, we have

(b) For job grade point x = 20, the estimated average pay scale is given by

Hence, likely monthly pay for a job with grade points 20 is Rs 5986.
Example 14.5: The following data give the ages and blood pressure of 10 women.

1. Find the correlation coefficient between age and blood pressure.


2. Determine the least squares regression equation of blood pressure on age.
3. Estimate the blood pressure of a woman whose age is 45 years.

[Ranchi Univ. MBA; South Gujarat Univ., MBA, 1997]

Solution: Assume blood pressure (y) as the dependent variable and age (x) as the


independent variable. Calculations for regression equation of blood pressure on age
are shown in Table 14.5.

Table 14.5: Calculation for Regression Equation

(a) Coefficient of correlation between age and blood pressure is given by

 
 

We may conclude that there is a high degree of positive correlation between age and
blood pressure.

(b) The regression equation of blood pressure on age is given by

Substituting these values in the above equation, we have

This is the required regression equation of y on x.

(c) For a women whose age is 45, the estimated average blood pressure will be

Hence, the likely blood pressure of a woman of 45 years is 134.

Example 14.6: The General Sales Manager of Kiran Enterprises—an enterprise


dealing in the sale of readymade men's wear—is toying with the idea of increasing
his sales to Rs 80,000. On checking the records of sales during the last 10 years, it
was found that the annual sale proceeds and advertisement expenditure were highly
correlated to the extent of 0.8. It was further noted that the annual average sale has
been Rs 45,000 and annual average advertisement expenditure Rs 30,000, with a
variance of Rs 1600 and Rs 625 in advertisement expenditure respectively.

In view of the above, how much expenditure on advertisement would you suggest
the General Sales Manager of the enterprise to incur to meet his target of sales?
[Kurukshetra Univ., MBA, 1998]

Solution: Assume advertisement expenditure (y) as the dependent variable and


sales (x) as the independent variable. Then the regression equation advertisement
expenditure on sales is given by

Given r = 0.8, σx= 40, σ = 25,   = 45,000,   = 30,000. Substituting these value in the
above equation, we have

When a sales target is fixed at x = 80,000, the estimated amount likely to the spent
on advertisement would be

Example 14.7: You are given the following information about advertising


expenditure and sales:

  Advertisement(x) Sales (y)


(Rs in lakh) (Rs in lakh)

Arithmetic mean,  10 90

Standard deviation,σ 3 12

Correlation coefficient = 0.8

1. Obtain the two regression equations.


2. Find the likely sales when advertisement budget is Rs 15 lakh.
3. What should be the advertisement budget if the company wants to attain sales target of
Rs 120 lakh?
[Kumaon Univ., MBA, 2000, MBA, Delhi Univ., 2002]

Solution:(a)Regression equation of x on y is given by

Given   = 10, r = 0.8, σx = 3, σy = 12,   = 90. Substituting these values in the above
regression equation, we have

Regression equation of y on x is given by

(b) Substituting x = 15 in regression equation of y on x. The likely average sales


volume would be

Thus the likely sales for advertisement budget of Rs 15 lakh is Rs 106 lakh.

(c) Substituting y = 120 in the regression equation of x on y. The likely


advertisement budget to attain desired sales target of Rs 120 lakh would be

 
Hence, the likely advertisement budget of Rs 16 lakh should be sufficient to attain
the sales target of Rs 120 lakh.

Example 14.8: In a partially destroyed laboratory record of an analysis of


regression data, the following results only are legible:

Variance of x = 9

Regression equations: 8x – 10y + 66 = 0 and 40x – 18y = 214

Find on the basis of the above information:

1. The mean values of x and y


2. Coefficient of correlation between x and y and
3. Standard deviation of y.

[Pune Univ., MBA, 1996; CA May 1999]

Solution: (a) Since two regression lines always intersect at a point ( , )


representing mean values of the variables involved, solving given regression
equations to get the mean values   and   as shown below:

Multiplying the first equation by 5 and subtracting from the second, we have

Substituting the value of y in the first equation, we get

(b) To find correlation coefficient r between x and y, we need to determine the


egression coefficients bxy and bxy

Rewriting the given regression equations in such a way that the coefficient of
dependent variable is less than one at least in one equation.
 

That is, byx = 8/10 = 0.80

That is, bxy = 18/40 = 0.45

Hence coefficient of correlation r between x and y is given by

(c) To determine the standard deviation of y, consider the formula:

Example 14.9: There are two series of index numbers, P for price index and S for
stock of a commodity. The mean and standard deviation of P are 100 and 8 and of S
are 103 and 4 respectively. The correlation coefficient between the two series is 0.4.
With these data, work out a linear equation to read off values of P for various values
of S. Can the same equation be used to read off values of S for various values of P?

Solution: The regression equation to read off values of P for various values S is


given by

 
Given   = 100,   = 103, σp = 8, σs = 4, r = 0.4. Substituting these values in the above
equation, we have

This equation cannot be used to read off values of S for various values of P. Thus to
read off values of S for various values of P we use another regression equation of the
form:

Substituting given values in this equation, we have

Example 14.10: The two regression lines obtained in a correlation analysis of 60


observations are:

What is the correlation coefficient and what is its probable error? Show that the ratio
of the coefficient of variability of x to that of y is 5/24. What is the ratio of variances
of x and y?

Solution: Rewriting the regression equations

 
That is, bxy = 6/5

That is, byx = 768/1000

Hence 

Since both bxy and byx are positive, the correlation coefficient is positive and hence r =


0.96.

Solving the given regression equations for x and y, we get   = 6 and   = 1 because


regression lines passed through the point ( , ).

14.7.3 Regression Coefficients for Grouped Sample Data

The method of finding the regression coefficients bxy and byx would be little different


than the method discussed earlier for the case when data set is grouped or classified
into frequency distribution of either variable x or y or both. The values
of bxy and byx shall be calculated using the formulae:
 

where h = width of the class interval of sample data on x variable

  k = width of the class interval of sample data on y variable

Example 14.11: The following bivariate frequency distribution relates to sales


turnover (Rs in lakh) and money spent on advertising (Rs in 1000's). Obtain the two
regression equations

Estimate (a) the sales turnover corresponding to advertising budget of Rs 1,50,000,


and (b) the advertising budget to achieve a sales turnover of Rs 200 lakh.

Solution: Let x and y represent sales turnover and advertising budget respectively.


Then the regression equation for estimating the sales turnover (x)on advertising
budget (y) is expressed as:

Table 14.6: Calculations for Regression Coefficients


 

Similarly,the regression equation for estimating the advertising budget (y)on sales
turnover of Rs 200 lakh is written as:

The calculations for regression coefficients bxy and byx are shown in Table 14.6.

 
 

Substituting these values in the two regression equations, we get

1. Regression equation of sales turnover (x) to advertising budget (y) is:


 

 
For y = 150, we have x = 116.65 – 0.414 × 150 = Rs 54.55 lakh
2. Regression equation of advertising budget (y) on sales turnover (x) is:

 
For x = 200, we have y = 76.457 – 0.04 (200) = Rs 68.457
thousand.

Self-Practice Problems 14A


14.1 The following calculations have been made for prices of twelve stocks (x) at the
Calcutta Stock Exchange on a certain day along with the volume of sales in
thousands of shares (y). From these calculations find the regression equation of
price of stocks on the volume of sales of shares.

[Rajasthan Univ., MCom, 1995]


14.2 A survey was conducted to study the relationship between expenditure (in Rs)
on accommodation (x) and expenditure on food and entertainment (y) and the
following results were obtained:

  Mean Standard
Deviation

• Expenditure on accommodation 173 63.15

• Expenditure on food and entertainment 47.8 22.98

Coefficient of correlation r = 0.57  

    Write down the regression equation and estimate the expenditure on food and
entertainment if the expenditure on accommodation is Rs 200.
[Bangalore Univ., BCom, 1998]

14.3 The following data give the experience of machine operators and their
performance ratings given by the number of good parts turned out per 100 pieces:

    Calculate the regression lines of performance ratings on experience and estimate


the probable performance if an operator has 7 years experience.
[Jammu Univ., MCom; Lucknow Univ., MBA, 1996]

14.4 A study of prices of a certain commodity at Delhi and Mumbai yield the
following data:
  Delhi Mumbai

• Average price per kilo (Rs) 2.463 2.797

• Standard deviation 0.326 0.207

• Correlation coefficient between prices at Delhi and Mumbai r = 0.774    


Estimate from the above data the most likely price (a) at Delhi corresponding to
the price of Rs 2.334 per kilo at Mumbai (b) at Mumbai corresponding to the
price of 3.052 per kilo at Delhi.
14.5 The following table gives the aptitude test scores and productivity indices of 10
workers selected at random:
Aptitude scores (x)     : 60 62 65 70 72 48 53 73 65 82
Productivity index (y) : 68 60 62 80 85 40 52 62 60 81
Calculate the two regression equations and estimate (a) the productivity index of
a worker whose test score is 92, (b) the test score of a worker whose productivity
index is 75.
[Delhi Univ., MBA, 2001]

14.6 A company wants to assess the impact of R&D expenditure (Rs in 1000s) on its
annual profit; (Rs in 1000's). The following table presents the information for the
last eight years:
Year R & D expenditure Annual profit

1991 9 45
1992 7 42
1993
1994
5 41
1995 10 60
1996 4 30
1997 5 34
1998 3 25
2 20

Estimate the regression equation and predict the annual profit for the year 2002
for an allocated sum of Rs 1,00,000 as R&D expenditure.
[Jodhpur Univ., MBA, 1998]

14.7 Obtain the two regression equations from the following bivariate frequency
distribution:

 
Estimate (a) the sales corresponding to advertising expenditure of Rs 50,000,
(b) the advertising expenditure for a sales revenue of Rs 300 lakh, (c) the
coefficient of correlation.
[Delhi Univ., MBA, 2002]

14.8 The personnel manager of an electronic manufacturing company devises a


manual test for job applicants to predict their production rating in the assembly
department. In order to do this he selects a random sample of 10 applicants. They
are given the test and later assigned a production rating. The results are as follows:

Fit a linear least squares regression equation of production rating on test score.
[Delhi Univ., MBA, 200]

14.9 Find the regression equation showing the capacity utilization on production


from the following data:
  Average Deviation Standard

• Production (in lakh units) : 35.6 10.5

• Capacity utilization (in percentage) : 84.8 8.5

• Correlation coefficient r = 0.62    

Estimate the production when the capacity utilization is 70 per cent.


[Delhi Univ., MBA, 1997; Pune Univ., MBA, 1998]

14.10 Suppose that you are interested in using past expenditure on R&D by a firm to
predict current expenditures on R&D. You got the following data by taking a random
sample of firms, where x is the amount spent on R&D (in lakh of rupees) 5 years ago
and y is the amount spent on R&D (in lakh of rupees) in the current year:

x : 30 50 20 180 10 20 20 40
y : 50 80 30 110 20 20 40 50
 

1. Find the regression equation of y on x.


2. If a firm is chosen randomly and x = 10, can you use the regression to predict the value
of y? Discuss.

[Madurai-Kamraj Univ., MBA, 2000]

14.11 The following data relates to the scores obtained by a salesmen of a company


in an intelligence test and their weekly sales (in Rs. 1000's):

1. Obtain the regression equation of sales on intelligence test scores of the salesmen.
2. If the intelligence test score of a salesman is 65, what would be his expected weekly sales?

[HP Univ., M.com., 1996]

14.12 Two random variables have the regression equations:

3x + 2y – 26 =0 and 6x + y – 31 = 0

1. Find the mean values of x and y and coefficient of correlation between x and y.


2. If the varaince of x is 25, then find the standard deviation of y from the data.

[MD Univ., M.Com., 1997; Kumaun Univ., MBA, 2001]

14.13 For a given set of bivariate data, the fiollowing results were obtained

 = 53.2,   = 27.9,

Regression coefficient of y on x = – 1.5, and Regression coefficient of x and y = –


0.2.
Find the most probable value of y when x = 60.
14.14 In trying to evaluate the effectiveness in its advertising compaign, a firm
compiled the following information:
Calculate the regression equation of sales on advertising expenditure. Estimate
the probable sales when advertisement expenditure is Rs. 60 thousand.
Year Adv. expenditure (Rs. 1000's) Sales (in lakhs Rs)

1996 12 5.0

1997 15 5.6

1998 17 5.8

1999 23 7.0

2000 24 7.2

2001 38 8.8

2002 42 9.2

2003 48 9.5
[Bharathidasan Univ., MBA, 2003]

Hints and Answers


14.1   = Σx/n = 580/12 = 48.33;
          = Σy/n = 370/12 = 30.83

Regression equation of x on y:

x –   = bxy (y –  )
x – 48.33 = –1.102 (y – 30.83)
or x = 82.304 – 1.102y

14.2 Given   = 172,   = 47.8, σx = 63.15, σy = 22.98, and r = 0.57


Regression equation of food and entertainment (y) on accomodation (x) is given
by

 
 

For x = 200, we have y = 11.917 + 0.207(200) = 53.317


14.3 Let the experience and performance rating be represented
by x and y respectively.
 = Σ x/n = 80/8 = 10;   = Σ y/n = 648/8 = 81

Regression equation of y on x

14.4 Let price at Mumbai and Delhi be represented by x and y, respectively


(a) Regression equation of y on x

For x = Rs 2.334, the price at Delhi would be y = Rs 1.899.


(b) Regression on equation of x on y
 

For y = Rs 3.052, the price at Mumbai would be x = Rs 3.086.


14.5 Let aptitude score and productivity index be represented
by x and y respectively.

(a) Regression equation of x on y

(b)

14.6 Let R&D expenditure and annual profit be denoted by x and y respectively

 
 

Regression equation of annual profit on R&D expenditure

For x = Rs 1,00,000 as R&D expenditure, we have from above equation y = Rs


439.763 as annual profit.
14.7 Let sales revenue and advertising expenditure be denoted
by x and y respectively

(a) Regression equation of x on y

(b) Regression equation of y on x


 

(c) 
14.8 Let test score and production rating be denoted by x and y respectively.

Regression equation of production rating (y) on test score (x) is given by

14.9 Let production and capacity utilization be denoted by x and y, respectively.


(a) Regression equation of capacity utilization (y) on production (x)

 
(b) Regression equation of production (x) on capacity utilization (y)

Hence the estimated production is 2,42,647 units when the capacity utilization is
70 per cent.
14.10   = Σ x/n = 270/8 = 33.75;   = Σ y/n = 400/8 = 50

Regression equation of y on x

14.11 Let intelligence test score be denoted by x and weekly sales by y

Regression equation of y on x :

 
 

14.12 (a) Solving two regression lines:

3x + 2y = 6 and 6x + y = 31


we get mean values as   = 4 and   = 7
(b) Rewritting regression lines as follows:

3x + 2y =26 or y = 13 – (3/2)x, So byx = – 3/2

6x + y =31 or x = 31/6 – (1/6)y, So bxy = – 1/6


Correlation coefficient,

Given, Var(x) = 25, so σx = 5. Calculate σy using the formula:

14.13 The regression equation of y on x is stated as:

Given,   = 53.20;   = 27.90, byx = – 1.5; bxy = – 0.2


Thus y – 27.90 = – 1.5(x – 53.20) or y = 107.70 – 1.5x
For x = 60, we have y = 107.70 – 1.5(60) = 17.7
 

14.14 Let advertising expenditure and sales be denoted by x and y respectively.

Thus regression equation of y on x is:

When x = 60, the estimated value of y = 3.869 + 0.125(60) = 11.369

14.8 STANDARD ERROR OF ESTIMATE AND PREDICTION INTERVALS

The pattern of dot points on a scatter diagram is an indicator of the relationship


between two variables x and y. Wide scatter or variation of the dot points about the
regression line represents a poor relationship. But a very close scatter of dot points
about the regression line represents a close relationship between two variables. The
variability in observed values of dependent variable y about the regression line is
measured in terms of residuals. A residual is defined as the difference between an
observed value of dependent variable y and its estimated (or fitted) value   
determined by regression equation for a given value of the independent variable x.
The residual about the regression line is given by

Residual ei = yi –  i

 
The residual values ei are plotted on a diagram with respect to the least squares
regression line   = a + bx. These residual values represent error of estimation for
individual values of dependent variable and are used to estimate, the variance σ 2 of
the error term. In other words, residuals are used to estimate the amount of
variation in the dependent variable with respect to least squares regression line.
Here it should be noted that the variations are not the variations (deviations) of
observations from the mean value in the sample data set, rather these variations are
the vertical distances of every observation (dot point) from the least squares line as
shown in Fig. 14.3.

Since sum of the residuals is zero, therefore it is not possible to determine the total
amount of error by summing the residuals. This zero-sum characteristic of residuals
can be avoided by squaring the residuals and then summing them. That is

This quantity is called the sum of squares of errors (SSE).

The estimate of variance of the error term   or   is obtained as follows:

The denominator, n – 2 represents the error or residual degrees of freedom and is


determined by subtracting from sample size n the number of parameters β0 and
β1 that are estimated by the sample parameters a and b in the least squares equation.
The subscript ‘yx’ indicates that the standard deviation is of dependent variable y,
given (or conditional) upon independent variable x.

The standard error of estimate Syx also called standard deviation of the error term


t measures the variability of the observed values around the regression line, i.e. the
amount by which the ŷ values are away from the sample y values (dot points). In
other words, Syx is based on the deviations of the sample observations of y-values
from the least squares line or the estimated regression line of ŷ values. The standard
deviation of error about the least squares line is defined as:

 
 

Figure 14.3 Residuals

To simplify the calculations of Syx, generally the following formula is used

The variance   measures how the least squares line ‘best fits’ the sample y-values.
A large variance and standard error of estimate indicates a large amount of scatter or
dispersion of dot points around the line. Smaller the value of Syx, the closer the dot
points (y-values) fall around the regression line and better the line fits the data and
describes the better average relationship between the two variables. When all dot
points fall on the line, the value of Syx is zero, and the relationship between the two
variables is perfect.

A smaller variance about the regression line is considered useful in predicting the
value of a dependent variable y. In actual practice, some variability is always left
over about the regression line. It is important to measure such variability due to the
following reasons:

1. This value provides a way to determine the usefulness of the regression line in predicting
the value of the dependent variable.
2. This value can be used to construct interval estimates of the dependent variable.
3. Statistical inferences can be made about other components of the problem.

Figure 14.4 displays the distribution of conditional average values of y about a least


squares regression line for given values of independent variable x. Suppose the
amount of deviation in the values of y given any particular value of x follow normal
distribution. Since average value of y changes with the value of x, we have different
normal distributions of y-values for every value of x, each having same standard
deviation. When a relationship between two variables x and y exists, the standard
deviation (also called standard error of estimate) is less than the standard deviation
of all the x-values in the population computed about their mean.

Based on the assumptions of regression analysis, we can describe sampling


properties of the sample estimates such as a, b, and Syx, as these vary from sample to
sample. Such knowledge is useful in making statistical inferences about the
relationship between the two variables x and y.

Figure 14.4 Regression Line Showing the Error Variance

The standard error of estimate can also be used to determine an approximate


interval estimate based on sample data (n < 30) for the value of the dependent
variable y for a given value of the independent variable x as follows:

where value of t is obtained using t-distribution table based upon a chosen


probability level. The interval estimate is also called a prediction interval.

Example 14.12: The following data relate to advertising expenditure (Rs in lakh)


and their corresponding sales (Rs in crore)

 
 

1. Find the equation of the least squares line fitting the data.
2. Estimate the value of sales corresponding to advertising expenditure of Rs 30 lakh.
3. Calculate the standard error of estimate of sales on advertising expenditure.

Solution: Let the advertising expenditure be denoted by x and sales by y.

(a) The calculations for the least squares line are shown in Table 14.7

Table 14.7: Calculations for Least Squares Line

(a) Regression equation of y on x is

where parameter a = 8.608 and b = 0.712.

Table 14.8 gives the fitted values and the residuals for the data in Table 14.7. The
fitted values are obtained by substituting the value of x into the regression equation
(equation for the least squares line). For example, 8.608 + 0.712(10) = 15.728.
The residual is equal to the actual value minus fitted value. The residuals indicate
how well the least squares line fits the actual data values.

Table 14.8: Fitted Values and Residuals for Sample Data

(b) The least squares equation obtained in part (a) may be used to estimate the sales
turnover corresponding to the advertising expenditure of Rs 30 lakh as:

ŷ = 8.608 + 0.712x = 8.608 + 0.712 (30) = Rs 29.968 crore

(c) Calculations for standard error of estimate Sy.x of sales (y) on advertising
expenditure (x) are shown in Table 14.9.

Table 14.9: Calculations for Standard Error of Estimate

14.8.1 Coefficient of Determination: Partitioning of Total Variation

The objective of regression analysis is to develop a regression model that best fits the
sample data, so that the residual variance S2y.x small as possible. But the value of
S2y.x depends on the scale with which the sample y-values are measured. This
drawback with the calculation of S2y.x restricts its interpretation unless we consider
the units in which the y-values are measured. Thus, we need another measure of fit
called coefficient of determination that is not affected by the scale with which the
sample y-values are measured. It is the proportion of variability of the dependent
variable, y accounted for or explained by the independent variable, x, i.e. it
measures how well (i.e. strength) the regression line fits the data. The coefficient of
determination is denoted by r2 and its value ranges from 0 to 1. A particular r2 value
should be interpreted as high or low depending upon the use and context in which
the regression model was developed. The coefficient of determination is given by

where = total sum of square deviations (or total variance) of sampled


SST response variable y-values from the mean value of y.

          SSE = sum of squares of error or unexplained variation in response


variable y-values from the least squares line due to sampling errors,
i.e. it measures the residual variation in the data that is not explained
by predictor variable x

          SSR = sum of squares of regression or explained variation is the sample


values of response variable y accounted for or explained by variation
among x-values

  = SST – SSE

The three variations associated with the regression analysis of a data set are shown
in Fig 14.5. Thus
 

where   = fraction of the total variation that is explained or accounted for

       variance of response variable y-values from the least squares line

       total variance of response variable y-values

Figure 14.5 Relationship Between Three Types of Variations

Since the formula of r2 is not convenient to use therefore an easy formula for the
sample coefficient of determination is given by

For example, the coefficient of determination that indicates the extent of


relationship between sales revenue (y) and advertising expenditure (x) is calculated
as follows from Example 14.1:
 

The value r2 = 0.9352 indicates that 93.52% of the variance in sales revenue is
accounted for or statistically explained by advertising expenditure.

A comparison between bivariate correlation and regression summarized in Table 14-


10 could provide further insight about the relationship between two
variables x and y in the data set.

Table 14.10: Comparison between Linear Correlation and Regression

  Correlation Regression

• Measurement level Interval or ratio scale Interval or ratio scale

• Nature of variables Both continuous, and Both continuous, and linearly related
linearly related

• x – y relationship x and y are symmetric y is dependent, x is independent;


regression of x on y differs from y on x

• Correlation bxy = byx Correlation between x and y is the same as


the correlation between y and x

• Coefficient of Explains common Proportion of variability of x explained by


determination variance of x and y its least-squares regression on y

Conceptual Questions 14A


1.
a. Explain the concept of regression and point out its usefulness in
dealing with business problems.
[Delhi Univ., MBA, 1993]

b. Distinguish between correlation and regression. Also point out the


properties of regression coefficients.
2. Explain the concept of regression and point out its importance in business forecasting.
[Delhi Univ., MBA, 1990, 1998]
3. Under what conditions can there be one regression line? Explain.
[HP Univ., MBA, 1996]

4. Why should a residual analysis always be done as part of the development of a regression
model?
5. What are the assumptions of simple linear regression analysis and how can they be
evaluated?
6. What is the meaning of the standard error of estimate?
7. What is the interpretation of y-intercept and the slope in a regression model?
8. What are regression lines? With the help of an example illustrate how they help in
business decision-making.
[Delhi Univ., MBA, 1998]

9. Point out the role of regression analysis in business decision-making. What are the
important properties of regression coefficients?
[Osmania Univ., MBA; Delhi Univ., MBA, 1999]

10.
a. Distinguish between correlation and regression analysis.
[Dipl in Mgt., AIMA, Osmania Univ., MBA, 1998]

b. The coefficient of correlation and coefficient of determination are


available as measures of association in correlation analysis. Describe
the different uses of these two measures of association.
2. What are regression coefficients? State some of the important properties of regression
coefficients.
[Dipl in Mgt., AIMA, Osmania Univ., MBA, 1989]

3. What is regression? How is this concept useful to business forecasting?


[Jodhpur Univ., MBA, 1999]

4. What is the difference between a prediction interval and a confidence interval in


regression analysis?
5. Explain what is required to establish evidence of a cause-and-effect relationship
between y and x with regression analysis.
6. What technique is used initially to identify the kind of regression model that may be
appropriate.
7.
a. What are regression lines? Why is it necessary to consider two lines
of regression?
b. In case the two regression lines are identical, prove that the
correlation coefficient is either + 1 or – 1. If two variables are
independent, show that the two regression lines cut at right angles.
2. What are the purpose and meaning of the error terms in regression?
3. Give examples of business situations where you believe a straight line relationship exists
between two variables. What would be the uses of a regression model in each of these
situations.
4. ‘The regression lines give only the best estimate of the value of quantity in question. We
may assess the degree of uncertainty in the estimate by calculating a quantity known as
the standard error of estimate’ Elucidute.
5. Explain the advantages of the least-squares procedure for fitting lines to data. Explain
how the procedure works.

Formulae Used
1. Simple linear regression model
 

 
2. Simple linear regression equation based on sample data
y = a + bx
3. Regression coefficient in sample regression equation
 

 
4. Residual representing the difference between an observed
value of dependent variable y and its fitted value
e= y – ŷ
5. Standard error of estimate based on sample data
 Deviations formula
 

 
 Computational formula
 

 
2. Coefficient of determination based on sample data
 Sums of squares formula
 

 
 Computational formula
 

 
2. Regression sum of squares
 

 
3. Interval estimate based on sample data: ŷ ± tdf Syx

Chapter Concepts Quiz

True or False

1. A statistical relationship between two variables does not indicate a perfect relationship.
(T/F)

2. A dependent variable in a regression equation is a continuous random variable.


(T/F)

3. The residual value is required to estimate the amount of variation in the dependent
variable with respect to the fitted regression line.
(T/F)

4. Standard error of estimate is the conditional standard deviation of the dependent


variable.
(T/F)

5. Standard error of estimate is a measure of scatter of the observations about the regression
line.
(T/F)

6. If one of the regression coefficients is greater than one the other must also be greater than
one.
(T/F)

7. The signs of the regression coefficients are always same.


(T/F)

8. Correlation coefficient is the geometric mean of regression coefficients.


T/F)

9. If the sign of two regression coefficients is negative, then sign of the correlation
coefficient is positive.
(T/F)

10. Correlation coefficient and regression coefficient are independent.


(T/F)

11. The point of intersection of two regression lines represents average value of two variables.
(T/F)

12. The two regression lines are at right angle when the correlation coefficient is zero.
(T/F)

13. When value of correlation coefficient is one, the two regression lines coincide.
(T/F)

14. The product of regression coefficients is always more than one.


(T/F)

15. The regression coefficients are independent of the change of origin but not of scale.
(T/F)

Multiple Choice
16. The line of best fit’ to measure the variation of observed values of dependent
variable in the sample data is

1. regression line
2. correlation coefficient
3. standard error
4. none of these

17. Two regression lines are perpendicular to each other when

1. r = 0
2. r = 1/3
3. r = – 1/2
4. r = ± 1

18. The change in the dependent variable y corresponding to a unit change in the


independent variable x is measured by

1. bxy
2. byx
3. r
4. none of these

19. The regression lines are coincident provided

1. r = 0
2. r = 1/3
3. r = – 1/2
4. r = ± 1

20. If byx is greater than one, then byx is

1. less than one


2. more than one
3. equal to one
4. none of these

21.If bxy is negative, then byx is

1. negative
2. positive
3. zero
4. none of these

22. If two regression lines are: y = a + bx and x = c + dy, then the correlation


coefficient between x and y is

1.
2.
3.
4.

23. If two regression lines are: y = a + bx and x = c + dy, then the ratio of standard
deviations of x and y are

1.
2.
3.
4.

24. If two regression lines are: y = a + bx and x = c + dy then the ratio of a/c is


equal to

1. b/d

2.

3.

4.
25. If two coefficients of regression are 0.8 and 0.2, then the value of coefficient of
correlation is

1. 0.16
2. – 0.16
3. 0.40
4. –0.40

26. If two regression lines are: y = 4 + kx and x = 5 + 4y, then the range of k is
1. k ≤ 0
2. k ≥ 0
3. 0 ≤ k≤ 1
4. 0 ≤ 4k ≤1

27. If two regression lines are: x + 3y + 7 = 0 and 2x + 5y = 12, then   and   are
respectively

1. 2,1
2. 1,2
3. 2,3
4. 2,4

28. The residual sum of square is

1. minimized
2. increased
3. maximized
4. decreased

29. The standard error of estimate Sy.x is the measure of

1. closeness
2. variability
3. linearity
4. none of these

30. The standard error of estimate is equal to

1.
2.
3.
4.

Concepts Quiz Answers


 

1. T 2. T 3. T 4. T 5. T

6. F 7. T 8. T 9. F  

10. F 11. T 12. T 13. T 14. F

15. T 16. (a) 17. (a) 18. (b)  

19. (d) 20. (a) 21. (a) 22. (d) 23. (d)


24. (b) 25. (a) 26. (d) 27. (b)  

28. (a) 29. (b) 30. (a)    

Review Self-Practice Problems


14.15 Given the following bivariate data:

1. Fit a regression line of y on x and predict y if x = 10.


2. Fit a regression line of x on y and predict x if y = 2.5.

[Osmania Univ., MBA, 1996]

14.16 Find the most likely production corresponding to a rainfall of 40 inches from


the following data:
  Rainfall (in inches) Production (in quintals)

Average 30 50

Standard deviation 5 10
    Coefficient of correlation r = 0.8.
[Bharthidarsan Univ., MCom, 1996]

14.17 The coefficient of correlation between the ages of husbands and wives in a


community was found to be + 0.8, the average of husbands age was 25 years and
that of wives age 22 years. Their standard deviations were 4 and 5 years respectively.
Find with the help of regression equations:

1. the expected age of husband when wife’s age is 16 years, and


2. the expected age of wife when husband’s age is 33 years.

[Osmania Univ., MBA, 2000]

14.18 You are given below the following information about advertisement


expenditure and sales:
  Adv. Exp. (x) (Rs in crore) Sales (y) (Rs in crore)

Mean 20 120
Standard deviation 5 25

Correlation coefficient 0.8

1. Calculate the two regression equations.


2. Find the likely sales when advertisement expenditure is Rs 25 crore.
3. What should be the advertisement budget if the company wants to attain sales target of
Rs 150 crore?

[Jammu Univ., MCom, 1997; Delhi Univ., MBA, 1999]

14.19 For 50 students of a class the regression equation of marks in Statistics (x) on


the marks in Accountancy (y) is 3y – 5x + 180 = 0. The mean marks in Accountancy
is 44 and the variance of marks in Statistics is 9/16th of the variance of marks in
Accountancy. Find the mean marks in Statistics and the coefficient of correlation
between marks in the two subjects.
14.20 The HRD manager of a company wants to find a measure which he can use to
fix the monthly income of persons applying for a job in the production department.
As an experimental project, he collected data on 7 persons from that department
referring to years of service and their monthly income.

1. Find the regression equation of income on years of service.


2. What initial start would you recommend for a person applying for the job after having
served in a similar capacity in another company for 13 years?
3. Do you think other factors are to be considered (in addition to the years of service) in
fixing the income with reference to the above problems? Explain.

14.21 The following table gives the age of cars of a certain make and their annual
maintenance costs. Obtain the regression equation for costs related to age.

14.22 An analyst in a certain company was studying the relationship between travel
expenses in rupees (y) for 102 sales trips and the duration in days (x) of these trips.
He has found that the relationship between y and x is linear. A summary of the data
is given below:
Σx = 510; Σy = 7140; Σx2 = 4150; Σxy= 54,900, and Σy2 = 7,40,200

1. Estimate the two regression equations from the above data.


2. A given trip takes seven days. How much money should a salesman be allowed so that he
will not run short of money?

14.23 The quantity of a raw material purchased by ABC Ltd. at specified prices


during the post 12 months is given below.

1. Find the regression equations based on the above data.


2. Can you estimate the approximate quantity likely to be purchased if the price shoots up to
Rs 124 per kg?
3. Hence or otherwise obtain the coefficient of correlation between the price prevailing and
the quantity demanded.

14.24 With ten observations on price (x) and supply (y), the following data were
obtained (in appropriate units): Σx = 130, Σy = 220, Σx2 = 2288, Σy2 = 5506, Σxy =
3467. Obtain the line of regression of y on x and estimate the supply when the price
is 16 units. Also find out the standard error of the estimate.
14.25 Data on the annual sales of a company in lakhs of rupees over the past 11
years is shown below. Determine a suitable straight line regression model y = β0 +
β1x + ∊ for the data. Also calculate the standard error of regression of y for values
of x.

From the regression line of y on x, predict the values of annual sales for the year
1989.
14.26 Find the equation of the least squares line fitting the following data:

 
 

Calculate the standard error of estimate of y on x.


14.27 The following data relating to the number of weeks of experience in a job
involving the wiring of an electric motor and the number of motors rejected during
the past week for 12 randomly selected workers.
Workers Experience (weeks) No. of Rejects

1 2 26

2 9 20

3 6 28

4 14 16

5 8 23

6 12 18

7 10 24

8 4 26

9 2 38

10 11 22

11 1 32

12 8 25

1. Determine the linear regression equation for estimating the number of components
rejected given the number of weeks of experience. Comment on the relationship between
the two variables as indicated by the regression equation.
2. Use the regression equation to estimate the number of motors rejected for an employee
with 3 weeks of experience in the job.
3. Determine the 95 per cent approximate prediction interval for estimating the number of
motors rejected for an employee with 3 weeks of experience in the job, using only the
standard error of estimate.

14.28 A financial analyst has gathered the following data about the relationship
between income and investment in securities in respect of 8 randomly selected
families:
 

1. Develop an estimating equation that best describes these data.


2. Find the coefficient of determination and interpret it.
3. Calculate the standard error of estimate for this relationship.
4. Find an approximate 90 per cent confidence interval for the percentage of income
invested in securities by a family earning Rs 25,000 annually.

[Delhi Univ., MFC, 1997]

14.29 A financial analyst obtained the following information relating to return on


security A and that of market M for the past 8 years:

1. Develop an estimating equation that best describes these data


2. Find the coefficient of determination and interpret it.
3. Determine the percentage of total variation in security return being explained by the
return on the market portfolio.

14.30 The equation of a regression line is

and the data are as follows:

 
    Solve for residuals and graph a residual plot. Do these data seem to violate any of
the assumptions of regression?
14.31 Graph the followign residuals and indicate which of the assumptions
underlying regresion appear to be in jeopardy on the basis of the graph:

Hints and Answers


14.15   = ∑x = 21/8 = 2.625;   = ∑y/n = 4/8 = 0.50

Regression equation:

Regression equation:

14.16 Let x = rainfall y = production by y. The expected yield corresponding to a


rainfall of 40 inches is given by regression equation of y on x.
Given   = 50, σy = 10,   = 30, σx = 5, r = 0.8
 

For x = 40,y = 2 + 1.6 (40) = 66 quintals.


14.17 Let x = age of wife y = age of husband.
Given   = 25,   = 22, σx = 4,σy = 5, r = 0.8

1. Regression equation of x on y


 

 
When age of wife is y = 16; x = 10.92 + 0.64 (16) = 22 approx.
(husband's age)
2. Left as an exercise

14.18

1. Regression equation of x on y


 

 
Regression equation of y on x
 

 
2. When advertisement expenditure is of Rs 25 crore,likely sales is
 

 
3. For y = 150, x = 0.8 + 0.16y = 0.8 + 0.16(150) = 24.8
14.19 Let x = marks in Statistics and y = marks in Accountancy,

Regression coefficient of x on y, bxy = 3/5


Coefficient of regression

Hence 3r = 2.4 or r = 0.8


14.20 Let x = years of service and y = income.

1. Regression equation of y on x


 

 
2. When x = 13 years, the average income would be

 
14.21 Let x = age of cars and y = maintenance costs.
The regression equation of y on x

 
 

14.22   = ∑x/n = 510/102 = 5;   = ∑y/n = 7140/102 = 70


Regression coefficients:

Regression lines:

14.23 Let price be denoted by x and quantity by y

1. Regression coefficients:
 

 
Regression lines:
 
 
2. For x = 124,

14.24 (a) Regression line of y on x is given by

(b)When x = 16,

(c)
14.25 Take years as x = – 5, –4, –3, –2, –1, 0, 1, 2, 3, 4, 5,
where 1983 = 0. The regression equation is

 
14.26   = Σx/n = 15/5 = 3,   = Σx/n = 20/5 = 4
The regression equation is:

Standard error of estimate,

14.27

Since b = – 1.40, it indicate an inverse (negative) relationship between weeks of


experience (x) and the number of rejects (y) in the sample week
(b) For x = 3, we have   = 35.57 – 1.40(3)   31

95 per cent approximate prediction interval

 
 

14.28 4.724; – 0.983; – 0.399, – 6.753, 2.768, 0.644


14.29 Error term non-independent.

Case Studies

Case 14.1: Made in China

The phrase ‘made in China’ has become an issue of concern in the last few years, as
Indian Companies try to protect their products from overseas competition. In these
years a major trade imbalance in India has been caused by a flood of imported goods
that enter the country and are sold at lower price than comparable Indian made
goods. One prime concern is the electronic goods in which total imported items have
steadily increased during the year 1990s to 2004s. The Indian companies have been
worried on complaints about product quality, worker layoffs, and high prices and
has spent millions in advertising to produce electronic goods that will satisfy
consumer demands. Have these companies been successful in stopping the flood
these imported goods purchased by Indian consumers? The given data represent the
volume of imported goods sold in India for the years 1999–2004. To simplify the
analysis, we have coded the year using the coded variable x = Year 1989.

Year x = Year 1989 Volume of Import (in Rs billion)

1989 0 1.1

1990 1 1.3

1991 2 1.6

1992 3 1.6

1993 4 1.8

1994 5 1.4

1995 6 1.6

1996 7 1.5

1997 8 2.1

1998 9 2.0
1999 10 2.3

2000 11 2.4

2001 12 2.3

2002 13 2.2

2003 14 2.4

2004 15 2.4

Questions for Discussion


1. Find the least-squares line for predicting the volume of import as a function of year for
the years 1989–2000.
2. Is there a significant linear relationship between the volume of import and the year?
3. Predict the volume of import of goods using 95% prediction intervals for each of the years
2002, 2003 and 2004.
4. Do the predictions obtained in Step 4 provide accurate estimates of the actual values
observed in these years? Explain.
5. Add the data for 1989–2004 to your database, and recalculate the regression line. What
effect have the new data points had on the slope> What is the effect of SSE?
6. Given the form of the scattered diagram for the years 1989–2004, does it appear that a
straight line provides an accurate model for the data? What other type of model might be
more appropriate

You might also like