Regression Analysis and a Data Analysis Overview - Multiple Regression
Multiple Regression
Recall that the error term included the effects on the dependent variable of variables other than the independent variable. It may be desirable to include explicitly some of these variables in the model. In a prediction context, their inclusion will improve the model's ability to predict and will decrease the unexplained variation. In terms of understanding, it will introduce the impact of other variables and therefore elaborate and clarify the relationships.
In our example it might be hypothesized that store size will influence store traffic. The larger stores usually will tend to get more customers. Thus, if store size is known, the prediction of store traffic should improve. Also, it might be desirable to include a variable that would be one if the store were located in a suburban area and zero if it were located in an urban area. Such a zero-one variable is termed a dummy variable and is often convenient and useful.
Our model now becomes
Y = 3o + 61X1 + 32X2 + B3X3 + e
where
X1=advertising expenditures on the previous day
X2=size of the store in thousands of square feet
X3 = dummy variable taking on the value one if the store is suburban and zero if it is urban
As before, the first logical step is to obtain data from the random sample on the independent variables and the dependent variable and use this information to estimate the four model parameters. With three independent variables, it is no longer possible to illustrate the process as was done in Figure 20-2; however, the logic is the same. Parameter estimates, also termed regression coefficients, are derived by a computer program that will minimize the resulting unexplained variation in Y, thus:
n
Unexplained variation in Y = 2 {Yt - Yt)2
i= 1
where
Yt = b0 + b.X, + b^i2 + b3X3
and
b0, bx, b2, and b3 are regression coefficients selected to
minimize the unexplained variation in Y Yt = regression model estimate of store traffic for store i Yf = actual store traffic for store i
n = number of stores in the random sample
The regression coefficients will be unique to the random sample that happened to be selected. If another random sample were taken, the regression coefficients would be slightly different. This sampling variation in the regression coefficients is measured by the standard error associated with each of them. The computer calculates this standard error and provides it as one of the outputs:
Sb0, Sbl, Sb2, Sb3 are the standard errors of the regression coefficients.
As in the single-variable model, each independent variable will have associated with it a t-value. For example, the~store size t-value is b2 divided by sb2 and is used to test the hypothesis that B2 is zero. Table 20-2 shows the regression coefficients and their associated errors and t-values for our example.
Parameter Interpretation in Multiple Regression
Parameters of the multiple regression model are interpreted identically to those of the single-variable model, with one important qualification. The parameter 3j is interpreted as the change in store traffic that would be expected to be obtained if advertising expenditures were increased one unit and if the X2 and X3 variables were not changed (or if X2 and X3 were held constant). The added qualification that the remaining independent variables remain unchanged is an important one.
If X2 and X3 were not included, the interpretation of Bx would be less clear. It could be that an apparent positive impact of advertising on store traffic was due only to the fact that larger stores tend to advertise at higher levels and that a larger advertising expenditure meant that a larger store was involved. However, in the multiple regression context, the analysis has controlled for the store size and the Bx coefficient reflects the advertising effect with store size held constant.
The major assumption of multiple regression is that all the important and relevant variables are included in the model. If an important variable is omitted the predictive power of the model is reduced. Further, if the omitted variable is correlated with an included variable the estimated coefficient of the included variable will reflect both the included variable and the omitted variable.9 In our example the coefficient of advertising in the single-equation model was inflated (0.74) because it reflected not only the impact of advertising but also the impact of store size. Since larger stores tend to advertise more heavily than smaller stores, store size was correlated with advertising expenditures.
9Recall the assumption that the error term is not correlated with the independent variables. If the error term includes an omitted variable that is correlated with an independent variable, this assumption will not hold.
Evaluating the Independent Variables
When regression analysis is used to gain understanding of the relationships between variables, a natural question is: which of the independent variables has the greatest influence on the dependent variable? One approach is to consider the r-values for the various coefficients. The t-value, already introduced in the single-variable regression case, is used to test the hypothesis that a regression coefficient (i. e., p^) is equal to zero and a nonzero estimate (i. e., bj) was simply a sampling phenomenon.10 The one with the largest t-value can be interpreted to be the one that is the least likely to have a zero 3 parameter. In Table 20-2, that would mean the store size variable (X2) closely followed by the advertising variable (XJ.
A second approach is to examine the size of the regression coefficients: however, when each independent variable is in different units of measurement (store size, advertising expenditures, and so on), it is difficult to compare their coefficients. One solution is to convert regression coefficients to "beta coefficients." Beta coefficients are simply the regression coefficients adjusted by expressing each variable by its estimated standard deviation instead of by its original measurement.11 The beta coefficients can be compared to each other: the larger the beta coefficient, the stronger the impact of that variable on the dependent variable. In Table 20-2, an analysis of the beta coefficients indicates that the store size and the advertising variables have the most explanatory power, the same conclusion that the analysis of t-values showed.
r2 Revisited
The term r2 has the same definition and interpretation in multiple regression as it has in single-variable regression.12 It is still the ratio of the explained variation to the total variation.
Recall that the error term included the effects on the dependent variable of variables other than the independent variable. It may be desirable to include explicitly some of these variables in the model. In a prediction context, their inclusion will improve the model's ability to predict and will decrease the unexplained variation. In terms of understanding, it will introduce the impact of other variables and therefore elaborate and clarify the relationships.
In our example it might be hypothesized that store size will influence store traffic. The larger stores usually will tend to get more customers. Thus, if store size is known, the prediction of store traffic should improve. Also, it might be desirable to include a variable that would be one if the store were located in a suburban area and zero if it were located in an urban area. Such a zero-one variable is termed a dummy variable and is often convenient and useful.
Our model now becomes
Y = 3o + 61X1 + 32X2 + B3X3 + e
where
X1=advertising expenditures on the previous day
X2=size of the store in thousands of square feet
X3 = dummy variable taking on the value one if the store is suburban and zero if it is urban
As before, the first logical step is to obtain data from the random sample on the independent variables and the dependent variable and use this information to estimate the four model parameters. With three independent variables, it is no longer possible to illustrate the process as was done in Figure 20-2; however, the logic is the same. Parameter estimates, also termed regression coefficients, are derived by a computer program that will minimize the resulting unexplained variation in Y, thus:
n
Unexplained variation in Y = 2 {Yt - Yt)2
i= 1
where
Yt = b0 + b.X, + b^i2 + b3X3
and
b0, bx, b2, and b3 are regression coefficients selected to
minimize the unexplained variation in Y Yt = regression model estimate of store traffic for store i Yf = actual store traffic for store i
n = number of stores in the random sample
The regression coefficients will be unique to the random sample that happened to be selected. If another random sample were taken, the regression coefficients would be slightly different. This sampling variation in the regression coefficients is measured by the standard error associated with each of them. The computer calculates this standard error and provides it as one of the outputs:
Sb0, Sbl, Sb2, Sb3 are the standard errors of the regression coefficients.
As in the single-variable model, each independent variable will have associated with it a t-value. For example, the~store size t-value is b2 divided by sb2 and is used to test the hypothesis that B2 is zero. Table 20-2 shows the regression coefficients and their associated errors and t-values for our example.
Parameter Interpretation in Multiple Regression
Parameters of the multiple regression model are interpreted identically to those of the single-variable model, with one important qualification. The parameter 3j is interpreted as the change in store traffic that would be expected to be obtained if advertising expenditures were increased one unit and if the X2 and X3 variables were not changed (or if X2 and X3 were held constant). The added qualification that the remaining independent variables remain unchanged is an important one.
If X2 and X3 were not included, the interpretation of Bx would be less clear. It could be that an apparent positive impact of advertising on store traffic was due only to the fact that larger stores tend to advertise at higher levels and that a larger advertising expenditure meant that a larger store was involved. However, in the multiple regression context, the analysis has controlled for the store size and the Bx coefficient reflects the advertising effect with store size held constant.
The major assumption of multiple regression is that all the important and relevant variables are included in the model. If an important variable is omitted the predictive power of the model is reduced. Further, if the omitted variable is correlated with an included variable the estimated coefficient of the included variable will reflect both the included variable and the omitted variable.9 In our example the coefficient of advertising in the single-equation model was inflated (0.74) because it reflected not only the impact of advertising but also the impact of store size. Since larger stores tend to advertise more heavily than smaller stores, store size was correlated with advertising expenditures.
9Recall the assumption that the error term is not correlated with the independent variables. If the error term includes an omitted variable that is correlated with an independent variable, this assumption will not hold.
Evaluating the Independent Variables
When regression analysis is used to gain understanding of the relationships between variables, a natural question is: which of the independent variables has the greatest influence on the dependent variable? One approach is to consider the r-values for the various coefficients. The t-value, already introduced in the single-variable regression case, is used to test the hypothesis that a regression coefficient (i. e., p^) is equal to zero and a nonzero estimate (i. e., bj) was simply a sampling phenomenon.10 The one with the largest t-value can be interpreted to be the one that is the least likely to have a zero 3 parameter. In Table 20-2, that would mean the store size variable (X2) closely followed by the advertising variable (XJ.
A second approach is to examine the size of the regression coefficients: however, when each independent variable is in different units of measurement (store size, advertising expenditures, and so on), it is difficult to compare their coefficients. One solution is to convert regression coefficients to "beta coefficients." Beta coefficients are simply the regression coefficients adjusted by expressing each variable by its estimated standard deviation instead of by its original measurement.11 The beta coefficients can be compared to each other: the larger the beta coefficient, the stronger the impact of that variable on the dependent variable. In Table 20-2, an analysis of the beta coefficients indicates that the store size and the advertising variables have the most explanatory power, the same conclusion that the analysis of t-values showed.
r2 Revisited
The term r2 has the same definition and interpretation in multiple regression as it has in single-variable regression.12 It is still the ratio of the explained variation to the total variation.
Comments
Post a Comment