Regression Analysis and a Data Analysis Overview - Prediction
Regression Analysis and a Data Analysis Overview - Prediction
The regression model, of course, can be used as a predictive tool. Given advertising expenditure, the model will predict the store traffic that will be generated. For example, if an advertising expenditure level of $200 is proposed, a model-based estimated store traffic would be
Y = a + bX = 275 + 0.74(200) = 423
Two cautionary comments. First, prediction using extreme values of the independent variable (such as X = 2000) can be risky. Further, the random sample provided no information about extreme values of advertising. Second, if the market environment changes, such as a competitive chain opening a series of stores, then the model parameters probably will be affected. The data from the random sample were obtained under a set of environmental conditions. If they change then the model may well be affected.
How Good Is the Prediction, r2?
A natural question is: how well does the model predict? Consider store 8 in Figure 20-2, which reported an advertising expenditure of $500 and a store traffic of 870. Applying the model to store 8, a model-based prediction results:
Y8 = a + b(500) = 275 + 0.74(500) = 645
The difference between this estimate and the actual store traffic for store 8 is 870 less 645, or 225. If this difference is squared for each store (which converts all errors to positive numbers) and then summed over all 20 stores, a measure of model performance is obtained that is termed the variation in Y, which is unexplained by the regression model:
n
Unexplained variation in Y = 2 (Yt - YJ2
To evaluate the predictive ability of the model, some standard of comparison is needed. The standard that is used is the best prediction that could be generated in the absence of any knowledge of the independent variable. In that case, the best estimate of store traffic would be Y, the sample-store traffic average, or 540. Thus, our best guess of store traffic for store 8 would be 540 if we had no information about the advertising expenditures. Our error then would be 870 less 540, or 330, which is greater than the error obtained when the advertising expenditure was known and the model was applied. On the average, the error should be less when the model is used, if the model has any value at all. A measure of the quality of estimates without the model is obtained by summing the squared deviations from Y and is termed the total variation in Y:
The difference between the total variation of Y and the variation that remains unexplained by the regression model is termed the variation explained by the regression model.6 Figure 20-3 illustrates. The measure of the regression model's ability to predict is termed r2 and is the ratio of the explained variation to the total variation:
r2 =total variation - unexplained variation/total variation = explained variation/total variation
For our example, in Figure 20-2, r2 is equal to 0.35. Thus, 35 percent of the total variation of Y is explained or accounted for by X. The variation in Y was reduced by 35 percent by using X and applying the regression model.7
The r2 term is the square of the correlation between X and V.8 Thus, it lies between zero and one. It is zero if there is no linear relationship between X and Y. It will be one if a plot of X and V points generates a perfectly straight line. A good way to interpret r, the sample correlation, is instead to interpret r2, which has a very natural percentage reduction in variation interpretation.
The regression model, of course, can be used as a predictive tool. Given advertising expenditure, the model will predict the store traffic that will be generated. For example, if an advertising expenditure level of $200 is proposed, a model-based estimated store traffic would be
Y = a + bX = 275 + 0.74(200) = 423
Two cautionary comments. First, prediction using extreme values of the independent variable (such as X = 2000) can be risky. Further, the random sample provided no information about extreme values of advertising. Second, if the market environment changes, such as a competitive chain opening a series of stores, then the model parameters probably will be affected. The data from the random sample were obtained under a set of environmental conditions. If they change then the model may well be affected.
How Good Is the Prediction, r2?
A natural question is: how well does the model predict? Consider store 8 in Figure 20-2, which reported an advertising expenditure of $500 and a store traffic of 870. Applying the model to store 8, a model-based prediction results:
Y8 = a + b(500) = 275 + 0.74(500) = 645
The difference between this estimate and the actual store traffic for store 8 is 870 less 645, or 225. If this difference is squared for each store (which converts all errors to positive numbers) and then summed over all 20 stores, a measure of model performance is obtained that is termed the variation in Y, which is unexplained by the regression model:
n
Unexplained variation in Y = 2 (Yt - YJ2
To evaluate the predictive ability of the model, some standard of comparison is needed. The standard that is used is the best prediction that could be generated in the absence of any knowledge of the independent variable. In that case, the best estimate of store traffic would be Y, the sample-store traffic average, or 540. Thus, our best guess of store traffic for store 8 would be 540 if we had no information about the advertising expenditures. Our error then would be 870 less 540, or 330, which is greater than the error obtained when the advertising expenditure was known and the model was applied. On the average, the error should be less when the model is used, if the model has any value at all. A measure of the quality of estimates without the model is obtained by summing the squared deviations from Y and is termed the total variation in Y:
The difference between the total variation of Y and the variation that remains unexplained by the regression model is termed the variation explained by the regression model.6 Figure 20-3 illustrates. The measure of the regression model's ability to predict is termed r2 and is the ratio of the explained variation to the total variation:
r2 =total variation - unexplained variation/total variation = explained variation/total variation
For our example, in Figure 20-2, r2 is equal to 0.35. Thus, 35 percent of the total variation of Y is explained or accounted for by X. The variation in Y was reduced by 35 percent by using X and applying the regression model.7
The r2 term is the square of the correlation between X and V.8 Thus, it lies between zero and one. It is zero if there is no linear relationship between X and Y. It will be one if a plot of X and V points generates a perfectly straight line. A good way to interpret r, the sample correlation, is instead to interpret r2, which has a very natural percentage reduction in variation interpretation.
Comments
Post a Comment