Building a Multiple Regression Model Using SigmaXL

By Issa Bass

In the Simple Linear Regression model that we built to determine the effect of the variation of the price of gas on the quantities sold, we only used one independent variable, the price per gallon. In that example, the variations of the quantities sold could have been explained by other factors such as new people moving to or out of the city, the building of a new freeway, the building of a new factory and so on. All these independent variables could have had a significant impact on the variations of Y and need to be taken into account for a more accurate regression model.

A model that takes into account more than one independent variable is called a multiple regression model.

The equation for the probabilistic simple regression model that we had was under the form of

 

In that case, we only had one independent variable and  represented the slope of the regression line.

If we consider that more than one independent variable contribute to the variation of the response variable Y, we obtain

  

 if the number of independent variables is.

 

  is called the partial regression coefficient of the  independent variable, it measures the variation in the value of y that is due to a variation of by one unit.  represents the slope of the regression line when all the other independent variables are held constant.

Suppose that we consider the population growth, the building of new road and the price per gallon to be factors that explain the quantities sold. The table below represents 15 samples taken. Each level of quantity sold is associated with a price point, a population growth rate and a number of new roads built. We want to create a model that will help predict the future quantities sold for given population growth, price points and new roads. 

 

Years

Quantity sold

Price per gallon

Pop growth

New roads

Year 1

159

1.62

55

12

Year 2

160

1.667

56

13

Year 3

163

1.69

58

14

Year 4

166

1.7

60

15

Year 5

167

1.72

63

17

Year 6

167

1.73

65

18

Year 7

168

1.736

66

18

Year 8

167

1.74

66.7

18

Year 9

167.9

1.75

66.9

19

Year 10

168.9

1.755

67.4

19

Year 11

169

1.756

67.9

19

Year 12

169

1.77

68

20

Year 13

170

1.767

68.5

18

Year 14

171

1.756

68.8

17

Year 15

172

1.77

68.9

18

Table 1

The computation of a multiple regression is very strenuous and conducive to errors if done by hand, since it is almost always done using a computer, we will solve the following example using SigmaXL. A trial version of SigmaXL can be downloaded from www.sigmaXL.com.


 Using
SigmaXL

Select the area that contains the data and the from the menu bar, click on SigmaXL and choose Statistical Tools from the menu and Regression from the submenu and then Multiple Regression.

 

The Multiple Regression box appears. Click on the Next>> button

The Numeric Response (Y) will be the “Quantity Sold”, and the Continuous Predictors (X) will the independent variables.


Press the OK>> button



The coefficients are used to build the model, in this case, the model will be:

Quantity sold = 57.412Price per gallon + 0.534Pop growth – 0.631New Roads + 44.112

 Interpretation of the Results.

Interpreting R

        R-Square

 R-square is the coefficient of determination; it measures the proportion in the variation in the Y variable (in the case, the quantity sold) that is explained by variations in the independent factors.

Note that R-square is the square of Multiple R which is the coefficient of correlation. R-Square = 0.947 means that 94.68% of the variations in the quantities sold are explained by the independent variables, i.e. the population growth, the price point and the new roads.

        Adjusted R-Square

 The adjusted R-Square takes into account the factors that can contribute to inflating the results, it is R-Square minus the inflation factors.
 

Residual Analysis or How far are the results of our analysis from the true values

The regression equation can only be meaningful if we can with certitude use it to make predictions. If for instance the price per gallon goes up to $2, the population growth is 70 and 15 new roads are built, we should expect the quantity sold to be

Quantity sold = (57.412 x 2) + (0.534 x 70) – (0.631 x 15) + 44.112 = 186.8

But how can we be sure that we will really obtain 186.8? What makes us believe that we can use that equation to make accurate predictions? How can we test the equation for fitness for use?
Since we have the X and Y values that were used to build the model, we can use them to see how far the regression model is from its predicted values. We replace the Xs that we had in the regression equation to obtain the predicted Ys.

We will proceed by replacing the Xs that we used to build the model into the regression equation.

The first line in Table 1 was:

 

Years

Quantity sold

Price per gallon

Pop growth

New roads

Year 1

159

1.62

55

12

Let us replace the values for “Price per gallon”, “pop growth” and “New roads” in the regression equation to see if we will get 159, which was the quantity sold.

Quantity sold = (57.412 x 1.62) + (0.534 x 55) – (0.631 x 12) + 44.112 = 158.9

We did not obtain the expected value of 159. The difference between 174.0614 and 159 is called residual error.

Residual error = 159 – 158.9 = 0.105

Since the result that we obtained did not match our expectations, let’s test the whole table. Since we know how to obtain the residuals, to make things easy, let us use Excel.

Let us open Data Analysis from the Tools menu for the MS Excel versions prior to 2007 and from the Data menu for the Ms Excel version 2007.

 

From the Data Analysis dialog box, let us select ‘regression’.

When the Regression box appears, let us fill it out as indicated below

Let us press on the OK button to obtain the results below


  The Standard Error

If the regression model were perfect, all the residuals would have been equal to zero. Since they are not all equal to zero and we allowed ourselves a confidence level of 95%, how do we interpret all the numbers in the residual column?

It would be a lot easier if we could have a single number that gives an account of all the residuals. That number is called the Standard error. The standard error is the standard deviation for the residuals.

It is the square root of the Sum of square of Error (SSE) divided by the degree of freedom.

Where

Y is the Actual quantity sold and is the predicted quantity.

Since Standard Error of Estimate SEE is the square root of the Sum of square of Error (SSE) divided by the degree of freedom it will be

We have 15 rows and 4 columns, therefore the degree of freedom in 15 – 4 = 11, consequently

Which matches the results that we obtained from Excel

 

Interpreting the P – values

The P – Values in the following SigmaXL output show the significant factors in the equation. The significant factors are those that have an effect on the response factor, i.e. the quantity sold.

Two factors have P – Values lower than 0.05 (“Price per gallon” and “Population growth”), therefore those two are significant. The other variable “New roads” is not significant.



When the P – Value of a variable is too high (in general more that 0.10), the presence of that variable is consider to have no value for the equation.

I hope that this will help. I do not know if you liked it but, I enjoyed every second of it. It was really fun crunching these numbers!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

Buy a copy of SigmaXL. It is cheap and it does just about everything Minitab does for about one fifth of the price.

Download a 30 day trial version from http://www.sigmaxl.com/Free%20Downloads.html

 

About the author
Issa Bass is the managing editor of SixSigmaFirst. He can be reached at issa@sixsigmafirst.com

Tell us what you think about this article. Send a note to the Editor.

www.manorhouseassociates.com

 

Place your Ad here
Six Sigma Statistics
Order "Six Sigma Statistics with Excel and Minitab," the new book by Issa Bass.