|
Building a Multiple Regression Model Using SigmaXL |
By
Issa Bass
In the
Simple Linear Regression
model that we built to determine the effect of the variation of the
price of gas on the quantities sold, we only used one independent
variable, the price per gallon. In that example, the variations of
the quantities sold could have been explained by other factors such
as new people moving to or out of the city, the building of a new
freeway, the building of a new factory and so on. All these
independent variables could have had a significant impact on the
variations of Y and need to be taken into account for a more
accurate regression model.
A model that takes into account more
than one independent variable is called a multiple regression model.
The equation for the probabilistic
simple regression model that we had was under the form of

In that case, we only had one
independent variable and
represented the slope of the regression line.
If we consider that more than one
independent variable contribute to the variation of the response
variable Y, we obtain

if the number of independent
variables is .

is
called the partial regression coefficient of the
independent variable, it measures
the variation in the value of y that is due to a variation of
by one unit.
represents the slope of the
regression line when all the other independent variables are held
constant.
Suppose that we consider the
population growth, the building of new road and the price per gallon
to be factors that explain the quantities sold. The table below
represents 15 samples taken. Each level of quantity sold is
associated with a price point, a population growth rate and a number
of new roads built. We want to create a model that will help predict
the future quantities sold for given population growth, price points
and new roads.
|
Years |
Quantity sold |
Price per gallon |
Pop growth |
New roads |
|
Year 1 |
159 |
1.62 |
55 |
12 |
|
Year 2 |
160 |
1.667 |
56 |
13 |
|
Year 3 |
163 |
1.69 |
58 |
14 |
|
Year 4 |
166 |
1.7 |
60 |
15 |
|
Year 5 |
167 |
1.72 |
63 |
17 |
|
Year 6 |
167 |
1.73 |
65 |
18 |
|
Year 7 |
168 |
1.736 |
66 |
18 |
|
Year 8 |
167 |
1.74 |
66.7 |
18 |
|
Year 9 |
167.9 |
1.75 |
66.9 |
19 |
|
Year 10 |
168.9 |
1.755 |
67.4 |
19 |
|
Year 11 |
169 |
1.756 |
67.9 |
19 |
|
Year 12 |
169 |
1.77 |
68 |
20 |
|
Year 13 |
170 |
1.767 |
68.5 |
18 |
|
Year 14 |
171 |
1.756 |
68.8 |
17 |
|
Year 15 |
172 |
1.77 |
68.9 |
18 |
Table 1
The computation of a multiple
regression is very strenuous and conducive to errors if done by
hand, since it is almost always done using a computer, we will solve
the following example using SigmaXL. A trial version of SigmaXL can
be downloaded from
www.sigmaXL.com.
Using
SigmaXL
Select the area that contains the data
and the from the menu bar, click on SigmaXL and choose Statistical
Tools from the menu and Regression from the submenu and then
Multiple Regression.
The Multiple Regression box
appears. Click on the Next>> button

The Numeric Response (Y) will be the
“Quantity Sold”, and the Continuous Predictors (X) will the
independent variables.

Press the OK>> button

The coefficients are used to build the model, in this case, the
model will be:
Quantity sold = 57.412Price per gallon + 0.534Pop growth – 0.631New
Roads + 44.112
Interpretation
of the Results.
Interpreting R
R-Square
R-square
is the coefficient of determination; it measures the proportion in
the variation in the Y variable (in the case, the quantity sold)
that is explained by variations in the independent factors.
Note that R-square is the square of
Multiple R which is the
coefficient of correlation.
R-Square = 0.947 means that 94.68% of the variations in the
quantities sold are explained by the independent variables, i.e. the
population growth, the price point and the new roads.
Adjusted
R-Square
The
adjusted R-Square takes into account the factors that can contribute
to inflating the results, it is R-Square minus the inflation
factors.
Residual
Analysis or How far are the results of our analysis from the true
values
The regression
equation can only be meaningful if we can with certitude use it to
make predictions. If for instance the price per gallon goes up to
$2, the population growth is 70 and 15 new roads are built, we
should expect the quantity sold to be
Quantity sold = (57.412 x 2) + (0.534 x 70) – (0.631 x 15) + 44.112
= 186.8
But how can we
be sure that we will really obtain 186.8? What makes us believe that
we can use that equation to make accurate predictions? How can we
test the equation for fitness for use?
Since we have the X and Y values that were used to build the model,
we can use them to see how far the regression model is from its
predicted values. We replace the Xs that we had in the regression
equation to obtain the predicted Ys.
We will
proceed by replacing the Xs that we used to build the model into the
regression equation.
The first line
in Table 1 was:
|
Years |
Quantity sold |
Price per gallon |
Pop growth |
New roads |
|
Year 1 |
159 |
1.62 |
55 |
12 |
Let us
replace the values for “Price per gallon”, “pop growth” and “New
roads” in the regression equation to see if we will get 159, which
was the quantity sold.
Quantity sold = (57.412 x
1.62)
+ (0.534 x
55)
– (0.631 x
12)
+ 44.112 =
158.9
We did not obtain the expected value
of 159. The difference between 174.0614 and 159 is called residual
error.
Residual error = 159 – 158.9 = 0.105
Since the
result that we obtained did not match our expectations, let’s test
the whole table. Since we know how to obtain the residuals, to make
things easy, let us use Excel.
Let us open
Data Analysis from the Tools menu for the MS Excel
versions prior to 2007 and from the Data menu for the Ms
Excel version 2007.

From the
Data Analysis dialog box, let us select ‘regression’.
When the
Regression box appears, let us fill it out as indicated below

Let us
press on the OK button to obtain the results below

The
Standard Error
If the regression model were perfect,
all the residuals would have been equal to zero. Since they are not
all equal to zero and we allowed ourselves a confidence level of
95%, how do we interpret all the numbers in the residual column?
It would be a lot easier if we could
have a single number that gives an account of all the residuals.
That number is called the Standard error.
The standard error is the standard
deviation for the residuals.
It is the
square root of the Sum of square of Error (SSE) divided by the
degree of freedom.

Where
Y is the
Actual quantity sold and is the
predicted quantity.

Since
Standard Error of Estimate SEE is the square root of the Sum of
square of Error (SSE) divided by the degree of freedom it will be

We have 15
rows and 4 columns, therefore the degree of freedom in 15 – 4 = 11,
consequently

Which
matches the results that we obtained from Excel

Interpreting the P – values
The P – Values in the following
SigmaXL output show the significant factors in the equation. The
significant factors are those that have an effect on the response
factor, i.e. the quantity sold.
Two factors have P – Values lower than
0.05 (“Price per gallon” and “Population growth”), therefore those
two are significant. The other variable “New roads” is not
significant.

When the P – Value of a variable is too high (in general more that
0.10), the presence of that variable is consider to have no value
for the equation.
I hope
that this will help. I do not know if you liked it but, I enjoyed
every second of it. It was really fun crunching these
numbers!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Buy a copy
of SigmaXL. It is cheap and it does just about everything Minitab
does for about one fifth of the price.
Download a 30 day trial version from
http://www.sigmaxl.com/Free%20Downloads.html
About
the author
Issa
Bass is the managing editor
of SixSigmaFirst. He can be reached at
issa@sixsigmafirst.com
Tell us what you think about this
article. Send a
note to the Editor.
www.manorhouseassociates.com
|