|
Analysis Of Variance (ANOVA) in short
sleeved English! |
By
Issa Bass
The purpose of this article is to present the Analysis Of Variance under
less confusing auspices, in plain English.
It is obvious that since most computers are equipped with some form of
spreadsheet, especially Excel, using the embedded Data Analysis tool
would make it painless to obtain the information that we are seeking.
But since the Statistics software tools do not provide any
interpretation of the results, nor do they explain the process that
leads to the results, understanding the step by step process of ANOVA
becomes paramount.
At the end of this
article we will illustrate our reasoning with 2 practical examples
using two different methods.
The
difference between Regression Analysis and ANOVA
Regression
analysis enables the experimenter to make predictions about the
evolution of the object of the experiment by determining the
correlation between the predicted variable and the predicators of
the experiment while ANOVA helps determine if there is a difference
between the means of several treatments.
For instance, if
we want to know by how much a given population of a given city will
grow within the next 5 years, we can collect enough data that are
relevant to the population fluctuation in that city and conduct a
Regression analysis and be able to predict the population growth.
The Regression
Analysis enables the experimenter to build a model under the form of
an algebraic function that makes it easy to make predictions.
The function can
be a first degree polynomial function under the form of:
Y = f(x) = ax
+b.
Where Y is the
expected response (population in 5 years), x is the variable that
explain the changes, a measures the propensity for a change
and b is a constant that would be equal to Y if x is equal to
zero.
The Analysis Of
Variance is about determining if there is a difference between
several treatments. For example, if we want to know if three
different types of unleaded fuel (Premium, Medium and regular)
impact the longevity of a car engine differently, ANOVA would be the
right tool.
One
Way ANOVA
Let’s suppose that
we have a soap manufacturing machine that is used by employees
grouped in three shifts composed of an equal number of employees. We
want to know if there is a difference in productivity between the
three shifts.
Had it been two
shifts, we would have used the standard error based t
Hypothesis Testing and determine if a difference exists, but since
we have three shifts, using the t test would be prone to
increase the probability of making mistakes.
In either case, we
will formulate a null hypothesis about the productivity of the three
shifts before proceeding with the testing. The null hypothesis for
this particular case will stipulate that there is no difference
between the productivity of the three groups.
The null hypothesis will be:
H0: the productivity of the first shift = productivity of second
shift = productivity of third shift.
And the alternate hypothesis will be:
H1: there is a difference between the productivity of the three
shifts.
Some conditions must be met in order for the results derived from
the test to be valid:
·
the
treatment data must be normally distributed,
·
the
variance must be the same for all treatments,
·
all
samples are randomly selected
·
and
the samples are independent
Seven samples of
data have been taken for every shift and summarized in the table
bellow. What we are comparing is not
the productivity by day but the productivity by shift, the
days are just levels.
So the shifts are,
in this case called treatments and the days are called
levels and the daily productivities are the factors.
|
|
First
Shift |
Second
Shift |
Third
shift |
|
Monday |
78 |
77 |
88 |
|
Tuesday |
88 |
75 |
86 |
|
Wednesday |
90 |
80 |
79 |
|
Thursday |
77 |
83 |
93 |
|
Friday |
85 |
87 |
79 |
|
Saturday |
88 |
90 |
83 |
|
Sunday |
79 |
85 |
79 |
The objective is
to determine if the differences are due to random errors (individual
variations within the groups) or to variations between the groups.
If the differences
are due to variations between the three shifts we reject the
hypothesis. If it is due to variations within treatments, we cannot
reject the hypothesis. Let’s note that statisticians do not accept
the null hypothesis, a hypothesis is either rejected or the
experimenter fails to reject it
The variability of
a set of data depends on the sum of square of the deviations

In the analysis of
Variance, the Total variance is subdivided between the variance due
to the treatment and the variance due to random error or within
treatment.
SSk is the Sum of Square between treatments

SSE is the Sum of
Square due to error

SST is the total
Sum of Square

i
= a given part of a treatment level
j
= a treatment level
k
= number of treatment levels
= number of observations in a
treatment level
=
grand mean
= mean of a treatment group
level
= particular value
It is customary to
summarize the results of the Analysis Of Variance in a table.
|
Source
of Variation |
Sum of
Squares |
Degrees
of Freedom |
Mean
Square |
F-Statistic |
|
Between
Treatments |
SSk |
k-1 |
MSk
= SSk
/(k-1) |
F= MSk/MSE
|
|
Error |
SSE |
N-k |
MSE
= SSE/(N-k) |
|
|
Total |
TSS |
N-1 |
|
Degree of Freedom
The concept of
degree of freedom is better explained through an example. Let’s
suppose that a person has 10 dollars to spend on 10 different items
that cost a dollar a piece. At first his degree of freedom is 10
because he has the freedom to spend the 10 dollars however he wants,
but after he has spent 9 dollars his degree of freedom becomes 1
since he does not have more than one choice.
The concept of degree of freedom is widely used in Statistics to
derive an unbiased estimator.
·
k -
1 is the degree of freedom between error, it is the number of
treatment minus 1.
·
N –
k is the degree of freedom for the Error
·
N -
1 is the total degree of freedom
Mean Square
The mean square will be the ratio of the sum of square to the degree
of freedom.
MSk =
SSk/(k – 1)
MSE =
SSE/(N – k)
The F-statistic is
better explained using an example, it is the ratio of MSk to MSE but
its interpretation goes beyond that.
So, let’s get back
to our example to build the ANOVA table.
There are several
ways to build the table; we will use two of them. First, we will use
the formulas above step by step.
First Method
|
|
First
Shift |
Second
Shift |
Third
shift |
|
Monday |
78 |
77 |
88 |
|
Tuesday |
88 |
75 |
86 |
|
Wednesday |
90 |
80 |
79 |
|
Thursday |
77 |
83 |
93 |
|
Friday |
85 |
87 |
79 |
|
Saturday |
88 |
90 |
83 |
|
Sunday |
79 |
85 |
79 |
Let us first
find SSk, the Sum of Square between treatments

The table is
presented under the form of

With i =3 and
j = 7.
is the mean of all the observed
data. It is equal to the sum of all the observations divided by 21.
is the mean of each treatment.
For the first shift, it is equal to 83.571, for the second shift, it
is 82.429 and for the third shift, it is 83.857.
represents the difference
between the mean for each treatment and mean of all the
observations.

The Sum of Squares
Between Treatment is therefore equal to 8
Now we will
find the SSE, the sum of square for the Error


Now we find the
difference between each observation and its treatment mean.

The next step will
consist in finding the square of the data in the previous table

Now we can
find the Total Sum of Square TSS.


Let’s remember the
value of Xbar

We subtract the Xbar from every observation to obtain the following
table.

The next step will consist in squaring all the data contained in the
previous table and we obtain the following table.
The Total Sum of
Square will be the sum of all the following data.

Now that we have
solved the most difficult problems, we can find the degrees of
freedom.
Since we have 3 treatments, the degree of freedom between treatments
will be 2 (3 – 1).
We have 21 factors, the df for the error will be 18 (the number of
factors minus the number of treatments, 21 – 3).
The mean square
for the treatment will be the ratio of the Sum of Square to the df
(8/2)
The mean square for the Error will be the ratio of the Sum Of Square
for the Error to its df (530.2857/18).
The F-Statistic is
the ratio of the “Between Treatment” to the Error (4/29.4603 =
0.13578).
|
Source
of Variation |
Sum of
Squares |
Degrees
of Freedom |
Mean
Square |
F-Statistic |
|
Between
Treatments |
8 |
2 |
4
|
0.13578 |
|
Error |
530.2857 |
18 |
29.4603 |
|
|
Total |
538.2857 |
20 |
|
Using MS
Excel
Had we chosen to use Excel, we would
have gotten the following table.
To use Excel, we need to have Data Analysis installed. If it is not,
follow these steps:
Click on tools
Select Add- ins
On the pop up
window, put check marks on all the boxes
And then go back
to tools and select Data Analysis
Select “Anova:
Single factor” and then Click on OK
The ANOVA Single
Factor menu pops up
Select the rage of
data to be inserted in the input rage.
Click OK

The F-Statistic by itself does not
provide ground for rejection or non-rejection of the null
hypothesis, it needs to be compared with the Critical F –value,
which is found on a separate
F–table,
If the calculated
F
is greater than
the critical
F
value on the
F–table
then the null hypothesis is rejected, if not, we cannot reject the
hypothesis.
In our case, From the
F–table,
the critical value of
F
for with
the degrees of freedom and
is 3.55.
Since 3.55 is greater than 0.13578, we cannot reject the null
hypothesis. We conclude that there is not a statistically
significant difference between the means of the three shifts.
Method II
The first Method was very detailed and explicit but also long and
maybe cumbersome. There should be a way to do it faster and
painlessly.
Since the following equation is true, we may not need to calculate
all the variables.
TSS = SSk + SSE.
Let’s suppose that we are comparing the effects 4 different system
boards on the speed at which the model XYT printer prints out papers
of the same quality. The null hypothesis is that there is not any
difference in the speed of the printer no matter what type of system
board is used and the alternate hypothesis is that there is a
difference. The speed is measured in seconds and the following
samples were taken at random for each system board:
|
Sys I |
Sys II |
Sys III |
Sys IV |
|
7 |
4 |
8 |
7 |
|
4 |
4 |
6 |
3 |
|
5 |
5 |
4 |
6 |
|
7 |
5 |
3 |
3 |
|
6 |
3 |
5 |
6 |
|
4 |
8 |
5 |
6 |
|
3 |
5 |
5 |
5 |
The sum of all the
observations is 142, the sum of the square of the 28
observations is780, so the Total sum of Squares TSS will be:

The totals of the
7 observations of the four different system boards are
respectively: 36, 34, 36, 36 The Sum of square between treatments
will be:

Now that we have
the TSS and the SSK, we can find the SSE by just subtracting the SSK
from the TSS.
Therefore:

Now that we have
the Total Sum of Square (TSS), the Sum of Square Between Treatments
(SSK) and the Error Sum of Square (SSE), we need to determine the
Degrees of freedom.
Since we have 4
treatments, the df Between Treatment will be 3. We have 28
observations, the degree of freedom for the Error will be 24 (28
minus the number of treatments which is 4). The Total degree of
freedom will be 27 (24 plus 3).
The next step will
be the determination of the Mean Squares.
The Mean Square for Treatment (MSK) will be the ratio of the SSK to
its degree of freedom.

The Mean Square
for Error (SSE) will be the ratio of the SSE to its degree of
freedom

Now that we have
the MSK and the MSE, we can easily determine the calculated
F-Statistic.

We can now put all
the results in the ANOVA table
|
Source
of Variation |
Sum of
Squares |
Degrees
of Freedom |
Mean
Square |
F-Statistic |
|
Between
Treatments |
0.4285 |
3 |
0.14286 |
0.05769 |
|
Error |
59.4285 |
24 |
2.476 |
|
|
Total |
59.8571 |
27 |
|
Based on the information, we cannot
reject or not reject the null hypothesis until we compare the
calculated F-Statistic with the Critical F –value found on the
F-table.
The Critical F value from the table for df2 equal2 to 3 and
df1 equal to 24 is 3.01, which is greater than 0.05769, the
calculated F –Statistic. Therefore, we cannot reject the null
hypothesis.
Had we used Excel,
we would have had the following table

About the author
Issa Bass is the managing
editor of SixSigmaFirst. He can be reached at
issa@sixsigmafirst.com
Tell us what you think about this article. Send a
note to the Editor.
www.manorhouseassociates.com
|