Analysis Of Variance (ANOVA) in short sleeved English!

By Issa Bass

The purpose of this article is to present the Analysis Of Variance under less confusing auspices, in plain English.
It is obvious that since most computers are equipped with some form of spreadsheet, especially Excel, using the embedded Data Analysis tool would make it painless to obtain the information that we are seeking. But since the Statistics software tools do not provide any interpretation of the results, nor do they explain the process that leads to the results, understanding the step by step process of ANOVA becomes paramount.

At the end of this article we will illustrate our reasoning with 2 practical examples using two different methods.

 

 

The difference between Regression Analysis and ANOVA

Regression analysis enables the experimenter to make predictions about the evolution of the object of the experiment by determining the correlation between the predicted variable and the predicators of the experiment while ANOVA helps determine if there is a difference between the means of several treatments.

For instance, if we want to know by how much a given population of a given city will grow within the next 5 years, we can collect enough data that are relevant to the population fluctuation in that city and conduct a Regression analysis and be able to predict the population growth.

 The Regression Analysis enables the experimenter to build a model under the form of an algebraic function that makes it easy to make predictions.

The function can be a first degree polynomial function under the form of:

 

Y = f(x) = ax +b.

 

Where Y is the expected response (population in 5 years), x is the variable that explain the changes, a measures the propensity for a change and b is a constant that would be equal to Y if x is equal to zero.

The Analysis Of Variance is about determining if there is a difference between several treatments. For example, if we want to know if three different types of unleaded fuel (Premium, Medium and regular) impact the longevity of a car engine differently, ANOVA would be the right tool.

 

One Way ANOVA

Let’s suppose that we have a soap manufacturing machine that is used by employees grouped in three shifts composed of an equal number of employees. We want to know if there is a difference in productivity between the three shifts.

Had it been two shifts, we would have used the standard error based t Hypothesis Testing and determine if a difference exists, but since we have three shifts, using the t test would be prone to increase the probability of making mistakes.

In either case, we will formulate a null hypothesis about the productivity of the three shifts before proceeding with the testing. The null hypothesis for this particular case will stipulate that there is no difference between the productivity of the three groups.


The null hypothesis will be:


H0: the productivity of the first shift = productivity of second shift = productivity of third shift.

And the alternate hypothesis will be:


H1: there is a difference between the productivity of the three shifts.


Some conditions must be met in order for the results derived from the test to be valid:

 

·         the treatment data must be normally distributed,

·         the variance must be the same for all treatments,

·         all samples are randomly selected

·         and the samples are independent

 

Seven samples of data have been taken for every shift and summarized in the table bellow. What we are comparing is not the productivity by day but the productivity by shift, the days are just levels.

So the shifts are, in this case called treatments and the days are called levels and the daily productivities are the factors.    

 

 

First Shift

Second Shift

Third shift

Monday

78

77

88

Tuesday

88

75

86

Wednesday

90

80

79

Thursday

77

83

93

Friday

85

87

79

Saturday

88

90

83

Sunday

79

85

79

 

The objective is to determine if the differences are due to random errors (individual variations within the groups) or to variations between the groups.

If the differences are due to variations between the three shifts we reject the hypothesis. If it is due to variations within treatments, we cannot reject the hypothesis. Let’s note that statisticians do not accept the null hypothesis, a hypothesis is either rejected or the experimenter fails to reject it

The variability of a set of data depends on the sum of square of the deviations


 

 

In the analysis of Variance, the Total variance is subdivided between the variance due to the treatment and the variance due to random error or within treatment.

  
SSk is the Sum of Square between treatments

 

 

SSE is the Sum of Square due to error

 

SST is the total Sum of Square

 

 

i = a given part of a treatment level


j = a treatment level


k = number of treatment levels


 = number of observations in a treatment level


 = grand mean


 = mean of a treatment group level


 = particular value

 

It is customary to summarize the results of the Analysis Of Variance in a table.

 

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F-Statistic

Between Treatments

SSk

k-1

MSk = SSk /(k-1)

F= MSk/MSE

Error

SSE

N-k

MSE = SSE/(N-k)

 

Total

TSS

N-1

 

 

 

 

 

 

 

 

 

 

Degree of Freedom

The concept of degree of freedom is better explained through an example. Let’s suppose that a person has 10 dollars to spend on 10 different items that cost a dollar a piece. At first his degree of freedom is 10 because he has the freedom to spend the 10 dollars however he wants, but after he has spent 9 dollars his degree of freedom becomes 1 since he does not have more than one choice.
The concept of degree of freedom is widely used in Statistics to derive an unbiased estimator.

 

·         k - 1  is the degree of freedom between error, it is the number of treatment minus 1.

·         N – k is the degree of freedom for the Error

·         N - 1  is the total degree of freedom

 

Mean Square
The mean square will be the ratio of the sum of square to the degree of freedom.

 

MSk = SSk/(k – 1)

 

MSE = SSE/(N – k)

 

The F-statistic is better explained using an example, it is the ratio of MSk to MSE but its interpretation goes beyond that.

 

So, let’s get back to our example to build the ANOVA table.

There are several ways to build the table; we will use two of them. First, we will use the formulas above step by step.

 

First Method 

 

 

First Shift

Second Shift

Third shift

Monday

78

77

88

Tuesday

88

75

86

Wednesday

90

80

79

Thursday

77

83

93

Friday

85

87

79

Saturday

88

90

83

Sunday

79

85

79

 

Let us first find SSk,  the Sum of Square between treatments

 

The table is presented under the form of


     

 

With i =3 and j = 7.

 

 is the mean of all the observed data. It is equal to the sum of all the observations divided by 21.


 is the mean of each treatment. For the first shift, it is equal to 83.571, for the second shift, it is 82.429 and for the third shift, it is 83.857.

 

 represents the difference between the mean for each treatment and mean of all the observations.

 



 

The Sum of Squares Between Treatment is therefore equal to 8

Now we will find the SSE, the sum of square for the Error


 

Now we find the difference between each observation and its treatment mean.

                                           

The next step will consist in finding the square of the data in the previous table

 

Now we can find the Total Sum of Square TSS.

 

                          

 

Let’s remember the value of Xbar


 
We subtract the Xbar from every observation to obtain the following table.      

                                     

                                      


The next step will consist in squaring all the data contained in the previous table and we obtain the following table.

 

The Total Sum of Square will be the sum of all the following data.

 

 

 

Now that we have solved the most difficult problems, we can find the degrees of freedom.
Since we have 3 treatments, the degree of freedom between treatments will be 2 (3 – 1).
We have 21 factors, the df for the error will be 18 (the number of factors minus the number of treatments, 21 – 3).

 

The mean square for the treatment will be the ratio of the Sum of Square to the df (8/2)
The mean square for the Error will be the ratio of the Sum Of Square for the Error to its df (530.2857/18).

The F-Statistic is the ratio of the “Between Treatment” to the Error (4/29.4603 = 0.13578).

 

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F-Statistic

Between Treatments

8

2

4

    0.13578

Error

530.2857

18

29.4603

 

Total

538.2857

20

 

 

 

 

 

 

 

 

 

Using MS Excel
Had we chosen to use Excel, we would have gotten the following table.


To use Excel, we need to have Data Analysis installed. If it is not, follow these steps: 

 

Click on tools

 

Select Add- ins

 

On the pop up window, put check marks on all the boxes

 

And then go back to tools and select Data Analysis

 

Select “Anova: Single factor” and then Click on OK

 

The ANOVA Single Factor menu pops up

 

Select the rage of data to be inserted in the input rage.

 

Click OK

 

 

 

The F-Statistic by itself does not provide ground for rejection or non-rejection of the null hypothesis, it needs to be compared with the Critical F –value, which is found on a separate F–table, If the calculated F is greater than the critical F value on the F–table then the null hypothesis is rejected, if not, we cannot reject the hypothesis.

In our case, From the F–table, the critical value of F for  with the degrees of freedom  and  is 3.55.


Since 3.55 is greater than 0.13578, we cannot reject the null hypothesis. We conclude that there is not a statistically significant difference between the means of the three shifts.

 

 

Method II


The first Method was very detailed and explicit but also long and maybe cumbersome. There should be a way to do it faster and painlessly.


Since the following equation is true, we may not need to calculate all the variables.

TSS = SSk + SSE.


Let’s suppose that we are comparing the effects 4 different system boards on the speed at which the model XYT printer prints out papers of the same quality. The null hypothesis is that there is not any difference in the speed of the printer no matter what type of system board is used and the alternate hypothesis is that there is a difference. The speed is measured in seconds and the following samples were taken at random for each system board: 

 

Sys I

Sys II

Sys III

Sys IV

7

4

8

7

4

4

6

3

5

5

4

6

7

5

3

3

6

3

5

6

4

8

5

6

3

5

5

5

 

The sum of all the observations is 142, the sum of the square of the 28 observations is780, so the Total sum of Squares TSS will be:

The totals of the 7 observations of the four different system boards are respectively:  36, 34, 36, 36 The Sum of square between treatments will be:

 

 

Now that we have the TSS and the SSK, we can find the SSE by just subtracting the SSK from the TSS.
Therefore:

 

 

 

 

Now that we have the Total Sum of Square (TSS), the Sum of Square Between Treatments (SSK) and the Error Sum of Square (SSE), we need to determine the Degrees of freedom.

Since we have 4 treatments, the df  Between Treatment will be 3. We have 28 observations, the degree of freedom for the Error will be 24 (28 minus the number of treatments which is 4). The Total degree of freedom will be 27 (24 plus 3).

The next step will be the determination of the Mean Squares.


The Mean Square for Treatment (MSK) will be the ratio of the SSK to its degree of freedom.


 

The Mean Square for Error (SSE) will be the ratio of the SSE to its degree of freedom



  

Now that we have the MSK and the MSE, we can easily determine the calculated F-Statistic.

 

 

  

 

We can now put all the results in the ANOVA table

 

Source of Variation

Sum of Squares

Degrees of Freedom

Mean Square

F-Statistic

Between Treatments

0.4285

3

0.14286

    0.05769

Error

       59.4285

               24

2.476

 

Total

59.8571

27

 

 

 

 

 

 

 

 

 

Based on the information, we cannot reject or not reject the null hypothesis until we compare the calculated F-Statistic with the Critical F –value found on the F-table.


The Critical F value from the table for df2 equal2 to 3 and df1 equal to 24 is 3.01, which is greater than 0.05769, the calculated F –Statistic. Therefore, we cannot reject the null hypothesis.

Had we used Excel, we would have had the following table

 

 



About the author
Issa Bass is the managing editor of SixSigmaFirst. He can be reached at issa@sixsigmafirst.com

Tell us what you think about this article. Send a note to the Editor.

 

www.manorhouseassociates.com

 

Place your Ad here
Six Sigma Statistics
Order "Six Sigma Statistics with Excel and Minitab," the new book by Issa Bass.