|
By Issa Bass
1-1 Measures of central tendency
In most cases, a single value can help describe a set of data, that value is referred to as a measure of central tendency. The most common measures of central tendency used to describe data are the arithmetic mean, the mode and median. The geometric mean is not often used but it is useful in finding the mean of percentages, ratios and growth rates.
1-1.1 Arithmetic mean
1-1.1.a Arithmetic mean for raw data
For ungrouped data, that is data that has not been grouped in a frequency distribution, the arithmetic mean of a population is the sum of all the values in that population divided by the number of values in the population.
Arithmetic mean 
Where:
represents the arithmetic mean of the population
Xi represents the ith value observed
N is the number of items in the observed population
is the sum of the x values
Example 1: The following table shows how many computers are produced during a five days work. What is the average daily production?
| Days |
Production |
Day 1
Day 2
Day 3
Day 4
Day 5 |
500
750
600
450
775 |
Solution:

Example 2: The following table shows the daily production of five teams of workers over a period of 4 days. Each team has a different number of workers. What is the average daily production per worker during that period?
| Days |
Teams |
number of workers
per team |
Production |
Day 1
Day 2
Day 3
Day 4 |
Team 1
Team 2
Team 3
Team 4 |
15
13
12
10 |
750
400
700
600 |
Total production over the four days = 750 + 400 + 700 + 600 = 2450
Total number of workers = 15 + 13 + 12 + 10 = 50
Average production per worker = 2450/ 50 = 49
1-1.1.b Arithmetic mean of grouped data
Sometimes, the available data are grouped in intervals or classes and presented in the form of a frequency distribution. The data on income or age of a population are often presented in that way. It is impossible to determine with exactitude a measure of central tendency, so an approximation is done using the midpoints of the intervals and the frequency of the distribution.
Arithmetic mean of grouped data

Where:
is the sample arithmetic mean
X is the midpoint
ƒ is the frequency in each interval
ƒX is the midpoint times the frequency
is the sum of these products
N the total number of the frequencies
example: The net revenues for a group of companies are organized as follow into the table below. Determine the estimated arithmetic mean revenue of the companies
| Revenues ($ millions) |
Number of companies |
18 - 22
23 - 27
28 - 32
33 - 37
38 - 42
43 - 47
48 - 52
53 - 57 |
3
17
10
15
9
3
14
5 |
solution:
| Revenues ($ millions) |
Number of companies |
Midpoint of revenues |
ƒX |
18 - 22
23 - 27
28 - 32
33 - 37
38 - 42
43 - 47
48 - 52
53 - 57 |
3
17
10
15
9
3
14
5 |
20
25
30
35
40
45
50
55 |
60
425
300
525
360
135
700
275 |
| Total: |
76 |
|
2780 |

So the mean revenue per company is $36.579 million.
1 -1.2 Geometric mean
The geometric mean is used to find the average of ratios, indexes or growth rates. It is the nth root of the product of n values.
Geometric mean GM = 
Let's suppose that a company's revenues have grown by 15% last year and 25% this year. The average increase will not be 20 ((15 + 25)/2).
GM = = = 19.365.
1-1.3 Mode
The mode is not a very frequently used measure of central tendency but it is still an important one. It represents the value of the observation that appears most frequently.
Consider the following sample measurement
75 60 65 75 80 90 75 80 67
75 appears more frequently, thus it is the mode.
1 -1.4 Median
The median of a set of data is the value of x such that half the measurements are less than x and half are greater.
Consider the following set of data:
12 25 15 19 40 17 36
n = 7 is odd. If we rearrange the data in order of increasing magnitude, we obtain:
12 15 17 19 25 36 40
The median would be the 4th value, 19 because 0.5(7+1) = 4.
1 -2 Measures of dispersion
The measures of central tendency only locate the center of the data, they do not provide information on how the data are spread. The measures of dispersion or variability provide that information. If the values of the measures of dispersion are closely clustered around the mean, the mean would be a good representation of the data and a good and reliable average.
Variation is very important in quality control. For instance, if we are manufacturing tires, an excessive variation in the depth of the treads of the tires would imply a high rate of defective products.
The study of variability also helps compare the spread in more than one distribution. Suppose that the arithmetic mean of a daily production of cars in two car plants is 1000. We can conclude that the two plants produce the same number of cars every day. But an observation over a certain period of time might show that one produces between 950 and 1050 a day and the other between 450 and 1550. So the second plant's production is more erratic.
The most widely used measures of dispersion are the range, the variance and the standard deviation.
1 -2.1 Range
The range is the most simple of all measures of variability. It is the difference between the highest and the lowest values of a data set.
Range = highest value - lowest value.
Example:
The weekly output on a production line is given in the table bellow
| Days |
Production |
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7 |
700
850
600
575
450
900
300 |
The range = 900 - 300 = 600
Table 1 -2.1
The concept of range will be more delved into when we will be studying the Statistical Process Control (SPC).
1-2.2 Mean Deviation
The range is very simple; it is in fact too simple since it only considers two values in an observation. It is not informative about the other values. The mean deviation, the variance and the standard deviation provide more information about all the data observed.
The mean deviation measures the amount by which the values in a population are dispersed around the mean.
It is the sum of the absolute values of the deviations from the mean divided by the number of observations in the sample. The absolute value of the sum of the deviations from is used because is always equal to zero.
Mean Deviation

Where:
Xi is the value of each observation
is the arithmetic mean of the observation
is the deviation from the mean
n is the number of observations in the sample.
Example: Let's use table 1 -2.1 to find the Mean Deviation of the weekly production.
Solution:
We need to find the arithmetic mean first.

We will add another column for the absolute values of the deviations from the mean.
Days |
Production |
|
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7 |
700
850
600
575
450
900
300 |
75
225
25
50
175
275
325 |
| Total |
|
1150 |

The mean deviation is 164.29 items produced a day, in other words 164.29 items produced deviated from the mean on average every day during that week.
1-2.3 Variance
Because and the use of absolute values does not always lend itself to easy manipulation, the square of the deviation from the mean is used instead.
The variance is the average of the squared deviation from the arithmetic mean (for the remainder of this chapter, whenever we say "mean" we understand arithmetic mean).
The variance of the population mean is denoted by 

If we want to find the variance for the example in table 1 -2.1, we will add a new column for the squared deviation.
Days |
Production |
X |
Day 1
Day 2
Day 3
Day 4
Day 5
Day 6
Day 7 |
700
850
600
575
450
900
300 |
5625
50625
625
2500
30625
75625
105625 |
Total |
|
271250 |

The variance is not only a high number but it is also hard to interpret because it is the square of a value. For that reason we will consider the variance as a transitory step in the process of obtaining the standard deviation.
1-2.4 Standard Deviation
The standard deviation is the most important measure of variability. It is the square root of the variance.

So for the previous example, the standard deviation would be:

Note that the computation of the variance and standard deviation derived from a sample is slightly different than it is from a whole population. The variance in that case is noted as S 2 and the standard deviation S.

Sample variances and standard deviations are used as estimators of population variances and standard deviations. Using n - 1 instead of N results in a better estimate of the population.
Note that the smaller the standard deviation, the closer the data are scattered around the mean. If the standard deviation is zero, that means all the data observed are equal to the mean.
1-2.5 Chebycheff's theorem
Chebycheff's theorem allows us to determine the minimum proportion of the values that lie within a specified number of the standard deviation of the mean.
Given the number k greater or equal to 1 and a set of n measurements , at least of the measurements lie within k standard deviations of their mean.
Example:
A sample of bolts taken out of a production line has a mean of 2" in diameter and a standard deviation of 1.5. At least what percentage of the bolts lie within ±1.75 standard deviations from the mean?
Solution:

At least 67.35% of the bolts are within ±1.75 standard deviation from the mean.
1-2.6 Coefficient of variation
A comparison of one or more measures of variability is not possible. We cannot compare the standard deviation of the production of bolts to the one of the availability of parts. If the standard deviation of the production of bolts is 5 and the one of the availability of parts is 7 for a given time frame, we cannot conclude that the standard deviation of the availability of parts is greater than the one of the production of bolts. For a meaningful comparison to be made, a relative measure called the coefficient of variation is used.
The coefficient of variation is the ratio of the standard deviation to the mean:

for a population and

for a sample.
Example:
A sample of 100 students was taken to determine income and expenditure on books. The standard deviations and means are summarized in the table bellow. How do the relative dispersions for income and expenditure on books compare?
Statistics |
Income |
Expenditure on books |
|
750 |
70 |
S |
15 |
9 |
Solution
For the students' income:

For their expenditure on books:

The expenditure on books is more than 6 times as variable as the students' income.
1-3 Measures of Association
Measures of association are statistics that provide information about the relatedness between variables.
The three most widely used measures of association are the covariance, correlation coefficient and the coefficient of determination.
1 -3.1 Covariance
The covariance shows how the variable y reacts to a variation of the variable x. Its formula is given as:
for a population and:
for a sample
Example:
Based on the data below, how does the variable y react to a change in x?
Solution:
x |
y |
|
|
|
9
7
6
4 |
10
9
3
7 |
2.5
0.5
-0.5
-2.5 |
2.75
1.75
-4.25
-0.25 |
6.875
0.875
2.125
0.625 |
|
|
|
|
10.5 |

X and Y vary in the same direction. As X increases, so does Y and when x is greater than its mean, so is Y.
The covariance is limited in describing the relatedness of x and y. It can show the direction in which Y moves when X changes but it does not show the magnitude of the relationship between X and Y. If we say that the covariance is 2.65, it does not tell us much except that X and Y change in the same direction.
A better measure of association based on the covariance is used by statisticians.
1 -3.2 Correlation coefficient
The correlation coefficient r is a number that ranges between -1 and +1. The sign of r will be the same as the sign of the covariance. When r equals -1, we conclude that there is a perfect negative relationship between the variations of the X and the variations of the Y. In other words, an increase in the X will lead to a proportional decrease in the Y. r equals 0 when there is no relation between the variation in X and the variation of Y. When r equals +1, we conclude that there is a positive relationship between the two, the changes in X and the changes in Y are in the same direction and in the same proportions. Any other value of r is interpreted according to how close it is to -1, 0 or +1.
The formula for the correlation coefficient is:
for a population

and for a sample

Since the formula is rather complex and we have software available to solve these kinds of operations, I would advise their use for the sake of speed and accuracy. Excel and Minitab provide good tools for that purpose.
Example 1:
Given the data below, find the coefficient of correlation between the availability of parts and the level of output.
| Weeks |
Parts |
Output |
Week1
Week2
Week3
Week4
Week5
Week6
Week7 |
256
250
270
265
267
269
270 |
450
445
465
460
462
465
466 |
Using Excel, we can determine r.

But we could have done better with Excel. Excel provides "short cuts" that allow us to get the results without having to use the formula.
The coefficient of correlation r is 0.9977201 which is very close to 1, so we conclude that there is a strong correlation between the availability of parts and the level of the output.
1 -3.3 Coefficient of determination
The coefficient of determination r 2 measures the proportion of changes of the dependent variable Y explained by the independent variable X. It is the square of the correlation coefficient r and for that reason; it is always positive and ranges between 0 and 1.
When the coefficient of determination is null, the variations of Y are not explained by the variations of X.
When r 2 = 1, the changes in Y are 100% explained by X, any other value of r 2 must be interpreted according to how close it is to 0 or 1.
Note that even though the coefficient of determination is the square of the coefficient of correlation, the coefficient of correlation is not necessarily the square root of the coefficient of determination.
About the author
Issa Bass is the managing editor of SixSigmaFirst. He can be reached at issa@sixsigmafirst.com
Tell us what you think about this article. Send a note to the Editor.
www.manorhouseassociates.com
|