Descriptive Statistics

By Issa Bass
 

1-1 Measures of central tendency

In most cases, a single value can help describe a set of data, that value is referred to as a measure of central tendency. The most common measures of central tendency used to describe data are the arithmetic mean, the mode and median. The geometric mean is not often used but it is useful in finding the mean of percentages, ratios and growth rates.

1-1.1 Arithmetic mean

1-1.1.a Arithmetic mean for raw data

For ungrouped data, that is data that has not been grouped in a frequency distribution, the arithmetic mean of a population is the sum of all the values in that population divided by the number of values in the population.

Arithmetic mean

Where:

 represents the arithmetic mean of the population

Xi   represents the ith value observed

N   is the number of items in the observed population

 is the sum of the x values

Example 1:   The following table shows how many computers are produced during a five days work. What is the average daily production?

Days Production
Day 1

Day 2

Day 3

Day 4

Day 5
500

750

600

450

775

Solution:

Example 2:   The following table shows the daily production of five teams of workers over a period of 4 days. Each team has a different number of workers. What is the average daily production per worker during that period?

Days Teams number of workers
per team
Production
Day 1

Day 2

Day 3

Day 4
Team 1

Team 2

Team 3

Team 4
15

13

12

10
750

400

700

600

Total production over the four days = 750 + 400 + 700 + 600 = 2450

Total number of workers = 15 + 13 + 12 + 10 = 50

Average production per worker = 2450/ 50 = 49

1-1.1.b Arithmetic mean of grouped data

Sometimes, the available data are grouped in intervals or classes and presented in the form of a frequency distribution. The data on income or age of a population are often presented in that way. It is impossible to determine with exactitude a measure of central tendency, so an approximation is done using the midpoints of the intervals and the frequency of the distribution.

Arithmetic mean of grouped data

Where:

   is the sample arithmetic mean

X    is the midpoint

ƒ    is the frequency in each interval

ƒX    is the midpoint times the frequency

 is the sum of these products

N     the total number of the frequencies

example:   The net revenues for a group of companies are organized as follow into the table below. Determine the estimated arithmetic mean revenue of the companies

Revenues ($ millions) Number of companies
18 - 22

23 - 27

28 - 32

33 - 37

38 - 42

43 - 47

48 - 52

53 - 57
3

17

10

15

9

3

14

5

solution:

Revenues ($ millions) Number of companies Midpoint of revenues
ƒX
18 - 22

23 - 27

28 - 32

33 - 37

38 - 42

43 - 47

48 - 52

53 - 57
3

17

10

15

9

3

14

5
20

25

30

35

40

45

50

55
60

425

300

525

360

135

700

275
Total:
76
 
2780

So the mean revenue per company is $36.579 million.

1 -1.2 Geometric mean

The geometric mean is used to find the average of ratios, indexes or growth rates. It is the nth root of the product of n values.

Geometric mean      GM =

Let's suppose that a company's revenues have grown by 15% last year and 25% this year. The average increase will not be 20 ((15 + 25)/2).

GM =  =  = 19.365.

1-1.3 Mode

The mode is not a very frequently used measure of central tendency but it is still an important one. It represents the value of the observation that appears most frequently.

Consider the following sample measurement

75 60 65 75 80 90 75 80 67

75 appears more frequently, thus it is the mode.

1 -1.4 Median

The median of a set of data is the value of x such that half the measurements are less than x and half are greater.

Consider the following set of data:

12 25 15 19 40 17 36

n = 7 is odd. If we rearrange the data in order of increasing magnitude, we obtain:

12 15 17 19 25 36 40

The median would be the 4th value, 19 because 0.5(7+1) = 4.

1 -2 Measures of dispersion

The measures of central tendency only locate the center of the data, they do not provide information on how the data are spread. The measures of dispersion or variability provide that information. If the values of the measures of dispersion are closely clustered around the mean, the mean would be a good representation of the data and a good and reliable average.

Variation is very important in quality control. For instance, if we are manufacturing tires, an excessive variation in the depth of the treads of the tires would imply a high rate of defective products.

The study of variability also helps compare the spread in more than one distribution. Suppose that the arithmetic mean of a daily production of cars in two car plants is 1000. We can conclude that the two plants produce the same number of cars every day. But an observation over a certain period of time might show that one produces between 950 and 1050 a day and the other between 450 and 1550. So the second plant's production is more erratic.

The most widely used measures of dispersion are the range, the variance and the standard deviation.

1 -2.1 Range

The range is the most simple of all measures of variability. It is the difference between the highest and the lowest values of a data set.

Range = highest value - lowest value.

Example:

The weekly output on a production line is given in the table bellow

Days Production
Day 1

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7
700

850

600

575

450

900

300

The range = 900 - 300 = 600

Table 1 -2.1

The concept of range will be more delved into when we will be studying the Statistical Process Control (SPC).

1-2.2 Mean Deviation

The range is very simple; it is in fact too simple since it only considers two values in an observation.  It is not informative about the other values. The mean deviation, the variance and the standard deviation provide more information about all the data observed.

The mean deviation measures the amount by which the values in a population are dispersed around the mean.

It is the sum of the absolute values of the deviations from the mean divided by the number of observations in the sample. The absolute value of the sum of the deviations from is used because  is always equal to zero. 

Mean Deviation     

Where:

Xi            is the value of each observation

            is the arithmetic mean of the observation

    is the deviation from the mean

n            is the number of observations in the sample.

Example:    Let's use table 1 -2.1 to find the Mean Deviation of the weekly production.

Solution:

We need to find the arithmetic mean first.

We will add another column for the absolute values of the deviations from the mean.

Days
Production
Day 1

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7
700

850

600

575

450

900

300
75

225

25

50

175

275

325
Total   1150

The mean deviation is 164.29 items produced a day, in other words 164.29 items produced deviated from the mean on average every day during that week.

1-2.3 Variance

Because and the use of absolute values does not always lend itself to easy manipulation, the square of the deviation from the mean is used instead.

The variance is the average of the squared deviation from the arithmetic mean (for the remainder of this chapter, whenever we say "mean" we understand arithmetic mean).

The variance of the population mean is denoted by

If we want to find the variance for the example in table 1 -2.1, we will add a new column for the squared deviation.

Days
Production
X
Day 1

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7
700

850

600

575

450

900

300
5625

50625

625

2500

30625

75625

105625
Total
271250

The variance is not only a high number but it is also hard to interpret because it is the square of a value. For that reason we will consider the variance as a transitory step in the process of obtaining the standard deviation.

1-2.4 Standard Deviation

The standard deviation is the most important measure of variability. It is the square root of the variance.

So for the previous example, the standard deviation would be:

Note that the computation of the variance and standard deviation derived from a sample is slightly different than it is from a whole population. The variance in that case is noted as S 2 and the standard deviation S.

Sample variances and standard deviations are used as estimators of population variances and standard deviations. Using n - 1 instead of N results in a better estimate of the population.

Note that the smaller the standard deviation, the closer the data are scattered around the mean. If the standard deviation is zero, that means all the data observed are equal to the mean.

1-2.5 Chebycheff's theorem

Chebycheff's theorem allows us to determine the minimum proportion of the values that lie within a specified number of the standard deviation of the mean.

Given the number k greater or equal to 1 and a set of n measurements , at least  of the measurements lie within k standard deviations of their mean.

Example:

A sample of bolts taken out of a production line has a mean of 2" in diameter and a standard deviation of 1.5. At least what percentage of the bolts lie within ±1.75 standard deviations from the mean?

Solution:

At least 67.35% of the bolts are within ±1.75 standard deviation from the mean.

1-2.6 Coefficient of variation

A comparison of one or more measures of variability is not possible. We cannot compare the standard deviation of the production of bolts to the one of the availability of parts. If the standard deviation of the production of bolts is 5 and the one of the availability of parts is 7 for a given time frame, we cannot conclude that the standard deviation of the availability of parts is greater than the one of the production of bolts. For a meaningful comparison to be made, a relative measure called the coefficient of variation is used.

The coefficient of variation is the ratio of the standard deviation to the mean:

for a population and 

for a sample.

Example:

A sample of 100 students was taken to determine income and expenditure on books. The standard deviations and means are summarized in the table bellow. How do the relative dispersions for income and expenditure on books compare?

Statistics
Income
Expenditure on books
750
70
S
15
9

Solution

 For the students' income:

For their expenditure on books:

The expenditure on books is more than 6 times as variable as the students' income.

1-3 Measures of Association

Measures of association are statistics that provide information about the relatedness between variables.

The three most widely used measures of association are the covariance, correlation coefficient and the coefficient of determination.

1 -3.1 Covariance

The covariance shows how the variable y reacts to a variation of the variable x. Its formula is given as:

 for a population and:

 for a sample

Example:

Based on the data below, how does the variable y react to a change in x?

X
Y
9

7

6

4
10

9

3

7

Solution:

x
y
9

7

6

4
10

9

3

7
2.5

0.5

-0.5

-2.5
2.75

1.75

-4.25

-0.25
6.875

0.875

2.125

0.625
10.5

X and Y vary in the same direction. As X increases, so does Y and when x is greater than its mean, so is Y.

The covariance is limited in describing the relatedness of x and y.  It can show the direction in which Y moves when X changes but it does not show the magnitude of the relationship between X and Y.  If we say that the covariance is 2.65, it does not tell us much except that X and Y change in the same direction.

A better measure of association based on the covariance is used by statisticians.

1 -3.2 Correlation coefficient

The correlation coefficient r is a number that ranges between -1 and +1.  The sign of r will be the same as the sign of the covariance. When r equals -1, we conclude that there is a perfect negative relationship between the variations of the X and the variations of the Y. In other words, an increase in the X will lead to a proportional decrease in the Y. r equals 0 when there is no relation between the variation in X and the variation of Y. When r equals +1, we conclude that there is a positive relationship between the two, the changes in X and the changes in Y are in the same direction and in the same proportions.  Any other value of r is interpreted according to how close it is to -1, 0 or +1.

The formula for the correlation coefficient is:

for a population

and for a sample

Since the formula is rather complex and we have software available to solve these kinds of operations, I would advise their use for the sake of speed and accuracy.  Excel and Minitab provide good tools for that purpose.

Example 1:

Given the data below, find the coefficient of correlation between the availability of parts and the level of output.

Weeks Parts Output
Week1

Week2

Week3

Week4

Week5

Week6

Week7
256

250

270

265

267

269

270
450

445

465

460

462

465

466

Using Excel, we can determine r.

But we could have done better with Excel. Excel provides "short cuts" that allow us to get the results without having to use the formula.

The coefficient of correlation r  is 0.9977201 which is very close to 1, so we conclude that there is a strong correlation between the availability of parts and the level of the output.

1 -3.3 Coefficient of determination

The coefficient of determination r 2 measures the proportion of changes of the dependent variable Y explained by the independent variable X. It is the square of the correlation coefficient r and for that reason; it is always positive and ranges between 0 and 1.

When the coefficient of determination is null, the variations of Y are not explained by the variations of X.

When r 2 = 1, the changes in Y are 100% explained by X, any other value of r 2 must be interpreted according to how close it is to 0 or 1.

Note that even though the coefficient of determination is the square of the coefficient of correlation, the coefficient of correlation is not necessarily the square root of the coefficient of determination.


About the author
Issa Bass is the managing editor of SixSigmaFirst. He can be reached at issa@sixsigmafirst.com

Tell us what you think about this article. Send a note to the Editor.

www.manorhouseassociates.com

 

Place your Ad here
Six Sigma Statistics
Order "Six Sigma Statistics with Excel and Minitab," the new book by Issa Bass.