PRACTICAL MATHEMATICS

Statistics

Contents

  1. Population Statistics
  2. Histograms
  3. Analysis of Variance
  4. Standardized Variables
  5. Two Dimensional Populations
  6. Analysis of Variance in Two Dimensions

Population Statistics

A complete set of observations is called a population. A statistic is a number that summarizes some property of these observations.

EXAMPLE

Twenty people are male or female as follows:
F F F F M M F F M F F M F M F M M F F F
This population is divided into two categories M and F with frequencies:
7 males, 13 females.
The frequency distribution may be expressed as percentages (35% males, 65% females) or as relative frequencies:
f1 = 0.35, f2 = 0.65.

In a relative frequency distribution with n categories we have:

f1 + ... + fn = 1.

EXAMPLE

The following ten numbers make a population:
1.7, 4.5, 3.8, 2.7, 0.6, 4.8, 1.1, 5.7, 3.4, 2.2.

Statistics commonly used with populations of numbers x1, ... , xn are the mean

mu = (x1 + ... + xn)/n,
and the variance
sigma2 = [(x1 - mu)2 + ... + (xn - mu)2] / n.
The statistic sigma is called the standard deviation. The mean is a measure of the center of the population, and the standard deviation is a measure of the scattering of the numbers about the mean. The most convenient formula for calculating the variance is:
sigma2 = (x12 + ... + xn2)/n - mu2.

EXERCISE

Show that for the ten numbers in the above example: mu = 3.05, and sigma = 1.59.


These results can be shown in a diagram as in Fig. 1.

Fig. 1.

Fig. 1. A population of ten numbers with their mean and standard deviation.

EXERCISE

Show that the mean square deviation s2 of the numbers x1, ..., xn from the value x, defined by

s2 = [(x1 - x)2 + ... + (xn - x)2]/n,

is a minimum when x = mu. The minimum value of s2 is the variance sigma2.

Histograms

Observed numerical data may be grouped into classes with a given class interval. The number of observations in each class is a frequency distribution of the data.

A histogram is a diagram which consists of rectangles with areas proportional to the relative frequencies of the classes.

EXAMPLE

Table 1. One Hundred Numbers Grouped into Classes
Class No.
i
Class Interval Central Value
xi
Frequency
ni
Relative Frequency
fi
1 40 - 50 45 5 0.05
2 50 - 60 55 18 0.18
3 60 - 70 65 42 0.42
4 70 - 80 75 27 0.27
5 80 - 90 85 8 0.08

Fig. 2.

Fig. 2. A histogram of the data in Table 1. The symbol above the histogram shows the mean and standard deviation of the data.

The mean and standard deviation of the data are calculated by summations over the classes as follows:

n = n1 + ... + n5,   fi = ni/n
mu = f1x1 + ... + f5x5 = (n1x1 + ... + n5x5)/n,
sigma2 = f1(x1 - mu)2 + ... + f5(x5 - mu)2 = (n1x12 + ... + n5x52)/n - mu2.

EXERCISE

Show that, for the data in Table 1, mu = 66.5, sigma = 9.73.

Analysis of Variance

Suppose a population of n measurements xi,j is divided into m categories of sizes ni having means mui and variances sigmai2, where i = 1, ..., m, as follows: For the whole population we have: The first term in the last expression for sigma2 is the mean of the category variances, and the second term is the variance of the category means.

EXERCISE

Prove the formula:
sigma2 = [n1sigma12 + ... + nmsigmam2]/n + [n1(mu1 - mu)2 + ... + nm(mum - mu)2]/n.

EXAMPLE

Table 2. Twenty Numbers Divided into Three Categories
Number and Category 4.04.50.53.54.95.69.41.61.03.4
AAAABACACA
Number and Category0.57.17.95.14.35.58.29.82.23.3
BCACBBCABB

Using the above formulas we get:

Table 3. Statistics of the Twenty Numbers Divided into Three Categories
Category  ni  muisigmai2sigmai
A94.5337.5242.743
B63.4332.8921.701
C56.1608.6582.943
Whole population204.6107.4372.727
Mean of category variances = 6.418
Variance of category means = 1.019

Figure 3 shows the mean and standard deviation of each category and the whole population. The variance of the category means is much smaller than the mean of the category variances, so the division of this population into three categories may not be significant.

Fig. 3.

Fig. 3. Statistics of the categories and the whole population for the data in Table 2. The symbols above the diagrams show the means and standard deviations of the data.

EXERCISE

Analyse the data below in this way. Is the division of the population into two categories likely to be significant?

x:3.56.113.46.916.82.910.18.716.615.24.06.7
 Category GGHGHGGHHHGG

Standardized Variables

We may represent each measurement x by its position relative to the mean mu as a multiple of the standard deviation sigma. This gives us the measurement in the standardized form
z = (x - mu)/sigma.
A distribution of standardized observations has mean muz = 0, and standard deviation sigmaz = 1.

EXERCISE

Calculate the standardized values

zi = (xi - mu)/sigma

of the numbers

1.7, 4.5, 3.8, 2.7, 0.6, 4.8, 1.1, 5.7, 3.4, 2.2.

and check that muz = 0 and sigmaz = 1 by direct calculation.

Two Dimensional Populations

Suppose a population consists of a number of points (x1, y1), ... , (xn, yn) in two dimensions.

EXAMPLE

Table 4. A Two-Dimensional Population
  x:   6.246.426.566.416.576.646.626.766.726.76
  y:   3.213.263.393.313.413.213.293.313.393.39
  x:   6.626.726.926.946.846.886.826.987.167.08
  y:   3.433.513.353.313.443.493.513.513.523.54

These points may be plotted on a scatter diagram as in Fig. 4.

Fig. 4.
Fig. 4. Scatter diagram of data points in two dimensions. The centroid and a line summarizing the data points are also shown.

We may also compute for the variables x and y the means mux, muy and the variances sigmax2, sigmay2 as before thus:

EXAMPLE

For the data in Table 4 we have: The data in this example can be summarized by the centroid (mux, muy), and the line (see Fig. 4) given by the equation
(x - mux)/sigmax = (y - muy)/sigmay.
Another statistic is the covariance sigmaxy defined by:
sigmaxy = [(x1 - mux)(y1 - muy) + ... + (xn - mux)(yn - muy)]/n = (x1y1 + ... + xnyn)/n - muxmuy.
This is a measure of how much changes in x are associated with changes in y. If sigmaxy > 0 then x and y tend to increase or decrease together. If sigmaxy < 0, then y tends to decrease as x increases, and vice versa. If sigmaxy = 0, or is very small, then x and y tend to be independent of each other.

Because the magnitude of the covariance depends on the dispersion of x and y, a better measure of these associations is given by the correlation coefficient rhoxy. The correlation coefficient is the covariance of the data in standardized form:

rhoxy = sigmaxy/(sigmax.sigmay).
The correlation coefficient may also be calculated from the formula:
rhoxy = [n.Sum(xy) - Sum(x).Sum(y)] / sqrt[(n.Sum(x2) - (Sum x)2).(n.Sum(y2) - (Sum y)2)],
where Sum(xy) = x1y1 + ... + xnyn, Sum(x) = x1 + ... + xn, and so on.

The correlation coefficient can have values in the range -1 to +1.

EXAMPLE

For the data in Table 4 we obtain:
sigmaxy = 0.0158 and rhoxy = 0.682.

EXERCISE

Draw a scatter diagram and calculate the means, variances, covariance, and correlation coefficient for the following data.

  x:   1017820141061518131512
  y:   155150180135156168178160132145139152

Analysis of Variance in Two Dimensions

Standardized Scatter Diagram

In order to analyse the relation between x and y in two-dimensional populations we plot the scatter diagram in standardized form. The standardized variables for x and y are

zx = (x - mux)/sigmax, zy = (y - muy)/sigmay.

The left diagram in Fig. 5 shows the standardized scatter diagram for the points in Table 4. The centroid of the points in Fig. 5(left) is the origin zx = 0, zy = 0, and the standard deviations are sigmazx = 1, sigmazy = 1. The correlation coefficient is rhozx,zy = 0.682, which is the same as rhoxy.

Fig. 5.

Fig. 5. Left: Standardized plot of the data in Table 4. Right: The same data plotted relative to the principle axes.

Principle Axes

The uv-axes at angles 45° to the zxzy-axes in Fig. 5(left) are the principle axes for the population. Relative to these axes the data in the population are represented by points (u,v) as in the right diagram in Fig. 5. The points (u,v) have the following statistics:

In this example the variances of u and v are therefore:

sigmau2 = 1.682, sigmav2 = 0.318.

These statistics give an alternative description of the standardized data in terms of the uncorrelated variables u and v relative to the principle axes.

Analysis of Variance

The total variance of the standardized two-dimensional population may be defined as:

sigmatotal2 = sigmazx2 + sigmazy2 = 2.

The total variance is unchanged by a rotation of the axes. Therefore, we also have

sigmatotal2 = sigmau2 + sigmav2 = 2.

When rhoxy > 0, we have sigmau2 > sigmav2. We then think of sigmau2 as the variance in the direct relation between x and y, and sigmav2 as the variance of the scatter due to other causes outside the relation between x and y.

If rhoxy < 0, we have sigmau2 < sigmav2. In this case we think of sigmav2 as the variance in the inverse relation between x and y, and sigmau2 as the variance of the scatter due to other causes.

For the data in Table 4, rhoxy > 0, and we have

sigmau2/sigmatotal2 = 1.682/2 = 0.841,
sigmav2/sigmatotal2 = 0.318/2 = 0.159.

Therefore 84.1% of the total variance is in the relation between x and y, and 15.9% of the total variance has causes outside this relation.

EXERCISE

Draw a standardized scatter diagram for the following data. Analyze the total variance of the data.

  x:   1017820141061518131512
  y:   155150180135156168178160132145139152

Home Page

By R. H. B. Exell, 2001, modified 2003. King Mongkut's University of Technology Thonburi.