F F F F M M F F M F F M F M F M M F F FThis population is divided into two categories M and F with frequencies:
7 males, 13 females.The frequency distribution may be expressed as percentages (35% males, 65% females) or as relative frequencies:
f1 = 0.35, f2 = 0.65.
In a relative frequency distribution with n categories we have:
f1 + ... + fn = 1.
1.7, 4.5, 3.8, 2.7, 0.6, 4.8, 1.1, 5.7, 3.4, 2.2.
Statistics commonly used with populations of numbers x1, ... , xn are the mean
mu = (x1 + ... + xn)/n,and the variance
sigma2 = [(x1 - mu)2 + ... + (xn - mu)2] / n.The statistic sigma is called the standard deviation. The mean is a measure of the center of the population, and the standard deviation is a measure of the scattering of the numbers about the mean. The most convenient formula for calculating the variance is:
sigma2 = (x12 + ... + xn2)/n - mu2.
These results can be shown in a diagram as in Fig. 1.
Fig. 1. A population of ten numbers with their mean and standard deviation.
Show that the mean square deviation s2 of the numbers x1, ..., xn from the value x, defined by
s2 = [(x1 - x)2 + ... + (xn - x)2]/n,
is a minimum when x = mu. The minimum value of s2 is the variance sigma2.
A histogram is a diagram which consists of rectangles with areas proportional to the relative frequencies of the classes.
| Class No. i | Class Interval | Central Value xi | Frequency ni | Relative Frequency fi |
|---|---|---|---|---|
| 1 | 40 - 50 | 45 | 5 | 0.05 |
| 2 | 50 - 60 | 55 | 18 | 0.18 |
| 3 | 60 - 70 | 65 | 42 | 0.42 |
| 4 | 70 - 80 | 75 | 27 | 0.27 |
| 5 | 80 - 90 | 85 | 8 | 0.08 |
Fig. 2. A histogram of the data in Table 1. The symbol above the histogram shows the mean and standard deviation of the data.
The mean and standard deviation of the data are calculated by summations over the classes as follows:
n = n1 + ... + n5, fi = ni/n
mu = f1x1 + ... + f5x5 = (n1x1 + ... + n5x5)/n,
sigma2 = f1(x1 - mu)2 + ... + f5(x5 - mu)2 = (n1x12 + ... + n5x52)/n - mu2.
sigma2 = [n1sigma12 + ... + nmsigmam2]/n + [n1(mu1 - mu)2 + ... + nm(mum - mu)2]/n.
| Number and Category | 4.0 | 4.5 | 0.5 | 3.5 | 4.9 | 5.6 | 9.4 | 1.6 | 1.0 | 3.4 |
| A | A | A | A | B | A | C | A | C | A | |
| Number and Category | 0.5 | 7.1 | 7.9 | 5.1 | 4.3 | 5.5 | 8.2 | 9.8 | 2.2 | 3.3 |
| B | C | A | C | B | B | C | A | B | B |
Using the above formulas we get:
| Category | ni | mui | sigmai2 | sigmai |
|---|---|---|---|---|
| A | 9 | 4.533 | 7.524 | 2.743 |
| B | 6 | 3.433 | 2.892 | 1.701 |
| C | 5 | 6.160 | 8.658 | 2.943 |
| Whole population | 20 | 4.610 | 7.437 | 2.727 |
| Mean of category variances = 6.418 | ||||
| Variance of category means = 1.019 | ||||
Figure 3 shows the mean and standard deviation of each category and the whole population. The variance of the category means is much smaller than the mean of the category variances, so the division of this population into three categories may not be significant.
Fig. 3. Statistics of the categories and the whole population for the data in Table 2. The symbols above the diagrams show the means and standard deviations of the data.
| x: | 3.5 | 6.1 | 13.4 | 6.9 | 16.8 | 2.9 | 10.1 | 8.7 | 16.6 | 15.2 | 4.0 | 6.7 |
| Category | G | G | H | G | H | G | G | H | H | H | G | G |
z = (x - mu)/sigma.A distribution of standardized observations has mean muz = 0, and standard deviation sigmaz = 1.
Calculate the standardized values
zi = (xi - mu)/sigma
of the numbers
1.7, 4.5, 3.8, 2.7, 0.6, 4.8, 1.1, 5.7, 3.4, 2.2.
and check that muz = 0 and sigmaz = 1 by direct calculation.
| x: | 6.24 | 6.42 | 6.56 | 6.41 | 6.57 | 6.64 | 6.62 | 6.76 | 6.72 | 6.76 |
| y: | 3.21 | 3.26 | 3.39 | 3.31 | 3.41 | 3.21 | 3.29 | 3.31 | 3.39 | 3.39 |
| x: | 6.62 | 6.72 | 6.92 | 6.94 | 6.84 | 6.88 | 6.82 | 6.98 | 7.16 | 7.08 |
| y: | 3.43 | 3.51 | 3.35 | 3.31 | 3.44 | 3.49 | 3.51 | 3.51 | 3.52 | 3.54 |
These points may be plotted on a scatter diagram as in Fig. 4.

Fig. 4. Scatter diagram of data points in two dimensions. The centroid and a line summarizing the data points are also shown.
We may also compute for the variables x and y the means mux, muy and the variances sigmax2, sigmay2 as before thus:
(x - mux)/sigmax = (y - muy)/sigmay.Another statistic is the covariance sigmaxy defined by:
sigmaxy = [(x1 - mux)(y1 - muy) + ... + (xn - mux)(yn - muy)]/n = (x1y1 + ... + xnyn)/n - muxmuy.This is a measure of how much changes in x are associated with changes in y. If sigmaxy > 0 then x and y tend to increase or decrease together. If sigmaxy < 0, then y tends to decrease as x increases, and vice versa. If sigmaxy = 0, or is very small, then x and y tend to be independent of each other.
Because the magnitude of the covariance depends on the dispersion of x and y, a better measure of these associations is given by the correlation coefficient rhoxy. The correlation coefficient is the covariance of the data in standardized form:
rhoxy = sigmaxy/(sigmax.sigmay).The correlation coefficient may also be calculated from the formula:
rhoxy = [n.Sum(xy) - Sum(x).Sum(y)] / sqrt[(n.Sum(x2) - (Sum x)2).(n.Sum(y2) - (Sum y)2)],where Sum(xy) = x1y1 + ... + xnyn, Sum(x) = x1 + ... + xn, and so on.
The correlation coefficient can have values in the range -1 to +1.
sigmaxy = 0.0158 and rhoxy = 0.682.
| x: | 10 | 17 | 8 | 20 | 14 | 10 | 6 | 15 | 18 | 13 | 15 | 12 |
| y: | 155 | 150 | 180 | 135 | 156 | 168 | 178 | 160 | 132 | 145 | 139 | 152 |
In order to analyse the relation between x and y in two-dimensional populations we plot the scatter diagram in standardized form. The standardized variables for x and y are
zx = (x - mux)/sigmax, zy = (y - muy)/sigmay.
The left diagram in Fig. 5 shows the standardized scatter diagram for the points in Table 4. The centroid of the points in Fig. 5(left) is the origin zx = 0, zy = 0, and the standard deviations are sigmazx = 1, sigmazy = 1. The correlation coefficient is rhozx,zy = 0.682, which is the same as rhoxy.

Fig. 5. Left: Standardized plot of the data in Table 4. Right: The same data plotted relative to the principle axes.
The uv-axes at angles 45° to the zxzy-axes in Fig. 5(left) are the principle axes for the population. Relative to these axes the data in the population are represented by points (u,v) as in the right diagram in Fig. 5. The points (u,v) have the following statistics:
In this example the variances of u and v are therefore:
sigmau2 = 1.682, sigmav2 = 0.318.
These statistics give an alternative description of the standardized data in terms of the uncorrelated variables u and v relative to the principle axes.
The total variance of the standardized two-dimensional population may be defined as:
sigmatotal2 = sigmazx2 + sigmazy2 = 2.
The total variance is unchanged by a rotation of the axes. Therefore, we also have
sigmatotal2 = sigmau2 + sigmav2 = 2.
When rhoxy > 0, we have sigmau2 > sigmav2. We then think of sigmau2 as the variance in the direct relation between x and y, and sigmav2 as the variance of the scatter due to other causes outside the relation between x and y.
If rhoxy < 0, we have sigmau2 < sigmav2. In this case we think of sigmav2 as the variance in the inverse relation between x and y, and sigmau2 as the variance of the scatter due to other causes.
For the data in Table 4, rhoxy > 0, and we have
sigmau2/sigmatotal2 = 1.682/2 = 0.841,
sigmav2/sigmatotal2 = 0.318/2 = 0.159.
Therefore 84.1% of the total variance is in the relation between x and y, and 15.9% of the total variance has causes outside this relation.
Draw a standardized scatter diagram for the following data. Analyze the total variance of the data.
| x: | 10 | 17 | 8 | 20 | 14 | 10 | 6 | 15 | 18 | 13 | 15 | 12 |
| y: | 155 | 150 | 180 | 135 | 156 | 168 | 178 | 160 | 132 | 145 | 139 | 152 |
By R. H. B. Exell, 2001, modified 2003. King Mongkut's University of Technology Thonburi.