PRACTICAL MATHEMATICS

Estimation and Significance Testing

Contents

  1. Estimation
  2. Confidence Intervals for Means
  3. Small Samples
  4. Estimating Correlation Coefficients
  5. Testing the Significance of the Estimated Correlation Coefficient
  6. Comparing an Observed Frequency Distribution with a Theoretical Distribution

Estimation

Let x1, ..., xn be the members of a random sample of size n taken from a population having mean mu and variance sigma2. Let m be the sample mean, and let s2 be the sample variance.

Because E[m] = mu, we use m as an estimator for mu.

However, because E[s2] = (n - 1).sigma2/n, we use:

S2 = n.s2/(n - 1)
as an estimator for sigma2.

Confidence Intervals for Means

With large samples the distribution of m is approximately normal with variance S2/n. Then the mean of the population satisfies the inequality

m - S.z /sqrt(n) < mu < m + S.z /sqrt(n),
with a probability which depends on the standard normal variable z. The exact values of z for some commonly used probabilities are shown in Table 1. These probabilities are called confidence levels.

Table 1. Values of z for Confidence Levels in a Normal Distribution
Confidence Level: 50% 95% 99%
z: 0.674 1.96 2.58

The quantity S.z/sqrt(n) at the 50% confidence level is called the probable error of mu.

EXAMPLE

Suppose a sample has size n = 40, mean m = 100, and variance s2 = 400. Then the estimated standard deviation of the population is:

S = sqrt(40×400/39) = 20.25.
The population mean mu satisfies the following relations at the 95% confidence level:
100 - 20.25×1.96/sqrt(40) < mu < 100 + 20.25×1.96/sqrt(40).
In other words the population mean lies between 93.72 and 106.28 with probability 95%.

EXERCISE

Find the 50% confidence interval for mu given by a sample with size n = 65, mean m = 50, and variance s2 = 500.

Small Samples

In small samples the normal distribution is not a good approximation to the distribution of m, especially when n is very small. Instead it has a distribution called the Student's t distribution, which is explained in textbooks on statistics. However, for practical purposes we may often use the approximate rule:

If the sample size n is not less than 20, then we may use z = 2 for a confidence level approximately 95%.

Estimating Correlation Coefficients

Let (x1, y1), ... , (xn, yn) be a random sample (with replacement) from a large two-dimensional population. An estimator for the the correlation coefficient rho of the population is the correlation coefficient r of the sample.

EXAMPLE

Suppose the data in Table 2 are for a sample taken from a large population.

Table 2. A Sample of Data from a Two-Dimensional Population
  x:   6.246.426.566.416.576.646.626.766.726.76
  y:   3.213.263.393.313.413.213.293.313.393.39
  x:   6.626.726.926.946.846.886.826.987.167.08
  y:   3.433.513.353.313.443.493.513.513.523.54

These data are shown in a scatter diagram in Fig. 1.

Fig. 1.

Fig. 1. Scatter diagram of the data in Table 2.

The correlation coefficient in this sample is r = 0.682. Therefore we estimate the correlation coefficient rho of the population to be 0.682.

Testing the Significance of an Estimated Correlation Coefficient

For samples which are not too small (n not less than 20) the quantity r.sqrt[(n - 2)/(1 - r2)] has a distribution which is approximately normal. Therefore, even when the population is not correlated, the following inequalities are true with approximately 95% probability:

-2 < r.sqrt[(n - 2)/(1 - r2)] < +2.

Therefore, if these inequalities are true we cannot conclude that the population is correlated. In other words, the sample could be from an uncorrelated population. But if the quantity r.sqrt[(n - 2)/(1 - r2)] is outside the interval -2 to +2, then we can conclude, with 95% confidence, that the correlation coefficient of the population is different from zero.

EXAMPLE

In the previous example we have n = 20 and r = 0.682. Therefore:

r.sqrt[(n - 2)/(1 - r2)] = 0.682×sqrt[(20 - 2)/(1 - 0.6822)] = 3.96.

Since 3.96 is outside the interval -2 to +2, we conclude that r is significantly different from zero; in other words, we conclude that the population is correlated.

EXERCISE

Prove that if, in a two dimensional sample, n = 20 and r = 0.40, then we cannot conclude with 95% confidence that the population is correlated.

Comparing an Observed Frequency Distribution with a Theoretical Distribution

Suppose a population is distributed among a simple list of c categories. Let f1, ..., fc be the observed frequencies in these categories in a sample of size n taken (with replacement) from the population. Then

f1 + ... + fc = n.

Let e1, ..., ec be a corresponding set of theoretically expected frequencies for the same categories such that

e1 + ... + ec = n.

Then the quantity

chi2 = (f1 - e1)2/e1 + ... + (fc - ec)2/ec = f12/e1 + ... + fc2/ec - n
has a chi-square distribution. The categories and the sample size must be such that in every category i we have ei > 5.

In this distribution there are c - 1 degrees of freedom. This is because, when the observations in the sample have been distributed among c - 1 categories, the remaining observations must go into the remaining category. In complicated classification tables with more restrictions the number of degrees of freedom is less than c - 1.

If the observed value of chi2 is less than the critical value given in Table 3, then we conclude, with 95% confidence, that the sample could have been taken from a population with the theoretical frequencies. But if chi2 is greater than the critical value in the table, then we conclude that the actual frequencies in the population differ significantly from the theoretical frequencies.

Table 3. Critical values of Chi-Square at the 95% Confidence Level
  Degrees of freedom:   12345678910
  Critical value of chi2:   3.845.997.819.4911.112.614.115.516.918.3
  Degrees of freedom:   11121314151617181920
  Critical value of chi2:   19.721.022.423.725.026.327.628.930.131.4

EXAMPLE

Suppose that the observed frequencies in six categories in a sample are:

f1 = 25, f2 = 17, f3 = 15, f4 = 25, f5 = 24, f6 = 16,
and the expected frequency in each category is 20. Then chi2 = 5.00. The number of degrees of freedom is 5. Since chi2 is less than the critical value 11.1 for 5 degrees of freedom, we conclude with 95% confidence, that the sample could have come from a population with equal frequencies in all categories.

EXERCISE

A sample has frequencies 21, 50 45, 31, 79, 14 in six categories. Could this sample have come from a population with equal frequencies in these categories?


Home Page

By R. H. B. Exell, 1998. King Mongkut's University of Technology Thonburi.