Statistics
Statistics software

# The Significance of the Difference between two Proportions

Author: R.J.Edwards G4FGQ © 5th March 2004

The four-data array, all integers, plus marginal totals is known as a 2-by-2 Contingency Table. It is used to statistically compare two numerical proportions which can occur in a great variety of statistical circumstances.

For example, suppose that in a random sample of 400 manufactured items taken from a large batch there are 2 defective. But in a sample of 300 taken from another large batch there are 7 defective. The question arises, is the quality of one batch better than the other or could the observed difference be solely due to random sampling when both batches are of identical quality? The null hypothesis is that the quality is the same in both batches. It may be that the cost of further inspection is excessive and a decision has to be made.

Another example. In a group of 80 people from a large population 5 had caught the pox since vaccination. In a group of 700 unvaccinated people 86 had caught the pox. Has vaccination had a beneficial effect or could the apparent improvement be due entirely to random sampling? The null hypothesis is that vaccination is ineffective. Should larger samples be taken or is there sufficient evidence already to say vaccination is effective with 95 percent confidence. Or it may be that a higher confidence level of 99 percent is required before action is taken. The probability of 5 or fewer cases in a sample of 80 is 5%.

```TABLE:  A  B  E   Input data are A,B,C,D   Principal proportions are:
C  D  F   Row totals are E,F          A/E, B/E, C/F, D,F
G  H  N   Column totals are G,H       A/G, C/G  B/H  D/H
Grand total is N```

The top row describes the first random sample. There is a proportion A out of E items which possess a particular attribute. The 2nd row describes another random sample with a proportion C out of F possessing the SAME attribute. The 1st and 2nd columns may be described in a general way as "successful" and "unsuccessful". The other numbers in the table are needed for the calculations carried out by this program. Exceedingly large numbers are involved in the calculations such as the products of factorials of all numbers in the table.

The null-hypothesis is that in the two large populations from which the samples are taken the true proportions are identical. The question is asked, what is the probability that results as extreme as those observed could have been due to chance sampling alone. If the probability is small enough then it may be concluded that a real difference between proportions actually exists.

It is usual to take the 10% probability level as suggesting that a difference exists, the 1% level indicating that a difference very likely exists, and a 0.1% level indicating that a difference almost certainly exists. And so on.

It is usual to describe the results of an experiment as Not Significant, Suggestive, Significant, Highly Significant or Extremely Highly Significant according to the decreasing probability that the extreme values observed are due entirely to the random sampling of items from the two large populations.

In experimental design it is essential to ensure samples are taken purely at random without any possibility of bias affecting the proportions of interest.

Depending on how a table is arranged it may be that the "greater than" probability is the appropriate limit to be used. Some experience is needed.

The relationship between the two samples can be considered as a type of correlation. The significance limits of the correlation coefficient are shown. The actual value of the correlation coefficient between the populations lies between the stated confidence probabilities of 95% and 99%.

The Cross Product Ratio, A*D/B*C, is a statistic which is meaningful in some types of experiment. Calculated probabilities apply to this also.

The abbreviations UL and LL are Upper Limit and Lower Limit. Probabilities in this program are areas under one tail of the normal distribution curve with corresponding probabilities in the other tail being 100-p%

In the table of "Raw data proportions" the "expected" sample proportions are based on the hypothesis that the population proportions are the same.

In the table "Correlation Coefficient r", calculated r is the observed value of r in the samples. Probability is that of an absolute value as high as r could arise on the assumption of no correlation between the populations.

In the table "Correlation Coefficient" are the confidence limits on the actual but unknown correlation coefficient between the two populations.

In the table "Confidence limits for proportions" are estimates of the actual proportions which exist separately in the two populations The possible range of each proportion increases as the probability of the true value of the proportion lying within that range increases from 95 to 99% It will be noticed, intuitively, accuracy of estimates increase as sample sizes increase.

Do not allow a size of sample, E,F,G,H, less than 2 or program will abort. Do not enter data, A,B,C,D, greater than 500. Numerical overflow may occur. For large values of A,B,C,D other more simple formulae are just as accurate and, in general, firm conclusions can be drawn from the data without calculation of probabilities.