^{1}

^{1}

^{*}

Population stratification is always a concern in association analysis. There is a debate on the extent of the problem in less extreme situations (Thomas and Witte [1] , Wacholder et al. [2] ). Wacholder et al. [3] and Ardlie et al. [4] showed that hidden population structure is not a serious threat to case-control designs. We propose a method of assessing the seriousness of the population stratification before designing association studies. If population stratification is not a serious problem, one may consider using case-control study instead of family-based design to get more power. In a case-control design, we compare chi-square statistics from a structured population (a union of two subpopulations) and a homogeneous population with the same prevalence and allele frequencies. We provide an explicit formula to calculate the chi-square statistics from 17 parameters, such as proportions of subpopulation, allele frequencies in subpopulations, etc. We choose these factors because they have potential to cause false associations. Each parameter takes a random value in a chosen range. We then calculate the likelihood of getting opposite conclusions in the structured and the homogeneous populations. This is the likelihood of having false positives caused by population stratification. The advantage of this method is to provide a cost effective way to choose between using case-control data and using family data before actually collecting those data. We conclude that sample sizes have a significant effect on the likelihood of false positive caused by population stratification. The larger the sample size is, the more likely to have false positive if the population structure is ignored. If the sample size will be smaller than 200 by budget constraints, then case-control study may be a better choice because of its power.

After Human Genome Project, the studies of genetic variation in human population have been developed extensively [

Wacholder et al. [

Ardlie et al. [

If population stratification is a serious problem, the reliability of case-control studies will be doubtful, and any positive results from case-control studies have to be reconfirmed by studies based on family-member controls. On the other hand, if this is not a serious problem, we do not have to spend valuable resources on collecting family-based data to just prevent bias caused by population stratification. Many people believe that case-control study is more powerful than family-based study (Morton and Collins [

In this paper we propose a method to assess the seriousness of population stratification. In order to quantitatively study the bias caused by population stratification, we consider two populations that have exactly the same marker allele frequencies, the same disease gene frequencies, and the same penetrance. Nevertheless, one population is structured (denoted as Population I) and the other is homogeneous (denoted as Population II). Seventeen factors in a population are analyzed. We choose these factors because they have potential to cause false associations. In a case-control design, at a biallelic marker, a standard chi-square statistic is used to test the association between the marker locus and an unknown disease locus. We want to know when data from the structured population and the homogeneous population yield different conclusions. Namely, we want to know when we will get a false positive (or a false negative) by neglecting the population structure. Our approach is to calculate the chi-square statistic from 17 parameters. We will randomly choose each parameter within its range, and then compare the chi-square statistics for the structured and the homogeneous populations. The percentage of false conclusions (positive or negative) will be recorded. This is the likelihood of the false conclusion caused by population stratification. The key step in our approach is an explicit formula for calculating marker allele frequencies among affected people and among normal people. This formula is given in Section 4. Since the rate of false positive depends on the ranges we have chosen for the parameters, we write the explicit formula in a computer program, in which the ranges of all parameters can be chosen by the user, and the program will calculate the likelihood of the false conclusion caused by population stratification under the chosen circumstance.

We will use the following notations:

1) I: population I (structured), which consists of two homogeneous subpopulations 1 and 2.

2) II: population II (homogeneous).

3) 1: subpopulation 1.

4) 2: subpopulation 2.

5) D: diseased people.

6) N: normal people.

7) M: a marker allele.

8. A: a disease allele.

9) ϕ i , i = 0 , 1 , 2 : prevalence, which is the likelihood of getting affected given genotype A A ¯ , A A ¯ , AA, respectively.

10) P ( ⋅ ) : probability.

Population I is a union of two homogeneous subpopulations, and there is no admixture. The reason we choose two subpopulations instead of three or more is the general belief that the effect of population stratification will decrease as the number of subpopulations increases, and we want to consider the worst case scenario. We will compare population I (structured) with population II (homogeneous). In order to compare populations I and II, they have to have something in common. We assume that they have the same allele frequencies and penetrance.

In a case-control design, consider a biallelic marker locus with alleles M and M ¯ . Suppose allele M appears more often in cases than in controls. Suppose the disease is caused by an unknown disease gene with several disease alleles A 1 , A 2 , ⋯ , A s and a normal allele A ¯ . We assume that the disease alleles were introduced into the general population at different time, and there were multiple ancestral haplotypes. Suppose the disease allele A j was introduced into the general population n j generations ago. Let n ≤ min { n j } be a lower bound of the age of the latest mutant disease allele. Suppose n generations ago, the conditional probability of a chromosome having allele M given it has A j is P ( n ) ( M | A j ) . Note that the unknown ages of mutant disease alleles are absorbed into the unknown incomplete initial association, and they do not cause additional troubles. Suppose that A 1 , ⋯ , A s are functionally equivalent disease alleles, i.e. the penetrance is P ( D | A i A j ) = ϕ 2 for 1 ≤ i , j ≤ s , P ( D | A i A ¯ ) = ϕ 1 for 1 ≤ i ≤ s , and P ( D | A ¯ A ¯ ) = ϕ 0 , where D indicates the disease phenotype. Letting A = ∪ j = 1 s A j , then ∑ j = 1 s P ( A j ) = P ( A ) . For population I, we look at

X 1 = ( n 11 + n 12 + n 21 + n 22 ) ( n 11 n 22 − n 12 n 21 ) 2 ( n 11 + n 12 ) ( n 21 + n 22 ) ( n 11 + n 21 ) ( n 12 + n 22 )

X 2 = ( m 11 + m 12 + m 21 + m 22 ) ( m 11 m 22 − m 12 m 21 ) 2 ( m 11 + m 12 ) ( m 21 + m 22 ) ( m 11 + m 21 ) ( m 12 + m 22 )

X 1 and X 2 are chi-square statistics with one degree freedom for population I and II, respectively. Consider a sample with N 1 cases and N 2 controls. Instead of taking a random sample, we calculate n i j and m i j using the following formula, where D and N indicate diseased and normal, and I and II indicate populations I and II.

n 11 = 2 N 1 P ( M | D and I ) , n 21 = 2 N 1 ( 1 − P ( M | D and I ) )

M | M ¯ | |
---|---|---|

Diseased | n 11 | n 12 |

Normal | n 21 | n 22 |

n 12 = 2 N 2 P ( M | N and I ) , n 22 = 2 N 2 ( 1 − P ( M | N and I ) )

m 11 = 2 N 1 P ( M | D and I I ) , m 21 = 2 N 1 ( 1 − P ( M | D and I I ) )

m 12 = 2 N 2 P ( M | N and I I ) , m 22 = 2 N 2 ( 1 − P ( M | N and I I ) )

Note that X 1 depends on P ( M | D and I ) and P ( M | N and I ) , and X 2 depends on P ( M | D and I I ) and P ( M | N and I I ) . These conditional probabilities P ( M | D and I ) , P ( M | N and I ) , P ( M | D and I I ) , and P ( M | N and I I ) depend on 17 parameters. We will give an explicit formula in (7)-(10) for calculating these conditional probabilities when given values of the parameters. The parameters are as follows:

1) P ( 1 ) is the proportion of subpopulation 1.

2) P ( 2 ) is the proportion of subpopulation 2.

3) P ( M | 1 ) is the frequency of marker allele M in subpopulation 1.

4) P ( M | 2 ) is the frequency of marker allele M in subpopulation 2.

5) P ( A | 1 ) is the frequency of disease allele A in subpopulation 1.

6) P ( A | 2 ) is the frequency of disease allele A in subpopulation 2.

7) n is a lower bound of the age of the latest mutant disease allele.

8) θ is the genetic distance between marker locus and the disease gene.

9) m = N 1 = N 2 is the number of cases, and it is also the number of controls.

10) ϕ 0 ( 1 ) is the likelihood of getting affected in subpopulation 1 given genotype A ¯ A ¯ .

11) ϕ 1 ( 1 ) is the likelihood of getting affected in subpopulation 1 given genotype A A ¯ .

12) ϕ 2 ( 1 ) is the likelihood of getting affected in subpopulation 1 given genotype AA.

13) ϕ 0 ( 2 ) is the likelihood of getting affected in subpopulation 2 given genotype A ¯ A ¯ .

14) ϕ 1 ( 2 ) is the likelihood of getting affected in subpopulation 2 given genotype A A ¯ .

15) ϕ 2 ( 2 ) is the likelihood of getting affected in subpopulation 2 given genotype AA.

16) P ( n ) ( M | A and 1 ) is the association between M and A in population 1, n generations ago.

17) P ( n ) ( M | A and 2 ) is the association between M and A in population 2, n generations ago.

Populations I and II have the same allele frequencies and the same penetrance:

P ( M | I ) = P ( M | I I ) , P ( A | I ) = P ( A | I I ) (1)

ϕ 0 ( I ) = ϕ 0 ( I I ) , ϕ 1 ( I ) = ϕ 1 ( I I ) , ϕ 2 ( I ) = ϕ 2 ( I I ) (2)

We also assume that, n generations ago, populations I and II have the same initial association between the marker allele M and the disease allele A, which is

P ( n ) ( M | A and I I ) = P ( n ) ( M | A and I ) = P ( 1 ) P ( n ) ( M | A and 1 ) + P ( 2 ) P ( n ) ( M | A and 2 ) (3)

We now calculate the likelihood of false conclusion caused by population stratification in different circumstances. We first choose the ranges for the parameters. Each parameter is chosen randomly in the range. For each set of the parameters, we can calculate chi-square statistics X 1 and X 2 for structured population I and the homogeneous population II. At 5% level, if X 1 and X 2 are at the different sides of 3.8414, i.e. either X 1 < 3.8414 < X 2 or X 2 < 3.8414 < X 1 , we then call it a false conclusion (a false positive, or a false negative). This means that at 5% level, if we treat the structured population as a homogeneous population (ignoring the subpopulation structure), then we get a wrong conclusion. We then record the percentage of false conclusions. We will do the same thing at 1% level, instead of 3.8414 we will use 6.6345. The ranges of the parameters are the following:

Circumstance 1.

1) 0 ≤ P ( 1 ) ≤ 1 , P ( 2 ) = 1 − P ( 1 ) .

2) 0.1 ≤ P ( M | 1 ) ≤ 0.9 , 0.1 ≤ P ( M | 2 ) ≤ 0.9 .

3) 0.05 ≤ P ( A | 1 ) ≤ 0.5 , 0.05 ≤ P ( A | 2 ) ≤ 0.5 .

4) 5 ≤ n ≤ 5000 .

5) 0 ≤ θ ≤ 100 (in cM).

6) m = 100 .

7) 0 ≤ ϕ 0 ( 1 ) ≤ 0.1 ≤ ϕ 1 ( 1 ) ≤ 0.3 ≤ ϕ 2 ( 1 ) ≤ 0.6 , 0 ≤ ϕ 0 ( 2 ) ≤ 0.1 ≤ ϕ 1 ( 2 ) ≤ 0.3 ≤ ϕ 2 ( 2 ) ≤ 0.6 .

8) 0 ≤ P ( n ) ( M | A and 1 ) ≤ 1 , 0 ≤ P ( n ) ( M | A and 2 ) ≤ 1 .

One million simulations have been run, and the rate of having different conclusions in populations I and II has been recorded, which is called the false rate.

The false rate is 4.84% at 5% significance level; and it is 2.25% at the 1% significance level, and it is 0.93% at the 0.1% significance level. The simulations have been run for ten million times as well, and the results are 4.82%, 2.27%, and 0.94%, respectively. So running one million times is accurate enough. The above ranges are so wide that we can say that in a case-control study using 100 cases and 100 controls, the possibility of getting a false positive caused by ignoring unknown population structure is small.

Next, we want to investigate the effect of each parameter on the false rate.

Note that in Circumstance 1, the maximum possible ratio of marker allele frequencies in two subpopulations is 9. From

The disease models and penetrance are difficult to estimate in practice. From

The genetic distance between the marker and the disease gene is of cause unknown. From

The age of the latest disease mutation and the initial association between marker allele and the disease allele are hard to estimate. From

5% level | 1% level | 0.1% level | |
---|---|---|---|

0.1 ≤ P ( M | 1 ) ≤ 0.9 | 4.84% | 2.25% | 0.93% |

0.1 ≤ P ( M | 1 ) ≤ 0.9 | |||

0.05 ≤ P ( M | 1 ) ≤ 0.95 | 6.60% | 3.35% | 1.53% |

0.05 ≤ P ( M | 1 ) ≤ 0.95 | |||

0.01 ≤ P ( M | 1 ) ≤ 0.99 | 8.18% | 4.42% | 2.15% |

0.01 ≤ P ( M | 1 ) ≤ 0.99 |

5% level | 1% level | 0.1% level | |
---|---|---|---|

0.1 ≤ P ( A | 1 ) ≤ 0.5 | 3.33% | 1.35% | 0.47% |

0.1 ≤ P ( A | 1 ) ≤ 0.5 | |||

0.05 ≤ P ( A | 1 ) ≤ 0.5 | 4.84% | 2.25% | 0.93% |

0.05 ≤ P ( A | 1 ) ≤ 0.5 | |||

0.01 ≤ P ( A | 1 ) ≤ 0.5 | 6.42% | 3.31% | 1.57% |

0.01 ≤ P ( A | 1 ) ≤ 0.5 |

5% level | 1% level | 0.1% level | |
---|---|---|---|

0.05 ≤ P ( M | 1 ) ≤ 0.95 | 8.43% | 4.68% | 2.39% |

0.05 ≤ P ( M | 1 ) ≤ 0.95 | |||

0.01 ≤ P ( A | 1 ) ≤ 0.5 | |||

0.01 ≤ P ( A | 1 ) ≤ 0.5 | |||

0.01 ≤ P ( M | 1 ) ≤ 0.99 | 10.22% | 5.97% | 3.21% |

0.01 ≤ P ( M | 1 ) ≤ 0.99 | |||

0.01 ≤ P ( A | 1 ) ≤ 0.5 | |||

0.01 ≤ P ( A | 1 ) ≤ 0.5 |

5% level | 1% level | 0.1% level | |
---|---|---|---|

0 ≤ ϕ 0 ( 1 ) ≤ 0.1 ≤ ϕ 1 ( 1 ) ≤ 0.15 ≤ ϕ 2 ( 1 ) ≤ 0.3 | 2.63% | 1.13% | 0.43% |

0 ≤ ϕ 0 ( 2 ) ≤ 0.1 ≤ ϕ 1 ( 2 ) ≤ 0.15 ≤ ϕ 2 ( 2 ) ≤ 0.3 | |||

0 ≤ ϕ 0 ( 1 ) ≤ 0.1 ≤ ϕ 1 ( 1 ) ≤ 0.3 ≤ ϕ 2 ( 1 ) ≤ 0.6 | 4.84% | 2.25% | 0.93% |

0 ≤ ϕ 0 ( 2 ) ≤ 0.1 ≤ ϕ 1 ( 2 ) ≤ 0.3 ≤ ϕ 2 ( 2 ) ≤ 0.6 | |||

0 ≤ ϕ 0 ( 1 ) ≤ 0.1 ≤ ϕ 1 ( 1 ) ≤ 0.5 ≤ ϕ 2 ( 1 ) ≤ 1 | 8.08% | 4.29% | 2.02% |

0 ≤ ϕ 0 ( 2 ) ≤ 0.1 ≤ ϕ 1 ( 2 ) ≤ 0.5 ≤ ϕ 2 ( 2 ) ≤ 1 |

5% level | 1% level | 0.1% level | |
---|---|---|---|

θ = 0 | 11.38% | 8.83% | 6.27% |

θ = 1 | 4.95% | 2.32% | 0.97% |

θ = 100 | 4.83% | 2.24% | 0.92% |

5% level | 1% level | 0.1% level | |
---|---|---|---|

5.90% | 3.02% | 1.38% | |

4.89% | 2.29% | 0.95% | |

4.83% | 2.24% | 0.92% |

5% level | 1% level | 0.1% level | |
---|---|---|---|

4.84% | 2.25% | 0.93% | |

4.83% | 2.25% | 0.92% | |

4.84% | 2.25% | 0.93% | |

4.84% | 2.25% | 0.93% | |

The proportion of a subpopulation in the whole population is an important factor affecting the false rate. From

proportion changes from 10% to 50%. The worst case occurs when the whole population is a union of two equal parts. If only a small part of the sample is from a different population (for example 10%), the chance of having a false positive is small.

A surprising result comes from

We will give an explicit formula for allele frequencies among cases and controls in populations I and II. The frequencies of marker allele and disease allele in population I are

The penetrance in population I are

Two subpopulations 1 and 2, and population II are assumed to be homogeneous. Therefore, Hardy-Weinberg equilibrium holds. The disease prevalence in these population can be calculated as follows:

5% level | 1% level | 0.1% level | |
---|---|---|---|

1.20% | 0.51% | 0.20% | |

5.23% | 2.37% | 0.98% | |

8.88% | 4.36% | 1.84% |

5% level | 1% level | 0.1% level | |
---|---|---|---|

1.77% | 0.60% | 0.16% | |

4.84% | 2.25% | 0.93% | |

9.91% | 5.75% | 3.06% | |

16.78% | 11.22% | 7.15% |

Since population I is not homogeneous, Hardy-Weinberg equilibrium does not hold. In particular, the disease prevalence in population I cannot be calculated as above.

Next, we will calculate the frequency of marker allele M among cases in a homogeneous population, for example subpopulations 1 and 2, and population II. Since the argument holds for all three populations, we will not specify the population. Let, , , and be the penetrance and disease prevalence in the population. We assume Hardy-Weinberg equilibrium in the population. Let and be the genotype frequencies among diseased individuals. It is clear that

We now consider an ordered pair of haplotypes. Let be the probability of a person having an ordered pair of haplotypes. Let be the frequency of the haplotype. We then have

Similarly,

Then

where

Next, we calculate. It is easy to see that, , , and. Replacing, b, and c by, , and, we have

We will calculate the frequency of haplotype in a homogeneous population. Let be the linkage-disequilibrium (LD) between the disease locus and the marker locus n generations ago, where is the haplotype frequency of n generations ago. From standard genetic theory (Equation (1.10) of Hartl [

where is the recombination fraction between the disease locus and the marker locus. Thus,

We then have

Substituting (6) into (4) and (5) yields the frequencies of allele M among cases and controls in population II:

where

We now calculate the frequency of allele M among cases and controls in population I (the structured population). Since population I is not homogeneous, Hardy-Weinberg equilibrium does not hold. We cannot use the above formula. Instead we have the following:

Note that

We then have

where

We provide a formula for calculating the likelihood of false positive caused by population stratification given the ranges of the parameters. This is written in a computer program. From Tables 2-10 we can see that without any knowledge about the structure of the population (i.e. each parameter has a wide range of possibilities), the chance of getting false positives from ignoring the population structure is small. Sample sizes have a significant effect on the likelihood of false positive caused by population stratification. The larger the sample size is, the more likely to have false positive if the population structure is ignored. For small samples (the sum of numbers of cases and controls is smaller than 200), when unknown population structure is ignored, the chance of having false positive is less than 5%. We suggest using sample size as a factor in choosing study design (case-control or family-based), if the sample size will be smaller than 200 by budget constraints, then case-control study may be a better choice because of its power. Of cause, cases and controls should be carefully matched. If there are still some unknown population differences between cases and controls, the chance of having false positive caused by unknown population structure is less than 5%.

The authors declare no conflicts of interest regarding the publication of this paper.