Zero-inflated distributions are common in statistical problems where there is interest in testing homogeneity of two or more independent groups. Often, the underlying distribution that has an inflated number of zero-valued observations is asymmetric, and its functional form may not be known or easily characterized. In this case, comparisons of the groups in terms of their respective percentiles may be appropriate as these estimates are nonparametric and more robust to outliers and other irregularities. The median test is often used to compare distributions with similar but asymmetric shapes but may be uninformative when there are excess zeros or dissimilar shapes. For zero-inflated distributions, it is useful to compare the distributions with respect to their proportion of zeros, coupled with the comparison of percentile profiles for the observed non-zero values. A simple chi-square test for simultaneous testing of these two components is proposed, applicable to both continuous and discrete data. Results of simulation studies are reported to summarize empirical power under several scenarios. We give recommendations for the minimum sample size which is necessary to achieve suitable test performance in specific examples.
Zero-inflated distributions―a mixture of a point distribution and some other non-zero distribution (s)―are common in biomedical studies. Typically, there is interest in making comparisons between such groups, as in testing the homogeneity of distributions across treatment groups or other populations of interest. In addition to excess zeros, analysis is made more difficult by the distribution of the non-zero data, which is most likely asymmetric or multimodal. Zhang et al. [
Zero-inflated distributions arise under a variety of circumstances. For example, some laboratory procedures may be unable to detect values below a certain threshold and, by default, are recorded as zero, the lowest detection level, or some other value. Another scenario may be that subjects from a portion of the population do not exhibit symptoms (their responses are recorded as zero since their response cannot be measured) while others do have symptoms and some measurements are taken. Data may be categorized and categories may be combined to simplify the analysis. For instance, subjects below a certain age may be coded with a single number which would censor the observed data at a point and create a point distribution at this value. Count data with excess zeros are one of the most prevalent examples of zero-inflated distributions with zero-inflated Poisson and negative binomial models used extensively in a variety of fields [
Lachenbruch [
As an extension of the procedure outlined in [
The proposed general strategy for comparing percentile profiles [
Let
1) Combine the K samples and calculate the combined sample percentile estimate of
2) For each of the K samples, sort the observations into categories or bins with cutoffs based on the combined sample percentile estimates,
3) Arrange the categorized data from Step (2) in a K × (p + 1) contingency table where each row, respectively, consists of p + 1 sorted sets of observations for one of the samples.
4) Perform the test of homogeneity of the percentile profiles in terms of Pearson’s chi-square statistic with p(K - 1) degrees of freedom.
To illustrate, suppose we are given three samples and we wish to test H0: Q1 = Q2 = Q3 where the percentile profile Qh = (25, 50, 75), h = 1, 2, 3. Further, suppose estimates of these percentiles were found to be q1 = (4.6, 6.4, 10.4), q2 = (6.2, 8.3, 12.3), and q3 = (4.8, 7.9, 10.6), respectively, in samples from the three distributions. The combined sample estimates of these percentiles are
Let D be a zero-inflated distribution;
Consider, for example, testing the equality of groups, such as race/gender cohorts, with respect to their degree of tobacco smoke exposure assessed in terms of the tobacco biomarker, cotinine (details later). Within the groups, there may be some people with undetectable or nonexistent levels of cotinine; we will consider these non- smokers who have not been exposed to measurable amounts of secondary smoke. Mixed with this unexposed population (within the groups) are those who either currently smoke, or have a history of smoking or being exposed to measurable amounts of second-hand smoke and thus have cotinine levels above the detection limit. We will consider these people to be exposed to smoking either directly or through people around them smoking. It is
Sample | Bin 1 | Bin 2 | Bin 3 | Bin 4 | Total |
---|---|---|---|---|---|
1 | 66 | 59 | 48 | 47 | 220 |
2 | 35 | 50 | 49 | 62 | 196 |
3 | 55 | 47 | 59 | 47 | 208 |
Total | 156 | 156 | 156 | 156 | 624 |
informative to test the equality of groups comprised of mixtures of exposed and unexposed populations-the proportion unexposed and the percentile profiles that reflect the severity of exposure in each group.
The general percentile test outlined in the previous section can be used to simultaneously test the equality of proportions of zeros (non-smokers) in addition to the equality of the percentile profiles (distributions of smokers) across populations. Let Xi denote the observed cotinine assessment for a person who is randomly selected from the ith population, and let πi (i = 1, ×××, K) be the proportion of zeros in the ith population. P[Xi ≤ 0] = πi and the value of any percentile less than 100 × πi is 0.
Thus, 0 could be considered another percentile estimate-the sample estimate of the 100 × πith percentile. Since any percentile less than 100 × πi is equal to 0, we can select an arbitrary percentile such that each population’s estimate is 0. However, because we use the combined sample percentile estimate to create bins for the contingency table, we must select a percentile less than the combined sample proportion of zeros, denoted
Suppose we wish to test the equality of a percentile profile Q across K populations with the proportion of zeros in the combined population equal to
For example, suppose the data used to create
As previously mentioned the point distribution of interest is not restricted to 0 but can be any value that is the minimum of the data. Also, this procedure is applicable to point distributions at the maximum value. Instead of adding a percentile near 0 to Q, an arbitrary percentile close to 100 can be chosen, such as 99. If
Sample | Bin 1 | Bin 2 | Bin 3 | Bin 4 | Bin 5 | Total |
---|---|---|---|---|---|---|
1 | 15 | 51 | 59 | 48 | 47 | 220 |
2 | 28 | 7 | 50 | 49 | 62 | 196 |
3 | 42 | 13 | 47 | 59 | 47 | 208 |
Total | 85 | 71 | 156 | 156 | 156 | 624 |
could test homogeneity of proportions of both point distributions as well as the percentile profile with Q = (1, Q1, ×××, Qp, 99).
Some asymptotic properties of the percentile test with zero inflated distributions were investigated for both continuous and discrete data. Power simulations were conducted for several scenarios by varying the proportion of zeros and/or the properties of the non-zero distributions. For this paper, point distributions at zero mixed with non-zero gamma distributions were considered for the continuous case (
Sample Size (n = m) | π1 | π2 | Gamma Distribution Parameters | |||
---|---|---|---|---|---|---|
α1 = 2 β1 = 2 | α1 = 2.2 β1 = 2.2 | α1 = 2.3 β1 = 2.3 | α1 = 2.4 β1 = 2.4 | |||
50 | 0.1 | 0.1 | 0.0467 | 0.1301 | 0.2536 | 0.4214 |
0.2 | 0.1605 | 0.2575 | 0.3918 | 0.5512 | ||
0.3 | 0.4914 | 0.5865 | 0.6786 | 0.7845 | ||
100 | 0.1 | 0.1 | 0.0511 | 0.2349 | 0.4961 | 0.7627 |
0.2 | 0.3089 | 0.5310 | 0.7169 | 0.8843 | ||
0.3 | 0.8419 | 0.9103 | 0.9582 | 0.9832 | ||
200 | 0.1 | 0.1 | 0.0500 | 0.4684 | 0.8387 | 0.9785 |
0.2 | 0.5975 | 0.8551 | 0.9672 | 0.9969 | ||
0.3 | 0.9938 | 0.9982 | 0.9998 | 1.0000 | ||
500 | 0.1 | 0.1 | 0.0527 | 0.8954 | 0.9987 | 1.0000 |
0.2 | 0.9619 | 0.9994 | 1.0000 | 1.0000 | ||
0.3 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
50 | 0.2 | 0.2 | 0.0476 | 0.1212 | 0.2326 | 0.3845 |
0.3 | 0.1211 | 0.2033 | 0.3123 | 0.4592 | ||
0.4 | 0.3719 | 0.4598 | 0.5620 | 0.6719 | ||
100 | 0.2 | 0.2 | 0.0493 | 0.2140 | 0.4599 | 0.7229 |
0.3 | 0.2178 | 0.4140 | 0.6173 | 0.8144 | ||
0.4 | 0.7015 | 0.8032 | 0.8920 | 0.9539 | ||
200 | 0.2 | 0.2 | 0.0508 | 0.4294 | 0.7991 | 0.9691 |
0.3 | 0.4218 | 0.7373 | 0.9243 | 0.9893 | ||
0.4 | 0.9585 | 0.9878 | 0.9967 | 0.9997 | ||
500 | 0.2 | 0.2 | 0.0503 | 0.8592 | 0.9971 | 1.0000 |
0.3 | 0.8459 | 0.9932 | 1.0000 | 1.0000 | ||
0.4 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
Sample Size (n = m) | π1 | π2 | Poisson Distribution Parameters | |||
---|---|---|---|---|---|---|
λ1 = 5 λ2 = 5 | λ1 = 5.5 λ2 = 5 | λ1 = 6 λ2 = 5 | λ1 = 6.5 λ2 = 5 | |||
50 | 0.1 | 0.1 | 0.0465 | 0.0876 | 0.2476 | 0.5205 |
0.2 | 0.1561 | 0.2080 | 0.3850 | 0.6307 | ||
0.3 | 0.4850 | 0.5472 | 0.6791 | 0.8342 | ||
100 | 0.1 | 0.1 | 0.0510 | 0.1489 | 0.5009 | 0.8671 |
0.2 | 0.2944 | 0.4472 | 0.7256 | 0.9458 | ||
0.3 | 0.8241 | 0.8839 | 0.9578 | 0.9937 | ||
200 | 0.1 | 0.1 | 0.0520 | 0.2588 | 0.8443 | 0.9960 |
0.2 | 0.5757 | 0.7803 | 0.9692 | 0.9995 | ||
0.3 | 0.9911 | 0.9974 | 0.9996 | 1.0000 | ||
500 | 0.1 | 0.1 | 0.0476 | 0.6354 | 0.9986 | 1.0000 |
0.2 | 0.9549 | 0.9954 | 1.0000 | 1.0000 | ||
0.3 | 1.0000 | 1.0000 | 1.0000 | 1.0000 | ||
50 | 0.2 | 0.2 | 0.0496 | 0.0867 | 0.2345 | 0.4736 |
0.3 | 0.1187 | 0.1755 | 0.3174 | 0.5459 | ||
0.4 | 0.3757 | 0.4284 | 0.5616 | 0.7326 | ||
100 | 0.2 | 0.2 | 0.0504 | 0.1426 | 0.4624 | 0.8281 |
0.3 | 0.2111 | 0.3362 | 0.6256 | 0.8935 | ||
0.4 | 0.6941 | 0.7737 | 0.8940 | 0.9763 | ||
200 | 0.2 | 0.2 | 0.0502 | 0.2526 | 0.7978 | 0.9910 |
0.3 | 0.4098 | 0.6332 | 0.9326 | 0.9972 | ||
0.4 | 0.9596 | 0.9771 | 0.9978 | 1.0000 | ||
500 | 0.2 | 0.2 | 0.0501 | 0.6168 | 0.9978 | 1.0000 |
0.3 | 0.8422 | 0.9719 | 1.0000 | 1.0000 | ||
0.4 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
As a multivariate problem, finding an exact probability of the test or expressing some measure of difference between distributions is a challenge. The power of the test is a complex function of the difference in probability of observing a zero, combined with the probabilities of the non-zero values being placed in the particular bins given the probability of a certain proportion of zeros. Furthermore, the choice of percentiles affects the power of the test as the relative proportions of bins affects the chi-square test. However, we attempt to relate the power of the test with certain characteristics of the data and choice of percentiles in situations that may be common in applications. All power estimates are based on 10,000 replicate samples and all procedures were programmed and carried out with R 3.1.2.
The results of the simulations using data generated from a mixture of a gamma distribution with zeros and testing the profile Q = (1, 50, 75, 90) are presented in
For example, for any sample size and π1 = π2, α1 ≠ α2, and β1 ≠ β2, power is greater when the overall proportion of zeros is lower. In these cases, the first bin which contains all the zero observations has equal probability for both samples but the remaining bins profiles are unequal. Thus scenarios with larger bins corresponding to unequal profiles have the greatest power. Simulations where π1 = π2 = 0.1 always have greater power than π1 = π2 = 0.2 because the bins with unequal profiles have larger relative proportions within the contingency table. Similarly, for any sample size and π1 = 1/2π2, α1 = α2, and β1 = β2 (the ratio of zeros is constant as well as non-zero distribution), power is greater when the overall proportion of zeros is greater. In cases where and π1 ≠ π2, α1 ≠ α2, and β1 ≠ β2, the relationship remains but is complicated by differences in the row profiles corresponding to each bin. If we examine situations where the ratio of π1 and π2 are equal (π1 = 1/2π2) with unequal non-zero profiles, we still observe greater power with greater
The results of simulations with zero-inflated Poisson distributions (
Urinary triclosan data from the 2011-2012 National Health and Nutrition Examination Surveys (NHANES) were used to illustrate the use of the percentile test with zero inflated distributions. Specifically, we examined the measurements of adult, non-Hispanic white and black participants between the ages of 18 and 79. Triclosan is a broad-spectrum phenolic biocide used in toothpastes, cleaning supplies, and personal-care products. Its use in consumer products has recently been investigated due to potential safety concerns. In experimental animal models, triclosan has been reported to alter hormones [
The lower detection limit (LDL) of urinary triclosan for this laboratory method is 2.3 nanograms per milliliter (ng/ml). Per NHANES procedures, any measurement less than the LDL is replaced with an imputed value of the LDL divided by the square root of two. For urinary triclosan this results in a point distribution at 1.63 mixed with continuous values above the detectable limit. Suppose one was interested in examining differences in the distribution of urinary triclosan between independent populations. The difference in proportion of non-detecta- ble measurements coupled with differences in the detectable measurements would be of interest. For illustrative purposes, consider testing the homogeneity of percentile profiles of independent groups: 1) black females and white females and 2) black males and white males. To test the homogeneity of the percentile profiles, one must follow the steps in Section 2 with the added percentile to account for “zeros”, which in this case is 1.63. Since triclosan is a potentially harmful substance, we are particularly interested in percentiles close to 100. We chose to test for homogeneity of the 1st, 50th, 60th, 70th, 80th, and 90th percentiles (with the 1st percentile used to test proportion of observations below detection). The combined samples have roughly 25% undetectable measurements so any percentile less than the 25th is adequate for Qz.
The contingency tables for comparing races with respect to gender-specific triclosan differences are shown in
If one compares the observed values with the expected (in parentheses), the differences between the groups
Females | Bin | ||||||
---|---|---|---|---|---|---|---|
1 ≤1.6 | 2 (1.6, 7.2] | 3 (7.2, 13.6] | 4 (3.6, 23.9] | 5 (23.9, 84.3] | 6 (84.3, 258] | 7 >258 | |
Black Females | 58 (63.3) | 58 (46.2) | 23 (21.4) | 17 (21.8) | 21 (21.8) | 15 (21.8) | 26 (21.8) |
White Females | 90 (84.7) | 50 (61.8) | 27 (28.6) | 34 (29.2) | 30 (29.2) | 36 (29.2) | 25 (29.2) |
Males | Bin | ||||||
---|---|---|---|---|---|---|---|
1 ≤1.6 | 2 (1.6, 5.9] | 3 (5.9, ≤9.5] | 4 (9.5, 19.9] | 5 (19.9, 53.4] | 6 (53.4, 173] | 7 >173 | |
Black Males | 68 (68.5) | 52 (53.6) | 24 (22.5) | 24 (24.3) | 33 (23.9) | 25 (23.9) | 15 (24.3) |
White Males | 84 (83.5) | 67 (65.4) | 26 (27.5) | 30 (29.7) | 20 (29.1) | 28 (29.1) | 39 (29.7) |
are not as straightforward as a shift in the distribution. When dealing with large populations with diverse participants and behaviors, such as NHANES, multimodal distributions are expected; other tests may not detect the subtleties that such data often contain. With this method, the analyst can test for differences between groups with greater control and detect differences in specific regions of the distribution.
Serum cotinine data from the 2011-2012 NHANES were used as a second example to illustrate the procedure. Cotinine is the primary metabolite of nicotine and is currently regarded as the best biomarker of tobacco smoke exposure, for both active smoking as well as “passive smoking” [
Results of the percentile test indicate significant differences between percentile profiles for both within gender comparisons. The profiles of black females and white females are significantly different (χ2 = 75.9, df = 6, p < 0.0001) with the proportion of black females below the detection limit lower than expected and the proportion of white females higher than expected (
The cumulative distribution of log (serum cotinine) for black and white males is plotted in
Females | Bin | ||||||
---|---|---|---|---|---|---|---|
1 ≤0.011 | 2 (0.011, 0.04] | 3 (0.04, 0.09] | 4 (0.09, 0.57] | 5 (0.57, 68.9] | 6 (68.9, 234] | 7 >234 | |
Black Females | 134 (192) | 158 (152) | 86 (68) | 93 (69) | 93 (69) | 58 (69) | 64 (69) |
White Females | 291 (233) | 179 (185) | 65 (83) | 59 (83) | 59 (83) | 94 (83) | 88 (83) |
Males | Bin | ||||||
---|---|---|---|---|---|---|---|
1 ≤0.011 | 2 (0.011, 0.16] | 3 (0.16, 1.93] | 4 (1.93, 84.5] | 5 (84.5, 202] | 6 (202, 313] | 7 >313 | |
Black Males | 78 (123) | 201 (188) | 80 (62) | 81 (62) | 77 (62) | 47 (63) | 59 (62) |
White Males | 217 (172) | 249 (262) | 69 (87) | 68 (87) | 72 (87) | 104 (88) | 88 (86) |
Equivalently, it is the proportion of the cumulative distribution between these two values multiplied by the sample size. The width is the proportion of the sample between the vertical lines. Essentially, we are testing the equality of widths (the change in the cumulative distribution) between a set of combined sample percentile estimates.
When dealing with zero-inflated data it is useful to compare both the proportion of zero values and the shape of the non-zero values, as measured by a percentile profile. Zero-inflated distributions are frequently encountered in biomedical studies which typically require some hypothesis testing of the equality of the distributions or equality of specific parameters, such as the median. With the proposed procedure, the analyst is able to simultaneously test for differences in the proportion of zeros, along with differences in any number of percentiles selected by the analyst. This flexibility allows the analyst to select the percentiles that best characterize the data and is especially useful for comparing asymmetric or multimodal distributions mixed with one or more point distributions. We find this procedure to be straightforward and easily implemented. It is also appropriate to use when the distributions have unusual shapes.
The proposed procedure has several other advantages when compared to other tests for homogeneity of zero- inflated distributions: 1) the test is nonparametric and can be used for any unimodal or multimodal continuous or discrete distribution; 2) the test allows for multiple point distributions of any value, not necessarily zero exclusively; and 3) it can be used to test for homogeneity of more than two groups simultaneously. As seen in the illustrative examples, the method can distinguish important differences between distributions of populations. For applications with large sample sizes, such as NHANES data, the procedure is particularly useful in testing equality of many percentiles simultaneously.
A limitation is that the test relies on large sample theory and further study is needed to evaluate the severity of this restriction. Simulations show that empirical alpha is adequate by sample size 50 for comparisons of three percentiles in addition to the proportion of zeros; however, the minimum sample size required to achieve the desired alpha is dependent upon the number and choice of percentiles. It is important to remember that there are more powerful tests to test overall equality of distributions (Wilcoxon, KS test) or specific changes in parameters (t-test, ANOVA), However, none of these tests are appropriate for identifying specific segments of distributions that are significantly different. Further research could be done on deriving a closed-form solution for the power of the percentile test for a given percentile profile based on the features of the samples, such as the probability of zeros, the sample size, and the underlying distributions.
This research is supported by 1 U54 GM104940 from the National Institute of General Medical Sciences of the NIH, which funds the Louisiana Clinical and Translational Science Center. William Johnson also receives support from the National Center For Complementary & Integrative Health and the Office of Dietary Supplements of the National Institutes of Health under Award Number 3 P50AT002776 which funds the Botanical Research Center of Pennington Biomedical Research Center and the Department of Plant Biology and Pathology in the School of Environmental and Biological Sciences (SEBS) of Rutgers University. The content of this manuscript is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
WilliamD. Johnson,JeffreyH. Burton,RobbieA. Beyl,JacobE. Romer, (2015) A Simple Chi-Square Statistic for Testing Homogeneity of Zero-Inflated Distributions. Open Journal of Statistics,05,483-493. doi: 10.4236/ojs.2015.56050