**Open Journal of Statistics**

Vol.07 No.02(2017), Article ID:75592,14 pages

10.4236/ojs.2017.72019

Estimation of Attributable Risk from Clustered Binary Data: The Case of Cross-Sectional and Cohort Studies

Mohamed Shoukri^{1*}, Allan Donner^{2,3}, Futwan Al-Mohanna^{1 }

^{1}Department of Cell Biology, Research Center King Faisal Specialist Hospital & Research Center, Riyadh, KSA

^{2}Department of Epidemiology and Biostatistics, Schulich School of Medicine and Dentistry, University of Western Ontario, London, Ontario, Canada

^{3}Robarts Clinical Trials, Robarts Research Institute, London, Ontario, Canada

Copyright © 2017 by authors and Scientific Research Publishing Inc.

This work is licensed under the Creative Commons Attribution International License (CC BY 4.0).

http://creativecommons.org/licenses/by/4.0/

Received: March 2, 2017; Accepted: April 21, 2017; Published: April 24, 2017

ABSTRACT

Effect sizes are estimated from several study designs when the subjects are individually sampled. When the samples are the aggregate cluster of individuals, the within cluster correlation must be accounted for to construct correct confidence intervals, and to conduct valid statistical inference. The purpose of this article is to propose and evaluate statistical procedures for the estimation of the variance of the estimated attributable risk in parallel groups of clusters, and in a design dividing each of k clusters into two segments creating multiple sub-clusters. The estimated variance is the first order approximation and is obtained by the delta method. We apply the methodology and propose a Wald type confidence interval on the difference between two correlated attributable risks. We also construct a test on the hypothesis of equality of two correlated attributable risks. We evaluate the power of the proposed test via Monte-Carlo simulations.

**Keywords:**

Correlated Binary Responses, Effect Size, Split-Cluster Design, Correlated Attributable Risks, Confidence Intervals, Monte-Carlo Simulations

1. Introduction

In the epidemiological research, it is important that the collected data are translated into interpretable results which can be easily communicated to clinicians. The need for “translatable” evidence from research studies is of prime importance in the evaluation of clinical interventions, because they hold the potential to immediately influence the course of patient treatment. When evaluating these studies, the examination of “Effect Size” or (ES) can be a useful measure of the comparative efficacy of the treatment under investigation. In randomized clinical trials, an effect size estimate quantifies the direction and magnitude of an effect of an intervention.

When exposure and disease risk are measured on a binary scale, several measures of effect size are in current use [1] . The odds ratio (OR), the relative risk (RR), and the population attributable risk (AR) are the most commonly used measures of effect size in clinical as well as analytic epidemiology.

The concept of AR was introduced in [2] and is a widely used measure of the amount of disease that can be attributed to a specific risk factor. The AR combines the relative risk (RR) and the prevalence of exposure $P\left(E\right)$ to measure the public health burden of a risk factor by estimating the proportion of cases of a disease that would not have occurred if we remove the risk factor.

The concept of AR and its statistical characteristics have been reviewed in [3] and in several publications [4] [5] . Statistical inferences on AR require the availability of data from subjects randomly assigned to intervention groups. However, when the sampling strategy involves aggregate or clusters of individuals, adjusting for the effect of intracluster correlation is essential in order to conduct valid statistical inferences [6] and the references therein. However, the statistical pro- perties of estimators of AR when clusters are sampled have not yet been fully explored. The fundamental objective of our work is to fill the gap of performing statistical inference on AR under the clustered binary data situation.

In this paper, we obtain the variance of the estimated AR under cluster sampling, focusing on cohort and cross-sectional designs. In Section 2, we construct an AR estimator, and in Section 3, we derive its large sample variance adjusted for the intracluster correlation (ICC). In Section 4, we consider the split cluster design, and describe situations where we compare two correlated AR parameters. In Section 5, we conduct a Monte-Carlo experiment to evaluate the empirical power of Wald’s test on the null hypothesis of equality of two correlated attri- butable risk parameters. At the end of each section, we provide an example.

2. AR from Cluster Sampling

We start with a parallel group design where k clusters are exposed to a specified risk factor, and l clusters are not exposed, as in the data layout given in Table 1.

In Table 1, we assume that
$k$
clusters have been selected at random from a well-defined population of exposed individuals, where the
${j}^{th}$
cluster has
${n}_{j}$
units. All individuals in this sample are assumed to be exposed to the risk factor. Let
${x}_{ij}\left(i=1,2,\cdots k,j=1,2,\cdots {n}_{i}\right)$
with
${x}_{ij}=1$
and 0, denoting positive and ne- gative responses corresponding to the presence of the exposure with
${\pi}_{i}={P}_{r}\left[{x}_{ij}=1|\text{exposedcluster}i\right]$
. Similarly we assume that
$l$
clusters have been selected from the population of unexposed individuals, where the
${l}^{th}$
^{ }cluster has
${m}_{l}$
units. All units in the clusters can serve as controls assuming the

Table 1. Typical data layout for clustered data in two groups: exposed and unexposed.

absence of exposure. In the unexposed clusters, let ${y}_{rs}\left(r=1,2,\cdots l,s=1,2,\cdots {m}_{r}\right)$ with ${y}_{rs}=1$ and 0 denote positive and negative responses with ${Q}_{r}={P}_{r}\left[{y}_{rs}=1|\text{unexposedcluster}r\right]$ . Furthermore, let ${X}_{i}={\displaystyle {\sum}_{j=1}^{{n}_{i}}{x}_{ij}}$ and ${Y}_{r}={\displaystyle {\sum}_{s=1}^{{m}_{r}}{y}_{ij}}$ denote respectively the total number of events in the exposed and non-exposed groups; provided that the misclassification error is zero. Therefore, conditional on ${\pi}_{i}$ , ${X}_{i}$ has binomial distribution with parameters $\left({n}_{i},{\pi}_{i}\right)$ . Similarly, conditional on ${Q}_{r},{Y}_{r}$ has binomial distribution with parameters $\left({m}_{r},{Q}_{r}\right)$ . To introduce a within cluster correlation, we assume that ${\pi}_{i}$ follows a beta distribution $B\left(a,b\right)$ with probability density function (pdf) given in (1).

$f\left({\pi}_{i}|a,b\right)=\frac{\Gamma \left(a+b\right)}{\Gamma \left(a\right)\Gamma \left(b\right)}{\pi}_{i}^{a-1}{\left(i-{\pi}_{i}\right)}^{b-1}$ (1)

and that ${Q}_{r}$ follows a similar beta distribution where pdf is denoted $B\left(\alpha ,\beta \right)$ . The effect of the intracluster correlation among the responses may be accounted for as follows.

Under the transformations, $P=\frac{a}{a+b}\text{and}$ , ${\rho}_{1}={\left(1+a+b\right)}^{-1}$ , the mean and variance of ${\pi}_{i}$ are given respectively by $E\left({\pi}_{i}\right)=P$ and $\text{Var}\left({\pi}_{i}\right)=P\left(1-P\right){\rho}_{1}.$

Similarly, $=\alpha /\left(\alpha +\beta \right)$ and ${\rho}_{2}={\left(1+\alpha +\beta \right)}^{-1}$ . Therefore, we have $E\left({Q}_{i}\right)=Q$ , $\text{Var}\left({Q}_{i}\right)=Q\left(1-Q\right){\rho}_{2}$ . Consequently, the unconditional distribution of ${x}_{i}$ is beta binomial with, $E\left({x}_{i}\right)={n}_{i}\rho $ , and $\text{Var}\left({X}_{i}\right)={n}_{i}P\left(1-P\right)\left[1+\left({n}_{i}-1\right){\rho}_{1}\right]$ . Similarly; $E\left({Y}_{i}\right)={m}_{i}Q$ , and

$\text{Var}\left({Y}_{i}\right)={m}_{i}Q\left(1-Q\right)\left[1+\left({m}_{i}-1\right){\rho}_{2}\right].$

It should be noted that the beta distribution assumptions imposed on the model parameters is not necessary, and one may adopt a quasi-likelihood set-up, by specifying the first two moments for ${\pi}_{i}\text{and}{Q}_{i}\text{asshown}$ in [7] and [8] . This set-up, would lead to the same expressions for $\text{Var}\left({X}_{i}\right)$ and $\text{Var}\left({Y}_{i}\right)$ . The reason for introducing the beta distribution here is that while it serves as mechanism to create within cluster correlation, it will form the basis for generating data from beta binomial distribution using Monte-Carlo simulations in Section 5.

The parameters ${\rho}_{1}$ and ${\rho}_{2}$ are respectively interpreted as the within cluster correlations among all pairs of scores in the group of exposed and unexposed. We may obtain consistent estimators of $\text{Var}\left({X}_{i}\right)$ and $\text{Var}\left({Y}_{i}\right)$ on replacing the parameters, $Q$ , ${\rho}_{1}$ , and ${\rho}_{2}$ with appropriate estimators from the data as will be shown in the next sections. We shall now construct unbiased point estimators for the parameters $P$ and $Q$ .

From [8] and [6] , we have $X={\displaystyle {\sum}_{i=1}^{k}{X}_{i}}$ has $E\left(X\right)=NP$ , $\text{Var}\left(X\right)=NP\left(1-P\right)\left[1+\left({n}_{0}-1\right){\rho}_{1}\right]$ . Similarly, $Y={\displaystyle {\sum}_{r=1}^{l}{Y}_{i}}$ has $E\left(Y\right)=MQ$ ,

$\text{Var}\left(Y\right)=MQ\left(1-Q\right)\left[1+\left({m}_{o}-1\right){\rho}_{2}\right]$ , where $N={\displaystyle {\sum}_{i=1}^{k}{n}_{i}}$ , $M={\displaystyle {\sum}_{r=1}^{l}{m}_{i}}$ ,

${n}_{0}={\displaystyle {\sum}_{i=1}^{k}{n}_{i}^{2}/N}$ , and ${m}_{o}={\displaystyle {\sum}_{r=i}^{l}{m}_{i}^{2}/M}$ . Clearly $X/N$ and $Y/M$ are unbiased point estimators for $P\text{and}Q$ respectively.

The data, under the above set up can then be summarized in a 2 ´ 2 table as shown in Table 2.

Formally, the AR is defined in [2] as:

$AR=\left\{P\left(D\right)-P\left(D|\overline{E}\right)\right\}/P\left(D\right)$ (2)

where $P\left(D\right)$ is the percentage of disease in the population, and $P\left(D|\overline{E}\right)$ is the percentage of disease in the population in the absence of exposure to the risk factor. Levin [2] defines the Attributable Risk $\left(AR\right)$ as “the amount of disease that can be attributed to a specific risk factor”.

Using Bayes theorem, and from [ [3] ; page 73], Equation (2) may be written as:

$AR=\frac{p\left(E\right)\left(RR-1\right)}{1+p\left(E\right)\left(RR-1\right)}.$ (3)

Here; RR is the relative risk or the risk ratio, and $P\left(E\right)$ is the risk of expo-

sure. The RR is defined by $RR=\frac{p\left(D|E\right)}{P\left(D|\overline{E}\right)}$ . In terms of population parameters,

the AR as defined in (3) is equivalent to:

$AR=\frac{P\left(1-Q\right)-Q\left(1-P\right)}{P+Q}=\frac{P-Q}{P+Q}.$ (4)

Under the transformation, $\Psi =\frac{1-AR}{1+AR}$ , we get $Q=\Psi P$ . We shall use this

transformation to facilitate the derivation of the large sample variance of AR.

The sample estimator of AR, is obtained using the data in a 2 ´ 2 cross classification as given in Table 2.

Epidemiologists use this statistic quite frequently to assess the consequences of an association between a binary outcome of interest $\left(D\right)$ and exposure to a risk factor $\left(E\right)$ . The total number of observations in the non-exposed and the exposed groups are given respectively by $M$ and $N$ , assumed fixed.

For a cross sectional or cohort study designs the $AR$ estimator is from [3]

Table 2. Disease-exposure cross classification in a 2 ´ 2 table.

given by:

$\widehat{AR}=\frac{X\left(M-Y\right)-Y\left(N-X\right)}{\left(X+Y\right)M}.$ (5)

Following [9] , we shall derive the asymptotic variance of $\widehat{\theta}=\mathrm{ln}\left(1-\widehat{AR}\right)$ .

We first write, $\text{Var}\left(X\right)=NP\left(1-P\right){c}_{1}$ , and, $\text{Var}\left(Y\right)=MQ\left(1-Q\right){c}_{2}$ ,

where ${c}_{1}=1+\left({n}_{0}-1\right){\rho}_{1}$ , ${c}_{2}=1+\left({m}_{0}-1\right){\rho}_{2}$ , ${n}_{0}={\displaystyle {\sum}_{i=1}^{k}{n}_{i}^{2}|N}$ , and ${m}_{0}={\displaystyle {\sum}_{i=1}^{l}{m}_{i}^{2}|M}$ .

Using the delta method [10] we can show to the first order of approximation that:

$\text{Var}\left(\stackrel{\u2322}{\theta}\right)=\frac{N\left(1-P\right){C}_{1}}{P{\left(N+M\Psi \right)}^{2}}+\frac{{N}^{2}\left(1-P\Psi \right){C}_{2}}{MP\Psi {\left(N+M\Psi \right)}^{2}}.$ (6)

A consistent estimator of $\text{Var}\left(\stackrel{\u2322}{\theta}\right)$ may be obtained on replacing the para-

meters $P,{c}_{1},{c}_{2}\text{and}\Psi $ by their moment estimators. An $\left(1-\alpha \right)100\%$ confidence interval on AR is thus given as:

$\left(1-\mathrm{exp}\left(\stackrel{\u2322}{\theta}+{z}_{\alpha /2}\sqrt{\mathrm{var}}\left(\stackrel{\u2322}{\theta}\right)\right),1-\mathrm{exp}\left(\stackrel{\u2322}{\theta}-{z}_{\alpha /2}\sqrt{\mathrm{var}}\left(\stackrel{\u2322}{\theta}\right)\right)\right).$

The moment estimators of the intraclass correlations are obtained separately from the groups of exposed and unexposed clusters. The moment estimator of ${\rho}_{1}$ is given by:

${\widehat{\rho}}_{1}=\frac{MSB-MSW}{MSB+\left({n}_{0}-1\right)MSW}$ (7)

where r,

$MSW=\frac{1}{n-k}{\displaystyle {\sum}_{i=1}^{k}{\displaystyle {\sum}_{j=1}^{{n}_{i}}\frac{{x}_{ij}\left({n}_{ij}-{x}_{ij}\right)}{{n}_{ij}}}}$ (8)

$MSB=\frac{1}{k-1}{\displaystyle {\sum}_{i=1}^{k}{\displaystyle {\sum}_{j=1}^{{n}_{i}}\frac{{\left({x}_{ij}-{x}_{i}\right)}^{2}}{{n}_{ij}}}}.$ (9)

Similar expressions for the (MSW, MSB) are obtained for the clusters of unexposed. The quantities (MSW, MSB) are estimated from the one-way ANOVA model when the responses are measured on the binary scale. For details the readers are referred to [6] .

We now consider two examples, the first is from data arising from a cross sectional study and the second example is on data from a randomized prospective trial.

Example 1: Cross-Sectional Study: The effect of consanguinity on congenital heart defects (CHD).

The Saudi Arabian CHD registry [11] was established in 1998, and by 2003 the registry evolved into Multi-Institutional research collaboration. The prime aim of this institution is to develop a registry whereby data from major referral hospitals across the country can provide patient information.

The participating hospitals are from regions that cover the country making the registry a nationwide data repository for the Kingdom of Saudi Arabia [Congenital Heart Disease Registry 2013]. The present example uses data on a major congenital heart disease; Patent Ductus Arteriosus (PDA). The incidence of PDA has been reported to be approximately, 1 in 2000 births, which accounts for 5% to 10% of all congenital heart diseases with female to male ratio of almost 2:1 [12] . The PDA was found to occur with increased frequency in several genetic syndromes, with precise mechanisms resulting in persistent PDA not yet clear [13] [14] .

Arab countries are notorious for consanguineous marriages, with first cousin types being the most common. For example in Jordan the prevalence of consanguinity was reported in [15] as 51.3%, Yemen, 40% as reported [16] , and almost 57% in Saudi Arabia as reported [17] [18] . More recently, a survey of Saudi families conducted in [19] , estimated the prevalence of consanguinity to be as high as 56%.

For illustrative purposes of the methodologies presented in this section, we sampled two children from the registry whose mother are non-diabetic, with maternal age less than 40 years. Each sampled child was classified according to the presence/absence of PDA, and the type of parental consanguinity (exposure variable). Therefore, for the exposed (children from consanguineous marriages restricted to first degree cousin) and non-exposed (children from non-consangui- neous marriages) the cluster size is $n=m=2$ . The data are presented in Table 3.

Direct applications using Equations (5), (8), (9), and (10) we get:

$\text{AR}=\frac{\left(53\right)\left(66\right)-\left(30\right)\left(99\right)}{\left(83\right)\left(96\right)}=6.6\%$ , $P\left(E\right)=0.61$

${\rho}_{1}=0.325$ , ${\rho}_{2}=0.332$ , $P={P}_{r}\left(D|\text{Consanguineous}\right)=0.39$ ,

$Q={P}_{r}\left(D|\text{Non}-\text{Consanguineous}\right)=0.31,\text{and}RR=\frac{0.349}{0.313}=1.11$ .

The square root of Equation (6) gives $se\left(\widehat{\theta}\right)=0.085$ , and the 95% confidence interval of AR is: −0.104 < AR < 0.210.

The AR estimate is interpreted as follow: if among infants born with CHD, gi- ven that PDA among infants with CHD is a preventable event, then prohibiting first degree relatives’ marriages will reduce the chance of having PDA by 6%.

Example 2: Prospective Cohort study (Weil’s data)

The data in this example was given first in [20] taken from [21] , and gives the

Table 3. Consanguinity and PDA.

results from an experiment comparing two treatments. One group of 16 pregnant female rats was fed a control diet during pregnancy and lactation, while the diet of a second group of 16 pregnant females was treated with a chemical. For each cluster (litter consisting of the pups born to a female rat), the number n of pups alive at 4 days and the number y of pups that survived at 31 day lactation period were recorded. The data are given as a fraction $y/n$ in Table 4.

In Table 4, the numerator is the number of dead pups during 21 days lactation period, and the numerator is the number who survived past 4 days. The purpose of the experiment was to determine if the chemical treatment significantly affects the survival rate among the pups. That is, we need to test the null hypothesis ${H}_{0}:P=Q$ . The data are presented in a 2 × 2 format in Table 5.

$P=0.24$ and $Q=0.10$ , giving relative risk $RR=2.4$ .

$AR=\frac{\left(35\right)\left(142\right)-\left(16\right)\left(112\right)}{\left(51\right)\left(158\right)}=39.4\%.$

$\rho \left(\text{control}\right)=0.029$ , $\rho \left(\text{treated}\right)=0.040$ , ${n}_{0}=9.84$ , ${m}_{0}=9.16$ , and $se\left(\widehat{\theta}\right)=0.0555$ , and the 95% confidence interval on AR is: 0.33 < AR < 0.45.

3. Split Cluster Design

Split-cluster experiments are being used by investigators in health sciences when naturally occurring aggregates of individuals with nested subgroups may be assigned to different treatments. Cited examples include split mouth trials, in which a subject’s mouth is divided into two segments that are randomly assigned to different treatment groups. In other situation, randomization to treatment conditions may be possible at the person level within the cluster. In this case, when the treatment conditions are available within each cluster, the design is referred to as a multisite or split cluster design (SCD). The major attractiveness of this design is that it removes a large portion of the inter-subject variation from the estimate of treatment effect; and hence has the potential to require a lesser number of subjects than a parallel arm design with the same power. When the response variable of interest is binary, statistical methods developed to evaluate the effect of intervention depends on non-parametric methods, as shown in [22] .

In this section we present the data layout for the SCD (see Table 6) and derive the large sample variance of the AR as a measure of effect size.

Under a similar set up to that we presented in the previous section and with appropriate change in notations the random variables ${X}_{i}={\displaystyle {\sum}_{j=1}^{{n}_{i}}{x}_{ij}}$ and ${Y}_{i}={\displaystyle {\sum}_{j=1}^{{m}_{i}}{y}_{ij}}$ will have the same beta-binomial distributions, but they are no- longer independent.

Table 4. Weil’s data: Mortality due to exposure is a two arms clinical trial.

Table 5. Weil’s data collapsed in a 2 ´ 2 table.

Table 6. Data layout for the split-cluster design.

$\text{Var}\left(X\right)=NP\left(1-P\right)\left[1+\left({u}_{1}-1\right){\rho}_{1}\right]$

$\text{Var}\left(Y\right)=MQ\left(1-Q\right)\left[1+\left({u}_{2}-1\right){\rho}_{2}\right].$

The correlation parameters ${\rho}_{1}\text{and}{\rho}_{2}$ are estimated as shown in Equations (7)-(9).

Although the AR estimator maintains the same expression under split clusters, its variance is affected by the correlations within the sub-clusters, and between units in the exposed and the non-exposed sub-clusters.

Using the delta method, we can therefore show that

$\begin{array}{l}\text{Var}\left(\widehat{\theta}\right)=\frac{N\left(1-P\right){c}_{1}}{P{\left[N+M\psi \right]}^{2}}+\frac{{N}^{2}\left(1-\psi P\right){c}_{2}}{M\psi P{\left[N+M\psi \right]}^{2}}\\ \text{}+\left\{\frac{-2NP{\rho}_{12}}{M\psi P{\left[N+M\psi \right]}^{2}}{\left[NM\psi \left(1-P\right)\left(1-\psi P\right){c}_{1}{c}_{2}\right]}^{1/2}\right\}\end{array}$ (10)

where ${c}_{1}=1+\left({u}_{1}-1\right){\rho}_{1}$ , ${c}_{2}=1+\left({u}_{2}-1\right){\rho}_{2}$ , ${u}_{1}={{\displaystyle \sum}}_{i=1}^{k}{n}_{i}^{2}/N$ and

${u}_{2}={{\displaystyle \sum}}_{i=1}^{k}{m}_{i}^{2}/M$ .

Here, ${\rho}_{1}$ is the intraclass correlation among the individuals in the sub-clus- ters of exposed, and ${\rho}_{2}$ is the intraclass correlation among the individuals in the sub-clusters of unexposed. Both correlations are estimated from the one-way ANOVA layout as explained in Equations (7)-(9). The cross-clusters correlation which is interpreted as an intercluster correlation denoted by ${\rho}_{12}$ is similarly estimated from the data by first ignoring the splitting structure of the data, and then use the one-way ANOVA to obtain the within and between mean squares. Substituting these quantities in (7) we obtain a moment estimator of ${\rho}_{12}$ .

Example 3: Split-Mouth Trial

For illustrating the proposed methodology, as a third example, we consider data from a split-mouth trial on 23 patients evaluating the effect of chlorhexidine in the treatment of gingivitis [22] . The data are presented in Table 7. The chlorhexidine and control treatments were randomly applied to four sites located in the patient’s left and right sides of the upper and lower jaws. We are interested here in testing the effect of treatment on the presence or absence of plaque, as based on the measurements taken two weeks after baseline and summarized in Table 8. The sample estimates and standard errors (SE) of P and Q, the proportion of patients having plaque in the chlorhexidine and control groups, are estimated at 0.89 (SE = 0.0343), and 0.77 (SE = 0.0491), respectively. The intra-class and inter-class correlation coefficients are estimated as ${\widehat{\rho}}_{1}=0.0395$ , ${\widehat{\rho}}_{2}=0.087$ , as shown in the previous section, pooled estimate $\widehat{\rho}=0.070$ The sample estimate of the relative risk RR is 1.155.

${\rho}_{1}=0.0395$ , ${\rho}_{2}=0.087$ , ${\rho}_{12}=0.039$ .

$AR=\frac{\left(82\right)\left(21\right)-\left(10\right)\left(71\right)}{\left(153\right)\left(92\right)}=\frac{1722-710}{14076}=7.19\%.$

$se\left(\widehat{\theta}\right)=0.092$ , and the 95% CI on AR is (−0.112, 0.225).

4. Testing the Equality of Two Correlated AR Parameters

Interest is focused on studying the change in disease-exposure etiology under va- rying conditions. We illustrate this situation using the published data [23] .

For example in the case of family data we may be interested in evaluating the effect of disease status of a parental exposure variable on their siblings, which can be divided into males and females within the same family. In this case, we

Table 7. Number of sites with plaque in four sites $\left({m}_{ij}=4\right)$ treated with chlorhexidine and control in 23. The data are adapted from [22] and is given in Table 7.

Table 8. Disease distribution among males and females according to father (exposure variable) disease status.

have two correlated attributable risk estimator, one describing the disease-ex- pose etiology for males, and the other for females. The main interest here is to compare the AR of males to that of females from the same sib-ship.

Example 4: Correlated AR’s from Cross Sectional Study: Family Data

We now consider a highly structured clustered familial data that has a two level hierarchy with blood measurements taken on parents (level two) and their offspring (level one) together with other anthropometric features [23] . Familial data sets are known to have considerable “within-cluster” correlation due to the homogeneous nature of family members. The goal is to classify the offspring blood pressure status based on parents BP and other anthropometric features. The data set contains 223 families with a mean number of siblings equal to 3 siblings per family. The outcome variable in this data set is a binary variable defined as offspring blood pressure status. If simultaneously SBP > 130 and DBP > 80, then an offspring is considered diseased $\left({D}^{+}\right)\text{or}$ otherwise normal $\left({D}^{-}\right)$ . The exposure variable in this example is whether a parent (here we select the father) has the condition (presence of exposure) or does not have the condition (absence of exposure). The data are presented in Table 8.

We present the general methodology as follows: Testing for gender difference in the population $AR$ is formulated as testing the null hypothesis ${H}_{0}:A{R}_{1}=A{R}_{2}$ against a general unspecified alternative ${H}_{1}:A{R}_{2}=A{R}_{1}+\Delta $ . Note that testing this null hypothesis is equivalent to testing ${H}_{0}:{\theta}_{1}={\theta}_{2},\text{or}{H}_{0}:\Delta =0$ .

Let the point estimators be denoted by ${\widehat{\theta}}_{1}$ and ${\widehat{\theta}}_{2}$ . The difference $D={\widehat{\theta}}_{1}-{\widehat{\theta}}_{2}$ is asymptotically unbiased and has variance $\mathrm{var}\left(D\right)=\mathrm{var}\left({\widehat{\theta}}_{1}\right)+\mathrm{var}\left({\widehat{\theta}}_{2}\right)-2\mathrm{cov}\left({\widehat{\theta}}_{1},{\widehat{\theta}}_{2}\right)$ .

Hence the null hypothesis is rejected whenever $Z=D/\sqrt{\mathrm{var}\left(D\right)}$ falls in the interval $Z>{z}_{\alpha /2}$ or $Z<-{z}_{\alpha /2}$ , where ${z}_{\alpha /2}$ is the $\left(1-\alpha /2\right)100\%$ cut off point on the standard normal curve. With a slight difference in notation, $\mathrm{var}\left({\widehat{\theta}}_{i}\right)$ is similar to the expression in Equation (6). We derive $\mathrm{cov}\left({\widehat{\theta}}_{1},{\widehat{\theta}}_{2}\right)$ using the delta method. In general, the data will have a structure similar to that given in Table 9.

We define the moment estimator as before:

${\widehat{AR}}_{j}=\frac{{x}_{j}\left({M}_{j}-{Y}_{j}\right)-{Y}_{j}\left({N}_{j}-{X}_{j}\right)}{{M}_{j}\left({X}_{j}+{Y}_{j}\right)}j=1,2.$

Let ${\widehat{\theta}}_{j}=\mathrm{ln}\left(1-{\widehat{AR}}_{j}\right)$ , then similar to the first situation, we have:

${\tau}_{j}^{2}=\mathrm{var}\left({\widehat{\theta}}_{j}\right)=\frac{{N}_{j}\left(1-{P}_{j}\right){c}_{1j}}{{P}_{j}{\left({N}_{j}+{M}_{j}{\psi}_{j}\right)}^{2}}+\frac{{N}_{j}^{2}\left(1-{P}_{j}{\psi}_{j}\right){c}_{2j}}{{M}_{j}{P}_{j}{\psi}_{j}{\left({N}_{j}+{M}_{j}{\psi}_{j}\right)}^{2}}.$

Table 9. Collapsed data for the analysis of correlated AR parameters.

Here;

${N}_{j}={\displaystyle {\sum}_{i=1}^{{k}_{j}}{n}_{ji}}$ , ${M}_{j}={\displaystyle {\sum}_{i=1}^{{l}_{j}}{m}_{ji}}$ , ${c}_{1j}=1+\left({n}_{0j}-1\right){\rho}_{1j}$ , ${c}_{2j}=1+\left({m}_{0j}-1\right){\rho}_{2j}$ ,

and ${n}_{0j}=\frac{1}{{N}_{j}}{\displaystyle {\sum}_{i=1}^{{k}_{j}}{n}_{ji}^{2}}$ , ${m}_{0j}=\frac{1}{{M}_{j}}{\displaystyle {\sum}_{i=1}^{{l}_{j}}{m}_{ji}^{2}},\text{}{\psi}_{j}=\left(1-A{R}_{j}\right)/\left(1+A{R}_{j}\right)$ , and

${P}_{j}=\text{rateofexposuretotheriskfactorunderthe}jth\text{condition}$ .

Moreover, ${\rho}_{1j}$ is the intracluster correlation of the exposed clusters under $jth$ condition, and ${\rho}_{2j}$ is the intracluster correlation of the unexposed clusters under $jth$ condition. They two parameters are estimated as described in (7).

For simplicity we assume that these correlations are constant among the exposed and unexposed.

Using the delta method we can show after some algebra that:

$\mathrm{cov}\left({\widehat{\theta}}_{1},{\widehat{\theta}}_{2}\right)=\rho [{\alpha}_{1}{\alpha}_{2}{\gamma}_{1}{\gamma}_{2}+{\alpha}_{2}{\beta}_{1}{\gamma}_{2}{\delta}_{1}+{\alpha}_{1}{\beta}_{2}{\gamma}_{1}{\delta}_{2}+{\beta}_{1}{\beta}_{2}{\delta}_{1}{\delta}_{2}].$ (11)

The correlation $\rho $ which, under both conditions is the average correlation among the responses, is estimated as described in Section 3.

The values inside the square bracket are given by:

${\alpha}_{j}=\frac{\overline{\partial}{\theta}_{j}}{\partial {x}_{j}}=-\frac{1}{{P}_{j}\left({N}_{j}+{M}_{j}{\psi}_{j}\right)}$

${\beta}_{j}=\frac{\overline{\partial}{\theta}_{j}}{\partial {y}_{j}}=-\frac{{N}_{j}}{{M}_{j}{\psi}_{j}{P}_{j}\left({N}_{j}+{M}_{j}{\psi}_{j}\right)}$

${\gamma}_{j}^{2}=\mathrm{var}\left({x}_{j}\right)={N}_{j}{P}_{j}\left(1-{P}_{j}\right){c}_{1j}$

$\begin{array}{c}{\delta}_{j}^{2}=\mathrm{var}\left({y}_{j}\right)={M}_{j}{Q}_{j}\left(1-{Q}_{j}\right){c}_{2j}\\ ={M}_{j}{\psi}_{j}{P}_{j}\left(1-{\psi}_{j}{P}_{j}\right){c}_{2j}\text{}j=1,2.\end{array}$

Therefore;

$\mathrm{var}\left({\widehat{\theta}}_{1}-{\widehat{\theta}}_{2}\right)={\tau}_{1}^{2}+{\tau}_{2}^{2}-2\rho [{\alpha}_{1}{\alpha}_{2}{\gamma}_{1}{\gamma}_{2}+{\alpha}_{2}{\beta}_{1}{\gamma}_{2}{\delta}_{1}+{\alpha}_{1}{\beta}_{2}{\gamma}_{1}{\delta}_{2}+{\beta}_{1}{\beta}_{2}{\delta}_{1}{\delta}_{2}].$ (12)

Using the data in Table 8 we get:

Males: ${P}_{1}=.59$ , ${M}_{1}=128,A{R}_{1}=.19$ , $\mathrm{var}\left({\widehat{\theta}}_{1}\right)=.0142$ .

Females: ${P}_{2}=.63$ , ${M}_{2}=244,A{R}_{2}=.29$ , $\mathrm{var}\left({\widehat{\theta}}_{2}\right)=.00578$

$\mathrm{var}\left({\widehat{\theta}}_{1}-{\widehat{\theta}}_{2}\right)=.01987$ , $z=\frac{-.211+.342}{.141}=0.929$ , and $p-\text{value}=0.353$ .

Therefore there is not enough evidence in the data to support the hypothesis of presence of gender differences for the paternal effect on the siblings’ hypertension status.

5. Simulations

We carried out a Monte-Carlo study generating the observations from bivariate beta binomial distribution. We restricted our simulations to the situation when the intracluster and the cross clusters correlation are equal. We also assumed a fixed number of observations within each cluster. The purpose was to limit the number of scenarios under which we examine the properties of the proposed test statistic Z. The statistic $Z=D/\sqrt{\mathrm{var}\left(D\right)}$ is computed when both ${P}_{1}$ and ${P}_{2}$ are strictly positive with additional restriction, $\rho <\left({\tau}_{1}^{2}+{\tau}_{2}^{2}\right)/2[{\alpha}_{1}{\alpha}_{2}{\gamma}_{1}{\gamma}_{2}+{\alpha}_{2}{\beta}_{1}{\gamma}_{2}{\delta}_{1}+{\alpha}_{1}{\beta}_{2}{\gamma}_{1}{\delta}_{2}+{\beta}_{1}{\beta}_{2}{\delta}_{1}{\delta}_{2}]$ . If these conditions are not satisfied, the sample is replaced until a total of 1000 iterations are obtained for each parameter combination. Table 10 shows the empirical levels and powers of the test assuming that the number of clusters is the same under both conditions and for the exposed and the non-exposed clusters. The main conclusions from Table 10, is that the proposed test statistic hold its empirical Type I error rate levels. There is an increase in the test power when the correlation increases, and when the prevalence of exposure parameters ${P}_{1}$ and ${P}_{2}$ are away from the boundaries of the interval (0, 1). We also note an appreciate increase in the power when the number of clusters is above 50, and naturally when the tested parameters are well separated.

6. Discussion

The population Attributable risk, like the odds ratio and relative risk is a measure of disease risk association. However it has a special appeal to public health epidemiologists as it measures the percent reduction in the chances of having the outcome among subjects who are exposed to the risk factor. Clearly, not everyone in the population is exposed to the risk factor. For example, in evaluating the relationship between consanguinity and the risk of PDA, not all parents are relatives. We assume say that 55% of women in the population (as in the Saudi traditional society) are married to a first cousin. To determine how much of a reduction there would be in PDA among CHD newborns we have 0.55 × 0.06 = 3.3%.

We have developed estimators of the variance and the confidence interval on AR when the units of sampling are aggregates of individuals under three study designs. In all situations the estimation of the intraclass correlation is crucial to

Table 10. Empirical type i error rates and powers based on 1000 replications from the bivariate beta binomial distribution. We set AR_{1} = 0.05, and therefore,
$A{R}_{2}=A{R}_{1}-\Delta $
.

conduct valid statistical inferences.

One of the objectives of this paper was to develop and evaluate simple test statistic that could be used to compare dependent attributable risks in the case of clustered dichotomous outcome variables.

7. Conclusions

1) Through simulations, a major finding of our work is that to test the equality of correlated attributable risks, either from cross sectional or cohort studies, one needs a much larger number of clusters than that expected to achieve high power.

2) An interesting extension of our study is to construct model-based inference on the AR. This would require the development of a semi parametric model similar to the generalized estimating equation, or a full probabilistic model such as generalized linear mixed model where the effect of multiple covariates may be accounted for.

3) A limitation of the simulation study is the restrictions that the number of observations in all clusters is held constant (balanced design) and that the within cluster and the cross cluster correlations are equal. The reason for this assumption is to limit the number of factors which affect the power so that reasonable conclusions can be made. But we believe that these restrictions should not affect the overall conclusions.

Conflict of Interest

The authors declare that there is no conflict of interest.

Cite this paper

Shoukri, M., Donner, A. and Al-Mohanna, F. (2017) Estimation of Attributable Risk from Clustered Binary Data: The Case of Cross-Sec- tional and Cohort Studies. Open Journal of Statistics, 7, 240-253. https://doi.org/10.4236/ojs.2017.72019

References

- 1. Richardson, J. (1996) Measures of Effect Size. Behavior Research Methods, Instruments and Computers, 28, 12-22. https://doi.org/10.3758/BF03203631
- 2. Levin, M.L. (1953) The Occurrence of Lung Cancer In Man. Acta Unio Internationalis Contra Cancrum, 9, 531-541.
- 3. Fleiss, J. (1982) Statistical Methods for Rates and Proportions. 2nd Edition, John Wiley, New York.
- 4. Fletcher, R.H., Fletcher, S.W. and Wagner, E.H. (1996) Clinical Epidemiology: The Essentials. Lippincott Williams & Wilkins, Philadelphia.
- 5. Gordis, L. (1996) Epidemiology. WB Saunders Co., Philadelphia.
- 6. Donner, A. and Klar, N. (2000) Design and Analysis of Cluster Randomized Trials in Health Research. Arnold, New York.
- 7. McCullagh, P. and Nelder, J.A. (1989) Generalized Linear Models. 2nd Edition, Chapman & Hall/CRC.
- 8. Cox, D.R. and Snell, E.J. (1989) Analysis of Binary Data. 2nd Edition, CRC Press, Boca Raton, FL.
- 9. Walter, S.D. (1975) The Distribution of Levin’s Measure of Attributable Risk from Case-Control Studies. American Journal of Epidemiology, 106, 206.
- 10. Kendall, M. and Stuart, A. (1987) Advanced Theory of Statistics. Vol. 1, 5th Edition, Griffin, London.
- 11. Saudi Congenital Heart Disease Registry. http://rc.kfshrc.edu.sa/chd_program
- 12. Mitchell, S.C., Korones, S.B. and Berendes, H.W. (1971) Congenital Heart Disease, in 56,109 Births. Incidence and Natural History. Circulation, 43, 323-332. https://doi.org/10.1161/01.CIR.43.3.323
- 13. Satoda, M., Pierpont, M.E., Diaz, G.A., Bornemeier, R.A. and Gelb, B.D. (1999) Char Syndrome, an Inherited Disorder with Patent Ductus Arteriosus, Maps to Chromosome 6p12-p21. Circulation, 99, 3036-3042. https://doi.org/10.1161/01.CIR.99.23.3036
- 14. Satoda, M., Zhao, F., Diaz, G.A., Burn, J., Goodship, J., Davidson, H.R., Pierpont, M.E. and Gelb, B.D. (2000) Mutations in TFAP2B Cause Char Syndrome, a Familial Form of Patent Ductus Arteriosus. Nature Genetics, 25, 42-46. https://doi.org/10.1038/75578
- 15. Khoury, S.A. and Massad, D. (1992) Consanguineous Marriages in Jordan. American Journal of Medical Genetics, 43, 769-775. https://doi.org/10.1002/ajmg.1320430502
- 16. Jurdi, R. and Saxena, P.C. (2003) The Prevalence and Correlates of Consanguineous Marriages in Yemen: Similarities and Correlates with Other Arab Countries. Journal of Biosocial Sciences, 35, 1-13. https://doi.org/10.1017/S0021932003000014
- 17. El-Hazmi, M.A., Al-Swailem, A.R. and Warsey, A.S. (1995) Consanguinity among the Saudi Arabian Population. Journal of Medical Genetics, 32, 623-626. https://doi.org/10.1136/jmg.32.8.623
- 18. Becker, S.M., Al Halees, Z., Molina, C. and Paterson, R.M. (2001) Consanguinity and Congenital Heart Disease in Saudi Arabia. American Journal of Medical Genetics Part A, 99, 8-13. https://doi.org/10.1002/1096-8628(20010215)99:1<8::AID-AJMG1116>3.0.CO;2-U
- 19. El-Mouzan, M., Al-Salloum, A., Al-Herbish, A., Qurashi, M. and Al-Omar, A. (2008) Consanguinity and Major Genetic Disorders in Saudi Children: A Community-Based Cross-Sectional Study. Annals of Saudi Medicine, 28, 169-174. https://doi.org/10.4103/0256-4947.51726
- 20. William, D.A. (1975) The Analysis of Binary Responses from Toxicological Experiments Involving Reproduction and Teratogenicity. Biometrics, 31, 949-952. https://doi.org/10.2307/2529820
- 21. Weil, C.S. (1971) Selection of the Valid Number of Sampling Units and a Consideration of Their Combination in Toxicological Studies Involving Reproduction, Teratogenesis or Carcinogenesis. Food Cosmetics Toxicology, 8, 177-182.
- 22. Donner, A., Klar, N. and Zou, G. (2004) Methods for the Statistical Analysis of Binary Data in Split-Cluster Designs. Biometrics, 60, 919-925. https://doi.org/10.1111/j.0006-341X.2004.00247.x
- 23. Miall, W.E. and Oldham, P.O. (1955) A Study of Arterial Blood Pressure and Its Inheritance in a Sample of the General Population. Clinical Science, 14, 459-487.