Missing data are always an issue in community-based longitudinal studies, calling into question the representativeness of samples and bias in conclusions, the research has generated. This may be due to the difficulty of implementing random sampling procedures in these studies and/or the inherent difficulty in sampling hard-to-reach segments of the population being studied. In fact, the ability to accurately study hard-to-reach populations in light of potential bias created by missing data remains an open question. In this study, missing data are defined as both failure to interview potential research participants identified in the sampling frame and failure to retain enrolled research participants longitudinally. Using the sample from the Mobile Youth Survey, a multiple-cohort, longitudinal study of adolescents living in highly impoverished neighborhoods in Mobile, Alabama, we examined sample representativeness and dropout to determine whether missing data led to a nonrepresentative, and therefore, biased sample. Results indicate that even though random procedures are not strictly used to draw the sample, (a) the sample appears to be largely representative of the population that was studied, and (b) attrition is largely uncorrelated with characteristics of those who dropped out. This suggests that it is possible to study with validity hard-to reach populations in community settings.
In research, some units of the analysis are more difficult to study than others. In survey research, where people are the units of analysis, some people (or more generally, groups of people with certain characteristics) are more difficult to study than others, and the term “hard-to-reach” has been used to invoke the resources needed to sample, recruit, and secure participation from individuals with these characteristics [
Lambert and Wiebel [
Several factors may account for the difficulty of conducting research on these populations. For remote or geographically isolated populations, traveling to con- duct the research, coupled with possible language and cultural barriers, may be prohibitively expensive (for examples, see [
The result is missing data, and despite best intentions of researchers, missing data are common in survey research involving hard-to-reach populations. This may take either (or both) of two forms: (a) lack of participation by cases identified in the sampling frame (or that would have been identified had the sampling frame been known) and targeted for recruitment; and (b) participant dropout in longitudinal studies (A third form of missing data, failure to answer specific questions in a survey, will not be addressed here; indeed, others argue that this form of missing data is “little more than a nuisance” [
Consider, for example, how missing data might occur in a study inner-city adolescents. Inner-city neighborhoods are, by definition, impoverished; and given the overlap of poverty and race/ethnicity in the U.S. [
Impoverished households are also more prone to residential mobility [
Beyond the risks of selection bias in studies that focus on hard-to-reach populations, a more insidious problem involving exclusion of underrepresented po- pulations has been noted [
Of course, the easiest way to address missing data is to avoid it. Community- based participatory research [
An alternative approach involves data analysis. While a number of data analysis procedures accommodate missing data (e.g., multiple imputation, maximum likelihood), they all assume that missing data are ignorable. In all cases, missing data are non-ignorable if missingness (R’) depends on the outcome variable (Y); to the extent that missing data are non-ignorable (also termed Missing Not at Random, MNAR), results are potentially biased [
Of course, it is difficult (if not impossible) to determine definitively whether or not the outcome or predictor variables are related to missingness, since by definition they are not measured for missing cases. While empirical means of exploring whether missing data are ignorable have been developed, (e.g., [
Graham and colleagues [
None of this means that the conclusion reached by Graham and colleagues [
The present study addresses this question through the analysis of Mobile Youth Survey (MYS) data; the MYS was a community-based longitudinal study of poverty and adolescent risk conducted in inner-city community-based sample in Mobile, Alabama. As is typical of such a study of this type, the sampling frame was poorly defined and high rates of missing data occurred. However, the study was able to access complete public school records for both study participants and potential participants. Thus, auxiliary data were available for both observed and missing cases each year of the study. This information is important, in that it extends the findings of Graham and colleagues [
The aim of this study is to determine whether systematic differences exist between the sample of students who (a) enrolled in the MYS (E) and those who did not, (b) participated in the MYS during any given year (P) and those who did not, and (c) were retained as MYS participants during consecutive waves of data collection (R) and those who were not. In the following notation, i denotes a person (or case), while j and k denote years (or waves) in the longitudinal data collection sequence. We define participation as a matrix P, where pij = 1 if person i participated in the MYS during year j. We define enrollment as a vector e, where ei = 1 if pij = 1 for any j and ei = 0 if pij = 0 for every j. Generally speaking, E allows us to examine the representativeness of the sample, while P allows us to examine within-case year-to-year variability in representativeness. We define wave-to-wave retention as a matrix R (which varies by adjacent data waves j and k) where rijk = 1 if pij = 1 and pik = 1; rijk = 0 if pij = 1 and pik = 0; and rijk is undefined if pij = 0 or i fails to meet the inclusion criteria (see the Methods Section) for either year j or year k). R allows us to examine whether missing data due to dropout are informative. In conducting these analyses, we examine how both demographic (i.e., age, sex, race) and functional (i.e., cognitive and behavioral) variables are related to missingness.
The Mobile MSA has a population of 540,258. The largest city within this MSA is Mobile, which has a population of approximately 200,000. In 2000, 46.1% of the residents of Mobile were Black American and 22.4% of this population lived in poverty, where the median household income was $31,445 [
A total of 8708 adolescents enrolled in the MYS between 1998 and 2007; this constitutes the core sample for the study, although noted in the Inclusion section, the final sample size is somewhat smaller. In 1998, the initial wave of MYS data collection was conducted in the 13 most impoverished neighborhoods in the Mobile MSA; 98% or respondents reported that they were Black). In 1990, the population in these neighborhoods was over 95% Black; median household income was $5190 and the poverty rate was 73% [
When it began, the sampling frame for the MYS was 10 - 18 year-old adolescents who lived in these neighborhoods, with the goal of recruiting the entire eligible population of eligible adolescents. This goal was pursued using both an active (random) and passive (non-random) recruitment strategy (see [
In 1998, 1771 respondents completed the MYS. The response rate is difficult to calculate because a definitive census of the sampling frame was not available. However, a conservative estimate of a response rate among those actively recruited is 60% [
Research participants were paid for their time each year ($10 between 1998 and 2004; $15 after 1994). The MYS typically required between 60 and 90 minutes completing.
The first criterion for inclusion in this study is that youths must be students in the Mobile County Public School System (MCPSS). MCPSS records were used in this study as an auxiliary dataset to assess the characteristics of cases with missing data (including those who were eligible to participate but were never enrolled in the study); therefore, only MCPSS respondents who could be matched to school records were selected for analysis in this study. Ninety percent of all MYS participants were matched to school records; these were supplemented with non-MYS participants who lived in MYS neighborhoods and attended the same schools as MYS participants. However, among the 10% of enrollees who were not matched to school records, approximately half could not be verified by any other source (i.e., records from the Mobile County Juvenile Court, the Mobile Housing Authority, the Mobile Police Department’s Family Intervention Team program for at-risk youths; the Lexis Nexispublic records database), and we assume that these cases are bogus (some adolescents may have given fake names so as to be allowed to participate multiple times and receive multiple cash incentives for their participation). Thus, we obtain an effective match to MCPSS records of approximately 95%. An analysis of Family Intervention Team program records shows that only five of 656 (0.7%) MYS enrollees in the program attended non-public schools, and none were home schooled. This is consistent with national statistics, showing that (a) fewer than 5% of Black American youths living in households with annual incomes less than $20,000 attended private schools [
The second inclusion criterion involves age. Public school records provide more-or-less complete data for adolescents living in the MYS neighborhoods, with one major caveat: Alabama law allows students to drop out of school when they turn 16; thus age is used as an exclusionary factor in this study. While age was limited by the MYS’ design (eligibility of participants limited to 10 to 18 years old), in the current study, data are limited to students aged 10 through 15. Because students are legally required to attend school through the age of 15, a neighborhood sample of youths under age 16 (adjusted for students who do not attend private schools and are not home schooled) and the MCPSS census of students under age 16 are coincident. Because there is little home-schooling and private-school attendance in these neighborhoods, the adjustment makes little difference. Beyond age 15, however, the neighborhood youth census and the MCPSS census begin to deviate, due to the possibility of school dropout; since dropouts are more likely to engage in risk behaviors [
The third inclusion criterion involves neighborhood. Addresses of MYS participants were geocoded using GIS software, and MYS neighborhoods were identified as geographical areas bounded by major made-made (streets, railroad tracks) or natural (bodies of water) barriers where MYS participants were clustered. Schools serving these neighborhoods that were attended by five or more MYS participants were selected for study. Addresses of all appropriately-aged (i.e., <16) students attending these schools were geocoded, and those living in the MYS neighborhoods were included in this study (these included both MYS participants and non-participants). Through the use of this inclusion criterion, we eliminated geographical outliers, who may also have been statistical outliers in terms of their characteristics.
Based on these inclusion criteria, the final sample consisted of 7142 MYS enrollees and 25,442 students who were not MYS enrollees but who lived in MYS neighborhoods.
Demographic variables examined in this study include sex, race, free lunch eligibility status, and grade, all based on MCPSS records. Sex is straightforward, and for analysis purposes, girl = 1 and boy = 0. Race/ethnicity was identified in these records as Black, White, Asian, Native American, Hispanic, and unspecified. Overall, 99.8% of MYS enrollees were classified as Black or White; this is consistent with MYS self-reported race, where 99.8% also reported themselves to be Black or White. We therefore treated cases that were coded by MCPSS as Asian, Native American, Hispanic, or unspecified as missing. For analysis purposes, Black = 1 and White = 0. School lunch status was coded in the MCPSS records as free, reduced-cost, and paid, with the vast majority of MCPSS participants receiving free lunches; for convenience, and because of the small number of MYS participants who were in the other two categories, the categories were combined to yield a dichotomous measure: “Free” and “Not Free” status. For analysis purposes, free lunch = 1 and paid lunch = 0. Free lunch status was a proxy for vulnerability associated with low SES. For ease of interpretation, grade was centered for analysis purposes at 3rd grade = 0. Finally, 48 MYS neighborhoods were identified; these were divided into two groups: original target neighborhoods and expansion neighborhoods.
In this study, functional characteristics include school achievement, as indicated by Stanford Achievement Test (SAT) percentile ranks, and by violation of school code-of-conduct; each is a component of MCPSS student records.
Student achievement in Alabama was assessed annually using the SAT. The SAT, 9th edition was completed by 3rd through 11th grade students through the 2001-2002 academic year. Beginning in spring 2003, the SAT, 10th edition, was administered annually to 3rd through 8th grade students. While results for the SAT 9th edition are not directly comparable to results from the SAT 10th edition, this study compares students within year; our use of standardized percentile scores across years makes such comparisons less problematic. The reading and math subscales of the SAT are used in these analyses.
The MCPSS records school violations of school code of conduct. Violations were classified as A, B, C, D, or E violations, in order of severity, with “A” violations indicating minor infractions (e.g., not wearing clothes conforming to a school’s color code) and “E” violations indicating major infractions (e.g., bringing a gun to school). As an overall measure, violations were weighted for severity (A = 1, B = 2, C = 3, D = 4, E = 5) and summed for each student each year.
In this study, we conduct statistical analyses to determine whether demographic and functional characteristics affect enrollment, participation, and retention. We also examine the magnitude of each observed effect to determine how much it might potentially have biased the MYS sample. Each of the ten waves of MYS data was paired to the subsequent school year (e.g., 1998 MYS (wave 1) corresponds to the 1998-1999 school year). While such comparisons cannot indicate whether the missing data are ignorable or non-ignorable (no analysis can strictly demonstrate this), they can provide an indication of whether the MYS sample is representative of the larger neighborhood population from which it was drawn on indicators that reflect potential sources of bias.
Strictly speaking, a sample is representative only if its characteristics do not differ from the population in terms of specific study variables. This requirement can be relaxed to some extent, although conclusions about representativeness are stronger as the characteristics used to assess representativeness align more closely with the study variables. Thus, while representativeness does not guarantee MAR or MCAR, these conditions do guarantee representativeness; but as the characteristics used to assess representativeness more closely approximate study variables, any loss of correspondence between representativeness and MAR or MCAR will have decreasing importance and can be safely ignored.
To assess whether missingness in MYS data were affected by demographic and functional characteristics (i.e., whether the MYS sample is representative of the population on these variables), models were estimated using a Generalized Estimating Equations (GEE) framework [
There is no general agreement on how to calculate effect sizes for GEE models, so we calculated means for different levels of each characteristic as a proxy. For categorical predictors, these are least squares estimates of the probability of the outcome (e.g., enrollment) for each category of the predictor variable. For continuous predictors, these are probability estimates of the outcome variable when the predictor is one standard deviation below and above the mean, with all other variables held at their means (or modal category).
Because differences between raw probabilities are difficult to interpret (e.g., a difference [Δp] of 0.05 is more meaningful in the tail of a distribution than at its center), we converted these differences into a pseudo-measure (h’) of effect size, using a procedure suggested by Cohen [
h ′ = | ( 2 ⋅ sin − 1 ( p 1 ) ) − ( 2 ⋅ sin − 1 ( p 2 ) ) | .
Note that h’ is not equivalent Cohen’s measure of effect size (h) for the simple test of proportions, since h does not take into account the complexity of repeated measures or multiple covariates, nor is it meant to compare two points in a continuous distribution (e.g., M ± S). Nonetheless, it provides a guide for interpreting the magnitude of reported effects. Cohen specifies small effect sizes in the range of h = 0.2, medium effect sizes in the range of h = 0.5, and large effect sizes in the range of h = 0.8.
Total | Enrolled in the MYS | Not Enrolled in the MYS | |||||||
---|---|---|---|---|---|---|---|---|---|
N = 32,584 | N = 7142 | N = 25,442 | |||||||
Observationsa = 107,689 | Observations = 27,761 | Observations = 79,928 | |||||||
Observations | Percent | Observations | Percent | Observations | Percent | ||||
Race | |||||||||
Black | 101,874 | 95.6 | 27,546 | 99.5 | 74,328 | 94.2 | |||
White | 4728 | 4.4 | 147 | 0.5 | 4581 | 5.8 | |||
Missing | 1087 | 68 | 1019 | ||||||
Sex | |||||||||
Girl | 52,816 | 49.0 | 13,182 | 47.5 | 39,634 | 49.6 | |||
Boy | 54,873 | 51.0 | 14,579 | 52.5 | 40,294 | 50.4 | |||
Missing | 0 | 0 | 0 | ||||||
Free Lunch | |||||||||
No | 25,193 | 23.4 | 3247 | 11.7 | 21,946 | 27.5 | |||
Yes | 82,493 | 76.6 | 24,514 | 88.3 | 57,979 | 72.5 | |||
Missing | 3 | 0 | 3 | ||||||
Observations | Mb | Sc | Observations | M | S | Observations | M | S | |
Grade | 107,686 | 7.23 | 1.82 | 27,760 | 7.07 | 1.77 | 79,919 | 7.28 | 1.83 |
Reading SAT | 71,649 | 35.42 | 24.38 | 18,284 | 29.38 | 21.98 | 53,365 | 37.49 | 24.81 |
Math SAT | 71,783 | 39.49 | 24.97 | 18,320 | 34.24 | 23.00 | 53,463 | 41.30 | 25.36 |
Weighted Violations | 107,689 | 3.75 | 6.01 | 27,761 | 4.99 | 7.03 | 79,928 | 3.31 | 5.55 |
Total | Enrolled in the MYS | Not Enrolled in the MYS | |||||||
---|---|---|---|---|---|---|---|---|---|
Observationsa = 107,689 | Observations = 14,448 | Observations = 93,241 | |||||||
Observations | Percent | Observations | Percent | Observations | Percent | ||||
Race | |||||||||
Black | 101,874 | 95.6 | 14,352 | 99.6 | 87,522 | 94.9 | |||
White | 4728 | 4.4 | 61 | 0.4 | 4667 | 5.1 | |||
Missing | 1087 | 35 | 1019 | ||||||
Sex | |||||||||
Girl | 52,816 | 49.0 | 6931 | 48.0 | 45,885 | 49.2 | |||
Boy | 54,873 | 51.0 | 7517 | 52.0 | 47,356 | 50.8 | |||
Missing | 0 | 0 | 0 | ||||||
Free Lunch | |||||||||
No | 25,193 | 23.4 | 1446 | 10.0 | 23,747 | 25.5 | |||
Yes | 82,493 | 76.6 | 13,002 | 90.0 | 69,491 | 74.5 | |||
Missing | 3 | 0 | 3 | ||||||
Observations | Mb | Sc | Observations | M | S | Observations | M | S | |
Grade | 107,686 | 7.23 | 1.82 | 14,447 | 7.18 | 1.77 | 93,232 | 7.24 | 1.83 |
Reading SAT | 71,649 | 35.42 | 24.38 | 9389 | 28.49 | 21.98 | 62,260 | 36.46 | 24.58 |
Math SAT | 71,783 | 39.49 | 24.97 | 9398 | 33.86 | 23.00 | 62,385 | 40.34 | 25.14 |
Weighted Violations | 107,689 | 3.75 | 6.01 | 14,448 | 5.59 | 7.03 | 93,241 | 3.46 | 5.68 |
Total | Enrolled in the MYS | Not Enrolled in the MYS | |||||||
---|---|---|---|---|---|---|---|---|---|
Observationsa = 10,691 | Observations = 7795 | Observations = 2896 | |||||||
Observations | Percent | Observations | Percent | Observations | Percent | ||||
Race | |||||||||
Black | 10,625 | 99.4 | 7754 | 99.7 | 2871 | 99.4 | |||
White | 41 | 0.4 | 23 | 0.3 | 18 | 0.6 | |||
Missing | 25 | 18 | |||||||
Sex | |||||||||
Girl | 5115 | 47.8 | 3735 | 47.9 | 1380 | 47.7 | |||
Boy | 5576 | 52.2 | 4060 | 52.1 | 1516 | 52.3 | |||
Missing | 18 | 0 | 0 | ||||||
Free Lunch | |||||||||
No | 1191 | 11.1 | 715 | 9.2 | 476 | 16.4 | |||
Yes | 9500 | 88.9 | 7080 | 90.8 | 2420 | 83.6 | |||
Missing | 0 | 0 | 0 | ||||||
Observations | Mb | Sc | Observations | M | S | Observations | M | S | |
Grade | 10,690 | 7.51 | 1.58 | 7794 | 7.51 | 1.57 | 2896 | 7.53 | 1.59 |
Reading SAT | 6524 | 27.42 | 21.38 | 4834 | 26.86 | 21.03 | 1690 | 29.02 | 22.28 |
Math SAT | 6514 | 32.44 | 22.40 | 4819 | 32.19 | 22.18 | 1695 | 33.13 | 5.68 |
Weighted Violations | 10,691 | 6.24 | 7.68 | 7795 | 6.44 | 7.89 | 2896 | 5.68 | 7.04 |
eligible to participate in the MYS during at least one pair of consecutive years. Observations reflect MCPSS records for the second of each pair of eligible years. Students who participated during the second year were classified as retained for that year, and students who did not participate during the second year were classified as not retained. Thus, the same student could be classified in each of the two groups for different years. The number of observations is considerably smaller than for enrollment and retention (10,691 total observations for retention versus 101,874 for enrollment and participation). As with
Model 2 shows that after controlling for demographic factors, both reading
Model 1 | Model 2 | Estimated Probability | Effect Size | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Effect | Parameter | Parameter Estimate | SEa | Z | p | Parameter Estimate | SE | Z | p | LS Meanb | Mc − Sd | M + S | |
Intercept | b0 | −1.687 | 0.104 | −16.29 | <0.001 | −1.545 | 0.128 | −12.07 | <0.001 | ||||
Race | 0.387 | ||||||||||||
White | b1 | −1.150 | 0.130 | −8.86 | <0.001 | −1.458 | 0.151 | −9.63 | <0.001 | 0.083 | |||
Black | 0.218 | ||||||||||||
Sex | 0.009 | ||||||||||||
Boy | b2 | −0.023 | 0.029 | −0.79 | 0.428 | 0.002 | 0.029 | 0.08 | 0.937 | 0.134 | |||
Girl | 0.137 | ||||||||||||
Free Lunch Status | 0.044 | ||||||||||||
Not Free Lunch | b3 | −0.129 | 0.014 | −9.31 | <0.001 | −0.153 | 0.019 | −7.78 | <0.001 | 0.128 | |||
Free Lunch | 0.143 | ||||||||||||
Neighborhood | b4 | <0.001 | <0.001 | ||||||||||
Grade | b5 | 0.029 | 0.003 | 9.36 | <0.001 | 0.018 | 0.005 | 3.96 | <0.001 | 0.221 | 0.238 | 0.040 | |
SAT Reading | b6 | −0.003 | 0.004 | −6.71 | <0.001 | 0.251 | 0.227 | 0.056 | |||||
SAT Math | b7 | −0.001 | 0.004 | −3.13 | 0.002 | 0.244 | 0.233 | 0.026 | |||||
School Violations | b8 | 0.005 | 0.001 | 3.59 | <0.001 | 0.234 | 0.244 | 0.023 |
aStandard Error. bLeast Squares means controlling for other variables in the equation; these have been converted to probabilities. cMean. dStandard Deviation.
and math SAT percentiles and school violations statistically affect the probability of MYS enrollment. For SAT percentile scores, the effect was negative. For reading, M ± S decreases from p = 0.251 for students one standard deviation below the reading mean to p = 0.227 for students one standard deviation above the reading mean; for math, M ± S decreases from p = 0.244 to p = 0.233. In contrast, weighted school violations have a positive effect on enrollment, with M ± S increasing from p = 0.234 (M − S) to p = 0.244 (M + S).
In addition to the terms reported in
By definition, participation rates are lower than enrollment rates (unless retention equals unity, which was not the case in the MYS, and rarely occurs in a community-based study).
Model 2 shows the effects of functional characteristics on participation controlling for demographic characteristics. As was the case with enrollment, both reading and math SAT percentile scores were negatively associated with participation. For reading, M ± S decreases from p = 0.150 for students one standard deviation below the reading mean to p = 0.122 for students one standard deviation
Model 1 | Model 2 | Estimated Probability | Effect Size | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Effect | Parameter | Parameter Estimate | SEa | Z | p | Parameter Estimate | SE | Z | p | LS Meanb | Mc − Sd | M + S | ||||
Intercept | b0 | −3.305 | 0.191 | −17.29 | <0.001 | −3.164 | 0.238 | −13.29 | <0.001 | |||||||
Race | 0.381 | |||||||||||||||
White | b1 | −1.957 | 0.189 | −10.38 | <0.001 | −1.662 | 0.207 | −8.04 | <0.001 | 0.015 | ||||||
Black | 0.095 | |||||||||||||||
Sex | 0.010 | |||||||||||||||
Boy | b2 | 0.033 | 0.029 | 1.11 | 0.246 | −0.055 | 0.034 | −1.63 | 0.103 | 0.039 | ||||||
Girl | 0.037 | |||||||||||||||
Free Lunch Status | 0.094 | |||||||||||||||
Not Free Lunch | b3 | −0.497 | 0.027 | −18.34 | <0.001 | −0.565 | 0.039 | −14.46 | <0.001 | 0.030 | ||||||
Free Lunch | 0.048 | |||||||||||||||
Neighborhood | b4 | <0.001 | <0.001 | |||||||||||||
Grade | b5 | 0.124 | 0.006 | 20.99 | <0.001 | 0.114 | 0.008 | 13.58 | <0.001 | 0.119 | 0.159 | 0.116 | ||||
SAT Reading | b6 | −0.005 | 0.001 | −6.01 | <0.001 | 0.150 | 0.122 | 0.082 | ||||||||
SAT Math | b7 | −0.002 | 0.001 | −2.36 | 0.019 | 0.144 | 0.127 | 0.050 | ||||||||
School Violations | b8 | 0.017 | 0.001 | 15.12 | <0.001 | 0.119 | 0.154 | 0.102 | ||||||||
aStandard Error. bLeast Squares means controlling for other variables in the equation; these have been converted to probabilities. cMean. dStandard Deviation.
above the reading mean; for math, M ± S decreases from p = 0.144 to p = 0.127. Weighted school violations had a positive effect on enrollment, with M ± S increasing from p = 0.119 (M − S) to p = 0.154 (M + S).
As was the case with enrollment, a supplemental version of Model 2 was also run with the twelve demographic × functional interaction terms. Two were statistically significant predictors of participation: school lunch status × SAT reading percentile (Z = −2.53, p = 0.011) and grade × school violations (Z = 2.55, p = 0.011). These results are not reported in
Model 1 | Model 2 | Estimated Probability | Effect Size | |||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Effect | Parameter | Parameter Estimate | SEa | Z | p | Parameter Estimate | SE | Z | p | LS Meanb | Mc − Sd | M + S | ||||
Intercept | b0 | 0.172 | 0.304 | 0.57 | 0.571 | −0.107 | 0.431 | −0.25 | 0.804 | |||||||
Race | 0.443 | |||||||||||||||
White | b1 | −0.893 | 0.364 | −2.46 | 0.014 | −0.658 | 0.431 | −1.53 | 0.127 | 0.351 | ||||||
Black | 0.570 | |||||||||||||||
Sex | 0.006 | |||||||||||||||
Boy | b2 | −0.011 | 0.051 | −0.21 | 0.837 | −0.022 | 0.067 | −0.33 | 0.742 | 0.457 | ||||||
Girl | 0.460 | |||||||||||||||
Free Lunch Status | 0.256 | |||||||||||||||
Not Free Lunch | b3 | −0.514 | 0.067 | −7.66 | <0.001 | −0.590 | 0.100 | −5.89 | <0.001 | 0.396 | ||||||
Free Lunch | 0.523 | |||||||||||||||
Neighborhood | b4 | <0.001 | <0.001 | |||||||||||||
Grade | b5 | 0.011 | 0.016 | 0.69 | 0.489 | 0.057 | 0.025 | 2.25 | 0.024 | 0.724 | 0.711 | 0.029 | ||||
SAT Reading | b6 | −0.004 | 0.002 | −1.98 | 0.048 | 0.715 | 0.665 | 0.108 | ||||||||
SAT Math | b7 | 0.003 | 0.002 | 1.74 | 0.082 | 0.701 | 0.680 | 0.045 | ||||||||
School Violations | b8 | 0.013 | 0.011 | 1.21 | 0.225 | 0.677 | 0.703 | 0.056 | ||||||||
aStandard Error. bLeast Squares means controlling for other variables in the equation; these have been converted to probabilities. cMean. dStandard Deviation.
of retention: for students with a SAT reading percentile score one standard deviation below the reading mean, p = 0.715 compared with p = 0.665 for students with a SAT reading percentile score one standard deviation above the mean.
None of the 12 demographic × functional terms included in the supplemental analysis were statistically significant.
The study of minority youths living in poverty has proven difficult, resulting in non-representative samples and high attrition rates (see [
In some studies, like those involving poverty, random sampling strategies may not be successful due to factors like high residential mobility within impoverished populations [
While matching sample demographic characteristics to those of the population is an important step in any study, it is seldom sufficient to demonstrate representativeness; the reason is largely due to the previously-discussed nesting of vulnerability that is particularly evident in hard-to-reach populations. When research questions involve beliefs, attitudes, and behaviors, it is also important to determine whether the sample is representative of the population in terms of functional characteristics like cognitive abilities and behaviors. This study assesses the extent to which the MYS sample deviates from the population of adolescents living in MYS neighborhoods, in terms of their enrollment, year-to-year participation, and retention, and therefore, the extent to which missing data may be nonignorable.
Results suggest that demographically, the MYS sample falls short of being representative of the population in three key areas. First, though, sex had no effect on any of the outcome variables, as measured by both statistical significance and effect size. However, grade, race, and school lunch status did show some statistically significant results. In terms of grade, statistically different rates of MYS enrollment and participation were evident; for example, students one standard deviation below the grade mean were less likely to enroll than students one standard deviation above the mean (p = 0.221 versus p = 0.238). The difference, while statistically significant, is quite small (Δp = 0.017; h’ = 0.04). The effect for participation is also significant and also small (Δp = 0.040; h’ = 0.116). The statistical significance coupled with the small effect size is likely due to the very large sample size used for these analyses (e.g., for the enrollment demographic analysis, N = 107,689 observations across 10 years), which results reduces the standard error of the estimate dramatically. We see this same outcome (a statistically significant estimate coupled with a small effect size) throughout the results. The differences for grade, although small, may nonetheless suggest the possibility that parents of younger students were more likely to withhold consent for MYS enrollment or participation; it may also suggest the possibility that younger students were less likely to venture out of their homes and walk to the survey administration site.
Race has a significant effect on both enrollment and participation; but here, effect sizes border on robust (Δp = 0.135, h’ = 0.387 for enrollment; Δp = 0.08; h’ = 0.381 for participation). Perhaps because very few White students lived in the MYS neighborhoods (4.4%), the perception may have developed that the MYS was a study of Black adolescents, and White adolescents largely self-se- lected out of the study. Alternatively, demographic distributions may not be geographically uniform within any given neighborhood, and there is evidence [
In the introduction, we suggested that missing data is a bigger problem in studying vulnerable and hard-to-reach populations (e.g., racial minorities, those living in poverty) than in studying populations that are not vulnerable or hard to reach. Even though our reported results suggest that the MYS sample may not be strictly representative of the population, those for race and school lunch status run contrary to what we would expect in studying hard-to-reach populations: racial minorities and impoverished youths were more likely to enroll and participate than their non-minority and less-impoverished counterparts. Thus, since the purpose of the MYS was to study the most vulnerable youths, these differences suggest that it largely succeeded in its mission. We should note, however, that the neighborhoods were overwhelming Black and impoverished, so these differences may be less important than their probabilistic magnitude would suggest. One additional explanation might argue that since MYS participants were paid for their time, those living in greater poverty (i.e., receiving free lunches) would be more likely to participate. While this explanation may be partially valid, its importance is undermined by three factors: a) the size of the payments ($10 or $15, depending on the year) were small; b) the neighborhoods were impoverished, and even those students who did not qualify for free lunch did not live in wealthy households; c) the fact that Blacks were overrepresented in the MYS, even controlling for free lunch status, suggests that the most vulnerable adolescents in the neighborhoods chose to participate in the MYS.
Even though SES is related to cognitive ability [
Overall, demographic variables did not interact with functional variables to affect outcomes. Of the 36 variable pairs tested across three outcomes, only four results achieved statistical significance, and only one pair achieved significance across two different outcomes: the grade by school violations interaction was a significant predictor of both enrollment and participation, but not of retention (p = 0.241). Further examination showed that the effect of school violations on both enrollment and participation became increasingly positive as grade increased. The MYS was not conducted in schools during the school year, so school discipline (e.g., suspension) as a response to violations would not explain the effect. However, younger adolescents are more likely than older adolescents to be subject to parental monitoring and restrictive rules―rules that might prevent them from MYS participation. Thus adolescents who were “in trouble” (e.g., who had been disciplined for violating school rules) may well have come under even greater parental monitoring and restrictive rules, and their MYS enrollment and participation may have suffered as a consequence. But overall, demographic and functional characteristics did not interact in predicting outcomes, and it is reasonable to treat them as separate characteristics in the analysis.
One methodological note is of importance. The rates of enrollment (0.258) and annual participation (0.134) are quite low, although the annual retention rate (0.729) is quite reasonable in a study of this type. In the 13 initial target neighborhoods, the enrollment and participation rates were much higher than these estimates suggest; for example, in the largest of the initial target neighborhoods, the 1998 participation rate was approximately 0.50, and the enrollment rate for that neighborhood was approximately 0.75 [
The general conclusion, then, is that, with the exception of race (and to a lesser extent, school lunch status), the demographic and functional characteristics of the MYS sample were very similar to students living in MYS neighborhoods who were not enrolled in the study. Even the racial differences in enrollment, participation, and retention and the school lunch difference in retention provide an indication that missing data was least common in the least vulnerable segment of this population. Thus, the hardest-to-reach segment of this hard-to-reach population was not only equally-likely to participate but actually even more likely to enroll, participate, and follow-up than other less-vulnerable segments of the population. In terms of the question posed earlier, we apparently were able to study this population without sampling or retention bias. It is, then, possible to reach hard to reach populations, in this case low-income minority adolescents.
We should briefly describe how we conducted the study, because that potentially influenced the results we were able to obtain. First, written parental consent was obtained for all participants enrollees. But we explicitly asked for consent for each adolescent to participate each year until he or she turned 19 (when the participant aged out). This allowed each enrollee to participate each year, whether or not direct contact with him or her or a parent/guardian occurred during that year. Word spread quickly that the MYS was in progress each year, and it was well regarded in each of the neighborhoods (to the point where study participants were overheard bragging to each other about how many times they had participated). Thus, each year many adolescents participated even though they were not individually and/or directly recruited.
How did the MYS establish a positive reputation in these neighborhoods? We can only speculate, but several factors may be relevant. First, it was a community-based survey, and the research team spent considerable person hours in each neighborhood each year knocking on doors and talking with both adults and youths. In other words, the MYS had a very visible presence in each neighborhood, and the survey became a special event in neighborhoods where special events are rare. Certainly, the fact that participants were paid for their time was important; but this rate of pay was considerably lower than what is often paid in similar studies. Surveys were also administered in the neighborhoods where each participant lived―or if that was inconvenient for the participant, in his or her home. Arguably, then, the fact that the research team was willing to go into their neighborhoods (which many of them recognized as potentially dangerous places) and their homes earned a great deal of respect.
Finally, a word about the members of the research team is warranted. Each year, an internship was offered, at first to college students in Alabama, then increasingly to students nationally. Small stipends were provided for the interns, to cover their living expenses. But in deciding who to select for the internship (typically the number of applicants outnumbered the number of available positions by a factor of three-to-four), priority was given to applicants with (a) research experience (preferably in the field) and (b) an appreciation for the effects of poverty and a strong desire to better understand how it affects people’s lives. Thus, the interns were both very respectful of the people with whom they interacted, and they were very good listeners, learning as much from their day-to-day experiences in the neighborhoods and interactions with neighborhood residents as from the actual data they collected.
Strengths and LimitationsThe commitment of the research team, coupled with the time it spent in each neighborhood recruiting and surveying adolescents, allowed it to gain respect and trust by neighborhood residents. While undoubtedly a strength of the MYS, this came at a monetary cost: approximately $200,000 per year to survey between 2000 and 3000 respondents. Not all research endeavors will have this type of budget. Moreover, the fact that the MYS research team returned year after year also contributed to the trust and respect that was established (the smallest sample was obtained during Year 1 and increased nearly each year thereafter); this long-term commitment also increased the overall budget for the study. So, the conclusions about studying a hard-to-reach population without selection bias may be limited by budgetary constraints. Second, the auxiliary dataset we used to establish the representativeness of the study was age-limited. Within the MYS neighborhoods, nearly all youths attend public schools; but they are only required to do so through age 15, since they can legally drop out at age 16. The problem with school dropout was particularly important in the MYS neighborhoods, where the high-school graduation rates were as low as 30%. Thus, we limited the analysis to MYS participants and non-participants to youths aged 10 through 15. We are therefore not able to draw any conclusions about the representativeness of the older segment (aged 16 - 18) of the MYS sample. However, given the findings for the younger MYS participants (aged 10 - 15), and without any theoretical reason to believe that they should be different for older youths, this may be a relatively minor limitation. Finally, measures of functional characteristics used in this study do not directly correspond to risk behaviors in the MYS; therefore, we cannot say with certainty that the representativeness we found extends to all cognitive, attitudinal, and behavioral domains. However, even the limited set of measures we use go well beyond what is available to address the issues of representativeness and missing data in hard-to-reach populations considered by most studies.
This study suggests that researchers can conduct studies of impoverished adolescents without bias, so long as they carefully attend to issues of establishing legitimacy, respect, and trust in the communities they study. This study may also generalize to other hard-to-reach populations, although more research is needed to make this leap. Unfortunately, we do not know definitively how to establish legitimacy, respect, and trust in vulnerable communities. Our previous discussion suggests ways that this may be accomplished, and other papers (e.g., [
Over 60 papers have been published using the MYS data. This study benefits those papers (and future papers that use MYS data) by suggesting that missing data in the MYS sample are largely ignorable. Our results indicate that while demographically, the MYS sample (ages 10 through 15) is not strictly representative of the population, the deviations do not suggest that those who were eligible, but did not participate, did so for largely-ignorable reasons. Moreover, survey research literature shows lower response rates for minorities and people living in poverty. The fact that we find higher rates of enrollment and year-by-year participation for Black adolescents and those who are eligible for free lunch suggests that the strategy of focusing on the most at-risk segments of the MYS neighborhoods is successful and further supports the idea that the neighborhood per se is an inappropriate sampling frame. Further, results show that functionally, while there are significant effects for reading and math scores and for school violations, these differences are quite small and show higher enrollment, participation, and retention rates for this most vulnerable segment of the population. Thus, the results provide support for treating missing data in the MYS as ignorable.
Finally, and perhaps most important, this research extends results and theoretical arguments by others (e.g., [
The research reported here was partially supported by the National Institutes of Health Office for Research on Minority Health through a cooperative agreement administered by the National Institute for Child Health and Human Development (HD30060); a grant from the Center for Substance Abuse Treatment, Substance Abuse and Mental Health Services Administration (TI13340); a grant from the National Institute on Drug Abuse (DA017428); a grant from the Centers for Disease Control and Prevention (CE000191); a grant from the National Institute for Child Health and Human Development (HD058857); The University of Alabama; the cities of Mobile and Prichard; the Mobile Housing Board; and the Mobile County Health Department.
Bolland, A.C., Tomek, S. and Bolland, J.M. (2017) Does Missing Data in Studies of Hard-to-Reach Populations Bias Results? Not Necessarily. Open Journal of Statistics, 7, 264-289. https://doi.org/10.4236/ojs.2017.72021