2012. Vol.3, Special Issue, 762-774
Published Online September 2012 in SciRes (
Copyright © 2012 SciRes.
Scalar Equivalence in Self-Rated Depressive Symptomatology as
Measured by the Beck Depression Inventory-II: Do Racial and
Gender Differences in College Students Exist?
Lisa M. Hooper1,3, Lixin Qu1, Cindy A. Crusto2, Lauren E. Huffman1
1The University of Alabama, Tuscaloosa, USA
2Yale School of Medicine, The Consultation Center, New Haven, USA
3Hooper Research Lab, Tuscaloosa, USA
Received June 9th, 2012; revised July 12th, 2012; accepted August 11th, 2012
Using item response theory and confirmatory factor analysis, the current investigation examined the
equivalence in responses derived from the widely used 21-item Beck Depression Inventory-II (BDI-II;
Beck, Steer, & Brown, 1996) among 1229 college students (mean = 21.15, SD = 6.19) in the United
States. Results from differential item functioning analyses indicated that the items endorsed by Black
American and White American college students were slightly different. However, items endorsed by fe-
male and male college students were almost invariant. The results of the study found partial support for
using the BDI-II in college student populations. Directions for future culturally tailored assessment and
research are proffered.
Keywords: Differential Item Functioning (DIF); Depression; Depressive Symptoms; Race; Gender;
Scalar Equivalence; Item Response Theory (IRT); Confirmatory Factor Analysis;
American College Students
In 2007, depression-related suicide was the third leading
cause of death for adolescents and emerging adults (ages 12 to
24) (Centers for Disease Control and Prevention, 2010). Depression
is one of the most significant, disabling, and deleterious mental
health disorders in all populations, including college students
(American College Health Association, 2009; Blanco et al.,
2008; Hankin, 2002; World Health Organization, 2002). For
emerging adults, depression is the most common clinical disorder,
with prevalence rates estimated to approach 11% (American
College Health Association, 2009; Blanco et al., 2008). Importantly,
findings from the National Comorbidity Study-Revised (NCS-R;
Kessler et al. 2003) suggested that many adults who had a re-
ported history of a depressive episode in the previous year
failed to receive adequate treatment (i.e., guideline-concordant
care such as pharmacotherapy or psychotherapy; see American
Psychiatric Association, 2000b; Gonzalez, Vega, Williams, Tar-
raf, West, & Neighbors, 2010) for their depression. Therefore,
there is an urgent need to better understand why depression
continues to be undertreated and often undetected, including in
college student populations (Carmody, 2005; Kadison, 2004;
Kisch, Leino, & Silverman, 2005; Tjia, Givens, & Shea, 2005).
One factor that may account for the undertreatment of de-
pression is the lack of detection of depressive symptoms in in-
dividuals (American College Health Association, 2009; Car-
mody, 2005; Hooper, 2010; Leino & Kisch, 2005). An obvious
first step in uncovering factors that may impede the effective
treatment of depression is clarifying the barriers that affect
providers’ ability to detect depressive signs and symptoms and
their competency to make an accurate diagnosis. Toward this
end, instruments or assessment tools that produce reliable and
valid scores are paramount (Boughton & Street, 2007). An ad-
ditional consideration is the extent to which instruments are
culturally, linguistically, and clinically sensitive (Anderson &
Mayes, 2010; Manly, 2006). With the increasing focus on racial
and cultural diversity in the human helping disciplines (see
Chao & Otsuki-Clutter, 2011; Day, 1996; McHorney & Fleischman,
2006) discussions on the extent to which assessment, diagnosis,
and treatment methods are culturally responsive and relevant
are important and timely. McHorney and Fleischman (2006)
suggested, “If items in outcome measures are biased, detection
rates can be biased (overestimated or underestimated), leading
to over- and under-detection and over- and under-treatment” (p.
In this article, we first provide a brief overview of the im-
portance of measures that compose scores that reliably and
validly assess for depressive symptoms. Then we review the
empirical literature on depression and depressive symptoms in
college students, including the implications of race and gender
for the presentation, detection, and screening of depression.
Next we describe the research design and summarize the results
of the current investigation conducted to test the extent to
which the Beck Depression Inventory-II (BDI-II; Beck, Steer,
& Brown, 1996) demonstrates scalar equivalence for female
and male college students and for Black American and White
American college students. We describe two rigorous different-
tial item functioning (DIF) analytic procedures—item response
theory and confirmatory factor analysis—that were employed
to detect if the endorsements of depressive symptoms among
these four groups are biased or equivalent. We conclude with
the implications of the findings and directions for culturally
tailored assessment and future research.
Background Literature
The Healthy People 2020 Initiative characterizes major de-
pression as a national priority (US Department of Health and
Human Services, n.d.). Empirical studies have suggested that
the manifestation and characterization of depressive symptoms
may be influenced by demographic, familial, and ecological
factors (e.g., race, gender, neighborhoods, and discrimination)
(Anderson & Mayes, 2010; Gregorich, 2006; Iwata & Buka,
2002). Therefore, another step in optimally treating depression
and accurately detecting depressive signs and symptoms in-
volves ensuring that racially and culturally diverse individuals
are included in research studies that examine diagnosis and
treatment methods and measures for medical conditions and
mental health disorders such as depression (see Manly, 2006).
The mandate from the National Institutes of Health (NIH) and
the recently established National Institute of Minority Health
and Health Disparities explicitly underlines this proposition: the
inclusion of racial and cultural minority participants in research
studies to inform practice, which includes reliable assessments
and accurate diagnoses (National Institutes of Health, 2002;
National Institute of Minority Health, 2010). Other mental
health services and intervention researchers have also under-
scored the importance of inclusion of vulnerable and racial
minority populations in the development of measures and con-
trolled clinical trials (Anderson & Mayes, 2010; Manly, 2006;
McHorney & Fleischman, 2006; National Institute of Mental
Health, 2010; National Institutes of Health, 2002; Paniagua,
1994; Sperry, 2010).
Depression in College Students
Some scholars have asserted that depression and anxiety
disorders are the leading clinical issues with which college
students must contend and with which health care providers
(e.g., college counseling center staff) must be prepared and
competent to face (Carmody, 2005; Kisch et al., 2005). Depression
care guidelines and published recommendations have relevance
for the current investigation about one element of depression
care: the assessment of depressive symptoms. With regard to
the criticality of assessment in depression care, empirical research
consistently has found an association between depressive symp-
toms and suicide behavior as well as other negative sequelae (e.g.,
anxiety symptomatology, disordered eating behaviors and at-
titudes, alcohol and drug use, and interpersonal violence) in
college student populations (Centers for Disease Control and
Prevention, 2010; Kisch et al., 2005; Wilcox et al., 2010). Taken
together, these are significant clinical issues and functional con-
sequences that are often evinced in college and university popu-
lations (American College Health Association, 2009; Arria et
al., 2009; Blanco et al., 2008; Centers for Disease Control and
Prevention, 2010; Eisenberg, Gollust, Golberstein, & Hefner, 2007;
Furr, Westefeld, McConnell, & Jenkins, 2001; Kisch et al., 2005;
Nolen-Hoeksema, 1990; Wilcox et al., 2010). As previously men-
tioned, suicide is also currently the third leading cause of death
among adolescents and emerging adults (Centers for Disease
Control and Prevention, 2010). Because of the life-threatening
element (i.e., suicidal ideation and suicidality) highly associated
with depression, scores that are derived from measures that
reliably and validly capture depression are paramount.
Toward this end, reliable and valid assessments (i.e., scores)
to capture depressive symptoms are needed, but assessments
that demonstrate cultural and linguistic equivalence also are
needed (Anderson & Mayes, 2010; Eisenberg et al., 2007; Manly,
2006). Iwata and Buka (2002) stated, “Specific response pat-
terns and psychometric properties of assessment instruments
across ethnic/cultural populations require further investigation”
(p. 2243). Consistent with most psychometricians’ suggestions
(see Borsboom, 2006), we believe the ideal scenario is that
widely used assessments such as the BDI-II (Beck et al., 1996)
should be equivalent (i.e., absent from item- and scale-level
biases) across cultural and ecological factors, such as race, gen-
der, geographical regions, socioeconomic statuses, and so forth.
Depression and Gender
A commonly reported claim is that depression is more pre-
valent in females than males (Beck et al., 1996; Hankin, 2002;
Kessler et al., 1994; Nolen-Hoeksema, 1990; World Health Or-
ganization, 2002), although this commonly recounted assertion is
based primarily on cross-sectional studies. The results that have
accumulated from epidemiological studies offer some support
for gender-related differences in depression and depressive symp-
toms, and they add to the results derived from cross-sectional stu-
dies. Epidemiological studies have suggested that gender dif-
ferences in depression and depressive symptoms emerge during
adolescence (see Hankin, 2002; Hankin & Abramson, 2001;
Nolen-Hoeksema, 1990; Nolen-Hoeksema & Girgus, 1994; Rao
& Chen, 2009). Results from studies composed of college students
have suggested the relation among gender and depression and
depressive symptoms is inconsistent (see Gladstone & Koenig,
1994; Nolen-Hoeksema & Girgus, 1994; Silverstein, 1999; Steer
& Clark, 1997). Therefore, a more accurate refrain may be that
gender-related differences in depression and depressive symptoms
are equivocal, in particular among college-aged populations.
Indeed, gender-related variances in depressed mood and
symptoms are unclear and not well understood—not only during
the developmental stage of emerging adulthood but also across
the entire lifespan (Eaton et al., 2011; Hooper, 2010; Rao &
Chen, 2009). Moreover, the commonly reported assertion that
women have higher levels of depressive symptoms and greater
prevalence rates of major depressive disorder is not consistently
found in the empirical literature. For example, Steer and Clark
(1997) found that male college student-respondents reported
levels of depressive symptoms similar to those reported by
female college student-respondents. In another example, Silverstein
(1999) suggested that gender differences disappear when anxi-
ety and somatic symptoms are statistically controlled for. In
other words, when somatic and anxiety symptoms are statisti-
cally controlled for gender differences related to depression are
nonexistent. Silverstein found “large gender differences in the
prevalence of anxious somatic depression among samples of
high school students, college students, and adults” (p. 480). He
concluded, however, that there are no gender differences in pure
depression. This conclusion is buttressed by findings in other
empirical studies (see Gladstone & Koenig, 1994; Nolen-Hoe-
ksema & Girgus, 1994).
In contrast, in a study regarding the psychometric properties
of the BDI-II in a college student sample, Carmody (2005)
found statistically significant gender differences. Specifically,
Carmody reported that female participants had higher mean
scores for depressive symptoms than their male counterparts.
Moreover, the college student participants’ scores reported in
Carmody’s study were comparable to those self-rated scores of
Copyright © 2012 SciRes. 763
college students found in the Beck et al.’s (1996) validation
study. Carmody also explored differences in respondents’ item-
level scores derived from the BDI-II based on ethnicity and
gender, finding differences at the item level based on gender
(i.e., BDI-II items 1, 15, 10, and 20). Osman and colleagues (1997)
also found statistically significant differences in gender-based
comparisons of depressive symptoms: Female college students
reported higher levels on six items (BDI-II items 1, 7, 10, 17,
20, and 21) than male college students reported. With regard to
BDI-II total scale score comparisons in Osman et al.’s study,
there were significant difference between males (mean = 9.41)
and females (mean = 11.88) as well.
Relevant to the current investigation, few studies have
investigated the psychometric properties (i.e., scalar equivalence
for females and males) of the measures that are often used to
assess for gender differences. The current investigation allows
for cross-gender comparisons at the item level. Determining
whether the often-reported gender differences in depression and
depressive symptomatology are true and real, rather simply
reflecting differences in items or measurement bias (i.e., DIF)
evidenced on the BDI-II, has implications for assessment,
diagnosis, and treatment on college campuses and in the broader
clinical community. Clarifying the role of gender in the mani-
festations of depression has important implications for optimal
depression care and management (assessment, diagnosis, and
treatment). It may be that gender-focused and gender-tailored
treatment for depression may be more efficacious and effective
than current treatment practices (see Anderson & Mayes, 2010;
Eaton et al., 2011; van de Vijver & Tanzer, 2006). Hankin and
Abramson (2001) and Rao and Chen (2009) indicated that the
gender differences seen in some studies are consistent and that
they consistently occur across race and ethnic groups, although
Culbertson (1997) suggested that the converse is true, that is,
gender differences vary based on cultural factors such as racial
and ethnic group membership.
Importantly, some of the disagreement in the literature
related to the differential effects of gender on depression and
depressive symptoms may be explained by measurement issues
(Anderson & Mayes, 2010). Boughton and Street (2007) con-
tended that gender differences related to depression may be a
result of the questions or items that appear on depression
screening tools. Specifically, they asserted, “These questions
may reflect too narrow of a definition of depression that fails to
include symptoms associated with depression in men” (p. 194).
They concluded that some self-rated assessment tools may
overestimate depressive symptoms in women and underestimate
depressive symptoms in men, leading to inaccurate depression
care (assessment, diagnosis, and treatment recommendations)
and prevalence rates.
Cochran and Rabinowitz (2003) advocated for gender-sensitive
assessment and intervention strategies of depression given that
the correlates, symptom presentation, and course of depression
can be and often are different for men and women. For instance,
compared to women, depression in men is more likely to be
related to issues such as gender-role conflict, and the symptom
presentation is more likely to be related to issues such as ag-
gression, physical and sexual risk-taking, chronic anger, inter-
personal conflict, work-related conflict, substance use and abuse,
and criminal behavior (Kilmartin, 2005). Given the masculine-
related presentation of depressive symptoms that may exist, is it
essential that instruments used to assess depression are sen-
sitive to these gender-related differences and accurately identify
and classify depression in males and females (see Fields &
Cochran, 2011).
Depression and Race
Compared to what is known about the relation between
depression (and depressive symptoms) and gender, even less is
known about possible differences in the manifestations of
depression and depressive symptoms based on race (Coyne &
Marcus, 2006; George & Lynch, 2003). The Unequal Treatment
report of the Institute of Medicine (2002) outlined numerous
factors that may relate to presentation of mental health symptoms,
misdiagnosis of mental health disorders, and differential treat-
ments and services, including depression care, based on race.
Specifically, racial and cultural factors have long been con-
jectured to relate to the establishment of accurate diagnoses
(e.g., depression, schizophrenia, bipolar disorder, and eating
disorders). However, as contended by George and Lynch, “The
existence, nature, and strength of race differences in mental
health remain unclear after several decades of research” (p.
353). In addition, George and Lynch suggested, “Despite a
voluminous research base, the basic question of whether blacks
and whites differ in levels of depression and psychological
distress remains unclear” (p. 353). A careful review of the
empirical literature reveals the lack of consistency in race-focused
empirical studies. Therefore, similar to gender differences, the
most accurate refrain for the variance in depression and depressive
symptoms based on race may be that race-related variances in
depression and depressive symptoms are equivocal.
The lack of clarity and definitiveness about real differences
in depression and depressive symptoms based on race have
been described in the literature. George and Lynch (2003)
suggested that some of the lack of clarity in the literature results
because researchers have drawn conclusions based on the
combined differential effects both of a diagnosis of depression
and of depressive symptoms. When a total scale score from a
given measure (e.g., BDI-II, Center for Epidemiologic Studies
Depression Scale [CES-D; Radloff, 1977], and so forth) is used
to make comparisons, the results may be different from those
obtained from comparisons based on the items of the measure
(see Teresi, Ramirez, Lai, & Silver, 2008).
Some empirical evidence derived from adult-focused studies
has shown differential depressive symptoms as well as a dif-
ferential probability of being diagnosed with depression may be
based on race (Coyne & Marcus, 2006; Dunlop, Song, Lyons,
Manheim, & Chang, 2003). For example, Leino and Kisch
(2005), in their study of college students, reported that Black
American students were less likely to be diagnosed with de-
pression than their White American counterparts. Moreover,
cross-sectional and epidemiological studies have—for the most
part—demonstrated racial and ethnic differences in depression
and depressive symptoms in adult samples (Kessler et al., 1994).
With relevance to the current study, few studies have ex-
amined DIF in depression measures. Most studies that have
been conducted have focused on the CES-D (see Teresi et al.,
2008, for a comprehensive review). Fewer studies have ex-
amined race-related endorsement patterns based on items of the
BDI-II (Beck et al., 1996). Only one study was located that has
examined DIF in college student populations: Carmody (2005),
who examined the psychometric properties of the BDI-II with a
sample of racially diverse college students. He found variance
Copyright © 2012 SciRes.
in items endorsed by racially diverse American students.
Specifically, DIF was evidenced on three items (BDI-II items
11, 14, and 17). White American students had higher scores on
item 11 (agitation) and item 14 (worthlessness) than did Asian
American students. White American students also had higher
scores on item 17 (irritability) than did Latino American stu-
dents. It is noteworthy that Carmody’s study of college students
found no differences in the BDI-II total score based on racial
groups. Carmody concluded that the lack of differences in
depressive symptom profiles based on race (only three items
resulted in DIF) may be related to the commonality of the
college experience, or else “college school culture” may have
superseded or attenuated any racial or ethnic differences re-
lative to the depressive symptomatology.
The BDI-II has been used with a range of populations, in-
cluding racially diverse college students (Carmody, 2005; Storch,
Roberti, & Roth, 2004; Whisman, Perez, & Ramel, 2000; Wilcox
et al., 2010). Despite considerable research supporting differ-
ences in depression and depressive symptoms based on race,
some evidence indicates there are no differences among racial
and ethnic groups as well. Given the dearth of DIF studies,
more research focused on scalar equivalence is clearly needed.
The Current Investigation
Our brief review of the empirical literature suggests that
cultural differences in the presentation, manifestation, and
endorsement of select depressive symptoms often—but not
always—vary by race and gender in many populations (Boughton
& Street, 2007; Eaton et al., 2011). However, the accumulated
results for racial and gender differences in the specific
population of college students remain most unclear.
As previously mentioned, one of the instruments most commonly
used to screen for a probable diagnosis of depression and de-
pressive symptomatology is the BDI-II (Beck et al., 1996). In
spite of its wide use with a range of diverse populations,
including college students, few studies have examined the
extent to which the inventory demonstrates scalar equivalence
in college student populations as well as other populations.
Some researchers have suggested that the factor structure of the
BDI-II differs based on the type of sample (e.g., clinical vs.
nonclinical) (Carmody, 2005; Storch et al., 2004; Teresi et al.,
2008). Likewise, it is assumed that the factor structure of the
BDI-II may vary based on the cultural background of the
sample (Black Americans vs. White Americans). Supporting
the need for the current investigation, Santor, Zuroff, Cervantes,
Palacios, and Ramsay (1995) stated, “How individuals endorse
items on a depression inventory may vary across items on a
single measure of depression, across measures of depression, as
well as across levels of depressive severity” (p. 131; emphasis
added). Moreover, the validity of the findings derived from the
BDI-II is only as good as the validity of the scores that are
derived from the BDI-II and the items that compose it (Harachi,
Choi, Abbott, Catalano, Bliesner, 2006; McHorney & Fleischman,
2006; van de Vijver & Tanzer, 2004). Given that establishing
validity is an ongoing process, studies that add to the ac-
cumulating evidence in the literature on the possible biases and
equivalence of the BDI-II scores and items are important and
needed (Schmidt & Hunter, 2003). The current investigation
fills a gap in and contributes to the depression literature by
examining DIF among the 21 BDI-II items. More specifically,
we used item response theory (IRT) modeling and confirmatory
factor analyses (CFA) to assess DIF among the BDI-II items.
Based on the gaps in the literature and the methodological
benefits of IRT, we established two research questions to guide
the current investigation: 1) To what extent does the BDI-II
(Beck et al., 1996) provide equivalent scalar measurement for
depressive symptoms in Black American and White American
college students? And 2) to what extent does the BDI-II (Beck
et al., 1996) provide equivalent scalar measurement for depressive
symptoms in female and male college students?
Participants and Procedure
Participants were a convenience cross-sectional sample of
1229 students from a large state university in the southeastern
region of the United States. The sample included 145 Black
American students and 1031 White American students. Gender
samples were nearly equivalent; the study sample included 684
female participants and 545 male participants. Mean age in
years for the sample was 21.15 (SD = 6.19). Year of school was
almost evenly distributed among freshman, sophomore, junior,
and senior levels (see Table 1). Participants reported low levels
of depressive symptomatology. BDI-II mean scores based on
self-reported race were 7.7 (SD = 7.5) and 8.7 (SD = 8.3) for
Black American and White American students, respectively.
BDI-II mean scores were 8.6 (SD = 8.2) and 8.5 (SD = 8.5) for
females and males, respectively.
Following approval from our Institutional Review Board, we
recruited participants in undergraduate-level classrooms and then
later by email. Study invitations were sent to students through
university email lists and individual class emails. We adminis-
tered the electronic survey packet online using a web-based
survey protocol. Before beginning the survey, participants viewed
and electronically signed the study’s informed consent form.
The online survey included a demographic information survey
and the BDI-II (Beck et al., 1996). The BDI-II and the demo-
graphic data sheet were presented in English. Extra course
credit was provided both as an incentive and as compensation
for time and effort involved in participating in the study.
Demographic Information. A researcher-designed demographic
information sheet was created for the investigation. Questions
inquired about the participant’s year in school, academic dis-
cipline, religious affiliation, age, gender, and racial and ethnic
Beck Depression Inventory-II. We used the BDI-II (Beck
et al., 1996) to assess each participant’s level of depressive
sympoms during the preceding 14 days. The BDI-II consists of
21 self-rated questions that assess for depressive symptomatology
consistent with the criteria for major depressive disorder de-
lineated in the Diagnostic and Statistical Manual of Mental
Disorders-IV (4th ed., text rev.; DSM-IV-TR; American Psy-
chiatric Association, 2000a). Participants are asked to select the
option that best corresponds to the way they have been feeling
during the past two weeks. Responses are self-rated on a four-
point Likert-type scale: 0 (absence of symptoms) to 3 (severe
presence of symptoms).
The BDI-II is scored by summing the participant’s response
for each of the 21 BDI-II items (Beck et al., 1996). Scores
range from 0 to 63; higher scores reflect greater severity of
Copyright © 2012 SciRes. 765
Copyright © 2012 SciRes.
Table 1.
Demographics of study samples by gender and race for beck depression Inventory-I.
Gender (n = 1229) Raceb (n = 1176 )
characteristic Female (n = 684) Male (n = 545) Black American (n = 145) White American (n = 1031)
No. of students (%)/
mean (SD)
No. of students (%)/
mean (SD)
No. of students (%)/
mean (SD)
No. of students (%)/
mean (SD)
Age, years 20.9 (3.8) 20.6 (3.0) 22.6 (6.3) 20.5 (2.6)
Gender, female 63 (43%) 454 (44%)
Black American 82 (12%) 63 (12%)
White American 571 (83%) 454 (83%)
School yeara
Freshman 140 (21%) 81 (15%) 15 (10%) 188 (18%)
Sophomore 200 (30%) 195 (36%) 55 (38%) 330 (32%)
Junior 208 (31%) 162 (30%) 38 (26%) 315 (31%)
Senior 115 (17%) 89 (17%) 31 (22%) 170 (17%)
BDI-II mean score 8.6 (8.2) 8.5 (8.5) 7.7 (7.5) 8.7 (8.3)
BDI-II score (0 to 12) 526 (77%) 414 (76%) 116 (80%) 785 (76%)
BDI-II score (13 to 19) 84 (12%) 68 (12%) 12 (8%) 133 (13%)
BDI-II score (20 to 63) 74 (11%) 63 (12%) 17 (12%) 113 (11%)
Note: aFive participants failed to report year of school; bFifty-three participants failed to report race.
depressive symptomatology and a greater probability of a clinical
diagnosis of major depression. Beck and colleagues reported
that scores of 16 or greater point to a probable diagnosis of
depression. Beck and colleagues also suggested the following
descriptions and interpretations related to severity: scores of 0
to 13 reflect minimal severity; 14 to 19 reflect mild severity; 20
to 28 reflect moderate severity; and scores of 29 or greater
reflect severe symptomatology.
With regard to reliability, scores from the BDI-II have been
shown to have sound internal stability. Studies using the BDI-II
have reported alpha coefficients ranging from .77 to .92 (Carmody,
2005; Dozois, Dobson, & Ahnberg, 1998; Hirsch, Webb, &
Jeglic, in press; Hooper & Doehler, 2011; Osman et al., 1997;
Whisman et al., 2000). For comparison, the original validation
study—composed in part of college student participants—
reported a Cronbach’s alpha value of .93 (Beck et al., 1996).
In terms of construct validity, although the BDI-II (or the
earlier versions) cannot confirm a diagnosis of depression, the
scores can point to probable depression (see Beck et al, 1996;
Beck, Ward, Mendelson, Mock, & Erbaugh, 1961). Findings
from Osman and colleagues (1997) suggested that BDI-II scores
yield sound convergent, construct, and discriminant validity.
Research conducted by Dozois and colleagues (1998) and
Storch and colleagues (2004) provided evidence for construct
validity based on the relations between scores on the BDI-II,
the State-Trait Anxiety Inventory-Depression (STAI-D; Spiel-
berger,1983; Spielberger, Gorsuch, & Lushene, 1972), and the
State-Trait Anxiety Inventory-Anxiety (STAI-A; Spielberger,
1983; Spielberger et al., 1972) factors scores. Dozois and col-
leagues provided guidance on recommended cutoff scores in
college student populations: scores 0 to 12 indicate not de-
pressed; scores 13 to 19 indicate dysphoria; and scores 20 to 63
indicate clinically depressed.
In the current investigation, Cronbach’s alpha values re-
sulting from the 21-item BDI-II reflected sound reliability of
scale scores in all four samples. Consistent with stability co-
efficients in other studies (Beck et al., 1996; Carmody, 2005;
Dozois et al., 1998; Hooper & Doehler, 2011; Whisman et al.,
2000), Cronbach’s alpha values were α = .90 for the Black
American study sample and α = .92 for the White American
study sample. The Cronbach’s alpha value for females was α
= .92; for males it was α = .92.
Missing Data
We examined missing data for all items on the BDI-II. All
analyses in the current investigation included subjects with
nonmissing values for all 21 items on the BDI-II. Therefore, only
observed values were used; no imputation was performed. Of
the 1229 participants, 53 participants failed to report their race and
thus were excluded from the analyses. We used responses from all
1229 participants for gender-related analyses (see Table 1).
Data Analysis Plan
To examine research questions 1 and 2 we used item response
theory (IRT) modeling to assess differential item functioning
(DIF) among the BDI-II items. In addition to the recommendations
put forward by numerous scholars (see Hambleton, 2006; Stark,
Chernyshenko, & Drasgow, 2006; Hays, Morales, & Reise,
2000; Samejima, 1969), several factors influenced our rationale
for the data analysis plan. For example, since the BDI-II is a
polytomous instrument (items with more than two response
options) in which items are measured on a four-point Likert-
type scale, we considered a graded response model (GRM) as
an appropriate method of IRT parameter estimation (Samejima,
1969). The GRM estimates for each item (i) a slope parameter
(αi) and a threshold parameter for each between-category
threshold (βij). For the four-item Likert-type scale specifically,
there are three between-category threshold parameters. The
threshold parameters are the points along the latent trait
continuum where respondents have a .50 probability of responding
above a threshold. Using the estimated parameters, category
response curves can be plotted to describe the probability that a
respondent with a certain trait level (θ) will endorse or agree with a
statement (i.e., item) at each point using the Likert-type scale.
DIF analysis in IRT is used to assess differences between
different groups of respondents with regard to the difficulty of
item endorsement. Each item has an estimated difficulty location
measured on the same scale as the trait level (θ). Items with
positive location estimates are harder to endorse, and those with
negative location estimates are easier to endorse. Items are
considered to be displaying DIF if the item location estimates
for two or more groups of respondents are significantly different
when all other parameters are held constant. In addition to IRT,
we used CFA to verify the unidimensionality of depressive
symptoms in our study samples. LISREL 8.80 (Scientific Software
International, 2007) was used for CFA analyses.
In sum, by employing a combined data analysis approach of
IRT and CFA in the same study, we used a rigorous and
conservative method (see Hays et al., 2000; Stark et al., 2006)
to explore the extent to which scalar equivalence exists for
Black American and White American college students and for
female and male college students. However, we also recognize
that there are alternative methods that have been recommended
as well (see Brown, 2006). Consistent with recommendations
put forward by Gregorich (2006) and others, we used confirmatory
factor analysis to determine if the construct validity of the
BDI-II is invariant for our two population groups: race (Black
American vs. White American college students) and gender
(female and male college students). We also used confirmatory
factor analysis to determine whether the group differences that
emerge are true differences in the construct under investigation
(depressive symptomatology) or are instead effects related to
some other factor pertaining to the demographics of the population
groups (e.g., group-specific attributes such as gender).
Results from our CFA indicated that a single dominant factor
underlies the BDI-II items. The one-factor CFA demonstrated
an adequate fit of the data for all comparison groups (see
Tables 2 and 3). Goodness-of-fit indices for the four groups are
as follows. Results for Black Americans were comparative fit
index (CFI) = .92, normed fit index (NFI) = .86, and nonnormed
fit index (NNFI) = .91; results for White Americans were CFI
= .95, NFI = .95, and NNFI = .95. Goodness-of-fit results for
the female college students were CFI = .95, NFI = .94, and
NNFI = .95; and for male college students results were CFI
= .95, NFI = .94, and NNFI = .94.
BDI-II: De sc riptive I te m S t a ti stics
As illustrated in Table 4, the BDI-II depressive items were
compared between Black American and White American
students using independent sample t tests. Significant mean
differences were evidenced on seven items (BDI-II items 2, 3, 5,
Table 2.
Analyses for unidimensionality and reliability for beck depression In-
ventory-II in Black American and White American college student
Black American White American
Cronbach’s Alpha Values.90 .92
CFI .92 .95
NFI .86 .95
NNFI .91 .95
Note: CFI = comparative fit index; NFI = normed fit index; NNFI = nonnormed
fit index.
Table 3.
Analyses for unidimensionality and reliability for beck depression In-
ventory-II in female and male college student participants.
Female Male
Cronbach’s Alpha Values.92 .92
CFI .95 .95
NFI .94 .94
NNFI .95 .94
Note: CFI = comparative fit index; NFI = normed fit index; NNFI = nonnormed
fit index.
Table 4.
Beck depression Inventory-II item scores in Black American and White American college student participants.
Mean ± Standard Deviation T (p Value)
Score Range: 0 - 3 White American (n = 1,031) Black American (n = 145)
BDI01—Sadness .34 ± .57 .33 ± .58 –.15 (.884)
BDI02—Pessimism .46 ± .60 .34 ± .62 –2.24 (.025)
BDI03—Failure .42 ± .63 .27 ± .56 –2.66 (.008)
BDI04—Loss of Pleasure .33 ± .59 .32 ± .60 –.03 (.973)
BDI05—Guilt .43 ± .62 .31 ± .58 –2.24 (.025)
BDI06—Punishment .26 ± .61 .23 ± .61 –.38 (.702)
BDI07—Self-Dislike .40 ± .70 .26 ± .60 –2.41 (.016)
BDI08—Self-Criticalness .55 ± .73 .30 ± .57 –4.05 (<.0001)
BDI09—Suicidal Thoughts .11 ± .36 .07 ± .28 –1.15 (.250)
BDI10—Crying .37 ± .65 .40 ± .79 .50 (.619)
BDI11—Agitation .42 ± .63 .33 ± .59 –1.56 (.120)
BDI12—Loss of Interest .32 ± .59 .30 ± .50 –.23 (.820)
BDI13—Indecisiveness .45 ± .78 .37 ± .63 –1.20 (.232)
BDI14—Worthlessness .23 ± .57 .12 ± .40 –2.19 (.029)
BDI15—Loss of Energy .52 ± .61 .56 ± .64 .75 (.456)
BDI16—Change in Sleep .85 ± .76 .82 ± .81 –.37 (.712)
BDI17—Irritability .35 ± .61 .35 ± .57 .01 (.991)
BDI18—Change in Appetite .58 ± .77 .56 ± .77 –.34 (.733)
BDI19—Concentration Difficulty .55 ± .76 .54 ± .78 –.24 (.814)
BDI20—Tiredness/Fatigue .54 ± .61 .55 ± .69 .17 (.863)
BDI21—Loss of Interest in Sex .20 ± .52 .37 ± .72 3.61 (.0003)
Note: Boldfaced values reflect a significant difference.
Copyright © 2012 SciRes. 767
Table 5.
Beck depression Inventory-II scores in female and male college student participants.
Mean ± Standard Deviation T (p Value)
Score Range: 0 - 3 Female (n = 684) Male (n = 545)
BDI01—Sadness .34 ± .58 .33 ± .53 .30 (.761)
BDI02—Pessimism .44 ± .60 .46 ± .62 –.61 (.542)
BDI03—Failure .39 ± .62 .42 ± .65 –.76 (.445)
BDI04—Loss of Pleasure .33 ± .57 .33 ± .62 .24 (.811)
BDI05—Guilt .41 ± .61 .43 ± .64 –.58 (.563)
BDI06—Punishment .25 ± .57 .26 ± .63 –.30 (.764)
BDI07—Self-Dislike .38 ± .69 .40± .68 –.41 (.682)
BDI08—Self-Criticalness .52 ± .71 .52 ± .73 .18 (.858)
BDI09—Suicidal Thoughts .11 ± .36 .09 ± .33 .82 (.412)
BDI10—Crying .41 ± .66 .32 ± .34 2.32 (.020)
BDI11—Agitation .42 ± .63 .40 ± .63 .62 (.535)
BDI12—Loss of Interest .30 ± .56 .35 ± .63 –1.47 (.141)
BDI13—Indecisiveness .44 ± .76 .43 ± .71 .46 (.648)
BDI14—Worthlessness .22 ± .58 .21 ± .54 .05 (.958)
BDI15—Loss of Energy .52 ± .61 .53 ± .64 –.05 (.958)
BDI16—Change in Sleep .83 ± .78 .84 ± .75 –.21 (.834)
BDI17—Irritability .37 ± .60 .32 ± .61 1.48 (.140)
BDI18—Change in Appetite .59 ± .76 .58 ± .79 .25 (.802)
BDI19—Concentration Difficulty .54 ± .75 .56 ± .76 –.51 (.613)
BDI20—Tiredness/Fatigue .55 ± .60 .55 ± .66 .01 (.992)
BDI21—Loss of Interest in Sex .22 ± .54 .21 ± .57 .28 (.776)
Note: Boldfaced values reflect a significant difference.
7, 8, 14, and 21). White American students had higher mean
scores on all items with the exception of item 21. BDI-II de-
pressive items also were compared between female and male
college students using t tests. Significant mean differences were
evidenced on one item only, item 10. In this case, as shown in
Table 5, females had higher mean scores for this item than
male respondents.
BDI-II: Dif f erential Item Functioning Analyses
To compare responses across the study samples, individ-
ual items on the BDI-II were assessed for DIF using the
MULTILOG 7.03 program (Thissen, 1991). MULTILOG uses
Marginal Maximum Likelihood (MML) estimation to evaluate
the significance of item location differences between groups.
All parameters except the location parameter were held constant
for Black American and White American respondents. A chi-square
statistic for the contrast between Black American and White
American item locations was used to test for significance of the
contrast. Items with chi-square values above the critical value at
a .05 alpha level with one degree of freedom are considered to be
displaying DIF. We followed the same analytic procedures for
female and male respondents.
As illustrated in Table 6, significant differences were found
in the item-level responses to the BDI-II based on race in the
current study. More specifically, five items on the BDI-II
(items 7, 8, 14, 15, and 21) displayed DIF in relation to Black
American and White American respondents. Of the five items,
only two items (BDI-II item 8, self-criticalness, and item 21,
loss of interest in sex) exhibited DIF based on both methods:
CFA and IRT analyses.
Table 7 shows that differences were also found in the item-
level responses to the BDI-II based on gender in the current study.
More specifically, two items on the BDI-II items (item 10,
crying; and item 12, loss of interest) displayed DIF in relation
to female and male student-respondents. Importantly, as can
also be seen in Table 7, the two DIF items that emerged in our
sample were observed from the CFA but not the IRT analyses.
This study used a convenience cross-sectional sample of
1229 American student-respondents to examine race- and gen-
der-related measurement equivalence (or bias) in the perfor-
mance of the BDI-II (Beck et al., 1996). More specifically, using
IRT and CFA, we tested the extent to which the BDI-II prov-
ides equivalent scalar measurement for depressive symptoms in
Black American and White American college students and in
female and male college students. The data from our convenience
cross-sectional sample of American students point toward four
main findings. We discuss these results in terms of our proposed
research questions.
Our first main finding relates to our racial group comparisons.
We used DIF analyses to examine research question 1: To what
extent does the BDI-II (Beck et al., 1996) provide equivalent
scalar measurement for depressive symptoms in Black Ameri-
can and White American college students? The data produced
differences in symptom endorsement based on race. Twenty-
three percent of the items on the BDI-II functioned differently
based on at least one comparison method (i.e., CFA or IRT).
More specifically, for these race-related comparisons, symptom
endorsement varied on five BDI-II items: items 7, self-dislike;
8, self-criticalness; 14, worthlessness; 15, loss of energy; and
21, loss of interest in sex. Therefore, five items functioned
differently, and 16 of the items functioned similarly in these
racial group comparisons.
These results align with empirical findings as well as theoriz-
Copyright © 2012 SciRes.
Table 6.
Differential Item Functioning (DIF) results from Confirmatory Factor Analysis (CFA) and Item Response Theory (IRT) methods for Black American
and White American student comparisons.
Chi-Square (Difference)
Model CFA (Δdf = 2) IRT (Δdf = 4)
Baseline Model (Referent: Item BDI01—Sadness) 2319.3 15845.7
Comparison Models
BDI02—Pessimism 4.6 11.0
BDI03—Failure 5.1 8.3
BDI04—Loss of Pleasure 4.6 7.7
BDI05—Guilt 4.9 9.9
BDI06—Punishment 2.8 5.6
BDI07—Self-Dislike 6.2 17.8DIF
BDI08—Self-Criticalness 22.2DIF 21.0DIF
BDI09—Suicidal Thoughts 2.1 2.0
BDI10—Crying 10.7 8.1
BDI11—Agitation 1.5 5.4
BDI12—Loss of Interest 6.2 9.5
BDI13—Indecisiveness .2 9.1
BDI14—Worthlessness 22.2DIF 5.6
BDI15—Loss of Energy 13.3DIF 10.0
BDI16—Change in Sleep 4.6 6.9
BDI17—Irritability 4.4 8.3
BDI18—Change in Appetite .3 1.7
BDI19—Concentration Difficulty 2.2 1.9
BDI20—Tiredness/Fatigue 7.1 10.9
BDI21—Loss of Interest in Sex 22.6DIF 23.6DIF
Total number of DIF Items 4 3
Note: In CFA, DIF is flagged if chi-square (x2) was > 11.98. In IRT, DIF is flagged if chi-square (x2) was >16.51. Boldfaced values reflect DIF flagged for both CFA and IRT.
Table 7.
Differential Item Functioning (DIF) results from Confirmatory Factor Analysis (CFA) and Item Response Theory (IRT) methods for female and male
student comparisons.
Chi-Square (Difference)
Model CFA (Δdf = 2) IRT (Δdf = 4)
Baseline Model (Referent: Item BDI01—Sadness) 2194.2 17167.1
Comparison Model
BDI02—Pessimism .7 3.9
BDI03—Failure 1.0 1.6
BDI04—Loss of Pleasure 1.3 10.0
BDI05—Guilt 1.3 6.7
BDI06—Punishment .8 7.2
BDI07—Self-Dislike .9 4.4
BDI08—Self-Criticalness .2 1.3
BDI09—Suicidal Thoughts 1.9 3.0
BDI10—Crying 14.4DIF 10.2
BDI11—Agitation .6 2.1
BDI12—Loss of Interest 16.2DIF 7.7
BDI13—Indecisiveness .8 8.5
BDI14—Worthlessness 9.1 8.9
BDI15—Loss of Energy 1.3 1.3
BDI16—Change in Sleep .1 7.1
BDI17—Irritability 3.9 7.9
BDI18—Change in Appetite .8 2.3
BDI19—Concentration Difficulty .5 2.4
BDI20—Tiredness/Fatigue 8.9 8.0
BDI21—Loss of Interest in Sex 2.3 4.1
Total number of DIF Items 2 0
Note: In CFA, DIF is flagged if chi-square (x2) was > 11.98. In IRT, DIF is flagged if chi-square (x2) was > 6.51. Boldfaced values reflect DIF flagged for both CFA and IRT.
Copyright © 2012 SciRes. 769
ing in the literature related to the differential presentation and
endorsement of depressive symptoms (i.e., scale scores and
item scores) in adult population based on varied racial groups
(see Teresi et al., 2008). Of significance, only a few studies
have explored differences in depressive symptoms at the item
level in particular using the BDI-II. More often, the compari-
sons have been done at the total score level. We can point to
several studies comparing scale scores of the BDI-II based on
race. Walker and Bishop (2005) found in their study, which was
composed of college students, that White American students
reported higher scores (BDI-II mean score = 9.1) on the BDI-II
than Black American students did (BDI-II mean score = 8.3).
Similarly, in the present investigation, our data indicated that
White American students reported higher total scores (BDI-II
mean score = 8.7) than did Black American college students
(BDI-II mean score = 7.7). However, in another study com-
posed of older adolescents (Miller & Taylor, 2011), depressive
symptoms as measured by the CES-D (Radloff, 1977) revealed
that older Black American adolescents had higher levels of
depressive symptoms than did their White American counter-
parts. It remains unclear if differences evinced in the literature
are true differences, differences based on the measure used, diff-
erences at the item level, or differences in sample or some other
unmeasured factor (Boughton & Street, 2007; Santor et al.,
Furthermore, those comparisons that have focused on DIF
have been based primarily on age or gender comparisons (e.g.,
Carmody, 2005; Kim, Pilkonis, Frank, Thase, & Reynolds,
2002; Teresi et al., 2008), not racial comparisons. In one study
that did include racial comparisons, Carmody (2005) investigated
item bias in the endorsement of symptoms on the BDI-II among
White, Asian, and Latino Americans. In his study, White
American students scored higher on three items (BDI-II items
11, agitation; 14, worthlessness; and 17, irritability) than did
Hispanic and Asian American students. However, Carmody’s
study made no comparisons between White American and
Black American students.
In the analytical approach that we employed (i.e., IRT and
CFA jointly) for research question 1, we found some dis-
crepancies in the identification of DIF. More specifically, we
found inconsistencies based on our CFA and IRT analyses. As
seen in Table 6, the CFA identified four DIF items, whereas
the IRT identified three DIF items. No clear guidelines exist
when there are discrepancies in DIF results when multiple data
analytic methods are employed (i.e., CFT and IRT), such as in
our study (see Borsboom, 2006; Hambleton, 2006).
Our second main finding relates to our gender group com-
parisons. We used DIF analyses to examine research question 2:
To what extent does the BDI-II (Beck et al., 1996) provide
equivalent scalar measurement for depressive symptoms in
female and male college students? The results from our data
revealed slight differences in symptom endorsement between
gender comparison groups. More specifically, for these com-
parisons, symptom expression varied on two BDI-II items:
items 10, crying; and 12, loss of interest. Nineteen of the items
on the BDI-II were found to function similarly for females and
males. Consequently, for most items (90%) on the BDI-II, no
bias was uncovered, and scalar equivalence was observed. Our
findings are in partial agreement with Carmody’s (2005) find-
ings, although more differences emerged in his group compari-
sons. In Carmody’s study, item-level scores differed based on
gender; females had higher scores on BDI-II items 1, 10, 15,
and 20.
Our results for between-group differences are consistent with
other studies of college student populations (Gladstone& Koe-
nig, 1994; Eisenberg et al., 2007; Nolen-Hoeksema, 1990). For
example, as reviewed by Boughton and Street (2007), several
studies have found gender differences to be invariant in popu-
lations specifically of college students. However, the dominant
view and theorizing holds that gender differences in depression
and depressive symptoms do exist (Boughton & Street; Kessler
et al., 2003). Moreover, the amassed empirical literature (cross-
sectional and epidemiological studies) on gender differences re-
lated to depression and depressive symptoms has suggested that
symptom profiles—in most populations—are different more
often than not (Carmody, 2005; Kessler et al., 2003; Leino &
Kisch, 2005; World Health Organization, 2002). In the end, it is
clear that much more research needs to be done to disentangle
the effects that gender has on depressive symptoms in college
student populations specifically.
As evidenced in our analysis related to race comparisons, we
found various levels of DIF based on our CFA and IRT
analyses. These emergent differences are important and require
additional consideration in future investigations. Moreover,
whether the items that showed DIF in our study need to be
replaced remains unclear. Additional studies need to be con-
ducted to test if the patterns evinced in the current study can be
In our third main finding, we demonstrated a method that can
be employed in examining item and scalar equivalence in
cross-cultural (e.g., race and gender) comparisons. Establishing
item and scalar equivalence—even for commonly used instru-
ments—is an important step that is often overlooked when
comparisons are made (Harach et al., 2006). Many researchers
and scholars have assumed that because instruments are used
widely (e.g., CES-D [Radloff, 1977], BDI-II, Brief Symptom
Inventory [BSI; Derogatis, 1993], and Health Outcomes Meas-
ure [SF-36; Ware & Sherbourne, 1992]) scalar equivalence is
established. Additionally, some researchers conclude if total
scale or subscale scores show no differences between diverse
comparison groups then scalar equivalence is established (Ter-
esi et al., 2008; van de Vijver & Tanzer, 2004). Drawing con-
clusions from these scale score and subscale score group
comparisons could lead to faulty assumptions or erroneous
conclusions (van de Vijver & Tanzer, 2004). Therefore, investi-
gations at the item level are paramount; the importance cannot
be overstated. Findings from our investigation add to the clinical
and research literature base on the psychometric properties of the
BDI-II and afford researchers and scholars alike confidence
when screening for depressive symptoms in college student-
Finally, we found that the reliability of the BDI-II total scale
score across all four groups was more than adequate. Cron-
bach’s alpha values in the target groups in the current invest-
igation demonstrated high internal reliability and ranged from .90
to .92. These values are also consistent with Beck et al.’s vali-
dation study (1996) and Dozois and colleagues’ (1998) findings.
These results—in conjunction with our other findings—also
support item equivalence of the BDI-II in our samples, although
alpha values by themselves should not be the sole method to
establish item equivalence (Hui & Triandis, 1985; Vandenberg
& Lance, 2000).
Copyright © 2012 SciRes.
Study Limitations and Directions for Future
This study contributes to the literature by examining the ex-
tent to which one of the most commonly used measures to assess
for depressive symptoms—the BDI-II (Beck et al., 1996)—
produced scalar differences in Black American and White
American college students and in female and male college
students. In other words, did the participants’ responses indicate
that the BDI-II items function equivalently in the four samples?
Concurrent with our results, limitations of the study must be
considered. First, the sample size of the two racial groups was
limited. The unequal sample sizes of the two groups could have
attenuated the results of the study. A second limitation, also
related to the study sample, is that the participants were from
one university and thus may not be representative of all college
students. A third limitation is that our study was composed of a
nonclinical population of college students. Although high levels
of depressive symptoms and a clinical diagnosis of depression
are often seen in college student populations, the majority of
the sample reported low levels of depressive symptoms (see
Beck et al., 1996; Dozois et al., 1998). Future studies should
attempt to replicate these findings with clinical samples with
clinical levels of depressive symptomatology.
We did not assess for social desirability—a fourth limitation.
Some scholars have concluded that social desirability could
explain DIF associated with racial and cultural groups (van de
Vijver & Tanzer, 2004). Thus, future studies should consider
the inclusion of a measure that assesses for social desirability.
A fifth limitation arises because the current study focused on
depression and depressive symptoms; however, there are many
other mental health disorders and problems with which college
and university populations are faced.
A sixth limitation is that we compared two racial groups only.
It remains unclear whether these findings are representative of
findings that would be evinced in different racial and cultural
groups, age groups, or clinical groups—or even in groups from
different geographical regions. Future studies should include
comparisons with additional racial and ethnic groups, including
Latino individuals, one of the largest ethnic minority groups in
the United States (see Humes, Jones, & Ramirez, 2011; Marin,
Escobar, & Vega, 2006). We employed several statistical tests
in the current study. As a result, some of our findings may be
based on chance. Thus, the number of tests run in the current
study serves as a seventh limitation.
Finally, it is plausible that our results were attenuated by the
homogeneity of our college student sample. For example, the
impact of the college experience, including the daily living
experiences on a college campus, could have created more
similarities than differences in our sample irrespective of race
and gender. In other words, the strength of the common college
experience could have been more powerful than the strength of
the cultural experiences evidenced in our college student sam-
ple (Carmody, 2005; Kadison, 2004). Future research is needed
to determine the applicability and generalizability of the results
in the current investigation.
These findings have implications for future research. Studies
that explore cross-cultural research, minority health and health
disparities, and racial and cultural differences in medical con-
ditions and mental health symptomatology must include meas-
ures that reliably capture the construct under investigation
(Manly, 2006). To be clear, researchers and clinicians alike
must be confident that the often-reported differences in de-
pression and depressive symptoms are true differences and not
artificial differences that are attributable to biased items on the
measure used (i.e., DIF). Therefore, more studies that examine
scalar and cultural equivalence of measures are needed (Hara-
chi et al., 2006; McHorney & Fleischman, 2006; van de Vijver
& Tanzer, 2004). The criticality of this need—even among the
most commonly used instruments—cannot be overstated and
therefore has far-reaching effects on science, practice, and po-
licy (Eisenberg et al., 2007). Finally, and of significance, most
of the most widely used instruments (e.g., BDI-II) were de-
veloped with predominantly or exclusively White American
samples, in monocultural contexts, and most often with college
student populations (see Beck et al., 1996; Boughton & Street,
2007; Carmody, 2005; Steer & Clark, 1997).
College students in general are an at-risk, high-priority popu-
lation when it comes to the development of depressive symp-
toms and depression (Gore & Aseltine, 2003; National Re-
search Council & Institute of Medicine, 2009). Cultural factors
such as race and gender may further complicate or exacerbate
mental health conditions, including depression. Future studies
must ensure that racially diverse individuals are included in
research studies that examine diagnosis and treatment methods
and measures for medical conditions and mental health dis-
orders such as depression (see Manly, 2006).
In addition, although our results found gender differences to
be invariant future studies may want to consider a two-pronged
culturally tailored approach to assess for gender differences in
depression. For example, several researchers have intimated
that the current DSM-IV criteria (American Psychiatric Asso-
ciation, 2000a) for depression (and thus indirectly the BDI-II)
may fail to capture fully men’s symptoms of depression. Sev-
eral scholars have suggested that a singular assessment tool
may be inadequate. Cochran and Rabinowitz (2003) suggested
a culturally sensitive approach by asking male clients typical
questions about depressive symptoms and questions that reflect
masculine-specific distress and symptoms. For example, ques-
tions about increased anger and agitation, decreased motivation,
increased somatic concerns, and a decrease in sexual interest, with
little change in sexual behavior. To address the possible limita-
tions of current assessments, Magovcevic and Addis (2008)
developed the Masculine Depression Scale. In their develop-
ment and refinement study, they found that males reported both
typical depressive symptoms evinced in the current DSM-IV
criteria (American Psychiatric Association, 2000a) as well as
masculine-specific depressive symptoms (e.g., getting mad and
feeling less confident). Other scholars have also suggested that
typical depressive symptoms are often filtered through a mas-
culine-focused lens and thus males may not be reporting symp-
toms that are evidenced on assessment tools (e.g., BDI-II and
CES-D) and/or filtering their depressive symptoms through a
masculine gender framework causing depression to go undiag-
nosed, undetected, and untreated. Future studies should consider a
culturally tailoring approach—using multiple instruments—
when examining depression in males (see Fields & Cochran,
The current investigation fills a gap in the depression litera-
ture. Our investigation is the first to assess scalar equivalence
of the BDI-II (Beck et al., 1996) in college students using two
rigorous methods jointly (see Hays et al., 2000; Stark et al.,
2006). More specifically, this was the first study to examine
DIF with the BDI-II using IRT and CFA concurrently. In our
Copyright © 2012 SciRes. 771
study sample we found DIF based on race and gender com-
parisons. Despite these differences, overall our results, suggest
the BDI-II scores appear to be a reliable and valid measure of
depressive symptoms for Black and White American college
students and female and male college students. This research
advances clinical knowledge about the utility, reliability, and
validity of the BDI-II scores in a racially and culturally diverse
college student population. Using DIF we found that 16 of the
21 items on the BDI-II functioned similarly in our racial group
comparisons, and 19 of the 21 items on the BDI-II functioned
similarly in our gender group comparisons. Before this investi-
gation, most cross-cultural comparison studies focused on total
BDI-II scale score analyses and therefore may have missed
important differences at the item level. We can conclude with
some confidence that many of the items on the BDI-II func-
tioned equivalently in our sample. Based on our preliminary
results, it appears that the BDI-II items do not need to be sig-
nificantly revised based on gender groups in college popula-
tions. However, researchers may want to consider to what ex-
tent the BDI-II items need to be slightly tailored based on gen-
der and racial groups or used in conjunction with other meas-
ures (e.g., Masculine Depression Scale; Magovcevic & Addis,
2008) in college populations, although more studies replicating
these findings are warranted before any changes are imple-
American College Health Association (2009). National college health
assessment II. Reference Group Executive Summary, Fall 2008. Bal-
timore, MD: American College Health Association.
American Psychiatric Association (2000a). Diagnostic and statistical
manual of mental health disorders (4th ed., text revision). Washing-
ton, DC: American College Health Association.
American Psychiatric Association (2000b). Practice guideline for the
treatment of patients with major depressive disorder. Washington,
DC: American College Health Association.
Anderson, E. R., & Mayes, L. C. (2010). Race/ethnicity and internalize-
ing disorders in youth: A review. Clinical Psychology Review, 30,
338-348. doi:10.1016/j.cpr.2009.12.008
Arria, A. M., O’Grady, K. E., Caldeira, K. M., Vincent, K. B., Wilcox,
H. C., & Wish, E. D. (2009). Suicide ideation among college students:
A multivariate analysis. Archives of Suicide Research, 13, 230-246.
Beck, A. T., Steer, R. A., & Brown, G. K. (1996). BDI-II manual. San
Antonio, TX: The Psychological Corporation.
Beck, A. T., Ward, C. H., Mendelson, M., Mock, J., & Erbaugh, J. (1961).
An inventory for measuring depression. Archives of General Psy-
chiatry, 12, 57-62.
Blanco, C., Okuda, M., Wright, C., Hasin, D. S., Grant, B. F., Liu, S.
M., & Olfson, M. (2008). Mental health of college students and their
non-college-attending peers: Results from the National Epidemi-
ologic Study on Alcohol and Related Conditions. Archives of Gen-
eral Psychiatry, 65, 1429-1437. doi:10.1001/archpsyc.65.12.1429
Borsboom, D. (2006). When does measurement invariance matter?
Medical Care, 44, s176-s181.
Boughton, S., & Street, H. (2007). Integrated review of the social and
psychological gender differences in depression. Australian Psychol-
ogist, 42, 187-197. doi:10.1080/00050060601139770
Brown, T. A. (2006). Confirmatory factor analysis for applied research.
New York: Guilford.
Carmody, D. P. (2005). Psychometric properties of the Beck Depres-
sion Inventory-II with college students of diverse ethnicity. Interna-
tional Journal of Psychiatry of Clinical Practice, 9, 22-28.
Centers for Disease Control and Prevention (2010). National Center for
Inquiry Prevention and Control, Web-Based Injury Statistics Query
and Reporting System.
Chao, R. K., & Otsuki-Clutter, M. (2011). Racial and ethnic differences:
Sociocultural and contextual explanations. Journal of Research on
Adolescence, 21, 47-60. doi:10.1111/j.1532-7795.2010.00714.x
Cochran, S. V., & Rabinowitz, F. E. (2003). Gender-sensitive recom-
mendations for assessment and treatment of depression in men. Pro-
fessional Psychology: Research and Practice, 34, 132-140.
Coyne, J. C., & Marcus, S. C. (2006). Health disparities in care for de-
pression possibly obscured by the clinical significance criterion. Ameri-
can Journal of Psychiatry, 163, 1577-1579.
Culbertson, F. M. (1997). Depression and gender. An international re-
view. American Psychologist, 52, 25-31.
Day, J. (1996). Population projections of the United States by age, sex,
race, and Hispanic origin: 1995-2050. Washington, DC: US Govern-
ment Printing Office.
Derogatis, L. R. (1993). Brief Symptom Inventory: Administration, scor-
ing, and procedures manual. Minneapolis, MN: National Computer
Systems, Inc.
Dozois, D. J. A., Dobson, K. S., & Ahnberg, J. L. (1998). A psychome-
tric evaluation of the Beck Depression Inventory-II. Psychological
Assessment, 10, 83-89. doi:10.1037/1040-3590.10.2.83
Dunlop, D. D., Song, J., Lyons, J. S., Manheim, L. M., & Chang, R. W.
(2003). Racial/ethnic differences in rates of depression among pre-
retirement adults. American Journal of Public Health, 93, 1945-1952.
Eaton, N. R., Keyes, K. M., Krueger, R. F., Balsis, S., Skodol, A. E.,
Markon, K. E., & Hasin, D. S. (2011). An invariant dimensional li-
ability model of gender differences in mental disorder prevalence:
Evidence from a national sample. Journal of Abnormal Psychology.
Eisenberg, D., Gollust, S. E., Golberstein, E., & Hefner, J. L. (2007).
Prevalence and correlates of depression, anxiety, and suicidality
among university students. American Journal of Orthopsychiatry, 77,
534-542. doi:10.1037/0002-9432.77.4.534
Fields, A. J., & Cochran, S. V. (2011). Men and depression: Current
perspectives for health care professionals. American Journal of Life-
style Medicine, 5, 92-100.
Furr, S. R., Westefeld, J. S., McConnell, G. N., & Jenkins, J. M. (2001).
Suicide and depression among college students: A decade later. Pro-
fessional Psychology, Research and Practice, 32, 97-100.
George, L. K., & Lynch, S. M. (2003). Race differences in depressive
symptoms: A dynamic perspective on stress exposure and vulnerabil-
ity. Journal of Health and Social Behavior, 44, 353-369.
Gladstone, T. R., & Koenig, L. (1994). Sex differences in depression
across the high school to college transition. Journal of Youth and
Adolescence, 23, 643-669. doi:10.1007/BF01537634
Gonzalez, H. M., Vega, W. A., Williams, D. R., Tarraf, W., West, B. T.,
& Neighbors, W. (2010). Depression care in the United States. Ar-
chives of General Psychiatry, 67, 37-46.
Gore, S., & Aseltine, R. H. (2003). Race and ethnic differences in de-
pressed mood following the transition from high school. Journal of
Health and Social Behavior, 44, 370-389. doi:10.2307/1519785
Gregorich, S. E. (2006). Do self-report instruments allow meaningful
comparisons across diverse population groups? Testing measurement
variance using the confirmatory factor analysis framework. Medical
Care, 44, s78-s94. doi:10.1097/01.mlr.0000245454.12228.8f
Hambleton, R. K. (2006). Good practices for identifying differential
item functioning. Medical Care, 44, s182-s188.
Hankin, B. L. (2002). Gender differences in depression from childhood
through adulthood: A review of course, causes, and treatment. Pri-
mary Psychiatry, 9, 32-36.
Copyright © 2012 SciRes.
Hankin, B. L., & Abramson, L. Y. (2001). Development of gender dif-
ferences in depression: An elaborated cognitive vulnerability-trans-
actional stress theory. Psychological Bulletin, 127, 773-796.
Harachi, T. W., Choi, Y., Abbott, R. D., Catalano, R. F., & Bliesner, S.
L. (2006). Examining equivalence of concepts and measures in di-
verse samples. Pre ve nt i on S c ie nc e, 7, 359-368.
Hays, R. D., Morales, L. S., & Reise, S. P. (2000). Item response theory
and health outcomes measurement in the 21st century. Medical Care,
38, s1128-s1142. doi:10.1097/00005650-200009002-00007
Hirsch, J. K., Webb, J. R., & Jeglic, E. L. (in press). Forgiveness, de-
pression, and suicidal behavior among a diverse sample of college
students. Journal of Clinical Psychology.
Hooper, L. M. (2010). The unmet needs of depressed adolescent pa-
tients: How race, gender, and age relate to evidence-based depression
care in rural areas. Primary Health Care Research & Development,
11, 339-348. doi:10.1017/S1463423610000277
Hooper, L. M., & Doehler, K. (2011). The mediating effects of differ-
entiation of self on body mass index and depressive symptomatology
among an American college sample. Counselling Psychology Quar-
terly, 24, 71-82. doi:10.1080/09515070.2011.559957
Hui, C. H., & Triandis, H. C. (1985). Measurement in cross-cultural
psychology—A review and comparison of strategies. Journal of
Cross-Cultural Psychology, 16, 131-152.
Humes, K. R., Jones, N. A., & Ramirez, R. R. (2011). Overview of race
and Hispanic origin: 2010.
Institute of Medicine (2002). Unequal treatment: Confronting racial
and ethnic disparities in health care. Washington DC: National
Academies Press.
Iwata, N., & Buka, S. (2002). Race/ethnicity and depressive symptoms:
A cross-cultural/ethnic comparison among university students in East
Asia, North and South America. Social Science and Medicine, 55,
2243-2252. doi:10.1016/S0277-9536(02)00003-5
Kadison, R. (2004). The mental-health crisis: What colleges must do.
The Chronicle of Higher Educ a t ion, B20.
Kessler, R. C., McGonagle, K. A., Zhao, S., Nelson, C. B., Hughes, M.,
Eshleman, S., Wittchen, H. U., & Kendler, K. S. (1994). Lifetime
and 12-month prevalence of DSM-III-R psychiatric disorders in the
United States. Results from the National Comorbidity Survey. Ar-
chives of General Psychiatry, 51, 8-19.
Kessler, R. R., Berglund, P., Demler, O., Jin, R., Koretz, D., Merikan-
gas, K. R., & Wang, P. S. (2003). The epidemiology of major de-
pressive disorder: Results from the National Comorbidity Survey
Replication (NCS-R). Journal of the American Medical Association,
289, 3095-3105. doi:10.1001/jama.289.23.3095
Kilmartin, C. (2005). Depression in men: Communication, diagnosis
and therapy. The Journal o f Men's Health & Gender, 2, 95-99.
Kim, Y., Pilkonis, P. A., Frank, E., Thase, M. E., & Reynolds, C. F.
(2002). Differential item functioning of the Beck Depression Inven-
tory in late-life patients: Use of item response theory. Psychology
and Aging, 17, 379-391. doi:10.1037/0882-7974.17.3.379
Kisch, J., Leino, V. E., & Silverman, M. M. (2005). Aspects of suicidal
behavior, depression, and treatment in college students: Results from
the spring 2000 national college health assessment survey. Suicide
and Life-Threatening Behavior, 35, 3-13.
Leino, E. V., & Kisch, J. (2005). Correlates and predictors of depress-
sion in college students: Results from the spring 2000 national col-
lege health assessment. American Journal of Health Education, 36,
Magovcevic, M., & Addis, M. E. (2008). The masculine depression
scale: Development and psychometric evaluation. Psychology of Men
and Masculinity, 9, 114-132. doi:10.1037/1524-9220.9.3.117
Manly, J. J. (2006). Deconstructing race and ethnicity: Implications for
measurement of health outcomes. Medical Care, 44, s10-s16.
Marin, H., Escobar, J. I., & Vega, W. A. (2006). Mental illness in His-
panics: A review of the literature. Focus, 4, 23-37.
McHorney, C. A., & Fleischman, J. A. (2006). Assessing and under-
standing measurement equivalence in health outcomes measures. Is-
sues for further quantitative and qualitative inquiry. Medical Care,
44, s205-s210. doi:10.1097/01.mlr.0000245451.67862.57
National Institute of Mental Health (2010). Office for Research on
Disparities and Global Mental Health: Overview.
National Institutes of Health. (2002). Outreach notebook for the inclu-
sion, recruitment and retention of women and minority subjects in
clinical research. URL.
National Research Council & Institute of Medicine (2009). Preventing
mental, emotional, and behavioral disorders among young people:
Progress and possibilities. Washington, DC: National Academies Press.
Nolen-Hoeksema, S. (1990). Sex differences in depression. Stanford,
CA: Stanford University Press.
Nolen-Hoeksema, S., & Girgus, J. S. (1994). The emergence of gender
differences in depression during adolescence. Psychological Bulletin,
115, 424-443. doi:10.1037/0033-2909.115.3.424
Osman, A., Downs, W. R., Barrios, F. X., Kopper, B. A., Guiterrez, P.
M., & Chiros, C. E. (1997). Factor structure and psychometric char-
acteristics of the Beck Depression Inventory-II. Journal of Psycho-
pathology and Behavioral Assessment, 19, 359-376.
Paniagua, F. A. (1994). Assessing and treating culturally diverse clients:
A practical guide. Thousand Oaks, CA: Sage.
Radloff, L. S. (1977). The CES-D scale: A self-report depression scale
for research in the general population. Applied Psychological Meas-
urement, 1, 385-401. doi:10.1177/014662167700100306
Rao, U., & Chen, L. (2009). Characteristics, correlates, and outcomes
of childhood and adolescent depressive disorders. Dial ogues in Clini-
cal Neuroscience, 11, 45-62.
Samejima, F. (1969). Estimation of latent ability using a response pat-
tern of graded scores. Psychometrika, Monograph Supplement No. 17.
Santor, D. A., Zuroff, D. C., Cervantes, Palacios, J., & Ramsay, J. O.
(1995). Examining scale discrimability in the BDI and CES-D as a
function of depression severity. Psychological Assessment, 7, 131-139.
Schmidt, F. L., & Hunter, J. E. (2003). History, development, evolution,
and impact of validity generalization and meta-analysis methods. In
K. R. Murphy (Ed.), Validity and generalization: A critical review
(pp. 31-65). Mahwah, NJ: Erlbaum.
Silverstein, B. (1999). Gender differences in the prevalence of clinical
depression: The role played by depression associated with somatic
symptoms. American Journal of Psychiatry, 156, 480-482.
Sperry, L. (2010). Culture, personality, health, and family dynamics:
Cultural competence in the selection of culturally sensitive treat-
ments. Family Journal, 18, 316-320.
Spielberger, C. D. (1983). Manual for the State-Trait Anxiety Inventory
STAI. Palo Alto, CA: Mind Garden.
Spielberger, C. D., Gorsuch, R. L., & Lushene, R. E. (1972). Manual
for the State-Trait Anxiety Inventory Self-Evaluation Questionnaire.
Palo Alto, CA: Consulting Psychologists Press.
Stark, S., Chernyshenko, O. S., & Drasgow, F. (2006). Detecting dif-
ferential item functioning with confirmatory factor analysis and item
response theory: Toward a unified strategy. Journal of Applied Psy-
chology, 91, 1292-1306. doi:10.1037/0021-9010.91.6.1292
Steer, R. A., & Clark, D. A. (1997). Psychometric properties of the
Beck Depression Inventory-II with college students. Measurement
and Evaluation in Counseling a nd Development, 30, 128-137.
Storch, E. A., Roberti, J. W., & Roth, D. A. (2004). Factor structure,
concurrent validity, and internal consistency of the Beck Depression
Inventory-Second edition in a sample of college students. Depression
and Anxiety, 19, 187-189. doi:10.1002/da.20002
Teresi, J. A., Ramirez, M., Lai, J. S., & Silver, S. (2008). Occurrences
and sources of differential item functioning (DIF) in patient-reported
outcome measures: Description of DIF methods, and review of meas-
ures of depression, quality of life, and general health. Psychological
Copyright © 2012 SciRes. 773
Copyright © 2012 SciRes.
Science Quality, 50, 538-600.
Thissen, D. (1991). MULTILOG user’s guide: Multiple, categorical
item analysis and test scoring using item response theory. Chicago:
Scientific Software.
Tjia, J., Givens, J. L., & Shea, J. A. (2005). Factors associated with un-
dertreatment of medical student depression. Journal of American
College Health, 53, 219-224. doi:10.3200/JACH.53.5.219-224
US Department of Health and Human Services (n.d.) Healthy People
2020: Proposed Objectives.
Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of
the measurement invariance literature: Suggestions, practices, and re-
commendations for organizational research. Organizational Research
Methods, 3, 4-69. doi:10.1177/109442810031002
van de Vijver, F., & Tanzer, N. K. (2004). Bias and equivalence in
cross-cultural assessment: An overview. Review for Applied Psy-
chology, 54, 119-135. doi:10.1016/j.erap.2003.12.004
Walker, R. L., & Bishop, S. (2005). Examining a model of the relation
between religiosity and suicidal ideation in a sample of African
American and white college students. Suicide and Life-Threatening
Behavior, 35, 630-639. doi:10.1521/suli.2005.35.6.630
Ware, J. E., & Sherbourne, C. D. (1992). The MOS 36 Item Short Form
Health Survey (SF 36): Conceptual framework and item selection.
Medical Care, 30, 473-483.
Whisman, M. A., Perez, J. E., & Ramel, W. (2000). Factor structure of
the Beck Depression Inventory-Second Edition (BDI-II) in a student
sample. Journal of Clinical Psychology, 56, 545-551.
Wilcox, H. C., Arria, A. M., Caldeira, K. M., Vincent, K. B., Pin-
chevsky, G. M., & O’Grady, K. E. (2010). Prevalence and predictors
of persistent suicide ideation, plans, and attempts during college.
Journal of Affective Disorde r s , 127, 287-294.
World Health Organization (2002). The World Health Report 2002:
Reducing risks, promoting healthy life. Geneva: World Health Or-