Dimensionality and Measurement Invariance of the Serbian Version of the BDI-II: An IRT Approach

There have been debates about the dimensionality of the Beck’s Depression Inventory – II, its appropriate scoring, and gender-related measurement invariance. We addressed these questions employing the Item Response Theory approach in a clinical sample of 288 patients, using a Serbian version of the BDI–II. We tested nine structural models using confirmatory Full Information Factor Analysis and IRT Graded Response Model. We concluded that the BDI-II is essentially unidimensional. All items had high discrimination, and the test was most informative in the medium range of depression severity in the clinical sample. Although small to medium gender related differential item functioning existed in several items, it did not affect the total score. Hence, the total score of the Serbian translation of the BDI–II is comparable between genders as a measure of severity of depression.

It is interesting that one-factor solution has been rarely supported (Kim, Pilkonis, Frank, Thase, & Reynolds, 2002;Segal, Coolidge, Cahill, & O'Riley, 2008). However, recently, several bifactor analytic studies have re-vitalized an interest in the general factor (G). These studies have agreed that the BDI-II has a bifactor structure with a dominant influence of the G factor on the common BDI-II item variance in comparison to specific factors. Even though these studies differed in the number and type of specific factors extractedsome reported two (Al-Turkait & Ohaeri, 2010;de Miranda Azevedo et al., 2016;Osman, Barrios, Gutierrez, Williams, & Bailey, 2008;Subica et al., 2014;Ward, 2006) and some three factors (Bühler et al., 2014;Quilty, Zhang, & Bagby, 2010) -they all have agreed that "...G is the key" (Brouwer, Meijer, & Zevalkink, 2013, p. 136). Hence, a majority of authors would argue that multidimensionality within the BDI-II items stems from different item content domains rather than the presence of specific factors that deserve to be scored separately (but see Bühler et al., 2014 for a different interpretation of the importance of the specific factors). Therefore, the instability of the factor solutions might stem from an incorrect model specification, in particular, a strong presence of the general factor (G) in all BDI-II items that could not be specified in studies that did not apply bifactor modeling (Ward, 2006). Namely, some items that are strongly saturated by the G factor might switch places from one oblique factor to the other in different studies (Ward, 2006).
Others suggested sample-dependent error variance as an explanation for various factor analytic solutions (Strunk & Lane, 2017), One goal of the present study is to provide a comprehensive test of the BDI-II structure in a sample of depressed patients from Serbia. Based on the reviewed studies, we expected that most of the common BDI-II item variance would be explained by the dominant G factor. If supported, this finding would inform clinicians about the best way of scoring the instrument when used for screening and research purposes. In particular, the Serbian version of the BDI would be scored as the total score, which is the BDI-II original scoring.
It is also possible that not all BDI-II items function uniformly (e.g., discriminate equally) across the latent trait of severity of depression and/or across different groups of individuals. For example, item difficulty (i.e., locations of categories' thresholds in polytomous items) might differ among various demographic groups, which might suggest a presence of differential item functioning (DIF) or a lack of measurement invariance. DIF exists when individuals, belonging to different demographic groups but with the same latent trait level (theta), respond differently to a test item. Such item characteristics can be tested within the item response theory framework (IRT), which has already been employed several times in testing psychometric characteristics of the BDI-II (de Sá Junior, de Andrade, Andrade, Gorenstein, & Wang, 2018;de Sá Junior et al., 2019;Kim et al., 2002;Wu & Huang, 2010). If our structural analyses show that the BDI-II is highly saturated by the G factor, we will fit unidimensional IRT model with two aims: to better understand how the BDI-II items function as a measure of severity of depression, and whether there is differential item functioning (DIF) between genders.
In IRT models, specifically the graded response model -GRM (Samejima, 1969), the parameter a (slope) together with the categories' location parameters (b), indicate how well item response categories (0-3 in the case of the BDI-II) discriminate between the levels of severity of depression considered as a latent trait. In GRM, the categories' location parameters (thresholds) indicate the point on the latent trait where the probability of endorsing neighboring response categories is equal (.50) (Reise et al., 2011). The thresholds can help one determine whether the item or test difficulty is adequate for the primary target population. For example, in a large sample of Brazilian university students, it has been shown that all BDI-II items have moderateto-high discrimination parameters for the latent trait of severity of depression (de Sá Junior et al., 2019). To our knowledge, there is no study that reported detailed a and b parameters for the BDI-II items in a sample of clinically depressed individuals.
Only two studies explored DIF between genders. In a student sample, de Sá Junior et al.
(2019) concluded that BDI-II items #10 (Crying) and #21 (Loss of interest in sex) demonstrated DIF, with women having a greater tendency to endorse these items, however, the overall scores were negligibly impacted by such differences. In a clinical sample and using the BDI-IA, Santor Ramsay, and Zuroff (1994) reported that only #14 (Distortion of body image) tends to overestimate the degree of depression in women compared to men. Thus, there is clearly a need for a replication study on an independent patient sample.
The importance of disentangling item bias from real differences between genders is particularly great, given inconsistent findings regarding gender differences registered via the BDI-II or its earlier version. For example, a number of studies reported that women obtained higher scores than men in different populations (e.g., in adult clinical population, see Beck, Steer, & Brown, 2009). However, differences were not always reported (e.g., in adult general population see Richter, Polak, & Eisemann, 2003).
Hence, the aims of the present paper were: to provide a comprehensive assessment of various factor analytic models of the BDI-II (Serbian version) that would inform us about optimal scoring of BDI-II, and, contingent upon its hypothesized unidimensionality, to examine item-level IRT characteristics on a sample of Serbian depressed patients and gender-related DIF.

Instruments
The Beck Depression Inventory-II (BDI-II; Beck et al., 1996;Mihić & Novović, 2019) is a multiple-choice, 21-item self-report measure of severity of depression. A previous study supported validity and reliability of the translated version of the BDI-II (Mihić & Novović, 2019). Each answer is scored on a scale ranging from 0 to 3. The total score is the sum of all item scores. In the present study, a coefficient of reliability was high, α = .94.

Data Analysis
Data were analyzed in the R environment (R Core Team, 2019) using packages mvoutlier (Filzmoser & Gschwandtner, 2018), psych (Revelle, 2009), ltm (Rizopoulos, 2006) and mirt (Chalmers, 2012). Taking into account a large number of different factor solutions reported in the literature to date and the ordinal nature of the data, it would be useful to utilize IRT full information confirmatory factor analysis (FA) approach (e.g., Bock, Gibbons, & Muraki, 1988).
This method of FA uses frequencies of distinct response vectors as data (Bock et al., 1988). All fitted models were two-parameter logistic graded response models (Samejima, 1983). These models estimate location (b -difficulty) parameters for every threshold between items' categories and one slope (adiscrimination) parameter per item. In addition, IRT analysis produces information functions of items and test, and can be used to examine measurement invariance (Meade, 2016) through analyses of differential item and test functioning (DIF and DTF).

Latent Structure of the BDI-II
In order to determine the latent structure of the Serbian version of the BDI-II, nine structural models were tested. To provide continuity with the previous studies, we explored the models tested in Brouwer et al.'s study (2013), which at the time represented the most comprehensive empirical test of various BDI-II factor structures. We also included the bifactor model that was published subsequently (Bühler et al., 2014) and a two-factor correlated model previously reported on a Serbian, university sample (Novović et al., 2011). Detailed specifications of all models are presented in Table 2. INSERT  (Hu & Bentler, 1999) as indicators of good model fit. Since more than half of tested models were bifactor, and global fit indices tend to favor bi-factor models due to their tendency to overfit (Bornovalova, Choate, Fatimah, Petersen, & Wiernik, 2020;Markon, 2019) we carefully inspected all the solutions, including loadings on the general and group factors. Apart from the model fit, we considered additional criteria while evaluating models: a) proportion of uncontaminated correlations (PUC) i.e., the proportion of all possible item correlations which are not contaminated by the correlations among items that belong to the same group factors, b) proportion of explained common variance (ECV), c) factor determinacy (FD), and d) replicability index H. FD (correlation between factor and factor scores) is proposed to be satisfactory if >.90 (Gorsuch, 1983). H index is a measure of factor replicability, with the values > .80 suggesting well-defined latent variables (Hancock & Mueller, 2001). Finally, ωh coefficient estimates how much of the total BDI-II score variance is attributable to the G factor, whereas ωhs reflects systematic variance that is left once individual variability due to the general factor was partitioned (Reise, Scheines, Widaman, & Haviland, 2013). The value of ωh > .80 suggests that the scale can be regarded as measuring a unidimensional construct (Reise et al., 2013).
Limited information goodness of fit statistic M2 (Table 3) was significant in all but three models (M5, M9, and M6) indicating lack of fit. M2 statistic tests for exact fit and it tends to be significant in models with large degrees of freedom (Maydeu-Olivares, 2013). Hence, we assessed approximate model fit based on other indices. More than a half of the tested models satisfied all the criteria for good fit (Table 3).

INSERT TABLE 3 HERE
According to Akaike Information Criterion (corrected for sample size), the best fitting model is M3 (the three-correlated factors model), followed by models 5, 2, 4, 9 (Table 3). The differences between the models' fit were also tested using likelihood ratio tests 2 . The best fitting models according to LR tests were 5 and 9 i.e., the two bifactor models which differed mainly in their definition of one group factor (Cognitive, Somatic, and Affective vs. Cognitive, Somatic, and Activation). However, inspection of standardized item loadings in both models (Table 4) revealed an inadequate representation of group factors judged by the number of significant loadings and presence of small negative loadings, indicating model misspecification (Reise, Kim, Mansolf, & Widaman, 2016). Also, all other indices in Table 4 were adequate for the general factor only, supporting rejection of the bifactor models.

INSERT TABLE 4 HERE
Two correlated trait models M3 and M2 had acceptable fit and the indices reported in Table 4, but factor correlations were too high to consider the isolated factors as distinctive traits, or even distinctive facets of the same trait (the correlations ranged from .89 to .98). These findings, in addition to the presence of a strong general factor in all bifactor models as well as very high loading correlations between the single-factor solution and the G factors from the five different bifactor solutions, suggested that the BDI-II is essentially a unidimensional instrument, in spite of presence of some multidimensionality. Thus, we opted for the most parsimonious, unidimensional model -M1.

Item Analysis of the BDI-II
Given that the previous analyses supported unidimensionality of the BDI-II in the presence of some multidimensionality, we presented the factor loadings from the single factor model and the IRT graded response model item parameters (Table 5). It is important to note that the overall IRT model fit was good. A certain level of misfit involved the items #10, #1, and #12, however it could be considered negligible 3 .

INSERT TABLE 5 HERE
All items had adequate slope parameter a, ranging from 1.38 (Loss of interest in Sex) to 2.89 (Sadness) indicating high to very high discrimination (Baker, 2001). The difficulty of item categories is distributed in θ to reflect the severity of depression. θ = 0 reflects an average level of depression, while positive and negative values of theta indicate depression above and below average level, respectively. As can be seen in Table 5, category difficulty parameters ranged from -1.51 (Loss of pleasure) to .292 (Suicidal thoughts) for threshold between categories 0 and 1, from -.04 (Loss of interest in sex) to 1.47 (Suicidal thoughts) for threshold between categories 1 and 2, and from .58 (Loss of interest in sex) to 2.08 (Past failure) for threshold between categories 2 and 3. Overall, the lower the value of category threshold is, the lower level of depression is needed to endorse it. For example, it means that one has to be much more depressed to endorse category 1 on the Suicidal thoughts item then on the Loss of pleasure item.
Since GRM does not provide locations of items, we calculated a mean of category threshold parameters per item (LImean; Ali, Chang, & Anderson, 2015). According to LImean, locations of the majority of items was slightly above 0, ranging from -.19 to .63, with exception of Suicidal thoughts which had LImean=1.26. In terms of IRT, an item is most informative (precise) as a measure of a latent trait (θ) at its location. Based on the values of the thresholds and LImean reported in Table 5, one can conclude that the BDI-II is the most appropriate and, thus, most accurate for measuring depression, in the individuals with depression levels above average.
As can be seen from Figure 1, this Serbian adaptation of the BDI-II is the most informative in the θ range between -2 and 2 logits, but more (about 63% of information) in the individuals with levels of depression above average. Figure 1 also shows that the BDI-II has outstanding precision in this clinical sample. Based on Figure 1, when test information is converted to classical reliability coefficient (Thissen, 2000), the BDI-II has reliability equal or above .90 (information over 10) in theta range -1.65-2.46 logits, and equal and above .95 (information over 20) in theta range -0.26-1.74 logits.

Differential Item Functioning of the BDI-II
In order to investigate measurement invariance of the BDI-II between genders, differential item functioning (DIF) analysis was performed. DIF can be manifested as difference in slopes (discrimination), difference in location parameters (difficulties), or both. Existence of DIF was tested using likelihood ratio test and comparing the fit of the baseline model with the fit of nested comparison models (see Meade, 2016).
Regarding difference in location parameters, gender-related DIF (using the anchor items approach proposed by Lopez Rivas, Stark, & Chernyshenko, 2009) 4 was detected in the following items: #20, #18 and #21. According to standardized difference in expected scores (ESSD), which can be interpreted as Cohen's d (Meade, 2016), females had slightly higher mean expected scores on these three items than males with the same level of depression. DIF effect size for item #20 was trivial (ESSD#20 = .196), small for item #18 (ESSD#18=.20) and moderate for item #21 (ESSD#21 =.71). However, maximum differences in the expected scores for items #18 and #21 were at below average levels of depression i.e., the part of the latent trait continuum which is not crucial for functioning of the BDI-II as a clinical instrument.
Item #10 (Crying) demonstrated a significant, small difference in slopes between genders 4 Details of the analyses are available from the first author upon request.
(ESSD#10 = -.25). It means that the difference in expected scores between the two groups depended on the location of the theta continuum. As it can be seen in Figure 2, assuming the same level of depression, males with θ below average had higher expected scores on item #10 than females. On the other hand, males with θ above average had lower expected scores than females with the same level of depression. Slope is somewhat steeper in the female subsample, meaning that item #10 better differentiates subjects with different levels of depression in females, implying that crying is more related to the intensity of depression in females then males.

INSERT FIGURE 2 HERE
However, the existence of these differentially functioning items did not affect functioning of the test as a whole. Maximum difference in the expected test scores was less than 1 point (.99) occurring at theta level of -.83 (depression level below average). Effect size of this difference was trivial (ETSSD = -.003). Signed test difference in the sample (STDS), which allows for cancelation of differences between items forming total score, was -.049, while unsigned test difference in the sample (UTDS) was 2.06 points. The difference that exists between UTDS and STDS in the case of the Serbian adaptation of BDI-II shows that, although some differences in functioning at item level exist, they cancel each other out at the test level and do not affect the total score.

Discussion
Our results demonstrated unequivocally that the BDI-II, when used in a sample of clinically depressed individuals, is best regarded as a unidimensional instrument that measures the severity of depression. Even though two bifactor models had the best fit indices, additional model indices did not justify creation of subscales. For example, group factors were not well-defined by their loadings. Only the G factor had acceptable values of EVC (.88), FD (.98), H (.97), and ωh (> .90), suggesting that this factor accounted for above 90% of the common variance in the bifactor models and that only the latent construct represented by the G factor is well-defined and replicable. Finally, there were very high correlations between the factor loadings in the unidimensional model and the loadings of the G factor in all bifactor models considered in this study, supporting our conclusion about unidimensionality of the Serbian version of the BDI-II. Hence, our results are in line with a number of previous studies that concluded that the BDI-II measures a single construct (depression severity), whereas clustering of the items tapping various aspects of this construct is not sufficient to form group or specific factors (e.g., Brouwer et al., 2013;Osman et al., 2008).
The results of the two-parameter IRT model suggested that all BDI-II items had high to very high discrimination (a parameter). The highest discrimination was demonstrated by the following items: Sadness, Loss of interest, Indecisiveness, and Concentration difficulty. It is noteworthy that these items are part of the A diagnostic criterion for major depressive disorder (APA, 2013), while the first two of them are the so-called main depression symptoms (one of them has to be fulfilled aside any of the other seven for diagnosis). From a somewhat different angle i.e., the network approach to psychopathology, recent studies in adult clinical samples have identified Sadness (Beard et al., 2016), Anhedonia (Bringmann, Lemmens, Huibers, Borsboom, & Tuerlinckx, 2015), and Fatigue (van Borkulo et al., 2015) as central symptoms. The central symptoms are those that have many and/or the strongest associations with other symptoms (Borsboom & Cramer, 2013). It can, then, be said that the most discriminative BDI-II items are, at the same time, the most central ones. Finally, these most discriminative items cover the whole range of depressive symptoms: affective, cognitive, and somatic-vegetative.
Based on the values of categories' thresholds, which represent the point on the latent trait of depression where the probability of responding above a certain category is .50, one can conclude that the thresholds were appropriate. Overall, the IRT threshold parameters for the BDI-II are spread over a reasonable range of the latent trait of depression severity (approximately from -2 to + 2).
Based on the test information curve, in the clinical sample, the Serbian version of the BDI-II is the most informative in the medium range of depression severity (from -1.65 to 2.46 logits), but more (about 63% of information) in individuals with levels of depression above average. Since the test is primarily intended for clinical purposes, we can consider this to be optimal. In other words, the BDI-II is the most informative in the theta range where the majority of clinical subjects can be expected. Using the more familiar reliability terminology from the classical test theory, one can claim that the BDI-II has excellent reliability over a wide range of depression severities. It has excellent reliability (precision) just where it is needed the most.
Small to moderate DIF was found in three BDI-II items. Females tended to report more changes in appetite and interest in sex at all levels of depression (than equally depressed males).
It is likely that such a pattern of responding reflects the fact that women, in general, have lower body image satisfaction than men (Hartmann, Rieger, & Vocks, 2019), and have a greater tendency to self-disclose sexual problems (Okur, van der Knaap, & Bogaerts, 2017).
A small non-uniform gender DIF was found for the item #10 (Crying). At the level of depression below average, males tended to disclose more crying than equally depressed females.
The pattern was opposite at the higher level of depression severity, with females, on average, consistently opting for higher response options than their equally depressed male counterparts.
Given that crying is a more common emotion-regulation strategy in females than in males (e.g., Vingerhoets & Scheirs, 2000), we can assume that for females who are at the lower end of the depression severity continuum it is difficult to discern whether crying represents an appropriate emotional regulation strategy or a depressive symptom. Consequently, they don`t report that they cry more than otherwise. Additional information is that the crying item, in general, is more discriminative for females than males i.e., better reflects differences in depression severity in females than males. It seems that crying becomes a more prominent and discriminative symptom of depression in females when they become more depressed. Lower discrimination of crying in males is, on the other side, a consequence of the distribution of their answers. They opted mainly for extreme, "not crying", categories. Thus, majority of their answers (47%) is in categories 0 -"boys don't cry" (more than otherwise), and 3they want, but can`t cry (28%). This pattern of results supports a linear relation between crying and depression in females, but not males, for whom inability to cry is related to mild as well as severe levels of depression (see Vingerhoets, Rottenberg, Cevaal, & Nelson, 2007 for the review of depression-crying relationship models).
Notwithstanding these differences that probably reflect different manifestations of depression between genders and/or their different help-seeking strategies, the total BDI-II score does not seem to be biased as an indicator of depression severity with respect to gender.

Limitations
First, sample in this study was a convenience sample, comprised of inpatients and outpatients of nine psychiatric hospitals in the Republic of Serbia. Since one of the aims of this paper was to examine the existence of gender-related measurement invariance, one can say that unequal gender groups (36.8% male, 63.2%) could be a problem. However, such distribution is in line with the epidemiological data and the higher prevalence of the depressive disorder in women (APA, 2013;Nolen-Hoeksema, 1987).

Figure 1. 'Test information curve for the Serbian version of the BDI-II
*Gray area marks theta interval in which test reliability is > .90, and pattern area marks theta interval in which reliability is > .95