Psychometric evaluation and short form development of the Balanced Inventory of Desirable Responding ( BIDR-6 )

The goals of this research were to evaluate the Bosnian-Croatian-Serbian (BCS) translation of the BIDR-6 scale, develop its short form, and to present its initial convergent/discriminative validation. The sample included 827 participants. MIRT CFA analysis revealed that fourfactor model (containing 32/40 items) fits the data best, with Self-Deceptive Enhancement (SDE) and Impression Management (IM) both splitting into the denial (SD-D and IM-D) and enhancement (SD-E and IM-E) factors. Fit and item properties were generally mediocre. SD-D and IM-E subscales were the strongest sources of misfit, thus SD-E and IM-D subscales were retained in the short form, which had good fit and replicated almost all main patterns of associations with other variables of interest (e.g., HEXACO personality traits) typically reported for the full SDE and IM scales in other research. Thus, 17-item BIDR-6 short form, containing only SD-E and IM-D subscales, is recommended for use in the BCS speaking area.

strong misrepresentation of participants' trait levels.In low stakes testing SDR is viewed as a "general method variance that is not necessarily faking and that is not necessarily substance" (Holden & Passey, 2010, p. 449).In other cases, e.g., self-reported alcohol consumption and harms research, SDR has been viewed as a significant threat to the validity (Davis, Thake, & Vilhena, 2010).
Proposed techniques for SDR management (Bäckström & Björklund, 2013;Nederhof, 1985;Paulhus, 1991;Paulhus & Vazire, 2007) are fairly limited.Statistical control is arguably the most known practice and it requires the administration of some SDR scale with other measures of interest.Subjects high on SDR scale are either deleted, or their 'contaminated' scores are adjusted (Nederhof, 1985).The latter is typically done by statistical partialling of the SDR scale scores from the measures of interest in an attempt to 'purify' them (Nederhof, 1985;Paulhus & Vazire, 2007).While some authors explicitly advocate for this approach (van de Mortel, 2008), others are openly against it, suggesting that this removes valid variance instead of fixing the problem (Paulhus & Vazire, 2007, Uziel, 2010).However, even with the latter position probably being true, SDR scales are still the most convenient way for issue detection, regardless if they can be used as a 'cure'.
Recently, some authors rethought the whole concept of SDR, especially the impression management, putting forward an argument that it should be redefined as a measure of interpersonally oriented self-control, which characterizes individuals who demonstrate high self-control, particularly in social contexts (Uziel, 2010).De Vries, Zettler, and Hilbig (2014) confirmed this interpretation, showing that impression management might be an expression of Honesty-Humility HEXACO personality trait (Ashton, Lee, & de Vries, 2014;Ashton et al., 2004;Lee & Ashton, 2004).Thus, SDR questionnaires could be useful tools for the study of individual differences in general, regardless of their potential merit as an actual SDR remedy.

Research problem
BIDR-6 (Paulhus, 1991(Paulhus, , 1994;;Paulhus & Reid, 1991) is one of the most famous SDR questionnaires (Li & Bagger, 2006), typically thought of as having two dimensions (Asgeirsdottir, Vésteinsdóttir, & Thorsdottir, 2016;Bobbio & Manganelli, 2011;Hart, Ritchie, Hepper, & Gebauer, 2015;Li & Bagger, 2006;Paulhus, 1984): the Self-Deceptive Enhancement (SDE) and the Impression Management (IM).However, three factors have also been proposed (Kroner & Weekes, 1996;Paulhus & Reid, 1991), with Paulhus and Reid (1991) observing the self-deceptive content of BIDR-6 splitting into the enhancement (the claiming of positive attributes) and denial (the repudiation of negative attributes) facets.Li and Li (2008) have observed this happening to the IM scale as well, thus obtaining four factors.Paulhus and Trapnell (2009) have also argued for a four-factor model of SDR in some recent advancements.Li and Li (2008) made a remark that the BIDR-6 latent structure might be culturally dependent, requiring separate tests for different cultural backgrounds.Thus, the first goal of this research is to present a thorough psychometric evaluation of the official Bosnian-Croatian-Serbian (BCS) BIDR-6 translation, focusing on its dimensionality.Given the ambiguity in the number of factors, a confirmatory approach will be used in order to compare the plausible factor solutions.The primary framework for the analysis will be an Item Response Theory (IRT), which allows us to study how underlying latent traits interact with item characteristics, such as difficulty and discrimination (Chalmers, 2012).This useful approach has been only sporadically used with the BIDR-6 (e.g., Asgeirsdottir et al., 2016;Cervellione, Lee, & Bonanno, 2008).Most recently, Asgeirsdottir and colleagues (2016) used IRT to shorten the BIDR-6, by retaining only the best 24 items.Other authors have also developed variations of BIDR-6 short forms (e.g., Bobbio & Manganelli, 2011;Hart et al., 2015).Short forms have the advantage over a full questionnaire due to an obvious fact that many researchers might be reluctant to use a 40-item SDR measure (Hart et al., 2015).Thus, the second goal of this study is to develop the BIDR-6 short form for the BCS speaking area.Unlike Asgeirsdottir and colleagues (2016), who used a combination of the confirmatory factor analysis (CFA) and unidimensional IRT to shorten the BIDR-6, we opted to rely upon a more sophisticated multidimensional variation of IRT -MIRT, which also allows for a usage analogous to the CFA, including factor loadings and model fit estimation (Chalmers, 2012).
The third goal of this article is to present an initial insight into the convergent and discriminative validity of the BIDR-6 BCS translation.Note that our view of the SDR is more in line with the interpersonally oriented selfcontrol perspective (de Vries et al., 2014;Uziel, 2010), than with the view of SDR measures as a way of 'weeding out bad variance'.Thus, we will primarily rely upon the findings of de Vries and colleagues (2014) as a benchmark for the BIDR-6 validation.Specifically, we expect that SDE will correlate with (low) Emotionality, Extraversion, and Conscientiousness, and that IM will correlate with Honesty-Humility, Conscientiousness, and Agreeableness, with the Honesty-Humility correlation being the strongest (de Vries et al., 2014).We also expect BIDR-6 dimensions to correlate with other measures of SDR, namely with the Brief Social Desirability Scale (Haghighat, 2007a(Haghighat, , 2007b)).Finally, it is also important to test for the gender and age differences, in order to have appropriate benchmark values for different subpopulations.In alignment with previous research, we expect that women will be higher on IM and men on SDE (e.g., Bobbio & Manganelli, 2011;de Vries et al., 2014), and that BIDR-6 dimensions will not correlate with age (de Vries et al., 2014).

Measures
Balanced Inventory of Desirable Responding -BIDR-6, Form 40A (Paulhus, 1991(Paulhus, , 1994(Paulhus, , 2008;;Paulhus & Reid, 1991).It consists of 40 Likert-type items answered on a 7-point scale (1="not true" through 7="very true").There are 20 items per the SDE and the IM scales.Each scale also has the Enhancement and Denial subscales (10 items each).Two scoring methods exist (Paulhus, 2008;Stober, Dette, & Musch, 2002): continuous (all answers are counted) and dichotomous (only extreme answers are counted).Following recommendations from Stober and colleagues (2002) continuous scoring was used, unless noted otherwise.Adaptation to BCS included two independent back-translations.Following suggestions from the translators and student contributors, several items were slightly modified due to cultural reasons.For example, item 30: "I always declare everything at customs." was modified into: "I would always report everything at customs.", as a few people from B&H have an extensive personal experience with the customs declarations.All item translations and adaptations were verified and approved by the questionnaire's original author (D.L. Paulhus).(Haghighat, 2007a(Haghighat, , 2007b)).It contains four true-false items (two are reverse scored), which measure a single dimension of SDR.This short measure was included for a convergent validation purpose.The items were added up to create a summary score, with a 0-4 range (Md=Mo=2, M=1.55, SD=1.15).While internal consistency reliability of BSDS V2 was relatively low (KR-20=.64), it is comparable to typical values of other short SDR questionnaires and its own referenced value (Haghighat, 2007b).

BIDR-6 dimensionality assessment and item properties
Initial model fits of the five tested models are shown in Table 1.Fourfactor model (Model 3), clearly had the best fit.However, only RMSEA value was good, SRMSR acceptable, while the other indices were below the conventional cutoffs (Hooper, Coughlan, & Mullen, 2008).Since CFI and TLI have a tendency to penalize models with a large number of indicators per latent variable (here: 10 per factor), especially when factor loadings (Λ) are in a lower range, this discrepancy in fit values is somewhat understandable (Kenny & McCoach, 2003;Sharma, Mukherjee, Kumar, & Dillon, 2005), but the values are low nevertheless.Regardless, as the best fitting model, four-factor solution was used as a basis for further analyses.2(Hooper, Coughlan, & Mullen, 2008).
Factor loadings are relatively low, with the average Λs (sums of Λ 2 are in brackets) for SD-E, SD-D, IM-E, and IM-D factors being: .59(2.49), .52 (1.91), .51(2.19), and .47(2.27), respectively.Furthermore, subscales exhibit an obvious Enhancement~~Enhancement and Denial~~Denial cross-scale correlation pattern, instead of Enhancement~~Denial within-scale pattern (with especially low SD-E~~SD-D correlation), fortifying the notion that the subscales should be treated separately, and that combining them into SDE and IM scales is not advised as per our data.Internal consistencies are generally moderate (with noticeably lower value for SD-D), and roughly in line with the values expected for the BIDR-6 (Paulhus, 2008).
The majority (n=25) of the items has moderate unidimensional discrimination (α), several (n=5) have high, with items 34 and 11 having low and very high discrimination, respectively.The majority (n=21) of the items has low multidimensional discrimination (MDISC), with 11 items having moderate values.On average, both on a unidimensional and multidimensional level, SD-E has the most discriminative items (M α =1.27, M MDISC =0.75), followed by IM-E (M α =1.05, M MDISC =0.62) and SD-D (M α =1.05, M MDISC =0.62), and lastly by IM-D (M α =0.92, M MDISC =0.54).(Baker, 2001, p. 35).The majority of SD-E item thresholds are negative, with the uppermost thresholds (β 6 ) mostly not being too high (M β6 =1.57), meaning that having a 50% probability of choosing answer 7 ("very true") requires only moderately high levels of a latent trait.Consequently, participants mostly tended to agree with the statements, i.e., SD-E items are "easy".On SD-D subscale, items 16 and 18 (which refer to the appreciation of criticism and doubting one's own abilities as a lover, respectively), were noticeably more "hard" than other SD-D items, with elevated upper thresholds (β 6 and β 5 to a degree) implying a low probability of participants strongly agreeing with the statements.Other SD-D items have slightly narrower item thresholds, grouping around the middle of a latent trait, suggesting a discrete uniform distribution of answers.IM-E also has several easier items with very low upper thresholds and narrow threshold ranges (items 36, 38, and 26), with items 30 and 40 also displaying signs of the discrete uniform distribution of answers.Finally, IM-D subscale has two obviously hard items (31 and 25, referring to stealing and revenge, respectively).Items 23, 25, 29, and 31 also fall on a harder side and items 21 and 37 on easier.Items 33, 35, and 39 display some discrete uniform distribution tendencies, but not as pronounced as the mentioned SD-D and IM-E items.
It is obvious from the analyses that both model fit and item properties of the BIDR-6 are generally mediocre, with not too many items that stand out either positively or negatively, discounting eight assumptions-violating items.Note, however, that all parameter values, while not being fully comparable due to the analyses differences, are equal to or better than the values presented in a recent IRT-based BIDR-6 analysis by Asgeirsdottir and colleagues (2016).Most obviously, item thresholds are much less extreme on our data.

BIDR-6 short form
Using the Model 3.1 as a starting point, we proceeded with the BIDR-6 short form development, using an iterative item removal process, relying upon the Λ, α/MDISC, and β values, using the resulting model fit as a benchmark.After a few iterations, SD-D and IM-E subscales deteriorated, to the point that all of their items were removed.This implied that SD-D and IM-E conform with the 2PL GRM MIRT model worse than SD-E and IM-D. 3 Problems with SD-D and IM-E became even more apparent when we tentatively tried dichotomous scoring (Paulhus, 2008;Stober et al., 2002), after which sums of Λ 2 sharply increased for SD-E (4.33) and IM-D (6.08), but sharply decreased for SD-D (1.66) and IM-E (3.45), with Λs of four SD-D and two IM-E items dropping under |.32|.SD-D and IM-E items tended to drop out regardless of the scoring method or a number of factors used as a starting point.This happened even if the conventional CFA was used.Narrow item threshold range and a tendency for uniform discrete distribution of several SD-D and (to a lesser degree) IM-E items is probably a reason for it.Thus, we opted to remove SD-D and IM-E subscales completely, retaining only SD-E and IM-D subscales (from Model 3.1).Paulhus and Reid (1991) have shown that relative independence of enhancement and denial items is not simply a result of the item keying direction.We did retain one positive-(SD-E) and one negative-keyed subscale (IM-D), implying that keying is not a deciding factor in our case, but given the high Enhancement~~Enhancement and Denial~~Denial cross-scale correlations this might be worth exploring further via an experimental manipulation of the keying direction.

BIDR-6 convergent and discriminative validity
Given that the short BIDR-6 with only SD-E and IM-D subscales had much better fit in comparison to the best fitting full model, we tested convergent and discriminative validity using only SD-E and IM-D.
Discussion.SD-E and IM-D subscales alone almost perfectly replicate the correlation patterns expected for the full SDE and IM scales, making SD-D and IM-E subscale removal a non-issue.Specifically, our data replicates almost all of the hypothesized/benchmark correlations observed by de Vries and colleagues (2014) (who used full SDE and IM scales).The only obvious difference is the (very) low (Cohen, 1992) tendency of SD-E to increase, and IM-D to decrease with age in our sample, but this probably does not require separate age-group norming.Slight caveat should also be put on SD-E~~Honesty-Humility and SD-E~~eXtraversion correlations.In the first case, there is a significant positive correlation, while the benchmark value was .00.However, given that our correlation fall under a trivial effect size (Cohen, 1992), the values are still comparable.In the second case, benchmark correlation is somewhat higher (.46 versus .27).All other correlations fall in line almost perfectly with the benchmark values.This includes the IM-D~~Honesty-Humility as the strongest observed correlation, thus confirming the previous notion that impression management might be a partial expression of Honesty-Humility (de Vries et al., 2014) and/or interpersonally oriented self-control (Uziel, 2010).More elaborated investigation of the underlying reasons for this is out of scope of this article, but it is obvious that the BIDR-6 short form presented here could be used for such investigation in the BCS speaking area.
SD-E and IM-D correlate with a short measure of SDR (BSDS-V2), as it was conceptually expected, even though correlations were in a lower range (Cohen, 1992).Finally, SD-E and IM-D also replicated the typical IM/SDE gender trends (Bobbio & Manganelli, 2011;de Vries et al., 2014), with women having higher IM-D and men SD-E.However, given the small effect sizes, separate gender norms might not be needed for low stake testing.

General discussion
"Controversy over the dimensionality of SDR is ongoing" (Hart et al., 2015, p. 7).This includes the BIDR-6.According to our data, the best fitting BIDR-6 structure was a four-factor one, with SDE and IM scales each splitting into the enhancement (SD-E & IM-E) and denial (SD-D & IM-D) dimensions.Paulhus and Reid (1991) also obtained enhancement-denial split for SDE, but not for IM scale.Four factors have been theorized by Paulhus and Trapnell (2009) in recent SDR model conceptualizations, but, until now, four-factor BIDR-6 structure was observed only in a Chinese sample (Li & Li, 2008).
Even though (after the deletion of eight assumptions-violating items) psychometric properties of the items in our four-factor model were equal to or better than a recent IRT-based BIDR-6 analysis (Asgeirsdottir et al., 2016), values were still mediocre, as was the overall model fit.SD-D and IM-E subscales were the causes for less-than-good fit, probably due to narrow item threshold ranges and some tendency towards the discrete uniform distribution of answers.This made the process of shortening the BIDR-6 very easy, as simply removing SD-D and IM-E and retaining the SD-E and IM-D subscales (17 items in total) produced a good fit.Several more items could have been removed due to their difficulty, but this offered no fit improvement advantages, and we argue that no further item removal is advised until a performance of the retained items is investigated under a deliberate faking condition (Asgeirsdottir et al., 2016).Such investigation would be an obvious next research step.
We also presented an evidence of multidimensional IRT-based CFA (i.e., CIA) analysis having clear fit advantages over a conventional CFA for the BIDR-6.Thus, we advise other researchers to consider using MIRT CFA.Expected item properties would likely still be mediocre, but since SDR represents either a method variance (Holden & Passey, 2010) or a response style (of a sort) measuring the interpersonally oriented self-control (de Vries et al., 2014;Uziel, 2010), it is unrealistic to expect anything better from the BIDR-6 (or other SDR scales).
It appears that the removal of SD-D and IM-E subscales did not compromise the validity of BIDR-6, as SD-E and IM-D retain almost all of the convergent and discriminative properties expected for the full SDE and IM scales.Namely, SD-E and IM-D replicate the benchmark correlations with the HEXACO personality traits, most importantly, with the Honesty-Humility dimension, making this short form fully suitable for the further investigation of the hypothesis that IM might be an expression of Honesty-Humility (de Vries et al., 2014).The short form also replicates the typical gender patterns (Bobbio & Manganelli, 2011;de Vries et al., 2014) and correlates (albeit lower) with at least one other SDR measure (Haghighat, 2007a(Haghighat, , 2007b)).It only slightly differs from the benchmark in regards to age-related trends (de Vries et al., 2014).
In conclusion, the short form of this BIDR-6 BCS translation has sufficiently adequate psychometric properties to be used for general research purposes and is recommended over the long form.

Table 2
MIRT CFA analysis results for the Model 3.1 Item thresholds represent the multidimensional item difficulty (there are k-1 thresholds, where k is the number of item ranks).α=unidimensional discrimination (calculated only for the given factor); MDISC=multidimensional discrimination; discrimination values below 0.34 are considered very low, 0.35-0.64 are low, 0.65-1.34are moderate, 1.35-1.69are high, and values over 1.70 are very high Note.

Table 4
Correlations of the BIRD-6 (short form) factors with other variablesNote.* p<.05, ** p<.01, *** p<.001.All variables were recoded so that higher values represent higher SDR.† marks consistent and ‡ marks inconsistent correlations with benchmark findings of de Vries and colleagues (2014).† and ‡ are not applicable for the BSDS-V2.