Psychometric evaluation and validation of the Serbian version of “ Reading the Mind in the Eyes ” test

“Reading the Mind in the Eyes” test (RMET) is one of the most popular and widely used measures of individual differences in Theory of Mind (ToM) capabilities. Despite demonstrating good validity in differentiating various clinical groups exhibiting ToM deficits from unimpaired controls, previous studies raised the question of the RMET’s homogeneity, latent structure, and reliability. The aim of this study is to provide evidence on psychometric properties, latent structure, and validity of the newly adapted Serbian version of the RMET. In total, 260 participants (61.9% females) took part in the study. The sample consisted of both unimpaired controls (76.5%), and a clinical group of participants that are believed to demonstrate ToM deficits (23.5%), namely, persons diagnosed with schizophrenia and bipolar disorder (54.1% females). RMET has demonstrated fair psychometric properties (KMO = .723; α = .747; H1 = .076; H5 = .465), successfully differentiating between clinical group and control [F (1,254) = 26.175, p <.001, ηp = .093], while typical gender differences in performance were found only in control group. Tests of several models based on the previous literature revealed that the affect-specific factors underlying performance on RMET demonstrate poor fit. The best fitting model obtained included reduced scale with a single-factor underlying the test’s performance (TLI = .953, CFI = .958, RMSEA = .020). Based on the fit parameters we propose 18-item short-form of the Serbian version of RMET (KMO = .797; α = .728; H1 = .129; H5 = .677) for economic, reliable and valid measurement of ToM abilities.

Social cognition is a mental operation which lies at the basis of altruistic behavior, caused by empathizing or understanding hints made by other people which show a need for concealment, sharing and help (Mussen & Eisenberg, 1977).According to Addington there are four domains of social cognition: Theory of Mind (ToM), attributive style, perception of emotions, and social observation (Addington, Penn, Woods, Addington, & Perkins, 2008).Social cognition can be divided into lower-level processes such as recognition and perception of socio-emotional signs including facial expressions, depth of voice, gestures; and higher-level processes such as inferring conclusions about mental states of others (that is ascribing mental states), empathy and emotional regulation (Ochsner, 2008).The capacity for emotional investment in relationships and moral standards indicates the orientation of the society focused on the need, as opposed to investing in values, ideals, and interpersonal relations.Damage to social cognition is observed in different clinical entities -from pervasive disorders to endogenous psychosis, eating and personality disorders.ToM tests are frequently used for assessment of social cognition.

Theory of mind
Theory of mind (ToM) is a concept that describes people's ability to understand and describe the mental states of other people, their intentions and beliefs (Premack & Woodruff, 1978).More specifically, ToM studies the psychological processes that serve to understand others or make mental boundaries between self and others (Doherty, 2009).Scholars suggest that the basis of ToM is a kind of mental modeling in which the simulator uses his mental frame of mind as an analog model simulating the object (Gordon, 1986).
ToM is called a theory because it assumes that mental states of others are not directly detectable but must be generated through predictions about how others think and will behave.This theory was originally developed to describe the behavior of chimpanzees (Premck & Woodruff, 1978), and then was expanded to describe the development of children and their ability to predict the perspective of others (Wellman, Cross, & Watson 2001).Later on this model was applied in description of the social and communicative deficits in specific clinical populations, mostly from the spectrum of autism (Baron-Cohen, Leslie, & Frith, 1985).It is considered conceptually similar or equivalent to cognitive empathy (Baron-Cohen et al., 2015) because both constructs include conclusions about the mental state of another person.There are two disciplines studying ToM: social science, exploring the neural basis of ToM and developmental psychology, interested in how these capabilities develop (Mahy, Moses, & Pfeifer, 2014).There are four major theories of ToM development in children: modularity, simulation, executive and theory theories (Mahy et al., 2014).
Neuroimaging studies provided some evidence on the neural basis of ToM.Functional magnetic resonance imaging studies assess the neural substrates of ToM in situations where respondents are thinking about their own or someone else's mental state.These studies demonstrated the activation of the posterior superior temporal sulcus and temporoparietal junction, medial prefrontal cortex, temporal poles and precuneus in ToM type tasks (Frith, 2007).Affective ToM seems to be based on a phylogenetically older emotional system in the lower frontal gyrus, while the cognitive ToM is likely dependent on the functioning of the ventromedial prefrontal gyrus (Shamay-Tsoory, Harari, Aharon-Peretz, & Levkovitz, 2010).The role of the ventromedial prefrontal cortex is controversial given the numerous connections of ventromedial prefrontal cortex with other regions such as the amygdala, superior temporal sulcus and anterior insula (Shamay-Tsoory, Tibi-Elhanany, & Aharon-Peretz 2006).
"The blindness of the mind" is the opposite of ToM.That is a cognitive disorder characterized by an inability to ascribe a mental state to self or another person.This feature appears in people with Asperger's syndrome, autism, and schizophrenia as well as in other disorders that show a deficit of social insight.A person with this disorder is unable to understand or predict mental states of other people (Frith, 2001, Pijnenborg, Spikman, Jeronimus, & Aleman, 2013).
While the ToM is usually considered as one unitary construct, some authors have described it as multiple constructs which include perception, attention, beliefs, desires, intentions, and emotions (Astington, 2003).According to this approach, the tests used for assessment of ToM should be multiple, assessing subconstructs (Slaughter & Repacholi, 2003).However, in practice, researchers and clinicians use unidimensional tests such as the "Reading the Mind in the Eyes" test.

Reading the Mind in the Eyes test
The "Reading the Mind in the Eyes" test (RMET) is considered to be a measure of nonverbal aspects of ToM.RMET is commonly used for ToM assessment both in general and clinical populations, with a special focus on the autistic spectrum disorders.The test is designed to measure the first level of ToM -attribution, which identifies the relevant mental state, as opposed to the second level in which the content of mental state is inferred (Baron-Cohen, Wheelwright, Hill, Raste, & Plumb, 2001).This test has been developed under the assumption that ToM heavily relies on the perception of eye gaze of the person being observed (Baron-Cohen, Jolliffe, Mortimore, & Robertson, 1997;Baron-Cohen et al., 2001) since it's considered as an important aspect of social interaction and communication (Emery, 2000).
The original version of the test (Baron-Cohen et al., 1997) consists of a series of 25 photographs depicting the area around the eyes with two descriptors of mental states presented with each photography.The participant's task is to select an alternative s/he considers to be the most suitable description of feelings or thoughts expressed by a person on a photograph.In order to resolve some of the issues the test was facing (see Baron-Cohen et al., 2001), revised version of the instrument was designed (Baron-Cohen et al., 2001).The second version of RMET consists of 36 male and female photographs (approximately equalized) of the area around the eyes with four descriptors of mental states offered, out of which only one is the correct description of feelings or thoughts expressed by a person on the photo.So far, this test has been adapted and translated into variety of languages, e.g.Italian (Vellante et al., 2013), French (Prevost et al., 2014), Romanian (Miu, Pana, & Avram, 2012), Bosnian (Schmidt & Zachariae, 2009), Spanish (Fernández-Abascal, Cabello, Fernández-Berrocal, & Baron-Cohen, 2013), German (Pfaltz et al., 2013), Turkish (Girli, 2014;Yildirim et al., 2011), Swedish (Hallerbäck, Lugnegård, Hjärthag, & Gillberg, 2009), Japanese (Adams et al., 2009;Kunihira, Senju, Dairoku, Wakabayashi, & Hasegawa, 2006), Persian (Khorashad et al., 2015), etc.

Present study
This study aims to explore aforementioned issues through an examination of psychometric properties, latent structure, and validity of the Serbian adaptation of RMET.On the following pages, we present a psychometric evaluation of newly adapted Serbian version of the RMET and provide a comparison of its psychometric quality with other adaptations made.This study addresses notions on latent structure of RMET facing several concurrent models found in previous literature, trying to establish whether the object of RMET's measurement is unidimensional and general in nature, or multidimensional and affect-specific.Moreover, validity of the instrument was examined through testing its predictive power in differentiating between entities that are supposed to demonstrate ToM specific deficits, namely persons suffering from schizophrenia and bipolar disorder (Bora et al., 2016;Bora et al., 2009) and unimpaired controls, as well as through testing typically observed gender differences, i.e. "female superiority" in the performance on RMET (see Baron-Cohen et al., 2015;Baron-Cohen, Knickmeyer, & Belmonte, 2005;Baron-Cohen et al., 2001;Khorashad et al., 2015;Schiffer, Pawliczek, Muller, Gizewski, & Walter, 2013;Vellante et al., 2013).Finally, based on the results obtained, we propose a short, economic version of the instrument and contrast it with other short versions of the test suggested in the previous literature.

Method Participants
A sample of 260 participants, age range 18 to 64 (M = 32.44,SD = 11.47;61.9% females) took part in the study.Participants' years of education varied from 8 to 22, with the mean value of approximately 14 years (M = 13.66,SD = 2.59).In order to cover full spectrum of the variability of the construct measured, and to test the diagnostic validity of the instrument, the sample consisted of participants from both the student and the general population (76.5%), as well as the clinical population (23.5%; 54.1% females).More specifically, persons diagnosed with schizophrenia (49.2%) and bipolar disorder (50.8%) were included in the sample since previous studies showed that these entities demonstrate ToMspecific deficits (Bora et al., 2016;Bora et al., 2009).Subjects participated in the study on a voluntary basis and have signed an informed consent.

Instrument
Translation and cross-cultural adaptation of the test followed the instructions of the Autism Research Centre (ARC; www.autismresearchcentre.com)and relied on the experience of other researchers who have had the same adaptation done in other cultural environments.Adaptation of the original instrument (Baron-Cohen et al., 2001) was carried out using standard backward translation method, i.e. by researchers bilingual in English and Serbian, as well as by the professional translator.Preliminary Serbian version was tested on 40 subjects, after which, with minimal corrections, the test was submitted to the ARC for approval.Upon the approval, the test was administrated to participants in line with the instructions provided by the ARC (Baron-Cohen et al., 2001).
The revised version of the "Reading the Mind in the Eyes" test (RMET) (Baron-Cohen et al., 2001) consists of 36 photographs which present eyes region of different individuals (19 male stimuli, 17 female stimuli).Each of the photographs is presented along with the four descriptors of complex mental states (Figure 1).Participants' task was to, among the descriptors offered, select the one which seems to be the most appropriate description of feelings or thoughts expressed by the individuals presented in the photograph.Among the descriptors offered, within each item, there is one target word and three foils.

Procedure
Following the practice section in which participants were familiarized with the task, they were successively presented with 36 eyes photographs each followed by four descriptors offered.Participants' task was to select the most appropriate one among four descriptors of mental state (feelings or thoughts) of a person presented in the photo.Glossary of mental states has been provided and participants could consult it at any time during testing.Testing was not time-limited, but participants were given an instruction not to contemplate too much on individual items.

Results
Table 1 displays percentage of participants who have chosen each option within every item.As shown, the proportion of participants who have chosen target words ranges from .46 to .91, with items in most cases being successfully solved by at least 50% of participants.Furthermore, it can be noted that some items exhibited specific patterns of option selection.More specifically, it is evident that most of the items have one salient distractor that competes with the target word while other options are seldom chosen.For example, the odds of option 4 being (wrongly) selected as a target word within item 3 is 9 times higher than for option 1 and 14 times higher than for option 2. Similar disproportion can be found within item 6, for example.The number of items with more than one dominant option competing for the correct answer is disproportionally low (for example, items 8, 9, 11, 13, 15, etc.).
Twenty-eight out of 36 items have shown to fall within the range of item difficulties obtained in previous studies (Table 1).Five of those items have shown to be easier, while three items proved to be more difficult compared to the other versions of RMET.However, in six out of eight items, aforementioned deviations have not exceeded 5% of the increase/decrease in items' difficulty as compared to other versions of the instrument.9 Item difficulties were calculated as mean percentiges of correct responses provided for German (Pfaltz et al., 2013), Turkish (Yildirim et al., 2011), Spanish (Fernández-Abascal et al., 2013), Italian (Vellante et al., 2013), French (Prevost et al., 2014), and Persian (Khorashad et al., 2015) adaptations of RMET, as well as values provided in the original publication (Baron-Cohen et al., 2001). 2 Classification of emotional valence of the target stimuli based on Harkness et al. (2005).
Following the score calculation, descriptive statistic measures were obtained.The distribution of participants' scores has shown to be severely skewed (zSK = -6.642,p <.01), and elongated (zKu = 4.439, p <.01), indicating distortion of the distribution of scores from the normal toward higher scores in a leptokurtic manner (K-S = 1.821, p <.01).Individual scores on RMET were ranging from .14 to .97, with participants, on average succeeding to correctly solve .70 of the items (SD = .14).
In order to examine whether the Serbian version of RMET successfully discriminates between entities that are supposed to have ToM deficits and participants without those deficits, and to test11 whether females perform better than males, two-factor analysis of covariance (ANCOVA) was performed, with age and number of years of education taken as covariates.Levene's test indicated equality of error variances across groups [F (3,256) = 0.358, p = .784].Results of ANCOVA indicated the significant main effect of group [F (1,254) = 26.175,p <.001, η 2 p = .093],with clinical group performing significantly worse (M = .58,SD = .16)than unimpaired controls (M = .74,SD = .10).On the other hand, the main effect of gender [F (1,254) = 1.152, p = .284],and group x gender interaction have not reached statistical significance [F 1,254) = 1.777, p = .184].
In order to cover full spectrum of the variability of the construct measured, the psychometric analysis was performed on both groups taken together.Psychometric characteristics of the test were calculated using the Rtt10g macro (Kneževć & Momirović, 1996).Full-scale item sampling adequacy was .723indicating lower representativeness of items sampled for measuring given ability.Internal consistency of the test has shown to be overall satisfying, α = .747.Both average inter-item correlation (H1 = .076),as well as the proportion of variance accounted for by the first principal component relative to other components whose reliability is exceeding zero (H5 = .465)indicated lower test homogeneity.
Individual items' sampling adequacy has shown to vary between .240 and .884,with not a single item exceeding the level of .90(Appendix A).The proportion of variance of a given item predicted using the remaining of the test's items (item's reliability) has shown to be relatively low for most of the items, ranging from .083 to .332.On the other hand, both measures of item's internal validity have detected numerous items achieving moderate positive corrected item-total correlations (range .080-.527), as well as a number of items whose correlations with the principal object of measurement can be considered satisfying (range .002-.461).Yet, both measures indicated several items whose correlations with the object of measurement are achieving zero, pointing to their poor discriminative power and specificity in the context of remaining items.
In order to examine latent structure of the instrument, the exploratory factor analysis (EFA) was carried out.Maximum likelihood extraction was used along with Promax rotation of the axis.Guttman-Kaiser criterion suggested retention of 14 factors, while scree plot demarcated a slope change after the second factor.Following the latter criteria, the number of factors was fixed to two.Two retained factors accounted for 12.24% of the items' variance.Pattern matrix is presented in table 2. Correlation between two extracted factors has shown to be moderate (r = .453).Overviewing primary factor loadings, no interpretation by means of a type of mental state depicted in the image, or other stimuli characteristic seemed to be an appropriate explanation for the items' grouping.On the basis of theoretical expectations and previous empirical findings, several confirmatory factor analyses (CFA) were performed.Summary of the models tested is presented in Table 3 and factor loadings for seven models tested are presented in the Appendix B. First of all, through examination of the model fit for the single-factor full-scale solution we wanted to determine whether test is unidimensional, i.e. whether all the items successfully measure single latent trait as suggested by Baron-Cohen et al. (2001).Results have shown that the full-scale single-factor model has a poor fit, with the low average loading of .275(Appendix B).Secondly, we tested the model obtained in the EFA with two interrelated factors underlying the performance on all the items.Estimated correlation between factors was high (r = .621),with average loadings of .321and .279,for the first and second factor, respectively.Overall, this model has shown poor fit as well.Furthermore, four models, subsuming previous empirical findings were examined.Affect-specific three-factor model of positive, negative, and neutral factors (Harkenss et al., 2005) underlying performance on the RMET has shown poor fit, with average loadings of .254,.311,and .300,for positive, neutral, and negative factor, respectively.Estimated correlations between factors have shown to be high for all the factor pairs -positive and negative (r = .666),positive and neutral (r = .763),and neutral and negative factor (r = .921).The two-factor model of positive and negative affect (Konrath et al., 2014) demonstrated somewhat better, but still unsatisfying fit, with very high positive estimated correlation between factors (r = .944),and average loadings for the first and the second factor of .218and .336,respectively.On the other hand, reduced model of Konrath et al. (2014) has shown fair fit according to all fit indices, with the average loading of .292,while the model of Olderbak et al. (2015) demonstrated less good fit with the average loading of .302.
In order to get to the most appropriate and reliable model of the Serbian adaptation of RMET, which would be based on the theoretical expectation of a single factor underlying the ability measured we eliminated items which exhibited low factor loadings within the full-scale single-factor solution (<.30), and tested this reduced model.According to all fit parameters, final reduced 18-item single-factor model has shown satisfactory fit, with an average factor loading of .360.(Hu & Bentler, 1999) Psychometric properties were again calculated for the single-factor 18item version of the RMET.Results have shown that item sampling adequacy achieved a more satisfying level (KMO = .797)ranging from .680 to .887 for 12 Since Konrath et al. (2014) reported only the target word (not item number) for both twofactor and reduced version, and since three of the target words used appear twice in the test, we iteratively tested all combinations of aforementioned items in order to get to the best set of items as indicated by fit parameters.The results of two Konrath et al. (2014) models presented in table 3 and Appendix B are based on the best fitting models including given items.
individual items.Reliability of individual items ranged between .093 and .247,with overall internal consistency remaining at the fair level despite the exclusion of half of the initial item pool (α = .728).Likewise, homogeneity of the 18-item version of the instrument was improved as well (H1 = .129;H5 = .677)achieving more satisfying level.Consequentially, the range of internal validity indices for the individual items in 18-item short version was improved -corrected itemtotal correlations were ranging from .336 to .608, while corrected correlations with principal component extracted from the scale ranged from .356 to .564.In terms of items' content, i.e. stimuli gender and emotional valence of target words (based on the classification of Harkenss et al. ( 2005)), the final version resulted in ten female stimuli and eight male stimuli, with five negative (1 male, 4 female stimuli), eleven neutral (6 male, 5 female stimuli), and two positive target words (1 male, 1 female stimulus).
In order to demonstrate that the short form of the Serbian version of RMET kept its diagnostic power in differentiating between participants with and without ToM deficits, analysis of covariance (ANCOVA) was performed once again, with age and number of years of education taken as covariates.Levene's test has shown equality of error variances across groups [F (3,256) = 1.650, p = .178].Results indicated significant main effect of group [F (1,254) = 24.885,p <.001, η 2 p = .089],with clinical group performing significantly worse than the group without deficits.Once again, main effect of gender was not observed [F (1,254) = 0.593, p = .442],while group x gender interaction got closer to the threshold of statistical significance [F (1,254) = 3.474, p = .064,η 2 p = .013],mainly deriving from the gender differences between participants in the control group F (1,195) = 8.814, p = .003,η 2 p = .043].
Item analysis has shown that the majority of items of the Serbian adaptation of RMET behave in a similar manner regarding their difficulty comparing to other RMET adaptations, as well as the original version of the instrument.Namely, the amount of individual item's deviation from difficulty measures provided in previous studies can be considered negligible, especially bearing in mind a wide range of individual item's difficulties documented in previous studies.Contrasting Serbian version of RMET to other adaptations and original version of the instrument revealed that the Serbian version significantly deviates only from the Persian one.
Item analysis of RMET has shown that test has a number of items with the unbalanced frequency of selection of foils within a number of items.Similar results were obtained in previous studies using this instrument (e.g.Baron-Cohen et al., 2001;Fernández-Abascal, et al., 2013;Girli, 2014;Khorashad et al., 2015;Prevost et al., 2014;Vellante et al., 2013).Namely, a number of items have shown to contain foils that are relatively poor distractors, whose improvement should, in our opinion, be considered for the second revision of the test.Additionally, distribution of scores has shown to be severely skewed and elongated despite the fair representation of the population which is considered to have ToM deficits thus questioning assumptions of normal distribution of this measure.
Results of the full-test psychometric analysis have shown that Serbian version of RMET overall has fair psychometric properties.Bearing in mind that previous studies reported on a wide variability in RMET internal consistencies, typically falling in the range from .40 to 70 (Harkness et al., 2010;Khorashad et al., 2015;Prevost et al., 2014;Ragsdale & Foley, 2011;Vellante et al., 2013;Voracek & Dressler, 2006), the Serbian version of RMET can be considered fairly reliable, compared to other adaptations (e.g.Girli, 2014;Khorashad et al., 2015;Prevost et al., 2014;Vellante et al., 2013).On the other hand, item sampling adequacy indicated lower representativeness of items for the measurement of ToM construct.Similarly, homogeneity parameters indicated to a relatively small amount of commonality between items indicating more than a single source of variance underlying the test's performance.Consequently, results of EFA have shown that two extracted factors accounted only about 12% of the RMET's variance.Additionally, these factors seem to be difficult to interpret in a meaningful way, i.e. by means of abilities recruited in the detection of affectspecific mental states presented in the items.
The fact that the test designed for measurement of the unitary construct of ToM exhibited low homogeneity, the issue that has been raised by the previous PSIHOLOGIJA, 2017, Vol.50(4), 483-502 studies as well (e.g.Olderbak et al., 2015), served us for examination of the latent structure of Serbian version of RMET throughout testing several models based on previous literature.Similarly, as previous studies have shown (e.g.Olderbak et al., 2015;Vellante et al., 2013) full-scale single-factor model exhibited poor fit according to most of the CFA parameters used.Affect-specific three-factor solution (Harkness et al., 2005) could not account for performance on the test resulting in structural validity which was suggested by previous studies as well (Olderbak et al., 2015).The same was true for the affect-specific two-factor model (Konrath et al., 2014), and two-factor model obtained in EFA within this study.
On the other hand, reduced single-factor models based on short-forms of the RMET proposed in previous studies (Konrath et al., 2014;Olderbak et al., 2015) exhibited much better structural validity indicating that the optimal solution for increasing RMET's structural validity is to eliminate items deviating from unitary ability measured, therefore pointing to the fact that current RMET's setting and item pool doesn't have a potential to detect any affect-specific ability on a latent level (if there is such) that would account for the performance on affect-specific content in a meaningful way.
Following the results of the item analysis and the full-scale single-factor loadings, 18-item short-form of the instrument assessing ToM has been proposed.Eighteen-item RMET has shown satisfactory internal psychometric properties and latent structure which is in line with theoretical expectations of a single trait underlying ToM abilities captured by this instrument.By means of the items selected, the 18-item version of RMET closely corresponds to those suggested by Olderbak et al. (2015) and Konrath et al. (2014), since it includes 70% of the first, and 65% of the items from the latter scale thus indicating concordance between Serbian version and other short-forms of the test.
Finally, both complete and short versions of the Serbian adaptation of RMET have shown satisfactory diagnostic validity in differentiating between the participants that are supposed to have ToM-specific deficits and unimpaired controls (Bora et al., 2016;Bora et al., 2009).On the other hand, typically observed "female superiority" (see Baron-Cohen et al., 2015;Baron-Cohen et al., 2005;Baron-Cohen et al., 2001;Khorashad et al., 2015;Schiffer et al., 2013;Vellante et al., 2013) in the performance on RMET was not obtained on a whole sample.Several previous studies pointed to the absence of gender differences on RMET as well (see Girli, 2014;Olderbak et al., 2015;Baron-Cohen et al., 2015).However, trend-level interaction between participants' group and gender, which derives from gender differences in the control group is directly comparable with those obtained and elaborated by Baron-Cohen and collaborators (see Baron-Cohen et al., 2015;Baron-Cohen et al., 2005).

Conclusion
Overall Serbian adaptation of RMET has demonstrated fair psychometric properties and satisfactory correspondence to both original version and other adaptations of the instrument.The proposed short version of the test has shown satisfactory latent structure that supports the premise of the unitary object of measurement, i.e. general ToM abilities.Additionally, the instrument demonstrated a satisfactory level of validity in differentiating between persons with ToM deficits and unimpaired controls.However, future studies should further address and provide additional evidence on the construct and predictive validity of this test using alternative measures of ToM capabilities on diverse groups of entities sampled both from general and clinical populations.

Figure 1 .
Figure 1.Example of the item from RMET

Table 1
Percentage of participants who have chosen each option in each item (item difficulty/target words are marked bold), item difficulties in previous studies, stimulus gender, and emotional valence of stimuli

Table 2
EFA's pattern matrix