Psychometric properties of the Serbian version of the Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV)

The Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV) is an individually administered clinical instrument designed for assessment of intellectual abilities of adolescents and adults. The WAIS-IV consists of 15 subtests (10 core and 5 supplemental) reflecting the efficacy of cognitive functioning in four domains (verbal comprehension-VCI, perceptual reasoning-PRI, working memory-WMI, and processing speed-PSI), and general intellectual ability (Full-Scale IQ-FSIQ). The WAIS-IV was administered to a sample of 262 respondents – specifically, 104 respondents from sample representative for the wider Belgrade area, 62 schizophrenic, 63 depressive, and 33 patients with intellectual disability. Psychometric properties of WAIS-IV subtests were analysed within the frameworks Item Response Theory (IRT) and Classical Test


Highlights:
• The latest WAIS-IV scale is based on Cattell-Horn-Carroll theory of intelligence.
• Psychometric properties are highly similar to those on US standardisation sample.
• The WAIS-IV enables reliable assessment of the full span of intellectual abilities.
The Wechsler Intelligence Scales are the most frequently used measures of intelligence worldwide (Lichtenberger & Kaufman, 2009).The Wechsler Adult Intelligence Scale-Fourth Edition (WAIS-IV; Wechsler, 2008a) was introduced in 2008 where the greatest modification was the elimination of the dual IQ, i.e., verbal and performance IQ scales.This dichotomy was replaced by four composite indices representing functioning in specific cognitive domains, i.e., Verbal Comprehension Index -VCI, Perceptual Reasoning Index -PRI, Working Memory Index -WMI, and Processing Speed Index -PSI 1 .As four specific domains are correlated, they can be used for calculation of the higher order Full-Scale score -FSIQ (Wechsler, 2008b).This structure of the WAIS-IV is representing Wechsler's view of intelligence which is conceptualised as a global capacity relevant for functioning in everyday life with several specific cognitive domains (Wechsler, 2008b).
Theoretical grounding for the WAIS-IV is the Cattell-Horn-Carroll (CHC) theory (Alfonso, Flanagan, & Radwan, 2005).The CHC theory of cognitive abilities is a combination of Horn and Cattell's theory of fluid and crystallized intelligence (Gf-Gc theory), and Carroll's theory of cognitive abilities proposing three strata, i.e., general intelligence (g), 10 broad cognitive abilities, and more than 100 narrow abilities (McGrew, 2009).The broad cognitive abilities include Fluid Reasoning (Gf), Comprehension-Knowledge (Gc), Short-term Memory (Gsm), Visual Processing (Gv), Auditory Processing (Ga), Long-term Retrieval (Glr), Processing Speed (Gs), Reading and Writing (Grw), Quantitative knowledge (Gq), and Decision/Reaction Time or Speed (Gt) (Evans, Floyd, McGrew, & Leforgee, 2002).Empirical evidence goes in favour of this viewpoint and indicates that intelligence is composed of specific abilities clustering into higher-order cognitive ability domains (Carroll, 1993;Keith, 1990;Keith & Reynolds, 2010).The CHC theory is considered to be the most validated model of cognitive abilities (Evans et al., 2002;Flanagan, 2000).Lately, the CHC theory has been the most used theoretical framework for the development of intelligence tests and is considered as very important in defining and interpreting cognitive abilities constructs (Alfonso et al., 2005;Newton & McGrew, 2010).
The theoretical model behind WAIS scale accommodates both theoretical cognitive constructs and empirical findings.Up to date, no single measurement covered all CHC abilities (Alfonso et al., 2005;McGrew, 1997).In the WAIS-IV, five broad cognitive abilities from the CHC theory were included, i.e., crystallised knowledge (Gc), fluid intelligence (Gf), short-term memory (Gsm), visual processing (Gv), and processing speed (Gs) (Alfonso et al., 2005).In developing this revised version of the scale of special importance were concepts of Fluid Intelligence (Gf), Working memory (WM), and Processing Speed (PS) since they are considered as key aspects of intellectual functioning (Lichtenberger & Kaufman, 2009).
The WAIS-IV scale consists of 15 subtests (10 core and 5 supplemental), providing information on four composite indices confirmed by factor analyses (Wechsler, 2008a).The core WAIS-IV subtests2 are Block Design, Similarities, Digit Span, Matrix Reasoning, Vocabulary, Arithmetic, Symbol Search, Visual Puzzles, Information, and Coding.The supplemental subtests are Letter-Number Sequencing, Figure Weights, Comprehension, Cancellation, and Picture Completion (Table 1).On average, for the administration of the core subtests testing time between 60 and 90 minutes is needed (Wechsler, 2008a(Wechsler, , 2008b)), while for the administration of all subtests on average 2 hours are needed.

Enhancements made with the WAIS-IV scale
There were several enhancements made with the latest, WAIS-IV scale, i.e., enhancement of theoretical underpinnings, improvement of psychometric qualities, clinical utility and user-friendliness (Wechsler, 2008b).In WAIS-IV, nine of the original 11 subtests from VITI (Vekslerov Individualni Test Inteligencije3 , Berger, Marković, & Mitić, 1991, which is a Serbian adaptation of WAIS-R, Wechsler, 1955) were retained but they were assigned to different groupings (Table 1).Two subtests from VITI that were still used in WAIS-III, but excluded from WAIS-IV were Picture Arrangement and Object Assembly.From WAIS-III, 12 out of 14 subtests were retained in WAIS-IV, and three subtests were added: Figure Weight, Visual Puzzles, and Cancellation.However, in these retained subtests, the item content and administration and/or scoring procedures were revised (Wechsler, 2008b).Importantly, the dependency of time bonuses in subtests of WAIS-IV has been significantly reduced or eliminated so that the influence of declining processing speed on the scores of older adults is lessened (Lee, Gorsuch, Saklofske, & Patterson, 2008).Previous studies show that these changes improved the theoretical foundation of the scale (Grégoire, 2013) and that WAIS-IV scale is superior in measurement, scoring, and structural models to measure FSIQ, compared to WAIS-III (Taub & Benson, 2013).Subtests Picture Arrangement and Object Assembly that were part of previous versions (i.e., VITI, Berger et al., 1991, andWAIS-III, Wechsler, 1997) were dropped because of the lengthy administration, and heavy PSIHOLOGIJA, 2018, OnlineFirst, 1-17 subtests manipulativeness.In addition, they contained a lot of pieces that could be damaged or lost or administered inconsistently.Furthermore, subtest Object Assembly was excluded to decrease dependence on time bonus points (Coalson, Raiford, Saklofske, & Weiss, 2010;Larrabee, 2004;Wechsler, 2008b).
To strengthen the measurement of Gf, two subtests as measures of PRI were added, i.e., Figure Weights and Visual Puzzles.As the new measure of Processing Speed, Cancellation subtest was added (Wechsler, 2008b).Visual Puzzles subtest was developed as a Perceptual Reasoning subtest, and it was aimed to be a substitute for Object Assembly.Figure Weights subtest was developed as a Perceptual Reasoning subtest for ages 16:0-69:11 aimed to measure quantitative and analogical reasoning, which involves reasoning emphasising inductive and deductive logic (Carroll, 1993).Cancellation subtest (for ages 16:0-69:11) was based on similar existing tasks (e.g., Geldmacher, Fritsch, & Riedel, 2000) and developed as a supplemental subtest targeting Processing speed.
Another significant change was that Information subtest was chosen as the core subtest over Comprehension which was made a supplemental subtest.The reasons for this decision were mostly psychometric, i.e., reliability, subtest floor, gradient, and ceiling.Moreover, the correlations between WISC-IV UK and WAIS-IV UK composites were slightly better for Information compared to Comprehension.Furthermore, administration time and ease of recording and scoring were in favour of Information subtest over Comprehension (Wechsler, 2008b).Subtests assessing working memory, Digit Span and Arithmetic were revised (Wechsler, 2008b).Arithmetic was chosen over Letter-Number Sequencing as a core subtest.Revisions in Arithmetic subtests were made to increase the applicability across different cultures and countries (Wechsler, 2008b).Specifically, the reference to currency and to UK system units of measurement was eliminated, some items were reworded to increase clarity, some new items were developed to improve floor, ceiling and difficulty gradient, and time bonuses were eliminated (Wechsler, 2008b).
In the subtest Digit span, a third task Digit Span Sequencing was added.The main reason for this change was to increase the working memory demands since previous research indicated different cognitive demands for the Digit Span Forward and Digit Span Backward tasks (Reynolds, 1997).Another advantage for the practitioners is that the separate process scores for each of the three tasks allow evaluation of differential performance across the tasks.
Psychometric characteristics of the scale were improved through new norms, increase in FSIQ span, improvement of the subtests and the total scale reliabilities, and new evidence on the validity of the scale (for an overview, see Wechsler, 2008b).Important enhancement achieved with WAIS-IV was an improvement of clinical utility through co-norming with Wechsler Memory Scale-Fourth Edition (WMS-IV, Wechsler, 2009), and conducting series of studies on 13 special groups (e.g., several groups of intellectually dysfunctional, intellectually gifted, traumatic brain injury, borderline intellectual functioning, etc.).
Increase in developmental appropriateness was achieved through adding demo items and examples, and a decrease in the use of professional terms in PSIHOLOGIJA, 2018, OnlineFirst, 1-17 instructions since growing body of research suggested that the understanding of the instruction is affected by the mere formulation of instructions and the age of the respondent (Wechsler, 2008b).In addition, developmental appropriateness was achieved through reduction of emphasis on motor skilfulness, and reduction of time bonus (Salthouse, 2004).Furthermore, the adaptation to the participants of advanced age was achieved through reduction of auditory discrimination and visual acuity demands.Finally, testing was made easier for test administrators by shortening the testing time, revising instructions and redesigning record form.
The current study aims to investigate psychometric properties of the WAIS-IV scale adapted for Serbian population.Specifically, the reliability of the subtests, diciminativity of the items and their difficulty was assessed using the Item Response Theory (IRT), so that quality of the Serbian adaptation of the WAIS-IV is assessed.IRT analysis is considered superior compared to classical test theory especially when developing scales and when psychometric properties of intelligence tests are examined (Bortolotti, Tezza, de Andrade, Bornia, & de Sousa Júnior, 2013).IRT is of special importance for cognitive measures since neither the discrimination along the whole continuum of a dimension nor the item order according to their difficulty, are of such critical importance for noncognitive measures.The main advantage of IRT is that it relies on the principle of invariance, where the item parameters are not dependent on the respondent's latent traits, and the individual parameters are not dependent on the presented items.Since the central element in the IRT analysis is the item, using IRT in testing psychometric properties of the scale allows comparison of individuals from different populations completing the questionnaires with common items, and it allows for the comparison of individuals submitted to different tests4 , and between respondents and items (Bortolotti et al., 2013).In other words, with IRT the respondents and the items are located on the same scale, i.e., the respondents are positioned on the latent variable, while the items are positioned according to which level of a latent trait they are best at discriminating.The reliabilities, item sampling adequacy, and homogeneity of the subtests were also analysed within CTT framework.

Method Sample
The WAIS-IV was administered to a sample of 262 respondents; specifically, 104 respondents from sample representative for the larger Belgrade area, 62 schizophrenic, 63 depressive, and 33 patients with intellectual disability from Belgrade clinics.The average age was 39.89 (SD=14.70).Descriptive characteristics of the whole sample and the subsamples are provided in Table 2.The subjects were intentionally sampled from various population substrata to increase variability in the achievement scores.Given that the instrument is primarily meant to be used in the clinical setting, affective and schizophrenic patients were included, as the diagnostic groups typical for psychiatric disorders.
Participants from the general population were collected on the sample representative for the Belgrade wider area.The representative sample of the Serbian population was created, and due to financial limitations, only Belgrade wider area was tested.All respondents from the general population were tested by the trained research assistants.Participation of the respondents was on a voluntary basis, and they were paid for their contribution.The sample universe was based on 2002 Census data.A two-stage stratified random representative sample design was employed.In the first stage, sampling units were households, and the method of household selection was a random route technique starting from given address based on the dwelling register.In the second stage, a respondent within a household represented the secondary sampling unit; respondent selection was based on the last birthday in the household in the given age quota.
The household/respondent selection method was defined, and a starting point, i.e., prior to the field survey, the particular address was specified.The maximum number of respondents per starting point was 10.The research assistants began with a given address or at the house nearest to it.If the testing session was successful, the research assistant counted the houses/apartments in a row and walked to the 10th apartment or to the 5th house.If a selected respondent declined to participate or was not found at the address after two attempts, the testing session was considered unsuccessful, and the research assistant chose the next nearest apartment or house.In no case, more than five test sessions were conducted within the same apartment building.In the case that the planned number of testing in a certain street could not be accomplished, the research assistant went to the next nearest street.The sample was created to represent population in Belgrade wider area according to age groups (16-17, 18-19, 20-24, 25-29, 30-34, 35-44, 45-54, 55-64, 65-69, 70-74, 75-79, 80-84, 85-90), sex, settlement type (urban-rural), and education level.All respondents from the general population were Serbian-native speakers.
Participants from the clinics were inpatients with a diagnosis of schizophrenia and major depression (in remission) from the Institute of Psychiatry, and Neuropsychiatric Clinic "Laza Lazarević -Padinska skela, Belgrade.Respondents with intellectual disability were from Military Medical Center -Karaburma and the Institute of Psychiatry, Belgrade.All patients were diagnosed according to ICD-9 (World Health Organisation, 1992).Sociodemographic data were collected for all participants in clinical samples.All patients were Serbian-native speakers.All respondents from the clinical samples were tested by the trained clinicians with long-term experience in administering Wechsler's scales.

Ethical Agreement
Respondents in the sample from clinics were in regular care, and no written consent was therefore required.Each participant was informed about the possible use of the data collected for research purposes, and they could withdraw from participating in the study if wanted.Collected data were made anonymous.The study was conducted in accordance with ethical principles for medical research involving humans (WMA, Declaration of Helsinki).

Instrument and procedure
The WAIS-IV was translated from English to Serbian, and back-translated by authors of this report.In addition, professional proof-editor was consulted during the process of translation.Some items, in verbal subtests, were adapted to fit local requirements with respect to content.Specifically, three items in the Information and four items in the Comprehension (i.e., sayings) subtest had to be adapted to fit local requirements.
PSIHOLOGIJA, 2018, OnlineFirst, 1-17 In subtest Letter-Number Sequencing, some of the letters used in items had to be changed due to differences between English and Serbian alphabet, but the order of the letters in two alphabets was considered when items were adapted.In addition, special care was paid to create items so that the participants using dominantly Cyrillic or Latin Serbian alphabet (depending on the fact which alphabet was first learned in school) are equally treated.This means that the selection of the letters used in the items was conducted so that the correct answers on the items would always be the same regardless of the alphabet.In other words, adaptation was conducted in a way that preference of the alphabet by the respondent would not influence his/her achievement.
In subtests Similarities, Vocabulary, Digit Span, Arithmetic, Cancellation, Block Design, Matrix Reasoning, Visual Puzzles, Picture completion, Figure Weights, Coding, and Symbol Search content was not changed compared to the original English version of the WAIS-IV (Wechsler, 2008a).
All items of subtests were administered to all respondents, regardless of the general administration instructions, i.e., discontinue rule was not applied.The WAIS-IV uses 10 core subtests to produce FSIQ.The Verbal Comprehension Index and Perceptual Reasoning Index are each composed of three core subtests, while Working Memory Index and Processing Speed Index are each composed of two core subtests.Supplemental subtests are provided to substitute core subtest if necessary, but three supplemental subtests (Figure-Weights, Letter-Number Sequencing, and Cancellation) are not available for 70-to 90-year-olds and these were not administered to participants older than 70 years.The WAIS-IV subtests along with corresponding index scores are presented in Table 1.

Results
Descriptive parameters, i.e., social, demographic and cognitive characteristics for the subsamples, and the whole sample are displayed in Table 2. Table 3 presents descriptive parameters for all WAIS-IV subtests on the subsample and the whole sample of respondents.All scores were calculated according to the US norms (Wechsler, 2008a).Our results show that average FSIQ for the respondents from the general population (Belgrade wider area) is 102.5 where scores were calculated according to the US norms.Compared to the results obtained on the special groups in the US standardisation, our respondents from major depression group show lower FSIQ achievement (98.6 in the US, 83.8 our sample).Respondents from the intellectual disability group show performance in line with those obtained in the US standardisation (Wechsler, 2008b).In the US standardisation study, data were not collected on the schizophrenic patients, but our data are in line with some other studies using WAIS-IV with clinical groups (e.g., Bulzacka, Meyers, Boyer, Le Gloahec, Fond, Szöke, et al., 2016).
Intercorrelations obtained between subtests and total score are provided in Table 4.As can be seen, results indicate that subtests corresponding to the specific index are correlating moderately to highly and that all subtests correlate highly with the general score (FSIQ).Note.Infit -inlier-pattern-sensitive fit statistic (mean-square); outfit -outlier-sensitive fit statistic (mean-square); KMO-Kaiser-Meier-Olkin measure of sampling adequacy, h2-Momirovic's measure of homogeneity (Knežević & Momirović, 1996).a Reliability for FSIQ was calculated as composite reliability for congeneric measures (Raykov, 1997).b Psychometric parameters are not displayed since the data were not entered for each item, but an overall number of correct answers were registered instead.
Homogeneity measures are high and indicate that almost all subtests are unidimensional, except for the 3 subtests, i.e., Digit Span, Letter-Number Sequencing, and Picture Completion.This result indicates that some subtests are capturing more than one (CHC) ability, which is not surprising since in these subtests several abilities are engaged, e.g., short-term memory, working memory, and logical reasoning.
Overall, our results show that all subtests discriminate participants well along the whole continuum of intellectual abilities (item-person map and item misfit order measures for all items of all subtests are available in Supplementary materials 1 and 2).As can be seen from item-person maps, all subtests cover the whole range of intellectual abilities and discriminate participants well along the continuum.Analysis of overall misfit order measures for subtests shows that all have very good values, i.e., values range between 0.8 and 1.3 which are considered excellent.Analysis of item misfit order shows that almost all items in all subtests have adequate infit and outfit measures, which indicates that items are good.In the subtests, Digit Span, and Similarities all items demonstrated good measures, both infit and outfit (see Supplementary materials 2).
PSIHOLOGIJA, 2018, OnlineFirst, 1-17 In subtest Comprehension, the first item had poor outfit measures, which can indicate that participants with higher abilities can unexpectedly have poor performance.It is possible that the item is unexpectedly easy and that respondents could be confused and provide the response of lower quality.In the subtest Information, items 3 and 5 had poor outfit measures.We have realized that in the case of item 3 improvement in the translation should be done and that it could prevent this item from having poor outfit measure.When it comes to item 5, it is possible that its position in the administration order should be changed and moved more to the middle part of the test.
In subtests where the first item is not a start point, i.e., Block design, Matrix Reasoning, Arithmetic, Visual Puzzles, Figure Weights, and Picture Completion, items preceding start point have poor outfit measures.This indicates that on these items people with higher abilities can occasionally have poor performance, or indicate lucky guesses of extreme respondents.Therefore, as prescribed, these items should not be administered unless reverse rules are required due to poor achievement on the start items.
In the subtest Vocabulary, items 13 and 30 had poorer outfit measures which suggest that for these two items unexpectedly low achievement of highly intelligent respondents can occur, or that respondents who are not high-achievers provide a correct response by lucky guessing.Based on these results, item 13 in Vocabulary should probably be changed.The last item in Vocabulary subtest (i.e., item no.30 -Palliative) discriminates highly intelligent respondents well (respondents of FSIQ 142 have 50% chance to answer it), but occasionally less intelligent people answer correctly.This could be due to the content of the item itself since the use of the word is widespread in everyday life and language (e.g., palliative care used as the last help for terminal patients), but providing an accurate definition of the term sometimes can be difficult even for the respondents with above-average cognitive abilities.However, discontinue rule (after 3 consecutive scores of 0) lowers significantly the possibility of a person with lower abilities to answer correctly on this item.
In the subtest Letter-Number Sequencing, items 2, 30, and 9 had poorer outfit measures.In the case of item 2, reasons for poor outfit measure can lie in the fact that the item itself is extremely easy, and that highly capable respondents are confused with the easiness of the task and provide the low-quality answer.For the last item, i.e., number 30, poorer outfit measure should not pose a problem since discontinue rule that must be applied in testing would significantly lower the possibility of lucky guessing.Reasons for poorer outfit index of the item number 9 in this subtest are still not clear, and future studies should inspect this issue in more detail.It is also important to note that the reasons for poorer misfit measures of some items of the subtests may be caused by only a few observations.Therefore, before making a final judgement on the potential lower quality of some items in subtests, data on the complete standardization sample should be considered.
For Symbol Search and Coding reliability was not calculated because the data were not entered for each item, but an overall number of correct answers PSIHOLOGIJA, 2018, OnlineFirst, 1-17 were registered instead.In the case of Cancellation, reliability was calculated from the two sums of the two types of correctly identified patterns.Analyses showed that separation index (the signal-to-noise ratio, i.e., the ratio of "true" variance to error variance) of the FSIQ is 3.89, which indicates highly effective discriminative power of the test.It means that the scale can discriminate at least four groups of respondents according to their cognitive abilities.

Discussion
All results from this study demonstrated that, overall, the WAIS-IV scale enables highly reliable assessment of the full span of intellectual abilities from intellectual disability to intellectual giftedness.These results are in line with other studies (for an overview see, Wechsler, 2008b) showing that the WAIS-IV scale allows for extensive, high-quality assessment of cognitive abilities.This result is a telling one regarding the test itself, but also regarding the nature of intelligence, having in mind that the Serbian adaptation of the WAIS-IV assumed only slight changes in the original items (see Instrument section) and that the scores were calculated relying on the US norms.(Interestingly, the average IQ for the rather small, but random sample from the large Belgrade area, based on these norms was 102.5 -very close to expected value of 100 for general population).
Our analyses showed that smaller changes in some of the subtests could improve psychometric properties of some subtests.Namely, in the subtest Vocabulary, one item (no.13) should be changed in the final version of the Serbian WAIS-IV.Future studies should clarify the quality of the item number 30 in the Vocabulary subtests, and of the items 2, 9 and 30 in the subtest Letter-Number Sequencing.These and other issues (such as the introduction of a couple of new items in the verbal reasoning tests to improve discrimination of the respondents with high abilities) will be addressed in the ongoing Serbian standardisation of the instrument.
The special quality of the WAIS-IV is the variety of high-quality supplemental subtests which allows the scale to be used when assessing cognitive abilities of different clinical groups, and to obtain additional information on cognitive functioning.The high quality of all subtests in Serbian adaptation indicates that the use of the core WAIS-IV subtests, when the time for administration is limited or when testing respondents older than 70 years would also be appropriate.Moreover, further reduction of the number of subtests in assessing IQ is a frequent trend, especially in research settings (Axelrod, 2002).There are recommendations for use of seven subtests versions of WAIS-III and WAIS-IV (Meyers, Zellinger, Kockler, Wagner, & Miller, 2013;Wymer, Rayls, & Wagner, 2003), as well as for assessment with four subtests (WASI, The Psychological Corporation, 1999;Axelrod, 2002).Due to high reliability of these subtests, correlations of IQ measures derived from these abbreviated versions and full WAIS scales are above .95(Ryan, Carruthers, Miller, Souheaver, Gontkovsky, & Zehr, 2003).
When discussing achievement of the respondents from the intellectual disability group, our results are in line with the results obtained in US standardisation of the WAIS-IV (Wechsler, 2008b).Like our results, results of previous studies demonstrated that individuals with intellectual disability have the poorest performance on Working Memory tasks (Baddeley & Jarrold, 2007;Conners, Rosenquist, Arnett, Moore, & Hume, 2008;Van der Molen, Van Luit, Jongmans, & Van der Molen, 2007), and Perceptual Reasoning tasks (Caffrey & Fuchs, 2007;Fontana, 2004).
Future studies should be conducted on the sample representative of the whole Serbian population, and provide norms.In addition, studies should give evidence of the factor structure of the scale, and the predictive validity of WAIS-IV in the assessment of cognitive strengths and weaknesses of respondents from different clinical groups.

Table 1
WAIS-R (VITI -Serbian version of WAIS-R), WAIS-III and WAIS-IV Subtests and corresponding index scores

Table 2
Social, demographic, and cognitive characteristics of the study sample

Table 3
Descriptive statistics and results of ANOVA with LSD post hoc test for subtests on four groups from the sample a Supplemental subtests; b Subtests for ages 16:0-69:11 only.

Table 5
IRT reliabilities, Misfit order measures, KMO measures of representativeness, Cronbach's α, and homogeneity of the subtests