The role of word frequencies in detecting unfamiliar terms and their effect on response quality

Research into cognitive aspects of survey response has indicated unfamiliar terms as one of the psycholinguistic determinants of question comprehensibility problems. In this paper the estimates of wording familiarity based on text corpora for the English and Slovenian languages were used to detect potentially incomprehensible wordings in two web survey questionnaires for international exchange students at the University of Ljubljana, one for incoming (English) and the other for outgoing students (Slovenian). Two versions of the questionnaire were developed for each language, one with low-frequency (complex) and the other with high-frequency (improved) wordings, and compared in a split-ballot experiment. The results show a lower drop-out rate and a decreased subjective perception of difficulty for the improved language versions.

Preparing good survey questions is a complex task in which several decisions need to be made regarding conceptual and technical issues.Research into the design of survey questionnaires has mainly focused on the structural characteristics of questions such as the type, format and order of questions and response categories.In contrast, there is substantially less research on the particular words used to form a question and the response categories, which is perhaps the most difficult aspect as each question can be worded in several ways.Ideally, a question should be worded so that it is clearly focused on a concept and its meaning is uniformly understood by all respondents without using too many words (Krosnick & Fabrigar, forthcoming).However, in survey practice many questions are worded poorly and this affects the response process.A specific threat is that even educated questionnaire developers have difficulties because they use overcomplicated language that might be too demanding for the respondent (Sheatsley, 1983: 200).
Corresponding author: ana.slavec@fdv.uni-lj.siUnfamiliar terms are one of the text features that research into the cognitive aspects of survey response has associated with comprehensibility problems (Tourangeu, Rips, & Rasinski, 2000).The use of terms that may be unfamiliar to some respondents can affect response quality; some respondents might not provide a response or give a non-substantive response (e.g.don't know), while others will try to guess the meaning from the context or associate it with a more familiar term, thus generating measurement errors.Moreover, even respondents who understand the meaning of the term might take some time and effort to cognitively process a difficult word.In psycholinguistics, the term low-frequency word is used to describe unfamiliar terms, which implicitly operationalises the concept.Namely, the frequency of a word is the number of times it occurs in text corpora, i.e. large electronic databases of authentic texts.Psychologists have observed that the more frequently a word occurs in a language, the faster it is processed (Broadbent, 1967).Moreover, eye-tracking studies reveal a longer gaze time for unfamiliar terms that have a low frequency in large corpora (Inhoff & Rayner, 1986;Jurafsky, 2003).

Pre-testing methods for detecting unfamiliar wordings
To avoid comprehensibility problems, survey researchers use pre-testing methods.For instance, participants in cognitive interviews can point out comprehensibility issues (Willis, Schechter, & Whitaker, 1999).To evaluate the effect of changes based on recommendations from cognitive interviewing, Willis (2005) performed a linguistic analysis on a set of questions about drug use.One of the observed characteristics was "big words", which is another term that can be used to describe unfamiliar words.The original version had 53 unfamiliar terms and that was reduced to 43 unfamiliar terms in the improved version.Even though cognitive pre-testing enabled the questionnaire designers to improve 10 terms, several unfamiliar terms remained unspotted.Another method that can be used to detect unfamiliar wordings is expert reviews (Lessler & Forsyth, 1996;Akkerboom & Dehue, 1997) but it is based on the personal judgment of experts who are usually educated people that, as mentioned above, tend to use overcomplicated language and might not perceive which terms are difficult to understand for somebody with a narrower vocabulary.
Further possibilities to pre-test questions are supported by computer software.The support of modern information communication technologies (ICT) has revolutionized the survey process in the last few decades, particularly with the increasing integration of the entire process (Vehovar, Petrovčič, & Slavec, 2014).Within this context, computer tools have been developed for detecting problems in survey questions, for instance the Survey Quality Predictor (SQP) (Saris & Gallhofer, 2007).The SQP is based on a meta-analysis of multitrait, multi-method experiments (MTMM) for more than 3,000 questions and allows users to obtain predictions of reliability and validity, including for any new question.However, it mainly focuses on the structural and formative characteristics of survey questions and only on a few linguistic indicators (length of syllables, words and sentences in a question introduction/request).
In the context of wording problems, a more relevant tool is Question Understanding Aid (QUAID) (Graesser, Kennedy, Wiemer-Hastings, & Ottati, 1999).It employs different psycholinguistic determinants of question complexity and is an important attempt to create software able to detect corresponding problems in questionnaire wording, including unfamiliar terms (Graesser, Cai, Louwerse, & Frances, 2006).However, QUAID does not return exact numerical frequencies of terms that were flagged as unfamiliar.Another shortcoming is that many of the wordings that are indicated as problematic are actually not really problematic, i.e. false positives (Graesser, Wiemer-Hastings, Wiemer-Hastings, & Kreuz, 2000).It would be more practical if the user were able to choose a threshold.Last but not least, it does not suggest alternative wordings, i.e. synonyms or hyponyms that are more familiar to the respondent and would improve the question.The researcher has to find them on their own (e.g.thesauruses in word processors), which is subjective and not systematized.Those that are sufficiently complete are proprietary software that cannot be used for free, while those that are free are usually not comprehensive enough.
In contrast, the WordNet lexical database (Miller, 1995), which contains strings of interchangeable synonymous words (synsets) and is considered a golden standard in computational linguistics, has not yet been utilized for retrieving synonymous words for purposes of questionnaire design, at least to our knowledge.

Research on word frequency effect on response quality
Although it does not mention word frequency, a relevant survey experiment in the context of wording familiarity was conducted by Blasius and Friederichs (2009), who varied the phrasing of seven items using low-brow (everyday) or high-brow (elaborated) language.In fact, it can be assumed that high-frequency words are typical for low-brow language, while unfamiliar, low-frequency words are usually used in high-brow language.Response distributions differed significantly for three of the seven items.They suggest using low-brow wording, as it resulted in more diverse responses by different socio-demographic and attitudinal variables.
In survey research, the effect of low-frequency words on response quality was examined in greatest detail by Timo Lenzner (2011).Six other text features were also covered, namely vague or imprecise relative terms, vague or ambiguous noun phrases, complex syntax, complex logical structures, low syntactic redundancy and bridging inferences.First, Lenzner, Kaczmirek, and Lenzner (2010) compared response times, drop-out rates and survey satisficing (i.e., very short response times, neutral responses, acquiescence and primacy effects) in a randomised split-ballot trial, in which one version had wellformulated questions, while the other contained suboptimal wordings (for four questions, the manipulation was a low-frequency word).They found that response times in the well-formulated version were significantly longer in 12 out of 28 question items but in only two of the four low-frequency wordings.On the other hand, there were no differences in drop-out rates, item non-response (which was also very low) and satisficing.The results of this study were extended in an eye-tracking study (Lenzner, Kaczmirek, & Galesic, 2011), which showed that questions with suboptimal text features had a longer fixation time, fixation count and question fixation time.
Lenzner (2012) also examined the effect of question comprehensibility on response quality in more detail in another split-ballot experiment incorporating a bigger sample size and controlling also for verbal intelligence and motivation.The same questionnaire was repeated after two weeks to assess the reliability of responses.This study found that less comprehensible questions reduced response quality (i.e., the number of non-substantive responses and the number of neutral responses).However, only four out of 28 text manipulations consisted of replacing a high-frequency word with a low-frequency synonym (Lenzner et al., 2010).
Although there are several studies in survey methodology that have compared alternative question wordings in a split-ballot experiment, (e.g., Kalton, Collins, & Brook, 1978;Smith, 1987;Duncan & Schuman, 1980;Rasinski, 1989), Lenzner's studies presented above are the only experiments, at least to our knowledge, that are based on a psycholinguistic text analysis.Thus, further empirical evidence is needed to better study the effect of wording frequencies on response quality.In particular, more research is needed in different languages, as current research has only been done for German.

The present study
In this paper, we present a procedure that complements and builds on previous attempts to detect unfamiliar wordings in survey items.The procedure is based on resources used in computational linguistics, a field that uses statistics and the computer sciences to model natural language.Linguistic corpora and lexical databases have had many applications in various fields, both within and outside linguistics.In survey methodology, the only known application is the aforementioned QUAID tool (described in the introduction).However, QUAID has several shortcomings, and linguistic corpora remain underutilised in survey research, except in some research by Willis (2005), and Lenzner (Lenzer, 2012;Lenzner et al., 2010).
In our procedure, frequencies in text corpora are used as estimates of wording familiarity and lexical databases are used to find alternative wordings.Our approach differs from earlier studies in that we operate with actual numbers (frequencies) from different text corpora.Moreover, when listing alternatives, we use lexical databases instead of regular thesauri.Thus, we can better distinguish between words that are true synonyms and those that are only similar.Furthermore, an important difference is that previous studies (Blasius & Friederichs, 2009;Lenzer, 2012;Lenzner et al., 2010) were done for German; in contrast, we research the word frequency effect for English and Slovenian.However, it should be noted that we use the two languages as two distinct case studies that are very different and should not be directly compared.
Our aim is to improve survey question comprehensibility by using simpler and clearer wordings based on linguistic corpora.Through a linguistic analysis of two questionnaires (English and Slovenian), we produced low-frequency and high-frequency versions that were compared in two split-ballot experiments (one for each case study).In contrast to Lenzner (Lenzer, 2012;Lenzner et al., 2010), we only focused on low-frequency words, so that we could better understand the relationship between word frequencies and response quality.Thus, we were also able to produce a greater amount of wording changes from the control (complex, low-frequency) and experimental (improved, high-frequency) versions of the questionnaire.In addition, we also introduced subjective indicators of response burden as a measure of response quality.
First, we aim to explore how to use text corpora to evaluate question wordings and detect unfamiliar words in survey questions.Second, we want to evaluate the effect of wording improvements on response quality.Does using words that have a higher frequency in text corpora improve response quality in terms of response times, breakoff rate, item nonresponse, satisficing and various indicators of response burden?

Text corpora
Structured data are needed by several linguistics approaches to study language and special datasets have been prepared to assist linguistic research methods.One approach is corpus linguistics which analyses language by collecting samples of language in a natural conte xt such as books and newspapers.Corpuses are electronic databases of authentic texts that are created according to specific criteria and aims.For instance, the Brown University Standard Corpus of Present-Day American English (Kučera & Francis, 1967) was one of the first corpora to be compiled in English and was followed by several others in English and other languages.
In this paper, we use three different English corpora (British National Corpus, Corpus of Contemporary American English and enTenTen) and one Slovenian corpus (Kres).Every corpus has its advantages and disadvantages.Restricting the analysis to only one would be limiting our understanding.
The British National Corpus (BNC; http://www.natcorp.ox.ac.uk) is a 100-millionword text corpus of written and spoken present-day British English taken from a wide range or sources (Burnard, 1995).It covers the period from 1960 to 1994, although over 93% of the texts are from 1985-1994.It might be slightly outdated but it has the widest range of sub-genres and includes spoken texts which give us also coverage of informal conversations.About 90% of the corpus is written texts, such as excerpts from newspapers, specialist periodicals and journals, books (academic and fiction), letters and memoranda, essays and other types of texts.It is encoded so that it represents both the output from the automatic partof-speech tagger and the structural properties of the texts.
The Corpus of Contemporary American English (COCA; http://corpus.byu.edu/coca/)contains more than 450 million words so it is about four times larger than the BNC.In fact, it is the biggest freely available genre-balanced corpus of any language (Davies, 2010).It covers the period from 1990 to 2012 and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts in (American) English.The BNC and COCA complement each other nicely.The COCA is larger and more up to date, while the BNC has a much wider range of sub-genres and better coverage of informal, everyday conversations.
The enTenTen corpus is a web corpus created by web crawling and processed with boilerplate cleaning and de-duplication tools.It belongs to the family of the TenTen multilingual corpora (https://www.sketchengine.co.uk/documentation/wiki/Corpora/TenTen) which covers 10 billion words in several languages (Jakubiček, Kilgariff, Kovář, Rychlý, & Suchomel, 2013).The BNC and COCA cover fewer texts than enTenTen but their advantage is that they are designed and genre-balanced.
The Kres corpus (http://www.slovenscina.eu/korpusi/kres) is a balanced subsample of almost 100 million words from Gigafida, which is a corpus of written Slovenian that contains more than 1.2 billion words, 77% of which are from newspapers and magazines, while only 6% of words are from books.Kres is weighted so that 20% are Internet texts, 17% are fiction, 18% are non-fiction, 20% are newspapers, 20% are magazines and 5% other (Logar Berginc & Krek, 2012).
There are other useful resources in computational linguistics such as machine-readable dictionaries and derived lexical databases like WordNet (Miller, 1995).The WordNet project (http://wordnet.princeton.edu/) is an attempt to organize lexical information in terms of word meanings rather than word forms as is the practice in conventional dictionaries.Word meanings are represented by word definitions and WordNet maps between the many forms and meanings of words.Some forms have several meanings (polysemy), and some meanings can be expressed by several different forms (synonymy) (Miller, Beckwith, Fellbaum, Gross, & Miller, 1993).True synonyms are rare so a weaker definition is applied in WordNet, i.e. words that denote the same concept and are interchangeable in many contexts.Apart from the English WordNet developed by researchers at Princeton University (Fellbaum, 1998;Miller, 1995), there are corresponding wordnets in other languages, for example in Slovenian there is sloWNet (Fišer, 2009).

The questionnaire
A web questionnaire used to assess European international student exchange programmes, such as Erasmus, was used for the case study.The questionnaire asks students about their knowledge, skills, and the study environment, focusing on a comparison between their host and home universities.Two questionnaires have been prepared, one for incoming and the other for outgoing students at the University of Ljubljana, although they are almost the same.The questionnaire for the outgoing students was translated into Slovenian.Complex, low-frequency wordings were intentionally chosen by the translators.
The questionnaire is 11 screens long and there are one to three questions on each page.The main part of the questionnaire consists of 21 questions, amounting to 79 items when counting the response options.The word count is 785 for the English version (1,040 when also including the list of countries in the dropdown list) and 791 for the Slovenian version (1,046 including the list of countries in the dropdown list).For both versions, we made a list of different nouns, verbs, adjectives and adverbs that appear in the questionnaire and manually searched for their synonyms and other related words for the corresponding meaning in WordNet (for English) and sloWNet (for Slovenian).Some of the new words were arbitrarily excluded because they did not sound natural in the context sentence.In most cases, we were limited only to single words but for three wordings in English ("critical assessment", "exam mark", and "subject field") and for four wordings in Slovenian ("delo s tabelami", "editiranje tekstov", "ekstrakurikularne aktivnosti", and "študijski materiali") we considered phrases.
For words that have at least one synonym, we manually searched the word frequencies of the original wording and other alternatives in the British National Corpus (for English) and in Kres (for Slovenian).Where more than one alternative was possible, we kept only the one with the highest wording frequency.In some cases, we replaced the original wording (in the control version) with a lower frequency word to make it more complex.As mentioned, the Slovenian version was already translated in such a way that it contained a lot of lowfrequency wordings.
Following the described procedure, we were able to find an alternative wording with a higher frequency in at least one of the corpora for 23 words in the English version (Table 1), while for the Slovenian version we were able to find 39 cases (Table 2).In addition, for English we compared the wording frequencies for the selected cases in the COCA and enTenTen corpora, which we have also listed in Table 1.The last column in both tables shows the number of times a wording change was made in the questionnaire.

Table 1
Words used in the control and experimental groups and their frequencies according to the British National Corpora (BNC), the Corpus of Contemporary American English (COCA) and enTenTen (web) corpus.We changed 23 different wordings but some appeared just once, while others appeared several times -the most frequent were "adequate" (8 times) and "evaluate" (9 times).As mentioned, there were three cases where we examined a phrase and not a single word.If we had considered the single-word frequency for "assessment" and "mark", we would have arrived at a different decision.
The wording frequencies in the three different English corpora are usually consistent -if a word has a relatively low frequency in one corpus it is also low in the other two.However, there are exceptions.For instance, the words "evaluate "and "laboratory" are less frequent than "rate" and "lab" according to the COCA and enTenTen but more frequent according to the BNC.Similarly, "furthermore" and "oral" are more frequent than "moreover" and "spoken" according to the BNC and the COCA, but less frequent according to enTenTen.In the Slovenian version, 39 different wording changes were made.Most of them appeared only once but some appeared several times, most frequently "evalvirati" (6 times) and "pedagog" (4 times).As mentioned, there are four wordings where we looked up the phrase and not a single word.The word frequency for "material" is lower than "gradivo" so the decision would be different if we had focussed on individual wordings.That might also be the case for some other words in the table, but this is a point for further exploration.
There are three words for which the wording in the control version actually has a lower frequency than in the experimental version ("komuniciranje", "komunicijski" and "socialen").We decided to allow this exception for stylistic reasons: many of the words in the control group are words of foreign origin and their alternatives are Slovenian synonyms of these words.Thus, the control version has a style that employs a lot of foreign words, while the experimental version uses domestic alternatives.These three words are also foreign in origin and we thus decided to have them in the complex control version and use the more Slovenian wordings in the experimental version.
At the end of both the control and experimental versions of both questionnaires (English and Slovenian), we included a block of questions that measure respondent satisfaction and questionnaire difficulty.The following questions were asked: -How much did you enjoy completing the questionnaire?A great deal, A lot, A moderate amount, A little, Not at all.-How difficult was it for you to interpret the meanings of questions in this questionnaire?
Extremely difficult, Very difficult, Moderately difficult, Slightly difficult, Not difficult at all.-How difficult was it for you to generate answers to the questions in this questionnaire?
Extremely difficult, Very difficult, Moderately difficult, Slightly difficult, Not difficult at all.-How many times did you not understand a certain word in a question?Please give at least an approximate answer.If there were no such words, please write 0.
In addition, we were interested in the respondents' multitasking behaviour and assumed that respondents are less prone to perform other activities (e.g.visiting other websites) if the questionnaire is less demanding for them.We measured multitasking with two questions, one for multitasking on electronic devices, and the other for other multitasking activities.Both questions had eight different activities listed and multiple answers were possible (a check-allthat-apply format).The question wording was: What, if anything else, have you been doing on any electronic device while responding to this survey?And: What, if anything else, have you been doing while responding to this survey?

Results
The study was carried out in April and May 2014 on Erasmus exchange students at the University of Ljubljana.The survey invitation (and one reminder) was sent to 1,147 incoming (international) and 917 outgoing (Slovenian) students.Following a random allocation, about half the respondents were allocated to the control (complex) and half to the experimental (improved) version that we described in the previous section.In total, 230 (20%) incoming students and 205 (22%) outgoing students started responding to the survey.The incoming students were responding to the English version and the outgoing students to the Slovenian version.
The incoming students who responded to the English version come from 27 different countries, mostly European.The largest group are Spanish students (11% of the respondents).No students were from an English-speaking country but five reported they are native speakers of English.It should be noted that for most of the respondents English was not their first language, which makes them more prone to comprehension difficulties.
We observed differences in five indicators of response quality: item non-response, drop-outs, straightlining, response time (average and median), subjective burden, and multitasking.Drop-outs are those who left the survey between the second and penultimate page of the questionnaire.The item nonresponse rate was computed by counting the number of items (out of 64) that were left blank.Straightlining is a manifestation of satisficing and is defined as always selecting the exact same response in a matrix question, either the middle point or another answer in the matrix.We computed straightlining for the four matrices that had more than three items: Q4 has eleven items and three response options (inadequate, just adequate, more than adequate), Q8 has eight items and five response options (much lower, lower, approximately the same, higher much higher), Q14 has five items and six response options (no information, a little, a moderate amount, a lot, a great deal of information) and Q15 which has eight items and five response options (no information, a little, a moderate amount, a lot, a great deal of information).Drop-outs were removed when computing the item non-response and straightlining.In addition, when computing the average and median response we removed item non-respondents and outliers (those who took more than one hour to respond).
Subjective burden was measured with four indicators, namely: enjoyment in completing the questionnaire, the difficulty of interpreting the meanings of questions, the difficulty of generating answers to questions, and the amount of times the respondent did not understand a certain word.Even if the variables have ordinal measurement scales, we assumed it is an interval scale and computed averages.Multitasking was measured with four questions, but we only analyse the first two: multitasking on a computer (or other device) and multitasking without a device.For both, we counted the number of boxes the respondent ticked but classified them as an on-or off-computer multitasker where they ticked at least one.

Analysis
We applied different statistical tests for different measures.A chi-square test was conducted for drop-outs and straightliners that were measured as a dummy and the percentage of those who were classified as a drop-out or straightliner is shown.For all other measures we calculated averages and medians.For averages we carried out Student t-tests for independent samples, while for medians we used the nonparametric Mann-Whitney U test.We also computed Cohen's d and r as effect sizes for all tests (Cohen 1988).The results for the English version are presented in Table 3 and for the Slovenian version in Table 4.The group that responded to the improved English version had a lower drop-out rate (20%) than the complex (control) version by almost 10 percentage points (30.8%).It is an important difference and turns out to be significant at the 0.06 level (chi square = 3.53) and although the sample is small, there is a small power effect (Cohen's d = 0.25).
Furthermore, we checked also the drop-out per page.Most of the dropouts occurred on the first page: 20 (17%) in the low-frequency version and 12 (11%) in the high-frequency version.Note that there was one wording change on this page ("constituent" vs "part").The remaining drop-out happened on the second page or later: 17 cases (14%) in the low-frequency and 10 cases (9%) in the high-frequency version.Per page differences go in the direction of our hypothesis; however, the cell sizes are too small to generalize.
On the other hand, there were no differences in item non-response, straightlining, response times, and in the subjective burden indicators.Either the sample size was too small or changing the 23 wordings does not have any effect on different measures of response quality (other than drop-out).In contrast, in the Slovenian version, where 39 wordings were changed, the results are somehow different (Table 4).Improving the wording of the Slovenian questionnaire decreased the impression of difficulty of understanding questions from 4.0 to 4.8 points (t=-6.17,p=0.00) and impression of difficulty of providing answers from 4.4 to 4.6 points (t=1.76,p=0.08).The differences are confirmed also by the Mann-Whitney U test: for both the difficulty of understanding the question (z=4.98,p=0.00) and difficulty of providing an answer (z=2.24,p=0.03) the median increases from four to five in the improved version (meaning less difficulty).The effect sizes for the difficulty to understand is intermediate (d=0.60) for the t-test and high for the z-test (d=0.84), while for the difficulty of providing an answer the effect is small both for the t-test (d=0.30) and z-test (0.35).Looking only at the average value, there is also a significant difference in the number of words not understood from 1.3 to 0.1 (t=5.36,p=0.00) but it is not confirmed by the nonparametric median test and the sample size is too small to give it statistical power.
On the other hand, the decrease in drop-out rates is smaller than in the English version, less than five percentage points (from 34% to 38.3%) and not even close to significant (chi square 0.77, p=0.38).Moreover, there are no significant differences in item non-response, straightlining and response times.
Finally, it should be noted that three out of the 64 total changes (of 39 different wordings) in the Slovenian version were not in line with other changes.While most of the changed wordings in the improved version were words with a higher frequency than in the control version, those three changes went in the opposite direction.However, they appeared towards the end of the questionnaire and present a minimal (4%) change compared to all changes that were done in the proper direction.

Discussion
In this paper, we first overviewed the challenges of using text corpora and lexical databases to improve survey question wording, which is an under-researched topic in the field of questionnaire design.In particular, we summarized Lenzner's (Lenzer, 2012;Lenzner et al., 2010) research on the effect of different text features on response quality and we outlined an empirical study based one of his research.However, we only concentrated on the effect of wording frequencies, allowing us more focus, and instead of German we applied our study on two other languages: English and Slovenian.
The study confirmed that the specific action of improving question wording by using words with higher frequencies can have a certain effect on some indicators of response quality.Although the results are somewhat different from those of Lenzner, we also confirmed some basic tendencies from his studies.Let us summarize the key findings.
First, as in Lenzner's studies (Lenzer, 2012;Lenzner et al., 2010) we were not able to observe any difference in item nonresponse and satisficing, neither in the English, nor in the Slovenian questionnaire.However, it should be noted that Lenzner looked into four indicators of satisficing (very short response times, neutral responses, acquiescence, and primacy effects), while we looked only into one (straightlining).
Second, although Lenzner hypothesized that word frequency might have an effect on drop-out rates, his evidence showed no significant differences for this indicator.In our experiment, on the other hand, we observed a small effect on drop-out rates, which was confirmed also by the power analysis.The replacement of 23 wordings in the English version with alternative wordings of higher frequency reduced the drop-out rate by almost 10 percentage points.Moreover, we can see a lower drop-out also for the Slovenian language version, as the 39 wording changes decreased the drop-out rate by almost five percentage points; however, the Slovenian findings are not significant and cannot be generalized due to the small sample size.Nevertheless, the same tendency as the English version was confirmed.
Third, we were not able to observe significant changes in response times, which is one of the main results of Lenzner's research (Lenzer, 2012;Lenzner et al., 2010).Although we actually observed a small difference in response times between the control and improved versions of the questionnaires for both languages, the differences were small and not significant.In any case, the sample size is too small to give these results any statistical power.
Fourth, what is novel in our experiment is that we also looked into some subjective measures of response burden, namely how much the participants enjoyed responding, the difficulty of understanding the questions, the difficulty of providing answers, and the number of times a certain word was not understood.For the English language questionnaire there were no effects, but for the Slovenian language questionnaire we observed a moderate effect for the difficulty of understanding and a small effect for the difficulty of providing answers.Also, there was a significant difference for the average number of times the respondents did not understand a certain word; however, the power analysis did not confirm the effect for the latter.
The differences in some research findings between our study and Lenzner's study can be primarily explained by differences in the methodological approach, i.e., per item observations in Lenzner's study vs. observing the aggregated effect of a series of changes.In addition, the specifics in the study populations, the language, and the questionnaire are just as important in explaining the differences.
With respect to certain differences in the strength of conclusions between our English and Slovenian study, it should be emphasized that they differed in the nature and amount of wording alternations.Moreover, as stated in the introduction, the two experiments are distinct case studies that represent two different populations and two different languages, which should not be directly compared.While the perceived lower number of incomprehensible words and the decreased perception of difficulty in the improved Slovenian questionnaire could be explained by the higher amount of wording improvements in the Slovenian version, there is no immediate explanation, except for some cultural effects, for the less pronounced decreased in the drop-out rate, compared to the English version.

Limitations
The study confirmed that changes in wording frequencies can affect response quality; however, at least with current data, it is difficult to accurately evaluate the specific effect of question wording on different response quality indicators.In fact, the current study has certain conceptual and methodological limitations.Additional experiments would be needed to further explore how exactly question wording is related to response quality.
The first limitation of the study is the relatively small sample size from only one university.Since we want to estimate small proportions (i.e., the percentage of drop-outs), a sample of at least 400 units per group is needed.Implementing the study on a larger population (other university, general population) would also empower the results.However, the relatively narrow population and small sample size does not jeopardize the internal validity of the findings, which certainly exposes a potential for the high effect of word frequency on response quality.
Second, the design of the questionnaire does not allow for a very accurate measurement of item response times.Since there is more than one item on each page, it is impossible to measure the time needed to respond to a specific item.A paging design shall be used in future experiments to enable the calculation of question response times.
Third, the number of experimental groups is a serious limitation.Two groups is enough to evaluate all differences as one factor but is not sufficient to study the effect of more specific factors, such as the nature and origin of alternative words, the specific effects of single words, the effect of the total number of words changed, the extent of change (moderate vs. high difference), the topic of the questionnaire, and also the role of specific factors of the target population (culture, language, socio-demographics).There are almost countless variations and requests for additional experimental cells.

Future research
The above limitations could be the subject of future research.In addition, another improvement in future studies could be to study the frequencies of longer strings of words and consider the context of a sentence instead of only focusing on single words.Computational linguistic technologies already allow this kind of analysis; however, this complex issue requires specific software support.In any case, it seems worthwhile to further explore the potential of text corpora and other computational linguistic technologies.Since the presented experiment does not allow for generalizable conclusions, a meta-analysis of a series of experiments is needed to fully understand the impact of changes in word frequencies.Based on future studies, we might be able to generate a model to estimate the critical thresholds of wording changes.

Conclusions
Despite some methodological limitations, we can confirm the basic findings of Lenzner (2011) that word frequencies can have some effect on question comprehension and response quality.However, due to the differences in methodological approaches, some effects found in our study were different.While Lenzner found an impact on response times, we found effects on dropout rates and on the subjective perception of response burden.Nevertheless, both studies found no effect on item nonresponse and satisficing.In any case, our study outlines the direction for future research, such as: (a) we identified key factors for potential inclusion in experimental designs, (b) we pointed out existing linguistic resources that also could be used to study strings of words, and (c) we identified a need for systematic meta-analytic studies to discover key factors and key effects in this complex matter.

Table 2
Words used in the control and experimental groups and their frequencies according to the Kres corpus.

Table 3
Comparison of the control and experimental English versions

Table 4
Comparison of the control and experimental Slovenian versions