The subjective frequency of word n-grams

When asked to think about the subjective frequency of an n-gram (a group of n words), what properties of the n-gram influence the respondent? It has been recently shown that n-grams that occurred more frequently in a large corpus of English were read faster than n-grams that occurred less frequently (Arnon & Snider, 2010), an effect that is analogous to the frequency effects in word reading and lexical decision. The subjective frequency of words has also been extensively studied and linked to performance on linguistic tasks. We investigated the capacity of people to gauge the absolute and relative frequencies of n-grams. Subjective frequency ratings collected for 352 n-grams showed a strong correlation with corpus frequency, in particular for n-grams with the highest subjective frequency. These n-grams were then paired up and used in a relative frequency decision task (e.g. Is green hills more frequent than weekend trips?). Accuracy on this task was reliably above chance, and the trial-level accuracy was best predicted by a model that included the corpus frequencies of the whole n-grams. A computational model of word recognition (Baayen, Milin, Djurdjevic, Hendrix, & Marelli, 2011) was then used to attempt to simulate subjective frequency ratings, with limited success. Our results suggest that human n-gram frequency intuitions arise from the probabilistic information contained in n-grams.

The predominant view in psycholinguistics today is that language is made up streams of words and thus the word has become the dominant unit of linguistic activity in psycholinguistic research.The next largest unit is usually the phrase or sentence and the next smallest unit is usually the morpheme.Recently some researchers have begun to look at groups of words called n-grams (Shaoul & Westbury, 2011).N-grams are any combination of two or more words, and are not restricted to complete, compositional phrases (both the red hat and give the red are considered n-grams).N-grams exist above words in a natural hierarchy: any stream of language can be broken down into its component n-grams in the same way that a word can be segmented into morphemes or letters.N-grams have similar statistical properties to other units: each n-gram has a probability of occurring at any point in time than can be empirically estimated, and that probability will change depending on the context.The probability of any n-gram occurring is usually estimated from its frequency of occurrence in a corpus, and the larger the corpus, the more accurate the estimate (Kilgarriff & Grefenstette, 2011).In this work we will estimate n-gram probabilities from the frequency information found in what is currently the largest publicly available sample of English written text, a one trillion-word corpus of English web documents created and released by Google (Brants & Franz, 2006) known as the Web1T dataset.These probabilities have the potential to explain aspects of language behavior that are beyond the reach of non-probabilistic psychological models of language.
Some long-standing theories of language predict that there should be no effects for the transitional probabilities of words in sentences or n-gram probabilities (Harris, 1951;Chomsky, 2005).These theories do not allow for exposure information to be implicitly or explicitly tied to words or n-grams.In a generative framework of language the amount or type of language experience need not impinge on lexical sequence processing.Experience is dismissed as being theoretically unimportant because a system of rule representations, once in place, is static.This is because once the parameters of the grammar have been properly set, the system is assumed to operate independent of experience.This is analogous to how a computer program, once constructed, does not change over the course of its use.In contrast, our perspective as empiricists is that there is no point in differentiating competence from performance in any empirical psycholinguistic research.Ullman (2001), for example, describes language as a mental lexicon of memorized words that are arranged by the rules, rules which are stored in a "mental grammar".The procedural operations in this model work by assembling larger structures from hierarchical compositions of smaller structures (morphemes into words, words into sentences).When these compositions are fully productive (e.g.walk -walked or ideas -green ideas), they are posited to be purely rule driven.Any effects of n-gram probability or co-occurrence statistics are incompatible with these models because rule processing operations should not be affected by the amount of experiences or the types of experiences with a stimulus.This common refrain, which we summarize as "words are stored in the mental lexicon" is inherently localist: each word gets a node, and data related to that word is contained in or around that node.
Compositional semantics is another area where rule-based theories of representation and processing have been popular.Jackendoff (2007) has offered models that build semantic combinations from a set of lexical items and relationships, but the empirical validations of this model are not forthcoming.
The assumption of this and other semantic models is that rules govern the combining of words, and all words have the power to encapsulate meaning.The meaning of larger structures is a simple outcome of various operations on the meaning of the words.This dualistic view of language processing -words and rules (Pinker & Ullman, 2002;Walenski & Ullman, 2005) or meanings and rules Jackendoff (2007) -is pervasive.Any models that allow for continuous learning and modification of the language system are not compatible with this view.
The inherent unwieldiness of localist/dualist models has spurred demand for more parsimonious models that can explain our linguistic capabilities.These emergentist theories of language propose that experience is used to build dynamic systems for processing linguistic input without any need for mental lexicons and systems of grammatical rules (Baayen, Milin, et al., 2011;E. A. Bates & Elman, 1993;Bod, 2009;Dilkina, McClelland, & Plaut, 2010;Elman, 1990;Frank & Bod, 2011;Goldberg, 2006;Tomasello, 2003).Why use the word emergent to describe our linguistic systems?Reductionism has long been at the core of many theories of language (e.g. a word is just the sum of its spelling, sound and meanings).Instead of trying to understand the whole by studying the parts, these new theories attempt to capture properties of the whole by understanding how the parts interact.These theories are also united in their position on learning, where learning is integral to the development of the system, and hand-coded rules are left out (Ramscar, 2010).Our definition of the emergentist school of thought is broad and inclusive, but the trait that links these models is consistent: these models all include effects of linguistic context along with content and allow context and content to interact as experience grows.
The following summary of current research on n-gram processing provides evidence for broad, probabilistic effects of linguistic experience on language processing, in turn providing support for this emergentist school of thought (For a more in-depth review of the literature, see Shaoul and Westbury, 2011).
First we shall look at probabilistic n-gram effects, in particular n-gram frequency effects.Bannard and Matthews (2008) studied children's production of n-grams, and found that n-gram frequencies influence their accuracy when children repeat back short phrases that differ only by one word.Arnon and Snider (2010) replicated this effect using similar stimuli, a reading task and undergraduate student participants.They found that participants read the more frequent n-grams faster than the less frequent n-grams.In both studies the effect was not due to the frequency of the individual words or substrings and it was observed across the entire frequency range (for low, mid-and high frequency n-grams).Arnon and Cohen Priva (2013) studied elicited and spontaneous speech and found that n-gram frequency influenced phonetic duration.Higher frequency n-grams took less time to produce whether they were constituents or non-constituents.A constituent is a verb phrase, noun phrase or prepositional phrase that can stand alone as an utterance, such as "the red hen".A nonconstituent phrase would be "will give the".Matthews and Bannard (2010) found that the verbal production of higher frequency n-grams was more accurate than lower frequency n-grams.The experimenters asked 2 and 3-year olds to repeat n-grams back to them, measuring how close their version was to the original.Even after controlling for multicollinearity in the frequency measures, they found an n-gram frequency effect.
In the studies mentioned so far, the authors limited all of their stimuli to n-grams that were constituents or intonational phrases, meaning that they did not cross over traditional phrase boundaries.The first study to look at reading times for n-grams that were sampled without requiring that stimuli be constituents was done by Tremblay, Derwing, Libben, and Westbury (2011).They used only nonconstituent n-grams in a self-paced reading experiment and found that there was a whole n-gram frequency advantage.Tremblay and Baayen (2010) followed up with an ERP study for an immediate free recall task for sets of three nonconstituent 4-grams.They found that whole n-gram probability as well as internal word and 3-gram frequency predicted recall as well as P1 and N1 amplitudes.These results suggest that n-gram frequency is contributing something to the language system, and that n-grams effects may be similar to word effects.
Eye tracking experiments have also been used to look at n-gram frequency effects.Siyanova-Chanturia, Conklin, and van Heuven (2011) presented subjects with two types of 3-grams: binomial phrases (bride and groom) and those same phrases reversed (groom and bride).These two types of n-grams are naturally very closely matched on many lexical variables, and they proposed that any differences in processing must arise from effects of n-gram frequency.The binomial 3-grams had an average frequency in the BNC that was 10 times that of the reversed 3-grams (2.473 per million versus 0.274 per million).Thirty 3-grams of each type were embedded in sentences and read by participants in the eye tracker.They found that binomial phrases were read faster than reversed phrases.They also found that phrasal frequency facilitated reading even after taking into account the effect of phrase type, more evidence that increased exposure to an n-gram contributes to its entrenchment.
Language is undeniably a stream of sounds or letters and n-grams can be thought of as groups letters of different lengths.Language users make use of the information in the environment to learn, and that learning is not necessarily explicit.Can humans implicitly learn patterns in their environmental input?Remillard (2010) recently reported that their subjects were able to implicitly learn 5 th -order and 6 th -order sequential probabilities of certain non-linguistic stimuli.In their experiment they taught their participants to push one of six buttons corresponding to the location of a box on the screen.After two sessions of training spread over two days, subjects showed improved speed and accuracy in their responses.After 16 sessions of training were completed, participants were able to reliably predict the 5th element of a sequence based on the conditional probability of the previous four elements.The subjects were not aware of the contextual dependency they were relying on to do this task.This result provides support to the idea that it is possible for humans to implicitly learn n-gram transitional probabilities for 2, 3, 4 and 5-grams.In a continuation of this line of research, Remillard (2011) replicated these results for fourth-order sequential probabilities in purely perceptual task, showing the common architecture of the learning systems in perception and action.
In a related line of research, implicit sequence learning ability has been shown to be linked to performance on language processing tasks by Conway, Bauernschmidt, Huang, and Pisoni (2010).They looked for individual differences in their participants' perception of degraded speech, a task that is highly dependent on the ability to predict upcoming words based on context.They found that a reader's sensitivity to sequential structure during implicit learning was the best predictor of these individual differences, even after taking into account their performance on tasks measuring short-term and working memory, attention and inhibition, and vocabulary.
Moving beyond orthographic frequency, other probabilistic measures are now being studied.Tremblay and Tucker (2011) investigated the influence of two additional measures, conditional logarithmic (log) probability, and Pointwise Mutual Information (PMI), on the recognition and production of 4-grams.Conditional probability is a measure of likelihood of seeing a word given a specific context, or predictability.PMI is an index of how strongly words are associated with each other and is calculated by dividing the probability of the whole n-gram by the product of the individual word probabilities.They asked participants to read 432 4-grams as quickly as possible after viewing them and they recorded the onset time (the time taken to read the 4-gram and prepare for the production) and duration of the utterance.N-gram frequency was found to explain more of the unexplained variance in production durations than conditional probability or PMI, leading the authors to conclude that n-gram frequency relates to the fluency of production due to entrenchment from exposure.Recognition time, as measured by the onset latency, had more deviance explained by conditional probability and PMI with a smaller contribution from frequency.Since conditional probability measures how predictable an n-gram is in context, the superiority of conditional probability measures in explaining recognition time implies that the degree of competition between n-gram family members is the main process underlying recognition.This dovetails nicely with recent work on competition-based models of recognition of compound words (Juhasz & Berkowitz, 2011;Kuperman, Schreuder, Bertram, & Baayen, 2009).In terms of which length n-gram contributed most to explaining deviance in onset latencies, probabilistic measures for the 3-grams were strongest, followed by unigram probabilities.For production duration, unigram probabilities were the dominant measure in reducing unexplained variance.Tremblay and Tucker propose that the 3-gram is a key unit of language that is long enough to contain complex meaning, but short enough to be processed efficiently.This pattern of results points to a complex, dynamic system, with information from internal n-grams influencing the processing of the wholes.
These studies all provide evidence for general n-gram frequency sensitivity, using different types of stimuli and different experimental paradigms.Is corpus frequency merely a reasonable way to estimate the familiarity of an n-gram?Frequency effects can be also be thought of as complex phenomena that arise from more than just pure exposure.The key realization is that repetition implies contextual diversity, and so repetition itself may not be what gives high frequency n-grams their advantage (McDonald & Shillcock, 2001).
Frequency is inevitably correlated with many other measures.McDonald and Shillcock (2001) identified contextual distinctiveness (CD) as a measure that can explain effects of orthographic frequency.CD was expressed as the relative entropy between a word's context and the context for all words in the language.
In a similar vein Baayen (2010) calculated the contribution of 17 lexical variables from many categories: frequency, genre distribution, CD, syntactic entropy, morphological entropy, and orthographic features in predicting lexical decision response time (LDRT).Once the other predictors were used to predict RT, orthographic frequency did not contribute to the final model.This idea could be called the frequency-effect-as-epiphenomenon position.As with McDonald and Shillcock (2001), frequency effects emerged from models that did not use lexical frequency counts.In the experiments we will report here we will consistently employ n-gram frequency in our statistical models, but it is critically important to state that there is much more than frequency at work -it is a combination of many other probabilistic measures of language, such as those proposed by Baayen (2010), that are going to eventually help us understand the system.Our overly simple frequency-based analyses are a good beginning, but much more work needs to be done to disentangle the complexities of n-gram frequency.
The theories that allow for learning from context to take place make clear predictions about n-grams: over time and exposure, the n-grams to which we are exposed will become more and more familiar.This familiarity with a word sequence (its subjective frequency), in line with other effects of familiarity for other stimuli, will influence the reading of n-grams.In this study we aim to delve deeper into the question of n-gram subjective frequency and to better understand what is driving these varying degrees of word sequence familiarity.
The first question to be addressed in this work is: How does the probability of an n-gram in a large corpus of text relate to the subjective frequency of the n-gram?In the first part of the paper we will attempt to detect any contribution of n-gram frequency to subjective frequency ratings.This evidence will provide a basis for n-gram probability in the formation of n-gram subjective frequency.The second question addressed is: How sensitive is the language system to the relative probabilistic information contained in language?Subjective frequency judgments are by definition on a fixed scale (i.e. from VERY FREQUENT to VERY RARE), but relative frequency judgments change depending on what n-grams are being compared.Comparing two very common n-grams may be different from comparing two very uncommon n-grams.Yet relative frequency judgments should tap into the same implicit familiarity knowledge that is used to generate subjective frequency ratings.In the second part of the paper the impact of n-gram probability on subjective relative frequency judgments is investigated.Will there be an impact of the frequency of the internal n-grams, the whole n-gram or both?Our goal is to better understand how the probabilistic information contained in n-grams influences their processing.
Finally, in the General Discussion, we will use a naive discriminative learning (NDL) model (Baayen, Milin, et al., 2011) to build computational simulations of subjective frequency ratings and relative frequency judgments to see how well a learning model can predict behavior in these tasks.We will attempt to find out whether sub-lexical learning is giving rise to these n-gram effects in our subjective frequency tasks.Any type of model that simulate n-gram frequency effects without storing any n-gram data is interesting because it lays bare the problems with the false dichotomy between n-gram "storage" and "computation".In a model that learns from experience, computation and memory are concurrent and unified rendering the "storage versus computation" debate moot.

WORDS AND N-GRAMS
One theme in this research is the similarities between n-grams and words.Evidence for this conjecture has come from many sources.Kuperman, Bertram, and Baayen (2008) studied compound words, and found that compound word frequencies, constituent lexeme frequencies, and conditional probabilities for all the morphemes in the compound word had a role to play in their model of compound word reading.Compound words are in many ways similar to 2-grams, leading us to speculate that models of n-grams may need to take similar information into account.Since n-grams have been shown by Arnon and Snider (2010) and Tremblay, Derwing, et al. (2011) and others to have a wordlike frequency advantage, it is possible that words and n-grams have even more in common.We will first look at subjective frequency, a well studied aspect of word knowledge.

Subjective and objective frequency of words and n-grams
The subjective frequency of words has been investigated by psycholinguists since the 1960s (see Gernsbacher, 1984 for a review).Connine, Mullennix, Shernoff, and Yelen (1990) found subjective frequency to be predictive of word naming times when the stimuli were presented auditorially, but found no effect for orthographic frequency in this modality.This led Connine et al. to conclude that objective and subjective frequency effects for words were task and modality dependent.Furthermore, subjective frequency was concluded to be a post-lexical component that was related to ease of production.Balota, Pilotti, and Cortese (2001) investigated what influences subjective frequency and they settled on objective frequency and meaningfulness 1 .They found that meaningfulness was a better predictor of subjective frequency for low frequency words and orthographic frequency was a better predictor of subjective frequency for high frequency words.More recently, Colombo, Pasini, and Balota (2006) used Italian words and found that subjective frequency and meaningfulness explained variance in lexical decision response times, but not in naming response times.Orthographic frequency explained variance for both tasks.Thompson and Desrochers (2009) found lower correlations between the orthographic frequency of low frequency words and their subjective frequencies, replicating the results (Balota et al., 2001), but with French words.Baayen, Feldman, and Schreuder (2006) attempted to explain the variability in subjective frequency ratings using various objective predictors.They built a statistical model that absorbed more than two thirds of the variance in subjective word frequency ratings using predictors such as orthographic frequency, written-spoken ratio, word category (noun or verb), noun-verb ratio, orthographic neighborhood density, derivational entropy and inflectional entropy.These predictors are also important inputs into most models of visual lexical decision response time and word naming response time.The parallels between the two sets of predictive variables supports the notion that subjective frequency is an "off-line inverse of visual lexical decision" (Baayen, Feldman, & Schreuder, 2006, p. 305).
What is subjective frequency?Subjective frequency is nothing more or less than a self-reported measure that expresses a person's introspective understanding of their amount of exposure to a stimulus.Lexical subjective frequency data is collected by asking people to rate how frequently they have encountered a word.The instructions in these experiments define encounters as hearing the word, saying the word or reading the word.The variance in these subjective frequency norms for words have been used to explain variance in lexical decision tasks, word naming tasks and others.Taking an emergentist stance, we posit that the subjective frequency rating for a word arises from the same emergent process that is in play when we use words -from the interactions of various processes that operate according to very basic principles of non-symbolic processing (Elman, 2011).If n-grams and words are similar, an n-gram's subjective frequency should be available to people during a task, just as a word's subjective frequency is known to be available.In our first experiment we collected subjective frequency norms for a set of n-grams and then analyzed these ratings to see how strong their relationship to objective frequency was.Our hypothesis is that if n-grams have a word-like subjective frequency, corpus frequency should be strongly correlated with subjective frequency when the effect of constituent word and n-gram frequencies are taken into account.Furthermore the direction of the correlation should be positive (higher ratings 1 Defined by Toglia (2009) as a rating of "How meaningful is this word?" on a survey.
for more frequent n-grams), and the correlation should be strongest for the most frequent n-grams, replicating the results of Balota et al. (2001).
It may seem that this work is quibbling over the obvious fact that n-grams that are encountered more frequently should feel more frequent.The reason that this question is of vital importance is connected to theoretical underpinnings to this work.Our conservative estimate of the number of representations that would need to be maintained in a localist model of language that included all words and n-grams (2,3,4 and 5-grams) experienced in a person's lifetime is around 10 9 .This is a gargantuan mental lexicon that would have to be consulted during every task we asked our participants to do in our experiments.A localist storage model becomes biologically implausible at this scale, and therefore subjective frequency judgments would be unrelated to corpus frequency.But if people can reliably judge the absolute and relative frequencies of n-grams then it is time to embrace probabilistic models of language processing.

EXPERIMENT 1
There are many data sets available that provide subjective frequency ratings for words (Balota et al., 2001), but there are no previous reports of the collection of subjective frequency norms for n-grams.To see if n-grams would have a stable, subjective frequency in the same way that words do, we collected ratings and looked for similarities between n-gram ratings and word ratings 2 .

Participants
One thousand five hundred and forty eight students at the University of Alberta participated in this experiment in exchange for partial course credit.The mean age was 19.2 years old (sd = 2.1 years), 64% were females and 74% of the students were native English speakers.Our results were not affected by the inclusion of those who were not native English speakers, and so their data is included in the following analyses.All subjects gave written consent to participate in the experiment, which was conducted with the approval and in accordance with the regulations of the University of Alberta Research Ethics Board.

Methods and Materials
179 pairs of n-grams were chosen from the Google Web1T data set (Brants & Franz, 2006): 60 pairs of 2-grams, 43 pairs of 3-grams, 36 pairs of 4-grams and 38 pairs of 5-grams.The n-grams were chosen to cover a broad range of frequencies and relative frequencies.They were also grouped into pairs and matched on the geometric mean of their constituent word frequencies.This was done so that there would be no bias caused by the relative lexical frequency of the items when they were later used in a relative frequency judgment task.Arnon 2 The data files, the analysis reported and other supplementary materials related to this paper are available to be downloaded at http://www.sfs.uni-tuebingen.de/~cshaoul/as well as the Potsdam Mind Research Repository at http://read.psych.uni-potsdam.de/.
and Snider (2010) chose to only use n-grams that were intonational phrases, that is, n-grams that sound complete when uttered on their own.Our stimuli were not restricted to clausal or intonational units so as to demonstrate that n-gram effects are not limited to those types of constructions.The n-grams had frequencies ranging from the very frequent (1139 per million, to the) to the very infrequent (0.00006 per million, to know and keep the).
In this paper we use the following convention to label the frequencies of the n-grams contained within an n-gram of larger size.The letters b, t, and q stand in for bigram, trigram and quadragram.The letter f denotes frequency, and the number following it indicates which position it has within the larger n-gram.Thus the abbreviation tf2 stand for Second Trigram Frequency, and would be the frequency of the second trigram in to know and keep the, which is know and keep.A full description of all these abbreviations is given in Table B4 (in Appendix B).
Subjects were administered a web-based survey with a seven point scale next to each n-gram.The n-grams were presented in the same pseudo-random order to all participants.The instructions stated: "Please rate how frequently the phrases below are used.A rating of almost never means that the phrases are used very rarely.A rating of very often means that the phrases are used very frequently."The two extremes of the scale were labeled, but the intermediate ratings were not labeled.Each person was asked to rate 31 n-grams, providing us with approximately 130 ratings per n-gram.Each subject also rated the frequency of three nonsense n-grams (e.g.sanity toast blanket) to confirm that they understood the instructions.

Results
To confirm that our participants understood the task, we analyzed the responses they made for the nonsense n-grams in the experiment.The mean rating3 for the nonsense n-grams was μ ˆ = 1.35, σ ˆ = 0.2, on our scale of 1 to 7. The mean rating for all the sensible n-grams was μ ˆ = 3.83, σ ˆ = 1.07.The nonsense n-gram responses were removed from the rest of the analyses.We measured inter-rater reliability using the intra-class correlation coefficient (ICC, Shrout and Fleiss, 1979).We chose the version of ICC with random effect of raters, known as the ICC(2,k).For all of the sub-groups of subjects who rated the same set of 32 items, all ICCs were greater than 0.98, around the same for similar lexical rating tasks.
To understand the relationship between the ratings that we gathered and the corpus frequency of the n-grams, we investigated the impact of internal n-gram frequencies on the subjective ratings.Analyzing these relationships is not straightforward: All of these n-gram frequencies are inevitably highly inter-correlated, and entering all the predictors simultaneously into a regression model could lead to spurious effects or suppression or enhancement.Facing the same problem, Matthews and Bannard (2010) chose to use Principle Component Analysis (PCA) to reduce the multi-collinearity of the component frequencies of their 4-grams.We considered using PCA as a potential way to reduce multi-collinearity in our predictors, but we chose not to use it because PCA replaces the original variables with orthogonal components which are often difficult to interpret.
To deal with the problem of multicollinearity while properly assessing which predictors are most relevant, we made use of an analytical technique called random forests.The advantage of random forests is that they are able to detect true relationships in the presence of many highly multi-collinear predictors.The full details of how we used random forests are reported in Appendix C, including a detailed description of our statistical methods, including how we determined which variables were important predictors or not.
The results of our random forest analysis can be summarized as follows: • For 2-grams, the whole n-gram and second word frequencies were important.
• For 3-grams, the whole n-gram frequency was important, with a smaller contribution from the third word frequency and the third bigram's frequency.Interestingly, the third bigram frequency, bf3, is the frequency with which the first and third words appear together in a corpus as a contiguous bigram, which we call a split-gram.• For 4-grams, the whole n-gram, the first 2-gram and the second 3-gram frequencies were important.• For 5-grams, the first 4-gram and the whole n-gram frequencies were important.Was n-gram frequency in the trillion word corpus helpful in predicting our outcome variable?To find out we used the variables identified as important by the random forest analysis and created models for each size of n-gram with the whole n-gram frequency.We compared them with models that added the other variables identified by the random forests.We compared the Akaike Information Criterion (AIC, Akaike, 1974) of all the models to determine which one was better.The AIC is a measure of the quality of a model that incorporates both the goodness of fit and the number of free parameters in the model.Models with fewer parameters that have a better fit with the data are given a lower AIC.This means that the absolute value of the AIC is not important, but rather the difference between two AIC values for two models indicates which model is better, and how much better.The results of these comparisons of nested models are shown in Table 1.
The picture for the relationship between objective and subjective frequency for n-grams is more complicated than the one for words described by Balota et al. (2001); it is not merely a linear relationship between the meaningfulness of words or their simple whole form corpus frequency.There were effects of the internal n-gram frequencies that came into play.We also found a non-linear effect of final word frequency on n-gram ratings in the 3-gram data.For this reason, the results of the 3-gram analysis will be reported separately from the others.
To assess the reliability of these effects, we will report the effect size of each predictor in our models.The effect size of a predictor in a linear regression model can be measured using Cohen's f 2 , an appropriate statistic according to Cohen (1988).He suggested that an f 2 of 0.02, 0.15, and 0.35 should be considered as being, respectively, small, medium, and large effect sizes.Each model was re-fit 1000 times with bootstrapped replicants giving a distribution of f 2 values4 .The 95% CI of the effect size from this distribution is reported below.For 2-grams, the subjective frequency ratings were predicted by both the 2-gram's frequency ( f 2 = 0.45, 95% CI 0.3,0.56)and the second word's frequency ( f 2 = 0.07, 95% CI 0.02,0.14)5 .
Table 1.Regression Model Comparisons for Experiment 1. Two models for predicting the mean subjective frequency ratings of n-grams are given for each size of n-gram.The first model nested within the second.Models in bold type were the best models.∆df denotes the change in the number of free parameters between the two models being compared.For the 5-grams, the best model did not include the whole n-gram frequency, as shown by the equivalence of models 9 and 10.Since model 9 is simpler, it is taken to be the best model.For the 4-grams, a more complicated model was the best fitting.The whole n-gram frequency had the largest effect ( f 2 = 0.34, 95% CI 0.19,0.52),followed by a weak effect of the first bigram ( f 2 = 0.08, 95% CI 0.01,0.18)and an unreliable effect of the second trigram ( f 2 = 0.03, 95% CI 0,0.12).

AIC
For the 5-grams, the addition of the whole n-gram frequency did not improve the model, so the simpler model prevailed.This simpler model had a strong effect of initial 4-gram's frequency, with the effect size being f 2 = 0.27, 95% CI 0.13,0.42.
Due to the non-linear effect of wf3, we could not use linear regression models or the bootstrapped f 2 statistic to analyse the data.Instead we applied a general additive model (GAM) to fit non-linear splines (Wood, 2006).A visual check of the confidence intervals around the spline shown in Figure 1 imply a strong effect of n-gram frequency and weak effects of the third word frequency wf3 and the split-gram frequency bf3.In particular, the effect of final word frequency appears to be strongest for the middle of the frequency range, with less confidence for the highest and lowest frequency words.Even the non-linear effect of wf3 is in the positive direction until it reaches the higher frequency words, where the model's confidence intervals become very wide.To confirm that there was no danger of misinterpretation to due to intercorrelated predictors, we check for multi-collinearity.In all the analyses above the amount of multi-collinearity between the predictors was reasonable (in all models, κ <8).
Finally, we noted that Balota et al. (2001) had found that the group of words with the highest subjective frequency ratings had a strong relationship between objective (n-gram) and subjective frequency, and that the opposite was true for the words with the lowest subjective frequency ratings.Our n-gram ratings replicated this result: we performed a median split on all of the items in this experiment based on their average subjective frequency rating, and calculated a bootstrapped Pearson correlation between subjective rating and corpus frequency 6 for each of the two groups.The magnitude of the correlation was larger for the set of items with the higher subjective frequency ratings: for the upper half, r(177) = 0.55, 95% CI 0.45,0.63,and for the lower half, r(176) = 0.24, 95% CI 0.09,0.37.Without this median split, the correlation for all items was r(355) = 0.64, 95% CI 0.58,0.69.With an r 2 of 0.41, for the full data-set, it is clear that n-gram frequency explains a non-trivial amount of the variance in our subjective frequency ratings.
6 For the 5-grams, qf1 was used instead of whole n-gram frequency.

Discussion
In this exploratory look at the frequency measures that influence the subjective ratings for n-grams we found a complex pattern of evidence for n-gram frequency effects.Each n-gram size had a different pattern of frequency effects, with no clear, over-arching pattern.The first interesting result was the lack of a whole n-gram frequency effect in the 5-gram data.Participants were found to be sensitive to the frequency of the probability of the first four words of the 5-gram, but not the whole n-gram frequency or the other n-gram frequencies.This implies that the subjective frequency estimation process did not use information about the probability of 5 words occurring together to accomplish this task.This could be due to inherent limitations in our language system on how much context we can use when learning sequential probabilities.Remillard (2011) has shown that it is possible for people to learn fourth-order sequential probabilities implicitly, as well as second-and third-order sequential probabilities.This may be an analogous manifestation of the limits of our contextual learning capabilities.
For the 2-, 3-and 4-grams, the whole n-gram frequencies had very large effects whereas the internal frequencies had relatively weak effects.The key finding in this experiment was that, excluding the 5-grams, there was strong evidence in the 2-, 3-and 4-gram data for a dominant effect of whole n-gram frequency (for all these effects, Cohen's f 2 > 0.34) and a subordinate effect of the sub-frequencies (for all these effects, Cohen's f 2 <0.08).This supports our hypotheses about the sources of implicit frequency judgments, and provides the justification for the next experiments on relative subjective frequency estimation reported below.
We found an effect in this analysis that we could have not predict when we designed this study, but that we were able to detect due to the correlational design of this experiment.By choosing stimuli that covered a broad span of frequencies we were able detect trends that spanned the whole range.In our 3-gram ratings we found that both the n-gram frequency and the split bigram frequency bf3 contributed to predicting the data.This result suggests that the first and third words are salient for subjective frequency judgments in 3-grams, but not for other n-grams.These split-grams may be related to non-contiguous subtrees proposed by Bod (2009).They are used in Bod's data-oriented parsing (DOP) model to help explain our ability to parse nonadjacent dependencies such as "BA carried more people than cargo in 2005" Bod, 2009, p. 764.This non-contiguous subtree, more XX than bears a striking resemblance to the split 3-gram, and the influence of the split 3-gram's frequency might provide some behavioral support for parsing models that allow these non-contiguous constructions.In contrast, all of the other split-grams that we included in our analyses (see Appendix A for the full list) had no detectable influence on the outcomes.More evidence on split-gram processing will need to collected before any links can be made between probabilistic reading models and syntactic models that involve split-grams.

Relative frequency of words
With evidence from our subjective frequency rating task pointing towards n-gram frequency effects for subjective frequency ratings, the next place we looked for effects was in a more complicated task: relative frequency judgments.In general, rating tasks are limited by the use of absolute Likert scales, which are not immune to artifacts (Carifio & Perla, 2007;Jamieson, 2004).To avoid these issues, we chose to develop a relative frequency task that does not suffer from the same issues.In this type of task the participants are shown two items at once and are asked to judge which one of them, in their experience, is more frequent.We will manipulate the relative corpus frequency of the items, both in absolute terms (low frequency vs. low frequency, high frequency vs. high frequency) and in relative terms (a very small difference in frequency relative to each other or a very big difference).The power-law distribution of words and n-grams provides ample examples of items that fall into all of these categories.We chose the stimuli to cover a broad swath of the frequency spectrum and to make sure that our results were generalizable to a majority of words and n-grams.
Before attempting this task with n-grams, we sought to confirm that a relative subjective frequency judgment task was reasonable and feasible with simpler stimuli.We created a single word task that we could later extend to n-grams, and looked for evidence that our paradigm was valid for investigating relative frequency judgments.

Participants
Thirty-three students from the University of Alberta participated in this experiment in exchange for partial course credit.The mean age was 19.4 years (sd = 1.33 years) and 57% of the participants were females.All were right-handed native English speakers.None had any visual or neurological disabilities that would interfere with their participation.All subjects gave written consent to participate in the experiment, which was conducted with the approval and in accordance with the regulations of the University of Alberta Research Ethics Board.

Methods and Materials
120 pairs of words were chosen to meet specific experimental criteria.To avoid any effects of a relative difference in orthographic neighborhood size, each pair of words had minimal difference between their Orthographic Levenshtein Distance (OLD, Yap and Balota, 2009).The mean of the differences between the OLD in all of the word pairs in our stimuli was 0.007 with a standard deviation of 0.2, meaning that each word was matched with a word with an orthographic neighborhood of almost identical size.We also used words of different lengths.There were 51 pairs of four letter words, 37 pairs of five letter words and 32 pairs of six letter words.Each word pair was selected to provide the broadest possible coverage of the frequency ratio space (from large to small ratios, for high and low frequency words).The breadth of the distribution of the item frequencies is shown in Figure 2. The position of the higher frequency word was counterbalanced so that it appeared at the top of the screen 50% of the time.We used the ACTUATE experiment presentation package (Westbury, 2007) to collect RT and accuracy data in our task.Each trial began with the display of a fixation cross for a random period of time between 500ms and 1000ms.At that point the fixation cross was removed and each pair of words, displayed directly above and below the location of the cross.The words were displayed in 18 point times roman font on a white background.Each subject had 10 practice trials and then all the word pairs were presented in pseudo-random order.Participants were instructed to press the k key if the word on top was used more frequently or the m key if the word on the bottom was more used more frequently.The more frequent word appeared above the less frequent word 50% of the time.After completing ten practice trials with feedback, all the experimental trials were completed without any feedback.

Results
We first used a graphical analysis to understand the relationship between our two dependent variables and our predictors of interest.In Figure 3 (A) the mean item accuracy increased with the ratio of the orthographic frequencies (Kendall's τ = 0.5, bootstrapped 95% CI 0.41,0.59).In Figure 3 (B), we saw a negative relationship between the corpus frequency ratio and RT (Kendall's τ = −0.31,bootstrapped 95% CI -0.41,-0.18).To quantify these effects, we created statistical models and fitted them to the data.We used generalized linear mixed effects models, or GLMNs (from the R package lme4) to understand the relationship between the independent variables and the accuracy of the participants' judgments (Baayen, Davidson, & Bates, 2008; D. M. Bates, in preparation; R Development Core Team, 2013).As with the subjective frequency data models above, we compared AIC values to find the best fitting model, and the results of those comparisons are shown in Table 2.All models include two crossed random factors, subject and item.
The number of letters in the words did not improve the models, and so it was removed from all the models.The best fitting, simplest model was an Log Frequency/Million for Lower Frequency Word additive model that had the following structure: random intercepts for subjects and items, and random slopes for the effect of the part of the screen that the higher frequency word was placed in (stimulus position) for each subject.The was no main effect of stimulus position.The ratio of the frequencies were a strong predictor of accuracy, with greater frequency ratios producing greater accuracy.Adding random slopes for the effect of the frequency ratio on each subject improved the model fit, implying that some subjects were more sensitive to the frequency ratio information than others (Table 2, last line).The slope of the regression coefficient for the frequency ratio remained significantly different from zero, as shown in Table 3.
Figure 3. A) Relationship between item accuracy and log frequency ratio for all the word pairs in Experiment 2. The dashed line is at the 50% accuracy level.(B) Relationship between frequency ratio and response time for all the word pairs in Experiment 2. In both of these graphs, Kendall's τ is reported rather than Pearson's r due to the heteroskedasticity of the distribution, and we have included bootstrapped 95% CIs.The gray lines show the LOWESS (locally weighted scatterplot smoothing) smooths.
We also performed a linear mixed effects model comparison for the log transformed response times obtained in this experiment to look at the processing load involved in making this type of judgment.Before beginning the analysis, we removed 88 outlier observations from the data set (RTs that were two and Kendall's τ = 0.48 95% CI : ( 0.4 , 0.56 ) a half standard deviations above or below the grand mean RT, which made up 2% of the data).Again, all of our models contained crossed random effects for Subject and Item, but in this analysis, we included the log transformed RT from the previous trial (the first trial for each subject was assigned that subject's mean RT).This predictor was inserted to account for inter-trial temporal dependencies, which were pronounced in this experiment (Baayen & Milin, 2010).The other predictors were the ratio of the word frequencies, the button pressed, and the length of the word in letters.In Table 4 we present the results of this model comparison.
Table 2. Accuracy Regression Model Comparisons for Experiment 2. The dependent measure is the response of the participant (top word or bottom word).All models contain crossed random effects of Subject and Item as well as random slopes for each subject based on their sensitivity to the location of the higher frequency word.FreqRatio is the log transformed ratio top word's frequency and the bottom word's frequency.In Model 3, random slopes were also fitted for each subject based on their sensitivity to the item's frequency ratio.From the model comparison we can infer that word length and the frequency ratio are important predictors and the addition of the possible confounding covariates (button and previous trial RT) did improve the model fit.Another potential source of variation in any experiment is the position of the trial in the experiment.We analyzed the effect of experimental position, and we found that there was no benefit in adding the trial number into the model (χ 2 (1) = 0.06, p = 0.8), implying an absence of fatigue or adaptation effects.

AIC
The best model included by-subject random slopes for the frequency ratio, length and button choice.The fact that this model was superior to all the others suggests that there was some variation in each subject's sensitivity to those three variables.The direction of the relationships in the best model are shown in Table 5.There was a negative relationship between frequency ratio and RT, meaning that there was facilitation when the frequency ratio was larger.The opposite direction was found for word length, as longer words take more time to read.The effect of Previous Trial RT was also positive, suggesting that participants exhibited a spillover effect of RT across trials.There was a trend to press the top button faster than the bottom button, reflecting a top-to-bottom bias in reading.
Table 5. Markov-chained Monte Carlo (MCMC) based estimates for the coefficients for the fixed effects in the linear mixed effects model fitted to the observed RT in Experiment 2. Button is the button pressed in each trial.FreqRatio is the log-transformed ratio of the word frequencies, Length is the length of the word in letters and PrevTrialRT is the log-transformed RT for the preceding trial.

Discussion
After creating a novel relative frequency judgment task for pairs of words we found that the ratio of the words' frequencies was a powerful predictor of the participants' accuracy in detecting the more frequent word as well as the time taken to complete the task.By matching word pairs on orthographic neighborhood size, we avoided potential confounds caused by orthographic neighborhood size.Word pairs that were very close in frequency were much more difficult to judge accurately.Word pairs that were very close in frequency also took longer to process, suggesting that it is harder to distinguish the relative frequency of items that are very similar in their orthographic frequency.
Our next step was to extend this paradigm to the judgment of the relative frequency of n-grams, making it possible to compare participants' performance on multi-word stimuli to their performance on single word stimuli.

Relative frequency of n-grams
In this experiment we applied the experimental paradigm that we found to be sensitive to lexical frequency ratios in Experiment 2 to pairs of n-grams instead of pairs of words.Our hypothesis is that n-gram relative frequency will influence the choices our subjects will make when they compare them and make a subjective relative frequency judgment.The bigger the ratio of the n-gram frequencies, the greater the effect that ratio should have on the response of the participant.We can make a further prediction based on the results of Experiment 2: response times should be faster for item pairs that have larger ratios.

Participants
Forty-nine students from the University of Alberta participated in this experiment in exchange for partial course credit.The mean age was 19.3 years old (sd = 1.79 years), and 65% were females.All were right-handed native English speakers.None of them reported any visual or neurological issues that would interfere with their ability to participate in the experiment.None had participated in Experiments 1 or 2. All subjects gave written consent to participate in the experiment, which was conducted with the approval and in accordance with the regulations of the University of Alberta Research Ethics Board.

Materials
The same 179 pairs of n-grams that were rated by subjects in Experiment 1 were used to create pairs of n-grams that covered a wide range of frequency ratios.We wanted to control the influence of the cue of word frequency in the n-grams and so we calculated the geometric mean of the word frequencies of the words in each n-gram using the unigram frequencies from the Google Web1T corpus (Brants & Franz, 2006).We then matched each n-gram with an n-gram that had a very similar geometric mean.By doing this, we hoped to eliminate any relative frequency cues coming from individual words in the n-grams, cues that we knew to be salient, as we found they influenced performance in the relative frequency judgment task in Experiment 2. Figure 4 shows the distribution of ratios for all the stimuli in this experiment.The stimuli covered most of the lower left quadrant of the frequency space, while the upper left and upper right quadrants of the space cannot be filled due to the nature of language corpora and our stimulus matching criteria.In particular, there were no n-grams pairs where the whole n-gram frequencies were many times greater than the other, and yet the requirement for the geometric mean of the word frequencies to be matched were true.This is due to the fact that all n-grams cannot have a frequency lower than that of the lowest of its word frequencies, and cannot have a frequency higher than that of the highest of its word frequencies.
With the effect of lexical frequency balanced on each trial, we restricted the source of variation to other types of information.The distributions of the frequency ratios of all of the n-grams used in this study are given in Appendix B.

Methods
We used the same method as in Experiment 2. After ten practice trials with feedback, all of the n-gram pairs were presented in pseudo-random order for each participant, with no feedback.The more frequent n-gram appeared on top of the less frequent n-gram 50% of the time.The presentation format and instructions were identical to those used in Experiment 2.

Results: Accuracy
The overall accuracy with which our participants identified the higher frequency n-gram was above chance.We used a bootstrapped confidence interval around the proportion of items to be judged correctly to be the more frequent one in the Google Web1T corpus.We found that for 2-grams, the mean accuracy for all subjects on all items was 0.6 (95% CI: 0.58,0.61),for the 3-grams it was 0.62 (95% CI: 0.6,0.64),for the 4-grams is was 0.57 (95% CI: 0.55,0.6),and for the 5-grams it was 0.55 (95% CI: 0.52,0.57).This was an aggregate analysis, not a trial-level model, and for our trial-level model, we hypothesized that the ratio of the two n-gram frequencies ratio would be the key predictor.If one n-gram was more frequent than the other, and if it was on the top of the computer screen, the trial-level model should predict a greater probability for chooing the n-gram on the top of the screen.
We used GLMMs to understand the relationship between the stimuli and the trial-level responses of the participants' judgments (Baayen, Davidson, & Bates, 2008).All of our models included the random effect of item.There was no improvement in the models when we added the crossed random effect of subject, and so it was left out.In the stepwise elimination process we also compared our best models with more complex models that included predictors such as trial number and all the individual word and smaller n-gram frequencies, but these were uniformly lower in fitness, and are not reported here.Despite the lack of random subject effects, we had noted that some participants commented on their personal strategies after completing the experiments, and these comments led us to believe that some of our participants were more careful than others in judging the relative frequency of the n-grams.If the level of consideration truly differed among participants, our analysis will benefit from taking each participant's level of effort into account.One way to diagnose this confound would be to look for a speedaccuracy trade-off -where we should see that faster-responding subjects were using a "give up quickly when unsure" strategy to pick the n-gram they thought was more frequent.Other, slower responding participants might have used a different strategy.To see if there was a modulation of accuracy based on each subject's strategy, we calculated a mean of all the RTs on all trials (for all n-gram sizes) for each of the 49 participants in our experiment.We then entered this number, the subject's average speed, as a predictor in our models to see if it interacted with our predictor of interest, the n-gram frequency ratio 7 .
We compared three models for each n-gram size: a model with no fixed effects, a frequency ratio model without any interactions, and a model with an interaction between frequency ratio and subject speed (all models the random effects of item.)The results of the model comparison are shown in Table 6.From the model comparison we see that for the 2-, 3-and 4-grams, the ability of the models to predict trial-level accuracy improved when there was an interaction allowed in the models.For the 5-grams, the three models were equally good, implying that there was no effect of n-gram frequency ratio.
The coefficients for the three best GLMMs are shown in Table 7.The directions for all of the coefficients were in the directions we predicted.For the interaction models, the larger n-gram frequency ratio increased the probability of choosing the more frequent n-gram, but this effect was modulated by the response speed of the subjects.The effect was stronger for slower subjects, and weaker for faster subjects.The interaction plots in Figure 5 show the modulating effect of subject speed on the sensitivity to n-gram relative frequency.
7 Why didn't the generalized linear mixed effects models detect this strategy different in the random effects for each subject?The reason could be that the fixed effect of subject speed interacted with another fixed effect, the stimulus frequency ratio.Random slopes were also fit, but the models with random slopes were not as good as the models with fixed effects for subject speed based on χ 2 tests.

Discussion: Accuracy
The relationship between the n-gram frequency ratios and the n-grams chosen by our participants as the more frequent n-gram was consistent across 2-, 3-, and 4-grams when taking into account the response speed of our participants.These results support a broad sensitivity for n-gram frequency in this task, extending the size of this sensitivity from single words (that we found in Experiment 2) to 2-, 3-, and 4-word n-grams.We found that by using more complex models that accounted for a speed-accuracy trade-off strategy across our participants we were able to create better models.
We found that when we averaged accross participants, the accuracy of the responses in the 5-grams above chance, but just barely (55%).At the trial level, there were no n-gram frequency ratio effects on accuracy found for 5-grams.The reason for this may be that judging the relative frequency of 5-grams is beyond our capabilities.This could also be linked to the the lack of a relationship between 5-gram frequency and 5-gram subjective frequency ratings in Experiment 1. Subjective frequency effects for 5-grams will need further investigation -as of now the evidence for such effects is weak.The N-gram frequency ratio is the frequency of the n-gram presented on the top divided by the frequency of the n-gram presented on the bottom.The dependent measure is the probability of choosing the n-gram presented on the bottom.The subject speed is the grand mean of each participant's response times in the experiment.The interaction is plotted at 5 intervals spread equally across the range of the subjects' speeds.What about the predictors which did not enter into the models during model selection?Crucially the individual word frequencies were not found to improve the fit of any of the models for accuracy for any of the n-gram types in the experiment, meaning that our method of matching pairs of n-grams to reduce the influence of word frequency was successful.The component n-gram frequencies and frequency ratios (i.e. for 3-grams: bf1, bf2 and bf3 and their ratios) did not improve any of our models either.It appears that embedded n-gram frequency ratios were not relevant to the n-gram relative frequency judgement task.In contrast, they were relevant in the subjective frequency rating task in Experiment 1.The reason for this difference is unclear.

Results: Response Time
We created linear mixed effects models for the RTs for each of our n-gram sizes, with crossed random effects for subject and item.Before looking at other covariates, we tested the effect of one common time-related predictor: previous trial response time.It was found to be a reliable predictor in all of our data and was entered into all the models.Previous trial RT had a consistently positive influence on RT: when a trial took longer, the next trial was also longer.We then looked at the n-gram frequency ratios, but they only predicted the response time for 3-grams.For the 3-grams there was a negative slope for the coefficient, indicating that a larger log ratio of the n-gram frequencies predicts a shorter reaction time.For the 2-grams, no n-gram frequencies had reliable effects.For the 4-and 5-grams, only one of the n-gram frequencies was a significant predictor.For the 4-grams, it was frequency of the bottom n-gram and for the 5-grams, it was the frequency of the top n-gram.The direction of the relationship for the 4-and 5-grams were also negative, meaning that the more frequent the n-gram, the faster the task was completed.
After fitting the best models, we performed model criticism by removing data that had residuals that were 2.5 times greater than or 2.5 times less than the mean of the residuals.This procedure did not change the outcome of any of our analyses and indicated that our models were not overly influenced by extreme values.The estimated coefficients for all of the fixed effects in these models and their 95% highest posterior density intervals are shown in Table 8.
For completeness, we also performed model comparisons between simpler models without the effects of any n-gram frequency ratios and ones with the effect of n-gram frequency ratios.In brief, all the comparisons showed improvements in model fitness after the addition of the frequency ratios (all p<0.01 for the χ 2 tests.)We also found that there was no benefit in adding the higher frequency stimulus position into the model (p>0.05 for all χ 2 tests), implying that participants were not speeding up based on the location of the correct answer.

Discussion: Reaction Time
Response times in this task were predicted by the frequencies or ratios of frequencies of the n-grams in the stimuli for the 3-, 4-, and 5-grams.It is unclear why there was no frequency effects on the RTs for the 2-gram judgments.The results for the 4-and 5-grams differed from the results in the single word task in Experiment 2, where, as with the 3-grams, the response time increased as the frequency ratio decreased.The only effect was one of facilitation for the more frequent n-grams, mirroring the results of Arnon and Snider (2010) who found that more frequent n-grams were read faster.The reason that the frequency ratio effect went away may be that the impact of the frequency ratios on the timecourse of the n-gram relative frequency judgment task is smaller relative to the single word task.
Thus far, we have extended the relative frequency judgment task from pairs of words to pairs of n-grams.N-gram frequencies again predicted the accuracy in detecting the more frequent n-grams.This result suggests that the subjective frequency of n-grams is something that is accessible to us when it is useful.In summary, we found in Experiment 3 that the probability that the participants could correctly identify the n-gram with the higher corpus frequency was linked with the relative frequency of the n-grams.What kind of model of linguistic processing could help explain these results?In the General Discussion we will apply a computational model of lexical learning and discrimination to try and answer the question "What makes some n-grams seem more frequent than others?"

GENERAL DISCUSSION
In the three experiments presented here we looked at the subjective frequency effects in words and n-grams and how they were related to objective frequency.In Experiment 1, we found that the subjective ratings of frequency for n-grams were correlated with their corpus frequencies just as it was for single words.In Experiment 2 we introduced a relative frequency judgment task and applied it to the relative frequency of words.In Experiment 3 we extended this task to n-grams, and we saw that the frequency ratio of n-grams can predict the likelihood of correctly choosing the higher frequency n-gram in a forced choice task.Our efforts to remove lexical frequency cues by matching stimuli by the geometric mean of their component word frequencies were successful, as we saw no predictive input from word frequencies or the ratio of their word frequencies.N-gram frequencies were the key predictors of accuracy in Experiment 3.These results imply that people have some type of knowledge that is connected to the relative frequency of n-grams that they are able to implicit access.Does this mean that people "store" n-grams?Is it conceivable that there is a mental lexicon with all the n-grams a person has seen or heard before in it?The possibility of this looks increasingly untenable.Forster and Hector (2002) propose that we search our lexicon for items, and, as we noted, we estimate that the number of representations that would need to be searched in a localist model of language that included words and n-grams in a lexicon would be at least 10 9 .Even if this search could proceed at speeds faster than the fastest known parallel search algorithms, it would still be too slow to be plausible.Our thinking is that an emergent account of lexical processing that does not depend on unique representations for n-grams is the only logical possibility.
Another critique of the storage model comes from the literature on learning.Learned frequency knowledge is often used implicitly in many tasks, linguistic and non-linguistic, such as word segmentation (Saffran, Aslin, & Newport, 1996), lexical recognition (Seidenberg & McClelland, 1989), visual object perception (Kirkham, Slemmer, & Johnson, 2002) and many others.In that sense, it is not surprising to see subjective frequency effects for groups of words, but the sheer number of n-grams that humans are exposed to in our lives makes it difficult to see how it is possible to keep track of our exposure to each n-gram.The definition of the "mental lexicon" has been recently criticized by Elman (2009Elman ( , 2011) ) and Dilkina et al. (2010).Taking our cue from Elman's ideas, we feel that our research supports the notion of n-gram processing as dynamic, interactive relationship between many types of non-symbolic knowledge.Memory systems incessantly interact with perceptual systems and production systems when reading, and learning is taking place at all times, irrespective of context or amount of previous experience (Ramscar & Dye, 2011).In this view of the linguistic system, recall of episodic memory traces, ease of articulatory simulation and ease of semantic accessibility all contribute to our ability to judge the absolute and relative frequency of n-grams.Frequency of exposure to n-grams will contribute to what is learned and what is unlearned8 in all of these mental systems, and this could explain why our data show such a consistent influence of n-gram frequency on performance in our tasks.
Another way that n-gram subjective frequency may emerge is from the sensation of fluency which some n-grams produce.Much as lexical processing takes longer for words that are rare, n-gram processing may take longer for n-grams that are new to us or rarely seen.If n-gram subjective frequency emerges from the same processes that produce lexical subjective frequency, and if subjective frequency is related to the speed of lexical recognition, then we can look at recent models of word recognition for ideas on how this may happen.Some recent models posit a process of accumulation of evidence when we read and recognize words (Baayen, Milin, et al., 2011;Dilkina et al., 2010;Norris & Kinoshita, 2008).One of these models, Naive Discriminative Learning (NDL) has already been applied to modeling the reading of n-grams, so we applied to our data to see how well it could simulate performance on our three tasks.
It is important to point out here that the current NDL implementations do not assume separate representations for word forms or n-gram forms, but rather shows the emergence of morphological and lexical effects using nothing but sub-lexical probabilistic information.The cues (letter n-grams) and the error that arises from seeing or not seeing a specific outcome is what allows a discrimination learning model to learn (Ramscar, Yarlett, et al., 2010).Baayen, Hendrix, and Ramscar (2012) used an NDL model to predict reading times for the n-gram stimuli used by Arnon and Snider (2010).The NDL model predicted the reading time from the model's knowledge of the statistical properties of pattern of letters and letter bigrams in the input, replacing the n-gram frequency predictor in the original model.We wanted to know if NDL could replace n-gram frequency in our models for subjective frequency ratings and judgments in our experiments.
Using a similar procedure reported by Baayen, Hendrix, and Ramscar (2012), we created a sub-lexical NDL model of the English language.All the simulations described below were implemented using the NDL package version 0.2.7 (Shaoul, Arppe, Hendrix, & Baayen, 2013) within the R programming Environment version 3.0.0(R Development Core Team, 2013).We trained our NDL model on a 500 million word corpus of USENET posts (Shaoul & Westbury, 2009) so as to have an input corpus closer in size to the 1 trillion word Google Web1T corpus.Letter trigrams were used as cues, and words were used as outcomes.We calculated NDL activations for the words in our n-grams using all the cues in the full n-gram as input to the NDL perceptron network.We then entered the word activations into our statistical models and inspected how the addition of NDL derived predictors affected the predictive power of n-gram frequency.
For the models from Experiment 1, entering the NDL activations did not reduce the influence of n-gram frequency.The only model which was improved by the addition of NDL activations was the model for predicting 3-gram mean ratings.The sum of the activations of the three words in the trigram took the place of the split-gram frequency bf3 and the final word frequency wf3 in a better fitting model (∆AI C = 15.1).The coefficients for this model are shown in Table 9.This model had an increase in the adjusted R 2 of 10% over the original model (increasing the adjusted R 2 from 45% to 55%).
For Experiment 2, the ratio of the activations made for a much better fitting model of the response choice (∆AIC = 24.8),but the word frequency ratios continued to contribute to the prediction independent of the word activation ratios (see Table 10 for the generalized linear model coefficients).The directions of the NDL activation ratio effect and the word frequency ratio effect were both negative.Finally, we attempted to apply our NDL model to the relative frequency judgments in Experiment 3. We created ratios of the summed activations for the words in each n-gram.We then added these ratios along with the n-gram ratios into generalized linear models predicting the proportion of responses (top versus bottom) and in all the models the addition of the activation ratio did not reduce the effectiveness of n-gram frequency or improve the models.
The conclusion that we draw from the results of the NDL simulations is that the discrimination learning in our sub-lexical NDL model is not the main source of our participants' subjective frequency knowledge.Unlike the visual reading task used by Arnon and Snider (2010), and modeled using NDL by Baayen, Hendrix, and Ramscar (2012), subjective frequency tasks are necessarily slower and more complicated (some of our subjects took up to 8 seconds to complete one trial).We envisage an interplay between bottom-up and top-down processes.The simplicity of our NDL model hampered its ability to capture this complexity.The current model does not have the ability to understand what a word boundary is, which might be crucial in determining the subjective frequency of an n-gram.
Despite the inability of our NDL model to supplant n-gram frequencies in our analysis of our data, we feel that subjective frequency is within the scope of discrimination learning to simulate.Perhaps what is missing from our current NDL implementation is a word-level language model.The sub-lexical model we attempted to use here is unaffected by word order.There is undeniably a strong affect of word order on the subjective frequency of an n-gram (i.e.appear to be versus be to appear).One variant of our NDL model has the potential of modeling this type of inter-word dependency relationship: a lexical-level discrimination model.We intend to build such a model, one that uses context words as cues to lexical targets, and see if it can do what the sub-lexical model cannot do.This type of model will learn which words reduce the uncertainty of seeing a particular word.Unlike a simple Markov chain, this model will contain the full power of a discriminative learning system with cue co-learning and error learning shaping the association between cues and outcomes.In this type of model the two n-grams alarm bells and bells alarm would not be equivalent, and hence the simulation of the subjective frequency would become feasible.If this model is capable of modeling our experimental data, it will be a system that can simulate n-gram processing without any local representations for n-grams, which is our goal.What else could explain the subjective frequency of n-grams?In general, any model of language processing that incrementally learns co-occurrence patterns at different grain-sizes might be a candidate.To test these models more data relating to n-gram processing will need to be collected for analysis.There is much left to be done.
The work presented in this paper supports the notion that n-gram probability has a contribution to make to the understanding of language processing, one that will allow us to explore language processing in new ways.The vast majority of models for word and sentence processing have thus far avoided dealing with the impact n-gram probability on behavior.We have presented experimental evidence that the granularity of what statistical relationship are being learned by readers extends beyond words to n-grams, and that the probability of being exposed to n-grams influences their subjective frequency.This leads us to the conclusion that the time has come to embrace probabilistic models and apply them to larger groups of words.There may be fundamental upper bounds to the complexity of the probabilistic information that we can use when reading n-grams and understanding those constraints will require further investigation.A3. 5-grams used in Experiment 3.
and to see that the : to the case on the win friends and influence people : was rumoured that he had at the end of each : to change the lives of is the purchase of a : or for the development of about what can happen to : and learn everything there is the beginning of the next : of what the year has all water under the bridge : is at least four times thank you so much for : always ready to help you which they have already received : had a distinct impact on of their registered owners and : serve as a guide to there are plenty of opportunities : given over a long period is less like an annoying : are paying close attention to if we did not know : gives us a sense of is the name of a : of the city by the support the full range of : be able to accept a couple of weeks or so : a very active forum for gave birth to a beautiful : help you organize your home used as a kind of : all of whom had the that the changes in the : and that he is a data that can not be : in the front or back it did not seem to : to help you prepare for be implemented in the future : but good enough for a so you can find out : a chance of showers and play an active role in : safer to keep it here was sentenced to six months : opportunity to introduce ourselves as here and there in the : and at the beginning the preparation for life in the : the result of arbitrary and has nothing to do with: ask to speak to a with an interesting story or : occurred early in the project ways to get rid of: at least one year after that all words are spelled : we propose to carry out would like to see this : were also of the opinion appear within a few moments : keep in mind when picking finally took the plunge and : stable at room temperature for of the ability of his : to know and keep the going to have to get : the various properties of the the first step in the: and can be used for of the last day of : may not be on the is a leader in the : and now the process of Figure C1.Importance for predictors in a random forest model of mean item rating in Experiment 1.After creating random forest models, we calculated the relative importance of all of the log transformed n-gram frequency variables in predicting mean subjective frequency ratings adjusted for correlations between predictor variables (both for the main effects and the interactions).The dotted red lines mark the selection criterion of 3% mean decrease in accuracy.

5−grams
Mean decrease in accuracy

Figure 2 .
Figure 2. Distribution of relative frequencies of stimuli for all word pairs presented in Experiment 2.

Figure 4 .
Figure 4. Distribution of relative frequencies of stimuli for all n-gram pairs presented in Experiment 3.

Figure 5 .
Figure5.Interaction plots for the GLMMs for each n-gram size.The N-gram frequency ratio is the frequency of the n-gram presented on the top divided by the frequency of the n-gram presented on the bottom.The dependent measure is the probability of choosing the n-gram presented on the bottom.The subject speed is the grand mean of each participant's response times in the experiment.The interaction is plotted at 5 intervals spread equally across the range of the subjects' speeds.

Table 3 .
Coefficients for the fixed effects in the generalized linear mixed effects model fitted to the observed accuracy for word pairs in Experiment 2 from Model 3 in Table2.FreqRatio is the log transformed ratio of the word frequencies.This model also included crossed random intercepts of subject and item as well as random slopes for each subject based on their sensitivity to the item's frequency ratio.

Table 4 .
RT Regression Model Comparisons for Experiment 2. PrevTrialRT is the logtransformed RT on the previous trial.FreqRatio is the log transformed ratio of the word frequencies.Length is the number of letters in the word.Button is the trial-level choice made by each subject.

Table 6 .
Response choice GLMM comparisons for Experiment 3.All n-gram frequency ratios were log-transformed and all models contained the random effect of item.Note the lack of a difference between the models for the 5-gram data.

Table 7 .
Coefficients for the fixed effects in the generalized linear mixed effects model fitted to the observed responses on n-gram pairs in Experiment 3.

Table 8 .
Markov-chained Monte Carlo (MCMC)based estimates of the coefficients for the fixed effects in the linear mixed effects model fitted to the observed RTs on n-gram pairs in Experiment 3.All models contain crossed random effects for subject and item.

Table 9 .
Coefficients for the fixed effects in a linear model predicting 3-gram subjective frequency ratings using NDL simulated activations and n-gram frequencies.The NDL activation for the 3-gram has superceeded wf3 and bf3.

Table 10 .
Coefficients for the fixed effects in a generalized linear model predicting relative frequency judgements for words using NDL simulated activation ratios and word frequency ratios.