Simultaneous effects of inflectional paradigms and classes in processing of Serbian verbs

In this paper we show that the processing of inflected verb forms is simultaneously influenced by the distributional properties of their inflectional paradigm (all the inflected forms of the given verb) and also by their inflectional class (all the verbs that conjugate in the same manner). Thus, we generalize a finding that was previously observed with nouns. We demonstrate that a divergence of the frequency distribution within inflectional paradigm from the frequency distribution within inflectional class (operationalized as Relative entropy between the two frequency distributions) is detrimental to processing. We present the results of a visual lexical decision experiment and the results of a simulation that was ran in the Naive Discriminative Reader, a simple computational model based on basic learning principles. We show that Relative entropy between an inflectional paradigm and an inflectional class predicts both empirically observed and simulated processing latencies. By doing so, we add to the body of research that investigates processing effects of information theory based descriptions of language. We also demonstrate that the effect of Relative entropy on the processing of morphology can arise as a consequence of the principles of discriminative learning in a system that maps input cues to outcomes, with no specification of morphology per se.

This paper is a continuation of a line of research that demonstrated that processing of nouns is affected both by the properties of frequency distribution of their inflected forms and the properties of summed (or average) frequency distribution of inflected forms of all the nouns that belong to the same inflectional class (Baayen, Milin, Filipović Đurđević, Hendrix, & Marelli, 2011;Milin, Filipović Đurđević, & Moscoso del Prado Martín, 2009). More precisely, it was shown that an increase in the divergence between the two distributions was followed by longer processing latencies and more errors in the visual lexical decision task. Here, we aimed to demonstrate that this finding could be generalized to other word classes, namely verbs. In other words, we wanted to demonstrate that verb processing was also more costly if the difference between the two distributions was large. Additionally, we aimed to demonstrate that this effect could arise as a consequence of some basic learning principles. We did so by running simulations in the Naive Discriminative Reader (NDR; Baayen et al., 2011), a model based on the principles of Naive Discriminative Learning (NDL) which had already been used to account for several morphological effects.

Information Theory and morphology research
Processing effects of variables based on information theory descriptions were demonstrated for many language-related phenomena (Ackerman, Blevins, & Malouf, 2009;Ackerman & Malouf, 2013;Baayen et al., 2011;Balling & Baayen, 2012;Filipović Đurđević 2007;Filipović Đurđević, Đurđević & Kostić, 2009;Frank, 2010Frank, , 2013Hale, 2001Hale, , 2003Hale, , 2006Kemps, Wurm, Ernestus, Schreuder, & Baayen, 2005;Kostić, 1991Kostić, , 1995Kostić & Havelka, 2002;Kostić, Marković, & Baucal, 2003;Kostić & Mirković, 2002;Levy, 2008;Milin, Kuperman, Kostić, & Baayen, 2009; Moscoso del Prado Martin, Kostić, & Baayen, 2004;Pluymaekers, Ernestus, & Baayen, 2005;Tabak, Schreuder, & Baayen, 2005;Wurm, Ernestus, Schreuder, & Baayen, 2006). For example, in a series of research papers Kostić and his colleagues demonstrated that processing of inflected forms of nouns, adjectives, and verbs was influenced by the complexity of the given inflected form. In their research, this complexity was expressed as information load (surprisal; Shannon, 1948) derived from the average frequency per syntactic function/meaning of that form (i.e., its role in a sentence; Filipović & Kostić, 2003Kostić, 1991Kostić, , 1995Kostić & Havelka, 2002;Kostić et al., 2003;Kostić & Mirković, 2002). The average frequency per number of syntactic functions/meanings of the inflected form in question was expressed relative to other inflected forms of the same lemma (stem), indicating the importance of the full inflectional paradigm. The influence of complexity of the full inflectional paradigm on processing was further demonstrated in a line of research that reported processing effects of the Inflectional Entropy (Baayen, Feldman, & Schreuder, 2006;Moscoso del Prado Martín et al., 2004;Tabak et al., 2005). Furthermore, research that followed revealed that it was not only information complexity of the given inflected form, and information complexity of the full set of related inflected forms that affected processing, but also the relation between the set of inflected forms of the given word (the inflectional paradigm) and the set of all the words which take the same exponents -build their inflected forms in the same way (inflectional class; Baayen et al., 2011;Hendrix, Bolger, & Baayen, 2016;; for the same effects in derivational morphology see Kuperman, Bertram, & Baayen, 2010;Milin, Kuperman et al., 2009). The concept of inflectional paradigm and inflectional class was defined in linguistics (Anderson, 1992;Aronoff, 1994;Blevins, 2006). In psycholinguistic research the relation between the inflectional paradigm and the inflectional class was operationalized as Relative entropy, or the Kullback-Leibler divergence between the two frequency distributions. The frequency distribution of the paradigm consisted of surface frequencies of all inflected forms of the given noun, whereas frequency distribution of the class consisted of suffix (exponent) frequencies or cumulative frequencies of the given inflected forms of all the nouns that belong to the same class (take on the same exponents; e.g., feminine nouns). The larger the deviance of the paradigm frequency distribution from that of the class, the costlier the processing was. This indicated that words that had typical inflected form frequency distributions were easier to process than those with rare and atypical distributions of form frequencies.
The detrimental processing effect of dissimilarity between frequency distributions of the paradigm and the class was demonstrated in several different experimental settings. It had been first demonstrated in a visual lexical decision task by see also Milin, Kuperman et al., 2009), but it was soon also observed in a self-paced reading and primed lexical decision task (Baayen et al., 2011), as well as in an eye-tracking experiment (Kuperman et al., 2010). Baayen and his colleagues (2011) extended the finding to English prepositional phrases, showing that the (dis)similarity of the frequency distribution of the cooccurrence of the noun with different prepositions (i.e., the prepositional phrases of the given noun) and overall prepositional phrase frequency distribution (i.e., prepositional phrases of all the nouns) also influenced processing. Recently, the same effect was documented in a primed picture-naming task, where both effects on processing latencies and effects on ERPs were observed (Hendrix et al., 2016).

Naive discriminative learning perspective
The effects of relative entropy were simulated in the NDR (Baayen et al., 2011) -a computational model based on the principles of discriminative learning. This model entails a very simple architecture that maps orthographic input onto meaning (output). The input is specified as a set of unigrams, bigrams, or trigrams. For example, an inflected form of the word gleda (he looks) is segmented into bigrams #g, gl, le, ed, da, a#. The output refers to the meaning in general and is specified by the language unit that refers to the experience that is to be discriminated from other sets of experiences. This is usually achieved by using a lemma as the output unit (e.g., gledati -to look), but the same can be achieved by using multi-word phrases. It is important to note that the output units are not interpreted as representational units per se, but rather as links between the orthographic input and the experience that the input is referring to. In order to separate the NDR output units from lexemes and other traditional language units, the authors termed the NDR output units as lexomes (Baayen, Shaoul, Willits, & Ramscar, 2015). In the course of training, the model learns to separate the cues that are good predictors of the output from the cues that are not good at predicting them. This is achieved by calculating the strengths of association between the cues and the outcomes. This process is based on the learning principle formulated by Rescorla and Wagner (1972). This principle involves a constant updating of cue-outcome associations in such a way that the strength of association is increased if both the cue and the outcome are present, whereas it is decreased if the cue is present without the outcome. In the absence of both the cue and the outcome the association strength remains unchanged. Since this process implies a constant updating, the system never reaches the equilibrium state. Therefore, in order to efficiently estimate cueoutcome association strengths, Danks (2003) proposed a set of equations which describe the equilibrium state of the system by introducing the assumption that in the stable state, all changes in association strengths equal zero. This reduced the estimation of association strengths to finding values which optimize conditional probabilities of the outcomes given the probabilities of the set of cues. Upon this estimation, the activation of the given outcome is calculated as the sum of association strengths of this outcome and all the cues that are present in the set of the input cues (e.g., all the bigrams that constitute a word). Finally, simulated reaction time is modelled as the inverse of this activation.
Even though morphology is not specified in the model, several morphological effects that were observed in various tasks were successfully simulated (Baayen et al., 2011;Hendrix, 2016;Hendrix et al., 2016). Although generally similar to the family of parallel distributed computational models regarding the unspecified, but emergent morphology (e.g., Gonnerman, Seidenberg, & Andersen, 2007;Plaut & Gonnerman, 2000;Rumelhart & McClelland, 1986;Seidenberg & Gonnerman, 2000;Seidenberg & Plaut, 2014), this model is of a more simplistic nature, as it does not include hidden layers and does not imply feedback activation. The core process in this model and the process that makes the learning possible and dynamic is the process of cue competition, as described by Ramscar, Yarlett, Dye, Denny, and Thorpe (2010).

The current study
In this study we firstly aimed to demonstrate that the findings of  could be generalized to verbs. In other words, we aimed to show that processing of inflected forms of verbs was influenced by the divergence of the frequency distribution of the inflected forms of the given verb (its inflectional paradigm) from the summed frequency distribution of the inflected forms of all the verbs that are conjugated in the same way (its inflectional class). Secondly, we wanted to show that this effect could be simulated as a consequence of simple learning principles, namely the principles of naive discriminative learning, as suggested by Baayen et al. (2011).

Serbian verbs.
The morphology of Serbian verbs is characterized by several attributes. Depending on the syntactic role that they have in the sentence, verbs appear in different inflected forms. These forms denote several grammatical meanings. There are three persons (1 st , 2 nd , 3 rd ) and two numbers (sg, pl) denoting who performs the action. Also, there are three verb moods denoting the speaker's attitude (imperative, potential, and future II), and several verb tenses denoting the time of the action (e.g., present, future, etc.). Tenses are formed either by adding an inflectional suffix to the root (e.g., Ja gleda-m -I look; Ti gleda-š -You look), or by combining certain inflected forms of the verb with the appropriate inflected forms of auxiliary verbs (e.g., Ja sam gleda-o -I looked; Ti si gleda-o -You looked). Additionally, for some verb forms, agreement with the noun is denoted by grammatical number (e.g., Devojčica je gleda-la -The girl looked; Devojčice su gleda-le -The girls looked) and grammatical gender (e.g., Devojčica je gleda-la -The girl looked; Dečak je gleda-o -The boy looked; Dete je gleda-lo -The child looked). There are other verb-related grammatical attributes, such as aspect, which describes whether the verb denotes an action which is completed, the one which is still in progress, or the one which is repetitive. However, in the Serbian language, aspect is coded in the lexical meaning, not in inflectional morphology of the verb. The same holds for verb valence and/or transitivity, which describe the potential of the verb to attract other elements within a sentence. Table 1 summarizes the inflected forms of Serbian verbs and their relations to grammatical categories of tense, person, number and gender (where applicable), whereas a simple list of inflected forms is presented in Table 2. For the purposes of illustration, we presented the inflected forms of verbs which have a root ending with phoneme /a/ (the remaining verb categories differ to some extent with respect to the distribution of suffixes, but the overall pattern is similar). Relative entropy. For each inflected verb form two associated frequencies can be obtained: its surface frequency and the frequency of its inflectional suffix, as illustrated in Table 2. The frequency of an inflectional suffix is the cumulative frequency of the given inflected forms (e.g., inflected form ending with -mo, as in gleda-mo) of all the verbs that belong to the same class (e.g., gleda-mo, pevamo, zeva-mo, etc.). The surface frequencies of all inflected forms of the given verb constitute the frequency distribution of the paradigm of that particular verb, whereas the suffix frequencies constitute the frequency distribution of the inflectional class of the given verb. As suggested by , the divergence of the paradigm distribution from the class distribution could be operationalized as relative entropy or the Kullback-Leibler (KL) divergence (D (p||q); Cover & Thomas, 1991;Kullback, 1959): (1) Form frequency distribution of inflectional paradigm is denoted by p(x) in equation (1), whereas q(x) denotes the form frequency distribution of inflectional class. The inflected forms within each given distribution are denoted by i. The frequency of the i-th inflected form of the word w is marked by f(w i ) and f(w) stands for the cumulative frequency of all the inflected forms of the word w (stem frequency of w). The frequency of the i-th exponent, that is, the sum of the frequencies of i-th inflected forms of all the words from the given inflectional class are denoted by f(e i ). Finally, f(e) stands for the cumulative frequency of all the inflected forms of all the words that belong to the given inflectional class. Table 2 illustrates how relative entropy for the verb gledati (to look) is calculated based on Equation (1).

Experiment
We presented the inflected verb forms in a visual lexical decision task in order to demonstrate that their processing was affected by relative entropy between the distribution of frequencies of the inflected forms of the particular verb and the distribution of cumulative frequencies of the inflected forms of all the verbs that conjugate in the same manner (calculated based on Equation 1). We predicted that, as observed with nouns , Relative entropy would be positively correlated with processing latencies, indicating that verbs with more prototypical distributions of inflected form frequencies are less costly to process than the verbs with more deviant frequency distributions.

Method
Participants. Sixty-nine students from the Department of Psychology at the Faculty of Philosophy in University of Novi Sad took part in the experiment. They were all native speakers of Serbian with normal or corrected-to-normal vision, who reported of no reading difficulties. The participants were randomly assigned to one of four groups based on the list of the stimuli with which they were to be presented.
The research was approved by the Ethical Committee of the Department of Psychology at the Faculty of Philosophy in University of Novi Sad. All participants signed the informed consent form.

Materials and design.
We selected 105 verbs of type V and 152 verbs of type VI, as categorized in the grammars of Serbian language (Stanojčić & Popović, 1992). Type V verbs were those with a stem ending with /a/ (e.g., pevati -to sing) and type VI verbs were those with a stem ending with /i/ (e.g., raditi -to work). The verbs were selected based on the Frequency Dictionary of Contemporary Serbian Language (Kostić, 1999). The selection criterion for a verb to be included in the stimuli list was for it to have as many as possible inflected forms with non-zero frequency.
The verbs were presented in two different inflected forms of the present tense: the 1 st person singular (pevam -I sing; radim -I work) and the 2 nd person singular (pevaš -You sing; radiš -You work). All of the verbs appeared in both of these forms, but not within the same list: the four lists were formed in such a way that all of the verbs within a list had the same inflectional suffix (as illustrated in Table 3; see Appendix A for the full list of stimuli).
We created pseudo words that mirrored the structure of type V and type VI words and within each group we created lists of pseudo words that ended with the same inflectional suffixes as the verbs from that particular list. The selection of stimuli and of the design complied with our attempt to make our study comparable to that of Milin and colleagues (2009). Along the same line, each participant was presented with a single list of stimuli, that is, with the verbs and pseudo words that ended with the same inflectional suffix.
The dependent variable was response latency and the crucial predictor was the KL divergence between probability distribution of the inflected forms of the given verb (inflectional paradigm) and the summed distribution of the verb's class (either type V or type VI verbs). The additional independent variables were the order of trial presentation, word length in letters, (log) surface frequency, and (log) lemma frequency.
Procedure. Open Sesame experimental software (Mathôt, Schreij, & Theeuwes, 2012) was used for stimuli presentation and data recording. We presented the participants with a visual lexical decision task. Each trial started with a blank screen that was presented for 500ms, followed by a fixation point presented for 1000ms. After that, the target stimulus would appear and would remain on the screen until the participant responded, or until the 1500ms timeout. The responses were to be given by mouse-button press, with yes being mapped onto the index finger pressing the left mouse button (the dominant action), and no being mapped onto the middle finger pressing the right mouse button. The order of stimuli presentation was randomized separately for each participant. Prior to the main session, the participants were presented with 10 practice trials in order to become familiarized with the task. The stimuli presented in the practice session were not presented in the main session and were not included in the analyses.

Results and discussion
Prior to the analysis, we excluded one participant and 11 verbs with above 25% error-rate. The overall error-rate was low -9.4% on average.
The reaction times were log transformed in accordance with the recommendations described in Baayen and Milin (2010). In order to control for collinearity, all of the predictors were centred to zero and divided by standard deviation (Gelman & Hill, 2007). Collinearity in our dataset was close to high (28.38), as tested using the Kappa coefficient (Belsley, Kuh, & Welsch, 1980; see Appendix B for pair-wise correlations of predictors). Therefore, we analysed the data using the mgcv package (Wood, 2006(Wood, , 2011 in R statistical software (R Core Team, 2015). We fitted mixed-effect generalized additive models to processing latencies as this analysis is less sensitive to collinearity in the set of predictors. In order to avoid the possibility of the coefficients being influenced by extreme values, the models were refitted after excluding points with residuals that exceeded the range of ±2.5 standard units. As this procedure did not bring any substantial change in the structure of the results, we reported the coefficients from the refitted model. In order to test whether inclusion of the given parameter in the model was justified by the data, we employed the itsadug package (van Rij, Wieling, Baayen, and van Rijn, 2015) to compare different variants of the model. This was achieved by testing the differences between the values of maximum likelihood (ML) as the measure of goodness of fit (AIC scores were not compared, as our model included component that accounted for autocorrelation in the data, in which case AIC scores are not reliable, as noted by the authors of the package). In the process of ML scores comparisons, we took into account statistical significance of Chi-square. In addition to that, as suggested by the authors of the package, we also considered the ML difference values (the difference in scores which is less than five is considered too small). We tested for the effects of the order of trial presentation, verb type, suffix, word length in letters, (log) surface frequency, (log) lemma frequency, relative entropy, as well as random effects of participants and items. The coefficients from the final model are presented in Table 4.  The best model that was justified by the data was the one that included order of trial presentation, word length in letters, interaction between (log) lemma frequency and relative entropy, as well as by-participant smooths for the order of trial presentation, and by-item random intercept. As expected, an increase in word length was followed by longer processing latencies. Along the same line, both (log) lemma frequency and relative entropy affected processing in the expected direction -it took less time to process words of higher frequency and it took more time to process words with larger relative entropies. Additionally, there was an interaction between (log) lemma frequency and relative entropy that revealed joint effect of the two predictors. As can be seen in Figure 1, the fastest responses were observed for words of high frequency and low relative entropy (dark shade in the graph), whereas the response time showed a tendency to become longer as the values of frequency decreased and the values of relative entropy increased (lighter shades in the graph). Model comparisons revealed that this interaction was justified by the data, because when taking the number of the parameters in the model into account, the fit achieved by this model (as indicated by ML values) was supreme both compared to the model with separate linear effects of (log) lemma frequency and relative entropy (ML difference = 10.427, AIC difference =14.85, χ 2 = 10.427, p = .0003), and the model with separate smooths for the two predictors in question (ML difference = 6.441, AIC difference = 7.95, χ 2 = 6.441, p = .0003). Our results showed that verbs with atypical distribution of inflected form frequencies were processed more slowly than verbs with inflected form frequency distribution that resembled the average distribution of the inflectional class to which they belong. Thus, we demonstrated that findings of Milin and colleagues (2009) could be generalized to verbs.

Simulation
In the next step, we tried to demonstrate that such effect could appear as a consequence of patterns that arise from mapping of orthography to semantics, without explicit coding of morphological units. We did so by running simulation in the Naive Discriminative Reader (Baayen et al., 2011), a simple computational model based on the Rescorla-Wagner equations (Rescorla & Wagner, 1972), which we described earlier.

Method
The simulation was performed using the R statistical software (R Core Team, 2015) and the ndl package (Arppe et al., 2015), the details of which are described in Baayen et al. (2011).
In order to build the database used for the model, for every lemma that was included in the study (257 in total: 105 of Type V and 152 of Type VI), we selected its inflected forms which were found in the Frequency Dictionary of Contemporary Serbian Language (Kostić, 1999) and their surface frequencies. In total, our database consisted of 4385 words and their corresponding frequencies. In the first step, every word was split into bigrams, and these were used as cues in the model, whereas the lemmas were treated as outcomes. Association strengths were estimated for every cue-outcome pair. Based on them, we calculated activations for every lemma that was provided by each individual inflected form (i.e., the set of cues that were contained in that form). Finally, we modelled simulated reaction time as the negative value of the association strength. Earlier simulations in the NDR revealed that including some additional information improved the overall fit of simulated data to empirically observed data (Baayen et al., 2011). However, our goal here was not to demonstrate the full power of the model in question, but to demonstrate that the model can account for the effects of relative entropy that was observed empirically. Therefore, we opted for the simplest approach in modelling simulated RT. Finally, we log-transformed simulated RT in order to correct the skewness of the distribution (in order to be able to apply the log-transformation, we added 1.1 to each simulated RT, thus making all the values fit in the range of the values to which the log function can be applied -i.e., above zero).

Results and discussion
We first calculated the Pearson correlation coefficient for the empirically observed response latencies to every inflected form of the verb and their simulated variants. Our results revealed a moderate, but significant positive correlation between the two: r = .13, t(512) = 3.035, p = .003. The observed coefficient was smaller than the one reported earlier in similar simulations (Baayen et al., 2011). However, this was not surprising as we tentatively applied the simplest possible way to model simulated RT, whereas the earlier simulations included many additional parameters to improve the fit.
Next, we fitted the linear regression model to (log) simulated reaction times in order to test whether the effects of the predictors that accounted for empirical reaction times could be observed here as well. The coefficients from the model are presented in Table 5. The model fitted to simulated RT revealed very similar effects as the one fitted to the empirically observed processing latencies. Firstly, as illustrated in Figure 2, we observed that word length in letters had an inhibitory effect on simulated RT, as well as in the case of empirical data. Along the same line, (log) lemma frequency had a facilitatory influence on both observed and simulated RT. Crucially, the effect of relative entropy was significant here, as well. As with empirical data, an increase in relative entropy was followed by an increase in simulated RT. The interaction between (log) lemma frequency and relative entropy was also significant in the case of simulated data, as was the case with empirical data. As illustrated in Figure 3 (produced by using loess and predict function for the purposes of illustration), the joint effect of (log) Lemma frequency and Relative entropy was similar to that observed on empirical data and illustrated in Figure 1. We therefore demonstrated that the detrimental effect of the divergence from the prototypical distribution of inflected form frequencies could emerge as a consequence of orthography-to-semantics mapping in a model that has no explicit representation of morphology. We tried to keep the model as simple as possible, which inevitably led to many limitations, and consequently low value of the correlation coefficient. Firstly, we introduced lemma (or stem) as the shortcut to semantics of the word. This was done for matters of simplicity, not because we argued that semantical representations are local by nature. On the contrary, we believe that they are highly distributed, as suggested by numerous research studies (for a review see McRae & Jones, 2013;Meteyard, Rodriguez Cuadrado, Bahrami, & Vigliocco, 2012), and that introducing this into the model would improve the fit to the empirical data. Additionally, when calculating simulated RT, we took into account only the inverse activation strength for the inflected form in question, thus ignoring its orthographic neighbours and other factors that could improve the fit.

General discussion
Our results revealed that processing of inflected verbs was affected by relation of two frequency distributions to which they simultaneously belong: the distribution of surface frequencies of all inflected forms of the given verb (i.e., its inflectional paradigm) and the distribution of frequencies of the exponents which build those inflected forms (i.e., its inflectional class). The divergence of paradigm frequency distribution from that of the class, as operationalized through Relative Entropy (or the Kullback-Leibler divergence) affected processing in such a way that verbs with divergent paradigm distributions (i.e., high Relative entropy) took more time to process. In other words, verbs with atypical frequency distribution of their inflected forms were more demanding in terms of processing.
This result is a direct continuation of the findings of , who observed the same for nouns. It is also in accordance with the results of several other investigations that demonstrated similar effects in different tasks and different stimulus presentation conditions. For example, Nenadić, Tucker, and Milin (2016) observed the same effect in an auditory lexical decision task. Importantly, these authors presented participants with a list of mixed inflected forms, thus demonstrating that the effect of relative entropy as observed in  and in our experiment was not an artefact of processing strategy developed by participants being presented with a single inflected form. Baayen et al. (2011) reported of the similar effect in a primed visual lexical decision and self-paced reading task. Here, processing of the inflected target noun was affected by the Weighted Relative entropy between the form-frequency distribution of the target noun and its preceding prime. The same authors calculated the Relative entropy for English nouns by looking at the distribution of prepositional phrases. The frequency distribution of the noun occurring with different prepositions was considered as its paradigm distribution, and general frequency distribution of prepositions occurring with all the nouns was considered as its class. The Relative entropy between the two distributions was again positively correlated with processing time. Recently, Hendrix et al. (2016) documented the effect of Relative entropy on ERP measures as well. They observed that theta range oscillations (with the temporal onset comparable to that of word frequency and mostly prominent in parietal and occipital areas) were of greater amplitude for higher levels of Relative entropy, thus indicating that words with atypical paradigmatic frequency distributions demand additional processing. Some recent investigations demonstrated that the effect of Relative entropy could be generalized to adjectives as well, although some additional considerations need to be taken into account (Filipović Đurđević & Milin, 2018). For example, in the case of adjectives, the relevant unit of frequency distribution was not the frequency of inflected form, but of the grammatical case. This could be related to the fact that adjectives are usually followed by nouns and that adjectival meanings and syntactic role are fully realized in an adjective-noun phrase. In the field of derivational morphology effects of Relative entropy were documented by Milin, Kuperman et al. (2009).
None of the traditional models of processing of morphologically complex words could account for the observed effect of relative entropy. Both decomposition models (Taft & Forster, 1975;Taft 1979Taft , 1994, and full-form models (Manelis & Tarp, 1977;Butterworth, 1983) address the characteristics of the individual inflected form and therefore make no predictions neither regarding the effects of the full set of inflected forms of the word, nor the effects of the relation between the distributions of paradigm and class.
As stated by Hendrix et al. (2016), the effect of Relative entropy between the two distributions could be fit in the exemplar-based approach. However, this would place a heavy load on the cognitive system either in terms of storage (very high number of exemplars due to a large number of rare events in language, as explained by Baayen, Hendrix, & Ramscar, 2013) or in terms of processing (on-line calculating distances between distributions). A much simpler explanation of the Relative entropy effect arises through the principles of discriminative learning (Arnon & Ramscar, 2012;Ramscar et al., 2010;Ramscar, Dye, & Clein, 2013;Rescorla & Wagner, 1972). These principles are integrated in the Naive Discriminative Reader model (NDR; Baayen et al., 2011) -a simple computational model that only searches for the optimal way of mapping the structure of the input layer to that of the output layer by applying the equations of Rescorla and Wagner (1972), and the equilibrium equations of Danks, 2003). We therefore ran the simulations by using the NDR computational model and observed a significant correlation coefficient between the observed and the simulated reaction times. Additionally, the regression analyses revealed fully comparable pattern of results for observed and simulated reaction time. As with empirical data, Relative entropy was positively correlated with simulated reaction time. This added to the body of evidence that the effect of Relative Entropy can arise as a consequence of the basic learning principles (Baayen et al., 2011;Hendrix et al., 2016).
It is important to note that, in naive discriminative learning approach the accounting for the Relative entropy effect comes for free. We did not feed the model with any of the information that would point to relation between the two distributions (paradigm and class). The model was trained only on mappings of word's bigrams to its lemma. In this model, the effect of Relative entropy appears simply as a consequence of the way the distributional properties of the language system "shape the associations between orthographic input cues and semantic outcomes across sequences of words" (Hendrix et al, 2016, p. 3). As inflected form of the word marks the syntactic role of the given word in the sentence, it entails in itself the syntactic context in which it occurs. Therefore, the mapping of different inflected forms to their common semantic outcome implies the mapping of different contexts in which they occur. Thus, this process is shaped by the distributional properties of the language system and its relation to the experience with the world. According to discriminative learning perspective, the goal of language use is to discriminate among the many cues in the linguistic input those that are good at predicting semantic outcomes from those that are not predictive. Here, semantic outcomes denote experiences with the world that linguistic cues are referring to. It could also be stated that discriminative learning operates by minimizing the uncertainty between the set of cues and the set of outcomes (Baayen et al., 2013). This point of view is very similar to the one taken by Information Theory (Shannon, 1948). As Ramscar and Port (2016) state, both Information Theory and Discriminative Learning deal with reduction of uncertainty about the current state of the system among all the possible states that the system could be in.
Although we primarily interpreted our findings in the light of discriminative learning approach (Baayen et al., 2011), it is important to note that our goal was not to strictly point to this approach as the single possible explanation for the observed pattern. Our goal was to point out that morphological effects can be accounted for without explicit introduction of morphological units. In that matter, our results would fit well with the broad category of connectionist models, which have already been applied in accounting for various similar effects both inside (Joanisse & Seidenberg, 1999;Plaut & Gonnerman, 2000;Woollams, Joanisse, & Patterson, 2009) and outside of morphology (Cree & McRae, 2003;Harm & Seidenberg, 1999Seidenberg & McClelland, 1989). For example, Mirković, Seidenberg, and Joanisse (2011) demonstrated that the connectionist network can simulate a complex pattern of Serbian noun inflection. Unlike the performance of the rule-based system, the performance of the connectionist network that was trained to generate inflected forms was sensitive to the same variables that influenced the behaviour of native speakers. One such variable was inflectional neighbourhood size -"the proportion of items in a corpus behaving in the same way across inflectional forms (friends)" (Mirković et al., 2011, p. 663). This variable is highly similar to the concept of inflectional class, as we applied it. The basic difference between the two is that inflectional neighbourhood is empirically established, whereas the inflectional class is adopted from grammars. Although there is much overlap between the words that would belong to each of the two sets, there are some potential differences, as some inflectional classes could be divided into several inflectional neighbourhood sets (e.g., due to slight changes to the stems). However, Mirković and colleagues (2011) examined the effects of the size of such set, whereas we looked into the relation between the distribution of the average probabilities of inflected forms from the set of inflectional "friends" and the distribution of the inflected form probabilities of the particular word. Additionally, the two studies did not apply the same task (inflectional form generation vs. visual lexical decision). Therefore, although the two studies converge in general conclusions, further research would be needed to fully compare the concepts they applied.

Conclusion
In this paper we brought two main insights: we generalized a previously reported empirical effect and demonstrated that this effect could arise as a consequence of simple learning principles. We reported the results of one experiment and one simulation.
The results of the experiment demonstrated that processing of inflected verbs is affected by the typicality of the frequency distribution of all inflected forms of that particular verb (i.e., by the divergence between frequency distribution of inflectional paradigm from frequency distribution of the inflectional class). This effect has previously been demonstrated for nouns , and our paper generalized the finding to a novel word class, the one that does not follow nominal inflections, but another type of inflection (conjugation). It therefore revealed that universal principle is applicable to different manifestations of inflectional morphology.
The results of the simulation revealed that observed reaction times could be simulated by a simple computational model based on the principles of discrimination learning (NDR; Baayen et al., 2011). It therefore added to the body of evidence that demonstrate how very complex morphological phenomena can naturally occur in a system with no explicit representation of morphology as a consequence of simple and universal learning principles.
Appendix A Stimuli presented in the experiment: Lemma (illustrated by infinitive form), Lemma frequency, inflected forms presented to the participants (1 st and 2 nd person singular), their surface frequencies and relative entropy; all frequencies are obtained in a corpus of 2 million words.