CONSTRAINTS ON PROBABILITY DISTRIBUTIONS OF GRAMMATICAL FORMS

In this study we investigate the constraints on probability distribution of grammatical forms within morphological paradigms of Serbian language, where paradigm is specified as a coherent set of elements with defined criteria for inclusion. Thus, for example, in Serbian all feminine nouns that end with the suffix "a" in their nominative singular form belong to the third declension, the declension being a paradigm. The notion of a paradigm could be extended to other criteria as well, hence, we can think of noun cases, irrespective of grammatical number and gender, or noun gender, irrespective of case and grammatical number, also as paradigms. We took the relative entropy as a measure of homogeneity of probability distribution within paradigms. The analysis was performed on 116 morphological paradigms of typical Serbian and for each paradigm the relative entropy has been calculated. The obtained results indicate that for most paradigms the relative entropy values fall within a range of 0.75 – 0.9. Nonhomogeneous distribution of relative entropy values allows for estimating the relative entropy of the morphological system as a whole. This value is 0.69 and can tentatively be taken as an index of stability of the morphological system.

declension or conjugation.In the present study we refer to declensions and conjugations as paradigms, where by paradigm we mean a coherent set of elements with defined criteria for inclusion.Thus, for example, in Serbian all feminine nouns that end with the suffix "a" in their nominative singular form belong to the third declension, feminine nouns that end with a consonant belong to the fourth declension etc.The notion of a paradigm could be extended to other criteria as well, hence, we can think of noun cases, irrespective of grammatical number and gender, or noun gender, irrespective of case and grammatical number, also as paradigms.Thus, for example, in Serbian, the case paradigm includes six elements (cases), the gender paradigm includes three elements (masculine, feminine and neuter), the grammatical number paradigm two elements (singular, plural) etc.In other words, paradigms can include both individual grammatical forms as well as elements of various grammatical categories.
Elements of different paradigms appear with unequal probabilities.The nominative case is more frequent than the dative, feminine gender is more frequent than neuter gender, singular is more frequent than plural etc.In table 1 we give an example of the unequal probability distribution of cases, singular and plural, of masculine nouns in Serbian.We may ask whether unequal probability distributions within paradigms are unsystematic or is there some systematic factor that guides the observed distributions.Specifically, we ask whether probability distributions within paradigms vary freely, with no obvious regularity, or are there some constraints that allow probability variation within limited margin only.In order to answer this question we need to specify the metric in which different probability distributions can be described.The obvious metric derives from Information Theory.Specifically, it is the entropy of a paradigm, defined in Information Theoretic terms.In the forthcoming paragraphs we discuss the notion of entropy in more detail.

THE AMOUNT OF INFORMATION AND ENTROPY
In Information Theory information is defined in terms of probability.The amount of information carried by an event is inversely related to its probability: higher probability is paralleled by smaller amount of information and vice versa.The unit in which we express the amount of information is the bit (equation 1).
In equation 1 h i refers to the amount of information (bits) carried by an event x within a system y, where p i is the probability (proportion) of an event x, transformed by log 2 and multiplied by -1.The obtained value is the amount of information (bits) carried by a particular event within some system.Note that probability is defined in terms of proportion relative to proportions of other events within a given system.Let us now apply equation 1 to cases singular and plural of Serbian masculine nouns (table 2: probabilities, expressed as proportions, and the amount of information carried by Serbian cases singular and plural masculine nouns).Inspection of table 2 indicates conspicuous variation of the amount of information carried by singular and plural cases of Serbian masculine nouns.The data presented in table 2 can be interpreted as the amount of information carried by each element of a system, where elements are grammatical forms and the system being the paradigm of masculine nouns.We can now express the amount of information carried by the system (i.e.paradigm) in terms of entropy, standardly defined as the average amount of information carried by a system /equation 2/.
In equation 2 H is the entropy of a system x, p i is the probability of an event i within the system x and log 2 p i is the amount of information carried by the event i.For each element of the system we calculate the product of its probability and the amount of information.The sum of products is the entropy of the system, also expressed in bits.Let us now return to the example from table 2. The entropy of the paradigm of masculine nouns that includes all cases singular and plural is 2.976 bits.
Assume that all singular and plural cases of masculine nouns appear with equal probability, where probability of each case is 0.083.What will be the of entropy of such paradigm?Applying the equation 2 the entropy is 3.585 bits.For the system in which all elements appear with equal probabilities we say that it is in the state of maximum entropy.In other words, maximum entropy equals log 2 of number of elements within a system (equation 3).
Equation 3 implies that systems with greater number of elements will have higher values of maximum entropy.This implies that entropies of two systems with unequal number of elements which appear with unequal probabilities will also generally differ.This feature may have considerable consequences for our study because it prevents us from direct comparison of paradigms with unequal number of elements.If entropies of two paradigms with unequal number of elements (e.g.case paradigm /six elements/ vs. gender paradigm /three elements/) differ, it remains unclear whether this difference is due to probability distributions within paradigms or due to the difference in the number of elements.With this in mind it becomes clear that entropy standardly defined (equation 2) may not be the proper descriptor that allows for comparisons among paradigms.
The Information Theoretic descriptor that is not sensitive to the number of elements of a paradigm is the relative entropy (equation 4).

Hr = H/Hmax (4)
The values of relative entropy range between 1 (maximum entropy) and asymptotically approach 0. If differences in probabilities of elements within a system are greater, the value of relative entropy will be smaller and vice versa.In other words, the more homogeneous the probability distribution, the greater the value of relative entropy.In other words, the value of relative entropy can be taken as an index of homogeneity of the probability distribution of elements within a system or, put differently, how far is the system from the state of maximum entropy.
The residual between relative entropy and maximum entropy is the redundancy (equation 5), which can also be used as an index of homogeneity of probability distribution of elements within a given system.

C = 1 -Hr
(5) Due to the fact that relative entropy is a ratio, it is not sensitive to the number of elements of a system which, on the other hand, makes it a proper descriptor that allows for comparisons among paradigms with different number of elements.
With relative entropy being used as an index of homogeneity of probability distributions, paradigms can now be directly compared with respect to probability distributions of their grammatical forms or categories.Our initial question can now be rephrased.We ask whether paradigms vary unsystematically with respect to their relative entropy values or is there some preferred range of relative entropy values.In order to answer this question, the probabilities of grammatical forms should be estimated first, which will enable us to calculate the relative entropy values for different paradigms.

PROBABILITY ESTIMATES AND CRITERIA FOR PARADIGM SELECTION
Probabilities of grammatical forms were estimated for Serbian, which is highly inflected and to a great extent free word order language.The probabilities were estimated from samples of daily press and poetry and then averaged for the two registers (Kostić, Đ. 1965a).Each sample consisted of about one million words.The two registers are part of the Corpus of Serbian Language that consists of 11 million words (Kostić, Đ. 2001).The Corpus is diachronic and spans Serbian language from the 12th century to the contemporary language.The sample of contemporary language consists of about 7 million words and encompasses five registers (daily press, poetry, prose, scientific texts and political texts).Each word in the Corpus is manually annotated for its grammatical status with a system of annotation that distinguishes about 2000 grammatical forms in Serbian.The Corpus was compiled and annotated in the mid fifties and transferred into the electronic format in the late nineties.
Probabilities were specified at the level of grammatical form and at the levels of different grammatical categories.The most detailed estimate was at the level of grammatical form (e.g. if a word is a noun, what is the probability for it to be a masculine noun in genitive singular).The coarser specifications were performed at the level of grammatical categories (e.g.what is the probability of a noun to be in the nominative case, or what is the probability of a noun to be in singular etc.).
As noted earlier, we define a paradigm as a coherent set of elements with specified criteria for inclusion in the set.According to this definition, each declension and conjugation could be treated as a paradigm.Likewise, a set of noun cases, irrespective of grammatical gender and number can also be considered a paradigm because the members of a set belong to the same grammatical category (case) within a defined class (nouns).This is also true for other categories like, for example, noun gender, verb person, adjectival number etc.In order to get better understanding of the relative entropy variation of different paradigms, we need an exhaustive list of possible paradigms of Serbian nouns, adjectives and verbs.In addition to the very basic level of grammatical forms, paradigms will be specified at the level of grammatical categories as well.Inspection of Graph 1 indicates that most values of relative entropy range between 0.75 and 0.9.

PROBABILITY DISTRIBUTIONS WITHIN ADJECTIVAL PARADIGMS
Due to the fact that adjectives have the same grammatical characteristics as nouns (i.e.case, number and gender), paradigms of adjectives parallel those of nouns.However, unlike nouns, adjectives are characterized by three levels of comparison: positive, comparative and superlative, each level being characterized with the same grammatical properties (case, number and gender).Thus the number of paradigms should be multiplied by three (for each level of comparison).Likewise, there is a distinct paradigm for comparison (probability of an adjective to be in positive, comparative and superlative).
Values of relative entropy for adjectival paradigms are presented in tables 21-80 in Appendix.The Distribution of relative entropy values for adjectival paradigms is presented in Graph 2 (X-axis: value of relative entropy, Y-axis: number of paradigms that fall into defined value range (0.05)).

PROBABILITY DISTRIBUTIONS WITHIN VERB PARADIGMS
Serbian verbs are characterized by person, grammatical number, tense, aspect and sometimes gender.Aspect will be omitted from the current analysis due to the fact that it does not generate systematic paradigms.On the other hand, grammatical gender is marked only for past tense and plusquampefect.Therefore, only within those tenses it is possible to specify paradigms with respect to gender.With this in mind, the following paradigms can be generated for verbs: 1. Verb Unlike values for noun and adjectival paradigms, relative entropy values for verb paradigms range mainly between 0.4 and 0.9.Some paradigms of verbs are characterized with extremely high redundancy (i.e.low relative entropy).A conspicuously low relative entropy is observed for the plusquamperfect gender (table 109) due to the fact that the proportion of masculine gender is prevalent relative to feminine and neuter gender (Hr=0.129).Likewise, an extremely low relative entropy has also been observed for verb person in the past tense (table 104) due to the extremely high probability of the third person singular (Hr=0.296).

RELATIVE ENTROPY DISTRIBUTION IRRESPECTIVE OF WORD TYPE
For the three word types analyzed in this study 116 paradigms had been specified and their relative entropies calculated.We may ask what is the distribution of relative entropy values irrespective of word type.In order to do this, we sum up the number of paradigms with a given entropy value /Graph 4 (X-axis: value of relative entropy, Y-axis: number of paradigms with a particular relative entropy value) and Appendix/.Inspection of Appendix 3 and Graph 5 indicates that almost two thirds (64%) of relative entropy value fall within a range of 0.75 -0.9, while one third (33%) of the values fall within a range of 0.8 -0.85.

RESTRICTION TO PRINCIPAL PARADIGMS
The cumulation of relative entropy values within the observed range should be taken with caution.It may not be surprising that values cumulated within some margins because there are many instances where the same grammatical categories appear in different paradigms.Take, for example, case.Case is a property of nouns, adjectives and some pronouns.Each noun and adjective can cross six cases singular and plural.Nouns, like adjectives, appear in three genders, while adjectives also appear in three levels of comparison.Cases cross gender, and number, while in the case of adjectives they also cross three levels of comparison.With this in mind, 9 paradigms can be specified for nouns and 27 for adjectives.Is there a reason to assume that probability distributions of cases in singular should differ from those in plural, or that the distribution of cases for masculine gender should differ from the distribution for feminine or neuter gender?Likewise, is there a reason to assume that case probability distributions differ for adjectives in positive as opposed to adjectives in comparative and superlative?Finally, there is strong reason to assume that probability distributions of case, grammatical number and gender for adjectives should parallel those of nouns because adjectives and nouns have to agree in case, number and gender.These assumptions were empirically confirmed.Additional analyses revealed extremely high correlations for case probability distributions for nouns of different gender and number, as well as distributions for adjectives and nouns.The same is true for grammatical number and gender.Note that each of those instances was treated as a distinct paradigm with its relative entropy value.With this in mind, we may ask whether the observed cumulation of relative entropy values may be an artifact of the way paradigms were specified.
This problem is two-fold.On the one hand, it is true that the same category (for example, case) is multiplied in a number of paradigms.On the other hand, it is also true that this multiplication is a fact of language.
So, is there any distinct range of the relative entropy distribution that may not be an artifact of paradigm selection criteria?In order to answer this question, we have to restrict our analysis to probability distributions within paradigms that are unique.Inspection of grammatical properties of nouns reveals that case, grammatical number and gender are their principal grammatical properties, for adjectives those are case, grammatical number, gender and comparison, and person, grammatical number and tense for verbs.In table 3 we give the relative entropy values for those categories, with probability distributions within paradigms being calculated at the most coarse level.0.For eaxmple, for noun cases it means probability distributions irrespective of grammatical number and gender, for noun gender it means probability distirbutions irrespective of case and grammatical number etc. Inspection of table 3 indicates that values of relative entropy for six out of seven principal grammatical categories across three word types range between 0.7 and 0.87, the exception being adjectival comparison (0,21).This range is not substantially different from the one observed for all paradigms as presented in picture 4 where two thirds of the relative entropy values range between 0.75 and 0.9.The average value of relative entropy is 0.73.

HOMOGENEITY OF RELATIVE ENTROPY DISTRIBUTIONS
Inspection of the distribution of relative entropy values, as depicted in Graph 4 and Appendix 3, indicates cumulation of relative entropy values.Thus, for example, there are 16 paradigms with relative entropy value of 0.82, 6 with the value of 0.86, 4 with the value 0.78 etc.In other words, there is a nonhomogeneous distribution of number of paradigms with a given relative entropy value.At this point we may ask what is the degree of the observed nonhomogeneity.Again, the metric will be the relative entropy.However, in this metric needs to be elaborated in more detail.In Graph 4 relative entropy values are presented on the X-axis and the number of paradigms with a given relative entropy value on the Y-axis.Now we may rephrase the Y-axis in an alternative metric.Instead of asking how many paradigms take a particular entropy value, we may ask what is the probability that a given paradigm has a particular entropy value.To do this, we need to transform the raw values (i.e.number of paradigms with a particular entropy value) into proportions By doing this, we can now treat the distribution of all 116 paradigms as a single paradigm with unequal probabilities of events, the events being a given value of relative entropy.(Probabilities of entropy values are given in Appendix 3).The relative entropy of the whole system is 0.69.

3
What may be the interpretation of the relative entropy value for the whole system?At this point we take it as a tentative index of the overall stability of a system of inflected morphology.This assumption will be elaborated in more detail in the forthcoming paragraphs.

EXTENSION TO OTHER ASPECTS OF LANGUAGE
The observed range of relative entropy variation was obtained for the exhaustive set of morphological paradigms.We may ask whether this range is specific for morphology, or is it more general.Will the same range of entropy variation be observed for paradigms derived from other aspects of language like, for example, probability distributions of individual words?In order to answer this question the following paradigms were investigated: a. Probability distribution of lemmas.b.Probability distribution of forms of lemmas c.Probability distribution of lemmas within a given word type Likewise, we can treat probability distributions of word types and probability distributions of all grammatical forms as a distinct paradigm.
These additional paradigms may not be an exhaustive set of possible paradigms for aspects other than morphology, but they may be sufficient to get some preliminary insights about the generality of the range observed with morphological paradigms.Probabilities for the additional paradigms, like those for morphological paradigms, were derived from the "Frequency Dictionary of Contemporary Serbian Language" (Kostić, Đ., 1999).The Dictionary was compiled from samples of daily press and poetry, each containing about one million words.The Dictionary contains 65 000 lemmas and their frequencies and about 240 000 forms of lemmas (and their frequencies).The obtained relative entropy values for the respective paradigms are given in table 4.
Inspection of table 4 indicates that variation of relative entropy values of paradigms for other aspects of language also ranges mainly between 0.7 and 0.9 The observed outcomes suggest that the observed range is not morphology specific, but more general and may apply to all aspects of language.The obtained results also indicate that relative entropy variation is more generic than initially assumed.This, on the other hand, suggests some global constraints imposed over permissible probability variations of language events.

GENERAL DISCUSSION
In the present study we investigated probability distributions for various paradigms of Serbian inflected morphology.Specifically, we investigated probability distributions within noun, adjective and verb paradigms.The aim of the study was to evaluate whether there is some systematic patterning in probability distributions of grammatical forms belonging to different paradigms, or whether this distribution is nonsystematic and arbitrary.The first step was to estimate the probabilities of grammatical forms for different paradigms.This estimate was derived from a study in which probabilities of all grammatical forms in Serbian were given at the levels of individual grammatical forms and at the various levels of grammatical categories (Kostić, 1965a).The next step was to specify the metric in which to express probability distributions in a way that will allow for comparison among paradigms with various number of elements.This metric proved to be the relative entropy, which can be treated as a measure of homogeneity of probabilities within a given paradigm or, alternatively, how far is a paradigm from the state of maximum entropy.
It is reasonable to make an a priori assumption that probabilities of grammatical forms may not be homogeneous, nor should they, by the same token, be close to the state of maximum entropy.It is less clear, however, whether probability distributions of grammatical forms should vary freely, or should their variation be constrained within relatively narrow margins.Put differently, there is no obvious reason why it would not be possible for a number of paradigms to take any value of relative entropy between, say, 0.1 and 0.9.Assume some paradigms having one grammatical form with extremely high probability, while other forms appear with small probabilities.In such a case the paradigm is characterized by high redundancy, i.e. low relative entropy.Likewise, it is conceivable to have a paradigm with minimal probability differences among grammatical forms, in which case the paradigm would have high relative entropy and low redundancy.If there are no constraints on probability distributions within paradigms, theoretically it could be expected that paradigms distribute homogeneously with respect to their relative entropy values.
Although theoretically plausible, the above assumptions were not empirically confirmed.Two thirds of relative entropy values for 116 paradigms ranged between 0.75 and 0.9.The same is true for the reduced number of paradigms, where six out of seven paradigms ranged between 0.7 and 0.9.The observed concentration suggests that probability distributions of grammatical forms within paradigms are not arbitrary nor is, by the same token, the variation of relative entropy values among paradigms.Most probability distributions within paradigms are restricted to a relatively narrow margin of +/-0.1 of relative entropy values, ranging between 0.7 and 0.9.This conclusion, as noted earlier, has to be taken with caution.The caveat is related to the criterion of paradigm selection.Namely, some grammatical categories like, for example, case and grammatical number, cross many paradigms with minimal variation in probability of grammatical forms or subordinate grammatical categories.Consequently, cumulation of relative entropy values within a narrow margin may not be surprising.In order to eliminate this we introduced a somewhat restrictive criterion where only principal grammatical categories, specified at the most molar level (e.g.case of a noun, irrespective of gender and grammatical number) are taken into consideration.The obtained range of variation did not differ substantially from the one observed for 116 paradigms.Such an outcome suggests that the observed range may indicate the preferred relative entropy values of paradigms or, put differently, the preferred probability distributions of grammatical forms within paradigms.
Before we address the problem of "preference" for a given range of relative entropy variation, we need to elaborate the statement that the observed cumulation may be an artifact of paradigm selection.Although the subsequent analyses on the restricted number of paradigms indicated similar ranges of entropy variation, it is true that the observed cumulation obtained for 116 paradigms is also due to the fact that the same categories cross a number of paradigms.At this point we have to distinguish two aspects of the problem.The one, related to the range of relative entropy variation and the other, related to cumulation of relative entropy values within a given range.The first aspect is not an artifact of paradigm selection because similar ranges have been observed for both selection criteria.The latter aspect, however, requires additional elaboration.We defined a paradigm as a coherent set of elements with specified criteria for inclusion.Applied to inflectional morphology this allows for creation of a finite number of paradigms with clear selection criteria -whatever can be specified in terms of a coherent class with n grammatical elements can be treated as a morphological paradigm.This specification implies a continuum ranging from the molecular level of a given set of grammatical forms (e.g.nominative singular, masculine nouns) up to the molar level of grammatical categories (e.g.cases, irrespective of word type, grammatical number and gender).Once such a criterion has been adopted, we inevitably face the multiplication of the same set of elements across number of paradigms.The question is whether this is a matter of description or the fact of language.
It is a fact of language that case is a property of three distinct word types (nouns, adjectives and some pronouns).Likewise, it is a fact of language that grammatical number is a property of nouns, adjectives, verbs and pronouns.The same is true to some extent for grammatical gender.Criteria for paradigm selection should map onto these properties within and across word types if their purpose is to encompass exhaustive sets of possible paradigms at different grain size of description.If so, multiplication of paradigms with equivalent probability distribution and, by the same token, cumulation of paradigms with the same relative entropy value is natural consequence of language properties.With this in mind it can be stated that the observed cumulation is a matter of language rather than criteria by which paradigms were specified.This does not refute the fact that probabilities for entropy values to take a given entropy value are not inflated to some extent.
The fact that distribution of relative entropy values is not homogeneous and that there is a preferred margin within which those values cumulate may suggest that distribution of relative entropies is constrained in a way to conserve some overall divergence of relative entropy vales.This divergence can also be expresses in terms of relative entropy, this time as relative entropy of the whole system.As demonstrated, the value of this entropy is around 0.7 and can tentatively be taken as an index of the system's overall stability.Any conspicuous change of relative entropy for a given paradigm (or paradigms) may change the relative entropy of the morphological system and push it to the state of potential instability.If such change happens over wider time span, the compensatory changes of probabilities have to happen in order to conserve the overall probability divergence (or coherence) expressed in terms of the overall relative entropy value.
Let us elaborate this assumption in more detail.Probability fluctuation is an inherent property of language and can be observed in diachronic studies of different linguistic phenomena.Take, for example, the hypothetical decrease of probability of one case in case paradigm in some period of time.Such decrease will cause the decrease of relative entropy across number of paradigms.Consequently, this will cause the relative entropy of the system to decrease as well, thus shifting the system towards the state of instability.In order to conserve the stable state (i.e. the relative entropy value of the system) other changes have to occur as well (not necessarily in the case system) which will compensate for the case probability decrease.In other words, any change in probability within one paradigm has to be compensated by some proportional probability change (or changes) in other paradigm (or paradigms).This assumption, however, needs empirical evaluation in diachronic studies of language.

APPENDIX 1 Probability distribution within paradigms of nouns,
Verbs and adjectives Inspection of Graph 2 indicates that most values of relative entropy cumulate within a range of 0.75 and 0.95.

Distribution of relative entropy values for verb paradigms
person, irrespective of grammatical number, tense and gender 2. Grammatical number, irrespective of person, tense and gender, 3. Tense, irrespective of person, grammatical number and gender 4. Present tense, verb person, irrespective of number 5. Present tense, grammatical number, irrespective of