The complexity of linguistic structures



Despite the pivotal role played by the notion of complexity in theories of language structure and processing, linguistic complexity remains a rather ill-defined concept. Much  research has been aimed at providing a concise and workable definition of linguistic complexity. A recurring idea is that the complexity of a language is best measured by the minimal size of the grammar required to fully describe it. However, no concrete method for estimating that optimal grammar length is available, and the definitions are often reduced to counting the occurrence of certain phenomena or structures that are chosen as representative of "complexity" based on some theoretical premises. This results in considerable vagueness and subjectivity with regard to what is more or less complex in linguistic terms, and has indeed led to heated discussions.

To avoid this, some have taken a theory-free approach investigating to what extent linguistic utterances can be compressed. The irreducible information that is left after compression provides a useful index of the informational content of the signal; its Algorithmic Information Content (AIC) in the sense of Kolmogorov (1965) and Chaitin (1987). Indeed, AIC-based approaches are capable of uncovering important properties of language (Moscoso del Prado, submitted). Although useful, by itself the AIC is a measure of randomness  rather  than actual complexity. As befits a measure of complexity, fully regular sequences are very highly compresible, and thus have very low values of the AIC. However, AIC measures attain their maximal value for completely random sequences, as these cannot be compressed at all. This does not fit well with an intuitive idea of complexity, few would argue that a large random text typed by a roomfull of monkeys armed with typewriters is in any way more complex than a literary work, or than the utterances produced by any average four-year old child.

Recently, I have used M. Gell-Mann's concept of Effective Complexity to estimate the actual minimal required size of the grammar that can generate a language (Moscoso del Prado, submitted). The basic idea is to separate the random aspects in a text, from the size of the grammar required to describe its regularities. I have used the AIC of different manipulated versions of text to achieve this separation. When the size of the text is taken to its infinite limit, this method directly provides a strict measure of the complexity of language. This finding has important consequences for linguistic and cognitive theories. Analyses of multiple corpora using this technique reveal that it is not possible to build a grammar capable of generating all the valid sentences in a language, and only those. Any grammar will either be unable to generate some valid sentences, or it will generate sentences that are plainly ungrammatical. This confirms E. Sapir's famous dictum "All grammars leak".

We also have used the concept of Effective Complexity to compare the complexity of English with that of Tok Pisin (an English-based creole that is official in Papua-New Guinea), to investigate the cotroversial claim that "The world's simplest grammars are creole grammars" (McWhorter, 2001). Using strict measures of complexity on a diachronic (spanning 25 years) parallel corpus, we did not find any support for making such a claim. Although Tok Pisin's grammar may indeed be simpler than that of English, the difference is within the range of variation that one observes in other, non-creole, languages (a preliminary draft will appear here soon). Also, the study of the complexity measure throuout the temporal span covered by the corpus enables to visualize the process of "creolization", the dynamical change in the linguistic complexity of an originally very simple language as it s speakers begin being using it as their main language.

Currently, in collaboration with other linguists, we are using the Effective Complexity concept to investigate the complexity of different morphological systems across languages. The richness and structure of morphological paradigms exhibits a large degree of variation: from the near complete lack of morphological structures in Mandarin Chinese, to the breathtaking richness of paradigms found in languages such as Estonian. Although it is commonly assumed that the "difficulty" or "complexity" of an inflectional system boils down to the number of distinct forms in which a word may appear, our intuition is that the informational regularites in the system may play a more important role than the sheer number of forms.