Despite the pivotal role played by
the notion of complexity in theories of language structure and
processing, linguistic complexity remains a rather ill-defined concept.
Much research has been aimed at providing a concise and workable
definition of linguistic complexity. A recurring idea is that the
complexity of a language is best measured by the minimal size of the
grammar required to fully describe it. However, no concrete method for
estimating that optimal grammar length is available, and the
definitions are often reduced to counting the occurrence of certain
phenomena or structures that are chosen as representative of
"complexity" based on some theoretical premises. This results in
considerable vagueness and subjectivity with regard to what is more or
less complex in linguistic terms, and has indeed led to heated
discussions.
To avoid this, some have taken a theory-free approach
investigating to what extent linguistic utterances can be compressed.
The irreducible information that is left after compression provides a
useful index of the informational content of the signal; its
Algorithmic Information Content
(AIC) in the sense of Kolmogorov (1965) and Chaitin (1987). Indeed,
AIC-based approaches are capable of uncovering important properties of
language
(Moscoso del Prado,
submitted).
Although useful, by itself the
AIC is a measure of randomness rather than actual
complexity. As befits a measure of complexity, fully regular sequences
are very highly compresible, and thus have very low values of the AIC.
However, AIC measures attain their maximal value for completely random
sequences,
as these cannot be compressed at all. This does not fit well with an
intuitive idea of complexity, few
would argue that a large random text typed by a roomfull of monkeys
armed with typewriters is in any way more complex than a
literary work, or than the utterances produced by any average four-year
old child.
Recently, I have used M. Gell-Mann's concept of
Effective Complexity to estimate
the actual minimal required size of the grammar that can generate a
language
(Moscoso del Prado,
submitted). The basic idea is to separate
the random aspects in a text, from the size of the grammar required to
describe its regularities. I have used the AIC of different manipulated
versions of text to achieve this separation. When the size of the text
is taken to its infinite limit, this method directly provides a strict
measure of the complexity of language. This finding has important
consequences for linguistic and cognitive theories. Analyses of
multiple corpora using this technique reveal that it is not possible to
build a grammar capable of generating all the valid sentences in a
language, and only those. Any grammar will either be unable to generate
some valid sentences, or it will generate sentences that are plainly
ungrammatical. This confirms E. Sapir's famous dictum
"All
grammars
leak".
We also have used the concept of Effective Complexity to compare
the
complexity of English with that of Tok Pisin (an English-based creole
that is official in Papua-New Guinea), to investigate the cotroversial
claim that "The world's simplest grammars are creole grammars"
(McWhorter, 2001). Using strict measures of complexity on a diachronic
(spanning 25 years) parallel corpus, we did not find any support for
making such a claim. Although Tok Pisin's grammar may indeed be simpler
than that of English, the difference is within the range of variation
that one observes in other, non-creole, languages (a preliminary draft
will
appear here soon). Also, the study of the complexity measure throuout
the temporal span covered by the corpus enables to visualize the
process of "creolization", the dynamical change in the linguistic
complexity of an originally very simple language as it s speakers begin
being using it as their main language.
Currently, in collaboration with other
linguists, we are using the Effective Complexity concept to investigate
the complexity of different morphological systems across languages. The
richness and structure of morphological paradigms exhibits a large
degree of variation: from the near complete lack of morphological
structures in Mandarin Chinese, to the breathtaking richness of
paradigms found in languages such as Estonian. Although it is commonly
assumed that the "difficulty" or "complexity" of an inflectional system
boils down to the number of distinct forms in which a word may appear,
our intuition is that the informational regularites in the system may
play a more important role than the sheer number of forms.