On unsupervised grammar induction from untagged corpora


Damir Čavar

(& Giancarlo Schrementi, Joshua Herring, Toshikazu Ikuta)

Indiana University


We will present work on corpora from Indo-European, Asian, as well as Native American and African languages, in the domain of unsupervised learning of structural properties of these languages, i.e. phonotactic regularities, morphology, and syntax. We focus on unsupervised algorithms with minimum prior knowledge that are used to generate a highly accurate morphological structure of the languages represented in the corpus. We are not assuming any prior knowledge except that the language is represented as a sequence of a finite set of symbols, be it a phonemic/phonetic or orthographic transcription. The algorithms are seriously restricted with respect to computational resources like time and memory. We can show that the resulting morphological structure can be used by the same type of algorithms to induce lexical properties of words as well as syntactic structure. The learning strategies are based on Minimum Description Length, Mutual Information, Clustering and Alignment Based Learning. We will present our work as a computational approach to automatic acquisition of lexical and grammatical information, that is related to general linguistic, psycholinguistic, and cognitive aspects of language acquisition. We will show that basic linguistic properties can be induced from plain (non-annotated) corpora, including the lexical subsets of open and closed class words, the noun-verb distinction, as well as headedness of lexical categories and basic syntactic structure, using a truly unsupervised bootstrapping approach. This work is not only relevant for corpus linguistic work, e.g. large scale automatic annotation of corpora, it is also relevant for the discussion of usage versus grammar oriented theories of the language faculty and the questions of language learning and learnability in general.


Home | Abstracts