On root-based concept of an electronic lexicon for Polish
Joanna Rabiega-Wi¶niewska (Warsaw University)
One of the problems in Text Technology research is how to create a lexicon that can be easily adapted to any application. At present, there are several methods of carrying out inflectional automatic analysis and lexical entry description for Polish. On the basis of such methods, several programs have been compiled (Hajnicz and Kup¶ć 2001). It has turned out that traditional approach to morphological analysis according to which word-forms are segmented into two parts, the stem and the ending, has to be abandoned (Rabiega-Wi¶niewska and Rudolf 2003). This particular lexicon design results in that any morphological analysis delivers only a lemma for a given word-form and an appropriate grammatical characteristics assigned to it. The question is what information should be provided for each entry to increase a number of purposes of the lexicon.
The paper aims to present root-based concept of an electronic lexicon for Polish. It will be demonstrated that a lexicon entry can be described by means of the principles of morphological construction. The root-based lexicon has been already exploited in the formal model of Polish nominal derivation (Rabiega-Wi¶niewska 2005), since it gives access to all stems of a given lemma and all endings separately. It can be adapted also for an automatic inflectional analysis and/or synthesis.
To prepare a lexicon of roots we have used the grammatical dictionary of AMOR analyser (Rabiega-Wi¶niewska and Rudolf 2003). As a result, we have obtained entries that contain the whole inflectional characteristics and details about internal alternations. The twolevel morphology model by Koskenniemi (1983) served as a theoretical background for the study. The conversion of the AMOR data into the lexicon of roots will be presented in following steps: a. synthesis of all word-forms of a given lemma, b. segmentation of the word-forms into two parts, the stem and the ending, c. collection of all stems of a given lemma, d. internal alternations description, and e. the final root choice. Each lexicon entry shall be represented by a superficial lemma, a root and two sets. The first set contains textual representations of all internal alternations within the lemma and grammatical codes of their distribution. The latter component comprises a full package of endings. An example below shows an entry suseł 'gopher':
suseł , suSE£
The root-based lexicon may be a part of an automatic inflectional as well as a derivational device. The morphological entry description proposed in this presentation seems to show a new perspective in creation of linguistic resources for Polish.
Hajnicz E., Kup¶ć A. (2001). Przeglšd analizatorów morfologicznych dla języka polskiego (A survey of morphological analysers for Polish language). IPI PAN # 937 (pp.60). Warsaw.
Koskenniemi, K. (1983). Two-Level Morphology: A General Computational Model for WordForm Recognition and Production. Publication No. 11, University of Helsinki. Helsinki.
Rabiega-Wi¶niewska, J., Rudolf, M. (2003). Towards a bi-modular automatic analyzer of large Polish corpora. In Kosta, R., Błaszczak, J., Frasek, J., Geist, L., ¯ygis, M. (Eds.), Investigations into Formal Slavic Linguistics. Contributions of the Fourth European Conference on Formal Description of Slavic Languages - FDSL IV, held at Potsdam University, November 28-30th, 2001 (pp. 363-372). Frankfurt am Main: Peter Lang GmbH.
Rabiega-Wi¶niewska, J. (2005). A formal model of Polish nominal derivation. In Z. Vetulani (Ed.), Human Language Technologies as a Challenge for Computer Science and Linguistics, 2 nd Language & Technology Conference April 21-23, 2005, Poznań, Poland, Proceedings (pp. 323-327). Poznań: Wydawnictwo Poznańskie Sp. z o.o.