Enhancing the potential of POS-tagged corpora with positional tagsets and ambiguities


Adam Przepiórkowski

Institute of Computer Science, Polish Academy of Sciences


The aim of of this talk is to argue for the need for more structured part-of-speech (POS) annotation of corpora than that usually adopted, and to present a concordancer which is able to take advantage of such structured annotation.


For morphologically rich languages, such as Slavic languages, with a substantial number of morphosyntactic categories and their possible values (e.g., for Polish, 7 cases, at least 5 genders, etc.), atomic tagsets of the kind assumed for English, where all morphosyntactic information is clumped into atomic symbols, does not seem viable. Instead, tags should have internal structure, separately representing the POS and the values of morphosyntactic categories specific for this POS. Such positional tagsets are used, e.g., within the Czech National Corpus and were proposed for a number of languages within the Multext-East project.


Less obviously, there are various cases of morphosyntactic ambiguities which cannot be disambiguated in a non-arbitrary way, e.g., in Polish, in a sentence involving both i) a verb subcategorising optionally (without any change in meaning) either for a genitive or an accusative noun phrase, cf. (1) below, and ii) a noun syncretic between the genitive and the accusative, as in (2):



Pożądał ją. 


desired.MASC her.ACC 


'He desired her.'



Pożądał jej.


desired.MASC her.GEN




Pożądała go. 


desired.FEM him.ACC/GEN


'She desired him.'


To the best of our knowledge, currently available concordancers cannot take the full advantage of positional tagsets and information about morphosyntactic ambiguities. In this talk, we present (and demonstrate, if equipment and time constraints allow) POLIQARP (POLish Indexing Query and Retrieval Processor), a concordancer developed at the Institute of Computer Science, Polish Academy of Sciences, which understands such structured tagsets with ambiguities.


Home | Abstracts