A finite-state approach to super-long German nouns

Marcin Junczys-Dowmunt (Adam Mickiewicz University, Poznań)

I will summarize the results of my Master's thesis and the main points of a lecture I held at the seminar of the Department of Applied Logic at the Adam Mickiewicz University in Poznań. A short overview about the structure of German compounds and newer research concerning the role of the so called interfixes for the segmentation of super-long compounds will be given.

After a short introduction to the concept of finite-state automata and transducers I will describe the construction of a transducer used for a naïve compound segmentation. This transducer will be obtained by building the Kleene-Closure of a deterministic finite-state dictionary of possible compound segments which are either basic, lexicalized nouns or interfixes. I will show that the resulting transducer, which lost its determinism due to the use of the Kleene-Closure, cannot be determinized, which is a direct consequence of the structure of German compounds and their theoretically unrestricted length.

In a second step the identified segments are analyzed in order to obtain additional data for the forthcoming disambiguation process. I will show how transducers can be used to identify segments with relevant suffixes or segments with more than a certain amount of syllables. Both properties have a strong impact on the distribution of the afore mentioned interfixes. The possible occurrence of doubled vowels in the nucleus of a syllable makes it necessary to construct rules for every possible nucleus.

The third step of the analysis uses transducer rules to determine distributional relations between the segments and the interfixes located between them. That is where the final disambiguation takes place. Two types of morphological contexts are modeled: the local relation between a segment and the following interfix and the global relations between two or more basic noun segments.

The three steps of segmentation, analysis and disambiguation are linked by the construction of a transducer cascade, where transducers are interpreted as mappings and the cascade is treated as a composition of mappings from a single input word to a set of tag-annotated segmentations.