PLM 2004: Mihailov

Translation Pairs from Parallel Corpora

Mihail Mihailov

University of Tampere

The issue of automated extracting of translation equivalents from parallel text corpora has been widely discussed during the last decade (Oakes 1998). The two main approaches to the problem are either dealing with string similarity measures or with co-occurrence patterns (Tiedeman 2003). However, in most cases the language pairs studied were related languages. In this paper these methods will be tested on a Russian-Finnish parallel corpus of litarary texts. The texts of the corpus were aligned on paragraph level, the word lists were lemmatized. The starting point of the experiments was to find out whether it is possible or not to extract translation pairs from parallel texts in non-related languages. As it was expected, the similarity-based extraction of word pairs was not effective in our case. Search methods based on co-occurrence patterns proved to be much more powerful. The Kulczinski coefficient (KUC) (see Oakes 1998: 171) was used as the most suitable.

The search was effective within the frequency band from 20 to 220 occurencies. The search utility studied 7290 Russian words of the frequency band and found 2080 word pairs, 1871 (90%) of which were correct, 102 (5%) were partly correct and only 107 (5%) were wrong. So, although the recall is low, the error rate is quite low, too.

The method makes it possible to generate simple bilingual glossaries, which can be used for various technical tasks. At least the following problems should be taken into account:

low-frequency words make most of word lists but receive no attention at all;
multi-word units vs. composite words;
words having more than one common equivalent.

References:

Oakes, Michael 1998: Statistics for Corpus Linguistics. Edinburgh University Press.

Tiedemann, Jörg 2003: Recicling Translations. Acta Universitatis Upsaliensis, Uppsala.

Home | Abstracts