Volume 41 (2006) Abstracts

Adam Pawłowski: Chronological analysis of textual data from the ‘Wrocław corpus of Polish’. PSiCL 41 (2006). Pages: 9-29.

Abstract

This paper describes the construction of a corpus of press texts from Poland’s communist period (1944–1990) now being assembled at the University of Wrocław. This corpus will contain chronologically ordered text fragments drawn from the daily press. It is assumed that the lan-guage of the official press is the most comprehensive expression of the ideological world con-struct propagated by the communist regime. Typical ‘sequential event patterns’, defined as pre-dictable series of salient changes in relevant word frequencies related to events or phenomena in a given real-life category over a period of time, are introduced, and a few such patterns of repre-sentative data from 1953 issues of the daily newspaper Trybuna Ludu (press organ of the ruling communist party) are also tentatively investigated.


Adam Przepiórkowski: The potential of the IPI PAN corpus. PSiCL 41 (2006). Pages: 31-48.

Abstract

The aim of this article is to present the IPI PAN Corpus (http://korpus.pl/), a large morphosyntac-tically annotated XML-encoded corpus of Polish developed at the Institute of Computer Science, the Polish Academy of Sciences, and describe its potential and actual use in natural language processing and in linguistic research. In particular, various quantitative information about the corpus and its publicly available subcorpora is provided, including sizes in terms of orthographic words and interpretable segments, tagset size measured in types and tokens, etc., as well as some interesting facts about Polish: frequencies of words of different lengths, frequencies of gram-matical classes and categories, and initial frequency lists of Polish lemmata.


Włodzimierz Sobkowiak: Automatic phonetic annotation of corpora for EFL purposes. PSiCL 41 (2006). Pages: 49-55.

Abstract

Corpora of English texts (both native and non-native) are now taken for granted as a resource in teaching and learning English as a Foreign Language (EFL). So far they have, however, been ex-ploited mostly on the lexical, morpho-syntactic and stylistic level. The phonetic potential of raw-text corpora (as opposed to the few expensive acoustically treated and annotated ones) has not been discovered.
In this contribution a method is presented of automatically annotating raw-text corpora with EFL phonetic tags coming from a suitably treated electronic word-list. The focus of the presenta-tion is in phonetic lapsology, i.e. in annotating English text for probable Polglish (Polish-English interlanguage) pronunciation problems and errors, as well as for the overall level of pronouncing difficulty. Two examples from my research are presented and discussed: (1) phono-lapsological analysis of definitions in the Macmillan English Dictionary for Advanced Learners on CD-ROM (MEDAL; Sobkowiak 2004b) and (2) my current work on TIMIT sentences in the context of the Boulder-Poznań CSLR Colorado Literacy Tutor project. It is demonstrated that on top of the automatic phonetic transcription of raw text, which is now conceptually and technologically rather trivial, a sophisticated L1-sensitive automatic phonetic annotation is feasible, with a vari-ety of EFL-related functions, in particular text/sentence selection (e.g. TIMIT) and evaluation (e.g. MEDAL) for lexicographic, pedagogical and research purposes (more in Sobkowiak 2004a, b and unpublished).


Damir Ćavar, Joshua Herring, Toshikazu Ikuta, Paul Rodrigues, and Giancarlo Schrementi: On unsupervised grammar induction from untagged corpora. PSiCL 41 (2006). Pages: 57-71.

Abstract

In this paper we describe the theoretical motivation and the implementation of an unsupervised morphology induction algorithm. We show that we can induce morphologies with very high pre-cision, using simple unsupervised statistical algorithms. The results are not only relevant from a theoretical point of view, but they also have various potential applications in natural language processing.


Elżbieta Dura: Extracting current language use from the web. PSiCL 41 (2006). Pages: 73-85.

Abstract

The web is an essential source of language data. Neither dictionaries nor large corpora can even remotely approximate completeness in their coverage of examples of use. Therefore, concor-dancers mounted on web search engines constitute an important complement to other corpus tools. Web concordancing tools cannot be underestimated, especially by linguists interested in extracting novel or rare language constructions, as these often cannot be found anywhere but on the web. This paper compares the possibilities offered by some NLP-enhanced concordancers mounted on web search engines. Such tools can enable searches for syntactic structures, collocations or new words, which are as precise as those obtained with special corpus tools attached to, e.g., the British National Corpus. The main drawback to concordancers driving web search engines is that they do not provide accurate statistics, since the counts are reported from the engines, which are not al-ways reliable. Still, web concordancers, such as Lexware Culler, which features prominently here, can be excellent tools for linguists, particularly when quick insights into current usage are desired.


Urszula Okulska: Historical corpora and their applicability to sociolinguistic, discourse-pragmatic and ethno-linguistic research. PSiCL 41 (2006). Pages: 87-109.

Abstract

This paper focuses on the use of historical corpora in research on language internal and external parameters that have influenced the historical development of English from its systemic and so-cial perspectives. Selected problems of the applicability of diachronic databases will be discussed in relation to morphological, syntactic, discursive, and pragmatic aspects of English at the earlier stages of its historical development. Additionally, some sociolinguistic criteria, such as social origin, networks, age, gender, and migration, of selected systemic variables will be tackled to show the usefulness of historical corpora for macro- and micro-level studies tracing the impact of external factors on the diffusion of language change across the centuries. Furthermore, as large text collections emerge as an important source of data for ethno-, text linguistic, and cultural re-search, examples of projects bridging qualitative and quantitative methodologies in dealing with the diachronic development of generic macro-structural units will be demonstrated. Systematic observation of discourse types and information flow in such forms of communication as private and official correspondence, drama, and court trials archived in historical corpora has enabled scholars to uncover the evolution of the cultural system of values in various domains of social life in English history. It is articulated, for instance, in the expression of emotions in private epis-tolary writings, the emergence of new forms of communication in diplomatic spheres, and dis-cursive construal of characters in early works of drama and records of trial proceedings.


Małgorzata Fabiszak and Przemysław Kaszubski: Studying metaphor with the BNC. PSiCL 41 (2006). Pages: 111-129.

Abstract

This paper is an exercise in corpus-driven metaphor research. In the main part, we demonstrate the use of collocation evidence for the construction of Fillmorian semantic frames for four ex-pressions in the ‘War’ domain – FIGHT WAR, DECLARE WAR, BATTLEFIELD and BATTLEGROUND. We combine manual categorisation of collocates identified in concordance lines extracted from the British National Corpus (BNC) with automatically derived collocation statistics (z-score). We then show how corpus citations demonstrate the scalar nature of meta-phor. Finally, we use the L-Word Query function in the BNC Sara concordancer to identify and categorise metaphorical compounds containing the word war. General comments regarding the usefulness of corpus-based empirical methods for metaphor research are offered throughout the paper and in the conclusions.


David Y.W. Lee: Corpus-based linguistics and the uninitiated. PSiCL 41 (2006). Pages: 131-150.

Abstract

In this paper, some of the practical issues surrounding research under the paradigm of ‘corpus-based linguistics’ are discussed, illustrating how, in this fairly new and expanding area of study, there lie many pitfalls for the unsuspecting student or researcher approaching it from the back-ground and point of view of a non-computer scientist. Poised as it is between the world of aca-demic linguistics and the real-world commercial realities of NLP (natural language processing), the world of corpus-based linguistics is one where only the multi-skilled or multiply-connected succeed in doing really substantive or original work. Going in as a linguist, with a primary inter-est in language, not computation, it can be a long and uphill struggle to adapt oneself to not only the tools and techniques of the trade but, more importantly, their limitations. Also, many of the best ideas, tools, resources and human experts are to be found not in academia, but in the boom-ing commercial world of the NLP industry, and the best corpus-based linguists and corpus-based solutions and ideas are to be found situated in that difficult liminal area of interdisciplinarity. Present corpus-based linguistics courses do not always best prepare researchers for the job. Equipping oneself with just the right skills given a limited time frame, and knowing who and where to go to for help and how to get it, is thus an important aspect of doing corpus-based re-search as a postgraduate student or as an academic researcher. This papers proposes a more rounded syllabus for corpus-based linguistics courses and suggests some practical guidelines for people thinking of embarking on such research.


Tadeusz Piotrowski: Corpus linguistics in linguistic courses in Poland. PSiCL 41 (2006). Pages: 151-159.

Abstract

This paper argues that students in English departments in Polish universities should receive train-ing first of all in linguistic methods that would allow them to analyze text. Among those methods corpus analysis tools seem to be especially useful for students.


Przemysław Kaszubski: Web-based concordancing and ESAP writing. PSiCL 41 (2006). Pages: 161-193.

Abstract

Concordance lines can outperform computer-based dictionaries in the amount and quality of lin-guistic information provided to learners (Cobb 2003, Gavioli and Aston 2001). Despite still inde-cisive empirical verification, the concordancing method (a.k.a. data-driven learning, DDL) has become a welcome supplement to the modern language learning syllabus, especially vocabulary study and (academic) writing (e.g. Garton 1996; Yoon and Hirvela 2004: 258). In this article, fol-lowing an overview of DDL in the context of ESAP writing (English for Specific Academic Pur-poses, Dudley-Evans and St John 1998), I develop the premise that specialist learning purposes call for specially designed concordancing tools. I then attempt to build a list of features of a dedi-cated ESAP concordancer, referring to numerous off-line and online applications and projects. At the end, I introduce the ESAP online concordancing environment being developed at the School of English, Adam Mickiewicz University. In its current version, the tool integrates: 1) multiple corpus searching (general/ specific, native/non-native etc.); 2) annotated list-driven searching; 3) error-driven concordancing; 4) an extensive user tracking facility for enhancing DDL observa-tion. An agenda for future developments is sketched out. The early public sampler of the IFA Concordancer can be tried out at http://elex.amu.edu.pl/~przemka/PICLE_search.php.


Chris Tribble and Przemysław Kaszubski: EAP corpora – expert and apprentice performances in literary criticism. PSiCL 41 (2006). Pages: 195-218.

Abstract

In this paper we show how corpus phenomena which Scott (1997) calls clusters can be used to investigate contrast between expert and apprentice production in an area of academic writing. Working with a collection of Polish literature MA student dissertations and published literary criticism articles from the BNC, we demonstrate how three- and four-word clusters can be used to identify pedagogically useful contrasts between these text collections.


Mikhail Mikhailov: Translation pairs from parallel corpora. PSiCL 41 (2006). Pages: 219-235.

Abstract

In this paper, methods of automated extraction of translation equivalents from parallel corpora are tested using a 2.2×2 million-word Russian-Finnish corpus (ParRus) of literary texts aligned at the paragraph level. Textual forms are lemmatized. While similarity-based extraction of word pairs proves ineffective for non-cognate languages like Russian and Finnish, search methods based on co-occurrence patterns work more effectively (lower error rate), although the list of equivalents obtained this way is rather short: 2,080 word pairs. It is concluded that co-occurrence methods may be helpful in generating simple bilingual technical glossaries. As far as using paral-lel corpora in bilingual lexicography and translation studies is concerned, closer inspection of several translation pairs shows that corpora may produce skewed or misleading data, requiring modification. This is because many translations reveal over-reliance on bilingual dictionaries (wrong or obsolete equivalents), while some other equivalents prove unsuitable as dictionary equivalents. Regardless of such limitations, parallel corpora prove useful for improving bilingual dictionaries and for studying the translation process.


Maciej Machniewski: Analysing and teaching translation through corpora: lexical convention and lexical use. PSiCL 41 (2006). Pages: 237-255.

Abstract

The popularity of corpus methodologies in numerous areas of linguistic research is matched by their widespread use in other, related fields. In Translation Studies, it resulted in the development of a new methodology for inspecting translational-linguistic issues, i.e. Corpus-based Translation Studies. This paper seeks to suggest some benefits of using research data obtained from corpora for the teaching of translation. First, it gives an overview of the main corpus research trends in translation and their applications in translation teaching to date. Subsequently, on the basis of corpus findings obtained from an analysis of small, Polish-English and English-Polish corpora of translations performed by professional translators and translation trainees and their comparison with corpora of parallel texts in English and Polish, respectively, it suggests some other, potential uses of corpora in translation didactics.


Mirosława Podhajecka: Using the web in the translation process. PSiCL 41 (2006). Pages: 257-274.

Abstract

This paper concentrates on the use of the World Wide Web, the largest of all corpora available today, in the translation process. Although the structure of the Web is chaotic and uncontrollable, the corpus is unparalleled in terms of size, diversity and topicality. It is therefore argued that translators should know how to retrieve, select and assess linguistic data to be able to benefit from the rich resources available at the click of a button. The paper offers a brief look at a tech-nique of ‘mining’ the Web, pointing to some of the pitfalls that inexperienced users may face en route.