The "Wrocław Corpus" of Polish Project (abstract)


Adam Pawłowski

University of Wrocław


The landscape of Polish corpus linguistics has changed significantly in recent years. New corpora are being created and old ones digitalized. Among the most important projects one should mention:

This sudden upsurge of interest concerns for the most part general language corpora. They offer standard tools of text-mining, such as concordance look-up, raw frequencies of lemmatised lexemes and, in future, syntactical descriptions. In this respect they follow the guidelines of the greatest and time-honoured corpora of English, German and French (e.g. BNC, IDS and TLF).


Our aim is not to create another "hundred million" corpus of contemporary Polish. The "Wrocław Corpus" will include chronological data representing the post-war history of Poland and will be composed of samples from the daily press (1.2 million running words per year), covering the nearly 50-year period of 1944 to 1990. We intend to extend the number of accessible search and analysis tools. In particular, the application will be implemented with on-line visualisation and modelling of trends in the evolution of lexeme frequency over time. It will also offer the possibility of exploring and modelling some statistical language laws (the Zipf and Menzerath laws). The scope of context analysis of lexemes (collocations) will be broadened with on-line calculation of the z-score and mutual information parameters. The corpus will be free of charge, implemented for internet users.


