The "Wrocław Corpus" of Polish Project (abstract)
Adam Pawłowski
University of Wrocław
The landscape of Polish corpus linguistics has changed significantly in recent years. New corpora are being created and old ones digitalized. Among the most important projects one should mention:
the Polish Scientific Publishers corpus (PWN);
the "IPI-PAN Corpus" currently being prepared by the Institute of Computer Science of the Polish Academy of Sciences;
the "toy-corpus" of Polish, implemented by the IPI-PAN and the Ohio State University;
the digitalised version of the legendary SFPW corpus, used for the first Frequency Dictionary of Modern Polish;
the PELCRA project.
This sudden upsurge of interest concerns for the most part general language corpora. They offer standard tools of text-mining, such as concordance look-up, raw frequencies of lemmatised lexemes and, in future, syntactical descriptions. In this respect they follow the guidelines of the greatest and time-honoured corpora of English, German and French (e.g. BNC, IDS and TLF).
Our aim is not to create another "hundred million" corpus of contemporary Polish. The "Wrocław Corpus" will include chronological data representing the post-war history of Poland and will be composed of samples from the daily press (1.2 million running words per year), covering the nearly 50-year period of 1944 to 1990. We intend to extend the number of accessible search and analysis tools. In particular, the application will be implemented with on-line visualisation and modelling of trends in the evolution of lexeme frequency over time. It will also offer the possibility of exploring and modelling some statistical language laws (the Zipf and Menzerath laws). The scope of context analysis of lexemes (collocations) will be broadened with on-line calculation of the z-score and mutual information parameters. The corpus will be free of charge, implemented for internet users.