1 Which or what? A study of interrogative determiners in present-day English Bas Aarts, Evelien Keizer, Mariangela Spinillo and Sean Wallis Survey of English Usage University College London The determiner slot in English interrogative Noun Phrases allows for three possible occupants, namely which, what and whose, and their variants ending in -ever: whichever, whatever and whosever. In this paper we will be looking only at the first two of these, which and what, in structures like the following: Which films do you like? / What films do you like? These expressions are close in meaning, so the question arises what factors influence the choice of one over the other determiner. Our aim in this paper is twofold: first we will demonstrate how interrogative NPs such as those exemplified above can be retrieved from the British component of the International Corpus of English (ICE-GB), using Fuzzy Tree Fragments (FTFs). The second aim of the paper is to show how the search results can be used to investigate what determines the choice of determiner in NPs like those shown above. 2 The TALANA annotated corpus for French: some experimental results Abeillé Anne, Clément Lionel, Kinyon Alexandra, Toussenel François TALANA-UFRL. Université Paris 7 {abeille, clement, kinyon, ftoussen}@linguist.jussieu.fr This paper presents the first linguistic results exploiting the new annotated corpus for French developed at Talana-Paris 7 (Abeille & al 00). The corpus comprises 1 million words fully annotated and disambiguated for parts of speech, inflectional morphology, compounds and lemmas, and partially annotated with syntactic constituents. It is representative of contemporary normalized written French, and covers a variety of authors and subjects (economy, literature, politics, etc.), with extracts from newspapers ranging from 1989 to 93. After explaining how this corpus was built, we present some linguistic results obtained when searching the corpus for lexical or syntactic frequencies, for lexical or syntactic preferences, and explain why we think some of these results are relevant both for theoretical linguistics and psycholinguistics. 3 On the search for metadiscourse units Annelie Ädel annelie.adel@eng.gu.se English Department Göteborg University, Sweden 1. Introduction To what extent do writers anchor their discourse in the current discourse situation and make the presence of the writer and/or reader overt? What types of situations are at hand when writers refer to their text as text, to themselves or to their readers? To what extent are texts monologic or dialogic? These types of questions have recently attracted a lot of attention within research on metadiscourse. Metadiscourse refers to discourse about on-going discourse and is interesting to study from the perspective of how the writer's or reader's presence in the text are made explicit. It has been studied by various branches of linguistics, for example text linguistics (Mauranen 1993, Markkanen et al 1993) – also in a historical perspective (Taavitsainen 2000), pragmatics (Hyland 1998) and genre studies (Bäcklund 1998, Bondi 1999) and is rapidly becoming a dynamic field of research. One particularly interesting perspective is the study of cultural differences in the use of metadiscourse. Several researchers have shown that metadiscourse typically differs across cultures (e g Markkanen et al 1993, Mauranen 1993, Vassileva 1998, Bäcklund 1998). Cultural differences in writing have been studied within the field of contrastive rhetoric, on the basis of the primary hypothesis that there are culture-specific patterns of writing, and that these cause interference in L2 writing (see Connor 1996:90). In this paper, I will report ongoing research into the use of metadiscourse in written argumentative texts by native and non-native speakers of English. The study is corpus-based and comparative, contrasting the use of metadiscourse by Swedish advanced learners’ writing in English to the writing of native speakers of British and American English.1 All writers are university students. The argumentative essays are full-length, and are available within the framework of the International Corpus of Learner English (Granger 1993). In case there may be cultural differences in the use of metadiscourse, the British English and American English parts of the control corpus are kept separate. One of the aims of my thesis is to investigate metadiscoursal patterns with explicit reference to the writer or reader. Some examples of metadiscourse specify discourse acts that the writer intends to perform are, for instance: (1a) to introduce a topic or state an aim: In this essay I will discuss some of the problem...; We must now consider the pros and cons of Britain joining "The Single Market". (1b) to sum up a discussion: To make a short summary of what I have been trying to say in this essay…; I have presented some of the most important benefits of drug legalization… (1c) to close the topic: …one may conclude that the American people does not approve of political leaders with low moral standards; The conclusion one might draw is rather depressing… I will first discuss the definition of metadiscourse and how I delimit the area. This is not unproblematic and metadiscourse has been defined differently by different researchers (see e g Markkanen et al 1993 and Mauranen 1993). The focus of the paper is on a subcategory within the broader field of metadiscourse which will be referred to as ‘metatext'. Metatext is text about the current text, and includes references to the evolving text itself rather than its subject matter. Since I am concerned with personalised examples here, this means linguistic elements through which the writer comments on his or her own discourse actions (see examples (1a-c) above). The method used in this work will only capture personalised types of metadiscourse, where the writer and/or reader have been explicitly mentioned in the texts, including expressions such as as I showed above and I will give an example, but not as shown above, to exemplify or this essay will show, etc. First person pronouns I and we, which are 1 The native speaker corpus, called LOCNESS also has one British English part consisting of A-level essays, which has been excluded from the investigation. The Swedish subcorpus will be referred to as SWICLE in the following. Examples will be marked either (Swicle), (BrE) or (AmE). 4 potential explicit references to the writer and/or the reader, have been retrieved from the corpus material and will be considered in the present paper. Further questions that will be posed within the span of the paper are: Who do the pronouns I and we include? How frequent are metatext units across corpora? What do they look like? and How are the metatext units distributed in the texts? The comparative perspective involves investigating the differences and similarities in the use of metadiscourse between learners and native speakers. It should be stressed that this is work in progress, so no definite analyses will be given. 2. The definition and delimitation of metatext All uses of I or we are not metatextual. I also occurs when the writer appears in the text to express his or her feelings, or to talk about personal experiences. I call this category involvement, and do not include it in the concept of metadiscourse. The way the term is used here is somewhat different from Chafe's (1982) definition. Chafe refers to involvement with the audience as typical for a speaker, and ‘detachment’ from the audience as typical for a writer, and the concept of involvement includes a range of linguistic features that are more prevalent in speaking2. Involvement in the sense it is used in this paper adds a narrative quality, or the writer as a person in the discourse-external world adds some personal experience to the discourse, or expresses his or her personal feelings or attitudes towards phenomena in the ‘real world'. Involvement including we stresses shared personal experiences in discourse-external situations. One would not expect a great deal of involvement in argumentative essays, but this type is extremely frequent in the learner material. I make a distinction between two subcategories within metadiscourse, which are metatext and commentary. Commentary has to do with writer-reader relations. It refers to features used to address readers directly, and draw them into an implicit dialogue (Vande Kopple 1988). I plan to carry out a future study, looking at the following commentary features: second person pronoun you (your), questions, exclamations, and imperative forms, and possibly also lexical items such as dear reader (vocatives). Speaker attitude is not dealt with here, although it may have to do with the writerreader relationship and especially with persuasion. In integrative approaches to metadiscourse (e g Markkanen et al 1993), however, the writer's attitude to what is said is included. In order to distinguish metatext and commentary from other uses of the pronouns, I also make a distinction between (a) whether the action takes place within the world of discourse, or (b) within object language, dealing with the ‘real world'. This can be compared to the distinction between metalanguage and object language in linguistics, an idea which originates with Roman Jakobson (see for instance 1980:81-92). Jakobson described metalanguage as a language in which we speak about the verbal code itself. It was contrasted to object language, in which we deal with objects or items that are external to language itself. Similarly, in a text perspective, we can speak of discourse-internal versus discourse-external phenomena. Thus, one basic question I examine when analysing the data is: Is the ongoing discourse in focus or other, ‘worldly', activities that are external to the text? The notion of current text (cf Mauranen 1993) is also important to metadiscourse, meaning that we are interested in how texts refer to themselves, and not to other texts. Texts about other texts (alluding to or commenting on other texts) are described by the concept of intertextuality, which is not our concern here. This is also the reason why quote markers, or introductions to reported speech, which are often included in models of metadiscourse, will be disregarded in the present study. Reporting the speech of others or quoting from other sources than the current text are intertextual rather than metatextual activities. Criteria or parameters for metadiscourse are generally not specified by researchers; it is usually simply referred to as ‘text about text'. However, I have extracted features, which work well for the analysis of the type of personalised, explicit types of metadiscourse present in my collection of corpus data. When looking at an occurrence of I, I decide whether it refers to the writer of the current text, or to an experiencing subject outside the realm of discourse. When the examples contain we, the question to ask is whether the pronoun includes [+writer persona] and [+reader persona], and not some other group that does not directly participate in the ongoing discourse. These parameters are summarised in figure 1 below. To a great extent, the parameters concern rhetorical roles taken on by the writer and offered to the reader. 2 These features are for example first and second person reference; the presence of the speaker's mental processes; monitoring by the speaker of the communication channel; emphatic particles that express enthusiastic involvement in what is being said, like just and really; vagueness and hedges; and direct quotes (see Chafe 1982). 5 Fig. 1. Parameters for metatext. Metatext [+current discourse] [+world of discourse] [+writer persona] ( I, we) [+reader persona] ( we) In order to analyse an element as metatext, we need to ask whether it refers to the current discourse, whether it deals with the world of discourse, and, more specifically, concerning first person singular, whether I refers to the writer presenting him- or herself as a writer. Concerning first person plural we, it should include the writer persona and/or the reader persona, and no other persons that do not belong in the world of discourse. 3. Explicit references to the writer and reader: What do the pronouns refer to? The first and second pronouns are most typically used to refer to specific individuals identified in the situation of communication (Biber et al 1999:328). They may point to ‘the one who is speaking’ and ‘the one(s) who is/are being addressed'” (Wales 1996:51). These pronouns help us to capture the current writer and the current reader of the current text. In my search for metadiscourse, I look for what Quirk et al (1985:347) call ‘specific reference', i e when the first and second person pronouns are used to refer to the writer(s) and the reader(s), those directly involved in the discourse situation. The first person singular pronoun I is “unambiguous in referring to the speaker/writer” (Biber et al 1999:329), in contrast to we, which often involves a “fluidity and ambivalence of meaning” (Wales 1996:58). In my data, first person singular I points to the writer of the current text in the majority of cases, except when it occurs in quoted material or reported speech, or for example when the pronoun it has been misprinted and lost its last letter. In order to classify an example as metadiscourse, not only has the I to point to the writer of the current text, but the action that the I performs has to be discourse-internal. The verb discuss, for example, involves a speech act which is quite frequently performed in argumentative essays. In the first example, however, the action does not take place within the realm of the current discourse: (2) One does not need to read the papers to notice how the antagonism towards immigrants has increased. Lately I have discussed the increasing hostility towards immigrants with my friends, relatives and fellow workers. Almost everyone I spoke to wants tougher immigration rules. (Swicle) The discussion in example (2) is held between the writer as a person in ‘real life’ and text-external individuals and it counts as involvement. Another example including the verbs discuss and analyse, which does concern the writer and the reader is the following, classified as metatext: (3) [] and secondly that the executive had been strengthened vis-à-vis the parliament. I will briefly discuss the Prime Ministers role and then elaborate on the Presidents functions. Then I will analyse each presidency showing how the presidents role has evolved. (BrE) Examples (10)-(16) in section 5 give further examples of metatext having I as subject. 3.1 We and metatext Let us consider some examples of the more complex first person pronoun we. Most occurrences in my material are not metadiscoursal, but relate to discourse-external phenomena. There is a fairly small number of examples that refers to the writer (but not as a writer) and other persons (who are not the readers of the current text), as in the following example: (4) [], but 2 a.m. struck and he bad [sic] to go. We talked some more in the lobby but we had to keep our voices down, out of respect. I wanted him to spend the night because [] (AmE) Examples of this kind have been left out. What I am primarily interested in is the type that has been called ‘inclusive authorial we’ (Quirk et al 1985:350). A small part of the overall material is of this kind (see section 4), for example: 6 (5) students ask themselves the question; is it worth the cost and the great effort it takes to study at the university? Therefore we are now going to look at some advantages of higher education, both in a short-term perspective and also over a longer period (Swicle) The effect of the fact that both the writer and the reader are addressed in this example is that cooperation is emphasised. The writer is showing his or her helpfulness and will to guide the reader in the discourse, thereby bonding with the reader, as it were. One subgroup of the so-called ‘exclusive we', according to Wales (1996) is the collective we, indicating several writers. This type does not occur in the selected material, since there are only single writers of the essays. A second subgroup is ‘editorial we', which is restricted to very few occurrences in the present material, for example “In the course of this essay, we shall attempt to analyse whether this is a belief founded in reality and, if it is, why it should cause such fear.” (BrE). Here, the single author uses the plural form for his own discourse actions. Quirk et al (1985:350) explain the motivation for using this type as a “desire to avoid I, which may be felt to be somewhat egotistical”. Another special usage of we is found in several of the literary essays of the British English material, where we is equivalent to ‘we the audience', as in the following extract: (6) Through Caligula's own dialogue, and through the opinions of Cherea, we are shown that there is a logic to Caligula's approach. Caligula insists on following this logic through to the completion (BrE) This reference is made explicit in another instance, through apposition: “We the audience can identify with her view” (BrE). These types have been omitted. Since the reader is excluded, it could be argued that these are not cases of metatext. Before leaving this category, I will bring up one more example, which is intertextual (throughout the play) rather than metatextual, discussing how readers of another text might interpret that other text: (7) In the play 'Caligula' Camus is dealing with the themes of death and the absurd, and throughout the play we can see different characters reacting in different ways. (BrE) The criterion that metatext instances should refer to the current text is not fulfilled here, but the writer is referring to another text, which belongs to a different genre even. Also, unless the writer assumes that the reader of his essay has read (or seen) the play, the we does not include the reader of the current text. Several examples of ‘we the audience’ type can be found in the British English material (as many as 75 out of a total of 262, which is nearly 30 per cent), but none at all in the other corpora. In this case, the use of first person plural we is probably different in the British English material due to the presence of a large number of literary argumentative essays, illustrating how different topics may influence linguistic features. A number of examples are ambiguous, for example: “[] However this system is once again proved wrong as Cacambo, Candide's manservant remains faithful disproving Martin's theory that Cacambo would run off with the money collected from Elderado. We can therefore assume that Voltaire is attacking Optimism in its context of a system, just as he criticises other systems in the book, such as the Church system, the military system and the caste system.” (BrE). Here, does we refer to the readers of Candide or to the writer and reader of the current text? Probably the latter, since it is possible to insert [from what I have said so far] in connection to the instance. As in the we the audience example, it is fairly often the case in the material that the pronoun is followed by a specification, which makes it easier to identify what it is intended to refer to. The ‘generic we', which is by far the most frequent type in the SWICLE and American English material, is sometimes modified by prepositional phrases and apposition: in Sweden/in the US/all/people/women. In some cases, a subsection of the whole is referred to by the addition of asphrases, e g as individuals/as borrowers. This type has also been called ‘rhetorical we’ (used in a collective sense: e g nation, party, see Quirk et al (ibid.)), and is particularly frequent in the SWICLE essays, presupposing solidarity with the reader. In example (8) below, the persons the writer is talking about are ‘people in general’ (also specified by most of us). This includes the writer him- or herself, but not in the role of a writer but rather as an experiencing person in the world. Moreover, other people than the reader are included. 7 (8) Most of us are rather selfish and if we are frank we have to confess that we can not compare ourselves with Mother Theresa when unselfishness is considered. (Swicle) It can be argued that this example is not metadiscoursal. It is still fairly persuasive, and it seems as if the writer very much would like to include the reader in the group of selfish mortals, but the most of still leaves some freedom to the reader to decide for him- or herself whether the description fits or not. This type of ‘freedom’ given to the reader, however, is rare. As Clark and Ivanic (1997:165) have pointed out, “[i]n building the dialogue with readers, writers in all genres often take for granted that readers are going to share their beliefs and values [], for example, [] by using the pronoun ‘we'. In this way they position their readers as consenting, part of an ‘usness’ that is hard to resist []”. This is particularly true in the case of rhetorical we. The ‘usness’ is particularly hard to resist in a phrase such as as we all know, which includes everyone, also the reader. These examples have been classified as commentary. Interestingly, one example in the material seems to contain some awareness of the fact that we can be quite powerful in its inclusiveness: (9) So, in the immediate future eastern Europe is an enormous market for consumers. At the same time "we" receive alarm reports from all over the world we have millions of people longing for things we have taken for granted. (Swicle) The pronoun is put within quotation marks, which seems to have the effect of hedging the statement made by the writer. Thus, the reader is also made aware that the all-inclusiveness of we can be questioned. 4. Frequencies of pronouns and metatext units There are huge differences in the overall frequencies between the three groups, in particular concerning I. What is most striking is that the Swedish learner material has overwhelmingly higher frequencies of first person pronouns. Table 1. The overall frequency of I across corpora Corpus Approx. number of words in corpus Raw frequency Frequency per 10,000 words SWICLE 204,630 1,851 90 LOCNESS BrE 95,508 83 9 LOCNESS AmE 149,767 649 43 Table 2. The overall frequency of we across corpora Corpus Approx. number of words in corpus Raw frequency Frequency per 10,000 words SWICLE 204,630 1,893 93 LOCNESS BrE 95,508 262 27 LOCNESS AmE 149,767 426 28 We is the most preferred first person pronoun, except in the American English corpus, which uses the more individualistic I to a greater extent.3 The preference in the British English data for we over I (27 versus 9) is not in accordance with the results of a study by Vassileva (1998:167) on academic writing, who found that “the ‘I’ perspective clearly dominates in English” in comparison to the ‘we’ perspective. The main part of her English corpus consists of research articles written by speakers of the British English variety. The reason why academic writing and the British English argumentative essays differ remains to be solved. The overuse of I by Swedish learners in contrast to native speakers has been pointed out by Altenberg (1997) and by Ringbom (1996, as reported in Altenberg 1997:127). Previous research on learner writing has also found that learner writers within the ICLE project generally are much more overtly present within the discourse than native speaker writers (Altenberg 1997 and Petch-Tyson 1998), suggesting that this may be a general learner strategy. 3 However, first person plural forms are most frequent in the American English data in the accusative and possessive forms. 8 The table below reveals the great difference in the number of metatext units per 100,000 words including I across corpora, particularly in the SWICLE in relation to the native speaker corpora. Table 3. The frequency of I involved in metatext units compared across corpora Corpus Approx. number of words in corpus Raw frequency Frequency per 100,000 words SWICLE 204,630 275 134 LOCNESS BrE 95,508 20 21 LOCNESS AmE 149,767 55 37 These results raise several questions, for instance whether the proportions are similar across corpora also for impersonal types of metadiscourse, or if the great preponderance in the learner material is due to the general tendency among Swedish learners (as evidenced in this corpus) to use writer (and reader) visibility in their writing. In other words, do the data reflect the general picture regarding proportions in the use of metadiscourse, or are they dependent on the overuse of I in the SWICLE, noted above? Table 4. The frequency of we involved in metatext units compared across corpora Corpus Approx. number of words in corpus Raw frequency Frequency per 100,000 words SWICLE 204,630 57 28 LOCNESS BrE 95,508 26 27 LOCNESS AmE 149,767 25 17 In the British English material, there are more metatext units involving we than I, whereas the Swedish learner essays (in particular) and the American English material have the greatest figures in connection to I. The Longman grammar (Biber et al 1999:334) comments on the ‘unexpectedly high frequency’ of we/us in their categories of news and academic prose as being “connected with the multiple uses of this pronoun in written prose: to make generalisations, and to refer to the author and reader”. When comparing the overall figures for we and the frequencies of we involved in metatext elements, we can see that the latter use is far less frequent in the present argumentative essays. There is a great difference in the learner material, where the metatext units involving we only amount to about a fifth of the units including I (57 versus 274). The overall occurrences of the pronouns I and we in the SWICLE, however, are roughly the same, even with a slight predominance of we. One conclusion to draw from this, is that the Swedish learners, in their use of personalised metatext, very much prefer the pronoun I to we. The native speakers, on the other hand, use the pronouns more or less to the same extent. The British English material has slightly more occurrences of we (26 instead of 20 in relation to I), whereas the American English material about twice as many metatext units including I (55, versus 25 for we occurrences). If we take normalised figures into account, it is still clear that the Swedish learners overuse personalised metatext, particularly with first person singular I as subject, provided that we regard the native speaker data as the norm. The results also show that there is a difference between the American and British English components of the LOCNESS corpus. The data suggest that writers in the American English variety use more metadiscourse (at least metatext) in their argumentative writing compared to writers who use the British English variety. It may be the case that writing conventions concerning metadiscourse differ in the two cultures, in which case Swedish learners would be more inclined towards (or influenced by?) the American English style, and a lot less so towards the British English. 5. What do the metatext units look like? In this section, I will briefly bring up one characteristic of the learner essays that seems to me to be important. In many respects, the learner essays are more tentative, hedged, and polite in their use of metatext compared to the native speaker examples.4 For example, try and would like to are very 4 This may also partly depend on cultural differences in writing. Vassileva (1998:168), studying English, German, French, Russian and Bulgarian academic writing, notes that “[i]n German, the author's intentions are, more often than not, mitigated (downtoned, hedged) by means of modal verbs, predominantly ich möchte ‘I would like’ []”. Writing in Swedish may also follow this ‘mitigating’ convention, which may be transferred by the learners to their 9 common in the learners’ metatext units. Try is the most frequent verb after say, but it does not occur at all in the native speaker material. Examples including I are the following: (10) In this essay I will try to give some examples of what imagination and dreaming have come to today with some references to the past, and I will also try to answer the question as to wether there is a place for imagination and dreaming in our modern world dominated by technology and science, or not. (Swicle) (11) Should we accept all foreign cultures or discriminate some of them? In what way would our defence of other forms of Swedish culture and our attitude to foreigners and foreign things? I will try to answer these questions, as well as I can, but first I will try to describe the situation of the Swedish language today. (Swicle) (12) To make a short summary of what I have been trying to say in this essay, technology will never make imagination and dreams unnecessary for two main reasons. (Swicle) The first two examples announce what is going to follow in the discourse5: in (10) the writer will perform the discourse act of giving examples and answer an important topical question, and in (11) answer topical questions and describe a situation to the reader. These examples are placed at the beginnings of the essays they occur in. Instead of giving anaphoric reference, example (12) points backwards in the discourse6 and the writer announces that a summary of the main points that have been made in the previous discourse is to come. The what I have been trying to say gives a modest and even insecure impression, as if the writer doubts his or her ability to get his or her ideas across in writing. Also note the occurrence of the phrase in this essay in both (10) and (12), which reflects the fact that essays in the learner material generally have a high degree of reflexivity, or awareness of the discourse as discourse. This is expressed particularly often in the first paragraphs, as in: “This paper is supposed to deal with immigrants coming to Sweden. To be more précis I am going to write about refugees, the biggest group of people who come here today.” (Swicle). This helps explain why the verb choose is fairly frequent in the metatext units in the learner material; the writers explicitly comment on the choice of topic for the essay. The polite formula I would like to also has high frequencies in the learner texts. Example (13) below introduces the topic and tells the reader what the writer is going to focus on in the essay. Note the extreme situation awareness in the adverbs in my essay today.7 (13) One of Anita Brookner's books is called 'A friend from England'. I read it last autumn, so I might have forgotten some names. If so, I've got to re-name them. In my essay today I would like to concentrate on the relationship between Heather and Rachel. I will also discuss in what way this was an important personal relationship. (Swicle) The insecure and hedged quality of many learner texts is also noticeable here, in that the writer explicitly comments on the fact that he or she may have forgotten some names in the novel he or she is about to discuss. However, the solution to this problem, to make up new names, is also brought up. Another example involving I would like to is the following: (14) To become a dog owner is one of the greatest gifts in life. I will support this statement by discussing three positive effects. I would like to call the first one writing in English. In this contrastive study on authorial presence by means of first person pronouns, Vassileva found different cultural norms. 5 This type has been called Announcement (Crismore et al 1993), preview (Crismore and Farnsworth 1990) and Advance Labelling (Tadros 1993). 6 Types in which the writer tells the reader what he or she has already done in the discourse have been called Reminders (Crismore et al 1993), reviews (Crismore and Farnsworth 1990) and Recapitulation (Tadros 1993). 7 Petch-Tyson (1998:112) examines the subcategory ‘reference to situation of writing/reading’ within the broader concept of writer/reader visibility in texts, including here, now and this essay. In a comparison of samples from the Dutch, Finnish, French and Swedish learner subcorpora within the ICLE project, both here and this essay were used about twice as often by the Swedish learners. 10 friendship. It is said that dogs are our best friends and I think that there lies a great deal of truth behind this statement. (Swicle) In this example, the writer labels the first of three effects which are important supporting arguments for the writer's thesis that a dog is one of the greatest gifts in life. Naturally, the writer could have skipped the metatext element marked by italics above, and been content by just having ‘The first one is friendship'. The writer could also have chosen to express more certainty by using modal will instead of would like to. (15) is another example of I would like to, announcing to the reader that some examples supporting the writer's argument are to come: (15) In this essay, which is of course wholly unscholarly, I would like to point out some examples, mostly by female writers where woman's place in society - or rather woman's possibility of controlling her won social position - is illustrated by the actions of the "archetypal English literary heroine". (Swicle) The writer explicitly makes a point of the notion that his or her essay only provides the layperson's view on the topic (in this essay, which is of course wholly unscholarly), thus telling the reader something about how to interpret the text. The writer is careful not to present him- or herself as an expert in the field. This could be interpreted as if the learner writer does not feel confident in the role of an argumentative essay writer or with the possible objectivity and learnedness that traditionally accompany this task. A similar example is the following: (16) Of course, you may wonder what the true reason for heavy taxation on these pleasures is. Is it really a concern about the average Swede's health and well being or is it, as many suspect, a concern about the government finances that is the reason? I have not the competence required to answer this question, but if you ask me if I think that taxes and restrictions are the right methods to keep people off the bottle and away from the cigarette, the answer is no. (Swicle) A question is posed, but the writer chooses not to answer it on the grounds that he or she is not an authority on the area. In the following coordinated clause, however, the question is rephrased so that the writer feels that he or she rightly can answer it. Note the interactive expression if you ask me, which would also be classified as metatext. 6. How are the metatext units distributed in the texts? The program WordSmith Tools (http://www.liv.ac.uk/~ms2928/index.htm) has been used to study the distribution of metatext elements. It gives the percentage of how far into the text file the search word occurs. The spread of the search word is examined in each text, starting at 1 and ending at 100 per cent. All percentages have been compiled and divided into tenths (1-10, 11-20, 21-30, etc). Measuring textual distribution in terms of percentages and not paragraphs can be criticised in several ways, but looking at the figures and tables in terms of percentages still gives a good picture of the spread of the metatext units through entire collections of texts.8 In the Swedish learner corpus (see diagram 6a below), there is a clear peak in metatextual units including first person I in the first parts of the essays. As many as 34 per cent of the overall number of metatext occur from 1 to 20 per cent into the essays. A certain rise is also noticeable at the very end, in the last percentage section. There are few occurrences of metatext having I as subject in the British English material, and the great majority of units occurs in the very first section (see diagram 6c below). Although the data are scarce, the overall picture is similar to the distribution in the Swedish learner essays. In the American English material (diagram 6b), on the other hand, the highest peak is found in the last section, 91-100 per cent into the texts, but also in the first and third sections. As expected, the results show that the beginnings and the ends of the essays are important sections for metadiscourse. Further analysis will reveal whether there are patterns in the use of individual verbs, for example whether some verbs have a preference for occurring at a certain point in argumentative essays. Thus, questions to be answered are: Which of the metatextual verbs occur 8 However, it may be the case, for instance, that paragraph length differs across corpora, which could influence the figures. 11 most frequently at the beginning or at the end of texts? Why is that? What do the forms look like? What functions do they have in the texts? Diagrams 6 a-c and 7 a-c. The distribution of metatext units involving I and we in the three corpora. Diagram 6a. The distribution of occurrences of I involving metatext in SWICLE Diagram 6b. The distribution of occurrences of I involving metatext in US LOCNESS Diagram 6c. The distribution of occurrences of I involving metatext in UK LOCNESS Diagram 7a. The distribution of occurrences of we involving metatext in SWICLE Diagram 7b. The distribution of occurrences of we involving metatext in US LOCNESS Diagram 7c. The distribution of occurrences of we involving metatext in UK LOCNESS If we compare the patterns of distribution of metatext involving I and we in the individual corpora, the neatest match is in the British English data, in which the highest numbers are in the very first section (8 occurrences in both cases, see diagrams 6c and 7c above). The figures in the American English data are only partly similar (diagrams 6b and 7b). There is a peak at the end for both pronouns, but it starts already at 81-90% for I, and the numbers are twice as high. The increase of metatext including I at the beginning do not correspond to a rise in the results including we. The top score for metatext involving we, instead, occurs at 41-50 per cent. The distribution of metatext with I and we as subjects is basically reversed in the SWICLE material (diagrams 6a and 7a). In the case of I, there are two evident tops at the very beginning and at the very end of the essays. In the data involving we, on the other hand, the lowest numbers occur in those places, and the rise takes place at 21-30 and 31-40 per cent. 7. Conclusion Some evidence that the three groups represented differ both quantitatively and qualitatively in their use of metatext as well as in their overall use of first person pronouns has been presented in this paper. Any definite answer to the reason why the differences between the corpora exist – in particular with regard to the learner essays versus the native speaker texts – will have to be left until the material has been analysed in detail. However, one partial and tentative answer may be that the differences are related to different cultural conventions for writing. First of all, there is a possibility that the conventions for using metatext in argumentative writing are not the same in Swedish and English. In addition, there seem to be differences with regard to writer visibility and use of metatext among the two varieties of English in the native speaker corpus. For example, there is a clear dissimilarity in the use of I in the American and British English parts (43 versus 9 occurrences per 10,000 words, see table 1 above). This suggests that there may be cultural differences involved, and the division of the LOCNESS corpus into two parts made for the present investigation is endorsed. To some extent, the presence of the discourse participants in texts and the use of pronouns are also a question of politics (see Wales 1996:84). The critical discourse perspective (see e g 12 Fairclough 1995) deals very much with ideology in texts, which may be done for example by asking questions such as: Is the agent behind the text visible? Is the position taken by the writer explicitly stated as belonging to the writer or is it rendered as a general truth or common knowledge? In this field, linguistic forms and structures are studied in terms of power and ideology, such as the agentless passive construction which, in contrast to personalised expressions, leaves causality and agency unclear – possibly with the aim of conscious hedging or even deception. Acknowledgement I would like to thank professor Karin Aijmer for reading and commenting on my paper. References Altenberg B 1997 Exploring the Swedish component of the International Corpus of Learner English. In Lewandowska-Tomaszcyk B, Melia P J (eds) PALC'97: Practical applications in language corpora. Lódz, Lódz University Press, pp 119-132. Bäcklund I 1998 Metatext in professional writing: a contrastive study of English, German and Swedish. Texts in European Writing Communities 3. Uppsala, Uppsala Universitet. Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 The Longman grammar of spoken and written English. London, Longman. Bondi M 1999 English across genres: language variation in the discourse of economics. Modena, Edizioni Il Fiorino. Chafe W L 1982 Integration and involvement in speaking, writing, and oral literature. In Tannen D (ed) Spoken and written language: exploring orality and literacy. Norwood, Ablex, pp 35-53. Clark R, Ivanic R 1997 The politics of writing. London and New York, Routledge. Connor U 1996 Contrastive rhetoric: cross-cultural aspects of second-language writing. Cambridge, Cambridge University Press. Crismore A, Farnsworth R 1990 Metadiscourse in popular and professional science discourse. In Nash W (ed) The writing scholar: studies in academic discourse. Written Communication Annual: An international survey of research and theory: Volume 3. Newbury Park/London/New Delhi, Sage Publications, pp 118-136. Crismore A, Markkanen R, Steffensen M S 1993 Metadiscourse in persuasive writing: a study of texts written by American and Finnish university students. Written Communication 10: 39-71. Fairclough N 1995 Critical discourse analysis: the critical study of language. London and New York, Longman. Granger S 1993 The International Corpus of Learner English. In Aarts J, de Haan P, Oostdijk N (eds) English language corpora: design, analysis and exploitation. Amsterdam, Rodopi, pp 51-71. Hyland K 1998 Persuasion and context: the pragmatics of academic discourse. Journal of Pragmatics 30: 437-455. Jakobson R 1980 The framework of language. Michigan, Michigan Studies in the Humanities. Markkanen R, Steffensen M , Crismore A 1993 Quantitative Contrastive study of metadiscourse: problems in design and analysis of data. Papers and Studies in Contrastive Linguistics 28: 137-152. Mauranen A 1993 Cultural differences in academic rhetoric: a textlinguistic study. Frankfurt am Main, Peter Lang. Petch-Tyson S 1998 Writer/reader visibility in EFL written discourse. In Granger S (ed) Learner English on computer. London and New York, Longman, pp 107-118. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman. Taavitsainen I 2000 Metadiscursive practices and the evolution of early English Medical writing 1375- 1550. In Kirk J M (ed) Corpora galore: analyses and techniques in describing English. Papers from the nineteenth international conference on English language research on computerised corpora (ICAME 1998). Amsterdam and Atlanta, Rodopi, pp 191-207. Tadros A 1993 The pragmatics of text averral and attribution in academic texts. In Hoey M (ed) Data, description, discourse: papers on the English language in honour of John McH Sinclair. London, HarperCollins, pp 98-114. Vande Kopple W J 1988 Metadiscourse and the recall of modality markers. Visible Language XXII: 233-27. Vassileva I 1998 Who am I/who are we in academic writing?: A contrastive analysis of authorial presence in English, German, French, Russian and Bulgarian. International Journal of Applied Linguistics, Vol 8, No 2: 163-190. Wales K 1996 Personal pronouns in present-day English. Cambridge, Cambridge University Press. 13 Discourse particles in contrast Karin Aijmer English Department Göteborg University There are several new avenues to the study of discourse particles (well, now, actually, in fact, etc ) such as the study of discourse particles across different languages. Cross-linguistic comparison of discourse particles represents what Carlson calls ‘an extreme case of contrastive analysis’ providing ‘potential evidence for an underlying set of shared (perhaps universal) functional distinctions (Carlson 1984: 68). The external contrastive analysis implies that correspondences are established between elements in different languages. Investigating translation equivalents of discourse particles can be regarded as a typical corpus problem since it involves looking for all the possible equivalents in order to establish rules for how they are used. Translation data show how transfer of the semantic content results in a rich variety of lexical items. The results may be analysed in terms of formal, semantic, stylistic and pragmatic equivalence. For this study I have used the English-Swedish Parallel Corpus, a 2 million word corpus containing translations from English into Swedish and Swedish into English to initiate fine-grained cross-linguistic comparisons between languages and to test hypotheses (Aijmer et al 1996). For example, on the basis of theories of grammaticalisation it is possible to make cross-linguistic predictions that can be tested on the basis of the corpus. References Aijmer K, Altenberg B, Johansson M 1996 Text-based contrastive studies in English. Presentation of a project. In Aijmer K, Altenberg B, Johansson M (eds). Languages in contrast. Papers from a symposium on text-based cross-linguistic studies. Lund: Lund University Press. 73-85. Carlson L 1984. ‘Well’ in dialogue games: A discourse analysis of the interjection ‘well’ in idealized conversation. Amsterdam and Philadelphia: John Benjamins  14 John is a man of (good) vision: enrichment with evaluative meanings* Takanobu Akiyama University of Lancaster 1. Introduction  In human communication, the linguistic form of an utterance often provides the hearer with only very skeletal information, and thus the hearer requires pragmatic inferential processing for interpretation of assumptions that are communicated by the utterance. It is reasonable to state that this pragmatic inferential processing is crucially contingent on contexts. In general, however, the term context tends to be restricted either to the discourse or co-text or to the physical situation. Relevance theory (Sperber and Wilson 19952), in order to avoid this ad hoc explanation of the context, defines it as a subset of the hearer's beliefs and assumptions about the world which interacts with newly received information. This paper, taking these into account, attempts to provide an account of pragmatic enrichment with evaluative adjectival meanings which occurs in interpretation of man of + N construction (e.g. John is a man of (good) vision). Pragmatic enrichment (or simply ‘enrichment') is a pragmatic inferential process of fleshing out the logical form of an utterance in order to recover the explicature of the utterance. Interpretation of phrases such as a man of sense, taste, vision, property, etc. consistently requires enrichment processing with evaluative adjectival meanings, but there has been no research into the mechanism under which this sort of enrichment processing occurs. This problem will be discussed in the framework of ‘relevance theory’ on the lines proposed by Sperber and Wilson (19952); this theory appears to be well equipped to make headway in the murky area involving various pragmatic phenomena. Yet, our examination is also carried out on the basis of data retrieved from the British National Corpus (BNC), because an introspection-based analysis alone does not afford an opportunity to construct a truly convincing argument. The discussion below is organised as follows. Section 2 is devoted to a preliminary discussion of certain indispensable pragmatic concepts relating to arguments in this paper and to the establishment of a consistent perspective for analysing the target construction. Here we will attempt to define the term ‘context', because one may presume that enrichment processing with evaluative meanings in the target construction is independent of the context, if the term ‘context’ is merely confined to external (i.e. linguistic and physical) contexts. Section 3 presents a survey of the target construction on the basis of data retrieved from the British National Corpus (BNC). We will classify phrases that give rise to evaluative enrichment into two types. Section 4 discusses how enrichment processing in the man of + N construction is motivated by relevance theory and establishes a hypothesis which is concerned with what the hearer's evaluative viewpoint is based on. Section 5 concludes this paper with a brief summary. 2. Preliminary discussion  2.1. Explicature and implicature: two levels of communicated assumptions Before embarking on an analysis of enrichment with evaluative adjectival meanings, it is necessary to make explicit certain basic concepts relating to pragmatic processing and to establish, as far as possible, a consistent perspective for analysing it. Sperber and Wilson (19952) and followers of their theory (e.g. Blakemore (1987, 1992, 1995)) sort hearers’ assumptions regarding what is communicated by utterance into two levels, viz. explicature and implicature. Explicature is the first level of assumption in utterance interpretation, at which the hearer fleshes out the under-determined form produced by the speaker to a fully elaborated, and accordingly explicit propositional form. At this level of assumption, the hearer's inference is based on the original utterance form, and is usually recovered by three types of processing, viz. disambiguation, reference assignment, and enrichment. Implicature, on the other hand, is the second level of assumption communicated by the utterance, where the hearer's inference is not based upon the original utterance form but upon the explicature, and thus the implicature can be quite removed from the original utterance form. A distinction between explicature and implicature can be hence drawn in terms of whether the assumption is contingent on the original utterance form or not. To enhance our understanding of the distinction between these two communicated assumptions, we will examine a breakfast conversation, during a scene in which Philip (husband), Jane (wife), and David (their son) are eating slices of toast. (1) Philip: David, another piece of this.  15 Jane: I think he's had enough. (BNC: KCH 430) In the conversation above, both explicature and implicature are requisite for the interpretation of each utterance. Let us suppose that Jane's interpretation of Phillip's first utterance is: ‘David, you can have another piece of toast.’ This assumption can be called explicature, because it is recovered in terms of the enrichment of ‘you can have’ and the reference assignment of the pronoun ‘this’ with ‘toast', motivated by the under-determinacy of the original utterance. Moreover, Philip's interpretation of Jane's utterance could be that ‘I think David has had enough toast already’ (viz. explicature), and then ‘If he eats any more he'll be sick. So don't give him the toast’ (viz. implicature). In Jane's utterance, the foregoing three types of processing to recover the explicature can be found: the pronouns ‘I’ and ‘he’ refer to the speaker (i.e. Jane) and David, respectively (viz. reference assignment), the verb ‘have had’ is to be construed as ‘have consumed', not ‘have possessed’ (viz. disambiguation), and some word meanings (i.e. toast and already) are added (viz. enrichment). At issue here is what the recovery of these communicated assumptions is based on. Let us confine ourselves to discussing explicature for simplicity of exposition. We skate over implicature here. 2.2 Perspectives on ‘context': external contexts As we have seen in the conversation (1) above, the original utterance form often provides only a very skeletal clue as to the explicature to be recovered. It will thus be reasonable to state that the process of fleshing out the original utterance form into an explicature crucially hinges on contexts. In the conversation (1), for instance, we will have to make much more effort in recovering these explicatures if we do not know that this conversation occurs during a breakfast scene. Further, Philip's first utterance, i.e. ‘David, another piece of this', could not be interpreted properly, unless we knew from the context that the participants of the conversation were eating toast. The explication of explicature therefore stands in need of a consistent perspective on the term ‘context'. ‘Context’ has been a very widely used term in linguistics and, as a consequence, any account of its meaning will be required to specify exactly how it is being used. Yet, many of the previous analyses of the context, while aiming to reify this abstract term, have mainly directed attention to external contexts and several types of them have been put forward, e.g. linguistic context (co-text), physical context (setting), etc. The linguistic context (co-text) will be the most easily postulated as a definition of context. It refers to the sounds, words, phrases, and so on, which come immediately before and after a particular phrase or piece of text and help to explain its meaning. The major role of the linguistic context for interpretation of utterances will be concerned with cohesion (c.f. Verschueren (1999: 104)). In other words, hearers attempt to solve reference assignment, disambiguation, etc. by referring back to earlier discourse or projecting toward a future linguistic context in order to give coherence to utterance interpretation. In the conversation (1), for instance, interpretation of Jane's utterance ‘I think he's had enough’ will require as a linguistic context Philip's previous utterance ‘David, another piece of this', to resolve a reference assignment of ‘he’ as ‘David'. The physical context (setting) can be defined as ‘the spatio-temporal location of the utterance, i.e. as the particular time (moment) and particular place at which speaker gives utterance and the particular time and place at which the hearer hears or reads the utterance’ (Allan (1986: 36)). This type of context is, thus, in particular concerned with the reference assignment of deictic expressions of place (e.g. ‘here', ‘there') and of time (e.g. ‘yesterday', ‘last week'). Suppose that you are told by someone ‘Sit here; but don't sit here, please'. You will be thrown into confusion by the indeterminacy of ‘here’ in this utterance unless the speaker designates two locations (e.g. two chairs) alternatively. It is indubitable that utterance interpretation is very highly determined by contextual factors, but it will be inadequate to direct attention to only the linguistic and the physical contexts, say the external contexts. In the following utterances (2) and (3), the explicatures recovered by means of enrichment processing do not seem to be dependent on those external contexts. (2) a. Whatever else the mysterious Arianna might be, she was clearly a woman of taste.1 (BNC: JY7 4224) b. = …, she was clearly a woman of good/*bad taste. (3) a. Wordsworth was a man of feeling, for an Englishman, but well compared to Oor Rabbie quite frankly he was nothing to write home about. (BNC: B38 112) b. = Wordsworth was a man of strong/*weak feeling, …   Hereafter, italics in corpus sentences quoted from the BNC are inserted to highlight relevant parts of the example.   16 Application of those external (i.e. the linguistic and the physical) contexts to the explication of the enrichment in (2b) and (3b) will be both pointless and stultifying. In the next section, in order to avoid this impasse, we will turn attention to a ‘psychological construct’ as per Sperber and Wilson (19952). 2.3. A psychological construct as a context While the emphasis in the previous analyses of context has been placed mainly on the external contexts in which an utterance is made, Sperber and Wilson (19952:15ff) lay stress on the (internal) psychological construct, viz. the subset of the hearer's current knowledge, beliefs, assumptions, hypothesis, and cultural and social conventions about the world. Of interest here is that Sperber and Wilson do not categorise contexts into several types but unify them into a single one that is supposed to exist in our mind. In their view, understanding an utterance is a matter of integrating the proposition it expresses with a context of existing beliefs and assumptions. The set of premises used in interpreting an utterance […] constitutes what is generally known as the context. A context is a psychological construct, a subset of the hearer's assumptions about the world. (Sperber and Wilson 19952: 15)2 This will come close to saying that the ‘psychological construct’ here subsumes a notion ‘schema’ in cognitive psychology (cf. Blakemore 1992: 17: Sperber and Wilson 19952:138). The schema is ‘a structured cluster of concepts; usually it involves generic knowledge and may be used to represent events, sequences of events, percepts, situations, relations, and even objects’ (Eysenck and Keane 20004: 252). In what follows I will claim that the psychological construct of the schema goes hand in hand with the enrichment processing of the construction ‘man of + noun’ (see (2-3) above). 3. Enrichment in ‘Man of + N (noun)’ 3.1. Classification of the target construction First of all let us start our analyses of enrichment of the target construction by classifying this construction according to the meaning of nouns following the man of. Our examination in the following discussion is based on data from the BNC, because an introspection-based analysis alone does not afford an opportunity to construct a truly convincing argument. The BNC contains 2060 examples of a collocation man of in 970 different texts, and 260 types and 651 tokens extracted from these 2060 examples are ‘man of + noun (N)'. These 260 types of ‘man of + N’ fall roughly into the following three categories: (I) the N characterises the man's activity or occupation (e.g. a man of letters); (II) the N denotes a group, nationality etc (e.g. a man of French); (III) the N denotes the man's essential feature or his trait, which is based on his physical/psychological/social properties. (e.g. a man of sense; a man of principles)3 Category (I) includes as its member a man of action, commerce, letters, business, law, peace, etc. The nouns in this category characterise the man's activity, which leads to his occupation. For instance, a man of action denotes a man whose whole life is a soldier, sportsman, political leader, etc.; a man of peace denotes a man whose life is to bring about peace; a man of letters denotes a man who writes works of literature or writes about literature. Category (II) has as members a man of Europe, India, Borneo, China etc. In what follows, these two categories will be excluded from our consideration. Category (III), which is the object of our investigation, refers to a man's physical/psychological/social characteristic or some essential feature of his character. Some nouns appearing in this category bring about enrichment with evaluative meanings for interpretation of the whole phrase (e.g. a man of (good) sense). In my investigation there are 41 phrases belonging to this category; the examples and their frequencies are described in Table 1. Here we should notice that not all the phrases in the category (III) undergo enrichment with evaluative meanings for the interpretation.   Blakemore (1992: 87), following Sperber and Wilson, defines the term context as ‘the beliefs and assumptions the hearer constructs for the interpretation of an utterance either on the basis of her perceptual abilities or on the basis of the assumptions she has stored in memory or on the basis of her interpretation of previous utterances'.   There are a small number of phrases in which N refers to a man's name (e.g. a man of John). Yet, for this very small number of members, I do not set up another category.   17 Table 1 Phrases and Frequencies of Category (III) phrase freq. phrase freq. man of honour 19 man of experience 2 man of integrity 10 man of principles 2 man of principle 9 man of quality 2 man of power 8 man of ability 1 man of reason 8 man of ambition 1 man of influence 7 man of anger 1 man of property 7 man of caution 1 man of vision 5 man of contradictions 1 man of courage 4 man of excitement 1 man of culture 4 man of family 1 man of genius 4 man of fidelity 1 man of knowledge 4 man of goodwill 1 man of sense 4 man of humour 1 man of taste 4 man of ideas 1 man of character 3 man of individuality 1 man of dignity 3 man of passions 1 man of feeling 3 man of spirit 1 man of learning 3 man of self-discipline 1 man of passion 3 man of strength 1 man of standing 3 man of surprise 1 man of energy 2 3.2. Analysis of ‘man of + N’ 3.2.1. Phrases of truisms Phrases which obviously yield enrichment with evaluative meanings are listed up in Table 2. These phrases are retrieved from Table 1. Most of the phrases refer to a man's physical or psychological properties and others make reference to a man's social properties. Phrases concerning a man's physical or psychological properties are a man of power, vision, knowledge, sense, taste, character, feeling, energy, experience, principles, ability, ideas, spirit, and strength. Phrases in relation to a man's social properties are a man of property, culture, standing, quality, and family. Table 2 Phrases which Obviously Yield Enrichment with Evaluative Meanings for their Interpretation phrase freq. phrase freq. man of power 8 man of energy 2 man of property 7 man of experience 2 man of vision 5 man of principles 2 man of culture 4 man of quality 2 man of knowledge 4 man of ability 1 man of sense 4 man of family 1 man of taste 4 man of ideas 1 man of character 3 man of spirit 1 man of feeling 3 man of strength 1 man of standing 3 Phrases listed in Table 2 obviously give rise to evaluative enrichment for their interpretation. All Ns following man of in these phrases, which refer to human physical, psychological or social properties are considered as being the endowment of all humans ubiquitously. Another common denominator among these Ns is that in their lexical level they do not denote a specific (e.g. high or low; good or bad) degree. Power, for example, basically denotes a particular ability of the body or mind but does not specify its degree in its lexical level. Yet power denotes a high degree sense in the man of + N construction with  18 enrichment processing. (4) a. Pope Gregory the Great had spoken of taming the wild unicorn, symbol of the man of power. (BNC: HPT 616) b. = …, symbol of the man of great/*low power. Property, culture, standing, quality, and family are usually assumed to be possessed by humans in their social life. For instance, a man of property, in its linguistically decoded logical form without enrichment processing, denotes a man who owns things. This logical form itself does not carry any new information. A close look at the data in table 2 reveals that Ns following the collocation man of commonly denote properties of human beings. Moreover I reiterate that these properties are assumed as ubiquitous among humans. It is of significance therefore that the phrases in Table 2 can result in a ‘truism', if we do not interpret these phrases with enrichment processing. A ‘truism’ is a statement that is obviously true, in particular one that does not say anything important. On the other hand, sentences such as He is a man of anger, passion, excitement, goodwill, and surprise, all of which also make reference to a man's psychological property, are not likely to give rise to enrichment with evaluative meanings. This is because these Ns following man of clearly contain a gradable adjectival feature inherent in their basic senses. Anger, for example, denotes a ‘strong’ feeling of resentment, and thus denotes a strong feeling of wanting to harm or criticise someone because they have done something unfair. Besides, man of honour, integrity, courage, genius, dignity, caution, contradictions, fidelity, and self-discipline do not need enrichment processing with evaluative meanings for their interpretation. This will be because properties depicted by these Ns following man of are considered as distinctive or unique rather than ubiquitous among humans. (5) a. "I am a man of honour as well as a royal emissary." (BNC: H9C 1638) b. It is "mental bombast" as opposed to verbal bombast, and it is a "fault of which none but a man of genius is capable". (BNC: CDL 1530) 3.2.2. Phrases of Vagueness There are some more phrases that require enrichment processing with evaluative meanings for their interpretation in the man of + N construction: these are listed in Table 3. It will be normal and expected in our social life for humans to have properties described by Ns following man of in this table. However, we cannot say that every human being is assumed to possess these properties equally. Therefore linguistically decoded logical forms of these phrases do not obviously give rise to truism but rather vagueness of their meanings, if these phrases are interpreted without enrichment. Further it is not obvious but likely that enrichment processing with evaluative meaning for the phrase interpretation in Table 3 occurs. It follows from this that the possibility of occurrence of enrichment processing in the man of + N construction is a matter of degree. Table 3. Phrases which are Likely to Yield Enrichment with Evaluative Meanings for their Interpretation phrase freq. man of reason 8 man of influence 7 man of learning 3 man of humour 1 man of individuality 1 It will be admissible to state that the scales of properties described by Ns in these phrases are vague, and this vagueness provides a motivation for enrichment processing. (6) a. He was a man of influence in the literary world. (BNC: AC3 1486) b. = He was a man of great/*small influence in the literary world. (7) a. I'm a man of individuality where garden embellishment is concerned. (BNC: ACX 184) b. = I'm a man of strong/*weak individuality… In (6a) the meaning of the logical form of a man of influence is vague. In other words, it is not obvious how much influence the man in question has in the literary world, if enrichment processing is not carried out. In (7a) a man of individuality in its logical form also denotes a vague sense, because the meaning of the lexical item individuality, i.e. the quality that makes someone different from all other people, does not  19 carry a specific sense in a degree. I claim that a common motivation for enrichment shared by truism-types of phrases (e.g. a man of taste) and vagueness-types of phrases (e.g. a man of influence) is a lack of relevance, a technical notion about communication put forward by Sperber and Wilson (19952), which will give a unitary account of mechanism of enrichment processing in man of + N construction. In the next section we will discuss the relationship between this notion and enrichment processing in the construction under discussion. 4. Relevance Theory Approach to Enrichment Processing in Man of + N Construction  As it has been established that the N which appears in phrases in Table 2 commonly has essential properties (e.g. a man of taste), it will be reasonable to state that these features of Ns provide a clue to the exploration of the process of enrichment in ‘man of + N’ construction. When the noun denotes essential properties of human beings and is neutral as to quality/value in its basic sense, the construction ‘man of + N’ will become a kind of truism. So information in the truism will not cause the hearer to add any new information, or strengthen his/her existing assumption, or change her/his mind. In this section, we will discuss mechanism of enrichment processing of the target construction in the framework of ‘relevance theory’ on the lines proposed by Sperber and Wilson (19952). 4.1. Communicative Principle of Relevance According to the relevance theory, ‘every aspect of communication and cognition is governed by the search for relevance’ (Wilson and Sperber 1998: 9). This theory assumes that human communication and cognition are relevance oriented; we pay attention to information that seems relevant to us. The technical notion ‘relevance’ is measured in terms of the relationship between ‘contextual effects’ and ‘processing effort'. Contextual effects are a kind of added set of conclusions that newly received information gives rise to, by interacting with a context, in order for the information to be relevant to the hearer. There are three ways in which newly received information may interact with the context to give rise to a contextual effect: (i) it may combine inferentially with contextual assumptions to yield a contextual implication; (ii) it may strengthen an existing assumption; (iii) it may contradict and lead to the elimination of an existing assumption (see Blakemore (1992: 30), (1995: 445); Sperber and Wilson (19952: 108ff); Wilson and Sperber (1998: 8)). Other things being equal, the more contextual effects there are, the greater the relevance of particular information. However, these contextual effects, say context-dependent benefits in communication, do not occur without any cost but require ‘processing effort', and therefore ‘An assumption is relevant to an individual to the extent that the effort required to achieve these positive cognitive effects is small’ (Sperber and Wilson 19952: 266). In other words, contextual effects need to be economically achieved, and thus the harder a hearer has to try to interpret the information, the less relevant it is. To summarise, a highly relevant utterance has large contextual effects for small processing effort, and an utterance of small relevance has a processing effort which exceeds its potential contextual effect. Taking these notions into account, Sperber and Wilson (19952) enunciate the following fundamental principle about human communication: (8) (Communicative) Principle of Relevance: every ostensive stimulus communicates a presumption of its own optimal relevance. (Sperber and Wilson 19952: 158) This will come close to saying that every ostensive (i.e. deliberate and overt) communication involves a presumption that utterance will have adequate contextual effects for the minimum necessary processing. Let us turn back to the discussion of our original problem. We have seen that the ‘man of + N’ construction, several instances of which give rise to enrichment with evaluative meanings for their interpretation (e.g. a man of taste), would be understood as a truism, if the phrase were interpreted without the enrichment processing. (That is, every man can be assumed to have some kind of taste.) In this case, the processing effort for the truism-interpretation will be minimal because this interpretation is merely dependent on the hearer's lexical knowledge of the lexical items (i.e. man, of, and taste), but at the same time this truism-interpretation also yields a minimal contextual effect. There will be no relevance in this interpretation. Yet, in order to follow the communicative principle of relevance, the hearer tries to find enough contextual effects and adds some meaning (e.g. good in a man of good taste). In this case, the processing effort is likely to be small, but the contextual effect will be adequately large to avoid a truism and to characterise the man in some positive, informative way. The contextual effect in ‘a man of good taste’ is applied to case (i), viz. ‘it (i.e. newly received information) may combine inferentially with  20 contextual assumptions to yield new conclusions'. Moreover, analogous comments hold in the case of a man of influence', which is a member of the category depicted in Table 3 above. The interpretation of this phrase, if it does not involve enrichment processing, will yield vagueness, or indeterminacy in the meaning of this phrase (some degree of influence, whether great or small). Indeterminacy of meaning motivates enrichment processing, and thus a man of influence will be enriched to a man of considerable/great influence. Since the linguistically decoded logical form a man of influence brings about vagueness rather than a truism, it yields a small contextual effect with minimal processing effort. However, the hearer tries to make explicit the meaning of this phrase in order to solve the vagueness of the logical form, and thus he/she enriches the phrase meaning. Let us here recapitulate the foregoing discussion of the two scales (i.e. contextual effect and processing effort) in relation to enrichment processing of the target construction in the form of an outline summary in Table 4. Table 4. Two Scales Relating to Enrichment Processing in ‘a man of taste/influence’ Logical Form and Explicature Contextual Effect Processing Effort John is a man of taste minimal minimal (as a logical form [i.e. truism]) John is a man of great taste (adequately) large small (as an explicature) John is a man of influence small minimal (as a logical form [i.e. vagueness]) John is a man of great influence (adequately) large small (as an explicature) What is common to both these phrases (i.e. a man of taste and a man of influence) for their interpretation is that enrichments are motivated by the lack of relevance of the linguistically decoded logical forms of their original utterances. Comparison of the contextual effects of the logical forms between a man of taste and a man of influence suggests that the lack of relevance is a matter of degree. 4.2. Advantage-disadvantage hypothesis So far we have discussed how enrichment is processed in the ‘man of + N’ construction, but our discussion has not accounted for the fact that meanings added in explicature of this target construction tend to have positive evaluative meanings (e.g. good in a man of good taste). In order to shed light on this problem, I put forward the following hypothesis: an ‘advantage-disadvantage’ hypothesis. (9) Advantage-disadvantage: The hearer's evaluative viewpoint is based on the ‘advantage-disadvantage’ that he/she conceives of in his/her knowledge or assumptions about the world (e.g. schemata). ‘Advantage-disadvantage’ refers to a human's psychological response to situations in which things happen. Further, I claim that a hearer's judgement of advantage/disadvantage is made in terms of a human's schemata (or encyclopaedic knowledge) in his/her beliefs and assumptions about the world. We can apply the advantage-disadvantage hypothesis to the elucidation of enrichment processing in the ‘man of + N’ construction (e.g. a man of (good) taste). Possession of properties such as taste, sense, influence, learning will be advantageous for a human's life and thus it will be evaluated as positive. It is for this reason that phrases such as a man of taste or a man of influence usually receive positive evaluative meanings for their explicatures. In order to make the preceding hypothesis defensible, we must examine more examples. (10) a. This milk has a smell. b. = This milk has a *good/bad smell. The state in which milk has a smell is supposed to be abnormal and the milk in this abnormal state will be unhealthy (i.e. disadvantage). This inferential process on the basis of the advantage-disadvantage hypothesis gives rise to the additional sense of bad, i.e. a negative evaluative meaning, to the explicature presented in (10b). On this basis, the foregoing hypothesis seems to be on the right track. 5. Conclusion This paper has attempted to elucidate the enrichment processing with evaluative meanings in the man of + N construction. Our corpus investigation of the target construction reveals that phrases which yield  21 enrichment with evaluative meanings can split into two types, viz. the truism type (e.g. a man of taste) and the vagueness type (e.g. a man of influence). The common motivation for enrichment between these two types is a lack of relevance. That is, the linguistically decoded logical forms of these types of phrases cannot bring about contextual effects, which should be large in context in order for the newly impinging information (i.e. utterance) to be relevant. In order to avoid the lack of relevance, or in other words, to follow the communicative principle of relevance, the hearer enriches the logical form to give rise to an appropriate contextual effect. Moreover, we found that enrichment of the man of + N construction tended to require positive evaluative meanings (e.g. a man of good taste). Consequently I put forward the advantage-disadvantage hypothesis for the explication of the hearer's evaluative viewpoint. One may claim that this hypothesis comes into conflict with the principle of relevance methodologically, because the principle of relevance is supposed to be a unitary principle which governs all human communication. However, the hypothesis I have provided is not a rule but a description of a tendency regarding the hearer's evaluative viewpoints. While the hearer's evaluative viewpoint tends to be based on normal states in his/her schemata, the choice of the normal state which is utilised for interpretation of an utterance is contingent on the hearer's contextual assumption at a given moment, in order to follow the principle of relevance. The advantage-disadvantage hypothesis is likely to be applied to the explication of evaluative enrichment in other constructions or even in constructions in other languages. With regard to the possibility of generalisation of this hypothesis, however, we have to wait for further research. *This is a revised version of Akiyama (2000: Unpublished paper). References Akiyama, T. 2000 Pragmatic enrichment in the man of + N construction. Unpublished paper. University of Lancaster. Allan, Keith (1986) Linguistic Meaning, vol.1. London: Routledge & Kegan Paul. Blakemore, D. 1987 Semantic constraints on relevance. Oxford, Blackwell. Blakemore, D. 1992 Understanding utterances: an introduction to pragmatics. Oxford, Blackwell. Blakemore, D. 1995 Relevance Theory. In Verschueren, J., J. Östman, J. Blommaert (eds.) Handbook of pragmatics, manual. Amsterdam: John Benjamins. p.p. 443-52. Eysenck, M. W., M. T. Keane 20004 Cognitive psychology: a student handbook. Hillsdale, N. Jersey: Lawrence Erlbaum. Sperber, D., D. Wilson 19952 Relevance: communication and cognition, Oxford: Blackwell. Verschueren, Jef (1999) Understanding Pragmatics. London: Arnold Wilson, D., D. Sperber 1998 Pragmatics and Time. In Carston, R., S. Uchida (eds.) Relevance theory: applications and implications, Amsterdam: John Benjamins Publishing Company. 1-22 22 Word order variations and spoken man-machine dialogue in French : a corpus analysis on the ATIS domain Jean-Yves Antoine, Jérôme Goulian VALORIA, University of South Britanny IUP Vannes, rue Yves Mainguy, F-56000 Vannes, France Phone : +33 2 97 68 32 10 — Email : {Jean-Yves.Antoine,Jerome.Goulian}@univ-ubs.fr WWW : http://www.univ-ubs.fr/valoria/antoine 1. Introduction During the last decade, spoken man-machine dialogue has known significant improvements that should lead shortly to the development of real use systems. In spite of these indisputable advances, numerous limitations restrict still the expansion of common use spoken dialogue systems. In particular, present researches in spoken man-machine communication lack seriously genericity. Most of spoken dialogue systems are indeed concerned by a unique application domain : transport information (ATIS domain). This task is very restricted, what allows the achievement of ad hoc processing methods that ignore most of the structure of the sentence — see for instance (Minker, 1999) for an overview concerning speech understanding. Although these approaches show a significant robustness on spontaneous speech, their portability to other application domains remains an open issue (Hirschman, 1998) : one should reasonably assume that less restricted tasks require a more detailed linguistic analysis. As a result, future advances of man-machine communication depend on the improvement of current spoken language models. In our opinion, corpus linguistics should be of great help for the development of such improved models : - the analysis of large task-specific corpora should provide a precise characterisation of the linguistic phenomena that occur in the concerned application domain. This characterisation is very useful for the achievement of a system prototype1, but should be helpful for evaluation purposes too (Antoine et al, 1999). - the comparison of different corpora should assess usefully the linguistic variabilities ¾ and their causes : task influence, familiarisation with the task, kind of user,... ¾ that should occur in spoken man-machine dialogue. It should therefore provide answers to the important problem of genericity. In this paper, we present a corpus analysis which concerns specifically word order variations in spoken French. This analysis has been carried out on a corpus of spoken man-man dialogue — the Air France corpus (Morel et al., 1989) — that corresponds to the ATIS domain. At first, this paper presents briefly the problem of word order freedom and its implications for natural language processing. We then detail the main results of this corpus analysis from a strictly linguistic point of view. In particular, we assess the influence of the familiarisation with the task by means of a comparison of two subparts of the corpus (see section 3). We finally discuss the consequences of these linguistic observations on the achievement of spoken dialogue systems. 2. Word order and natural language processing Word order is an important question for human language technologies. For instance, the problem of word order freedom originated the development of dependency grammars (Tesnière 1990, Mel'cuk 1988) in response to some weaknesses of standard phrase structure grammars2. 1 Several works have shown for instance that the errors of probabilistic language models may be the result of systematic failures on a restricted number of linguistic phenomena. 2 This controversy is stille open : see for instance (Rambow and Joshi, 1994) or (Pollard and Sag, 1994). 23 Likewise, stochastic language models (N-grams) depend to a large extent on word order. As a result, any increase of word order variations should increase harmfully the perplexity of the language model. Generally speaking, two kinds of word order freedom should be distinguished (Holan et al., 2000) : - weak word order freedom ¾ called freedom of constituent order within a continuous head domain by Holan ¾ where a constituent is free to move in several places but remains always continuous. The corresponding utterance respects therefore the constraint of projectivity. For instance : (1) on the morning Paul used to go shopping. - global word order freedom which corresponds on the opposite to a relaxation of continuity. In such cases, some extracted elements are allowed to appear out of the constituent they are supposed to belong to : the corresponding utterance is therefore non projective. Consider for instance the following wh-extraction (Hudson, 2000) : (2) who do you think that Mary claims that Sarah likes Global word order freedom concerns above all free word order languages (Russian, Finnish, Czech,...) whereas rigid languages (English for instance) are more concerned by weak variability (Holan et al., 2000). Written French should be considered as a rigid word order language (Covington, 1990). Spoken French is however hardly identifiable to written French (Blanche-Benveniste et al., 1990). There is therefore no linguistic evidence that spontaneous spoken French presents a word order variability that is only restricted to weak variation. Besides the important question of the processing of non projective structures (global word order freedom), weak variability constitutes a not inconsiderable problem for spoken language technologies. Since speech recognition provides usually several hypothetical utterances (N-best sequences), any increase of ambiguity / perplexity due to weak variability should affect noticeably the robustness of the system. The variability of spontaneous spoken French is therefore an important problem from a computational point of view. The corpus analysis detailed in this paper aims precisely at answering this question on a specific task domain. 3. Air France corpus This analysis has been carried out on a speech corpus which was transcribed from the recording of real dialogues between a hostess of an air transport information service (ATIS domain) and several customers (Air France corpus). This corpus represents 103 dialogues that correspond to 5149 speech turns and 49703 words. It has been divided into two corpora which correspond respectively to individual customers and travel agents (figure 1) in order to assess the influence of the familiarisation with the task. Table 1 — Description of the Air France corpus. Corpus number of dialogues number of speech turns number of words familiarisation with the task individual customers 68 3676 n.c. low travel agents 35 1473 n.c. high Total 103 5149 49703 — This corpus does not correspond to a man-machine interaction, but on the contrary to a dialogue between two humans. As a matter of fact, our purpose is to characterise the real usages that should be modelled by spoken dialogue systems. 4. Corpus analysis We have made an exhaustive inventory of all the extractions that occur in the Air France corpus. Every observed phenomenon has been characterised according to several features (Gadet, 1992). 1) direction of the extraction ¾ anteposition (movement of an element to the left of the utterance) or postposition (movement to the right), 24 2) kind of construction ¾ we have distinguished the following kind of extracted constructions : - simple inversion (extraction without any specific linguistic mark) : sur Héraklion on n'a qu'un seul tarif special (AF.II.17.O14 ) [ for Heraklion we have only one unique special fare ] - dislocations (extraction marked by a clitic) le visa on l'a eu au consulat (AF.I.48.C6) [ the visa we have obtained it at the consulate ] - presentative structures (among which cleaved sentences) j'ai quelqu'un qu'est allé prendre des billets charters pour moi (AF.I.43.C9) [I have (there is) someone that went and took charter tickets for me] c'est une personne de nationalité tunisienne qui a eu ce billet (AF.I.4.O7) [This is a Tunisian that had this ticket ] 3) syntactic function of the extracted element ¾ subject, argument, adjunct or finally sentence complement ¾ also called associés (associated elements) in (Blanche-Benveniste, 1997). 4) projectivity ¾ continuous or discontinuous extraction. These linguistic features have been characterised by their frequencies of occurrence in the Air France corpus. Since the notion of sentence is not relevant in spoken French (Blanche-Benveniste et al., 1990), speech turns has been used as unit of segmentation for the computation of these probabilities. 5. Quantitative importance of word order variations in spontaneous spoken French At first, we present some conclusions that can be drawn from the observation of the whole corpus. The influence of the familiarisation with the task will be discussed in the following section, which concerns the comparison between the two sub-corpora (individual customers and travel agents). First of all, spontaneous spoken French seems to be ¾ given the considered task ¾ noticeably more flexible than written French. Table 2 shows3 indeed that a not inconsiderable part (13.6%) of the speech turns presents at least one word order variation. Table 2 — Frequency of word order variations in the Air France corpus (mean number of speech turns presenting at least one extraction). Corpus mean frequency standard variation minimal frequency on a dialogue maximal frequency on a dialogue individual customers 14.9% 6.9% 0.0 % 29.8 % travel agents 10.1% 8.2% 0.0 % 30.8 % Total 13.6 % 7.5 % 0.0 % 30.8 % Besides, the statistical distribution of these frequencies presents a high standard deviation. The use of word order variations is therefore very variable from a dialogue to an other. It is quite difficult to explain this variability by means of a unique cause. For instance, dialogic context (negociation, reformulating,...) is undoubtedly a noticeable source of variability, but one might reasonably assume that idiosyncratic factors can intervene too. Anyway, it appears that word order variations are rather common in spoken French. This is why they can not be ignored in the prospect of a robust spoken 3 This table and all the following ones present global results computed on the whole corpus as well as particular results observed on the sub-corpora that concern respectively “individual customers” and “travel agents”. The comparison of these last two corpora will be discussed in section 8. 25 man-machine communication. Fortunately, a detailed analysis of the observed extractions shows that the latter respect to a certain extent some rigid word order constraints. 6. Constrained extractions : projectivity and SVO canonical order As shown by table 3, most of word order variations correspond unsurprisingly (Gadet, 1992) to antepositions (82.5 % of the observed variations). This difference between antepositions and postpositions is statistically significant (c2 test4 of an identical distribution : CHIAF = 0.997). Table 3 — distribution of the extractions according to their direction (mean percentage of antepositions and postpositions). Corpus anteposition postposition standard deviation individual customers 82.9% 17.1% 18.4% travel agents 81.2% 18.8% 24.1% Total 82.5 % 17.5 % 20.4 % This predominance of the antepositions can be related to the distribution of the extracted elements according to their syntactic function (table 4). Word order variations concern above all subjects (30.7 % of the observed variations), sentence complements (30.0 %), adjuncts (27.4%), whereas subcategorized arguments represent only 12.0 % of the observed variations. This lesser occurrence of argument extractions is statistically significant (Student mean test5 of identical distribution of subject and arguments : Tsub/arg = 3.652 ; T(0.01) = 2.600). On the contrary, there is no significant difference between the three other functions ( Student mean test : Tsub/adj = 0.911 ; Tsub/scpl = 1.059 ; T(0.1) = 1.652). Table 4 — distribution of the word order variations according to the syntactic function of the extracted element. Corpus subjects arguments adjuncts sentence complements individual Mean 29.6% 12.6% 28.8% 29.0% customers (St. Dev.) (26.0%) (15,2 %6). (24.4%) (22.4%) travel agents Mean 34.6% 9.4% 22.5% 33.5% (St. Dev.) (35.6%) (16,1 %) (30.2%) (28.3%) Total Mean 30.7 % 12.0 % 27.4 % 30.0 % (St. Dev). (29.6 %) (15.5 %) (26.5 %) (24.5 %) This distribution seems to be coherent from a linguistic point of view. Generally speaking, written French follows a canonical SVO (subject-verb-object) order. Since adjuncts or sentence complements are not concerned by this ordering constraint, they are relatively free to move inside the sentence. Likewise, subject extractions follow in most cases a SVO order since they correspond very frequently to an anteposition (table 5). Subject extraction is consequently rather free. On the opposite, the position of arguments is rigidly fixed by the SVO canonical order. Argument extractions are thus unsurprisingly less frequent in our corpus. Table 5 — distribution of the subject extractions according to their direction (mean percentage of antepositions and postpositions). 4 see (Dudewicz & Mishra, 1988) 5 see (Dudewicz & Mishra, 1988). 6 This value of the standard deviation, which is greater than the corresponding mean value, shows simply that these distributions do not follows a Gauss distribution. 26 Corpus subject anteposition subject postposition standard deviation Individual customers 82.9 % 17.1 % n.c. Travel agents 77.3 % 22.7 % n.c. Total 80.6 % 19.4 % 20.4 % All things considered, most of the observed extractions preserve the canonical SVO order (table 6a). Thus, in spite of a frequent use of extractions, spoken French infringe hardly some fundamental ordering constraints. The inventory of non projective structures supports clearly this observation. Discontinuous extractions are indeed very rare in the Air France corpus (table 6b) : non projective structures, which are very embarrassing for most of parsers or language models, do not concern therefore spoken French. Table 6 — relative importance of the extractions that follow a canonical SVO order (6a : left) and relative importance of projective extractions (6b : right) Corpus % of extractions with SVO order % of speech turns with SVO order % of continuous extractions % of continuous speech turns customers 90.4 % 98.6 % 97.5 % 99.5 % travel agents 90.2 % 99.0 % 98.4 % 99.8 % Total 90.3 % 98.7 % 97.7 % 99.6 % In conclusion, spoken man-machine dialogue in the ATIS domain seems to be noticeably concerned by a weak word order variability that preserves nevertheless a SVO canonical order, whereas global word order freedom is not really observed. 7. Functions and extracted structures Extracted structures follow some regularities that should be usefully considered for computational purposes. Table 7 presents for instance the distribution of word order variations according to the extracted construction used. This distribution presents a rather high dispersion. It is however possible to distinguish simple inversion as the most frequent extracted construction (60.6 % of the extractions). This predominance is statistically significant (Student mean test of identical distributions : T = 4.473 ; T(0.01) = 2.600). On the opposite, the difference between dislocations (24.9 %) and presentative structures (among which cleaved sentences ; 13.2 %) is not statistically significant (Student mean test of identical distributions : T = 1.366 ; Tinv(1.366) = 0.174). Table 7 — distribution of the word order variations according to their structure. Corpus simple inversions dislocations presentatives (cleaved sentences) other constructions individual Mean 60.8 % 24.0 % 13.8% 1.4 % customers (St. Dev.) (27.0 %) (17.3 %) (22.7%) (8.8 %) travel agents Mean 60.2 % 28.3 % 10.5% 1.0 % (St. Dev.) (36.0 %) (36.1%) (20.6%) (8.1 %) Total Mean 60.6 % 24.9 % 13.2 % 1.3 % (St. Dev). (30.2 %) (25.5 %) (22.1 %) (8.6 %) A detailed analysis of these distributions according to the syntactic function of the extracted element provides further conclusions on these structural regularities. Thus, most of subjects and to a lesser extent most of arguments extractions are linguistically marked (dislocations and cleaved structures : table 8). Table 8 — distribution of subjects and arguments extractions according to their structure. 27 Corpus subject extractions argument extractions inversion dislocation + presentative inversion dislocation + presentative customers 4.2 % 95.8 % 30.5 % 69.5 % travel agents 6.1 % 93.9 % 44.4 % 55.6 % Total 4.6 % 95.4 % 32.7 % 67.3 % This predominance of marked extractions for the argument function is statistically significant (c2 test on a not significant predominance : CHIAF = 0.945). On the opposite, adjuncts and sentence complements extractions corresponds almost always to a simple inversion (table 9). Table 9 — distribution of adjuncts and arguments extractions according to their structure. Corpus adjunct extractions sentence complement extractions inversion dislocation + presentative inversion dislocation + presentative customers 97.1 % 2.9 % 100.0 % 0.0 % travel agents 95.3 % 4.7 % 100.0 % 0.0 % Total 96.8 % 3.2 % 100.0 % 0.0 % Once again, these observations are coherent from a linguistic point of view. On the one hand (subject or argument extraction), cleaved structures or clitics in dislocations compensate partially for an eventual change of the canonical SVO order. On the other hand, adjuncts or sentence complements extraction does not require such marked constructions, since their position is relatively free. 8. Influence of the familiarisation with the task In the previous sections, we have only considered global observations on the whole Air France corpus. Now, any significant difference between the “customers” and the “travel agents” corpora may show an interesting influence of the familiarisation with the task on word order variations. The distinction between these two corpora seems a priori relevant. Dialogues are indeed more direct with travel agents, whereas negotiations and reformulatings are noticeably more frequent with individual customers. This observation should be put together with the fact that dialogues with travel agents are shorter than with the other ones. A Wilcoxon-Mann-Withney test7 shows that this difference is statistically significant (Z = 3.819 ; Z(0.01) = 2.576). Table 10 — dialogue length according to kind of user (mean number of speech turns per dialogue). Corpus mean dialogue length Standard deviation individual customers 54.1 33.5 travel agents 42.1 26.4 The familiarisation with the task has therefore a noticeable influence on the dialogue structure. Does this influence concern extractions too ? An exhaustive comparison of the results detailed on tables 2 to 9 suggests that word order variations are independent of this familiarisation. Student mean tests (table 11) show indeed that there is no statistically significant influence on the different features that should characterise word order variations. This observation is obviously interesting for genericity purposes. Table 11 — Statistical tests (Student mean test) of significance of a feature difference between the “individual customers” and “travel agents” corpora. Feature T T(0.1) Tinv(T) frequency of occurrence 0.628 0.532 7 see (Dudewicz & Mishra, 1988) 28 direction 0.284 0.777 kind of construction inversion 0.212 0.833 dislocation 0.943 0.348 presentative 0.715 1.660 0.476 syntactic function subject 0.503 0.616 argument 0.220 0.827 adjunct 0.213 0.832 sentence complement 0.154 0.878 projectivity 0.380 0.705 9. Conclusion : extractions and NLP for man-machine communication Since they ignore usually most of the syntactic structure of the sentence, current spoken dialogue systems have not been concerned so far by the problem of word order freedom. They circumvent indeed this problem thanks to ad hoc approaches that take advantage of the very restricted nature of the considered task. This would not be the case with richer applications or finer dialogue models. As a result, the question of word order freedom will arise soon because of the increasing need8 of a more detailed language modeling. Now, this corpus analysis provides several lessons on word order freedom that are interesting from a computational point of view. First of all, the question of discontinuity (global word order freedom), which is very embarrassing for natural language processing, does not concern fortunately spoken French in the ATIS domain. Since discontinuous extractions appear to be very rare, the parsing of these non projective structures does not constitute a relevant problematic for future researches on spoken dialogue systems. On the opposite, the processing of weak word order freedom should meet an increasing importance as more complex applications will be considered by spoken man-machine communication. The frequent occurrence of extracted constructions in the Air France corpus shows that this question should not be disposed of anymore. Our inventory of several structural regularities (canonical SVO order, specific use of each extracted construction) suggests fortunately that the robust processing of extractions is not an impossible task. Finally, this corpus analysis does not revealed any influence of the familiarisation with the task on word order variations. This is an interesting result that guarantees to some (restricted) extent the genericity of spoken dialogue systems. This study did not investigate however the independence of word order variations from the application domain. In order to answer this important question, we are currently analysing two additional corpus whose application domain is tourism information. Since the corresponding tasks are clearly less restricted than in the ATIS domain, we hope to obtain interesting conclusions on genericity. Preliminary results suggest that word order variations are rather independent from the application domain, but also that other factors (degree of interactivity for instance) should affect noticeably the frequency of occurrence of spoken extractions (Antoine, 2001). One aim of this paper was to show the benefit that spoken man-machine communication should obtain from a rigorous analysis of representative corpora. Besides the question of word order variations, we hope that this paper came up ¾ at least partially ¾ to this expectation. Acknowledgements The authors are thankful to Agnès Hamon and Valérie Monbet (SABRES Laboratory of Applied Statistics, Vannes, France). They were a great assistance for the achievement of the statistical tests of significance presented in this paper. 8 For instance, see (Chelba and Jelinek, 2000) for an illustration of the need of strutured language models in speech recognition 29 References Antoine J-Y Goulian J 2001 Linguistique de corpus et ingénierie des langues appliquée à la CHM orale : étude des phénomènes d'extraction en français parlé sur deux corpus de dialogue oral finalisé, TAL, Hermès Paris France (submitted). Antoine J-Y Siroux J Caelen J Villaneau J Goulian J Ahafhaf M 2000 Obtaining predictive results with an objective evaluation of spoken dialogue systems : experiments with the DCR assessment paradigm, In proceedings of the 2nd conference on Language Ressources and Evalaution, LREC'2000, Athens, Greece. Blanche-Benveniste C Bilger M Rouget C and van den Eynde K 1990 Le français parlé : études grammaticales, CNRS Editions Paris France. Blanche-Benveniste C 1997 Approches de la langue parlée en français, Orphys Paris France. Chelba C Jelinek F 2000 Structured language modeling, Computer Speech and Language 14(4) pp. 283-332, Academic Press London UK. Covington M 1990 A dependency parser for variable-order languages, research Report AI-1990-01, University of Georgia, USA. Dudewicz E J Mishra S N 1988 Modern mathematical statistics, Wiley series in probability and mathematical statistics, John Wiley & Sons New-York, USA. Gadet F 1992 Le français populaire, PUF Paris France. Hirschman L 1998 Language understanding evaluations : lessons learned from MUC and ATIS. In proceedings of the 1st conference on Language Ressources and Evalaution, LREC'98, Granada Spain, pp 117-122. Holan T, Kubon, Oliva K, Plátek M 2000 On complexity of word order, TAL 41(1) pp. 273-300, Hermès Paris. Hudson R 2000 Discontinuity, TAL 41(1) pp. 15-56, Hermès Paris. Mel'cuk 1988 Dependency syntax : theory and practice, State University of New York Press, Albany, USA. Minker W, Waibel A, Mariani J 1999 Stochastically based semantic analysis, Kluwer ac. Amsterdam the Netherlands. Morel M-A 1989 Analyse linguistique de corpus, Publications de la Sorbonne Nouvelle Paris France. Pollard C, Sag I 1994 Head-driven Phrase Structure Grammar, University of Chicago Press, Chicago, USA. Rambow O, Joshi A 1994 A formal look at dependency grammars and phrase-structure grammars with special considerations of word-order phenomena, In Wanner L. (ed.) Current issues in Meaning- Text Theory, Pinter London UK. Tesnière L 1959 Elements de syntaxe structurale, Klincksiek Paris France. 30 Sociopragmatic annotation: New directions and possibilities in Historical Corpus Linguistics Dawn Archer and Jonathan Culpeper Department of Linguistics and Modern English Language, Lancaster University {d.archer, j.culpeper}@lancaster.ac.uk The bias of computer searches towards form (e.g. a letter or string of letters) can be a major difficulty for linguistic analyses of texts. In particular, how form relates to context in interactive texts tends to be overlooked. Corpus linguists have been seeking to improve our understanding of the relationships between forms (e.g. collocations, lexical networks, keywords); in other words, they focus on text and co-text. However, little work has considered context in terms of situational, sociological and cultural phenomena. Consequently, our aim in this paper is to demonstrate how a sophisticated annotation scheme can help bridge the gap between text and contexts. Our work is located in the area of historical corpus linguistics, an area where there are particular difficulties relating to the retrieval of contextual factors (e.g. the possibility of asking speakers is not available!) and where little work has been done on annotation of any kind. Specifically, we will introduce an annotated sub-section of the Corpus of English Dialogues (CED), constructed by Jonathan Culpeper and Merja Kytö (Uppsala University). The Sociopragmatic Corpus, as it is called, covers a 120 year time span (1640-1760), and consists of more than 240,000 words of “authentic” and constructed speech drawn from two text-types: trial proceedings and drama. We will then describe how the annotation system that we have developed: (i) Accommodates the investigation of language set in various context(s) (for example, speaker/hearer relationships, social roles, and sociological characteristics such as gender), and (ii) Treats contexts as dynamic (cf. other annotation systems, such as the spoken sub-section of the BNC, which concentrate upon the relatively static characteristics of speakers). More generally, this paper shows how computational linguistics can be used in pragmatics research. 31 A corpus for interstellar communication Eric Atwell eric@comp.leeds.ac.uk http://www.comp.leeds.ac.uk/eric/ John Elliott jre@comp.leeds.ac.uk http://www.comp.leeds.ac.uk/jre/ Centre for Computer Analysis of Language And Speech, School of Computing, University of Leeds, Leeds, Yorkshire, LS2 9JT England. 1. Introduction: SETI, the Search for Extra-Terrestrial Intelligence Many researchers in Astronomy and Astronautics believe the Search for Extra-Terrestrial Intelligence is a serious academic enterprise, worthy of scholarly research and publication (e.g. Burke-Ward 2000, Couper and Henbest 1998, Day 1998, McDonough 1987, Sivier 2000, Norris 1999), and large-scale research sponsorship attracted by the SETI Institute in California. Most of this research community is focussed on techniques for detection of possible incoming signals from extra-terrestrial intelligent sources (e.g. Turnbull et al 1999), and algorithms for analysis of these signals to identify intelligent language-like characteristics (e.g. Elliott and Atwell 1999, 2000). However, recently debate has turned to the nature of our response, should a signal arrive and be detected. For example, the 50th International Astronautical Congress devoted a full afternoon session to the question of whether and how we should respond to an initial message identified to be of extraterrestrial origin. Interestingly, we (the authors of this paper) were the only corpus linguists present at this session: the Congress seemed to assume that the design of potentially the most significant communicative act in history should be decided by astrophysicists. We believe that others should be aware of and contribute to what is effectively a corpus design project; and that the Corpus Linguistics research community has a particularly significant contribution to make. 2. Past ideas on how to signal our existence to extra-terrestrials Speculations about how to signal our existence to extraterrestrials began at least a century ago. Early ideas focussed on pictorial messages, transmitted visually by drawing over very large expanses of the Earth's surface. “For example, the Pythagorean theorem could be illustrated visually during the daytime by clearing vast expanses of forest in Siberia to show the areas surrounding a right-angled triangle. Or during the night, canals dug into the Sahara desert in the shape of a circle could be filled with kerosene; when lit, the flames would provide a pictorial signal of our existence.” (Vakoch 1998a). More recently, the Pioneer and Voyager spacecraft, sent to explore planets in our solar system but then left to drift out into interstellar space, carried messages to any extraterrestrials who might intercept them in their travels beyond the solar system. On the Pioneer plaque, an outline of the Pioneer spacecraft is seen behind figures of two humans. At the bottom of the plaque, the same spacecraft is shown in a smaller scale as it passes through the solar system on its journey from Earth. A diagram of fifteen converging lines shows the Earth's location in time and space in relation to prominent pulsars. (Sagan et al 1972, Vakoch 1998a). The Voyager spacecraft each bear similar diagrams, and in addition a record (with player and encoded instructions on how to play) illustrating basics of human knowledge of mathematics and physics, and a wide variety of pictures of our world. (Sagan 1978, Vakoch 1998a). There have also been attempts to deliberately transmit messages from the Earth's surface. Most notably, in 1974 astronomers at the Arecibo radio-telescope in Puerto Rico sent a signal of 1,679 radiowave pulses to M13, a star-cluster 25,000 light-years away. 1679 is the product of two prime numbers, 23 and 73; arranging the pulses into a rectangle of 23 columns by 73 rows creates a pictogram showing a radio-dish, a human, and some basic scientific information. (Couper and Henbest 1998, Vakoch 1998a). 32 3. Current SETI ideas on message construction The Arecibo experiment was a deliberate attempt at message transmission. Humanity has been transmitting radio signals on a much larger scale for decades, since radio transmissions intended for terrestrial reception are also beamed into outer space; thus an extra-terrestrial first encounter with human culture may well be through accidental reception of television and radio broadcasts, as foreseen in the novel and subsequent film Contact (Sagan 1988). Reception of such “unintended” messages may prompt Extra-Terrestrials to initiate first contact; but many in the SETI research community (e.g. Vakoch 1999) feel it is important to plan a more deliberately designed, well-thought-out response message. (Vakoch 1998b) argues for “... the need for more intensive investigations of the linguistic aspects of SETI before a message is received”. (Vakoch 1998c, p705) also identifies several benefits of beginning work on construction of a reply message immediately, even before an incoming extraterrestrial message has been received and recognised: “(1) concretely understanding the challenge of creating an adequate reply; (2) helping decode messages from extraterrestrials; (3) creating interstellar compositions as a new form of art; (4) having a reply ready in case we receive a message; (5) providing a sense of concrete accomplishment; (6) preparing for an active search strategy; and (7) gaining public support for SETI.” In 1974 a signal of 1,679 bits was considered potentially significant and challenging to technology of the time, e.g. it took three minutes to transmit; a quarter of a century later, we are used to processing messages of megabytes, gigabytes, or bigger in terrestrial communication networks such as the Internet. It is clear that we could look beyond a single pictogram or collection of diagrams, to design a much larger Corpus of data to represent humanity. (Vakoch 1998c) advocates that the message constructed to transmit to extraterrestrials should include a broad, representative collection of perspectives rather than a single viewpoint or genre; this should strike a chord with Corpus Linguists for whom a central principle is that a corpus must be “balanced” to be representative. The consensus at the 50th International Astronautical Congress seemed to be to transmit an encyclopaedia summarising human knowledge, such as the Encyclopaedia Britannica, to give ET communicators an overview and “training set” key to analysis of subsequent messages. Furthermore, this should be sent in several versions in parallel: the text; page-images, to include illustrations left out of the text-file; and perhaps some sort of abstract linguistic representation of the text, using a functional or logic language (Ollongren 1999, Freudenthal 1960). 4. Enriching the message corpus with multi-level linguistic annotations The idea of “enriching” the message corpus with annotations at several levels should also strike a chord with Corpus Linguists. Natural language exhibits highly complex multi-layered sequencing, structural and functional patterns, as difficult to model as sequences and structures found in more traditional physical and biological sciences. Corpus Linguists have long known this, on the basis of evidence such as the following: · Language datastreams exhibit structural patterns at several interdependent linguistics levels, including: phonetic and graphemic transcription, prosodic markup, part-of-speech wordclasses, collocations, phraseological and collegational patterns, semantic word-sense classification, syntax or grammatical phrase structure, functional dependency structure, semantic predicate structure, pragmatic references, discourse or dialogue structure, communication act or speech act patterns. · Even within one such linguistic level, structural analysis is complex, with further interdependent sublevels. For example, the European Expert Advisory Group on Language Engineering Standards (EAGLES) report on parsing annotations (Leech et al 1996) recognises at least 7 separate yet interdependent sublayers of grammatical analysis which a full parser should aim to recognise; yet none of the state-of-the-art parsers evaluated in (Atwell 1996, Atwell et al 2000a) were capable of providing all 7 layers of analysis in their output. Different parsers analysed different subsets of these sublayers of grammatical information, making cross-parser comparisons and performance evaluations difficult if not meaningless. 33 · Furthermore, linguistic analysis at one level may depend on or require other levels of linguistic information; for example, (Demetriou and Atwell 2001) demonstrated that lexicalsemantic word-tagging subsumes or combines several knowledge sources including thesaurus class, semantic field, collocation preferences, and dictionary definition. · Some corpora have been annotated with several layers or levels of linguistic knowledge in parallel; for example, the SEC corpus (Taylor and Knowles 1988) has speech recordings, transcriptions, prosody markup, PoS-tags, parse-trees; the ISLE corpus (Menzel et al 2000, Herron et al 1999, Atwell et al 2000b) has language-learner speech recordings, transcriptions, corrections, prosody, expert evaluations. Other annotations can be added automatically by software, e.g. semantic tags (Demetriou and Atwell 2001), ENGCG Constraint Grammar dependency structures (Karlsson et al 1995, Voutilainen et al 1996). 5. Natural language learning In the 1980s, most NLP researchers used their ‘expert intuitions’ to guide development of large-scale grammars; a language model was essentially an `expert system’ encoding the knowledge of a human linguistics expert. This kind of knowledge model was harder to `scale up’ to cover more and more language data, and it relied on existing expert knowledge. More recently, this has given way to the use of corpora or large text samples, some of which are annotated or `tagged’ with expert analyses. Tagged and parsed corpora can be used by linguists as a testbed to guide their development of grammars (see, for example Souter and Atwell 1994); and they can be used to train Natural Language Learning or data-mining models of complex sequence data. Several initiatives are under way to collect language datasets for language modelling research, for example, ICAME, the International Computer Archive of Modern and medieval English (based in Bergen); ELRA, the European Language Resources Association (based in Paris); LDC, the Linguistic Data Consortium (based at the University of Pennsylvania). A growing number of NLP researchers are looking into ways to utilise these new training-set resources: the Association for Computational Linguistics has established a Special Interest Group in Natural Language Learning (machine-learning of language sequence-patterns from corpus data) which holds annual conferences, e.g. CoNLL'2000. Given appropriate annotated Corpus data, many NLP problems can be generalised to “mappings” between linguistic levels of analysis, for example: · Word-class identification (mapping words into syntactic/semantic sets or classes), e.g. (Atwell and Drakos 1987, Hughes 1993, Finch 1993, Hughes and Atwell 1994, Teahan 1998) · Part-of-Speech wordtagging (mapping word-sequences onto wordclass-tag sequences), e.g. (Leech et al 1983, Atwell 1983, Eeg-Olofsson 1991, Brill 1993, Atwell et al 1984, 2000a); · Sentence-structure analysis or parsing (mapping word- and/or word-class sequences onto parses), e.g. (Sampson et al 1989, Atwell 1987, 1988, 1993, Black et al 1993, Bod 1993, Briscoe 1994, Jelinek et al 1992, Joshi and Srinivas 1994, Magerman 1994, O'Donoghue 1993, Schabes, Roth and Osborne 1993, Sekine and Grishman 1995) · Semantic analysis or word-sense tagging (mapping word-sequences onto semantic tags or meaning-analyses), e.g. (Demetriou 1993, Demetriou and Atwell 1994, 2001, Bod et al 1996, Kuhn and de Mori 1994, Weischedel et al 1993, Wilson and Rayson 1993, Wilson and Leech 1993, Jost and Atwell 1993) · Machine Translation (mapping a source-language word sequence onto a target-language wordsequence), e.g. (Brown et al 1990, Berger et al 1994, Gale and Church 1993) · Speech-to-text recognition (mapping a speech signal onto a phonetic and graphemic transcription word-sequence), e.g. (Demetriou and Atwell 1994, Giachin 1995, Jelinek 1991, Kneser and Ney 1995, Yamron 1994, Young and Bloothooft 1997). 34 Researchers have tried casting these NLP mapping subtasks in terms of Natural Language Learning models, such as Hidden Markov Models (HMMs), Stochastic Context Free Grammar (SCFG) parsers, Data-Oriented Parsing (DOP) models. The complex patterns found in language data call for sophisticated stochastic modelling. For example, Hidden Markov Models have become widely used in Language Engineering applications because they are well-understood and computationally tractable (e.g. Young and Bloothooft 1997, Manning and Schutze 1999, Jurafsky and Martin 2000, Huang 1990, MacDonald 1997, Elliott et al 1995, Woodward 1997). Although (Chomsky 1957) famously demonstrated that a finite-state model is a theoretically inadequate approximation for certain aspects of language modelling, Language Engineers have come to realise that HMMs can be adapted to work most of the time, and that the theoretically problematic cases alluded to by Chomsky are infrequent enough in “real” applications to be ignored in practice. Language Engineering researchers have been searching for higher-level models which effectively extend Hidden Markov Models in limited ways without extending the computational cost prohibitively, for example higher-order Markov models, limited stochastic context-free grammars, hybrid statistical/knowledge-based models. Linguists have found ‘Universal’ features which appear to be common to and characteristic of all human languages, (e.g. Zipf 1935, 1949); but few of these have been stated in terms of or related to stochastic models. We know how to extract low-level linguistic patterns from raw text using unsupervised learning algorithms (e.g. Atwell and Drakos 1987, Hughes 1993, Finch 1993, Hughes and Atwell 1994, Elliott and Atwell 1999, 2000, Elliott et al 2000a,b, 2001, Manning and Schutze 1999, Jurafsky and Martin 2000); a “Rosetta Stone” key to English, annotated with rich linguistic analyses, should help ET communicators map between symbols and meanings using supervised as well as unsupervised learning algorithms. 6. A corpus linguistics SETI advisory panel Astronomers have not sought to consult Corpus Linguists on the design of this Corpus for Interstellar Communication; but we can and should make an informed contribution. The parallel corpus and multiannotated corpus are not new concepts to Corpus Linguistics. We have a range of standards and tools for design and annotation of representative corpus resources. Furthermore, we know which analysis schemes are more amenable to supervised learning algorithms; for example, the BNC tagging scheme and the ICE-GB parsing scheme have been demonstrated to be machine-learnable in a tagger and parser respectively. An Advisory Panel of corpus linguists could design and implement an extended Multi-annotated Interstellar Corpus of English. The following are ideas for the Advisory Panel to consider: · augment the Encyclopaedia Britannica with a collection of samples representing the diversity of language in real use. Candidates include the LOB and/or BNC corpus; · as an additional “key”, transmit a dictionary aimed at language learners which has also been a rich source for NLP learning (e.g. Demetriou and Atwell 2001); a good candidate would be LDOCE, the Longman Dictionary of Contemporary English, which uses the Longman Defining Vocabulary; · supply our ET communicators with several levels of linguistic annotation, to give them a richer training set for their natural language learning attempts. We suggest that initial (i) raw text and (ii) page-images should be augmented with some or all of (iii) XML markup, (iv) PoS-tagging, (v) phrase structure parses, (vi) dependency structure analyses, (vii) coreference markup, (viii) dialogue act markup, (ix) semantic analyses. · Add translations of the English text into other human languages; although the International Astronautical Congress seemed to assume Humanity should be represented by English, multilingual annotations may actually be useful in natural language learning algorithms. This calls for a large-scale corpus annotation project, which may not seem immediately justifiable to computational linguistics research funding bodies such as the UK Engineering and Physical Sciences Research Council (EPSRC). However, the International Astronautical Congress also discussed plans to 35 proactively make interstellar contact using existing astronautical technology, by firing a satellite-based laser cannon at a range of nearby (in astronomical terms) potentially suitable targets. If this succeeds and we receive a message back, the need for our Interstellar Corpus Advisory Panel becomes more urgent. Of course, this Interstellar Corpus Advisory Panel should be chaired by an acknowledged expert in English grammar and semantics (eg Quirk et al 1972, 1985, Wilson and Leech 1993, Leech 1969, 1971, 1974, 1983, 1994), English language learning (e.g. Leech 1986, 1994, Quirk et al 1972, 1985 ), and corpus design, implementation, annotation, standardisation, and analysis (e.g. Leech et al 1983, 1996, Atwell et al 1984, Garside et al 1987, Black et al 1993, Leech 1991, 1992, 1993a,b): Professor Geoffrey Leech. 7. References Atwell E 1983 Constituent likelihood grammar. ICAME Journal 7: 34-65. Atwell E 1987 A parsing expert system which learns from corpus analysis. In Meijs W (ed) Corpus Linguistics and Beyond: Proceedings of the ICAME 7th International Conference on English Language Research on Computerised Corpora, Amsterdam, Rodopi, pp227-235. Atwell E 1988 Transforming a parsed corpus into a corpus parser. In Kyto M, Ihalainen O, Risanen M (eds) Corpus Linguistics, Hard and Soft: Proceedings of the ICAME 8th International Conference on English Language Research on Computerised Corpora. Amsterdam, Rodopi, pp61-70. Atwell E 1993 Corpus-based statistical modelling of English grammar. In Souter C, Atwell E (eds), Corpus-based computational linguistics: proceedings of the 12th conference of the International Computer Archive of Modern English. Amsterdam, Rodopi, pp195-214. Atwell E 1993 Linguistic constraints for large-vocabulary speech recognition. In Atwell E (ed), Knowledge at Work in Universities: Proceedings of the second annual conference of the Higher Education unding Council's Knowledge Based Systems Initiative. Leeds University Press, pp26-32. Atwell E 1996 Machine learning from corpus resources for speech and handwriting recognition. In Thomas J, Short M (eds), Using corpora for language research: studies in the honour of Geoffrey Leech, Harlow, Longman, pp151-166. Atwell E 1996 Comparative evaluation of grammatical annotation models. In Sutcliffe R, Koch H, McElligott A (eds), Industrial Parsing of Software Manuals. Amsterdam, Rodopi. Atwell E, Leech G, Garside R 1984 Analysis of the LOB corpus: progress and prospects. In Aarts J, Meijs W (eds) Corpus Linguistics: Proceedings of the ICAME 4th International Conference on the Use of Computer Corpora in English Language Research, Amsterdam, Rodopi, pp40-52. Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, Wilcock S 2000 A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal 24: 7-23. Atwell E, Howarth P, Souter C, Baldo P, Bisiani R, Bonaventura P, Herron D, Menzel W, Morton R, Wick H 2000 User-guided system development in interactive spoken language education. Natural Language Engineering, 6(3): 188-202. Atwell E, Drakos N 1987 Pattern recognition applied to the acquisition of a grammatical classification system from unrestricted English text. In Maegaard B (ed), Proceedings of the Third Conference of European Chapter of the Association for Computational Linguistics. New Jersey, Association for Computational Linguistics, pp56-63. Atwell E, Hughes J, Souter C 1995 Automatic extraction of tagset mappings from parallel-annotated corpora. In Tzoukermann E, Armstrong S (eds) From text to tags - issues in multilingual language analysis: Proceedings of EACL-SIGDAT. Dublin, Association for Computational Linguistics, pp10-17. 36 Berger A, Brown P, Cocke J, Pietra S, Pietra V, Gillett J, Lafferty J, Mercer R, Printz H, Ures L 1994 The candide system for machine translation. In Proceedings of the ARPA workshop on Human Language Technology. San Mateo, Morgan Kaufmann, pp152-157. Black E, Garside R, Leech G (eds) 1993 Statistically-driven computer grammars of English: the IBM/Lancaster approach. Amsterdam, Rodopi. Bod R 1993 Using an annotated corpus as a stochastic grammar. In Proceedings of the 6th EACL. Utrecht, Association for Computational Linguistics. Bod R, Bonnema R, Scha R 1996 A data-oriented approach to semantic interpretation. In Proceedings of the ECAI'96 workshop on corpus-oriented semantic analysis. Budapest, ECAI. Brill E 1993 A corpus-based approach to language learning. PhD thesis, University of Pennsylvania. Briscoe E 1994 Prospects for practical parsing of unrestricted text: robust statistical parsing techniques. In Oostdijk N, de Haan P (eds) Corpus-based research into language. Amsterdam, Rodopi. Brown P, Cocke J, Pietra S, Pietra V, Jelinek F, Lafferty J, Mercer R, Roossin P 1990 A statistical approach to machine translation. Computational Linguistics 16(2): 79-85. Burke-Ward R 2000 Possible existence of extra-terrestrial technology in the solar system. . Journal of the British Interplanetary Society 53(1&2): 1-12. Chomsky N 1957 Syntactic structures. The Hague, Mouton. Couper H, Henbest N 1998 Is anybody out there? London, Dorling Kindersley. Day P (ed) 1998 The search for extraterrestrial life. Oxford, Oxford University Press and the Royal Institution. Demetriou G 1993 Lexical disambiguation using CHIP. In Proceedings of the 6th EACL. Utrecht, Association for Computational Linguistics, pp431-436. Demetriou G, Atwell E 1994 Machine-learnable, non-compositional semantics for domain independent speech or text recognition. In Proceedings of the 2nd Hellenic-European Conference on Mathematics and Informatics (HERMIS), Athens. Eeg-Olofsson M 1991 Word-class tagging: some computational tools. PhD thesis, University of Lund. Elliott R, Lakhdar A, Aggoun J, Moore R 1995 Hidden Markov models: estimation and control. London, Springer-Verlag. Elliott J, Atwell E 1999 Language in signals: the detection of generic species-independent intelligent language features in symbolic and oral communications. In Proceedings of the 50th International Astronautical Congress. Amsterdam, paper IAA-99-IAA.9.1.08. Elliott J, Atwell E 2000. Is there anybody out there?: The detection of intelligent and generic languagelike features. Journal of the British Interplanetary Society 53(1&2): 13-22. Elliott J, Atwell E, Whyte B 2000 Language identification in unknown signals. In Proceedings of COLING'2000 18th International Conference on Computational Linguistics, Saarbrucken, Association for Computational Linguistics (ACL) and San Francisco, Morgan Kaufmann Publishers, pp1021-1026. Elliott J, Atwell E, Whyte B 2000 Increasing our ignorance of language: identifying language structure in an unknown signal. In Daelemans W (ed) Proceedings of CoNLL-2000: International Conference on Computational Natural Language Learning. Lisbon, Association for Computational Linguistics. 37 Elliott J, Atwell E, Whyte B 2001 A toolkit for visualisation of combinational constraint phenomena in linguistically interpreted corpora. In Proceedings of CLUK'4: Computational Linguistics in the United Kingdom. Sheffield. Finch S 1993 Finding structure in language. PhD thesis, Edinburgh University. Freudenthal H 1960 LINCOS, Design of a language for cosmic intercourse. North Holland. Gale W, Church K, 1993 A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1): 75-102. Garside R, Leech G, Sampson G (eds) 1987 The computational analysis of English: a corpus-based approach. London, Longman. Giachin E 1995 Phrase bigrams for continuous speech recognition. In Proceedings of ICASSP'95. Detroit. Herron D, Menzel W, Atwell E, Bisiani R, Daneluzzi F, Morton R, Schmidt J 1999 Automatic localization and diagnosis of pronunciation errors for second language learners of English. In Proceedings of EUROSPEECH'99: 6th European Conference on Speech Communication and Technology. Budapest. Huang X 1990 Hidden Markov models for speech recognition. Edinburgh, Edinburgh University Press. Hughes J 1993 Automatically acquiring a classification of words PhD thesis, University of Leeds. Hughes J, Atwell E 1994 The automated evaluation of inferred word classifications. In Cohn A (ed) Proceedings of ECAI'94: 11th European Conference on Artificial Intelligence. Chichester, John Wiley, pp535-539. Jelinek F 1991 Self-organised language modeling for speech recognition. In Waibel A, Lee K (eds), Readings in speech recognition. San Mateo, Morgan Kaufmann, pp450-506. Jelinek F, Lafferty J, Mercer R 1992 Basic methods of probabilistic context-free grammars. In Laface P, de Mori R (eds) Speech recognition and understanding. Berlin, Springer-Verlag, pp347-360. Joshi A, Srinivas B 1994 Disambiguation of Super Parts of Speech (or Supertags): almost parsing. In Proceedings of COLING'94. Kyoto. Jost U, Atwell E 1993 Deriving a probabilistic grammar of semantic markers from unrestricted English text. In Lucas S (ed) Grammatical Inference: theory, applications, and alternatives, IEE Colloquium Proceedings 1993/092. London, Institute of Electrical Engineers, pp91-97. Jurafsky D, Martin J 2000 Speech and language processing. Prentice-Hall Karlsson F, Voutilainen A, Heikkila J, Anttila A (eds) 1995 Constraint grammar. Berlin, Mouton de Gruyter. Kneser R, Ney H 1995 Improved backing-off for n-gram language modelling. In Proceedings of IEEE ICASP'95. Detroit, pp49-52. Kuhn R, de Mori R 1994 Recent results in automatic learning rules for semantic interpretation. In Proceedings of the International Conference on Spoken Language Processing. Yokohama, pp75-78. Leech G 1969 Towards a semantic description of English. London, Longman. Leech G 1971 Meaning and the English verb. London, Longman. Leech G 1974 Semantics. Harmondsworth, Penguin. 38 Leech G 1983 Principles of pragmatics. London, Longman. Leech G 1986 Automatic grammatical analysis and its educational applications. In Leech G, Candlin C (eds) Computers in English language teaching and research: selected papers from the British Council Symposium. London, Longman, pp204-214. Leech G 1991 The state of the art in corpus linguistics. In Aijmer K, Altenberg B (eds) English corpus linguistics: essays in honour of Jan Svartvik. London, Longman, pp8-29. Leech G 1992 100 million word of English: the British National Corpus (BNC). Language Research 28(1): 1-13 Leech G 1993 100 million words of English. English Today 33: 9-15 Leech G 1993 Corpus annotation schemes. Literary and Linguistic Computing. 8(4): 275-281. Leech G 1994 Students’ grammar – teachers’ grammar – learners’ grammar. In Bygate M, Tonkyn A, Williams E (eds) Grammar and the language teacher. Prentice Hall International. Leech G, Garside R, Atwell E 1983 The automatic grammatical tagging of the LOB corpus. ICAME Journal 7: 13-33. Leech G, Barnett R, Kahrel P 1996 EAGLES Final Report and guidelines for the syntactic annotation of corpora, EAGLES Report EAG-TCWG-SASG/1.5. http://www.ilc.pi.cnr.it/EAGLES96/home.html MacDonald I 1997 Hidden Markov and other models for discrete-valued time series. London, Chapman and Hall. Magerman D 1994 Natural language parsing as statistical pattern recognition. PhD thesis, Stanford University. Manning C, Schutze H 1999 Foundations of statistical natural language processing. Cambridge, MIT Press. McDonough T 1987 The search for extra-terrestrial Intelligence. John Wiley and Sons. Menzel W, Atwell E, Bonaventura P, Herron D, Howarth P, Morton R, Souter C 2000 The ISLE corpus of non-native spoken English. in Gavrilidou M, Carayannis G, Markantionatou S, Piperidis S, Stainhaouer G (eds) Proceedings of LREC2000: Second International Conference on Language Resources and Evaluation. Athens, European Language Resources Association (ELRA), vol.2 pp.957- 964. Norris R 1999 How old is ET? In Proceedings of 50th International Astronautical Congress. Amsterdam, paper IAA-99-IAA.9.1.04. O'Donoghue T 1993 Reversing the process of generation in systemic grammar. PhD thesis, University of Leeds. Ollongren A 1999 Large-size message construction for ETI. In Proceedings of the 50th International Astronautical Congress. Amsterdam, paper IAA-99-IAA.9.1.06. Quirk R, Greenbaum S, Leech G, Svartvik J 1972 A grammar of contemporary English. London, Longman. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman. Sagan C (ed) 1978 Murmers of earth: the voyager interstellar record. New York, Random House. Sagan C 1988 Contact. London, Legend. 39 Sampson G, Haigh R, Atwell E 1989 Natural language analysis by stochastic optimisation. Journal of Experimental and Theoretical Artificial Intelligence 1: 271-287. Schabes Y, Roth M, Osborne R 1993 Parsing of the Wall Street Journal with the inside-outside algorithm. In In Proceedings of the 6th EACL. Utrecht, Association for Computational Linguistics, pp341-46. Sekine S, Grishman R 1995 A corpus-based probabilistic grammar with only two non-terminals. In Proceedings of IWPT International Workshop on Parsing Technologies. Prague University. Sivier D 2000 SETI and the historian: methodological problems in an interdisciplinary approach. . Journal of the British Interplanetary Society 53(1&2): 23-26. Souter C, Churcher G, Hayes J, Hughes J, Johnson S 1994 Natural language identification using corpus-based models. HERMES Journal of Linguistics. 13: 183-204. Souter C, Atwell E 1994 Using parsed corpora: a review of current practice. In Oostdijk N, de Haan P (eds) Corpus-based research into language. Amsterdam, Rodopi, pp143-158. Taylor L, Knowles G 1988 Manual of information to accompany the SEC corpus: The machine readable corpus of spoken English. University of Lancaster: Unit for Computer Research on the English Language. Available from http://kht.hit.uib.no/icame/manuals/sec/INDEX.HTM Teahan B 1998. Modelling English text. PhD Thesis, University of Waikato, New Zealand. Voutilainen A, Jarvinen T 1996 Using the English constraint grammar parser to analyse a software manual corpus. In Sutcliffe R, Richard, Koch H, McElligott A (eds.) Industrial parsing of software manuals. Amsterdam, Rodopi, pp57-88. Turnbull M, Smith L, Tarter J 1999 Project Phoenix: Starlist2000. In Proceedings of the 50th International Astronautical Congress. Amsterdam, paper IAA-99-IAA.9.1.02. Vakoch D 1998a Pictorial messages to extraterrestrials, SETIQuest 4(1): 8-10 (Part 1), 4(2): 15-17 (Part 2). Vakoch D 1998b Constructing messages to extraterrestrials: an exosemiotic perspective. Acta Astronautica 42(10-12): 697-704. Vakoch D 1998c The dialogic model: representing human diversity in messages to extraterrestrials. Acta Astronautica 42(10-12): 705-710. Vakoch D 1999 Communicating scientifically formulated spiritual principles in interstellar messages. In Proceedings of the 50th International Astronautical Congress. Amsterdam, paper IAA-99- IAA.9.1.10. Weischedel R, Meteer M, Schwarz R, Ramshaw L, Palmucci J 1993 Coping with ambiguity and unknown words through probabilistic models. Computational Linguistics 19(2): 359-382. Wilson A, Rayson P 1993 The automatic content analysis of spoken discourse. In Souter C, Atwell E (eds), Corpus based computational linguistics. Amsterdam, Rodopi, pp215-226. Wilson A, Leech G 1993 Automatic content analysis and the stylistic analysis of prose literature. Revue Informatique et Statistique dans les Sciences Humaines 29: 219-234. Young S, Bloothooft G (eds) 1997 Corpus-based methods in language and speech processing. Dordrecht/Boston, Kluwer Academic Publishers. Zipf G 1935 The psycho-biology of language. Boston, Houghton-Mifflin Zipf G 1949 Human behaviour and the principle of least effort. New York, Addison Wesley. 40 From EAGLES to CT tagging: a case for re-usability of resources Manuel Barbera SSLMIT Trieste b.manuel@inrete.it 1. Abstract Re-usability has been recently identified as one of the main requisites a corpus annotation project must accomplish (cf. Garside, Leech and McEnery 1997: 5; Leech - Wilson 1999, etc.). This subject, developed mainly to favour morphosyntactic tagging of corpora, naturally holds more general implications for corpus linguistics as a whole, and the need for resources to «be reusable, interchangeable, shareable» (Monachini and Calzolari 1999: 149) is now strongly agreed upon even at an institutional level. It is not by chance that international initiatives in this sense have multiplied in the last few years (cf. Monachini and Calzolari 1999: 149-150). In fact, beside the obvious economic and practical reason, there is also a more theoretic attitude toward a more “ecological” and “democratic” conception of linguistic computer sciences, where resources can be shared by fields of diverse nature. Corpus Taurinense (CT) annotation in connection with EAGLES standard guidelines is, I believe, an excellent example of this approach from many points of view. Not only, in fact, was the CT-tagset conceived according to standards that would allow the CT to be (re)used for extremely different purposes (cf. § 4), but its very conception is an example of how experiences previously accumulated in rather diverse fields may be re-used (cf. § 3). The CT, in fact, is a POS tagged corpus of old Italian formulated for prevalently linguistic-philological aims. EAGLES standards, instead, have been introduced for eminently practical (commercial, economic, etc.) purposes. The fact that technologies designed for society can become valuable to humanistic research, in a sort of cycle and recycle of intercommunication between the two sectors, is a rather new, fortunate situation. The present paper will present a short documentation of an example of this lucky match. 2. The Corpus Taurinense (CT). Before fully entering into the subject, it is worthwhile dedicating a few introductory notes to the Corpus Taurinense; while it is certainly not necessary in this paper to spend time on the EAGLES guidelines, so familiar to all. The Corpus Taurinense1, as I have already mentioned, is a tagged corpus of old Italian texts (more specifically, old Florentine dated between 1251 and 1300) of 258,310 tokens (for 19,235 forms) which has been developed by Carla Marello and me2. The choice of texts was not our responsibility, however, since the CT is truly the annotated reincarnation, improved in the tokenization, of the Padua Corpus, which is a subset of the collection of texts of the TLIO (Tesoro della lingua italiana delle origini) under costruction at the OVI (Opera del vocabolario Italiano)3. This collection was made available by Pietro Beltrami and chosen by Lorenzo Renzi and Giampaolo Salvi (cf. Renzi 1998: 29) as the base for the compilation of “ItalAnt – Grammatica dell'italiano antico”, a syntax of Old Italian which is considered an ideal prosecution of the “Grande grammatica italiana di consultazione” (Renzi - Salvi 1988, 1991, 1995). The linguistic annotation which we have implemented is a morphosyntactic tagging, additionally enriched by lemmatic annotation. At present, we are working on the disambiguation of the transcategorizations and on the treatment of multiword entries (Barbera and Marello 2000). In the future we hope to be able to add at least a third level of annotations of textual nature. The CT is available for UNIX/LINUX on the Corpus Work Bench (CWB) system , built by IMS Stuttgart (cf. Christ and Schulze 1996). A demo of an Internet query interface is already online and 1 Its name, analogously to the Padua Corpus, which will be discussed shortly, is taken from the place where the co-financed group is located, namely Torino (Turin, Italy, in Latin Augusta Taurinorum). We could not have simply called it “Corpus di Torino” because, aside from our love for Latin, there already exists a corpus of texts in English put together by students from the University of Turin which is internationally known in studies of applied linguistics as, in fact, the “Turin Corpus”. 2 In the field of research co-financed by the CNR “Per una grammatica testuale dell'italiano antico”, directed by Bice Morara Garavelli, and coordinated with “Ricerche linguistiche sull'italiano antico”, directed by Lorenzo Renzi. 3 Cf. OVI homepage at the URLs http://www.ovisun199.csovi.fi.cnr.it/italnet/OVI or http://www.lib.uchicago.edu/efts/ARTFL/projects/OVI. 41 under testing at Stuttgart4. The CT, thanks to the versatility of the CQP (Corpus Query Processor) of the IMS Corpus Work Bench, allows for the simultaneous display and query of both linguistic (lemma, POS, morphosyntax) and philological annotations (e.g. corrections, text structure, etc.) 3. From EAGLES to the CT In the first place, EAGLES guidelines have provided us in the planning phase of the works with many useful insights which have helped in building an efficient tagset. Monachini and Calzolari (1996) collects a lot of previous experience in this sector, which has been even more useful since corpus annotation is a branch of computational linguistics that, until now, has rarely worked on “old” corpora, so that we had few specific previous experiences5 to base our work on. In particular, in order to create a tagset suitable for old Italian it was necessary to keep the peculiarities of the language and of the documentation in mind as they related to the computational automatisms and the needs of linguistic analyses. Thus, various operations have become necessary. Often they were complex and would have had to be completely “invented” had the EAGLES guidelines not offered us appropriate solutions. This is not the place to discuss the details of these matters. I would, however, like to offer at least a few practical examples of this, restricting the discussion to one specific fragment of the tagset, the “pronominal” area6. A more general problem is the division in the tagset into Hierarchy Defining Features (HDF) and Morphosyntactic Features (MSF). Since only HDF are constructed in a typed hierarchy in this kind of architecture, while MSF are freely applied to typed tags, the general criteria hold that each class of alternative features which is POS-specific be arranged in a hierarchy and that each class of alternative features which is applied to various POS be classified as MSF. A simple comparison of EAGLES tagsets for German (ELM-DE) and for modern Italian (ELM-IT), however, reveals how these criteria can be freely modified and shaped in every language according to their requirements for lemmatization and the problems of disambiguation. The ELM-DE scheme, in fact, may seem complex, yet it was the best choice in this respect, as demonstrated by the way the IMS Stuttgart has used it. ELM-DE PRON personal refl poss demo idf rel interrog sg;pl sg;pl sg;pl inflect non-inflect sg;pl sg;pl 1;2;3 1;2;3 sg;pl du mich seines dieser mancher man die wen +MSF +gend +case +case +gend +case +gend +case +gend +case – +gend +case +gend +case DET poss demo idf rel interrog sg;pl sg;pl inflect non-inflect inflect non-inflect sg;pl sg;pl seine dieser manche manch dessen welchen wessen +MSF +gend +case +gend +case +gend +case – – +gend +case – Table 1. For Italian, however, as demonstrated by the ELM-IT solutions, a simpler division was preferred. 4 Cf. the “CT-WWW-Demos des Corpus Query Processor CQP” homepage at the URL http://www.ims.unistuttgart. de/projekte/CQPDemos/italant/. Access is reserved, but it can be freely granted by asking me (b.manuel@inrete.it) or Carla Marello (marello@cisi.unito.it). 5 Moreover the Penn-Helsinki parsed corpus of Middle English (http://www.ling.upenn.edu/mideng/) and the Tycho Brahe parsed corpus of Historical Portuguese (http://www.ime.usp.br/~tycho/corpus/index.html) which are perhaps the most relevant experiences in this sector, are both treebanks, and present, therefore, problems which are often different from ours. We do, however, know of some experiments on morphological tagging of Old Italian texts at the CiBIt (Centro interuniversitario Biblioteca Italiana Telematica) in Pisa: http://cibit.humnet.unipi.it/index_ra.htm. 6 This area of the tagset was dealt with specifically in Parallela IX Congress (Barbera 2000); for a general standard description of the CT-Tagset cf. Barbera 2000/2001. 42 ELM-IT PRON pers poss dem indf int rel excl strg weak nom obl obl io mi me mio quello alcuni che? che che! +MSF +pers +gend +numb +pers +gend +numb +pers +gend +numb +pers +gend +numb +gend +numb +gend +numb +gend +numb +gend +numb +gend +numb DET poss dem indf int rel excl 1,2,3 mio quello alcuni che? che che! +MSF +gend +numb +gend +numb +gend +numb +gend +numb +gend +numb +gend +numb Table 2 Having seen the EAGLES proposals, we began to consider the idea that the scheme of the tagset, in its typed and non-typed components, could be correlated to the scheme of the lemmary, so the system could be optimized, while considering the specific difficulties Old Italian presented. These are not only due to the variations introduced by diverse philological practices used in the editions initially used, but also to the fluid and not yet prescribed nature of the original texts. The result is a staggering number of graphic and linguistic variations of all forms and the creation of multiple problems in identifying tokens, especially in relation to the presence of pronominal clitics (and less frequently adverbial clitics), particularly abundant in this state of language. The aim was to render the annotation as distinct and suitable to old Italian as possible, while using the least number of tags necessary7: CT-Tagset PD pers poss dem indf int rel excl strg weak strg weak strg weak nom obl obl io mi me mio ÷ma quello ne alcuni che? che che! +MSF +pers +gend +numb +pers +gend +numb +pers +gend +numb +pers +gend +numb +pers +gend +numb +pers +gend +numb – +pers +gend +pers +gend +pers +gend +pers +gend Table 3 In doing so, as can be seen from the results, we have also managed to remain closer than the ELMIT to the native “naive”8 grammatical tradition. The modern perspective of re-usability of computer data, in fact, has underlined more than once that «it is a good idea for adnotation schemes to be based as far as possible on consensual or theory-neutral analyses of the data» (Leech 1997: 7). From this point of view the principal novelty of the CT-Tagset, compared to ELM-IT, has been to re-organize the pronominal area (Pronouns and Determiners) in one single POS, which we have called “PD”, prevalently morphologically based (leaving the syntactic level to a later and different phase), and with an internal organization studied expressly from the point of view of the tagset's structure and its suitability to Old Italian. It was a pleasant surprise for us, as well as a confirmation that we had worked in the right direction, to find that Geoffrey Leech and Andrew Wilson, in the most recent Guidelines, published after the formulation of ELM-IT, had reached conclusions similar to ours: 7 In fact, the computational advantages offered by a limited tagset are well-known. For example, if the tagset contains no more than 70 hierarchical tags, the annotated corpus will be most effective as training corpus for a stochastic annotator (cf. Heid 1998). 8 Cf. also the linguistical notion of “concetto ingenuo” worked out by Graffi 1991. 43 The parts-of-speech Pronoun, Determiner and Article heavily overlap in their formal and functional characteristics and different analyses for different languages entail separating them out in different ways. For the present purpose, we have proposed placing Pronouns and Determiners in one ‘super-category', recognizing that for some descriptions it may be thought best to treat them as totally different part-of-speech. There is also an argument for subsuming Articles under Determiners. The present guidelines do not prevent such a realignment of categories, but do propose that articles (assuming they exist in a language) should always be recognized as a separate class, whether or not included within determiners. (Leech - Wilson 1999: 63-64). They then concluded that “the requirement is that the descriptive scheme adopted should be automatically mappable into the present one” (ibidem): and our scheme certainly and easily is. 4. From CT towards the future This brings us back to the second point of view of re-usability which we bore in mind as we planned the CT. From this other perspective, as information retrieval is concerned, since the CT tagset is fully EAGLES-conformant, information provided by the CT is fully comparable with the main modern language tagged corpora, without even exiting from the same working system environment. That the interlinguistic comparison is so greatly facilitated is obvious. But the typological perspective is certainly not the only one to benefit from this approach. For example, if historical linguists and historians of the Italian language have annotated corpora of Old and Modern Italian, which can be easily compared, they could verify the empirical basis of their theories more easily, or even make new discoveries. A small example of what new observations are made possible by this corpus is found in the study of multiword entries that I am pursuing. I have discovered that collocations of the structure “dalla parte mia/tua/…” with variable ending are apparently unknown to thirteenth-century Florentine: in our corpus only the type “dalla mia/tua/… parte” with variable internal element. Last but not least, corpora annotated in the same way for diverse chronological phases of the Italian language, could be used for lexical acquisition by historical dictionaries. An elementary example could be the evidence of the change in the subcategorization frame of many verbs from the thirteenth century to the modern language. This can be only roughly studied when simple concordance programmes are used. Moreover, since the CT, through the mediation of the Padua Corpus, already constitutes in itself a reuse of the lexicographic resources from the OVI, its ability to return this favour by making a new source of information retrieval available, is another demonstration of that non vicious circle of resources from which our small consideration on re-usability began. 5. Conclusions At the beginning of this contribution we underlined how technologies designed for society could become valuable for humanistic research. One of the great merits of computational studies, in fact, has been just that: to have built a bridge between these two worlds, which previously had relatively little contact. The example of the CT, in this way, is, I hope, even more noteworthy in that it involves historical linguistics and philology, subjects that until now have benefited from this profitable circulation of resources less then other doctrines, such as logic, applied linguistics and lexicography. 6. References Barbera M 2000 Pronomi e determinanti nell'annotazione dell'italiano antico. La POS “PD” del Corpus Taurinense. Paper presented at Neuntes österreisch-italienisches Linguistentreffen / Nono incontro italo-austriaco dei linguisti PARALLELA IX. “Text - Variation - Informatik / Testo - variazione - informatica”, Salzburg, 1.-4. November 2000. Barbera M 2000/2001 Italiano antico e linguistica dei corpora: un Tagset per ItalAnt, in VI Convegno Internazionale SILFI “Tradizione & Innovazione”: La linguistica e filologia italiana alle soglie di un nuovo millennio. Gerhard-Mercator-Universität Duisburg 28.06.-02.07.2000. Atti. (Forthcoming). Barbera M, Marello C 1999/2001 L'annotazione morfosintattica del Padua Corpus: strategie adottate e problemi di acquisizione. Paper presented at Italiano antico e corpora elettronici. Padova, 19-20 febbraio 1999. Incontro seminariale. Forthcoming in Revue romane 36(1). Barbera M, Marello C 2000 (Forthcoming in Revue française de linguistique appliquée.) Entrées de multimots et étiquetage de parties du discours dans le Corpus Taurinense. Paper presented at AFLA 2000, Paris, 6-8 juillet 2000. Christ O, Schulze BM 1996 CWB. Corpus Work Bench, Ein flexibles und modulares Anfragesystem für Textcorpora. In Feldweg H, Hinrichs E (eds) Lexikon und Text. Tübingen, Niemeyer. Garside R, Leech G, McEnery A (eds) 1997 Corpus annotation. Linguistic information from computer 44 text corpora, London - New York, Longman. Graffi G 1991 Concetti ‘ingenui’ e concetti ‘teorici’ in sintassi. Lingua e stile 26: 347-363. Heid U. 1998 Annotazione morfosintattica di corpora ed estrazione di informazioni linguistiche. Paper presented at Annotazione morfosintattica di corpora e costruzione di banche di dati linguistici. Torino, 26-XI-1998. Leech G, Wilson A 1999 Standards for tagsets. In van Halteren 1999, pp. 55-80. Monachini M, Calzolari N 1996 Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora. A common proposal and application to European languages. Pisa, EAGLES Document EAG-CLWG-MORPHSYN/R, May 1996. Monachini M 1996 ELM-IT: EAGLES Specifications for Italian morphosyntax. Lexicon specifications and classification guidelines. Pisa, EAGLES Document EAG-CLWG-ELM-IT/F, May. Renzi L 1998 Perché una grammatica dell'italiano antico: una presentazione. In Renzi L (ed), ITALANT: per una Grammatica dell'Italiano Antico. Padova, Centro Stampa di Palazzo Maldura, pp 21-32. Renzi L, Salvi G (eds) 1988, 1991, 1995 Grande grammatica italiana di consultazione I.-III. Bologna, Il Mulino. Teufel S 1996 ELM-EN. EAGLES Specifications for English morphosyntax. Draft version. Stuttgart, EAGLES Document, July. Teufel S, Stöckert Ch 1996 ELM-DE. EAGLES Specification for German morphosyntax. Lexicon specification and classification guidelines. Stuttgart, EAGLES Document EAG-CLWG-ELMDE/ F, März. van Halteren H (ed) 1999 Syntactic wordclass tagging. Dordrecht - Boston - London, Kluver Academic Publishers, Text, speech and language technology 9. Optimisation of corpus-derived probabilistic grammars Anja Belz CCSRC, SRI International 23 Millers Yard, Mill Lane Cambridge CB2 1RQ, UK 1 Overview This paper examines the usefulness of corpus-derived probabilistic grammars as a basis for the automatic construction of grammars optimised for a given parsing task. Initially, a probabilistic context-free grammar (PCFG) is derived by a straightforward derivation technique fromthe Wall Street Journal (WSJ) Corpus, and a baseline is established by testing the resulting grammar on four different parsing tasks. In the first optimisation step, different kinds of local structural context (LSC) are incorporated into the basic PCFG. Improved parsing results demonstrate the usefulness of the added structural context information. In the second optimisation step, LSC-PCFGs are optimised in terms of grammar size and performance for a given parsing task. Tests showthat significant improvements can be achieved by the method proposed. The structure of this paper is as follows. Section 2 discusses the practical and theoretical questions and issues addressed by the research presented in this paper, and cites existing research and results in the same and related areas. Section3 describes howLSC-grammars are derived fromcorpora, defines the four parsing tasks on which grammars are tested, describes data and evaluation methods used, and presents a baseline technique and baseline results. Section 4 discusses and describes different types of LSC and demonstrates their effect on rule probabilities. Methods for deriving four different LSC-grammars from the corpus are described, and results for the four parsing tasks are presented. It is shown that all four types of LSC investigated improve results, but that some lead to overspecialisation of grammars. Section 5 shows that LSC-grammars can be optimised for grammar size by a generalisation technique that at the same time seeks to optimise parsing performance for a given parsing task. An automatic search method is described that carries out a search for optimal generalisations of the given grammar in the space of partitions of nonterminal sets. First results are presented for the automatic search method that showthat it can be used to reduce grammar size and improve parsing performance. Parent node information is shown to be a particularly useful type of LSC, and the results for the complete parsing task achieved with the corresponding grammar are better than any previously published results for comparable unlexicalised grammars. Preliminary tests for LSC grammar optimisation show that it can drastically reduce grammar size and significantly improve parsing performance. In one set of experiments, a partition was found that increased the labelled F-Score for the complete parsing task from72.31 to 74.61, while decreasing grammar size from21,995 rules and 1,104 nonterminals to 11,254 rules and 224 nonterminals. Results for grammar optimisation by automatic search of the partition space showthat improvements in grammar size and parsing performance can be achieved in this way, but do not come close to the big improvements achieved in preliminary tests. It is concluded that more sophisticated search techniques are required to achieve this. 2 Background and related research The research reported in this paper covers a range of issues: (i) corpus-derived grammars; (ii) the usefulness of structural context information in making parsing decisions; (iii) automatic construction methods for specialised grammars that take corpus-derived grammars as a starting point; (iv) the (in)adequacy of PCFGs as a grammar formalism; and (v) the question of whether parsing strategies that do without lexical information can come closer to the performance of lexicalised systems. Each of these issues will be discussed in more detail over the following sections. Corpus-derived grammars. Over the last five years, a range of research projects—e.g. Charniak (1996), Cardie & Pierce (1998), Johnson (1998, 2000), Krotov et al. (2000)—have looked at probabilistic grammars that have been directly derived frombracketted corpora (or treebanks, hence the term 46 “treebank grammar” coined by Charniak, 1996). The basic idea in grammar derivation fromcorpora is simple. For each distinct bracketting found in a corpus, a grammar rule is added to the grammar and the rule's probabilityis derived in some way (often by maximum-likelihoodestimation with some smoothing method) fromthe frequency of occurrence of the bracketting in the corpus. For instance, the bracketting (NP (DT the) (NN tree)) would yield the production rule NP ! DT NN. However, because the number of rules in grammars derived in this entirely straightforwardmanner is infeasibly large at least in the case of the WSJ Corpus, and because their parsing performance moreover tends to be poor, some techniques are usually applied to reduce grammar size and to improve performance. All approaches edit the corpus in some way, e.g. eliminating single child rules, empty category rules, functional tags, co-indexation tags, and punctuation marks. Different compaction methods (such as eliminatingrules with frequency less than some n) have been investigated that reduce the size of grammars without too much loss of performance (in particular by Charniak and Krotov et al.). To improve parsing performance, e.g. Charniak relabels auxiliary verbs with a separate POS-tag and incorporates a “right-branching correction” into the parser to make it prefer right-branching structures. As a result of such techniques, the final grammars for which performance results are reported tend to have little in common with the rule set underlying the corpus fromwhich they were derived. Several other grammar building and training methods are similar to treebank grammar construction: Bod&Scha's DOP1 method which extracts tree fragments rather than rules fromcorpora, MBL2 methods (Daelemans et al.) for building parsing systems fromcorpora, and—more generally—any method that estimates the likelihood of brackettings (or of brackettings converted into taggings) froma corpus, since such methods directly utilise both the brackettings and their frequencies as found in the corpus. The existing results for corpus-derived grammars that do not undergo significant further development demonstrate their limitations: they cannot compete with state-of-the-art parsing results (see Section 2). It will be argued in this paper that grammars directly extracted fromcorpora do, however, provide a useful starting point for further automatic grammar construction methods. Context-free grammars that incorporate structural context. It is frequently observed (e.g. Manning & Sch¨utze (1999, p. 416ff)) that PCFGs are inadequate as a grammar formalism because of the very strong independence assumptions inherent in them, reflecting on the one hand a complete lack of lexicalisation, and on the other a lack of structure dependence. It is true that in conventional PCFGs the probability of, say, a given NP bracketting is independent of the identity of the head noun as well as its structural context (e.g. whether the NP is in subject or object position). However, this independence is not due to the formal characteristics of PCFGs, but rather to the way they tend to be used. If the set of nonterminals of a PCFG does not distinguish between, say, NPs in subject position and NPs in object position, then the probabilities of any rules containing the nonterminal NP are necessarily independent of the subject/object distinction. However, it is straightforward to introduce such a dependence into a PCFG by splitting the category NP into two categories NP-SBJ and NP-OBJ. Similarly, categories (nonterminals) can be divided on the basis of lexemes, lexical categories or semantic classes. PCFGs may not be able to accommodate lexical and structural informationin themost elegant fashion, but the point here is not about representational elegance and efficiency. Rather, the fact that PCFGs encode languages that make up the formal class of context-free languages is entirely separate fromtheir ability to reflect the dependence of rule probabilities on lexical and structural context. Examining different kinds of structural context within the PCFG framework (as done in this paper) has two advantages: firstly, there are polynomial-time algorithms for finding most likely parses, and secondly, there is a simple measure of the complexity added to a grammar by the introduction of a piece of structural information such as the subject/object distinction, namely the resulting increase in the number of rules in the grammar. Automatic grammar construction. It is sometimes observed that deriving probabilistic grammars fromcorpora in the way described above is not an automatic grammar learning method because all that is done is to extract the PCFG that underlies the corpus and is encoded in its sentences, brackettings and occurrence frequencies. As was pointed out above, creating a grammar in this way is simply one of many ways to utilise the brackettings and frequencies of corpora, a feature shared with many computational learning approaches to automatic grammar construction. However, as previously mentioned, the limitations of grammars directly extracted fromcorpora indicate that using themas starting points for further grammar development is the more useful approach. 1Data-Oriented Parsing. 2Memory-Based Learning. 47 Grammar/Parser Grammar Size performance (WSJ unseen) LR UR LP UP CB Fully lexicalised: Collins (2000) – 90.1 – 90.4 – 0.73 Charniak (2000) – 90.1 – 90.1 – 0.74 Collins (1999) – 88.5 – 88.7 – 0.92 Collins (1997) – 88.1 – 88.6 – 0.91 Charniak (1997) – 87.5 – 87.4 – 1.00 Magerman (1995) SPATTER – 84.6 – 84.9 – 1.26 Nonlexicalised: Charniak (1996) 10,605 – 78.8 – 80.4 – without frequency 1 rules 3,943 – 78.2 – 80.7 – Krotov et al. (2000) 15,420 74.1 77.1 77.4 80.6 2.13 without frequency 1 rules 6,514 74.4 77.5 76.9 80.2 2.18 WSJ 15–18 treebank PCFG 6,135 69.1 – 71.4 – 2.67 Table 1: Performance of comparable lexicalised and nonlexicalised grammars on full parsing. Reference Method LP LR F-Score Lexicalised: Tjong KimSang et al. (2000) Systemcombination 94.2 93.6 93.9 Mu˜noz et al. (1999) SNoW 92.4 93.1 92.8 XTAG Research Group (1998) XTAG + Supertagging 91.8 93.0 92.4 Ramshaw &Marcus (1995) Transformation Based Learning 91.8 92.3 92.0 Veenstra (1998) MBL 89.0 94.3 91.6 Nonlexicalised: Argamon et al. (1999) MBL 91.6 91.6 91.6 Cardie&Pierce (1998) Error-Driven Grammar Pruning 90.7 91.1 90.9 WSJ 15–18 treebank PCFG 89.2 87.6 88.4 Table 2: Performance of comparable lexicalised and nonlexicalised grammars on NP-chunking. Creating a starting point for grammar learning in this way is particularly useful because context-free grammars cannot be learnt fromscratch fromdata. At the very least, an upper bound must be placed on the number of nonterminals allowed. Even when that is done, there is no likelihood that the grammars resulting from an otherwise unsupervised method will look anything like a linguistic grammar whose parses can provide a basis for semantic analysis3. Parsing with(out) lexical information. Corpus-derived grammars tend to be nonlexicalised PCFGs, hence the existing research cited above can be seen as investigations into the results that can be achieved in parsing without taking into account lexical information. In syntactic parsing tasks, nonlexicalised methods are generally outperformed by lexicalised approaches. In the case of complete (non-shallow) parsing, nonlexicalised methods are outperformed by large margins. Table 1 shows an overview of state-of-the-art nonlexicalised and lexicalised results for statistical parsing systems (U/LR= Un-/Labelled Recall, U/LP = Un-/Labelled Precision, see Section 3). For comparison, the last rowof the table shows this paper's baseline result for the complete parsing task (see Section 3.4). In NP-chunking, a shallow syntactic parsing task that has become a popular research topic over the last decade (for details see Section 3.2 below), nonlexicalised systems also tend to lag behind lexicalised ones, although by much smaller margins. Table 2 shows a range of results for the baseNP chunking task and data set given by Ramshaw & Marcus (1995). Again, the corresponding baseline result from this paper is included in the last row. It is clear from this overview that the difference between lexicalised and unlexicalised systems is far smaller for this parsing task than for complete parsing. There are several reasons for investigating how well parsers can do without lexicalisation. Apart 3Any linguistic CFG can be converted into a normal form that encodes the same set of sentences, but whose derivations and substructures are not semantically meaningful. 48 fromthe theoretical interest, optimisinggrammars before adding lexicalisationmay improve their overall performance, as lexicalised systems often performworse than comparable nonlexicalised systems when the lexical component is taken out. E.g. Collins (1996) includes results for the system with lexical information removed, which reduces LR from85.0 to 76.1 and LP from85.1 to 76.6 in one test – worse than the comparable results reported below in Section 4.3 (78.78 and 77.16). Furthermore, the results shown in Tables 1, 2 and 4 indicate that shallowparsing tasks require lexical information to a far lesser extent than nonshallowones, so that the added expense of lexicalisation might be avoidable in the case of such tasks. 3 Grammars, parsing tasks, data and evaluation 3.1 Grammars fromcorpora The basic procedure used for deriving PCFGs fromWSJ Sections 15–18 can be summarised as follows4: 1. In the first step, the corpus is iteratively edited by deleting (i) brackets and labels that correspond to empty category expansions; (ii) brackets and labels containing a single constitutent that is not labelled with a POS-tag; (iii) cross-indexation tags; (iv) brackets that become empty through a deletion; and (v) functional tags. 2. In the second step, each remaining bracketting in the corpus is converted into a production rule. The rules are divided into nonlexical ones (those that form the grammar), and lexical ones (those that formthe lexicon). 3. In the final step, a complete PCFG is created. The set of lexical rules is converted into a lexicon with POS-tag frequency information. The set of nonterminals is collected from the set of rules. Each set is sorted, the number of times each item occurs is determined, and duplicates are removed. Probabilities P for rules N !  are calculated fromthe rule frequencies C by Maximum Likelihood Estimation (MLE): PMLE(N ! ) = C(N!) Pi C(N!i) 3.2 Four parsing tasks Results are given in the following and subsequent sections for four different parsing tasks: 1. Full parsing: The task is to assign a complete parse to the input sentence. Afull parse is considered 100%correct if it is identical to the corresponding parse given in the WSJ Corpus. 2. Noun phrase identification: The task is to identify in the input sentence all noun phrases, nested and otherwise, that are given in the corresponding WSJ Corpus parse. 3. Complete text chunking: This task was first defined in Tjong KimSang & Buchholz (2000), and involves dividing a sentence into flat chunks of 11 different types. The target parses are derived fromWSJ parses by a deterministic conversion procedure. 4. Base noun phrase identification: First defined by Abney (1991), this task involves the recognition of non-recursive noun phrase chunks (so-called baseNPs). Target parses are derived from WSJ parses by a simple conversion procedure. 3.3 Data and evaluation Sections 15–18 of theWall Street Journal (WSJ) corpus were used for grammar derivation, and Section 01 from the same corpus was used for testing parsing performance. Parsing performance was tested with the commonly used evalb program by Sekine & Collins5. The program evaluates parses in terms of the standard PARSEVAL evaluation metrics Precision, Recall and crossing brackets. For a parse P and a corresponding target parse T, Precision measures the percentage of brackets in P that match the target brackettings in T. Recall is the percentage of brackettings from T that are in P. Crossing brackets gives the average number of constituents in one parse tree that cross over constituent boundaries in the other tree. (See e.g. Manning&Sch¨utze (1999), p. 432–434.) For Precision and Recall there are unlabelled and labelled variants. In the latter, both the pair of brackets and the constituent label on the bracket pair have to be correct for the bracketting to be correct, 4Throughout this paper, WSJ refers to the PENN II Treebank version. 5Available fromhttp://cs.nyu.edu/cs/projects/proteus/evalb/. 49 whereas in the unlabelled variant only the brackets have to be correct. In this paper, unless otherwise stated, Precision and Recall always mean Labelled Precision and Recall, in particular, all new results presented are the labelled variants. Precision and Recall are commonly combined into a single measure, called F-Score, given by ( 2 +1 )Precision Recall= 2 (Precision +Recall). In this paper, = 1 throughout. All grammars tested are nonlexicalised, therefore input sentences are sequences of POS-tags not words. In the tests, sentences of a length above 40 words (consistently close to 7.5%of all sentences in a corpus section) were left out. All grammars are formally probabilistic context-free grammars (PCFGS). The parsing package LoPar (Schmid (2000)) was used to obtain Viterbi parses for data sets and grammars. If LoPar failed to find a complete parse for a sentence, a simple grammar extension method was used to obtain partial parses instead. 3.4 Baseline A baseline grammar “BARE” was extracted from WSJ Sections 15–18 by the method described in Section 3.1, applied to the four parsing tasks defined in Section 3.2, and tested and evaluated as desribed in the preceding section. This yielded the following set of results which forms the baseline for the purpose of this paper. (Results include 9 partial parses.) Full parsing NP identification BaseNP chunking Complete text chunking LR LP F LR LP F LR LP F LR LP F 69.08 71.43 70.24 74.97 81.62 78.15 87.6 89.21 88.4 89.63 88.99 89.31 4 Introducing structural context into PCFGs 4.1 Different types of structural context In this section, the effects of introducing three different types of structural context (SC) into PCFG BARE are examined: (i) the grammatical function of phrases, (ii) their depth in the parse tree, and (iii) the category of the parent phrase. All three types of structural context are local to the immediate neighbourhood of the phrase node for which they provide the expansion probabilityconditions. Other local SC types that could be considered include position among the children of the parent node, and identity of immediate sibling nodes. Useful nonlocal SC types might be the identity of more distant ancestors than the parent node and of more distant sibling nodes. Grammatical function. As mentioned above, the WSJ corpus subdivides standard phrase categories such as NP by attaching functional tags to themthat reflect the grammatical function of the category, e.g. NP-SBJ and NP-OBJ. However, the corpus is not consistently annotated in this fashion (the same type of phrase may have zero, one or more functional tags). Parsing results for grammar FTAGS might be better if the grammar is derived froma more consistently annotated corpus. The rule that expands a noun phrase to a personal pronoun is a strong example of the extent to which grammatical function can affect expansion probabilities. In the WSJ, 13.7% of all NPs expand to PRP as subject, compared to only 2.1% as object. Of all object NPs, 13.4% expand to PRP as first object, compared to 0.9%as second object (source: Manning&Schuetze, 1999. p. 420). Depth of embedding. The depth of embedding of a phrase is determined as follows. The outermost bracketting (corresponding to the top of the parse tree) is at depth 1, its immediate constituents are at depth 2, and so on. In the parsed sentence (S (NP (DT the) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat))))), S is at depth 1, the first occurrences of NP and VP are at depth 2, the first occurrences of DT and NN as well as VBD and PP at depth 3, IN and the second NP at depth 4, and the second occurrences of DT and NN are at depth 5. It is not obvious that the depth of embedding of a phrase captures linguistically meaningful parts of its local structural context. However, different phrases of the same category do occur at certain depths with higher frequency than at others. This is most intuitively clear in the case of NPs, where subject NPs occur at depth 2, whereas object NPs occur at lower depths. More surprisingly, VPs too have preferences for occurring at certain levels. Table 3 (previouslyshown in Belz (2000, p. 49)) provides clear evidence of this. The first column shows the six most frequent WSJ VP expansion rules, the second column shows their canonical probabilities (calculated over all WSJ VP rules). The remaining columns show how these probabilities change if they are made conditional on depths of embedding 2–7. For each depth, the highest rule probability is highlighted in boldface font, 50 Depth of Embedding 2 3 4 5 6 7 p(VP!TOVP) 0.089 0.004 0.067 0.136 0.127 0.135 0.130 p(VP!MD VP) 0.056 0.075 0.043 0.055 0.062 0.050 0.047 p(VP!VB NP) 0.054 0.001 0.036 0.052 0.073 0.088 0.096 p(VP!VBNPP) 0.039 0.004 0.049 0.047 0.042 0.044 0.055 p(VP!VBZ VP) 0.038 0.069 0.034 0.037 0.025 0.023 0.021 p(VP!VBDS) 0.026 0.090 0.016 0.005 0.005 0.004 0.003 Table 3: Rule probabilities at different depths of embedding for 6 common VP rules. and the second highest in italics. At depth 2, for instance, the most likely rule is the one with the fourth highest canonical probability, and at depth 5, the second most likely rule is the one with the third highest canonical probability. In fact, there is only one depth (4) at which rule probablities appear in their canonical order, which shows howstrongly even VP rules are affected by depth of embedding. Parent node. The parent node of a phrase is the category of the phrase that immediately contains it. In (S (NP (DT the) (NN cat)) (VP (VBD sat) (PP (IN on) (NP (DT the) (NN mat))))) S is the parent of NP and VP, VP is the parent of PP, which is the parent of NP. Thus, distinguishing between NP-S (an NP with S as its parent) and NP-PP captures part of the subject/object distinction. The advantage of using parent node information was previously noted by Johnson6 (1998). 4.2 Four LSC-Grammars Grammars incorporating local structural context—or LSC grammars—were extracted fromthe corpus by the same procedure as described in Section 3.1 above, except that during Step 2, each bracket label that is not a POS tag was annotated with a tag representing the required type of LSC. Four different grammars were derived in this way, PCFGs FTAGS, DOE, PN and DOEPN. All four grammars incorporate the functional tags present in the WSJ Corpus. Additionally, for grammar DOE, each nonterminal was annotated with a tag representing the depth of embedding at which it was found, for grammar PN, nonterminals were annotated with tags encoding their parent node, and for grammar DOEPN, nonterminals were given both depth and parent node tags. The resulting grammars are signifi- cantly larger than the baseline grammar BARE. Grammar sizes and numbers of nonterminals (excluding POS tags) are as follows: Grammar Type BARE FTAGS DOE PN DOEPN Size (n rules) 6,135 10,118 21,995 16,480 33,101 Nonterminals 26 147 1,104 970 4,015 4.3 Performance on parsing tasks In calculating labelled bracketting Recall and Precision for the LSC-grammar results, all labels starting with the same category prefix, e.g. NP, are considered equivalent (standard in evalb). The idea is that the additional information encoded in the LSC-tags attached to category labels helps select the correct parse, not that it should be retained in the annotation for further analysis. Table 4 shows parsing results for the unseen data in WSJ Section 01 (the results for baseline grammar BARE are also included for comparison). Best F-Scores are highlighted in boldface font, and second-best F-Scores in italics. The best results in Table 4 are better than those reported by Charniak (1996) and Krotov et al. (2000), even though the previous results were obtained after using ca. 10/11 of the WSJ corpus as a training set (compared to 3/25 used here): UF LF Krotov et al. (2000) 79.12 76.09 Charniak (1996) 79.59 – PN-Grammar 80.51 77.96 6Johnson calls it grandparent node, but means the same thing. 51 Grammar Type BARE FTAGS DOE PN DOEPN Partial parses 9 9 25 20 62 Full parsing: LR 69.08 71.41 72.72 78.78 74.33 LP 71.43 73.06 71.9 77.16 70.61 F-Score 70.24 72.23 72.31 77.96 72.42 Crossing brackets 2.76 2.51 2.53 1.91 2.59 %0 CBs 32.34 35.43 35.75 44.40 37.0 NP identification: LR 74.97 77.22 78.2 83.86 81.02 LP 81.62 81.02 77.56 81.22 74.30 F-Score 78.15 79.07 77.88 82.52 77.51 BaseNP chunking: LR 87.6 87.35 87.02 90.27 87.05 LP 89.21 88.68 87.03 89.52 84.11 F-Score 88.4 88.01 87.02 89.89 85.55 Complete text chunking: LR 89.63 89.49 89.17 90.84 89.24 LP 88.99 88.64 87.28 89.46 85.85 F-Score 89.31 89.06 88.21 90.14 87.51 Table 4: Parsing results for the four LSC-grammars and WSJ Section 01. Incorporating different types of LSC affects results for the four parsing tasks in different ways. It is clear fromthe results in Table 4 that some kinds of contextual information are useful for some tasks but not for others. For example, adding parent phrase information improved results (from grammar BARE to grammar PN) by almost 8 points (F-Score 70.24 to 77.96) for the complete parsing task, by about 4.5 points (F-Score 78.15 to 82.52) for NP identification, by 1.5 points (F-Score 88.4 to 89.89) for baseNP chunking, and by just under one point (F-Score 89.31 to 90.14) for complete text chunking. It is likely that adding depth of embedding information indiscriminately (as in grammars DOE and DOEPN) results in overspecialisation. Looking at results for seen data (part of the training corpus) con- firms this. Table 5 shows results for the baseline grammar and the four LSC grammars on WSJ Section 15, i.e. one of the sections used during grammar derivation. On seen data, grammar DOEPN performs best on all parsing tasks. Tables 4 and 5 together imply that adding depth of embedding information for all depths to all rules simply overfits the training data and results in undergeneralisation. Similarly, it is likely that not all the information added in the four LSC grammars is useful for all parsing tasks. Distinguishing 27 depths of embedding is probably too much for all parsing tasks, e.g. distinguishingdepths above 20 is generally unlikelyto be useful, as the occurrence of rules at such depths is rare. Techniques for eliminating the information that makes no useful contribution for a given parsing task are discussed in the following section. 5 Automatic optimisation of LSC-Grammars 5.1 Initial assumptions If it is true that some of the LSC information added to the grammars tested so far makes little or no contribution to a grammar's performance on a given parsing task, then it should be possible to reduce grammar size without loss of parsing performance by selectively taking out some of the added information. At the same time, if it is true that some of the LSC-grammars are overspecialised (overfit the data), then it should be possible to improve the grammar's performance by selectively generalising them. As pointed out above in Section 4.3, it is clear fromthe LSC results that adding different kinds of LSC information to a grammar has different effects on the results for different parsing tasks. It should therefore be possible to optimise a grammar for a given parsing task by selectively taking out the information that is not useful for the given task. The idea behind the experiments reported in the following section was to see to what extent the LSC grammars can be optimised in terms of size and parsing performance by grammar partitioning for each of the parsing tasks. 52 Grammar Type BARE FTAGS DOE PN DOEPN Partial parses 0 0 0 0 0 Full Parsing: LR 71.48 75.15 82.81 84.64 90.39 LP 75.03 78.64 84.86 85.94 91.43 F-Score 73.21 76.86 83.82 85.29 90.91 Crossing brackets 2.57 2.15 1.37 1.31 0.75 %0 CBs 34.48 41.85 56.31 57.33 73.46 NP identification: LR 76.54 79.26 84.51 87.46 91.17 LP 84.89 85.61 88.79 88.75 92.61 F-Score 80.5 82.31 86.6 88.1 91.88 BaseNP chunking: LR 90.21 90.28 92.68 94.40 95.99 LP 92.59 92.70 94.54 95.66 97.19 F-Score 91.38 91.47 93.60 95.03 96.59 Complete text chunking: LR 91.68 91.67 93.59 94.25 96.45 LP 92.46 92.56 94.19 95.02 96.84 F-Score 92.07 92.11 93.89 94.63 96.64 Table 5: Parsing results for the LSC-grammars and WSJ Section 15 (seen data). 5.2 Preliminary definitions The addition of structural context as described in previous sections can be viewed in terms of split operations on nonterminals, e.g. in the FTAGS grammar, the nonterminal NP is split into NP-SUBJ and NP-OBJ (among others). This results in grammar specialisation, i.e. the new grammar parses a subset of the set of sentences parsed by the original one. The reverse, replacing NP-SUBJ and NP-OBJ with a single nonterminal NP, can be seen as a merge operation, and results in grammar generalisation, i.e. the new grammar parses a superset of the sentences parsed by the original one. An arbitrary number of such merge operations can be represented by a partition on the set of nonterminals of a grammar. A partition is defined as follows. Definition1 Partition A partition of a nonempty set A is a subset  of 2A such that ; is not an element of  and each element of A is in one and only one set in . PCFGs can be defined as follows. Definition2 Probabilistic Context-Free Grammar (PCFG) APCFGs is a 4-tuple (W;N ;NS;R), whereW is a set of terminal symbols fw1; : : :wug, N is a set of nonterminal symbols fn1; : : : nvg, NS  N is a set of start symbols fns 1; : : : ns vg, und R is a set of rules with associated probabilities f(r1; p(r1)); : : : (r1; p(rx))g. Each rule r is of the form n ! , where is a sequence of terminals and nonterminals. For each nonterminal n, the values of all p(n ! i) sumto one. Given a PCFG G = (W;N ;NS;R) and a partition N = fN1; : : :Nvg of the set of nonterminals N, the partitioned PCFG G0 = (W;N0;NS0 ;R0) is derived by the following procedure: 1. Assign a newnonterminal name to each of the non-singleton elements of N. 2. For each rule ri in R, and for each nonterminal nj in ri, if nj is in a non-singletonelement of N, replace it with the corresponding new nonterminal. 3. Find all rules in R of which there are multiple occurrences as a result of the substitutions in Step 2, sumtheir frequencies and recalculate the rule probabilities. 53 If start symbols are permitted to be merged with non-start symbols, then there are two ways of determining the probability of a rule expanding the nonterminal resulting fromsuch a merge: either its frequency is the sum of the frequencies of all nonterminals in the merge set, or it is the sum of just the frequencies of the start symbols in themerge set. The latter optionwas chosen in the tests reported below. 5.3 “Proof of concept” The discussion and results in this section provide preliminary confirmation of the prediction made in Section 4.3 that for the different LSC grammars there exist (non-trivial) partitions that outperform the original base grammar. More formally, the “proof of concept” provided below shows the following for most of the grammar/task combinations: Given a base grammar G = (W;N ;NS;R) and a parsing task P, a partition of the set of nonterminals N can be found such that the derived grammar G0 = (W;N0;NS0 ;R0) 1. is smaller than G (i.e. jR0j < jRj), and 2. performs better than G on P. Some of the five LSC-PCFGs can be derived by partition fromone of the others. For example, BARE can be derived from all others, FTAGS can be derived from DOE, PN and DOEPN, and DOE and PN can both be derived from DOEPN. This means that for some of the grammars, the results given in Section 4.3 in themselves show that there exists at least one (non-trivial) partition that is smaller than and outperforms the original grammar. E.g. for the baseNP chunking task, the partition that derives PN fromDOEPN achieves nearly a 3 point improvement (F-Score 87.63 to 90.23), while reducing grammar size from33,101 rules to 16,480, and the number of non-terminals from4,015 to 970. In the remainder of this section it is shown that there are other partitions of the DOE grammar that improves its performance and reduces its size. Grammar type Depth bands Grammar size Nonterminals DOE 1, 2, . . . 27 21,995 1,104 1, 2, 3, rest 12,933 312 1, 2, rest 11,254 224 1, rest 10,165 170 FTAGS – 10,118 147 BARE – 6,135 26 Table 6: Sizes and depth bands of DOE grammar and 5 of its partitions. Fromthe parsing results for the DOE grammar it appears that indiscriminately adding depth of embedding information does not help improve parsing performance for shallow parsing tasks on unseen data: while there is a significant improvement for the complete parsing task (F-Score 70.24 to 72.31), the F-Scores for the other three parsing tasks are worse. That there is any improvement shows that some useful information is added. It is likely that distinguishing all depths simply leads to overspecialisation of the grammar, resulting in a large increase in parse failures on the one hand, and the selection of bad, previously unlikely, parses on the other. If this is so then partitioning the DOE grammar in a way equivalent to distinguishing broader depth bands rather than each individual depth will improve results. To test this hypothesis, three different partitions of the DOE grammar were created. The partitions (too large to be shown in their entirety) correspond to distinguishing between the different depths shown in the second column of Table 6, e.g. in the case of the fourth row, all nonterminals NT-n with a depth tag n greater than 1 are merged into a single nonterminal NT-rest. The last two columns show the number of rules and nonterminals in each grammar. The last two rows show the corresponding numbers for the BARE and FTAGS grammars (DOE-type grammars all incorporate functional tags). The partitionedDOEgrammars all improve results (compared to grammars BARE, FTAGS, and DOE) for the full parsing task, with the DOE-f1, 2, restg grammar performing the best. For the NP identifi- cation task, grammar DOE achieved a worse F-Score than grammar BARE, yet all the partitioned DOE grammars achieve a better F-Score than grammar BARE, with the DOE-f1, 2, restg grammar again performing the best. On the baseNP chunking task and the complete text chunking task, grammar BARE performs the best, but all the derived DOE grammars outperformthe nonpartitioned DOE grammar. On 54 Grammar Type BARE FTAGS DOE DOE-f1,rg DOE-f1,2,rg DOE-f1, 2, 3, rg Full Parsing: LR 69.08 71.41 72.72 73.35 74.13 74.14 LP 71.43 73.06 71.9 74.48 75.1 74.73 F-Score 70.24 72.23 72.31 73.91 74.61 74.43 Crossing brackets 2.76 2.51 2.53 2.32 2.19 2.21 %0 CBs 32.34 35.43 35.75 38.83 39.1 38.56 NP identification: LR 74.97 77.22 78.2 77.39 77.95 78.29 LP 81.62 81.02 77.56 81.20 81.31 80.85 F-Score 78.15 79.07 77.88 79.25 79.59 79.55 BaseNP chunking: LR 87.6 87.35 87.02 87.82 87.73 87.74 LP 89.21 88.68 87.03 88.93 88.59 88.11 F-Score 88.4 88.01 87.02 88.37 88.16 87.93 Complete text chunking: LR 89.63 89.49 89.17 89.58 89.71 89.70 LP 88.99 88.64 87.28 88.54 88.52 88.29 F-Score 89.31 89.06 88.21 89.06 89.11 88.99 Table 7: Parsing results of DOE grammar and 5 of its partitions. { {0}, {1}, {2} } { {0,2}, {2} } { {1,2}, {0} } { {0,1}, {2} } { {0,1,2} } { {0,1,2} } { {0,1,2} } Figure 1: Partition tree for a set with three elements. the baseNP chunking task, the BARE grammar's F-Score is closely matched by the DOE-f1, restg grammar. These results show that partitions can be found that not only drastically reduce grammar size but also significantly improve parsing performance on a given parsing task. 5.4 Search for optimal partition of LSC-Grammars Given. A PCFG G = (W;N ;NS;R), a data set D, and a set of target parses DT for D. Search space. The search space is defined as the partition tree for the set of nonterminals N in the given grammar G. Each node in the tree is one of the partitions of N, such that each node's partition has fewer elements than all of its ancestors, and the partition at each node can be derived fromits parent by merging two elements of the parent's partition. The single node at the top of the tree is the trivial partition corresponding to N itself. Each node is the parent of 1 2(n2 level reduces the number of states by one. The complete partition tree for a set with three elements looks as shown in Figure 1. Search method. The partition tree is searched top-down by a variant of beam search. A list of the n current best candidate partitions is maintained (initialised to the trivial partition). For each of the n current best partitions a subset of size b of its children in the partition tree is generated and evaluated (b thus defines the width of the beam). From the set of current best partitions and the newly generated candidate partitions, the n best elements are selected and formthe new current best set. This process is iterated until either no new partitions can be generated that are better than their parents, or the lowest level of the partition tree is reached. In the current version of the evaluation function, only the F-Score achieved by candidate solutions on the test data is taken into account. Search stops if in any iteration (depth of the partition tree) no 55 solution is found that outperforms the current best solutions. That is, size is not explicitly evaluated at all. Candidate solutions are evaluated on a subset of the test data, because evaluating each candidate solution on all 1,993 sentences of WSJ Section 01 makes the cost of the search procedure prohibitive. There are three variable parameters in the partition tree search procedure: (i) the number n partitions (nodes in the tree) that are further explored, (ii) the size x of the subset of the test data that candidate solutions are evaluated on, and (ii) the width b of the beam. 5.5 Results for LSC-Grammar optimisation by search of partition tree Table 8 shows some results for automatic optimisation experiments carried out for grammar PN and the baseNP chunking and complete text chunking tasks. The first three columns showthe variable parameters b (beamwidth), n (size of list of best solutionsmaintained), and x (size of data subset used in evaluation). The fourth column shows the number of runs results are averaged over, and the fifth and sixth columns show the number of iterations and evaluations carried out before search stopped. Column 7 gives the average number of nonterminals the best solution grammars had, and column 8 their average evaluation score. The last two columns show the overall change in F-Score (calculated on all of WSJ Section 01) and grammar size for the given grammar and parsing task. Var. Parameters Runs Iter. Eval. Nonterms F-Score (sub) F-Score +/- Size +/- b n x Grammar: PN; Grammar Size: 16,480/970 Task: BaseNP chunking; F-Score: 89.89 100 2 50 4 4 45 968.25 95.93 +0.032 (89.92) -0.25 100 10 50 4 6.75 341.5 967.25 97.25 +0.048 (89.94) -2 500 1 50 4 5.25 499 967.5 97.49 +0.06 (89.95) -2.25 Grammar: PN; Grammar Size: 16,480/970 Task: Complete Text Chunking; F-Score: 90.14 1,000 1 10 4 5 523.75 967 100.00 +0.06 (90.2) -0.75 Table 8: Results for automatic optimisation tests. Current results showinsensitivityto the precise values of parameters b and n. What appears to matter is just the total number of evaluations, results being better the more candidate solutions are evaluated. Results indicate a greater sensitivity to the value of x: a data subset size of 10 is clearly too small, as search quickly finds solutions with an F-Score of 100 and then stops (last rowof Table 8). Overall, results are not nearly as good as might have been expected after the preliminary tests described above. Only small numbers of nonterminals were merged, and small improvements achieved, before search stopped. However, the fact that every single run achieved an F-Score improvement and almsot all runs resulted in a decrease in grammar size even for small numbers of merged nonterminals indicates that the basic approach is right, but that some way has to be found of overcoming the local optima on which search in the reported experiments stopped, by widening the width of the beam, changing the evaluation function, or by using a more sophisticated search method. 6 Conclusions and further research The first part of this paper looked at the effect of adding three different kinds of local structural context —grammatical function, parent node and depth of embedding—to a basic PCFG derived fromthe Wall Street Journal Corpus. Grammars were tested on four different parsing task differing in complexity and shallowness. Results showed that all three types of context improve performance on the complete parsing task, but that only parent node information improves performance on all parsing tasks. The PCFG with parent node information was particularly successful and achieved better results on the complete parsing task than the best previously published results for nonlexicalised grammars and WSJ corpus data. In the second part of the paper, a newmethod for optimising PCFGs was introduced that has the effect of overcoming overspecialisation by generalising grammars. It was shown that partitions can be found that drastically reduce grammar size and significantly improve parsing performance. First results were reported for applying an automatic search method to a PCFG that incorporates parent node information, 56 and the tasks of baseNP chunking and complete text chunking. Results are promising, but indicate that in order to achieve radical improvements in parsing performance and grammar size, a different evaluation function and/or more sophisticated search methods may be required. 7 Acknowledgements The research reported in this paper was carried out as part of the Learning Computational Grammars Project funded under the European Union's TMR programme (Grant No. ERBFMRXCT980237). References Krotov A, Hepple M, Gaizauskas R, Wilks Y. 2000. Evaluating two methods for treebank grammar compaction. Natural Language Engineering, 5(4):377–394. Belz A. 2000. Computational Learning of Finite State Models for Natural Language Processing. Ph.D. thesis, COGS, University of Sussex. Cardie C, Pierce D. 1998. Error-driven pruning of treebank grammars for base noun phrase identification. In Proceedings of COLING-ACL ’98, pp 218–224. Manning C, Sch¨utze H. 1999. Foundations of Statistical Natural Language Processing. MIT Press. Tjong Kim Sang E, Buchholz S. 2000. Introduction to the CoNLL-2000 shared task: Chunking. In Proceedings of CoNLL-2000 and LLL-2000, pp 127–132. Charniak E. 1996. Tree-bank grammars. Technical Report CS-96-02, Department of Computer Science, Brown University. Charniak E. 1997. Statistical parsing with a context-free grammar and word statistics. In Proceedings of NCAI-1997, pp 598–603. Charniak E. 2000. Amaximum-entropy-inspired parser. In Proceedings of NAACL-2000, pp 132–139. Schmid H. 2000. LoPar: Design and implementation. Bericht des Sonderforschungsbereiches “Sprachtheoretische Grundlagen f¨ur die Computerlinguistik” 149, Institute for Computational Linguistics, University of Stuttgart. Ramshaw L, Mitchell M. 1995. Text chunking using transformation-based learning. In Proceedings of the Third ACL Workshop on Very Large Corpora, pp 82–94. Association for Computational Linguistics. Collins M. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of ACL and EACL ’97, pp 16–23. Johnson M. 1998. The effect of alternative tree representations on tree bank grammars. In Proceedings of the Joint Conference on New methods in Language Processing and Computational Natural Language Learning (NeMLaP3/CoNLL'98), pp 39–48. CollinsM. 1999. Head-driven statisticalmodels for natural languageparsing. Ph.D. thesis, Department of Computer and Information Science, University of Pennsylvania. CollinsM. 2000. Discriminative reranking for natural language parsing. In Proceedings of ICML 2000. Abney S. 1991. Parsing by chunks. In Berwick R, Abney S, Tenny C (eds), Principle-Based Parsing, pp 257–278. Kluwer. 57 58 “But this formula doesn't mean anything!” Ylva Berglund, Uppsala University Oliver Mason, University of Birmingham When working with quantitative measures of even simple statistics, one is sometimes confronted by colleagues who fail to see the significance of such studies, claiming there was no meaning in a statistical formula. Even when formulae are used to measure some textual parameter, the results may be difficult to interpret. What does it actually mean if a text has an average word length of 4.67 and an TTR of 0.34? In this paper we start from earlier research into measuring the performance of non-native speakers (or rather writers). We have previously shown (Proc of TaLC 2000, to appear) that automatic stylistic assessment can be used to distinguish between different kinds of texts, both those of different genres and, with even better results, native versus non-native authorship. Taking into account a variety of ‘surface’ parameters which were measured for essays written by Swedish learners of English at University level, we will now look at a possible mapping between a number of numerical values and the quality of the essay, which cannot be directly quantified itself. Using PCA and cluster analysis we try to demonstrate what gives an English text the ‘Englishness', and how non-native speakers develop their language ability in this ‘textual parameter space'. We also hope to show that even if a formula as such does not mean anything by itself, quantitative measures are a valuable means that can be used to enhance our understanding of intuitive judgements about texts. 59 Comparing cohesive devices: a corpus-based analysis of conjunctions in written and spoken learner discourse Roumiana Blagoeva Department of English and American Studies Sofia University St. Kliment Ohridski 1.The aims of this paper are: to describe the process of developing two learner corpora of English at Sofia University; to mention some problems associated with the collection of the data as well as some of the applications of corpora of this type, and to present the findings from the research carried out so far. 2. The International Corpus of Learner English (ICLE) project, launched at the University of Louvain in Belgium to compile an international corpus of written learner language was joined by the Bulgarian team in 1996. To conform to the requirement for comparability all the teams participating in the project are collecting the same type of data, the only difference between the sub-corpora being the learners’ mother tongue. At present the Bulgarian sub-corpus contains about 112 000 words, which is 55% of the total amount of data required, and work on it is still going on. The data consist largely of argumentative essays written by Sofia University students at the beginning of their second academic year, so they can be described as adult advanced learners of English. Each of the 112 learners contributed about 1000 words on one or two of the essay topics suggested by the ICLE members. In 1995, a complementary project was conceived in Louvain to compile a corpus of spoken learner language. The Louvain International Database of Spoken English Interlanguage (LINDSEI) project is the first of its kind and was soon joined by several other countries, including Bulgaria. In order to ensure comparability of the data each sub-corpus was to contain transcripts of 50 fifteen-minute interviews with non-native university students of English. The Bulgarian sub-corpus has already been compiled, and the transcription and keying-in of the data has been fully completed. The amount of words collected approaches 110 000 including the speech of the interviewers, who are native speakers of English. 3. A prime concern of this corpus creation activity is to collect data from a homogeneous population. Among the variables that need to be controlled are learning environment, age, mother tongue, stage of learning and nature of the task. The relevant biographical information about each contributing learner, such as years of English at school, stay in an English speaking country, knowledge of other foreign languages, use of reference tools, is encoded in a learner profile questionnaire which learners fill in(Granger, 1994: 26). The group of learners who contributed to the Bulgarian sub-corpora is homogeneous in terms of all the variables listed above. The subjects are all second-year students of English and American Studies at Sofia University; aged 20 to 22; the reference tools used when writing the essays are monolingual dictionaries; their nationality, mother tongue and language spoken at home is Bulgarian with only 2 exceptions where one of the parents speaks Turkish at home; the medium of instruction at school and at university has been English and Bulgarian; and very few students (only 6) have spent from four months up to two years in an English speaking country. Another essential consideration in developing the corpora is the representativity of the data to be collected. In compiling the two Bulgarian sub-corpora there was no selection of contributors on the basis of academic record or any personal preferences on the part of teachers and students. The essay writing as well as the interviews, their transcription and conversion into electronic form were incorporated in the students’ written and oral assignments. In this way all Bulgarian students of English and American Studies at Sofia University could participate in the two projects and they did their best to fulfil the tasks. There are inevitable differences in the previous learning experience of the subjects, but I consider them an advantage rather than a hindrance because this fact practically excludes the influence of one and only one teaching strategy on the learners, thus, rendering the data representative of a much wider population than the one sampled. It should also be noted that since the ratio between male and female students at the Department is 1:10, the number of female participants 60 is much higher (90%) than the number of male ones. If the two sexes were to be represented equally the collection of data would have extended over a much longer period of time. 4. A learner corpus is very different from a native corpus because of the nature of the material collected. A native corpus contains data from a natural language and can be used on its own for the investigation of characteristic features of this language. A learner corpus presents evidence of an interlanguage; and an interlanguage, regardless of its stages of development, can only be a simulation of the natural language that is the target aimed at in the process of FLT. Therefore, any learner corpus would be of little value on its own, but it could be a useful tool for investigating a particular interlanguage when compared to a relevant native corpus. The choice of this native-speaker corpus is dependent on the aims of FLT. In preceding decades comparisons were largely carried out between learner language and the norm of the target language described in grammar books and dictionaries. Isolated examples of TL and IL were analysed with the purpose of finding erroneous structures and elements in the IL. So analyses tended to overlook the fact that learner language is not characterised only by misuse, but also by over/underuse of words, syntactic structures and discourse features. This is especially true for the highest levels of FLA, where errors are rare but we still feel that learners have not achieved near-native competence and learner production still differs from what native speakers would produce in similar situations. If the final goal of FLT/FLA is to achieve an ability to use the TL the way it is used by native speakers for the fulfillment of certain real-life tasks then a study of interlanguage will always need a corpus of authentic samples of the foreign language to compare with learner production. 5. For this reason two computerized native-speaker corpora of nearly the same amount of data were also developed at Sofia University. The written native language corpus is a collection of newspaper articles and essays on various topics; the spoken native language corpus contains transcripts of interviews, dialogues, announcements, extracts from radio programmes etc. All of the texts in these two corpora are used as teaching materials in- or outside class, and in test papers. The choice of text types and sources is justified by the fact that the most immediate contact Bulgarian students have with the English language is established through teaching and the media rather than personal contact with native speakers. Moreover, the students are trained to be specialists in English language and culture and on graduating should be able to communicate in English in a variety of situations as well-educated people. Therefore, a comparison of their production with the types of native corpora mentioned is relevant for the present study. 6. Having such electronic collections of natural textual data enables us to carry out research on a large scale and to explore characteristic features of interlanguages across a variety of backgrounds in a quick and reliable way. Instead of analysing isolated, invented sentences we now have the opportunity to explore language in use with its purposes and functions in mind. “… for us the actual text, not the invented sentence, must be the essential linguistic unit … In the coming millennium, this prospect can now finally be documented and clarified by working with very large corpora of authentic texts, whereby we can hope to uncover some of the vital and delicate missing links between ‘language’ and ‘text'.” (Beaugrande, 2000) As noted by Halliday (Halliday, 1976:2), “a text does not consist of sentences; it is realized by, or encoded in sentences”. Hence, only the study of longer stretches of discourse can give an insight into the resources that English has for creating texture. 7. Some of the first searches applied to the Bulgarian sub-corpora are connected to the use of conjunctive elements by the learners. Conjunctions seem to be the most explicit way of establishing relations in a text since they indicate how what is to follow is systematically connected to what has gone before in the text (Halliday, 1976: 227). They are a means that text producers use “to exert control over how relations are recovered and set up by the receivers” (Beaugrande, 1983: 74). Conjunctive elements with their intrinsic function to signal relations in a text are rather different from the rest of the lexicon in a language: they are relatively independent of context in the sense that we can expect them to be present in a text no matter what the particular topic of this text is. The study of their use by foreign learners of English can provide the first step to the understanding of the ideas that learners have about text structure and to the investigation of their ability to construct a text. 8. The present analysis examines fifty words and phrases that, according to Halliday's (Halliday: 242-243) classification of conjunctive relations express additive, adversative, causal and temporal relations, in a text (and, being one of the most frequently used additive conjunctions, is the subject of 61 a separate study and is excluded from this analysis). Using WordSmith Tools (Scott, 1997), concordances were produced for the fifty conjunctions in each of the four corpora. In Corpus 3, containing the learner spoken data, the interviewers’ words were manually excluded and only the interviewees’ speech was taken into account. This slightly reduced the size of this corpus so the frequency of occurrence of each item was calculated as a percentage of the total number of tokens in the corpora. The results of this first search are summarised in Table 1. Percentage of occurrences in each corpus Number of conjunctions studied Corpus 1 Learner written 100 000 words Corpus 2 Native written 100 000 words Corpus 3 Learner spoken 70 000 words Corpus 4 Native spoken 100 000 words 50 3,627 1,894 4,328 3,364 Table 1. Frequency of occurrence of 50 cohesive devices in four corpora 8.1. There are two striking facts that these figures reveal: first, the greater use of connectors by both native speakers and learners in speaking than in writing; and, second, a clear overuse of conjunctions by learners in written and spoken production respectively in comparison with native speakers. 8.1.1. A number of scholars studying spoken language (e.g. Labov, 1972; Chafe, 1979; Cicourel, 1981; Goffman, 1981; Biber, 1995) have reached the conclusion that conjuncts are more formal and therefore more typical of written rather than spoken language because the speaker is usually less explicit than the writer. It is true that the speaker can resort to a number of means of expression such as gestures, posture, tone, etc., but it is equally true that in most situations the speaker is under the pressure of time-limitations as he observes his interlocutor and has to respond quickly to his reactions, sometimes modifying what he is saying and making it clearer and more concise. The writer, on the contrary, can construct a text at leisure, can spend more time on the choice of particular words and syntactic structures, or can go over the whole text and edit it in many different ways. This allows the writer to avoid certain unnecessary repetitions (including the repetitions of formal connectors) and to find other means of structuring his text in a logical way. Whatever the reason, I am fully aware of the fact that working with corpora of 100 000 words each cannot give a detailed picture of the existing state of written and spoken language. What this study presents is based only on the observations of the raw data from the discourse stretches in the particular corpora under investigation. I believe that the extension of the data and further analyses and collaboration with the other teams may explain this phenomenon. 8.1.2. In trying to provide reasons for each deviant use of the target language by learners we most often turn to the factors influencing FLA and the psycholinguistic processes central to second language learning as determined by Selinker (1972: 37). The overuse of conjunctions by learners of English could be due to some teaching/learning strategies. In most classrooms, and more specifically in Bulgaria, speaking and writing are distinguished as different skills and are trained separately and very often independently. Teaching materials used in English classes and in the special writing courses place too great an emphasis on text structure and the importance of formal connectors in general while at the same time leave very little space for other ways of achieving coherence. This may lead learners to overgeneralization of some TL rules and may bring them to the conclusion that a well-written and logically structured text can be produced only if such language items are abundantly present in it. Such an idea results in a tendency on the part of learners to “paste” connectors mechanically to the texts they produce in an attempt to give them some “shape” (Blagoeva, 2000: in press). By contrast, the development of speaking skills usually goes along the lines of free discussions on suggested topics. Little instruction is given on how to speak coherently and to participate effectively in the act of communication. Greater attention is paid to grammatical accuracy and pronunciation as well as to knowledge connected with the topics discussed. It should be noted, however, that the contributors to the two learner corpora are highly influenced by long and constant exposure to written forms of language. Since our Department offers simultaneous courses in Linguistics and English and 62 American Literature the students’ work is mostly directed towards reading and writing tasks. This probably accounts for the overuse of conjunctive elements and the more formal register they use in their speech. 9. The overall overuse of conjunctions by the learners could be misleading if concrete examples were not examined in greater detail. Table 2 presents the occurrences of the top 15 connectors in descending order of frequency. Percentage of occurrences in each corpus Top 15 Conjunctions Corpus 1 Learner written 100 000 words Corpus 2 Native written 100 000 words Corpus 3 Learner spoken 70 000words Corpus 4 Native spoken 100 000 words But 0,632 0,426 0,988 0,632 Because 0,203 0,062 0,534 0,238 So 0,132 0,033 0,432 0,429 However 0,108 0,032 0 0,012 Then 0,107 0,090 0,234 0,299 For instance \ For example 0,098 0,017 0,062 0,058 Though \ Although 0,091 0,077 0,058 0,072 Of course 0,066 0,020 0,074 0,107 Therefore 0,056 0 0 0,020 Thus 0,050 0,002 0,004 0 Rather 0,040 0,016 0,020 0,025 Actually 0,035 0,013 0,235 0,135 In fact 0,033 0,014 0,122 0,059 Yet 0,022 0,015 0,003 0,010 I mean 0,006 0 0,245 0,144 Table 2. Frequency of occurrence of the top 15 connectors 9.1. Even a cursory glance at the first two columns shows that the greater number of instances of conjunctions in the written learner production are almost evenly distributed among all the 15 most frequently used connectors. In a recent paper Blagoeva (2000: in press) also reports that the results obtained from the Bulgarian written learner data confirm the findings made by Granger (1994: 27-28) about the overuse of the same connectors in the writing of French learners. (For further details see Blagoeva, 2000: in press) 9.2 The data extracted from the spoken corpus, however, demonstrates distribution of connectors in the speech of the learners different from that of the native speakers. Two of the connectors, so and then, have nearly the same ratio in Corpus 3 and Corpus 4 probably because they are also perceived by the learners as pause fillers. But, because, in fact, I mean, actually seem to be favourites of Bulgarian speakers of English and it is quite understandable if NL interference as a factor influencing learners’ choices of words is considered. The English in fact can introduce “a contradiction or an opinion which is different from something that has just been said” (COBUILD,1994) and is therefore classified as adversative, contrastive conjunction (Halliday, 1976: 242-243 ). The Bulgarian translation equivalent of in fact, YV ãWQRVW, means only in reality and is used to introduce some clarification or to add details to a previously mentioned statement (Bulgarian Language Dictionary, 1994). In this way it functions rather as an additive, and since this is the more frequent relation in a text it is quite natural to find more instances of this connector in the Bulgarian learner speech. The other conjunctions that show greater frequency in the Bulgarian-English interlanguage have functions similar to those of their translation equivalents in the NL of the learners. But in English and no in Bulgarian, for example, have the same contrastive adversative function; because and zaštoto in English and Bulgarian respectively “introduce the reason for a statement or the answer to a ‘why’ question” and are both causal conjunctions. There seems to be no place for confusion here and it is true that no instances of misuse were encountered. Still NL interference could work not only because of formal differences between languages but also through the transfer of different speaking or writing habits in the mother tongue due to some cultural differences. One hypothesis at this point is that Bulgarians tend to be more explicit when stating reasons or objections. Yet such a conclusion could be 63 confirmed only after comparisons with relevant Bulgarian native corpora, which are still being compiled at Sofia University. At the same time several connectors on the list point to a tendency for underuse in learner speech. Obviously, some language items are used at the expense of others whenever students do not feel confident enough in their knowledge of the foreign language. 10. One could argue that the higher or lower frequency of formal connectors in learner language, as long as they are used correctly, may not lead to serious communication breakdowns. Still, it is my view that it could interfere with a receiver's comprehension of a text and could contribute to the artificiality of learner English. At an advanced stage of FLA students should be made aware that they tend to stick to some language structures and should be encouraged to turn to other means of achieving cohesion. Another salient point is connected with the differences between spoken and written language and the choice of teaching materials for the development of speaking and writing skills. By revealing characteristic features of learner language, corpus analysis studies offer ways of diagnosing the true learners’ needs for the different purposes of communication. The results of such studies can turn the attention of teachers to the fact that speaking is not equivalent to mere talking but is a special skill that can be trained in a systematic way. Naturally, further corpus-based investigation of other discourse features is likely to be the way forward to developing learner resources that will bring interlanguages closer to the kind of language used by native speakers of English. References: Andreichin L, et al 1994 % OJDUVNL7 ONRYHQ5HþQLN>%XOJDULDQ/DQJXDJH'LFWLRQDU\@Sofia: Nauka i izkustvo,. Beaugrande R 2000 Text linguistics at the millennium: Corpus data and missing links.Text, 20, 2000. Beaugrande R, Dressler W 1983 Introduction to text linguistics. London and New York: Longman. Biber D 1995 Variarion across speech and writing. Cambridge; Cambridge University Press. Blagoeva R 2000 Comparing cohesive devices: conjunctions and other cohesive relations and their place in the Bulgarian-English interlanguage. Paper presented at Third international conference for research in European studies, Veliko Turnovo, Bulgaria. Chafe W L 1979 The flow of thought and the flow of language in (ed) T. Givon. Cicourel A 1981 Language and the structure of belief in medical communication in (eds) B. Sigurd and J. Startvik, Proceedings of AILA 81 Studia Linguistica 5: 71-85. Goffman E 1981 Forms of talk. Oxford: Basil Blackwell. Granger S 1994 The learner corpus: a revolution in applied linguistics. English Today 39: 25-29. Halliday M A K, Hasan R 1976 Cohesion in English. London and New York, Longman. Labov W 1972 Sociolinguistic Patterns Philadelphia: University of Pensilvania Press. Scott, M. 1997. Wordsmith Tools. Version 2.Oxford: Oxford University. Selinker L 1972. Interlanguage. International Review of Applied Linguistics 10: 209-31. Sinclair J, ed. in chief. 1994. Collins Cobuild English Language Dictionary. London: HarperCollins Publishers. 64 Frame Semantics as a framework for describing polysemy and syntactic structures of English and German motion verbs in contrastive computational lexicography Hans C. Boas1 ICSI/UC Berkeley This paper addresses the question of how to account for verbal polysemy from a contrastive point of view. By examining the syntactic and semantic distribution of arguments of a selected number of English and German motion verbs, I intend to demonstrate the usefulness of Fillmore's (1982) Frame Semantics for describing verbal argument realization patterns across languages. In this connection it will be shown that frame-semantic descriptions offer a unified way of relating the full range of lexical units2 that instantiate the same semantic concept. In addition to these theoretical considerations, practical applications of the frame-semantic approach to lexical organization are discussed. 1. Polysemy of English and German motion verbs The data in (1) – (2) illustrate a small range of senses associated with the motion verbs run and walk expressed in terms of distinct syntactic argument realization patterns. In (1a), run is used in a Selfmotion sense to describe a situation in which a Self-mover (Julie) arrives at a Goal (the store) as the result of her moving under her own power.3 In (1b), run is used in a Caused-motion sense to describe a situation in which an Agent (Julie) causes a Theme (Pat) to end up in a location, in this case a Goal (off the street). (1) a. Julie ran to the store. b. Julie ran Pat off the street. (2) a. Rod walked to the door. b. Rod walked Melissa to the door. The semantics of walk in (2a) is similar to the semantics of run in (1a) in that it describes the motion of a Self-mover (Rod) towards a Goal (the door). Following the terminology developed by Johnson et al. (2001), I classify the usages of run and walk in (1a) and (2a) as Self-motion. Walk differs from run in two respects. First, the manner of motion expressed by walk is of a slower nature than that expressed by run.4 Second, the semantics of walk in (2b) differs from the semantics of run in (1b) in terms of contact between the two event participants and their relation to each other. That is, whereas run in (1b) incorporates a notion of force, (2b) does not. In contrast to the Caused-motion semantics attributed to run in (1b), the Cotheme semantics associated with the use of walk in (2b) implies that the two event participants (i.e., frame elements), Rod (the Self-mover) and Melissa (the Cotheme), started walking together from an unmentioned common Source along an unmentioned Path to their final destination, the Goal (to the door). In German, the basic types of situations described by run and walk in (1a) and (2a) are typically expressed by rennen and gehen, respectively. (3a) illustrates that the basic Self-motion sense of rennen is the translation equivalent of the basic Self-motion sense of run in (1a). Note, however, that although the basic Self-motion sense of run shows considerable semantic and syntactic overlap with the basic Self-motion senses of rennen, there is no such overlap between run in (1b) and rennen in (3b). (3) a. Tina rannte zum Geschäft. Tina ran to the store ‘Tina ran to the store.’ 1 The research reported here has been made possible by a postdoctoral fellowship from the “Deutscher Akademischer Austauschdienst” (German Academic Exchange Service) under the “Gemeinsames Hochschulprogramm III von Bund und Ländern” Program for conducting research with the members of the FrameNet project (http://www.icsi.berkeley.edu/~framenet) (NSF Grant No. IRI-9618838, P.I. Charles Fillmore) at the International Computer Science Institute in Berkeley, California. 2 A lexical unit is a word in one of its senses. 3 For the sake of clarity, names of frame elements (i.e., semantic roles) and semantic frames are capitalized. 4 The difference in speed between run and walk is classified by Levin (1993: 265) as a difference in manner of motion. This leads her to classify run and walk as “manner of motion verbs.“ 65 b. *Tina rannte Enno von der Straße ab. Tina ran Enno from the street off c. Tina drängte Enno (beim Rennen) von der Straße ab. Tina pushed Enno (while running) from the street off ‘Tina pushed Enno off the street (while running).’ A comparison between the Caused-motion sense associated with run in (1b) and the sentence in (3b) shows that German rennen is not conventionally associated with a Caused-motion sense. The phenomenon exemplified by the distribution of run and rennen in (1b) and (3b) is a case of what has been called “divergence” in recent work on machine translation (cf., e.g., Dorr (1990) and Heid (1994)). Divergences are cases in which different languages use different means to convey a given meaning. In the case of German rennen, this means that the translation equivalent of the Causedmotion sense associated with run in (1b) is expressed by a different type of verb in German, in this case abdrängen in (3c). A careful comparison of the Caused-motion meaning of run in (1b) with the meaning conveyed by abdrängen in isolation in (3c) shows that the semantics of the two verbs do not exhibit an exact overlap. That is, abdrängen without further specifications does not encode the manner in which the Theme (i.e., Enno in (3c)) has been caused to move to its end location. Information about the manner in which the caused-motion activity took place must be supplied by a separate phrase beim Rennen. The comparison between the sentences in (1) and (3) shows that English and German verbs may differ with respect to how the semantics of Caused-motion is lexicalized. Whereas English may supply its motion verbs with a specific syntactic frame to express Caused-motion semantics, German does not allow for such an option with rennen. The language provides a different type of verb to express Caused-motion and leaves open the option of specifying the particular manner in which it happened. Building on Talmy's (1985) terminology from work on motion expressions, I refer to the type of realization of Caused-motion semantics exemplified by run in (1b) as construction-framed semantics. That is, the abstract semantic concept of Caused-motion in (1b) is lexicalized in terms of a construction specific syntactic frame occurring with the same verb as the basic sense, i.e., run. German abdrängen in (3c) is an example of what I refer to as verb-framed semantics. In this case, the caused-motion semantics is not lexicalized in terms of a specific syntactic frame occurring with the same lexical unit expressing the basic sense. Instead, it is a lexicalized concept inherent to the semantics of a different lexical unit, in this case abdrängen.5 Turning to the German translation equivalents of the two senses of walk in (2) above note that the use of gehen in (4a) exhibits the same basic Self-motion sense as walk in (2a). A comparison between (2a) and (4a) illustrates that in contrast to walk which is associated with a construction-framed Cotheme semantics, gehen does not exhibit this pattern. Instead, German forces the use of a different verb, namely begleiten, in (4c), to express the Cotheme semantics exhibited by walk in (2b). (4) a. Bernd ging zur Tür. Bernd walked to the door ‘Bernd walked to the door.’ b. *Bernd ging Anna zur Tür. Bernd walked Anna to the door c. Bernd begleitete Anna zur Tür. Bernd accompanied Anna to the door ‘Bernd accompanied Anna to the door.’ The difference in lexicalization patterns of Cotheme semantics in (2) and (4) shows parallel properties to the differences in lexicalization patterns of Caused-motion semantics observed above in (1) and (3). That is, whereas the Cotheme semantics is lexicalized in terms of a construction-framed semantics with walk, German chooses to lexicalize the translation equivalent in terms of a verb-framed semantics by employing the verb begleiten (cf. (4c)). This section has shown that English and German motion verbs may differ with respect to how abstract semantic concepts are lexicalized. It has been shown that English and German motion verbs display different types of polysemy networks, i.e., they are not all associated with an equal amount of different semantic concepts. The following section addresses the issue of describing the similarities and differences exhibited by run, walk, rennen, and gehen with a set of devices that allows for crosslinguistic abstractions across different lexicalization patterns. 5 To be more precise, the Caused-motion semantics is already lexicalized in the German verb drängen. In this case, the separable prefix ab serves as a preverb responsible for specifying the path and goal of the caused-motion semantics. 66 2. The role of Frame Semantics in contrastive lexicography Most traditional approaches to lexicographic descriptions regard the notion of headword as central to the organization of dictionaries and list the different senses associated with a headword in a lexical entry. For each of the individual senses associated with a headword, traditional dictionaries list information about its meaning, its usage, register, etc. (cf. Atkins 1995: 26). Whereas this approach to documenting polysemy of lexical units typically centers around a dictionary example for each sense of a word in order to exemplify the context in which it is used, here I explore an alternative type of lexical organization for bilingual polysemy structures of the type illustrated in (1)-(4) above. Adopting ideas from previous work on lexical organization (e.g., Fillmore & Atkins (1992), Heid (1994), Atkins (1995), and Fontenelle (2000)), I propose that the different senses of English and German motion verbs are related to each other in terms of frame semantic descriptions. 2.1 Frame Semantics Charles Fillmore's (1982) Frame Semantics centers around the idea that in order to understand the meanings of words in a language one must first have knowledge of the semantic frames, or conceptual structures, that underlie their usage. Frames serve as a type of cognitive structuring device that provide the background knowledge and motivation for the existence of words in a language as well as how they are used in discourse.6 Fillmore's most often cited example of a frame is the commercial transaction frame which involves a scenario with different frame elements such as Buyer, Seller, Goods, and Money which participate in a commercial transaction. In this frame, “one person acquires control or possession of something from a second person, by agreement, as a result of surrendering to that person a sum of money. The needed background requires an understanding of property ownership, a money economy, implicit contract, and a great deal more.” (Fillmore & Atkins 1992: 78) Lexical units belonging to this frame are verbs such as buy, sell, spend, or charge, nouns such as price, goods, or money, and adjectives such as cheap and expensive. While all of these lexical units belong to the same semantic frame (the commercial transaction frame), a specific choice of a lexical unit reveals a particular perspective from which the commercial transaction frame is viewed. Consider the following examples. (5) a. Miriam bought a book (from Collin) (for $20). b. Collin sold a book (to Miriam) (for $20). Both sentences in (5) describe the same commercial transaction event but from different perspectives. Whereas (5a) views the transaction from the viewpoint of the buyer, (5b) views the transaction from the perspective of the seller. The main point is that the two verbs buy and sell both refer to the same underlying frame and evoke the same type of underlying knowledge about commercial transaction events. Note also, that the syntactic realization of the individual frame elements differs depending on the type of verb employed to describe the commercial transaction event. Whereas buy requires syntactic realization of the frame elements Buyer and Goods, the Seller and the Price do not have to be realized syntactically as is indicated by parentheses. Sell requires the Seller and the Goods to be realized at the syntactic level, but leaves the realization of the Buyer and the Price an option. A complete frame semantic description of a lexical unit belonging to the commercial transaction frame includes not only information about the types of frame elements that make up the underlying frame, but also information about how these frame elements are realized at the syntactic level. For example, the lexical entry for the verb buy includes information about the fact that the frame element Buyer has to be realized as a NP in subject position, whereas the frame element Goods has to be realized as a NP in object position. The entry also records the fact that the frame elements Seller and Price may optionally occur in postverbal position. Capturing semantic and syntactic information about lexical units in terms of frame semantic descriptions facilitates the creation of inventories of lexical units according to the types of frames to which they belong. This type of lexical organization differs from that of traditional dictionaries in that in a frame semantics dictionary the “concept of ‘headword’ becomes obsolete, for the whole frame is the definiendum,” as Atkins (1995: 27) points out. Note also that the frame-semantic approach to dictionary organization has practical advantages for the everyday dictionary user. By switching the 6 For a detailed survey of the main concepts underlying Frame Semantics, see Petruck (1996). 67 definitional burden from the level of the individual sense listed under the category of a headword to a frame-semantic level, it becomes easier to understand the entire structured background of knowledge that underlies the usage of a word. Since the description of a semantic frame also includes a list of the words that evoke the frame, the dictionary user has access to the interrelationships that hold between the class of words belonging to a common semantic frame. This means that understanding the meaning of a single word based on a frame semantic description facilitates a more direct understanding of all of the words belonging to the same frame.7 The next section illustrates the advantages of a frame-semantic approach to lexical organization for describing the different polysemy structures of English and German motion verbs discussed above. 2.2 Describing contrastive polysemy structures We are now in a position to contrast systematically the polysemy structures of English and German motion verbs on the basis of frame semantic principles. In our frame semantic treatment of the data in (1)-(4), there is one important feature that sets our approach apart from traditional approaches to bilingual lexicography. That is, our lexico-semantic descriptions do not refer to the notion of headword as the structuring element of our dictionary. This means that instead of referring to a specific sense of a headword only, the description of a lexical unit will be stated in terms of a frame as structuring device. By turning to this alternative level of lexical organization, it becomes possible to make higher order generalizations about meanings of words across different languages. In order to tease apart the relationships that hold between the individual senses of the verbs surveyed in section 1, consider first the “basic” senses of English run and walk. As pointed out above, the usages in (1a) and (2a) are instances of Self-motion in which “a living being (the Self-mover) moves under its own power in a directed fashion.” (Johnson et al. (2001: 148)) Using the terminology of Johnson et al. (2001), we identify in (6) the frame elements belonging to the Self-motion frame as follows. The frame element Self-mover is a living being which moves under its own power (i.e., Julie in (6a) and Rod in (6b)). The frame element Goal gives information about where the Self-mover ends up as a result of its motion (i.e., to the store in (6a) and to the door in (6b)).8 (6) a. Julie ran to the store. b. Rod walked to the door. Capturing the information about the distribution of frame elements of the self-motion frame as they are realized by run and walk in (6) results in partial sets of simplified frame-semantic descriptions as in (7).9 (7) Partial frame-semantic descriptions of Self-motion senses of run and walk a. runSelf-Motion [ Self-mover Goal ] NP PP b. walkSelf-Motion [ Self-mover Goal ] NP PP The simplified partial frame-semantic descriptions in (7) identify each verb as belonging to the Selfmotion frame and give information about the syntactic realization of the two frame elements Selfmover and Goal. Whereas Self-motion is realized as an NP, the Goal is realized as a PP with both verbs. Next, consider the corresponding German verbs rennen and gehen in (3a) and (4a), repeated in 7 By including sets of example sentences exemplifying the use of a word in context, the dictionary user also has access to information about the entire range of syntactic realization patterns of frame elements. In the Berkeley FrameNet database (for detailed descriptions, see Lowe et al. (1997), Baker et al. (1998), Fillmore & Atkins (1998), and Johnson et al. (2001)), each lexical entry includes corpus examples from the British National Corpus that have been annotated with semantic tags representing frame elements (see Gahl (1998)). 8 Besides Self-mover and Goal, this frame also includes the frame elements Source, Path, Manner, Distance, and Area. For more details, see Johnson et al. (2001: 148 – 150). 9 Note that in a full-fledged frame-semantic description of FrameNet, lexical entries include the entire range of corpus-attested syntactic patterns exhibited by a lexical unit, including information about how frame elements are realized by each syntactic pattern. They also include semantically annotated corpus sentences serving as examples (see Lowe et al. (1997), Baker et al. (1998), and [http://www.icsi.berkeley.edu/~framenet] for details). 68 (8), and their corresponding simplified partial frame-semantic descriptions in (9) identifying them as belonging to the Self-motion frame. (8) a. Tina rannte zum Geschäft. b. Bernd ging zur Tür. (9) Partial frame-semantic descriptions of Self-motion senses of rennen and gehen a. rennenSelf-Motion [ Self-mover Goal ] NP PP b. gehenSelf-Motion [ Self-mover Goal ] NP PP A comparison of the partial frame-semantic descriptions in (7) and (9) shows that all four verbs evoke the Self-motion frame and exhibit the same type of syntactic realization of the frame elements Selfmover and Goal. Taking our observations to a higher level of abstraction, we arrive at a generalization about how the frame elements belonging to the Self-motion frame are realized by the four verbs in the two languages as in (10). (10) The Self-motion frame as a common structuring device for English and German Self-mover Goal Self-Motion runSelf-Motion [Self-mover Goal] rennenSelf-Motion [Self-mover Goal] NP PP NP PP The top of the diagram in (10) contains a fragment of the Self-motion frame and illustrates how the individual English and German verbs instantiate the respective frame elements of that frame. The arrows connecting the verb's individual frame semantic descriptions with the frame elements of the underlying Self-motion frame illustrate the mapping between the frame elements of the Self-motion frame and their syntactic realizations in the two languages. A comparison of the mapping properties between the frame elements of the Self-motion frames and run and gehen in (10) shows that the two verbs have identical mapping properties, i.e., they map the Self-mover as a preverbal NP and the Goal as a postverbal PP. Similar observations can be made for the mapping of frame elements between walk and gehen discussed in (7b) and (9b) above. So far, it has been demonstrated how the frame elements of the Self-motion frame are realized in similar ways by run, walk, gehen, and rennen. The next section turns to a discussion of cases in which verbs from different languages exhibit different types of mapping of frame elements due to a difference in lexicalization patterns of semantic frames. In section 1 it was shown that whereas run is associated both with a Self-motion and a Caused-motion frame, German rennen does not display a Caused-motion use parallel to that of run. Instead, German offers a verb-framed lexicalization of Caused-motion (i.e., abdrängen) to describe those types of situations that are expressed by the construction-framed Caused-motion sense of run. Adopting Johnson et al.'s (2001: 132) terminology to describe the Caused-motion frame, we can say that “an Agent causes a Theme to undergo directed motion” which may be “described with respect to a Source, Path and/or Goal.” The frame elements relevant for describing the Caused-motion senses of run and abdrängen include Agent, Theme, and Goal.10 The following diagram illustrates how these three frame elements are realized by the Caused-motion sense of run and the Caused-motion sense of abdrängen. 10 Other frame elements included in the Caused-motion frame are Source, Path, Distance, and Area (cf. Johnson et al. (2001: 131- 133). 69 (11) The Caused-motion frame as a common structuring device for English and German Agent Theme Goal Caused-motion runCaused-motion[Agent Theme Goal ] abdrängen Caused-motion [Agent Theme Goal ] NP NP PP NP NP PP (12) a. Julie ran Pat off the street. b. Tina drängte Enno von der Straße ab. Tina pushed Enno from the street off ‘Tina ran Enno off the street.’ Note that the frame-semantic descriptions of the Caused-motion senses of run and abdrängen in (11) express similar types of scenarios as exemplified in (12). When comparing the similarities and differences between the two diagrams in (10) and (11), we see that the notion of semantic frame offers a convenient way of comparing and contrasting distributions of semantic concepts among different lexical units.11 In particular, our frame-semantic descriptions give information about the use of run to express both Self-motion as well as Caused-motion, and the use of gehen only to express Self-motion.12 The advantage of organizing a bilingual dictionary along frame-semantic concepts should be clear by now. For example, users of bilingual dictionaries requiring information about how to express a specific semantic concept in a different language, are offered multiple ways of gaining access to the information. The first possibility includes looking up a specific word in the target language in order to see whether it may be used in the same pattern as the word in the source language. In the case of Caused-motion, the dictionary user might expect that because run and rennen are associated with similar Self-motion senses, the two verbs share a similar usage pattern when it comes to expressing Caused-motion. By looking up the caused-motion sense of run, the dictionary user would then get to a description of the Caused-motion frame and subsequently discover that there is no Caused-motion equivalent for rennen but that he must use abdrängen. In this case, a frame-semantic description underlying a diagram such as (11) allows the dictionary user to understand more easily the meaning of abdrängen, because it is described in terms of the same type of underlying structuring devices, namely the Caused-motion frame and its frame elements. Furthermore, by providing example sentences such as (12) which include information about how the given sense of a word is used in context, the dictionary user gains full access to usage examples from both languages. The second way of accessing the desired information needed to express a given situation has to do with cases in which a dictionary user is not completely sure about what types of words to use in either language. Here, the frame semantic dictionary serves as a combination of dictionary and thesaurus. The user is able to consult the frame-dictionary and look up lists of descriptions of semantic frames including the types of words belonging to the frame. Based on the definition of frames and in combination with examples illustrating the use of the individual words belonging to the frame, the dictionary user can pick the word which best describes the situations in mind. Take, for example, our comparison of walk and gehen in (2b) and (4b) above. We have seen that whereas walk is associated with a construction-framed Cotheme semantics, gehen is not. To be more precise, whereas walk is associated with a sense describing the motion of two distinct objects (Selfmover and Cotheme) moving towards a goal, gehen does not allow for such a construction-framed 11 Note that abdrängen by itself does not characterize in detail the manner in which the Caused-motion is carried out. That is, it only lexicalizes the force dynamics of Caused-motion (pushing). In contrast, run X off does not only lexicalize the abstract force dynamic semantics of Caused-motion in terms of a syntactic frame, but it also gives information about the manner in which the Caused-motion activity is carried out. 12 The proposals put forward in this paper are in contrast to many generative accounts of verbal polysemy such as Jackendoff (1972), Pustejovsky (1995), and Rappaport Hovav & Levin (1998). These accounts typically suggest that verb meanings should be split up into different verb class clusters and subsequently reduced to include only very minimal lexical semantic information. In order to arrive at multiple verb meanings, these accounts propose to employ different sets of generative mechanisms to ensure the generation of different verb senses and their related argument realization patterns from underspecified minimal lexical entries. With respect to the range of application of generative accounts of verb meaning, Weigand (1998: viii) points out that “the generative approach (...) reaches its limits in so far as the rule-governed mode-oriented approach, in principle, cannot tackle all the varieties and idiosyncrasies of language use and therefore remains restricted to a subset of examples.” 70 association with the semantics of the Cotheme frame. Instead, it calls for a different verb, begleiten, to express the same type of Cotheme semantics. This is illustrated by the following diagram. (13) The Motion-Cotheme frame as a common structuring device for English and German accompanyCotheme [S.-mover Cotheme Goal ] begleitenCotheme [S.-mover Cotheme Goal ] NP NP PP NP NP PP Self-mover Cotheme Goal Cotheme walkCotheme [Self-mover Cotheme Goal ] NP NP PP (14) a. Rod walked Melissa to the door. b. Rod accompanied Melissa to the door. c. Bernd begleitete Anna zur Tür. ‘Bernd accompanied Anna to the door.’ At its center, diagram (13) contains a partial set of frame elements from the Motion-Cotheme frame.13 The arrows linking the frame-descriptions of the individual verbs to the Motion-Cotheme frame indicate that each of the three verbs are associated with the semantics of the Motion-Cotheme frame, as well as how the frame elements are realized by each verb respectively. Coming back to the problems that a dictionary user is faced with, we might consider a German speaker who wishes to describe a Motion-Cotheme scenario. By consulting the frame-index, the speaker looks up the Motion-cotheme frame description and sees two possibilities for expressing such a scenario in English. This is the point where example sentences exemplifying the usage of the respective words in the frame become crucial. For example, when choosing between the construction-framed lexicalization pattern of the Motion-Cotheme frame with walk or the verb-framed lexicalization pattern of the Motion-Cotheme frame with accompany, the dictionary user may want to emphasize the fact that the Motion-Cotheme scenario was carried out by means of walking. In this case, he chooses the construction-framed lexicalization of the Motion-Cotheme semantics with walk (cf. (14a)). In contrast, if the speaker chooses to remain silent about the manner of motion, he chooses the verb-framed lexicalization of the Motion-Cotheme semantics with accompany (cf. (14b)) which exhibits the same type of Motion-Cotheme lexicalization pattern as begleiten (i.e., verb-framed) (cf. (14c)).14 The third possibility of gaining access to information about how a specific concept is expressed in a language is by making reference to individual frame elements. For example, when the dictionary user is interested in expressing information about a person's motion, he may look up the definition for the frame element Self-mover. Based on this definition, the dictionary user has automatic access to all semantic frames that include this frame element in their frame description. For the example frames discussed in this paper, this means that referring to Self-mover offers direct access to the lexical items belonging to the Self-motion (e.g., run, walk, rennen, and gehen), Caused-motion (e.g., run, walk, and abdrängen), and Cotheme-motion (e.g., walk, accompany, and begleiten) frames, among many others. Offering multiple ways of accessing information about the distribution of lexical items in a bilingual dictionary demonstrates the usefulness of a frame-semantic approach for lexical organization. In contrast to traditional dictionaries employing the notions of headword and isolated examples to guide the dictionary user in finding the proper translation equivalent for a specific sense of a word, bilingual 13 The frame elements of this frame include Self-mover, Cotheme, Source, Path, Goal, Manner, Distance, and Area (cf. Johnson et al. (2001: 133–135). 14 Note that although all three verbs walk, accompany, and begleiten are all associated with Motion-Cotheme semantics, they describe the semantics of the Motion-Cotheme from different angles. That is, accompany and begleiten do not explicitly mention the manner of motion, whereas the Cotheme use of walk makes explicit reference to the manner of motion. According to Snell- Hornby‘s (1983: 33-35) classification of verb descriptivity, verbs like accompany and begleiten exhibit a variable degree of descriptivity, whereas the Cotheme sense of walk exhibits a low degree of descriptivity. In this connection, see also Leisi‘s (1975:77) discussion of “expressive verbs.” 71 dictionaries employing the notion of semantic frames facilitate the finding and understanding of lexical items because the common device for understanding meaning is the whole semantic frame. Using semantic frames as structuring devices also facilitates the learning of polysemous structures since the types of polysemy exhibited by a word in the source language might not be mirrored by a similar polysemy network of the corresponding word in the target language.15 By constructing bilingual dictionaries along frame semantic guidelines and supplementing them with a large number of corpus examples exemplifying the various usages of a lexical item in context, it may become possible to circumvent a major problem for users of bilingual dictionaries pointed out by Snell-Hornby (1983: 215): “It is perhaps the stumbling-block of the conventional bilingual dictionary that it operates with words in isolation, yet functions according to the principle of working equivalence, whereby a context would be required.” 3. Practical applications for bilingual computational lexicography and educational tools Employing frame-semantic principles for the construction of bilingual dictionaries requires effective means of representing vast amounts of lexical information. Rather than being confined to traditional representation tools such as printed materials, a dictionary design employing an electronic database architecture will facilitate the representation and looking up of lexical units, their frame semantic descriptions, and semantic relationships among frames (e.g., inheritance and blending (cf. Fillmore & Atkins (1998)). A basis for building such a bilingual electronic database incorporating the main ideas proposed in section 2 is the mono-lingual Berkeley FrameNet database for English.16 Without going into the details of its full architecture (see Lowe et al. (1997) and Baker et al. (1998) for details), I will briefly outline how some of its features may be incorporated into a computer-based bilingual framesemantic database.17 At the center of such a database is the description of a frame, its frame elements, and the lexical units belonging to that frame. For each lexical unit of a language, a lexical entry contains a “traditional” definition combined with a frame semantic description with an exhaustive list of the semantic and syntactic combinatorial properties. In addition, the lexical entry includes annotated corpus examples exemplifying the syntactic valence patterns in which the frame elements occur. As mentioned above, the database user has the possibility to access information about translation equivalents of a lexical item in different ways. The first option consists of an alphabetic list of abstract frames that contains a description of each frame as well as its frame elements and the lexical units for both languages that participate in this frame. By clicking on a frame name, the user is informed of the frame description and may then proceed to gather information about individual English and German lexical items that are members of this frame.18 The second option of accessing information starts from an alphabetical list containing all lexical units (different lists for English and German). By clicking on a lexical unit, the user will see its lexical entry and may proceed from there either directly to the corresponding lexical unit(s) of the other language or to the full description of the semantic frame. At the level of semantic frames, the user is informed of all of the lexical units from both languages that participate in the frame and may choose the corresponding lexical item of the other language from there. Under the third option, as outlined in section 2.2, the dictionary user may access information about lexical items and their translation equivalents by choosing a frame element from an alphabetical list of frame elements in order to gain access to an overview of all of the frames in which the frame element occurs. From this list, individual frame descriptions may be accessed for an overview of lexical items that instantiate this frame element. The various ways of accessing frame-semantic descriptions of lexical items and their translation equivalents presented in this section constitute only a small set of options for representing lexical information in a bilingual frame-semantic database. The important point is that in a database that is as 15 See, for example, our discussion of run and rennen in section 1. There we have seen that run is associated with both a Selfmotion sense as well as a Caused-motion sense. In contrast, rennen is only associated with a Self-motion sense. 16 See http://www.icsi.berkeley.edu/~framenet 17 For similar proposals, see Fontenelle (2000) on bilingual lexical data bases combining frame semantics and Meaning-Text Theory (Mel'cuk et al. (1984)). Heid (1994) presents an outline of the DELIS project which is aimed at developing lexicographic tools for corpus-based lexicography. Within this project, Frame Semantics is employed to develop semantic descriptions of lexicon fragments for English, French, Italian, Danish, and Dutch lexical items. The proposals put forward in this paper share many theoretical and practical considerations underlying the architecture of DELIS. 18 An example of this is the simplified representation in (13) to which the dictionary user would get after clicking on “Motion- Cotheme.” In addition to information regarding which lexical units belong to this frame, the database user may also have access to full-fledged frame-semantic descriptions of the individual lexical items by clicking on them. 72 flexible as the type outlined above, the possibilities for accessing different types of relevant information is manifold. Moreover, dealing with polysemous structures across languages is facilitated because semantic frames offer a convenient way of structuring polysemy in terms of a unified descriptive vocabulary. Finally, consider the advantages a frame-semantic bilingual database can offer to the field of second language acquisition, especially with respect to educational tools needed for foreign language instruction. Traditional learning tools such as printed textbooks are limited in the size and scope of exercises they offer to students. Incorporating a bilingual frame-semantic database into electronic learning tools for foreign language pedagogy would not only offer students access to more efficient ways of learning vocabulary by being able to relate to a common structuring device, i.e., semantic frames. It would also give foreign language teachers the opportunity to design individual exercises for students incorporating different types of pedagogically relevant information from the database that are needed for specific learning tasks not covered by the standard learning software employed in the classroom. Lastly, with semantically annotated example sentences from corpora, students would be offered the opportunity of learning the vocabulary of a foreign language in context. Such an opportunity would allow students to increase their understanding of the usage patterns of the respective lexical items as opposed to simply learning vocabulary by memorizing lists consisting of only “words” and their translation equivalents.19 4. Summary This paper has outlined the theoretical and practical advantages of adopting Fillmore's Frame Semantics for the description of different types of polysemy networks in English and German.20 By examining the syntactic and semantic distribution of arguments of a selected number of English and German motion verbs, this paper has shown how semantic frames can be employed as unified structuring devices for bilingual dictionaries and databases. By describing lexical units with respect to their underlying semantic frames, there is no need for the traditional notions of ‘headword’ or ‘basic sense’ as organizational devices for structuring the lexicon. Frame Semantics can thus be regarded as a true semantic metalanguage for both linguistic theory and applied lexicography, because it refers to scenarios typically shared by speakers of all languages. 5. References Atkins B 1995 The role of the example in a frame semantics dictionary. In Shibatani, M, Thompson, S (eds), Essays in Semantics and Pragmatics in Honor of Charles J. Fillmore. Amsterdam/Philadelphia, Benjamins, pp 25-42. Baker C, Fillmore C, Lowe J 1998 The Berkeley FrameNet Project. In Proceedings of ACL/COLING 1998. Dorr B 1990 Solving Thematic Divergences in Machine Translation. In Proceedings of the 28th Annual Conference of the Association of Computational Linguistics. Fillmore C 1982 Frame Semantics. In Linguistic Society of Korea (ed.), Linguistics in the Morning Calm. Seoul, Hanshin, pp 111-138. Fillmore C, Atkins B 1992 Toward a frame-based lexicon: The semantics of RISK and its neighbors. In Lehrer A, Kittay E (eds), Frame, fields, and contrasts: New essays in semantic and lexical organization. Hillsdale, Erlbaum, pp 75-102. 19 Since lexical entries contain semantically annotated corpus examples showing how a lexical item is used in context, a framesemantic database would also greatly increase what Neubert (2000) calls the “five parameters of translational competence.” These include “(1) language competence, (2) textual competence, (3) subject competence, (4) cultural competence, and, last but not least, (5) transfer competence.” (Neubert (2000: 6)) 20 The full range of polysemy networks of English and German motion verbs is, of course, much larger. Creating inventories of the frame semantic descriptions of these verbs and comparing their different sense distributions is a task that is far beyond the limits of this paper. For an example of a detailed study discussing the polysemy networks of two related verbs, see Fillmore & Atkins (2000) for a comparison of English crawl and French ramper. 73 Fillmore C, Atkins B 1998 FrameNet and Lexicographic Relevance. In Proceedings of the First International Conference on Language Resources and Evaluation, Granada. Fillmore C, Atkins B 2000 Describing Polysemy: The Case of ‘Crawl.’ In Ravin Y, Laecock C (eds), Polysemy. Oxford, Oxford University Press, pp 91-110. Fontenelle T 2000 A Bilingual Lexical Database for Frame Semantics. International Journal of Lexicography: 232-248. Gahl S 1998 Automatic extraction of subcategorization frames for corpus-based dictionary-making. In Fontenelle T, Hiligsman P, Michiels A, Moulin A, Theissen S. (eds), Euralex ’98 Proceedings. 8th International Congress of the European Association for Lexicography. Liège, Université de Liège, pp 445-52. Heid U 1994 Contrastive Classes – Relating Monolingual Dictionaries to Build an MT Dictionary. In Kiefer F, Kiss G, Pajzs J (eds), Papers in Computational Lexicography – COMPLEX 1994. Budapest, Research Institute for Linguistics/Hungarian Academy of Sciences, pp 115-126. Jackendoff R 1972 Semantic Interpretation in Generative Grammar. Cambridge, Mass., MIT Press. Johnson C, Fillmore C, Wood E, Ruppenhofer J, Urban M, Petruck M, Baker C 2001. The FrameNet Project: Tools for Lexicon Building. Manuscript. Berkeley, CA, International Computer Science Institute. Leisi E 1975 Der Wortinhalt. Seine Struktur im Deutschen und Englischen. Heidelberg, Carl Winter. Levin B 1993 English Verb Classes and Alternations. Chicago, Chicago University Press. Lowe J, Baker C, Fillmore C 1997 A frame-semantic approach to semantic annotation. In Tagging Text with Lexical Semantics: Why, What, and How? Proceedings of the Workshop. Special Interest Group on the Lexicon, Association for Computational Linguistics, pp 8-24. Mel'cuk I et al. 1984 Dictionnaire Explicatif et Combinatoire du Français Contemporain. Montréal, Presses de l'Université Montréal. Neubert A 2000 Competence in Language and Translation. In Schäfner C, Adab B (eds), Developing Translation Competence. Amsterdam/Philadelphia, Benjamins, pp 3-18. Petruck M 1996 Frame Semantics. In Östman J-O, Blommaert J, Bulcaen C (eds), Handbook of Pragmatics. Amsterdam/Philadelphia, Benjamins, pp 1-13. Pustejovsky J 1995 The Generative Lexicon. Cambridge, Mass., MIT Press. Rappaport Hovav M, Levin B 1998 Building Verb Meaning. In Butt M, Geuder W (eds), The Projection of Arguments. Stanford, CSLI Publications, pp 92-124. Snell-Hornby M 1983 Verb descriptivity in English and German: A contrastive study in semantic fields. Heidelberg, Carl Winter. Talmy L 1985 Lexicalization Patterns. In Shopen T (ed.), Language typology and syntactic description. Cambridge, Cambridge University Press, pp 57-147. Weigand E 1998 Foreword. In Weigand E (ed.), Contrastive Lexical Semantics. Amsterdam/Philadelphia, Benjamins, pp vii-ix. 74 Nouns and their prepositional phrase complements in English Rhonwen Bowen Department of English Göteborg University 1. Introduction This paper concerns one aspect of work in progress on a PhD thesis on the complementation of nouns in English (See Bowen (forthcoming)). The purpose of the thesis is to give a description of nouns and their complements which will prove of interest to linguists but will also be of interest, from a pedagogical point of view, to learners of English. Although there are extensive surveys of the noun phrase and its integral parts in grammars1 and research alike, the complementation of nouns is often only discussed in connection with verb and adjective complementation2. Recent traditional grammars, Greenbaum (1996) for example, discuss noun complementation more fully. Biber et. al. (1999:604- 656) discuss the different structural types of post-nominal patterns including: that-complement clauses, exemplified here in (1); to-infinitival complement clauses shown in example (2); whcomplement clauses shown in (3) and of + -ing constructions shown in example (4). Further, in Biber et. al. (1999:634), prepositional phrases are included in postmodification and dealt with separately. In the present study, however, the term complement is used to denote the postnominal elements which are bound elements selected by a head noun, including both clausal complements as in examples (1) - (3) and prepositional phrase complements (PPC) as in examples (4) - (7): 1) the fact that he will win 2) his need to win, 3) the decision whether it is worthwhile 4) the thought of losing the match 5) the involvement of the teacher 6) her reliance on drugs 7) their belief in the dollar The focus of this paper is on the prepositional phrase complements illustrated in examples (5) – (7). Moving away from prototypical complements like the examples above, this paper explores other aspects of complementation. One consideration of interest is the type of noun which permits a PPC. In other words, do only nominalisations allow PPCs or are other nouns found with complements and to what extent do nominalisations "inherit" their complementation patterns? Another question addressed in this paper is the type of preposition found heading a PPC. Also of interest are the number of complements which are permissible with one noun and the question of the order of complements when more than one is present. Finally, instances of postponement of the complements will be discussed. These issues are addressed in this paper. 2. Method In order to analyse and study the patterns of nouns used in English today, a corpus of authentic sentences was collected. As a point of departure, all the nouns requiring all types of post-nominal elements, i.e. both phrasal and clausal elements were extracted from two learners' dictionaries of English. This extraction included 1904 nouns. As a qualitative rather than quantitative study of nouns is part of the aim, a 25% sub-set of these nouns was systematically selected to form the basis of this study. Sentences including examples of the subsequent 476 nouns have been extracted from the CobuildDirect and BNC corpora. The frequency of the nouns in the corpora varied considerably which entailed that for some of the nouns, a complete search was possible, e.g. seizure with 166 instances in the CobuildDirect corpus, whereas for extremely frequent nouns such as place with 30,337 instances, a percentage of the examples was randomly extracted. 1 See Quirk et. al. (1985:1238ff) 2 See Herbst (1988:266) for a discussion of previous research. 75 An analysis of the sentences was conducted with the help of a number of syntactic and semantic criteria3 in order to establish complementhood. The distinction between complements and adjuncts is far from clear cut and should be seen in terms of a cline. PP-adjuncts which are not bound to the head noun are usually markers of description such as the wines of France or a book of poems or PPs denoting place and time such as a book on the floor or the car in the garage. At the other end of the cline, PPCs are typically bound to their heads as in a ban on smoking, an introduction to linguistics or the request for money. Between these prototypical cases, the cline includes partitive complements such as a chunk of cheese, and intermediate complements such as the cradle of free enterprise. For 65 of the nouns, the prepositional phrases functioned as clear cut adjuncts as in a blaze of colour or the shank of the screwdriver and as such were excluded from the survey. The analysis showed that 411 nouns were found with PP-complements. 3. Which nouns take complements? In order to determine if patterns of complement-taking nouns (CTNs) exist, the nouns were grouped into three word formation categories. The first group comprises derived nouns which contain distinguishable overt suffixes as in decision and ability and which have been labelled "derived". The second group comprises nouns which have the same form as other word classes, i.e. they have no overt suffixation as in hope, right and answer. These are referred to as either verb or adjective "linked" nouns. The last group includes nouns which are non-derived, i.e. they have no evident connection to other words classes, such as idea and fact. The reader must keep in mind that a division of this kind is, however, problematic and that the focus of this work is on complementation and not morphology as such. Therefore, there are some cases where the categorisation is difficult, e.g. proclivity with an overt suffix but with no corresponding verb or adjective. The word formation division of the 411 corpus nouns which take complements is illustrated in table 1: Table 1 Morphological distribution of corpus nouns Derived Nouns Linked Nouns Non-Derived Nouns 196 118 97 Additionally, both the derived and linked categories of nouns have been further sub-divided into nouns with deverbal as in reliance on, argument against, de-adjectival such as allergy to, evidence on or denominal derivation such as relationship between. The derived and linked nouns together represent 77% of all the CTNs. The category of non-derived CTNs are also found with all types of PPCs such as moratorium on, antidote to, clue to, spokesperson for. In general, it can be said that the derived and linked nouns ”inherit” the same prepositions as the verbs and adjectives from which they are related, e.g. adherence to, belief in and collision with. For some nouns, such as ability which predominantly takes to-infinitival clause complements (cf. adjective able to-v), a variety of PPCs have been found, e.g. the jumping ability of your animal; his ability at mathematics; his ability for schoolwork and the children's ability on language skill. In a few cases there are differences in complementation patterns such as the adjectives weak at/in/from/with and fond of compared to weakness for and fondness for. The alternation between prepositions for one and the same noun may also entail a change in meaning such as adoration of and adoration for. 4. Which prepositions head PPCs? The following table illustrates a break-down of the nouns which take complements headed by different prepositions. The nouns are divided into their morphological categories and the number of nouns is supplied for each preposition. The right-hand column supplies the total number of nouns for each individual preposition. It must be kept in mind that one and the same head noun can take more than one type of PPC as in, the accusation of/by/against NP, or a search of/for/by NP. 3 Following, amongst others, Aarts (1997), Allerton (1982), Chomsky (1970), Collins (1991), Haegeman (1994), Jackendoff (1977) and Radford (1988). 76 Table 2 Distribution of nouns and prepositions Preposition Derived Nouns Linked Nouns Non-derived Nouns Total of 186 121 96 403 for 36 27 21 84 to 35 15 11 61 by 39 12 1 52 on 25 14 4 43 from 21 18 1 40 against 17 16 3 36 between 18 11 7 36 in 18 10 6 34 with 15 7 4 26 about 13 6 1 20 over 4 13 2 19 towards 10 5 1 16 at 6 4 1 11 behind 1 - - 1 into 1 - - 1 It is noteworthy that of the 411 CTNs in the corpus, 403 or 98% take of-PPCs. The prevalence of of- PPCs can be explained by the versatility of this preposition, i.e. that it is found across the full spectrum of of-phrases. Its use is frequent in subject- and/or object-related complements but also in partitive complements and intermediate complements. For the remaining prepositions, their use is not so much a question of a cline but one which can involve subcategorisation as in solution to and dependence on or where the meaning of the preposition is decisive a book on linguistics or a misunderstanding about the money. The by-PPCs differ in that they only represent subject-related complements as in the ruling by the court. 5. Number and preferred order of complements For the vast majority of CTNs, only one complement is typically present. The PPC of a head noun may represent subject- or object-related complements as in the love of God or be subcategorised by the noun as in her belief in her daughter but also combinations of two or three complements do occur: 8) However, his accusation of corruption against the Prime Minister and her Cabinet members results in the Government. . . (CobuildDirect times/10). 9) Thus the elevation of potters from the humble status of craftsmen to bona fide artists was given its greatest boost in. . . (CobuildDirect ukmags/03). In example (8) the head noun accusation is followed by two complements, an of-PPC and an against- PPC, whereas in example (9) the head noun elevation has no fewer than three complements represented by an of-PPC followed by a from-PPC and finally a to-PPC. In the literature, for example, Huddleston (1984:261) states that "NPs with more than one complement are generally rather infrequent". The corpus investigation shows that of the 411 CTNs in this study, 76 head nouns were found with double complements. This figure may give the impression that double complements are fairly frequent, but the 250 sentences found with this construction represent a very small section of the corpus material. The number of complements found with a head noun is restricted partly by the type of head noun but also by the length of the PPCs. Having established that some nouns take more than one complement, it is of interest to determine whether there is a preferred order between complements. In cases where two complements are present, as in, the exclusion of members by the committee the object-related complement of members precedes the subject-related complement, by the committee. This order is found in the majority of double-complement sentences, however, examples in the corpus show that the order is by no means fixed: 77 10) . . . the seizure by Moscow of the provinces of Bukavina and Bessarabia. (CobuildDirect bbc/06). 11) The seizure of two vessels by armed Chinese security forces who. . . (CobuildDirect oznews/01). 12) The Contact Group believes recognition of Bosnia by Mr Milosevic would . . . (CobuildDirect oznews/01) 13) This success was due to a persistent advocacy and a recognition by others of his sincere concern. (CobuildDirect ukbooks/08) Other nouns found with alternating order of complements include: acquisition, assault, ban and boycott, to name a few. All the nouns found with alternating order of complements were either derived or linked head nouns; no non-derived nouns were found with more than complement. From the corpus investigation, it appears that there are a number of influencing factors which determine the order of complements. To begin with, it is difficult to postulate anything concrete about text-type when the sentences are taken out of context, but it is obvious that the ordering of elements to give focus or end-weight is a deciding factor particularly in journalese, for example. Also the length of the PPCs appears to influence the order so that shorter PPCs tend to precede longer ones as can be seen in examples (10) – (13). 6. Postponement of complements The usual position of prepositional complements in English is adjacent to the head noun. Cases where the complement has been moved rightwards to the end of the sentence do occur in the corpus, although they are rather rare. Radford (1988:448) illustrates this phenomenon thus: 14) [A review -] has just appeared of my latest book. Using Radford's terminology the dash in example (14) represents a "gap" which the extraposed PP has left behind4. Examples from the corpus include: 15) A ban also has been introduced recently on the purchase of shares by senior officials although residents say that many city leaders succeeded in getting rich before it took effect. (CobuildDirect bbc/06). 16) The search is still going on for the two gunmen who carried out the shooting. (CobuildDirect ukspok/04) In example (15) the on-PPC has been moved rightwards from the head noun ban, and in example (16) the for-PPC has been similarly moved away from search. In the case of example (16) the postponement may have been used to avoid the imbalance of a heavy subject NP followed by a relatively light predicate. In the case of example (15) this is not the case, however. This construction is rare in the corpus with only 66 examples found. Not all the examples are as clear as examples (15) and (16); a certain amount of ambiguity exists: 17) The ban was imposed on Florey, of Bracknell, Berks, at a Jockey club disciplinary hearing. (CobuildDirect today/11) 18) And each time you spend on your card, a small donation is made to a worthwhile cause. (CobuildDirect sunnow/17) From the examples in the corpus, it can be concluded that the typical syntactic function of the NP in these constructions is that of subject and that the majority of NPs were found with indefinite articles as determiners. 7. Summary 4 Scholars differ in their approach to this phenomenon, see Culicover & Rochemont for a ”basegenerated” model. 78 The above sub-sections show the complexity of noun phrases and their complements and give some of the tentative results of this investigation. In summary, it can be said that approximately 80% of the CTNs in this material were derived or linked to verbs and adjectives. This entails that a considerable number of CTNs do take complements in spite of their lack of connection to other word classes. The majority of head nouns ”inherit” the same complementation patterns as the verbs and adjectives to which they are related. From the 411 nouns which have been found to take PP complements, 98% were found with of-PPCs. Head Nouns are typically found with one complement, however, when both subject- and object-related complements are present, factors such as end-weight and focus influence the order. Few examples of PPC-postponement were found in the material but as the most common function of the NP is that of subject, the use of postponement avoids instances of imbalance in a sentence where a heavy NP subject is followed by a relatively light predicate. References Advanced Learners’ Dictionary of Current English. 1995 & 2000. Oxford, Oxford University Press. Aarts B, 1997 English Syntax and Argumentation. London, Macmillan Press Ltd. Allerton D.J, 1982 Valency and the English Verb. London, Academic Press. Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman Grammar of Spoken and Written English. London, Longman. Bowen R, (forthcoming) Noun Complementation in Present-day English. Burnard L, 1995 The British National Corpus. Chomsky N, 1970 Remarks on Nominalization in Readings. English Transformational Grammar. In Jacobs R, and Rosenbaum P (eds), pp 184-221. Collins Cobuild English Language Dictionary. 1987 & 1995 London, Harper Collins Publishers Ltd. Collins P.C, 1991 Cleft and Pseudo-cleft Constructions in English. London, Routledge. Culicover P.W, Rochemont M.S, 1990 Extraposition and the complement principle. Linguistic Inquiry, 21:23-47. Greenbaum S, 1996 The Oxford English Grammar. Oxford, Oxford University Press. Haegeman L, 1991 & 1994 Introduction to Government & Binding Theory. Oxford, Blackwell. Herbst T, 1987 A Valency model for nouns in English. Journal of Linguistics (24): 265-301. Huddleston R, 1984 Introduction to the Grammar of English. Cambridge, Cambridge University Press. Jackendoff R, 1977 X-bar Syntax: A Study of Phrase Structure. Cambridge, Massachusetts, MIT. Longman Dictionary of Contemporary English. 1987 & 1995 London, Longman. de Mönnink I, 2000 On the Move The mobility of constituents in the English noun phrase: a multi method approach. Amsterdam & Atlanta, Rodopi. Radford A, 1988 Transformation Grammar A First Course. Cambridge, Cambridge University Press. Sinclair J, 1987 The CobuildDirect Corpus. 79 From dictionary to corpus to self-organizing dictionary: learning valency associations in the face of variation and change Ted Briscoe Computer Laboratory University of Cambridge ejb@cl.cam.ac.uk http://www.cl.cam.ac.uk/users/ejb 1. Introduction I use the term valency in an extended sense as a relatively theory-neutral term to refer to lexical information concerning a predicate's realization as a single or multiword expression (such as a phrasal verb), the number and type of arguments that a particular predicate requires, and the mapping from these syntactic arguments to a semantic representation of predicate-argument structure which also encodes the semantic selectional preferences on these arguments. Thus, I use the term valency (frame) to subsume (syntactic) subcategorization and realization, argument structure, selectional preferences on arguments, and linking and/or mapping rules which relate the syntactic and semantic levels of representation. For example, the verb, believe, in its primary sense and one (rare) realization takes a NP subject, NP object and infinitival VP complement where the subject is interpreted as the `believer', and therefore will typically denote the kind of entity capable of belief, the object NP is interpreted as the subject of the infinitival VP, and the proposition denoted by their composition is interpreted as `the belief', as in Most voters believe the election to have been seriously mishandled. Most grammatical frameworks treat valency almost entirely as a lexical property of predicates, although the inventory of valency frames, which varies between both major syntactic categories and between languages, can be described somewhat independently of individual words. Some theories, such as Construction Grammar (e.g. Goldberg, 1995) argue that the valency frame itself, or `construction' contributes aspects of the overall meaning; for example, the dative frame is said to denote a `change of possession': hence, The barman slid Fred the beer but not ?The barman slid the end of the table the beer. Such examples raise complications for a theory of the association between valency frames and predicates because it appears that the meaning of slide is coerced to entailing `change of possession’ by virtue of insertion in the dative frame. Therefore a predicate's participation in alternant valency frames can result in predictable modification of meaning. Abstracting over specific lexically-governed particles and prepositions and specific predicate selectional preferences, but including some `derived' / `alternant' semi-productive, and therefore only semipredictable, bounded dependency constructions, such as particle or dative movement, there are at least 163 valency frames associated with verbal predicates in (current) English (Briscoe, 2000). In this paper, I will review the work that my colleagues and I have done to learn (semi-)automatically this very large number of associations between individual verbal predicates and valency frames. Access to a comprehensive and accurate valency lexicon is critical for the development of robust and accurate parsing technology capable of recovering predicate-argument relations (and thus logical forms) from free text or transcribed speech. Without this information it is possible to `chunk’ input into phrases but not to distinguish arguments from adjuncts or resolve most phrasal attachment ambiguities. Furthermore, for statistical parsers it is not enough to know the associations of predicates to valency frames, it is also critical to know the relative frequency of such associations given a specific predicate. Such information is a core component of that required to `lexicalize’ a probabilistic parser, and it is now well–established that lexicalization is essential for accurate disambiguation (e.g. Collins, 1997, Carroll et al, 1998). While state-of-the-art wide-coverage grammars of English, capable of recovering predicateargument structure and expressed as a unification-based phrase structure grammar, have on the order of 1000 rules, it is clear that the number of associations between valency frames and predicates needed in a lexicon for such a grammar will be much higher. 80 2. The dictionary-based approach In the mid-eighties automated corpus analysis was not good enough to derive useful descriptions of the grammatical contexts in which predicates occurred, so we resorted to enhancing the valency information painstakingly gathered by lexicographers and available to us in the machine-readable versions of advanced learners’ dictionaries, such as the Longman Dictionary of Contemporary English (LDOCE) (Boguraev and Briscoe, 1987; Boguraev et al, 1987). The resulting lexicon and subsequent similar efforts (e.g. Comlex syntax distributed by the LDC, Grishman et al, 1994) have high precision (approx. 95%) but disappointing recall (appox. 76% for ANLT, 84% for Comlex) which means that despite the large amount of lexicographical and linguistic resources deployed 24-16% of the required associations between predicates and valency frames were omitted for an open-class vocabulary of about 35,000 words (Briscoe and Carroll, 1997). However, given that there are more than 60,000 associations between words and valency frames (ignoring word sense differences) in the ANLT dictionary, perhaps the result is not so surprising. Many of the omitted associations are quite unremarkable; for example combining the associations from ANLT and Comlex would still leave the sentence types with seem in 1) (based on cited examples) unanalyzed: 1a) It seemed to Kim insane that Sandy should divorce. 1b) That Sandy should divorce seemed insane to Kim. 1c) It seemed as though Sandy would divorce. 1d) Kim seemed to me (to be) quite clever / a prodigy. 1e) (For Kim) to leave seemed to be silly. 1f) The issue now seems resolved. In addition, neither of these projects yielded (public-domain) lexicons which associated predicates (i.e. word senses as opposed to word forms) with valency frames, nor did they record the relative frequency of a frame given a specific word, nor did they fully specify the mapping to argument structure or the selectional preferences on the arguments. Though some of these limitations might be overcome by further work, inherently, the dictionary-based approach is untenable, because such associations vary, both absolutely and also in relative strength, depending on subject domain and genre (Roland and Jurafsky, 1998), and they change over time at a rate consonant with lexical sense change rather than grammatical change. For example, swing has a core sense of spatial motion with optional volitional / causative components, explaining its occurrence with locative and/or source/goal locational arguments: Gay swung her into the saddle (after the LOB corpus). However, in the financial domain, the extended sense of (share) price movements on a one-dimensional range is more common: UAL's shares swung violently between an all time low of $6 and $36 in yesterday's trading (after the WSJ). Sense specialization, such as that of swinger to mean sexually promiscuous, leads to new valency associations: Kim swung with much of NY's literary society in the course of a long weekend (after the Guardian). Therefore, at least the relative frequency of a valency frame varies depending on the relative frequency of the sense of a word, and in many cases valency frames are different under sense extensions. For example, with swing the selectional preferences and prepositional items are different for the financial usages, while with the example of slide above, or the following attested example (due to Annie Zaenen) She smiled herself an upgrade, the entire valency frame is only available under the extended sense. 3. The corpus-based approach By the mid-nineties it was possible to reliably extract accurate phrasal analyses as well as part-of-speech (PoS) tags from corpora, and thus to identify the realization of predicate forms in phrasal contexts (e.g. Abney, 1996). That is, it became possible to `chunk’ PoS-tagged input into verb groups, bare (unpostmodified) noun phrases, prepositional phrases, and so forth. Around this time, we developed a system to acquire the associations between predicate forms and valency frames based on on our own intermediate parsing technology (Briscoe and Carroll, 1997). 81 The system consists of : · A tagger, a first-order HMM PoS and punctuation tag disambiguator, is used to assign and rank tags for each word and punctuation token in sequences of sentences (Elworthy, 1994). · A lemmatizer is used to replace word-tag pairs with lemma-tag pairs, where a lemma is the morphological base or dictionary headword form appropriate for the word, given the PoS assignment made by the tagger. We use an enhanced version of the GATE project stemmer (Cunningham et al, 1995). · A probabilistic parser, trained on a treebank, returns ranked analyses (Briscoe and Carroll, 1993) using a grammar written in a feature-based unification grammar formalism which assigns ‘intermediate’ phrase structure analyses to the PoS tag ‘lattices’ returned by the tagger (Briscoe and Carroll,1995; Carroll and Briscoe, 1996). · A pattern extractor which extracts local syntactic frames, including the syntactic categories and head lemmas of constituents, from sentence subanalyses which begin/end at the boundaries of (specified) predicates. · A pattern classifier which assigns patterns to valency frames or rejects patterns as unclassifiable on the basis of the feature values of syntactic categories and the head lemmas in each pattern. · A lexical filter which evaluates sets of frames gathered for a (single) predicate, constructing putative lexical entries and filtering the latter on the basis of their reliability and likelihood. For example, the sentence in 2a) is PoS-tagged as in 2b) and the parser returns the highest ranked analysis, shown as a labelled bracketing in 2c). Assuming that we are acquiring an entry for attribute, the extractor yields the pattern in 2d), which is classifiable as a known NP-NP-PSing valency frame with additional lexical information such as the preposition and heads of the NP arguments and of the NP and VP arguments of the PP, further parameterizing the frame hypothesized. In mapping from the analysis in 2b) to a valency frame, the classifier is able to repackage information from the fairly simple X-bar PSG used by the intermediate parser. In effect, the parser factors information between VSUBCAT and PSUBCAT features and the classifier looks at these, and sometimes lexical information, to deduce, in this case, that the appropriate frame is one in which the prepositional complement is a non-finite progressive clause. 2a) He attributed his failure, he said, to no-one buying his book. 2b) he_PPHS1 attribute_VVD his_APP$ failure_NN1 ,_, he_PPHS1 say_VVD ,_, to_II no-one_PN buy_VVG his_APP$ book_NN2 2c) (Tp (V2 (N2 he_PPHS1) (V1 (V0 attribute_VVD) ) (N2 (DT his_APP$) (N1 (N0 (N0 failure_NN1) (Ta (Pu ,_,) (V2 (N2 he_PPHS1) (V1 (V0 say_VVD))) (Pu ,_,))))) (P2 (P1 (P0 to_II) (N2 no-one_PN) 82 (V1 (V0 buy_VVG) (N2 (DT his_APP$) (N1 (N0 book_NN2))))))))) 2d) ((((he:1 PPHS1)) (VSUBCAT NP_PP) ((attribute:6 VVD) ((failure:8 NN1)) ((PSUBCAT SING) ((to:9 II)) ((no-one:1 0 PN)) ((buy:11 VVG))))) We call the level of analysis exemplified in 2c) `intermediate’ because the parser finds singly rooted trees rather than simply chunking the input. However, many attachment decisions are `canonical'; for example, the attachment of the comma-delimited text adjunct (Ta) he said as a nominal postmodifier is incorrect but convenient for this application because it hides the text adjunct from the pattern extractor which computes the local syntactic context of attribute. The parser returns the highest ranked analysis using a purely structural probabilistic model for ranking alternative analyses. This makes training the parser on realistic amounts of data and using it in a domain-independent fashion feasible. However, it also means that patterns extracted from the highest ranked analysis are noisy because quite often the parser has no mechanism for choosing the correct analysis. For example, the correct analysis for 3a) is shown in 3c) and the correct analysis for 3b) in 3d). 3a) He looked up the word. 3b) He looked up the hill. 3c) (Tp (V2 (N2 he_ PPHS1) (V1 (V0 (V0 look_VVD) (P0 up_RP)) (N2 (DT the_AT) (N1 (N0 word_NN1))))) 3d) (Tp (V2 (N2 he_ PPHS1) (V1 (V0 look_VVD) (P2 (P1 (P0 up_RP) (N2 (DT the_AT) (N1 (N0 hill_NN1))))))) However, the parser cannot reliably select between the derivations in 3c) and 3d) because it does not have access to any lexical information such as the likelihood of look up being a phrasal verb or the differing selectional restrictions on the NP as either PP or verbal argument. The classifier rejects some noisy patterns which do not conform to the known valency frames for English. However, many classifiable patterns are still incorrect for the reasons noted above, so the filter accrues evidence for associations between the specified predicate form and specific valency frames. These are then filtered using a statistical confidence test which utilizes the overall error probability that a particular frame will be hypothesized, and the amount of evidence for an association of that frame with the predicate form in question. Specifically, we think of occurrences of predicate forms with putative frames as a sequence of independent (Bernoulli) trials and use the error probability, Pe i, that a predicate form will be associated with an incorrect valency frame, i, to formulate the null hypothesis. The probability of an event with probability p happening exactly m times out of n such trials is given by the binomial distribution: P(m,n,p) = n! ¸ m!(n - m)! Pm (1 - p)n -m The probability of an event happening m or more times is: P(m +,n,p) = Sk=m n P(k,n,p) So P(m +,n Pe i,) is the probability that m or more occurrences of valency frame i will be associated with a predicate form occurring n times. The threshold for rejecting this null hypothesis was set to 0.05 yielding a 95% or better confidence that a high enough proportion of frames has been seen, given the underlying error probability, to accept the hypothesis that the predicate form really is associated with the frame. The error probability for a given frame i was estimated by: Pe i = ( 1 - (|pred in frame i| ¸ |preds|)) (|frames for i| ¸ |frames|) where the counts for frames were obtained by running the parser and extractor software on the entire Susanne corpus (Sampson, 1995) and the estimates of the numbers of predicates associated with frame i is obtained by counting the number of predicates in the ANLT lexicon paired with that frame. Suppose that 83 the parser and extractor predict that verb tokens are associated with frame i ¾ of the time, and that only ¼ of the verb types in ANLT are associated with i, then the error probability for associating verbs with i will be slightly over half. If the frame is only hypothesized ¼ of the time but linked to verb types ¾ of the time, the error probability will be one eighth. Note, however, that the estimate of the true incidence of the association, being dictionary-based, doesn't take account of the relative frequency of tokens of the verb types. The performance of the system can be evaluated by recording true positives (TP) and true negatives (TN), that is, cases where the filter correctly accepts or rejects an association between a predicate form and a frame, and false positives (FP) and false negatives (FN), where the filter incorrectly accepts or rejects such an association. The incidence of the four outcomes can be calculated either by comparing the lexical entries produced by the system to a set of accurate lexical entries, or to a manually compiled set of entries based on the same data that the system used to acquire the entries. The advantage of the former approach is that it is quicker – we can create accurate entries by intelligently merging the information in the ANLT and Comlex lexicons. However, the disadvantages are that we have no way of knowing whether the data the system used actually exemplified all the frames in the resulting lexical entries and, since these lack information about the relative frequency of frames for specific predicates, no way of knowing whether the system has acquired accurate frequencies. Therefore, we supplement this method with manual analysis of the data given to the system. This allows us to measure type recall and precision against the gold standard entries and the corpus analysis and token recall and ranking accuracy against the corpus analysis, as defined below: Type recall: TP ÷ (TP + FN types in dict/corpus) Type precision: TP ÷ (TP + FP types in dict/corpus) Token recall: TP ÷ (TP + FN tokens in corpus) Ranking accuracy: % pairs of TPs whose ranking by rel. Freq. is same as in corpus For 14 pseudo-randomly chosen verbs and training data of between 200 and 1000 exemplars per verb, the system achieved the results shown in the table below: Dictionary (14 vbs) Corpus (7 vbs) Type precision 65.7% 76.6% Type recall 35.5% 43.4% Token recall 80.9% Ranking Accuracy 81.4% These results are considerably worse than the (type) precision and recall results for ANLT and Comlex, reported above, but nevertheless were promising. Analysis of the system behaviour showed that for associations seen less than 10 times, the binomial hypothesis test performed no better than chance. Therefore, much of our subsequent work has focused on how to improve the filtering and entry construction component of the system. 4. Better filtering Briscoe, Carroll and Korhonen (1997) demonstrated that iteratively optimizing the error probabilities used in the above experiment on the basis of the correct entries resulted in a system yielding nearly 9% improvement in type precision, over 20% improvement in type recall, and a 10% improvement in ranking accuracy on a further 20 test verbs. This result shows that the dictionary-based estimation of error probabilities is far from optimal. However, the other critical problem with the binomial hypothesis test -- that it deals poorly with low frequency events -- remains. Korhonen, Gorrell and McCarthy (2000) compared results using the binomial test, the log likelihood ratio test and an empirically determined thresholding scheme. A log likelihood test has been proposed by Dunning (1993) as appropriate for nonnormally distributed data with many rare events. However, on this task the log likelihood test performed significantly worse than the binomial, resulting in more FNs for high frequency frames, more FPs for medium frequency frames and no improvement in performance on low frequency frames. The thresholding 84 scheme simply involved converting the hypothesized counts for each frame for a given predicate form into a conditional probability for each frame given the predicate, P(framei | predicatej ) using the maximum likelihood estimate (MLE, i.e. the ratio of the count for predicatej + framei over the count for predicatej ) and then rejecting any association with a probability lower than an empirically determined optimum on non-test data. This resulted in a 24% improvement in precision and a 2% improvement in recall. However, thresholding still results in many FNs for low frequency associations. The reason why hypothesis testing does not work well for this task is that not only is the underlying distribution Zipfian, but also there is very little correlation between the unconditional distribution on valency frames independent of specific predicates and the conditional distribution of frames given a specific predicate. For example, believe occurs mostly with a sentential complement, but the sentential complement frame, in general, is rare. Therefore, any test that involves reference to the unconditional distribution, filtering hypotheses about the conditional distributions, will perform badly. Unfortunately, the same observation also undermines any attempt to use the unconditional distribution as the prior in a Bayesian scheme or to smooth the conditional distributions to compensate for the known poor performance of MLE on rare / unseen events. Korhonen (2000), compares the performance of the MLE and thresholding scheme smoothed with the conditional distributions for a predicate related to those whose associations the system is attempting to learn. One simple approach to smoothing one distribution with another to infer the probabilities of unseen events is linear interpolation (e.g. Manning and Schutze, 1999:218f). The conditional probability P(framei | predicatej ) is smoothed using the conditional probability of P(framei | pre dicatek ) where pre dicatek stands in some specified relationship to pre dicatej according to the formula below: P(framei | predicatej   1 (P(framei | predicatej  2 (P(framei | predicatek)) ZKHUHWKH i denote weights which sum to 1, and can be optimized with held out data so that most of the smoothed probability for P(framei | predicatej ) is determined by its MLE estimate. Korhonen (2000) demonstrates that the distribution of frames given predicates is better correlated for hypernyms and synonyms of target predicates (derived from WordNet) than for the unconditional distribution. She, then, compares results for acquiring frames for 60 test verbs from 10 semantically-defined classes (based on Levin's (1993) classification), using as a baseline MLE thresholding with no smoothing and comparing this to linear interpolation of the MLE estimates with the unconditional distiribution for all verb frames and also with the merged conditional distributions for 3 other verbs from the same class. Her results are shown in the table below: Baseline Unrelated smoothing Sem.-related Smoothing Type Precision 78.5% 71.4% 87.8% Type Recall 63.3% 64.1% 68.7% Ranking Accuracy 79.2% 67.6% 84.4% The measures shown are based on comparison with the corpus data used by the system. The baseline performance is better than smoothing with the poorly correlated unconditional distribution. However, smoothing against the conditional distributions of semantically related predicates results in a significant improvement in performance. All but 3 of the 151 low frequency FNs rejected by MLE and thresholding exceed the threshold after semantically-related smoothing, suggesting that this approach provides an effective way of dealing with low frequency associations. There is, of course, a cost to the smoothing approach, as it is necessary to obtain the smoothed distributions for all the different classes of predicates exemplified in the English lexicon. However, based on analysis and extension of Levin's (1993) classification, it seems likely that the total number of classes needed for good performance on this task is unlikely to exceed 50, so it should be possible to seed such a system by obtaining about 200 conditional distributions semi-automatically for smoothing purposes. The underlying reason why this approach is effective is that similar predicates have similar, though by no means identical, `paradigms’ of associations to valency frames, because they tend to undergo the same types of frame alternation processes as other predicates in the same class. 85 This more semantically-driven approach to learning valency associations, seems to be getting us closer to the tested performance of existing manually-derived valency dictionaries such as ANLT or Comlex, with the added advantage that we recover the relative frequencies of these associations. However, it also raises the issue of polysemy and the what precisely the associations are between. I argued initially that associations hold between predicate senses and specific frames. However, our corpus-based work so far has resulted in associations between target predicate forms and frames. In order to classify predicate forms into semantic classes, Korhonen used the predominant sense of the form as defined by WordNet . However, it is clear that the results obtained with the technique would be improved if it were possible to classify occurrences of predicate forms into appropriate senses. Similarly, the corpus-based work so far has resulted in the recovery of lists of lemmas which occur as the heads of arguments in frames with given predicates, but the induction of selectional preferences from these lists requires their sense disambiguation too. 5. Predicate and argument sense disambiguation The performance of word sense disambiguation systems (e.g. Wilks and Stevenson, 1998) has improved to the point where it is viable to integrate such technology into our system. McCarthy (1997) and McCarthy and Carroll (2000) develop a system which utilizes the WordNet semantic hierarchy on nouns and the lists of head lemmas in argument slots in valency frames returned by the pattern extractor to infer a probability distribution on semantic classes occurring in a given argument position in a given frame for specific predicates. This probability distribution characterizes the selectional preference(s) of the predicate on that argument. The approach utilizes the minimum description length principle (MDL e.g. Li and Vitanyi, 1998) and aims to find the smallest model over semantic classes which describes the data (list of head lemmas) most succinctly. The head lemmas are assigned to WordNet semantic classes and counts of classes exemplified for a particular frame argument are maintained and propagated through the hierarchy. If a lemma is ambiguous between classes then the counts are evenly distributed between these classes. For example, for the direct object argument in the transitive frame for eat, lemmas such as food and chicken might be seen. In WordNet food is classified as a substance and is itself a semantic class food, while chicken is variously classified as food, bird and person with only the food class itself classified as a substance. Therefore, the counts would be divided and propagated so that food and substance each obtained 1 and 1/3 counts, while bird and person obtained 1/3 each. These counts are turned into MLE estimates, but the low probability classes are often filtered out using MDL since they add to model complexity without compressing the description of the lemmas that occur. The resulting selectional preference distributions have been used for WSD on nouns occurring in frame slots with competitive results of around 70% precision (Kilgariff and Rosenzweig, 2000), and also to identify whether a specific predicate participates in a frame alternation. For example, many verbs with a causative component to their meaning participate in the so-called causative-inchoative alternation, exemplified in 4). 4a) Kim broke the window. 4b) The window broke. 4c) Kim galloped the horse. 4d) The horse galloped. Identifying the precise class semantically is difficult if not impossible (e.g. Levin and Rappaport, 1995), probably because the alternation is semi-productive and partly driven by item familiarity (e.g. Goldberg, 1995). An alternative is to look for near identical selectional preference distributions on argument slots between valency frames putatively related by such alternations. In this case, we would expect the direct object slots in 4a and c) to share similar distributions to the subject slots in 4b) and 4d), respectively. This approach to such alternations has several potential advantages from the perspective of learning valency frames for predicates, such as more economic representation of valency lexicons, statistical estimation of the semi-productivity of alternation rules and thus using them as a further aid in the induction of low frequency frame associations (e.g. Briscoe and Copestake, 1999), and perhaps most importantly inferring 86 when the basic sense of a predicate will be systematically and predictably modified by association with a frame, as with the example of slide and dative movement discussed in the introduction. McCarthy (2000) reports results of an experiment with two alternation rules (causative-inchoative and conative) which yielded a 75% accuracy rate at classifying predicates as participants in these alternations. Most errors were FPs and it may be once again that polysemy of the predicate forms accounts for much of this noise. Most recently McCarthy in unpublished work has begun to disambiguate verb forms into WordNet defined senses using the distribution of nouns in argument slots which the verb form has been associated with. If this work is successful, it should be possible to associate associations with predicate senses directly, and thus to reduce some of the noise inherent in the work that relies on sorting predicates into semantic classes before identifying the frames they are associated with, the selection preferences on these frames, and so forth. 6. Related work There seems to have been very little work done on corpus-based acquisition of valency frame associations prior to the nineties. Seminal work was done by Brent (1991) establishing the importance of precise cues and introducing the binomial hypothesis test. Manning (1993) extended this work by performing finite-state text chunking before extracting patterns for frames. Ushioda et al (1993) were the first to attempt to extract the relative frequency of different frames. These systems recognized a maximum of 23 frames and had performance rates of around 80% token recall, in line with the original system we developed which recognized 161 frames. More recent papers are Gahl (1998), Carroll and Rooth (1998), Lapata (1999), Stevenson and Merlo (1999) and Zeman and Sarkar (2000). None of these utilizes a set of frames as large as ours or reports results suggesting more accurate performance. Carroll and Rooth dispense with hypothesis testing and use estimation maximization without thresholding, prefiguring the results of Korhonen, Gorrell and McCarthy (2000) to some extent. Zeman and Sarkar also compared the log likelihood ratio to the binomial hypothesis test and preferred the latter, but start from preparsed data. Lapata and Stevenson and Merlo focus on learning which verbs participate in a small number of alternations. 7. The future: self-organizing dictionaries The recent experiments we have undertaken are based on applying the pattern extractor to 20M words of the BNC (Leech, 1992). There is no reason, in principle, for us not to parse more data, however, the overall performance of the system would be improved if parse selection accuracy were better. Carroll, Minnen and Briscoe (1998) demonstrate that using the valency frame associations acquired by our system to rerank the derivations returned by the parser improves its accuracy by about 10%. Incrementally integrating this and other lexical information into the probabilistic ranking of derivations is a current avenue of research. Similarly, there are other incremental improvements that can and will be made to the different subsystems being developed for identifying predicate senses, selectional preferences and alternation behaviour. Nevertheless, no matter how much data is analysed however accurately, this data will still be inadequate from a statistical perspective for the acquisition of an accurate and comprehensive valency lexicon. In the limit, the arguments that I made against the dictionary-based approach also apply to the corpus-based approach when it is applied to a finite corpus. Zipf (1949) demonstrated that several distributions derived from natural language approximate to power laws in that the probability mass is distributed non-linearly between types with a few of the most frequent types taking the bulk of the probability mass and a very long tail of rare types. Both the unconditional distribution of valency frames and the conditional distributions of frames given specific predicates are approximately Zipfian (e.g. Korhonen, Gorrell and McCarthy, 1998). Two conclusions that can be drawn from this are: 1) that, because the power law is scaling invariant, any finite sample will not be representative in the statistical sense, and 2) that power law distributions are very often a clue that we are not sampling from a stationary source but rather from a dynamical system (e.g. Casti, 1994). Baayen (1991) develops a dynamical model of word generation which predicts a Zipfian distributed vocabulary. Briscoe (2001a) develops a more general model of language as a dynamical system, the aggregate output of a changing population of partly heterogeneous generative grammars. From this perspective it is not surprising that classical statistical models of learning, which rely on representative 87 samples from stationary sources, do not perform optimally. A better model of a valency lexicon, given these observations, is of an adaptive self-organizing knowledge base which continually monitors data to update associations and the strengths of associations between predicates and valency frames. The techniques outlined in this paper will need to be improved before we will be able to meaningfully develop such a model. However, such models can be developed as an extension of the statistical techniques used here by augmenting these techniques with the ability to incrementally update and renormalize the conditional distributions acquired from data (see Briscoe 2001b) for an implementation of Bayesian incremental updating of parameter values. References Abney, S. 1996. Part-of-speech tagging and partial parsing. In Church, K., Young, S. and Bloothoft, G. Corpus-based Methods in Speech and Language, Dordrecht, Kluwer. Baayen, H. 1991. A stochastic process for word frequency distributions. Proc of 29th Assoc. For Comp. Ling. 271-278. Boguraev, B.K. and Briscoe, E.J. 1987. Large lexicons for natural language processing: utilising the grammar coding system of the Longman Dictionary of Contemporary English, Computational Linguistics 13.4, 219-240. Boguraev, B.K., Briscoe, E.J., Carroll, J., Carter, D. and Grover, C. 1987. The derivation of a grammatically-indexed lexicon from the Longman Dictionary of Contemporary English, Proc. of 25th Assoc. for Comp. Ling., 193-200, Morgan Kaufmann, Palo Alto, CA. Brent, M. 1991. Automatic acquisition of subcategorization frames from untagged text. Proc. of 29th Assoc. For Comp.Ling., 209-214, Morgan Kaufmann, Palo Alto, CA. Briscoe, E.J. 2000. Dictionary and System Subcategorisation Code Mappings, Ms. Computer Laboratory. Briscoe, E.J. 2001a. Evolutionary perspectives on diachronic syntax, Diachronic Syntax: Models and Mechanisms. (eds) Pintzuk, S., Tsoulas, G. and Warner, A.}Oxford University Press, Oxford. Briscoe, E.J. 2001b. Grammatical Acquisition and Linguistic Selection. In Linguistic evolution through language acquisition: formal and computational models. (ed.) Briscoe, E.J. Cambridge University Press, Cambridge. Briscoe, E.J. and Carroll, J.A. 1993. Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational Linguistics, 19.1, 25-60. Briscoe, E.J. and Carroll, J.A. 1995. Developing and Evaluating a Probabilistic LR Parser of Part-of- Speech and Punctuation Labels, 4th Int. Workshop on Parsing Technologies (IWPT95) Morgan Kaufmann, Palo Alto, CA. Briscoe, E.J. and Carroll, J.A. 1997. Automatic extraction of subcategorisation from corpora. Proc. of 5th Assoc. For Comp. Ling. Conf. on Applied Nat. Lg. Proc., Morgan-Kaufmann. Palo Alto, CA. Briscoe, E.J. and Copestake, A.A. 1999. Lexical Rules in Constraint-based Grammars, Computational Linguistics, 25.4, 487-526. Carroll, J.A. and Briscoe, E.J. 1996. Apportioning development effort in a probabilistic LR parsing system through evaluation. 2nd Conference on Empirical Methods in Natural Language Processing, 92-100, Morgan-Kaufmann. Palo Alto, CA. 88 Carroll, J.A. and McCarthy, D. 2000. Word sense disambiguation using automatically acquired verbal preferences. In Computers and the Humanities. Senseval Special Issue, 34, 1-2. Carroll, J.A., Minnen, G., and Briscoe, E.J. 1998. Can subcategorisation probabilities help a statistical parser? 6th Workshop on Very Large Corpora, 35-41, Morgan-Kaufmann. Palo Alto, CA. Carroll, G. And Rooth, M. 1998. Valence induction with a head-lexicalized PCFG. 3rd Conference on Empirical Methods in Natural Language Processing, Granada, Spain. Casti, J.L. 1994. Complexification. Harper Collins, New York. Collins, M. 1997. Three generative lexicalised models for statistical parsing, Proc of 35th Assoc. for Comp. Ling. 16-23, Morgan-Kaufmann. Palo Alto, CA. Cunningham, H., Gaizauskas, R. \& Wilks, Y. 1995. A general architecture for text engineering (GATE) - a new approach to language R&D. Research memo CS-95-21, Department of Computer Science, University of Sheffield, UK. Elworthy, D. 1994. Does Baum-Welch re-estimation help taggers? Proc. of 4th Conf. Applied Nat. Lang. Processing, Morgan-Kaufmann. Palo Alto, CA. Gahl, S. 1998. Automatic extraction of subcategorization frames from a part-of-speech tagged corpus. Proc. of Assoc. For Comp. Ling. Morgan-Kaufmann. Palo Alto, CA. Goldberg, A. 1995. A construction grammar approach to argument structure, Chicago UP. Grishman, R., Macleod, C. and Meyers, A. 1994. Comlex syntax: building a computational lexicon. Int. Conf. on Computational Linguistics, 268-272, Kyoto, Japan Kilgariff, A. and Rosenzweig, J. 2000. English Senseval: report and results. Ms. ITRI, Brighton, UK (www.itri.bton.ac.uk) Korhonen, A. 2000. Using semantically motivated estimates to help subcategorization acquisition. Proc. of Jnt. Conf. on Empirical Methods in NLP and Very large Corpora, Morgan-Kaufmann, Palo Alto, CA. Korhonen, A., Gorrell, G. and McCarthy, D. 2000. Statistical Filtering and Subcategorization Frame Acquisition. Proc. of Jnt. Conf. on Empirical Methods in NLP and Very large Corpora, Morgan- Kaufmann, Palo Alto, CA. Lapata, M. 1999. Acquiring lexical generalizations from corpora: a case study for diathesis alternations. Proc. of 37th Assoc. For Comp. Ling. 397-404, Morgan-Kaufmann, Palo Alto, CA. Leech, G. 1992. 100 million words of English: the British National Corpus. Language Research 28.1, 1-13. Levin, B. 1993. English verb classes and alternations. Chicago Univ. Press, Chicago. Levin, B and Rappaport, H. 1995. Unaccusativity, MIT Press, Cambridge, MA. Li, M. and Vitanyi, P.1998. An Introduction to Kolmogoroff Complexity and Its Applications, Springer-Verlag, Heidelberg. Manning, C. 1993. Automatic acquisition of a large subcategorisation dictionary from corpora. Proc. of 31st Assoc. for Comp. Ling. 235-242, Morgan-Kaufmann, Palo Alto, CA. 89 Manning, C. and Schutze, H. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge MA. McCarthy, D. and Korhonen, A. 1998. Detecting verbal participation in diathesis alternations. Proc. of 36th Assoc. For Comp. Ling., Morgan-Kaufmann. , Palo Alto, CA. McCarthy, D. 2000. Using semantic preferences to identify verbal participation in role switching alternations. Proc. of 1st Nth. Am. Assoc. For Comp. Ling. Morgan-Kaufmann. , Palo Alto, CA. Roland, D. and Jurafsky, D. 1998. How Verb Subcategorization Frequencies are Affected by Corpus Choice. Proc. of 36th Assoc. for Comp. Ling., Morgan-Kaufmann, Palo Alto, CA. Sampson. G. 1995. English for the Computer. Oxford University Press, Oxford. Stevenson, S. and Merlo, P. 1999. Automatic verb classification using distributions of grammatical features. Proc. of 9th Conf. Of Eur. Assoc. For Comp. Ling. 45-52., Morgan-Kaufmann, Palo Alto, CA. Ushioda, A., Evans, D., Gibson, T. and Waibel, A. 1993. The automatic acquisition of frequencies of verb subcategorization frames from tagged corpora. SIGLEX ACL Workshop on the Acquisition of Lexical Knowledge from Text, Columbus, Ohio. Wilks, Y. and Stevenson, M. 1998. Word sense disambiguation using optimised combinations of knowledge sources. Proc. of 36th Assoc. for Comp. Ling., Morgan-Kaufmann, Palo Alto, CA. Zeman, D. and Sarkar, A. 2000. Automatic extraction of subcategorization frames for Czech. Proc. of Int. Conf. On Comp. Ling., 691-697, Saarbrucken, Germany. Zipf, G. 1949. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge, MA. 90 Semi-automatic tagging of intonation in French spoken corpora Estelle Campione & Jean Véronis Equipe DELIC Université de Provence 29, Avenue Robert Schuman, 13100 Aix-en-Provence (France) Estelle.Campione@up.univ-aix.fr, Jean.Veronis@up.univ-mrs.fr Abstract The transcription of spoken corpora using the punctuation of the written language is far from satisfactory. On the other hand, the manual transcription of prosody is an extremely time-consuming activity, which requires highly specialised experts, and is prone to errors and subjectivity. Automating the prosodic transcription of corpora can be interesting both in terms of effort reduction, and in terms of objectivity of the markup. Full automation is not achievable in the current state of the technology, but we present in this paper a technique that automates critical steps in the process, which results in a substantial annotation time reduction, and improves the objectivity and coherence of the annotation. In addition, the necessary human phases do not require a highly specific training in phonetics, and can be achieved by syntax students and corpus workers. The technique is applied to French, but most of the modules are language-independent, and have been tested on other languages. Keywords: spoken corpora, annotation, prosody, intonation, French 1. Introduction The transcription of spoken corpora is a difficult issue. It has been noted many times that transcriptions using written language punctuation are unsatisfactory and misleading (Blanche-Benveniste & Jeanjean, 1987; Leech, 1997), since the set of written punctuations is far from parallel to that of prosodic phenomena in speech. Leech (1997:90) calls the transcription of spoken language using ordinary orthography (and written punctuation) “a pseudo-procedure the only excuse for which is that it would be prohibitively expensive to attempt anything else”. Because of this inadequacy, some teams like ours have developed transcription conventions that do not make use of any of the written punctuations. For reasons of feasibility, these conventions usually mark only a very limited subset of prosody phenomena. In our case, for example, a minimalist stance has been taken and only pauses are marked (Blanche-Benveniste & Jeanjean, 1987; Blanche-Benveniste, 1990). However, this type of transcription is not entirely satisfactory either, because ambiguities appear in the resulting “text”. In French, for example, most discourse markers can belong to another category and fulfil another function. Very often meaning and context are not enough to find the correct interpretation, which requires prosodic clues. In the example below, quoi can be a discourse marker (more or less similar to you see or I mean), but also a pronoun (what). The unpunctuated fragment, je ne sais pas quoi can therefore be interpreted either as I don't know what or as I don't know, you see: écrire un un petit euh je sais pas quoi un petit recueil qui qui explique comment les étapes qu'il faut suivre Another common ambiguity consists in “floating” segments, usually complements, that can be attached to what comes either before or after in the utterance (see Bilger et al., 1997): elle arrive moi je m'en vais à une demi-heure près on travaille pas ensemble In a noticeable proportion of cases, the interpretation spontaneously adopted by corpus users is the wrong one. Sometimes they do not even notice the double reading. 91 The only satisfactory solution would be to faithfully transcribe the prosody of utterances1, but this was attempted only for a handful of corpora (e.g. the London-Lund Corpus or the Lancaster/IBM Spoken English Corpus). Prosodically transcribed corpora are totally lacking for most languages (such as French). The reasons for this shortage lies in the difficulty of prosodic transcription stressed by many authors. It is highly time-consuming and requires from the annotators a type of phonetic-oriented competence that is not common among syntax scholars. In addition, the very subjective nature of prosodic labelling reduces the trustworthiness of the results or requires careful control by counterexperts thus increasing the cost yet more. Automating the prosodic transcription of corpora can be interesting both in terms of effort reduction, and in terms of objectivity of the markup. Full automation is not achievable in the current state of the technology, but we present in this paper a technique that automates critical steps in the process, which results in a substantial annotation time reduction, and improves the objectivity and coherence of the annotation. In addition, the necessary manual steps do not require a highly specific training in phonetics, and can be achieved by syntax students and corpus workers. The technique is applied to French, but most of the modules are language-independent, and have been tested on other languages. 2. Overview Many schools have developed prosodic theories and annotation schemes and there is no consensus on a transcription system. It has even been said that each new monograph introduces a different coding system (Hirst, 1979; Mertens, 1990:159). ToBI (Silverman et al., 1992) has gained a wide popularity for American English, but it is not easy to adapt to other languages, even to other varieties of English (Nolan & Grabe, 1997; Leech, 1997). ToBI labelling also relies on linguistic judgements made by experts and is consequently difficult to carry out automatically, although research in that direction is underway (Wightman & Ostendorf 1992; Ostendorf & Ross, 1997). More importantly, ToBI-like systems are too detailed for many linguistic purposes, especially for syntactic studies. Leech (1997:89) makes a distinction between spoken language corpora and speech corpora, the former consisting of usually large, naturally occurring samples of continuous language or discourse, the latter referring to “laboratory speech”, usually words or sentences out of context. While a “narrow” transcription can be useful and even necessary for speech corpora, and phonetic studies, a high density of fine-grained prosodic symbols (typically one or more per word) is likely to be difficult to read and to blur the important facts in spoken corpora. As far as intonation is concerned, for example, most of the smaller melodic movements are the result of local phonotactic rules and contraints (such as the number of syllables in a word or the position of the lexical stress), and do not reflect communicative choices from the speaker (for French, see for example Hirst & Di Cristo, 1984; Di Cristo, 1999a and b). The following utterance, for example, can be either conclusive: (la maison du voisin)L [the neighbor's house] or continuative: (la maison du voisin)H In the first case, the speaker uses a final falling tone (L) to show that he/she is temporarily finished, and that this is a place for the interlocutor to take over. In the second case, the speaker marks his/her intention to continue by means of a final rising tone (H), which implicitly invites the interlocutor not to interrupt. In both cases, the inner movements of the sequence are controlled by the syntactic and lexical organisation of the utterance, and the speaker has no real choice concerning them (if no word is accented): (la maison) H L(du voisin) H If the utterance becomes longer, phonotactic rules impose a breakdown into smaller, less prominent prosodic groups: 1 Of course we do no claim that it would solve all ambiguities in spoken corpora (no more than punctuation solves all ambiguities in the written language). 92 (la maison) H L(du fils) H L(du voisin) H [the neighbor's son's house] It seems to us that, apart from being the only feasible approach, a broad prosodic transcription, marking only the major prosodic events is sufficient and more readable for most uses of spoken corpora. We will be as theory-neutral as possible, and simply consider that utterances are composed of consecutive prosodic segments, delimited by pauses and large pitch movements. These segments can be prosodically autonomous or depend on their neighbours. In the example above, we made the implicit assumption that the segment la maison du voisin was isolated from the rest of the discourse, for example by a long pause (marked --) : (la maison du voisin)H -- However, the same sequence of words with the same rising intonation could be prosodically coupled with the next segment, as in the following dislocation : (la maison du voisin)H (elle a brûlé) L -- [the neighbour's house, it burned down] Prosodic segments must therefore be grouped together into prosodic units. Prosodic units have an internal prosodic cohesion, and are independent from each other in the discourse flow. Our goal is therefore to detect the prosodic segments, to tag them and group them into prosodic units. Five steps are involved in the process: (1) pauses are automatically detected in the speech signal; (2) the fundamental frequency or F0 curve is stylised in order to eliminate its smaller, irrelevant details; (3) the stylised curve is reduced to a sequence of discrete symbols encoding the pitch movements; (4) the recording is orthographically transcribed and synchronised to pauses and major pitch movements; (5) the sequence of melodic movements is filtered and translated to a final prosodic coding. The orthographic transcription is entirely manual. The other steps are automatic, but require some hand correction, as summarised in Figure 1. Pause detection (automatic + manual correction) Stylisation of F0 curve (automatic + manual correction) Discretisation of pitch movements (automatic) Orthographic transcription (manual) Prosodic coding (automatic) Figure 1. Overview of transcription and annotation process 3. Pause detection The first step of the annotation process consists in an automatic detection of (silent) pauses. As simple as it may seem, this task is far from straightforward to perform by automatic means. Long pauses can be interrupted by ambient noise (frequent outside laboratory conditions). On the other hand, very short 93 pauses are extremely difficult to distinguish from plosives. For example, some very brief breathing pauses can be as short as 60 ms, which is shorter than many occurrences of plosives. We use a pause detector based on F0 detection, which behaves reasonably well in terms of robustness to ambient noise. Using a threshold of 350 ms, very few false detections occur. However, some very short pauses are not detected, and must be added by hand. The task is not particularly difficult, with the help of a graphic editor that enables the visualisation of the signal and playback segment by segment. Typically the manual correction takes ca. one hour for a 15 minutes recording, and does not require a highly specialised phonetic expertise. At the end of this phase, silent pauses are categorised into three groups: (1) very short pauses (< 350 ms, coded ^) (2) short pauses (³ 350 ms and < 1.5 s, coded -) (3) long pauses (³ 1.5 s, coded --). Even if no subsequent prosodic treatment is planned, this technique leads to a much greater reliability in pause transcription than direct transcription from a tape recorder, and is therefore advisable in any spoken corpus work. Pause transcription is a very difficult exercise when done entirely manually, and we have noticed that most linguists, even highly competent ones, tend to miss many pauses, especially when they are coupled with other phenomena (such as hesitation or syllable lengthening). For example, in the fragment that we will use as a running example in the next sections: on fait pas que le pressing on fait aussi la blanchisserie plus la blanchisserie d'ailleurs - les draps les nappes la restauration the pause (marked with a dash) was missed by the (skilled) linguists who transcribed and verified the corpus. The part in bold is three-way ambiguous. The form d'ailleurs is either a locative adverb (from elsewhere) or a discourse marker (similar to actually) which can be attached to what comes before or after. The three interpretations are therefore: 1. We don't do only dry cleaning, we also do the laundry plus laundry from other places: sheets, tablecloths, catering... 2. We don't do only dry cleaning, we also do the laundry. More the laundry actually: sheets, tablecloths, catering... 3. We don't do only dry cleaning, we also do the laundry. More the laundry. Actually sheets, tablecloths, catering... The pause is an important clue for the disambiguation, but is not sufficient in itself. The intonation contour must be taken into account. 4. Stylisation of the F0 curve F0 curves can be seen as the combination of a macroprosodic component governed by syntactic and pragmatic rules, which reflects intonational intention of the speaker, and a microprosodic component which is entirely dependent on the effects of the particular phonemes in the utterance (lowering of F0 for voiced obstruents, etc.). Stylization consists in extracting the macroprosodic component from the F0, while the microprosodic component is factored out. Various stylization methods have been proposed since the sixties (Cohen & t'Hart, 1965; t'Hart, Collier, & Cohen, 1990; D'Alessandro & Mertens, 1995; Fujisaki & Hirose, 1982; Taylor, 1994; etc.), and rely on more or less complex models. The method used in this work (MOMEL, standing for MOdélisation de MELodie) was proposed by Hirst & Espesser (1993) (see also Hirst, Di Cristo & Espesser, 2000). It has some appealing features compared to other methods: (1) it is language-independent; (2) it does not require any pre-segmentation of the signal (e.g. in syllables); (3) it does not require any training on the data; (4) it performs automatically with a very good success rate; (5) the stylised curve is perceptually undistinguishable from the original. 94 The technique consists in reducing the intonation contour to a series of target points, which represent the relevant pitch movements (Figure 2). Once interpolated by a quadratic spline curve (unvoiced segments are interpolated so that the resulting curve presents no discontinuities), the series of target points produces an F0 contour perceptually undistinguishable from the original, apart from a few detection errors that must be corrected by hand. A quantitative assessment showed that the algorithm produces about 5% of errors. A large part of these errors (approximately 3%) were moreover systematically of two or three different types, in particular missing targets in transitions from voiced to voiceless segments of speech, which suggests that an improved algorithm could probably eliminate the majority of them (Campione, forthcoming). Again, the correction is easy to perform and does not require a specialised training. The signal can be played segment by segment, and the original can be compared to the version re-synthesised using the stylised curve. If the two differ perceptually, the target points can be moved via a graphic interface until the re-synthesis is judged similar to the original. The correction phase takes around one hour for a 15 minute recording, as for the pause detection. For the moment, the two are done separately due to the use of different tools, but we plan to integrate them in the future, thus reducing the total correction time substantially. les draps les nappes la restaur ation on fait pas que le pressing on fait aussi la blanc hisserie plus la blanchisserie d'ail leurs - Figure 2. Stylisation and discretisation of the F0 curve 5. Discretisation of pitch movements The next step consists in converting the target points into a sequence of discrete symbols encoding the pitch movements. The requirements we set for this operation are as follows: (1) the set of symbols should be as small as possible; (2) it should be possible to generate from the sequence of symbols an F0 curve perceptually undistinguishable from the original; (3) it should be language independent. After experimentation with several coding systems, including INTSINT (Hirst, 1991; Hirst & Di Cristo, 1998), we have developed a mathematical model that enables a reduction of the initial curve to an alphabet of 7 symbols without substantial loss of information. The model is based on the observation that the distribution of target points is approximately normal (Campione & Véronis, 1998). Details of the model are outside the scope of this paper, but the reader can find a description in Véronis & Campione (1998). The alphabet of symbols is as follows: L+ large falling movement; L medium falling movement; L- small falling movement; S very small or null movement; H- small rising movement; H medium rising movement; H+ large rising movement. 95 These symbols have no phonological value, and consist only in an extremely compact representation of the F0 curve. We showed, by an evaluation on a large multilingual database (4 hours 20 minutes of speech, 50 speakers, 5 languages), that the encoding enables regeneration of ca. 99% of points at less than 2 semi-tones than the original (Véronis & Campione, 1998). The F0 curves re-generated from the encoding using the mathematical model are therefore virtually undistinguishable perceptively from the original. The model has interesting properties. In particular, movements of the same amplitude in (semi-tones) do not necessarily have the same coding, depending on the place at which they occur in the speaker's range, thus reflecting the fact that pitch variation towards the extremes requires more articulatory effort than pitch variation in the speaker's medium area. As a consequence, the model also predicts the downdrift effect which is actually observed in speech (Figure 3) without requiring a specific downdrift parameter. L H L L H H Figure 3. Downdrift effect 6. Orthographic transcription The speech signal, already segmented at pauses (see section 3), is further segmented at large pitch movements (coded H+ or L+). In the example above, four new breakpoints are inserted (three H+ and one L+)2, thus delimiting six segments: _______ (H+) _______ (H+) _______ (H+) _______ (L+) _______ (pause) _______ (pause) The corpus is then orthographically transcribed, using a graphic interface that enables playback segment by segment. During this phase, two additional prosodic phenomena are manually encoded, but only when they occur at the end of a segment: (1) accents (coded *) (2) final syllable lengthening (coded :). Another important information results from the transcription itself, and consists in filled pauses (hesitations) which are transcribed as special lexical items (euh). These cues are necessary for the correct interpretation of pitch movements in the last step (see section 7). In the example above, the segment by segment transcription yields: on fait pas que* (H+) le pressing (H+) on fait aussi la blanchisserie (H+) plus la blanchisserie (L+) d'ailleurs (pause) les draps les nappes la restauration (pause) Orthographic transcription using this strategy is less time-consuming than the usual technique with a tape recorder, and more reliable. The pre-segmentation in small units, that can be replayed at will, facilitates the transcriber's task, and helps avoiding errors (missing hesitations, repeats, etc.). 2 This is not the norm. For the sake of brevity, the example was chosen because it contained several interesting phenomena in a short time span. 96 Transcribing and correcting a 15 minutes recording takes about two hours, as opposed to three hours with no assistance. 7. Prosodic coding We distinguish two types of prosodic events, depending on their scope. A first type of prosodic event is of a punctual nature: it consists of a change of pitch or a plateau at the end of the segment, regardless of the sequence of pitch movements in that segment. In the final segment of our example: les draps les nappes la restauration the rising movement on the last syllable of restauration (H-) is enough to indicate continuation (in the context of the following pause). The sequence of preceding movements in the segment consists of a rising-falling alternation, governed by syntactic and lexical constraints. Other prosodic events bear on an entire segment. In this case, the normal rising-falling alternation is replaced by a continuous rising, falling or flat sequence on the entire segment. In the example above, the sequence plus la blanchisserie is composed only of falling movements (L- L L+). It has a communicative role since it indicates, in this particular case, that the speaker has an afterthought and corrects what she has just said. It is confirmed by the next segment d'ailleurs whose flat intonation (coded S) is typical of a discourse marker that “tags on” the preceding segment. In French, the combination of a downstepped intonation followed by a plateau is a common strategy for self-correction or retrospective message modification. The available prosodic information (including presence of accent) therefore disambiguates among the three interpretations listed in Section 3, in favour of the following: We don't do only dry cleaning, we also do the laundry. More the laundry actually: sheets, tablecloths, catering... The first step of the algorithm marks the prosodic events at the end of each segment using three codes:  rising (H-, H, H+)  falling (L-, L, L+)  flat (S) Example : les draps les nappes la restauration Braces indicate that a prosodic event bears on an entire segment (for segments of more than two syllables): {plus la blanchisserie} The output of this first step on our example yields: on fait pas que* le pressing on fait aussi la blanchisserie {plus la blanchisserie} d'ailleurs  les draps les nappes la restauration  The second step of the algorithm consists in grouping together the prosodic segments into prosodic units, or, to put it another way, to detect the prosodic boundaries between units. This step requires that all the available clues are taken into account: (1) pitch movements; (2) silent pauses; (3) accents; (4) final syllable lengthening; 97 (5) filled pauses (hesitations). The interaction of these clues is complex. For example, in French, a falling movement is perceived as a conclusive boundary when it is followed by a short or long pause, but not if it is preceded by a syllable lengthening, or a filled pause. On the other hand, rising movements followed by a short or long pause mark continuative boundaries, even if they are preceded by a syllable lengthening, or a filled pause. In order to maximise readability, prosodic units are separated by paragraph marks. Each prosodic unit is preceded by its start time in seconds. In addition, redundancies are removed. Since accents always appear with a rising intonation, the arrow following the accent mark (*) is removed. In the same way, flat or falling movements before an internal pause are not marked (unless they bear on an entire segment), because they can be interpreted only as hesitation. An example of output on a larger sample is given Figure 4. /   YRLOj /   EHQMHWUDYDLOOHGDQVXQSUHVVLQJ       RQ IDLW SDV TXH OH SUHVVLQJ  RQ IDLW DXVVL OD EODQFKLVVHULH  ^SOXV OD EODQFKLVVHULH`  GDLOOHXUVOHVGUDSVOHVQDSSHVODUHVWDXUDWLRQ       RQ IDLWEHDXFRXSGH FRORQLHVEHDXFRXSGH  GH FKRVHVFRPPHoD RQWUDYDLOOHSRXUOD SROLFHSRXUODJHQGDUPHULHHXKRQWUDYDLOOHSRXUEHDXFRXSGHPRQGH       RQDEHDXFRXSGHPDUFKpV GRQFFHVWSDVpYLGHQW       ^SDUFH TXLO \ D GHV MRXUV R LO \ D`   SDV GH ERXORW  LO \ D GHV MRXUV R LO \ D GX ERXORW       FRPPHSDUWRXW       GRQFRQHVWGHX[       PRLHWPDFROOqJXH+D\DW       RQVHQWHQGELHQ RQDXQHERQQHDPELDQFHGDQVOHQWUHSULVHGRQFMHSHQVHTXHFHVW TXDQGPrPHDVVH]DVVH]ELHQ       ^TXDQGLO\DXQHERQQHHQWHQWH` SDUFHTXHOHERXORWIDXWIDXWUHFRQQDvWUHRQQ\YDSDV SDUSODLVLU       RQ\YDSDUREOLJDWLRQHXKGRQFHXKPRLMHWRXFKHjDX[GHX[ jODEODQFKLVVHULH HWDXSUHVVLQJ       SDUFHTXHPDFROOqJXHQDSDVOD ODTXDOLILFDWLRQDXQLYHDXGXSUHVVLQJGRQFFHVWSRXU oDTXHOOH\WRXFKHSDVSRXUOHPRPHQW     Figure 4. Example of prosodic transcription 8. Conclusion and future work The strategy and algorithms outlined in this paper enable semi-automatic transcription of prosodic information in spoken corpora. The transcription aimed at is a “broad” prosodic transcription, in which only the major prosodic events are annotated. We claim that a broad transcription is more suitable for corpus-based syntactic and pragmatic studies, since most of the smaller melodic movements are the result of local phonotactic rules and lexical contraints. A “narrow” annotation (in addition to being impossible to carry out on a large scale) would be difficult to read and unnecessary for many purposes. At the moment, several separate tools are used in the various phases of the transcription. An obvious direction for further development would consist in integrating these tools into a single “prosodic 98 annotation workstation”. This would reduce the transcription time substantially, since the two phases of manual correction (pause detection and F0 stylisation) could be merged, and accomplished during the orthographic transcription itself. We estimate that, using an integrated environment, the transcription and correction time of a 15 minute recording could be reduced to about three hours. This figure is similar to the time currently required for the orthographic transcription and correction alone using a simple tape recorder. Therefore, it seems possible to add very useful prosodic information to spoken corpora at little or no extra cost. In addition, our strategy provides a segment by segment alignment of the transcription with the audio signal, which can be useful for many purposes (e.g. listening to the fragment corresponding a concordance line). Other directions of research are concerned with the fine-tuning of the various tools. For example, the discretisation of the pitch movements uses two parameters, the mean frequency and variance of target points for a speaker. Currently, these values are computed for the entire recording, but one of the features of spontaneous speech is the presence of switches in speaking style, with bursts of greater (or smaller) variation in F0. A relatively simple pre-processing could enable us to segment the recording into sections of coherent F0 mean and variance. Other tools could also be included in the processing chain. For example, we have started experimenting the use of a filled pause detector, which could assist in the transcription of this important parameter. Although the detector used was developed for Japanese (Goto, Itou & Hayamizu, 1999), and would require tuning and adaptation for French, preliminary results are encouraging. Acknowledgements In this study we made use of the Transcriber software developed by Claude Barras (DGA) and the Signaix speech tools developed by Robert Espesser (CNRS). We are particularly indebted towards Robert Espesser for his help and assistance throughout this project. We thank Masataka Goto for making his filled pause detector available to us, although extensive experimentation has not been possible yet. We are also grateful to Claire Blanche-Benveniste, Albert Di Cristo, José Deulofeu, Daniel Hirst and Frédéric Sabio for their advice and comments. Remaining errors are of course ours. References Bilger M, Blasco M, Cappeau P, Pallaud B, Sabio F, Savelli M-J 1997 Transcription de l'oral et interprétation ; illustration de quelques difficultés. Recherches sur le français parlé, 14:57-86. Blanche-Benveniste C (ed) 1990 Le français parlé : études grammaticales. Paris, CNRS éditions. Blanche-Benveniste C, Jeanjean C 1987 Le français parlé : transcription et édition. Paris, Didier Erudition. Campione E, Véronis J 2001 Une évaluation de l'algorithme de stylisation mélodique MOMEL. Travaux Interdisciplinaires du Laboratoire Parole et Langage d'Aix-en-Provence, 19 [in press]. Campione E, Véronis J 1998 A statistical study of pitch target points in five languages. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98), Sidney, pp 1391- 1394. Cohen A, t'Hart J, 1965 Perceptual Analysis of Intonation Pattern. In Proceedings of 5ème Congrès International d'Acoustique, Liège, 1-4. D'Alessandro C, Mertens P 1995 Automatic Pitch Contour Stylisation Using a Model of Tonal Perception. Computer, Speech and Language, 9:257-288. Di Cristo A 1999a Le cadre accentuel du français : essai de modélisation : première partie. Langues, 2(3):184-205. Di Cristo A 1999b Le cadre accentuel du français : essai de modélisation : seconde partie. Langues, 2(4): 258-269. Fujisaki H, Hirose K 1982 Modelling the dynamic characteristics of voice fundamental frequency with application to analysis and synthesis of intonation. In Proceedings of 13th International Congress of Linguists, Tokyo, pp 57-70. Goto M, Itou K, Hayamizu S 1999 A Real-time Filled Pause Detection System for Spontaneous Speech Recognition. In Proceedings of the 6th European Conference on Speech Communication and Technology (Eurospeech ’99), Budapest, pp.227-230. Hirst, DJ 1979 The transcription of English intonation. Studia Phonetica, 17:29-39. 99 Hirst, DJ 1991. Intonation models. Towards a third generation. In Proceedings of ICPhS, I, pp 305-310. Hirst DJ, Di Cristo A 1984 French Intonation : a parametric approach. Die Neueren Sprachen, 83(5):554-569. Hirst DJ, Di Cristo A 1998 A survey of intonation systems. In Hirst DJ, Di Cristo, A (eds). Intonation Systems: A Survey of Twenty Languages. Cambridge, Cambridge University Press, pp 1-44. Hirst DJ, Di Cristo A Espesser R 2000 Levels of representation and levels of analysis for the description of intonation systems. In Horne M (ed), Prosody: Theory and Experiment. Dordrecht, Kluwer Academic Publishers. Hirst DJ, Espesser R 1993 Automatic Modelling of Fundamental Frequency Using a Quadratic Spline Function. Travaux de l'Institut de Phonétique d'Aix-en-Provence, 15:75-85. Leech G, McEnery A, Wynne M 1997 Further Levels of Annotation. In Garside R, Leech G, McEnery A (eds), Corpus Annotation : Linguistic Information from Computer Text Corpora, London, Longman, pp 85-101. Mertens P 1990 Intonation. In Blanche-Benveniste C (ed), Le français parlé: études grammaticales. Paris, CNRS Edition, pp 159-176. Nolan F, Grabe E 1997. Can ’ToBI’ transcribe intonational variations in British English? In Proceedings of ESCA Workshop Intonation: Theory, Models and Applications, Athens, pp 259-262. Ostendorf M, Ross K 1997 A multi-level model for recognition of intonation labels. In Sagisaka, Campbell and Higuchi (eds), Computing Prosody. Springer, Berlin, pp 291-308. Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightman C, Price P, Pierrehumbert J, Hirschberg J 1992 ToBI: a standard for labelling English prosody. In Proceedings of ICSLP'92, Banff, pp 867-870. Svartvik J (ed) The London Corpus of Spoken English: Description and Research. Lund, Lund University Press. t'Hart J, Collier R, Cohen A 1990 A Perceptual Study of Intonation, an experimental-phonetic approach to speech melody, Cambridge, Cambridge University Press. Taylor LJ, Knowles G 1988 Manual of information to accompany the Spoken English Corpus. Technical Report, UCREL, University of Lancaster. Taylor P 1994 The Rise/Fall/Connection Model of Intonation. Speech Communication, 15(1,2):169- 186. Véronis J, Campione E 1998 Towards a reversible symbolic coding of intonation. In Proceedings of the 5th International Conference on Spoken Language Processing (ICSLP'98), Sidney, pp 2899-2902. Wightman CW, Ostendorf M 1992 Automatic Recognition of Intonational Features. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, San Francisco, pp 221-224. 100 An attempt to improve current collocation analysis Pascual Cantos-Gómez Universidad de Murcia (Spain) Abstract Our goal is to review the main stream of research on collocation analysis and try to overcome some of its intrinsic problems, such as the optimal span and the reason for undesired collocates (statistically significant collocates, though lexically and semantically not related with the node word). The idea is not just to calculate significant collocates of a chosen node word or to elucidate which is the best statistical procedure to achieve this goal. However, we shall focus on the way words socialise with other words, forming complex network-like structures or units: lexical constellations; something that cannot be explained using current collocation analyses. 1. Introduction Conventional or prepatterned expressions are often loosely referred to as collocations. However, since the term collocation is also used in a stricter sense to denote a special kind of lexical relationship, it is convenient to clarify these concepts. In its broadest sense, collocation is more or less equivalent to: recurrent word combination. In the Firthian tradition, however, it is generally used in a stricter sense: a collocation consists of two or more words which have a strong tendency to be used together. According to Firth (1968: 181), “collocations of a given word are statements of the habitual or customary places of that word.” That they are habitually co-occurring lexical items or mutually selective lexical items (Cruse 1986: 40). For example, in English you say rancid butter, not sour butter, or burning ambition instead of firing ambition. Both interpretations imply a syntagmatic relationship between linguistic items, but whereas the broader sense focuses on word sequences in texts, the stricter one goes beyond this notion of textual cooccurrence and emphasises the relationship between lexical items in language (Greenbaum 1974: 80). It follows that the former, but not necessarily the latter, includes idioms, compounds and complex words, and that the latter extends also to discontinuous items. There are also intermediate uses of the term, where collocations can be thought of as lying on a continuum. At one end of the continuum lie the idioms like as soft as butter. At the other end of the continuum lie free combinations. These are word combinations that can be formed by simply applying grammatical rules. For example, the table would be the result of the grammatical rule that states that the definite article precedes a countable noun. The Firthian notion of collocation is thus a more extensive lexical concept than recurrent word combination. Consequently, methodologically it is viewed as essentially a probabilistic phenomenon consisting of identifying statistically significant collocations and excluding fortuitous combinations. This means that collocation analysis can be dealt with quantitatively and to some extent also automatically. Sinclair, within the Firthian tradition, defines collocation as “the occurrence of two or more words within a short space of each other in a text” (1991: 170). In collocation analysis, interest normally centres on the extent to which the actual pattern of these occurrences differs from the pattern that would have been expected, assuming a random distribution of forms. Any significant difference can be taken as, at least, preliminary evidence that the presence of one word in the text affects the occurrence of the other in some way. In what follows, we shall briefly discuss: (1) the ways in which actual and expected patterns of cooccurrence can be computed and compared, and (2) the measures that can be applied to the results to assess their significance and explore their implications. 2. Extracting Collocations A significant collocation can be defined in statistical terms as the probability of one lexical item co-occurring with another word or phrase within a specified linear distance or span being greater than might be expected from pure chance. It is not our goal here to discuss and compare the various measures, though some points are important to note. Probably, the most commonly used statistical measures to determine collocate significance are z-score (Berry-Rogghe 1973), t-score and mutual information (Church and Hanks 1990). All three significance measures can be used to highlight words that appear to be most strongly 101 collocated with the node word. Strong collocates are somehow equally highlighted by all three measures; however, z-score and the mutual information measure artificially inflate the significance of low-frequency co-occurring words because of the nature of their formulae. There are important differences between the information provided by the three measures: more, perhaps, between t-score and the other two than between z-score and mutual information themselves. It is difficult, if not impossible, to select one measure that provides the best possible assessment of collocates, although there has been ample discussion of their relative merits (see, for example, Church et al. 1991; Clear 1993; or Stubbs 1995). It is probably better to use as much information as possible in exploring collocation and to take advantage of the different perspectives provided by the use of more than one measure. Geffroy et al. (1973) produced a formula for the strength of collocation (called C) which took into account both the frequency of co-occurrence and the proximity of the collocates to each other. They employed cut-off points for minimum values of C and f (frequency) such as 5 and 20, respectively. Similarly, Smadja et al. (Smadja and McKeown 1990, Smadja 1991) take into account word distance as well as word strength (spread) for a measure of word association (height) within a limited span: 5 words on either side (see Martin et al. 1983). By means of XTRACT, Smadja proposes an approach for automatically acquiring co-occurrence knowledge from statistical analysis of large corpora. Smadja assumes that two words are associated if they appear in the same sentence and are separated by less than five words. For each pair of collocating words, a vector of 10 values is created. In a flexible collocation, the words may be inflected, the word order may vary and the words can be separated by any number of intervening words (i.e. to take steps, took immediate steps, steps were taken). If the word order and inflections are fixed and the words are in sequence, the collocation is said to be rigid (e.g. International Human Rights). Kita et al. (1994) also propose a mathematical approach to automatically compiling collocations: the cost criteria. However, their approach differs slightly from the others mentioned so far in that they consider collocations to be cohesive word clusters, including idioms, frozen expressions and compound words. This approach focuses just on linearly fixed multiple word units, being unable to determine discontinuous co-occurrences and/or word associations. Kita et al.'s measure relies on the length of the collocate or word sequence and on its frequency. This approach is much more in the line of research that understands collocation within the broader definition (see above) and consequently fails to account for word associations, lexical relations or discontinuous collocations. There are also other approaches such as Lafon's use of combinatorics to determine collocational significance (Lafon 1984), Daille's approach to monolingual terminology extraction (Daille 1995) or Dunning's log-likelihood measure also known as g-square or g-score (Dunning 1993), among others. 3. Collocations Revisited In what follows, we shall re-examine some basic notions regarding collocations with reference to real data. This, we are confident, will give us a more comprehensive view of the complex interrelationships (semantic, lexical, syntactic, etc.) between co-occurring linguistic items. Three starting assumptions were made: (1) we understand collocates in the stricter sense, as already discussed: that is, words that have a strong tendency to co-occur around a node word in a concordance; (2) in order to avoid being biased in favour of any data, we concentrated on the occurrences of a lemma within a single semantic domain: the Spanish noun mano (extracted from the CUMBRE Corpus -16.5 million tokens-, by SGEL); and (3) no optimal span was taken for granted. This explains why we took the full sentence as the concordance (though important collocates can be missed due to anaphoric reference) . To make sure that we were actually focusing on the same semantic contexts, we first extracted all instances (full sentences) containing mano, both in singular and plural form and next, classified all occurrences semantically by means of the various definitions for mano. This preliminary distributional analysis allowed us to isolate the various meanings and to concentrate on individual meanings or identical semantic contexts. In the analysis that follows, we shall concentrate on one of the meanings of mano (“layer of paint, varnish or other substance put on a surface at one time”). As already noted, no preliminary assumptions were made regarding relevant or optimal window size or span. We simply started by taking the whole sentence in the belief that full sentences are more likely to contain complete ideas, meanings, etc. Fixing a span beforehand might often distort the unity of meaning or complete meaning in a sentence. We tend to think that words outside the range of a sentence are not likely to affect the meaning unity of other sentences, nor are they strongly associated to node words within other sentences. 102 The distribution of significant collocates (z-scores) results in the following graph (Figure 1), where the x-axis displays the position relative to the node word and the y-axis the number of statistically significant collocates. In order to get a more accurate and neater visual representation of the distribution of the significant collocates than the zigzag curves above, we normalised the data of the y-axis (number of significant collocates), by dividing the number of significant collocates found at each position by its positioning with respect to the node word (Figure 2). The y-axis represents the relative number of significant collocates. Figure 2 visually represent the extent of influence of the node on its immediate environment: the lexical gravity. Note that our notion of gravity differs radically from the one proposed by Sinclair et al. (1998; see also Mason 1997): (1) we do not calculate the extent of influence of the node on its immediate verbal environment, but take the node itself; (2) Mason's one is a more heavily calculated statistic one and (3) instead of displaying gravity as crater shapes, we do it in peak shapes. The graph stresses the attraction of MANO(layer of paint, etc.), which is greatest in the immediate environment and wears off with distance, which is indeed an already established fact. If we observe and compare the actual gravity with its trendline, we realize that there is a kind of correlation between the relative number of significant collocates and the node distance, particularly close around the node word. Intuitively, we can with some confidence state that the most relevant collocational context for MANO(layer of paint, etc) is between -8 and +6. It is also interesting that data around the node tends to have a more normal distribution than overall. However, evidence shows that fixing an optimal span could be misleading as the appearance of significant collocates is likely to exceed these pre-established limits. Fig.1. D istribution o f significant collocates vs trendline (z-score: mano) -1 0 1 2 3 4 5 6 -1 40 -120 -100 -80 -6 0 -40 -20 0 20 40 Pos itioning Number of significant collocates Fig. 2. Gravity vs trendline ( mano) -0 ,5 0 0 ,5 1 1 ,5 2 2 ,5 3 -140 -120 -100 -80 -60 -40 -20 0 20 40 Pos itioning Relative number of sgnificant collocates 103 Against the background of this data, it becomes clear that the distribution of significant collocates is not necessarily normal. This has important consequences as it speaks against generic assumptions on optimal spans (see i.e. Jones and Sinclair 1973; Martin et al. 1983; Smadja 1989; or Berber 1997, among others). Phillips is right to claim that “it is probably sensible to err in favour of overinclusiveness” (Phillips 1989: 23). This led us to state two hypotheses on collocates: H1: each word has a unique idiosyncratic linguistic behaviour (different attraction strength or power), at least lexically speaking; and H2: not all significant collocates of a node word are actually attracted by it, only some. Others are likely to be attracted by other significant collocates within the same context (sentence, phrase, concordance line, etc.). Regarding H1, it is clear that our empirical data is not sufficient to accept or reject this hypothesis. However, other related research such as Mason's (1997) gravity study and Sinclair et al. (1998) clearly support it. They talk about the importance of establishing a reasonable span for the collocational analysis of a word. They go even further by saying that there is a need to have some means of justifying the span to be used for each word. The key is to discover the extent of node effect on its environment (whether the optimum span varies according to the node's grammatical class, semantic range, between word-forms of a lemma, richness of lexis and so on, remains to be seen). These findings clearly indicate a change compared with those of Martin et al. (1983) and Jones and Sinclair (1973), who fixed the span of influence of the node on its environment, irrespective of the grammatical class of the node, its semantic range, etc, and concluded that the optimal span is -5 +5 and -4 +4, respectively. Summing up, this means that each word has a distinctive range of influence on other words and we, necessarily, need to determine the gravity of each word in order to arrive at or calculate its significant span. H2 states that not all significant collocates are actually attracted by the same word or node; only some are. Others are likely to be attracted by other significant collocates. This hypothesis does not focus on the span; it is even span-independent. What matters here is not just the span or extent of the influence of the node on its immediate environment but the fact that collocates (statistically significant ones) within the same environment are not necessarily attracted by the same node. This introduces a striking difference with regard to H1. By H2, we understand that collocates within the same environment build ordered and structured frames: not flat lexical frames (traditional view of collocates, see H1), but complex interrelated hierarchies similar to constellations consisting of a nucleus (e.g. the Sun) which attracts various other stars with each star attracting various other moons. Collocates form lexical conceptual multi-dimensional frames: lexical constellations. Constellations themselves can be substructures (subsets) of others (e.g. the Solar System as part of the Milky Way) or superstructures (supersets) subsuming other structures (e.g. the Solar System containing Jupiter and its Moons). Both hypotheses seem plausible, particularly the first one, if we consider the main stream and recent advances made in corpus linguistic research on collocational analysis (see e.g. Daille 1995; Mason 1997; Smadja 1992; Samdja et al. 1996; Sinclair et al. 1998). However, H1 somehow fails to comfort to one major psychological feature of linguistic data: efficiency of storage of lexical items (Worff 1991). This speaks against the economising role or purpose of language production (Coulmas 1981, Cowie 1981, Nattinger 1980, 1988, etc.). There are features that indicate that the effect or influence of a word on others varies significantly among words (see Berry-Rogghe 1974). However, the range of variability cannot be unlimited or very large. Consequently, H1 is, at least to its full extent, not fully plausible psychologically, as humans have a limited memory and storage capacity. H2 seems in this respect more realistic, both psychologically and linguistically. Evidence shows that words exert some influence on others, forming units (semantic units, tone units, syntactic structures, such as phrases, etc.). Furthermore, these units are to some extent cognitively, spatially and temporarily limited, though not necessarily universally fixed or equally limited. Several principles speak in favour of this: minimal attachment, right association and lexical preferences (Allen 1995: 160- 2). This means that the attraction domain of the node needs to be not just limited but also constrained, though not necessarily fixed or universally predetermined: the extent of influence is likely to vary depending on grammatical category, semantic range, etc. This assumption significantly reduces the range of variability of the various possible attraction domains (optimal spans) of node words. In addition, this also fits in with the unification-based assumption of the constellation principle: ordered hierarchical elements forming autonomous structures, sub-structures or super-structures. This explains the appearance of significant collocates for MANO(layer of paint, etc.) at positions -99, -89, -38, -28, -19 and +20, among others. If they are not attracted by the node word, then which items exert their influence on them and attract them? We might explain these facts as follows: the high z-scores categorise them as significant collocates within the sentence contexts of MANO(layer of paint, etc.); however, they are likely to be attracted by other words, not necessarily by mano. That is, these 104 collocates are part of another unit, not the immediate unit or vicinity of the node word. From the data, it becomes apparent that, for instance, profesora (-118) -a significant collocate within the concordance sentence of MANO(layer of paint, etc.)- might not be semantically attracted by mano, but by other lexical items within the same context, e.g. by Universidad (position -115). And, Autónoma (-114) is probably attracted by Universidad (position -115), too. The reason for these significant collocates, within the context sentences of MANO(layer of paint, etc.), is not related to the influence the noun mano exert on them, but is due to the fact that they are attracted by other items. Indeed, they seem to be part of a different unit (i.e. part of a different phrase). This means that the direct influence of words on others is not just limited but also somehow structured. To illustrate this, let us take one concordance sentence for MANO(layer of paint, etc.): “Si algún tono no tiene la intensidad deseada, aplique otra {MANO} de pintura” (“If the shade hasn't got the desired intensity, apply another layer of paint”). This mano sentence contains just six statistically significant collocates: algún, tono, intensidad, deseada, aplique and pintura, distributed from -9 to +2 (a positively skewed influence distribution). This gives the following collocational distribution: Positioning -10 -9 -8 -7 -6 -5 Freq. 1 Si algún tono no tiene la Positioning -4 -3 -2 -1 +1 +2 Freq. 1 intensidad deseada aplique otra de pintura A simple flat collocational analysis would just reveal that there are six collocates likely to be statistically significant for mano. Furthermore, if we had fixed the optimum span to -5 +5 or –4 +4, we would have missed at least: algún and tono. To know more about the possible lexical hierarchy, we started by calculating the expected cooccurrence probability of all possible significant collocate tuples (combinations) within the same sentence, that is, the probability that event x occurs, given that the event y has already occurred. For example, taking the occurrence data for tono and intensidad we get: Tono Intensidad Intensidad 11 Tono 11 Without intensidad 1037 Without tono 640 1048 651 P(intensidad*tono) = 11 / 1048 = 0.0105 P(tono*intensidad) = 11 / 651 = 0.0169 This shows that tono is lexically less tied to intensidad and more likely to appear without it, whereas intensidad is much more dependent on tono and less likely to occur without it. This information reveals that tono is lexically less tied to intensidad and more likely to appear without it, whereas intensidad is more dependent on tono. The probability figures indicate that the lexical attraction is mutual and bi-directional, but not equidistant: tono exerts a greater influence and attraction on intensidad than vice versa. Sinclair already realised this fact (1991: 115-116), coining the terms upward and downward collocation. These two types of collocation are explained using the term node (for the word that is being studied) and the term collocate (for any word that occurs in the specified environment of a node). Coming back to our previous example, if we take tono as the node and intensidad as the collocate, what we have is downward collocation: collocation of a node (tono) with a less frequent word (intensidad). On the contrary, if intensidad is the node and tono the collocate then we have upward collocation. The systematic difference between upward and downward collocation is that the former is the weaker pattern in statistical terms, and the words tend to be elements of grammatical frames, or subordinates. Downward collocation by contrast gives us a semantic analysis of words (Sinclair 1991: 116). Once the attraction direction procedure was established (upward or downward collocation), we determined all possible tuples of mano by means of the given statistically significant collocates (namely, tono, intensidad, deseada, aplique and pintura). We discarded algún as it is not a content word or open-class lexical item. Potentially, the possible tuples or combinations within the same context sentence range for mano from single-word ones up to six-word clusters (the total number of 105 statistically significant collocates plus the node word mano), that is, the sum of all combinations of r objects (number of co-occurring items) from n (set of statistically significant collocates). Thus, the combination of r objects from n is calculated by means of: C(n, r) = n! / (n – r)!r! This totals an overall of 63 possible order-independent combinations for mano (6 + 15 + 20 + 15 + 6 + 1). However, the singletons (single-word combinations) are practically irrelevant as they just indicate their own probability of occurrence in the corpus, irrespective of co-occurrence with other items. According to the data and probability calculations for tono and intensidad above, we believe that the attraction direction is somehow inherently determined by the observed frequency (consider also Sinclair's notion of downward and upward collocation (Sinclair 1991: 115-116)). This means that, between two items, the one that occurs more often in a language model (i.e. corpus) is likely to exert a greater influence on the less occurring one(s). This explains why we do not care about the actual order of the combinations as this is to some extent predetermined by the observed frequencies of their constituents. In order to concentrate on the really significant combinations, we took a number of preliminary decisions. First, all singletons were discarded as they do not give any information as to which other items they combine with. Next, we discarded all combinations with an occurrence frequency = 1. The reason is that all significant collocates occur, at least once, that is, whenever they co-occur in the concordance sentence itself. Consequently, combinations with a frequency lower than 2 are irrelevant, as they do not contribute anything to their association power with other items. So, among all potential combinations, we discarded the hapax tuples (combinations with an occurrence frequency = 1) and the singletons (single words). This resulted in the following significant combinations for mano: TOKEN(S) FREQ(corpus) PROB(corpus) mano,pintura 17 6.19820E-14 tono,intensidad 11 4.01060E-14 pintura,tono 8 2.91680E-14 mano,tono 7 2.55220E-14 mano,intensidad 6 2.18760E-14 intensidad,deseada 3 1.09380E-14 mano,aplique 2 7.29200E-15 tono,deseada 2 7.29200E-15 tono,intensidad,deseada 2 4.40306E-22 We find that the most likely tuple for mano is mano-pintura. This means that, in fact, mano is responsible for its co-occurrence with pintura, but not necessarily the other way round, as mano is more likely to be found than pintura. This also applies to the tuple tono-intensidad, producing these two ordered pairs: mano (pintura ) tono (intensidad ) This means that the collocate intensidad, though significant within the context of MANO(layer of paint, etc.), is not directly attracted by mano but by another significant collocate (tono) within the same context. The third most likely combination is pintura-tono. Here we have now the missing link between mano-pintura and tono-intensidad. The two previous combinations can be merged and reduced into a single structure, giving: mano (pintura (tono (intensidad ) ) ) The next tuple mano-tono is already entailed in the structure above, and indicates that mano does indeed attract tono. However, evidence shows that it is actually pintura that exerts a greater influence on tono, or, in other words, mano attracts tono by means of pintura. If it were not for pintura, it is unlikely that tono would be a collocate of mano or, at least, directly attracted by it. A similar case is that of mano-intensidad. The next combination, intensidad-deseada, enlarges the structure into: 106 mano (pintura (tono (intensidad (deseada ) ) ) ) Mano-aplique brings along a new directly attracted collocate of mano: mano (pintura (tono (intensidad (deseada ) ) ) ) (aplique ) and finally, tono-deseada and tono-intensidad-deseada which are already entailed in the structure above. We conclude that, in this example sentence, within the context of MANO(layer of paint, etc.), mano is the dominant collocate or node as it directly attracts pintura and aplique and indirectly attracts most of collocates (tono, intensidad and aplique), except for deseada. The resulting collocational hierarchy is: MANO (pintura (tono (intensidad (deseada ) ) ) ) (aplique ) Or in a more visual mode, very similar to a 3D star constellation: 4. Conclusions Constellations represent lexical relations among collocates in a structured hierarchical way versus the standard simple representation of most collocational analyses. It is precisely the plain vision of standard collocational work that misses two main features of collocational behaviour: (1) their non- 107 linear dependency: not all collocates are necessarily attracted by the same node word and, furthermore; (2) those that co-occur within a determined unit (phrase, sentence, etc.) form a kind of complex lexical hierarchy. It is also interesting that by means of constellational analysis we might also get rid of the optimum span dilemma, as this analysis always takes the sentence unit. What matters is not the span, but the lexical hierarchy collocates form within a linguistic unit (sentence, phrase, concordance line, etc.). The lexical hierarchies are neither predetermined nor assumed (as happens in most semantic networks or word sense hierarchies); they result from the co-occurrence probabilities and the attraction calculations, which might lead to divergences depending on the corpus or language samples analysed. The potential implementation of this constellation analysis could, of course, go much further by means of incorporating lemmatisation and part-of-speech tagging. This would prevent duplicate lexical relations (i.e. homonymy, polysemy) and part-of-speech ambiguity, and it would refine frequencies, producing more accurate constellational models. Notice that we have not tackled here the problem of which statistical collocation model or method to use, as this is not a basic issue for constellational analysis. Any statistical model would, in principle, do; this is up to the researcher's preferences. Of course, much research is still needed to totally configure this new trend in collocational analysis and to sort out hierarchical discrepancies/similarities (e.g. very close statistical significances). We are confident that this new view within collocational analysis can be a positive contribution. The approach presented is transparent in nature and empirically testable, producing a good description of the storehouse of vocabulary items. That is, how words are connected through their associations and how lexical knowledge is schematic and associative. This could lead to improved results in computational lexicography (development of collocational dictionaries, automatic construction and typification of proto-senses, etc.), language pedagogy (the teaching/learning of collocational knowledge for L2 learners) and Machine Translation (particularly to EBMT), to mention some important applications. Similarly, there are also various research interests within the lexical constellation framework. This includes the capturing and/or mapping of syntactic structures (i.e. exploring whether lexical constellations capture the representation of syntax) by means of various linguistic models (Compositional Grammars, Construction Grammars, etc.) or specific parsing strategies (e.g. Data Oriented Parsing). References Allen J 1995 Natural Language Understanding. Redwood: The Benjamin/Cummings Publishing Company. Berber Sardinha AP 1997 Lexical Co-occurrence: A Preliminary Investigation into Business English Phraseology. Letras & Letras, 13/1: 15-23. Berry-Rogghe GLM 1973 The Computation of Collocations and their Relevance in Lexical Studies. In AJ Aitken, R Bailey, N Hamilton-Smith (eds) The Computer and Literary Studies. Edinburgh: Edinburgh University Press. Berry-Rogghe GLM 1974 Automatic Identification of Phrasal Verbs. In J.L. Mitchell (ed) Computers in the Humanities. Edinburgh: Edinburgh University Press. Church KW, Hanks P 1990 Word Association Norms, Mutual Information and Lexicography. Computational Linguistics, 16/1: 22-9. Church KW, Gale W, Hanks P, Hindle D 1991 Using Statistics in Lexical Analysis. In U. Zernik (ed) Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Hillsdale, NJ: Lawrence Erlbaum Associates, 115-164. Clear J 1993 From Firth Principles: Computational Tools for the Study of Collocation. In M. Baker et al. (eds) Text and Technology. Amsterdam: Benjamins, 271-292. Coulmas F 1981 Introduction: Conversational Routine. In F Coulmas (ed) Conversational Routine: Explorations in Standardized Communication Situations and Pre-patterned Speech. The Hague: Mouton, 1-17. Cowie AP 1981 The Treatment of Collocations and Idioms in Learners's Dictionaries”. Applied Linguistics, 2: 223-235. Daille B 1995 Combined Approach for Terminology Extraction: Lexical Statistics and Linguistic Filtering. UCREL Technical Papers. Volume 5. Lancaster: University of Lancaster Press. Dunning T 1993 Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19/1: 61-74. Firth J 1968 A Synopsis of Linguistic Theory 1930-1955. In FR Palmer (ed) Selected Papers of J.R. 108 Firth 1952-59. Bloomington: Indiana University Press, 1-32. Geffroy A, Lafon P, Seidel G, Tournier M 1973 Lexicometric Analysis of Co-occurrences. In AJ Aitkien, R Bailey and N Hamilton-Smith (eds) The Computer and Literary Studies. Edinburgh: Edinburgh University Press. Greenbaum S 1974 Some Verb-Intensifier Collocations in American and British English. American Speech, 49: 79-89. Jones S, Sinclair J 1973 English Lexical Collocations. A Study in Computational Linguistics. Cahiers de Lexicologie, 23/2: 15-61. Kita K, Kato Y, Omoto T, Yano Y 1994 Automatically Extracting Collocations from Corpora for Language Learning. In A Wilson, T McEnery (eds) UCREL Technical Papers. Volume 4. Corpora in Language Education and Research. A Selection of Papers from TALC94. University of Lancaster. 53-64. Lafon P 1984 Dépouillements et statistiques en lexicométrie. Geneva: Slatkine-Champion. Martin W, Al B, van Sterkenburg P 1983 On the Processing of a Text Corpus: From Textual Data to Lexicographical Information. In R Hartmann (ed) Lexicography: Principles and Practice. London: Academic Press. Mason O 1997 The Weight of Words: An Investigation of Lexical Gravity. In Proceedings of PALC'97, 361-375. Nattinger J 1980 A Lexical Phrase Grammar for ESL. TESOL Quarterly, 14: 337-344. Nattinger J 1988 Some Current Trends in Vocabulary Teaching. In C. Carter and M. McCarthy (eds) Vocabulary and Language Teaching. London: Longman, 62-82. Phillips M 1989 Lexical Structure of Text. Birmingham: ELR/University of Birmingham. Sinclair J 1991 Corpus, Concordance, Collocation. Oxford: Oxford University Press. Sinclair J, Mason O, Ball J, Barnbrook G 1998 Language Independent Statistical Software for Corpus Exploration. Computers and the Humanities, 31: 229-255. Smadja FA 1989 Lexical Co-occurrence: The Missing Link. Literary and Linguistic Computing, 4/3: 163-168. Smadja FA 1992 XTRACT: An Overview. Computers and the Humanities, 26/5-6: 399-414. Smadja FA 1991 Macro-coding the Lexicon with Co-occurrence Knowledge. In U Zernik (ed) Lexical Acquisition: Exploiting On-line Resources to Build a Lexicon. Hillsdale, NJ: Lawrence Erlbaum Associates, 165-189. Smadja FA, McKeown KR 1990 Automatically Extracting and Representating Collocations for Language Generation. Proceedings of the 28th Annual Meeting of the Association for Computational Linguistics, 252-259. Smadja FA, McKeown KR, Hatzivassiloglou V 1996 Translating Collocations for Bilingual Lexicons: A Statistical Approach. Computational Linguistics, 22/1: 1-38. Stubbs M 1995 Collocations and Semantic Profiles: On the Cause of the Trouble with Quantitative Methods. Functions of Language, 2/1: 1-33. Wolff JG 1991 Towards a Theory of Cognition and Computing. Chichester: Ellis Horwood. 119 The Design of Czech Lexical Database )UDQWLãHNýHUPiN Institute of Czech National Corpus, Philosophical Faculty, Prague, Czech Republic Frantisek.Cermak@ff.cuni.cz Jana Klímová Institute of Czech Language, Czech Academy of Science, Prague, Czech Republic Jana.Klimova@ff.cuni.cz Karel Pala Dept. of Information Technologies, Faculty of Informatics, Masaryk University Brno, Czech Republic pala@fi.muni.cz 9ODGLPtU3HWNHYLþ Institute of Theoretical and Computational Linguistics Philosophical Faculty, Charles University, Prague, Czech Republic 9ODGLPtU3HWNHYLþ#IIFXQLF] 1. Introduction The aim of the paper is to present a conception of Czech Lexical Database (CLD) that should later become a basis for the new representative Czech dictionary. The main purpose of this enterprise is to build a representative Czech Lexical Database that would serve as a source of lexical information and also as a partial knowledge representation in various NLP applications (Ingria, Boguraev, Pustejovsky, 1992, p.341-365)1. The basic units in CLD can be either single lemmata like G P KRXVH or standard collocations as e.g. vysoká škola (university). The assumed size of the designed CLD is approximately 50 000 entries and we set as our primary task to concentrate on Czech verbs as much as possible, i.e. the number of the described verbs should be about 20 000 (40 % of CLD size, when the estimated number verbs in Czech is about 40 000 items). This is based on the fact that the verbs represent the main relational element in natural languages around which the other elements, mostly nouns, are centered. 2. The basic structure of CLD It can be described by a DTD (most probably designed in XML) that will consist of the following parts (fields, see e.g. Faber, Usón 1999, p.20)2: a1) about the sound structure of the expressions constituting a given entry. This in fact means that we will be trying to develop the (parallel) speech database for Czech that would form a data collection for building the algorithms able to process speech signals, as e.g. speech recognition, synthesis and encoding, as well as recognition and verification of the speaker. The speech database data will be appropriately included in the lexical database. Some interesting tasks have to be solved in this respect: particularly, additional word forms will have to be generated by the speech synthesis module since it would be virtually impossible to account for all forms of all words in the lexical database: there are approximately 5–5.5 million word forms in Czech). a2) about the structure of the entry (for Czech) – it will yield the information about POS and all the respective grammatical categories associated with that POS plus the information about the basic segmentation. For nouns this can be captured by since we assume that a morphological analyzer/generator AJKA will be integrated into CLD (6HGOiþHN )3 that would offer the morphological information on demand 1 Ingria R, Boguraev B, Pustejovsky J 1992 Dictionary/Lexicon. in Encyclopedia of Artifical Intelligence (ed. by Shapiro S. C.), New York, John Wiley, pp.341-365. 2 Faber P, Usón R, M 1999 Constructing a Lexicon of English Verbs. Berlin – New York, de Gruyter. 3 6HGOiþHN50RUIRORJLFNêDQDO\]iWRUSURþHãWLQX 0RUSKRORJLFDODQDO\VHUIRU&]HFK 'LSORPD Thesis, Faculty of Informatics, Brno. 120 (dynamically). For verbs this typically includes 8 categories (attributes): , , , , , , and . Their values would be accessed dynamically through the . To get this information the morphological analyzer/generator will be used in a similar way as for nouns. Also word formation information will be included here in a subfield and it should show the relevant formally justified links between the respective entries including their semantic consequences, cross POS relations like the direction of derivation (as in práce Å pracovat (work Å to work)). This poses a task how to formulate the word derivation rules as formally as possible (see below Klímová, Pala, 2000, p.987-991), a3) where for each sense the following should be given: a3.2) that can be associated with an entry – possibly based on EuroWordNet Top Ontology and hypero/hyponymy hierarchies (trees) or their parts (subtrees or clusters) (Vossen, 1999)4. It is to be examined how large parts of the trees or subtrees can be employed – we estimate that the plausible number of the nodes used here may be about 5, a3.3 and distinguishers (differentia specifica), will be given typically for the entries containing nouns. In fact the genus proximum definitions can be viewed as subsets of the hypero/hyponymy trees where only two nodes are considered. The distinguishers represent a sort of problem: in our view they quite successfully resist to the attempts to formalize them. This can be demonstrated by the fact that the particular dictionaries differ most in the way in which they treat the distinguishers – there is no general agreement as to what distinguishers should or should not be selected and included in the particular entries. a3.4 – for verbs the genus proximum definitions may not work as reliably as for nouns or they may be used reliably only for a small number of them, therefore we suggest here to indicate a semantic class a verb belongs to. In this respect we are preparing a semantic classification of Czech verbs similar to Levin‘s (Levin, 1995)5 though in Czech this task seems to appear more complicated because of the category of aspect (thanks to this Czech verbs regularly occur in pairs). On the other hand, it is also obvious that the semantic classes of verbs are closely related to the valency frames (verb frames) and we set as our task to reflect these links in the database as well. a3.1 that can be found for a given lexical unit (entry, lemma), possibly in WordNet fashion. The reason for having synsets follows from the fact that the relation of synonymy (and antonymy) can serve as one of few relatively reliable ways of the characterization of meaning a4) about the combinatorial properties of the entry and the expressions that are related to it. It is obvious that typically the syntactic properties of the given item are strongly related to the particular sense of the entry and they distinguish it from the other senses. The information given in this field will be captured through for all the POS where it makes sense, i.e. for verbs, nouns, adjectives, numerals and also some adverbs. It is evident that in this respect we have to distinguish syntactic (or superficial) valency frames that in Czech will include the combinatorial information about the morphological cases (there are seven of them in Czech) and semantic (or deep) valency frames containing the necessary information about the semantic cases (roles) that are expressed by the morphological cases. The notation linking syntactic and semantic valencies is indicated in the examples below, however we take it as preliminary since the final inventory of deep cases for Czech has not been established yet (see e.g. Fillmore and Atkins, 1998, p. 417-423)6 . That is not all, in our view it is also very useful to include the particular lexical information in the valency frames. Typically, it is not enough to know just the respective values of morphological (superficial) cases but their lexical "cast" as well, see e.g. the vital difference between two accusatives occurring in GUåHWYUXFHNQLKX KROGDERRNLQWKHKDQG and GUåHWWYDr (to mantain, keep the shape). It can be objected that the semantic valencies should capture these differences in the senses but for practical NLP applications it certainly appears to be practical to have this kind of information in CLD in an explicit form. a5) , i.e. contexts typical of the given entry, e.g. hezká dívka (pretty girl), or šikovný chlapec (smart boy) etc., they will be obtained from the corpus. a6) , e.g. GUåHWNQLKXY UXFH WRKROGDERRNLQWKHKDQG RWRþLWKODYX WR turn the head), we should get them from the corpus texts, a7) with appropriate subcategorizations, i.e. an attempt has to be made to find a 4 Vossen P et al. 1999 Final Report on EuroWordNet-2, 2D041. CD ROM, v.1, Amsterdam, University of Amsterdam. 5 Levin B 1995 English Verb Classes and Alternations. Chicago, The University of Chicago Press. 6 Fillmore Ch, Atkins B 1998 FrameNet and Lexicographic Relevance, in Proceedings of the First National Conference on Language Resources and Evaluation. (eds. Rubio A, Gallardo N, Castro R, Tejada A) , vol. 1, Paris, ELRA, pp. 417-423. 121 semantic classification of the collocations. We will not go into details here but it can be shown that it would be very useful to have e.g. the verbal collocations classified in accordance with the semantic classes of verbs mentioned above. Similar technique can be applied to the noun collocations as well but we are aware of the fact that this task will require a lot of corpus data and their laborious analysis. a8) – more structured information about the register and stylistic properties of the item including the regional information and other data, however, we would like to cover only the basic types of this information, a9) – i.e. short etymological information related to the item, a10) – this field will include the information about the logical type of the item based on the Transparent Intensional Logic (TIL) (Materna 20007, Pala 2000, p.109-114)8. In TIL the types are built on the ramified theory of types which, we hope, may lead to formally more consistent semantic representations of NL expressions. This together with hierarchic hypero/hyponymic structures will enable us to use CLD also as a part of the knowledge representation systems. We would like to try to establish the relations between EuroWordNet Top Ontology and Type Ontology as defined within TIL. This should yield more precise and less arbitrary semantic classifications for the semantic hierarchies, semantic relations and semantic features as well though on the other hand we are aware that this enterprise may also lead to some problems, e.g. it may be feasible only for some entries or parts of speech (verbs, nouns, adjectives, adverbs). a12) – can be included in CLD where it would be useful or even necessary for possible NLP applications, this may hold e.g. for the entries that are related to the information technologies. The question is whether we should look for a kind of knowledge representation language that would enable to represent the encyclopedic information or to approach the problem pragmatically and to follow the present encyclopedic resources in their current form. In the examples below we give only some arbitrary explanations but in the beginning it appears to be more reasonable to follow the latter path. 3. Resources for CLD Thanks to favourable situation with regard to Czech National Corpus (CNK, at Charles University in Prague, Faculty of Arts – (abbrev. FF UK) and corpus ESO (at Masaryk University in Brno, Faculty of Informatics – (abbrev. FI MU) we assume that the building of CLD can be based mainly on the CNK and ESO data. Other resources would be used as well, particularly the two existing Czech dictionaries: 1) large Dictionary of Literary Czech (66-ý)9, 2) and smaller Dictionary of Written Czech (66ý)10. Additional resources will have to be sought as well, particularly other existing dictionaries, especially the terminological ones. We are also aware of the fact that a well founded reader program has to be established sooner or later that should be later closely connected with the preparation of the New Czech Dictionary. 4. Tools Thanks to the interesting results in the NLP research being performed both at Charles University (Institute of Czech National Corpus, Institute of Formal and Applied Linguistics, Institute of Theoretical and Computational Linguistics) in Prague and Masaryk University in Brno (Faculty of Informatics, NLP Laboratory) we have at our disposal a basic set of tools that can be used for building CLD. Particularly, we will take advantage of the morphological analyzer AJKA (mentioned above), parsers (DIS and GT, äiþNRYi3RSHOtQVNê1HSLOS-+RUiN6PUåS-50)11, desambiguators (2OLYD3HWNHYLþHWDOS-8)12, corpus manager and graphical interface using 7 Materna P 2000 Type-theoretical analysis as a preparation of analyzing expressions of a natural language. Prague - Brno, Faculty of Informatics MU, manuscript, pp.110. 8 Pala K 2000 Word Senses and Semantic Representations - Can We Have Both? in Proceedings of TSD 2000, Berlin, Springer Verlag, pp.109-114. 9 6ORYQtNVSLVRYQpKRMD]\NDþHVNpKR 'LFWLRQDUy of Written Czech Language) 1960, Praha, Academia. 10 6ORYQtNVSLVRYQpþHãWLQ\ 'LFWLRQDU\RI/LWHUDU\&]HFK 3UDKD$FDGHPLD 11 äiþNRYi(3RSHOtQVNê/1HSLO05HFRJQLWLRQDQG7DJJLQJRI&RPSRXQG9HUE*URXSVLQ Czech. in Proceedings of CoNLL-2000 and LLL-2000, Lisbon, ACL New Brunswick, pp.219- 225.+RUiN$6PUå3/DUJH6FDOH3DUVLQJRI&]HFKLQ3URFHHGLQJVRI(IILFLHQF\LQ/DUJH-Scale Parsing Systems Workshop, COLING'2000, Universitat des Saarlandes, Saarbruecken, pp.43-50. 12 Oliva K,3HWNHYLþ9HWDO7KH/LQJXLVWLF%DVLVRID5XOH-Based Tagger of Czech. in 122 client-server architecture (Rychlý, 2000)13, dictionary editor and a browser based on XML format that can handle any dictionary converted into XML. The modified version of the browser can be also used for processing any WordNet data in a way that is now possible with Polaris database (Vossen, 1999). Other tools include various conversion programs, programs for corpus maintenance and corpus preparation, heuristic programs for obtaining valency frames from corpus texts, Czech morphological database and programs for automatic word derivation that would capture the word derivation chains like XþLW WRWHDFK –XþHQt WHDFKLQJ –XþLWHO WHDFKHU –XþLWHOND VKH-teacher) –XþHQê VFKRODUO\ – XþHQHF VFKRODU  – výuka (tuition, lesson) etc. It has to be decided whether these data should be included in CLD directly or rather whether they would be obtained dynamically from the morphological module (Klímová, Pala, 2000, p.987-991)14. We touched this question above when discussing the morphological information about the entries. In the examples given in the next section we have not tried to show the word formation semantic information because we hope that it can be obtained dynamically as the output of the word formation engine that is presently being built for Czech at NLP Lab. at Faculty of Informatics MU. 5. Conclusions In this short contribution we have presented the underlying assumptions from which building Czech Lexical Database can start. We are aware of the fact that a number of the discussed points will have to be elaborated more deeply and systematically to obtain fully applicable results. This explains the fact that the examples of the entries are in several points rather tentative skeletons than full and complete entries. However, it is our hope that the described techniques, resources and tools will allow us to reach our goal. 5.1 An example15 An example of the entry for the Czech verb GUåHW KROG follows: HQWU\GUåHW!! > > > >> > VIHDWþLQQRVW!! :1-like) %(verbs of holding and keeping) FRQWH[WVGUåHWGYH HGSLVWROLYUXFH!! H[DPSOHVIURPFRUSXV <> > %(+ semantic class of a collocation) ORJW\SHY]WDKPH]LGY PDLQGLYLGXLUHODWLRQ-in-intension between two individuals> >> V\QVHWEêWIL[RYiQ!EêWXSHYQ Q!! %(WN-like) %(verbs of holding ) FRQWH[WVRPtWNDGUåt!! H[DPSOHVIURPFRUSXV FROORFDWLRQVK HEtNGUåt!!! VHPDQWLFFODVVRIDFROORFDWLRQ > Proceedings of TSD 2000, Berlin, Springer Verlag, pp.3-8. 13 5\FKOê3.RUSXVRYpPDQDåHU\DMHMLFKHIHNWLYQtLPSOHPHQWDFH &RUSXV0DQDJHUVDQGWKHLU Effective Implementation). Ph.D. Dissertation, Brno, Faculty of Informatics MU. 14 Klímová J, Pala K 2000 Application of WordNet ILR in Czech Word-formation. In Proceedings of LREC Conference, Athens, ELRA, pp.987-991. 15 Note that in the examples English is used to describe the fields, however, we doubt whether it makes sense to translate Czech lexical data within the fields. We decided not give the translations and to offer the necessary explanations during the poster session when discussing the particular points. 123 HQF\FXSHYQ QtREMHNWXVHSURYiGtOHSHQtP]DWOXþHQtP!! > > >> <...> > %(WN-like) %(verbs of maintaining and keeping shape) FRQWH[WVYODV\GUåtID]yQX!! H[DPSOHVIURPFRUSXV > > > > > >> V\QVHWEêWYHVWHMQpSROR]H!QHP QLWSRORKX!! <...> > %(WN-like) %(verbs of keeping a position) FRQWH[WVGUåHWW ORY]S tPHQ !! H[DPSOHVIURPFRUSXV FROORFDWLRQVGUåHWKODYXQDGYRGRX!!! VHPDQWLFFODVVRIDFROORFDWLRQ > HQF\FSODWtSURSRORKXOLGVNpKRW OD!! VHQVHGHIYODVWQLWS GX!! > > > V\QVHWYODVWQLWS GXVSUDYRYDWPDMHWHN!! <...> > %(WN-like) %(verbs of possesion) FRQWH[WVGUåHW]DKUDGX!! H[DPSOHVIURPFRUSXV > HQF\FYODVWQtNHPMHþORY NREMHNWHPQHPRYLWRVW!! > > >> V\QVHWS VWRYDW!FKRYDW!! <...> > %(WN-like) %(verbs of growing) FRQWH[WVEDELþNDGUåtVOHSLFH!! H[DPSOHVIURPFRUSXV FROORFDWLRQVGUåHWGRE\WHN!!! VHPDQWLFFODVVRIDFROORFDWLRQ > HQF\FREMHNW\MVRX]Yt DWD!! > > > > > > <...> > %(WN-like) %(verbs of reservation and booking) FRQWH[WVGUåtQiPPtVWRYSR DGt!! H[DPSOHVIURPFRUSus) FROORFDWLRQVGUåHWNRPXPtVWR!!! VHPDQWLFFODVVRIDFROORFDWLRQ > 124 > > >> > <...> > %(WN-like) %(verbs of emotional attitudes, liking) FRQWH[WVRQQDQLGUåt!! H[DPSOHVIURPFRUSXV FROORFDWLRQVGUåHWQDVHVWUX!!! VHPDQWLFFODVVRIDFROORFDWLRQ > HQF\FSODWtSURPtVWROLGLQHERþORY NDDQ MDNêREOtEHQêREMHNW SHQt]H !! VHQVHGHIGUåHWVNêP!! > k1þORY Nc7>> > V\QVHWGUåHWV NêPGUåHWVSROXEêWV NêPYSDUW !! <...> > %(WN-like) %(verbs of social grouping) FRQWH[WVGUåHWSDUWX!! H[DPSOHVIURPFRUSXV FROORFDWLRQVGUåHWVNRPXQLVW\!!! VHPDQWLFFODVVRIDFROORFDWLon) > > > > >> > VIHDWþLQQRVW!! :1-like) %(verbs of holding and keeping) FRQWH[WVGUåHWGYH HGSLVWROLYUXFH!! H[DPSOHVIURPFRUSXV <> > %(+ semantic class of a collocation) ORJW\SHY]WDKPH]LGY PDLQGLYLGXLUHODWLRQ-in-intension between two individuals> HQF\FUXNDMHþiVWOLGVNpKRW ODQHERURERWD!! 5. 2 Example of the entry for Czech noun hlava (head): > < > > > < > > <šiška>> VIHDWKRORW OR!PHURW ORRþLQRVWYi H~VWD!!:1-like > %(examples from corpus) > %(+ semantic class of a collocation) HQF\FSODWtSURKODYXþORY ND]Yt HWHQHERURERWD!! > > < > > 125 V\QVHWP\VO!Y GRPt!! > %WN-like > %(examples from corpus) > %(+ semantic class of a collocation) > > > > V\QVHWERVV!QiþHOQtN!! VIHDWK\SHUKXP!K\SRSRG t]HQê!!:1-like > %(examples from corpus) > %(+ semantic class of a collocation) > > <> > > %WN-like > %(examples from corpus) > %(+ semantic class of a collocation) > > >> <> > > %WN-like > %(examples from corpus) <> > %(+ semantic class of a collocation) HQF\FREVDKNQLK\VHþOHQtQDMHGQRWN\– kapitoly, hlavy > 126 Using a corpus of school children's writing to investigate the development of vocabulary diversity N. CHIPERE, D. MALVERN, B. RICHARDS and P. DURAN School of Education, University of Reading, UK Abstract The paper shows how a corpus-based approach can be used to investigate the development of vocabulary diversity during the school years. The theoretical and pedagogical motivations for the investigation are outlined and advantages of using a corpus-based approach are discussed. The problem of the text-length effects on typetoken ratios is presented, followed by a description of a recent mathematical solution to the problem (Richards and Malvern, 1997). Data for the investigation consisted of a corpus of 899 narrative essays from school children aged 8 to 15. The essays were grouped into three three age categories (Key Stages 1, 2 and 3) and within each Key Stage, essays were further categorised according to a maximum of eight possible levels of writing ability as defined by the British National Curriculum for English. An analysis was carried out for each Key Stage to determine the relationship between level of linguistic ability and vocabulary diversity. The paper presents results from the analysis and discusses some theoretical and pedagogical implications. Future applications of the mathematical model to the investigation of diversity in categories of linguistic form other than vocabulary are discussed. 1 Introduction Computer-based corpus analysis can reveal meaningful patterns in first language development during the school years. Results are reported here from an analysis of vocabulary diversity in the writing of school children aged eight to fifteen. The analysis is part of a three-year funded research project which seeks to identify quantitative measures of the written language skills of school children. The paper begins by outlining some key theoretical and pedagogical motivations for studying children's writing. A discussion of some of the methodological issues involved is then followed by a report of the study and its results. 2 Theoretical and pedagogical motivations for the study Recent policy changes in first language pedagogy in England are compelling linguists to reconsider long-held assumptions about language development. It has been assumed that procedural grammatical competence is attained without instruction by age five or soon thereafter. This assumption is congruent with previous educational practice which excluded the teaching of first language grammar. However, the recent ‘National Literacy Strategy’ now requires the grammar of English to be taught in English schools. This policy shift, brought about by public concerns with standards of literacy, makes an implicit but clear theoretical statement about grammatical development: that it is not complete by age five and that it can be facilitated by formal instruction. The British government has therefore adopted a view of grammatical development which can loosely be described as empiricist. The empiricist view is that language is learned with feedback from experience (e.g. Salzinger, 1975 and Sampson, 1999). However, it is a nativist axiom that competence in language is attained without assistance very early in life (e.g. Pinker, 1994 and Marcus, 1994) and that, as far as the purely grammatical component of linguistic competence is concerned, the role of experience is merely to trigger innate knowledge. While empiricists might therefore regard the attempt to teach first language grammar as fully justified, nativists might see it as misguided and redundant. The theoretical and pedagogical importance of settling the question needs no emphasis. However, it is not easy to tease empiricist and nativist claims apart on the basis of pre-school language acquisition data. In view of the evidence that grammatical development continues during the school years (e.g. Karmiloff-Smith 1986 and Perera 1986), it is necessary to extend the investigation to cover this period as well. It is also necessary to use both experimental and observational methods of data collection. For instance, one of the authors reports variations in the procedural grammatical competence of 18 year-old native English speakers and relates these variations statistically to differences in academic ability (Chipere, in press and Chipere, 1999). The literature reviews in the just cited work indicate that levels of procedural grammatical competence are also statistically related to levels of formal education. While 127 this evidence supports the empiricist view, the necessarily tight focus of experimental studies restricts the range of subjects and materials. Corpus-based analyses can complement experimental studies by providing a wider coverage of subjects and materials. Corpus-based analyses can also produce data which has pedagogical applications. For instance, corpus analysis might reveal linguistic features which characterise good versus poor writing among school children. It is known that skilled writers use more diverse vocabulary than less skilled writers. Corpus analysis techniques can make it possible to measure vocabulary diversity scores for pieces of children's writing and these scores could then enable teachers to decide how much time and effort a given pupil should spend in improving vocabulary knowledge. Vocabulary diversity scores might also be useful for assessment purposes. There is a growing interest in the possibility of computerised assessment of writing and vocabulary diversity scores might inform automated assessment of lexical richness. 3 Framework of analysis Thus there are both theoretical and pedagogical motivations for studying language development during the school years. Some of the methodological issues surrounding such an investigation now need to be discussed. The framework of analysis which has been adopted for the current project is that of Biber (1988). This framework ‘is based on the assumption that strong co-occurrence patterns of linguistic features mark underlying functional dimensions.’ (Biber, 1988:12). For instance, conversational interaction represents a functional dimension of language which is characterised by a given pattern of co-occurring linguistic features. This pattern differs markedly from the pattern which chacterises, for instance, the delivery of technical information. Biber's analytical procedure therefore involves calculating the frequencies of a large number of selected linguistic features and then deriving a set of functional dimensions through factor analysis. Biber proposes that one application of his framework is in composition research. It might be the case, for instance, that good and poor writing are marked by different co-occurrence patterns. Grabe and Biber (1987, cited in Biber, 1988) found only small differences between good and poor essays in their study. However, it is possible that large differences will be found if a) the sample represents the whole range of writing ability and b) texts are analysed for the whole range of linguistic features known to mark language development. These two considerations inform the current project. It is intended to analyse the writing produced by children who represent a wide range of writing ability. Their writing will be then be analysed in terms of quantitative measures of language development reported in the literature (e.g. Johnson, 1944; van der Geest, Gerstel, Appel and Tervoort 1973; Barnes, Gutfreund, Satterly and Wells, 1983; Fletcher and Peters, 1984; Bennett-Kastor, 1988; Klee, 1992 and Snow, 1996). This literature identifies the ratio of different types of words to the total number of words in a text, or type-token ratio (TTR) as one of the most important indicators of language development. TTR is taken as a measure of vocabulary diversity and it is usually expected that TTRs will be positively correlated with other measures of language development. 4 The type-token ratio A serious flaw in the calculation of TTR, however, has been identified by several writers from as early as Chotlos (1944) to Biber (1988). Richards and Malvern (1997) provide an extensive discussion of the issues and propose a solution which will be described presently. The following paragraphs summarise the key points in that discussion. TTR is calculated by dividing the number of different types (V) with the total number of tokens (N) in a text. Many researchers have mistakenly assumed that the ratio is constant over a given text. Richards (1987) shows that the ratio is closely related to text length with the following simple demonstration. Consider a simple case of a two-word text in which the same word occurs twice. In that case, TTR = 1 type divided by 2 tokens = 0.5. Now consider the case of a three-word text in which the same word occurs three times. Then, TTR = 1 type divided by 3 tokens = 0.3. In a four-word text with the same word occurring 4 times, TTR = 1 type divided by 4 tokens = 0.25. Finally, in a five-word text with the same word occurring five times, TTR = 1 type divided by 5 tokens = 0.2. Thus five texts which have exactly the same range of vocabulary yield five different values of TTR. Additionally, longer texts will tend to produce smaller TTRs. Richards and Malvern show how failure to recognise this flaw has resulted in contradictory research findings in the child development literature. Examples are cases where the text length effect produces results which indicate a) no differences in TTRs taken from transcripts of children at different levels of 128 language development; b) lower TTRs for more advanced versus less advanced children and c) a lack of correlation between TTR and other measures of language development. Several researchers have long been aware of the problem, however, and have tried to correct it. Solutions have taken the form of controls on text length or transformations of TTR. As shown below, all these solutions are either inherently flawed or subject to practical limitations which make them inappropriate for the analysis of children's writing. 5 Controls on text length In the child language development literature, Stickler (1987) proposes standardising text length by using 50 utterances taken from the middle of a transcript. However, this method does not eliminate the text length effect, since more advanced children produce more words per utterance than less advanced children. Thus, not only will the measure of lexical diversity be distorted, but there is a possibility that the TTR values of more advanced children will be smaller than those of less advanced children. A demonstration of this anomaly is provided in Richards and Malvern (1997: 26). Another solution is to standardise the number of tokens. While this solution does eliminate text length effects, there are practical problems. Standardising text length is practicable for a given corpus but it is difficult to arrive at a standard text length which can be applied to all corpora. Thus the standard text lengths have varied from 1000 tokens (Wachal and Spreen, 1973 and Hayes & Ahrens, 1988) to 400 tokens (Biber, 1988 and Klee, 1992) to 350 tokens (Hess et al, 1986) to 50 tokens (Stewig, 1994). This variation is problematic because TTRs which are calculated on the basis of shorter text lengths will be higher than those calculated from longer text lengths. It is therefore not possible to compare the two sets of TTRs. The difficulty of arriving at a universal standard cannot be solved simply by consensus because of wide variations in the lengths of transcripts from different sources. For instance, many transcripts of child language data are much shorter than those from adult language data. Standards based on the length of child language transcripts will therefore involve wasting a considerable amount of adult language data and possibly reducing the reliability of the measure. The crux of the problem is that TTR continues to fall with increasing text length and measuring TTR at any one point is inherently unsatisfactory. The final solution based on standardising text length to be discussed here is the Mean Segmental Type Token Ratio or MSTTR (Johnson, 1944). This measure involves calculating the mean TTR for consecutive equal-length segments of text. The advantage of this method over standardising the number of tokens is that a) the size of the smallest transcript in a corpus can be used as the size of the segment and b) nearly all the data are used. However, a problem remains in that it is not possible to compare cross-corpus MSTTRs based on different-sized segments, since MSTTRs based on short segments will be higher than those based on longer segments. Thus, while MSTTR might appear to be an elegant solution, it too does not fully overcome the problem of text length. 6 Transformations of TTR Attempts to transform TTR in various ways also fail to eliminate the effect of text length. Guiraud (1960) divides the number of types by the square root of the number of tokens to derive root type token ratio or RTTR. Herdan (1960) divides the logarithm of the number of types by the logarithm of the number of tokens to obtain the Bilogarithmic TTR. Carroll (1964) divides the number of types by twice the square root of the number of tokens to derive corrected type token ratio or CTTR. Ultimately, however, none of these transformations overcome the effect of text length, since “any apparent reduction of the relationship with sample size is an artefact of the change in scale and will be accompanied by a reduction in the sensitivity of the measure due to the use of smaller units” (Richards and Malvern, 1997: 33). 7 Mathematically modelling diversity Most of the solutions discussed above, with the exception of MSTTR, fail to eliminate text length effects because they do not utilise the fact that diversity in a text is better represented in terms of a curve described by values of TTR taken at successive points along the length of the text rather than in terms of a single value taken at one point. A more successful class of solutions has focussed on mathematically modelling the way TTR falls with increasing token counts. A detailed discussion of the development of these models is provided in Richards and Malvern (1997). They present an equation which describes the family of curves obtained when TTR values are plotted against token counts. These curves lie between the two extremes of total diversity and zero diversity. In the case of total diversity, the number of types equals the number of tokens throughout the text and TTR = 1 at each 129 successive point along the abscissa, resulting in a straight line with a zero slope. In the case of zero diversity, the total number of types is 1 throughout the text and TTR = 1/(N)umber of tokens for increasing values of N. The result is a curve which falls steeply from an initial value of 1 along the ordinate and then gradually flattens as it asymptotically approaches the abscissa. The TTR-Token curves of different texts will therefore lie between the two extremes with increasing lexical diversity represented by increasingly shallower slopes and decreasing diversity represented by increasingly steeper slopes. The equation found by Malvern and Richards to describe this family of curves is: Fig. 1 ú ú û ù ê ê ë é - ÷ ø ö ç è æ 1 D N 2 + 1 N D = TTR 2 1 where TTR = type-token ratio, N = number of tokens and D is a constant with serves as the index of diversity. 8 Implementation and Validation of D An algorithm for computing values of D from transcripts is described in McKee, Malvern and Richards (2000). Points for a TTR-Token curve are obtained by calculating type-token ratios for increasing values of N from N = 35 to N = 50. Each point is averaged from 100 sub-samples drawn randomly from the text without replacement. D is then obtained through a curve-fitting procedure. The algorithm has been implemented in a C program called vocd, also described in McKee et al, which runs on UNIX, PC and Macintosh platforms as part of the CLAN suite of programs (McWhinney, 2000). D has been validated in a number of analyses on corpora containing data from first and foreign language learning and academic writing (Malvern and Richards, 2000). The current project seeks to extend the application of D to the analysis of first language development during the school years. A description of a study carried out as part of the project now follows. The study was concerned with discovering patterns in the development of vocabulary diversity in school children aged between 8 and 15. 9 Background to the study There are well-established differences in the written language abilities of school-going children. These differences are found between and within age groups. Between-group differences suggest that language development continues during the school years while within-group differences suggest individual differences in language development. It is interesting to find out how such differences might be related to objective measures of language development. Children's writing in England has been assessed by a government body called the Qualifications and Curriculum Authority. QCA mark schemes for the assessment of writing focus on three major aspects of written language: Purpose and Organisation, which is concerned with discourse level aspects; Style, which is concerned with sentence structure and vocabulary and Punctuation, which is concerned with the use of punctuation conventions and spelling. These criteria are applied by markers who give a global score for each of the three aspects. The three scores are then added up and used to classify each script in terms of eight levels of writing ability. The assessment is therefore qualitative in nature. However, QCA has recently developed a quantitative instrument consisting of a set of coding frameworks for writing. The frameworks are used by trained markers to count the frequencies of selected features in 100-word samples of scripts. These features include correct and incorrect uses of various punctuation marks, types of spelling errors, word tokens belonging to various word classes, subordinate and co-ordinate clauses and so on. The fact that the coding is done manually limits the size of the sample per child and the variety of features which can be studied. The current project grew out of an attempt by one of the authors to automate the coding process. The study reported here measured vocabulary diversity as a first step towards automated analysis. 10 Aims of the study The primary aim was to analyse the lexical diversity of school children and gauge the extent to which it is sensitive to age and ability level. 130 11 Materials 899 narrative essays at least 50 words long were analysed. The essays were obtained from various schools in England. They cover a cross section of three age groups referred to as Key Stages 1, 2 and 3 in the English education system (i.e. age groups 8, 11 and 14 years) and seven (out of a possible eight) levels of writing ability. All the students were asked to write a narrative essay beginning with the sentence ‘The gate was always locked, but on that day it was open …'. 12 Data Preparation 10 markers assigned a score to the scripts on the basis of National Curriculum Level descriptors (see QCA 2000 for instance). The markers were unaware of the age or ability level of the pupils. Scores for each essay were assigned separately by least two markers and later averaged to obtain the final score. In a few cases, scores from different markers diverged considerably and the final score was decided through negotiation. The final score was then used to assign each script to one of eight National Curriculum Levels of writing ability. A breakdown of the numbers of scripts in each Key Stage and Ability Level are shown on Table 1 in terms of the percentage of the total number of scripts in each category. The table needs some explanation. Data for each Key Stage is presented in a column which is further divided into two columns. The first of these columns, which should be read vertically, shows the percentage of pupils in that Key Stage who were assigned to the different levels. The second column, which should be read horizontally, shows the percentage of pupils in each Level who belonged to different Key Stages. Table 1 % of scripts in each Key Stage and Level Key Stage 1 Key Stage 2 Key Stage 3 % of all scripts % KS1 % Level % KS2 % Level % KS3 % Level Level 1 11 92 1 8 0 0 4 Level 2 68 76 19 21 2 2 31 Level 3 19 30 30 47 15 23 23 Level 4 2 3 36 57 25 40 22 Level 5 0 0 13 33 27 68 12 Level 6 0 0 1 4 24 96 6 Level 7 0 0 0 0 7 100 2 % of all scripts 32 - 43 - 25 - 100 The graded scripts were converted into machine-readable form by a typist. It was necessary to correct spelling errors in order to prevent spelling errors from being treated as different types and thereby inflating the token counts of poor spellers. Spelling errors were corrected using a special utility program. The procedure for correcting spelling errors was as follows. Firstly, a list of all the words in the scripts was compiled. Any words found in the list which were not also found in a dictionary list were considered as potential spelling errors by the program. It was then up to the human editor to decide if a specific word was indeed a spelling error and if so, what the correct spelling ought to be. Instances of spelling error in the corpus were then sought, found and edited through a search and replace dialogue box. This method had the advantage of speed over manual correction involving reading through the scripts. In addition, the method provided assurance that all the spelling errors flagged by the program were accounted for. However, there are at least two disadvantages with the method. Firstly, cases where correctly spelled words were used incorrectly, such as homophones, were missed. However, the margin of error thus incurred was deemed acceptable. Secondly, the scripts were altered in such a way that the original spelling could not be recovered. Refinements will be made to the utility program in future to overcome both problems. 13 Analytical Procedure Essays were analysed using vocd (Malvern and Richards, 2000) via the CLAN interface. Memory limitations in the software meant that the scripts could only be analysed in batches of fifty at a time. After all the scripts had been processed, the output files from vocd were concatenated into one file and another utility program was used to extract values of D for each essay and produce a spreadsheet of all the results. 131 14 Results D values from all 899 scripts were subjected to a 2-way ANOVA with Key Stage and Level as the independent variables. Main effects were obtained for both Key Stage F(2) = 221.8, p<0.001 and Level F(2) = 92.965, p<0.001. Mean values of D are plotted on Figure 2 and standard deviations of values D are shown on Table 2 by Key Stage, Level, by Key Stage collapsed over Level and by Level collapsed over Key Stage. Figure 2 There was no significant interaction between Key Stage and Level F(7) = 1.344, p<0.226 and Level appeared to have a greater effect size than Key Stage, judging from the mean square error generated by each factor: the mean square error due to Level = 419.866 while that due to Key Stage = 454.406. Table 2 Standard Deviations by Key Stage and Level A surprising result was that Key Stage 2 consistently obtained higher mean values of D than Key Stage 3. However, the difference between Key Stage 2 and 3 was not significant. A post-hoc Tukey test shows that while there are significant differences between Key Stage 1 and Key Stage 2, p<0.001 and between Key Stage 1 and 3, p<0.001, there are no significant differences between Key Stage 2 and 3, p>0.5. Additionally, while there are significant differences at the p<0.01 level between each of Levels 1, 2, 3 and all other levels, there were no significant differences between Levels 4, 5, 6 and 7. (NB. There are only 3 data points contributing to the very high mean of D in Key Stage 2 Level 6 and this high value should not therefore be treated as a significant trend). Key Stage 1 Key Stage 2 Key Stage 3 Level stdev Level 1 12.376 8.006 - 12.351 Level 2 15.762 21.927 6.522 18.764 Level 3 16.426 20.833 15.809 20.448 Level 4 19.344 25.835 16.335 24.012 Level 5 - 23.535 15.693 19.969 Level 6 - 40.931 17.985 21.38 Level 7 - - 14.370 14.37 Key Stage stdev 16.517 25.866 17.753 - Mean Values of D for each Key Stage 0 20 40 60 80 100 120 140 L1 L2 L3 L4 L5 L6 L7 Level Mean D KS1 KS2 KS3 132 15 Discussion The results indicate that vocabulary diversity is significantly related to age and writing ability. Of the two factors, writing ability appears to be have a greater effect size. Rather surprisingly, KS2 obtained higher scores of diversity than KS3. The reasons for this are not clear. One possibility is a dip in performance which is known to occur at Key Stage 3. This dip has been variously ascribed to the trauma of moving from primary school to secondary or school or to maturational influences. However, there might also be a linguistic explanation. It may be that older pupils produce more coherent discourse which requires a greater level of lexical repetition and therefore results in lower diversity scores. If that possibility were true, then it might indicate a limit in the sensitivity of vocabulary diversity as a measure of writing ability. However, these comments are all purely speculative. To explain the result more fully, it is necessary at least to analyse Key Stage 4 scripts and obtain a wider view of the trend. The effects of writing ability and age do not in themselves discriminate between different theories of language development, since few would deny that vocabulary grows during the school years and that rates of growth may differ across individuals. However, these effects do raise significant questions for deeper investigation. It has been proposed that vocabulary growth during the school years can be accounted for in terms of the development of derivational morphology (Anglin, 1993). By analysing the co-occurrence of morphologically related forms in a text, a future study might determine whether increasing knowledge of derivational morphology accounts for the observed increases in vocabulary diversity. If lexical diversity is found to be an effect of morphological productivity, then the observed effects of age and writing ability would suggest that morphological knowledge grows gradually during the school years but at different rates for different individuals. That observation might have both theoretical and pedagogical implications which would need to be explored. A second question for further research is also suggested. Inasmuch as considerable grammatical information is now stated in lexical terms, current linguistic theories posit a close relationship between the lexicon and sentence level grammar. This relationship is not a purely descriptive artefact. Bates and Goodman (1997) have shown that there is a close relationship between lexical and grammatical development in first language acquisition, aphasia and on-line sentence processing. It has also been shown, by means of both corpus and on-line sentence processing data, that syntactic structures are often intimately associated with specific lexical properties (eg. MacDonald, 1994 and Trueswell 1996). The question which arises is whether the differences in lexical diversity reported here might be related in some way to differences in sentence level grammatical development. A syntactically annotated corpus would make it possible to address the question. If such a relationship does exist, there would be theoretical and pedagogical implications to explore. 16 Conclusion The paper has shown that the study of language development during the school years is a rich area for corpus-based investigation. It can help to discover objective developmental trends and it can also yield results of potential pedagogical value. It also offers methodological challenges for the corpusbased approach. It was shown how an apparently simple measure, vocabulary diversity, requires rather sophisticated modelling. The use of mathematical modelling to facilitate quantitative analysis of corpora will no doubt increase as linguistics develops further as a quantitative discipline. The question arises whether the model described here can be generalised to the study of diversity in forms of language other than vocabulary. For instance, syntactic annotation would allow the measurement of syntactic diversity while pragmatic annotation would allow the measurement of the diversity of pragmatic functions in a text. If these kinds of analysis yield useful results, the analysis of diversity could well provide a general method of measuring linguistic productivity for various pedagogical, clinical and other applications. References Anglin J 1993 Vocabulary development: a morphological analysis. Monographs for the Society for Research in Child Development. Serial No. 238, 58(10). Barnes S, Gutfreund M, Satterly D, Wells CG 1983 Characteristics of adult speech which predict children's language development. Journal of Child Language 3(10): 65-84. 133 Bates E, Goodman J 1997 On the inseparability of grammar and the lexicon: Evidence from acquisition, aphasia and real-time processing. In Altmann G (Ed.), Special issue on the lexicon, Language and Cognitive Processes, 12(5/6): 507-586. Bennett-Kastor T 1988 Analyzing children's language: methods and theories. Oxford, Blackwell. Biber D 1988 Variation across speech and writing. Cambridge, CUP. Carroll J 1964 Language and thought. Englewood Cliffs, Prentice-Hall. Chipere N (in press) Variations in Native Speaker Competence: Implications for First Language Teaching. Language Awareness. Chipere N (1999) Processing Embedding in Humans and Connectionist Models. Unpublished PhD thesis. Cambridge University. Chotlos J 1944 Studies in language behavior: IV. A statistical and comparative analysis of individual written language samples. Psychological Monographs 56, 75-111. Fletcher P, Peters J 1984 Characterizing language impairment in children: an exploratory study. Language Testing 1, 33-49. Guiraud P (1960) Problemes et methodes de la statistique linguistique. Dordrecht, D. Reidel. Grabe P, Biber D 1987 Freshman student writing and the contrastive rhetoric hypothesis. Paper presented at SLRF 7, University of Southern Carlifornia. Cited in Biber D 1988 Variation across speech and writing. Cambridge, CUP. Hayes D, Ahrens M 1988 Vocabulary simplification for children. Journal of Child Language 15(2): 395-410. Herdan,G 1960 Type-token mathematics: A textbook of mathematical linguistics. The Hague, Mouton. Hess C, Sefton K, Landry R 1986 Sample size and type-token ratios for oral language of preschool children. Journal of Speech and Hearing Research 32: 536-40. Johnson W 1944 Studies in language behavior: I. A program of research. Psychological Monographs 56: 1-15. Karmiloff-Smith A (1986) Some fundamental aspects of language develoment after five. In Fletcher P, Garman M (eds) Language Acquisition. Cambridge, CUP, pp 455-474. Klee T 1992 Developmental and diagnostic characteristics of quantitative measures of children's language production. Topics in Language Disorders 12(2): 28-41. McDonald M (1994) Probabilistic Constraints and Syntactic Ambiguity Resolution. Language and Cognitive Processes 9: 157-201. McKee G, Malvern D, Richards B 2000 Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing 15(3): 323-338. MacWhinney B (2000) The CHILDES Project Vol 1: Tools for Analysing Talk – Transcription Format and Programs. New Jersey, Lawrence Erlbaum. Malvern D, Richards B (2000) A new method of measuring lexical diversity in texts and conversations. TEANGA 19: 1-12. Marcus G (1993) Negative evidence in language acquisition. Cognition 46: 53-85. Perera K (1986) Language acquisition and writing. In Fletcher P, Garman M (eds) Language Acquisition. Cambridge, CUP, pp 494-518. Pinker S 1994 The language instinct: How the mind creates language. New York, Morrow. QCA (2000) English Tests: Mark Schemes. London, HMSO. Richards B Malvern D 1997 Quantifying lexical diversity in the study of language development. The University of Reading, The New Bulmershe Papers. Richards B (1987) Type/token ratios: what do they really tell us? Journal of Child Language 14: 201-9. Salzinger K (1975) Are Theories of Competence Necessary? In Aaronson D, Rieber R (eds) Developmental Psycholinguistics and Communication Disorders. Annals of the New York Academy of Sciences 263: pp 178-196. Sampson (1999) Educating Eve. London, Continuum International. Snow C 1996 Change in child language and child linguists. In Coleman H, Cameron L (eds) Change and language. Clevedon, BAAL in association with Mulitlingual Matters, pp 75-88. Stewig J 1994 First graders talk about paintings. Journal of Educational Research 87(5): 309-16. Stickler K 1987 Guide to analysis of language transcripts. Eau Claire, Thinking Publications. Trueswell J 1996 The role of lexical frequency in syntactic ambiguity resolution. Journal of Memory and Language 35: 566-585. van der Geest T, Gerstel R, Appel R, Tervoort B 1973 The child's communicative competence: language capacity in three groups of children different social classes. The Hague, Mouton. Wachal R, Spreen O 1973 Some measures of lexical diversity in aphasic and normal language performance. Language and Speech (16): 169-81. 134 Approaching irony in corpora Claudia Claridge (Greifswald) 1. The problem of irony 1.1 In pluribus unum? What actually is irony? We all usually recognize an ironic utterance when we come across one, but explaining why we do that and how irony works is not easy. The English lexeme “irony” is polysemic (cf. OED), covering among others the figure-of-speech sense, irony of situation and Socratic irony (pretense, dissimulation). Conversational irony is also a very varied phenomenon: its force ranges from comic irony to more offensive sarcasm (Leech 1983: 143) and it can be realized in many different surface forms, which means that practically any utterance (depending on the context) can be used for the purpose of irony. This makes it interesting, but also hard to grasp, in particular for corpus linguists. It has long been realized that the traditional definition of irony as an utterance “in which the intended meaning is the opposite of that expressed by the words used” (OED) is inappropriate for characterizing, let alone explaining, the concept. Let me illustrate this with some examples taken from Sperber/Wilson (1981: 300), set in the context of two people caught in a downpour, one of them producing one of the following utterances: (1) What lovely weather. (2) It seems to be raining. (3) I'm glad we didn't bother to bring an umbrella. (4) Did you remember to water the flowers? Only (1) and (3) can be seen to express a meaning opposite to the one really intended or to the actual state of affairs (itself an important distinction). However, (2) and (4) in their literal sense are highly irrelevant remarks in the given context and thus force an ironic interpretation.1 But their literal meaning is not thereby cancelled, demonstrating that irony is not a clear example of non-literal language (such as metaphor). (2) achieves its inappropriateness both by stating the blindingly obvious and by doing it through understatement. In fact, irony is parasitic on other rhetorical means, also using hyperbole, rhetorical questions, metaphors, paradoxes, shifts of register/style (e.g. excessive politeness) etc. for its effect (Barbe 1995, Leech 1983: 82). Thus, these phenomena can serve as irony signals in the same way as facial expressions, intonation, laughing etc. sometimes do. Yet irony signals themselves are not constitutive of irony (Lapp 1992: 29): utterances without any such clear signals can be ironic (cf. 4), while utterances containing them can be absolutely serious. Some (few?) instances of irony, however, signal themselves rather clearly by virtue of being lexicalised, e.g. a fine friend (called common irony by Barbe). Moreover, ironic inconsistency can formally manifest itself on several levels, being inherent in one word only (cf. lovely in (1)), in the whole statement/utterance (cf. (4)), and/or in the general incompatibility of statement and context (cf. Barbe 1995: 29). Both the speaker's and the hearer's perspective have to be taken into account in a discussion of ironic utterances. On the one hand, ironic intention is part of the speaker meaning, but the hearer has to recognize the irony for it to be successful. It is important to keep these two aspects apart, because of the fact of unintentional irony (Barbe 1995: 78, Hamamoto 1998: 262), which is inferred by the hearer. It may be that this inference is connected to the irony-of-situation meaning. I will not answer here the question posed at the beginning (which would be most daring after such a brief discussion!), but what seems to apply to all instances of irony is that it points to some inconsistency or incongruity which the hearer is expected to notice, this inconsistency being located either on the level of the utterance or on the level of both utterance and situation (Barbe 1995: 16). This inconsistency can and often does, but need not, consist in an opposition between the literal and some other meaning. All in all, irony may be seen as a polysemous concept, with a range of more or less prototypical realizations and various meanings or functions. 1.2 The importance of being ironic People must be using irony for some purpose that cannot be served equally well by non-ironic expressions. There must be a clear gain through surplus meaning to outweigh the riskiness of the strategy (namely that the irony, and thus the real force of the utterance, may not be understood). 1 I will ignore Sperber/Wilson's interpretation of these examples as echoic for the moment. 135 Basically, irony is or transports an attitude, in the majority of cases a negative one. Ironic utterances are (proto)typically used to convey some criticism on the part of the speaker (Levinson 1983: 161), if in an indirect way. Dews et al. (1995) have experimentally uncovered the following reasons for selecting irony: (i) to be funny, (ii) to soften the edge of an insult, (iii) to show oneself in control of one's emotions, and (iv) to avoid damaging one's relationship with the addressee. While (ii)- (iv) all perform face-saving functions for either the speaker, or the hearer, or both, (i) highlights the strong connection of irony to humour in general (cf. Attardo 1994: 7 for a placement of irony within the semantic field of humour). Of course, jocular language can also be seen in face, i.e. politeness, contexts (cf. Brown/Levinson 1987). However, I think the underlying connection between (i) and the other functions is rather to be found in the aggressive element inherent in humour and laughter, which links up with the FTA implicit in (ii)-(iv). Thus, a certain balance between aggression and social control seems to be the hallmark of irony. Depending on the cultural context, the general balance will be tilted more towards one or the other of these poles. Sarcasm, which has to be seen as a type of irony, sacrifices social control for effective aggressive attack – indicating that restricting irony's function to face-saving might not be appropriate. In more concrete terms, ironic utterances can take the following forms. A very common form is saying something positive to express a negative attitude, i.e. to voice some criticism. Opposite meaning is not the only method for encoding ironic criticism, however (cf. the examples above). On the other hand, there are also ironic compliments (Dews et al. 1995: 348), in which a negative statement is used to express a positive value judgement (praise-by-blame). This makes the (supposed) compliment at the very least ambivalent, as it colours it with some negative touch (e.g. envy); some kind of unexpressed criticism is hidden within the compliment. Ironic utterances can be directed at (i.e. take as their victim) the hearer(s), some third (absent) party, the speaker him-/herself or the general situation. The latter case seems to favour the humorous function of irony. Speaker-directed irony can be seen as a means of emotional release. 1.3 Theoretical treatments of irony This is not the place to go into detail about theories of irony, but some points should be mentioned briefly. Most modern theories have tried to move away (more or less successfully) from traditional rhetoric's characterization of irony as meaning something different to what is said, in particular the opposite of what is said (i.e. substitution theories) (cf. Lapp 1992). Pragmatic treatments promise to be the most successful approaches, as irony is after all a conversational phenomenon constituting meaning in context and in interaction. Grice (1975: 312) described irony as flouting the first maxim of Quality, thereby producing the intended meaning by conversational implicature. Sperber/Wilson (1981), finding fault with Grice's approach, propose the interpretation of ironic utterances as echoic mentions of previous propositions, adding the speaker's evaluation of the latter. The sources of the echo are extremely varied and thus vague – actual utterances, thoughts, opinions, real or imagined sources etc. (Sperber/Wilson 1981: 310) – presenting a problem for the usefulness of the theory in the analysis of real data (and also of its explanatory power). The pretense theory of irony (Clark/Gerrig 1984) assumes that the ironic speaker is pretending to be someone else who might hold the opinion expressed in the ironic utterance, mocking both the opinion and the person it is attributed to. Both of the latter theories foreground the attitude of and evaluations made by the speaker (the psychological aspect), which is certainly very important for irony. As hinted at above, irony has also been seen in the context of politeness in language. Brown/Levinson (1987:221ff) list irony as one of their off-record strategies, giving it a face-saving function. Leech (1983) even proposes an Irony Principle (“If you must cause offence, at least do so in a way which doesn't overtly conflict with the P[oliteness]P[rinciple], but allows the hearer to arrive at the offensive point of your remark indirectly, by way of implicature”, p. 82), which helps to avoid open conflict and thus rescues the Politeness Principle on a different level. The last two theories emphasize more the social functions of irony. Analysing irony in pragmatic terms of course presupposes the existence of data, which leads us to the next problem. 2. The problem of data What strikes one when looking through many publications on irony is, on the one hand, the lack or rarity of authentic data, and, on the other, the little linguistic context given for the cited examples. Data found in the literature generally falls into three classes: 136 (i) Invented instances of ironic utterances, based on the researcher's intuition of what are good examples of irony. This is especially common in approaches in some way connected to the Gricean paradigm. (ii) Experimental data: psychological/psycholinguistic results concerning the processing, understanding and evaluation of irony. Experiments are usually based on invented examples, cf. (i). (iii) Authentic data sampled by tape recording or, in some cases, taken down as notes by the author. Each of these has its problems. As we have seen above, irony is a rather multi-facetted phenomenon and it is not likely that one researcher would come up with a sufficient range of ironic instances to mirror real-life variety. Individual differences of irony perception, or ironic preferences, will also play a role here. Thus, approach (i) is rather narrow and may not be adequate for supporting a comprehensive theory of irony. In so far as (ii) is based on data of the first kind, it shares the problems mentioned above. However, the resulting experimental data is certainly very important and not invalidated by its restricted basis. The third approach, in particular if pursued within a conversation analysis context, seems much better fitted to capture the concept of irony in all its variety. The problems here are that this method might not produce enough data (in terms of sheer frequency) and that the resulting data might not have a wide enough range (especially in the sociolinguistic sense). The latter concern is based on the fact that one will usually ask friends and (good) acquaintances for permission to make tape recordings (i.e. people of the researcher's own social and educational background)2 and, potentially more disturbing, one might subconsciously prefer to approach those people who have a certain reputation for irony (to increase the likelihood of sufficient data). The second problem is the question of the contextual embedding of the examples cited in the literature. This is a serious problem, in view of the strong contextual dependence of irony. A short, oneto- two sentence long description of the larger extralinguistic context is usually given. As to the linguistic context, it is often only the ironic utterance as such that is provided; where there is more, it is usually the preceding utterance which triggers the ironic retort. The larger conversational context, in particular the response to the ironic utterance3, is notably absent. As important functions of irony include criticism on the one hand and getting the criticism across in a face-saving manner on the other, the response would seem to be an important part of the data. Furthermore, a larger preceding context might be interesting with respect to the echoic-mention theory, at least in so far as the echo can relate to an actual utterance not too far removed in time. Basically, I think the most promising approach would be a mixture of several approaches, combining data collection and psycholinguistic methods (experiments, questionnaires).4 A clear emphasis should be placed on a large amount and a wide range of authentic data, however. While the researcher's own recordings and field notes are valuable (partly because of better access to contextual knowledge), it would be a shame to ignore the opportunity offered by corpora to broaden our picture of irony in use, with regard both to the possible variety of ironic expressions and to the social spread of users of irony. 3. Irony and corpora In 1996 McEnery/Wilson (98f) stated that there had been relatively little corpus-based research in pragmatics and discourse analysis so far, something which seems to me still to hold true today. This is doubtless due to the fact that corpus-linguistic methods and pragmatic problems do not on the whole meet easily: it is hard to have the computer automatically search and find a phenomenon that does not have a (range of) corresponding surface structure(s). Irony is no exception here: as stated above, almost any utterance can be used ironically and there are no unambiguous irony markers. Nevertheless it is possible to find irony in corpora and some methods together with their results will be presented below. As I am interested in irony as a (mostly) conversational phenomenon, I will restrict myself to spoken corpus data here. I will draw on the spoken part of the BNC, the Wellington Spoken Corpus (WSC) and the Santa Barbara Corpus of Spoken American English (SBC, part of ICE-USA)5 in order to cover several varieties of English. 2 The largest-scale study that I am aware of in this respect is Hartung (1998) on irony in conversational German. He used tape recordings of 14 conversations with a total length of 18.5 hours, which yielded the amazing amount of 302 instances of irony. The recorded participants were all young adults with a middle-class background and university education (ibid. p. 63-69). 3 Sometimes this is added in the explanation in the surrounding text. 4 It might also be a good idea to feed authentic examples into these experimental methods. 5 A partial pre-publication distributed by the Linguistic Data Consortium. The WSC is published on the ICAME CD. 137 3.1 “Explicit” irony The expression “explicit irony” is taken from Barbe (1995), who uses it to describe phrases such as “it is ironic that …” when employed by writers (of letters to the editor, in her data) to comment on some fact or course of events. She hypothesizes that these uses are based on the writers’ understanding of implicit irony (ibid. 142) and thus can help find out about people's concept of irony. This might in fact be so, and looking for expressions including terms like ironic in spoken contexts might also lead one to situations where irony is actually being employed. The presence of such a metacomment might even mean that something in the communication is going ‘wrong', which is potentially interesting, as understanding irony can be a problem. Thus I looked for the following words in the spoken corpora: irony/ies, ironic, ironical, ironically, sarcasm, sarcastic, sarcastically. While the SBC yielded no instances6, there were 11 in the WSC and 114 in the BNC, on the whole more from the irony-field than from the sarcasm-field. Irony, ironical and ironically are most commonly used in the irony-of-situation meaning similar to Barbe's examples and as in (5): (5) The problem is, of course, that we the irony is that we are now in a period where we have a much bigger potential workforce who we are not employing as we might. (BNC KRE) This is a sense that seems to be rather salient in an English, or at least British, context. In contrast to Barbe, however, I doubt whether that particular usage tells us anything much about people's understanding of verbal irony. Occurrences where people comment on verbal irony are therefore potentially more enlightening. (6) is about the inappropriateness of irony in some situations or with some kinds of people, here by a café manager (AU) to her customers. (6) BY: <[>i don't <.>see well you say you're nice to them i can't imagine you being nasty with them AU: well short or abrupt or something or just not NICE not smiling <,,> BY: exhales yeah <,> i suppose it's times like that you just <.>even you can't even be ironic with them can you you <.>can't AU: yeah BY: you can't even <.>say AU: oh well there's <{><[>SO much work to do BY: <[>god i'm grumpy today you can't even say <,> <{><[>that's all (WSC dpc207) I am not quite sure whether the third statement after the ironic instance (italicised) is intended as an example of an ironic remark. An instance similar to (6) is presented by a lecturer who had to restrain her “sarcastic tongue” because her “students cannot handle it because it's a different kind of situation there are different vulnerabilities going on in that situation so i had to learn to challenge students in ways that were constructive” (WSC mul030). If she was talking about school students, especially very young ones, that would not be surprising, but she seems in fact to be talking about university students. Another instance clearly mentions the criticising function of irony or sarcasm: (7) DA: it's taken quite a long time to get there but they seem to be heading in the right direction at least i like to think that a couple of sarcastic references er from various people of this university actually helped them along the right way pointed out the errors of their ways i <.>the i think they've been getting it from quite a number of different sides actually (WSC dgz064) In contrast to ironic(ally), sarcastic(ally) was found in contexts where verbal irony was actually present or talked about. This might indicate that sarcasm (being more aggressive) is the more salient type of irony and that as a consequence the word might be extended to cover all forms of irony, including the less critical ones. The examples could lead one to the latter conclusion, as the comments referred to often do not sound very aggressive. Sarcastic is used to report other people's use of irony (8) or to (meta)comment on one's own use of it, as in (9), where it refers to an incongruity between a statement and the actual state of affairs. (8) There was as much in the loft as there was on the whole first floor. My the boss said a few things under his breath, came down the steps and we went back into the van. Cos we had to repack it you see because it was going to take a-- it was going to take probably not it was going to go right to the end with this lot on. So up it went higher. Not the six foot he'd thought, seven or eight foot we went up and started to pack away again. A little bit sarcastically my boss said er to the owners There's nothing else that we haven't seen is there? . Well have you seen the gar--? Well your wife said there w-- Oh well you'd better come and have a look, you see. (BNC KNC) 6 This is not surprising, as it is very small with only 14 conversations. 138 (9) Ken (…) By the way er we may er be gratified to know that erm thankfully Labour lost the ninety eighty seven election but in, in it's manifesto in nineteen eighty seven Labour proposed annual elections in local government. PS000 Hm. Ken don't recall, well in fact I'm being sarcastic because I know for a, we know that it wasn't in the manifesto for nineteen ninety two. What the impact of annual elections for local government would be on turn out is difficult to say. I suspect disastrous. (BNC F7T) Cases of (potential) misunderstanding are another environment for sarcastic, as is visible in (10), where a statement is first taken as ironic by the hearer (cf. unintentional irony), but then the misunderstanding is cleared up. (10) Caroline Don't walk away cos I'm connected to you. Okay we're going to <-|-> canteen <-|->. Lyne <-|-> You know what you have to <-|-> do. Caroline And food. Lucy's looking very nice in a very nice skirt today. Lyne You being sarcastic? Caroline No I she's looking very nice in a nice skirt. No I like it so I said it. Lyne Oh. (BNC KP3) Caroline's supposed ironic statement in fact contains possible irony markers that might have led Lyne to that interpretation, namely the great positive emphasis (twice very nice) and the laugh following her statement. If Caroline is honest in her denial, this is an example of irony markers being present in non-ironic statements. Of course, Lyne may also have used mutual knowledge (Caroline's usual assessment of Lucy, previous talks about Lucy) in reaching her conclusion; however, this is not accessible via the corpus data. Note that this is an example where missing following context might enforce an ironic meaning that is not there. Finally, a rather unfortunate fact has to be mentioned. It is often the case in the BNC (as in (11)) that -markup precedes or follows (or both) the mention of irony/sarcasm, so that it is impossible to put any sensible interpretation on these instances.7 (11) seems to present a harmless, jocular context between friends, in that case contrasting with the offensive meaning of sarcastic. (11) Cassie Erm what is this tape? Have you got band practice tonight then Dan? PS6U1 Cassie Don't be sarcastic with me matey. PS6U1 Give me a kiss. Cassie Are you alright? PS6U1 Yeah, think so. (BNC KP4) 3.2 Commonly used ironic expressions Certain words/expressions are more likely to occur in ironic remarks than others. Kreuz/Roberts (1995: 25), for example, have drawn up a “random irony generator” consisting of hyperbolic combinations of adverbs such as absolutely, certainly, perfectly, really with “extreme positive adjectives” like adorable, brilliant, gorgeous, magnificent, the best … in my life etc. If one randomly asks native speakers for their intuition on which words are often used ironically, they suggest expressions like great, how funny, big deal, haha, you don't say etc. - again positive terms in the majority. Seto (1998: 244ff), who assumes that a semantic reversal from an overdose of positive meaning to the negative opposite is taking place, also lists some linguistic devices that are typical of irony: single words such as genius, or miracle, modifiers/intensifiers (real, nice, such), superlatives, exclamations, focus topicalisation and excessive politeness. These examples obviously reflect the most typical or most easily noticed type of irony, that of blame by praise/saying something positive to convey criticism. The examples just mentioned partly overlap with Barbe's (1995: 22ff) concept of “common irony” for phrases which always, even out of context, induce an ironic interpretation, e.g. real winner, fine friend, likely. In contrast, “nonce irony” (ibid. 18ff) represents original creative instances of irony which are not in habitual use. The existence of such commonly used ironic expressions of course makes it possible to search for these in corpora, using the examples found in the literature, examples provided by asking native speakers, and words/phrase marked as potentially ironic in dictionaries. An OED search turned up the following as labelled “frequently ironic/sarcastic” or the like: (12) ironic: that's a laugh, knight in shining armour, what's the big idea?, Great British people, it's a great life, mein Herr, Merrie England(er), outpost, prop (up), go/pass etc. to one's reward, royal, sage, sir, sweetness and light, I will thank you 7 This problem also applies to examples of the type discussed in 3.2. 139 to do so-and-so, elegant variation, to have a good etc. war, in his/her etc. wisdom, wise guy, wonders will never cease, yet (clause-final), favour sarcastic: erudite, fine gentleman, firm (n.), genteel, martyr, merciful, oh-so, pain, sacred I decided to do pilot searches using some of the terms mentioned above, concentrating mainly on the overly positive ones. From the OED list I only checked clause-final yet and oh(-)so, as most of the others did not seem very relevant for spoken contexts. None of these two has yielded any ironic results so far, though there are so many instances in the BNC that I would not want to be definite on this point at the moment. The often quoted fine friend also yielded no instances in any of the corpora; there were a few possibly ironic examples, such as doing a fine job (BNC KCN), be on (sic) fine form (BNC KDA), but on the whole the corpus evidence has not pointed to an ironic specialization of fine so far. Let me now discuss a few of the examples found. (13) is taken from a conversation between Susan and Anne (colleagues) about the state of the Royal family and the way they are treated by the press. It is an example of word level irony (great), which is supported by its cotext, the conglomerate of positive (and somewhat belittling) descriptive terms, the short form of the name and the superfluous information about Philip's marital status. The irony, which has a benevolent mocking character (and is after all not directed at anybody present in the situation), lies in the incongruity between the epithet and the general public evaluation of Prince Philip. (13) Susan: … but this Royal Family, I E the, the Royal Family with which I grew up and Anne did were really sweet nice little Windsors who behaved themselves and that was what was, went into our psychic and there was the odd crack about Phil the Great who's the Queen's husband, you know and how he perhaps had an eye for the ladies, but there was never any photographs of him being or any evidence that it might have gone further than that (BNC J40) In contrast, (14) represents a more confrontational form of irony, with 12-year-old Paul criticizing his parents, Kevin and Ruth, for not being very helpful. Several features contribute to the ironic effect: individual positive words (fantastic, great, lot), the exclamation, and the focus topicalization (cf. Seto 1998). Paul's mother is not impressed by the criticism and Paul loses the ensuing short argument. (14) Kevin You might get an extra merit mark <-|-> if you do an extra session. <-|-> Ruth <-|-> I don't really know. <-|-> Paul I don't! Ruth I don't really know. Paul Oh fantastic! That's a great lot of <-|-> help you are. <-|-> Ruth <-|-> It'll do <-|-> you good to do G. Paul It won't do me good to do pipsqueak questions. Ruth If they're pipsqueak why are you asking me? Paul I'm not Two A three. Two A means two times two is four. (BNC KD0) In (15), with the same participants as (14), a quasi lexicalised ironic expression is used, i.e. an example of common irony (Barbe 1995). Evaluative big deal is practically not used in a positive sense any more. Such specialization (and/or overuse) can weaken the ironic force of the utterance (which might be increased again by repetition, an irony marker, cf. big, big deal in BNC KPV). While the criticism here is probably not very strong, it leads (after the comment by third party Kevin) to a first, unfortunately response, and then to a justificatory response by the “victim” of irony, Paul. (15) Ruth Well they'll have to be equal prizes wouldn't they? Paul I've got it! Whichever team wins th-- the cha-- children can give out the Christmas presents. Ruth What Christmas presents? Paul Those presents up there. Ruth Big deal! Kevin Can't fool you. Paul <-|-> <-|-> Ruth <-|-> Well I've <-|-> never <-|-> prize <-|-> Paul <-|-> It's only <-|-> it's only fun, I mean what are you gonna give them? The team that wins <-|-> gets their presents. <-|-> (BNC KD0) The next two examples have exclamatory structure (what a …, cf. also how …), which is also a common ironic device. The whole formulation of the utterance in (16) contributes to the ironic effect (especially brave, let loose) as does the situational inappropriateness: nowadays, it tends to be the press that is let loose on other people, and the trustees of the Tyneside Cinema Board are probably rather “harmless” people, more likely to be attacked by the press than the other way round. The trustees apparently do not fancy the task required of them very much, and Peter (the chief executive) thus feels it necessary to make conciliatory remarks in response. 140 (16) Peter Er it will be at ten. Ten till two or something like that. It's also very good if board members can attend to er not merely to support the staff er and to celebrate the event but also, if necessary, to talk to the press or to er engage er with guests and so on and so forth. Er it would be much appreciated if you were able to attend. Roger Any other other business? Colin What a brave person you are to let a board of trustees loose on the press. Peter Well we we we'll have an army of people to stand by you and <-|-> guide you and <-|-> PS000 <-|-> <-|-> Peter nudge you should you say anything. (BNC F7A) In (17), a conversation between two young women, the exclamation form is combined with word level irony and an imitated accent, which is additionally intended to mock the absent victim of the irony. AL's irony sounds somewhat bitter, as she has apparently been criticized before for not doing well certain sales behaviour that the American is boasting of. (17) AL: no i was talking to this guy um who's from knox BQ: <[1>laughs <[2>yeah mhm AL: um mark and he's um living in wellington for the summer cos he's being a lawyer BQ: oh yeah AL: working as a summer law clerk lucky bastard anyway he um he used to work at tauranga mcdonald's and like he was a crew trainer and assistant this and worked his way up you know wow what a hero and um he said that he used <.>to cos i was talking about that suggestive selling you know how they always used to tell me off for not being suggestive enough and um he said he used to go to people um when they bought like a small coke he'd say um would you like a twenty four pack of chicken mcnuggets with that or would you like eight big macs and four cheeseburgers with that laughs and just take the piss which i think is really good laughs (WSC DPF028) Searches for particular terms were often not successful with respect to these terms themselves being used ironically, but turned up other ironic material in their vicinity, thus broadening the scope of findings. In a House of Commons debate a certain Wilson MP produces the following statement (18), which makes a very ironic impression. The irony lies on the utterance and the contextual level, with the speaker alluding to irony of situation, namely the absurd fact that while attempts at reducing regulations are being pursued, new ones are being passed all the time. Through his choice of words and register (great enthusiasm, diving, anxious, merrily getting on with it) Wilson is also poking gentle fun at his colleagues. (18) Wilson here was that er we're debating banks and banking regulations building soc-- soc-- society orders, auditors regulations and financial services rules. Madam Deputy Speaker I didn't mention those at all in the speech I gave last week which was so warmly received by the house. Here are all my colleagues rushing upstairs with great enthusiasm, diving into the committee room, anxious to get on to curb the ever growing number of rules and regulations and whilst they're upstairs merrily getting on with it, here we are downstairs passing more things which we say we don't want to do. PS000 Will the honourable member give way? Wilson Yes. (BNC JSF) The next example is interesting, as it sounds to me like a good example of the echoic-mention theory. The bold-faced statement by Ian might be taken to echo what people (parents?, teachers?) said about himself and Edmund in their youth, and which in the meantime has been proved wrong (cf. the italicised statements further down). Ian is therefore criticizing (albeit in a good-humoured way) those people who did not believe in their abilities in the past through the contrast between beliefs and the actual state of affairs, while also putting himself in a good light. The absence of any linguistic indicators of irony in this ironic utterance is noticeable, with the accompanying laughter8 the only sign of a non-serious intention. (19) Ian <-|-> Yeah. <-|-> Well i-- i-- I always say that er it always comes a surprise that I was good at anything. And I always think back to the er the time I think Edmund and I were sitting in the back garden here and deciding that we'd go off and join the paratroopers Noel <-|-> <-|-> Enid <-|-> Really? <-|-> Ian because we weren't going to get any O levels. Enid Oh! Ian And that was about all we were good for . Enid I'm <-|-> sorry. <-|-> Noel <-|-> <-|-> Enid How funny. <-|-> But in fact you did get your O levels didn't you? <-|-> 8 It is striking how often mark-up for laughing is present around ironic utterances. This might be a further way of looking for irony. 141 Ian <-|-> So I remember well I got a handful <-|-> Enid Yes I <-|-> remember <-|-> Ian <-|-> <-|-> Enid Mm. Noel And Edmund got a BA. (BNC KC0) My last example is an instance of self-directed irony, i.e. the speaker ‘criticizing’ himself, here in the presence of others. The speaker at a church funds meeting makes a little grammatical mistake and proceeds to apologize for this in a very marked fashion, with hyperbole and borrowings from the religious register. Of course, the remark is rather jocular, but statements of this kind can function as a sort of preventive self-defence, pre-empting criticism from others. This might also be a case of Dews et al.'s function of the speaker showing himself to be in control (cf. 1.2 above), not of his emotions but of the social situation after a faux pas. (20) F87PS000 (…) Moderator I've asked for this new clause to come in immediately after seven because of what in accepting seven we've just done. (…) Now let me read the words for the s-- benefit of those who don't have them add new section eight and rem-- renumber eight urge his majesty's government to give continuing and careful concern to the many situations in which lack of financial resources are still causing elderly people grave hardship. Having read that I immediately repent me in dust and ashes for having committed a dreadful grammatical error. I was so caught up in my plurals or situations in hardship that I didn't notice that the subject in more senses than one is a singular lack, and the verb should be is and not are, therefore I must ask the indulgence of the general assembly to change the verb. (BNC F87) 3.3 Unusual collocations Louw (1993: 163) has argued that violations of semantic prosodies point either to the presence of irony or to the suspicion that the author might be insincere. One of his examples is “bent on selfimprovement” (David Lodge), contrasting with the fact that its usual collocates are overwhelmingly unpleasant or outright negative (e.g. mischief, resigning, destroying) (ibid. 164ff).9 Thus, instances of unusual collocations might be another way to unearth irony in corpora. However, Louw also points to problems posed by this approach, namely that semantic prosodies might be difficult to access through human intuition and, therefore, not that easy to look for and find in corpora (172f). This is a valid point, as I found out all too readily. There are not that many terms with clear prosodies that immediately spring to mind. Commit is one such clear example, as it has in one of its uses (transitive and without following to) a wonderfully unambiguous negative semantic prosody, being followed by such words as suicide, crime, atrocity, adultery, offence, arson, robbery10 etc. One might then assume that if it occurred with a positively evaluated object (such as favour), this might in fact produce irony. However, none of the spoken corpora yielded any unusual combinations for this word. Another search with set in and set about, for which Sinclair (1991: 74ff) had found rather clear patterns, was also unsuccessful. I then decided on a slightly more systematic, though still random, approach. I checked through the entries of the BBI Dictionary beginning with the letters f, g, h, i, j, k, l, m (random choice) to find promising candidates for a corpus search. Most searches (e.g. honour (v.), hark, flood (v.), frugal, languish, loyal, interlude, forgive) did not yield any results11, either because the semantic prosodies were not pronounced/unambiguous enough in the first place or because there were no violations. However, there were a few successful, or at least interesting, hits such as the following example: (21) Maggie I'm gonna borrow your husband a minute <-|-> sit on the floor. <-|-> Betty <-|-> Alright. Just watch him cos he's <-|-> liable to be a bit amorous. Maggie <-|-> Right Dave? <-|-> PS000 <-|-> chuck you out <-|-> Betty <-|-> Oh dear dear dear. <-|-> (BNC KBE) To be liable to is an expression with a tendency to combine with words denoting unpleasant/disagreeable things or actions and it also has a legalistic touch. Among the nouns prominently collocating with liable are bye-law, servitude, disqualification, negligence, damages, indictment, debts, defendant, prosecution, flooding etc. Others collocates include criminally, disciplinary, legally, penal and so on. A bit amorous is certainly the odd one out in this list, and the whole utterance above is certainly humorous, perhaps also ironic. Betty makes this statement about her husband (David) in the presence of her daughter (Sally) and her friends (Rose, Julie, Maggie). If irony 9 This effect seems to be similar to register or stylistic clashes, which can also produce irony. However, I do not think that the result of a semantic clash need always be ironic or insincere; it might just be a general humorous effect, which I would also see as a more likely interpretation of the Lodge passage in question. 10 Those were among the first 13 items of a collocates listing (based on mutual information) for commit done on the whole BNC. 11 These searches were only done with the BNC, because it is the only one big enough to yield viable collocation data. 142 is involved here the victim is David, who does not react to it (unless the unattributed, and unfortunately mostly unclear PS000 utterance is by him). David is described in the BNC text header as a disabled unemployed man at the age of 55, while Maggie (who wants to ‘borrow’ him) is a 32-year-old shop assistant. The incongruity between their respective ages might play a role here for an ultimate interpretation of this exchange. The next example, which concerns verge (n.), was found incidentally during a search for marriage. Just like liable, verge has a rather negative semantic prosody, combining with collocates such as extinction, bankruptcy, starvation, tears, collapse, nervous (breakdown), death, war etc., with retirement being the only noun among the first 50 collocates that can take on a positive or a negative evaluation. The phrase on the verge of marriage then at the very least seems somewhat unusual. (22) Liz Yes. I mean, I know it's hard it's hard doing <-|-> erm <-|-> PS000 <-|-> Yes! <-|-> PS000 And she wasn't going to marry, she never really considered marrying Rivers did she? <-|-> It said she was on the verge of marriage. <-|-> PS000 <-|-> I heard she . <-|-> PS000 She was <-|-> almost <-|-> PS000 <-|-> Er <-|-> PS000 <-|-> hypnotized <-|-> Liz <-|-> Yeah. <-|-> (BNC K60) However, the phrasing seems to produce neither humour nor irony in this instance. Another problem here is that it is a kind of quote occurring in the context of a literature lecture/discussion (with the lecturer, Liz, and two other (unknown) participants), so that one would have to take into consideration how the phrase is used and intended in the original literary work. At any rate, this is not a truly conversational example for a collocational clash. The medium might in fact play an important role. Most of Louw's examples are taken from the written language, and it could be that unusual collocations intended to transport a special meaning and attitude are more common in carefully drafted written language. My last example is taken from an academic book publication. (23) When Tikhon was placed under house arrest in June 1922, one of these movements, the Living Church, was given numerous concessions by the regime, and at first looked set to take over the role and some of the property of the Orthodox. Trotsky went so far as to call the agreement “an ecclesiastical NEP”, implying a similar tolerance to that meted out to “kulaks” or to Nepmen, but this was a superficial and short-sighted judgement redolent with propaganda. The year 1922 in fact marked the start of the regime's long-term siege of the Church. (BNC A64) The combination of tolerance and mete out, whose most prominent collocates are punishment, treatment, justice (the last two especially in their negative instantiations), certainly leads to an ironic effect, used for the prototypical function of criticism. 4. Evaluation and outlook The above investigation has shown that it is possible to find instances of irony in corpora, and, what is more, instances of a very varied kind. The examples presented above do in fact all express an attitude, in many cases a somewhat negative or the prototypical critical one. They range in force from being humorous (e.g. (13)) to conveying confrontational criticism (e.g. (14)). The examples exhibit the diverse targets of ironic utterances: at the speaker him-/herself (20), at the situation as such (18), at the/a present addressee as the victim (14, 16, 21?), and at an absent victim (13, 17, 19). Formally, the irony is realized at the word level (13), at utterance level (16) or in a certain incongruity between the content utterance and the situation (19). Thus, the variety of irony treated in the literature can be illustrated with authentic corpus material. Moreover, the ironic instances found also show a nice social spread. They occurred both in informal contexts (family, friends, e.g. (14), (19), (21)) and in more formal ones (institutional settings, e.g. (18), (20)), with work-related contexts perhaps in the middle (13, 16). Irony is produced in the examples by people of different ages (from their teens (14) to their seventies (19)), and from different spheres of life (teachers, students, production workers, shop assistants, chief executives, housewives, civil servants). While basic sociolinguistic information is available in some corpora (relatively informative in the BNC, less so in the WSC), much contextual information that would be useful for the interpretation of irony is missing (e.g. details about personal relationships, mutual knowledge etc.). The cotext of the examples can be useful in this respect, however (cf. (19)). All three approaches pursued here are useful to a certain extent. Searching for explicit irony can give information about people's understanding of irony and it can even find conversations with 143 (supposedly) ironic remarks. It is a restricted approach, however, as the frequency of such instances is rather limited. The collocational method is probably the most complicated and also the least promising. At any rate, it is hard to pursue as long as no more is known about semantic prosodies than is the case at present. The best way seems to be to look for instances of irony identified by native speakers and dictionaries, hoping that this will also turn up unexpected examples – because it has to be admitted that this approach contains the danger of finding only what one was looking for in the first place, i.e. a circular approach. The method has to be fine-tuned of course, e.g. by using a comprehensive dictionary database (the OED in its present edition might not have been the best choice here) and employing native speaker intuitions in a more systematic way by collecting them via a structured questionnaire technique. The size of the corpus used is also very important. The examples quoted above are almost all from the BNC, with just a few from the WSC and none at all from the SBC. The WSC proves that careful searching can make use of a one-million-word corpus for investigating irony, but anything below that (cf. SBC) is pushing the researcher's luck. References British National Corpus (BNC) Santa Barbara Corpus of Spoken American English Part 1 (SBC) Wellington Spoken Corpus (WSC) Attardo S 1994 Linguistic Theories of Humor. Berlin/New York, Mouton de Gruyter. Barbe K 1995 Irony in context. Amsterdam/Philadelphia, Benjamins. Benson M, Benson E, Ilson R 1997 The BBI dictionary of English word combinations. Amsterdam/ Philadelphia, Benjamins. Brown, P, Levinson, S 1987 Politeness. Some universals in language usage. Cambridge, Cambridge University Press. Clark H, Gerrig R J 1984 On the pretense theory of irony. In Journal of experimental psychology 113: 121-126. Dews S, Kaplan J, Winner E 1995 Why not say it directly? The social functions of irony. In Discourse Processes 19: 347-367. Grice H P 1991 (1975) Logic and Conversation. In Davis S (ed), Pragmatics: A reader. New York/Oxford, Oxford University Press, pp. 305-315. Hamamoto H 1998 Irony from a cognitive perspective. In Carston, R, Uchida, S (eds), Relevance Theory: Applications and implications. Amsterdam/Philadelphia, Benjamins, pp. 257-270. Hartung M 1998 Ironie in der Alltagssprache. Opladen/Wiesbaden, Westdeutscher Verlag. Kreuz, R J, Roberts, R M 1995 Two cues for verbal irony: Hyperbole and the ironic tone of voice. In Metaphor and Symbolic Activity 10(1): 21-31. Lapp E 1992 Linguistik der Ironie. Tübingen, Narr. Leech G 1983 Principles of Pragmatics. London/New York, Longman. Levinson S 1983 Pragmatics. Cambridge, Cambridge University Press. Louw B 1993 Irony in the Text or Insincerity in the Writer? The Diagnostic Potential of Semantic Prosodies. In Baker, M, Francis, G, Tognini-Bonelli, E (eds), Text and technology: In honour of John Sinclair. Philadelphia/Amsterdam, Benjamins, pp.157-176. McEnery T, Wilson A 1996 Corpus linguistics. Edinburgh, Edinburg University Press. Oxford English Dictionary (OED) 2nd edition, 1994 Oxford, Oxford University Press. Seto K 1998 On non-echoic irony. In Carston, R, Uchida, S (eds), Relevance Theory: Applications and implications. Amsterdam/Philadelphia, Benjamins, pp. 239-255. Sinclair J 1991 Corpus, concordance, collocation. Oxford, Oxford University Press. Sperber D, Wilson D 1981 Irony and the use-mention distinction. In: Cole P (ed), Radical pragmatics. New York, Academic Press, pp. 295-318. 144 Corpus-based empirical analysis of form, function and frequency of characters used in Bangla Niladri Sekhar Dash and Bidyut Baran Chaudhuri Computer Vision and Pattern Recognition Unit Indian Statistical Institute 203, Barrackpore Trunk Road Calcutta - 700 035, India Email: {niladri/bbc@isical.ac.in} Abstract In this paper an attempt is made to understand formal and functional aspects of Bangla characters used in the written texts compiled in a sample monitor corpus designed systematically from language data collected from various text documents published within 1980 and 1995. The purpose of this study is to understand the form and function of the characters, trace their behavioural peculiarities, and if possible, find out the reasons of such peculiarities. The study focuses on the formation of the characters, their structural change in case of compound and cluster formation, their contextual use, statistical analysis of their occurrence, and their position in words. The study also encompasses the use of different punctuation marks in the texts. Finally, some possible areas of application of such analysis are identified. Key words: character, corpus, grapheme, allograph, vowel, consonant, cluster, frequency, statistics, punctuation. 1. Introduction The Bangla1 language corpus, used for the following empirical analysis, is represented by a set of unique orthographic symbols (letters) arranged in a specific pattern with appropriate punctuation marks for proper reading and comprehension of the language users. Most of the linguistic features of the language are captured by the character2 set that leads the written text to be potentially representative of the spoken form of the language. The study of the characters constituting the corpus is important for accounting their pattern of use in different context of the texts as well for comprehension of the general characteristics of the language. Thus, multi-layered information of the characters can be important and necessary contribution to Natural Language Processing (NLP), Computational Linguistics (CL), Optical Character Recognition (OCR), key-board design, Word Sense Disambiguation (WSD), Parts-of-speech Tagging, cryptography, language teaching, Machine Translation (MT), besides other applied and interdisciplinary studies. Moreover, it can provide insight about how language is used by different users in different domains of knowledge representation. For functional and structural analysis of characters used in Bangla, we have considered a written corpus of 4 million words comprising printed texts of nearly 85 disciplines, published within 1980 and 1995 (Dash and Chaudhuri 2000). To make our corpus non-skewed and sufficiently representative, we have tried to develop it after the standards proposed by Sinclair (1991), Atkins at al. (1992), Biber (1993) and others. The relative frequency of various published documents in the corpus is determined following the report of a survey conducted by the Central Institute of Indian Languages (CIIL), Mysore, as stated below: 1 Among the languages used in the world today, Bangla is fifth (after English, Chinese, Spansh and Hindi) in strngth in ragard to the number of speakers. It is the national language of Bangladesh and one of the national languages of India, spoken by the people of West Bengal and Tripura, two states of India. Moreover, a sizable number of people living in Asia, Europe and America also use this language. 2 The term character is used here in linguistic or paleographic sense to mean different orthographic symbols used for writing of a natural language. In most cases, it includes letters, punctuation marks and other similar symbols. 145 Texts from genre %-age Texts from genre %-age Mass media 30 % Literature 15 % Social science 15 % Natural science 15 % Commerce 10 % Fine arts 5 % Translation 5 % Others 5 % Table 1: Percentage of texts belonging to different genres included in the corpus Moreover, an electronic Bangla dictionary of around 60,000 words is used for the current purpose. Thus, the corpus and the dictionary are used as representative linguistic data-base for description as well as for verification of hypotheses about the use of characters in the language. They provide information of use, shape, design, size, occurrence and change of the characters as well as insight into the role of punctuation marks employed in writing. The paper is organised as follows: section 2 highlights some unique features of Bangla characters while section 3 gives an account of shape analysis of characters in isolation as well as within words. It also gives some idea of tier division, compounding and clustering of characters. In section 4, we provide a few frequency counts of characters in the corpus while in section 5, we try to provide a brief survey of the use of punctuation marks. The importance of such study is evaluated and some application areas are identified in section 6. 2. Bangla characters In printed Bangla texts, five types of character are noted: (i) vowel grapheme (iii) consonant grapheme (ii) vowel allograph3 (iii) consonant graphic variant (v) graphic compound, and (vi) consonant cluster. Scholars like Banerji (1919), Chakravarty (1938), Diringer (1953), Sen (1993) and others more or less agreed that Bangla characters are developed from the proto-Bangla type. The present character set is claimed to be evolved from the hand-written letters inscribed in various Old Bangla documents and manuscripts (Sen 1993: 24). But, through the years, notable modification on the shape of the characters is introduced at different points of time. By the 15th and 16th century, the characters appeared to be fully developed. Indeed, during the 17th and 18th century, no change is registered on their design and structure. In the 19th century, the shape of the characters has been stereotyped by the introduction of printing press (Diringer 1953: 365). When the character set was mechanically designed and stratified for printing, the scope of structural modification was drastically reduced. However, that the first character set designed by Wilkinson in the 18th century, has been modified to give more recent shape (Bandyopadhay 1981: 47). Vidyasagar made an attempt to rearrange characters more scientifically and proposed 52 characters (12 vowels and 40 consonants) in place of earlier 50 characters (16 vowels and 34 consonants) (Bandyopadhaya 1981: 100). As a result, the total number of character has been reduced and some consonant clusters are simplified in form (Mukhopadhaya 1981: 102). It is noted that all Bangla vowels, except a, have allographs. Probably, allographs were meant to speed up handwriting, because basic vowels, compared to their respective allographs, take more time, space and energy. It is statistically observed that most recurrently used allograph is most simplified in shape and most suitably positioned in writing system (Dash and Chaudhuri 2000). Structurally, Bangla characters are cursive and twisted as Diringer (1953: 363) observes: 3 The term allograph is a cover term that includes all graphic variants found in the script. The variants are generally called vowel signs. Recently, Babulanam and Beena (1999) have mentioned them as shorthand signs which, however, we ignore for possibility of confusion with real shorthand signs. 146 "The Bengali was a peculiar cursive script with circular or semi-circular signs, hooks or hollow triangles attached to the left of the tops of the vertical strokes. The triangle itself is a modification of the top stroke with a semi-circle below, and this form is connected with the common form of thick top-strokes, rounded off at both ends". Ploughing through the corpus, the following features of Bangla character set are noted: (i) It has 12 vowels, 20 allographs and 39 consonants along with nearly 280 consonant clusters which are read and written from left to right sequence in word formation. (ii) In printed texts, the vowel e has two allographs: one is with and the other is without the ’head line’ (a short horizontal line put above most basic characters). The contexts of their use are also different. (iii) A consonant or cluster can use only one allograph at a time. (iv) Unlike vowel, in certain contexts, a consonant may be silent in utterance despite being physically present in the text. (v) The shape of an allograph is not grapheme dependent. However, in some cases it is changed depending on the shape of a consonant. For instance, the shape of allograph of the vowel u is changed when used with consonant g and sh, and cluster nt etc. (vi) The linear position of allographs with respect to graphemes is not uniform. They can occur before or after, above or below the consonants or clusters. (vii) Consonants also register some graphic variants (called reph and raphalaa, yaphalaa, vaphalaa etc.) which are generally used in cluster formation. (viii) A single grapheme most often represents a single phoneme with a slight phonetic variation (e.g. ii (long) and i (short) denote [i], u (short) and uu (long) denote [u], j and y denote [Ô], sh (palatal), s. (retroflex) and s (dental) denote [ò], n (alveolar) and n (dental) represent [n] etc.) (ix) Most of the consonants can join physically to form a cluster. Generally, clusters are formed by joining two consonants. However, clusters of three or four consonants are also possible. There are nearly 280 clusters of which those made by two consonants are around 240. Cluster of three and four consonants are nearly 35 and 5, respectively. (x) The sentence terminal markers consist of purnacched (full stop) [< Skt. purna "full" + cched "pause"], interrogation mark and exclamation mark. Most punctuation marks used in Bangla are borrowed from English punctuation system. 3. Shape analysis of characters Generally, the vowels and consonants are considered ’basic characters’ because of their independent existence and their role in governing the use of other characters in the texts. The entity and role of vowel allographs and consonant graphic variants are measured by their use within words or morphemes. The formation of graphic compounds is caused by the change of vowel allographs used with consonants or clusters. The consonant clusters are formed by joining two or more consonant graphemes the role of which is context-bound (because their role in the script can be properly studied when put in the context of words). 3.1 Characters in isolation The shape of the basic characters is a mixture of straight line, circular and semi-circular curves, thick dot and conic shapes. All these shapes are not of equal length although the height of most of the basic 147 characters are identical. Moreover, the lines and curves are not always used in their full length in every occasion of character formation. Sometimes, the full length of a line or curve, sometimes the half of them, or sometimes just a portion of a line or curve is used for designing basic characters. However, the shape and design of some basic characters (ii, kh, g, gh, ch, j, th, ph, bh, s etc.) are more complex than that of other characters. The reason of their complexity probably lies in their process of formation where all the properties (i.e. dots, curves, straight lines and conic sections) are used. The head line (shirorekhaa) is considered as an important feature of the characters because it acts as a line of demarcation at the time of tier division (discussed in section 3.3) of the characters. It is a property by which the basic characters can be grouped into two broad classes: · characters with head line (32 in number) · characters without head line (14 in number). Moreover, according to the arrangement of different structures and properties, the basic characters can be grouped into three major classes: · characters formed with linear structures arranged in different angles (15 in number) · characters formed by dot and curve shapes (11 in number), and · characters having both kind of shapes (26 in number). The use of vertical line is maximum in the formation of the basic characters. Nearly 33 basic characters have vertical line in its full horizontal span. It is observed that while some characters (n, b, r, dh) contain vertical line at their rightmost side, some characters (c, ch, T, Dh, Rh, d) have it on their left most side. Similarly, while some characters (aa, jh) use vertical line twice where the second line is placed just parallel to the first one, some other characters (u, uu, ch, D, d, R) use only a halflength vertical line in their shape design. Moreover, the width of a character is not always proportionate to its height. While some characters (g, l, sh) are wider than their height some others (N, n) are less wide than their height. For some characters (k, b, r ) the width and height are nearly equal For automatic recognition an analysis of shape similarity of characters is required since some basic characters (a/aa, u/uu, o/tt, kh/th, k/ph, t/bh, l/n, sh/n, b/r, D/R, T/Dh/Rh, g/p, y/gh/s, ks/hm) are nearly similar in shape. They are generally considered as confusing characters because one can easily be confused with the other either by man or in machine recognition problem. 3.2 Characters in string Characters used within words sometimes differ from their features noted in their isolation, because context can add some more features. Generally, the following factors control their roles in context: · restrictions in positional use, · modifications in original form or shape, and · limitations in functional role These factors generally lead vowels to be converted into allographs at word intermediate and final positions despite their use in original shape at word-initial position. Therefore, the occurrence of vowels (i, o) in basic shape at word-middle or final position carries separate implication (e.g., a particle for emphasis). Similarly, some consonants (n, R, Rh, y) cannot occur at word-initial position because of the restriction in their positional use. Moreover, the variant of the consonant t (called khandata "half-t") is not entitled to accompany a vowel allograph. Therefore, whenever such situation arises (particularly at the time of using case markers) it changes into the basic character (t). The consonant r generally allows its two variants (reph and raphalaa) to occur only in cluster formation. While reph occurs above a consonant, raphalaa finds place at the bottom of the character. Moreover, reph is too weak to cause any structural change of the character but raphalaa is strong enough to modify the basic shape of some consonants. 148 3.3 Tier division of characters An analysis of tier division of the characters is important for understanding actual behaviour of the characters within the words. In Bangla words, the characters are arrayed in three tiers: upper, middle, and lower. While the upper tier contains signature of basic characters and allographs, some consonant graphic variants (candrabindu, reph etc.), the middle tier contains the bulk of character shape, and the lower tier contains some allographs (u-kaar, ri-kaar etc.) and some consonant graphic variants (raphalaa, vaphalaa etc.). The allographs, when used with consonants or clusters, are distributed in all three tiers (Pal and Chaudhuri 1995). This kind of analysis has helped us in OCR system development in Bangla. 3.4 Graphic compounds The graphic compounds, generally formed by joining two or three graphemes with allographs or graphic variants, can either be a combination of consonant and allograph, or a combination of consonant, graphic variant and allograph. In this process some change takes place in the original form of the characters. Compared to basic characters these are complex in form and design. It is noted that: (a) the allograph of u, when used with consonants, creates three different graphemedependent compound shapes: (i) with consonant r or a cluster with raphalaa (e.g., dr, gr, shr, br etc.) the allograph changes its shape and is attached at the right hand side of consonant or cluster. The notable point is, the change takes place only with consonant r, either in its original shape or in graphic variants, (ii) with consonant sh, g and cluster nt, it changes its shape and is attached just at the bottom of consonant or cluster, and (iii) with the consonant h it entirely merges with the grapheme making the shape of the grapheme little more twisted. (b) the allograph of uu also goes through structural change when used with consonant grapheme r and clusters with raphalaa (e.g. gr, shr). The allograph changes thoroughly in shape and is attached on the right hand side of the character. (c) the allograph of ri goes through a directional change (not structural change) when used with consonant h. Here, the allograph changes its direction from horizontal to vertical and is attached to the right hand side of grapheme. 3.5 Consonant clusters The consonants in clusters (formed by joining with other consonants) undergo three types of structural change: (i) the shape of the graphemes are entirely modified to generate a new shape (e.g., ks, kt, kr, ng, nc, tt, tr, hm etc.) (ii) the shapes of the graphemes are partly modified. We have found that for nearly 45 clusters, the shape of the first grapheme is modified, while for nearly 60 clusters, the shape of the last grapheme is affected. (iii) three (nearly 35) or four consonants (nearly 5) can also join to form a cluster, where the last grapheme (either in full or in part) is attached at the bottom or right hand side of the cluster. Recently, some compounds and clusters are simplified in shape for transparency and easy access in typewriter and computer (Sarkar 1993: 42), which is yet to be accepted widely by printing organisations. 4. Some quantitative findings 149 The introduction on different sub-disciplines like quantitative linguistics, stylometrics, applied linguistics, forensic linguistics etc. has raised a demand for different statistical and quantitative analysis of the occurrence of characters in the texts of a language for making various observation and hypotheses, developing primers for language learners, designing tools for CL and NLP etc. The use of statistics on language study is rejuvenated once the computer accessible huge corpus is available to the investigators. A corpus with a huge collection of empirical data with innumerable variations of use of different characters can easily be subjected to quantitative analysis which allows us to discover which characters are more regularly used in the language and which occur rarely. It allows us to get a precise picture of the frequency and rarity of particular character, and thus helps us to determine their relative normality or abnormality. This has led Yule (1964: 15) to comment that linguists without adequate knowledge of statistical information about different properties of language can make mistakes in handling linguistic data as well as in observations. Probably, the Bangla language has been first quantitatively studied by Chatterji (1926/1993). He made a frequency count of lexical items in a Bangla dictionary and in some writings of Old Bangla. Bhattacharya (1965), on the other hand, has made different statistical analysis of phonemes, syllables, words and sentences on a collection of prose texts. Similarly, Das et al. (1984) have made some statistical studies on global character occurrence in Bangla, Assamese and Manipuri on some selected texts. The efforts of Mallik and Nara (1994, 1996) are mostly centred around the writings of the poet Tagore. To the best of our knowledge, multi-dimensional quantitative analysis on Bangla corpus can be credited to Chaudhuri and Dash (1998) and Chaudhuri and Ghosh (1998). Before starting frequency count, some issues regarding character identification should be resolved. The decisions thus made are followed throughout the study on corpus at the time of programming which saved us from problems of wrong observation or deductions. The decisions are as follows: · at character level frequency count, importance is given on each character's position in the corpus. · each grapheme, allograph, graphic variant, consonant cluster and graphic compound is considered as a single character. · the percentage of vowel is obtained by adding the occurrence of their basic as well as allographs, while that of consonants is counted by adding the occurrence of their basic as well as their graphic variants (where applicable). · for uniformity in processing and identification of the characters, the punctuation marks are uniformly separated from the characters in the texts. The following section presents the frequency statistics of different characters used in corpus along with some discussions on the findings. Among statistical methods, frequency count is the most straight-forward and rudimentary approach to work with quantitative data. In this process the characters in corpus are classified according to a particular scheme and an arithmetical count is made on the number of items within corpus which belong to each classification of the scheme. The frequency count provides number of occurrences of each type used in corpus. The results may be used for estimating position of different characters in script, to design primers for language users, as well as to develop different tools for computer implementation. Moreover, various hypotheses presented by earlier scholars about the use and occurrence of characters are also verified by the findings. Keeping these factors in mind, we have counted a few simple statistics of the characters which are given below: (i) Among total number of characters used in the corpus, the occurrence of aa is maximum followed by e, r and i. Among the vowels, aa (11.965%) including allographs comes first, followed by e (9.793%), i (7.745%), u (2.379%) and o (2.027%), while among the consonants, r (8.633%) is maximally used followed by n (5.033%), k (4.898%), t (4.312%), b (3.800%), s (2.942%), l (2.866%), m (2.826%), p (2.562%), y (2.143%) and d (2.127%). The occurrence of r is maximum in corpus probably because of the 150 occurrence of its two variants. Similarly, the percentage of use of t is increased due to the presence of its variant. Among the first 10 most recurrently used characters in the corpus, 6 are consonants while the remaining 4 are vowels and all these characters are easier to articulate than others present in the language. In Hindi corpus also, the vowel aa occurs maximally in the script (Tripathi 1971: 26). (ii) The occurrence of vowel (39.63%) and consonant (52.76%) consists nearly 92.39% of the total characters used in the corpus. Though there are nearly 280 types of cluster in the script, their use is quite less (07.61%). The percentage of cluster is higher in a similar context if the text is written in chaste (saadhu) version which is older than the colloquial (calita) version now in vogue. This supports our argument that Bangla is gradually simplified in utterance and pronunciation and the clusters are gradually replaced by single consonants. (iii) It is found that words starting with the consonant k (9.81%) is highest in the language followed by p (8.68%), b (8.58%), s (8.24%), e (5.43%), aa (4.85%), n (4.64%), m (4.63%), t (4.50%), d (4.47%) and h (4.45%). Moreover, words starting with consonants (81.52%) is more in number than words starting with vowels (18.48%). It provides an interesting insight into the nature of the language. Out of top ranking 20 characters, the vowels are 5, while the remaining 15 are consonants which are easy to articulate. It supports the commonly used statement that Bangla is easier to speak and perhaps sweeter to listen than many other Indian languages. (iv) Among allographs, the occurrence of aa (34.20%), similar to Hindi (Khan et al. 1991: 272), is highest in the script followed by e (29.44%), i (18.75%), u (06.64%) and o (04.50%), consecutively. The counting of a is not possible as it has no allograph, though our common belief is that the occurrence of a would have been highest in the corpus, because in most cases a consonant or cluster if not attached with any allograph carries an inherent [] with it. However, this hypothesis can only be authenticated once a speech corpus is analysed. (v) The use of cluster in language is gradually reduced because in corpus the number of words without cluster is much more (81.57%) than words having one (14.25%), two (2.72%) or three (1.00%) clusters. Words having more than three clusters are very rare, mostly tatsama compounds. The statistics hints for almost complete loss of clusters from the language in future. Among the first 20 most frequently used consonant clusters, pr, ks, tr, st, sv, ny, sth, gr, by, jn and shy can occur at any position of words, while the remaining clusters can occur only at word-intermediate and/or final position. Among these, pr (8.16%) occurs the most because of its maximum occurrence at word-initial position, and the graphemes comprising the cluster, are quite frequent in the language. (vi) Among consonant graphic variants raphalaa (37.94%), yaphalaa (26.66%) and reph (22.70%) are highest in occurrence. They are used in the script for the purpose of cluster formation. Probably, because of their recurrent use in the language they are made simple in shape so that they can be used easily in the script. Here we argue with statistical support that because of their recurrent use in script they are most simple in shape and designed to make writing easier and faster. (vii) Following Miller et al. (1958) the statistics of relative frequency of use of punctuation marks in Bangla corpus is taken. It is noted that the use of comma, like that of English (Bayraktar et al. 1998), is highest (22.32%) followed by purnacched (17.26%), semicolon (15.27%), hyphen (8.89%), note of interrogation (7.38%), colon (6.16%) and note of exclamation (04.36%), respectively. 5. Use of punctuation marks 151 Generally, some syntactic and semantic properties of a sentence are controlled by punctuation marks as they are used in texts to mark out strings of words into manageable groups. The primary role of punctuation marks is to show the pause of breath in a sentence besides clarifying meanings or, in some cases, preventing wrong meaning being deduced form a sentence. Traditionally, two functions of punctuation marks are considered: (i) grammatical function where they help the construction or structure of a sentence, and (ii) rhetorical function where they help to deduce the hidden implication of a statement. However, Crystal (1995) has defined four functions: grammatical, prosodic, rhetoric and semantic function. The role of punctuation marks in Bangla is not yet scientifically estimated though there have been some attempts by Saraswati Press (1956), Roy (1989), Chakravarti (1994), Chatterji (1973), Bhattacharya (1999) and others. We here try to present an estimate by citing examples from the corpus. The punctuation marks most commonly used in Bangla to divide a piece of prose writing are: purnacched, semicolon, comma, colon, note of interrogation, note of exclamation, apostrophe, quotation mark, brackets (round, braces and square), dash, ellipsis, hyphen and space, besides arrow mark, percentage mark, equal mark, therefore mark, underline, implication mark etc. Moreover, various mathematical and geometric symbols are used in scientific books and articles. Full stop, therefore, marks the main division into sentences, semicolon joins sentences, and comma which is most flexible in use, separates smaller elements with the least loss of continuity. Brackets and dashes also serve as separators - often more strikingly than commas. In the following sections only a brief outline is given: · In Bangla a purnacched is regularly used as a terminal marker at the end of a sentence which is a statement (not question or exclamation) either in descriptive, declarative, narrative or imperative sense. · The use of comma has a lot of variation in practice. Its primary role is to give detailed description to the structure of sentences, especially longer ones, and to make their meaning clear. Too many commas can be distracting; too few can make a piece of writing difficult to read or, worse, difficult to understand. It is widely used to separate main clauses of a compound sentence when they are not sufficiently close in meaning or content to form a continuous unpunctuated sentence, and are not distinct enough to warrant a semicolon. Moreover, it is used for a short time break during utterance that indicates a pause between parts of a sentence, or dividing items in a list, string of figures etc. Besides, it is used in pairs to separate elements in a sentence that are not part of the main statement, to stop for a short while in various sections within a sentence, to provide a slight pause between words in a sentence and between dates of days, month and year. Generally, this technique is not practised in Bangla though our corpus contains some such instances. It is used in numeral of four or five figures, to separate each group of figures starting from the right; before the reporting of speech; and sometimes used in place of parenthesis where it is used just before and after the parts of text. · Colon is generally used to introduce a quotation or a list of items; or to separate clauses when the second clause expands or illustrates the first. It acts to separate main clauses when there is a step forward from the first to the second, especially from introduction to main point, from general statement to example, from cause to effect, or from premises to conclusion; to deliver the terms (name of books, persons, countries or a statement of people) that have been mentioned in the preceding words; to introduce a list of items; to indicate time in hours, minutes and seconds in writing; and between numbers in a statement of proportion or ratio. · Semicolon is of intermediate value between a comma and a full stop as it denotes a timebreak which is longer than comma but shorter than full stop. The main function of semicolon is to unite sentences which are closely associated or which are complement or parallel to each other in some way; to join two sentences different in meaning or content; 152 to divide some parts of a sentence meaningfully where each part has commas for its own use etc. Moreover, it is used as a stronger division in a sentence that already includes division by means of commas and when the name and designation of some persons are to be shown in a sentence. · The question mark is primarily used to indicate that the sentence is an interrogative one. Sometimes it is used within brackets in the middle of a sentence to express uncertainty or doubt about a fact, word or phrase immediately following or preceding it. · The exclamation mark is generally used at the end of a sentence to denote a sense of exclamation. Moreover, it is also used after words within a sentence to express absurdity, command, warning, contempt, disgust, emotion, pain, encouragement, wish, regret, wonder, admiration, surprise, grief etc. · Apostrophe is used to indicate omission of letters or numbers; to denote loss or contraction of some letters in words; and to denote loss of digits in years. · Generally, two types of quotation mark are used in Bangla to indicate direct speech and quotation: single and double quotation. While single quotation is used to quote or mention title of books or other things; to denote special symbols or words used in texts; to mean that a word or phrase is cited in the sense of pun etc.; the double quotation is used to denote speech or dialogue in a story or novel; and to quote other's speech correctly in a piece of writing. It is noted that the closing quotation mark should come after any punctuation mark which is a part of the quoted matter, but before any mark which is not. · In corpus the use of three types of brackets are noted: parentheses, brace and square bracket. Among these, use of braces is very rare, mostly for denoting some mathematical symbols or notations while use of parentheses and square bracket is almost regular. Parentheses are used to enclose explanation and extra information or comment; to give reference or citation to any date or event or work or person etc.; and to enclose reference letter and number. Square brackets are used to enclose extra information which is attributive to something, someone or some place; and to convey special kinds of information, especially when parentheses are used for other purposes. For instance, in standard Bangla dictionaries they are used to give the etymologies at the end of the entries. · The purpose of using dash in writing or printing is to mark a break in words; or to represent omitted letters or words. It is also used in place of parenthesis. Sometimes a single dash is used to indicate a pause or hesitation in speech; to introduce an explanation or expansion of what comes before it; and to indicate omitted words especially slang or coarse words in reported speech. · The use of dot is very rare in Bangla. Generally, it is used to denote abbreviation or shortening of some portion of words. The use of ellipsis is also very less in Bangla. Generally, it is used to indicate fumbling of speakers or incompleteness of sentences which are already started. Moreover, if one does not want to quote the full texts of some writings one can use this sign in those places of the text where it is left. · Asterisk is used as a reference mark in the footnote to explain some ideas or to denote source of some texts. It is used just immediately before or after word or sentence form which one can directly refer to the footnote. It is also used to denote some words or characters silent in texts; to emphasise on or to draw attention of readers to a particular item in the text. For instance, in a list of books, some can be marked with asterisk to denote that these books are either very important or rare etc. Sometimes, more than one asterisk marks are used at a time to indicate a break or lapse of a section of an article or text. 153 · The alternative sign (generally called as stroke) is used to denote break of lines of a poetry, to describe two or more related words or items etc. In linguistic analysis it is used to denote phoneme or phonetic segment or a syllable. · The space between words is never considered as punctuation mark but its role in the text is of equal importance like other punctuation marks. What is a measured pause in speech is probably a calculated space in writing. It provides gap between two consecutive words in a writing to identify words in a sentence or texts. The hyphen is probably the most complex punctuation mark used in Bangla. It's role is not yet fully defined as there is no regularity in its use in the language. In standard Bangla dictionaries it is considered as a sign which joins two syllables or words. This definition is partly true as it captures only a fragment of the multiple roles of hyphen as noted in corpus. Its use in compounds is arbitrary, especially when elements of compounds are of one syllable. Except for some unavoidable situations, it is randomly used. In some occasions it is not used though needed, on the contrary, it is used in those occasions where it is not required. In the corpus more than twenty types of use are noted as given below. A hyphen is generally used: · more often in routine and occasional couplings, especially when reference to the sense of separate elements is considered important and unavoidable. It is used between compounds of two nouns (cor-Daakaat "thief and robber"), two adjectives (rogaa-moTaa "thin and thick"), noun and adjective (man-gaRaa "fabricated"), pronoun and noun (seidin "that day"), and cardinal adjective and noun (tin-purus "three generation") etc. · between two proper names (seli-kiTs "Shelley-Keats"), words of similar meaning (Taakaa-paysaa "rupees and pennies"), department and post (krisi-mantri "agriculture minister"), institution and position (skul-maasTaar "school teacher"), place and occasion, (baarlin-alimpik "Berlin Olympic"), for direction (uttar-pashcim "North- West") etc. · to connect words having syntactic link (kathaay-kathaay-raag-karaa mejaaj "gettingangry- in-every-word temper"), to link compounds and phrases used attributively (haajaar-haat-kaali "goddess Kali with a garland made of thousand cut-off hands"), to denote a sense of continuation (co-o-o-o-r "thief"), to indicate sounds of music or musical instruments (taa-dhin-dhin-taa), to write some names of non-Indian origin (maao-se-tung "Mao-Tse-Tung") etc. · for all similar words to stop repetition of the second constituent when the second constituent of compounds of a sentence are common. The second constituent is used only in the last word in the sentence (raajya juRe shramik-, bekaar-, bidyut-, khaadya-, jalsamasyaa dekhaa diyeche "The problem of labour, unemployment, electricity, food, water are noted in the state"). Hyphen here performs the role of concurrence. · between prefixes and nouns, (ku-najar "ugly look"), between monosyllabic noun and suffix (paa-Ti "the leg"), between monosyllabic pronoun and suffix (ka-Taa "how many"), in compounds where the first part is a single letter word (bhu-prakriti "geographical nature") etc. · between reduplicated words (maajhe-maajhe "sometimes"), onomatopoeic words (jhanjhan "tinkling"), echo words (bis-Tis "poison etc."), for exclamatory expression (ho-yaaT "what"), for emphasis (maa-i "mother herself"), for abbreviated words (bi-bi-si "BBC"), for avoiding awkward collision of homophonous characters (taap-prabar "heat-strong") etc. · when an inflected word form is used as noun and is added with further suffixes for linguistic analysis (moder-er byabahaar kameche "Use of 'moder' is declined") etc. 154 · between proper name and case ending or suffix which is added with proper name of person (MilTan-er "of Milton"), institution, (aai.es.aai-te "in ISI"), book (myaakbeth-er "of Macbeth"), newspaper (sTeTsmyaan-e "In the Statesman"), day (sombaar-e "on Monday"), year (1947-e "in 1947") etc. However, such use is not a regular feature of the language. Generally, case endings and suffixes are attached with proper names without inserting a hyphen in between. · to retain the original structure of some words unchanged when they use suffixes (pad-er "of words", desh-er "of Desh") though without this hyphenation the grammatical value of the words would not have been changed · whenever a case marker is added with words ending with half-t (bhabisyat-e "in future"). · after Bangla characters when they are subjected to grammatical or linguistic analysis (a- Taa baanglaar pratham varna "a is the first letter in Bangla"). · in vowel allographs (u-kaar "u-allograph") for grammatical reasons, and between some consonants and their place of articulation (dantya-sa "dental-s"). · with affixes and case markers when used for linguistic analysis. Usually, prefixes are written with a hyphen immediately after them (pr-, bi-), while suffixes and case markers are written with a hyphen immediately before them (-der, -ke, -te ). Hyphen works here as mark of their identity · between numbers (2-3), between years (1986-1990), and between numbers and words (6- phuT "6-feet"). · when native post-positions or case markers are used with non-nativised foreign words (mesin-er "of machine"). Such kind of use is very rare in corpus. Generally, nativised foreign words use native post-positions or case markers without hyphen (skuler "of school"). · when some English group verbs (pick up, by pass etc.) are transliterated in Bangla (pikaap, baai-paash). · when a single-letter name is used to hide somebody's identity (ka-baabu "Mr. X"). · after an abbreviated word with a colon to indicate that the full form of the word is deliberately omitted (pr:- "question"). · between words which are not compounds but which by their peculiar combination either denote a sense of hesitation (dicchi-debo "dilly-dally"), or a sense of request (baabaabaachaa "appeasing"), or a sense of adverb (saat-taaRaataaRi "in a haste"), or a sense of pun (jyoti-hin pashchimbanga "West Bengal without light"). As noted above the use of hyphen in Bangla is full of varieties. It serves as a means for dissolving lexical ambiguity embedded within the surface forms. Moreover, it helps us to find actual meaning of words or phrases as well as stops us from deducting wrong meaning. The corpus has cited some interesting instances where deletion or addition of hyphen changes the meaning of words. Table (2) shows some examples. Without hyphen Meaning With hyphen Meaning Asukh illness a-sukh not happy KaTaa brownish yellow ka-Taa some PaaTaa plank paa-Taa the leg Amrita nectar a-mrita not dead 155 Aakaar shape aa-kaar a-allograph CaaTaa lick caa-Taa the tea Ekaar alone e-kaar e-allograph Maar kill maa-r of mother Kushaasan mat of Kush grass ku-shaasan bad ruling Table 2: Change of meaning of the words with/without hyphen Similarly, displacement of hyphen changes the meaning of words. For instance, the word surat-ranga means "game of coition" while sur-taranga means "waves of music", or akhyaatanaamaa means "notorious" while a-khyaatanaamaa means "non-famous" etc. To divide a word with a hyphen at the end of a line is altogether a different matter because it is not a regular feature of spelling. It is more common in print, where the text has to be accurately spaced and the margin has to be justified. With little care it can be avoided totally in hand-written, typed or word-processed text materials. In printing the words need to be divided carefully and consistently taking account of the appearance and structure of words. 6. Conclusion Researchers who are studying the evolution of thought process in human societies, believe that development of language and script may also influence human cognitive powers. Script is a form of knowledge representation in which the use of grapheme makes demand on humans to code and decode knowledge, convert auditory sounds into visual symbols, think deductively and order words to construct sentences. The study of characters used in a language is important and useful for developing various tools for NLP and CL such as OCR system, cryptography, key-board design for typewriter and printing, telegraphic codes design, spell-checker design, dictionary preparation, machine translation, information-theoretic analysis of language, language teaching etc. Even from pure linguistic point of view such study can help both primary and secondary language users to know how the characters are designed and used in the language. Acknowledgement: The Ministry of Information and Technology, Govt. of India is thanked for providing the corpus for studies presented here. Bibliography Atkins S, Clear J, Ostler N 1992 Corpus Design Criteria. Literary and Linguistic Computing 7(1): 1- 16. Babulanam S M, Beena K F 1999 The User-Oriented Bengali Easy Orthography. Computers and the Humanities 33: 241-245. Bandyopadhyay C (ed.) 1981 Dui Shataker Bangla Mudran O Prakashan (The printing and publishing history of Bangla in last two centuries). Calcutta, Ananda Publishers. Banerjee R D 1919 The Origin of the Bengali Script. Calcutta, Calcutta University Press. Bhattacharya N 1965 Some Statistical Studies of the Bangla Language. Unpublished PhD thesis, Indian Statistical Institute, Calcutta. Bhattacharya S 1999 Tista Ksanakal: Biramcihna O Anyanya Prasanga (Wait for a While: The Bangla Punctuations and other Issues). Calcutta, Ananda Publishers. Bayraktar M, Say B, Akman V 1998 An Analysis of English Punctuation: The Special Case of Comma. International Journal of Corpus Linguistics 3(1): 33-58. Biber D 1993 Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243- 257. Chakravarti N (ed.) 1994 Bangla: Ki Likhben, Kena Likhben (Bangla: What to Write and Why to Write). Calcutta, Ananda Publishers. Chakravarty S N 1938 Development of the Bengali Alphabet from the Fifth Century AD to the End of the Muhammadan Rule. Journal of the Royal Asiatic Society of Bengal 4: 351-391. Chatterji S K 1973 Bangla Bhasatattver Bhumika (An Introduction of the Bangla Linguistics). Calcutta, Calcutta University Press. 156 Chatterji S K 1926/1993 The Origin and Development of the Bengali Language. Calcutta, Rupa Publications. Chaudhuri B B, Dash N S 1998 Bangla Script: A structural Study. Linguistics Today (2)1:1-28. Chaudhuri B B, Ghosh S 1998 A Statistical Study of Bangla Corpus. in the Proceedings of International Conference of Computational Linguistics, Speech and Document Processing (ICCLSDP'98 ): C32-37. Crystal D 1995 The Cambridge Encyclopaedia of the English Language. Cambridge, Cambridge University Press. Das G, Bhattacharya S, Mitra S 1984 Representing Assamese, Bengali and Manipuri Text in Line Printer and Daisy-Wheel Printer. Journal of the Institution of Electronics and Telecommunication Engineers 30: 251-256. Dash N S, Chaudhuri B B 2000 The Process of Designing A Multidisciplinary Monolingual Sample Corpus. International Journal of Corpus Linguistics. (forthcoming) Diringer D 1953 The Alphabet: A Key to the History of Mankind. London, Hutchinson's Scientific and Technical Publications. Khan I, Gupta S K, Rizvi S H S 1991 Statistics of Printed Hindi Text Graphemes: Preliminary Results. Journal of the Institute of Electronics and Telecom Engineering 37(3): 268 -275. Mukhopadhaya B 1981 Bangla Mudraner Car Yug (Four Era of Bangla Printing History) in Bandyopadhaya C (ed.) DuiSataker Bangla Mudran O Prakashan (The printing and publishing history of Bangla in last two centuries). Calcutta, Ananda Publishers. Mallik B P, Nara T (eds) 1994 Gitanjali: Linguistic Statistical Analysis. ILCAA, Tokyo University Press. Mallik B P, Nara T (eds) 1996 Sabhyatar Sankat: Linguistic Statistical Analysis. Calcutta: Rabindra Bharati University. Miller G A, Newman E B, Friedman E A 1958 Length-Frequency Statistics for Written English. Information and Control 2 (1): 370-389. Pal U, Chaudhuri B B 1995 Computer Recognition of Printed Bangla Script. International Journal of Systems Science 26(3): 2107-2123. Roy A K 1989 Unis Sataker Bangla Gadya Sahitya: Ingreji Probhab (The Bengali Prose Literature of the 19th Century: The Impact of English). Calcutta, Jignasa Prakashani. Saraswati Press 1956 Rules for Compositors and Readers. Calcutta, Saraswati Press. Sarkar P 1993 Bangla Bhasar Yuktabyanjan (Consonant Clusters in Bangla Language). Bhasa 1(1): 23-45. Sen S 1993 Bhasar Itivrittva (The History of Language). Calcutta, Ananda Publishers. Sinclair J 1991 Corpus, Concordance, Collocation. Oxford, Oxford University Press. Tripathi J N 1971 A statistical Analysis of Devnagari (Hindi) text graphemes. Journal of Institute of Electronics and Telecom Engineers 17(1): 25-27. Yule G U 1964 The Statistical Study of Literary Vocabulary. Cambridge, Cambridge University Press. 157 158 Contrasting causal connectives on the Speaker Involvement Scale Liesbeth Degand University of Louvain, Louvain-la-Neuve Germanic studies Place B. Pascal, 1 B-1348 Louvain-la-Neuve e-mail: degand@exco.ucl.ac.be Henk Pander Maat, Utrecht University of Utrecht Department of Dutch Trans 10 NL-3512 JK Utrecht e-mail: h.pandermaat@let.uu.nl In our talk we would like to contrast Dutch and French causal connectives on the Speaker Involvement Scale (Degand & Pander Maat, to appear; Pander Maat & Degand, to appear). This scale is an alternative proposal to classify the use and meaning of connectives. Going beyond dichotomous and trichotomous classifications, we have proposed to represent (causal) coherence relations and connectives in a scalar way This scalar representation reflects the fact that (causal) connectives are not strictly domain specific, but that they nevertheless impose constraints on the contexts in which they can occur, with some contexts being more “natural” than others. In addition, a number of causal connectives seem to take an intermediate position between the traditional categories. According to us, this situation is an indication for the need of a scalar perspective on the spectrum reaching from nonvolitional causality in the content domain to epistemic and speech act causality. The scale we have developed is one of speaker involvement (SI), on which the inherent expressive power of connectives can be represented. Our hypothesis is that the different causal relations can be ordered along a scale from minimal to maximal speaker involvement. SI refers to the degree to which the present speaker is implicitly involved in the construal of the causal relation. More specifically, SI increases with the degree to which both the causal relation and the related segments vehicle assumptions and actions of the present speaker. Four characteristics of coherence relations may enhance the prominence of speaker assumptions in the relation, and thus enhance the level of Speaker Involvement of the relation: The involvement of a conscious protagonist, a lack of isomorphism between the relation and states of affairs in the real world, proximity of the relation to the present speaker and the time of speaking, and the implicit vs. explicit realisation of the protagonist. The different causal relations we distinguish are, in order of increasing SI: causal non-volitional and volitional content relations; causality-based and noncausality based epistemic relations, and causal speech-act relations. Point of departure of our contrastive analysis are connectives which are very close in meaning and seem to be easily substitutable by one another within one language, like French donc and dès lors (‘so/hence'), car and parce que (‘for/because'), or Dutch dus and daarom, or between languages like the supposed translational equivalents puisque and aangezien (‘since'). In addition, these connectives also show highly diverging frequencies in a newspaper corpus. The question we would like to tackle in our presentation is whether these divergences in frequency and subtle meaning differences are reflected in diverging SI distribution patterns. This presupposes to uncover the semantic profile of the given connectives as well as their interaction with the surrounding discourse. In our view, analyses of the SI potential inherent to connectives cannot do without systematic corpus analyses combining distributional data and semantic intuitions. In our presentation, we will develop how we proceed to determine the SI of a connective and how this SI level accounts for the substitution effects created in (1) (when addressed to a traffic agent who pulls you over because you ignored a ‘one way’ traffic sign), while (2) is perfectly acceptable. (1a) Ik had haast, #dus /daarom hield ik me niet aan het inrijverbod. (1b) J'étais pressé, #donc /c'est pourquoi j'ai pris le passage interdit. (1c) I was in a hurry, ‘so’ / ‘that's why’ I ignored the one way sign. (2a) Ik had haast, dus /daarom nam ik een taxi. (2b) J'étais pressé, donc /c'est pourquoi j'ai pris un taxi. (2c) I was in a hurry, ‘so’ / ‘that's why’ I took a taxi. References Degand L, Pander Maat H to appear A contrastive study of Dutch and French causal connectives on the Speaker Involvement Scale. In Verhagen A, van de Weijer J (eds) Levels in Language and Cognition. Pander Maat, H. & Degand, L. in press Scaling causal relations and connectives in terms of Speaker Involvement. Cognitive Linguistics. 159 Corpus-based identification of temporal organisation in discourse Anne Le Draoulec, Marie-Paule Péry-Woodley ERSS, Université Toulouse Le Mirail 1. Introduction We report here on a corpus-based study of a particular form of discourse organisation, inspired by the notion of temporal discourse frames. In Charolles’ theory of Discourse Framing (Encadrement du Discours, Charolles 1997), a discourse frame is described as the grouping together of a number of sentences which are linked by the fact that they must be interpreted with reference to a specific criterion, realised in a frame-initial introducing expression. For instance, as regards perspective framing, According to X, … provides an essential element for the interpretation of the proposition which follows, and also potentially of several subsequent propositions – as frame-introducing expressions are characterised by their ability to extend their scope beyond the sentence in which they appear (cf. Péry-Woodley 2000). In the case which concerns us – temporal frames –, the introducing expressions are mostly adverbial expressions such as today, subordinate clauses such as when he left, and prepositional phrases such as {in | until | since | about…} 1989. Temporal expressions can occur in various places in the sentence, indeed they constitute a typical case of these mobile elements whose position (initial, median or final) has interested many generations of linguists (cf. Bally 1944, Firbas 1972, Givón 1979 inter alia). The best studied contrast is between initial and final position: a functional difference has been established between autonomous initial adverbials, which as adjoints have a scene-setting role outside the proposition, and final adverbials which have no autonomy and express a circumstance only modifying the proposition. The distinction is also expressed in terms of syntactic integration in the sentence: initial position is equated with non-integration, final position with integration. Yet position is not the whole story as regards autonomy, or integration: there is a complex interplay between position and punctuational detachment (a comma is a visual indication of detachment which argues for non-integration). Combining position and punctuation yields the following possibilities: initial-detached, initial-non-detached, final-detached, final-non-detached. However, initial position is recognised as having a major cognitive impact (cf. Lambrecht 1994), and in any case, for this position, punctuation seems somewhat erratic. On the other hand, there is a real difference between a detached and a non-detached final adverbial: detachment goes against the integration associated with final position. What is the relation between these classical theoretical accounts, and discourse framing? Charolles (1997, p.15) asserts that frame introducers are sentence adjoints, and suggests that they most often appear in initial position. As his focus is on the elaboration of the notion of discourse framing, he does not go into a systematic analysis of the nature of introducers. His intuition about frame introducers seems perfectly well founded, as there is an obvious link between autonomy (or non-integration) of an expression, and its ability to open a frame: the characteristic of a frame introducer is indeed to scope over several sentences. Our objectives in this corpus study are twofold: (a) Gather data to document the scoping potential of sentence initial temporal expressions, as well as validate the postulate of a difference in scoping ability linked to integration1. A first contribution of this study will be to provide a fairly systematic examination, in an extensive corpus, of phenomena which have until now been observed separately, in theoretical mode mostly and with made up examples as concerns integration, and on the basis of limited illustrative data for the issue of scope. At this stage, we choose to concentrate on the most clearly-contrasted cases: initial (detached or non- 1 An earlier corpus-based study (Thompson 1985), though set in a different theoretical framework, established the same kind of position-linked difference in the case of initial and final purpose clauses. 160 detached) and final non-detached temporal expressions. Within these, we focus on prepositional phrases (PPs), for methodological reasons which we explain in the next section. (b) Examine the extension of frames. The left boundary is established by the introducer. But are there linguistic clues to the final boundary of a temporal frame? 2. Corpus and method Our initial corpus, the Atlas corpus, is composed of a single text: the book "Atlas de la France scolaire"2. This choice was motivated by the fact that, as a geographical and historical account of schooling in France, this book is essentially structured in a spatial and temporal way. 2.1. Preliminaries Our decision to focus our analysis on PPs reflects in the first instance a methodological choice: the choice to simplify the parameters to be taken into account for the automatic identification of temporal expressions. Yet it is not completely arbitrary, as we observed that the temporal expressions in our corpus were mostly PPs. For instance, a cursory survey came up with an extremely small number of temporal subordinate clauses. Our method can be said to be semi-automatic: the initial automatic detection based on lexical markers was followed by a manual selection of the expressions we viewed as relevant. For the lexical markers, we constituted two lists: one of temporal nouns, based on a study by Berthonneau (1989), and another containing temporal prepositions. We constructed our search filters3 in such a way that we could detect: - PPs composed of any preposition followed by a date (represented by a regular expression) or a temporal noun. Our list of nouns includes 78 items, among which: année, jour, février, siècle, moment, période, début. - PPs introduced by a temporal preposition, i.e. a preposition which does not require a temporal noun to form a temporal expression: avant, après, pendant, durant, lors, dès, depuis, jusque. We are well aware that these patterns leave out a number of temporal expressions, for instance those composed with a non specialised preposition and an event noun (à la rentrée des classes, etc.), or those which are not introduced by a preposition (l'année précédente, deux ans après). Our familiarity with the corpus allows us to feel confident that our patterns cover the great majority of the temporal PPs that could be detected. In a study centred on discourse phenomena, we needed extensive context, measured in a discursively relevant unit: we opted for retaining extracts of three paragraphs, one on either side of the one containing the PP. Among the PPs identified, the following were manually eliminated as non-relevant: - non-temporal PPs: expressions such as en même temps in line 13 of example (1), which does not have a temporal meaning in this context; - temporal expressions which only bear on one sentence constituent (such as a noun or an adjective), or which are embedded in subordinate or parenthetical clauses. These expressions, which have a status below the sentence, seem to be of limited interest for a study of scoping properties. This sorting task was mostly concerned with eliminating non-relevant items, but occasionally, as in the analysis of the extended example (1) below, we have also retained expressions which, although not detected by our automatic procedure, are also relevant for our study. For example, we consider trente ans plus tard, (line 6), as sharing with the automatically identified surrounding PPs the discourse properties under study4. 2 Hérin & Rouault 1994. We thank P. Enjalbert et N. Malandain (Greyc, University of Caen) for making this text (56346 words) available to us. 3 This search was executed with Yakwa, a text analysis tool developed by L. Tanguy (an illustrated description is available on www.univ-tlse2.fr/erss/membres/tanguy/Yakwa.html). We are very grateful to L. Tanguy for his readiness to help. 4 Such an adjustment is somewhat arbitrary. Yet, in some cases, it is perhaps less so than allowing technical considerations to strictly determine the selection of expressions. 161 2.2. Analysis of an example We begin by presenting an extract from our corpus, which will serve as a basis for explaining our analytical approach (temporal PPs appear in italics): (1) 1. Si l'on définit la démocratisation de l'enseignement par l'accès d'un nombre croissant de jeunes aux formations et 2. aux diplômes de l'enseignement secondaire et de l'enseignement supérieur, il y a bien eu, en une génération, 3. démocratisation du système éducatif français. En effet, entre la fin des années 1950 et le début des années 1990, 4. le nombre des collégiens a triplé, celui des lycéens quadruplé et celui des étudiants sextuplé. Les taux de 5. scolarisation des adolescents et jeunes adultes ont considérablement augmenté: en 1958-59, à peine 20% des 6. jeunes de dix-huit ans étaient scolarisés; trente ans plus tard, la proportion dépasse 80% (apprentis compris) et 7. dans le même temps le nombre des admis au baccalauréat est passé d'à peine 60 000 à plus de 400 000. 8. Toutes les catégories sociales ont bénéficié de l'allongement des scolarités et de l'accès de plus en plus ouvert aux 9. formations des cycles secondaires et supérieurs: le nombre des étudiants français d'origine ouvrière était de 10. 30 000.au début des années 1960, il est de 130 000 au début des années 1990; celui des étudiants provenant 11. des catégories sociales aisées, patrons, professions libérales, cadres supérieurs, est passé dans le même temps de 12. 100 000 à 350 000 environ. Mais le rapprochement de ces derniers chiffres, s'il confirme l'ouverture des niveaux 13. supérieurs de formation aux jeunes des milieux économiquement modestes, indique en même temps les limites 14. de la démocratisation du système éducatif. Certes, la proportion d'étudiants d'origine ouvrière (salariés agricoles 15. et personnels de service compris) dans les universités a presque doublé au cours des vingt dernières années; 16. mais elle n'est encore que de 12%, alors que plus de 40% des jeunes d'âge scolaire sont d'origine ouvrière. Et 17. près du tiers des étudiants des universités proviennent des milieux aisés, alors que les enfants des familles de 18. cadres supérieurs et des professions libérales ne forment guère que 10% de la population scolarisable. We will concentrate here on the analysis of two sets of tokens which lend themselves particularly well to exhibiting the functional difference between sentence initial and sentence final temporal expressions (from now on ITEs and FTEs). The first set consists of the ITEs: entre la fin des années 1950 et le début des années 1990, en 1958-59, trente ans plus tard, dans le même temps. The second one consists of the FTEs: au début des années 1960, au début des années 19905. The first set examplifies the framing characteristics described in the introduction. Entre la fin des années 1950 et le début des années 1990 opens a frame which extends beyond the sentence where it appears: the interpretation criterion thus provided scopes over a number of sentences which constitute a segment, within which the next three ITEs (en 1958- 59, trente ans plus tard, dans le même temps) open sub-frames6. This can be visually represented in the following manner: There is no such temporal framing with the expressions in the second set. The organisation is provided by the anaphoric link between the clauses, and the FTEs are only temporal modifiers within the clause (in other words, they only provide a specification). 5 dans le même temps in the same paragraph is not dealt with in our study as it is in a median position, a position which will be examined in further work. Though it appears to function here in the same way as an FTE, we do not want to prejudge the equivalence of these two positions. 6 The interpretation whether an ITE opens a new frame or a sub-frame depends crucially on world knowledge: it is world knowledge that tells us that the periods 1958-59 and trente ans plus tard are included in entre la fin des années 1950 et le début des années 1990. Entre la fin des années 1950 et le début des années 1990, en 1958-59, trente ans plus tard, dans le même temps 162 3. The scope of ITEs As one of our main objectives is to investigate the scoping properties of temporal expressions, our first task is to examine the way in which ITEs structure the text. And insofar as they open segments which can extend over several sentences, the first question is to determine what may signal the final boundary of such segments. In other words, what indicates to the reader that the interpretation criterion provided by the ITE no longer applies? This preoccupation leads us to establish a distinction between ITEs opening a frame, and ITEs opening a sub-frame: the above example (line 3 to 7) shows that only the former can have an extended scope, whereas the scope of the latter is clearly limited to individual items of the enumeration. In our corpus, we have identified a number of parameters associated with the end of a frame (shown in our examples by ¨). Two major types of clues are directly connected to time: the first is occurrence of another temporal expression referring to a time not included in the period denoted by the frame introducer; the second is change of tense. Another clue of a more general nature is change of paragraph. Further work will be needed to determine precisely the interaction between these clues. We can however already present some observations. It might seem that the presence of an ITE introducing a new frame should be a sufficient clue to a closing frame boundary. And yet, in our corpus, an overwhelming number of these cases display a change of tense as well, as in the example below which combines the two parameters: the opening of a new frame is marked both by another ITE (en 1989) and by a change from a perfect to a present tense. (2) En juin 1992, 747 500 candidats se sont présentés à l'examen, dont 35 000 candidats individuels ; près des trois quarts ont été reçus ; mais pour les candidats individuels le taux de réussite a été à peine de 50 %. Pour la série collège (85 % de l'ensemble des candidats), 76 % des candidats des établissements scolaires ont obtenu le brevet, ceux des collèges privés sous contrat réussissant mieux que ceux des collèges publics (85 % de taux moyen de succès dans les premiers, 74 % dans les seconds). ¨ En 1989, tant les collégiens du privé que ceux du public ont de meilleurs résultats dans les départements des académies de l'Ouest où les élèves du privé sont nombreux, d'Orléans -Tours, Reims et Grenoble, ainsi que dans les Midis aquitain et méditerranéen. We may wonder why we do not find in our corpus any example of a date change without a tense change, even when this would be possible (as in (2) where the present could easily be replaced by a perfect). A hypothesis might be that tense should reflect the temporal difference, and indeed this is what happens in many cases. However, our above example intriguingly contradicts this hypothesis: the change from perfect to present does not follow a temporal logic, as the second frame is temporally anterior. What is at work here is definitely a discursive logic: change of tense is a way to make very clear to the reader that she is moving to a new segment. Change of tense does not inherently guarantee a frame boundary. But in cases when, in the absence of a new temporal expression, the tense changes from non-present to present, this can suffice to indicate such a boundary, as the use of the present is then generally an implicit reference to the time of utterance. The following example is a case in point, with a transition from perfect to present, made even more explicit by the temporal adjective actuelle in the NP la tendance actuelle. (3) Depuis le début des années 1960, la composition du corps enseignant a été diversifiée : les disciplines, catégories indiciaires et grades ont été multipliés, avec l'apparition de CAPES et de CAPET artistiques et techniques et la création de certains corps enseignants (PEGC et plus récemment, PLP1 et PLP2). ¨ Même si la tendance actuelle est à la simplification, les grades restent nombreux (près d'une quinzaine) de même que les statuts (titulaire, titulaire académique, titulaire remplaçant, stagiaire, non -titulaire). Le corps professoral demeure hétérogène. Rather than change of tense, the relevant clue is therefore the transition from non-present to present7. 7 This is a somewhat simplified account: with depuis, things are much more complex, as the period referred to includes the present time. And in French, as opposed to English, this property may lead to a present tense being used to express a situation which extends from past to present: Depuis 1993, j'habite à Toulouse / Since 1993, I have been living in Toulouse (English requires a perfect). In example (3), the present tense in la tendance actuelle est is a “true” present which only refers to the time of utterance, and implies that we 163 Another example is given in (4), where two different past tenses (perfect and imperfect) coexist within the same frame opened by De la fin du siècle dernier jusqu' aux années 1950, and contrast with the present tense which is necessarily outside this frame. (4) De la fin du siècle dernier jusqu' aux années 1950, l'école primaire a été le pilier du système scolaire français. Elle inculquait les connaissances de base, lire, écrire et compter, qui serviraient toute la vie. Elle avait aussi pour mission de former les citoyens de la République. Elle délivrait le certificat d'études qui, pour le plus grand nombre, attestait de la réussite des études et marquait l'entrée dans le monde du travail. ¨ Les sessions du certificat d'études n'ont plus lieu. The third frame boundary clue mentioned above is change of paragraph, which is often regarded as linked to topic change (cf. Longacre 1979, Bessonat 1988). In our data, it appears to be indeed a relevant clue, as in examples (2) and (4). It is however not necessary, as shown in example (3), where a change of frame occurs without a change of paragraph. Is it sufficient? We have no case when a new paragraph is still to be interpreted as part of an earlier introduced frame: so a new paragraph seems to be a sufficient clue to indicate a frame boundary. Is it sufficient by itself? In most cases, the new paragraph is associated to some of the above clues. But we also find instances where a paragraph change, in the absence of any temporal clue, implies a frame boundary. In example 5, the paragraph change is accompanied neither by a new temporal expression, nor by a tense change: (5) Malgré cette progression rapide et générale, les écarts restent sensibles entre les académies et les départements. En 1990, la proportion de bacheliers par classe d'âge n'atteint pas 40 % dans les académies de la grande couronne parisienne. Elle se situe entre 40 et 45 % dans le Nord, en Lorraine et en Alsace, dans les académies de Toulouse, Limoges, Grenoble, Rennes ainsi qu'en Corse. En Île -de -France les écarts sont écrasants entre Paris (67 % de bacheliers dans la classe d'âge) et la Seine-Saint-Denis (29 %) Dans l'ensemble, la moitié sud du pays continue à avoir de plus fortes proportions de bacheliers que la moitié nord, Bretagne et, à présent, Lorraine exceptées. Mais ces différences s'atténuent. ¨ L'allongement des scolarités, dont l'augmentation continue du nombre des bacheliers donne la mesure, se traduit par la fréquence sans cesse croissante des scolarités longues, poursuivies jusqu'au terme des études secondaires, sans préjuger des entrées de plus en plus nombreuses dans les formations supérieures. At this point in our study, we may feel that we have satisfactory confirmation of the specific functional role of ITEs. By extending their scope over several sentences, in conditions that we have gone some way towards exploring, they play a part in discourse organisation and more specifically discourse segmentation. We have focused on ITEs, as we started from the accepted view that FTEs can only give a time specification, and not open a frame. In this perspective, it should be out of the question for FTEs to share the scoping properties of ITEs. And yet, reconsideration of the behaviour of FTEs in the corpus questions this initial dichotomy. We find examples of FTEs which appear to have extended scope: (6) Le nombre d'élèves par classe stagne dans les collèges depuis plusieurs années : la baisse sensible du nombre d'élèves s'accompagne en effet de nombreuses suppressions de postes d'enseignants, récupérés pour les lycées. (7) En moyenne, 600 à 700 écoles ont été fermées chaque année depuis 1980 : quelques -unes par fusion entre l'ancienne école de filles et l'ancienne école de garçons ; d'autres dans le cadre d'un regroupement pédagogique intercommunal ; les plus nombreuses par baisse des effectifs au -dessous du seuil de fermeture d'une école réduite au fil des années à une seule classe […] (8) Plus des trois quarts des 800 000 élèves qui étaient en 5e en 1991-92 sont passés en 4e. La proportion est en augmentation continue depuis quatre ou cinq ans. En revanche, les orientations vers les préparations aux CAP de l'enseignement technique (2 % des élèves) sont de moins en moins fréquentes - près de 10 % des élèves sont maintenant orientés vers les 4e technologiques. In each of these examples, one might think that the FTE provides an interpretation criterion which applies to the next sentence(s). Section 4 below proposes a closer look at the data in order to establish whether one are outside the frame introduced by depuis (which would be confirmed by the translation with a present tense in English: the present tendency is...). 164 can legitimately consider that FTEs have scoping properties, thus questioning the classical view of the discourse role linked to sentence position. 4. Do FTEs have scoping properties? On closer examination, it appears that the sentences which might appear to be within the scope of an FTE always enter into specific relations with the sentence specified by this FTE. We approach these discourse relations within the framework of RST (Mann & Thompson 1988), without presenting the model, but only providing the elements which are necessary for our purposes. In examples (6) and (7), we identify respectively relations of Non-Volitional Cause and Elaboration: in the sentence(s) after the colon, elements are given which provide a cause for the situation described in the proposition specified by the FTE, or develop this proposition. As these relations do not in any way involve temporal considerations, it seems logical that the temporal specification continues to apply. The same is true of the Contrast relation in (8): a Contrast relation clearly extracts the elements in opposition, while all others are assumed to remain constant. Here, the contrast is established between two school orientations, and by default the time period is assumed to remain the same. We would therefore argue that what is a stake here is not a scoping property of FTEs, but a default continuation of a temporal specification. Nevertheless, we are aware that this point remains excessively based on the theoretical postulate that we precisely wanted to validate with our data. It would undoubtedly be more satisfactory to find empirical evidence to support our argument. In order to eliminate the possibility of temporal specification by default, we must envisage relations with temporal implications. It is in such cases that the scoping property may be fully thrown into light. Sequence is such a relation: by definition, it involves progress in time. If we find a temporal expression which extends over several sentences in a Sequence relation, we can be sure that this extension corresponds to a frame scope, and not to a default continuation: it cannot be a default if the relation implies a moving forward of the temporal reference point. We re-examined our data in this perspective, looking for this kind of configuration. In accordance with our hypothesis, we found no FTE exhibiting what would constitute evidence of a scoping property. Unfortunately, we did not find positive evidence either, i.e. ITEs extending their scope over sentences in a Sequence relation. It became clear to us that this could be due to the expository nature of our corpus. So with the same method of identification of temporal PPs, we investigated a corpus made up of film summaries and film reviews8, more likely to contain narrative elements. Indeed, in this new corpus, the configuration we were looking for was frequent, and, in accordance with our expectations, only with ITEs. We give below two relevant examples: (9) Flash-back : pendant l'été 1945, les bombardiers américains B-29 déversent des tonnes de bombes incendiaires sur la ville de Kobe. D'innombrables victimes périssent dans un gigantesque incendie. Parmi les rares survivants, dans un paysage d'Apocalypse, Seita un jeune garçon de quatorze ans protège sa petite sœur de quatre ans, Setsuko. [...] (10) Au début des années 50, le poète Pablo Neruda, sous le coup d'un mandat d'arrêt au Chili parce que communiste, arrive en exil en Italie avec sa femme et s'installe dans une petite île du Sud. Mario Ruoppolo, fils de pêcheur en chômage, est recruté par le chef du bureau de poste local comme facteur auxiliaire dont Neruda sera l'unique client. Après une prise de contact assez froide, les deux hommes sympathisent […] In both cases, we have a sequence of events which must be interpreted as taking place in the period referred to by the ITE. The existence of such examples, together with the absence of similar patterns with FTEs, at last provides clear evidence of what we wanted to show: only for ITEs is it appropriate to use the notion of scope. Another dimension is opened up by some work in progress seeking to compare the potential of ITEs for extending their scope and delimiting segments of text, with what happens with certain kinds of titles. Our Cinema corpus provides us with several examples of interest to this question. They contain temporal expressions which, although their status is unclear and needs investigating, share some characteristics with titles (in particular they constitute an autonomous punctuational unit): what is worth noting is that these temporal expressions behave very much like ITEs in their ability to scope over a – potentially large – 8 The Cinema corpus (81397 words), extracted from two Internet sites, the television channel Canal+ and the daily newspaper Libération. 165 number of sentences. Thus, Juillet 1914 in (11) works in a very similar way to pendant l'été 1945 in (9) or Au début des années 50 in (10). (11) Juillet 1914. Le château du vieux comte Pascal de Sainteville dans la France profonde entre "campagne bucolique" et "montagne magique". Pierre Mercadier, riche rentier et professeur dilettante d'histoire, Paulette, sa femme, qui "n'aime rien ni la musique ni l'amour" et leurs enfants, dont Gabriel, 12 ans, débarquent pour l'été chez leur oncle Pascal. Ils sont déçus de constater que le comte, ruiné, a dû louer une partie du château aux Pailleron : Ernest, ouvrier puis contremaître qui a épousé Blanche, la fille du patron, avant de devenir patron lui-même, Suzanne, leur fille et Yvonne, son amie pianiste. Le charme de Blanche séduit Pascal, Pierre et Gabriel, mais Pierre a la préférence. Gabriel entreprend son initiation aux choses de l'amour avec Suzanne et Yvonne sans craindre de les faire souffrir alternativement. [...] 5. Conclusion The corpus analysis has enabled us to throw light on several aspects of temporal discourse framing. As concerns the distinction between initial and final temporal expressions, the homogeneous (expository) nature of the Atlas corpus was at once a disadvantage and an advantage. It did not provide data capable of clearly establishing the specificity of ITEs as regards scoping. But it revealed that all the contexts that could be interpreted as pointing to a similar scoping behaviour of ITEs and FTEs displayed the same kind of discourse relation. In turn, this observation led us to formulate the hypothesis that texts with a narrative component would be more likely to provide suitable evidence. And indeed this was the case. This little detour urged us to look into the relationship between discourse framing and another form of text organisation, i.e. discourse relations, as regards temporal structure. Furthermore, it clearly points to the necessity to take genre into account in any corpus-based study. As a preliminary analysis, the work presented here simplifies the issue in dealing exclusively with the two most distinct cases. To be exhaustive, it will need to be completed by a study of median and detached final temporal expressions. Our intuition is that their scoping behaviour is similar to that of FTEs, but this needs to be supported by a corpus analysis such at this one. Our second objective was to investigate the linguistic signalling of frame boundaries. We have started exploring three clues which seem to be commonly involved in the identification of the end of a frame. But we are very conscious that this exploration must be pursued much further, if it is to account for the complexity of the data. We remained somewhat evasive as to the first clue: occurrence of a temporal expression referring to a new time. In our examples the frame boundary coincided with the opening of a new frame by a second ITE. Yet we found examples (which we did not mention for the sake of simplicity) where what appeared as a new segment only contained a temporal specification: that this should be enough to indicate the end of the preceding frame shows that the end of a frame does not necessarily signify the opening of a new one. This has to do with the fact that the theory of discourse framing constitutes a partial model of text organisation: as not all segments are frames, some other inter-segment organising principles must intervene, such as those (mentioned above) that call upon discourse relations. Furthermore, we came across some examples where the discourse relation alone seemed to imply the end of a frame, in particular when a comment is made on information presented within a frame9. As regards the second clue (change of tense from non-present to present), the observations we made are indeed representative of the data in our corpus. However, given that in the French tense system, the perfect for instance is related to both present and non-present, the interaction between tenses is likely to lead to a much more complex situation As we are interested in text segmentation, the third clue (change of paragraph) should be looked at in relation to other forms of visual segmentation in the light of studies such as those of Virbel (1989, 2001) and Nunberg (1990). 9 As in the following example, where the comment on what has happened depuis quelques années (“for the last few years”) must clearly be interpreted outside this same temporal frame: Depuis quelques années, en prenant pour base de référence l' annonce ministérielle de l'objectif d'amener les trois quarts des jeunes de chaque classe d'âge aux niveaux de formation correspondant à la classe Terminale de lycée (1985), les caractéristiques principales du système éducatif français sont en cours de rapide mutation. Quelques chiffres donnent la mesure de ces changements : […] 166 A final comment: given our objective of finding clues to final boundaries, we acted as though these boundaries could always be unambiguously established. But we realise that there are also many cases when the boundary is left unspecified, without it hampering the reading process. There is nothing surprising about this, as underspecification is a general linguistic principle. However it would be worth quantifying this tendency with regard to temporal framing, and extending the quantitative study to different types of framing, such as the perspective framing mentioned in the introduction. References Bally C 1944 Linguistique générale et linguistique française, 2ème édition. Berne, Francke. Berthonneau A-M 1989 Composantes linguistiques de la référence temporelle - Les compléments de temps, du lexique à l'énoncé. PhD thesis, Université de Paris VII. Bessonat B 1988 Le découpage en paragraphes et ses fonctions. Pratiques 57: 81-105. Charolles M 1997 L'encadrement du discours : Univers, champs, domaines et espaces. Nancy, Université de Nancy2. Firbas J 1972 On the Interplay of prosodic and Non-prosodic Means of Functional Sentence Perspective. In Fried U (ed), The Prague School of Linguistics and Language Teaching. London, Oxford University Press. Givón T 1979 On Understanding Grammar. New-York / San Francisco / London, Academic Press. Lambrecht K 1994 Information structure and sentence form. Topic, focus and the mental representations of discourse referents. Cambridge, CUP. Longacre RE 1979 The paragraph as a grammatical unit. In Givón T (ed) Syntax and semantics vol 12. Discourse and syntax. New-York / London, Academic Press, pp 115-134. Luc C, Virbel J to appear 2001 Le modèle d'architecture textuelle - Fondements et expérimentation. Verbum 23 (1). Mann WC, Thompson SA 1988 Rhetorical Structure Theory: toward a functional theory of text organization. Text 8 (3): 243-281. Nunberg G 1990 The Linguistics of Punctuation. Menlo Park, CSLI. Péry-Woodley M-P 2000 Cadrer ou centrer son discours ? Introducteurs de cadres et centrage. Verbum 22 (1): 59-78. Thompson S A 1985 Grammar and Written Discourse: Initial vs. Final Purpose Clauses in English. Text 5 (1-2): 55-84. Virbel J 1989 The Contribution of Linguistic Knowledge to the Interpretation of Text Structures. In J. André J, Quint V, Furuta RK (eds), Structured Documents. Cambridge, CUP. 167 Measuring morphological productivity: Is automatic preprocessing sufficient? Stefan Evert & Anke Lüdeling, Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Germany. e-mail: {evert, anke}@ims.uni-stuttgart.de 1. Morphological productivity In this paper we want to focus on a small facet of morphological productivity: on quantitative measures and their applicability to “real life” corpus data.1 We will argue that – at least for German – there are at present no morphological systems available that can automatically preprocess the data to a quality necessary to apply statistical models for the calculation of productivity rates.2 Before coming to the quantitative aspects we want to clarify the notion morphological productivity. Morphological productivity has long been a topic in theoretical morphology (see for example Schultink 1961, Aronoff 1976, van Marle 1985, and Plag 1999). It has been defined in many ways. We choose a definition by Schultink (1961, p. 113) which contains three aspects that are important to us: We see productivity as a morphological phenomenon as the possibility for language users to coin unintentionally an in principle unlimited number of new formations, by using the morphological procedure that lies behind the form-meaning correspondence of some known words.3 The three important aspects are unintentionality, unlimitedness, and regularity. They are all interdependent. The first aspect – unintentionality – helps us to distinguish between productivity (which is a linguistic rule-based notion) and creativity (which is a general cognitive ability and cannot be captured within morphology alone): Words formed by productive processes are often not recognized or noticed as new words (this is true for speaker and hearer) while words formed by other (creative) processes are carefully produced and are perceived as new words. The second aspect is unlimitedness – if productive word formation patterns are in principle unlimited, it is not possible to give a finite list of words (some implications of this are discussed below). Both unlimitedness and unintentionality require that the words formed by a given process are morphosyntactically and semantically regular. Theoretical and descriptive works on word formation mostly focus on what Baayen (1992) calls the qualitative aspect of productivity: the morphological, phonolocical, syntactic, semantic and other restrictions of a specific word formation process are studied. The adjective-forming suffix –bar (roughly comparable to English –able), for example, takes as bases transitive activity verbs (lesbar “readable” from transitive lesen “to read”, but not *schlafbar from intransitive schlafen “to sleep” or *wissbar from stative wissen “to know”), the noun-forming circumfix Ge- -e does not take prefix verbs (Geschimpfe “continued or iterative scolding“ from simplex schimpfen “to scold” but not *Gebeschimpfe from prefixed beschimpfen “to insult”). The goal is an intensional description of the possible bases for a given word formation process. Next to qualitative approaches to productivity, quantitative approaches are suggested (Baayen 1992, Baayen and Lieber 1991, Baayen 2001). These aim at calculating the probability of finding a new word formed by a given morphological process in a text once a given amount of text is sampled (see Section 2). What is the relevance of quantitative approaches to productivity? What can quantitative approaches tell us that qualitative approaches cannot? At first glance the qualitative view of productivity differs fundamentally from the quantitative view: A pattern X can be fully productive in the qualitative analysis – it applies to all possible bases – but unproductive in the quantitative analysis, if there is a finite number of bases and all the words that pattern X can possibly form from these bases have been sampled. The probability of encountering new words formed by that word formation process is then 0. But qualitative and quantitative approaches really complement each other. First of all, quantitative approaches cannot be used without careful linguistic interpretation, as we will see below. From the 1 We illustrate our points with German data from the StZ corpus, which contains roughly two years of the newspaper Stuttgarter Zeitung (1991 – 1993, 36 million tokens). We suspect that the problems that arise when corpus data is used for quantitative approaches to productivity are similar for other languages. 2 We will not deal with the statistical models or the mathematics behind them in this paper. Excellent work on morphological productivity has been done by Harald Baayen in a series of articles over the last 10 years (starting with Baayen 1992). He gives a detailed overview over the LNRE models that can be used to calculate productivity in Baayen (2001). 3 Our translation of “Onder produktiviteit als morfologisch fenomeen verstaan we dan de voor taalgebruikers bestaande mogelijkheid door middel van het morfologisch procédé dat aan de vorm-betekeniscorrespondentie van sommige hun bekende woorden ten grondslag ligt, onopzettelig een in principe niet telbaar aantal nieuwe formaties te vormen.” 168 opposite perspective, quantitative studies help us to find out more about the nature of word formation processes: Baayen and Neijt (1997) show, for example, that the words formed by some word formation patterns really belong to two different distributions: the lexicalized words behave statistically different from the productive words. This insight is made possible by studying the shape of the distribution and analysing its anomalies. See also Baayen and Lieber (1991) for an overview of how linguistic knowledge and quantitative results profit from each other. In addition to being a valuable facet of the linguistic study of productivity, quantitative approaches also make a significant contribution to computational morphology. Many applications (for example machine translation, dialogue systems, text-to-speech systems) have to deal with unseen text. This involves not only parsing unseen sentences but also analysing new words. Because of productivity it is not possible to have a finite lexicon containing all the words of a language. Quantitative productivity studies help us determine which patterns can be listed and for which patterns rules have to be formulated. In Section 168 we will introduce a few statistical notions necessary for the understanding of productivity measures and sketch how statistical models dealing with productivity work. These methods depend on corpus data. But using corpus data is quite problematic, as discussed in Section 3. The data have to be thoroughly preprocessed before quantitative measures can be applied. We will investigate how automatic preprocessing compares with manual preprocessing in Section 170. 2. Vocabulary growth curves A minimal understanding of the statistical ideas is essential as a background for the following sections. We do not have the space to introduce the statistical assumptions and models in detail and have therefore tried to define the necessary notions rather intuitively. For a formal introduction to LNRE models refer to Baayen (2001). For a quantitative approach to productivity, we need a mathematical definition of the degree of productivity, based on observable quantities. Following Schultink's definition of morphological productivity, we define the vocabulary of a morphological process as the number of types (i.e different lemmata) that the process can potentially generate. A productive pattern is, in theory, characterised by an infinite vocabulary (cf. the notion of unlimitedness in Schultink's definition), whereas a totally unproductive pattern is expected to have a finite, and often quite small, vocabulary. In order to estimate the vocabulary size S of pattern X from a corpus, we look at the subset of the corpus consisting of all tokens generated by X (e.g. all tokens with the suffix –bar), in the order in which they appear in the corpus. Let V be the number of different types among the first N tokens in the subset. Plotting V against N, we obtain vocabulary growth curves as shown in Figure 1. If we had an infinite amount of data, we would eventually sample the entire vocabulary of X, and V would converge to S. The left panel of Figure 1 shows the typical shape of the vocabulary growth curve of an unproductive process. After enough data has been sampled, the curve flattens out and converges to a constant value, the full vocabulary size S. In the case of an infinite vocabulary, V would continue to grow indefinitely (see the right panel in Figure 1). Figure 1: Typical shapes of vocabulary growth curves (idealised). Baayen (1992) uses the slope P of the vocabulary growth curve after the entire corpus has been sampled as a measure of productivity (P is referred to as the productivity index of pattern X). Intuitively speaking, P is the probability that the next token generated by X is a previously unseen type, 169 i.e. a new complex form. Under certain simplifying assumptions we find that P = N n1 , where n1 is the number of hapax legomena (words occurring only once, cf. Baayen 2001, Chapter 2). Unfortunately, P cannot be used to compare the productivity of affixes with substantially differing sample sizes. If we look at the left panel in Figure 1, we see that for the full sample, P=0 as we would expect from a completely unproductive process. However, had we looked at the first 15 or 20% of the curve (e.g. because we are working on a small corpus, or sampling a rare pattern), we would have observed a much larger value of P. Larger, perhaps, than that of the productive process in the right panel (for the full sample). Since we do not know where in the curve we are, we cannot simply compare P values for processes of different sizes. In order to compare different processes, we therefore need to be able to extrapolate the value of P to larger sample sizes N, or, equivalently, extrapolate the shape of the vocabulary growth curve. We cannot rely on standard statistical models for this purpose because the type frequency distributions of productive patterns are so-called LNRE distributions (for “large number of rare events”), in which lowfrequency types (including hapax legomena, but also types occurring two, three, etc. times) account for a major part of the vocabulary. Ordinary statistical techniques are based on assumptions that do not hold for LNRE distributions (the most prominent one being the law of large numbers). Hence, Baayen (2001) introduces specialised models, for which parameter estimates are obtained from the counts of low-frequency types in the corpus (the frequency spectrum). These models can then be used to extrapolate the vocabulary growth curves to larger values of N. Baayen's comparison of the degrees of productivity of different patterns is based on the predicted vocabulary sizes. Any errors in the type counts have a direct influence on the model parameters, and thus on the predicted vocabulary size. It is therefore essential to correct the input data for the errors described in Section 3.4 3. Problems with corpus data Since we want to compute the probability P of a new complex word formed by a given word formation pattern in a given text type (for example newspaper data or technical text) and because the productivity of word formation patterns is highly dependent on text type (neoclassical words, for example are much more likely in scientific texts than in everyday language, see also Baayen 1994) it is necessary to sample a corpus of that text type. Dictionary data cannot be used for our purpose for two reasons: (i) Dictionaries often contain obsolete words but typically do not contain regular newly formed words. (ii) The advanced statistical models introduced in Section 2 are based on vocabulary growth curves that can only be computed from corpus data. If we want to apply the measures described in Section 2 to real corpus data we encounter a number of problems. Lüdeling, Evert, and Heid (2000) describe these problems and show that the data have to be preprocessed before they can be used. In this section we want to summarise the problematic factors briefly, using two adjective forming suffixes (–sam and –bar) for illustration. The intuition is that –sam is unproductive while –bar is productive. If we simply extract all adjectives ending in the letters sam and bar from the STZ corpus and calculate P from the frequency spectra we get very similar graphs that suggest that both processes are productive (the solid lines in Figure 2). A closer look at the data reveals that the data have to be corrected for the following factors: · incorrect data o mistagged items o misspelt items o corpus structure: repeated articles and sections influence the frequency distribution · linguistic factors: many complex words that look like they are formed by the word formation process in question are really formed by other processes: o compounds: these make up the largest portion of the hapax legomena of most word formation patterns. We have to distinguish between complex words where the 4 Baayen (2001) also describes more sophisticated models (adjusted and mixture distributions) which are estimated from the (known) shape of a pattern's vocabulary growth curve. Just as it is the case for the simpler models, errors in the type counts have a significant influence on the shape of the vocabulary growth curve and hence on the predicted vocabulary size. 170 compounding happens before the derivation (with the structure ((stem+stem)+affix)) and cases in which something attaches to an already affixed word (stem+(stem+affix)). The former cases have to be counted as new types while the latter cases have to be counted as instances of the affixed word. We will discuss an example in Section 170. o other complex bases: likewise we have to distinguish between cases in which the affix in question attaches before other affixes or after them. For –bar adjectives that also contain a prefix, for example, we have to decide whether the prefix is attached to the –bar adjective (like the negation prefix un- which attaches to verzichtbar in unverzichtbar “indispensable” and thus should not be counted as a new type for the -bar pattern) or to the verb which is the base for –bar adjectivisation (fahrbar “ridable” is formed from the simplex verb fahren “to ride” while befahrbar “passable” is formed from the prefix verb befahren “to pass over, to drive on” – in this case we have two new types). o words that “accidentally” end in the same letters: Balsam “balm”, for example, does not contain the suffix –sam. o words formed by creative rather than by productive processes: kinobetriebsam is a creative formation merging betriebsam “busy” and Kinobetrieb “movie theatre business”. As we emphasised in Section 2, the correct application of the guidelines is so important because many of the problematic cases are hapax legomena and thus have a direct influence on the productivity index P. In Lüdeling, Evert, and Heid (2000) we manually corrected the data for –sam and –bar according to these guidelines. The dotted lines in Figure 2 show the vocabulary growth curves for the corrected data. They now conform to intuition: -sam is clearly unproductive while the curve for -bar still shows a productive pattern. Figure 2: Raw and manually corrected vocabulary growth curves for –sam and –bar. 4. Automatic preprocessing vs. manual preprocessing It is not feasible to correct the input for all word formation processes manually since some word formation patterns are very large. It would, therefore, be desirable to preprocess the input automatically. In this section we want to explore how automatic preprocessing compares to manual correction. We will find that automatic preprocessing – at least with the tools that are available to us – is not yet a usable alternative to manual preprocessing. Before we describe the programs we used we want to recapitulate the requirements for morphological preprocessing: in order to perform like a human corrector, a morphology program would have to be able to (a) find spelling and tagging mistakes, (b) make a morpheme analysis, and (c) provide the 171 correct hierarchical structure for complex words consisting of more than two morphemes. At least for German, there are no programs available that meet these requirements for unseen words.5 Because no fully fledged morphology analyser is available, we wanted to find out how much preprocessing can be done automatically with the available tools. We chose: · MORPHY, a freely available German morphological analyser (see Lezius, Rapp, and Wettler 1998). It was developed mainly for the analysis and lemmatisation of inflected forms, as well as context-sensitive part-of-speech tagging. In our experiment we used the built-in lexicon of MORPHY together with a simple heuristic compound analysis. · DMOR, a two-level morphology system developed at the IMS Stuttgart (cf. Schiller 1996). Like MORPHY, it is intended mainly for generating and analysing inflected forms, but it includes compound analysis as well: compounds are analysed as sequences of stems and linking elements (somewhat restricted by simple heuristic patterns). The DMOR lexicon contains 65,000 stems (mostly simplexes, but some complex words are listed). Neither program analyses derivational morphology. The built-in stem lexicon of DMOR is considerably larger than that of MORPHY. We manually corrected the output for a number of word formation patterns and compared the vocabulary growth curves of uncorrected (raw), manually corrected and automatically corrected data. Figure 4 shows the manually and automatically corrected plots for –bar and –sam. The automatically corrected curves lie between the uncorrected curve and the manually corrected curve in both cases. For –bar, the results produced by MORPHY (dashed lines) are hardly different from the uncorrected curve; DMOR (broken lines) performs somewhat better. For –sam, the automatically corrected curves are much closer to the manually corrected curve. Qualitatively, the autocorrected curves are similar to the manual result and hint at an unproductive process (although not as clearly as the manually corrected data). Why does automatic correction work so much better for -sam than for -bar? This is due to the fact that almost all of the hapax legomena in the –sam spectrum are compounds involving only very few head types. The compound analysis components of DMOR and MORPHY seem to be able to deal with these cases quite well. The hapax legomena in the –bar spectrum are much more diverse: un- prefixations, many spelling errors and very few compounds – since MORPHY and DMOR do not have derivation analysis components, they cannot correct these cases. Figure 4: Vocabulary growth curves for –bar and –sam. The next process we look at is the diminutive suffix –chen which combines with nouns (and sometimes adjectives) to form nouns. Here again we have a pattern where the automatically corrected curves are 5 There are a number of morphology programs which analyse complex words http://services.canoo.com/MorphologyBrowser.html http://wortschatz.uni-leipzig.de/ http://www.linguistik.uni-erlangen.de/cgi-bin/orlorenz/dmm.cgi As far as we can see from the available versions, they all work with finite lexicons. There are no downloadable tools or word lists. None of them provides hierarchical structure. 172 somewhere between the uncorrected curve and the manual curve, with DMOR (with its larger lexicon) giving considerably better results than MORPHY (Figure 5). Note, however, that the manually corrected curve has become much shorter, i.e. the number of tokens has been reduced. This is due to the fact that many words “accidentally” end in –chen (such as Groschen “ten-pfennig piece”, Drachen “dragon”, or Zeichen “sign”). Because of the resulting smaller sample size the manually corrected curve cannot be compared directly to the other curves (compare Section 2). However, the shape of the former suggests that –chen is much less productive than the automatically corrected curves would predict. Figure 5: Vocabulary growth curves for –chen. To summarize the results so far: The improvement achieved by automatic preprocessing differs considerably for the various processes. It never reaches the quality of manual preprocessing: in the case of –bar there is little improvement over the raw data, in the case of –sam the automatically corrected data comes close to the manual curve, and for –chen, automatically corrected curves are approximately in the middle between the uncorrected curve and the manual curve. For –chen, we also observe the largest difference between the two morphology systems. Since we cannot predict how good the results are compared to manually corrected data, automatic preprocessing is no real alternative for manual correction. However, it would seem that the morphology systems always lead to some improvement and should therefore always be applied (manual checks may then be done on the automatically cleanedup data). However, when we look at another diminutive pattern (suffixation with –lein) and a process that is semantically similar (compounding with klein “small”), we see a strikingly different pattern (Figure 6). The vocabulary growth curves of klein compounds show the familiar pattern of a gradual improvement from raw to manually corrected data. If we look at the plots for –lein, the uncorrected curve and the manually corrected curve are very similar to those of the klein compounds. However, when we automatically correct the data with DMOR we obtain a curve that is far below the manual curve, and hence predicts a considerably lower productivity than we actually find. This is due to the fact that DMOR contains the noun stem Lein “flax” and therefore analyses many derived words as compounds with the head Lein. 173 Figure 6: Vocabulary growth curves for klein and –lein. One could argue that the Lein/-lein homography might be a single example, and can easily be corrected if we disallow reduction to the simplex Lein. The curves are given in Figure 7 which shows the familiar pattern where the automatically corrected curves are between the manually corrected curve and the uncorrected curve. Figure 7: Vocabulary growth curves for –lein (DMOR curve corrected). However, Figure 8 shows the same pattern for the noun-forming affix –tum. Again, the curve produced by MORPHY is above the reference curve, whereas the curve produced by DMOR is below the reference curve. –tum is an interesting case: DMOR finds roughly the same number of types as the human annotator, but a considerably larger number of tokens. Although this would explain the seemingly lower productivity indicated by the DMOR curve, the actual types are different, due to two factors: as in the case of –chen, there are many words ending with the letter sequence tum which are not derived by the suffix –tum (examples are Datum “date”, Faktum “fact”, or Quantum “quota”) – DMOR mistakenly counts these types. The other factor has to do with hierarchical structure: There are many compounds of the type NOUN+reich+tum which DMOR reduces to Reichtum “wealth”. The correct structure, however, is ((NOUN+reich)+tum) which means that these cases have to be counted as different types (an example is Kinderreichtum “having many children” which is derived from kinderreich “prolific”, and not a compound of Kinder “children” and Reichtum). Here we need careful linguistic analysis – a simple automatic solution is not possible. 174 Figure 8: Vocabulary growth curves for –tum. 5. Conclusion Corpus data are necessary for the application of statistical models to the quantitative analysis of morphological productivity. We summarised Lüdeling, Evert, and Heid (2000), who showed that corpus data have to be thoroughly preprocessed before they can be used in these models. Since manual preprocessing is not feasible for large word formation patterns, we wanted to find out whether automatic preprocessing is a usable alternative. We found that although automatic correction, based on the currently available morphology systems, yields an improvement over the uncorrected data in many cases, it is no replacement for manual corrections. In some cases, automatic preprocessing is possible – if only as a basis for further manual correction. However, the over-compensation we observed for –lein and –tum shows that fully automatic preprocessing may produce misleading results. Without a manually corrected reference curve we do not know where the automatic curves for a given word formation process lie. This means that we cannot even use automatic preprocessing as a reliable basis for manual correction. Only a morphology system which, in addition to derivation and compounding components, includes a model of the order in which processes operate on a simplex form (producing a hierarchical analysis of complex words) can be expected to lead to sufficiently reliable automatic correction results for quantitative studies of productivity.6 6. References Aronoff M 1976 Word Formation in Generative Grammar. Cambridge, MIT Press. Baayen R H 1992 Quantitative aspects of morphological productivity. Yearbook of Morphology1991: 109 – 150. Baayen R H 1994 Derivational productivity and text typology. Journal of Quantitative Linguistics 1: 16 – 34. Baayen R H 2001 Word Frequency Distributions. Dordrecht, Kluwer Academic Publishers. Baayen R H, Lieber R 1991 Productivity and English derivation: a corpus-based study. Linguistics 29: 801 – 843. Baayen R H, Neijt A 1997 Productivity in context: a case study of a Dutch suffix. Linguistics 35: 565 – 587. Lezius W, Rapp R, Wettler M 1998 A freely available morphological analyser, disambiguator, and context sensitive lemmatizer for German. In Proceedings of the COLING-ACL, pp 743 – 747. 6 Work leading to such a morphology system is currently under way in the DeKo project (Schmid et al 2001). 175 Lüdeling A, Evert S, Heid U 2000 On measuring morphological productivity. In Proceedings of the KONVENS 2000, Ilmenau, pp. 57 – 61. Plag I 1999 Morphological Productivity. Structural Constraints in English Derivation. Berlin, Mouton de Gruyter. Schiller A 1996 Deutsche Flexions- und Kompositionsmorphologie mit PC-KIMMO. In Hausser R (ed), Linguistische Verifikation. Dokumentation zur Ersten Morpholympics 1994. Tübingen, Niemeyer Schmid T, Lüdeling A, Säuberlich B, Heid U, Möbius B 2001 DeKo: Ein System zur Analyse komplexer Wörter. In: Proceedings der GlDV, Giessen Schultink H 1961 Produktiviteit als morphologisch fenomeen. Forum der Letteren: 110 – 125. van Marle J 1985 On the Paradigmatic Dimension of Morphological Productivity. Dordrecht, Foris. 176 Linguistic clues for corpus-based acquisition of lexical dependencies Cécile Fabre, Didier Bourigault ERSS - UMR 5610 Université de Toulouse Le Mirail, 5 allées A. Machado, 31058 Cedex, France. {cfabre,bourigault}@univ-tlse2.fr 1 SYNTEX: a tool for the extraction of lexical dependencies Disambiguation of prepositional phrase attachment is a crucial issue for all NLP applications that need textual resources enriched with syntactic knowledge. We are confronted to this problem in the process of designing SYNTEX, a shallow parser specialised in the extraction of lexical dependencies (such as adjective/noun, or verb/noun associations) from French technical corpora. These word-to-word associations will be used as material for the construction of semantic classes on a distributional basis. In this context, the first step towards the automatic discovery of such dependencies is to determine to which word a preposition must be attached. For example, in the phrase disséquer le plateau rocheux en chevron, taken from a corpus in the domain of geomorphology1, the preposition en may potentially be attached to any of the three words disséquer, plateau, rocheux, as verbs, nouns or adjectives (and also adverbs) may govern a prepositional phrase. As lexico-syntactic information is part of what we want to extract from the text, we cannot rely on prior lexical resources: it is our belief, based on in-depth studies of corpora from technical domains, that words exhibit idiosyncratic uses from one domain to the other, not only at the semantic level, but also regarding their syntactic properties. As a consequence, our parser relies as much as possible on corpus-based information to solve the ambiguities of syntactic attachment. Our second claim is that such disambiguation process can be performed on the basis of linguistic clues, with only limited use of statistical measures. In this paper, we describe a method which relies on the search for linguistic indications to perform syntactic disambiguation. Our parser is based on two main ideas: first, the detection of unambiguous contexts is used as a starting point for the processing of ambiguous contexts. Second, the notion of productivity (more reliable than the simpler notion of frequency) is proposed to assess the strength of a word/preposition association: we define the productivity of a (word,preposition) pair as the ability of the word to appear with this preposition in various contexts. These two basic ideas are refined with the help of further linguistic clues. In particular, we use morphological information and we take into account semantic similarity between the words to evaluate the likelihood of the association between a word and a preposition. We first describe the modules of SYNTEX which perform prepositional phrase attachment. We present the various information that the analyser exploits to resolve ambiguity, focusing on the notion of productivity. We then analyse the linguistic relevance of this empirical notion of productivity: is it feasible to use productivity measure as a means to differentiate between various levels of lexical relations, and particularly to draw a frontier between arguments and non-arguments? The paper will present the first results in favour of this hypothesis. 2 Description of the strategy for PPs attachment Prepositional attachment resolution is usually considered as the first procedure that permits the delimitation of phrases for automatic parsing (Brent 1993, Basili et al. 1999). This is crucial for NLP applications such as text retrieval or information extraction (Grefenstette 1994). Our objective is also to provide data about lexico-syntactic relations between words that will help to construct semantic resources. Classes of words can be defined on the basis of their distributional properties, following the propositions of Harris (Harris et al., 1999). For example, in the geomorphology corpus, we are able to identify a set of four nouns - alluvions, sable, dépôts, cendres - all sharing at least two lexico-syntactic contexts with each other: - same V PREP N contexts: 1 All the examples in the paper are extracted from this geomorphology corpus. We are grateful to Danièle Candel (INALF) who has made it available. 177 disparaître sous (des alluvions, du sable, des dépôts) enfouir sous (des alluvions, les dépôts) creuser dans (les alluvions, le sable, les cendres) tailler dans (des alluvions, des cendres) - same N PREP N contexts: manteau de (alluvions, dépôts) banc de (alluvions, sable) These four nouns indeed appear to be semantically close, all denoting some sort of sediment. The extraction of syntactically and lexically related pairs of words is therefore a means to discover corpus-specific classes with semi-automatic techniques. Research initiated in French towards this goal has suffered from the lack of automatic parsers for this language: (Habert et al. 1996), (Assadi and Bourigault 1995) were both taking as input the results of a NP parser, LEXTER (Bourigault, 1994), and were not able to exploit data concerning verbs. SYNTEX provides an extension to LEXTER by identifying all types of lexical dependencies involving verbs, nouns and adjectives. 2.1 An inductive approach The identification of lexical dependencies in texts may be performed by different techniques. The first consists in the projection of a-priori lexical resources. There are major well-known problems with this approach: firstly, such resources, which should not only contain information related to argument structure but also deal with complementation and modification phenomena, are simply not available. The process of constructing such lexical bases is long and costly, but most of all, constructing a-priori resources is an endless task since idiosyncratic uses of words are found in corpora. Work on technical corpora demonstrates that such texts exhibit a great variety of lexico-syntactic properties regarding how words associate with each others. Of course, corpora contain also highly predictible data: for example, we find that in the geomorphology domain, the verb débiter associates with the preposition en without a determiner, from occurrences such as débiter en blocs, en chicots, en granules, en lamelles, etc. This construction is described in any French dictionary. But corpus observations also show many configurations that are totally idiosyncratic. This is especially true concerning noun modification. To illustrate this point, we can point among numerous examples to the existence of a lexical pattern of the form N pour det N with the non argument-taking, non relational, headnoun salle (room), (in: salle pour l'étude des minéraux, salle pour la détermination des argiles). In this corpus, nouns very often appear with the prepositions en (vallée en canyon, section en cluse) and à (section à méandre, cirque à source), which are much more frequently used as postverbal prepositions in other types of texts. Since these properties do not concern argument selection and are very unstable from one corpus to another, such data cannot be listed to be used on any type of texts. Another argument against the use of prior resources is the variety of prepositional attachment patterns that exist for a single lexeme. If subcategorization, complementation and modification phenomena (Grimshaw 1990) are all taken into account, it appears that many verbs and nouns can potentially associate with a great range of prepositions. This is the case even in a limited domain. For example, the verb accumuler, besides the transitive construction, gets constructed with six different prepositions, corresponding to locative and instrumental interpretations (e.g. dans le creux, derrière l'obstacle, sur le névé). As a consequence, the listing of all complementation alternatives – if possible at all – will not in itself be suffichient to reduce ambiguity. Alternatively, other approaches are proposed to learn this information from corpora in order to avoid the preconstruction of lexical resources. Our strategy is in line with research adopting a distributional methods to solve the ambiguities of syntactic analysis. (Brent 1993) was the first to take into account lexical association measures in texts to identify verbs' arguments. A similar approach has been adopted by (Hindle and Rooth 1993) or (Manning 1993), who defined methods to discover verbs' subcategorization frames in texts. More recently, (Federici et al. 1999) combine the tabula rasa approach and inductive technics for the parsing of Italian texts. Similarly, our parser uses information extracted from the whole corpus to infer local decisions, exploiting lexical redundancy in technical corpora to acquire patterns of prepositional attachment. But the novelty of our approach can be characterized as the conjunction of three options: - SYNTEX deals not only with subcategorization but also with any type of lexical dependency involving prepositional attachment, - it maximizes the use of linguistic clues, limiting the development of statistical methods to perform disambiguation, 178 - it is mainly based on the productivity measure, a key notion to reach a decision in ambiguous texts. 2.2 Disambiguation rules SYNTEX is based on inductive learning of complementation properties. The parser looks for triplets (governor, preposition, governee) linked by a dependency relation2. This relation may correspond to subcategorization patterns - as in the triplets (pénétrer, dans, pore) or (accessible, à, bâteau) - or it may illustrate non argumental associations, such as verb + circumstant – (déplacer, à, vitesse) -, noun + expansion (côte, à, fjord), etc. Learning is performed in two steps: first, properties of lexical associations are acquired from unambiguous contexts; second, they are used as indications to solve ambiguous cases throughout the corpus. This strategy is very close to the approach described in (Federici et al. 1999) for shallow parsing of italian. More generally, it is not uncommon to see parsers exploiting unambiguous contexts to limit the complexity of the analysis. The novelty of SYNTEX is in the criteria used to identify the triplets likely to correspond to genuine dependency relations. 2.2.1 Detection of unambiguous contexts Learning is first performed by detecting attachment zones in the corpus. Such zones are delimited to the right by a preposition and to the left by a frontier, which may not, or may rarely be crossed over by prepositional attachment. Such frontiers are punctuations, verbs, prepositions other than de and à or typographical items such as parentheses. In the following examples, these zones are enclosed within brackets and underlined. i. On tend de plus en plus à ]insérer cette science dans] une géographie physique globale ii. transformations du relief des versants], conséquences des actions de l' homme sur] le sol Obsviously, most frontiers are not entirely reliable. The prepositional link can be established over interpolated verbal phrases, over prepositions, etc. as in the two following examples, where the preposition and its governor are underlined: iii. La puissance est utilisée , on s'en souvient , en partie par le transport de la charge iv. On peut mesurer la vitesse d'infiltration d'une goutte d'eau déposée sur la roche par sa disparition de la surface The segmentation module is therefore, for the moment, very rudimentary, but it enables us to considerably limit the complexity of the attachment procedure. As a first approximation, we have measured the silence due to "preposition jump" (a preposition finds its governor over a preposition other than de or à) as around 5% on one of our technical corpora, which is a relatively small proportion. Within these zones, all lexical units are viewed as potential governors of the preposition. Example ii contains three nominal candidates (conséquences, actions, homme), corresponding to three potential triplets: (conséquence, sur, sol), (action, sur, sol), (homme, sur, sol). One condition must be met for these triplets to be used in the learning process: they have to be found in unambiguous contexts - containing no other potential governor for the preposition. For example, the triplet (glisser,sur,bord) is extracted from the following unambiguous context: v. des nappes de gravité ]glissent sur] le bord des surélévations 2.2.2 Acquisition of reliable dependency relations Unambiguous cases are thus the starting point of the parser. The purpose of the acquisition module is then to use unambiguous contexts as clues for the resolution of ambiguous cases. Yet, we cannot consider all information found in unambiguous contexts as reliable. Several criteria are taken into account. Given a triplet (Gvr, Prep, Gvee) found in an ambiguous context, it is considered as a possibly genuine lexical relation if: 2 The governor is the word governing the prepositional phrase. It belongs to the categories verb, adjective or noun. The governee is the word governed by the preposition. It may be a noun (as in insérer dans une géographie) or an infinitive (as in tendance à former). Both appear in lemmatized form in our results. 179 · Rule 13: The same triplet has been found in an unambiguous context. Example: in the ambiguous context indique une légère tendance à l'enfoncement, where the preposition may be governed by the verb indique or the noun tendance, the latter is considered as the more probable governor, because the triplet (tendance, à, enfoncement) has been found in an unambiguous context. · Rule 2: The pair (Gvr, Prep) is productive. Productivity of a (Gvr, Prep) pair equals the number of different governees with which the pair (Gvr, Prep) occurs in the corpus. A word is considered as productive with a given preposition, if it combines with at least two different governees. For example, given the following unambiguous contexts found in the corpus, it appears that the verb disséquer is productive with the preposition en, with a productivity of 5. Figure 1: productivity of a (governor, preposition) pair The productivity measure allows us to assess the likelihood of a word being used as a governor of a preposition on a more reliable basis than the simpler frequency measure. High productivity indicates that the governor regularly associates with this preposition, in various occurrences. To illustrate this point, we can oppose three cases: - unfrequent and unproductive association of a word and a preposition: the string annuler en général is found only one time in the corpus. In this case, the occurrence corresponds to the association of a verb and an adverbial phrase (meaning generally), and it does not indicate that the verb is constructed with the preposition en. - frequent but unproductive association: the strings an en moyenne, croûtes en Afrique, allonger dans la direction are each found three times in the corpus. These findings may correspond to genuine lexical dependencies but they cannot be used to infer a regular association between the head and the preposition. - productive (and necessarily frequent) association: the pair (disséquer, en) seems to indicate a regular relationship between the verb and the preposition because they appear together in various contexts. · Rule 3: A word morphologically linked to Gvr is productive with the preposition. The exploitation of such morphological links is useful because of the richness of the morphological system in French. Relations between subcategorization properties of verbs and argument-taking nominals are not systematic. Yet, they are sufficiently frequent to provide a clue for disambiguation. We use a lexical resource, Verbaction4, which provides an inventory of process nominals that are constructed on a verbal base. This information is used to solve ambiguities such as the following: vi. la formation à blocs ayant soliflué plus rapidement, ]avec glissement rapide sur] l' arène lente The only information that the corpus provides on the two potential pairs (glissement, sur) and (rapide, sur), is that the verb glisser, morphologically related to the noun glissement, has been found as governor of the preposition sur in eight unambiguous contexts (example v is one of them). This indication is used to make the hypothesis that glissement may be a governor of the preposition sur. · Rule 4: A word semantically linked to Gvee has been found as governee of the pair (Gvr, prep). A word is considered as semantically related to another, if both have at least two governors in common in the corpus. They share some distributional properties. They are also said to be semantic neighbours. 3 The numbering of the rules does not indicate priority of application. 4 This morphological resource has been compiled by Nabil Hathout at the National Institute of French Language (INaLF). disséqué parfois en récif disséquer en récif disséqué en dents de scie disséquer en terrasses disséqués en bois de renne disséquée en chevron récif dent terrasse prod=5 bois chevron Å disséquer en 180 vii. nous] terminerons par quelques notes sur] la morphologie de la lune et de Mars . The noun notes has been found as governor in an unambiguous context with a semantic neighbour of the noun morphologie, namely forme. The proximity of the two nouns morphologie and forme has been established on account of their sharing two unambiguous contexts: viii. compréhension de + det + (forme, morphologie) ix. s'intéresser à + det (forme,morphologie) The construction of semantic classes, which is the objective of our work, is thus also sketched out during the syntactic analysis to provide indications for prepositional attachment. 2.2.3 Resolution of ambiguous contexts Ambiguous cases are solved by using this combination of clues. In the following example, rules 1 (same triplet) and 2 (productivity) are used to choose between the three potential governors. x. L'érosion a disséqué le plateau rocheux en chevrons. The verb is productive (prod=5) and it has been found with the same governee in an unambiguous context. The noun plateau has been found only one time with the preposition en in an unambiguous context, with another governee (des plateaux en interfluve); the adjective rocheux is not found with this preposition. This resolution module is currently under development. At the moment, precision is 86%, which is very satisfactory, but the recall measure is only 60%. These results have been obtained by comparing the results of SYNTEX to prepositional attachment manually performed on several thousands occurrences in three different corpora. This relatively low recall corresponds to two situations: in the first case, no potential governor has been found in the attachment zone. Recall must therefore be increased by improving the segmentation module, through the definition of more flexible frontiers for the attachment zones. In the second case, no indication could be used to choose between several potential governors. It is due to lack of corpus evidence, so we must consider developping a default strategy, or using some amount of prior knowledge when all corpus-specific information has been exploited. 3 Productivity: a measure that helps to detect different levels of lexical dependency? A further question is at issue in this experiment: what levels of linguistic information are we able to point out from the observation of lexical associations in corpora? More precisely, we want to know if the strength of the association between a governor and the preposition, that we have measured in terms of productivity, can be used to describe the type of relation - subcategorization or adjunction - that holds between the two words. (Brent 1993) claims that there is a connection between frequency of occurrence and type of complementation: two words are more frequently associated if they are associated by a grammatical relation. According to this view, heuristics based on frequency of cooccurrence should therefore enable us to make this fundamental distinction between arguments and adjuncts. On the contrary, (Basili et al. 1999) think that this distinction cannot et should not be made by automatic means, because adjuncts equally contribute to the verb semantics and are as regularly associated with the verb as arguments. Observations made on corpora indeed show that the frontier between the two types of complementation is very difficult to draw, even when we want to manually determine prepositional attachment. Yet, we wanted to know if it was feasible to go beyond the simple diagnosis of prepositional attachment, and to rely on the productivity measure to try and detect different types of prepositional phrases. The last part of this paper is devoted to the presentation of the first results regarding this issue. 3.1 Variety of lexical dependencies In the previous sections, we have encountered examples of prepositional attachments that illustrate the diversity of semantic relations between a governor and a PP. SYNTEX resolves prepositional attachments corresponding to different types of lexical associations, namely: - an argument-taking element with its argument verb: s'enfoncer dans + det + (alluvions,eau, fond, surface) adj: sujette à + det + (bouleversement, gel,variation, émiettement) noun: déversement dans + det + (bassin, cuvette) 181 - an argument-taking element with a complement verb: disparaître dans (le bassin,le gouffre,le lac,le puits) noun: ruissellement en (nappes, rigoles, films) - non argument-taking element with a complement adj: active dans + det + (la baie, la zone) noun: équilibre entre (forces,puissances) noun: vallée à (flancs,replats) All these relations prove to be useful for the construction of semantic classes: words may of course be grouped because they share arguments, or they may be grouped because they are arguments of the same words. But non-argumental relations are also useful for the construction of homogenous sets of words. To illustrate this point, we can point to the fact that nouns that appear in the same list of complements in the previous examples are closely related, such as the three nouns nappes (sheet), rigoles (rills), films (films), or the four nouns bassin (basin), gouffre (chasm), lac (lake), puits (well). As a consequence, it would be very useful to propose some clues in order to differenciate between these types of dependencies. One track that we are currently following consists in the measurement of different types of productivity. Our objective is to find criteria to differentiate between argument relations and other levels of complementation relations. With this in mind, we have tried to compare two types of productivity: the productivity of governor-preposition pairs, exemplified so far, and the productivity of preposition-governees pairs. In the latter case, saying that the governee is productive with a preposition means that it occurs in the scope of various governors. For example, the pair (par, mer), separated by a determiner, is productive because it occurs under the government of six different verbs: battues coupé déposé + par la mer prod=6 envahie occupée recouvert Figure 2: productivity of a (preposition,governee) pair We have compared the lexical information that is acquired from these two different criteria. Our first conclusions are illustrated by the observation of two prepositions: sur and à, in verbal and nominal contexts. 3.2 Verbal complementation: V sur det N phrases We compare two lists: the first one (figure 3) is made up of 15 verbs that are found to be most productive as governors of the preposition sur (+ determiner) in our corpus. The first line of the table indicates that the verb reposer has been found before the preposition sur in unambiguous contexts with 23 different right contexts, which are all nouns (reposer sur le banc, reposer sur la couche, etc.). governor governees prod reposer banc,couche,critère,critérium,dos,fond,galet,horizon,lit,marne,mesure,plancher,précédent,pét ition de principe,reg,remarque,restitution,roche,sable,socle,substratum,surface,étude 23 renseigner climat,condition,constitution,degré,direction,intensité,morphogenèse,mouvement,nature,pro venance,rapport,relief,sens,topographie,valeur,évolution 16 situer bord,crête,dôme,emplacement,face,front,ligne,passage,plan,trajet,versant,équateur 12 emporter accumulation,altération,amplitude,creusement,exportation,mode,proportion,roche,élément,ér osion 10 rencontrer croupe,côte,flanc,granit,niveau,paroi,pente,revers,roche 9 trouver bord,emplacement,glacier,mer,partie,pente,planète,rivière,roche 9 établir année,bloc,couverture,fond,marne,permien,pénéplaine,socle,surface 9 appuyer chronologie,connaissance,couverture,interstratification,pierre,pointement,étude,îlot 9 glisser bord,couche,flanc,fond,neige,pente,plaque,substratum 8 opérer arbre,modèle, pente, sol, échantillon, élément 6 poser glacier, planèze, plaque, pupitre, socle, sol 6 tomber glacier, interfluve, planète, région, sol, versant 6 fonder distinction, glacio-eutatisme, granulométrie, inégalité, loi, repère 6 insister composition, fixisme, inefficacité, ouvrage, rôle, épigénie 6 localiser sur + det bordure, contact, côte, emplacement, retombée, versant 6 Figure 3: 15 most productive verbal governors with the preposition sur 182 The second list (figure 4) is made up of the 15 nouns that are found most productive (productivity>=4) as governees of the preposition sur in our corpus after verbs. The first line of the table indicates that the noun fond (bottom) has been found after the preposition sur with 15 different right contexts (affleurer sur le fond, ancrer sur le fond, etc.). governors governee prod affleurer,ancrer,arrêter,balayer,concrétionner,frotter,glisser,reposer,rouler,stratifier,transp orter,traîner,triturer,élever,établir fond 15 accélérer,arriver,cascader,condenser,descendre,disperser,déboucher,faire,former,glisser,i nfluer,observer,opérer,rencontrer,trouver pente 15 affleurer,descendre,exercer,faire,lire,localiser,paralyser,passer,remonter,situer,tomber,tra vailler,épandre versant 13 déposer,développer,emporter,faire,fixer,manifester,rencontrer,reposer,trouver roche 9 désintégrer,obtenir,reposer,réfléchir,voir,égaliser,épancher,établir surface 8 agir,basculer,opérer,poser,rouler,séjourner,tomber sol 7 déferler,exister,localiser,rencontrer,retrouver,régulariser côte 6 coller,exercer,plaquer,produire,rencontrer,réfléchir paroi 6 compléter,effectuer,indiquer,pouvoir,repérer,étudier terrain 6 déplacer,manquer,trouver,échelonner,étendre partie 5 effectuer,glisser,situer,trouver bord 4 affleurer,glisser,jouer,rencontrer flanc 4 aligner,épancher,étaler,étendre kilomètre 4 couler,peser,reposer,étaler lit 4 carboniser,détruire,situer,submerger sur + det passage 4 Figure 4: nouns most productive as governees with the preposition sur If we compare the two tables, we see that verbs that are most productive with the preposition sur all instantiate one of these two cases: - the PP headed by the preposition sur is subcategorized by the verb (reposer, renseigner, situer, emporter, appuyer, opérer, poser, fonder, insister), - the PP denotes a localization that is expected given the semantics of the verb, which are all spatial verbs (rencontrer, trouver, établir, glisser, tomber, localiser). In the second list, we see that all PPs (sur le fond, sur la pente, sur le versant) tend to behave as autonomous phrases, conveying spatial information. 3.3 Noun complementation: N à N phrases The second illustration regards the study of simple nouns, with no morphological link to a verb, related by the preposition à. We wanted to know whether different types of à N expansions would emerge by the application of the two productivity measures. The parser extracted 50 nominal governors and 54 nominal governees with the preposition à. A subset of these results (productivity >= 3) is presented in figures 5 and 6. governor prepositional link governees carte à 1.10 000, 1.200 000,1.80 000 cas à Java, Montserrat, Nantasket cas à + dét lahar, pays; pays-bas craie à bélemnite,micraster,silex crête à Porolithon,cheminée,clocheton côte à falaise,fjord,plage,ria,skjär,structure kilomètre à + dét Nord,dizaine,pôle méthode à + dét potassium,strontium,uranium roche à diaclase,feldspath,feldspathoïde,grain roche à + dét extérieur,minéral,soleil région à cuesta,nappe,permafrost,plateaux,saison,sous-sol zone à cristal,pergélisol,pluie Figure 5: (noun, à) productive pairs, with or without a determiner governors prepositional link governees ablation; plage; terrasse à + dét aval degré; maximum; éprouvette à + dét dessous Ouest; actif; haut; rigole à droite calcaire; ensemble; granit; granite; grès; leucogranit; roche à grain 183 gneiss; granit; micaschiste à mica altitude; base; forme à peine dépression; glacis; grès; zone à + dét pied courant; glacis; plan à + dét sens affaire; ciselure; face à + dét surface concavité; côte; côté; pente à + dét vent Figure 6: (à, noun) productive pairs, with or without a determiner First, a few remarks about these two tables. The results are not as good as those found on verbs, certainly because this second type of dependency patterns is more unstable, less recurrent than subcategorization patterns. Both tables contain erroneous triplets, illustrating some defects of the analysis. Some errors are due to the tagging procedure (actif, haut should have been tagged as adjectives). Others also come from the non recognition of verbal phrases, such as être le cas: the noun cas cannot be considered as an autonomous governor. Despite these problems, if we observe the (preposition, governee) pairs, we can see that the two tables exhibit rather different lexical relationships. Productive governees appear in three structures: - prepositional locutions (à peine, au sens) - circumstancial PPs (à l'aval, au dessous, à droite, au pied, à la surface) - in only 2 out of 10 cases, N à N compounds (à mica, à grain). As for productive governors, they are heads of N à N compounds in 9 out of 14 cases. In these cases, the prepositional expansion denotes a qualification (according to the classification made by (Cadiot 1997)). These first observations indicate that the productivity of the governor is an indication that we are dealing with PPs which are cohesive with the head noun or verb. Conversely, a high productivity of (preposition, governee) pairs is rather an indication for autonomous PPs. 4 Conclusion This paper reports the results obtained by our parser, SYNTEX, in the task of prepositional attachment disambiguation. The attachment strategy, which does not limit the focus to subcategorization patterns but processes any type of prepositional dependency involving verbs, nouns and adjectives, is based upon a combination of linguistic clues, namely: productivity of a (word, preposition) pair, evidence about morphologically related words or about words showing similar distributional behaviours. These first results indicate that the productivity measure, which finds echoes in other areas of linguistic research (see for example (Bayyen 1996) in morphology), is a very reliable criterion to assess the likelihood of a prepositional attachment and to extract lexical patterns form texts. Further work is needed to improve recall, and particularly to find further indications to make a decision between several potential governors when the system lacks corpus evidence. In this paper, we have also reported our first observations concerning the use of the productivity measure in the differenciation of arguments and non-arguments among prepositional phrases. The opposition between cohesive and autonomous phrases, that emerges from the data we have presented concerning both verbal and nominal patterns, must be further investigated. But these first results certainly indicate a contrast between the PP phrases that are detected by these two productivity measures. Our next objective is to integrate this distinction in the disambiguation strategy. 5 References Assadi H, Bourigault D 1995 Classification d'adjectifs extraits d'un corpus pour l'aide à la modélisation des connaissances. In Actes des 3èmes Journées internationales d'analyse des données textuelles (JADT95), Rome. Basili R, Pazienza MT, Vindigni M 1999 Adaptive Parsing and Lexical Learning, in Proceedings of VEXTAL'99, Venise. Bayyen RH, Renouf A 1996 Chronicling the times: productive lexical innovations in an english newspaper. Language, 72(1): pp 69-96 Bourigault D 1994 Lexter, un logiciel d'extraction de terminologie. Application à l'acquisition des connaissances à partir de textes. Unpublished PhD thesis, Ecole des Hautes Etudes en Sciences Sociales, Paris. Brent M 1993 From Grammar to Lexicon: Unsupervised Learning of Lexical Syntax. Computational Linguistics, 19(2): 243-262. 184 Cadiot P 1997 Les prépositions abstraites en français. Paris, Armand Colin/Masson. Federici S, Montemagni S, Pirrelli V, Calzolari N 1998 Analogy-based extraction of lexical knowledge from corpora: the SPARKLE experience. In Proceedings of the first international conference on Linguistic Resources and Evaluation, Grenada, pp 75-82. Grefenstette G 1994 Exploration in Automatic Thesaurus Discovery. Londres, Kluwer Academic Publishers. Grimshaw J 1990 Argument structure. Cambridge, Massachussets, the MIT Press. Habert B, Naulleau E, Nazarenko A 1996 Symbolic word-clustering for medium-size corpora. In Proceedings of the 16th International Conference on Computational Linguistics, COLING, Copenhague, pp 490-495. Harris Z, Gottfried M, Ryckman T, Mattick Jr P, Daladier A, Harris T, Harris S 1989 The Form of information in science, analysis of immunology sublanguage. Kluwer Academic Publisher, Dordrecht. Hindle D, Rooth M 1991 Structural Ambiguity and Lexical Relations. In Proceedings of the 29th meeting of the association for Computational Linguistics, ACL, Morristown, pp 229-236. Manning D 1993 Automatic acquisition of a large subcategorization dictionary from corpora. In Proceedings of the 31th meeting of the association for Computational Linguistics, ACL, Columbus, pp 235-242. 512 CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model R. Rossini Favretti, F. Tamburini and C. De Santis CILTA - University of Bologna - Italy {rossini,tamburini,desantis}@cilta.unibo.it A corpus of written Italian – CORIS – has been under construction at the Centre for Theoretical and Applied Linguistics of Bologna University (CILTA) since 1998 and will soon be completed and made available on-line. The project aims at creating a representative and sizeable general reference corpus of contemporary Italian designed to be easily accessible and user-friendly. CORIS contains 80 million running words and will be updated every two years by means of a built-in monitor corpus. It consists of a collection of authentic texts in electronic form chosen by virtue of their representativeness of written Italian. It is aimed at a broad spectrum of potential users, from Italian language scholars to Italian and foreign students engaged in linguistic analysis based on authentic data and, in a wider prospective, all those interested in intra- and/or interlinguistic analysis. Besides the defined model, a dynamic model (CODIS) has been designed, which allows the selection of subcorpora pertinent to specific research and also the size of every single subcorpus, in order to adapt the corpus structure to different comparative needs. A number of tools have been developed, both for corpus access and for corpus POS tagging and lemmatisation. 185 Going out in style? Shall in EU legal English Richard Foley, Researcher University of Lapland Rovaniemi, Finland 1. Introduction The ambiguity of shall vis-à-vis may and must and its excessive and inconsistent use, or “traditional promiscuity” (Garner 1998:940), in legal English, have attracted the attention of draftspersons and reformers of legal language in the United States, Canada, the United Kingdom and Australia. The evidence brought to bear in arguments for reform in these jurisdictions typically relies on court decisions in disputes over the meaning of the word and the intuitive perceptions of usage held by the reformer. The present study describes research which expands the debate on shall to a new context – the legal language of the European Union – and in applying the systematic usage-based techniques of corpus linguistics (Biber 1996:172) to the issue introduces a comparatively innovative methodology as well. The problem of shall can be attributed in considerable measure to tradition. This is seen as a reluctance to depart from tried and true formulae, which is described by Mellinkoff (1963:294) as a fear of change: “Lurking in the dark background is the always present, rarely voiced lawyer's fear of what will happen if he is not “precise” in the way that the law has always been “precise.” Adherence to tokens of legalese such as shall not only sustains the myth of precision in legal language but also perpetuates a style and language that differentiates the genre from that of other professions (Bhatia 1993:101-2) and, by extension, general usage. The tradition of precision instils in the legal profession a prescriptivist orientation to language, exercised both consciously and subconsciously on new writers, e.g., draftspersons and practicing lawyers. While the legal profession seeks to contain complexity in response to and anticipation of litigation arising from ambiguity, the linguist in contrast, is - and must be - content to describe language practice, however complex, through a principled search for regularities in a representative corpus. As a new legal order in which language is uniquely unfettered by tradition and, at the same time, the object of a singularly well-resourced attempt at language engineering, the EU provides an interesting crucible for language reform and descriptivist-prescriptivist perspectives. For example, a semantic analysis of shall and the modals in general is essential for machine translation initiatives (Svendsen 1991) and an understanding of the use of shall will figure prominently both stylistically and semantically in implementing the policy of transparency, i.e., making the legal instruments of the EU more readable for the average citizen. The paper undertakes to inform these concerns by addressing the following questions: 1) Have the problems of ambiguity identified in the literature for shall in common-law jurisdictions “come over” from the United Kingdom and continued in EU legal English? 2) Is there evidence of the use of promiscuous shall in EU legislation, e.g., use as a mere stylistic marker, as has also been suggested in the literature concerning other jurisdictions? In other words, is one of the ‘senses’ of shall null? 3) What is the frequency of shall in EU legislative language compared to that in other English-speaking jurisdictions? What is its frequency vis-à-vis general usage? To what extent does shall serve as a sign making legal language exclusionary of the average citizen? The paper will proceed in the next section with a discussion of the problem of shall in English-speaking jurisdictions and a few of the solutions proposed. This serves to contrast the perspectives of the linguist and the lawyer and provides necessary background for the analyses. The third section describes the materials and methods used in the study. This is followed by a presentation of the analyses and a concluding section, which discusses the findings with a view to informing future research. 186 2. The problem of shall The lawyer and linguist both strive to understand the meaning of shall, but all that they share in this pursuit is the token on the page. To the lawyer, shall is a word of authority - a word conferring rights and obligations and prohibitions - whose function is to impose an obligation (Thornton 1979:86- 7) and to do so unambiguously, in keeping with the Golden Rule of drafting: ”[T]he competent draftsman makes sure that each recurring word or term has been used consistently. He carefully avoids using the same word or term in more than one sense....In brief, he always expresses the same idea in the same way and always expresses different ideas differently.” Dickerson The Fundamentals of Legal Drafting §2.3.1, at 15-16 (2d ed. 1986) cited in Garner (1998:940). Drafting guidelines such as the above that provide the lawyer with prescribed usage; usage deviating from this norm -- the source of the problem -- takes the form of entries in dictionaries of legal English and court cases. For example, the ambiguity of shall is attested in the following entry in Black's Law Dictionary, which after a paragraph substantiating obligatory shall continues: But [shall] may be construed as merely permissive or directory (as equivalent to may), to carry out the legislative intention in cases where no right or benefit to any one depends on its being taken in the imperative sense. Wisdom v. Board of Sup'rs of Polk County, 236 Iowa 669, 19 N.W. 2d 602, 607, 608 (Black 1990:1375) More extensive evidence can be had through an investigation of cases involving disputes over the meaning of shall, a source suggested in the entry above. One such study is that by Kimble (1992), who after reviewing over one hundred cases involving disputes over the meaning of shall concluded: “In summary, I'm afraid that shall has lost its modal meaning - for drafters and for courts. Drafters use it mindlessly. Courts read it any which way” (p.72). To the linguist, shall is a modal auxiliary, a lexico-semantic category studied in great detail (e.g., Leech (1971), Coates (1983), Perkins (1983), Palmer (1986)) to describe and account for the variety of meanings it exhibits. With the establishment of the computer corpus linguistics paradigm (Leech 1992:107-111), the evidence invoked in such investigations will typically be derived from a systematic empirical investigation of a corpus or corpora, although researchers may still rely on their own intuition regarding language usage. The dimensions of the evidence used in the two professions are depicted below in terms of two continua: anecdotal -systematic and usage – intuition. Intuition Usage Systematic Anecdotal 1 4 2 3 Figure 1 Dimensions of linguistic evidence The entry in Black's Law Dictionary for shall could be placed in quadrant 1. It cites cases in which courts have been called on to decide whether shall means obligation or permission - usage –but such cases are not systematic evidence for the study of shall. Kimble's (1992) study can be placed somewhat more to the systematic edge of the quadrant, although a collection of cases clearly cannot represent the typical range of use. Quadrant 2 represents corpus linguistics research. Noteworthy examples relevant in the present context are the studies of Coates (1983) and Trosborg (1997). Quadrant 3 describes the approach of many studies of linguistic competence in the 1970s, in particular the introspective investigations of syntax that figured prominently in the development of transformational grammar. Quadrant 4 comprises systematic studies using native informants, which frequently complemented introspective approach. 187 A second crucial difference between the lawyer and linguist is that they have a different conception of meaning. To be sure, both may seek to distinguish dictionary senses of words, but linguistic studies of modality (Coates 1983) have yielded acceptable models of meaning in terms of clines, or gradients of senses running between core and peripheral meanings, which are doubtless at variance with the categorical approach seen in the drafting guideline. In the main, solutions to the problem of shall have sought to restrict the word to a single sense, its original sense of obligation. Garner (1998:940) distinguishes eight senses of shall: (1a) “the court. . . shall enter an order for the relief prayed for. . . .” (1b) “Service shall be made on the parties” (1c) “The debtor shall be brought forthwith before the court that issued the order.” (1d) “Such time shall not be further extended except for cause shown” (1e) “Objections to the proposed modification shall be filed and served on the debtor.” (1f) “The sender shall have fully complied with the requirement to send notice when the sender obtains electronic confirmation.” (1g) “The secretary shall be reimbursed for all expenses.” (1h) “Any person bringing a malpractice claim shall, within 15 days after the date of filing the action, file a request for mediation.” Only the first of these is deemed acceptable, the criterion being that the grammatical subject must be person on whom the obligation has been imposed. Garner objects to shall in (1b) on the grounds that a duty is being imposed on an abstract thing, e.g., service, and to (1c) because a duty is being imposed on an unnamed actor; the agent is not specified. The lack of specification made possible by the agentless, short passive in English is not, however, in any way, a consequence of the author's choice of shall. The choice of passive voice is a syntactic, not lexico-semantic, consideration. In fact, agents are readily identifiable: both verbs, ‘serve notice on’ and ‘bring before the court’ imply action on the part of a bailiff at the request of the court. In each case a duty is being imposed and it is possible to determine the nature of that duty, the person or persons who are to perform it and the object of that performance. Shall tells the reader in each case that a duty is involved. The approach confounds syntactic and semantic criteria and yields three senses where there is only one. While the foregoing will suffice to illustrate Garner's argumentation, (1d) merits comment as a case that is undoubtedly of the kind which prompted Bowers (1989:294) to conclude that shall is generally “used as a kind of totem, to conjure up some flavour of the law.” The function of the verb in (1d) form is to indicate completed action (establishment of compliance) in future time (when the sender obtains electronic confirmation), and no interpretation of obligation is possible. On the evidence of intuitive and empirical linguistic studies, respectively, Bowers (1989:34) and Trosborg (1997:136) propose that shall be restricted to indicating obligation where a human agent is specified or easily recoverable from context. This argument rests on the concept of a master speech act (Kurzon 1986:19), according to which the enactment clause in a legislative instrument establishes a global illoctionary force of obligation. Used in a non-agentive context in the enactment clause, shall actually undermines the purpose of the legislation in two respects: first, as the law is always speaking, if shall is interpreted as a future tense it creates perpetual futurity, meaning the provision will never come into force; second, “all cases where shall is propositional and non-agentive in fact weaken the superordinate force of the Act by suggesting that there is yet a further step to be taken before the enacted clause becomes reality” (Bowers 1989:34). Language engineering, in particular machine translation, also can be seen to embraces a similar interest in isolating a single sense for shall and the modals. The following is an analysis of modality that has been presented as Euroversal by Svendsen (1991:276). Possible Necessary “others” Epistemic possibility Necessity presumption Deontic permission compulsion obligation Subject-oriented ability Volition resolution Table 1 Eurotra matrix for English modal meanings 188 The senses distinguished encompass the meanings to be expressed in the official languages of the EU and presumably one form and one form only should be assigned to each for each language. Shall certainly would be at home in “deontic obligation.” This would amount to termification of the modals using Euroversal concepts and an agreed form – a term – in each language. In the case of the modals, however, describing and prescribing a single sense is perhaps more daunting a task than in the case of legal terms of art proper. Significantly, the importance of context in determining meaning has been recognized by lawyers and linguists alike. For example, Asprey (1992:82) states on the question of replacing shall with must: “the reason why it is difficult to replace shall with a word that has all these subtle meanings is that shall never did it in the first place. Not on its own. It did in context.” The problems with specifying the meaning(s) of words of authority have a linguistic basis in that, as a closed system, their meanings are “reciprocally defining: it is less easy to state the meaning of any individual item that to define it in relation to the rest of the system” (Quirk et al 1972:46). Ideally, methods of investigation and evidence should accommodate both conceptions of meaning. Another proposed solution to the problem of shall is that it be replaced by must. This would accomplish little semantically, as the entry in Black's indicates: “but this [mandatory] meaning of the word is not the only one, and it is often used in a merely directory sense, and consequently is a synonym for the word “may” Black 1990:1019). The meaning of must will clearly have to be contained as well before it meets the needs of the legislator. Interestingly for the present study, Asprey (1992:77) suggests replacing shall with must in the sense of obligation on stylistic grounds, asserting that the use of shall “puts lawyers out of step with the language of the general community”. This proposal will be taken up in conjunction with frequency analyses below. 3. Material and methods The present study relies on four principal sources of data comprising a variety of corpora. The first is a corpus of EU primary and secondary legislation (EULEG) compiled by the author for research on modality in EU legal English. Table 2 gives a breakdown of the current composition of the corpus. Text type No. of texts Word count Avg. word count per text Treaty 1 47102 47102 Regulations 4 39863 9966 Directives 4 55360 13840 Decisions 2 16488 8244 Total 11 158813 14437 Table 2 Composition of EULEG The texts were obtained in electronic form from EUR-LEX the WWW-based database of European legislation. Although the texts are not official translations, the likelihood that there are discrepancies in the use of the modal verbs vis-à-vis the authoritative version as published in the Official Journal was considered negligible. The ready availability of text in electronic form in a searchable database was seen as outweighing this criterion of authoritativeness. Legislation in the EU can be divided into primary and secondary. Primary legislation comprises the Treaties. There are three treaties establishing the European Communities and a number of conventions and documents by which these have been amended. Complementing the Treaties are the instruments of secondary legislation, which comprise Regulations, Directives and Decisions. Regulations are directly applicable in that no national measures are required for a Regulation to become binding on the citizens of a Member State and that a Member State cannot undertake measures that would prevent the application of a Regulation. A directive is binding as to the result to be achieved; the form and methods by which these are implemented may be decided by the authorities in each Member State. Decisions are generally of an administrative nature and implement other Community rules. A decision is binding in its entirety on those to whom it is addressed The text types chosen represent the genre of primary and secondary legislation exhaustively, whereby they should adequately represent the linguistic distribution of modal verbs. In light of Biber's 189 (1993:252) criteria for representativeness, the figures in the table suggest that a larger number of samples, perhaps smaller in size, might be equally or more representative than those presently included. However, at this stage in the research, the author has opted to include the entire text, even in the case of the Treaty, for a number of reasons. First, stratified sampling without regard to the division of the texts into recitals, enactment section and annexes would have made it impossible to investigate the frequency of shall in the enactment section, which is crucial in light of the proposals by Bowers and Trosborg referred to above. Second, the entire text provides the most representative sample for intertextual comparison with corpora from other English-language jurisdictions. While the genre of legislation is in the main similar, no consistent structure could be observed which would have justified stratified sampling by text section across the jurisdictions. The texts in the corpus have not been chosen entirely at random, and EULEG can be described as opportunistic (Leech 1991:10) inasmuch as several of the texts had been gathered for teaching purposes and terminological research. On the whole, however, the texts do not represent any particular field of activity or time frame; a study of the frequency of modals along such parameters is beyond the scope of the present research. The regulations span economics, social security, the directives data protection and environmental protection and the decisions trade in bananas. Given the nature of the linguistic feature - modal verbs - it is unlikely that any bias could be introduced by the subject matter. There is no evidence that one field of activity imposes more obligations and prohibitions or confers more rights than another. The second major source of material comprises the translations of the instruments in EULEG in Swedish, Finnish, French and German. These furnish parallel corpora which are instrumental in disambiguating the occurrences of shall. For example, the archetypical sense of ‘obligation’ (Crystal and Davy 1969:206-207) stands out in bold relief where one finds it translated with the Agentgen+ BE + present passive participle construction in Finnish. French and German, in turn, serve to reveal occurrences of shall in which futurity but not obligation is intended and in which the use of shall instead of will in the third person is unmotivated. The use of the translations is predicated on the assumption that translators deverbalize the original message in the source text - whatever language this may be in - and expresses it in their native language. This is depicted as the “language-free semantic representation” in Figure 2 below: Clause in source text 1) clause structure, MOOD and lexical choices Language-free semantic representation Clause in target text 6) Speech act 2) Propositional content 3) Thematic structure 4) Register features 5) Illocutionary force Figure 2 The Process of Translation (adapted from Bell 1991:56) The thesis here is that, through the agency of the translator, the text of a piece of legislation in each of the official languages of the EU can be seen as encoding the same meaning. Due to lexical, syntactic and semantic differences among the languages, however, these features will be expressed in different ways. A serious qualification on the potential of deverbalization is in order, however, for translators often rely excessively on the surface structure. Evidence for translationese, represented by the dashed line between “Clause in source text and “clause in target text” in Figure 2, has been substantiated by Schmied and Schäffler (1996:48) in translation corpora of English and Norwegian. Trosborg (1997: 159) acknowledges the phenomenon with regard to shall in legal language. Beeth and Fraser (1999:76) identify institutional pressures for perpetuating inadequate translations: One of the dangers of many of the translating tools like the Translator's Workbench (TWB) is that they provide the translator with ready-made segments of text in the target language (lifted from earlier documents), making it much easier to stay on the surface of a document. And yet in our hearts we know that what was an adequate translation for the document from which the segment originated is unlikely to be as adequate for the document we have before us now. 190 A further complication in using parallel corpora is that it is not possible to establish the source and target text: as all are equally authentic and ‘source’ would imply an untoward political primacy. Article 53 of the Treaty on European Union reflects this principle: “This Treaty, drawn up in a single original in the Danish...Spanish languages, the texts in each of these languages being equally authentic...” The organizational reality mirrors this political reality in that it is nowhere stated for a particular piece of legislation which language it has been first drafted in and therefore which has served as the basis for the translations found in EUR-LEX. On balance, while one must be somewhat circumspect regarding the level of abstraction at which translators operate, the parallel texts represent a significant resource to tap systematic usage. The following examples from Regulation 95/48/EC illustrate the use of translations: (2a) en whereas the name given to the European currency shall be the 'euro`; whereas the euro as the currency of the participating Member States shall be divided into one hundred sub-units with the name 'cent`; (2b) fr que le nom de la monnaie européenne sera «euro»; que l'euro, qui sera la monnaie des États membres participants, sera divisé en cent subdivisions appelées «cent»; The corresponding Swedish text uses skall, a cognate of shall, which expresses futurity and obligation, as well as kommer att, a pure future. (2c) sv Namnet på den europeiska valutan skall vara "euro". Som valuta för de deltagande medlemsstaterna kommer euron att delas upp i ett hundra underenheter med namnet cent. The English word ‘given’ allows an agent (‘by the participating Member States'), making obligation and thus shall possible. The French translation using the future tense suggests a strictly temporal use. Swedish skall indicates both obligation and futurity. In the second instance, the use of the future kommer att, coupled with the evidence from French, suggests that at least the second shall in (4a) is superfluous and that is or will would suffice. This is a reasonable interpretation also inasmuch as there is no recoverable agent; ‘shall be divided’ is the equivalent of ‘equals'. The following is an example in which the use of the Finnish construction BE + Present passive participle, which indicates obligation unambiguously, argues in favor of recovering an agent and therefore accepting the use of shall. (3a) en The amount shall be credited to the account of the creditor in the denomination of his account, with any conversion being effected at the conversion rates. (3b) fi määrä on hyvitettävä velkojan tilille hänen tilinsä valuuttayksikön määräisenä [the-amount is to-be-credited of-the-creditor to-the-account…] As the third question in the introduction indicates, the present study seeks to investigate the distinctive features of the genre of legislative English and examine the characteristics of EULEG in comparison with comparable legislation from other jurisdictions. On the intuitive level, shall certainly seems to meet the criteria of frequency, distinctiveness and precision described by Crystal and Davy (1991:228-9). Following is a description of the several pilot corpora or representative texts that have been compiled to enable comparisons of EULEG with for the stylistic analysis. Corpus No. of texts Word count Avg word count per text AMLEG 1 5365 5365 CANLEG 2 35213 17607 BRITLEG 1 6592 6592 FINNLEG 6 59316 9886 Tot 9 - - Table 3 Pilot corpora 191 Clearly, great caution is called for in undertaking an analysis on the basis of a single text. Shall is a frequent enough linear feature (Biber 1993:252), however, to sustain a tentative assumption that even a single text may have some value as a sample. At the level of delicacy sought, they do have something to offer. As the author is interested in translationese, a corpus of translations from Finnish into English has also been compiled. The fourth principal source of data comprises the frequency of the occurrence of modals in the Brown and LOB corpora. The former comprises 1,013737 word of American, the latter 1,013,644 words of British English, and they are used here to represent the use of modals in the standard language. 4. Analysis In her study of shall in British statutes, Trosborg (1997) concluded that even though shall has been defined as a modal verb expressing legal obligation, “ ...65.1% of the observed instances of shall occurred with non-human subjects which could not be given orders or assigned obligations” (pp. 105- 106). This semantic analysis is essential in overcoming a monolithic concept of or belief in a one-toone correspondence of form and meaning in legal language. This is reflected in Crystal and Davy's statement: “Shall is invariably used to express what id to be the obligatory consequence of a legal decision, and not simply as a marker of future tense, which is its main function in other varieties (1969:206-7) and can be inferred in studies focusing on general usage, e.g., Coates (1983), which simply cite the legal or quasi-legal meaning of shall. An examination of the frequency of shall in the corpus by section of text is depicted in Figure 3 below. Frequency of deontic modals by section 0 5 10 15 20 25 Recitals Enactment Annex Frequency (per 1000 words) shall may must Figure 3 Frequency of deontic modals by text section This frequency analysis of shall prompted examination of word frequencies overall in the corpus and revealed that the relative frequency of shall in EULEG places it squarely among semantically void function words. The relevant frequencies are summarized in the table below: Word Frequency (per 1000 words) Corpus Enactment The 90 100 Of 55 58 To 29 30 In 26 26 And 22 23 shall 15 20 a 14 15 192 Table 4 Comparison of most frequent tokens in EULEG by section of text This evidence suggests that the occurrences of shall in the corpus merit closer analysis along the lines of Trosborg's study. It seems implausible that each occurrence of shall entails the imposing of an obligation. To investigate this hypothesis empirically, a random sample of 574 occurrences of shall was drawn from the enactment sections of the legislation in EULEG and analyzed interactively for the occurrence of a human subject, and where the sentence was in the passive and had a human agent, whether the agent was expressed in a by clause or had to be recovered from context. The percentages obtained are presented in Table 5 below: Sentence type Active Passive Subject agentless by + Agent Human 40% 12% 3% Non-human 39% 6 % - Table 5 Use of shall in EULEG with human subject If one adopts relatively strict criteria for agency, e.g., that the logical and grammatical subject coincide and that the subject be human, 60% of the occurrences of shall analyzed are unmotivated, indicating that legislation in the EU is only slightly more fastidious in this regard than that analyzed by Trosborg. If the passives are included as agents recoverable from context, then some 45% of the occurrences of shall in this data are unmotivated, an improvement over British statutes but nevertheless a level of promiscuous use that is certainly in no way reconcilable with the Golden Rule of drafting cited above. Perhaps the more significant observation to emerge in the analysis is that the criterion of human subject, which Bowers and Trosborg and, to an extent, Garner argue for, proves as difficult to specify as discrete senses of shall. The following are examples from the sample. (4a) It shall be for the controller to ensure that paragraph 1 is complied with (4a) might not fit the “human subject” criterion, especially if the text were being analyzed by a machine, but is clearly equivalent to the “controller shall ensure...” (4b) below represents the reverse case. This main clause would presumably meet the criterion of human subject, but the wider context “if it sees fit” renders the issue of obligation irrelevant in giving the authority discretion. (4b) If it sees fit, the authority shall seek the views of data subjects or their representative Here, it is instructive to consult the parallel corpora. The Finnish version uses voi ‘can/may; Swedish uses skall, and German and French the present tense. The Swedish may be influenced by the English text. Obligation is not plausible. (4c) fi Jos viranomainen katsoo aiheelliseksi, se voi hankkia [If the authority sees fit, it can/may obtain the views] (sv skall de holt ein, fr recueille,) The four examples that follow are also potentially problematic, and must be accounted for in any attempt to assign shall to a particular sense. It is unclear whether Trosborg would have included any of these as human subjects (1997:105-106). The first three were counted in the present analysis; (4g) was not. Yet the actions involved - publishing, calculating, taking into account and creating a state in which requirements are provided in written form – all require a human agent. (4d) The report shall be made public (4e) The turnover of an undertaking concerned within the meaning of Article 1 shall be calculated by adding together (4f) The decision to publish shall take due account of the legitimate interest of undertakings (4g) the requirements shall be in writing or in another equivalent form. The analysis to follow addresses the third question in the introduction by examining the frequency of shall and modal in EU legislative language compared to that in other English-speaking jurisdictions and in general usage. The focal question is the extent to which modality in general and 193 shall in particular as the most frequently occurring modal contribute to making legal language a distinctive genre, one perhaps exclusionary of the average citizen. Figure 4 below depicts the frequencies of modal auxiliaries in legal and general usage: Modals in Legal and General Usage 0 2 4 6 8 10 12 14 would will can could may should must might shall Frequency (per 1000 words) Modals Legal General . Figure 4 Frequency of modal auxiliaries in legal and general usage General usage is represented by frequencies from the LOB and Brown corpora as described in the previous section. Legal usage is based on 200,000 words comprising EULEG, AMLEG, CANLEG and BRITLEG. Spearman's rank correlation coefficient for the differences between the nine auxiliaries was calculated and found to be significant at the 99% level. The frequency of shall here proves to be distinctive as a stylistic feature in terms of both criteria presented by Crystal and Davy (1969:21): that is, it occurs most frequently within the variety of language in question and is less shared by other varieties. Indeed, a nearly clean break can be established between the general language and legal language. In Coates’ (1983:189-9) analysis, shall in the sense of obligation is “virtually restricted to formal legal contexts. Its fossilisation is demonstrated by the fact that there are no examples in the informal spoken language of the Survey and only one in the more colloquial written language of the Lancaster fiction texts.” The following analysis compares the individual corpora to one another and each to general usage. Modals in Legal Language 0 2 4 6 8 10 12 14 16 18 shall may must Frequency (per 1000 words) EULEG AMLEG CANLEG BRITLEG FINNLEG Figure 5 Deontic modals in legal corpora Chi-squared was calculated for the frequencies of modals in each corpus vis-à-vis the Brown and LOB corpora and found to be significant. Without shall, however, none of the texts or corpora differ significantly from the general language. The analysis places EU legislation well within the family of English-speaking jurisdictions. The high frequency of shall in the translation corpus FINNLEG merits further investigation. A first observation is that the translators have learned their legal English well. 194 5. Discussion The foregoing analyses have shed light on the questions presented in the introduction. From the standpoint of what might be termed legal hygiene, the research to date has revealed no ambiguities which might compromise a citizen's rights. This finding suggests the need for analysis at a greater level of delicacy or drawing more extensively on the parallel corpora for interpretations. With regard to linguistic hygiene, or the promiscuous use of shall, the analysis revealed superfluous use on the order of fifty percent. More significantly, the analysis revealed difficulties with the proposed solution of restricting shall to meaning obligation when imposed on a human subject. While there is much work yet to be done in building EULEG, this finding can be seen as an outcome of corpus linguistics abiding interest in systematic empirical investigation. The lawyer must accept what the linguist has come to realize: usage is complex and must be investigated accordingly. Yet, the descriptive analysis raises a salient prescriptive issue: Might stylistic use of shall, perhaps a habit perpetuating the exclusionary nature of the genre, introduce ambiguities of obligation into the text for the (monolingual) English reader where, at least on the evidence of other languages, none was intended? In light of the analyses presented here, the modals in general and shall in particular pose significant challenges to machine translation. The need for language engineering is patent, the prospects for success somewhat less obvious. Descriptive analysis of the frequency of modals in EULEG and comparable texts or corpora established that legislative language in the EU has more in common with legal genres in other English jurisdictions than with the general language. Although hardly surprising as such, legislative language merits further empirical study given the European Union's express interest in transparency, which is comparable to the Plain Language Movement in the English-speaking countries. Worth noting in this regard is that it is the frequency of shall alone, rather than that of may or must, that binds these populations together statistically. Obsolescent in the sense of obligation in general language, shall clearly signals to the reader that he or she is dealing with a distinctive genre. Replacing shall with must, as mentioned earlier, would be a problematic enterprise semantically. Even if the number of justifiable uses of a modal of obligation were cut by half, the frequency of familiar must would rise to the point where the word would differ distinctively from that in the general language and, in this respect, risk becoming a new shall. References Asprey M 1992 Shall Must Go. Scribes Journal of Legal Writing 3(79): 79-83. Beeth H, Fraser B 1999 The hidden life of translators. Translation and Terminology 2: 76-96 Bell R 1991 Translation and Translating. London,Longman Bhatia V 1993 Analysing Genre: Language Use in Professional Settings. London,Longman Biber D 1996 Investigating Language use through corpus-based analyses of association patterns. International Journal of Corpus Linguistics, 1(2): 171-197. Biber D 1993 Representativeness in Corpus Design. Literary and Linguistic Computing 8(4): 243-257. Bowers F 1989 Linguistic Aspects of Legislative Expression. Vancouver, University of British Columbia Press. Coates J 1983 The Semantics of the Modal Auxiliaries. London & Canberra,Croom Helm. Crystal D, Davy D 1969 Investigating English Style. London, Longman. Crystal D 1991 “Stylistic Profiling” In Aijmer K, Altenberg B (eds) English Corpus Linguistics. London and New York,Longman, pp. 121-238. Garner B 1998 A Dictionary of Modern Legal Usage. Oxford,OUP Kimble J 1992 The Many Misuses of “Shall”. Scribes Journal of Legal Writing 3(61): 65-75 Kurzon D 1986 It is Hereby Performed...Explorations in Legal Speech Acts. Amsterdam,John Benjamin Publishing. Leech G 1971 Meaning and the English Verb. London,Longman. Leech G 1991 The state of the art in corpus linguistics. In Aijmer K, Altenberg B (eds) English Corpus Linguistics. London and New York,Longman, pp. 8-29. 195 Leech G 1992 Corpora and theories of linguistic performance. In Svartvik J (ed) Directions in Corpus Linguistics. Berlin, Mouton de Gruyter, pp. 105-122. Mellinkoff D 1963 The Language of the Law. Boston & Toronto, Little, Brown and Company. Palmer F 1986 Mood and Modality. Cambridge,CUP. Perkins, M 1983 Modal Expressions in English. London,Frances Pinter. Quirk R, Greenbaum S, Leech G, Svartvik J 1972 A Grammar of Contemporary English. London, Longman. Schmied J, Schäffler H 1996 Approaching translationese through parallel and translation corpora. In Percy, C, Meyer C, and Lancashire I (eds) Synchronic corpus liguistics. Amsterdam,Rodopi, pp. 41-55. Svendsen U 1991 On the translation of modality in an MT-System Part I. Le Langage et l'Homme 25(4): 273-280. Thornton G 1979 Legislative Drafting. London,Butterworths Trosborg A 1997 Rhetorical Strategies in Legal Language. Tübingen, Gunter Narr Verlag 196 Translating passives in English and Swedish: a text linguistic perspective Anna-Lena Fredriksson anna-lena.fredriksson@eng.gu.se English department Göteborg University, Sweden 1 Introduction The aim of this work-in-progress report is to study whether the textual and grammatical structure of passive clauses is retained or altered in translations between English and Swedish. The passive is interesting to study in this perspective since it is a multifunctional category closely linked to textual structure. It constitutes a useful text-structuring tool in that it makes it possible to ‘thematize a Given element […] whilst simultaneously focalizing a New actor by shifting it to the end of the clause [agentful passives]’ (Granger 1983:299).1 Moreover, it allows the verb to occur without an agent (agentless passives). The study focuses on translation pairs in which passive clauses in English and Swedish original texts provide the starting-point. Four alternative situations are possible in the translated texts: a) no change in syntax – no change in thematic structure (1a-b), b) change in syntax – no change in thematic structure (2a-b), c) no change in syntax – change in thematic structure (3a-b), and d) change in syntax – change in thematic structure (4a-b): (1a) A wall of polythene sheeting was nailed up to screen the area from the rest of the house, and domestic catering operations were transferred to the barbecue in the courtyard. (1b) Ett plastsjok spikades upp för att skärma av byggplatsen från resten av huset, och hushållets matlagning flyttades ut till grillen på gården.2 (PM1) (2a) He was instructed to be 'as conspiratorial as possible' in order to minimise the dangers of police penetration. (2b) Han fick instruktioner att vara "så konspiratorisk som möjligt" för att minimera riskerna för polisinfiltration. (CAOG1) (Lit: He got instructions to be …) (3a) Varje år utdelas cirka 300 Nordstjärneordnar och -medaljer till förtjänta utlänningar. (Lit: Every year, are-given around 300 orders and medals …) (3b) Every year, around 300 orders and medals of the Northern Pole Star are given to deserving foreigners. (GAPG1) (4a) Till och med ensamhetskänslan försvagades även om han aldrig blev helt fri från den. (Lit: Even the-loneliness-feeling was-diminished even though he …) (4b) Work even diminished his feelings of loneliness, although he could not make them disappear altogether. (KF1) Textual structure and thematic patterning in translations is an area which is relatively unexplored, but interesting for a number of reasons. Ventola (1995:88) remarks that changes in thematic structure in translations ‘may create somewhat different meanings. The readers are forced to focus on different 1 For text linguistic discussions of the passive, see e.g. Duskovà 1971, Granger 1983, Halliday 1994:126, Péry-Woodley 1991, Sundman 1987: 360f. Reference grammars also bring up textual functions of the passive, see Quirk et al 1985:1390f and Teleman et al 1999:4:380. 2 All examples are taken from the ESPC (see p. 2). The original text is given first as (a) and the translation as (b). The source text passive and the corresponding expression in the translation have been italicised. The text code referring to each corpus text, here PM1, has been given after each translation (see the list of references). In cases where the Swedish text differs from the English, a word-for-word gloss is provided for the part of the clause under discussion. 197 things – orientation to ‘the starting-point’ in the forthcoming text is different'. Sometimes it may even be argued that original and translated texts ‘are not saying the same thing’ (ibid.). Enkvist (1984:48f) draws attention to the importance of contrastive work combining text linguistics and grammar. Such work can focus on, for example, ‘the interplay between the syntactic structure and the information structure of the clause and sentence, that is, the way in which the syntactic structure is brought into harmony with the desired distribution of old and new information in the sentence and the text'. Contrastive studies in English and Swedish or Norwegian on thematic structure include for example Altenberg (1998) and Hasselgård (1998). 2 Definition of theme As a point of departure for the thematic analysis I have chosen Halliday's definition of theme according to which ‘[t]he Theme is the element which serves as the point of departure of the message; it is that with which the clause is concerned. The remainder of the message, the part in which the Theme is developed, is called […] the Rheme’ (Halliday 1994:37). The theme comes in first position in the clause extending ‘up to (and including) the first element that has a function in transitivity. This element is called the ‘topical Theme'; so we can say that the Theme of the clause consists of the topical Theme together with anything else that comes before it’ (Halliday 1994:53). In declarative clauses, the theme typically conflates with the subject, which is then referred to as an unmarked theme. Marked themes are realised by adjuncts or complements. Furthermore, the information structure often coincides with the thematic structure, so that the theme conveys given information and the rheme new information. The fact that the topical theme can consist of an adverbial or a prepositional phrase only (forming a marked theme) complicates the analysis of the thematic structure of passive clauses since such elements occur quite frequently in thematic position in the material analysed. Therefore, I have chosen to use a somewhat altered definition and disregard the presence of such adjuncts in order to see the passive subject as part of the theme. Referring to work by Ravelli and Matthiesen, Altenberg (1998:116) points out that there have been suggestions within systemic linguistics to allow the theme to include everything in the clause up to the finite verb. Such a definition might be useful for further analysis of this type of data. A further problem and a difference from English is that Swedish is a verb-second language. According to the V2 constraint, Swedish normally accepts only one clause element before the finite verb in declarative clauses, which results in inversion of the subject and the verb when the initial element is a non-subject (cf. Teleman et al:1999:4:4ff. and Altenberg 1998). In such cases the subject becomes part of the rheme, while it is a thematic element in English. Apart from the V2 constraint, the order of thematic elements is generally similar in Swedish and English (cf. Teleman et al. 1999:4:25, 380, 390). Theme-rheme patterns are important in guiding the reader through the text. Daneš has proposed a model of three main text-structuring patterns (here taken from Fries 1995:7f). The first is called ‘linear thematic progression'. Here, ‘the content of the Theme of a second sentence (Theme 2) derives from the content of the previous Rheme (Rheme 1), the content of Theme 3 derives from Rheme 2, etc’ (Fries 1995:7). In the second pattern called ‘Theme iteration', ‘the same Theme enters into relation with a number of different Rhemes. The result of this type of thematic progression is that the Themes in the text constitute a chain of (typically) co-referential items which extends through a sequence of sentences or clauses’ (ibid.). Thirdly, in ‘progression with derived Themes’ different themes are derived from a hypertheme, a superordinate notion. 3 Material The data used is derived from the combined parallel/translation corpus English-Swedish Parallel Corpus (ESPC).3 Six non-fiction texts were analysed: three Swedish original text samples and three English original text samples with their respective translations. Each text contains 10-15.000 words, resulting in a total of approximately 75.000 words in source texts and about the same number of words in translations. Although this is a fairly small sample, a total of 795 instances of passive constructions were retrieved and analysed. 3 For a description of the corpus, see Altenberg, Aijmer & Svensson 1999. 198 It may be argued that material consisting of text samples instead of complete texts is not ideal for a textlinguistic study (Johansson 1998:11). However, since the present study does not analyse textual structure across whole texts, but is restricted to the thematic and information structure in the context of the sentence containing a passive construction, the size of each text sample does not seem to be of major importance. The types of passives analysed are English (agentful and agentless) passives formed with the auxiliary be and the past participle, and the three types Swedish (agentful and agentless) passives: the s-passive formed by adding the inflection –s to the active form of the verb, and the periphrastic passive formed with bli or vara as auxiliaries together with the past participle of the main verb, analogous with the English be-passive. 4 Translation equivalents In a previous study (Fredriksson forthcoming) I found that there is a wide range of possible translation equivalents when an English passive is translated into Swedish, or a Swedish passive into English. The results of the present study also show a wide variety of translation possibilities. Table 1. Translation equivalents in Swedish translations (EO 67 4 ST category no % s-passive 149 49 paraphrase bli/vara-passive 45 44 15 14 direct active 25 8 active + man 13 4 omission 9 3 adjectival construction 9 3 nominalisation 8 3 other 3 1 Total 305 100 Frequency per 1000 words of 305 EO passives: 8.13 In translations from English into Swedish (Table 1), the great majority of English original passives are rendered as passives in the Swedish translations; they follow what we may call a ‘passive-to-passive principle'. If the two passive categories are grouped together, they make up as much as 63%. Paraphrasing is an alternative rather commonly used. This group contains for example active renderings that are not the closest possible translation of the passive verbs, but are contextually suitable choices with similar meanings. Direct active counterparts of passive constructions are used in 8% of all occurrences, which indicates that this is not a preferred translation strategy. A difference between this and many of the other translation equivalents is that it entails changes in thematic structure since the clause elements are re-arranged (agent – subject, etc). Finally, it is interesting to note that the translation alternative called ‘active + man’ which consists of combinations of the generic man (cf. French on, German man, English one) and the active form of the verb occurs with only 4% frequency. This choice is often mentioned as a potential competitor to the agentless passive. Here, however, it is only one among several active alternatives. Table 2 shows the frequency of translation equivalents in English translations of Swedish passive constructions. The passive-to-passive principle is strong in this direction as well. The group ‘omission’ is surprisingly large. It contains two types of omissions; those where the passive construction has been left out in the translation, and those where the translator has chosen not to leave out the whole clause or sentence(s). An interesting point about the frequency of source text passives is that the Swedish passive is more frequent than the English passive in this material. The English passive is usually regarded as a strong category used with high frequency, especially in non-fiction genres (see e.g. Baker 1992:102, Biber 1988:250f), whereas the Swedish passive has been considered weaker. A commonly suggested reason 4 English original text is referred to as EO, Swedish original text as SO, English translations as ET, and Swedish translations as ST. 199 is the use of the generic man combined with an active verb. In Fredriksson (forthcoming) based on fiction texts passives are more frequent in English texts than in Swedish. The present study shows reversed results. Table 2. Translation equivalents in English translations (SO (7 ET category no % be-passive omission paraphrase non-finite construction direct active nominalisation other 285 70 45 38 27 21 5 58 14 9 8 6 4 1 Total 491 100 Frequency per 1000 words of 491 SO passives: 13.09 5 Thematic and grammatical structure in translations The default case would be retaining both the grammatical and textual structures of the source text in the target text. However, the translator can choose to change either the thematic structure or the syntactic structure, or both. Table 3. Distribution of grammatical and thematic structure in ST. Grammatical structure Thematic structure Total + change - change + passive 43 (48%) 149 (69%) 192 - passive 47 (52%) 66 (31%) 113 Total 90 (100%) 215 (100%) 305 Table 3 shows that the default case, i.e. no textual, no syntactic change (showed as -change, +passive in the table), occurs in a majority of cases (149 instances) in translations from English into Swedish. In 31% of all cases has the translator used a non-passive structure (-passive) in order to retain the textual structure. The total number of unchanged thematic structures is 215, whereas that of changed structures is 90. In the column showing changed textual structure (+change), we find that the distribution of passives and non-passives is nearly equal. We note that in 52% of the occurrences the textual change is accompanied by a syntactic change as well. The latter would probably produce a translation that differs from the source text to a considerable extent. Table 4. Distribution of grammatical and thematic structure in ET. Grammatical structure Thematic structure Total + change - change + passive 57 (32%) 228 (73%) 285 - passive 120 (68%) 86 (27%) 206 Total 177 (100%) 314 (100%) 491 Table 4 shows the distribution in English translations. Here too, the default case, with unchanged thematic and syntactic structure is most frequent with 228 instances. The textual structure is preserved but not the passive in 27% of the cases. The total number of preserved thematic structures is 314 as compared with 177 instances of changed thematic structure. A difference from the other direction of translation is that the distribution within the group of changed thematic structure is uneven. The translators have changed both the thematic and syntactic structure in 68% of the total number of changed occurrences. 200 6 Preserved thematic structure The tendencies presented in the previous sections will now be illustrated with a few examples in order to look into the various translation strategies in more detail. We start by looking at a default case, i.e. one in which the grammatical as well as the thematic structure is retained in the translation (5a and b). The English agentful passive is rendered as a Swedish s-passive, the most frequent type of Swedish passive. (5a) It was a life with all the cheerful violence of a cartoon film. Boys brawl with each other, get involved in territorial gang fights, are chased down the street with razor blades by drunken uncles, whack each other over the head with Christmas trees, have rock fights and snowball fights. (5b) Det var en värld som var lika uppsluppet våldsam som en tecknad film. Pojkar bråkar, blir indragna i gängslagsmål, jagas längs gatan av fulla farbröder med rakblad, slår varandra i huvudet med julgranar, kastar sten och är invecklade i snöbollskrig. (RF1) The translation neatly follows the textual structure of the original text. The theme boys/pojkar is the theme of the whole sentence. In the model of thematic progression outlined by Daneš (see section 2), this pattern is called theme iteration. The theme is constant but the rhemes are different. The second most frequent group of translation equivalents in translations in both languages is called ‘paraphrase’ and contains a mixture of constructions. A large amount of them are expressions in which the translator has used an active form of a verb that is not the closest translation of the passive verb, but still conveys similar meaning: (6a) Our initial excitement had turned into anti-climax as the plans became more and more dogeared and, for one reason or another, the kitchen remained untouched. Delays had been caused by the weather, by the plasterer going skiing, by the chief breaking his arm playing football on a motor-bike, by the winter torpor of local suppliers. (6b) Den första ivern hade förvandlats till antiklimax när ritningarna blev allt skrynkligare medan köket av någon anledning förblev orört. Förseningarna berodde på vädret, på att stuckatören var borta och åkte skidor, på att murarbasen hade brutit armen medan han spelade fotboll på motorcykel, på leverantörernas vintertröghet. (PM1) (Lit: The-delays depended on the-weather, …) A passive translation corresponding to the source text would have been possible – förseningarna hade orsakats av vädret …- but the chosen alternative is less formal. The thematic structure with theme iteration is retained in the translation. (7a) Tre kockar finns redan på plats. Hovtraktören och Erich Schaumburger, hans köksmästare även till vardags. Och en extrakock som kan behöva läras upp. (Lit: … who might need be-taught up.) (7b) Three cooks are already at work. The Restaurateur and Erich Shaumburger, who is also his regular Chief Cook. Plus an extra cook, who might need the training. (GAPG1) In (7b) the translator has used a nominalisation instead of a passive, although a passive could have worked just as well. It is difficult to explain why this choice has been made, since both texts have the same thematic structure. Another type of syntactic change is illustrated in (8b) where we find an adjectival construction instead of a verb: (8a) Till bondeståndet utsåg varje härad en riksdagsman. Valbara var bara de självägande bönderna och kronans åbor. Adelns bönder var uteslutna liksom de stora obesuttna samhällsgrupperna på landsbygden. (Lit: The-nobility's peasants were excluded …) (8b) For the Estate of the Peasants every rural district known as a ‘härad', or hundred, elected one member of the Riksdag. Only peasant proprietors and crown tenants were eligible. The nobility's peasants were ineligible as were the large tenantless social groups of the countryside. [AA1] 201 The closest English equivalent of the Swedish verb phrase var uteslutna is were excluded/expelled. However, the author expresses a contrast between the preceding sentence with the adjective valbara [‘eligible'] and var uteslutna [‘were excluded']. By using an antonym of the preceding adjective - eligible/ineligible - instead of, for example, exclude or expel the translator makes the contrast even more marked. 7 Altered thematic structure We will now look at a few examples in which the translators have chosen to deviate from the thematic structure of the original text. Thus, the following examples are all non-default cases. Many of the instances of retained syntactic structure despite thematic re-structuring are due to the fact that Swedish is a verb-second (V2) language whereas English is not. Temporal and spatial adverbials are highly frequent in initial position in the Swedish texts which causes frequent alterations, one of which is found in (9a and b) (9a) Åbo blev den nya maktens säte och handelsplats. Här byggdes också den första domkyrkan för biskopen i Finland. (Lit: (9b) Åbo became the seat of the new power and a centre of trade. Here, too, the earliest cathedral for the Bishop of Finland was erected. (AA1) Here we find a marked theme, här/here and the remainder of the clause forms the rheme, which is where we find differences between the original and the translated texts. The subject is rather long and the differences in the rhematic structure create a different perspective for the reader. In the ET was erected is focalised in end position and is hence the most rhematic, whereas in the SO den första domkyrkan för biskopen i Finland/the earliest cathedral for the Bishop in Finland’ is most focalised. This difference is due to the V2 constraint. Ventola (1995:87) discusses a parallel situation in German and says that ‘[t]he further the subject is pushed from the first verbal element the less likely it is any longer encoding ‘Given’ information, but instead its ability as a [sic] encoder for ‘New’ information increases'. (10a) The Bolsheviks saw the Civil War from the beginning as part of a great Allied plot. In reality, the revolt of the Czechoslovak Legion had been prompted not by the Allies but by fears for its own survival after attempts by Leon Trotsky, now Commissar for War, to disarm it. (10b) Redan från början betraktade bolsjevikerna inbördeskriget som en del av en stor allierad komplott. I själva verket var det inte de allierade som hade framkallat den tjeckiska legionens revolt, utan dess oro för den egna överlevnaden sedan Lev Trotskij, nu folkkommissarie för krigsärenden, försökt avväpna legionen. (CAOG1) (Lit: In reality, was it not the allied who had prompted the Czechoslovak Legion's revolt, but …) Example (10b) illustrates the use of what I have called a ‘direct active', which is meant to be the closest active alternative to the passive. The clause elements have been re-arranged in accordance with ‘the traditional active-passive correspondence’ , i.e. the passive agent corresponds to the active subject, and the passive subject to the active object. Consequently, the translation has a changed thematic structure; the theme/subject connects to the rheme of the previous sentence (a great Allied plot). The pattern called linear thematic progression (see section 2) means that an idea expressed in the rheme of the preceding clause or sentence is picked up and forms a new theme. (11a) Gjorde han det skulle han bli presenterad för de rätta personerna. Immanuel, som var omgiven av otåliga fordringsägare och hotats med gäldstugan, tyckte sig ha litet att förlora på att resa. Han ansåg sig dock inte ha råd att ta med familjen. (Lit: Immanuel, who was surrounded by impatient creditors …) (11b) He urged Immanuel to come to Turku, where he could introduce him to the right people. Surrounded by impatient creditors and threatened by debtor's prison, Immanuel felt he had little to lose by going. He told his family they would follow as soon as circumstances allowed. (KF1) The original text contains a passive in an embedded relative clause which has been turned into a fronted non-finite clause in the translation. The Swedish alternative appears to be relatively unmarked, whereas the same elements in the English text are given greater prominence. In the original, the theme 202 and subject, Immanuel, is the same theme as occurs in the preceding sentence, there is hence theme iteration, which facilitates the progression of discourse. In contrast, the re-organisation of the translated text with a postponed subject breaks this progression. (12a) These were not the kind of gentlemen farmers who spent their winters on the ski slopes or yachting in the Caribbean. Holidays here were taken at home during August, eating too much, enjoying siestas and resting up before the long days of the vendange. (12b) Det här var inga godsägare som tillbragte vintrarna i skidbacken eller på en lustjakt i Karibiska havet. Semester hade man hemma i augusti: då åt man för mycket, njöt långa siestor och vilade upp sig inför den långa skördesäsongen. (PM1) (Lit: Vacation had one at-home in August: …) The combination of an active verb-form and the generic pronoun man as subject (12b) is often described as a useful alternative to agentless passives in Swedish. This translation possibility is however not among the most frequent in the data. The reason for employing it in this exampel might be that the Swedish verb ta [‘take'] is unlikely to be used in the passive in collocation with semester [‘holidays']. The thematic element is the same in both texts and provides a cohesive link with the preceding sentence according to the principle of linear thematic progression. An idea expressed in the rheme of the preceding clause or sentence is picked up and forms a new theme (holidays). In accordance with the Swedish V2 constraint, the subject is placed after the verb. The addition of a subject in non-initial position and the fronting of the complement in the translation mean that the theme of this clause is marked whereas the one in the English source text is unmarked. 8 Conclusion The aim of this paper has been to study how textual and grammatical structure is retained or altered in translations. It is interesting to see that a range of syntactic alternatives are used in translations in both directions. Some are less frequent than others, but can still indicate certain tendencies. Some syntactic changes appear to have been made for keeping the thematic structure intact, while in other cases both structures have been changed. However, the main tendency is that the default case with no changes is the most common. This suggests that a variety of translation strategies are in use and further studies will describe such strategies and in more detail. Acknowledgements This work was carried out with funding from the Bank of Sweden Tercentenary Foundation. I wish to thank Professor Karin Aijmer for her valuable comments on earlier versions of this paper. References Primary sources (a) Swedish corpus texts Åberg A 1985 Sveriges historia i fickformat. Stockholm, Natur & kultur. Translated by Elliot G as A concise history of Sweden. Stockholm, Natur & kultur, 1991. [AA1] Arvidsson G, Gullers P 1991 Vad händer på slottet?. Stockholm, Gullers. Translated by Write Right as Inside the royal palace. Stockholm, Gullers, 1991. [GAPG1] Fant K 1991 Alfred Bernhard Nobel. Stockholm, Norstedts. Translated by Ruuth M as Alfred Nobel: a biography. New York, Arcade Publishing, 1993. [KF1] (b) English corpus texts Andrew C, Gordievsky O 1990 KGB. The inside story. London, Hodder & Stoughton. Translated by Malmsjö K as KGB inifrån. Stockholm, Albert Bonniers Förlag, 1991. [CAOG1] Mayle P 1989 A year in Provence. London, Hamish Hamilton. Translated by Wiberg C as Ett år i Provence. Malmö, Richter, 1991. [PM1] Ferguson R 1991 Henry Miller: a life. London, Hutchinson. Translated by Lindgren N as Henry Miller: ett liv. Stockholm, Wahlström & Widstrand, 1992. [RF1] Secondary sources Altenberg B 1998 Connectors and sentence openings in English and Swedish. In Johansson S, Oksefjell S (eds), Corpora and cross-linguistic research: theory, method, and case studies. Amsterdam/Atlanta, Rodopi, pp 115-139. 203 Altenberg, B., K. Aijmer & M. Svensson. 1999. The English-Swedish Parallel Corpus: Manual. Department of English, University of Lund. (Also at http://www.englund.lu.se/research/espc.html.). Baker M 1992 In other words: a coursebook on translation. London/New York, Routledge. Biber D 1988 Variation across speech and writing. Cambridge, Cambridge University Press. Duškovà L 1971 On some functional and stylistic aspects of the passive voice in present-day English. Philologica Pragensia 14(3): 117-143. Enkvist N E 1984 Contrastive linguistics and text linguistics. In Fisiak J (ed), Contrastive linguistics: prospects and problems. Berlin/New York/Amsterdam, Mouton Publishers, pp 45-67. Fredriksson A-L forthcoming A contrastive study of English and Swedish passives in a textual perspective. In Byrman G, Levin M, Lindquist, H (eds), Korpusar i forskning och undervisning. Fries P H 1995 A personal view of theme. In Ghadessy M (ed), Thematic development in English texts. London/New York, Pinter, pp 1-19. Granger S 1983 The be + past participle construction in spoken English with special emphasis on the passive. Amsterdam, North-Holland. Halliday M A K 1994 An introduction to functional grammar. Second edition. London, Arnold. Hasselgård H 1998 Thematic structure in translation between English and Norwegian. In Johansson S, Oksefjell S (eds), Corpora and cross-linguistic research: theory, method, and case studies. Amsterdam/Atlanta, Rodopi, pp 145-67. Johansson S 1998 On the role of corpora in cross-linguistic research. In Johansson S, Oksefjell S (eds), Corpora and cross-linguistic research: theory, method, and case studies. Amsterdam/Atlanta, Rodopi, pp 3-24. Péry-Woodley M-P 1991 French and English passives in the construction of text. Journal of French Language Studies 1(1):55-70. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman. Sundman M 1987 Subjektval och diates i svenskan. Åbo, Åbo Academy Press. Teleman U, Hellberg S, Andersson E 1999 Svenska Akademiens grammatik. 1-4. Stockholm, Svenska Akademien (Norstedts Ordbok i distr). Ventola E 1995 Thematic development and translation. In Ghadessy M (ed), Thematic development in English texts. London/New York, Pinter, pp 85-104. 204 Phraseological approach to automatic terminology extraction from a bilingual aligned scientific corpus Frérot Cécile1*, Rigou Géraldine2, Lacombe Annik3 1 University of Paris 7-Denis Diderot, 2 place Jussieu, F-75251 Paris Cedex 05, France, frerot@cicrp.jussieu.fr 2 INRA-CRJ, Unité Centrale de Documentation, Translation and Terminology Unit, F-78352 Jouy-en-Josas Cedex, France, rigou@jouy.inra.fr, tel: +33 (0) 1.34.65.24.54, fax: +33 (0) 1.34.65.22.72 3 INRA-CRJ, Unité Centrale de Documentation, Translation and Terminology Unit, F-78352 Jouy-en-Josas Cedex, France, alacombe@jouy.inra.fr, tel: +33 (0) 1.34.65.24.55 * This work was carried out during a training period at the INRA Translation and Terminology Unit as part of a post-graduate degree (DESS Industrie de la Langue et Traduction Spécialisée), UFR Etudes Interculturelles de Langues Appliquées Abstract: The linguistic knowledge represented in specialised dictionaries should not be restricted to a collection of terms (single and complex terms). It should include phraseological units, i.e. more or less fixed multiword expressions –often called collocations– which cause serious problems for translators and technical writers, since translating them into the target language is rendered difficult by the syntactic and lexical characteristics of each Language for Specific Purposes (LSP). Corpus-based terminology acquisition tools make it easier to identify and collect these units and to generate terminology resources that better meet the needs of users. We present the results of an automatic terminology extraction from a French-English aligned corpus with the aim of developing a scientific translation and writing tool for French researchers with emphasis on the phraseological dimension. This experiment was conducted at the Translation and Terminology Unit of the French National Institute for Agricultural Research (INRA), in collaboration with two researchers working in automatic terminology extraction and bilingual terminology alignment. LEXTER extracted term candidates from a French-language scientific corpus and TRINITY identified the translation candidates in the English corpus made up of the translations of the French texts. The translator-terminologists exploited the results of the extraction using the hypertext interface for validation of a database. After describing the different stages of the experiment, from preparing the corpus to data processing, we explain how we validated and exploited the results. We focus on collocations and the problems linked to their identification and terminographic description from a translation perspective and consider the problems inherent to phraseology in LSP when efforts are made to improve natural language processing (NLP) tools. Keywords: terminology extraction, aligned corpus, phraseology, LSP, scientific translation. 205 1 Introduction Despite considerable terminographic work carried out to collect the vocabulary of specialised subject fields, deficiencies still exist in emerging and/or evolving subject fields. Available terminological resources do not fully meet the needs of technical writers and translators. Resources are often limited to terminological units, whether they are simple or complex terms, and exclude phraseological units. However, a large number of problems are raised by the linguistic environment of terms, i.e. the syntagmatic extension of the term to the sentence (Blais, 1993). The ability to deal with this linguistic environment is absolutely necessary for language specialists eager to respect the type of language used by the experts. Consequently, the study of lexical units other than nouns (e.g. verbs, adverbs and adjectives) is of utmost importance. Nevertheless, in spite of a growing interest in problems linked to identifying and collecting phraseology in LSP (Language for Special Purposes), there are still few electronic or print resources available for translators. Corpus-based terminology acquisition tools such as terminology extractors make it easier to generate terminological resources. These tools help identify simple and complex terminological units as well as expressions –often known as collocations– specific to a linguistic community and including other lexical units such as verbs, adjectives and adverbs. Since we placed the emphasis on usage linked to collocations, we took into account the definition given by Pesant and Thibault (1993) who define collocations as being different parts of speech which appear together in a text and combine to form expressions fixed by usage. Two examples of collocations extracted from our corpus are technologically valuable lactic acid bacteria and radio-tagged river trout. The present work describes the results of an automatic terminology extraction from a bilingual aligned scientific corpus. We aimed both at evaluating how well these tools performed and at analysing how they could be used to develop a scientific writing tool meeting the needs of French researchers at the National Institute for Agricultural Research (INRA). 2 Terminology extraction 2.1 Context of the experiment The experiment was conducted by the Translation and Terminology Unit in collaboration with Didier Bourigault1 who developed LEXTER, a tool for terminology extraction, and David Hull2 who developed TRINITY, a tool for bilingual word alignment. The experiment aimed at increasing the terminological database currently used by the translators of the Unit. In the long term, we intend to develop a computer-based English writing tool for the French researchers of the Institute. Nowadays, the vast majority of publications are written in English. It is of vital importance to publish one's results, especially in emerging subject fields where competition is fierce, and the level of English is one of the criteria for acceptance of papers. However, whereas researchers often have a good command of their field's terminology, they often encounter great difficulty in expressing their ideas clearly and, as a result, in writing a syntactically coherent text. At INRA, they have the support of in-house translators to write their papers. In order to compensate for the lack of terminological data, the translators search on-line bibliographical databases constituting immense corpora, from which they extract the terminology and phraseology they cannot find in dictionaries. The experiment aimed to collect terminology from a French-English translation corpus by validating the results of an automatic extraction. The analysis of noun phrases (NP) extracted by LEXTER helped us identify the terms and the extensions of these NPs. We then turned to the French list of nouns and verbs to collect verb phrases belonging mainly to general language and frequently used by researchers. 2.2 Preparation of the corpus This extraction was carried out from an aligned bilingual corpus made up of French to English translations. The corpus represented a total of 340,000 words and included, in decreasing order: research papers, press releases and publications aimed at the general public, a software user's guide, presentation leaflets, a licence contract and summaries of monographs. The corpus comprised texts from different subject fields of which the most represented were: agricultural sciences, soil sciences, hydrobiology, environment, biometry and modelling, plant breeding and genetics, plant disease and 1 Didier Bourigault, Equipe de Recherche en Syntaxe et Sémantique, Université Toulouse le Mirail 2 David Hull, Xerox Europe Research Centre, Grenoble 206 weed science. Paragraph marks in both the French and English texts were used to manually align the corpus. We also removed from the corpus any element likely to generate noise during the extraction process (bibliographical references, mathematic symbols, figures, etc.). 2.3 Data processing Xerox tools were used to align the corpus sentence by sentence and tag it. 2.3.1 LEXTER3 LEXTER extracted term candidates (TC) from the tagged French corpus. TCs designate words and multiwords likely to be terminological units or collocations. The first stage in the process is a morphological analysis of the texts followed by the identification of noun phrases of maximal length, using linguistic rules of boundary detection. In the second stage, LEXTER uses a range of parsing rules to extract from each noun phrase of maximal length a set of sub-phrases which are likely to constitute terms by virtue of their position in the noun phrase (Bourigault, 1994). After statistical filtering, LEXTER supplies a list of TCs. 2.3.2 TRINITY4 TRINITY is an alignment system using statistical word alignment techniques to automatically construct a bilingual word and phrase lexicon from a collection of translated sentences. Statistical word alignment algorithms use common regular co-occurrence patterns between source and target language words to establish links between occurrences of these words in individual sentence pairs (Hull, to be published). 2.3.3 RESULTS: HYPERTEXT INTERFACE FOR VALIDATION LEXTER automatically generates a list of term candidates corresponding to different grammatical categories (noun, adverb, adjective, verb) as well as adjective phrases and noun phrases. The LEXTER hypertext interface for validation gives access to a list of term candidates either in decreasing frequency or alphabetical order, depending on the lexical units mentioned above (see Appendix A). It is also possible to view the contexts in which the term candidates appear. Thanks to the output of the TRINITY system, the term candidates and the translation candidates are presented in their aligned contexts (see Appendix B). The number of occurrences is available for each item of data. For each noun phrase, the hypertext interface gives access to its extensions, i.e. the term candidates in which the noun phrase is the syntactic head or expansion. The interface also makes it possible to use a validation scale which we determined on the basis of French and English word segmentation as well as terminology and translation relevance. Lastly, French term candidates and their English equivalents can be entered into a lexicon if they are considered relevant. 3 Analysis of the results 3.1 Noun phrase analysis 3.1.1 CREATION OF TERMINOLOGICAL ENTRIES The structure of the lexicon supplied by the hypertext interface made it possible to enter only French term candidates and their English equivalents together with comments. Since we aimed to collect additional information useful to both translators and researchers, we created a terminological entry that would best meet their needs and based its structure on the existing entry of the translators’ database. Therefore, we kept some of the fields (namely: Term, Variant –several, Context, Subject field, Descriptor, Linguistic note) and added two collocation fields (Noun collocation and Verb collocation). We removed the Definition field (see Appendix C). The aim was not to formulate terminological definitions but to collect defining statements in order to understand the terms. However, it should be noted that identifying defining statements proved quite unproductive. The reason for this unproductivity lies in the fact that the main type of publications contained in the corpus was the research paper: research papers relate to the expert-expert communicative setting and concepts are rarely defined since experts are assumed to have the same or very similar level of expertise. 3 For further details, see BOURIGAULT D 1994 4 For further details, see HULL D (to be published) 207 3.1.2 LEVELS OF ANALYSIS We selected noun phrases according to three approaches: terminology, translation and scientific writing. The specific needs of translators and researchers were also considered. Translators must use the appropriate terminology as well as the linguistic expressions specific to a subject field which reflect the language as used by researchers. This is rendered difficult by the diversity of INRA's subject fields. The selection of noun phrases is influenced by the linguistic skills and scientific expertise of the translator. This usually depends on the translator's experience (namely beginner, experienced, sub-contractor). Since translators need to understand the concept referred to by the author, we collected defining statements whenever possible. As regards researchers, they are more concerned with phraseology than terminology. Two main reasons explain why they often have great difficulty in clearly expressing their ideas in English: their command of the target language can be quite limited and, secondly, they tend to apply French syntax, hence the likelihood of loan translations. Their knowledge of LSP phraseology is often very poor even when it comes to general language expressions such as to put forward hypotheses, to conduct research, to take into account, hence the focus on collecting verb phrases belonging to LGP (see 3.2). Generally speaking, researchers encounter difficulty when combining specialised terms with words from general language. From a theoretical point of view, this raises the issue of "the separation of terms on the one hand and the linking elements from LGP on the other [which] may not be maintained uncritically" (Picht, 1987). After identifying these needs, we selected noun phrases on the basis of the issues raised by: - translators: are these noun phrases or collocations specialised enough to be selected? Is this collocation an original feature linked to the author's style or a rather set expression worth being collected? - terminologists: does this noun phrase designate a specific concept? - researchers: which words are used to describe the steps of an experiment? Which is the appropriate verb used to describe results, a comparison or cause and effect? Which expressions would a native English speaker use? Is there a specific expression used to convey a given idea? What is the English equivalent of that specific French verb? We analysed each noun phrase as follows: the term candidates and the translation candidates were analysed, as well as head and expansion productivity (see Appendix D), and the French and English contexts were read through to collect defining statements and collocations LEXTER would not have extracted. 3.1.3 IDENTIFICATION AND RECORDING OF COLLOCATIONS The analysis of collocations has pointed out the lexical and syntactic differences between the two languages, as well as their unpredictability (Heid and Freibott, 1991) when translating from French into English. We bore these linguistic facts in mind when selecting collocations, which we define as usual associations of several words linked by prepositions and referring to one or several notions. We focused on prepositions since translating the phrases they are part of is a major problem for researchers. This definition excludes complex terms with patterns such as noun + noun or adjective + noun. The main difficulty consists in differentiating a term from a collocation when analysing a sequence of lexical units. The difference involves a change from a conceptual organisation to a linguistic environment. Consequently, collocations are subjected to greater variation than terms, which is also related to the author's own style, to the creative power of phraseology (Blampain, 1993). It is difficult to determine whether experts in a given subject field often use a collocation or whether it is pure stylistic "extravagance". Frequency was not considered to be a criterion of selection, as most of the corpus belongs to a very specialised discourse, the terminology and phraseology of which are not widespread (Béjoint and Thoiron, 1992). Moreover, the corpus is of limited size and is therefore not representative of scientific discourse. To record collocations, it is necessary to consider the intellectual processes involved in translating and writing. Translators and researchers probably try to answer the following question: What words or verbs usually combine to describe a given process, to show results? (Cohen, 1992) rather than What words or verbs usually combine? Consequently, concept is crucial and provides some answers as to whether users should have to access one collocation referring to several notions together or each of these notions separately. 208 In the following collocation: addition de lipides à la ration de la truie en lactation, the various notions referred to mean that three different entries need to be created (addition de lipides / ration / truie en lactation), but the unpredictability of the collocation led us to enter addition de lipides à la ration de la truie en lactation into the Noun collocation field of the three terminological entries. As it is more relevant and reliable to search a term rather than a collocation when querying a database, a collocation will be entered under the keyword (the base of the collocation) as well as the co-occurrent. Example: Noun collocation: dry matter content under the entries content and dry matter. To record all the collocations related to a term in a single field would make the content of the entry difficult for users to read and access to the information would be delayed. Therefore, we decided to create two collocation fields to improve the way in which noun collocations and verb collocations are identified. 3.2 Verb analysis The extraction was performed from the French corpus and concerned simple verbs. TRINITY in this case did not align the translation candidates in English, which involved us: - reading the French contexts to analyse the linguistic environment of the verb and identify complex verb phrases, - reading the English contexts to identify the equivalents. 3.2.1 CREATION OF VERB ENTRIES The structure of the verb entry differs from that of the NP terminological entry. We integrated the following fields: French verb, English equivalent, Linguistic note but removed the Definition, Subject field and Descriptor fields. In fact, we focused on the information relating to the ways in which these verbs function and are used in context. We created a dozen fields to record linguistic combinations in which the verb appears (see Appendix E). These combinations are more or less fixed and constitute sentence segments that are interesting from a translation point of view. In order to facilitate access to information, we created an entry for each verb phrase containing a support verb such as mettre which produces approximately 20 complex verb structures. 3.2.2 SELECTION OF VERBS From the list of "verb candidates", we selected those frequently used in original research papers due to their predominance among the documents translated or reviewed by the Translation Unit. An original research paper displays an IMRaD structure composed of the following sections: Introduction Materials and methods (experimental protocol) Results (and) Discussion We first excluded specialised verbs referring to a specific notion in a subject field for several reasons: these verbs only create minor difficulties in translation, there is very little morphological or syntactic variation between both languages and researchers usually know the English equivalents of specialised verbs, as they do for specialised terms. This hypothesis was confirmed by analysing the list of specialised French verbs, which was relatively limited, and their English equivalents, as shown by the following examples: cloner (to clone), coder (to code). We selected verbs belonging to general language which create difficulties for researchers when translating them, as well as support verbs such as mettre which produce approximately 20 complex verb structures (mettre en comparaison, mettre en évidence…). Here are some of the verbs we analysed: montrer, constituer, réaliser, conduire, mener, estimer, permettre, varier, sembler, apparaître, entraîner, rechercher. 3.2.3 TERMINOGRAPHIC DESCRIPTION The verb entry should not be limited to a lexicon, i.e. a verb and its equivalent(s) in the target language (L'Homme, 1993). Problems in translating complex verbs are related to the syntactic structures specific to French and English. Users therefore need more information than is usually 209 provided for a simple or complex term. It is necessary to record how the verb functions from a linguistic point of view and to: - describe how the verb functions syntactically (What kind of complement can be used alongside this verb? What preposition is required with this verb? Is it a transitive or intransitive verb?), - provide semantic information by proposing a synonym to specify the meaning of the French verb (the French synonym is usually the most similar morphologically to the English equivalent), - give an example of the verb in context, as it occurs in a sentence. Interestingly, the English equivalents of the complex verb structures having a support verb as their nucleus, such as mettre, are often simple verbs (mettre en évidence: to demonstrate), and the French verb has several equivalents in English (entraîner: to cause, to lead to, to result in). 3.2.4 PROSPECTS FOR VERB ANALYSIS The analysis of NPs had evidenced errors in the extraction performed by LEXTER. For example, the NP list consists of a large number of sequences for which compte de is a head ([tenir] compte de l'accumulation préférentielle d'assimilats, [rendre] compte de l'orientation des particules, [prendre] en compte des contraintes techniques). These sequences, had they been well identified, would have allowed verb phrases such as rendre compte de, tenir compte de, prendre en compte to be extracted. Therefore, from the list of nouns extracted by LEXTER, we selected those presumably having high cooccurrence productivity with verbs. Examples: hypothèse: admettre une ~, émettre une ~, faire l'~ que, rejeter une ~, etc. résultats: commenter des ~, discuter des ~, obtenir des ~, diffuser des ~, etc. gènes: identifier des ~, localiser des ~, introgresser des ~, porter des ~, etc. Of the nouns with high co-occurrence productivity with verbs, it appears that some belong to general language and are combined with verbs from general language (admettre une hypothèse), while some are terms (gène) and are combined either with verbs from general language (identifier des gènes) or with specialised verbs (introgresser des gènes). These examples illustrate the fact that general language intermingles with LSP in scientific documents and raises the issue of the supposed dichotomy between LGP and LSP, and the widely acknowledged point of view which considers as a boundary the limit between highly specialised jargon and words from general language (Bourigault and Jacquemin, 2000). On the basis of the list of nouns extracted by LEXTER, we will further analyse complex verb structures starting with nouns from general language, since this type of structure raises major problems for INRA researchers when writing their papers. We will continue with verb structures including terms. Recording these different types of verb structures raises the problem of how they can be accessed. For the time being, the structures we have analysed do not allow us to define the best approach to recording them: it is still unclear whether access should be given to the whole verb structure or to the noun and verb separately depending on what information we will provide. According to the information available in the corpus, we could provide two kinds of entries: a specific term with the various verbs it can be combined with, or a verb and the kinds of complements with which it can be used (L'Homme, 1997). This will probably need to be discussed with future users as researchers and translators may not go about searching databases in the same way. 4 Conclusion and prospects In the light of the corpus size and heterogeneity, the quality of the extraction, in French as well as in English, is irrefutable. Terminology extractors are useful to translators considering the very little time they have to enter terms into a terminological database. Moreover, this kind of extraction allows experienced translators to record NPs they would not have entered otherwise, judging them as being too basic. As for verbs, using LEXTER enables verb phrases to be identified rapidly within a context. It would be interesting to perform a similar automatic extraction with English verbs. 210 We will continue exploiting the results of the extraction and, when the database has a sufficient number of entries, we will carry out a test in the field with researchers so as to make sure it meets their needs. The work carried out on verbs could be improved to better describe verbs in scientific discourse which, in turn, would make it possible to determine appropriate conceptual classes of co-occurrents and classify verbs according to their usage. A more detailed analysis of verbs in LSP could contribute to improving corpus-based terminology acquisition tools and, more generally, to improving NLP. Once the users’ needs are taken into consideration and the analysis is no longer restricted to noun terms, other parts of speech (verb, adverb, adjective) emerge: it is then necessary to adopt a phraseological approach. 5 References Bejoint H, Thoiron P 1992 Macrostructure et microstructure dans un dictionnaire de collocations en langue de spécialité. Terminologie et traduction 2/3: 513-522. Blais E 1993 Le phraséologisme. Une hypothèse de travail. Terminologies nouvelles 10: 50-56. Blampain D 1993 Notions et phraséologie. Une nouvelle alliance ? Terminologies nouvelles 10: 43-49. Bourigault D 1994 Lexter, un logiciel d'extraction de terminologie. Application à l'acquisition des connaissances à partir de textes. Thèse en informatique linguistique, Ecole des Hautes Etudes en Sciences Sociales, Paris. Bourigault D, Jacquemin C 2000 Construction de ressources terminologiques. In J-M. Pierrel, Ingénierie des langues. Paris, Hermès Sciences Publications, pp 215-233. Cohen B 1992 Méthodes de repérage et de classement des cooccurrents lexicaux. Terminologie et traduction 2/3: 505-511. Heid U, Freibott G 1991 Collocations dans une base de données terminologique et lexicale. Meta 36(1): 77-91. Hull D (unpublished) Software tools to support the construction of bilingual terminology lexicons. In Bourigault D, Jacquemin C, L'Homme M-C (eds). Recent advances in Computational Terminology. London, John Benjamins. L'homme M-C 1993 Le verbe en terminologie : du concept au contexte. L'actualité terminologique 26(2) 17-19. L'homme M-C 1997 Méthode d'accès informatisé aux combinaisons lexicales en langue technique. Meta 42(1): 15-23. Pesant G, Thibault E 1993 Terminologie et cooccurrence dans la langue du droit. Terminologies nouvelles 10: 23-35. Picht H 1987 Terms and their LSP environment – LSP phraseology. Meta 32(2) 149-155. 211 APPENDIX A: Hypertext interface for validation: TCs (NPS) by decreasing frequency APPENDIX B: TCs and translation candidates in aligned contexts 212 APPENDIX C: Terminological entry APPENDIX D: Head and expansion productivity 213 APPENDIX E: Verb entry 214 The METER corpus: A corpus for analysing journalistic text reuse Robert Gaizauskas†, Jonathan Foster‡, Yorick Wilks†, John Arundel‡, Paul Clough†, Scott Piao† Departments of Computer Science† and Journalism‡ University of Sheffield, Sheffield, S1 4DP (contact: R.Gaizauskas@dcs.shef.ac.uk; fax:(0114) 222 1810) Abstract As a part of the METER (MEasuring TExt Reuse) project we have built a new type of comparable corpus consisting of annotated examples of related newspaper texts. Texts in the corpus were manually collected from two main sources: the British Press Association (PA) and nine British national newspapers that subscribe to the PA newswire service. In addition to being structured to support efficient search for related PA and newspaper texts, the corpus is annotated at two levels. First, each of the newspaper texts is assigned one of three coarse, global classifications indicating its derivation relation to the PA: wholly derived, partially derived or non-derived. Second, about 400 wholly or partially derived newspaper articles are annotated down to the lexical level, indicating for each phrase, or even individual word, whether it appears verbatim, rewritten or as new material. We envisage that this corpus will be of use for a variety of studies, including detection and measurement of text reuse, analysis of paraphrase and journalistic styles, and information extraction/retrieval. To illustrate these potential uses we briefly describe some work we have done with the corpus to develop algorithms for detecting text reuse. 1. Introduction The aim of the METER (MEasuring TExt Reuse) project1 is to investigate how text is reused in the production of newspaper articles from newswire sources and to determine whether algorithms can be discovered to detect and quantify such reuse automatically. It is to be hoped that results will generalise beyond the newspaper-newswire scenario and provide broader insights into the nature of text derivation and paraphrase; but the newspaper-newswire scenario provides an ideal initial case study, and one with considerable potential practical application – see below. To assist in this study it was necessary to create a comparable corpus2 consisting of a selection of newswire texts and newspaper articles reporting the same stories, in some cases derived from the newswire texts and in some cases not. Because the Press Association, the major British domestic newswire service, is a collaborator in the METER project and have provided us with unrestricted access to their newswire service, we have used their archive as the source newswire for our corpus and texts from a variety of their subscribers in the British press as the candidate derived texts. Having assembled the corpus and annotated it to assist in our study of text reuse, we believed the corpus would be of wider interest to the corpus linguistics, natural language processing and language engineering communities, and hence decided to package and release the corpus on its own. This paper describes the design, structure and contents of the corpus, and illustrates its potential by briefly describing some experiments we have carried out using it. The METER Corpus is available free-of-charge for research purposes. It should be stressed that the METER corpus is a pioneering corpus for the study of text reuse and that as such is no doubt flawed and limited in various ways. Resource limitations have meant limiting the size of the corpus and the amount of interannotator verification carried out on the annotations. Ideas as to how it should be annotated continued to evolve during the process of annotation, which means that complete consistency across annotations has probably not been achieved. Our hope is that despite these 1 For further details of the METER project, see: http://www.dcs.shef.ac.uk/nlp/funded/meter.html. 2 Johansson et al. (1996: 3) define comparable corpus as: “corpora consisting of parallel original and translated texts in the same languages”. 215 limitations the corpus will still prove useful to others, even if only as a starting point for designing a better resource. 2. Text Reuse in the British Press The Press Agency (PA) is the national news agency for the UK and Ireland. It provides regional, national and international news 24 hours a day, 365 days a year to its media customers throughout Britain. On a daily basis, the PA sources 1,500 news, sport and feature stories and one hundred news and sport photographs to the newspaper industry. The PA also supplies listings information in, for example, finance, arts and entertainment and television. In addition, the PA supplies text to specific news and sport internet websites, as well as providing news copy and information for other leading commercial and public sector organisations. Those using its services include national and international newspapers, regional morning and evening papers and terrestrial radio and television broadcasters. Thousands of weeklies, periodicals and magazines also receive various types of PA output. Through its services, the PA quite clearly performs a critical function for the British media, whether they operate in print, audio or electronic forms. Because of its ongoing supply of both domestic and national news, it is clear that the PA has a critical role in setting the news agenda through its capacity to distribute salient information rapidly to news organisations. The PA, which has been operating since the nineteenth century, is in a unique position in British media industry and is widely regarded as a most credible, authoritative and trustworthy journalistic source for the newspaper and broadcast enterprises. Being a primary supplier, the news issued by the PA is widely reused, either directly or indirectly, in British newspapers. Even if not directly using the PA copy, journalists invariably refer to the agency's newswire service during report production. This ensures verification of the ’facts’ in a story and also facilitates effective ’copy tasting’ decisions (copy tasting is the process of assessing which of all the news available at a given time should be included in the current newspaper edition). Therefore, rich “real” examples of text reuse can be expected from a collection of PA copy and related newspaper articles. The study of text reuse has, aside from its intrinsic academic interest, a number of potential applications. Like most newswire agencies, the PA does not monitor the uptake or dissemination of copy they release because tools, technologies, and even the appropriate conceptual framework for measuring reuse are unavailable. For the PA, potential applications of accurately measuring reuse of their text include: 1) monitoring of source take-up to identify unused or little used stories; 2) identifying the most reused stories within the British media; 3) determining customer dependencies on PA copy, and 4) new methods for charging customers based upon the amount of copy reused. This could create a fairer and more competitive pricing policy for the PA3. 3. Construction of the METER corpus The texts of the METER corpus were collected manually from the PA online service and the paper editions of nine British newspapers - The Sun, Daily Mirror, Daily Star, Daily Mail, Daily Express, The Times, The Daily Telegraph, The Guardian and The Independent4. Building a general newspaper corpus was beyond the time and resource limitations of the METER project, so we have limited the corpus to just two domains: British law court reporting and show business stories. Court stories were chosen because of the substantial amount of data available in both newspapers and PA and because of their regular recurrence in British news. Court stories also revolve around “facts” such as the name of the accused, the charge and the location of a trial or inquest, with limited scope for journalistic interpretation. This information is found in both newspaper and PA reports, even when newspapers do not use PA as a source. Courts generally sit from Monday to Friday; therefore PA copy was collected for these days and newspaper versions of the same story appearing on the succeeding day were also chosen. Court cases reported by newspapers for which PA did not 3For a more comprehensive summary of the PA and public access to a portion of their newswire, see the PA website: http://www.ananova.com. 4 The first five of these papers are published in tabloid format and are viewed as the “popular” press; the latter four are referred to as “broadsheets” and viewed as the “quality” press. 216 produce copy were ignored. Stories were collected for cases that lasted just one day, as well as for cases that stretched over much longer periods. The other domain included in the corpus is show business and entertainment news. This was chosen to contrast with the court domain. Show business press exhibits a more expansive style, with greater freedom of journalistic expression and interpretation. Show business stories tend to be reported in a more frivolous, light-hearted manner. Like court reporting, show business news is also a stable and recurring feature of contemporary British news. The show business news stories, however, form a secondary collection to that of court stories and less data from this domain is included in the corpus. Just as practical restrictions on corpus construction limited the scope of news domains included in the corpus, so too did they limit the temporal extent of the material included. Too narrow a date range might have lead to a biased sample, but too wide a range would lead to too much material. In the end we settled on gathering material from a one year period. The text collection spans 24 days for the law court reporting and 13 days for show business stories from 12 July 1999 to 21 June 2000. PA stories are classified under a number of discrete, identifiable news categories, such as Courts, Showbiz (show business), Politics, Education, etc. Under each of these categories, stories are further classified into sub-categories, such as Courts (Axe), Courts (Strangle) and Courts (Gamekeeper). Each of the sub-categories refers to an individual story, incident or event. Such a sub-category is called a catchline. For each catchline, the PA follows its development and keeps releasing updated reports throughout a day. Each report occupies a single web page – termed a PA page. Therefore, a catchline contains one or more PA pages. On each day in the study, all PA catchlines and associated pages relating to courts or show business were identified through the PA categories, Courts and Showbiz. The PA pages under a selected catchline were downloaded into separate electronic files. For each selected PA catchline, the final southern editions of the nine British national newspapers from the next day were then examined. The newspaper reports about law and court stories were compared against the PA catchlines and all those that the PA had covered the day before identified. These newspaper articles were then manually scanned into separate electronic files (and later manually examined and spell-corrected). Original paper copies were used in constructing the METER Corpus because web-published versions of the stories were neither reliably available nor, even if available, reliably identical to the published paper copy, which is still viewed as the definitive form of publication for a newspaper. Table 1 gives general statistical information about the METER corpus. As discussed, the corpus consists of two main parts, law and court reports and show business reports. These contain 1,430 texts (458,992 words) and 287 texts (76,158 words) respectively – here we use the term text to refer either to a PA page or a newspaper article. In terms of the PA-sourced and newspaper-sourced texts, 773 PA pages (239,679 words) versus 944 newspaper articles (295,471 words) were included. Of the PAsourced texts, 611 courts texts and 112 show business texts are associated with 205 Court catchlines and 60 Showbiz catchlines respectively. Source Domain Total Law and Court Show Business Words Texts Words Texts Words Texts WD PD ND WD PD ND PA 206,354 661 (205 catchlines) 33,325 112 (60 catchlines) 239,679 773 (265 catchlines) Other 1,269 0 3 2 0 0 0 0 1,269 5 Times 34,794 24 41 46 2,966 5 7 2 37,760 125 Star 14,021 15 28 27 7,590 10 19 7 21,611 106 Express 21,956 17 27 18 5,270 1 8 5 27,226 76 Mirror 17,359 22 32 28 4,211 7 11 6 21,570 106 Mail 31,686 21 29 7 6,414 0 7 7 38,100 71 Guardian 38,499 12 46 37 3,805 4 6 3 42,304 108 Telegraph 45,768 30 62 35 2,985 6 7 2 48,753 142 Sun 18,597 18 37 15 6,010 4 24 7 24,607 105 Independent 28,689 7 37 46 3,582 2 7 1 32,271 100 Total 458,992 1,430 76,158 287 535,150 1,717 Table 1: Statistics of the METER corpus 217 The primary aim of constructing the corpus was to provide a resource for studying reuse of text between the PA and various subscribing newspapers, and not to provide a resource to study how the same story is handled differentially across the British press. Nevertheless, there are a significant number of stories in the corpus which are carried across multiple newspapers. Figure 1 illustrates the distribution of shared catchlines across newspapers. Most catchlines have associated with them only a single newspaper article; but more than ten in the courts domain are to be found in all nine of the newspapers represented in the corpus. Figure 1: Distribution of shared catchlines across newspapers The total effort expended in creating the corpus was about two person-years. Gathering stories, converting to electronic form and manual annotation took about one person-year. Subsequent OCR error checking and correction, organising the electronic texts into a structure, designing the mark-up scheme and electronic annotation took approximately another person-year. 4. Structure of the METER corpus An important issue in corpus construction is deciding upon an appropriate structure in which to store the data, so as to facilitate human and machine access to it. In the current METER corpus, the texts (PA pages plus newspaper articles) are arranged in a tree structure. Texts are clustered according to their origin, topic and date of release. This structure provides a unique identifier for each text within the corpus. Figure 2 illustrates the overall structure of the METER corpus (lower level details are shown only for the Courts domain – the Showbiz part of the corpus is structured similarly). 0 10 20 30 40 50 60 70 1 2 3 4 5 6 7 8 9 10 Number of newspapers reporting the same catchline Courts dom ain Showbiz domain Lowest level of alignment .. 21.06/00 .. Showbiz Courts Catch line N Catch line 1 .. annotated Catch line N Catch line 1 .. meter corpus .. Page 1 Page N News paper 1 News paper N news papers .. annotated rawtext 21.06.00 12.07.99 .. Courts rawtext 12.07.99 PA Showbiz 218 Figure 2: Topology of the METER corpus As shown in Figure 2, the PA and newspaper stories are stored in a six-level tree structure. At the topmost level, the corpus is divided according to the source of materials, PA texts under one branch and related newspaper texts under the other. At the second level, both branches split into two according to the domain of the materials, Courts stories under one branch and Showbiz stories under the other. At the third level, all branches bifurcate again, into raw and annotated sub-branches. The raw sub-branch contains exact copies of all texts as downloaded from the PA website or scanned in and OCR errorcorrected from the newspapers; the annotated sub-branch contains marked-up versions of the texts as described in more detail in section 5 below. The fourth level of the tree classifies the texts by the date of release. All texts issued on the same day, both in the PA and newspaper divisions, are grouped together under the same directory, which takes the date as its name, in the form day.month.year, e.g. 12.04.99. Note, however, that the newspaper texts in the METER corpus are released one day later than the corresponding PA texts. However, for the sake of retaining a parallel data structure between PA and newspapers, the PA dates are used as directories for both PA and newspaper sections. Thus newspaper texts are actually released a day later than their directories indicate. At the fifth level, i.e. within date, in both the PA and newspaper divisions the texts are grouped into catchlines. However, it should be noted that the texts under the same catchline are related in different ways within the PA and newspaper halves. In the PA division, all the PA pages reporting a given story on the same date are stored under the catchline under which the PA released them. In the newspaper division, all the newspaper versions of a given story are stored beneath the PA catchline for that story, for example, the Times', Sun's and Telegraph's versions of reports about a killing. The sixth and final level of the corpus contains the leaf nodes of the tree structure, the actual text files which make up the corpus content. Each text is given a filename that conveys basic information about it. The general naming convention for PA texts is: catchline+PA-page-number+{story-type}.txt (“+” here indicates concatenation, “{}” indicate optionality). catchline and PA page have been previously explained. story-type indicates whether a PA text is a “non-standard” press release, such as nightlead, snap, sub, etc. For example, the filename sergeant1snap.txt refers to a text which is a single sentence giving urgent, breaking news about an incident or event involving a sergeant. Table 2 lists PA text types included in the METER corpus. On the other hand, the filenames given to newspaper texts consist of three parts: catchline+filecode_newspaper-name.txt. The second component filecode is a unique number assigned to each newspaper article for identification. For example, a newspaper filename Example filename Code File type description meter_corpus/PA/courts /21.06.00/sergeant/ sergeant1.txt A page of a story released by the PA (number indicates page). meter_corpus/PA/courts /21.06.00/sergeant/ sergeant1lead.txt lead A page of PA copy that summarises the major aspects of a story. meter_corpus/PA/courts /21.06.00/sergeant/sergeant1nl.txt, meter_corpus/PA/courts /22.11.99/inqheart/inqheart1ld.txt nl or ld As lead - compiled in the afternoon or evening. Especially useful for the next day's daily newspapers. meter_corpus/PA/courts /21.06.00/sergeant/ sergeant1snap.txt snap A single sentence giving urgent, breaking news. meter_corpus/PA/courts /21.06.00/sergeant/ sergeant1sub.txt sub A page of copy that develops previous material sent by the PA. meter_corpus/PA/courts /07.10.99/wife/wifecorr.txt corr Amendment to earlier copy provided. meter_corpus/PA/courts /16.07.99/care/care1nlcorr.txt nlcorr Nightlead correction. meter_corpus/PA/showbiz /14.12.99/mccartney/mccartney1ff.txt ff Fact file - A series of bullet points to accompany an existing story. Table 2: Different PA file types 219 thomas353_telegraph.txt refers to a newspaper article from the Telegraph newspaper under the catchline of thomas with the file code 353. The parallel structure of the PA and newspaper divisions in the METER corpus facilitates manual and automatic search for related PA and newspaper texts. Furthermore, the indicative directory and file names make it possible to retrieve basic information about text(s) automatically, e.g. data source, text domain, date of release and topic, etc. This information may be obtained directly from the corpus structure and text filenames, without the requirement to look inside any of the texts either at text content or embedded markup. Such information provides one straightforward route to exploit the corpus; however, all this information, and more, is also available from metadata embedded in the annotated portion of the corpus. 5. Annotation of the METER corpus Leech (1997: 4) suggests that “corpora are useful only if we can extract knowledge or information from them”. While raw texts contain useful data, except basic information such as word frequency and collocations, it is difficult to extract more complex information automatically from them. Such information needs to be encoded explicitly in the corpus. In order to increase the utility of the METER corpus, we have encoded information pertaining to text reuse in the newspaper section of the corpus. Each of the newspaper articles has been manually classified into one of three general categories, indicating degree of derivation from the PA. In addition, approximately 400 newspaper articles have been subject to detailed annotation down to the sentence, phrase, or even word level. The annotation of the METER corpus is described in the following subsections. Due to limitations of project resource and time, all annotations were carried out by one person, a professional journalist. However, we are in the process of getting second judgements from another expert to verify the choices made by the original journalist in the general classification task, for 5% of texts in each category. Resources permitting we will validate the more detailed annotations later. 5.1. General classification at the document level As mentioned earlier, every newspaper text in the METER corpus shares a topic with one of the PA catchlines. However, the newspaper articles are related to their PA counterparts in a variety of ways. Some of them use whole pieces of PA text without any change; some of them modify PA texts to fit their specific requirements; some of them supplement PA materials with other content that is not to be found in the PA materials; and some do not appear to have consulted the PA at all, even though the PA provided relevant materials for the story. In order to capture general information about the reliance of newspapers on the PA, each newspaper text is classified by an expert journalist into one of the following three categories: a) Wholly derived (WD) – all content of the target text is derived only from the PA. b) Partially derived (PD) – some content of the target text is derived from the source text. Other sources have also been used. c) Non-derived (ND) – no content of the target text is derived from the source text. Although verbatim and rewritten text may appear in the target text, the context, overlap of entities or use of source text is not indicative of reuse. Note that this classification is based upon judgements concerning the source of the content in the newspaper article, not simply upon surface criteria, such as presence of a certain number or length of shared tokens. Since we are interested in studying the mechanisms of text reuse, we cannot begin by presuming what we hope to discover; i.e. we cannot begin by defining text reuse in terms of surface linguistic criteria. Otherwise, we can never hope to discover more than our initial definition. Instead we rely on the judgement of expert journalists who bring years of experience to bear, comprising specialised linguistic and world knowledge. In a wholly derived newspaper text, all of the facts in it can be mapped, with varying degrees of directness, to PA text(s) under the shared catchline. In the most direct cases, the whole or part of a PA 220 text is copied verbatim to form the newspaper text. In the other cases, a PA text is modified in various ways before being deployed in a newspaper article, including change of word order, substitution of synonyms, and paraphrase. In such cases, the relation of the newspaper text and its counterpart PA text(s) is often not so clear, sometimes even difficult for a human to infer. In a partially derived newspaper text, part of the text can be mapped to the corresponding PA text(s), but for other parts no related materials can be found in the PA texts. In other words, newspaper articles in this category contain new facts not found in the PA. This category represents an intermediate degree of dependency of newspaper texts on the PA. Within this category, the level of dependency varies considerably, from the majority of text being derived to only one or two sentences being derived from the PA. The last category covers those newspaper articles that are written independently from PA. Note that we are considering newspaper and PA texts covering the same event. This means that the PA provides coverage of the event, but that it has not been utilised by the newspaper. Instead, the newspaper has used other journalistic sources. This category represents the null dependency of the newspaper on the PA. The number of texts in each category can be obtained from Table 1. This three-way classification is encoded in the METER corpus as described in section 5.3 below. With such information explicitly available, the METER corpus can be used for training/evaluating algorithms for detecting text reuse in journalistic domain, studying journalistic styles, etc. 5.2. Detailed classification at the lexical or phrasal level In addition to the document level classification of the whole of the newspaper portion of the METER corpus, detailed annotation of about 400 of the wholly or partially derived newspaper articles was also carried out. In this annotation, individual words, phrases or sentences in the newspaper texts were tagged with information about their derivation from the PA. The detailed classification parallels the text-level classification discussed in the previous section. Three categories were used: 1) Verbatim: text that is reused from PA word-for-word in the same context; 2) Rewrite: text that is reused from PA, but paraphrased to create a different surface appearance. The context is still the same; 3) New: text not appearing in PA or apparently verbatim or rewritten, but used in a different context. Of these three categories, the rewrite appears in various forms, such as change of word order or paraphrase. Such modification of the PA texts occurs due to four main reasons: a) the PA text may be re-written to comply with the house style of the newspaper; b) it may be re-written to fit the space available in the newspaper; c) information dispersed throughout the PA texts may be re-arranged into a single, coherent sequence; d) a PA text may be dramatised by replacing neutral words with more dramatic words. Accordingly, a PA text can be modified in the following ways: 1) Rearrangement of word/phrase/sentence order or position; 2) Substitution of original terms with synonyms or other context dependent substitutable terms; 3) Deletion of original materials; 4) Insertion of minor new materials (e.g. addition of words like by in passivisation), In the annotation, newspaper materials falling into any one of the above four categories were tagged as rewrite. There are some controversial cases in which subjective judgement had to be made. On the other hand, the tagging of the verbatim and new materials was generally straightforward, although minor mistakes are inevitable for a manual annotation. All the information about the dependency of newspaper materials on the PA is encoded in an SGML annotation scheme, as described in the following sub-section. 5.3. Annotation of the METER corpus The METER corpus is annotated in SGML (Standard Generalised Markup Language) to comply with the international mark-up standard (Goldfarb, 1990) and a customised SGML DTD (Document Type 221 Definition) was developed for it. The DTD provides a framework for recording general information about the texts (such as date, catchline, etc.) as well as for recording information specifically pertaining to reuse. Figure 3 illustrates the structure of the METER SGML DTD document5. As shown in Figure 3, the METER mark-up scheme consists of three levels of tags. The topmost is the header which keeps general information about files, including file name, source newspaper, newspaper page number on which the report appeared, date of release and catchline. For newspaper articles it also includes the text level classification of the article as wholly derived, partially derived, or non-derived. This header mark-up is found in every text in the whole corpus. The second level consists of two elements: title and body of the document. The title is optional. The third layer consists of three elements: verbatim, rewrite and new (for their definitions, see section 5.2). They are sub-elements of body, and apply to any individual tokens or token sequences in the newspaper article, including single words, phrases, sentences and punctuation marks. Generally the annotation focuses on words rather than punctuation marks. But the latter were included in the annotation, for they may have significant role in some cases, e.g. in the recognition of quotations. The detailed annotation was carried out in two stages. First, a journalist analysed and classified the contents of about 400 newspaper articles on paper. Later, the classification was transcribed into tags in the electronic version of the corpus using an annotation tool. Figure 4 displays a sample of annotated newspaper text. In this example the first line informs an SGML parser of the document type and the location of the DTD file. The header indicates that the sample is a newspaper report about a court story related to the PA catchline “banker” on the 4th page of the Telegraph released on 16 July 1999 and is wholly derived from the PA. The text body is broken into segments each of which is tagged with one of the three 5 In SGML, the DTD document defines the constituent parts of a document in terms of objects, known as elements. Each element may have parameters called attributes. Figure 3: Structure of the METER DTD (required) Header filename: filename of the text (required) attributes: newspaper: the newspaper name (required) domain: courts or showbiz (required) classification: either wholly-derived, partially-derived or non-derived (optional) pagenumber: the newspaper page number (optional) date: the date of publication (required) (optional) No attributes <Body> (required) No attributes <Verbatim> (optional) Attributes: PA_src: the source PA sentence(s) (optional) <Rewrite> (optional) Attributes: PA_src: the source PA sentence(s) (optional) <New> (optional) Attributes: PA_src: the source PA sentence(s) (optional) 222 categories: verbatim, rewrite and new. The value of the attribute PA_src, which indicates location of the PA source sentence(s), is now blank, but may be completed in a subsequent release of the corpus. Figure 4: A sample annotated newspaper article from the METER Corpus 6. Preliminary experiments with the METER corpus Currently, the corpus is being used in the METER project for training and evaluating algorithms to detect text reuse. Our initial investigations have utilised the document-level annotations only and have addressed the following task: given a PA text and a candidate derived text, determine whether the text is wholly derived, partially derived or non-derived. We have tried three approaches to this task: the dotplot, n-gram overlap and text alignment. Details can be found in Clough et al. (2001); here we just give a brief overview of the work to illustrate how the corpus can be useful. The dotplot is a tool adapted from the biological domain to visualise the similarities and differences between input streams of either text in electronic format, or software code (Helfman, 1993). This technique was tested on METER texts from the three categories: wholly derived, partially derived and non-derived. It was found that when compared with candidate PA source texts, newspaper texts from the three categories could generally be identified by distinct dotplot patterns. However, a disadvantage of the dotplot is that without more complex processing, no quantitative similarity value is produced above and beyond the visual image. Building the dotplot for large input streams is also computationally expensive. An alternative approach is to measure similarity between texts by simply measuring the number of word n-grams shared between them. From initial experiments, we found that derived texts appeared to share more n-grams of lengths 3 words and above. This follows the intuition that derived texts share longer matching strings than non-derived texts. However, we also found some instances of non-derived Original PA version BANKER'S BITTERNESS LED TO SYSTEMATIC THEFTS< By Lyndsay Moss, PA News< A middle-aged banker who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years today.< Trusted Derek Boe, 48, used some of the money to splash out on holidays, buy a car and a caravan, and pay for expensive home improvements.< Telegraph version: A BANKER who stole more than £270,000 from his bosses because he resented younger staff being promoted over his head, was jailed for four years yesterday. Derek Boe, 48, used some of the money for holidays, to buy a car and a caravan, and to pay for home improvements. Annotated Telegraph version: <!DOCTYPE meterdocument SYSTEM "meter_corpus/dtds/meter.dtd"> <meterdocument filename="meter_corpus/newspapers/annotated/courts/16.07.99/banker/banker125_telegraph.sgml", newspaper="telegraph", domain = "courts", classification="wholly-derived", pagenumber="4", date="16.07.99", catchline="banker"> <body> <verbatim PA_src="">A </verbatim> <verbatim PA_src="">BANKER who stole more than </verbatim> <rewrite PA_src="">£270,000 </rewrite> <verbatim PA_src="">from his bosses because he resented younger staff being promoted over his head, was jailed for four years </verbatim> <rewrite PA_src="">yesterday. </rewrite> <verbatim PA_src="">Derek Boe, 48, used some of the money </verbatim> <rewrite PA_src="">for </rewrite> <verbatim PA_src="">holidays, </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">buy a car and a caravan, and </verbatim> <rewrite PA_src="">to </rewrite> <verbatim PA_src="">pay for </verbatim> <verbatim PA_src="">home improvements. </verbatim> </body> 223 texts which contained shared 10-grams (due to shared directly quoted text). Using the METER corpus we trained the classifier by selecting optimal threshold values for various parameters. Results in testing ranged from 50-70% correct in classifying documents as wholly derived, partially derived or nonderived. The final approach to this classification task that we have explored so far is text alignment (see, e.g., Manning and Schütze (1999) for a review of statistical alignment techniques). Given a candidate derived newspaper text, we first carry out best-match alignment at the sentence level. Then we estimate the dependency of the whole candidate derived text on the source, using parameters derived from the METER corpus. Preliminary results show 80-90% correct classification, depending on the setup. The METER corpus has provided invaluable data for the development and evaluation of this algorithm. These experiments are but a few examples of potential applications of the METER corpus. When it becomes widely available to the research community, we envisage that a much wider range of applications will be found for it. 7. Conclusion In this paper, we have described a new corpus – the METER corpus – built as part of a project to investigate text reuse in the world of newspaper journalism. It contains texts from the domains of law courts and show business reporting. The texts were collected from the Press Association and nine British national newspapers. The data were manually collected and classified by a professional journalist. All of the newspaper articles are classified at the document level based on their dependency on the PA. Some of the newspaper articles are also annotated at the phrasal or even lexical level to indicate material that is verbatim, rewritten or new. This is an innovative corpus resource for the communities of corpus linguistics and natural language processing/engineering. The critical role of the PA in the contemporary British media industry and the breadth of the newspaper sources from which the data were collected all warrant that the METER corpus is highly representative of contemporary media in the given domains. The fact that all of the data in the corpus were manually selected and tuned guarantees high quality and reliability. Until now, no corpora marked up with information about text reuse have been reported. The METER corpus fills this gap. We envisage that the corpus will be of use for a range of research, including the study of text reuse, plagiarism, journalistic style, lexicon building, document clustering and language generation. Acknowledgements The authors would like to acknowledge the UK Engineering and Physical Sciences Research Council for financial support for the METER project (GR/M34041). We would also like to thank the Press Association for supplying us with access to their newswire archive and for discussions regarding text reuse in the British press. Finally we would like to thank Andrea Setzer for supplying and helping us to customise the annotation tool we used to annotate the METER corpus. References: Clough P, Gaizauskas R, Piao S, Wilks Y 2001 METER: MEasuring TExt Reuse. Department of Computer Science Reseach Memorandum CS-01-03, University of Sheffield. Goldfarb C 1990 The SGML Handbook. Oxford, Oxford University Press. Helfman J 1993 Dotplot: A Program for Exploring Self-Similarity in Millions of Lines of Text and Code. Journal of Computational and Graphical Studies 2(2): 153-174. Johansson Stig, Ebeling Jarle 1996 Exploring the English-Norwegian parallel corpus. In Percy Carol E., Meyer Charles F., Lancashire Ian (eds), Synchronic corpus linguistics. Amsterdam-Atlanta, GA, Rodopi B. V., pp 3-15. Leech Geoffrey 1997 Introducing corpus annotation. In Garside Roger, Leech Geoffrey, McEnery Anthony (eds), Corpus annotation – linguistic information from computer text corpora. London & New York, Longman, pp 1-18. Manning C, Schütze H 1999 Foundations of Statistical Natural Language Processing. Cambridge, MA, MIT Press. 224 ROSETTA: Rhetorical and semantic environment for text alignment Hatem Ghorbel, Afzal Ballim, Giovanni Coray LITH-MEDIA group Swiss Federal Institute of Technology IN Ecublens, 1015 Lausanne, Switzerland Phone:+41-21-693 52 83 Fax:+41-21-693 52 78 {hatem.ghorbel, afzal.ballim, giovanni.coray}@epfl.ch Abstract In the framework of machine translation of multilingual parallel texts, the technique of alignment is based on statistical models and shallow linguistic parsing methods. When addressing the problem to corpora where different versions are derived or interpreted from the same source, we need further criteria to consider the forms of disparity between these versions. In this article we propose a content-driven approach based on the semantic and pragmatic structure of texts to aid in the process of alignment and comparison between their versions. 1. Introduction Alignment is the process of establishing the relationship between the different subparts of two or more comparable documents. Much of the early work on alignment is still used as the basis for more advanced systems. These methods are mainly based on statistical models of translated texts, with some based on word or character frequencies, others on string occurrences. Whereas previous work in alignment has viewed texts as essentially a flat stream of characters or words, other approaches have incorporated some structural properties of the documents as further criteria. For example the logical structure of the documents (e.g., sections, chapters, titles, etc.) is often used. In this paper, we propose an extension to the existing methods where we consider the semantic and pragmatic structure of texts as the basic criteria to aid in the process of forming a correspondence between a parallel pair. We are focusing on parallel corpora where different versions are derived from the same material (intra or inter lingual documents). The practical goal of alignment is to give experts, students or ordinary users a tool that facilitates the on-line comparative analysis of ancient texts and to navigate through the various components of the different versions. Some experiments on ancient manuscripts of medieval French have shown the limits of statistical alignment due to the considerable variation of these versions, which exhibit omissions, insertions and substitutions that range from words to sentences and sometimes to larger spans of texts. Despite these lexical and syntactic variations, the semantic content i.e the meaning is kept invariable. This presumes that these versions have very close semantic structure. We argue this hypothesis with the fact that each text holds a semantic content within its lexicon and its linguistic structure. Generally, this content remains invariable when the text is translated or when a further version is reproduced. The variation would be at the lexical level or at the level of details added or omitted. The main ideas and the intentions of the writer are however kept the same. Based on this hypothesis, if we manage to model the semantic content by means of an appropriate discourse structure, we can compare the content from the perspective of such a structure. As a very end goal for alignment, we intend to provide an environment to facilitate the analysis of ancient texts and to enable data access of contextually relevant materials such as slides of original books, commentaries from rare books, annotations added by domain experts. This meta-information is often divorced from the original versions; thus making it easily available to users provides a powerful educational mechanism for specialized studies. 2. Background When the idea of fully automated high quality translation was given up by most of researchers, semiautomatic and assisted translations became the goal of most of the current projects of translation. To support such approaches many linguistic resources (machine-readable dictionaries, thesauri, tagged corpora, etc.) and knowledge bases (Wordnet) were built. Parallel corpora were important bases of 225 multilingual resource that complements previous classical tools of translation either for human or for machine. Basing on translated segments in parallel corpora, some approaches of memory-based or example-based machine translation were developed (Sato & Nagao 1990, Sumita et al. 1990). Parallel corpora were used then for other natural language applications, for instance in translation consistency checking, cross-language information retrieval, document comparison and other multilingual applications. The problem of alignment and establishing correspondences between segments of documents became an issue and an end in itself. 3. The alignment problem 3. 1. Alignment of same-text versions Same-text versions are different presentations of texts derived or based on the same original document. These versions represent different same-language or cross-language translations preserving semantic content i.e. meaning. Usually these versions are interpretations of epic stories that tell about gods or about the adventures of great heroes. They often represent the values and ideals of an entire country or people. An automatic alignment among this kind of texts establishes a mapping that relates the content of one text to the content of the other, wherein the subject of the mapping are interpretations of the same span of the original text. This is an interesting problem that is closely related to the alignment of different language translations of a common source. Nevertheless, in the latter case, the translation is usually so regular and homogeneous that a great deal of success is achieved using existing techniques (see state of the art). Unfortunately, most of this work has focused on cross-language translations and very few projects have considered alignment of different versions of ancient text. Own et al. (1998) have focused on the alignment of Iliad, Odyssey, and bible versions in the framework of the HEARER HOMER project. Our motivation to perform this sort of alignment is to introduce a new tool that supports 1. comparative navigation within a corpus of parallel versions of ancient texts enabling a comparative analysis of the translation style, the linguistic and the geo-linguistic features and the dialectal properties of the language. 2. an elaboration of the comprehension of the documents by grouping parallel segments from various versions that can be found in another textual or graphical form. 3. random access of information through the different textual streams and retrieval of content and meta-data attached to the translation content from the different versions, basing on a search of an original document or an alternate translation. 3. 2. Case study: medieval texts In this project we are interested in French medieval manuscripts, in particular the manuscripts produced between the XIIth and the XVth century. These manuscripts are sets of different conversions over the time of the same original texts. They are written by authors -often unknown- having different cultures and skills. Each manuscript reflects, thus, its own cultural and geo-linguistic features that depend on a particular civilization. As a first step, we worked with extracts from the same versions of the manuscript "Ovide moralisé" (versions of Geneva, Paris, Lyon, and Rouen) as well as versions translated to Latin and modern French by domain specialists. While manipulating these manuscripts we have noticed that: 1. the structure is very different from one version to the other. Although there seems to be a certain isomorphism in the structure within one class of versions (verse or prose), the variation is striking when comparing two versions from different classes. These irregularities in structure make it quite difficult even to perform a manual alignment. 2. Sentence and paragraph structure is difficult to detect in many versions due to the loss of punctuation or uncertainty about the sentence structure. 3. Apart from the variation in the structure between the different classes of documents, the variation in the linguistic level contains the following attributes: 226 - Morphological variation is quite abundant in all the versions. For instance the word "plusieurs" can be found as "plusiours". The conjugated verbs particularly cover a great number of variants of dialectal, orthographic or analogue (to modern French) origins. - Semantic variations are less abundant but quite important, particularly among versions from different classes. These variations are mainly the use of synonyms or other group of words having the same realization (e.g. mal / mauvaise, prouffit / enseignement). - The pragmatic and discourse structures are very close even if in some versions we find more elaborated parts than in others, or in some cases more detail is given, particularly when it comes to a description or an argumentation. Despite this disparity in the expressiveness of the language, the core of the content is kept invariant among all the converted texts. After some experiments, we found that the alignment among same class versions can be done using standard techniques, particularly for texts of the verse class. This is due to the following two reasons: 1. The document structure is very close or in some cases identical: A poem is composed of titles (to announce a beginning of a division), verses (organized line by line) and annotations in the margins (often transcribed within the verse structure with a special markup). The graphical objects are inserted randomly in the document without a prior order. The most significant variation in this level is the omission or the insertion of a title, or a block of annotation or at most a block of verse that does not exceed a couple of lines in the worst cases. 2. The linguistic variation is limited to the morphological level and in many cases to the semantic level where a simple substitution of a word by its synonym has taken place. As previously noted, in this particular case we are dealing with a classical problem of alignment where statistical approaches have proved their success. The TALCC aligner (Ballim et al. 1998) has been used and achieved about 90% success when respecting a same generic model of documents. The TALCC aligner is based on the Gale and Church algorithm with the addition of the logical structure of documents as a further criterion of alignment (see state of the art). 3. 3. The alignment problem with verse/prose documents The first problem that we were faced with when comparing verse and prose versions was the document structure. The structure of a verse manuscript is composed of titles and verses organized line by line. Each verse may be a clause or in rare cases a whole sentence. Punctuation does not exist originally in the material but is introduced by experts. Annotations on the margins are transcribed within the verse structure with a special markup. Instead, a prose is a block of text organized in a paragraph structure each one starting with a title that describes that content. Sentences within the paragraphs are usually separated by original punctuation though in some cases it is difficult to detect (figure 1). Linguistic variation Whereas verses are mainly segments of sentences expressed in an artistic style enclosing a form of rhyme, the content in sentences in a prose is more compact and expressed with a lighter descriptive style. Obviously, the linguistic construction is different in many ways; the lexicon submits to the usual morphological variation that can exist between two versions as described before, however the order is not preserved. Syntactically the whole construction is different: a sentence in the prose version can correspond to one or more sentences in the verse version. A word itself in one version can be elaborated to a whole verse or even a block of verses. Usually the prose version is more compact than the verse one with a ratio of nearly 3/4. Pragmatic variation When it comes to the pragmatic structure, the difference is noticed on the level of depth of explanation and expressiveness. It has been noticed that the verse versions are enriched with more elaboration, restatements, argumentation, details (temporal, location, manner, circumstances, etc) and descriptions (comparison, causative events). The comments and personal interpretations are often more abundant in the verse versions. This explains the difference in size between the two manuscripts. Nevertheless, it has also been noted that the salient segments are kept invariant and submits only lexical variations or synonymy substitution. 227 Verses in medieval French Prose in medieval French Se l'escripture ne me ment Tout est pour nostre enseignement Quant qu'il a es livres escript Soient bien ou mal li escript Toutes escriptures soient bonnes et mauvaises sont pour nostre prouffit et doctrine faittes. Figure 1: Example of alignment of verse and prose versions in medieval French. 4. State of the art The problem of alignment seems to have first been raised when Brown (1988) and his colleagues tried to build a probabilistic model for automatic translation. Debili (1992) faced the same problem when he planned to set up dictionaries of bilingual expression transfers and synonyms. The alignment problem was then treated as only second or peripherical. Many authors now set the alignment problem in a more global framework. For instance, Warwick (1989) places the alignment in the context of the implementation of lexicographic tools for linguists and translators, or, more recently, as an aid to the evaluation of translation quality. A good deal of work has already been done on alignment (Brown et al. 1991), (Gale and Church 1991) and (Simard et al. 1992). Since then several other approaches have been used, both for sentences, word and character alignment (Papageorgiou et al. 1994; McEnery et al. 1995). All these methods are mainly based on statistics, some based on word frequencies, others on characters occurrences. Gale & Church's character-based algorithm propose a method which uses only internal information and does not consider any hypothesis on the lexical content of the sentences. Authors started from the observation that the length of sentences in the source text and its translation in the target text are strongly correlated: short sentences tend to be translated into short sentences and long sentences into long sentences. Furthermore, it seems that there exists a rather constant ratio between the length of sentences from a language to another in terms of number of characters. This method has been tested on a bilingual corpus of 15 economic reports published by the Swiss Banks Union in English, French and German, for a total of 14'680 words, 725 sentences and 188 paragraphs in English and their corresponding numbers in the two other languages. This method makes it possible to correctly align the total amount of sentences, except for 4 % of them. The same number of errors has been found in the English-French and English-German alignments, showing that the method is relatively language- independent. The model proposed by Gale & Church has also been tested on a much more important sampling, of 90 million words, taken from the official Canadian Hansards corpus. Brown's word-aligner algorithm combines sentences according to the number of words included in each sentence. This algorithm is described by its author as a development of Gale & Church's algorithm, which computes the length of sentences from the number of their characters. The data of the official Hansards (McEnery et al. 1996) corpus of the Canadian parliament official decrees have first been converted into a unique English corpus and another French corpus. Each of these corpora has been fragmented into token and these tokens have been combined into groups called sentences. Moreover, auxiliary data such as the numbering of Parliamentary Sessions, the name of speakers, the time index and the ordering of questions were used to add comments throughout the text. Each of these comments can be used as an anchor point in the alignment process. The alignment of anchor points is made in two passes, first for the main anchor points, then for the secondary ones. Chen's alignment model (Chen 1996) is built from a sample of data aligned in two languages, English and French, and tested on samples taken from the Hansards. This model, conceived on an ad hoc basis in the framework of Bayes' paradigm allows to take into account frequent cases where sentences don't align in a uniform way, in a 1:1 ratio, but rather in a 2:1 ratio. The search strategy used, which is that of dynamic programming, allows a linear search in the length of the corpus. This strategy includes a separated mechanism to process a large number of suppressed sentences in one or the other version of a bilingual corpus. Martin Kay's method (Kay, 1993) is based on an EM (Estimation and Maximization) type of algorithm. Here, the alignment of sentences depends on the alignment of words. That of words depends on the similarity of their distribution. The method proposed by Kay is relatively more complicated to implement than the other methods already mentioned. 228 Simard and his colleagues give a simple method, which attempts to align texts on the basis of related words (cognates). This method seems to give satisfactory results but its drawback is that it can only operate on rather closely related language pairs. Cognates are words, which are almost identical in both languages (artificial/artificiel). This thus implies that both languages must be written with the same alphabet (Simard et al. 1992). There have been some innovative works that incorporated further criteria in alignment such as the linguistic knowledge and the structural properties of the documents. The use of linguistic knowledge covers mainly the process of parsing (Dagan et al. 1996; Matsumoto et al. 1993) and tagging (Van der Eijk, 1993). Kupiec (1993) proposes an algorithm for finding nominal syntagms matching each other in a bilingual corpus. In this algorithm, syntagms are thus recognized with the aid of a specific program and the correspondences between these syntagms are determined with an algorithm based on simple statistical techniques. The use of external linguistic resources mainly bilingual dictionaries is quite efficient in identifying lexical anchors (Catizone et al. 1989; Warwick & Russel 1990; Debili & Sammouda 1992). Structure-driven methods consider the text as structured flow of information and manipulate this metainformation about the organization of the text structure to aid in the process of alignment. Ballim and al. (1998) developed an aligner which takes advantage of the global structure that many documents have (e.g., sections, chapters, titles, etc.) This structural information is integrated with other similarity metrics such as: number of characters in PCDATA, cognates, bilingual terms and parts of speech to decide the correspondence between parallel segments. Tests and evaluations have showed that the structure-driven alignment is efficient with isomorphic documents having the same generic logical structure. However it was much more difficult to deal with non-isomorphic documents although referring to the same generic logical structure. In the same framework of structure-driven alignment Romary & BonHomme (2000) have used the TEI annotation guidelines to calculate the best alignment pairs from the multilingual texts at division, paragraph and sentence level. 5. Semantic and pragmatic approach The success of one approach or another depends strongly on the nature of documents. When aligning the Hansard corpus for instance, statistical approaches were enough the reach a great performance (nearly 97% with the Chen algorithm). This is mainly due to the specificity of the translation between French and English, which is often located in the word and sentence level. When it comes to structured documents, the logical structure is a further hint to guide the process of alignment. That what has been shown for instance by Ballim et al. in some experiments with Systematic Corpus of federal Laws from the Swiss federal Chancellery. However, when we address the problem of alignment with other types of documents having different properties, we must find further criteria of similarities. Still little work has focused on the linguistic properties of the documents because they are difficult and complex to compute and to deal with. Nevertheless in some cases alignment becomes a task that is coupled with a semantic and pragmatic understanding of the content of the texts. It is the case of the different versions of ancient texts where even a manual alignment needs experts’ competence. Thus our semantic and pragmatic approach looks at texts from the perspective of their discourse structure rather than from the textual flow of data. The discourse structure is the way text is organized to insure coherence and cohesion of its content. Most of the discourse theories (Hobbs 1985; Grosz & Sidner, 1986; Mann & Thompson 1988) consider texts as a set of segments related by means of pragmatic relations. Parallel texts express similar semantic content i.e. share a same meaning, therefore their discourse and pragmatic structure must be very close. Hence this structure could be considered as a further criterion to detect similarities where classical approaches failed. 5. 1. Description of the discourse structure In our hypothesis we have adopted the Rhetorical Structure Theory (RST) (Mann & Thompson 1988) as a model of discourse organization. RST is a descriptive theory about the organization of natural texts, characterizing their structure basically in terms of a closed set of relations called rhetorical relations that may hold between their parts. The term rhetorical is not limited to the relations that have a rhetorical sense but can be extended to other kinds of relations such as semantic, pragmatic, logical or even very special domain-dependent relations. Texts are decomposed into non-overlapping units called discourse segments. Each segment is related to a span of segments by means of a relation and is called a nucleus or a satellite (there are a few exceptions to this rule: some relations can join two nucleus 229 segments, they are called multinuclear relations). The distinction between nuclei and satellites comes from the empirical observation that the nucleus expresses what is more essential to the writer's purpose than the satellite, and that the nucleus of a relation is comprehensible independent of the satellite, but not vice versa. Text coherence in RST is assumed to arise from a set of constraints applied on the nucleus, on the satellite and on their combination. For example in the following sentence: Although we obediently ate everything our mother prepared, my sister and I much preferred to eat our fruit crisp. We detect a concession relation, the situation described in the nucleus (second clause of the example) is in contrast to that presented in the satellite. It is about a violated expectation. The model of discourse structure we are using obeys the constraints put forth by Mann and Thompson (1988) and Marcu (1996). It is a binary tree whose terminal nodes represent the elementary units and non-terminal nodes represent the relations holding between spans of texts (In figure 2 Arrows are pointing to the nucleus spans). 5. 2. Semantic-pragmatic annotation Recent developments in computational linguistic have created the means for the automatic derivation of rhetorical structures of unrestricted texts. Marcu (1996) suggested an algorithm that uses cue phrases and a simple notion of semantic similarity in order to hypothesize rhetorical relations among the elementary units. Nevertheless, these algorithms are still domain dependent and the efficiency is their main drawback. To structure our texts, we proceeded by a manual annotation of a sample corpus (versions of Geneva and Paris) to evaluate the complexity of the task. We have fixed a taxonomy of relations where each class is composed of subclasses of more specific relations. We distinguish three main classes, semantic relations, inter-personal and textual relations. Semantic or informational relations are mainly relations used to describe how information is conveyed for instance elaboration, comparison, circumstance, condition and causative. Inter-personal or planning relations are relations that hold a pragmatic intention for example interpretation, evidence, explanation and argumentation. Textual relations are rather relations that have an influence on logical structure of the text for instance list, conclusion, disjunction, conjunction, summary, joint, topic-drift and sequence. Such classification gives more freedom to annotators to choose the relations according to their own understanding, and permits to build a similarity measure in the process of comparison (Ghorbel 2001). The first task of the annotation is the process of segmentation. Unlike previous work where segmentation is basically situated in the clause level, we focused on a more global view; the sentence level and in some cases on larger blocks of texts. This kind of macro segmentation allows us to define the elementary units of the discourse structure and eventually the units of the alignment process. The larger the segments are, the easier the computation of correspondence is, but the less precise the alignment is. On the other hand, considering very short segments we will end up with very large trees and the problem of complexity becomes important. The second task of annotation consists of grouping the elementary units together by means of either a mononuclear or a multinuclear relation. This process will create spans of texts or discourse segments Figure 2: RST model of discourse structure. 230 related in the form of an ordered binary tree (figure2). Within this tree we can detect certain paths formed by the nuclear nodes. This path structure will play an important role in the alignment process. Unlike previous work (Marcu for summarization (1998), for automatic translation (2000), for essay scoring (2000) and Cristea (1998) for anaphora resolution) where the whole text is represented as a single tree, and since we are working with long texts, we found it more appropriate to consider the texts as a forest of trees. Separation between trees is viewed as a topic shift in the texts. Still this concept of separation between trees is subjective as it depends on the annotator, but it does not have adverse effects since in the alignment process a tree from the source text can be aligned with more than one tree from the target text. 6. Multi-criteria alignment 6. 1. semantic-pragmatic structure alignment We have shown that the semantic and pragmatic structure of the discourse is very important to find anchor points and hints to align segments of similar texts. We propose in this section some algorithmic approaches to use the knowledge stored in the suggested model to detect similarities and make correspondence between segments. Problem formulation Consider a document D1 to be aligned with D2. D1 and D2 are respectively represented as forests of n and m trees as follows: D1 ={T1 1 , T 1 2 ,…, T 1 n } D2 ={ 2 1 T , 2 2 T ,…, 2 m T } Each tree T d i (the ith tree of the document d) is composed of text spans structured in a binary tree structure. The terminal units of d i T are text segments ordered from left to right as they appear as text sentences in the document. We denote these segments as follows: d i T ={ d i sr , d i s 1 + r , …, d i s 1 - s , d i ss } where ir and is are respectively the left and the right boundaries of the tree d i T . We call the salient path (SP) in a tree the path followed when we navigate from the root to the terminal elementary units and choosing the nucleus nodes when coming through a relation node. In a tree there exists only one SP if and only if all the chosen relations are mononuclear and ( is - ir +1) SPs if all chosen relations are multinuclear. In the general case the number of SPs, let's denote this number p , p then ranges between 1 and ( is - ir +1). Each SP points to a terminal elementary text segment. Let's denote by d i T) the set these terminal units in a tree, then we have: d i T) ={ d j s / i j i s r £ £ } where card( d i T) )= p . Pragmatically, the set d i T) stands for the span of the text in the tree i of the document d, which is considered as far as it is seen by the annotators when segmenting and attributing the nucleus and satellite properties to segments, for the main part that helps readers understand the sense and the content of the tree i. For example in figure 2, ir =2, is =8, d i T) ={ d s5 }, p =1, the SP is hence the path of the 5th segment. We call the k-nearest neighborhood of the SP the k satellite segments having the minimum distance with SP. We define the distance dm as the metric distance which is defined from d T * d T to the set of natural numbers where d T = U n i d i T .. 1 = . The metric distance between two segments is given as following; 231 dm ( d i s , d j s )= |j-i|. We define also ds as the structural distance applied between the nodes of a tree and which has the value of the length of the minimum path between the nodes. The length of a path in the tree is given by the number of nodes it contains between its two extremities. The process of alignment The semantic and pragmatic annotation as described previously models the text as a forest of trees. Each tree holds in its structure a set of segments related to each other by means of semantic or pragmatic relations and ordered by their pragmatic importance in the way they hold and convey information. This is modeled by the hierarchical structure and the concept of satellite/nucleus. The first step in the process of alignment is to define a mapping M that establishes the correspondence between the trees composing the two documents. M can be defined according to three approaches: Approach 1: alignment based on the SP segments The window of comparison of the texts is limited to certain specific segments, in this case only those that form the set d i T) . M is defined as following: M( d i T )= ’ d j T iif S( d i T) , ’ d j T) ) h £ where S is the similarity measure and h is an empirical threshold. S is calculated basing on the lexical distance between the streams of characters and referring to thesaurus database (see next section). Approach 2: alignment based on the SP and its k-nearest neighbor segments The window of comparison of the texts is limited to the segments that form the set d i T) and their knearest neighbors. The k-nearest neighbors can be obtained either by considering the M( d i T )= ’ d j T iif D( d i T) , ’ d j T) ) h £ where D is the similarity distance measure and h is an empirical threshold. The knearest neighbors can be obtained applying the metric distance or the structural distance. In both cases we search for the k-nearest satellite to the SP. In the case the structural distance, when the found neighbor is a sub-tree, we investigate its local SP which points to the first neighbor, the next ones will be recursively determined according the first neighbor. Approach 3: alignment based on the nature of the relations Relations are classified according to their semantic and pragmatic similarities. We distinguish three main classes, semantic relations, inter-personal and textual relations. Semantic or informational relations are mainly relations used to describe how information is conveyed for instance elaboration, comparison, circumstance, condition and causative. Inter-personal or planning relations are relations that hold a pragmatic intention for example interpretation, evidence, explanation and argumentation. Textual relations are rather relations that have an influence on logical structure of the text for instance list, conclusion, disjunction, conjunction, summary, joint, topic-drift and sequence. Each subclass can itself be more detailed. This approach of alignment of trees is based on a statistical comparison of the frequency and the order of relations. 6. 2. lexical alignment of segments Despite of the great variation of the conversion from one version to another, there still exists a certain local similarity between words, particularly proper nouns, nouns, adjectives and verbs. Certain words might be subjected to tiny morphological variations (change i/y, s/z, etc). For these reasons, we found the word comparison (Brown et al. 1991) approach more relevant than character comparison. This form of comparison is used only between certain segments of the text hypothesized by the structural alignment (according to the previous approaches) to give a further heuristic of similarity. Each segment is reduced to a word list where noisy data (article, determiners, pronouns etc) is eliminated. A first approach is to apply a stemming algorithm and then compare the obtained stems. The main drawback of this method is that inaccuracy in rules can map different forms to same stem. A second approach consists in estimating the Levenshtein distance between each couple of words taking into consideration some particular rules when estimating the costs of modification (substitution, omission, and addition). A vector of words’ distance is hence obtained. The similarity distance measure is the cost of the minimum path through the previous vector. 232 Besides the lexical similarity, we observed a kind of semantic similarity between the words used in the different versions of the manuscript, for example mal/mauvais, prouffit/enseignement, etc. A thesaurus that provides synonymy relation between words is in construction. When comparing two segments, the synonymy relation is checked between each couple of words and the vector of words’ distance will be modified. 7. Conclusions and future work When dealing with comparable documents where manual alignment needs an elaborated and deep understanding of the content, automatic alignment becomes a difficult task. Human interaction is then needed to aid in the process structuring and modeling the content. In this framework, we proposed a content-driven alignment whose heuristic is partially based on a manual human annotation of the semantic and pragmatic structure of documents. The main drawback of this annotation is the subjectivity of this task that depends closely on the profile of the annotator. Much of the current work in devoted to define a simplified golden standard and to automate some sub-tasks in order to facilitate the annotator's task. We have also presented in this article some approaches of content-driven alignment based on the semantic structure of the texts. The alignment process establishes a mapping between spans of texts represented in a tree structure by comparing the significant (from a semantic view) segments. Therefore the semantic structure built upon the text limits the window and drives the order of comparison. The lexical comparison still remains an important heuristic of similarity. Preliminary results seem encouraging and much of the future work will be focused on the evaluation of each approach of alignment proposed (structural, lexical and thesaurus-based alignment) and on the integration of these heuristics. The alignment at the segment level is also one of the interesting points we are investigating, so that the granularity of the correspondence and the comparison could be refined. References Ballim A, Coray G, Linden A, Vanoirbeek C 1998 The use of automatic alignment on structured multilingual documents. In Proceedings of the Seventh International Conference on Electronic Publishing, Saint Malo, pp 464-475. Brown P, Della Pietra S, Della Pietrs V, Mercer R 1988 A statistical approach to language translation. In Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, pp 1-6. Brown P, Lai J, Mercer R 1991 Aligning sentences in parallel corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, USA, pp 169- 176. Burstein J, Marcu D 2000 Towards Using Text Summarization for Essay-Based Feedback. Septième Conference Annuelle sur Le Traitement Automatique des Langues Naturelles TALN'2000. Lausanne, Switzerland, pp 51-59. Catizone R, Russell G, Warwick S 1989 Deriving translation data from bilingual texts. In Zernik U (eds), Proceedings of the first Lexical Acquisition Workshop Detroit, Mich, USA. Chen S 1996 Building Probabilistic Models for Natural Language. PhD thesis Harvard University. Cranias L, Papageorgio H, PiperidisS 1994. A matching technique in example-based machine translation. In Proceedings of the Fifteenth International Conference on Computational Linguistics, Kyoto, Japan, pp100-104. Cristea D, Ide N, Romary L 1998 Veins theory: A model of global discourse cohesion and coherence. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th annual meeting of the Association for Computational Linguistics (COLING, ACL), Montreal, Canada. Dagan I 1996 Bilingual word alignment and lexicon construction. Tutorial notes, 34th Annual meeting of the Association for Computational Linguistic, Santa Cruz, California. Debili F, Sammouda E 1992 Appariement des phrases de textes bilingues. In Proceedings of the 12th International Conference on Computational Linguistics, Nantes, France, pp 517-538. Gale W, Church K 1991 A program for aligning sentences in bilingual corpora. In Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, USA, pp 177-184. 233 Ghorbel H (forthcoming) 2001 Guidelines for semantic annotation. Internal report, Swiss Federal Institute of Technology IN Ecublens, 1015 Lausanne, Switzerland. Grosz B, Sidner C 1986 Attention, Intentions, and the Structure of Discourse. Computational Linguistics. 12(3):175-204. Hobbs J 1985 On the coherence and structure of discourse. Center of the study of language and information, CSLI-85-37, Leland Stanford Junior University. Kay M, Roescheisen M 1993 Text-translation alignment. Computational Linguistics. 19(1):121-142. Kupiec J 1993 An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the ACL, Columbus, Ohio, 17-22. Mann W, Thompson S 1988 Rhetorical Structure Theory: Toward a functional theory of text organization. Text. 8(3): 243-281. Marcu D 1997 The Rhetorical Parsing, Summarization, and Generation of Natural language Texts. PhD Thesis. Department of Computer Science, University of Toronto, Canada. Marcu D 1998 Improving summarization through rhetorical parsing tuning. In Proceedings of The sixth workshop on very large corpora, Montreal, Canada. pp 206-215. Marcu D, Carlson L, Watanabe M 2000 The Automatic Translation of Discourse Structures. The 1st Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL'2000), Seattle, Washington, pp 9-17. Matsumoto Y, Ishimoto H, Utsuro T 1993 Structural matching of parallel texts. In 31st Annual Meeting of Computational Linguistics, Columbus, Ohio, pp23-30. McEnery A, Oakes P 1995 Cognate extraction in the Crater project. In Proceedings of the EACLSIGDAT workshop, Dublin, pp 77-86. McEneryA, Wilson, A 1996 Corpus Linguistic. Edinburgh, Edinburgh University Press. Owen C, Makedon F, Steinberg T 1998 Parallel text alignment. Research and Advanced Technology for Digital Libraries, Special Issue containing invited ECDL'98 papers, Computer Science Lecture Note Series, Springer Verlag, Christos Nikolaou and Constantine Stephanidis (eds) pp 235-260. Romary L, BonHomme P 2000 Parallel alignment of structured documents. In Véronis, J. (eds.). Parallel Text Processing. Kluwer Academic Publishers, Dordrecht. Sato, S, Nagao M 1990 Towards memory-based translation. In Proceedings of the 12th International Conference on Computational Linguistic COLING'90, Helsinki, pp 247-252. Simard M, Foster G, Isabelle P 1992 Using cognates to align sentences in bilingual corpora. In Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation, Monreal, Canada, pp 67-81. Sumita E, Hitoshi I, Hideo K 1990 Translating with examples: a new approach to Machine Translation. In the Third International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Language, Austin, Texas, pp. 203-12. Van der Eijk P 1993 Automating the acquisition of bilingual terminology. In sixth Conference of the European Chapter of the Association of Computational Linguistics, Utrecht, The Netherlands, pp 113- 119. Véronis J 2000 Alignement de corpus multilingues. In Pierrel, J-M (eds). Ingénierie des langues. Editions Hermès, Paris. Warwick S, Russel G 1990 Bilingual concordance and bilingual lexicography. In Proceedings of Eurolax'90, Malaga, Spain. 234 Is that a fact? A corpus study of the syntax and semantics of the fact that Solveig Granath, Karlstad University 1. Introduction Even though corpora have been used in language studies for the past few decades, up until now very few grammars of English have based their descriptions of the language on the wealth of information that can be gathered from them. There are indications that this is beginning to change. For example, Biber et al (1999) use a corpus-based approach in their handbook, the Longman Grammar of Spoken and Written English. It can be expected that this new approach will bring about both a revision and a refinement of descriptions of structure and patterns of usage in present-day English. The advantages of using corpora as the basis of grammatical descriptions should be obvious to anyone interested in grammar: not only is it possible to provide examples of actual use instead of invented sentences, but in addition, information on frequency of use, style, and the meaning of different features can be included. The use of the fact that (i.e. the fact followed by an appositional that-clause) is a case in point. Learner grammars tend to stress that the fact is used to link a preposition to a that-clause, since prepositions may not govern that-clauses in English (Svartvik & Sager 1996:331, Hasselgård et al. 1998:349). Sometimes it is added that the use of fact presupposes that the that-clause expresses a fact. According to this description, fact can thus be regarded both as a function word and as a content word. As a reflection of this, it is interesting to note that the first two entries of fact in Collins Cobuild English Dictionary (1995) refer to the syntactic use rather than the meaning of the word. Thus the first entry says that “you use the fact that after some verbs or prepositions … to link the verb or preposition with a clause”, and in the second entry the reader is informed that “you use the fact that instead of a simple that-clause either for emphasis or because the clause is the subject of your sentence”. Only in the fifth entry is a semantic definition of the word given: “when you refer to something as a fact… you mean that you think it is true or correct”. In Quirk et al (1985:1001-02) the fact is even referred to as a marginal subordinator, which stresses its syntactic function over and above its semantic meaning. Whereas grammars and dictionaries thus demonstrate the need for the fact that in certain syntactic patterns, usage handbooks often denounce the use of the fact as “deadwood” (Perrin 1942: 515, 161): “The fact that is very often a circumlocution for which that alone would do as well: He was quite conscious [of the fact] that his visitor had some other reason for coming.” Even quite modern usage guides will warn users that the fact that “is often inserted quite superfluously into a sentence”, and exhort the user to avoid the phrase, because it is “ugly and pretentious” (Kahn and Ilson 1985:233). It is noteworthy that Quirk et al. (1985:659, 1002) also refer to prepositional phrases ending in the fact that as “stylistically clumsy”. Both grammars and handbooks thus tend to regard the fact primarily as a function word, which fulfils little semantic function in the clause. This is not supported by Mair (1988), who draws two major conclusions from his corpus study of spoken and written English: first, “the function of the fact is primarily semantic: it indicates that the speaker takes the following clause to refer to a fact,” and second, “the fact that is not a mere variant of the conjunction that but a genuinely suppletive form which substitutes for that in contexts where the latter is ruled out” (Mair 1988:70). The present study is an attempt to shed light on some syntactic and semantic features of the fact that. The paper will begin with a presentation of the frequency with which the fact that appears in different functions in the clause. Next, a closer analysis will be made of the use of the phrase in its functions as subject, subject complement, prepositional complement, and object of a transitive verb. One feature that is rarely noted is that the phrase in itself is not syntactically frozen, and examples of this will be given in section 7 below. Finally, the question whether fact is semantically meaningful or just a function word empty of meaning will be discussed (section 8). The results presented in the quantitative part (section 2) are based on two million-word corpora, FLOB and Frown, the Freiburg update of LOB and Brown with 1991 and 1992 used as sampling years. A pilot study of LOB and Brown, which contain texts that antedate FLOB and Frown by 30 years, showed results that were approximately the same. This makes it reasonable to assume that the figures are fairly representative, at least of these two varieties of English, but possibly also of English in general. The three sections that discuss the fact that in subject position, as subject complement, and as the complement of a preposition are likewise based on FLOB and Frown, whereas the three sections that deal with the fact that in object position and the syntax and semantics of the phrase have culled 235 examples not only from these two corpora, but also from a number of British and American newspaper and news broadcast corpora. The majority of examples are from written texts, but in the later sections, a few examples of spoken English are included. 2. Survey of the functions of the fact that in the clause The two corpora on which the first part of this study is based, the Frown corpus of American English and FLOB which is a similar corpus of British texts, are rather small by comparison with some of the corpora in use today, but since they contain a variety of texts representing many different text categories, it can be assumed that they reflect general tendencies in the language fairly well. The total number of the fact that was somewhat higher in the British corpus (121 in FLOB as compared to 87 in Frown), but when it comes to syntactic function of the fact that, the results are similar (figure 1). Thus, in close to half of the sentences in both corpora, the fact that is the complement of a preposition, something that is reflected in grammatical descriptions of this feature, since we saw above that this is the context in which the fact is most frequently mentioned in grammars today. The second most common function of the fact that is that of subject; however, almost equally frequent is the fact as the head of an object clause, something that is all but ignored in grammatical descriptions. Syntactic accounts generally classify that-clauses as nominal clauses, which means that their function is similar to NPs (e.g. Quirk et al. 1985:1048). It is therefore curious that a subset of the transitive verbs that occur with the fact + a that-clause will take a that-clause only as the complement of a head noun (see below, section 6). The least common grammatical function is that of subject complement. It should be kept in mind that the figures are based on written corpora, and that they might be somewhat different in a corpus of spoken English. 3. The fact that as subject Clauses in subject position are often felt to be awkward in English, since they violate the principle of end-weight, which requires that “the part of the sentence following the verb should be as long as, and preferably longer than, the part that precedes the verb” (Quirk et al. 1985:1040). This is normally resolved by extraposition of the clause, which means that the clause is moved to the end of the sentence and it takes its place as the subject, so that (1b) is generally preferred to (1a): (1) a. That the cable networks and the advertisers that found them will continue to encourage the trend is far from certain. b. It is far from certain that the cable networks and the advertisers that found them will continue to encourage the trend. If the that-clause is an appositional clause, however, extraposition is blocked, since it cannot be used in subject position when the extraposed constituent is an NP (2): (2) * It is a real advantage the fact that Harry doesn't have a cart of his own. There are certain conditions under which a subject the fact that can move to clause-final position. This happens for instance when the predicate phrase consists of a clause-initial adjective (often in the comparative) plus the copula, as in (3a). The participle of a verb may also occur in clause-initial position, in which case auxiliary be follows the rest of the verb phrase (3b): Figure 1. Survey of the functions of the fact that in the clause (in percentages) 0 20 40 60 Subject Subj compl Object Prep compl Frown FLOB 236 (3) a. More worrying, I fear, is the fact that most people will have terrible trouble just understanding what Condren is trying to argue. (Frown J58 144) b. Complicating the issues of social class is the fact that in the United States there is a large overlap between lower-middle-class, working-class, and lower-class membership (Frown J49 175) As can be seen from table 1, clause-final subjects are generally longer than clause-initial subjects, but it is still striking that when the fact that is in its normal subject position, it often “outweighs” the predicate part of the clause, as in (4), where the subject consists of 24 words and the verb phrase of only 3 (see also example (16a,b) below). (4) The fact that she is still somewhat tentative in the role and that her command of English is rather less secure than her arabesques, are minor blemishes. (FLOB C11 125) Table 1. Survey of position and length of subject the fact that in Frown and FLOB Frown FLOB Clause-initial subject No. of sentences 19 21 Length 7 – 34 words 6 - 24 words Average length 15 words 13 words Clause-final subject No. of sentences 11 3 Length 17 – 41 words 14 – 35 words Average length 31 words 21 words It is often the case, however, that these clauses have long and information-loaded predicates in addition to the long subjects, as witness (5), which has a subject part consisting of 34 words, and a verb phrase containing as many as 37 words. (5) The fact that this is still going on, in both this and related semidetached systems (the term comes from the fact that only one of the stars is in contact with the critical surface), means that accretion flows onto the companion have played a role in the orbital dynamics and that this has fed back into the stellar evolution through the alteration of the mass and boundary conditions on the stars. (Frown J07 64) The explanation for the reversal of weight in sentences such as (4) most likely has to do with information structure, so that even if the subject in a sentence like (4) contains several items of new information, they are presented as “facts”, i.e. as something given that cannot be questioned by the addressee. It is interesting to note that an initial subject that-clause will also be understood factively, i.e. presupposed to be true, even when it is not preceeded by the fact. This is shown in (6), where only (6c) presupposes that Smith had in fact arrived (see Kiparsky & Kiparsky 1970:167f, Granath 1997:36f). (6) a. The UPI reported that Smith had arrived. b. It was reported by the UPI that Smith had arrived. c. That Smith had arrived was reported by the UPI. In conclusion, when the fact that functions as the subject of the clause, the principle of end-weight no longer applies, so that we may find long and heavy elements with a high information load at the beginning of a sentence, contrary to the normal place of such elements in English. This is confirmed by the results quoted in Mair (1988:65), who found that there is a tendency to have the fact that rather than only that in English subject clauses that have not undergone extraposition. It seems that the fact may serve as a structural marker of a long constituent, and that it will signal to the listener/reader to expect the verb phrase to appear late in the clause. As such, it serves a communicative purpose, and omission of it would enhance the complexity of the sentence and make it more difficult to comprehend, as is shown in (7a) and (7b). (7) a. That patients still have to wait 21 weeks to get into hospital for treatment and 12 weeks for day surgery demonstrates yet again that the health service has not got enough money to treat patients in a reasonable time. b. The fact that patients still have to wait 21 weeks to get into hospital for treatment and 12 weeks for day surgery demonstrates yet again that the health service has not got enough money to treat patients in a reasonable time." (FLOB A14 192) 237 4. The fact that as subject complement The least common function of the fact that is that of subject complement. Only one instance was found in Frown, compared to four in FLOB. Not only is the fact that rather rare in this function, it also seems that semantically, the fact serves only to reinforce the statement somewhat, and that the message is communicated almost as forcefully if it is omitted. The reader is welcome to test this on the following two examples from FLOB (my slashes). (8) a. “the most obvious encumbrance on this picture is /the fact/ that it is woefully late” (FLOB G49 140) b. Another component of her success was /the fact/ that she rated well under both the RORC and American CCA rules of the time. (FLOB E18 23) Considering the low number of sentences in which the fact that was used in this function, it seems likely that this construction is less frequent than that-clauses (without an introductory fact) as subject complement. 5. The fact that as the complement of a preposition Standard grammatical descriptions of English stress that prepositions cannot govern that-clauses in English (see e.g. Svartvik & Sager 1996:331, Hasselgård et al. 1998:349, Quirk et al. 1985:658f).1 This means that some other syntactic device is needed to link clauses to prepositional phrases. One such device is “to use an appositive construction with a ‘general’ noun such as fact” (Quirk et al. 1985:659). Both in grammars and in English usage guides, this usage is referred to as “clumsy” and “superfluous”, and the language user is told to avoid it if possible (for examples of how this can be done, see (9) and (11a) below). The advice from handbooks is usually simply to leave out the sequence preposition + fact, or in the case of the two prepositions despite and due to, to replace them with a subordinator such as although or because. However, evidence from the corpora shows that this is rarely an option for the user: in most cases where this structure is used, no other alternative exists. In Frown, 40 out of 87 sentences with the fact that were of the structure preposition + the fact. The number was even higher in FLOB, so that this structure was found in 65 out of 121 sentences. This means that this is by far the most common use of the fact that. In each corpus, there were only a handful of cases where the writer would have had the choice of leaving out the preposition + the fact. This was for instance the case in the four sentences where the preposition was the complement of an adjective (content with, grateful for, oblivious of/to, responsible for) and it was possible with two of the four nouns postmodified by preposition + the fact that in Frown and FLOB: recognition of, concern about. However, only in two cases would it have been possible to leave out preposition + fact as the complement of a verb, namely comfort oneself with and hint at; in all other cases an ungrammatical construction would have been the result.2 Some examples of sentences where omission of the prepphrase would have been possible are given in (9) (my slashes). (9) a. I am grateful /for the fact/ that we know so little about his life. (FLOB G08 202) b. In reality, its merger with another organization was recognition /of the fact/ that it carried “high negatives” in public opinion polls. (Frown D09 202) c. In the closing pages of the text, his attempted self-renunciation hints /at the fact/ that to abolish the reflective ego is to cease to be a human subject (FLOB J62 166) In about one fifth of the sentences, the fact was the complement of a preposition with no governing adjective, noun or verb. The most common among these were passive sentences, where the fact that took on the function of agent (16 out of a total of 105 sentences). Mair (1988:64f) interprets this as an application of the principle of end-weight, i.e. that long constituents are moved to the end of the clause. Again, it does indeed seem as if the language user sometimes has a choice, so that transforming (10a) into an active clause (10a') does not really produce a more complex structure. In other cases, the principle of given and new in the functional sentence perspective makes the passive stylistically the only viable alternative (cf. 10b and 10b'). 1 This rule does admit a number of different exceptions – indeed far more than might be expected – so that among the prepositions discussed in this section, except, given and than can indeed take that-clause complements. Syntacticians normally get around this problem by reclassifying these words, so that except that is regarded as a complex subordinator, besides and than are given a double classification (as a preposition and as a subordinator), and given is referred to as a marginal preposition (e.g. Quirk et al. 1985:667) precisely because it may indeed precede a that-clause. Further information about that-clauses as complements of prepositions can be found in Christophersen (1979) and Seppänen & Granath (forthcoming). 2 See Granath 1997 for a detailed account of the omission of prepositions in verb complementation. 238 (10) a. The value of the posters must have been enhanced by the fact that they were created not by an industry but by an individual. (FLOB E35 213) a'. The fact that they were created not by an industry but by an individual must have enhanced the value of the posters. b. This vulnerability is underlain by the fact that in 1988 66 per cent of goods and services were exported. (FLOB J42 33) b'. The fact that in 1988 66 per cent of goods and services were exported underlies this vulnerability. Other prepositions found with the fact that in the two corpora (in order of frequency) were despite, due to, given, than, aside from, beside(s), and except for. Of these, besides (but not beside), except and given (and more marginally than) will allow a following that-clause with no mediating noun (11a). With the majority of prepositions complementing verbs and phrases of various kinds, this possibility does not exist (11b,c). (11) a. In the year 600 the civilizations of Spain and Mexico were roughly comparable except /for the fact/ that the former had profited from the invention of the wheel. (Frown N13 55) b. … much of this stemmed from the fact that he rarely took on the more general questions of the organization (Frown J60 116) c. I didn't make anything of the fact that we didn't have a particularly good time in bed. (FLOB K21 283) In summary, out of the 105 sentences where the fact that was the complement of a preposition, there were only 15 altogether where it would have been possible to leave out preposition + fact. Another consideration is of course what will happen to the meaning of the sentence if this is done. Removing the fact that will certainly take away some of the communicative force of the sentence, which can be seen e.g. in the examples in (9) and (11a). 6. The fact that as direct object It has already been pointed out above (section 2) that that-clauses are traditionally classified as nominal clauses and thus put on a par with noun phrases syntactically. We would therefore expect that when the fact that serves as the object of a transitive verb, a that-clause on its own would do, even if it would be less emphatic than a construction with the fact. However, when one begins to consider corpus evidence, it is clear that this is not at all the case. There are several issues that deserve description here. First, even a very brief search of one of the big corpora will turn up quite a large number of the structure verb + the fact that, which means that this is a common structure in English. Furthermore, it is one that is generally ignored in grammars. Second, whereas there are a number of verbs where the fact can be said to be optional syntactically, there are others with which a that-clause complement is rare, and yet a further group where this possibility does not seem to exist at all. It is possible that this area of English syntax has been neglected precisely because it has been impossible to investigate the patterns in which these verbs occur without access to large corpora.3 For this part of the paper, several CD-ROMs of British and American newspapers and broadcast news were used. A pilot study of some of these quickly turned up more than 250 verbs followed by the fact that. In order to delimit the study somewhat, the results in this section are primarily based on the 120 verbs of this type found in The Guardian/The Observer 1999. Altogether, this corpus comprises approximately 40 million words. Other corpora were consulted to determine whether the fact could be considered to be optional or obligatory. It is possible that further corpus searches will revise the results somewhat, so that some of the verbs in group 3 (verbs that obligatorily require an NP head of the (appositional) clause) will have to be moved to group 2 (verbs where a that-clause complement is an alternative to the fact that). The first group consists of verbs that occasionally take the fact as an object head of a that-clause, but where the predominant structure is for the clause to follow the verb directly. This applies to roughly one fourth of the 120 verbs in the survey (table 2). One interesting aspect is that when the Collins Cobuild Dictionary entries for the 120 verbs were checked, the 28 verbs in this group were all 3 The corpus that Mair (1988) used for his paper comprised about 840,000 words, which was a large corpus at the time, but it is small by today's standards. It is noteworthy that of the nine verbs for which Mair claims that the plain that-clause is rejected (give, face, ignore, bring home, disguise, resent, obscure, raise, discuss) (Mair 1988:65), only two have been found not to take a plain that-clause in the present study, namely obscure and raise. This underscores the value of large corpora, since native speaker intuition cannot be trusted. Some speakers may very well reject what other speakers say, and it is impossible even for trained linguists to be able to imagine all the contexts in which a word may be used. 239 categorised as taking a “report-clause”, i.e. a that-clause, whereas the remaining 92 verbs were only marked as taking NP objects. Table 2. Transitive verbs for which that-clause complementation is the normal pattern and the fact that less often used ----------------------- accept consider forget mind recognise stress acknowledge dispute grasp note regret take (into account) add emphasise illustrate notice report volunteer appreciate establish lament proclaim reveal believe explain mention publicise spot ----------------------- In the second group of verbs, two types can be distinguished: for a few verbs the fact that occurs with almost the same frequency as a plain that-clause, but for the majority of verbs, the fact that is in fact the predominant pattern, with the plain that-clause a rare alternative. The reason why these two types have been collapsed into one group is that even the relatively large corpora used as the basis of this survey are too small to establish with any certainty how these verbs should be divided up. Thus reflect (in the sense ‘mirror') occurs 46 times with the fact that and 6 times with a plain that-clause (12a, b); face occurs once with a that-clause, 13 times with the fact that, but in addition to that, it occurs with a number of other NP heads, such as accusations, allegations, claims, complaints, criticism, the possibility, the realisation (12c, d). Other verbs, such as like, love, and hate were extremely rare with that-clause complements, and the examples that were found of this structure stem from spoken corpora. (12) a. The book looks old-fashioned. … Perhaps this reflects that the hey-day of the circus in Britain is long gone. (The Guardian, 17 April 1999, p. 9) b. The Bath course records high scores in the value-added measure of academic teaching. This reflects the fact that a high proportion of students with low entry qualifications are graduating with upper seconds and first-class honours. (The Guardian, 9 Nov 1999, p. 10) c. ‘I think it is a damning indictment of our society that my daughter is now having to face that you cannot trust any word spoken or written by a politician, let alone a secretary of state or prime minister. (The Guardian, 21 Aug 1999, p. 2) d. It is scary enough to face the fact that you yourself are faking it. (The Guardian, 30 Jun 1999, p. 8) Table 3 lists the verbs that were found to take predominantly the fact that rather than the plain thatclause. This group comprises close to half the 120 verbs, or 51 in all. It needs to be stressed that for some reason, the possibility of using a that-clause complement is not indicated in Collins Cobuild English Dictionary, even though this is one of the first dictionaries to be based on a large corpus, namely the enormous Bank of English. Table 3. Transitive verbs with which the fact that predominates but which occasionally take a that-clause complement ---------------------- absorb celebrate disguise can't help neglect rumble advertise change disregard hide omit sell alter cite dread highlight overlook can't stand avoid conceal enjoy ignore query take (into account) belie cover up escape include reflect uncover bemoan credit expose like reinforce underline bring home, into criticise face love relish underscore focus, out (into deplore handle mark resent welcome the open) discuss hate mask respect ---------------------- The group of verbs that poses the greatest problems to explain in terms of syntactic behaviour is the third group, presented in table 4. In the material consulted for this study, these verbs were found to occur only with a that-clause as the apposition of a head noun. The fact was obviously found with all these verbs, but for some of them, a number of other head nouns were also found, so that dismiss, for instance, was recorded with well over 20 different NPs, such as speculations, accusations, hints, rumours etc. This group consists of 41 verbs, that is approximately one third of the verbs studied, but it should be remembered that further research will probably find that at least some of them will indeed also take plain that-clause complements. 240 Table 4. Transitive verbs which only take a that-clause complement with a mediating NP such as the fact ---------------------- address cloud dodge invent obscure promote answer (inanimate concern drop italicise outweigh raise subject) confront duck keep secret overcome rue begrudge decry exploit laud overshadow share blame defend forgive miss pardon splash camouflage dislike hold sth against negate pinpoint square cloak dismiss illuminate obfuscate plug withhold ---------------------- It is difficult to account for the reasons why these verbs should behave differently from the verbs in table 2. One possibility is that semantics has something to do with it. Language often works by analogy, so that words that have the same meaning tend to adopt similar syntactic patterns. It is apparent that many of the words are semantically close in meaning, so that we find groups of verbs that are converse terms, such as conceal/disclose, praise/condemn, welcome/regret, admit/deny, like/dislike, support/question etc. If meaning could be used to explain syntactic structure, we would expect that verbs with a similar meaning should behave identically syntactically, but this is not the case. Thus, in group 2 (verbs that take both the fact that and a plain that-clause) we find conceal, hide, and disguise, whereas in group 3 (verbs that do not take a that-clause complement) we find camouflage, cloak, cloud and obscure. One possibility that this suggests is that frequency matters, so that less frequent verbs more rarely take plain that-clauses. The scope of this paper has not made it possible to look more carefully into this, but even a quick look at table 4 indicates that at least some of the verbs are not at all uncommon in English, for instance blame, concern and defend (example 13). The reader is welcome to try to verify whether it is possible to omit the fact in these three sentences. (13) a. ‘... She shouldn't blame the fact that she's a woman, or me,’ he gabbles, with more detail than is strictly necessary. (The Guardian 31 jul 1998, p 2) b. Mr Clinton's most serious legal problems concerned the fact that he and Ms Lewinsky gave conflicting stories about how and why she gave back the presents he had given her (The Guardian 21 Aug 1998, p 13) c. But Bassay asserts that classroom application should not be the overwhelming criterion for research and defends the fact that academics make so much use of it. (The Guardian, 17 Mar 1998, p 6) The present paper can only begin to outline the problems encountered in the analysis of these verbs. The problem appears to call for a syntactic rather than a semantic explanation, since a number of the verbs in group 3 will take a plain that-clause complement in other languages, such as Swedish. The fact that the structures are in some sense idiosyncratic in English is another reason why it is important to have a fuller account of these verbs than has so far been the case: there is at present no place where learners can find out about the erratic behaviour of these verbs, since not even dictionaries account fully for their colligations. What we are left with at present is a system that in Matthews’ terms is not codified but rather an area of partial codification, for which there are no explicit rules (Matthews 1981:20-21). To a syntactician, this is of course not the ideal situation, nor is it a happy situation for the language learner. But, as corpus linguists, working with actual language use and not the ideal speaker-hearer, we will probably have to accept that there are areas of a language that cannot be wholly explained in terms of one system or another. Still, the topic of this section, the fact that as the complement of transitive verbs, is an area where more research is needed both to provide a fuller account of this usage for learners, and in order to see if in the end perhaps it is possible to find at least some determining factors that can explain what to us looks like largely idiosyncratic behaviour. 7. Internal syntactic variation in the phrase the fact that One aspect of language that has become more obvious to linguists after the introduction of corpusbased research is that speakers only partly have an open choice when it comes to selecting vocabulary, syntactic forms etc. Much of what we say actually consists of pre-fabricated or semi-preconstructed chunks of words, which work together in phrases. This is usually referred to as the idiom principle (Sinclair:1991 110-115). A typical feature of idioms is their frozenness, i.e. lack of syntactic variation in number, tense, voice etc. It seems that the fact that would serve well as an example of this, since all mention of the phrase includes the definite article, the subordinator that, and fact in the singular. However, even though this is indeed the predominant structure of the phrase, some examples will demonstrate that it is in fact not invariable. 241 Although the definite article is by far the most common determiner, this and that occur occasionally, both when the that-clause is restrictive, as in (14a and b), and when it is non-restrictive, as in (14c): (14) a. We haven't tried to hide this fact that, come November, we would have banned them ourselves. (The Observer, 19 Sept. 1999, p. 1) b. He kept this up in meetings and letters, despite that fact that we were never together in Paris. (The Guardian, 21 July 1998, p. 16) c. Despite this fact, that language is so much more than just a huge store of isolated words, what you remember of learning your own native language is likely to be limited to memories of just that – the learning of new words. (Heny 1998:190) Examples where the indefinite article is used are rare. Instead we find a fact that in the phrase to know for a fact that, where the that-clause must be interpreted as the object of the verb know. An extraposed that-clause also often follows a fact in clauses of the type It is a fact that (…). In neither of these cases is the that-clause appositional. What we find instead is that a fact is often followed by a relative thatclause. The only example in the present investigation where the that-clause could be taken to be an appositional clause with the head noun a fact was actually one where fact was post-modified by a relative clause: (15) The clash between Schroder and Blair also underscored a fact that yesterday's Florence summit of the centre left had been intended to disguise: that there remains a huge gulf in thinking between Britain's post- Thatcherite Labour party and America's post-Reaganite Democrats on the one hand and the socialist and social democrat parties on the continent on the other. (The Guardian, 23 Nov. 1999, p. 2) An alternative analysis of the clause structure is to regard the that-clause as the direct object of the verb underscore, which, as was shown above (table 3), belongs to the group of verbs that may take a thatclause with no mediating noun. From a stylistic point of view it also seems logical that the determiner of fact is definite, since in the majority of cases it refers to something presupposed and thus given information. As regards the number of fact, the singular by far outnumbers the plural, for the simple reason that in general, only one “fact” is mentioned. Occasionally, we find the plural facts with appositive postmodification, something that is also noted by Quirk et al. (1985: 1261). This obviously occurs only when there is more than one appositive clause. However, with two or more appositive that-clauses, it is possible to use either the singular fact or the plural facts, as witness the examples in (16), where (a) and (b) contain that-clauses in apposition to the subject, and the ones in (c) and (d) belong to the object of the clause: (16) a. The fact that England did not even practice penalties, and that Hoddle characteristically refused to accept this as an error, left the coach looking conceited and complacent (The Observer, 4 Oct 1998, p. 5) b. The facts that the roads carried 81 per cent of all freight last year and that 86 per cent of passenger miles are taken in cars and vans cut no ice. (The Observer, 21 Nov. 1999, p. 28) c. He ignores the fact that Geffen reputedly offered her $1million for it, and that her previous album Pretty On The Inside was a powerful musical landmark two years previously. (The Observer, 5 jul 1998, p. 7) d. But such notions ignore the facts that Ramsay has already made three short films and that two of them have won prizes at Cannes. (The Guardian, 14 Aug. 1999, p. 4) Finally, the language user also has the option of omitting the subordinator that after fact. One could assume that that would be left out more often in informal than in formal contexts, but corpus data does not confirm this (cf. (17a and b), both of which contain the phrase lament the fact, where that is omitted in the longer example from the business section (17b)). (17) a. He laments the fact that you can't trust politicians these days. (The Guardian, 1 Nov 1999, p. 12) b. Fellow Tory Eric Forth, a former minister, lamented the fact the Opposition was ‘conniving with his government’ to pass the legislation (The Guardian, 26 Oct 1999, p. 29) A cursory glance at the sentences where that is omitted after fact indicates that this happens more frequently when the embedded clause has a pronoun subject, but it is not restricted to this type of subject, as is shown in (18). Neither does there seem to be a limitation on that-omission due to the function of fact in the clause, so that we find it both when the fact is a subject (18a), an object (18b), and a prepositional complement (18c). (18) a. The fact the President must step down in 2000 … does not matter. (The Observer, 16 May 1999, p. 22) b. These examples of Germany's innovative and corporate success cannot, however, disguise the fact the economy inherited by Gerhard Schroder from Helmut Kohl has serious structural problems (The Guardian, 10 June 1999, p. 19) 242 c. After he was discharged the police asked him ‘to keep quiet about the fact a tank had crushed students … (The Guardian, 2 June 1999, p. 12) In conclusion, corpus evidence shows that the fact that is not the invariable phrase that it appears to be in grammatical descriptions and according to information given in dictionaries. 8. The semantics of fact The seminal article FACT by Kiparsky & Kiparsky (1971) presented the idea that certain predicates are factive, meaning that their complements are presupposed to be true. One criterion used for testing factivity was to negate the matrix verb. The predicate was classified as factive if the presupposition remained intact under negation unless explicitly contradicted (Kiparsky & Kiparsky 1971:351-52, Leech 1974:301-317). Thus I'm sorry that he lost his job and I'm not sorry that he lost his job both presuppose that he lost his job. In a sense then, such predicates can be said to be open to objective verification, that is, one should be able to show that they are facts. According to Kiparsky & Kiparsky (1971:145), “[o]nly factive predicates can have as their objects the noun fact with a gerund or thatclause.” Typical examples of factive predicates are regret, grasp, take into account, ignore, mind, and deplore. Predicates such as these make up the majority of the verbs in section 6 above, at least in the two groups that predominantly occur with the fact that. We also saw that that-clauses in subject position tend to be given a factive reading whether or not the fact is overtly stated. Therefore, one major conclusion to be drawn from the present investigation is that the fact is indeed used because the appositional clause is presupposed to be true. More interesting, perhaps, is whether fact is losing some of its meaning, and whether it is sometimes used for syntactic reasons alone. Even Kiparsky & Kiparsky (1971:147fn) refer to a colleague who had informed them “that for him factive and non-factive predicates behave in most respects alike and that even the word fact in his speech has lost its literal meaning and can head clauses for which no presupposition of truth is made.” Such usage of fact is what language guide books react against, so that a sentence such as I certainly do not accept the fact that Sir Patrick was remotely influenced by the timing of the leadership election (Radio 4) receives the comment that “[i]f someone does not accept a fact, then in their eyes it is not a fact” (Blamires 1998:117). The author goes on to suggest that suggestion should replace fact in this case. The way fact is described in grammars, e.g. by being referred to as a marginal subordinator (Quirk et al. 1985:1001-02) also indicates that the word has lost much of its original meaning, and that today, it can be used as a function word devoid of meaning. What indications are there in the corpora that this is true? Starting from Kiparsky & Kiparsky's list of non-factive verbs, we find that believe is a typical example of such a predicate. According to their rule, then, it should never occur with the fact as an object. Nevertheless, several examples of this construction were found (19a,b). Included here is also an example with buy in the sense ‘believe, accept’ (19c). (19) a. Her mother, Amanda, 24, said: ‘It's been very traumatic for her and I am just glad she's OK. But I just can't believe the fact that she was able to get hold of something like this in the first place. (The Guardian, 15 Dec 1999, p. 6) b. They paid a big price for that, though, and I think they paid it in the sense that Heidstra – he talks about a Bronco-like car leaving the scene of this crime, and if Christopher Darden and the prosecution can make the jury question the timing, but believe the fact that there's a car leaving, then his testimony has the effect of suggesting Simpson at the scene. (CNN Morning News, 26 jul 1995) c. Quite frankly, it is not a tax break for the rich. … It would be first dollar coverage, it would be a high deductible, it would be very, very affordable for those people and unfortunately I just cannot buy the fact that it is a tax break for the rich. (CNN Domestic News, 25 Apr 1996) If we take a closer look at these examples, it is evident that (19a) expresses a fact that can be verified. The mother's use of can't believe the fact is just her way of signalling incredulity. The second example, (19b), might be a little more difficult to explain in terms of factuality. If you are going to make someone believe something, this usually signifies that what you make them believe is not true. However, the context here is the courtroom, and the fact referred to is an event that is a fact for the prosecution, whose task it is to make the jury believe the facts they present.4 Neither of these examples 4 Greetham (1999:3) outlines three “degrees of truth” from the rhetoric of litigation: raw facts, the common ground taken as evidence by both parties, truefacts, the facts appropriated by only one party, which “had acquired a new level of revealed truth by being frankly partisan, and thus more true than mere facts”, and factoids, presented by the opposing party but not espoused as facts by one's own side. 243 can therefore be explained as sentences where the fact functions as a prop word devoid of meaning. The last example, on the other hand, is a contradiction, because the speaker states emphatically that “it is not a tax break for the rich”, a “fact” that he immediately says he does not believe in. In this particular case the fact is probably more of a rhetorical flourish, used to add emphasis to what the person is saying. It ought not to be possible to refer to things that are counterfactual (i.e. made up or imagined) as “facts”, but (20) demonstrates precisely this. In both these cases, fact fulfils a syntactic function, since neither lay money on nor invent can take that-clause complements.5 Thus it does seem that we have two cases here of the fact that as a marginal subordinator. (20) a. I'll lay money on the fact that he's found some other way to work out his aggressions. (Walters, Minette 1999 The Dark Room Pan Books, p. 447) b. When asked whether he still works behind the bar, Martin says: 'I used to love it, but I don't do it any more. Tim invented the fact that he worked behind the bar, but he never did. (The Observer, 19 Sep 1999, p. 4) Two other verbs often mentioned as typical examples of non-factive verbs are claim and assume. Claim is a verb that signals that the complement is the speaker's subjective opinion, and assume refers to a hypothetical condition. Consequently, it is not be possible to verify the truth of the complement of either of these verbs (21a,b). In (21c), where the predicate is hear, it is obvious that the speaker uses the fact to refer to something he does not believe in. (21) a. Bobby Murcer, Mantle Teammate: Well, I didn't know Gehrig and Ruth – I just heard about those guys. I knew Mickey Mantle, so I can honestly claim the fact that I think Mickey Mantle was the greatest. (CNN World News, 15 Aug 1995) b. Eli Noam: It's true for the present, but if you look at the market share of broadcasters generally, and even assuming the fact that they will be able with new technology to squeeze more broadcast channels into their broadcast signals, even assuming that, they still are – will be big players, but among many, many other competitors. (NPR Morning: Business, 9 Aug 1995) c. At Marty's Center Tap, UAW members are not optimistic about reaching a contract agreement and Mark Samp says it will take a tremendous effort to overcome the effects of a strike, which has been terrible for everyone. Mark Samp: I get tired of hearing the fact that people believe that, well, you know, time heals all wounds. (NPR Morning Edition: Domestic News, 9 Jan 1996) It is significant that all these examples are from spoken language, and the speakers have thus not been able to edit their utterances, which means that they could all be explained as performance errors. However, if we assume that the speakers used the fact for some purpose, then it is noteworthy that in none of the sentences in (21), the fact is needed for syntactic reasons: assume, claim and hear all normally take that-clause complements. So why is the fact used here at all? I would like to suggest that in the context, it has a certain pragmatic force, so that in the examples in both (20) and (21) the fact is used to strengthen the force of what is said. The fact that examples such as these are found in corpora also supports the suggestion made by Kryk (1981) that the logical analysis in terms of truth conditions of these utterances should be replaced by a pragmatic approach, and the factive/non-factive dichotomy by a scale of factive and “not-so-factive” predicates. Whether the fact is necessary to avoid an ungrammatical construction, as in (20), or just added as some kind of rhetorical flourish, as in (21), speakers make use of it in order to get a message across, a message that would certainly be less forceful without the fact. 9. Conclusion The present paper has attempted to demonstrate that the use of the fact that in present-day English cannot be explained by referring to its syntactic function alone, but semantic as well as pragmatic meaning needs to be taken into account in a description of its function in utterances. This serves as an example to demonstrate in what way a corpus-based approach to syntax can be used to enhance existing accounts. Traditional syntax can be amended, so that a narrow focus on structure can be replaced by a multifunctional approach. Instead of classifying words as function words or content words, a common procedure in introductory texts in semantics, we will have to acknowledge that very few words act as function words only. The result is a grammar where there is a synthesis of syntactic, semantic and pragmatic rules. 244 References Biber D, Johansson, S, Leech, G, Conrad, S, Finegan, E 1999 Longman Grammar of Spoken and Written English. Harlow, Pearson Education. Blamires H 1998 The Cassell Guide to Common Errors in English. London, BCA. Christophersen P 1979 Prepositions Before Noun Clauses in Present-Day English. In Chesnutt, M, Faerch C, Thrane T, Caie G D, Essays Presented to Knud Schibsbye. Copenhagen: Akademisk Forlag, 229-234. Granath S, 1997 Verb Complementation in English: Omission of Prepositions before that-clauses and to-infinitives. Göteborg, Acta T. Universitatis Gothoburgensis. Greetham D 1999 Facts, Truefacts, Factoids; or, Why Are They Still Saying Those Nasty Things about Epistemology? Yearbook of English Studies 29:1-23. Hasselgård H, Johansson S, Lysvåg P 1998 English Grammar: Theory and Use. Oslo, Universitetsforlaget. Heny F 1998 The Structure of Sentences. In Clark V P, Eschholz P A & Rosa A F (eds) Language: Readings in Language and Culture. Boston and New York, Bedford/St Martin's, 189-224. Kahn J E, Ilson R (eds) 1985 The Right Word at the Right Time: A Guide to the English Languge and How to Use It. London, Reader's Digest Association. Kiparsky P, Kiparsky C 1971 Fact. In Bierwisch, M & Heidolph K E (eds), Progress in Linguistics: A Collection of Papers. The Hague, Mouton, 143-173. Kryk B 1982 The Relation between Predicates and Their Sentential Complements, A Pragmatic approach to English and Polish Studia Anblica Posnaniensia: An International Review of English Studies, 14:103-120. Leech G 1974 Semantics. Penguin Books. Mair C 1988 In Defense of the fact that: A Corpus-Based Study of Current British Usage. Journal of English Linguistics 21:1, 59-71. Matthews P H 1981 Syntax. Cambridge, Cambridge University Press. Perrin P G 1942 Writer's Guide and Index to English. Chicago, Scott, Foresman and Company. Seppänen A, Granath S (forthcoming) That-Clauses and the Complementation of Prepositions. Sinclair J 1991 Corpus, Concordance, Collocation. Oxford, Oxford University Press. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A Comprehensive Grammar of the English Language. London and New York, Longman. 245 Building a text corpus for representing the variety of medical language Benoît Haberta, Natalia Grabarb, Pierre Jacquemartb, Pierre Zweigenbaumb a LIMSI-CNRS & Université Paris 10 b DIAM - Service d'Informatique Médicale/DSI, Assistance Publique – Hôpitaux de Paris & Département de Biomathématiques, Université Paris 6 Abstract The representation of specialized domains in reference corpora does not always cater for the internal diversity of genres. Similarly, most sublanguage studies have focussed on domain specialization, largely leaving genre an implicit choice that received less individual attention. Specialized domains, though, display a large palette of text genres. Medicine is a case in point, and has been the subject of much work in Natural Language Processing. We therefore endeavored to build a large corpus of medical texts with a representation of the main genres found in that domain. We propose a framework for designing such a corpus: an inventory of the main genres of the domain, a set of descriptive dimensions and a standardized encoding of both meta-information (implementing these dimensions) and content. We present a proof of concept demonstrator encoding an initial corpus of text samples according to these principles. Keywords: Specialized language, genres, medicine, French, natural language processing, TEI, CES, XML, XSL. 1 Introduction The representation of specialized domains in reference corpora does not always cater for the internal diversity of genres. This is the case for instance in Brown and LOB, where the domain is the entry key – the small size of these corpora also accounts for this. Similarly, domain specialization seems to have been the focal criterion considered in all sublanguage studies performed in the eighties (e.g., (Grishman & Kittredge, 1986)), whereas genre was largely an implicit choice that received less individual attention. Specialized domains, though, display a large palette of text genres. This is the case of medicine. Medical narratives, including discharge summaries and imaging reports, have been the most studied type of text (Sager et al., 1987; Friedman, 1997; Rassinoux, 1994; Zweigenbaum & Consortium MENELAS, 1994). Short problem descriptions, such as signs, symptoms or diseases, have been the subject of much attention too, in relation to standardized vocabularies (Tuttle et al., 1998). Some authors have also examined abstracts of scientific literature (Grefenstette, 1994). And indeed, web pages are today the most easily available source of medical documents. These documents vary both in form and in content; it has even been showed that within a single document, subparts can consistently display very different language styles (Biber & Finegan, 1994). The natural language processing (NLP) tools that have been tailored for one document type may therefore be difficult to apply to another genre (Friedman, 1997)1. All these genres are found in the same overall domain: medicine. A large palette of genres can also be found within each medical specialty: domain and genre are clearly distinct factors of textual variety. This diversity has consequences for the design and development, or simply for the use, of natural language processing tools for medical information processing. Without better informed knowledge about the differential performance of natural language processing tools on a variety of medical text types, it will be difficult to control the extension of their application to different medical documents. We propose here to provide a basis for such informed assessment: the construction of a corpus of medical texts. We address this task for French language texts, but we believe the same reasoning and methods and part of the results are applicable to other languages too. This text corpus must be useful for testing or training NLP tools 2. It must provide a variety of medical texts: diversity must be obtained in addition to mere volume, since our specific aim is to represent the many different facets of medical language. We need to characterize this diversity by describing it along 1The precision of French taggers evaluated within the framework of GRACE (Adda et al., 1999), measured in relation to a manually tagged reference corpus, similarly shows significant variations depending on the part of the corpus under examination (Illouz, 1999). This corpus containing 100,000 words has been compiled from extracts from Le Monde (2 extracts), and from literary texts: memoirs (2 extracts), novels (6 extracts), essays (2 extracts). Thus an extract from memoirs results in important variations, positive and negative, among the taggers. 2Taggers, shallow parsers which are able to build syntactic representations for any kind of text, named entities recognizers, etc. 246 appropriate dimensions: origin, genre, domain, etc. These dimensions have to be documented precisely for each text. This documentation must be encoded formally, as meta-information included with each document, so that sub-corpora can be extracted as needed to study relevant families of document types. Finally, text contents must also be encoded in a uniform way, independently of the many formats documents were written in originally. We present here a framework for designing a corpus of medical texts representing genre variety: a set of genres and descriptive dimensions, inspired in part from previous relevant literature, a standardized encoding of both meta-information (implementing these dimensions) and content, using the TEI XML Corpus Encoding Standard (Ide et al., 1996), and an initial set of text samples encoded according to these principles. This work takes place in the context of a larger corpus collection initiative, project CLEF (www.biomath.jussieu.fr/CLEF/), whose goal is to build a large, diversified corpus of French texts and to distribute it widely to researchers. After a brief review of related work (section 2), we explain in turn each of the main phases of the design of our corpus: (i) assessing document diversity, choosing dimensions to characterize this diversity, and implementing them in a standard XML DTD (section 3); (ii) selecting the main classes of documents we want to represent and documenting them with these dimensions, then populating the corpus with texts (section 4). We also explain how sub-corpora can be extracted from the corpus (section 4.3). 2 Taking into account corpus state of the art The evolution of corpus techniques and standards in the past ten years makes it difficult to reuse existing (medical) corpora. The development of standards for the encoding of textual documents has been the subject of past initiatives in many domains (electronic publishing, aeronautics, etc.), using the SGML formalism, and now its XML subset. The Text Encoding Initiative was a major international effort to design an encoding standard for scholarly texts in the humanities and social sciences, including linguistics and natural language processing. It produced document type definitions (DTDs) which have been complemented with a Corpus Encoding Standard (CES) (Ide et al., 1996). The CES DTD is therefore the natural format for encoding a corpus that is targeted at NLP tools. Beyond bibliographic description, descriptive dimensions for characterizing text corpora have been proposed by Sinclair (Sinclair, 1996) and Biber (Biber, 1994) among others. A related strand of work is that around the standardization of meta-information for documenting web pages (Dublin Core Metadata Initiative, 1999); but this covers more limited information than that we shall need. In the medical informatics domain, the standardization efforts of bodies such as HL7 (Dolin et al., 2000) and CEN (Rossi Mori & Consorti, 1999) focus on clinical documents for information interchange: both their aim and coverage are different from ours. The available medical corpora we are aware of do not match the criteria underlying current standards. Firstly, medical textbooks and scientific literature have been collected in project LECTICIEL (Lehmann et al., 1995) for French for Special Purposes learning. A set of software tools were available to study the various parameters of documents from the corpus (lexical choices, grammatical connectors, text organization: titles, parts...). Users could add new texts to the database and compare them with the existing sub-corpora. The encoding standard however is an obsolete one and the resulting corpus far too small by our current expectations. Secondly, one medical corpus was specifically built for the purpose of linguistic study: MEDICOR (Vihla, 1998). Although its focus is on published texts (articles and books), with no clinical documents, it is an example of the kind of direction that we wish to take. Unfortunately the initial version of the corpus provides limited documentation about the features of each document (intended audience, genre and writer qualification), which is planned to be extended. Thirdly, even though very large collections of medical texts indeed exist within hospital information systems – the DIOGENE system being among the earliest ones (Scherrer et al., 1996) – the issue here is that of privacy and therefore anonymization, to which we return below. 247 3 Identifying and representing diversity dimensions 3.1 Assessing variety dimensions In our opinion, a corpus can only represent some limited subsets of the language, and not the whole of it. No corpus can contain every type of language use3. It is even true in specialized domains such as medicine or computer science. As a matter of fact, studies of sublanguages favored until recently very few types of textual documents (see above), “hiding” the variety of registers within each of these sublanguages. In order to gather a corpus, one must explicitly choose the language use(s) (s)he wants to focus on. One must identify the main underlying dimensions of diversity which are responsible for the major contrasts within the linguistic area (s)he wants to analyze. The variety factors are twofold: external and internal. External variety refers to the whole range of parameter settings involved in the creation of a document: document producer(s), document user(s), context of production or usage, mode of publication, etc. This issue is thoroughly addressed in (Sinclair, 1996) and (Biber, 1994). It is rather straightforward to describe documents according to these lines. However, internal variety must as well be taken into account. It follows from the range of registers corresponding to the main communicative tasks of the linguistic community. Indeed informants in a specific domain such as medicine have intuitions about the major relevant registers for the domain, even if they do have difficulties in establishing clear-cut borderlines. (Wierzbicka, 1985) relies on folk names of genres (to give a talk / a paper / an address / a lecture / a speech) as an important source of insight inside communicative characteristics of a given community. It has even been shown as well that, while there is no well-established genre palette for Internet materials, it is nevertheless possible, through interviewing users of Internet (students and teaching staff in computer science), to define genres that are both reasonably consistent with what users expect and conveniently computable using measures of stylistic variation (Dewe et al., 1998)4. This is why the very first step consists in asking people from the domain the main communicative routines or “speech acts” they identify. We started thus from a series of prototypic contexts, and listed the types of texts related to these starting points: medical doctor (in hospital or in town), medical student, patient (consumer); patient care, research; published and unpublished documents. It is now possible to restate more precisely what we mean by variety : a domain corpus should represent the main communicative acts of the domain and their parameter settings. This analysis leads us to slightly change Sinclair's definition of a corpus (Sinclair, 1996, p. 4) (“a corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language”): the criteria need to be situational and sociological as well (Habert, 2000). The definition of the dominant communicative routines of the domain and of their precise situational parameters is a pre-requisite. 3.2 Genres of medical texts Trying to compile a complete list of the types of medical textual documents is probably a never ending task, since new situations may lead to the creation of new document types, and since a finer-grain examination could always reveal finer distinctions. Our aim here is rather to identify the main kinds of medical texts that can be found in computerized form, and to characterize each of them by specifying values for a fixed set of orthogonal dimensions related to the external and internal factors of diversity. We considered four main contexts of production or use of medical documents (table 1). In the context of care, medical professionals produce information about a patient. In a hospital, this information is registered in the form of reports. Letters are a different series of genres where the receiver is a more targeted health care professional. In a university context, teaching involves material produced by faculty members (lecture notes as well as test questions for examinations) and by students (student notes). Dissemination of knowledge much resembles what can be found in other scientific disciplines. Medical professionals read and, for some of them, write articles in various sorts of journals and conference proceedings. Medical professionals or scientific reporters also write articles for the general public in the 3For instance, is it possible to find real-life, that is large-scale, samples of prayers or love small-talk in any existing corpus? 4The ten following genres were distingued : Informal, Private (Personal home pages); Public, commercial (Home pages for the general public); Interactive pages (Pages with feed-back: customer dialogue; searchable indexes); Journalistic materials (News, editorials, reviews, popular reporting, e-zines); Reports (Scientific, legal, and public materials; formal text); Other running text; FAQs; Link Collections; Other listings and tables; Discussions; Contributions to discussions; Usenet News material; Error Messages. 248 general press or in more specialized magazines. They also report on (often yet unpublished) conference lectures. Students also produce memos and reports, the most prominent of which is the doctoral dissertation. Direct computer mediated discussion and exchange of information takes place in specialized newsgroups and electronic lists, which are as pervasive in the medical area as in other technico-scientific domains. Production of information on a patient Reports Letters Discharge report Surgery report Examination report Endoscopy ; EKG; EEG; Anatomopathology; Imaging ; Functional Request for advice; Referral; Discharge letter; Additional prescription Teaching Course material (professorial) Course material (student) Lecture notes; Test questions Student notes Dissemination of knowledge Periodicals Articles Newspaper ; Magazine ; Bulletin ; Journal Generalist press article; Scientific article; Article abstract; Conference report Student memo Electronic discussion Dissertation News group ; Electronic list Knowledge resources Reference knowledge Book; Encyclopedia; Dictionary; Monograph Guidelines General French medical guidelines; Consensus conference; Recommendation; Protocol Official Bulletin officiel; Code of deontology; Convention; Informed consent Coding systems Terminology (Nomenclature, Thesaurus, Classification) Table 1: A (non-exhaustive) list of genres of medical documents. Beyond the previous items, different kinds of knowledge resources are used in the medical practice. Stable reference knowledge is found in dictionaries and encyclopedias or in monographs (e.g., all that needs to be known about a given drug). More operative knowledge takes the form of guidelines and protocols which often constitute rules that medical practitioners must follow. The “Références médicales opposables” (RMO, http://www.upml.fr/rmo/, translated here as “general French medical guidelines”) are national rules that state which medical acts should be avoided in certain situations. A “consensus conference” is a conference where an authoritative group of physicians agree on a statement, e.g., about the best treatment for a disease (e.g., www.chu-rouen.fr/ssf/recomfr.html). Protocols are precise plans of diagnosis or treatment for specific diseases, especially in oncology. Official documents regulate the legal or contractual aspects of medical practice: the official bulletin is a legal publication of the French government; the Code of deontology is a regulation of the medical profession; the “Convention” is an agreement between the medical profession and the government; and the “Informed consent” warns a patient about the potential risks and benefits related to his or her treatment. Finally, coding systems organize different types of terminologies used for the normalized description of medical information. 3.3 Dimensions These document types are difficult to classify into non-overlapping groups. Therefore modeling the corpus with descriptive dimensions is all the more useful. We studied three sets of dimensions proposed in the literature (Sinclair, 1996; Biber, 1994; Dublin Core Metadata Initiative, 1999). Most of them are attributes useful in medical text genres, and were kept in our final selection (table 2). The objective of these attributes is to characterize the different types of texts and each of their instances. Here again, the set presented here is liable to revision as more documents are added to the corpus. 249 We divide these dimensions in three groups. Bibliographic dimensions are the traditional features that characterize the origin of the document, including its links to a larger document set (e.g., article in a journal or chapter in a book) and its status as a partial or full document. Non-textual parts are removed from our documents, but their descriptions can be included. External dimension: bibliographic references Origin : Title : Author, Coauthor, Translator, Contributor, Editor, Text creator: Date of creation, of translation, etc.: Identifier: Localization: Page, etc. Link: None; To a series; To another text Extract: Full; Article; Chapter; Paragraph Description of embedded non-textual data: e.g., Radiograph, Photograph, Table External dimension: context of production and reception Mode of production: Typed; Dictated; Manuscript Mode of transmission: Oral; Electronic; Printed Software format: Plain text, html, etc. Producer Plurality: Individual; Association; Company; Institution Function: Scientist, Medical professional, Scientific reporter, Student, Terminologist, Patient, etc. Receiver Plurality: Unique, Multiple Presence: Present; Absent Profile: Medical professional; Non-medical professional Objective: Record; Describe; Inform; Explain; Discuss; Persuade; Recommend; Teach; Order Publication status: Published; Unpublished Frequency of publication: Periodical; Punctual Coverage: Local; National; International Rights: Statement of ownership, Usage restrictions More internal dimensions Language: French Size of text: In words, bytes, etc. Level of style: Low; Medium; High Quality of presentation: Raw; Revised; Advanced Interaction with public: Distant; Neutral; Close Personalization: Personalize; Impersonal Factuality: Informative factual; Intermediate; Imaginary Technicity: Non-technical; Intermediate; Specialized Table 2: Dimensions. The second group contains external dimensions that characterize the context of production or reception of the documents. The mode of production corresponds to the original authoring of the text, the mode of transmission to the form it had before inclusion in the corpus. Only the mode of transmission is considered in (Sinclair, 1996) and (Biber, 1994). We find it useful to make a difference between the successive forms of a document during its life cycle. The software format applies to electronic texts and helps to document conversion work. The producer's profile includes his or her “function”. This aspect is not mentioned in (Sinclair, 1996), although it covers the profile of the receiver (“audience constituency”). The only 250 distinction in the receiver's profile that seems relevant up to now here is whether s/he is a medical professional or not. The “objectives” merge those of (Biber, 1994) and of (Sinclair, 1996). The publication status corresponds to a usual distinction. The frequency of publication was introduced as a general attribute to help differentiate periodicals from non-periodicals. “Coverage” comes from the Dublin Core (Dublin Core Metadata Initiative, 1999): it is meant to describe “the extent or scope of the content of the resource”. Typically this encompasses the spatial and temporal validity of the text: e.g., the national applicability of a law or the temporal validity of a terminology. The “rights” attribute, also from the Dublin Core, is useful to inform the corpus user about the allowed utilization of the corpus. The last group contains more internal dimensions which can generally be detected from the text itself. “Language” is mentioned in (Dublin Core Metadata Initiative, 1999). We introduced “size” to control sampling policy over the corpus. The “level of style” is a usual dimension, as well as the “quality of presentation” (Sinclair, 1996). Some of the remaining dimensions may be related to external dimensions (e.g., “Interaction with public” usually depends on the producer's and receiver's profiles), but we consider that they are more reflected and verifiable in text contents. 3.4 Domains Medicine has numerous specialized subfields, each of which entertains professional, teaching and scientific activities, with its own societies, journals and conferences. Several lists of medical specialties can be found, among which the official list of health care professions (“Nomenclatures des professions de santé”), that of the US National Library of Medicine's Medical Subject headings thesaurus (MeSH, www.nlm.nih.gov/mesh/meshhome.html) and that of the CISMeF internet directory of French medical resources (www.cismef.org, (Darmoni et al., 2000)). We relied on the latter, which synthesizes the first two. We made some of the categories slightly more specific by ungrouping some clusters (e.g., “Angéiologie & cardiologie” separated into “Angéiologie” and “cardiologie”); we also removed or specialized a few themes that seemed too far fetched (e.g., “Anthropology, Education, Sociology and Social Phenomena”, reduced to “Education”). 3.5 Exploiting diversity dimensions: corpus and document headers A corpus without documentation is a (possibly huge) bag of “dead words”. For that very reason, within the TEI standardization group, much attention has been devoted to the definition of headers (Giordano, 1995). A header is a normalized way of documenting electronic texts. The corpus header caters for the documentation for the corpus as a whole, whereas each document header contains the meta-information for its text. Each document in a corpus has a header. This header describes the electronic text and its source – fileDesc or file description in figure 1– (bibliographic information, when available), it gives the encoding choices for the text – encodingDesc or encoding description – (editorial rationales, sampling policy...), non-bibliographical information that characterize the text, and a history of updates and changes (revisionDesc – revision description). In the non-bibliographical part of the header (profileDesc – profile description), the text is “tagged” according to one or more standard classification schemes, which can mix both free indexes and controlled ones (such as standard subject thesauri in the relevant field). These classification schemes are thoroughly described in the corpus header. It is then possible to extract sub-corpora following arbitrarily complex constraints stated in these classification schemes. For instance, the interface to the BNC relies on such an approach (Dunlop, 1995) and permits to restrict queries to sub-corpora (spoken vs written language / publication date / domain / fiction vs non-fiction ...and any combination of these dimensions). We followed the TEI proposals and more precisely the standard TEI XML CES model (Ide et al., 1996). We could find a mapping into the CES header for each dimension of our model, and therefore implemented it in the CES framework. Generally, bibliographic dimensions (and document size) fit into the fileDesc; the definition of the other external and internal dimensions is located in the encodingDesc section of the corpus header, and each corpus document refers to it in its profileDesc section. An added advantage is that the CES model provides some additional documentation dimensions, e.g., information about the corpus construction process (text conversion, normalization, annotation, etc.). The implemented corpus is a collection of texts, each with its own meta-information: an instantiation of the above dimensions. On top of these texts, it provides documentation on itself: on the one hand, bibliographic 251 information of the same kind as its component texts; on the other hand, meta-information about the documents it contains. The latter comprises the definition of the descriptive dimensions along which each of its documents is described. Figure 1: Overall corpus form: corpus header (<cesHeader>: upper rectangle) then documents (<cesDoc>), each containing a document header (<cesHeader>: lower, inner rectangle) and the actual <text>. 4 Building and exploiting the corpus 4.1 Giving a shape to the corpus: document sampling Several parameters influence the overall contents of the corpus: we focus here on the types and sizes of documents that it will include. There is debate in the corpus linguistics community as to whether a corpus should consist of text extracts of constant size, as has been the case of many pioneering corpora, or of complete documents. The drawback with extracts is that textual phenomena with a larger span may not be studied on such samples. The overall strategy of project CLEF is therefore to opt for full documents as much as possible. In some cases however, text samples may be easier to obtain: it may be more acceptable for a publisher, because of property rights, to give away extracts rather than full books or journals. We plan to be pragmatic about this issue. To initiate the construction of our corpus, we selected an initial subset of text types as target population for the corpus. As explained above, we tried to represent the main communicative acts of the domain. The main text types we aim to represent initially include types from all the groups of genres listed above: hospital reports, letters (discharge), teaching material (tutorials), publications (books chapters, journal articles, dissertations), guidelines (recommendations) and official documents (code of deontology). This will be achieved progressively; the current status is that of a proof of concept, which we describe below. We cautiously avoided to over-represent web documents, which could bias corpus balance because of their immediate ease of obtention. An additional interesting family of genres would be transcribed speech; but the cost of transcription is too high for this to be feasible. A generic documentation for each text type was prepared. The rationale for implementation is then to encode a document header template for each text type: this template contains the prototypical information for texts of this type. This factorizes documentation work, so that the remaining work needed to derive a suitable document header for an individual text is kept to a minimum. Document templates were implemented for the text types included so far in the corpus. 4.2 Populating the corpus with document instances The addition of documents to the corpus comprises several steps. The documents must first be obtained. This raises issues of property. A standard contract has been established for the project with the help of the European Language Resources Agency (ELRA), by which document providers agree with the distribution of the texts for research purposes. For texts that describe patient data, a second issue is that of 252 privacy. We consulted the French National Council for Informatics and Liberties (CNIL). They accepted that such texts be included provided that all proper names (persons and locations) and dates be masked. The contents of each document are then converted from their original form (HTML, Word) to XML format. Minimal structural markup is added: that corresponding to the TEI CES level 1 DTD. This includes paragraphs (<p>; this is marked automatically) and optionally sections (<div>). The document header template for the appropriate document type is then instantiated. For series of similar samples (e.g., a series of discharge summaries), most of this instantiation can be performed automatically. Figure 2 shows a slice of the implemented corpus. Figure 2: A slice of the implemented corpus: extracts of the document header of document 2 and first lines of its contents (viewed with Xerces TreeViewer). As a proof of concept, we integrated 374 documents in the corpus: 294 anonymized patient discharge summaries from 4 different sites and 2 different medical specialties (cardiology, from project Menelas (Zweigenbaum & Consortium MENELAS, 1994), and haematology), 78 anonymized discharge letters, one chapter of a handbook on coronary angiography and one consensus conference on post-operative pain. The total adds to 143 Kwords, with an average of 385 words per document. Many colleagues have kindly declared their intent to contribute documents, so that a few million words should be attainable. Adding new documents to the corpus and documenting them requires a varying amount of work depending on the type of document. Patient documents require the most attention because of anonymization. Their actual documentation also raises an issue: a precise documentation would reintroduce information on locations and dates, so that we must here sacrifice documentation for privacy. A pre-specified model for document description is a need if a corpus is to be used by many different people. The dimensions of our model, implemented as taxonomic “categories”, will probably need some update with the introduction of the other main types of documents. We expect however that they should quickly stabilize. 4.3 Extracting sub-corpora Adherence to an existing standard enabled us to implement our corpus model in a principled way with a very reasonable effort. Besides, the general move towards XML observed in recent years facilitates the conversion of existing documents and the subsequent manipulation of the corpus, which can be 253 manipulated through standard XML tools. We ran the Xerces Java XML library of the Apache XML project and James Clark's XT library under Linux, Solaris and HP-UX. The corpus was checked for syntactic well-formedness (“conformance”) and adherence to the xcesDoc DTD (“validity”). We use XSL stylesheets to produce tailored summaries of the corpus contents and to extract subcorpora. An XSL stylesheet can specify transformations that should be applied to an input XML file, here the whole corpus. These transformations include the selection of elements of the input file (here, individual texts) and the construction of a new document (a sub-corpus) embedding these elements. Additional material such as a new corpus header can be built on the fly as needed. Selection can operate on any of the features of the texts, including their documentation, so that all the previously discussed genres, dimensions and domains can serve as criteria for extracting corpus texts. We have written a few stylesheet to performs specific extractions. We are currently working on a generic user interface for specifying these extractions. An important need is to keep track of the origin of the corpus elements through successive extractions (Illouz et al., 2000). 5 Conclusion and Perspectives We have proposed a framework for designing a medical text corpus and a proof of concept implementation: a set of descriptive dimensions, a standardized encoding of both meta-information (implementing these dimensions) and content, and a small-size corpus of text samples encoded according to these principles. This corpus, once sufficiently extended, will be useful for testing and training NLP tools: taggers, checkers, term extractors, parsers, encoders, information retrieval engines, information extraction suites, etc. We plan to distribute it to NLP and Medical Informatics researchers. We believe that the availability of such a resource may fill a gap in the current corpora and help better study the issues of genres in a specialized domain. The corpus should also allow more methodological, differential studies on the medical lexicon, terminology, grammar, etc.: e.g., terminological variation across genres within the same medical specialty, or the correlation of observed variation with documented dimensions, which should teach us more about the features of medical language. 6 Acknowledgments We wish to thank the French Ministry for Higher Education, Research and Technology for supporting project CLEF, D Bourigault and P Paroubek of project CLEF's management board for useful discussions, B Séroussi and J Bouaud for help about the document genres, and the many colleagues who agreed to contribute documents to the corpus. Bibliography Adda G, Mariani J, Paroubek P, Rajman M, Lecomte J 1999 Métrique et premiers résultats de l'évaluation GRACE des étiqueteurs morpho-syntaxiques pour le français. In Amsili P (ed), Proceedings of TALN 1999 (Traitement automatique des langues naturelles), Cargèse, ATALA, pp 15–24. Biber D 1994 Representativeness in corpus design. Linguistica Computazionale, IX-X:377–408. Current Issues in Computational Linguistics: in honor of Don Walker. Biber D, Finegan E 1994 Intra-textual variation within medical research articles. In Ooostdijk N, de Haan P (eds), Corpus-based research into language, number 12 in Language and computers : studies in practical linguistics. Amsterdam, Rodopi, pp 201–222. Darmoni S. J, Thirion B, Leroy J. P, Douyère M, Baudic F, Piot J 2000 CISMeF: a structured health resource guide for healthcare professionals and patients. In Proceedings of RIAO 2000: Content- Based Multimedia Information Access, Paris, France, C.I.D. Dewe J, Karlgren J, Bretan I 1998 Assembling a balanced corpus from the internet. In Proceedings of the 11th Nordic Conference on Computational Linguistics, Copenhagen, pp 100–107. Dolin R, Alschuler L, Boyer S, Beebe C 2000 An update on HL7's XML-based document representation standards. Journal of the American Medical Informatics Association, 7(suppl):190–194. Dublin Core Metadata Inititative 1999 The Dublin Core Element Set Version 1.1. WWW page http://purl.org/dc/documents/rec-dces-19990702.htm. 254 Dunlop D 1995 Practical considerations in the use of TEI headers in large corpora. Computers and the Humanities, 29:85–98. Friedman C 1997 Towards a comprehensive medical natural language processing system: Methods and issues. Journal of the American Medical Informatics Association, 4(suppl):595–599. Giordano R 1995 The TEI header and the documentation of electronic texts. Computers and the Humanities, 29:75–85. Grefenstette G 1994 Explorations in Automatic Thesaurus Discovery. Natural Language Processing and Machine Translation. London, Kluwer Academic Publishers. Grishman R, Kittredge R (eds) 1986 Analyzing Language in Restricted Domains. Hillsdale, New Jersey, Lawrence Erlbaum Associates. Habert B 2000 Des corpus représentatifs : de quoi, pour quoi, comment ? In Bilger M (ed), Linguistique sur corpus : Études et réflexions, volume 31 of Cahiers de l'Université de Perpignan. Presses universitaires de Perpignan, pp 11–58. Ide N, Priest-Dorman G, Véronis J 1996 Corpus Encoding Standard. Document CES 1, MULTEXT/EAGLES, http://www.lpl.univ-aix.fr/projects/eagles/TR/. Illouz G 1999 Méta-étiqueteur adaptatif : vers une utilisation pragmatique des ressources linguistiques. In Amsili P (ed), Actes de TALN'99 (Traitement Automatique des Langues Naturelles), Cargèse, ATALA, pp 185–194. Illouz G, Habert B, Folch H, Heiden S, Fleury S, Lafon P, Prévost S 2000 TyPTex: Generic features for text profiler. In Proceedings of RIAO 2000: Content-Based Multimedia Information Access, Paris, France, C.I.D., pp 1526–1540. Lehmann D, de Margerie C, Pelfrêne A 1995 Lecticiel – Rétrospective 1992–1995. Technical report, CREDIF – ENS de Fontenay/Saint-Cloud, Saint-Cloud. Rassinoux A.-M 1994 Extraction et Représentation de la Connaissance tirée de Textes Médicaux. Thèse de doctorat ès sciences, Université de Genève. Rossi Mori A, Consorti F 1999 Structures of clinical information in patient records. Journal of the American Medical Informatics Association, 6(suppl):132–136. Sager N, Friedman C, Lyman M. S (eds) 1987 Medical Language Processing: Computer Management of Narrative Data. Reading, Mass., Addison Wesley. Scherrer J.-R, Lovis C, Borst F 1996 DIOGENE 2, a distributed information system with an emphasis on its medical information content. In van Bemmel J. H, McCray A. T (eds), Yearbook of Medical Informatics ’95 — The Computer-based Patient Record. Stuttgart, Schattauer. Sinclair J 1996 Preliminary recommendations on Text Typology. WWW page http://nicolet.ilc.pi.cnr.it/EAGLES/texttyp/texttyp.html, EAGLES (Expert Advisory Group on Language Engineering Standards). Tuttle M, Olson N, Keck K, Cole W, Erlbaum M, Sherertz D, Chute C, Elkin P, Atkin G, Kaihoi B, Safran C, Rind D, Law V 1998 Metaphrase: an aid to the clinical conceptualization and formalization of patient problems in healthcare enterprises. Methods of Information in Medicine, 37(4-5):373–383. Vihla M 1998 Medicor: A corpus of contemporary American medical texts. ICAME Journal, 22:73–80. Wierzbicka A 1985 A semantic metalanguage for a crosscultural comparison of speech acts and speech genres. Language in society, 14:491–514. Zweigenbaum P, Consortium MENELAS 1994 MENELAS: an access system for medical records using natural language. Computer Methods and Programs in Biomedicine, 45:117–120. 45 The use of the progressive in Swedish and German advanced learner English: a corpus-based study Margareta Westergren Axelsson* and Angela Hahn° *Department of English, Uppsala University °English Language and Linguistics, Chemnitz University of Technology Margareta.Westergren_Axelsson@engelska.uu.se, Angela.hahn@phil.tu-chemnitz.de In our paper we approach the following research questions from a contrastive perspective: 1. What are the differences, if any, in the use of the progressive in Swedish and German learner English? 2. What functions of the progressive do learners typically choose? Are there any functions chosen other than the purely aspectual ones? 3. What types of non-native progressives are produced by these learners? The study is part of a cooperation project between the Chemnitz University of Technology and Uppsala University with questions of corpus comparability as its main concern. Our research draws on two learner corpora, namely the Uppsala Student English (USE) corpus (Axelsson 2000), and the German component of the International Corpus of Learner English (ICLE). To facilitate comparison we have chosen one text type only – the argumentative student essay. We have focused on the use of the progressive, since this is a feature not present in Swedish or German, and thus interesting from a contrastive point of view (cf. also Virtanen 1997, Hahn et al. 2000). The method entails two types of investigations: 1. A qualitative study of about 12,000 words from each corpus (randomly chosen essays) discussing the choice of form (simple or progressive) by advanced learners. 2. A quantitative description of the use of the progressive in a material of about 70,000 words from each corpus. References Axelsson, M. W. 2000. USE – The Uppsala Student English Corpus: An instrument for needs analysis. ICAME Journal 24: 155-157. Hahn, A., S. Reich & J. Schmied. 2000, in press. Aspect in the Chemnitz Internet Grammar. Proceedings of the ICAME conference 1999, Freiburg. Virtanen, T. 1997. The progressive in NS and NNS student compositions: evidence from the international learner corpus. In: M. Ljung (ed), Corpus-Based Studies in English. Amsterdam: Rodopi: 299-309. 255 A reusable corpus needs syntactic annotations: the Prague Dependency Treebank (YD+DMLþRYiDQG3HWU6JDOO Center for Computational Linguistics Faculty of Mathematics and Physics Charles University, Prague e-mail: {hajicova,sgall}@ufal.mff.cuni.cz The Prague Dependency Treebank (PDT, i.e. an annotated part of the Czech National Corpus) is conceived as a three-layer system of tags; the individual layers can be characterized as follows: (i) morphemic tagging capturing relatively disambiguated values of morphemic categories based on a full morphemic analysis of Czech; (ii) syntactic tags at the so-called analytical level, capturing the functions of individual word forms; in the analytical tree structures (ATSs), every word token and punctuation mark has a corresponding node and is analyzed as for its POS and morphemic value, as well as for the main syntactic functions (‘analytical functors', ‘afuns'); among the afuns, Subj, Obj, Adv are not classified in a more subtle way; (iii) syntactic tags at the tectogrammatical level (TGTSs) rendering the underlying (tectogrammatical) structure of the sentence, i.e., its syntactic structure proper (with a detailed classification of underlying syntactic functions). In the sequel we focus on a brief characterization of the TGTSs and on issues that are specific for the PDT scenario and are crucial, especially from the linguistic point of view. These issues concern (i) the transition from ATSs to TGTSs, (ii) the assignment of the features of the information structure of the sentence (topic-focus articulation), and (iii) a tentative treatment of coreference relations. The TGTSs are based on dependency syntax; the tagging at this level is guided by the following principles: (a) a node of a TGTS represents an autosemantic (lexical) word; the correlates of synsemantic (functional, auxiliary) words are attached to the autosemantic words to which they belong; (b) in the cases of deletion in the surface shape of the sentence, further nodes are supplied into the TGTS to ‘recover’ a deleted word; (c) no non-projective structures are admitted in the TGTSs (they are supposed to be solved by movement rules between the ATS and the TGTS); (d) not only the direction of the dependence on the governing node (dependence to the left, dependence to the right) is taken into account, but also sister nodes are ordered (from left to right). 256 Using the BNC to produce dialectic cryptic crossword clues David Hardcastle, Birkbeck College, University of London 1. Overview This paper describes an attempt to generate seemingly meaningful cryptic crossword clues without trying to analyse meaning but relying solely on word occurrence statistics. It is a continuation of a project in which I developed an application toolkit for cryptic crossword clue compilers. The software described here assembles simple cryptic clues using the resources developed in the earlier project combined with the British National Corpus (BNC) Sampler. Some pieces of the process remain problematic making it tempting to look for recourse in grammatical and syntactic data or investigations with a high processing overhead. However, my aim is to try to extract as much mileage as possible from data derived from the BNC that can be processed with a limited overhead. All of the clues are of a particular type, which I term ‘dialectic’ clues using the taxonomy of D St P Barnard (Barnard 1963). A dialectic clue is a pair of synonyms for the answer word, appositely combined as a single short phrase. For example, “Delicate but dainty” and “Pretty light”1 are acceptable dialectic clues for the word “fair”. Ideally the apparent syntax of the clue should mislead the person solving the clue by strongly suggesting a different sort of answer. Clues for “fair” such as “Market average” and “Sound common” would fall into this category. To assemble such clues, the software must first evaluate all possible synonym pairs2 for the clue word, and decide which pairs are more apposite than the others. Once a list of suitable pairs has been found, the second task is to attempt to link the pairs together in a manner which ideally is meaningful, or failing that at least not too jarring. The principal focus of this paper is on the first of these two tasks, namely identifying apposite pairings and ranking lists of pairings for shared meaning. I shall then briefly address possible ways of linking the pairs together. 2. Finding and ranking apposite pairings 2.1 Retrieving the list of synonyms This part of the process is very straightforward. I required a list of synonyms for my MSc thesis (Hardcastle 1999), and downloaded a machine-readable version of Roget's 1912 thesaurus from the Gutenberg Project. I then assembled the data from it into a synonym dictionary. Unfortunately many of the synonyms are out-of-date and many form lists of co-hyponyms or loosely associated terms. As a result, the clues are often marred by poor synonyms. The fix for this is clearly to locate a more up-todate, machine-readable thesaurus, but for the moment the focus of my work and of this paper, is on combining the resulting pairs rather than their quality as synonyms for the clue word. 2.2 Language and meaning in cryptic crossword clues The aim of this project is the generation of cryptic crossword clues and not sentences. Although these may be seen as an analogue for natural language, there are key differences. Cryptic crossword clues usually have a simple minimal syntax which at best determines the rubric3 of the clue, and frequently merely acts as filler between the key elements. This is particularly the case for dialectic clues, since they have the simplest syntax of any of the groupings of clues in Barnard's taxonomy. The key difference between cryptic clues and sentences is that of meaning. Although cryptic clues do not have a reference in the real world, they potentially offer two other levels of meaning. The first is the rubric of the clue, a list of instructions which can be used to solve a clue. The second gives the reader a vague sense that the clue could refer to some situation and as such that it resembles a meaningful English sentence. 1 All example clues given are from output generated by the software, unless stated otherwise. 2 Lists of synonyms are derived from a synonym dictionary that I constructed for an earlier project 3 By ‘rubric’ I mean a simple set of instructions through which the clue may be solved 257 For example, the clue “effeminate English to embrace the church”4, a clue for the word “epicene”, is both a rubric and a sort of sentence. The answer means “effeminate”, and can be formed with the letter ‘E’ for English, then the word “pine” around the abbreviation “CE” for church, as such the clue is a set of instructions. Of the many alternatives to communicate one item around another, such as “cover”, “ensnare”, “surround”, the compiler chose the word “embrace”. Similarly the letter ‘E’ could have been clued differently, or the whole shape of the clue could have been different. I suggest that from a wide variety of possibilities, the human compiler settled on a particular set of words with a certain feel to them, since they had just enough in common to suggest a reference. Although the clue is not a proper sentence, it seems to have some sort of meaning, in that it appears to refer to something real. It is this sense of appropriate feel that I am aiming to capture in the ranking of the pairs of synonyms for the simpler dialectic clues. 2.3 Finding apposite pairings This is the central focus of this paper. Given a list of pairs of words, the software should put the list into rank order according to the extent to which each pair has something in common, rather like the common game of word association. Tightly linked pairs such as “bus” and “stop” should come at the top, then pairs with a tight scope of common context such as “spanner” and “mechanic”, then words with broad shared scope such as “write” and “work” and finally words which show scant promise of association, such as “bookcase” and “goaded”. One way of achieving this might be to consider the definitions of the words, or to map them according to their membership of certain thematic subsets. However I chose to consider only the extent to which they co-occur in the BNC in order to investigate whether this data alone would be sufficient to rank the list. 2.3.1 Getting the raw data Since the mark-up language of the BNC specifies a clear hierarchical structure, it is possible to break it down into many different sized chunks, from the level of a whole chapter of a book to a single sentence or even word boundary. Pairs that share a word boundary should be good candidates for the top-scoring set, those which share sentences or paragraphs should be good candidates for the second, and so on down to pairs which are not even much in evidence co-habiting the largest chunks. To examine pairs and determine the extent of their shared space in the corpus I constructed an index of the corpus sampler. In this index, every instance of the four hierarchical section boundaries (“<div1>” to “<div4>”) is sequentially numbered, as is each paragraph within them, each sentence in each paragraph and each word within each sentence. The resulting index is a dictionary file which provides a list of coded keys for each word in the corpus. bridge / 313|-1|-1|-1|8|1|39 / 327|102|60|-1|13|3|4 / 350|164|98|-1|6|4|18 Figure 1: A sample entry from the index to the BNC Sampler Figure 1 shows part of the entry for the word “bridge” and shows the keys for the first three occurrences of “bridge” in the BNC Sampler. The final key (350 164 98 –1 6 4 18) states that the word “bridge” can be found in <div1> number 350, <div2> number 164, <div3> number 98, in an area with no <div4> context and in the 18th word of the 4th sentence of the 6th paragraph of that section. Were the word “suspension” to have a key for example 350 164 98 –1 6 4 17 this would tell us that the phrase “suspension bridge” is evidenced once in the BNC Sampler. Were the word “bidding” to have a key 313 –1 –1 –1 8 1 10, thus sharing the same paragraph as the first key for “bridge”, this would tell us that the words “bidding” and “bridge” shared the same paragraph once in the BNC Sampler. 4 “Phi”, The Independent Saturday Crossword, No. 4461, 3rd February 2001 258 Using this index, the software can rapidly return the number of occurrences of any word in the BNC Sampler and also the number of co-occurrences within each level of scope5 for any pair of words. At present the coding does not take advantage of non-hierarchical scope data such as page breaks or clause boundaries, nor does it differentiate between different sources. Both of these enhancements could improve the quality of the raw data. The former would increase the depth of the data returned, and the latter may prove a useful enhancement to the “<div1>” data, since it could give an indication of size. While a few co-occurrences at the top hierarchical level of a novel, namely at chapter level, may not mean a lot, in the context of the spoken corpus or of ephemera where the top hierarchical level is likely to be much smaller it would be more significant. 2.3.2 Interpreting the raw data The raw numbers of co-occurrences returned at the different scope levels do give some rough indication of the level of shared context between the two input words. However, such data requires further refinement since common words tend toward the top of the list, and rarer words toward the bottom, regardless of how apposite the pairings. I examined the feasibility of using the statistical test chi-squared to determine the significance of the co-occurrences for each pair against a random distribution. Unfortunately chi-squared, and other tests for statistical significance, lend additional weight to larger samples. While this makes perfect sense for data such as polls, it only deepened the rift between common and rare words rather than promoting the more apposite pairs. For example, the pairing “work” and “go” scored 17,649 in a chi-squared test, whereas the pairing “ski” and “salopettes” scored only 4. The hypothesis behind the process of comparison was that the closer the association between a pair of words, the more likely it would be to find co-occurrences of the pair within small chunks of the BNC such as sentences or paragraphs. Therefore the scoring algorithm needed to lend extra weight to cooccurrences found within a small scope, such as a sentence. In order to determine whether the number of co-occurrences found at each level of scope counted as likely or unlikely, I constructed a control set of over 20,000 randomly selected pairs from a machine-readable dictionary (Mitton 1986) to measure the average number of co-occurrences at each level of scope for each pair. To counter the advantage toward words with a high frequency in the earlier scoring systems, I recorded the ratio of cooccurrences to total occurrences of the pair rather than just the raw total of co-occurrences. The scoring algorithm scores the factor of difference between the average ratio of the baseline pairs and the recorded ratio for the pair under examination at each level of scope. The score is the total of these factors of difference. The factoring process lends extra weight to co-occurrences at small scope level as the baseline ratios for sentences and paragraphs are extremely small. Word boundary cooccurrences are scored in raw quantity, and pairs which evidence such juxtapositions are promoted to the very top of the list regardless of their overall score. To return to our earlier example, “work” and “go” which scored 17,649 with a chi-squared test now scored 1,415 while “ski” and “salopettes” was promoted from a chi-squared score of 4 to a score of 3,785. There are some drawbacks to this scoring system. Firstly, the use of ratios discriminates heavily against unbalanced pairings where one word is very common and the other very rare. It is also left open to some bizarre results where a word shows up just a few times in the BNC Sampler and all of the occurrences are in a particular atypical context. Finally words that do not appear in the BNC Sampler score zero by default, thus potentially missing good pairings. These problems can all be addressed by using a larger portion of the BNC. In spite of these potential problems, the scoring system seems to function relatively well, although testing the system proved a difficult task. Since the shared context, or lack of one, between any two words is not a given it is difficult to test the scores generated against some other property of the pairings. At some future stage I would like to compare the rankings of a set of input pairings with the results of a survey where people rank the pairs according to how well they go together. For the timebeing I have run two simple tests on the scoring system. 5 By ‘scope’ I mean the size of the context within which a co-occurrence has been found, such as a sentence, a paragraph, or a chapter 259 The first involved some consideration of the distribution of test scores for a large set of randomly generated pairings derived from the dictionary. The resulting distribution is represented in Figure 2 and appears to be relatively promising. Firstly, it seems to be discriminatory, in that the vast majority of pairs failed to score, and that the frequency of scores follows quite a steep curve from frequent mediocrity to the rare sublime pairing at the top of the scale. A cursory inspection of some examples of the ranked pairings also suggested a relatively successful scoring mechanism (see Figure 2). Score band %age Examples 1,001 to 10,000 0.1 goodwill, contract (3,089) factory, working (2.436) jazz, waistband (1,753) 201 to 1,000 0.4 Rain, outlook (907) raid, escapee (440) fluid, stomach (330) 51 to 200 0.8 Club, sixty (187) holy, building (157) guest, tribute (101) 11 to 50 2.7 Waistcoat, guilt (26) blank, game (22) torch, wind (35) 1 to 10 3.1 Timing, accession (4) cardboard, scamper (4) 0 92.9 Haircut, garlic (0) radish, thump (0) Lutheran, shaker (0) Figure 2: some examples of pairs in the different score bands following a test on the scoring system of a large sample of word pairs. The percentages represent the total number of pairs in each score band as a percentage of the total pairs under examination Finally I tested the scoring mechanism with a set of chosen pairs using the word “market” that I felt to have more or less in common along a relatively clear scale. The resulting rankings do at least seem to group the pairings roughly into the correct halves, with most of the top ten being those which I picked because I felt that they shared a common context, and most of the bottom half of the table being those which I chose to represent a poor association (Figure 3). Four of the expected top five pairs received the highest scores, with “fruit” unexpectedly coming in 7th place. However, the words which received low scores did not do so just because of poor frequency; as “fruit” scored 177 with “tree” and 871 with “veg”, while “church” and “Catholic” scored 1,703 as a pairing. Expected top 5 scoring set Expected bottom 5 scoring set Score Ranking Score Ranking Average 1542 1st Church 145 5th Sell 1000 2nd Plate 67 6th Money 580 3rd Catholic 15 8th Europe 560 4th Ski 15 8th Fruit 47 7th Headlights 5 10th Figure 3: Scores for a set of words paired with “market” 2.3.3 Refining the scoring process – ‘third party co-referents’ Although this system of scoring provided a reasonable system of ranking pairings, it did not perform well for words with very low frequencies in the BNC Sampler. For example the words “headlights” and “wipers” which clearly have a strong association receive an unimpressive score (9), with only 2 matches between them and those at the widest scope. Despite the strong word association, it is unlikely that we will find many co-occurrences within the same sentence or the same paragraph. Indeed, even with a much larger corpus we may not find a significant number of co-occurrences within small scopes. 260 To deal with pairs of words where one or both occurred relatively infrequently, I decided to examine which other words occurred in close proximity. Reversing the index shown in Figure 1, I generated a look-up table of all the words which co-occur in the same paragraph as the key. I will refer to the list of words which co-occur at paragraph level with both input words as ‘third party co-referents'. Initially the lists of co-referents were dominated by function words, pronouns and common verbs. To determine the culprits of this ‘noise', I generated lists for a large sample of input words and identified words which appeared in more than 80% of the outputs. These words are filtered out of all lists of coreferents which the software produces. Figure 4 shows the most commonly co-occurring words within paragraph scope for “headlights”, “brake” and “donkey” filtered for noise. The words “headlights” and “brake” share two of the top seven entries, and indeed many more of their respective full lists, while “donkey” shares none in the top seven and few in the full list with either of the other words. Such ‘third-party co-referents’ provide a means to promote such a pair up the rankings. An important point is that the list of paragraph-level co-referents does not necessitate that “headlights” and “brake” co-occur in any paragraphs themselves, instead the co-referents represent an intermediate measure of shared context to make up for the lack of direct evidence. Headlights Brake Donkey Fog * Caravan Ballot Grandma Solenoid * Votes Bumper Portable Rifle Vibrating Toot Exhaustive Spotlights Slate Candidate Bonnet LTD Battalion Balancing Fog * Voters Flash Fluid Camel Solenoid * Daimler Ape Flashing Starter Spanish Figure 4: co-referent lists, “fog” and “solenoid” act as third party co-referents between “headlights” and “brake” Returning to the example of “headlights” and “wipers”, the software was now able to score a list of third party co-referents, words which co-occurred at paragraph level with “wipers” and also with “headlights”. Using this intermediary the pair now scored 62 rather than 9, a modest but notable improvement. The system also provided some measure for pairs which had previously scored zero, such as “wipers” and “brake” which rose from 0 to 40. The system occasionally threw up some rather unusual pairing suggestions, such as “Catholic” and “sheep” which scored 610, although further investigation always uncovered a rational explanation and an overlap of reference. At present this system of third party co-referents remains experimental, although the results so far have been very promising. I selected paragraph level scope since I felt that it was wide enough to provide sufficient information to work with, while being sufficiently small that the co-reference could reasonably said to mean something. It may be that a larger scope such as page level, or a reduction to sentence level may produce cleaner or more informative result sets. It is notable that at present this system has been tested predominately with nouns. Nouns, and count nouns in particular, seem intuitively to be more apt to this type of processing as they can be more readily ascribed to thematic subsets. Although many verbs and adjectives share this property, it is arguable that the majority have too general a function for this process to bear fruit. Whether or not this distinction holds is something I hope to explore as I develop this co-referencing process. 261 2.3.4 Third party co-referents and meaning lists As I explored the first few lists of third party co-referents I observed that for many words the lists contained many key thematic elements that to some extent defined the context of the input word. Although an unedited list of these words would prove a crude measure of meaning, I felt that the crossreference between this list and lists of co-hyponym sets might provide some measure of meaning. To clarify, the idea would be to compare an input word to a set of co-hyponym lists. The crossreference would involve neither the input word, nor the co-hyponym key, but a cross-referencing between the list of ‘third party co-referents’ for the input word and the list of co-hyponyms under the keyword. A large overlap might indicate that the co-hyponym keyword represents some core aspect of the meaning of the input word. The thesaurus I have been using contained many lists of co-hyponyms, indeed many of the clues I have generated still suffer from a poor rubric as they contain co-hyponyms rather than synonyms. In order to reduce the extent of this problem I removed the tagged lists of co-hyponyms from the thesaurus. Figure 5 shows the results of cross-referencing the third party co-referents from the input words with the cohyponym lists from Roget. The percentages represent the percentage of cross-references between a coreferent and a co-hyponym which occurred in each category. Although the lists are crude and short, numbering less than 30 in total, the results quite clearly favour some reasonable and interesting keywords. Church: Social (23%) Religion (17%) Military (17%) Politics (12%) Horse: Animals (30%) Industry (9%) Military (9%) Doctor: Politics (18%) Social (15%) Science (9%) Medicine (7%) Figure 5: Examples of the results of cross-referencing the co-referent list for the input word with lists of co-hyponyms. With the help of the full version of the BNC and a fuller set of co-hyponym lists, I hope to be able to return a list of meaning keywords for most noun input words. As above, it is highly likely that while this system will work well for nouns and for count nouns in particular, it will not work so well for verbs, adjectives and adverbs since they are not so readily classified into sets. 3. Linking the pairs together Having ranked the pairs of synonyms for the clue solution word according to their shared context using the processes described above, the next step is to link the pairs together in as appropriate a fashion as possible. 3.1 Juxtaposed pairs Some of the pairs may have been found in juxtaposition within the corpus. Given that there is evidence of how to combine them already, the best chance for an idiomatic feel is to present them as they were found. For the most part, this approach seems to have been relatively successful producing clues such as “Still soft” for “gentle”, “Standing order” for “condition” and “Tax band” for “press”. Where the approach is less successful is where the pairing forms a part of a longer idiomatic phrase. While “Pretty big” sits happily as a unit on its own, the pairing “Great big” scores the same but is clearly inferior, as a third word is expected. It is possible that the difference in quality between “pretty big” and “great big” could be measured with closer examination of the surrounding tags and clause boundary markers. However, I feel that to set out in such a direction would detract too much from the core of this project, which is to produce dialectic cryptic clues at low processing cost, and using descriptive data from the BNC. Thus pairs evidenced in juxtaposition are left as they are found, and the software takes its chances with the results. Provided that the words are not function words or pronouns, unlikely in the context of synonyms for a clue word, the result should at the very least be acceptable and seems quite frequently to produce a nice pun. I describe the scoring of puns below (see below Section 4). 262 3.2 Pairs requiring a link word The remaining pairs of synonyms were not found in juxtaposition, and so they cannot be presented as a clue until a suitable way has been found to link them together. One way of achieving this is to examine what part of speech the words could be and linking them in appropriate ways. For example, singular noun plus “to” plus infinitive intransitive verb, infinitive transitive verb plus singular noun, adjective plus singular noun, noun plus “or” plus noun, and so on. While this approach generates some passable clues they do not read very idiomatically and, since such an algorithm requires a decision on what part of speech each component is, homograph puns are likely to be missed. Furthermore, the system does not allow for idiomatic links between pairs, since any departure from the simplest grammatical structure is likely to generate unsightly exceptions. For example, given the words “able” and “resolve” we could choose to interpret them as an adjective and noun pair and produce “able resolve”. However, they could also be interpreted as an adjective and verb pair, but the phrase “able to resolve” would be a significant risk as this structure is not guaranteed to work with the majority of adjective-verb pairings and may produce some awful phrases such as “red to drill” or “peaceful to dig”. The phrase “able to resolve” is acceptable since one can be “able to” do many things. Rather than set about listing the adjectives that could fit this pattern and build a prescriptive grammar, I decided to list the conjunctive phrases found immediately before and immediately after all of the words in my dictionary in the BNC. For want of a better source of conjunctive phrases I opted to use a list of function words (Mitton 1996) and to count a phrase as one, two or three function words in a group. Some examples of the resulting statistics are listed in Figure 6 and Figure 7. The entries for “Red” and “Grounds” in Figure 7 exemplify problems with adjective position and framing phrases, these issues are discussed in greater detail below Contact: with (63), between (8), by (5), from (3) Popped: in (44), along (11), down (11), on (11) round (11), up (11) Ready: to (45), for (30), and (6) Figure 6 Examples of statistics for conjunctive phrases following dictionary entries, the numbers in brackets are percentages representing the frequency of each phrase in the total sample found Austria: in (25), of (25), and (12), from (12), to (12), with (12) Perspective: in (35), of (14), on (14), and (7), from the (7), of the (7), that (7) Red: a (19), of (12), in the (6), into the (5), that (5), with (4), and the (2) … Grounds: on the (65), in the (10), of (7), on (2), outside the (2), within the (2) … Figure 7 Examples of statistics for conjunctive phrases preceding dictionary entries, the numbers in brackets are percentages representing the frequency of each phrase in the total sample found Rather than attempt to combine the pairings by determining their part of speech and looking for a grammatical structure to accommodate them, the software looks at the function words which are evidenced preceding and following each word in the pair. If a fit is found between the preceding conjunctive phrases of one word and the following conjunctive phrases of another it is scored according to the combined relative frequency of both. In this way the phrase “able to resolve” becomes a safe bet with a score of 186 out of a maximum 200 rather than a gamble. Other outputs from this process are listed in Figure 8. left a pot (46), pot on the left (34), left to pot (20), pot to the left (17), left in the pot (15), left on the pot(13) … likely to go (158), likely that go (13), likely a go (4) … pile of grass (82), pile in the grass (25), pile on the grass (16) … Figure 8 Some pairs generate several similar scores, such as “left” and “pot”, others just a single high score 263 Where a fit cannot be found the pair does not receive a score for linking and may be demoted to the benefit of other better fitting pairs. If none of the pairs with good context scores has a good fit, then a rough fit must be made by determining the what part of speech each member of the pair could be and assembling them according to a very rudimentary grammar. Although this approach produces successful results, there are still occasions when we are left with a clue which is one word short of an idiomatic phrase. The key problems which remain in the assembly of the pairs are: Intervening words. Most commonly adjectives are the culprit. Not only does the intervening adjective deprive the software of data about the conjunctive phrase and the noun, it also results in phrases such as “in the yellow” and “under the long” seeming acceptable. Cleaning out such data without losing idioms such as “in the red” and “in the main” will be difficult, although it may be possible to find the idioms through testing the whole combination for frequency. Frames. As with the previous problem, the result is phrases that feel like they are missing something, such as “for the reason” and “on the assumption”, both of which require the word “that” and a clause to follow. It is difficult to see how to remove these without a considerable processing overhead. Clause boundaries. This problem arises from the indexing of the spoken section of the BNC Sampler. Since the clause boundaries are coded differently, they are not accounted for in my index, and as a result pairs of words crossing word boundaries are recorded as contiguous. For example the word “hand” records “yes” as a following phrase, probably from the text “would you like a hand? Yes … “. This could also cause unusual results in the return of supposedly juxtaposed pairings. Fortunately the solution is not complex, only time-consuming, and will involve refreshing the index to take account of the difference in mark-up. 4. Scoring puns At present the resulting pairs pass through a very straightforward filter to score the puns which examines the thesaurus entry and checks to see if the pair is a combination of words from single or multiple entries of the thesaurus. Homograph puns feature fairly heavily in the resulting clues, in particular in the clues derived from juxtaposed pairs. The present system, although a crude measure, picks up the majority. A precise measure would require a substantial overhead, especially since homograph puns are encouraged by the software not differentiating between homographs in the processing of the synonym pairs. Indeed, for the bulk of completed clues, the software will at no point have investigated what part of speech the composite parts might be. Pairs scoring as puns for the word “fair” included “market average” and “sound common”. Nonscoring pairs included “pearly white” and “white light”. The latter of these is in fact a pun, “light” being suggested as a noun. However to determine this the software would have to return to the reference in the corpus and establish that the pairing “white light” which evidenced the juxtaposition was an adjective noun pair, and this requires a substantial overhead. 5. The production of a dialectic clue The following description provides an overview of the whole process of the generation of a dialectic clue with examples of each processing stage, using “reign” as the input word. The input word, “reign”, is used as a key against the thesaurus to generate a list of all the possible pairs of synonyms for the word. Each pair is scored using the index of the BNC Sampler against the baseline ratios for co-occurrence at various levels of scope. This process is refined by comparing the extent to which lists of all co-occurring words at paragraph level scope, third party co-referents, overlap for each pair. The pairs are sorted into ranked order on the co-occurrence and third party co-referent scores. Any pairs found in direct juxtaposition are separated off. 264 Lead weight (direct juxtaposition “lead weight”) Lead capability (direct juxtaposition “capability lead”) Authority pressure Importance authority Authority control Effect authority Figure 9 The top pairings for “reign”. Many of the synonyms are not particularly good, but the pairings share some common context. Data from the BNC Sampler on conjunctive phrases occurring before and after each member of each pair is examined to find a match. The top-scoring match is selected for each pair, and the score recorded. By combining the linking score with the context score, pairs which have less context but link together more idiomatically are promoted. Pressure of authority (23), authority under pressure (22), authority of pressure (20) … Importance of authority (64), importance of the authority (51), authority and importance (28) … Control of authority (57), control of the authority (44), authority to control (40) … Figure 10 The results of combining the top-scoring pairs using link words. The first two pairs (lead weight and capability lead) are not linked as they were found in juxtaposition A filter checks each pair to determine if it could be a homograph pun. Puns are promoted to the top of the list. The software selects the top-scoring punning clue, or the user can select their favourite from the head of a list of formatted clue suggestions. Lead weight (5) Capability lead (5) Authority under pressure (5) Importance of authority (5) Authority to control (5) Figure 11 Sample formatted clues for “reign” using varied link words 6. Evaluation I evaluated an early version of the software by generating dialectic clues for a set of six clue words which had been clued dialectically in broad-sheet crosswords. I selected the top five scoring clues from each clue word, and presented them to a group of crossword enthusiasts in groups with the broadsheet clues mixed in. The enthusiasts then scored the clues for readability without knowing the answer word. I then averaged the scores of the broadsheet clues. For readability the clues from the broadsheets scored 3.4, meaning that on average they were placed between 3rd and 4th place against 5 clues generated by the software, indicating stiff competition from the computer-generated clues on this front. The software-generated clues fared less well when the answers were revealed as they frequently contained good puns made with very poor synonyms. Some computer-generated clues were deemed to be of good quality, even when the answer was revealed. These included “market average”, “sound common” and “common market” for “fair”, “dictatorship of the regime” for “reign”, “tax band” for “press” and “separate part” for “isolate”. However, many clues were deemed to be insoluble as the synonyms were too difficult or inappropriate. For example, the clue “Lead weight (5)” was a neat pun, but could not be solved as “weight” is too remote from “reign”. Similarly “Degree of distinction (9)” again read well, but “distinction” for “condition” was felt to be unfair. 265 7. Summary The result of the evaluation suggested that the project has so far been largely successful and has generated some appealing dialectic cryptic crossword clues without human intervention. However, an improved thesaurus is clearly much-needed if clue quality is to be raised. The software identifies apposite pairings without recourse to definitions or tables of meaning, but using the levels of co-occurrence in the BNC Sampler and through indirect co-occurrence by means of lists of ‘third party co-referents'. It identifies appropriate bridging phrases to link pairs together where they are not evidenced in juxtaposition without recourse to syntactic or grammatical guides, but purely through frequency data gleaned from the BNC Sampler. References cited Barnard D. St. P 1963 The Anatomy of the Crossword Bell Hardcastle, D 1999 SphinX Crossword Compiler's Toolkit Unpublished MSc Thesis, Birkbeck College, University of London Mitton R, 1986 “A partial dictionary of English in computer-usable form” Literary and Linguistic Computing 1(4): 214-5 Mitton, R, 1996, English Spelling and the Computer, Longman, Appendix 2 266 Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect commas in Danish. Trained on a part-of-speech tagged corpus of 600,000 words, the system identifies incorrect commas with a precision of 91% and a recall of 77%. The system was developed by randomly inserting commas in a text, which were tagged as incorrect, while the original commas were tagged as correct. Then the tagger was trained to recognize the contexts in which incorrect commas occur. In what follows, we first describe the corpora and tag sets used in this research, and give background on the Brill Tagger. We then describe the methodology for learning to identify comma errors, and then we examine some of the principles that the system learned to identify comma errors. Finally, test results are presented, and we discuss plans for future research. The method used here is quite general, and could be applied fairly directly to a wide range of grammar checking problems, in Danish or other languages. 2. Background · Corpora and tag sets This research uses two Danish corpora: the Parole corpus (Parole 1998) and the Bergenholtz corpus (Bergenholtz 1988). The Brill tagger was trained on the manually tagged Parole corpus to recognize Danish part of speech tags. The Danish Parole tag set consists of 151 distinct Tags, containing information such as syntactic category, number, gender, case, tense and so on. As described below, we have used a reduced version of the Danish Parole tag set for the current project. · Brill tagger The Brill tagger learns by first tagging raw text with an Initial State Tagger, which tags words with their most frequent tag. The resulting file is termed Dummy, and is compared to a file called Truth, which has been manually tagged, and is thus assumed to be completely correct. 1 The system Contextual-Rule-Learn searches for transformations that can be used to make Dummy more closely resemble Truth. The system searches among transformations that instantiate the following templates: Change tag a to tag b when: 1. The preceding (following) word is tagged z. 2. The word two before (after) is tagged z. 3. One of the two preceding (following) words is tagged z. 4. One of the three preceding (following) words is tagged z. 5. The preceding word is tagged z and the following word is tagged w. 6. The preceding (following) word is tagged z and the word two before (after) is tagged w. Learning proceeds iteratively as follows: Contextual-Rule-Learn tries every instantiation of the transformations templates, and finds the transformation that results in the greatest error reduction. (See Fig. 1.) This transformation is output to the Context Rules list, and the transformation is applied to Dummy. The process continues until no transformation results in an improvement above a preset 1 We ignore the lexical rules, which are learned in a separate phase. These are not relevant to the present study. See Brill 94, 95 for details. 267 threshold. The tagger can then be run with rules that determine part of speech tagging for Danish, based on the Danish Parole Corpus. We term this the Base Tagger. Fig 1. Training the Base Tagger 3. Training the comma checker We produced the training file by tagging 600,000 words of text from the Bergenholtz corpus, using the Base Tagger. We converted the tags to the Reduced Parole Tag Set. This was done to facilitate the learning of generalizations such as “no comma between a preposition and a noun”. In the original tag set, there are 23 tags for common nouns, because of differences in number, gender, etc. In the reduced tag set, there are just two: N (common noun), and N_GEN (genitive noun). Other categories have similarly reduced numbers of tags. To use this tagged file as the training corpus for developing a comma checking system, we make the simplifying assumption that all existing commas are correct, and that no additional commas would be correct. Thus all existing commas in the training corpus are given a new tag, GC (good comma). Next, two copies of the training corpus are created, Truth and Dummy. In each of these, commas are inserted at random positions (in the same positions in each file). The inserted commas are labeled BC in Truth, and GC in Dummy file. Thus the only differences between the two files are that the randomly inserted commas are tagged with BC in Truth and GC in Dummy. Then, Contextual-Rule-Learn is run on these two files. The result is an ordered list of Error Context Rules for commas. (See Fig. 2.) Raw Text: PAROLE Corpus Initial State Tagger Dummy Contextual-Rule-Learn POS Context Rules Truth: Tagged PAROLE Corpus 268 Fig 2. Learning Comma Context Rules Thus what the system learns is contexts in which a comma's tag should be changed from GC to BC, and in this way marked as an error. The list of such contexts is produced by the learner as an ordered list of rules, specifying when the comma tag should be changed. It is important to note that these rules are ordered, so that a decision specified by a rule early on the list will sometimes be reversed by a rule later on the list. In all, 166 Error Context Rules for commas were produced. The first 12 rules are shown below: 1. GC -> BC if one of the three following tags is End-of-sentence 2. GC -> BC if one of the two previous tags is Beginning-of-sentence 3. GC -> BC if the next tag is Preposition 4. GC -> BC if one of the two following tags is Verb(Infinitive) 5. GC -> BC if the previous tag is Conjunction 6. BC -> GC if the previous tag is Interjection 7. GC -> BC if one of the two previous tags is Subordinating Conjunction 8. GC -> BC if the previous tag is Preposition and the following tag is N 9. GC -> BC if the previous tag is Pronoun and the following tag is N 10. GC -> BC if the previous tag is Verb(past) and the following tag is Pronoun(personal) 11. BC -> GC if one of the next two tags is Subordinating Conjunction 12. GC -> BC if the previous word is er (is) The first two rules state that a comma is marked bad (“BC”) if it is within 3 words of the end of a sentence, or within 2 words of the beginning of the sentence. These rules were learned because there were comparatively few correct commas in these environments in the Truth file, and a large number of Dummy: Errors Not Marked Contextual-Rule-Learn Error Context Rules for commas Truth: Errors Marked Tagged Text (Bergenholtz Corpus) Generate Errors: Insert Random Commas 269 incorrect commas in these environments. However, the system soon learns that these rules are overly general. For example, the sixth rule states that a comma is correct if preceded by an interjection. This occurs typically near the beginning or end of a sentence, as in the following example from the training corpus: Naa/INTERJ ,/GC I/PRON_PERS sidder/V_PRES stadig/RGU og/CC hygger/V_PRES Well , you sit still and enjoy jer/PRON_PERS ./XP yourselves. Rule 7 doesn't permit commas between prepositions and nouns, and Rule 8 doesn't permit commas near the beginning of a subordinate clause. This is related to the fact that a comma typically introduces a subordinate clause in Danish. This fact is partially captured in Rule 11, which permits commas just before subordinating conjunctions. Rule 9 disallows commas between a Pronoun and Noun. In the Parole corpus, there is no category for Determiner, and words like the and a are tagged as pronouns. 4. The resulting system We build a system that corrects commas in raw text, based on the rules learned above. Text is first tagged by the Base Tagger, and then commas are all tagged GC. Next the Comma Corrector is executed – this is the tagger with the Comma Error Rules. In the output, any incorrect commas are tagged with BC. (See Figure 3.) Here is a sample run of the system, with different comma positions in the (constructed) sentence Det er godt, at du kom (It is good, that you came): Input Det er godt, at du kom. Det er godt at, du kom. Det er godt at du, kom. Det, er godt at du kom. Det er, godt at du kom. Output Det er godt , at du kom . Det er godt at ,/BC du kom . Det er godt at du ,/BC kom . Det ,/BC er godt at du kom . Det er ,/BC godt at du kom . Of the five different comma positions, only the first is correct in Danish. 2 The system correctly labels all the other alternatives as incorrect (BC). 2 This is in fact not entirely clear, since there are at least two distinct systems for placing commas in Danish, and this position may not be considered correct in one of the two systems. While it is difficult to get confident judgments, all my Danish informants agree that this example has only one possible comma position, which is the one accepted by the comma correction system. 270 Output Fig. 3 Comma Correction System 5. Empirical results The system was tested with a file of distinct text from the Bergenholtz corpus, containing 14,044 words. The file contains 869 commas. 389 additional commas were introduced in random positions, as errors. The system marked 327 commas as errors, of which 299 actually were errors. This gives a precision of 91.4% and a recall of 76.9%. Here is a list of the first 10 examples where the system incorrectly marked a comma as an error: 1. Hulgaard/EGEN ,/BC Århus/EGEN 2. mener/VPRES ,/BC vi/PRONPERS 3. mener/VPRES ,/BC han/PRONPERS 4. menneskemassen/VPRES ,/BC der/UNIK 5. 17-13/NUM ,/BC Norris-Paulsen/N 6. morderiske/VPRES ,/BC psykopatiske/VINF 7. Sørensen/EGEN ,/BC Århus/EGEN 8. nabokommunen/N ,/BC på/SP 9. systemet/N ,/BC kan/VPRES 10. de/PRONDEMO aktive/ADJ ,/BC servicefunktionerne/N In items 1 and 7 a line break was incorrectly placed immediately before the text in question. Items 4 and 6 involve mis-tagging: Raw Text Base Tagger Tagger with POS Context Rules Label Commas Comma Corrector Tagger with Comma Context Rules 271 menneskemassen (“mass of people”) and morderiske (“murderous”) are both nouns, mis-tagged as verbs. Item 10 is an interesting case De aktive, servicefunktionerne (the active, service workers). The comma is marked as incorrect because of the following rule: GC -> BC if the previous tag is ADJ and the next tag is N This is normally correct; commas don't tend to appear between an ADJ and a N. Here, however, “the active” is a complete NP, on par with, e.g., “the rich”, and “service workers” is a separate NP. Total Number Commas Incorrect Commas Total System Corrections Valid System Corrections Precision Recall 1258 389 327 299 91.4% (299 / 327) 76.9% (299 / 389) Table 1. Results 6. Discussion and further work The system was developed using the transformation-based learning system of the Brill tagger. This learning system is limited in various ways: for example, only three words or tags before or after a position are examined. It is likely that certain patterns involving commas could be learned if that locality restriction were loosened. Furthermore, the learning system of the Brill tagger maximizes overall success rate, using a greedy strategy. We believe precision is a more relevant measure in grammar checking problems. Thus it would be interesting to modify the learner so that it optimizes precision or some related measure, and we suspect that greedy learning may be problematic in this case. We are contemplating various experiments related to these issues. It is also possible that the precision and recall of the system would be substantially increased with a larger training corpus. Work is proceeding on this. Finally, we plan to apply similar techniques to a wide variety of grammar problems, both in Danish and other languages. References Bergenholtz, H 1988 Et korpus med dansk almensprog. Hermes. Brill, E 1994 Some Advances in rule-based part of speech tagging. In Proceedings of the Twelfth National Conference on Artificial Intelligence (AAAI-94), Seattle, WA. Brill, E 1995 Transformation-based error-driven learning and natural language processing: a case study in part of speech tagging. Computational Linguistics 21(4). Golding, Andrew R and Schabes, Yves 1996 Combining trigram-based and feature-based methods for context-sensitive spelling correction. In Proceedings of the 34th Annual meeting of the Association for Computational Linguistics. Jacobsen, Henrik Glaberg and Jørgensen, Peter Stray. 1991. Politikens Håndbog i Nudansk. Politikens Forlag. Parole. 1998. http://coco.ihu.ku.dk/~parole/par_eng.htm 272 Tracking lexical change in present-day English Raymond Hickey Essen University For several centuries English has been well-known for frequent cases of conversion (word-class change without any formal alteration). In recent decades a further development can be observed which for the want of a better word could be termed univerbation. By this is meant that structures consisting of several words are reduced to one, as when a verbal phrase is compacted to a single word, e.g. we spent the night in Vienna -> we overnighted in Vienna. My contention is that such cases illustrate a process which is part of a long-term typological shift in English. The latter is what has been observed in the shift from a morphologically complex to an inflectionally simplified language and is conventionally referred to as a move from synthetic to analytic. The current process can be viewed as a later stage in an analytic language where lexical compaction is in evidence and can thus be interpreted as part of a typological cycle. A side-effect of this compaction is that the subcategorisation rules for existing verbs can be altered (usually expanded to include a new type) as in This door is fitted with an alarm -> This door is alarmed, i.e. the valency of alarm has been altered to allow for both animate and inanimate objects. The paper will look at several cases illustrating the matters just alluded to and comment on the theoretical ramifications for the structure of English. 273 Orality and noun phrase complexity: a corpus-based study of British and Kenyan writing in English Diana Hudson-Ettle*, Tore Nilsson° and Sabine Reich* *Department of English Linguistics, Chemnitz University of Technology, Reichenhainer Str. 39, 0917 CHEMNITZ, Germany ° Department of English, Uppsala University, Box 527, SE-751 20 UPPSALA, Sweden. diana.hudson-ettle@t-online.de, toren@bahnhof.se, sabine.reich@epost.de According to previous research (Biber 1988, Chafe 1982, Halliday 1989), both spoken and written texts display evidence of orality such as lack of planning and elaboration, indicated by the relative frequency of specific linguistic features. On the basis of this evidence, we examine selected text types to determine their position on a cline of orality with a view to investigating whether noun phrase complexity in terms of adjectival premodification can also be said to be sensitive to this concept of orality. Our first hypothesis is that noun phrases in texts showing more oral features are characterized by a relatively lower degree of adjectival premodification. Our data consist of English texts from Britain and Kenya. The first step of the study focusses on three genres from the Kenyan subcorpus of ICE-EA (The East African component of The International Corpus of English compiled at Chemnitz University of Technology). These genres are social letters, reportage feature articles and institutional editorials. The aim of this first step is to establish the position of each text type on the orality cline. In the second step our attention is focussed on reportage feature articles alone. Results obtained in the first step are seen in comparison with those gained from analyses of the same text type taken from two corpora of British English, UPC (the Uppsala Press Corpus) and a corpus of British Tourism and Travel Texts, both compiled at Uppsala University. Our second hypothesis is that the texts from Kenya will contain more evidence of orality than the British feature articles. References Biber, Douglas (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press. Chafe, Wallace L. (1982) “Integration and involvement in speaking, writing and oral literature”. In Deborah Tannen (ed.), Spoken and Written Language: Exploring Orality and Literacy, 35-54. Norwood, NJ: Ablex. Halliday, M.A.K. (1989) Spoken and Written Language. 2nd edition. Oxford: Oxford University Press. 274 The American National Corpus: A standardized resource for American English Nancy Ide* and Catherine Macleod† *Department of Computer Science Vassar College Poughkeepsie, NY 12604-0520 USA ide@cs.vassar.edu †Computer Science Department New York University New York, New York 10003-6806 USA macleod@cs.nyu.edu 1 Introduction Linguistic research has become heavily reliant on text corpora over the past ten years. Such resources are becoming increasingly available through efforts such as the Linguistic Data Consortium (LDC) in the US and the European Language Resources Association (ELRA) in Europe. However, in the main the corpora that are gathered and distributed through these and other mechanisms consist of texts which can be easily acquired and are available for re-distribution without undue problems of copyright, etc. This practice has resulted in a vast over-representation among available corpora of certain genres, in particular newspaper samples, which comprise the greatest percentage of texts currently available from, for example, the LDC, and which also dominate the training data available for speech recognition purposes. Other available corpora typically consist of technical reports, transcriptions of parliamentary and other proceedings, short telephone conversations, and the like. The upshot of this is that corpusbased natural language processing has relied heavily on language samples representative of usage in a handful of limited and linguistically specialized domains. A corpus is intended to be "a collection of naturally occurring language text, chosen to characterize a state or variety of a language" (Sinclair, 1991). As such, very few of the so-called corpora used in current natural language processing and speech recognition work deserve the name. For English, the only true corpora that are widely available are the Brown Corpus (Kucera and Francis, 1967) and the British National Corpus (BNC) (Leech, 1994). Although it has been extensively used for natural language processing work, the million words of the Brown Corpus are not sufficient for today's largescale applications. For example, for tasks such as word sense disambiguation, many word senses are not represented, or they are represented so sparsely that meaningful statistics cannot be compiled. Similarly, many syntactic structures occur too infrequently to be significant. The Brown Corpus is also far too small to be used for computing the bigram and trigram probabilities that are necessary for training language models used in a variety of applications such as speech recognition. Furthermore, the Brown corpus, while balanced for different written genres, contains no spoken English data. The 100 million words of the BNC provide a large-scale resource and include spoken language data; however, this corpus is not representative of American English and is so far available only within Europe for purposes of research. As a result, there is no adequately large corpus of American English available to North American researchers for use in natural language and speech recognition work. To meet the need for a corpus of American English, a proposal was put forward at the 1998 Language Resources and Evaluation Conference (LREC) to create a large, heterogeneous, uniformly annotated corpus of contemporary American English comparable to the BNC (Fillmore, et al., 1998). Over the past two and a half years the project has developed, and a consortium of supporters including American, Japanese, and European dictionary publishers, as well as industry, has been formed to provide initial funding for development of the American National Corpus (ANC). At present, the creation of the ANC is underway, using texts contributed by some consortium members and supported by membership fees. The Linguistic Data Consortium, which will manage and distribute 275 the corpus, and is contributing manpower, software, and expertise to create a first version of the corpus, a portion of which should be ready for use by consortium members at the end of this year. 2 Why we need a corpus of American English There is a need for a corpus of American English that cannot be met by the data in the British National Corpus, due to the significant lexical and syntactic differences between British and American English. Well-known variations are: "at the weekend" (Br.) vs. "on the weekend" (U.S.), "fight (or protest) against <something>" (Br.) vs. "fight (or protest) <something>" (U.S.), "in hospital" (Br.) vs. "in the hospital (U.S.), "Smith, aged 36,…" (Br.) vs. "Smith, age 36…" (U.S.), "Monday to Wednesday inclusive" (Br.) vs. "Monday through Wednesday" (U.S.), "one hundred and one" (Br.) vs. "one hundred one" (U.S.), etc. Also, in British English, collective nouns like "committee", "party", and "police" have either singular or plural agreement of verb, pronouns, and possessives, which is not true of American English. British English often makes use of a to-infinitive complement where American English does not. In the following examples from the BNC “assay”, “engage”, “omit” and “endure” appear with a to-infinitive complement, there were no examples found in a small corpus (comprised of selections from the Brown Corpus, the Wall Street Journal, the San Jose Mercury News, Associated Press, and the Penn Treebank) of this construction, although the verbs themselves did appear. For the first two verbs, one can argue that there is not an equivalent verbal meaning in American English, but, for the last two, the meaning can be paraphrased in American English by the gerund, as shown below. (Note that the British English examples are from the BNC and the American English examples are paraphrases.) Verb Eng. Example sentences assay B.E. Jerome crept to the foot of the steps, and there halted, baulked, rather, like a startled horse, drew hard breath and ASSAYED TO MOUNT, and then suddenly threw up his arms to cover his face, fell on his knees with a lamentable, choking cry, and bowed himself against the stone of the steps. engage B.E. A magnate would ENGAGE TO SERVE with a specified number of men for a particular time in return for wages which were agreed in advance and paid by the Exchequer. omit B.E. “What did you OMIT TO TELL your priest?” A.E. “What did you OMIT TELLING your priest?’ endure B.E. But Carteret's wife, who frequented health spas, could not ENDURE TO LIVE with him or he with her: there were no children. A.E. But Carteret's wife, who frequented health spas, could not ENDURE LIVING with him or he with her: there were no children. Verb complementation containing prepositions often differs from British English to American English John Algeo (1988) gives a number of examples. In British English, “cater for” and “cater to” both occur, but “cater to” has a pejorative connotation and is less frequent. In American English, only “cater to” is used and is not considered pejorative. British English “claim for” contrasts with American English “claim” + NP (claim for benefits vs claim benefits), and, conversely, “agree” + NP is acceptable in British English but not in American English, which demands a preposition such as upon, on, about, or to. Algeo's example of British English “..yet he refused to agree the draw” would be “..yet he refused to agree to a draw” in American English Similarly, the bare infinitive after "insist", "demand", "require", etc. (e.g., "I insist he be here by noon.") is common in American English but rare in British English. Adverbial usage is also different. The British English use of “immediately” in sentence initial position is not allowed in American English For example, British English “Immediately I get home, I will attend to that.” is incorrect in American English, in which one would say “As soon as I get home, I will attend to that.” Other syntactic differences include formation of questions with the main verb “have”. In British English, one can say, “Have you a pen?” where American English speakers must use “do” (“Do you have a pen?”). Support verbs for nominalizations also differ : for example, the British English “take a decision” vs the American English “make a decision”. 276 There are also considerable semantic differences between the two brands of English: in addition to well-known variations such as lorry/truck, pavement/sidewalk, tap/faucet, presently (currently)/soon, autumn/fall, etc., there are numerous examples of more subtle distinctions, for example: "tuition" is not used to cover tuition fees in British English; "surgery" in British English is "doctor's office" in American English; "school" does not include higher education in British English, etc. Usage not only differs but can be misleading, for example, British English uses "sick" for the American "nauseous", whereas "sick" in American English is comparable to "ill" in British English; British "braces" are U.S. "suspenders", while "suspenders" in British English refers to something else entirely. Overall, the distribution of various semantic classes will also distort a British and an American corpus differently, for example, names of national institutions and positions (Whitehall, Parliament, Downing Street, Chancellor of the Exchequer, member of parliament, House of Lords, Royal Family, the queen, senate, president, Department of Agriculture, First Family, and heavy use of the word "state", etc.) and sports (baseball terms will be more frequent in an American corpus, whereas hockey--itself ambiguous between British and American usage--and soccer will predominate in the BNC). Idiomatic expressions also show wide variation between British and American English. Of course, spoken data between the two brands of English are not comparable at all. The above comprise only a few examples, but it should be clear that when a uniquely British corpus is used, such examples skew the representation of lexical and syntactic phenomena. For applications that rely on frequency and distributional information, data derived from samples of British English are virtually unusable. The creation of a representative corpus of American English is critical for such applications. 3 Makeup of the ANC Our model for the contents of the ANC includes, ideally, a balanced representation of texts and transcriptions of spoken data, as well as a large component of annotated speech data. In addition, the ANC should include representative samples for different dialects of American English (including Canadian English). Finally, samples of other major languages of North American, especially Spanish and French Canadian, should also comprise a portion of the corpus and, ideally, be aligned to parallel translations in English. However, in view of the high cost of collection and annotation of speech data, creation of a speech component of the ANC has been put off for the near future. Similarly, we are delaying attempts to provide comprehensive and balanced representation of regional dialects and collection of samples of North American Spanish and French. Our goals for the next few years of the project therefore include collection of only textual data and transcriptions of spoken data. The ANC will contain a static component and a dynamic component. The static component will comprise approximately 100 million words and will remain unchanged, thus providing a stable resource for comparison of research results as well as a snapshot of American English at the end of the millennium. This portion of the corpus will be comparable in balance to the BNC; although there is no set definition of "balance" in a corpus, we will follow the BNC criteria in terms of domain and medium1 to enable cross-linguistic studies between British and American English. However, unlike the BNC, the ANC will include primarily contemporary texts (1990 onward). The selection of contemporary texts is important for both lexicography and NLP, particularly in view of the significant changes in language usage over the last few years (e.g., changes brought about by electronic communication). Therefore, the ANC static corpus will overlap with only the second time period of the BNC. A static corpus cannot keep up with current usage, and so the ANC will also include a dynamic component comprised of additional texts added at regular intervals. At present, we plan to add approximately ten- percent new material every five years in a layered organization, thus enabling access to all layers and the static core in chronological order. In this way, we hope to provide the advantages of both a static corpus such as the BNC and a dynamic corpus (e.g., the COBUILD corpus), while at the same time providing a resource for studies of change in American English over time. Beyond the 100 million words comparable to the BNC, the ANC will also include additional texts from a wide range of styles and domains that will be varied rather than balanced; i.e., it will include smaller samples of a greater variety of texts rather than differing percentages of texts according to their representative importance in the language. To some extent the contents of this portion of the corpus 1 See the BNC User's Reference Guide (Burnard, 1995) for details of the criteria for balance in the BNC. 277 will be dictated by availability: we hope to take advantage of the availability of large quantities of contemporary texts such as email, rap music lyrics, etc., as well as to add historically significant novels and other writings. Up to now, much NLP research has been focused on newspaper or newswire text, reflecting the availability of common corpora and annotated corpora in these areas. However, other genres are becoming not only common but also available in massive quantities, including unedited electronic data, email, web announcements and discussion groups, technical writing in computer manuals, help files and telegraphic reports. These genres differ in vocabulary, names and "named entity'’ structures (e.g., formulas, addresses, currencies, etc), syntax, lexical semantics, and discourse structure. It has been shown that adapting to genre-specific language can significantly improve analysis performance for syntactic structures and preferences (Sekine, 1997) and for semantic or selectional preferences. A standard multi-genre corpus can foster research on genre adaptation, where some experiments can be conducted on raw text data and others can be effective with small amounts of syntactically-annotated data. 4 Encoding and annotation of the ANC An American Natural Corpus will be most useful if it is more than just a collection of words. The corpora that have become most useful to both publishers and researchers in natural language and speech research have been those which are annotated. The paradigm example of this is the Brown Corpus, which has been the cornerstone of language-related research across disciplines in the United States, indeed in psychology as much as in natural language processing. Tagging of the Brown Corpus has played an essential role across disciplines, both in the original version (Kucera and Francis, 1967), and in the various on-line tagged versions, such as the Penn Treebank version (Marcus et al., 1993). For example, many modern part-of-speech taggers are trained on the Penn Treebank tagged corpus (see, for example, Brill, 1995). Part-of-speech tagged data has been used to automatically acquire subcategorization dictionaries (Manning, 1993), in spell checking, and for applications that require partial parsing, etc. The syntactic parse trees that annotate the Brown Corpus in the Penn Treebank have played a similarly fundamental role in the training and evaluation of parsing systems. The overall plan for development of the ANC is in two broad stages. In the first stage, a "base level" encoding (conformant to a Level 0 encoding as specified by the Corpus Encoding Standard (Ide, 1998a,b)) of the data will be provided, by automatically transducing original printer codes to XCES markup for gross logical structure (title, paragraph, etc.). Header information regarding target audience, text type, etc. will be inserted manually; at this stage, only minimal header information will be provided, based on the headers used in the BNC. This will allow us to test the applicability of the basic BNC header to our corpus at an early stage and give us the opportunity to tune it to the needs of the ANC in the final version. The base level encoding and annotation will be performed by the Linguistic Data Consortium at the University of Pennsylvania (Penn). All texts in the both the base and final versions of the corpus will be marked for major structural divisions, paragraphs, and sentence boundaries, as well as part of speech. The base corpus will be automatically tagged, using the part-of-speech tags of the Penn TreeBank. At this stage, only spot checking of the data will be done; the object of this step is to harmonize the data to the extent possible using only automated means, thus avoiding the time and cost of hand-work. The resulting base level corpus should be sufficient for many needs (in particular, those of dictionary publishers), such as concordance generation. Software for viewing and analyzing the corpus data in this format will be made available to consortium members along with the data, although the data will also be available separately. The second stage of development will be undertaken in parallel with the first, but the exact time-frame of the work is dependent on funding. In this stage, the corpus will be produced in its "final" form, with the goals of (1) marking as much information in the ANC as possible while providing for maximal search and retrieval capability, and (2) providing a "gold standard" corpus, consisting of some portion (possibly 10%) of the entire ANC, for use in natural language processing work for training, etc. In the final corpus, annotation for linguistic phenomena (part of speech, syntactic annotation, etc.) will follow de facto standards such as those established by EAGLES.2 2 It will be necessary to use a larger tagset than the Penn set in the final corpus. The Penn set was designed to be used with a corpus that was parsed, not merely tagged, and hence eliminates information that is only recoverable from a parsed corpus, such as the distinction between prepositions and subordinating conjunctions. These were 278 Encoding of the ANC will also adhere to international standards. The corpus will be encoded according to the specifications of the eXtensible Markup Language (XML) (Bray, et al., 1998) version of the Corpus Encoding Standard (XCES)3 (Ide, et al., 2000), part of the Guidelines developed by the Expert Advisory Group on Language Engineering Standards (EAGLES).4 XCES was developed expressly to serve the needs of corpus-based work in language engineering applications by providing XML encoding conventions for various linguistic phenomena in text and speech, as well as several types of linguistic annotation. Because XCES is an XML application, its use for encoding the ANC guarantees that access to and use of the corpus will be supported by tools and mechanisms designed for data delivered via the World Wide Web. The XCES specifies a flexible document structure that provides for "layering" annotation and related documents that may be added incrementally at later stages. The separation of linguistic annotation in distinct documents facilitates retrieval from different annotations (including variants of the same kind of annotation--e.g., part of speech analysis by several taggers). This strategy of “remote” or “stand-off” markup is well-suited to the XML environment, due especially to the development within the XML framework of the Extensible Style Language (XSL). XSL provides a powerful transformation language (Clark, 1999) that can be used to create new XML documents from one or several others by selecting, rearranging, modifying and adding information to it. Thus, a user of a corpus encoded following the XCES model need not be aware of the underlying document architecture, and will see only a new document containing all and only the information he or she is interested in, in any desired configuration and encoding. This will enable use of the ANC for a potentially limitless set of applications, including not only computational linguistics research but also education, etc. The XCES architecture also provides for distribution of development and enhancement, by enabling different sites to develop separate documents containing annotations for the primary ANC data, all of which are ultimately linked together and retrievable as a hyper-document. In the second stage of ANC development, at least the following tasks will be undertaken: · Validation and refinement of existing markup, e.g., changing paragraph markers to more precise tags such as list, quote, etc., marking highlighted words for function (e.g., foreign word, emphasis, etc.); · Provision of a full XCES-compliant header, including a full description of provenance and all encoding formats utilized in the document. In this phase we will correct, where necessary, categories and other header information drawn from the BNC in the first phase, and substantially add to it; · Insertion of additional markup for sub-paragraph elements, such as tokens, names, dates, numbers, etc. Identification of these elements will, to the extent possible, be done automatically; · Hand validation of markup for sub-paragraph elements, including sentence, token, names, dates, etc., in the "gold standard" portion of the corpus; · Transduction of part-of-speech markup to XCES specifications, and possible transduction of annotation categories to a standard scheme such as the EAGLES morpho-syntactic categories (Monachini and Calzolari, 1996); · Hand-validation of part-of-speech tags in the "gold standard" portion of the corpus; · Implementation of the layered data architecture for annotations; · Adaptation and/or development of search and retrieval software, together with development of XSLT scripts for common tasks such as concordance generation, etc. 5 The ANC consortium Founding consortium members contribute US$21,000 over 3 years in annual installments of $7,000, which will be used to support the development of the base level corpus. In addition, publishers and other members are expected to provide contributions of data for inclusion in the corpus. Consortium combined into the single tag IN in the Penn tagset, since the tree-structure of the sentence disambiguated them (subordinating conjunctions always precede clauses, prepositions precede noun phrases or prepositional phrases). 3 See http://www.cs.vassar.edu/XCES. 4 http://www.ilc.pi.cnr.it/EAGLES/home.html 279 members who join after March 31, 2001, contribute $40,000 in two annual installments. Consortium members receive the data as soon as it is processed and have exclusive commercial rights to it for a period of five years after the date of the first release of data, currently anticipated to begin at the end of this year. Current consortium members are: · Pearson Education · Random House Publishers · Langenscheidt Publishing Group · Harper Collins Publishers · Cambridge University Press · LexiQuest · Microsoft Corporation · Shogakukan,Inc. · Associated Liberal Creators Press · Taishukan Publishers · Oxford University Press · Kenkyusha Publishers · International Business Machines Corporation All ANC data will be freely available to non-profit educational and research organizations from the outset (aside from a nominal fee for licensing and distribution). There will be no restrictions on obtaining the corpus based on geographical location; restrictions on the distribution of the BNC, which has so far been unavailable outside the European Union, have limited large-scale and comparative research based on the corpus. We hope to encourage comparative research by providing global access. The Linguistic Data Consortium will obtain licenses from text providers and provide licenses to users. In general, the license will prohibit redistribution of the corpus and the publication or similar use of substantial portions of text drawn from the corpus without the permission of its original publisher. For dictionary makers, who comprise a large portion of the current consortium membership, usage of short portions of text in published dictionary examples etc. is allowed under legal definitions of "fair use". We also plan to provide for an “open sub-corpus”, licensed to permit redistribution on the model of open-source software. The size of this corpus will be determined by the contributors. Development of the Level 1 corpus and the "gold standard" sub-corpus will necessarily begin later than development of the base-level version, due to the need to secure substantial funding from external sources to support it. In addition, this development requires time for significant planning to ensure that the corpus is maximally usable by a broad range of potential applications and meets the needs of the research and industrial communities. We are currently soliciting input from the research community to feed this development. A meeting on the topic of annotation and encoding formats and data architectures for large corpora was held at last year's ANLP/NAACL conference in Seattle in early May5; another more comprehensive workshop on the same topics was held preceding the LREC conference in Athens in June, 2000 (Broeder, et al., 2000). By taking into account past experience, current and developing technologies, and user needs, we hope to be able to provide a state-of-the-art platform for universal access to the ANC. 6 Summary The ANC, initially proposed at the first LREC in 1998, is now well on its way to realization. Within the year, the first data in its base level representation will be available to the NLP community and consortium members. The final corpus in its fully marked and annotated form should be available within three years. A corpus of contemporary American English is a valuable resource not only for commercial applications and research, but also for educators, students, and the general public. It is also an important historical resource: the corpus will provide a "snapshot" of American English at the turn of the millennium, valuable for linguistic studies in the decades to come. 5 Papers available at http://www.cs.vassar.edu/~ide/ANLP-NAACL2000.html. 280 Acknowledgments The first ANC meeting in Berkeley, California was funded by National Science Foundation grant ISI- 9978422. We would like to thank Sue Atkins, Michael Rundell, and Rob Scriven for their support and for providing information concerning the creation of the BNC. We would also like to acknowledge the contribution of Wendalyn Nichols, Frank Abate, and Yukio Tono, who have been instrumental in obtaining the support of the publishing community. References Algeo J 1988 British and American grammatical differences. International Journal of Lexicography, 1:1-31. Bray T, Paoli J, Sperberg-McQueen CM (eds) 1998 Extensible markup language (XML) Version 1.0. W3C recommendation. http://www.w3.org:TR/1998/REC-xml-19980210. Brill E 1995 Transformation-based error-driven learning and natural language processing: A case study in part-of-speech tagging. Computational Linguistics, 21(4):543-566. Broeder D, Cunningham H, Ide N, Roy D, Thompson H, Wittenburg P (eds) 2000 Proceedings of the EAGLES/ISLE Workshop on Meta-Descriptions and Annotation Schemas for Multimodal/Multimedia Language Resources and Data Architectures and Software Support for Large Corpora. Paris, European Language Resources Association. Burnard L 1995 British National Corpus: User's reference guide for the British National Corpus. Oxford, Oxford University Computing Service. Clark J (ed) 1999 XSL transformations (XSLT). Version 1.0. W3C recommendation. http://www.w3.org/TR/xslt Clark J (ed) 1999 XSL transformations (XSLT). Version 1.0. W3C recommendation. http://www.w3.org/TR/xslt. Fillmore C, Ide N, Jurafsky D, Macleod C 1998 An American National Corpus: A proposal. In Proceedings of the First Annual Conference on Language Resources and Evaluation, Granada, pp 965-969. Ide N 1998a Encoding linguistic corpora. In Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, pp 9-17. Ide N 1998b Corpus Encoding Standard: SGML guidelines for encoding linguistic corpora. Proceedings of the First International Language Resources and Evaluation Conference, Granada, pp 463-70. Ide N, Bonhomme P, Romary L 2000 XCES: An XML-based standard for linguistic corpora. In Proceedings of the Second Annual Conference on Language Resources and Evaluation, Athens, pp 825-30. Kucera H, Francis W 1967 Computational analysis of present-day American English. Providence, Brown University Press. Leech G, Garside R, Bryant M 1994 CLAWS4: The tagging of the British National Corpus. Proceedings of COLING-94, Nantes, pp 622-628. Manning C 1993 Automatic acquisition of a large subcategorization dictionary from corpora. Proceedings of ACL, Columbus, pp 235-242. Marcus M, Santorini B, Marcinkiewicz 1993 Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2): 313-330. Monachini M, Calzolari N. 1996 Synopsis and comparison of morphosyntactic phenomena encoded in lexicons and corpora: A common proposal and applications to European languages. EAGLES report EAG-CLWG-MORPHSYN/R. http://www.ilc.pi.cnr.it/EAGLES96/morphsyn/. Sekine S 1997 The domain dependence of parsing. In Proceedings of the Fifth Conference on Applied Natural Language Processing. Sinclair J 1991 Corpus, concordance, and collocation. Oxford, Oxford University Press.  281 The positions of reporting clauses of speech presentation with special reference to the Lancaster Speech, Thought and Writing Presentation Corpus Reiko Ikeo, Lancaster University (rikeo@hotmail.com) Abstract This paper reports the corpus findings on the positions of reporting clauses of speech presentation. In direct speech, the preferred positions of reporting clauses vary according to text types, and the results are compared with those of Biber et al. (1999). In indirect speech, more than 90% of reporting clauses are found in the initial position. Although fronted reported clauses of indirect speech in fictional texts have been treated as an intermediate form between indirect and free indirect speech, the corpus data from news reports shows textual/contextual conditions where reported clauses tend to be fronted. This phenomenon in the press data seems to have different motivations from the one in fiction which often manipulates shift of viewpoints. 1. Introduction The report of other people's discourse is at the core of narrative and journalistic discourse. Reporting clauses in discourse presentation are one of the most explicit linguistic devices which introduce other people's discourse in texts. The various positions of reporting clauses have both syntactic and pragmatic importance in discourse presentation. The positions of reporting clauses can affect the syntactic relationship between the reporting and the reported clauses. In direct speech (DS), clear markers of the reported clauses such as quotation marks, the verb tenses and pronoun uses distinguish the reported from reporting clauses even when reporting clauses are placed after the reported clauses. Biber et al. (1999: 196) define the syntactic status of reporting clauses of DS as intermediate between independent and dependent clauses, and this definition does not seem to be affected by the varied positioning of reporting clauses. On the other hand, in the case of indirect speech (IS), the syntactic relation between reported and reporting clauses can be affected by their relative positions. If reported clauses are placed before reporting clauses, a ‘that'-complementiser is compulsorily omitted. Placing reporting clauses in the medial and final positions also allows a question form or exclamatory sentence in the reported clauses. As a result, the reported clauses may gain more syntactic freeness than when they are subordinated to reporting clauses in the initial position. Point of view is another issue which is deeply involved in discourse presentation. In DS, the reported speaker(s)’ point of view is clearly distinguished from that of the reporter or other reported speakers by quotation marks and choices of deixis and verb tense. Fronted reported clauses (i.e. before reporting clauses) can increase the effect of immediacy. In IS, on the other hand, the narrator's point of view intervenes more in the reported clauses and is reflected in the choice of pronouns, lexis and verb tenses. When reported clauses are placed before reporting clauses, the reader might not notice the reporting clause and may at first take the reported clause to be free indirect speech (FIS) or narration before recognising the reporting clauses. This phenomenon has been an interest in the field of stylistics, and several researchers1 have discussed it in relation to FIS in fictional texts. Leech and Short (1981) refer to the effect of inversion in IS and locate the constructions ‘somewhere in between indirect speech and free indirect speech’ (p.333). However, this phenomenon has hardly been examined outside literary texts. One of the purposes of this paper is to examine the inversion of the reporting and reported clauses of IS in journalistic texts. Different modes of presentation of reported clauses seem to have different preferences for the positions of reporting clauses. These differential preferences are for syntactic and pragmatic reasons. The type of texts in which discourse presentation occurs also may affect the positions of reporting clauses. Biber et al. (1999) report from their corpus findings that the final position of reporting clauses of direct discourse presentation is highly favoured in both news and fiction. The source of my data in this paper is the Lancaster Speech, Thought and Writing Presentation Corpus (ST&WP Corpus). This corpus of 260,000 words was annotated manually for categories of speech, thought and writing presentation using a tagset which was developed by the Lancaster research team and based on Leech and Short's 1981 model. Biber et al. report their findings based on a sample of 100,000 words from the Longman Spoken and Written English Corpus (the LSWE Corpus). The ST&WP Corpus is 2.6 times larger than the LSWE sample corpus. The ST&WP Corpus has three sections of texts: news, fiction and (auto)biographies. Each of three sections is subdivided into    Reinhart (1975), Banfield (1983) and Fludernik (1993)   282 serious/popular divisions. These divisions make comparisons between serious and popular types of texts possible as well as between different genres. In this paper, I concentrate on speech presentation and make the following general points: (1) The preferred positions of reporting clauses of DS seem to vary according to text types and the serious/popular divisions. (2) Although more than 90 % of reporting clauses of IS are placed before the reported clauses in all three genres, news reports seem to have particular textual/contextual patterns where the reporting clauses of IS are placed either in the middle or at the end of the reported clauses. 2. Reporting clauses of DS There are 2878 reporting clauses in the corpus, and about 94% of reporting clauses are attached either to DS or IS: 53.2% are attached to DS and 40.6% are attached to IS (Table 1). About 6% of reporting clauses accompany other types of speech presentation such as free direct speech (FDS) or FIS. DS (%) IS (%) DS+IS (%) Total (%) 1530 (53.2) 1169 (40.6) 2699 (93.8) 2878 (100) Table 1 The numbers of reporting clauses of DS and IS Table 2 shows that news reports and fiction both have about 600 reporting clauses of DS, and (auto)biographies have 360. Thus the (auto)biography samples have less DS than news reports or fiction. This is probably because biographers tend to lack direct access to the original speech of the people who are involved and the nature of the non-fiction genre discourages the biographer to quote protagonists’ speech as a faithful reproduction. Initial (%) Final (%) Medial (%) Total (%) News 394 (66.4) 197 (33.2) 2 (0.3) 593 (100.0) Fiction 115 (20.0) 404 (70.2) 56 (9.8) 574 (100.0) (Auto)biography 163 (44.9) 156 (43.0) 44 (12.1) 363 (100.0) Table 2 The positions of reporting clauses of DS The preferred positions of reporting clauses for DS vary in the three genres. In news reports, the initial position is more preferred than the final or the medial position. In fiction, on the contrary, the final position is much preferred. In (auto)biographies, reporting clauses are evenly distributed in the initial and the final positions. Reporting clauses in the middle position are hardly found in news reports while in fiction and (auto)biographies about 10% of the reporting clauses are inserted in the middle of the reported clauses. The sample corpus from the LSWE Corpus suggests different distributions of reporting clauses of DS in the three positions. Initial (%) Final (%) Medial (%) News 35 50 10 Fiction 15 65 15 Table 3 The positions of reporting clauses of DS in the LSWE Corpus2 The two corpora show similar results in fiction; reporting clauses in the final position are preferred in both corpora. On the other hand, in news reports, most reporting clauses are found in the initial position in the ST&WP Corpus whereas half of the reporting clauses are found in the final position in the LSWE Corpus. While reporting clauses in the medial position are scarcely found in the ST&WP Corpus, 10% of reporting clauses are found in the medial position in the LSWE Corpus. One possible explanation for the different results about the most preferred positions and the wide gap in the percentages of the middle position in the news texts is that the data sources are differently comprised in the two corpora. The news texts in the LSWE Corpus include articles on other topics than news reports such as sports and culture whereas the ST&WP Corpus concentrates on news reports on international/domestic matters. The articles on cultural topics could have different styles from that of typical news reports and may have more reporting clauses in the final and medial positions. Another difference in the news texts in the two corpora is that in the LSWE Corpus 60% of the data is from broadsheets and 40% is from regional papers while, in the ST&WP Corpus, 50% of data is from broadsheets and 50% is from tabloids3. The proportion of tabloid papers as a data source in the ST&WP  2 The figures are based on Table 11.1 of Biber et al. (1999: 923). 3 Broadsheets: Independent on Sunday, Guardian, Independent, Daily Telegraph, and Times. Tabloids: News of the World,  283 Corpus can be higher than in the LSWE Corpus, assuming that regional papers are either more similar to broadsheets or in-between of broadsheets and tabloids. The higher proportion of the tabloids may be one of the factors which raise the percentage of the reported clauses in the initial position in the ST&WP Corpus. As the data from ST&WP Corpus shows in the next section, initial reporting clauses are preferred in tabloid papers compared with the broadsheets. 3. Reporting clauses of DS and the serious/popular divisions From now on, my discussion concentrates on my findings based on the ST&WP Corpus. In this section, DS reporting clauses are examined across text types and their popular/serious divisions. As a general tendency, the popular divisions of all three genres have more DS than the serious divisions, reflecting the style difference which the serious/popular contrast would predict (Table 4). Broadsheets/Serious (%) Tabloids/Popuar (%) Total (%) News 234 (39.5) 359 (60.5) 593 (100.0) Fiction 244 (42.5) 330 (57.5) 574 (100.0) (Auto)biography 81 (22.3) 282 (77.7) 363 (100.0) Table 4 DS in the popular/serious divisions In news reports, the tabloids have roughly half as much DS again compared with the broadsheets. Furthermore, the tabloids tend to have more reporting clauses in the initial position than the other two positions (Table 5). In the DS mode, the reported clause can produce more immediacy than IS and allow a wider range of registers from formal public speech to casual colloquial speech. The initial position of reporting clauses makes the reader's processing of the speech presentation easier than other positions. The tabloid papers may well favour these advantages of DS with reporting clauses in the initial position. In contrast, the broadsheets show no distinct preference for either the initial or the final position. News Initial (%) Final (%) Medial (%) broad tabloid broad tabloid broad tabloid Total No. of DS (%) Broad=234 (100) Tabloid=359 (100) 117 (50.0) 277 (77.2) 117 (50.0) 80 (22.3) 0 (0.0) 2 (0.5) Table 5 The positions of reporting clauses of DS in News In fiction, DS is also preferred in the popular division compared with in the serious division (Table 6). As for the positions of reporting clauses, both popular and serious fiction prefers the final position. Since DS in fiction is most often indented as well as accompanied by the quotation marks, the reader is probably already aware that a character's speech is being presented at the beginning of reported clauses even if reporting clauses are postponed until the end of the sentences. Such graphological practice in fiction can help the reader's processing. In addition, in fictional contexts, a range of choice of speakers is much more limited than in journalistic contexts, which often offer almost unlimited possibilities of speakers. The reader may have less difficulty in recognising a reported speaker with the reporting clause at the final position in fiction. Fiction Initial (%) Final (%) Medial (%) serious popular serious popular serious popular Total No. of DS (%) Serious=244 (100) Popular=330 (100) 45 (18.4) 70 (21.2) 174 (71.3) 229 (69.4) 25 (10.3) 31 (9.4) Table 6 The positions of reporting clauses of DS in Fiction By placing the reported clauses before the reporting clauses, the narrator's intervention is postponed and the characters’ speech can be presented almost like a play script. Fronted DS clauses give vivacity to characters’ speech presentation and generate dramatic effects in fictional worlds. In (auto)biographies, there are 363 reporting clauses attached to DS and 282 reporting clauses are found in popular (auto)biographies whereas only 81 reporting clauses are found in serious (auto)biographies (Table 7). Popular (auto)biography does not show particular preference for either initial or final position of reporting clauses. But it does have a higher percentage of reporting clauses in the medial position, compared with the serious (auto)biographies or the other two genres.  Express, Mirror, Star and Sun. Intermediate: Today.   284 (Auto)biography Initial (%) Final (%) Medial (%) serious popular serious popular serious popular Total No. of DS (%) Serious=81 (100) Popular=282 (100) 52 (64.2) 114 (40.4) 26 (32.1) 131 (46.5) 3 (3.7) 40 (13.1) Table 7 The positions of reporting clauses of DS in (Auto)biographies Out of 40 examples of reporting clauses in the medial position in the popular division, 35 examples appear in the autobiographies, and 5 in the biographies. In popular autobiographies, the protagonists’ speech often embeds another narrative: it can be a story or a joke. The following example is from an autobiography by the actor, Michael Caine, to whom a comedian, Eric Sykes, is talking at a party. He is joking that Brigitte Bardot, a famous actress, ignores him because she loves him. (1) ‘She is in love with me,’ he whispered, ‘and she can't stand it. Watch her as she goes by,’ (Michael Caine, What's It All About?) The reporting clause comes at the clause boundary in the reported clause. The second reported clause gives a reason the super star ignores the comedian, which renders a humorous effect. By inserting the reporting clause before the punch line, the reader's expectation is suspended and the punch line is made more effective. Out of 40 examples, 27 examples have reporting clauses at a clause boundary. Such interruption suspends the onset of the latter part of the speech momentarily, making the reader more attentive to it. The subjects of popular autobiographies are often famous actors, athletes and comedians, whose images and tones of voice the reader is likely to be familiar with. A casual, colloquial style which reflects the celebrity's image is commonly shared in such autobiographies, and this tendency is especially strong in direct speech. Inserting a reporting clause in the middle of a reported clause with a colloquial register, the reader can have processing time imagining the character's speech with more immediacy and get ready for the following speech which often requires more inferential work than the first part. On the other hand, since serious (auto)biographies tend to aim for accuracy of record on past events rather than immediacy and drama, specifying the speaker of the following speech presentation at the beginning of the sentence is a reasonable measure in order to avoid the ambiguity over the speakers. 4. Reporting clauses of IS In IS, the initial position of reporting clauses is a more established, syntactically-determined pattern. More than 90% of reporting clauses are fronted in all three genres (Table 8). Initial (%) Final (%) Medial (%) Total (%) News 650 (92.9) 42 (6.0) 8 (1.1) 700 (100.0) Fiction 110 (90.1) 2 (1.7) 9 (7.4) 121 (100.0) (Auto)biography 327 (94.0) 3 (0.8) 18 (5.2) 348 (100.0) Table 8 The positions of reporting clauses of IS A few reporting clauses come after the reported clauses or are inserted in the middle of the reported clauses, which, as we have already noticed, occurs much more commonly in DS. Stylistic accounts regard such inversion of reporting and reported clauses as an intermediary form between IS and FIS and link it with free indirect discourse in literary texts. However, in these discussions other aspects of a character's subjectivity such as lexis, ejaculative expressions and deixis in the reported clauses are also taken into consideration. Above all, FIS involves more characters’ points of view and less intervention from the narrator than IS. It should be further examined whether placing a reported clause before a reporting clause can automatically generate shifts of viewpoint. News reports offer useful data in this respect because reported clauses in news texts tend to contain less subjective elements than those of fiction and the inversion of the reporting and the reported clauses can be analysed independently of other elements of FIS. If the inversion itself has an effect which suggests a clearly speaker-oriented perspective, it will push the reported clause toward FIS. 5. Reporting clauses of IS in the final/middle position of news reports Table 8 shows that reporting clauses of IS in the final position most frequently occur in news reports while fiction has much less. This is quite remarkable when we remember that in DS reporting clauses in the final position is most preferred in fiction. The phenomenon looks contradictory at first sight because the fronted reported clause of IS can sometimes be ambiguous with narration, and news reports are supposed to avoid any ambiguities concerning speech presentation. By examining the examples of news reports, three major textual patterns can be found, in which the inversion of the  285 reporting and the reported clauses do not necessarily generate ambiguities or involve shifts of the points of view. 5. 1. The first sentence of the body text The first major textual pattern is that the sentence appears as the leading sentence of the body text immediately after the headline. As Table 8 indicates, there are 42 examples of IS with reporting clauses after the reported clauses. Out of these 42 examples, 13 (31%) are the first sentence of the body text. This tendency is the same for both broadsheets and tabloids. Out of these 13 cases, in 12, the contents of the reported clauses are previously announced by the headlines. Consider: (2) <headline > Britain ready to pull troops out of Bosnia <body text> BRITAIN could start withdrawing its troops from Bosnia within weeks if the warring factions reject the latest plans for a settlement, Douglas Hurd, the Foreign Secretary, said yesterday. (The Times, 5/12/94, Britain ready to pull troops out of Bosnia) The headline gives the reader a general idea on the topic of the article; more precisely, the headline summarises the reported clause of the first sentence of the body text. The content of the reported clause gives more detailed information in relation to the main headline. The combination of the headline and the reported clause of the first sentence repeats the general content of the headline and reinforces the impact of information which the article conveys. From the perspective of ‘given/new information', the reported clause of the first sentence is textually ‘given’ since it is previously mentioned in the headline. On the other hand, the reporting clause, which specifies the information source, is ‘new'. The construction of the first sentence observes the ‘end-weight’ principle of informational structure (Quirk et al. 1985: 1365-6) although it can deviate from the normal syntactic structure of IS. It should be noted, however, that ‘end-weight’ does not necessarily entail ‘prominence', or, in more practical terms, ‘news value'. In some cases, the sources of information are mentioned in less specific ways. Compare: (3) <headline> 2 BRITS AMONG DEAD IN PLANE HORROR <body text> A BRITISH couple were among the 109 killed in Saturday's Florida jet crash, it was revealed last night. (The Sun,13/5/96, 2 Brits among dead in plane horror) Such an unusual order of reporting and reported clauses in the first sentence of the body text makes a contrast with the normal clausal order in a similar textual environment. When the reporting clause is at the beginning of the first sentence, the importance of the syntactic subject of the reporting clause is stressed. In the following example, the subject of the reporting clause appears in the headline, and the full name of the speaker is again indicated in the first sentence. Compare: (4) <headline> Blair backs Straw over a new role for the monarchy <body text> TONY BLAIR confirmed yesterday that he intended to make fundamental changes to the role of the Royal Family an issue in the next general election. (The Times, 5/12/94, Blair backs Straw over a new role for the monarchy) There are 23 cases where the speaker of the indirect speech appears in the fronted reporting clause as well as in the headlines. In such cases, the source of information, the speaker, can express the ‘topic’ and the reported clause can express the ‘comment'4. In (4), the ‘topic’ is Blair, who is already known to the reader, and the ‘comment’ is what he said. The pragmatic functions of ‘topic’ and ‘comment’ explain why famous individuals tend to be found as the syntactic subject of fronted reporting clauses. Out of 23 cases, 18 have personal names of politicians, show business people or royals in the fronted reporting clauses. The inversion of the reported and reporting clauses in the first sentence of a news article seem to have a different motivation from that of fiction. In news reports, the inversion of the two clauses are mainly related to the shifts of focus between the speaker, i.e. the source of information, and the content    Van Dijk (1977: 116) defines a topic as ‘some function determining about which item something is being said; it often associated with what is already known in some context, or what is presupposed.’ He refers to the ‘comment’ as what is ‘unknown’ and asserted.   286 of speech. In the above examples, the positions of reporting clauses are especially affected by the relationship between the headline and the first sentence. On the other hand, in fiction, the issue has more to do with changing points of view and the distance which the narrator intends to take between the character who speaks and him/herself. As a representative example which clearly demonstrates shift of viewpoints, I will quote a short passage by V. Woolf outside the corpus. There are three reporting clauses in the passage, and two of them are in the final position, one in the middle. (5) (a) ‘Do you write many letters, Mr. Tansley?’ asked Mrs. Ramsay, pitying him too, Lily supposed; for that was true of Mrs. Ramsay - she pitied men always as if they lacked something - women never, as if they had something. (b) He wrote to his mother; otherwise he did not suppose he wrote one letter a month, said Mr. Tansley, shortly. (Virginia Woolf, To the Lighthouse p.93; alphabetising mine) In (a), the first reporting clause is attached to Mrs. Ramsay's direct speech. The second reporting clause is inserted after an adverbial phrase with a present participle ‘pitying him'. This reporting clause suggests that the point of view is suddenly shifted from the narrator to Lily Briscoe, and she takes over the role of telling Mrs. Ramsay's inner state when Mrs. Ramsay passes a conversational turn to Mr. Tansley. The fronted indirect form in (b) suggests both immediacy and Mr. Tansley's reserved manner. The final reporting clause not only specifies the speaker of the preceding speech but also signals whose point of view is presented from now on; a new paragraph which follows the present passage concentrates on reporting Mr. Tansley's thought. In fiction, the positions of reporting clauses is one of the devices which can manipulate the characters’ or the narrator's points of view. Especially in IS, even if the reported clause has no particular signs which are more attributable to the character than the narrator such as deixis or lexis, postposing the reporting clause after the reported clause generates some character-bound effect. This kind of rapid viewpoint switching is not usually a feature of news reports. In news report, the positions of the reporting clause seems to be more dependent on textual organisation and pragmatic elements rather than shifting viewpoints. 5. 2. Continuous speech presentation by the same speaker Another textual feature of news reports, surrounding a postponed reporting clause, is that a fronted reported clause comes immediately after another speech presentation by the same speaker. 16 cases out of 50 with inverted reporting clauses are part of a continuous speech presentation by one speaker. All of the 16 cases except one are found in broadsheets. In these cases, the reporters seem to reorder reporting and reported clauses as one of the devices for re-constructing the speaker's speech. In (6), each reporting and reported clause is alphabetised for ease of reference. (6) (a) Mr Major warned yesterday of the dangers of Britain being left behind if a group of European Union members pushed ahead with a single currency. (b) Nobody could be certain, (c) he said, (d) of the economic impact on the UK. (Independent on Sunday,11/12/94, Blair puts Labour troops on alert for snap election) In (6), the reporter summarises Mr Major's speech by using the mode of IS in (a). After this summary, the reported speaker's more specific wording is introduced in (b) and (d) in order to back up the previous summary. It is clear that the reporter intentionally chose a particular part out of Mr Major's speech which would most strongly support his/her previous generalisation. By fronting the reported clause, the rhetorical connection between the summary (a) and the specification (b) and (d) are tightened. In (7), too, the reporter seems to apply a similar strategy. (7) (a) But Mr Lilley said (b) Labour tactics could prompt many ordinary people who would otherwise support Labour to turn to the Conservatives. (c) “I regret very much that they have put the future of the monarchy into the political domain,” (d) he said on BBCl's Breakfast with Frost. (e) “But having done so, I think that they risk losing the support of a lot of their voters.” (f) While Labour activists were Left-wing, Labour voters were usually “very pro- monarchy, very pro-Britain” and the Conservatives would “vigorously defend” the Queen and the Royal Family. (g) Having abandoned its policies on the economy and education, Labour desperately needed something new to please theLeft, (h) he said. (The Daily Telegraph, 5/12/94, Labour in row over Royal role) A generalisation of Mr Lilley's speech is introduced as IS in (a) and (b). The following DS in (c) and  287 (e), FIS in (f) and the inverted reported clause of IS in (g), all not only represent Mr Lilley's specific wording but also explain the reasons why Labour tactics could lose the potential voters against the Conservatives. These reported clauses, however, may not be presented in the same sequence as that of Mr Lilley's actual speech. The reporter could have rearranged the order of these clauses so that the each segment is organised according to his logic. Fronting the reported clause (g) seems to be one of strategic devices to link the reported clause logically with the reporter's summary of Mr Lilley's speech. 5. 3. An introduction of a new speaker at the end The third pattern of placing reporting clauses in the final or medial position is more context-dependent than affected by textual organisation. Reporting clauses can ‘abruptly’ appear after fronted reported clauses without any introduction of the speaker before. There are 11 such cases in the corpus, out of which 10 reporting clauses are at the end. These ‘abrupt’ reporting clauses are equally distributed across the broadsheets and tabloids. In those examples, the contents of the reported clauses are often additional or secondary to the preceding information, which has more importance or impact. In the following example, the reported clause has its reporting clause at the end. The content of the reported clause (b) has less news value than the preceding information (a), and consequently, the information source specified in (c) is even less important. (8) (a) A powerful bomb ripped through a bus in Pakistan yesterday, killing at least 40 people. Most of the passengers were returning home to celebrate the Muslim festival of Eid al-Adha. (b) A second bomb was found near the wrecked bus and safely defused, (c) the official Associated Press of Pakistan news agency said. (The Times, 29/4/96, Bomb on crowded bus kills 40 in Pakistan) In (8), no clues that indicate speech presentation are found in (b) after quite a long stretch of narration. The reader will not recognise (b) as a reported clause until he/she reads the reporting clause of (c). The possible reason for the inversion here would be that the reporting clause receives the least emphasis because the information of the reported clause is secondary as news. Different degrees of emphasis on the reporting and the reported clauses seem to affect the order of the two clauses. Another aspect of this structure is that information sources in the reporting clauses are often institutions or groups rather than individuals such as ‘the official Associated Press of Pakistan news agency’ and ‘the UN'. Out of 13 cases, 6 mention groups and organisations as information sources. Such institutional group speakers/writers tend to be backgrounded because they have less news value. This makes a contrast with the situation where famous individuals such as politicians and celebrities are often found in the fronted reporting clauses in the first sentence of the body text. 6. Conclusion This research supported our intuitive perception about the positions of reporting clauses in direct speech and indirect speech by the quantitative data which was obtained from the ST&WP Corpus: DS has more reporting clauses in the middle and the final positions of the reported clauses compared with IS, while in IS the order of the reporting and the reported clauses is syntactically established and the inversion of the two clauses occurs much less. The corpus findings also quantitatively suggested that text types are closely related to the preferred positions of reporting clauses in DS. These results were compared with those of the LSWE sample Corpus by Biber et al. (1999). In IS, the press data gave a new perspective to the inversion of the reporting and the reported clauses. Although placing the reported clause before the reporting clause can be an unusual syntactic structure, this construction tends to be more affected by textual organization and often obeys the informational principle. Here, not only the quantitative analysis of the corpus data but also the qualitative approach successfully revealed some of the motivations for this construction of news reports, which seem to be different from those of fiction. References Banfield A 1982 Unspeakable sentences: narration and representation in the language of fiction. London, Routlege & Kegan Paul. Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman grammar of spoken and written English. London, Addison Wesley Longman Dictionaries. Fludernik M 1993 The fictions of language and the languages of fiction: the linguistic representation of speech and consciousness. London, Routledge. Leech G, Short M 1981 Style in fiction. London, Longman. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman.  288 Reinhart T 1975 Whose main clause?: point of view in sentences with parentheticals. Harvard Studies in Syntax and Semantics 1: 127-71. van Dijk T 1977 Text and context: explorations in the semantics and pragmatics of discourse. London, Longman.  Texts Caine M 1992 What's it all about? London, Century. Woolf V 1992 To the lighthouse. London, Penguin. 289 Multiple-level knowledge discovery from corpus Wang JianDe, Chen ZhaoXiong, Huang HeYan Institute of Computing Technology Chinese Academic of Science PO BOX 2704 , Beijing China 100081 wangjiande@sina.com Multiple-Level Knowledge includes knowledge of morphology, syntax, semantics, and style. These four levels of knowledge correspond to the levels of natural language, and thus can be of use in language learning, language engineering, language processing and others areas related to linguistics. Corpora contain large amounts of language data, marked up with a set of attributes; we can derive knowledge of linguistic rules from the annotated corpus. 1) Lexical knowledge can be obtained from statistics derived from the corpus. 2) Knowledge of syntax also can be obtained from corpus data, by means of the Machine Learning algorithm. 3) Semantic rules can be found in the classified document of different semantic type through the statistic model and semantic learning function. 4) A style model can also be derived from these documents. According the arrangement of feature words and syntactic structures in the documents and the learning algorithm of Style Model. The four levels of knowledge are not independent; they interconnect, and can convert to each other in some time. This system is intelligent. The information of user operation and the corpus are stored as the translation resource. With the statistical model and Machine Learning algorithm, some knowledge can be extracted from the resource. The system combines the different methods of machine translation that abstract the advantage of RBMT and EBMT, and use the database to store the sentence human translated. The knowledge of translation is divided into three kinds: public, protected and private, that can be used to different translators. So when MT uses this knowledge, the system is more and more intelligent. The above discovery can be automatic, but it needs much more corpus data, especially annotated data. So we are designing the computer-aided Multiple-Level Knowledge Discovery tool.                           ! "  # $%&&'(%       !"#    $%#%%%         &'#&      "# # #                                      ! "    #                  !        $ %           &    '      %    !   (  )* +,-+,.+,/ * 0 -0 .0 / 1* +,-0 .0 / 2* 0  -&.0 / 3* 0 -+,.0 / 4* +,-+, .0 / '* 0 -&.0       %  %           $    &                  5 6 *            )2       !     2   7   )%             &  7          !        %            &     !%  !  !          $    8              7              &             % &7   1 %  &        %   6         5 *             $       9   :     8  %     !,  ;     2     ::                                  !"#$   )      <  Ó  !           !    Ó =%&    ¼        ¸ !                ) &  % * !( ÿ þ ý ¸ =       % & + , ÷         !         ,                          !         9  - 5*0 . 0 ( ) 0 -0 .0      +,-0 .0     1 > !-0 .0     2 0-0 .0     3  -0 .0     4 ?  @)A:  !     !   6        &                 7        %%     ; & !"  ¸ !  !"#  % (  )* . . ÿ .. þ ý = + . . ..         * .ÿ .þ ý . . = +      1*    ;99 5;  9   9 *  #99 5#" 9   9 *             # %     B995B9 " %9 *    %   #                          !      !!        " %        7 6        &   !        $ %    &  &  !       ;    5; *CD?5C &D   ? *    !  :  &             (  ; ) $       :  &       5     @EA  @A*                    %       $   "                        ¸    å < -       ;  %   <&  %  ,          &              !  !   !                <  ¸                 6 <  -- --   = + !  !   &  )*              × = + =  *           = + =    %    ;  %        &  )        &  =F#5' *   %( )* 0G) * %&  &  >  1* &&  %  >    )E          &  )'         &     (  ?&  ) ?&   D 0      - . )1 14 )3)  - . ) ) ))4  - . 4 13 '14  - .  2 1)4  - . 1E E 43'  - . 4'  )2  - . )E 2' 2  -" .  4 1) ')2  - . E )4 1  - . E )3' 13 D 0      - . )1 E E'  - . )) 4 4E1  - . 2 14 ))4  - . )1) 1 3E  - . 2 )3 32E  - . )2 42 2'  - . 1 ) )E  - . )2 44 )' 0 -" . 1) 3 )'4  - .  )E )1)  - . )1 )4 )23  -" . ) )1 ))  1 D - .  3 )) 33  - . 3 ) )2'  - . )) ) ) H - .  4 )E 1) 0 - .  4 )4 1) ; - . ))  )' ; -8 .! 3 )3 44 D -8 .  3 )) 33   - . ) ) '4  -" .! )1 )) )  -+ . 4 )) )  -" . 2E ) ))  !) !       ( )*  D!)!  !( D-".  -. -. D-.  ;-. -. -. H-.  0-.  -8.! -8.   * 0D!!  !)( -". -. 0-". 0-. -. -". -. +-".! -+.    1* !D! !)!( -. -. -. +-. -. -. -.       ! (  0 !     !"    !        ( )* +,-+,.+,/ $% &' () *+ ,-  * 0 -0 .0 / ./ 01 23 45 67  1* +,-0 .0 / $8 &9 : ;< =>  2* +% !-+% !.+% !/ ?@ AB CD E? FG  3* 0 -<   5*.5*/ HI JK LM NO PQ  4*   )* 3*           ! &  % %        ! ( )* 0  -&.0 / RS TU VW XY Z"  * 0 -+,.0 / [\ ]^ _` ab cd  1* +,-+, .0 /  2 %e fg hi Fj ,k  2* 0 -&.0  lm 4n 5o 6p qr  ;   !            , %  +,-+,. +,+  )I1       ( )* 0 -0 .0 / * +,-0 .0 / 1* 0  -&.0 / 2* 0 -+,.0 / 3* +,-+, .0 / 4* 0 -&.0      % !  &      % !   8        %    +  !        9    !   , %% !          # "       ;  J         <  8   %5::      *  8  %  5         *    !,   ! @2A@3A   $ %%&      <  A )  @ ( #  ¨ ×  ÷  !  ::     !      ¸ Í = +   (        % +  ,  = ‡” ¸      ÷        , % !     (                 , % *  5     *  5    ÷  *  5 ) *  ¨ )I' *  5 )   ¨ )I' *  5    ¨ )I' *  5    ¨ )I' *  5    ¨ )I' *  5 +   ¨ )I'    ¨ )I)))3E  *  5     *  5    ÷  *  5    ¨ )I    ¨ )I1)E  !1 !2  %&'               #              3 $%                        5      , -*     )2     8  %               "                  !!   8 B99  '  "                                   7             #  &           , %    &       ;   9            5  *          % @)A        @A@1A@4A@)A% :    5 *     9  !      "  ( )* +"   $                 !   !     0  < +  "9;                + "      CD?  K  <F +  KH   ; ;        L M     "         &      <&    ,  @)A <L         -    +        7         0     < '1E @A <M 'M ! )'3 @1A <N$    9         =     #,<  J%   ) @2A D  . /.0$+  #,D9)' @3A D // /?B  < )E1 @4A L   M C 1 -   +  /   $ > ) @'A L MC 1     )E @EA L K/ + -2  +     + 7  @A L K  L         ;   7    $ "  9 9+   599+)* @)A LB1 /  '  /- $ > )E @))A M$&L 1!      1  +< 0&   5M    4 * ) @)A <+M(( 7 E11E131)43 @)1A <+M123 ++4 )'3 297 Corpus approaches to antonymy Steven Jones University of Central Lancashire 1. Introduction The word “antonymy” was coined in 1867 by CJ Smith1 to describe word-pairs - commonly known as “opposites” - such as hot/cold, girl/boy and buy/sell. Some linguists (e.g. Lyons (1977) and Cruse (1986)) apply the term “antonymy” restrictively and would only identify the first of these three pairs as being truly antonymous. However, this seems counter-intuitive in many respects, and this paper will use “antonymy” in its broader sense to refer to all word-pairs which could reasonably be identified as “opposites” by speakers of English. Antonymy is often defined simply as “oppositeness of meaning” (Palmer 1976: 94). However, the problem with an exclusively semantic definition is that it fails to explain, or even acknowledge, the tendency for certain words to become enshrined as “opposites” in language while others do not. For instance, rich and poor would be regarded as antonyms because they occupy opposite ends of the same scale, namely the scale of wealth. Affluent and broke also occupy opposite ends of this scale, but, intuitively, one would be reluctant to describe them as antonyms. Therefore, antonymy should be defined according to lexical as well as semantic criteria. It is a phenomenon “specific to words rather than concepts” (Justeson & Katz, 1991: 138). This paper will examine some of the ways in which advances in corpus technology enable antonymy to investigated afresh. Firstly, the categories to which antonymous pairs have been logically assigned will be summarised. Secondly, a set of new, corpus-based categories will be presented. Thirdly, a statistical analysis of co-occurrence rates among antonyms will be offered. And finally, ways of identifying new antonyms, again using corpus data, will be explored. 2. Traditional classes of antonymy The meanings of antonymous pairs have been logically examined by a number of linguists (e.g. Lyons 1977, Kempson 1977, Cruse 1986, etc.) and antonyms have been classified according to their theoretical differences, perhaps at the expense of their intuitive similarity. Using terminology favoured by Leech (1974), each of the traditional categories of antonymy will now be outlined. 2.1. Binary Taxonomy The name given by Leech to antonymous pairs such as man/woman, alive/dead and married/unmarried is “binary taxonomy” (1974: 109). Other writers (see Palmer 1972, Jackson 1988, Carter 1987) prefer to speak of “complementarity”. Kempson - whose favoured term is “simple binary opposition” - describes examples of Binary Taxonomy as “the true antonyms” (1977: 84). However, this description is particularly confusing in light of the unwillingness of other linguists - namely Cruse (1986) and Lyons (1977) - to acknowledge Binary Taxonomy as a form of antonymy at all. The criterion necessary for an opposition to be considered binary is that the application of one antonym must logically preclude the application of the other. For instance, if X is a smoker, X cannot be also a non-smoker; if X is baptised, X cannot be also unbaptised, and so on. 2.2. Multiple Taxonomy Multiple Taxonomy - also known as Multiple Incompatibility (Carter 1987:19) - is a borderline classification of antonymy that refers to pairs such as summer/winter and north/south. In some respects, this category is akin to Binary Taxonomy. The pair male and female, for example, belong to a two-member system, such that X can never be simultaneously more than one member; solid, liquid and gas, by comparison, belong to a three-member system, such that X can never be simultaneously more than one member; similarly, clubs, diamonds, hearts and spades belong to a four-member system, such that X can never be simultaneously more than one member. And so on. Thus, Multiple Taxonomy may be seen as Binary Taxonomy extended to three or more terms. Whether such examples remain within the boundaries of antonymy is debatable. 1 See Muehleisen (1997) for more details or “Introductory Matter” in Webster's Dictionary of Synonyms (1951: vii-xxxiii). 298 2.3. Polar Opposition Polar Opposition differs from Binary Taxonomy because one antonym is not automatically debarred by the other's application. In other words, it is possible to be neither tall nor short in a way that it is not possible to be neither male nor female. Thus, tall/short are said to be a Polar Opposition, as are the majority of everyday opposites (old/new, cold/hot, wet/dry, etc.). Because Polar Oppositions are not mutually exclusive, they are readily modified (quite happy, extremely happy, fairly happy, etc.) and can take both comparative (happier) and superlative (happiest) form. Indeed, many commentators (e.g. Lyons 1977, Cruse 1986, Jackson 1988) prefer to label such pairs Gradable Antonymy. 2.4. Relative Opposition An example of Relative Opposition is tenant/landlord. The statement X is the landlord of Y entails and is entailed by Y is the tenant of X. Therefore, landlord and tenant belong to a reciprocal relationship, also reflected by pairs such as teach/learn, buy/sell and above/below. The majority of semanticists label this phenomenon “converseness” and Kempson notes that if the variables X and Y are converse verbs, the statement A X B implies B Y A and the statement A Y B implies B X A. In other words, B precedes A implies that A follows B, and A follows B implies that B precedes A (1977: 85). A fertile area for Relative Antonymy is the field of kinship relations. If X is the grandparent of Y, then Y must be the grandchild of X; if X is the husband of Y, then Y must be the wife of X. 2.5. Other Categories Leech identifies two further categories of antonymy: “hierarchy” (1974: 106) is similar to Lyons's notion of “rank” and describes the relationship between sets of terms such as January/February/March and one/two/three; “inverse opposition” describes pairs such as all/some and remain/become2 which may not otherwise be regarded as “opposites”. Other types of antonymy include “orthogonal” and “antipodal” opposition (Lyons, 1977: 286). Orthogonal - meaning perpendicular, at right angles - describes the antonymy holding between the words man, woman, girl and boy. Each of these four words contrasts with two of the other three. So man can be the antonym of boy and woman, but not girl; and boy can be the antonym of girl and man but not woman. An example of an antipodal opposition would involve the terms north, east, south and west. Here, words only contrast in one direction. So north is an antonym of south, but not east or west; and west is an antonym of east but not north or south. 3. New classes of antonymy The categories outlined above are useful if we wish to look at antonymy from a logical perspective. However, it has now become possible for antonymy to be approached from a corpus-based angle, with new classes being created to describe not what antonyms are, but what antonyms actually do in text. 3.1. Data and Methodology In order to ascertain and quantify the various textual functions of antonymy, a database of 3,000 sentences was constructed. Each of those sentences were retrieved from a large corpus and features both members of a recognisable antonymous pair. The corpus which I chose to use consists of about 280 million words from The Independent. All stories printed in the newspaper between 1 October 1988 and 31 December 1996 are included in the corpus. Journalistic corpora are suitable for studies of this nature because they are large, genre specific and reflect a natural, modern, non-fictional use of written language. Thus, an overview of how antonymy is used in the field of broadsheet newspaper journalism is possible, although it should be acknowledged that antonymy might be found to function differently in other corpora. Selecting a representative sample of antonymous pairs is more problematic. It is difficult to imagine a list of antonyms which would not raise a single eyebrow, either because of words included but not considered to be “good opposites”, or because of “good opposites” which might be conspicuous in their absence from the list. 2 Leech's reasoning is that some artists have no formal training is synonymous with not all artists have formal training and she did not become a smoker is synonymous with she remained a non-smoker. 299 Other corpus-based investigations into antonymy, such as Justeson & Katz (1991) and Mettinger (1994), resolved this problem by using an existing index of antonyms. Justeson & Katz chose to make use of the 40 “historically important” (1991: 142) antonyms identified by Deese (1965). This list of antonyms was based entirely on the results of word association tests. Deese took 278 adjectives3 and used them to elicit responses from 100 informants. When a pair of contrast words successfully elicited one another more than any other word, they were added to the list of antonymous pairs, which ultimately numbered 40. Though all antonyms cited by Deese fulfilled this requirement, some passed the test with alarmingly low scores. For example, given the stimulus together, only 6% of informants replied alone; given alone, only 10% replied together. However, this was evidently enough to make these answers more popular than any other, even though the fact remains that a minimum of 84% of informants failed to give either alone as a response to together, or together as a response to alone. Indeed, of the 278 adjectives tested, only one word succeeded in eliciting its antonym on a majority of occasions (left, to which 51% of informants gave right). Therefore, though there remains a strong tendency for informants to provide antonyms as responses to given stimuli in word associations tests, it may not be wise to treat Deese's 40 antonyms as being in any sense exhaustive or definitive. A different approach was taken by Mettinger (1997), who used Roget's Thesaurus as his source for antonyms. Created in the middle of the nineteenth century, Roget's Thesaurus attempted to catalogue language, not in alphabetical order, but according to “ideas”. This is of relevance to a study of antonymy because Roget chose, where possible, to present these ideas in opposition to one another. Thus, the thesaurus begins by listing words associated with existence, then considers words associated with inexistence. Following next are substantiality and insubstantiality, then intrinsicality and extrinsicality. Neither using the Deese Antonyms nor turning to thesaural listings is ideal. Essentially, one is still dependent on the intuitions of others to identify antonymous pairs. In the case of Roget, these intuitions are 150 years out of date and “contain a number of lexical items that are hardly used in contemporary English” (Mettinger, 1994: 94); in the case of the Deese antonyms, one is dependent on the criteria for antonymy established by 1960s’ schools of psychology. However, it is impossible to rely on anything other than intuition when it comes to a psycholinguistic phenomenon such as antonymy. No exhaustive list of antonyms will ever be produced because the process which gives a pair of words antonymous status is complex and dynamic. Indeed, this status could only really be gauged by consensus, as definitions of antonymy vary not only from one linguist to the next, but also from one mental lexicon to the next. With these limitations in mind, I decided that the best approach would be to create a new list of antonyms, customised to meet the demands of this research and relevant to a 21st Century investigation of antonymy: active/passive advantage/disadvantage agree/disagree alive/dead attack/defend bad/good badly/well begin/end boom/recession cold/hot confirm/deny correct/incorrect difficult/easy directly/indirectly discourage/encourage dishonest/honest disprove/prove drunk/sober dry/wet explicitly/implicitly fact/fiction fail/succeed failure/success false/true fast/slow female/male feminine/masculine gay/straight guilt/innocence happy/sad hard/soft hate/love heavy/light high/low illegal/legal large/small long/short lose/win major/minor married/unmarried new/old officially/unofficially old/young optimism/pessimism optimistic/pessimistic peace/war permanent/temporary poor/rich private/public privately/publicly punishment/reward quickly/slowly right/wrong rightly/wrongly rural/urban strength/weakness Table One: Antonymous pairs selected for inclusion in the database Any native speakers could be reasonably expected to identify the antonym of each of the above 112 words. Most core antonyms are represented and the list also features a number of lower frequency pairs, including antonymous nouns, adverbs and verbs, which previous investigations have been inclined to overlook. 3 Deese's list included words such as above, inside and bottom which function as adjectives less often than they function as other parts of speech. 300 In total, 2,844 sentences were retrieved at random from the corpus, all of which features both members of one of the above antonymous pairs. A further 156 sentences were retrieved in which a word co-occurred with an un- version of itself. As un- is the most prolific morphological marker of antonymy in English, this was a useful way to enrich the database with antonymous pairs which, though often less familiar than the 56 pairs selected for analysis, still reflect opposition and are instantly recognisable as antonyms because of their morphology. 3.2. Classifying the database All 3,000 database sentences have been classified according to their textual function. Eight categories are presented below, in alphabetic order only, together with three illustrative examples. 3.2.1. Ancillary Antonymy Sentences attributed to this category contain two contrasts: that between the established pair of antonyms (in bold) and that between a pair of words or phrases which would not usually be interpreted contrastively (in italics). Here, it would appear that the antonyms function as lexical signals. They serve an “ancillary” role, helping us to process another, perhaps more important, opposition nearby. · ind9024: At Worcester on Wednesday, Botham - apart from bowling well - was wandering around in a T-shirt with the message: ‘ Form is temporary, class is permanent‘. · ind913: Broadly speaking, the community charge was popular with Conservative voters and unpopular with Labour voters. · ind891: Robin Cook, Labour's health spokesman, demanded: ‘How can it be right to limit the hours worked by lorry drivers and airline pilots, but wrong to limit the hours of junior hospital doctors undertaking complex medical treatment?’ 3.2.2. Comparative Antonymy This category is home to sentences in which a pair of antonyms are set up in comparison with one another. This function of antonymy is often expressed by a lexico-syntactic framework such as more X than Y or X is more [adjective] than Y. · ind891: And it is possible to accept both that Dr Higgs was a lot more right than wrong in her diagnoses, but that it is now impossible for her to return. · ind923: ‘Well,’ said Cage, completely unabashed, ‘some living composers are more dead than alive ‘. · ind903: Training would be based upon rewarding good behaviour, because behaviourists, Skinner argued, had found that reward is more effective than punishment. 3.2.3. Co-ordinated Antonymy The antonymous pair in each of the examples below is presented in a unified, co-ordinated context. The function of such antonyms is to identify a scale, then exhaust that scale. The contrastive power of each pair remains untapped because their purpose is to express inclusivity. Mostly, antonyms which serve this role are conjoined by and or or. · ind953: He showed no disloyalty, publicly or privately, to Virginia Bottomley though it must have irked him that she was in the Cabinet and he was not. · ind921: Whitehall was yesterday unable to confirm or deny other simulated devolutions. · ind941: Again in debates over genetic research it is significant that Christians, Muslims and Jews have united, implicitly and explicitly, in condemning a low view of the value of embryonic life. 4 The notation gives detail about where and when each sentence was published (newspaper; year; quarter). For example, this sentence was published in the second quarter of 1990 by The Independent. 301 3.2.4. Distinguished Antonymy The sentences below refer, in a metalinguistic fashion, to the semantic dissimilarity between antonyms. The framework which houses the antonyms most frequently is n between X and Y, where n is difference or a synonym thereof. · ind892: But far from that, Mortimer's father had not given him even a basic moral education, such that today he still doesn't know the difference between right and wrong, or so he said. · ind931: But it made the point that the division between gay and straight is one of many rifts in our society. · ind884: Mr Craxi's fresh-faced deputy, Claudio Martelli, also dissented, saying that ‘one must distinguish between hard and soft drugs'. 3.2.5. Extreme Antonymy Sentences classified in terms of Extreme Antonymy are similar to Co-ordinated Antonymy examples. The difference is that here a contrast is set up, not between antonyms, but between both ends of a semantic scale, on one hand, and the semantic space in between, on the other. Typical frameworks show antonyms linked by or or and, and premodified by an extremity-signalling adverb such as very or too. · ind892: No-one can afford to go to law except the very rich and the very poor and it can't possibly get any worse. · ind903: The advantages are that the track does not need watering, and can be used when conditions are either too dry or too wet for racing on turf. · ind964: Freud maintained in Civilization and Its Discontents that human beings feel a deep hate and a deep love for civilization. 3.2.6. Idiomatic Antonymy Many antonymous pairs co-occur as part of a familiar expression, proverb or cliché. Such examples have been assigned to category of Idiomatic Antonymy. · ind944: The long and the short of it is that height counts. · ind893: They evidently knew they could teach this old dog a few new tricks. · ind892: Whoever said the female of the species was more deadly than the male hadn't met Lord William Whitelaw. 3.2.7. Negated Antonymy Arguably the purest function of antonymy, the sentences below each negate one antonym in order to place additional emphasis on the other or to identify a rejected alternative. The most common framework for this class is X not Y. · ind884: Well, without the combination of an arms race and a network of treaties designed for war, not peace, it would not have started. · ind912: Democracy means more than the right to pursue one's own self-interest - government must play an active, not passive, role in addressing the problems of the day. · ind893: However, the citizen pays for services to work well, not badly. 3.2.8. Transitional Antonymy The function of antonyms belonging to this category is to help describe a movement from one state to another. This transition is usually expressed by a framework such as from X to Y or hinges around the verb to turn. · ind923: Her film career similarly has lurched from success to failure, with enormous periods out of work. 302 · ind934: The atmosphere of the negotiations was tense, discussion uneven, the mood in both camps swung from optimism to pessimism. · ind923: Inflation is a tax which redistributes wealth to the sophisticated from the unsophisticated. 3.3. Frequency of New Classes The eight classes outlined above collectively account for 2,894 of the 3,000 database sentences. The remaining 3.5% of contexts demonstrate that antonyms can also function in unusual, sometimes innovative, ways about which it difficult to generalise. The table below shows the statistical distribution of all database sentences. ancill co-or comp distin trans negat extre idiom other total active/passive 53 14 9 6 6 6 - - 2 96 advantage/disadvantage 15 14 2 - 4 1 - - - 36 agree/disagree 26 17 3 - - - - - 3 49 alive/dead 16 26 9 1 - 1 - - 1 54 attack/defend 10 15 3 - - 2 - - - 30 bad/good 55 47 4 4 3 1 - 2 1 117 badly/well 31 15 4 1 - 2 - - - 53 begin/end 24 23 3 - - 1 - - - 51 boom/recession 12 3 4 - 5 - - - - 24 cold/hot 21 23 - - 2 - 2 11 - 59 confirm/deny - 34 - - - - - - - 34 correct/incorrect 6 11 - 1 - - - - - 18 difficult/easy 19 5 - - - - 1 - 2 27 directly/indirectly 21 57 1 - - - - - - 79 discourage/encourage 16 8 2 - - 2 - - - 28 dishonest/honest 8 4 - - - - - - - 12 disprove/prove - 14 - - - - - - - 14 drunk/sober 8 7 - - 1 1 - - 1 18 dry/wet 11 9 3 1 3 - 4 - - 31 explicitly/implicitly 6 19 2 - - 3 - - - 30 fact/fiction 5 5 2 11 2 4 1 - 6 36 fail/succeed 30 27 5 - - 1 - - - 63 failure/success 38 20 10 12 1 6 1 - - 88 false/true 10 34 3 11 - 1 1 - 2 62 fast/slow 17 7 2 1 - - - - 1 28 female/male 23 43 1 4 1 - - 2 13 87 feminine/masculine 37 18 2 3 - - 1 - 7 68 gay/straight 3 20 7 1 1 1 - - - 33 guilt/innocence 5 27 3 5 1 2 - - 1 44 happy/sad 22 17 2 - 2 - 2 - - 45 hard/soft 17 3 2 1 3 2 3 - 1 32 hate/love 40 44 7 2 1 2 2 - 6 104 heavy/light 46 19 5 1 4 - 2 - - 77 high/low 20 3 2 3 1 1 1 1 - 32 illegal/legal 10 17 1 - 3 - - - - 31 large/small 17 23 4 2 2 - 2 - - 50 long/short 22 7 4 1 - 1 - 1 - 36 lose/win 27 25 5 - - 1 - - - 58 major/minor 11 9 - 3 4 - - - - 27 married/unmarried 4 14 8 5 - - - - - 31 new/old 81 76 21 19 10 3 1 6 37 254 officially/unofficially 14 10 1 - - - - - - 25 old/young 20 34 6 5 - 1 3 - - 69 optimistic/pessimistic 30 12 3 - 1 - 1 - - 47 optimism/pessimism 9 1 2 - 6 1 1 - 1 21 peace/war 3 5 1 1 1 2 - - 2 15 permanent/temporary 6 12 5 1 3 1 - - - 28 poor/rich 46 16 6 24 1 - 5 - 4 102 private/public 36 68 6 13 5 2 - - 4 134 privately/publicly 20 24 2 1 - - - - - 47 punishment/reward 6 5 4 - - 3 - - 1 19 quickly/slowly 16 6 2 - - - 4 - - 28 right/wrong 36 13 1 5 1 - - - 4 60 rightly/wrongly 1 43 - - - - - - - 44 rural/urban 7 13 - 2 1 - 1 - - 24 strength/weakness 11 6 6 - 4 4 - - 4 35 un-words 58 60 15 10 7 3 1 - 2 156 TOTAL: 1162 1151 205 161 90 62 40 23 106 3000 TOTAL (%): 38.7 38.4 6.8 5.4 3.0 2.1 1.3 0.8 3.5 100 Table Two: Statistical breakdown of database classes 303 Table One demonstrates that the most popular category is that of Ancillary Antonymy, to which 38.7% of all sentences have been attributed. Recording only 11 fewer sentences is Co-ordinated Antonymy, which accounts for 38.4% of all database sentences. These two classes are significantly larger than any others and collectively account for 77.1% of sentences. The third largest category identified is Comparative Antonymy, but this is only a fraction the size of the two major categories. 205 sentences have been attributed to the class of Comparative Antonymy, less than 7% of the database. Distinguished Antonymy accounts for a further 5.4% of sentences, but no other class of antonymy is represented by more than 3% of the total sample. Perhaps most remarkable is that the majority of pairs, regardless of their word class, follow a similar pattern of distribution. For example, Ancillary Antonymy and Co-ordinated Antonymy are the most commonly occurring categories, but this is not just because they are each strongly favoured by a small number of pairs. Rather, this pattern is consistent among almost all pairs. Indeed, in the case of 44 of the 56 pairs sampled, Ancillary Antonymy and Co-ordinated Antonymy each account for more sentences than any other category. This suggests that any given word-pair is likely to have a predictable textual profile. However, some pairs may have unusual individual distributions. For example, all 34 sentences retrieved which feature confirm/deny are assigned to the class of Coordinated Antonymy. This is because refusing to confirm or deny a proposition has become a cliche among politicians and other public figures. 4. Antonym Co-occurrence One of the questions raised by my research is this: exactly how widespread is antonymy in language? Gauging the true answer is very difficult. Firstly, there is the problem of defining antonymy: the stricter the definition one uses, the less pervasive the phenomenon will appear. Then there is the even greater problem of counting: in order to arrive at an estimate of the proportion of sentences which feature antonyms, one would need to identify every single antonymous pair in use, then retrieve all sentences which features both of those words. But that would not be all - one would then need to edit all of these sentences manually (which would number over a million in my corpus) to eliminate those in which the word-pair do not function antonymously (coincidental co-occurrence is common among higher frequency pairs, especially those which feature a polysemous term, such as well). Only then could one arrive at an approximation of the proportion of corpus sentences which feature antonyms, and this approximation would still fail to account for inter-sentential antonymous usage. An easier way to estimate the prevalence of antonymy in text is to compare the expected cooccurrence rate of antonyms with their observed co-occurrence rate. Therefore, I shall now examine each of the 56 antonymous pairs in my sample to determine whether those antonyms co-occur more or less than would be expected by chance. Listed below are the 56 antonymous pairs selected for study in this thesis. Each pair is followed by five columns of figures: columns one and two simply record the raw frequency of each antonym in the corpus; column three records the number of sentences one would expect to feature both antonyms if those words co-occurred at random; column four records the number of sentences in the corpus which, in reality, contain both antonyms; column five records the Observed/Expected ratio, which is generated by dividing the figure in column four by the figure in column three. word one (W1) / word two (W2) raw frequency of W1 raw frequency of W2 Expected Co-occurrence Observed Co-occurrence Observed / Expected active/passive 11411 2033 1.8 172 95.6 advantage/disadvantage 21531 2483 4.2 69 16.4 agree/disagree 18196 2472 3.5 153 43.7 attack/defend 43395 9198 31.0 273 8.8 cold/hot 16466 16026 20.5 751 36.6 correct/incorrect 10529 1484 1.2 34 28.3 dead/alive 32214 11661 29.2 565 19.3 deny/confirm 7514 6595 3.9 335 85.9 difficult/easy 54244 31395 132.4 434 3.3 directly/indirectly 14172 1377 1.5 492 328.0 drunk/sober 4730 1878 0.7 56 80.0 dry/wet 10978 5109 4.4 348 79.1 encourage/discourage 12586 1614 1.6 77 48.1 end/begin 145438 19682 224.6 740 3.3 explicitly/implicitly 1320 813 0.1 32 320.0 fact/fiction 78900 7391 45.3 503 11.1 fail/succeed 10963 8258 7.0 131 18.7 304 fast/slow 22625 17374 30.6 350 11.4 feminine/masculine 1191 903 0.1 140 1400.0 good/bad 181876 47247 668.1 4804 7.2 guilt/innocence 4229 3804 1.3 162 124.6 happy/sad 28217 9420 20.7 140 6.8 hard/soft 68635 11960 63.8 526 8.2 high/low 93232 41088 297.8 2847 9.6 honest/dishonest 6922 1084 0.6 28 46.7 legal/illegal 40832 11208 35.6 302 8.5 light/heavy 36832 22898 65.6 297 4.5 long/short 131582 52119 533.2 2168 4.1 love/hate 42541 6108 20.2 511 25.3 major/minor 45452 10624 37.5 432 11.5 male/female 16930 14883 19.6 2556 130.4 married/unmarried 25581 1033 2.1 101 48.1 new/old 341832 113065 3004.9 9426 3.1 officially/unofficially 6025 394 0.2 33 165.0 old/young 113065 83247 731.8 2704 3.7 optimistic/pessimistic 7123 1984 1.1 96 87.3 optimism/pessimism 5717 1163 0.5 91 182.0 prove/disprove 20968 258 0.4 35 87.5 permanent/temporary 10413 7878 6.4 351 54.8 poor/rich 34054 20999 55.6 2027 36.5 public/private 133056 61202 633.1 6741 10.6 publicly/privately 8108 6406 4.0 282 70.5 punishment/reward 6363 6152 3.0 38 12.7 quickly/slowly 25129 8958 17.5 83 4.7 recession/boom 22707 8678 15.3 334 21.8 right/wrong 125712 42376 414.2 2677 6.5 rightly/wrongly 4558 2681 1.0 182 182.0 rural/urban 8600 7923 5.3 515 97.2 small/large 86908 69219 467.7 2928 6.3 straight/gay 21672 9734 16.4 277 16.9 strength/weakness 19866 5971 9.2 441 47.9 success/failure 47816 24438 90.8 971 10.7 true/false 35357 10245 28.2 227 8.1 war/peace 81293 38258 241.8 2586 10.7 well/badly 178431 15772 218.8 712 3.3 win/lose 76372 27771 164.9 1125 6.8 TOTAL: 2662409 955994 8441.8 55411 AVERAGE: 6.6 Table Three: Co-occurrence of Antonymous Pairs Table Three shows that all antonymous pairs selected for study co-occur at a statistically significant rate, at least three times more often than would be expected by chance. Some pairs record an enormous Observed/Expected ratio, but this is often attributable to their low individual frequencies. For example, feminine and masculine only arise on about 1,000 occasions each in the entire corpus. Therefore, they can be expected to co-occur in only 0.1 sentence. In fact, they co-occur in 140 sentences, generating an Observed/Expected ratio of 1400. This is anomalous, but even pairs of words with relatively high individual frequencies (female/male, cold/hot, poor/rich, etc.) are able to record healthy co-occurrence figures. Indeed, according to this experiment, antonyms co-occur 6.6 times more often than would be expected by chance. When Justeson & Katz conducted a similar test on the Deese antonyms, they found that antonyms co-occurred in the same sentence 8.6 times more often than chance would allow (1991: 142). Their results were based on a corpus much smaller than the one from which the above statistics are drawn and this inevitably affects their reliability. For example, Justeson & Katz calculate an Observed/Expected ratio of 19.2 for happy/sad, based on individual frequencies of 89 and 32 respectively and observed co-occurrence in just one sentence. My corpus yields an Observed/Expected ratio of 6.8 for happy/sad, based on individual frequencies of 28,217 and 9,420 respectively and observed co-occurrence in 140 sentences. It seems fair to conclude that statistics derived from the latter corpus will be more trustworthy. However, despite difference in corpus size, the two average Observed/Expected rates (6.6 and 8.6), are close enough to prove that antonyms do co-occur in text at a relatively high rate. 5. New Antonyms A further question which may be considered with the help of corpus data is: how and why do certain pairs of words become enshrined as antonyms? To address this issue, the productivity of 305 frameworks associated with some of the new classes of antonymy will be investigated. Productivity here refers to the “statistical readiness” (Renouf & Baayen, 1996) of lexico-syntactic constructions to incorporate other related terms. In other words, if antonyms occupy certain lexical environments in text, which other words also occupy those environment and could some of those words be seen as new, developing antonyms? Most new classes of antonymy were found to favour certain lexical environments in text. Three lexico-syntactic frameworks will be used to assess the antonymous profiles of given words: both X and Y [Co-ordinated Antonymy] between X and Y [Distinguished Antonymy] whether X or Y [Co-ordinated Antonymy] The productivity of these frameworks will be tested by placing a word in the X-position, extracting all concordances which feature that word-string from the corpus, then examining which items occupy the Y-position. Three words will be placed in X-position for each framework. I shall begin by investigating an antonym from my sample, good. If these frameworks are in any sense productive, intuition demands that they should retrieve bad and, to a lesser extent, evil in Y-position with high frequency. I shall then examine two new words (natural and style) to discover whether this strategy can be used to identify potential antonyms for words which do not have established antonyms. 5.1. Seed Word: good · both good and ... The lexical word-string both good and appears in a total of 63 corpus sentences. In 45 of those 63 sentences, it is followed immediately by bad. A further 4 of the 63 sentences recorded evil appearing immediately after both good and. This leaves 14 occurrences of both good and which are followed by neither bad nor evil. These are listed below, together with the noun-head they modify: q both good and flawed (King Hassan's reign) q both good and pathetic (years) q both good and wicked (people) q both good and nasty (youths) q both good and hard (times) q both good and not green (God) q both good and true (a story) q both good and lasting (friends) q both good and powerful (patriotism) q both good and new (a paper) q both good and friendly (a service) q both good and inimical to the Labour Party (Conservative belief) q both good and non-sexually explicit (a novel) q both good and great (wines) The concordances above make interesting reading. Some Y-position words are very useful contrast terms for good (flawed and pathetic are synonymous with bad; wicked and nasty are synonymous with evil). One would not intuitively identify hard as an antonym of good, but this contrast is perfectly valid within its given context - hard times are quite the opposite of good times. However, the phrase both X and Y does not always reflect an obvious contrast. For example, a story is described as being both good and true; one would not want to consider these terms as potential antonyms. In such contexts, it would appear that the framework signals unlikely inclusiveness. Similarly, no contrast is generated between good and either powerful, new, or friendly. Finally, both good and great is an interesting example because a distinction is made, but that distinction is not at the usual point on the scale on quality (i.e. between good and bad). Rather, this context distinguishes between good and something better than good. · between good and ... The framework between good and Y occurs in 140 corpus sentences, more than double the number of both good and Y. However, the distribution of bad and evil is very different. Of the 140 examples of between good and Y, 50 feature bad in Y-position and 78 feature evil in Y-position. Only 12 sentences feature neither bad nor evil in Y-position. These contexts are listed below, together with their corresponding noun-head, where appropriate: q between good and poor (schools) q between good and poor (performance) q between good and lousy (comprehensives) q between good and harmful (foods) q between good and greed (a struggle within Lewis) 306 q between good and suspicious (toadstools) q between good and good to soft (the going) q between good and very good q between good and very good q between good and excellent (Melbourne's eateries) q between good and really great (wine) q between good and the best Once again, some of the occurrences at the lower end of the frequency scale are valid contrast terms and others are not. Two contexts show poor occupying Y-position in the between good and Y. This is an excellent contrast term, as is lousy. Equally interesting are the distinctions made between good and other, more extreme points on the scale of quality. On two occasions, very good is contrasted with good, and excellent, really great and the best each appear in opposition on one occasion. This is reminiscent of the both good and great word-string retrieved earlier. Although one would expect good to contrast exclusively with negative items in language, it would seem that many writers choose to exploit its latent contrast with “super-positive” terms instead. · whether good or ... Of the three lexico-syntactic frameworks analysed, whether X or Y is the least common. In the corpus, only 8 sentences feature the word-string whether good or Y. In 7 of those sentences, bad fills the Y-position; in the eighth, evil fills the Y-position. 5.2. Summary of good This analysis of good has demonstrated that it is possible to retrieve contrast words from the corpus using productive lexico-syntactic frameworks. Collectively, the three frameworks examined occur on 211 occasions in the corpus. On 102 of those occasions (48.3%), the given word-string is followed by bad. This is compatible with our intuitions - one could predict that bad would be set up in opposition against good most commonly. Indeed, one could also predict that evil would be runner-up; evil fills the Y-position in frameworks analysed on 83 occasions (39.3%). However, the purpose of this experiment is not to prove that bad and evil are antonymous with good; rather, it is to show that the three frameworks identified are fertile enough to be deemed productive. This seems indisputable. 5.3. Seed Word: natural · both natural and ... q both natural and accurate (their response to the camera) q both natural and artificial (light) q both natural and artificial (lighting) q both natural and artificial (light) q both natural and artificial (the essence of man) q both natural and artificial (everything that exists) q both natural and assisted (fertility) q both natural and beneficial (high altitude) q both natural and coloured (light) q both natural and heraldic (devices) q both natural and human (perturbations) q both natural and inevitable (process) q both natural and inevitable (that ...) q both natural and lucid (her acting) q both natural and man-made (components) q both natural and man-made (beauty) q both natural and man-made (beauty) q both natural and man-made (disasters) q both natural and man-made (facilities) q both natural and man-made (polymers) q both natural and market (forces) q both natural and prudent (paying debts) q both natural and safe (white sugar) q both natural and sensible (idea) q both natural and social (sciences) q both natural and social (sciences) q both natural and spiritual (creatures) q both natural and superb (a history of vodka) q both natural and synthetic (fibres) q both natural and taboo (a child's sexuality) q both natural and technical (the effect) q both natural and violent (causes) q both natural and vital (USA action) The output generated by a search for both natural and comprises 33 concordances, all of which are listed above. It can be seen that both natural and occurs less frequently in text than both good and , but that the Y-position output is more diverse. However, this is not to say that no patterns emerge: of 307 the 33 occurrences of this lexico-syntactic framework, 6 are followed by man-made and 5 are followed by artificial. Both of these terms make excellent contrast words for natural. Some of the words retrieved in Y-position on one occasion only are non-contrastive, but interesting and valid oppositions of natural include market (in terms of forces), synthetic (in terms of fibres), violent (in terms of death) and assisted (in terms of fertility). · between natural and ... q between natural and artificial (ozone) q between natural and artificial (worlds) q between natural and artificial (worlds) q between natural and created (forms) q between natural and cultivated (areas) q between natural and juridical (persons) q between natural and man-made (assets) q between natural and metal (packaging) q between natural and moral (evil) q between natural and supernatural Ten corpus sentences feature the phrase between natural and Y. The only word to occupy Yposition more than once is artificial. It is interesting to note that between natural and man-made also appears. This suggests that man-made also shares a strong contrastive profile with natural. Significantly, it also confirms that similar words occupy the Y-position in the frameworks both natural and Y and between natural and Y. All of the six words retrieved on one occasion reflect contrast to a lesser degree, with supernatural perhaps being the most interesting because it ties in with spiritual, which was picked up by both natural and Y · whether natural or ... q whether natural or artificial (hormones) q whether natural or electric (light) q whether natural or imposed (punishment) q whether natural or man-made (environment) q whether natural or man-made (beauty) q whether natural or otherwise (phenomena) q whether natural or step (parents) q whether natural or through external intervention (chemical changes) Eight contexts were found to include the word-string whether X or Y. Pleasingly, both artificial and man-made arise in Y-position. This means that all three lexico-syntactic frameworks have successfully retrieved both of these words. One-off contrast terms again include valid and useful examples. For example, step is not the kind of word one would intuitively identify as a potential opposite of natural. However, within the given context of parentage, this contrast is not only legitimate but very interesting. It is also reassuring to note the appearance of otherwise in Y-position. Though this term is not a valid opposite of natural in itself, otherwise effectively functions as a proform for unspecified contrast words in text. 5.4. Summary of natural Collectively, the three frameworks examined as part of this study feature natural in X-position in a total of 51 sentences. In 9 of those sentences, artificial occupies Y-position and, in a further 9 of those sentences, man-made occupies Y-position. This strongly suggests that those two words are the primary textual contrast terms of natural. Indeed, over one third of all frameworks examined feature either artificial or man-made in opposition with natural. This output is particularly interesting if analysed in light of the range of antonyms which lexicographers have paired intuitively with natural. For example, Webster's Dictionary of Synonyms (1951) lists three antonyms: artificial, adventitious and unnatural. The inclusion of the firstmentioned of this trio is supported by this experiment, but adventitious does not occupy Y-position at all. This is not surprising given that the word occurs only seven times in the entire corpus (or about once per 40 million words in text). More notable is the non-appearance of unnatural in textual opposition with natural. This could be interpreted as a flaw in the retrieval strategy or it could be interpreted as revealing an interesting aspect of natural: namely, that it prefers to contrast with lexical opposites rather than its morphological opposite. 308 Collins Cobuild Dictionary5 (1985) cites six antonyms of natural, beginning with unnatural. The other contrast words suggested are surprising (not retrieved in text), contrived (not retrieved), artificial (retrieved 9 times), man-made (retrieved 9 times), and processed (not retrieved). Chambers Dictionary of Synonyms and Antonyms (1989) suggests unnatural, artificial, man-made, affected and contrived. Therefore, some correlation emerges between intuitively identified antonyms and antonyms identified by productive lexico-syntactic frameworks: all three dictionaries cite artificial as a good opposite and only the oldest of the three fails to cite man-made. However, I would suggest that other recommended antonyms (adventitious, processed and even unnatural) are not placed in textual opposition against natural as often as may have been anticipated. Moreover, it could be argued that such words are less valid contrast terms of natural than synthetic, supernatural, assisted and other Yposition words which have not been cited by lexicographers, but which have been retrieved in this experiment. 5.5. Seed Word: style · both style and ... q both style and a demonstration of reaching speed q both style and achievement q both style and commercial space q both style and content (x4) q both style and date q both style and emotion q both style and fashion q both style and feeling q both style and heart q both style and history q both style and performance q both style and personality q both style and personnel q both style and policy q both style and prices q both style and qualifications q both style and reputation q both style and standards q both style and substance (x5) The word-string both style and arises in 26 corpus sentences. Two words are set up in opposition to style more commonly than anything else - substance appears in Y-position five times and content appears in Y-position four times. These contexts reflect a trend for style to be seen as meaningless or superficial, and licenses its opposition with more “weighty” terms. Other words which reflect this trend but only occur once include performance, policy, achievement and standards. From examining its antonymous profile, one might infer that style has developed a pejorative sense in the language. · between style and ... q between style and content (x4) q between style and disorder q between style and grape q between style and political ideology q between style and quality (x2) q between style and subject q between style and substance (x5) Four of the 15 between style and constructions are followed by content and five are followed by substance. This means that only 6 of the 15 occurrences of this lexico-syntactic framework feature other terms. On two occasions, this term is quality, which conforms to the underlying trend for style to be treated negatively in text and be synonymous with emptiness or absence of quality. However, a reminder that these frameworks are not exclusively inhabited by contrast words is provided by grape. The sentence from which this word-string is taken actually explores the relationship between the style of a given wine and the nature of the grape used in its production. Of course, grape could hardly be seen as a valid or useful opposite of style, merely an unlikely instantial collocation. 5 This dictionary made use of a corpus which was smaller than my own (about 20 million words), but which was not newspaper specific. 309 · whether style or ... This lexico-syntactic framework was not found in the corpus at all, probably because style functions most commonly as a noun and nouns do not lend themselves readily to this construction. 5.6. Summary of style The textual profile of style shows that substance is most commonly retrieved in Y-position (10 hits; 24.4%). In second place is content (8 hits; 19.5%). Between them, these two words are retrieved in Y-position in 43.9% of frameworks. It is interesting to note that style is never seen to contrast with concepts such as inelegance or tastelessness, as Chamber's Dictionary of Synonyms and Antonyms (1989) suggests. 5.7. Analysis of output The aim of this exercise was to identify where new antonyms come from. The process by which “opposites” are created is complex, but it is reasonable to speculate that in order for a pair of words to become enshrined as antonyms in any language, they must first receive a significant amount of exposure. This exposure will be in contexts which are more often associated with established pairs of antonyms. Based on this rationale, man-made and artificial have been identified as potential antonyms of natural; and substance and content as potential antonyms of style. This output may be seen as initial evidence that it may be possible to automatically identify embryonic antonyms in text. Arguably the most dramatic antonymous formation of recent times involved gay and straight, which held no obvious semantic relation in the middle of the last century, but have now achieved clear antonymous status. The increasing attention given to different sexual preferences must have contributed to the establishment of gay and straight as new “opposites”, though this antonymity was surely enshrined by repeated co-occurrence in lexical environments similar to those examined here. 6. Conclusions This paper has demonstrated that corpus-based approaches are relevant to an investigation of antonymy. Based on evidence from broadsheet newspaper corpora, I have argued that: · In addition to logical distinctions, antonymous pairs are also receptive to classification according to their textual function. Data show that the two most common text-based classes of antonymy are Co-ordinated Antonymy (in which antonyms are joined by and or or and express exhaustiveness or inclusiveness) and Ancillary Antonymy (in which antonyms act as a lexical signal of a further, nearby contrast). · All 56 antonymous pairs examined co-occur intra-sententially at least three times more often than chance would allow. On average, antonyms record an Observed/Expected ratio of 6.6. · Using lexico-syntactic frameworks associated with the co-occurrence of established antonymous pairs, it is possible to identify new textual oppositions. Such research may shed light on the process by which a pair of words achieve antonymous status in language and allow us to identify new antonyms in their infancy. Bibliography Carter R 1987 Vocabulary. London, Allen & Unwin. Chambers Dictionary of Synonyms and Antonyms 1989. Cambridge, Chambers. Collins Cobuild English Dictionary 1995. London, HarperCollins. Cruse DA 1986 Lexical Semantics. Cambridge, Cambridge University Press. Deese J 1964 The Associative Structure of Some Common English Adjectives. Journal of Verbal Learning and Verbal Behaviour, 3: pp 347-357. Jackson H 1988 Words and their Meaning. Cambridge, Cambridge University Press. Justeson JS, Katz SM 1991 Redefining Antonymy: the textual structure of a semantic relation. Literary and Linguistic Computing 7: pp 176-184. Kempson RU 1977 Semantic Theory. Cambridge, Cambridge University Press. Leech G 1974 Semantics. Middlesex, Penguin. Lyons J 1977 Semantics: Volume 2. Cambridge, Cambridge University Press. Mettinger A 1994 Aspects of Semantic Opposition in English. Oxford, Oxford University Press. Muehleisen V 1997 Antonymy and Semantic Range in English. Unpublished PhD dissertation, Northwestern University. Palmer FR 1972 Semantics. Cambridge, Cambridge University Press. 310 Renouf A, Baayen H 1996 Chronicling the Times: Productive Lexical Innovations in an English Newspaper. Language, volume 72, number 1. Roget's Thesaurus 1952. London, Sphere Books. Webster's Dictionary of Synonyms 1951. Menasha, Merriam. 311 Variation across Korean text registers Beom-mo Kang and Hung-gyu Kim Korea University Our paper describes the results of an adoption of Biber's (1988, 1995) multivariate statistical analysis of text registers to the analyses of Korea text registers and styles, comparing our results with Kim's (1990). Here, registers are conceived of as kinds or types of texts with respect to text/discourse situations and are not distinguished from genres in a broad sense. The text corpus on which our study is based is composed of 334 text samples of 36 registers/genres of Korean texts. Each text sample is a computerised one and consists of about 1,000 words, the total size of the corpus being about 370,000 words. Among 13 spoken text registers, we included not only transcriptions of real conversations but also some texts written to be spoken--transcripts of TV dramas, plays, and movies. Among 23 written text registers, we included various kinds of informative and imaginative ones--newspaper reports, editorials, essays, encyclopaedia, academic, informative books, etc. For the multivariate statistical analyses, we counted the occurrences of 82 linguistics features of each text samples and performed factor analysis, cluster analysis, and canonical discriminant analysis. Before we counted linguistic features occurring in texts, we had to perform morphological tagging because many of the linguistic features could only be counted from morphologically analysed texts. We used automatic morphological analyser first, and then checked and corrected the tags manually. By multidimensional factor analysis using SAS, we found that there are six dimensions of Korean texts: D1. Informal interaction vs. Elaboration; D2. On-line/situated production vs. Informative content; D3. Narrative vs. Abstract concerns; D4. Formal statement of opinions; D5. Public report; D6. Public mention of modern things. By cluster analysis, we found that there are eight text types which are solely based on linguistic characteristics and which are different from conventional genres or registers. These eight text types are: T1. Abstract content; T2. Modelled informal interaction; T3. Event description; T4. Opinion; T5. Planned presentation of information; T6. On-line production; T7. Explanation of facts and procedures; T8. Public report. Our analysis is an application of Biber's multivariate statistical analysis to Korean texts. The results show that this method is a fruitful one in investigating aspects Korean text registers and styles, and reveal similar as well as different textual characteristics of English and Korean. References Biber D 1988 Variation across Speech and Writing. Cambridge, CUP. Biber D 1995 Dimensions of Register Variation: A Cross-linguistic Comparison. Cambridge, CUP. Kim Y 1990 Register Variation in Korean: A Corpus-Based Study, Ph.D. thesis, USC. 312 Tracing idiomaticity in learner language: the case of BE 3U]HP\VáDZ.DV]XEVNL 6FKRRORI(QJOLVK$GDP0LFNLHZLF]8QLYHUVLW\3R]QD 3RODQG 1. Introduction It is a widely known fact that language learners, especially less advanced ones, tend to rely excessively on flexible, high-frequency, ‘core’ vocabulary items in their foreign language use1. One of such commonly overused verbs is the primary verb lemma BE, whose multifarious nature must discourage many corpus researchers from devoting it time. In this paper I attempt to construct a version of the traditional tripartite scale of idiomaticity (frozen - restricted - free combinations) in order to encode various types of occurrence of lexical BE and test the extent(s) to which particular level(s) of fixedness are responsible for the reported overuse. The matter is vital, for core words are prone to forming extensions of all kinds which, contrary to the simple ‘building-block’ metaphor of learner lexical performance (cf. Kjellmer 1991: 124), indicate proficiency rather than non-proficiency. Before we announce that learners overuse the commonest words and possibly give them make-up work, it is useful to find out what exactly learners do with the core lemmas2. The premises underlying the idiomatic chart proposed here rest on both the traditional, grammatical criteria for idiomaticity (semantic opacity, lexical/syntactic fixedness, lexical/syntactic anomaly, cf. Moon 1997: 44, Hudson 1998: 8-9) as well as on corpora-inspired views of conventionality (viz. frequency and distribution, as reported in LDOCE3) and pragmatic specialisation in discourse (formulae). Since many tendencies regarding EFL vocabulary production are transferrelated, the distribution of BE's postulated idiomaticity bands will be shown in a contrastive scheme comprising both EFL learner and control non-learner and L1-based text collections. The practical goal of these examinations is to characterise quantitatively the use of BE by Polish advanced EFL learnerwriters. An underlying methodological objective of the study is to demonstrate how the needs of learner language phraseological research fail to be served by modern, robust, corpus-driven methods of text analysis. 2. Idiomatic BE: a major challenge for corpus-driven methodology The lemma BE poses a major challenge on corpora researchers because of its versatility and extremely high frequency. In a phraseological study, one of the first tasks that needs resolving is, of course, the separation of grammatical and semantic3 (here also called lexical) uses of BE. Semantic BE is generally to be identified not only with the existential, intransitive uses of this verb but also with its linking (copular) functions, which likewise translate lexically into other languages, a point of importance when the impact of L1 interference on EFL language production is recalled. The two basic cases of auxiliary use (cf. Quirk et al. 1985: 129-135) to be excluded from analyses of lexical BE are: 1) ‘central’ passives (as opposed to semi-passives and pseudo-passives in which BE functions as a copula, e.g. ‘This difficulty can be avoided in several ways’ [central passive] vs. ‘Leonard was interested in linguistics’ [semi-passive] vs. ‘The building is already demolished’ [pseudo-passive]; cf. Quirk et al. 1985: 167-171); and 2) the use of BE as the progressive aspect auxiliary (e.g. ‘Ann is learning 1 Frequency analyses of learners’ language, such as Ringbom's (1998), Altenberg's (1997) or Hasselgren's (1994), clearly point this way. Resorting to safe lexical items is a frequent communication strategy not only of learners. 2 Although some corpus linguists consider idiomatic bonds to best operate between wordforms, I follow a lemma-based approach out of conviction, after Aitchison (1994), that the lemma is the basic lexical unit of the mental lexicon (cf. Howarth's lexemic approach, 1998). 3 Semantic uses of verbs most often correspond to the main verb function in a clause, but can also be represented by non-finite forms (infinitives and participles) and gerunds (in non-count forms, i.e. ‘being’ but not ‘a being'). 313 Spanish'). Especially the first group proves extremely difficult to tackle with contemporary textprocessing software. Another source of complication for disambiguation are multi-word instances of what are called ‘verbs of intermediate function': neither entirely semantic nor grammatical (Quirk et al. 1985: 96-128, 136f). Two pertinent sub-classes of such verbs are: the modal idiom ‘BE to <do sth>', and the more open set of semi-auxiliaries, which include ‘BE able to <do sth>’ , ‘BE about to <do sth>', ‘BE apt to <do sth>', ‘BE bound to <do sth>', ‘BE due to <do sth>', ‘BE going to <do sth>', ‘BE likely to <do sth>', ‘BE meant to <do sth>', ‘BE obliged to <do sth>', ‘BE supposed to <do sth>', ‘BE willing to <do sth>', etc. In the performed analysis two other ‘verb idioms which express modal or aspectual meaning’ (Quirk et al. 1985: 143) have been supplemented: ‘BE inclined to <do sth>’ and ‘BE allowed to <do sth>’ (= ‘may’ or ‘have permission', as applied by some Polish users). The differentiation between modal idioms and semi-auxiliaries is essential insofar as the latter approximate the lexical (linking) uses of BE. Once all the above enumerated uses of BE can be successfully identified and set aside, we can proceed to study the remaining lexical uses, which, as with any other verb, exhibit inclinations to form idiomatic (or frozen), phraseological (restricted / collocational), and open combinations with other words. First, the FROZEN idiomatic level of BE may be postulated as consisting of those phrases in which the verb is literally ‘frozen’ both lexically and formally (as a particular wordform). Such uses are few and typically specialised functionally, e.g. ‘that is (to say)’ (used to mark repetition), ‘to be sure’ (epistemic modality), or ‘for the time being’ (time disjunct). Phrasal/prepositional occurrences of BE can rarely be taken as integrated semantic units, but more like instances of phrasal/prepositional complementation, e.g. ‘BE around’ (= ‘BE available'), ‘BE on’ (='BE working/ running/ playing'), ‘BE into <sth>’ (=be interested in sth). They will be associated with the RESTRICTED collocational level, discussed below. One notable exception is the frozen perfective expression ‘been around’ (=having had many and varied experiences), as in ‘a young executive who has been around', where the meaning of ‘BE around’ acquires an extended metaphorical meaning. Published sources offer little help regarding collocational habits of BE. Collocation GLFWLRQDULHV FI%HQVRQHWDO.R]áRZVND ']LHU DQRZVND SUHVHQWYHU\PRGHVWHQWULHVIRU lexical BE, or do not present them at all. This is because BE is an ‘upward collocate’ (Sinclair 1991) of so many words that it makes little practical sense to list all of them. A closer look at corpus data, however, proves that a good percentage of BE tokens and types are somehow conditioned or conventional, i.e. that they transcend the simple slot-and-filler generative paradigm which links words according to pre-selected syntactic choices. As mentioned, lexical BE comes in two basic variants, copular (or linking) and intransitive, the former usually outnumbering the latter significantly. One important fact about copular verbs is that they require obligatory complementation, which may be of three structural types (Quirk et al. 1985: 1171-4). Two of them are simple and prototypical: 1) by an adjective phrase (‘BE <adj>'; e.g. ‘the menace from the plant is serious'), and 2) by a noun phrase (‘BE <noun>'; e.g. ‘the movies are a form of fiction'). The third kind of complementation involves the use of a (predication) adjunct, whose most frequent surface manifestation is a prepositional phrase. This type of complementation may be functionally ambiguous, since its role may be either that of an obligatory adverbial (e.g. representing the relation ‘BE <place>’ or ‘BE <time >') or of a subject complement resembling a noun phrase or an adjective phrase (as in the pattern ‘BE of <sth>': e.g. ‘BE of consequence/ substance/ importance'4 etc.; Quirk et al. 1985: 732). Quite importantly, many of the prepositional phrases functioning as subject complements of BE are multi-word units, often internally idiomatised (i.e. displaying lexical fixedness or syntactic abnormality), with corresponding adjectival synonyms, e.g.: ‘BE out of breath’ (cf. ‘BE breathless'), ‘BE of no importance’ (cf. ‘BE unimportant'), ‘BE not at ease’ (cf. ‘BE not relaxed'), ‘BE in love', ‘BE in good condition'. In contrast to their function as subject complements, prepositional phrases acting as obligatory adverbials seem to merely describe circumstances relating to the subject's — a person's, object's or event's — ‘being’ (i.e. presence or happening). They thus appear much less tied to the verb BE, which assumes a decontextualised, intransitive, existential rather than typically copular function. Such a relation is perceptible between BE and prototypical obligatory adverbials (time, space and metaphorical space) and, perhaps less strongly, also between BE and other adjunct complements (recipient, purpose, reason, accompaniment) (cf. Quirk et al. 1985: 731). 4 ‘BE of <noun>’ is an interestingly productive sub-type of prepositional-phrase subject complement. 314 Even less transparent/ compositional (and therefore classifiable as ‘restricted') seem to be the cases of complementation by: 1) means adjuncts (often conventionalised and/or lexically fixed, or else possibly replacing a passive or different predicate; e.g. ‘Transport is by ferry', ‘Entrance is by special invitation', ‘such contracts are (= are signed) with people who...'); 2) stimulus adjuncts (rare, stylistically marked, and greatly restricted by the subject, which controls the preposition following BE, e.g. ‘His main interest was in sport'; 3) agent adjuncts (restricted semantically to, most typically, artistic authorship, e.g. ‘The book was by an unknown writer'5); 4) measure adjuncts (contracting a non-prototypical sense of BE, though obviously a frequent and salient one among non-beginner English learners, e.g. ‘The jacket was 10 pounds'6). From the above survey of the complementation patterns of lexical BE, a general rule can be inferred that the verb tends to be followed by complements which either constitute idiomatic phrases, or restrict (specialise) BE's realm of reference (by influencing its subject collocates), or which otherwise constitute simple, ad-hoc, compositional phrases (adjectival, nominal or prepositional). I would like to propose for the first two of these types to be joined into a complementation super-pattern ‘BE <idiom>', which will be henceforth associated with a RESTRICTED level of collocability of lexical BE, on the grounds that: 1) copular BE, by definition, requires a complement (or adverbial), and 2) the type of complement (or adverbial) considered is itself idiomatic. Examples of restricted collocations representing the two prototypical complementation patterns (‘BE <adj>’ and ‘BE <noun>') will include: 1) BE + adjectival idioms or collocations (predicatively unified, often substitutable by a single verb, e.g. ‘BE conditional upon <sth>', ‘BE worth <(doing) sth>', ‘BE alive’ (cf. ‘live'), ‘BE fraught with <sth>', ‘BE sorry for <sb>’ (cf. ‘sympathise'); 2) predicative pseudopassives and semi-passives (e.g. ‘BE composed of <sth>', ‘BE connected with <sth>', ‘BE interested in <sth>', ‘BE used to <(doing) sth>', ‘BE situated <somewhere>'; 3) BE + adjectival/past-participial predicate + to-clause (e.g. ‘BE liable to <do sth>', ‘BE reluctant to <do sth>'; 4) BE + nominal idiom (e.g. ‘BE a bitter pill (for <sb>) (to swallow)', ‘it BE high time', ‘BE the case (with <sb/sth>)'. Following the criteria of pragmatic specialisation and frequency, another sub-category of idiomatically RESTRICTED uses should be associated with lexicalised discourse-related formulae, which in the case of BE are quite numerous. On account of the transparent, prototypical semantics of BE and absence of (strong) lexical and syntactic restrictions operating on it, formulaic uses should not, I believe, be regarded as frozen. Table I below provides a brief summary of suitable subtypes and instances of formulae: Table I: Restricted, discourse-conditioned phrases with lexical BE Pattern/Subtype Example/Sub-pattern conventional discourse formulas and linking phrases ‘that/this BE why/ the reason why...’ etc. (often sentence initially) ‘there is every/no reason (for <sb>) to <do sth>’ ‘<sth> BE that...*’ ‘the idea/problem/thing is that...’ ‘<sth> BE to <do sth>**’ ‘his purpose/task/approach is to <do sth>’ idiomatic referential uses ‘BE so/otherwise’ ‘<sb/sth> BE one/those that/who ...’ BE + clause: other formulae*** ‘<sth: the question etc.> BE whether ...’ ‘<sth> BE how <sth> <happened>’ ‘it/this BE because ...’ Other formulae ‘<sth> BE for <sb> to <do sth>’ ‘<sth> BE as follows/the following’ etc. * a prominent discourse prefacing formula ** a prominent explicational formula, also common in prefacing *** this sub-type is arguably the least restricted (formulaic) of all the ones tabulated here; it has been added on account of semantic analogy to other prefacing formulas Apart from all the restricted occurrences, lexical BE also features certain independent stylistic/rhetorical uses that are difficult to categorise within the bounds of idiomaticity. One such type of expression are cleft and pseudo-cleft sentences, where the use of BE (in bold-type in the example 5 Many complementations of this type may be regarded as idiomatic equivalents of the long passive, e.g. ‘The book was/had been written by an unknown writer'. 6 The pattern of such expressions may be written out as ‘BE <sth>’ but its semantic structure is totally incongruous with the defining quality of prototypical noun complementation, captured by sentences such as ‘The prize was 10 pounds'. 315 below) appears a result of a transformation from an underlying non-emphatic predication (underlined below) rather than a typically lexical instance comparable to cases described earlier: After all, it is marriage, the beginning of a family that constitutes the very basic part of every nation and society and as such it is no longer a private affair between two people. All his people ask for is no more war. Another stylistically motivated feature is the use of subject-to-subject raising with an optional infinitival phrase to be, found in copulas ‘SEEM (to be)', ‘APPEAR (to be)', ‘TURN out (to be)', ‘PROVE (to be)', or in complementations of some mental verbs, especially in the passive voice, e.g. ‘BE found/thought etc. (to be)’ 7. Some of these structures may be frequent enough (the most common ‘SEEM (to be) <adj>/<noun>’ may yield up to 100 occurrences in a 100,000-word corpus) to skew other findings for BE, depending on whether the optional infinitive is included or excluded from global counts. By an arbitrary decision, in the findings presented below, such optional occurrences of ‘to be’ have been added up with total scores. Lastly, it is posited that FREE-COMBINATIONAL uses of BE should comprise all the remaining occurrences of this verb, in particular cases of non-idiomatic complementation within the two prototypical patterns: ‘BE <adj>’ (including -ed and -ing adjectives) and ‘BE <noun>'8. The third major sub-category of free combinations will be made up of the de-selected instances of obligatory but semantically ancillary adverbial complementation (adjuncts): 1) ‘BE <adjunct: time, space, metaphorical space>’ (e.g. ‘Pure fire (the stars) is in the heavens.', ‘It was 10 years ago.'); and 2) ‘BE <adjunct: purpose, accompaniment, measure, etc.>’ (e.g. in ‘BE with <sb>; BE for <sth> (=purpose); ‘BE about <sth>', as when used of a book, television programme etc.) The existential use of ‘there BE’ or the use of ‘BE’ after the anticipatory ‘it’ are, in the scheme proposed here, assumed as resulting from transformations of the basic theme-rheme informational model (performed, e.g., to satisfy stylistic or contextual needs) and, unless lexicalised or specialised (e.g. ‘there is every reason that ...'), such forms will be treated as free-combinational, regardless of their various detailed functions (cf. Biber et al. 1999: 951-953). The presented stratification of the potential occurrences of the verb BE demonstrates that, even if only on account of its highly diversified complementation, it is not justifiable to apply one yardstick to all instances of lexical BE present in a text. On the contrary, BE seems to feature its own model of the idiomatic cline, whose investigation may provide useful material not only for EFL researchers. 3. Automatic interfaces to corpus-bound phraseology It goes without saying that the technological potential an average applied corpus linguist may have at his/her disposal will fall short of resolving all the delicacies necessary to describe the idiomatic distribution of BE. At the heart of the problem lies the discrepancy in the way the basic term collocation is understood by applied linguists and the way in which it is implemented by corpora researchers. While the traditional, applied sense of collocation associates it with co-occurrence between items forming a syntactically interpretable unit (noun phrase, verb phrase etc.), corpus-driven methods are usually focused on what is easily countable in electronically held text. However, as we shall see, statistical results based on word spans or adjacent word clusters, although useful in surveying large bodies of text, cannot fully satisfy due to both incompleteness and overgeneralisation. 7 Quirk et al. (1985: 1173) report that in both British and American English certain copular verbs (APPEAR, LOOK, FEEL, SEEM, SOUND, REMAIN, STAY, BECOME, END UP, PROVE, TURN, TURN OUT, WIND UP) prefer infinitive constructions before noun phrase complements. The statistical results collated in the present study for the most frequent of the verbs, SEEM, have not confirmed this preference in native-speaker written production, instead pointing to a generally much more widespread adjectival complementation in which the highly more frequent (by 3-5 times) option is the one without ‘to be'. Interestingly, this last finding showed an opposite tendency (i.e. preference for ‘SEEM to be <adj>') among both advanced and intermediate Polish learners of English. 8 One further refinement (not pursued here) within the above group might be to isolate commentmaking patterns beginning with ‘it’ (which touch upon discourse specialisation) in opposition to clauses containing nominal subjects (cf. ‘it would be irresponsible to attempt...’ vs. ‘Attempting ... would be irresponsible.') 316 Precision and recall can be improved by using POS-tagged corpora, but, unless we put in a major effort to re-edit manually, some data will still slip through, due to systematic inaccuracy of taggers. It is practically impossible to automatise the labelling of the central passives (as opposed to semi-passives and pseudo-passives). The deep-tagging program tried for this project, TOSCA-ICLE tagger (Aarts et al 1997), despite the claimed 95-6% accuracy (de Haan 1997: 218), notoriously misinterpreted ‘BE used to <doing sth>’ as ‘BE used to <do sth>’ and likewise labelled as passive each instance of ‘BE related to <sth>', ‘BE concerned about <sb/sth>’ or ‘BE satisfied with <sth>'. Equally failing may be attempts at automatising the retrieval of significant collocations. One of the crudest ways is to extract ‘recurrent word combinations’ (also called ‘word clusters', ‘word bundles', ‘word strings’ etc.). Altenberg (1993, 1998) rightly showed that many of them can exhibit important, pragmatic functions, particularly in spoken discourse; however, most are not, by definition, ‘idiomatic’ (Biber et al. 1999: 990), and prove difficult to interpret and sub-classify as a group. Their significance is further undermined by the fact that many genuine collocations and multi-word expressions are not contiguous (Kennedy 1998: 114) and do not form fixed word strings. Clusters are certainly appealing for large-corpus research: Biber et al. devote over 30 pages to these combinations and only about 13 to all other multi-word, idiomatic expressions (Biber et al. 1999: 990-1024). However, they can uncover only very selective and very incomplete lexical associations, hidden amongst results that better indicate dominant topics (cf. ‘BE allowed to adopt children’ or ‘with Down's syndrome BE’ in Polish learner data) or stylistic mannerisms (e.g. ‘it BE obvious that', ‘and that BE why') than reveal collocational bonding. Many clusters signify no units at all (‘it BE', ‘that they BE') but cannot be stop-listed since the commonest words (e.g. prepositions) play important roles in many other, meaningful clusters. Another approach to automatising collocation extraction is to apply co-occurrence statistics. These express arithmetically the (relative) strength of the association bond between words that tend to appear within a specified span (window) of words. Two commonly used co-occurrence formulas (with variants frequently experimented upon) are mutual information (MI) and the Z-score (McEnery & Wilson 1996: 71), the latter sometimes replaced by the more accurate t-score. The philosophy behind MI makes it difficult to apply in studies targeting phrases with highfrequency vocabulary. MI can be applied successfully to identifying ‘idiosyncratic collocations’ (Oakes 1998: 90) and those which typify domain sublanguages because it privileges ‘rare events’ (Oakes 1998: 177). The limited helpfulness of MI in pointing at significant collocates was confirmed by the fact that even when the collocate frequency threshold was lowered to 3 (while 5 is said to be a decent statistical minimum), tests for lemmatised BE (carried out with WordSmith Tools) failed to produce any significant results within the 4:4 span. This, in view of the stratified phraseological system introduced above, is a rather questionable result. More prolific can be MI calculations performed for each attested wordform of BE. The Polish students’ essay-writing corpus displayed connections between ‘I’ and ‘sure', ‘I’ and ‘against’ and ‘I’ and ‘afraid', mostly, however, exhibited relations with infrequent, topicinduced nouns (‘monarchism', ‘delusion', ‘ritual', ‘centuries') and adjectives (‘doubtful', ‘conspicuous', ‘annoying'), which have less to do with genuinely significant lexicogrammatical patterning of BE. A measure of co-occurrence which is less ‘resistant’ to common collocates, is the Z-score. Running Collgen (a sub-program of the freeware package TACT) on the same corpus of Polish learners’ essay-writing reported associations (p-level 0.01; span 4:49) between the lemma BE and the adjectivess ‘able', ‘likely', ‘supposed’ and ‘afraid'. It also pointed out the habitual co-occurrence of BE and ‘there’ (a most likely indicator of heavy use of the existential ‘there BE’ structure, indeed popular with learners), or BE and ‘concerned’ (indicative of learners’ frequent over-reliance on the structure ‘as far as <sb/sth> BE concerned'). However, a problem with rated collocate lists (even those sorted by the Z-score or t-score) is that they only indicate potentially interesting cases that require much effort and close textual analysis to verify (e.g. the association of BE and ‘there’ may also imply the verb phrases ‘BE there'). Even access to annotated corpora is not immediately helpful, since tagging on-the-fly may go wrong and the number of tag combinations to be queried for a particular type of association is often not entirely predictable, complicating search patterns and/or prolonging computer processing time. In short, when precision in obtaining data for a pedagogically oriented study is at stake, reliance on automatic extraction means often proves insufficient because: 1) too much ‘noise’ is generated in the data, which, in the case of smaller corpora, may considerably slow down analysis; 2) 9 This is an approximation. Collgen actually calculates co-occurrence statistics from generated wordclusters. In this case 2-5 word clusters (including the node word) were examined. 317 collocations can spread beyond the typically heuristic 4:4 span, in which case they will be blocked out, while extending the span would needlessly escalate the ‘noise’ effect; 3) sometimes only grouping data uncovers a meaningful kind of association, whereas co-occurrence extractors work with orthographic words and easily skip over, e.g., variants of one idiomatic expression (cf. Stubbs 1998); 4) learner data (especially at lower proficiency levels) contribute to a further lowering of the statistical ‘precision’ and ‘recall’ of automated procedures, because of grammatical, stylistic, orthographic and other mistakes and errors. These, unless annotated or corrected in advance by hand, will often confuse taggers and/or skew statistics10. 4. Contrastive Interlanguage Analysis and the applied corpus network Learner corpora studies benefit strongly when a multi-corpus network with native and non-native reference data can be applied. Such is the framework of Contrastive Interlanguage Analysis (CIA, Granger 1996), which involves two kinds of comparisons: 1) comparison of non-native and native varieties of the same language (e.g. to identify errors, or to trace ‘foreign-soundingness’ in patterns of overuse and underuse); and 2) comparison of different non-native varieties of the same language (e.g. to examine if a given IL phenomenon is bound with a given L1 background or is more universal/ developmental in nature). CIA gains further diagnostic and predictive power when connected with classical CA, carried out on translation or, as in the corpus network outlined below, parallel corpora. The English corpora gathered for this project fall into five pre-established proficiency categories, the central one of which is the advanced-EFL band containing the major Polish learner corpus, IFAPICLE. In turn, the contrastive Polish part of the network represents two proficiency levels, expert/professional and college/secondary school learner, which mirror the native English control data. Table II: The stratification of the corpora used in the study (word token counts: hyphens within words) non-native English native English ‘apprentice’ corpora ‘expert’ corpora 1. Intermediate 2. Upperintermediate 3. Advanced 4. College 5. Professional Polish intermediate EFL Spanish (upper-) intermediate EFL Belgian- French advanced EFL Polish advanced EFL British and American college learner English British academic writing British and American quality press PLLC SPAN11 FREN IFAP( ICLE) LOCN(ARG) MCONC12 LOB&BROW N 92,712 tokens 94,965 tokens 101,442 tokens 107,990 tokens 106,255 tokens 97,914 tokens 94,421 tokens POL-STUD ‘apprentice’ corpus 4. College level Polish college compositions 103,382 POL-EXP ‘expert’ corpus 5. Professional level Polish academic papers + quality-press articles 101,348 tokens The Polish advanced EFL corpus IFA-PICLE13 belongs, alongside SPAN, FREN and LOCNARG, to the International Corpus of Learner English (ICLE) resource, which primarily samples 500-1000-word 10 This point was worth mentioning although BE, the simplest of verbs to use and one of the first to learn in writing, poses few problems (‘where'/‘were’ and *‘ben’ for ‘been’ were the only reported cases). 11 The Spanish learner corpus (SPAN), although officially regarded as ‘advanced’ in the ICLE Project structure, had to be relegated to a lower level as it contained many more grammatical mistakes and decisively poorer vocabulary in comparison to the other advanced-level EFL data. 12 LOCNARG is a selection of argumentative essays written by English and American secondary school and college students; the whole resource constitutes the LOCNESS corpus (=Louvain Corpus of Native English eSSays), the primary control native corpus within the ICLE family, which, arguably, is more comparable with the non-native learner data than are professionally written text samples. 318 argumentative essays submitted by English university students in EFL countries (Granger 1994 and 1998 ed.). Because problems were encountered finding equivalent texts (in genre and style) and equivalent sample sizes to those in the ICLE material, the expert corpora, in particular MCONC, as well as POL-EXP, occasionally include longer and/or incomplete extracts of text cut out of larger publications. No topic homogeneity could be enforced, either, but efforts were made to include, in the first place, themes typically represented in IFA-PICLE and the other ICLE learner corpora (e.g. youth and social problems: violence, drugs, TV-addiction, etc.). PLLC is an extract from the Polish part (over 500,000 tokens) of the 10-million-word Longman Learner Corpus (LLC) including short essay writings, some of which feature personal rather than argumentative discourse (hobbies, interests, plan for the future, etc.). MCONC is a collection of manually extracted, jargon-free academic English texts derived from the MicroConcord text collection B. Academic texts (1993). LOB&BROWN is a collection of mostly quality- and popular-press extracts retrieved from the LOB and Brown corpora (ICAME Collection of English Language Corpora 1991), exclusively from text Category B (‘Press: Editorial') and text Category F (‘Popular Lore'), including analyses on political events, popular science articles, columns and editorials on every-day life, etc., but excluding short press reports. 5. How Polish advanced EFL writers overuse BE: selective findings Let us begin with a few procedural remarks. Due to the unmanageably high frequency of BE in each corpus, most of the demanding disambiguation tasks14 involving non-frozen expressions (passives, semi-auxiliaries, and restricted / idiomatic phrases) had to be performed on samples of random concordance lines (500) drawn from each English corpus. When projecting the samples-based scores onto whole corpora, approximations using the standard error (0.5-2.0%) were performed, and confidence ranges established, assuming, for the easiest fit, a normal distribution and the minimum 95% confidence level. Some of the values presented below will consequently appear as (partially overlapping) continua rather than as single scores. Instead of sophisticated statistical testing, which is often dropped in applied studies of this kind (Granger 1998a), the results (comparisons of frequencies and percentages) are assessed and commented upon impressionistically. Secondly, due to the lack of topic homogeneity in the corpora, unexpectedly skewed and possibly topic-induced frequencies had to be identified. This was done by taking standard deviation measures for each recorded expression type and group across all the seven corpora and applying a heuristically established threshold of 2 to discriminate between proportionate and skewed distributions. The latter cases were then assessed as either instances of genuine quantitative difference or, if text inspection confirmed consistent connections with a uniquely (over)represented topic, rejected from further counts. Quantitative results obtained for the first disambiguation stage (auxiliary vs. lexical BE) showed a clear underuse of the central passives among Polish intermediate learners, possibly resulting from a more personal and casual content of their texts. The complex semi-auxiliary structure ‘BE going to <do sth>’ (predominantly spoken and perhaps stylistically weak, cf. Biber et al. 1999: 489) was found a characteristic of (less proficient) learner writing that contrasted deeply with native English expert data, especially its academic variety. Overused informal expressions will reappear throughout this section, becoming a frequent feature of many EFL-based findings. Amongst semi-auxiliaries where BE functions as a linking verb, another instance of informality is a rather significant, consistent overuse of the structure ‘BE able to <do sth>', noticeable especially in the performance of advanced-level native and non-native writers. Generally high statistical frequency of this ‘core phrase’ (over 100 occurrences in a million words, Biber et al. 1999: 517) spreads proportionally across various registers and text-types (conversation, fiction, news, academic writing), implying that the expression is very familiar to most (foreign) students of English from early stages in their education. This familiarity may be a conducive factor for the reported overuse, since the phrase is a safe option for selection in almost any language task. Passing on to the idiomatic uses of lexical BE, three specific expectations were developed and tested: 1) negative correlation between rising proficiency and increasing frequencies of single-word (non-idiomatic) uses and/or with underrepresentation of idiomatic BE; 2) prolific presence of favourite 13 IFA-PICLE is an extract of the PICLE corpus, containing over 230,000 words of running text (365 essays). Full information on PICLE can be found at: http://main.amu.edu.pl/~przemka. 14 Performed with Concord, one of WordSmith Tools. 319 expressions (‘core phrases') in EFL learner data; and 3) traceability of (at least some of) the favourite expressions to L1 (Polish). Table III: Lexical BE: Major summary results calculated from 500-line concordance findings 5. Professional 4. College 3. Advanced 2. Upp-Int 1. Interm 95% confidence intervals LOB &BR MCONC LOCN IFA- PICLE FREN SPAN PLLC Estimated standardised frequency per 100,000 words Frozen uses >5 <76 >24 <127 >2 <75 >27 <138 >10 <98 >25 <134 >6 <93 Restricted: BE + idiom >314 <525 >282 <504 >325 <552 >324 <568 >299 <529 >228 <443 >188 <391 Restricted: formulae >213 <397 >290 <514 >138 <306 >280 <512 >317 <551 >173 <367 >191 <396 Cleft sentences >47 <159 >49 <173 >54 <177 >82 <234 >36 <152 >8 <96 >0 <75 Free combinations >1,717 <1,990 >1,778 <2,086 >2,005 <2,290 >2,346 <2,686 >2,209 <2,528 >2,574 <2,870 >3,317 <3,611 Total: >2,552 <2,892 >2,713 <3,115 >2,775 <3,148 >3,406 <3,793 >3,176 <3,553 >3,260 <3,657 >3,968 <4,300 Estimated % of lexical BE in a corpus Frozen uses >0.2% <2.8% >0.8% <4.4% >0.1% <2.5% >0.8% <3.8% >0.3% <2.9% >0.7% <3.9% >0. 1% <2.3% Restricted: BE + idiom >11.5% <19.3% >9.7% <17.3% >11.0% <18.6% >9.0% <15.8% >8.9% <15.7% >6.6% <12.8% >4.5% <9.5% Restricted: formulae >7.8% <14.6% >10.0% <17.6% >4.7% <10.3% >7.8% <14.2% >9.4% <16.4% >5.0% <10.6% >4.9% <9.6% Cleft sentences >1.7% <5.9% >1.7% <5.9% >1.8% <6.0% >2.3% <6.5% >1.1% <4.5% >0.2% <2.8% >0.0% <1.8% Free combinations >63.1% <73.1% >61.0% <71.6% >67.7% <77.3% >65.2 <74.6 >65.7% <75.1% >74.4% <83.0% >80.2% <87.4% Total 100% 100% 100% 100% 100% 100% 100% The summary results presented in Table III find a good deal of agreement with the overall proficiency-based predictions expresses in the first hypothesis. Lower-proficiency students (especially PLLC) appear to use fewer collocational idioms (i.e. ‘BE + idiom') than the remaining groups (corpora), and many more free combinations. In frequency terms, it is perhaps surprising to see the restricted level best represented in the two EFL advanced corpora, which is due, perhaps surprisingly, to a high share of formulae in these texts (comparable to the level characterising English academic writing MCONC). At the same time, frequencies as well as percentage data show that the two advanced-level EFL corpora and the native learner corpus share a similarly extensive predilection for the application of free combinations. LOCNARG could have approached the position of expert English corpora much closer were it not for the lower figures recorded for formulae, particularly the prefacing structures of the type ‘<sth: idea, purpose etc.> BE that/to...'. However, quantitative studies of formulae are a shaky matter when EFL data are involved, since learner language has been found to feature many contextually unsuitable prefabs and so their pure calculation without fathoming the context may be misleading (cf. Granger 1998b, de Cock et al. 1998). Passing on to the frozen and restricted levels of phraseology, the data obtained point to the presence of several expressions and collocations that are favoured by learners, and to many instances probably attributable to the Polish L1. Thus, the second and third of the formulated hypotheses have also found at least some confirmation in the tests. With respect to the idiomatic levels of BE, among the scarcely represented frozen expressions one worth noting is Polish learners’ apparent overuse of the finite clausal structure ‘what is more’ in the function of an addition/reinforcement adverbial, e.g.: 320 It is no wonder, that then some easily influenced Poles, who have not been exposed to many such films so far, might want to try living in a similar manner. What is more, after watching another “Rambo-like” film an average Pole may be led to thinking that committing a crime is a part of people's existence. (IFA-PICLE) Although a legitimate idiomatic expression (LDOCE3: 1628), ‘what is more’ (often contracted to ‘what's more') is an emphatic and rather spoken, though infrequent, ‘polyword’ (cf. Altenberg 1998: 117 or Biber et al. 1999: 1008f, 1014f). The origins of the overuse may be sought in the stylistically rhetorical (usually written 3ROLVKWUDQVLWLRQDOSKUDVHµFRZL FHM¶IUHTXHQWO\HPSOR\HGE\ZULWHUVDQG orators to emphasise and/or extend an argument. Both native Polish corpora consulted (expert POLEXP and learner POL-STUD) were agreed in pointing to a stable frequency of 9-10 instances per 100,000 words for this use. Although not as high as the one attested for ‘what is more’ in IFA-PICLE and PLLC, the value is significant enough to indicate Polish-English cognateness (or possibly transliteration since the phrase is fairly compositional) as a likely factor enhancing the detected overuse. Interestingly, Polish users also favour an alike finite structure ‘what is more <adj: important, significant etc.>’ as a sentence-initial emphasising adverbial, a use more naturally rendered by singleword adverbs like ‘importantly', ‘significantly'. This, too, can be traced to the habitual Polish FRQQHFWRUVVXFKDVµ D FRZD QHQDMZD QLHMV]HQDMLVWRWQLHMV]H¶HWFDSRLQWZKLFKUHWXUQVLQWKH discussion of discourse formulae below. Restricted / collocational associations are much better attested and therefore more convincing. Within the super-pattern ‘BE + idiom', the following findings concerning Polish EFL essay writers are worth mentioning: · the expression ‘BE full of <sth>’ appears overused by intermediate learners (12 occurrences in PLLC and 10 in SPAN), and distinguishes itself also in Polish advanced learners’ writing (6); a transfer trigger mechanism is likely as the expression is more informal in English (MCONC and LOB&BROWN contain only 1 instance each) than in Polish (POL-STUD: 5 occurrences), where IRUPDOSHUVXDVLYHGLVFRXUVHUHDGLO\IHDWXUHVDGLUHFWO\FRUUHVSRQGLQJµE\üSHáQ\PF]HJR !¶DVLQ ‘ ZLDWMHVWSHáHQSRNXVDQDWXUDOXG]NDVáDED¶ WKHZRUOGLVIUDXJKWZLWKWHPSWDWLRQVand the human nature is feeble); · ‘BE present’ appears overused in IFA-PICLE (cf. Polish semi-IRUPDOµE\üREHFQ\P¶DVLQ µ7HOHZL]MDMHVWREHFQDZ \FLXND GHJR]QDV¶32/678' EXWQRWLQ3//& SRVVLEO\GXHWR genre inconsistency), so it is impossible to fully diagnose the case; · the semantic set ‘BE connected/associated etc. with <sb/sth>', and especially the phrase ‘BE connected with <sb/sth>', appears a strong Polish learner's favourite, with IFA-PICLE recording 12 and PLLC 15 occurrences (as opposed to 0-1 in all the remaining corpora); transfer influence is RQO\SDUWLDOO\MXVWLILHGVLQFHWKHWUDQVODWLRQDOHTXLYDOHQW DOWKRXJKQRWDFRJQDWH µE\ü ]ZL ]DQ\P¶UHJLVWHUVRQO\-4 times in native Polish expert and learner writing alike; · ‘BE concerned with <sth>’ (='deal with sth') is in prevalent use among professional English writers (8-9), but Poles (and French Belgians) prefer for ‘BE concerned’ to operate in the linking structure ‘as far as <sth> BE concerned', which may be the reason why they do not apply it to other contexts; frequencies show that the latter phrase may be a favourite with many EFL advanced learner populations; · ‘BE slow to <do sth>’ is arguably less transparent than most other phrases (cf. the PolLVKµRFL JDü VL ¶ DQGPD\EHDYRLGHGE\RUSHUKDSVLVXQNQRZQWR()/OHDUQHUVZKLOHLWLVUHFRUGHGLQ moderate use among native expert writers. With respect to discourse formulae and polywords, the following instances deserve mention: · ‘that/this BE why’ appears a highly typically Polish-style linker, featuring 49 occurrences in IFAPICLE and 56 in PLLC (other corpora: FREN 23, SPAN 13, native-speaker data 1-4, including LOCNARG). Over 30% of all the occurrences appear in the short form ‘that's why'. The popularity may be L1-related (POL-EXP and POL-STUD both feature dozens of sentence-initial ‘Dlatego (ZáD QLH ¶DVZHOODVVHYHUDOµ=WHJRSRZRGXZ]JO GX¶HWFZKLFKVHHPWUDQVODWLRQDO pragmatic equivalents). The expression is generally not very formal and unsuitably typical of Polish EFL learners’ written English discourse; · (sentence initial) ‘what is more <adj: important etc.>’ registers a few times in the IFA-PICLE corpus, although no (heavy) overuse has been detected; what is interesting is that native-English sources tend to employ it in clefts (e.g. ‘what is more important is that...') while adverbial uses are typically covered by adverbs (‘importantly', ‘significantly’ etc.); 321 · ‘BE:’ (e.g. ‘The question was: ....') is a discernible convention in native writing (especially academic MCONC), but most EFL learners use it twice as often, which may indicate undue overuse of this simple clause structure. To summarise this section, the cases characterising Polish learners’ habits mainly concern overuse and derive from their falling back on L1-inspired options and/or on common, familiar, spoken (or universal) phraseology. Similar stylistic infelicities may also be observed in the preferred free combinational uses of BE. For instance, existential ‘there BE', perhaps more characteristic of spoken English, tends to typify lower-proficiency written performance (with PLLC and SPAN recording by far the largest frequencies). Another ‘spoken habit’ of Polish users, both advanced and intermediate ones, is their resorting to anticipatory ‘it’ clauses in preference to longer structures with full nominal subjects that usually enhance textual cohesion. 6. Conclusion Quantitative studies of learner phraseology, marked by heavy disambiguation of instances, require finer, small-corpus based comparisons rather than coarse, corpus-driven, statistical methods. Results such as those presented above would not have been possible without painful manual analysis allowing us to reach deep into corpus data. Researchers of learner corpora should strive to capture the hard-toretrieve covert types of ‘error': the overapplication and avoidance of words, expressions, structures etc. Indeed, with more advanced learners, it is those unnaturally distributed rather than incorrectly applied items that characterise ‘foreign-sounding’ style prominently. Regardless of the creative side of language, much of what native users say and write is influenced by conventions which, at least in statistical terms, are also expected in learners’ texts and speech, particularly at the university level. The underlying philosophy is unavoidably prescriptive, in that it presupposes native ‘norms’ against which learners’ performance can be assessed. Respectable voices advise caution against idealising learner corpus evidence, especially when it is confronted with such norms (cf. Leech 1998). However, it seems that unless we falsely hail learner corpora research as the one and only guide to native-like competence, instead of simply naming it a contributor to more successful learning, a touch of simple, educational prescriptivism should do little harm. Especially if we beware of obvious methodological pitfalls, ask demanding questions, carefully prepare and scrutinise data and avoid drawing arrogant, foregone conclusions. The message flowing from the presented exercise is that EFL learners do tend to overapply the simplest uses of the verb BE in comparison with native writers and that the trend is happily less marked at the advanced level than at the intermediate level. Not only free combinations, however, add to the overall impression of overuse. A number of popular collocational and frozen expressions with BE, often inspired by L1 or borrowed from spoken language, also contribute strongly. Bibliography --- 1995 TACT (Textual Analysis Computing Tools). Version 2.1. Toronto, University of Toronto. Aarts J, Barkema H, Oostdijk N 1997 The TOSCA-ICLE tagset. Tagging manual [accompanying the TOSCA-ICLE tagger/lemmatiser version 1.0]. Nijmegen, University of Nijmegen. Aitchison J 1994 Words in the mind [2nd ed.]. Oxford - Cambridge, Mass., Blackwell. Altenberg B 1993 Recurrent verb-complement constructions in the London-Lund Corpus. In Aarts J, de Haan P, Oostdijk N (eds), English language corpora: design, analysis and exploitation. Papers from the 13th international conference on English language research. Amsterdam, Rodopi, pp 227-245. Altenberg B 1998 On the phraseology of spoken English: the evidence of recurrent word combinations. In Cowie A P (ed), Phraseology. Oxford, Clarendon Press, pp 101-122. Benson M, Benson E, Ilson R 1997 The BBI dictionary of English word combinations. Amsterdam - Philadelphia, John Benjamins Publishing Company. Biber D, Johansson S, Leech G, Conrad S, Finnegan E 1999 Longman grammar of spoken and written English. Harlow, Pearson Education Limited. 322 de Cock S, Granger S, Leech G, McEnery T 1998 An automated approach to the phrasicon of EFL learners. In Granger S (ed), Learner English on computer. London, Addison Wesley Longman, pp 67- 79. de Haan P 1997 An experiment in English learner data analysis. In Aarts J, de Mönnink I, Wekker H (eds), Studies in English language research and teaching: in honour of Flor Aarts. Amsterdam - Atlanta, Rodopi, pp 215-29. Granger S (ed) 1998 Learner English on computer. London, Addison Wesley Longman. Granger S 1994 The learner corpus: a revolution in applied linguistics. English Today 39 (10/3): 25-29. Granger S 1996 From CA to CIA and back: an integrated approach to computerized bilingual and learner corpora. In Aijmer K, Altenberg B, Johansson M (eds), Languages in contrast. Papers from a symposium on text-based cross-linguistic studies Lund 4-5 March 1994. Lund, Lund University Press, pp 37-51. Granger S 1998a The computer learner corpus: a versatile new source of data for SLA research. In Granger S (ed), Learner English on computer. London, Addison Wesley Longman, pp 3-18. Granger S 1998b Prefabricated patterns in advanced EFL writing: collocations and formulae. In Cowie A P (ed), Phraseology. Oxford, Clarendon Press, pp 145-60. Howarth P A 1998 The phraseology of learners' academic writing. In Cowie A P (ed), Phraseology. Oxford, Clarendon Press, pp 161-86. Hudson J 1998 Perspectives on fixedness. Lund, Lund University Press. Kennedy G 1998 An introduction to corpus linguistics. Harlow, Addison Wesley Longman. Kjellmer G 1991 A mint of phrases. In Aijmer K, Altenberg B (eds), English corpus linguistics: studies in honour of Jan Svartvik. London, Longman, pp 111-127. .R]áRZVND&'']LHU DQRZVND+Selected English collocations [Revised and enlarged edition; VWHG@:DUV]DZD3D VWZRZH:\GDZQLFWZR1DXNRZe. Leech G 1998 Preface. In Granger S (ed), Learner English on computer. London, Addison Wesley Longman, pp. xiv-xx. McEnery T, Wilson A 1996 Corpus linguistics. Edinburgh, Edinburgh University Press. Moon R 1997 Vocabulary connections: multi-word items in English. In Schmitt N, McCarthy M (eds), Vocabulary: description, acquisition and pedagogy. Cambridge, Cambridge University Press, pp 40- 63. Oakes M P 1998 Statistics for corpus linguistics. Edinburgh, University Press. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman. Ringbom H 1998 Vocabulary frequencies in advanced learner English: a cross-linguistic approach. In Granger S (ed), Learner English on computer. London, Addison Wesley Longman, pp 41-52. Scott M 1996 WordSmith: software language tools for Windows. Oxford, Oxford University Press. Sinclair J 1991 Corpus, concordance, collocation. Oxford, Oxford University Press. Stubbs M 1998 A note on phraseological tendencies in the core vocabulary of English. Studia Anglica Posnaniensia XXXIII: 399-410. Summers D (ed) 1995 Longman dictionary of contemporary English [ 3rd Edition]. Harlow, Longman Group Ltd. [LDOCE3] 323 Identifying parallel corpora using Latent Semantic Indexing Yuliya Katsnelson Charles Nicholas Highland Technologies 4831 Walden Lane Lanham, MD 20706 UMBC 1000 Hilltop Circle Baltimore, MD 21250 Tel: 301-306-2849 Fax: 301-306-8201 Tel: 410-455-2594 Fax: 410-455-3969 ykatsnelson@htech.com nicholas@cs.umbc.edu Abstract Identifying parallel corpora can be an important step in a variety of tasks related to information retrieval. However, at present to identify parallel corpora requires human experts to examine the texts and evaluate their respective contents. We assume in this research that texts which are translations of each other have similarities in their semantic structure which are absent between independent documents. Latent Semantic Indexing (LSI) (Landauer 1989, Deerwester 1990) is a statistical technique that brings out correlations between documents based on term co-occurrence patterns identified using the method of Singular Value Decomposition. LSI does not involve any knowledge of the actual content of the documents and therefore has no need for human intervention. If LSI could be used for parallel corpora identification, it would lower the costs incurred in this task. Our purpose is to determine if it is possible to identify a parallel corpus using the method of Latent Semantic Indexing. We present evidence that LSI reveals similarities between parallel documents that do not exist between non-parallel documents and is therefore useful for identifying parallel corpora. 1 Introduction Parallel corpora identification can be considered an area of Cross-Language Information Retrieval with applications both inside and outside of the area of IR. Recognizing the relevance of documents in multiple languages also includes recognizing the relevance of parallel documents. Parallel documents, in fact, make good test cases for evaluating the effectiveness of a Cross-Language IR system, since if a document is relevant to a query, then its translation should also be relevant. Also, being able to tell the parallel part of a corpus from the rest of the corpus allows one to reduce the amount of search, since only one portion of the parallel subsection needs to be explored. Parallel corpora analysis, on the other hand, relates largely to the area of document translation, human- or computer-aided. Parallel documents are included in a corpus in order to provide the reader with a choice of language in which to read the document. This means that the different versions of the document need to convey the same information. This objective introduces a number of questions. Is the corpus parallel? What part of the corpus has a corresponding parallel subset in the same corpus? How close are the various multilingual translations? Which translation is the best among all the available versions in the same language? In this research we aim to answer the first question, believing that the insight that we get in the process will enable us to approach the others. Latent Semantic Indexing (LSI) is a statistical technique that uses the method of Singular Value Decomposition (SVD) to represent the terms and documents in the corpus as points in a multidimensional space. The dimensions of this space represent various patterns of term co-occurrence. Thus, documents that have similar characteristics with respect to a particular term co-occurrence pattern have similar coordinates in the corresponding dimension of the LSI space. SVD allows a reduction of the dimensionality of the LSI space with minimal information loss. We report on the results of applying Latent Semantic Indexing to identifying parallel corpora. In our research we used the unsupervised learning approach, relying on the hypothesis that the content “signature” of the collection is going to outweigh the difference in language-related “noise” and still 324 supply meaningful results. As measures for comparing the LSI representations for parallel and nonparallel corpora we use correlation coefficient analysis and visual inspection. We used corpora in the English, French, Russian and Italian languages to assess the behavior of this method in the cases of more similar (English/French) and less similar (French/Russian) language pairs. The corpora included abstracts from “The Little Prince” by A. de Saint Exupery, short stories and the novel “Smoke Bellew” by J. London, “The Adventures of Sherlock Holmes” by Sir A. Conan Doyle, and “The Black Tulip” by A. Dumas. Our choices of documents were driven largely by their availability in different languages. We present evidence that LSI reveals relationships between parallel documents that do not exist between independent documents and thus, provides a means of identifying parallel corpora. The remainder of this paper is organized as follows. Section 2 lists the necessary definitions and gives a brief description of the Vector Space model and Latent Semantic Indexing. Section 3 explains the research hypotheses and experimental design. Section 4 discusses the results, and Section 5 gives the conclusions and outlines future work. 2 Definitions and Background In this section we introduce the main concepts and methods used in the course of our research. We also briefly describe the Vector Space model (VSM) of Information Retrieval and explain in more detail the method of Latent Semantic Indexing (LSI) and its advantages as compared with the VSM. We will also discuss the reasons behind choosing LSI for the task of parallel corpora identification. 2.1 Definitions Before going any further, we list the definitions of concepts that are going to be referred to throughout the paper. 1. Parallel corpus – a collection of documents in which at least a subset of documents has translations within the corpus. 2. Fully parallel corpus – a corpus in which each document in the corpus has a translation within the corpus. 3. Partially parallel corpus – a corpus in which only a subset of documents has a translation within the corpus. 4. Literal translation – translation in which the goal is to remain as close to the original choice of words and structures as possible. 5. Literary translation – translation in which the goal is to remain as close to the original meaning and imagery as possible. 6. Synonymy – different terms are used to describe the same concept. 7. Polysemy – the same term describes more than one concept. 8. Mixed LSI space – all documents in the corpus are used to create the LSI space for further analysis. 9. Separate LSI space – a separate LSI representation is created for each monolingual sub-collection of the corpus under consideration. 10. Cognates – words in different languages that are spelled similarly and have the same meaning. 2.2 Vector space model The three main models of IR are the Boolean, vector space and probabilistic models. We are only going to discuss the Vector Space model in this paper as the most relevant to the issue in question. Information on this and other models may be found (Baeza-Yates 1999). In the vector space model, both documents and queries are represented as vectors. The vector components are term frequencies, often normalized by the vector length to account for different lengths of documents. There are a number of methods of computing similarity between these vectors. The most popular is the cosine of the angle between the vectors. The vector space model presents more flexibility for evaluating the degree of similarity between documents than the Boolean model (Baeza-Yates 1999). However, it does not account for the fact that the occurrence of one query term in the document may influence the likelihood of the occurrence of another term. For instance, if the query is “computer networks security”, then the fact that the term “computer” occurred in a document make it more likely that the term “networks” occurs in the same document than, for instance, the term “knitting”. 325 2.3 Latent Semantic Indexing Latent Semantic Indexing (LSI) can be considered an extension of the vector space model. LSI claims that the meaning of a document is not determined by its set of terms, but rather by a latent semantic structure that manifests itself by how each term is used with all other terms across the entire corpus. This means that replacing one (or more) term(s) in this structure with their synonyms does not change the meaning of the document, provided that the latent semantic structure remains intact. The goal of LSI then is to bring forth this latent semantic structure (Landauer 1989). The entrance point for this method is constructing the term-document matrix A for the corpus. The size of the matrix A is [m x n] where m is the number of terms in the corpus and n is the number of documents. Element aij is the frequency of occurrence of term i in document j. Once the term-document matrix is constructed, the LSI method uses Singular Value Decomposition (SVD) to create the LSI space. The result of the SVD is a 3-tuple of matrices U, S, and VT such that: where matrices U and V are orthonormal (i.e. U*UT = V*VT = I), and matrix S is diagonal (i.e. aij = 0, " i 1 j). Matrix U has size [m x m]. An element of U uij contains the coordinate of term i in dimension j of the LSI space. Matrix VT has size [n x n]. An element of VT vij contains the coordinate of document j in dimension i of the LSI space. The matrix S contains the singular values sorted in descending order. The size of this matrix is [m x n]. The number of non-zero diagonal elements in this matrix equals the rank of the original term-document matrix, which can be at most min(m,n). Therefore, only min(m,n) singular values at the most are significant for the analysis. Also, since the singular values decrease as they go down the main diagonal of S, the dimensions corresponding to those values are of decreasing significance. Any first k dimensions of the LSI space represent the best k-approximation of the original LSI space (Berry 1995). However, it may not be a good practice to establish a preset cutoff point for the number of dimensions k under analysis. In the course of our experiments we established 10 L k L 15 as a number of dimensions that appeared to be the most useful for the purpose of parallel corpora identification. 3 Method and experimental design The goal of the present research is to see whether it is possible to identify parallel corpora with computational means rather than through inspection by a trained multi-lingual human professional. We assume that given an original document A in a particular language, any adequate translation of this document into another language is going to possess some similarity to the original document. The hypothesis that is based on this assumption is that Latent Semantic Indexing (LSI) is a suitable method for showing the similarities between parallel documents. Thus, the approach taken in the course of this research is to represent both parallel and non-parallel document collections in LSI space and measure the similarities between corresponding documents. In this section we discuss the rationale behind applying LSI to identifying parallel corpora and describe the particular ways LSI was applied for this purpose. We are also going to describe the experimental strategy taken and the data that we used for the experiments. 3.1 Method The intuitive way of parallel corpora identification is to create some kind of a dictionary, which provides a mapping between a term in one language and its equivalents in other languages. One problem with such a method is that it requires knowledge of what particular languages appear in the corpus. It is also subject to synonymy and polysemy problems. A good method would allow analyzing a corpus without needing to know the languages of the documents that comprise it. A m n = U m m x S m n VT n n x 326 Latent Semantic Indexing has one property that is very attractive for processing multilingual corpora: No knowledge of the term meanings is required for the analysis, thus eliminating the need for dictionaries. Once the term-document matrix has been constructed, the only “reality” that exists in the analysis space is the frequencies of term occurrences with no inherent meaning associated with those frequencies. Of course, this property can also be construed as a shortcoming of this method, since some built-in knowledge could be useful in some situations. For instance, it is possible that there exist two documents with completely unrelated contents, but very similar or identical lexical structure. In that case, LSI will regard the two documents as similar, despite their semantic differences. However, in our experience such occurrences are unlikely. The LSI method provides us with the means to compare the co-occurrence patterns within the corpora in ways that will hopefully bring out the similarities between parts of the corpus that belong to different languages. 3.2 Structure of LSI space The critical question that needs to be answered when applying the Latent Semantic Indexing method is how to construct the LSI space. The issue of finding translations (mates) within a corpus also relates back to the work of M. Littman, S. Dumais and T. Landauer on Cross-Language Information Retrieval using LSI (Littman 1996). Their approach was to create a training corpus in which each document consisted of corresponding multilingual documents pasted together. This technique allowed them to establish co-occurrence mappings between languages. Thus, when a newly introduced query produced a pattern similar to the one in the training corpus, the documents that had similar co-occurrence characteristics in all languages participating in the training corpus were returned as relevant. Our assumption that translations of the same document are similar in ways that independent documents are not allows us to use an alternative strategy, i.e. to eliminate the training corpus, and consider only the correlation between coordinates in LSI space of multi-lingual documents. We used two different ways of mapping documents from a multilingual corpus into the LSI space. One way was to process the entire multilingual corpus simultaneously and construct a single term-document matrix. This means that every vector in matrix U of the LSI decomposition contains the values for all the terms, and that the LSI space will be based on the term co-occurrence patterns not only across documents, but across language boundaries as well. Such a space is called mixed LSI space throughout this paper. The second way was to create a separate LSI space for every monolingual part of the corpus and compare the representations of corresponding documents or document vectors in the same way as in the case of mixed LSI space. We used both mixed LSI space and separate LSI space approaches in this work, but an in-depth analysis of their respective merits is beyond the scope of this paper. This research does not rely on the existence of cognates (i.e. words in different languages that are spelled similarly and have the same meaning). In the case of the mixed LSI space, every column of the term-document matrix contains frequencies, raw or weighted, for every term in the corpus. If cognates were present in significant numbers, then the actual frequency with which a given term occurred in patterns in a particular language would be obscured. The assumption for this analysis is that no or very few cognates exist in the corpora and they have no significant impact on its outcome. 3.3 “Mate” dimensions Since we are interested in measuring parallelism of corpora, rather than concentrating on individual documents, the objects of our analysis are the vectors of document coordinates within a particular dimension, rather than the coordinates of a particular document in the LSI space. By comparing the monolingual parts of these vectors to each other, we claim that if the corpus is, indeed, parallel, then there is at least one pair of dimensions in the LSI space in which the corresponding monolingual sub-patterns are similar. This claim implies that the sub-vectors of different dimensions may be mated, i.e. the patterns in one language reflected in one dimension may be coupled with a similar pattern in another language. Therefore, for some, if not all dimensions there exists a “mate” which has the patterns of the multilingual sub-vectors that are similar in a quantifiable way. This hypothesis is a “twist” on the idea of finding mates developed by Dumais et. al. (Dumais 1997). In the case of separate LSI spaces, one can argue that, considering the inherent similarities in the subcorpora of a parallel corpus, there would be also similarities in their LSI representations that would 327 manifest themselves despite the grammatical differences between the languages in which these subcorpora are expressed. This might mean that the idea of mate dimensions applies also to separate LSI, although in this case there is no common “context” that was created in the case of mixed LSI by processing the entire corpus into a single LSI space. A special case of this "mate dimension" hypothesis exists for languages with similar syntactic structure (e.g. English and French). That is, the closer the grammatical structures of the languages, the more the "mate" dimensions converge to the point where in some dimensions the sub-vectors, corresponding to each language form a similar pattern. 3.4 Data We constructed a parallel corpus from various digital libraries. We included literary works by Jack London, Sir Arthur Conan Doyle, A. de Saint-Exupery, A. Dumas pére, and the United Nations corpus. The experiments were performed on the following language combinations: English/French, English/Russian, French/Russian, and English/Russian/Italian. The texts were divided into abstracts of 550-600 words (about one page long). The terms used for this research were 5-grams (Damashek 1995). An element aij of the term-document matrix A was the frequency of occurrence of the 5-gram i in the document j. Thus, the number of terms in each document was around 3000. The use of n-grams allowed us to use the same parser for each language, bypassing stemming and stop-listing. N-grams have also proven useful with other LSI-based techniques (Nicholas 1998). The documents in the corpus were grouped together by language. In a parallel corpus consisting of n monolingual documents and their translations, document di has document di+n/2 as its corresponding translation. The number of dimensions in the LSI spaces constructed in the course of this research varied from 20 to 80. 3.5 Plotting and measuring We used the standard plotting capability provided by MATLAB to plot the values in the S*VT matrix. The plots were all two-dimensional. The X-axis shows the sequence number of the document (and of its corresponding translation) in the collection, and the Y-axis shows the coordinate of that document in a particular dimension. Lines connect the document points in order to make the patterns easier to trace. We also analyzed the correlation coefficients between all permutations of monolingual sub-vectors of each row of the S*VT matrix. This correlation analysis produced as a result a matrix of correlation coefficients of size [n, n]. Of this matrix only the highest values were chosen to illustrate the results of the experiment. These results are shown as in Table 1, below. The first and the second column hold the LSI dimension numbers of the sub-vectors participating in the correlation analysis. The third column holds the actual correlation values. The header row indicated what were the languages of subcollections that produced the corresponding sub-vectors. The highest absolute values were selected for display. We also present graphs in which the two sub-vectors for each dimension are plotted for comparison, as shown in Figure 1, for example. Table 1. Correlation coefficient table sample. English sub-vector dimension Russian sub-vector dimension Correlation coefficient 1 2 0.8334 3 3 0.8902 3 4 0.8610 3 5 0.7996 7 7 0.7831 7 8 0.8342 328 Fig. 1. Plotting schema. The coordinate points for the documents are connected in the plots in order to make the pattern formed by them more evident. Each language participating in the experiment has its own color legend: red for English, green for French, blue for Russian, and black for Italian. 4 Results This section is structured in the following way: we will first show the difference between the parallel and non-parallel corpora representation; we will then show, within parallel corpora, the examples of grammatically similar and dissimilar language combination. We will also talk about literal vs. nonliteral translation. The majority of the research was done on mixed LSI spaces. However, the separate LSI space experiments produced interesting results that will be mentioned briefly in this section. 4.1 Mixed LSI Space 4.1.1 Parallel vs. non-parallel corpora The results shown in this category are very important because they confirm the main hypothesis of this research – that the representation of a parallel corpus presents similarities between parallel parts of the corpus that are absent in a non-parallel corpus. The documents in this collection come from “Smoke Bellew” by Jack London, namely the parts called “The Taste of the Meat” and “The Race for Number One”. In the case of a parallel corpus, the excerpts from both novels were combined into a single collection with their respective translations and the LSI space was created from them. In the case of a non-parallel corpus, the English part of the corpus was comprised of “The Race for Number One” excerpts and the Russian part from “The Taste of the Meat” excerpts. The highest-value portion of the correlation coefficient matrix for the parallel corpus part of this experiment (see Table 2) shows that the values are well above the tentative threshold of 0.75 assumed in the course of the experiments. Also, the S*VT plot in Figure 2 shows some obvious similarities. For instance, the sub-vectors in dimensions 5 and 7 present opposite trends, whereas dimension 6 has similar patterns for both sub-vectors. The English sub-vector in dimension 4 is very similar to the Russian sub-vector in dimension 5. Thus, the results of this experiment support the hypothesis. English sub-vector dimension Russian sub-vector dimension Correlation coefficient 3 3 0.8621 4 4 0.8793 4 5 0.9318 6 6 0.7890 Table 2. Correlation coefficients for parallel corpus from "Smoke Bellew". The analysis of the parts of this corpus separately showed that “The Taste of the Meat” was less precisely translated into Russian than “The Race for Number One”. This might account for lower values for the former collection. Also, the patterns formed by each sub-corpus in this experiment remain distinct when the collections are combined. The first 15 dimensions for this experiment involving the non-parallel collection from the same texts did not produce nearly as high similarity estimates as the previous experiment (see Table 3). However, there is still some similarity of behavior visible in dimensions 1 and 2, 3 and 4, and 7 and 8 of the S*VT 329 matrix plot (see Figure 3). The reason may be that the novels were written by the same author, and on a similar topic. However, in this experiment, the similarities in the visual representation of the corpus are accompanied by low values of correlation coefficients. English sub-vector dimension Russian sub-vector dimension Correlation coefficient 1 3 0.6368 7 17 0.6168 7 18 0.6173 1 2 0.5439 4 3 0.3629 8 7 0.2568 Table 3. Correlation coefficients for non-parallel corpus from "Smoke Bellew". Figure 2. Parallel corpus example, English and Russian. Figure 3. Non-parallel corpus example, English and Russian. We can see that the highest correlation coefficient for the non-parallel corpus is 0.64 and the lowest correlation coefficient for the parallel corpus shown above is 0.78. We therefore set a tentative threshold of 0.75 for the correlation coefficients in judging a corpus as parallel. The data for the nonparallel corpora that we have tested with so far has proven consistent in not providing correlation coefficient measures above 0.70 (usually it was much lower). 330 4.1.2 Similar vs. dissimilar languages The texts examined in this category are the French original of the “The Little Prince” by Antoine de Saint-Exupery with its Russian and English translations. The analysis procedure for this category is similar to the one demonstrated in Section 4.1.1. In this series of experiments, the range of values in grammatically similar and grammatically dissimilar languages are not significantly different. The highest correlation coefficient for the English/French pair is 0.99, and 0.98 for the French/Russian pair. In other words, both grammatically more and less similar languages participating in parallel corpora produce high correlation values, and don't produce them in a non-parallel corpus. However, depending on the style of the translation, the difference between these values may be more significant, although the experiments done so far show that these values still fall within the 0.75-1.0 range. We made the following observations about the S*VT matrix. In the French/English collection of excerpts from “The Little Prince” we saw that starting with dimension 1, the similar patterns are most often observed within the same dimension, e.g., dimension 1, dimension 3, etc. In dimension 4 we saw an effect of opposing patterns, where the English pattern is the same as in the previous dimension, but the French is the opposite. Thus, the idea of “mate” dimensions converging towards the same dimension seems to be supported. The French/Russian representation also presents strong similarity trends, but here the “mate” dimensions are more dispersed, e.g. similar patterns occur in dimension 1 and 2, 3 and 4, 5 and 6, 8 and 12, etc. This also seemed to confirm the “mate” dimensions suggestion. The highest correlation coefficients for this series of experiments were between 0.99 and 0.88. 4.1.3 Literal vs. literary translation A literary translation is intended for transmitting both the meaning and the spirit of the original work. This means that, considering that different languages have different imagery mechanisms, code words, and idioms, it is almost impossible to translate a literary text literally into another language and obtain a translation that's worth reading. Thus, the only claim that can be made towards the usability of LSI in application to literary translation is that it measures not so much the literality of translation, as its consistency; the style of the original and of the translation needs to have similar “texture”. For instance, if the author of the original work uses a single imagery for describing the blue sky and the translator uses three different images for the same purpose, the consistency of the translation will suffer. The examples that we worked with here are the excerpts from “Smoke Bellew”, specifically, parts of “The Race for Number One” and “The Taste of the Meat”. In this pair, “The Taste of the Meat” is the less literal translation. The correlation coefficient values for this experiment are not very high (the highest is 0.86), nor the dimensions in which they are high, very multiple. However, these values are located within the assumed threshold and the dimensions that possess these high similarity values are within the first 10-15 dimensions. The analysis of the second element in the selected pair, “The Race for Number One” produced the following much higher correlation coefficient values (the highest 0.9) and there is a significant number of dimension pairs with similarity value above 0.75. The separate LSI space approach presented similar results to those described for the mixed LSI space, which seems to support our hypothesis. For additional information on these and our other experiments, we refer the reader to the first author's thesis (Katsnelson 2000). 5 Conclusion The application of the LSI method and the subsequent correlation analysis showed that parallel corpora produce larger correlation coefficient values for rows of the S*VT matrix of the term-document matrix. The row plots of this matrix are also visibly more similar between the parts of a parallel corpus than between the parts of a non-parallel corpus. We analyzed several aspects of corpora comparison, such as parallel vs. non-parallel collections, literal vs. non-literal translations, and document collection pairs in languages with different degrees of grammatical similarity. The proposed threshold of 0.75 held for all types of parallel corpora. The 331 experiments also showed that visual analysis is as important for the application of our method to parallel corpora identification, as the statistical analysis. The limitation of our experiments has been that the number of dimensions that we have worked with was low due to the computational expense of SVD, and the limited number of documents. In the future we are planning to address this issue by both increasing dimensionality of experimental LSI space and considering alternative methods of dimensionality reduction that are less computationally demanding. We also hope to extend this method into evaluating the quality of translation. References T. K. Landauer, M. L. Littman, "Computer information retrieval using latent semantic structure". U. S. Patent No. 4,839,853, Jun 13, 1989. S. Deerwester, S. T. Dumais, T.K. Landauer, G.W. Furnas, and R.A. Harshman. Indexing by latent semantic analysis. Journal of the Society for Information Science, 41(6), 391-407, 1990. R. Baeza-Yates and B. Ribeiro-Neto, "Modern Information Retrieval", ACM Press, New York, 1999. M. W. Berry, S. T. Dumais, and G. W. O'Brien. Using Linear Algebra for Intelligent Information Retrieval. SIAM Review, 37(4), 1995, pp. 573-595, 1995. M. L. Littman, S. T. Dumais, and T. K. Landauer. Automatic Cross-Language Retrieval Using Latent Semantic Indexing. SIGIR96 Workshop on Cross-Linguistic Information Retrieval, 1996. S. T. Dumais, T. A. Letsche, M. L. Littman, and T. K. Landauer. "Automatic cross-language retrieval using Latent Semantic Indexing", AAAI Spring Symposium on Cross-Language Text and Speech Retrieval, March 1997. M. Damashek, "Gauging similarity with n-Grams: Language-Independent Categorization of Text", Science, February 1995, Vol. 267, pp. 843-848. C. Nicholas and R. Dahlberg. Spotting Topics with the Singular Value Decomposition. PODDP'98, St. Malo, pp. 82-91,March 1998. Y. Katsnelson, "Parallel Corpora Identification Using LSI", M.S. thesis, UMBC Department of Computer Science and Electrical Engineering, January 2000, http://www.cs.umbc.edu/~ykatsn1. 332 Exploiting large corpora: A circular process of partial syntactic analysis, corpus query and extraction of lexicographic information Hannah Kermes & Stefan Evert Institut für Maschinelle Sprachverarbeitung, University of Stuttgart, Germany. 1 Introduction Our approach follows the work of Eckle-Kohler (1999) who used a regular grammar to extract lexicographic information from text corpora. We employ a system that allows to improve her querybased grammar especially with respect to recall and speed without reducing accuracy. In contrast to Eckle-Kohler (1999), we do not attempt to parse a whole sentence or phrase at once during the extraction process, but build complex structures incrementally. The intermediate results are annotated in the corpus and used as input for following iterations.1 This concept enables us to accommodate new aspects such as agreement information and the use of annotated structures together with their features (for partial parsing as well as for extraction purposes). Our goal is to design a tool that can be used with both small and large corpora. In addition to partial syntactic analysis we provide queries (based on the parsing results) for an interactive use. The idea is to build up flat annotations of (maximal) syntactic constituents (noun phrases (NP), prepositional phrases (PP), adjectival phrases (AP), adverbial phrases (AdvP) and verbal complexes (VC)) incrementally, using a multi-pass algorithm. The chunks/phrases allow embedding of chunks/phrases of other categories as well as recursive embedding, but no PP attachment. The incremental structure-building procedure enables us to analyse chunks/phrases independently of their immediate context, i.e., even if we cannot parse the whole sentence or phrase we might still parse part of it. Analyses of complex chunks/phrases have to be executed only once, the results being annotated. Consequently, even when dealing with very large corpora, interactive queries and the extraction process are relatively fast. Besides, as features specifying the character of chunks/phrases are annotated, structures having certain characteristic can be extracted easily and quickly. We are also able to include agreement morphology in the process. Thus, we can check the content of chunks/phrases with respect to agreement features such as case, number, and gender. Agreement information of chunks/phrases is disambiguated to the extent possible without guessing and added to the structural mark-up. 2 Technical framework Our tools are based on the IMS Corpus Workbench2 (CWB). The CWB is an environment for storage and querying of large corpora with shallow annotations. Currently, the maximum size of a single corpus can be approximately 300 million words, depending on number and type of annotation. The CWB provides fast access to corpora by using a separate lexicon and a full index for each annotation level. The data are stored in a compact proprietary format, and compressed with specialised algorithms (Huffmann coding for the token sequence and variable-length encodings for the indices). The CWB was initially developed for corpora annotated at token level only (typically with part-ofspeech (PoS) and lemma values). Later, support for flat, non-overlapping structural annotation was added (referred to as structural attributes). Since this mark-up was intended for the annotation of document structure (e.g., source files, paragraphs, and sentences), the regions of structural attributes are neither hierarchical nor do they allow recursion. Besides, compression algorithms are not necessary to store the relatively small number of sentences, paragraphs, etc. in a corpus. Queries can be specified in terms of regular expressions over tokens and their linguistic annotations, using the Corpus Query Processor component (CQP). In contrast to most CFG-based parsers, the CQP query language allows complex expressions at the basic token level. These include regular expression matching of tokens and annotated strings (optionally ignoring case and/or diacritics), tests for 1 A similar approach was conducted by Steve Abney for English using a cascaded finite-state parser. (cf. Abney 1991 and 1999) 2 See Christ (1994) for an overview. More information on the IMS Corpus Workbench is available from http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench . 333 membership in user-specified word lists, and arbitrary Boolean expressions over feature-value pairs. Additionally, “global constraints” can be used to specify dependencies between arbitrary tokens in a CQP query. CQP includes a simple macro language, based on string replacement with interpolation of up to 10 arguments. The “body” of a macro (which is substituted for each macro “call” in a query) may contain further (non-recursive) macro invocations. Thus, complex queries can be broken down into small parts, similar to the rules of a context-free grammar. Macro definitions are loaded from text files and can be modified at run-time. The CWB includes support for set-valued annotations encoded as string values using a special disjunctive notation. This allows treatment of agreement features, which is one of our improvements over the work of Eckle-Kohler (1999). For example, all possible combinations of case, gender, and number values a certain noun may have are stored as a single, set-valued annotation (a feature vector). Unification of feature values (which is the basis of most complex grammar formalisms) is equivalent to the (set-theoretic) intersection of the corresponding feature vectors. Special operators based on regular expressions are available to test for the presence of features (e.g. a noun phrase which might have genitive case) as well as the uniqueness of feature values (a noun phrase uniquely identified as genitive). We considered a number of alternative approaches for the parsing stage as well, in particular those for which standard tools are available. · Complex grammars (e.g., in the LFG or HPSG framework) can model the hierarchical structure of language, and are well-suited for handling attachment ambiguities. Drawbacks include slow parsing speed, lack of robustness, dependence on an extensive lexicon as a prerequisite, and the complex interactions between rules that complicate both grammar development and the adjustment to a particular domain seriously. Furthermore, complex grammars usually return a large number of possible analyses for each sentence, which cannot be stored and queried efficiently for large corpora. Thus, an additional, and probably rather unreliable disambiguation component or labour-intensive manual disambiguation would be necessary. · Context-free grammars (CFGs) are modular (i.e. there is little interaction between different rules) and allow for fast parsing. In most CFG-based systems, however, modelling agreement and special (lexical) constructions requires large numbers of additional rules, which makes grammars unwieldy and slows down the parsing process. For the automatic analysis of large amounts of text, further “robustness” rules are needed. Partial or full disambiguation of agreement information is difficult to achieve. · Probabilistic context-free grammars (PCFGs) extend the CFG formalism with a statistical model of lexical information such as subcategorisation frames and collocations. In contrast to complex grammars, probabilistic “lexicon entries” are learned from the input text and training data without human intervention. PCFG-based parsers are slower than their CFG counterparts and require a considerable amount of working memory for their large parameter sets. A particular problem for PCFGs are marked constructions and other special cases, where the parser almost inevitably prefers a more frequent unmarked alternative. In general, PCFG parsers perform a full disambiguation of agreement features involving guesswork rather than the partial disambiguation that we prefer. To sum up, the advantages which led us to use the CWB as a framework for our tools are: (i) The possibility to work with large corpora. After compression, the surface forms and lexical annotations (lemma, etc.) require approximately 30 bits/token of disk space, whereas categorical annotations (PoS, agreement features, etc.) require 10 bits/token or less. (ii) CQP efficiently evaluates complex queries on large corpora. Disk files are accessed directly and do not have to be loaded into memory first. It is this feature which makes a multi-pass algorithm (in which CQP is frequently restarted in order to re-use intermediate results; see section 334 for details) feasible at all. (iii) The query language is modular and allows easy treatment of special cases (using additional rules, word lists, or structural mark-up of multiword entities). (iv) The same representation formalism and query language can be used at the parsing stage, for interactive querying of the final results, and for the extraction of lexical information. 334 3 General concept The general concept of our approach is to combine two usually separate processes (see also Figure 1): (i) annotation of syntactic structures and (ii) extraction of linguistic information. Central to our approach is a set of hierarchical query macros. These macros serve basically two purposes: (i) they can be used by an annotator tool to analyse syntactic structure and (ii) they can be interactively used to extract linguistic information. In the first case, the results of the queries are annotated as structural mark-up in the corpus, incrementally building up larger and larger structures. In the latter case, the queries are used by a human user to extract linguistic information from the corpus. For that purpose, the queries use the structural mark-up the annotator tool constructed by means of the queries themselves. Queries Annotator Tool Linguistic Information Corpus Figure 1: general concept The results of the extraction process (in the form of linguistic or lexicographic information) may then be used either directly by the user or they may be fed back into the queries to improve them. Useful information for the latter are wordlists, subcategorization information, etc.. The annotator tool as well as the human user can then use the refined queries to optimise both the annotation and the extraction process. 4 The parser The parser is a combination of Perl scripts and CQP queries applied in multiple passes. The results produced by CQP are post-processed by Perl scripts. These scripts check the results with respect to agreement and possibly other criteria rejecting inappropriate structures. The remaining structures are then annotated in the corpus. This procedure is repeated several times in order to build more and more complex structures. The annotated results of previous steps are taken as additional input for further steps. In general, there are two different types of passes: (i) a primary “preparing” pass that is executed only once, and (ii) a secondary pass that is run several times (see Figure 2). 335 Pass 2 Pass 1 Figure 2: the parsing process In the primary pass simple non-recursive chunks/phrases are built. These chunks/phrases are mostly syntactically and/or lexically special chunks, i.e. chunks/phrases that are either special with respect to their distribution, their grammatical function, or their structure. Collecting and annotating these chunks in a preliminary step enables us to model these chunks/phrases independently of other chunk/phrase rules, which considerably improves performance and speeds up the parsing process. The queries can be adjusted to the special characteristics (lexical or structural) of the chunks/phrases. These features are made accessible in the form of chunk annotations. Thus, in further steps these chunks/phrases may be either treated as regular chunks or according to their special character. The secondary pass constitutes the main part of the parsing process. It is designed to run several times. The number of runs is not predetermined. Currently, we run it three times, which seems to provide a sufficient depth of embedding for most of the text in our corpora. Each iteration of the secondary pass takes the annotated structural mark-up of all previous steps (primary or secondary) as input to build more complex chunks/phrases. The macro queries resemble the rules of a context free grammar (CFG), i.e., they state that chunks/phrases may contain a specifier, certain embedded chunks/phrases (we do not distinguish between modifiers and complements) and a head. Yet, in contrast to Eckle-Kohler (1999) and other CFG approaches, the embedded chunks/phrases do not have to be re-analysed for every query but can be accessed as structural mark-up (annotated in one of the previous steps). This makes it easier and faster to query optional elements, optional coordination and the special elements/chunks described above (see Table 1 and Table 2 for details). Recursion, in this context, is simply the inclusion of chunks/phrases of the same category that have been built in previous steps. Chunks/phrases built in intermediate steps need not be the most complex structure possible, but are annotated to serve as input for further steps. Whenever a larger structure is found, the Perl scripts delete the smaller structure in favour of the new and more complex one. If the most complex structure of a phrase is not found, because of tagging errors or lacks in the grammar rules or the lexicon, the intermediate structures remain, and may, nevertheless, contribute to future extraction. 5 Grammar details In the primary step special chunks/phrases are annotated along with features specifying their characteristics. These chunks are identified using one or more of the following criteria: (i) PoS-tags (multiword proper nouns), (ii) certain makers in the text itself such as brackets or quotes (terminology, multiword units), (iii) a lexicon in the form of word lists or subcategorisation information (temporal nouns, measure nouns, complex adjectival phrases), (iv) morpho-syntax (invariant or verbally derived adjectives). The features annotated with adjectival phrases and noun phrases are listed in Table 1 and Table 2. 336 Table 1: annotation features of adjectival chunks/phrases Table 2: annotation features of noun chunks/phrases Beside of these lexical/structural features we have also annotated (partially) disambiguated agreement features and the head lemma. As in complex grammars such as LFG or HPSG, the characteristics of the head project to the chunk/phrase. Similarly, the characteristics of intermediate chunks project to larger chunks/phrases with the same head. This projection of features is made possible by the use of Perl scripts that copy annotated features from smaller to larger structures, modifying them where necessary. For instance, agreement disambiguation is performed in every intermediate step (where possible). The partially disambiguated agreement features are annotated along with the chunk/phrase. Later disambiguation steps use the annotated agreement features of relevant chunks/phrases (e.g., in the case of noun phrases, we use the agreement morphology of the determiner, the embedded adjectival phrases, and the noun chunk containing the head). Thus, (partial) disambiguation for a certain set of words does not have to be repeated. A special method was chosen to check the agreement of multiple APs embedded in a NP. Every AP is annotated with a feature specifying its suffix (more precisely, the last letter of the adjective head, which is extracted by a Perl script in the first pass). All APs sharing the same suffix are assumed to agree in their morpho-syntactic features as well .3 APs whose heads are invariant adjectives are ignored. The chunks/phrases are built using relatively simple queries. A noun phrase, e.g., consists of a noun or year date as head and only obligatory element, which may have truncated elements (cf. Table 2). In pre-head position there are optionally a determiner, a cardinal number and a (theoretically) unlimited number of adjectival phrases. Cardinal numbers and adjectival phrases may be coordinated. Post-head there are optionally an unlimited number of genitive NPs, plus a noun chunk in quotes and/or in brackets: das alkoholfreie Bier “Kelts” the nonalcoholic beer “Kelts” 3 The method was adopted from the German Gramotron grammar (cf. Schulte 2000). Annotation features of adjectival chunks/phrases feature description examples invar vder meas norm quot pp invariant adjective verbally derived adjective AC/AP embedding a measure noun regular AC/AP AC/AP in quotes AP embedding a PP or year date Berliner Justiz; fünfziger Jahre Beginnender Aufschwung; überzeugende Antworten Hektar groß; Meter langen ´´falsche''; ´´allzu plötzlich'’ von seiner Frau geborgten; schon vor Jahren entdecktes Annotation features of noun chunks/phrases feature description examples ne time year meas news address tel trunc quot brac proper noun temporal noun year date measure noun news agencies street name and number telephone numbers nouns with truncs NC/NP in quotes NC/NP in brackets Walter Holm; Mecklenburg-Vorpommern Mai, Feierabend 1999 Dollar Strafe; Hektar Pachtland LONDON (afp / rtr / AP) Musikantenweg 14; Museumsgasse 1 Tel. 23 33 25 oder 23 15 47 Schadenersatz- oder Schmerzensgeldanspruch; Verantwortungs-, Solidaritäts-, Gerechtigkeits-, Gleichheits- und Freiheitsdenken ´´Autonome Organisation''; ´´United Democrats'’ (LKA); (Framingham Heart Study) 337 den Namen “Werner-Herr-Haus” the name “Werner-Herr-Haus” das Quartett “Itchy Fingers” the quartet “Itchy Fingers” der Telefonnummer 602-316 (Herrn Borns) the telephone number 602-316 (of Mister Born) The pre-head as well as the post-head elements have to occur in the given order, yet, they do not depend on their predecessors. Table 3 gives a graphical overview of the noun phrase structure we use. Table 3: elements of noun chunks/phrases An adjectival phrase also has only one obligatory element, its head, which is an attributive adjective. Optional pre-head elements are adverbial phrases, a modifying particle (allzu [großer] (far too [big]); zu [hohen] (too [high])), and, in the case of verbally derived adjectives or adjectives subcategorising PPs, a prepositional phrase or a year date (see Table 4). A prepositional phrase simply consists of a preposition as its head and an obligatory noun phrase (which may optionally be coordinated) Table 4: elements of adjectival chunks/phrases 6 Evaluation For evaluation purposes, we applied our parser to a 40-million-word newspaper corpus4. The corpus was preprocessed and part-of-speech tagged with standard tools (cf. Schmid 1995 and Schiller 1996). In order to reduce memory consumption during the parsing process, the corpus was automatically split into slices of approximately 500,000 tokens. Each slice was encoded as a separate corpus on which the parser was run. The structural annotations of all slices were then recombined and annotated in the original corpus. The parsing process5 took 13.5 hours on a standard 933 MHz Pentium III notebook with 128 MBytes of RAM. This amounts to an average speed of 3 million words per hour. Processing speed varied across different slices, with some slices taking almost twice as long as the “fastest” ones. For a first quality assessment, we selected 100 sentences at random (excluding incomplete sentences and sentences containing spelling mistakes) and manually annotated noun phrases according to the extended chunk concept introduced in section 5. Unlike other manual annotation tasks, there is little ambiguity in the assignment of noun phrase boundaries when PP attachment is excluded, and agreement between the authors was easily reached. 4 The Frankfurter Rundschau (FR) corpus from the ECI Multilingual CD-ROM I. 5 Including the splitting and recombination steps, but not including preprocessing and part-of-speech tagging. Elements of noun chunks/phrases pre-head head post-head optional Determiner CARD* adjectival phrases* truncs* noun proper noun optional genitive NC/NP proper noun NC in brackets NC in quotes year dates Substitutive pronouns *optional coordination Elements of adjectival chunks/phrases pre-head Head optional adverbial phrase modifying particle Adjective optional prepositional phrase year date verbally derived adjective or adjectives subcategorising PPs 338 We did not evaluate the agreement features assigned to the phrases because they are only partially disambiguated and there is no guesswork on the side of the parser (any errors in case assignment etc. are due either to the morphological analysis or to tagging errors). For the same reason, other phrase types (such as PPs or APs) were not evaluated. PPs, for instance, can easily be identified by the parser once the corresponding NP has been found. In the 100 test sentences, we found 477 noun phrases. Automatic parsing yielded 487 NPs, of which 440 were true positives. This corresponds to a precision of 90% (i.e. 90% of the NPs identified by the parser were correct) and a recall of 92% (i.e. 92% of the manually annotated NPs were also found by parser). Looking at the results in detail, two major factors account for the number of false positives and NPs that were not found: tagging errors (e.g., the colon : tagged as a noun) and proper nouns that were not correctly identified (e.g., Thomas Doll, where Doll was erroneously tagged as an adjective). If we correct such errors (which can easily be done using lists of proper nouns etc.), the number of false positives drops from 47 to 13, and we obtain precision and recall values of 97% and 98%, respectively. Another four of the eight noun phrases that our parser still cannot identify contain adjectives with PP complements. Feeding adjective subcategorisation frames extracted from the annotated corpus back into the parsing process will thus further improve the results. 7 Extraction As mentioned above macro queries may not only be used for annotation but also for extraction. For this purpose they build on the results of the annotation process, searching on structural mark-up that they have produced before. In this case, the annotated features are an important knowledge source making morpho-syntactic information and characteristics of the chunks/phrases easily accessible. The extraction process itself can be automatic or semi-automatic depending on the query, i.e., the results of the queries may need manual checking before they can be used for the different purposes (e.g., lexicographic or linguistic). Interesting for the extraction process are, e.g., chunks/phrases enclosed in brackets or quotes. These “structural” markers are relatively secure signs of elements belonging together. Thus, the elements enclosed can give hints regarding subcategorisation information of various kinds, but also with respect to multiword units, idiomatic expressions and collocations. Multiword units, in particular multiword proper nouns, sometimes occur in brackets or quotes, where they can be assembled securely. The same holds for abbreviations of terms, which often occur in brackets preceded by the respective term. “Teenage Mutant Hero Turtles” (FC Italia Frankfurt) „Club Marienthaler Carnevalisten“ „Rocky Horror Picture Show“ „Johann Strauß Ensemble Frankfurt“ (Sprecher Wolf Hardy Pulina) „Arbeiter Samariter Bund“ Technische Überwachung Hessen (TÜH) Deutscher Aktienindex (Dax) Daimler-Benz Inter Services (Debis) Stickstoffdioxyd (NO2) The annotation of structural mark-up can also help to model sentence positions in which certain elements occur without having to parse the whole sentence. Thus, the information these positions include can be accessed, even if the parse of the whole sentence is not successful. In the first position of German main clauses, for example, which is referred to as Vorfeld within the framework of the topological field model (cf. Wöllstein-Leisten et al. 1997), only one constituent may occur. Adverbs following NPs in this position can, consequently, be supposed to belong to the class of post-noun modifiers. Das “modernistische” Konzept hingegen lebt ... The „modernistic“ concept however lives … 339 Er selbst berichtet ... He himself reports ... Die Rationalität alleine ist ... The rationality alone is ... Herrn Frank persönlich wünsche ... Mister Frank personally wishes ... Das Volk indessen lässt ... The people meanwhile let .... It is also possible to overgenerate certain structures, annotating them only temporarily in the corpus. If they are not embedded in a larger construction or prove to be correct in another way, they may be deleted again, otherwise they remain annotated in the corpus. Due to this possibility, structures can be annotated that would need subcategorisation information not available in the lexicon. These structures can than be queried and taken as evidence for certain subcategorisation frames. An example are adjectives subcategorising PPs that can build complex adjectival phrases with the respective PP. In this case, we overgenerate APs allowing all adjectives intermediately to build complex APs with preceding PPs or year dates. The APs are deleted again, unless they are embedded in a NP after the last annotation step, i.e., they are preceded by a cardinal number or determiner belonging to the head noun of the NP. Einem auf die Betreuung Aidskranker spezialisierten Sozialarbeiter A on the care of people suffering from aids specialised social worker “A social worker specialising in the care of people suffering from aids” Der für chinesische Verhältnisse kleinen 20 000 Einwohner zählenden Stadt The for Chinese standards small 20 000 inhabitant having city “The city with 20 000 inhabitants, a small city by Chinese standards” Die dafür erforderlichen 300 000 Mark The for this needed 300 000 Marks “The 300 000 Marks needed for this” Die für eine Aufrufung des Rates notwendigen 60 Abgeordneten The for a summoning of the council necessary 60 delegates “The 60 delegates necessary to summon the council” 8 Acknowledgments Part of our work was done within the framework of the DEREKO project. The DEREKO (Deutsches Referenzkorpus) project is a joint project of the Institut für deutsche Sprache (IDS) in Mannheim, the Institute for Natural Language Processing (IMS) in Stuttgart, and the Seminar für Sprachwissenschaft (SfS) in Tübingen. The project is funded by the Ministry of Science, Research and the Arts of the State of Baden-Württemberg. The goal of the project is to improve the infrastructure for text-based linguistic research and development by making accessible a large well-balanced German text corpus. This corpus is intended as a source for linguistic and lexicographic information. The target group for the resulting infrastructure are lexicographers, dictionary publishers, manufactures of terminological databases and ontologies as well as linguists. 9 References Abney S 1991 Parsing by chunk. In Berwick R, Abney S, Tenny C (eds), Principle-based parsing. Dordrecht, Kluwer Academic Publishers. Abney S 1999 Partial parsing via finite-state cascades. In Proceedings of the ESSLLI ’96 Robust Parsing Workshop. Christ O 1994 A modular and flexible architecture for an integrated corpus query system. In Papers in Computational Lexicography COMPLEX ’94. Budapest, Hungary, pp 22-32. 340 Christ O, Schulze B M, Hofmann A, König E 1991 Corpus Query Processor (CQP). User's Manual. Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, Germany. Eckle-Kohler J 1999 Linguistisches Wissen zur automatischen Lexikon-Akquisition aus deutschen Textcorpora. Berlin, Logos Verlag. Schulte im Walde S 2000 The German statistical grammar model: development, training and linguistic exploitation. In Arbeitspapiere des Sonderforschungsbereichs 340 Linguistic Theory and the Foundation of Computational Linguistics 162. Institut für Maschinelle Srpachverarbeitung, Universität Stuttgart, Germany. Wöllstein-Leisten A, Heilmann A, Stepan P, Vikner S 1997 Deutsche Satzstruktur. Grundlagen der syntaktischen Analyse. Staufenberg Verlag, Tübingen. 341 A tagset for the morphosyntactic tagging of Arabic Shereen Khoja Computing Department, Lancaster University, Lancaster, LA1 4YR. s.khoja@lancaster.ac.uk Tel: 01524 592329 Fax: 01524 593608 Roger Garside Computing Department, Lancaster University, Lancaster, LA1 4YR. rgg@comp.lancs.ac.uk Tel: 01524 593803 Fax: 01524 593608 Gerry Knowles Department of Linguistics and Modern English Language, Lancaster University, Lancaster, LA1 4YT. g.knowles@lancaster.ac.uk Fax: 01524 843085 A morphsyntactic tagging system for Arabic has been around for centuries. When analysing sentences, Arabic grammarians give to each word a detailed morphosyntactic tag or part-of-speech. These tags contain a large amount of information, perhaps more than what would be found in Indo-European tags. This is because Arabic words are formed by following fixed patterns, and by knowing the tag of the word, we know its pattern, and can thus predict various properties and meanings of the word. It is from this tagging system that we derive our Arabic tagset. The Arabic tagset we describe does not follow the EAGLES recommendations for the morphosyntactic annotation of corpora, but this is to be expected since Arabic is very different from the languages for which EAGLES was designed, and belongs to the Semitic family rather than the Indo-European one. Following a normalised tagset and the EAGLES recommendations would not capture some of Arabic's relevant information. However, the Arabic tagset can be mapped onto the EAGLES recommendations. In Arabic we have three major categories, these are Noun, Verb and Particle. To handle other common features of Arabic we have included another three categories, Residual (foreign words, abbreviations, etc), Numerals, and Punctuation (comma, period, etc). The EAGLES recommendations include the major categories described above (except the Particle) plus eight other categories. These are Adjective, Pronoun/Determiner, Article, Adverb (which in Arabic are all considered to be subcategories of the Noun), and Adpositions, Conjunctions, and Interjections (which are subcategories of the Particle in Arabic). The Unique category in the EAGLES recommendations is applied to categories with a unique membership, which do not follow any of the standard part-of-speech tags. In this paper we will show that although the Arabic tagset we have derived is very different to any Indo-European tagset, it is appropriate for Arabic. A detailed description of the tagset, along with examples of how the tags map onto real words will be included in this paper. We will also describe the results of using this tagset with an Arabic part-of-speech tagger to tag an Arabic corpus. 342 Web as corpus Adam Kilgarriff ITRI, University of Brighton 1. Introduction The corpus resource for the 1990s was the BNC. Conceived in the 80s, completed in the mid 90s, it was hugely innovative and opened up myriad new research avenues for comparing different text types, sociolinguistics, empirical NLP, language teaching and lexicography. But now the web is with us, giving access to colossal quantities of text, of any number of varieties, at the click of a button, for free. While the BNC and other fixed corpora remain of huge value, it is the web that presents the most provocative questions about the nature of language. It also presents a convenient tool for handling and examining text. Compared to LOB, the BNC is an anarchic object, containing ‘texts’ from 25 to 250,000 words long, screeds of painfully formulaic entries from the Dictionary of National Biography, conversations monosyllabic and incoherent, sermons, pornography and the electronic discourse of the Leeds United Football Club Fan Club. Compared to the web, the BNC is an English country garden. Whatever perversities the BNC has, the web has in spades. First, not all documents contain text, and many of those that do are not only text. Second, it changes all the time. Third, like Borges's Library of Babel, it contains duplicates, near duplicates, documents pointing to duplicates that may not be there, and documents that claim to be duplicates but are not. Next, the language has to be identified (and documents may contain mixes of language). Then comes the question of text type: to gain any perspective on the language we have at our disposal in the web, we must classify some of the millions of web pages, and we shall never do so manually, so corpus linguists, and also web search engines, need ways of telling what sort of text a document contains: chat or hate-mail; learned article or bus timetable. These may sound like arguments for not studying the web: for scientific progress, we need to fix certain parameters so we can isolate the features we want to look at, and the web is not a good environment for that. This is true. For the web to be useful for language study, we must address its anarchy. If the web is a torrent and nothing more, it is not useful; for it to be useful, we must channel off manageable quantities to irrigate the pastures of scientific and technological progress. 2. The D3CI We are developing the D3CI (Distributed Data Distributed Collection Initiative) a framework for distributed corpora. This will comprise a set of corpora contributed by anyone with an on-line corpus to offer, where each corpus comes in the form of a set of URLs. The “virtual multicorpus” website will then be a place to visit for anyone wishing to download a corpus of some known language-variety. Corpus measures (Kilgarriff 2001) will be used to identify the homogeneity of each submitted corpus. We shall check for duplicates (Bouyad-Agha and Kilgarriff 1999). We shall provide a program which will go and collect a set of web pages and deliver it to a user. We shall develop links with that part of the WWW community which is examining ways of using links between documents and other strategies to automatically identify interesting clusters of interconnected pages (e.g. Chakrabarti 2000). Our medium-term goal is to set up a suite of web-based corpora that can be used by linguists and language technologists to answer questions of the form: “my theory/algorithm/program works well on the text type I developed it for: I wonder how well it generalises to other text types.” The use of the web addresses the hobgoblin of corpus builders: copyright. If material is on the web, it has been published and can be downloaded without infringing copyright. If I wished to store that material, put it on a CD and distribute that CD, I would be infringing copyright. If I merely present a list of URLs and announce to the world that this URL set comprises a corpus (of a given text type which I also describe) then I am clearly not infringing copyright. There are also no administrative, CDburning or postage costs associated with web-based corpora. To the objection that web pages die, so a corpus defined as a set of URLs would be forever shrinking, we propose the following solution. Our virtual corpora are monitored by an agent, which periodically checks that all URLs are still live. On discovering that one no longer is, the agent, which has gathered a statistical profile of each of its pages, sets out to find a new page or pages to replace the deceased. First, it submits a web search, using the terms in the deceased as search terms. This gathers in a set of candidates. Then, using corpus similarity measures, it identifies which of the candidates do in 343 fact have the same linguistic form as the deceased. It then adds them to the corpus. The virtual corpus will evolve. Some may object, “but that is not suitable as use as a corpus because the texts that are there today are not identical to those that were there yesterday, so how can we compare results?” Results can be compared because the text type is the same. To demand more is to demand that tomorrow's experiments on the water flow in the River Lune involve the same water molecules as yesterday's. 3. Related Work We are not the first to note the web's usefulness for corpus research, despite its short history. Since the mid-nineties, the net has commonly been used by summarisation researchers as a source of documents to summarise. In this context, Radev and McKeown (1997) use internet-accessible newswire as a knowledge source for a language generation system. More recently, researchers have used collections of papers found on the web for a very wide range of purposes. Grefenstette and Nioche (2000) and Jones and Ghani (2000) explore the potential of the web as a source of language corpora for languages where electronic resources are in short supply, and Resnik (1999), as a source for bilingual parallel corpora. Fujii and Ishikawa (2000) use the web to generate encyclopaedia entries. Grefenstette (1999) presents prospects and experiments regarding the web as a source of lexical information; as the web provides thousands of contextualised instances of even fairly rare words, for many languages, it offers vast opportunities for automatic distillation of lexical entries from empirical evidence. Varantola (2000) pursues a similar theme, showing how translators, when confronted with a rare term, can find ample evidence of the term, its contexts, and associated vocabulary, through the simple use of a search engine. Specialised ‘lexicographic’ search engines have been produced (see http://www.webcorp.org.uk ) though their relative merits compared to, e.g., google (which provides some linguistic context for each occurrence of the word, all at breathtaking speed) remains an open question. Mihalcea and Moldovan (1999) and Agirre and Martinez (2000) use the web as a lexical resource, and as a source of test data, for Word Sense Disambiguation. Jacquemin and Bush (2000) use it as a source for harvesting lists of named entities. There has recently been a Web track in the TREC Information Retrieval Competition (see http://pastime.anu.edu.au/WAR/webtrax.html). A field such as this, with its newness and no entry costs, is immediately appealing to students and others, and the list above is of course incomplete. It does indicate how the use of the web as a corpus is taking off fast. 4. Conclusion To conclude: the BNC was one of the greatest innovations for linguistics in the 1990s. Now the world has moved on. As corpus linguists, we are in the fortunate position of having a particular perspective and channel of attack for examining the web --perhaps the most extraordinary phenomenon of our time -- which also just happens to provides solutions to many of our practical problems and an endless stream of new data. We have presented a model which uses the web as a source of data, the web as a delivery medium, and in which the web, and its language, are objects to be explored. The corpus of the new millennium is the web. 5. References Agirre E and Martinez D. Exploring automatic word sense disambiguation with decision lists and the web. In proceedings of COLING Workshop on Semantic Annotation and Intelligent Content, Saarbruecken, Germany. August 2000. Bouayad-Agha N and Kilgarriff A. Duplication in Corpora. In proceedings of Second Computational Lingusitics in the UK Colloquium, Essex, 1999. Chakrabarti S. Invited talk. Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora. Hong Kong. October 2000. (http://www.cse.iitb.ernet.in/~soumen) Fujii A and Ishikawa T. Utilizing the world wide web as an encyclopaedia: Extracting term descriptions from semi-structured text. In proceedings of the 38th Meeting of the ACL, Hong Kong, October 2000, pp. 488-495. Grefenstette G. The WWW as a Resource for Example-Based MT Tasks. Invited Talk, ASLIB ‘Translating and the Computer’ conference, London. October 1999. Grefenstette G and Nioche J. Estimation of English and non-English Language Use on the WWW. In 344 proceedings of RIAO (Recherche d'Informations Assistee par Ordinateur), Paris, 2000. Jacquemin C and Bush C. Combining Lexical and Formatting Clues for named entity acquisition from the web. In proceedings of Joint SIGDAT Conference on Empirical Methods in NLP and Very Large Corpora. Hong Kong. October 2000, pp. 181-189. Jones R and Ghani R. Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop. Hong Kong. October 2000, pp. 29-36. Kilgarriff A. 2001 (in press) Comparing Corpora. International Journal of Corpus Linguistics. Mihalcea R and Moldovan D. A method for word sense disambiguation of unrestricted text. In proceedings of the 37th Meeting of ACL. Maryalnd, USA, June 1999, pp. 152-158. Radev D and McKeown K. Building a generation knowledge source using internet-accessible newswire. In proceedings of the Fifth Applied Natural Language Processing conference. Washington D. C.., April 1997, pp. 221-228. Resnik P. Mining the web for bilingual text In proceedings of the 37th Meeting of ACL. Maryalnd, USA, June 1999, pp. 527-534. Varantola K. Translators and disposable corpora. In proceedings of CULT (Corpus Use and Learning to Translate). Bertinoro, Italy. November 2000. 345 A long-standing problem in Corpus-Based Lexicography and a proposal for a viable solution Dimitrios Kokkinakis Sprakdata, Göteborg University Box 200, SE-405 30, Sweden Dimitrios.Kokkinakis@svenska.gu.se Abstract This paper describes the application of a framework for text analysis to the problem of distinguishing unusual or non-standard usage of words in large corpora. The need to identify such novel uses, and augment machine-readable dictionaries is a constant battle for professional lexicographers that need to update their resources in order to keep up with the development of the dynamic and evolving aspects of human language. Of equal importance is the need to devise automatic means upon which we can evaluate to what extent a (defining) dictionary accounts for what we find in corpus data. A combination of both semi-, and automatic means have been explored, and it seems that Machine Learning might be a plausible solution towards the stated goals. 1. Introduction Lexicography is concerned with the study of how words are used in a natural language by abstracting away surface differences between them. This paper deals with the question of creating a methodology that can aid professional lexicographers to distinguish novel or non-standard usage of (lexicalized) words. The process is automatized to a large extent, using state-of-the-art implemented Natural Language Processing (NLP) software for written Swedish, such as different kinds of annotators and analyses tools, coupled with large repositories of static and dynamic lexical knowledge. The NLP tools are geared towards the goal of distinguishing non-standard usage applied on very large bodies of texts. Approximated by these tools, surface differences between words are reduced while at the same time linguistic information is successively added in the data in order to create a suitable representation that can be matched and measured by a Machine Learning software. The obtained results, that do not conform to the norm according to the lexical resources and the method explored, need manual inspection in order to make a decision regarding the actual, or not, identification of a new or extended usage. However, the inspection process can be automatized, and accelerated, as soon as large portions of texts have been appropriately marked and analyzed, since Machine Learning techniques can be applied to calculate the difference or distance between previously analyzed and newly processed material. A distance to the nearest marked instance for each test item over a certain threshold, might indicate that an unusual or novel usage is identified, while under a threshold might indicate a prototypical usage. The presentation that follows is not about word sense disambiguation (WSD), sense or semantic tagging per se, it is rather about how to use WSD, and other supporting NLP technologies, in a practical situation. Moreover, the discussion that follows is not a polemic point of view against the defining lexicon we chose, or any similar lexicons of that kind. We are fully aware of the limitations in terms of space, time, resources etc. that are prohibitive for producing better coverage and richer in content dictionaries. Dictionaries are and will always be incomplete and it is simply impractical to give descriptions of all possible sense nuances of a word, regardless of the theoretical model incorporated in the dictionary. Still, however, a typical scenario for many scholars, such as professional lexicographers, is the need to have sophisticated software in their every-day work in order to cope and organize the rapidly growing bulk of large corpora in electronic form, to automatically aid the evaluation of (defining) dictionaries against textual data, and to automate the process of discovering non-standard usage of words. The operational definition of non-standard usage we adopt is tightly connected to the use of an existing dictionary. Accordingly, corpus instances can either fit into the dictionary descriptions provided (standard usage) or not (non-standard usage). This paper is organized as follows. We start by describing what we mean by “non-standard” use and give some background information regarding other efforts towards the same or similar goals, chapter (2); a brief presentation of various static and dynamic lexical resources as well as annotation 346 and analyses tools for written Swedish will be presented, chapter (3); chapter (4) describes the method explored for the discovery of non-standard usage; while chapter (5) presents a worked example and discusses some preliminary results obtained; finally chapter (6) end the presentation by giving some general conclusions and directions for future research. 2. Background 2.1 What is non-standard use? There is a justified need to distinguish new senses and novel usage of words, and accordingly augment defining dictionaries with “fresh” material. This is regarded as a constant battle for professional lexicographers that need to update their artifacts reliably and rapidly, in order to keep up with the development of the dynamic and evolving aspects of human language, and its constant and rapid change. In this paper we will use the terms: new, novel and non-standard usage or sense rather interchangeably referring to, roughly, the same thing. Moreove, when we talk about these terms we should always try to keep in mind that they are closely related to given dictionary senses. That is, “any numbered section of a defining dictionary entry which supports its own definition or requires separate treatment from the surrounding material”; (Atkins 1987). Thus, the definition we adopt for the work presented in this paper, regarding what might constitute a non-standard use of a particular word, is based on the lexical resources in disposal, in our case the Gothenburg Lexical Database (GLDB). Consequently, if a word in a specific context does not match any readings of the descriptions, provided by the lexicographers, in this specific dictionary, then it will be worth examining it manually, in order to establish if a non-standard usage has been detected. Manual inspection is justified by the fact that deviating readings might simply be productive extensions1, (easily) covered by applying some kind of mapping rule or process to the available readings, or, they might be errors produced by the complex interaction of the different tools applied on the test data. For similar experiments, but with manually inspecting large text instances for the purpose of evaluating the Generative Lexicon see Kilgarriff (2000). Kilgarriff acknowledges that close reading of definitions from a published dictionary does not provide an ideal method for distinguishing standard from non-standard uses of words. However, the method has no fundamental flaws, he says, and there is no better method available. A disadvantage of the work presented by Kilgarriff is that he starts by first identifying a set of words to test whether the Generative Lexicon accounts for a novel word use. However, this cannot always be practical due to the fact that working with multi-million corpora and the totality of a natural language, at least as described by a dictionary, makes the suggested approach infeasible. Therefore, what we need is better, more general means that can be applied to more than a handfull of words. Nevertheless, I agree that manual judgement is still necessary regardless the qualitative or quantitative view of the approach adopted. Comparisons between established machine-readable dictionaries (MRDs) and text collections have shown that there is a gap between the two in different dimensions. Since accurate, robust analysis tools for large corpora and machine-readable dictionaries were rare, until recently, even for overexplored languages, such as English, only a very coarse estimate of the coverage of the dictionaries has been studied. This is usually taking the form of identifying ‘new’ words, words not in a ‘master wordlist’ of ‘existing’ words, or counting how many of the words found in various text collections were in various machine-readable resources, such as the Longman Dictionary of Contemporary English (LDOCE) and the COLLINS dictionary; (Renouf (1993), Krovetz (1994)). Renouf (1993), for instance, discusses the use of several analytical tools or ‘filters’ to identify ‘new’ words in a flow of data on the basis that they are not in a list of ‘existing words'. It is easy to confuse the finding of ‘new’ words and the identification of ‘new’ senses, since there are some similarities between the two. However, the identification of the later requires more sophisticated and precise software tools, and is a much more difficult task. Therefore, the lack of greater sophistication in the software tools that were available for working with large corpora has been prohibitive for researchers to go a step further and actually try not only to discover new words but new senses of existing, lexicalized words. Clear (1994) comments that 1 For instance, the compound word brandvägg ‘fire-wall’ is not found as an entry in GLDB. The head vägg, however, is, with a sub-sense (1c) intended to cover metaphoric or extended usage for that specific word. Consequently, the (metaphoric) sub-sense for brandvägg namely, “a way to protect unauthorised access to a computer” is covered this way. 347 if one had software that could reliably categorise citations (i.e. concordance lines2) into semantic subsets, one could find answers quickly and easily to questions such as which citations out of the set do not appear to match any of a pre-defined set of word sense categories. Accordingly, potentially new uses of this word could be identified. After all, in the very relevant research area of WSD the interest has been persisently focused, until very recently, on quality WSD of a handful of target words, rather than quantity (Yarowsky (1995); Leacock et al. (1996)). Some large-scale WSD efforts on all content words are described in Kilgarriff & Palmer (2000) in the framework of the SENSe EVALuation (SENSEVAL) exercise. 2.2 Previous research The idea of employing automatic techniques dealing with the topic of discovering new or novel usage is not new, on the contrary. In early work by Wilks (1980), within the developed “Preference Semantic” system, methods for dealing with extensions of word-senses are discussed. These are based on the incorporation of richer semantic structures, called pseudo-texts, and the observation of unexpected contexts. The shortcoming of the approach presented there, however, was the inadequacy to deal with many forms of lexical ambiguity. The elaborated mechanism of templates for all part-ofspeech that was developed had to be both too general, for the creation of semantic representations, and too specific, to aid disambiguation. In his experiments, however, Wilks (1980) observes that extended or new usage is actually the norm in ordinary language use (at least for English). Syntactic cues are used by Dorr & Jones (1996) for the derivation of semantic information and for augmenting on-line dictionaries for novel verbal senses. The syntactic cues are divided into distinct groupings that correlate with different word senses. For a very large number of verbs, the syntactic signatures, or syntactic patterns, are used. These are compared with Levin's, Levin (1993), verb classes and information from LDOCE. According to the algorithm presented by the two authors: if a verb is in Levin's lists it is classified accordingly; if not, WordNet synsets are used (lists of synonym semantic concepts, Miller et al. (1990)), and if the synonym is in Levin's classes they select the class that has the closest match with canonical LDOCE codes; if there are no synonyms, or LDOCE codes, a new verb class is created. Syntactic signatures are of the form: ‘X broke the vase to pieces’ which, according to Dorr & Jones, becomes ‘[np, v, np, pp(to)]'. Similarly, Wilks et al. (1996) describe a method, referring to an unpublished manuscript by Jim Cowie at CRL-NMSU, on an effort to piggyback a dictionary from a corpus and a seed MRD. By applying the described method, all the occurrences of a word in a corpus are classified as belonging to sets of senses defined by a lexicographer who has examined a subset of the occurrences of the word using concordances. The authors argue that after a sufficient number of example senses have been marked it should be possible to classify the remaining instances of a word using different techniques. More interestingly, it may be possible to highlight unusual usages (or different unclassified senses) by identifying instances where the overlap occurrence is low, and subsequently it is necessary to examine these instances manually. Cases where the overlap is high may indicate archetypical example usages. Hanks (1996) argues that the semantics of verbs are determined by their complementation patterns, discussing an empirical, semi-automatic approach, where it is necessary to identify typical subjects, objects and adverbials, and then group individual lexical items into sets. In creating the behavioural profiles of verb lemmas, such as ‘urge’ in large corpora, Hanks showed that 10% of the uses of ‘urge’ are metaphors and figuratives, while the most common patterns account for 61% of the occurrences “a person urging another person to do something”. Moreover, Hanks proposed, that for unusual uses of words it is advisable to statistically sort their collocates into relevenat sets, to give a name and note possible correlations among different sets in particular roles (subject, object), and explain the relation by appealing to criteria of ellipsis, rhetoric, etc. Finally, Tapanainen & Järvinen (1998) describe a tool that utilizes syntactic information and produces dependency syntax-based concordances between lemmatised words. The tool can be used for detecting relatively invariable phenomena, which, according to the authors, are collocations with fixed order and strict precedence. The tool can be used for studying, for instance, long distance dependencies by clustering sentences according to different syntactic structures. Although the authors do not directly claim that their tool can be used for discovering new usage of words, we think that it can certainly be an important step towards that direction as well. 2 A concordance line is a formatted version or display of all the occurrences or tokens of a particular type in a corpus, the type is usually called the keyword or target or search item. Concordances form the main source of information in computer assisted lexicography. 348 3. Lexical and algorithmic resources 3.1 Lexical resources The lexical resources that are used in this work consist of the GLDB, which is the largest, most comprehensive lexical resource for modern Swedish, upon which a number of defining dictionaries have been produced, Malmgren (1992), and the extended content of the Swedish SIMPLE semantic lexicon (Lenci et al. (1998)), over 25,000 entries, Kokkinakis et al. (2000). For the classification of proper names into semantic classes, a named-entity recognizer (NE) is used, Kokkinakis (1998). The semantic classes in the NE recognizer fall into the categories LOCATION, HUMAN, TIME and ORGANIZATION. Proper names are both frequent and have a serious impact for the disambiguation of the surrounding context. The NE module is also classifying personal pronouns referring to humans as well as appositive nouns (e.g. he, she, professor) to the class HUMAN. 3.2 Algorithmic resources & machine learning The most important NLP tools that comprise the algorithmic resources are a rule-based part-of-speech tagger; a semantic tagger, Kokkinakis et al. (2000); a sense tagger (for content words), Kokkinakis & Johansson Kokkinakis (1999a); and a cascaded finite-state parser, Kokkinakis & Johansson Kokkinakis (1999b). Moreover, various finite-state based software that identify and mark idioms, multiword expressions, phrasal verbs, and perform heuristic compound segmentation and lemmatisation are used. The idioms consist of approx. 4,500 different ones, according to the GLDB. Compound segmentation is based on the distributional properties of graphemes, trying to identify grapheme combinations that are non-allowable when considering non-compound forms in the Swedish language, and which carry information of potential token boundaries. The heuristic technique behind the segmentation is based on producing 3-gram and 4-gram character sequences from several hundreds of non-compound lemmas, and then generating 3-gram and 4-grams that are not part of the lists produced. After manual adjustments and iterative refinement a list of such graphemes has been produced and used for segmentation. Ambiguities are unavoidable, although the heuristic segmentation has been evaluated for high precision. Finally, lemmatisation is based on the output from the part-of-speech tagger and the rich feature representation that can be found in the part-of-speech tags. Examples of such grapheme sequences are ‘ngs|s’ and ‘iv|b', e.g. forskningsskola ‘research school', skrivbord ‘writing desk'; ‘|’ denotes where the segmentation will take place. Machine learning techniques, Mitchell (1997) are used in order to automate the calculation of the overlap of the contexts between word that are candidates of defining a novel sense. More specifically, we adopt a supervised, inductive, classification-based variant of Machine Learning called Memory-Based Learning (MBL), and a specific implementation by Daelemans et al. (1999) called TiMBL. Using such techniques, the contexts, or instances or analyzed, modified concordance lines, can be sorted by calculating the distance of a new processed context with the distance to the nearest instance (or neighbour) of each test instance already processed. Here, the distance of two contexts is defined as the difference between the features within the instances. Training and test instances consist of fixed-length vectors of symbolic n feature-value pairs (in the study presented in this paper n=37 for verbs and for nouns), and a field containing the classification of that particular feature-value vector. During classification an unseen example X, a test instance, is presented to the system and a distance metric D between the instances of the memory Y and X is calculated, D(X,Y). The algorithm tries to find the nearest neighbour and outputs its class, as prediction for the class of the test instance, as well as the distance from the nearest neighbour in the training instances. 4. Design and methodology The methodology and design proposed in this report consists of an integrated approach to unify the results of the software outlined previously, that manipulate the content of lexical and textual resources, in order to aid the recognition of potential non-standard use of lexicalized words. Different tools are adding various types of annotations on the data, such as grammatical and semantic. Such annotations are means of giving to a text added value, in the sense that the added information can be used for a multitude of purposes; cf. Leech (1997). The model for the interpretation of non-standrd use that is presented is triggered by the use of a mixture of collocations and colligations, i.e. a collocation patterns based on syntactic information rather than individual words. This type of syntactic patterning demands 349 extra information about the words in the corpus, information provided by the different software that will be described later. The method is inspired by the previously description of the work by Jim Cowie discussed in Wilks et al. (1996:240) and the work by Kilgarriff (2000). In particular, I am interested in producing different modified views of typical concordance lines (see figure 1) and then use a combination of manual and automatic techniques for inspecting the obtained results. Moreover, when enough modified concordance lines have been produced, inspected and marked, given a particular sense from the available lexical resources, the overlap between old and new material can be measured, and aid a human towards the identification of novel senses, according to the lexical resources used, here the GLDB. Figure 1. A more abstract representation of a concordance, using combination of lemmata, features and labels of various types Essentially, the following processing steps are considered:  Gather a large number of sentences (or concordance lines or contexts) for every word to be examined from a (newspaper) corpus  Annotate these sentences with any possible type of information available (such as part-ofspeech, sense and semantic information)  Parse with a syntactic analyzer  Normalize the information obtained by the different tools (lemmatisation, uppercase to lowercase conversion, keeping the head of long chunks recognized during parsing, etc.)  Decide the format for the MBL vectors  Create the fixed-format vectors, by gathering the inform ation provided by the steps (2), (3), and (4) above, and use them as training data  Perform the same steps again, this time on sentences taken from another text genre and use the result as test data  Run MBL with the training and test data, and calculate the distance b etween the test and training instances  Inspect the results: zero distance? ( identical instances), small distance? (possibly prototypical sense), large distance? (possibly, non-standard sense or processing errors) By applying these steps, concordance lines can be transformed from raw text to a more abstract, annotated representation, upon which MBL can calculate the overlap between the material, see the worked example given in chapter (5). 4.1 Thresholds Using MBL different threshold values are tested depending the provided algorithms. For instance, one is using the normalized Information Gain, (i.e. Gain ratio), which is measured by computing the difference in uncertainty (i.e. entropy) between the situations without and with knowledge of the value of a feature. Another threshold is used when the algorithm tested is the nearest neighbor search (k-NN or IB1), using weighted overlap and information gain weighting. Accordingly, the instances with the 350 highest threshold values produced by the TiMBL software are manually examined, (identical instances, complete overlap, produce ‘0’ zero distance). Of course, the returned values are dependent on the amount of training data, therefore, for each test performed the thresholds are adjusted in accordance with the results. 5. A case study and results The purpose of the experiment I conducted and describe in this chapter is twofold. First, to investigate whether the proposed method is sufficient or not for the discovery of non-standard usage of words, and second, to devise automatic methods for evaluating to what extent a dictionary accounts for whatever we can find in corpora. The last point can be seen as a side-effect of the first, since both are closely interrelated. In order to test the applicability of the proposed architecture, and methodology, I chose to use as training material, data taken from the Swedish language bank (http://spraakbanken.gu.se/) which consists, to a large extent, of newspaper material. As testing material I chose to use data downloaded from the Internet, this time by searching on Swedish sites for sentences containing the words chosen for my experiments. The reason I chose the test material from the Internet is because the corpora that has been used for developing training data, and some of the tools that I will describe later, have also been used by the lexicographers for the production of the static resources, particularly the GLDB. However, there is nothing that excludes that there might be cases in the training material alone where one can find non-standard sense or usage not used or ignored in the production and description of the material in the lexicon. Similarly, we can speculate, and certainly anticipate, that the test data may contain prototypical use of the words for the majority of the cases downloaded. 5.1 A worked example In order to make the methodology described in chapter (4) clearer I will provide some examples showing how different tools process few sample instances. The key word under investigation in the small sample given, is the verb skyffla, which according to the GLDB has two senses, the first is similar to ‘to shovel’ while the second is most similar to ‘to shove (away)'. All instances are taken from the Swedish language bank, while the annotations provided are, in some cases, simplified. Four such lines, with a very approximate interpretation, are: (1) Socialdemokraterna försöker skyffla diskussionen om EMU under mattan. ‘The social democrats are trying to shovel the discussion about EMU under the rug.’ (2) Han försökte ocksa skyffla ansvaret för utvecklingen rörande Cypern pa EU. ‘He also tried to shove the responsibility regarding the development in Cyprus on EU.’ (3) Sakic skyfflade över pucken till Ozolinsh. ‘Sakic shovelled the puck to Ozolinsh.’ (4) Morfar skyfflade kol i fabriken i femtio ar. ‘Grand father has shovelled coal in the factory for fifty years.’ 5.1.1 Tokenization, part-of-speech annotation & lemmatisation Tokenization not only identifies graphic words (tokens) but also recognizes idioms, phrasal verbs and multi-word expressions. The part-of-speech tags shown below are edited and simplified for the sake of simplicity and readability. The original tagset is using a slightly modified version of the Swedish PAROLE morphosyntactic description, (http://spraakdata.gu.se/parole/lexikon/swedish.parole.lexikon.html). Lemmatisation is based on the output from the previous tool, and it is implemented as a finite state mechanism with three distinct states, one for verbs, one for nouns and one for adjectives. The suffix of the content words, along with their rich morphosyntactic features returned by the tagger are sufficient for establishing the base-form of each content word with very high accuracy. (1a) Socialdemokraterna/NOUN försöker/AUX-VERB skyffla/VERB diskussionen/NOUN om/PREP EMU/ABBREVIATION under_mattan/IDIOM ./PUNC Socialdemokraterna[socialdemokrat] försöker[försöka] skyffla [skyffla] diskussionen[diskussion] om EMU under_mattan . (2a) Han/PRONOUN försökte/AUX-VERB ocksa/ADVERB skyffla/VERB ansvaret/NOUN för/PREP utvecklingen/NOUN rörande/PREP Cypern/PROPER-NOUN pa/PREP EU/ABBREVIATION ./PUNC 351 Han försökte[försöka] ocksa skyffla[skyffla] ansvaret[ansvar] för utvecklingen[utveckling] rörande Cypern pa EU. (3a) Sakic/PROPER-NOUN skyfflade/VERB över/PARTICLE pucken/NOUN till/PREP Ozolinsh/PROPER-NOUN ./PUNC Sakic skyfflade[skyffla] över pucken[puck] till Ozolinsh . (4a) Morfar/NOUN skyfflade/VERB kol/NOUN i/PREP fabriken/NOUN i/PREP femtio/NUMERAL ar/NOUN ./PUNC Morfar[morfar] skyfflade[skyffla] kol[kol] i fabriken[fabrik] i femtio ar[ar] . 5.1.2 NE-recognition and semantic annotation The named-entity labels are obtained by the NE-recognition, these are underlined in the sample below. The semantic annotation is following the SIMPLE model, colon ‘:’ designates the hyper-hyponym relation in the semantic hierarchy, e.g. IDEO:HUMAN:CONCRETE means that the class IDEO “humans identified according to an ideological criterion” is hyponym of the semantic class HUMAN, which in turn is hyponym of the class CONCRETE. Note, that not all nouns get a semantic class annotation since the entries in the SIMPLE lexicon (at the time this work was completed) did not account for more than 25,000 noun senses. (1b) Socialdemokraterna/IDEO:HUMAN:CONCRETE försöker skyffla diskussionen/ABSTRACT om EMU under_mattan . (2b) Han/HUMAN försökte ocksa skyffla ansvaret/COGNITIVE-FACT:ENTITY: ABSTRACT för utvecklingen rörande Cypern/LOCATION pa EU/AGENCY: ENTITY:ABSTRACT . (3b) Sakic/HUMAN skyfflade över pucken/ARTIFACT:OBJECT:NON-LIVING:CONCRETE till Ozolinsh/HUMAN . (4b) Morfar/BIO:HUMAN:CONCRETE skyfflade kol/MATTER:NON-LIVING: CONCRETE I fabriken/FUNCTIONAL-SPACE:LOCATION:NON-LIVING:CONCRETE i femtio ar/TIME . 5.1.3 Shallow parsing Shallow parsing or chunking is based on the output from the part-of-speech tagging. During analysis, only the lemmatised head of each chunk (noun, adverbial, and adjective phrases and verbal groups) is preserved, as well as particles and the preposition heading a prepositional phrase. Here ‘NP’ is a noun phrase, ‘VG’ a verbal group, ‘PP’ a prepositional phrase ‘RP’ temporal adverbial phrase. Chunking is used for a couple of reasons; to obtain the head of the chunks, since phrases can be arbitrarily long and complex, and to give a shorthand name to constituents lying at a certain distance on the left and right of the word under investigation, e.g. different types of clauses. (1c) NP[Socialdemokraterna/NOUN] VG[skyffla/VERB] NP[diskussionen/NOUN] PP[om/PREP NP[EMU/ABBREVIATION]] IDIOM[under_mattan/IDIOM] ./PUNC (2c) NP[Han/PRONOUN] VG[skyffla/VERB] NP[ansvaret/NOUN] PP[för/PREP NP[utvecklingen/NOUN]] PP[rörande/PREP NP[Cypern/PROPER-NOUN]] PP[pa/PREP NP[EU/ABBREVIATION]] ./PUNC (3c) NP[Sakic/PROPER-NOUN] VG[skyfflade/VERB över/PARTICLE] NP[pucken /NOUN] PP[till/PREP NP[Ozolinsh/PROPER-NOUN]] ./PUNC (4c) NP[Morfar/NOUN] VG[skyfflade/VERB] NP[kol/NOUN] PP[i/PREP NP[fabriken/NOUN]] RP[i/PREP NP[ar/NOUN]] ./PUNC 5.1.4 Sense annotation Sense annotation is given for content words, nouns, main verbs and adjectives. The notation provided, according to the GLDB, gives the lemma number followed by the lexeme or sense number. The underlying model adopted in GLDB is the so called lemma-lexeme model described in Allén (1981). The lemmas are grammatical paradigms, comprising formal data, e.g. technical stem, spelling variations, part of speech, inflection(s), pronunciation(s), stress, morpheme division, compound boundary, abbreviated form(s) and much more. The lexemes, are the numbered senses of a lemma, and are divided into two main categories, a compulsory kernel sense and a non-compulsory set of one or more sub-senses, called the cycles. A large number of metonymic uses of words are encoded under separate lexemes or cycles for a lemma. (1d) Socialdemokraterna:1/1 försöker skyffla:1/2 diskussionen:1/1 om EMU under_mattan. (2d) Han försökte ocksa skyffla:1/2 ansvaret:1/1 för utvecklingen:1/1 rörande Cypern pa EU. 352 (3d) Sakic skyfflade:1/2 över pucken:1/1 till Ozolinsh. (4d) Morfar:1/1 skyfflade:1/1 kol:1/2 i fabriken:1/1 i femtio ar:1/1. 5.1.5 Creating vectors The vectors used by ML are of a fixed-format, and we have experimented with vectors of different size and content. We think that for optimal results and for different part-of-speech categories we need to model the content of the vectors differently. Therefore, the vectors for verbs consist of four ‘contexts’ to the left and four ‘contexts’ to the right of a verb under investigation (v1); while for nouns we model less context, namely two chunks to the left and two to the right (v2). If the noun under investigation is found in a prepositional phrase the preposition is also a part of the vector, finally the nearest to the noun modifier (if any) is also used in the vector. (v1) Key-Word Part-of-Speech Byte-Offs {Left-Context} Key-Word {Right-Context} Class (v2) Key-Word Part-of-Speech Byte-Offs {Left-Chunk} Prep Modifier Key-Word {Right-Chunk} Class The left and right ‘contexts’ in (v1) and the chunks in (v2) are defined as clusters of four features of the form: TOKEN:MORPHOSYNTAX:SEMANTIC-TAG-or-NE:SENSE-TAG With ‘MORPHOSYNTAX’ we mean a part of speech (if the context concerns a single token, which is the head of a chunk, within the clause where the keyword appears) or a larger syntactic label (if the context concern a syntactic unit outside the keyword's own clause, e.g. ‘CLAUSE'). With ‘SEMANTIC-TAG-or-NE’ is meant the result obtained by the semantic annotation and the NErecognition, while with ‘SENSE-TAG’ is meant the result from the sense annotation (a GLDB lemma and lexeme number). Any of the features, or even all, can be absent, in this case a missing feature is marked with a ‘dummy’ character, an equal sign ‘='. The reason to it is that MBL requires that the vectors are of equal size, one of the few disadvantages of the method in general. Given the sample of the worked examples, the results are gathered in the format below, and then converted to a fixed format of equal size for all vectors. Truncation is performed when a lot of context is available. ‘BYTE-OFFS’ is simply the position of the key word in the discourse, a mechanism that is inhereted by the tokenizer and helps linking the results with the original text from where it was taken. (1e) SKYFFLA VERB BYTE-OFFS = = = socialdemokrat:NOUN:HUMAN:1/1 skyffla:VERB:=:1/2 diskussion:NOUN:ABSTRACT:1/1 om:PREP:=:= EMU:ABBR:=:= under_mattan:IDIOM:=:= (2e) SKYFFLA VERB BYTE-OFFS = = = Han:PRONOUN:HUMAN:= skyffla:VERB:=:1/2 ansvar:NOUN: ABSTRACT:1/1 för:PREP:=:= utveckling:NOUN:=:1/1 rörande:PREP:=:= Cypern:PROPER-NOUN: LOCATION:= pa:PREP:=:= EU:PROPER-NOUN:AGENCY:= (3e) SKYFFLA VERB BYTE-OFFS = = = Sakic:PROPER-NOUN:HUMAN:= skyffla: VERB:=:1/2 över:PREP:=:= puck:NOUN:ARTIFACT:1/1 till:PREP:=:= Ozolinsh:PROPER-NOUN:HUMAN:= (4e) SKYFFLA VERB BYTE-OFFS = = = morfar:NOUN:HUMAN:1/1 skyffla:VERB:=:1/1 kol:NOUN: MATTER:1/2 i:PREP:=:= fabrik:NOUN:FUNCTIONAL-SPACE:1/1 i:PREP:=:= ar:NOUN:TIME:1/1 5.2 Small scale evaluation We carried out a small evaluation of the presented approach by producing vectors such as the ones given in section 5.1.5, for a large sample of both training (newspaper articles) and testing material (gathered from the Internet) for some verbs and a number of common nouns. More specifically, some of the verbs we looked at were: skyffla ‘to shovel, to shove away’ (2 senses, 100 training contexts and 15 test contexts), publicera ‘to publish’ (1 sense, 300 training contexts and 20 test contexts) and the phrasal verb hoppa in ‘to step in, to interfere’ (2 senses, 200 training contexts and 10 test contexts). The results produced by MBL were sorted according the distance to the nearest instance in the training 353 sample, a distance calculated on the metrics gain ratio and information gain, using the IB1 algorithm. The instances with the highest distance from the training material in every case were: (5) ColorFusion-kortet skyfflar hela 9 MB videodata per sekund. ‘The ColorFusion-card shovels 9 megabyte video-data per second.’ (6) Framtidens taxibolag publicerar sina bilars positioner pa Internet. ‘The future's taxi companies publish their car's positions on the Internet.’ (7) Vid trafikavbrott kan den ena ringen hoppa in och ersätta den andra. ‘During interruption of traffic the one ring can interfere and replace the other.’ The characteristic for the test instances with the longest distance, in all three cases, has been the fact that a concrete object (e.g. ‘card', ‘ring') is initiating an action usually performed by a human or organization in the training material. While the longest distance for the instance of the verb ‘to publish’ has to do with publishing a ‘location’ or ‘position’ while the majority of the cases in the training sample has been to publish a concrete object (e.g. ‘article', ‘report'). Over half of the test instances for the verb ‘to publish’ had short distance to the training material, explicitly refering to prototypical usage, while the opposite could be observed with the other two verbs. Regarding the examined nouns, among others plattform ‘platform', we provide here a more detailed picture of how the results look like. According to the GLDB, plattform has one main sense (lexem) and four sub-senses (cycles). Roughly, these sub-senses are: ‘a’ platform (concrete), ‘b’ tram, ‘c’ oil-rig and ‘d’ platform (abstract); ‘d’ was the commonest sense in the training material and used as default in all testing instances. Some of the results produced by the ML software, using 372 training and 18 testing examples, are given below (9-12). The character and number on the right designates the suggested sense for the key word and the distance of the example from the training instances. The underlied features simply helps to distinguish which features belong together (see 5.1.5). Small number (distance) designates that there are examples in the training material with many features in common with the test. Examples (9a-12a) show the nearest instance found in the training examples. Figure (2) shows the examples in concordance format. Figure 2. Concordance lines before processing (9) plattform N 00 dator N APPARATUS 1 bli V = = = = = = dominerande A = 1 PLATTFORM för S = = tillgang N = 2 till S = = world_wide_web Y = = d d 2.306898 (9a) tag N VEHICLE 1 bli V = = = = = = = = = = PLATTFORM för S = = samtal N ABSTRACT 1 om S = = utveckling N = = (10) plattform N 00 = = = = = = = = i_och_med S = = öppen A = = PLATTFORM som = = = JavaCard = = = ha V = = företag N AGENCY 1 d d 2.061991 (10a) = = = = = = = = = = = = = = = = PLATTFORM som = = = bära V = = = = = = system N = = (11) plattform N 00 medlem N SITU = sluta_sig_samman V = 1 kring S = = politisk A = 1 PLATTFORM = = = = = = = = = = = = = = = = d d 1.799718 (11a) folk N HUMAN 1 identifiera_sig V = 1 med S = = politisk A = 1 PLATTFORM = = = = = = = = = = = = = = = = (12) plattform N 00 Internet N = = vara V = = = = = = ny A = 2 PLATTFORM som = = = fa V = = support N = = = = = = d d 1.074738 (12a) Internet N = = vara V = = = = = = ny A = 2 PLATTFORM = = = = = = = = = = = = = = = = 6. Conclusions and further research This paper has outlined an approach to create a framework for aiding the identification of novel senses of words in large corpora. We think that important variation of word-usage is ‘hidden’ and hard to identify by merely looking at thousands of (non-processed) concordance lines. Therefore any means of organising the bulk of the data is absolutely necessary for future enhancements of the dictionary content with ‘new’ corpus-based material. Empirical preliminary results applied on a small sample of words have shown that although a lot of noise is produced by the different modules of the system in the form of errors in part-of-speech sense or semantic annotation sense differences between the 354 annotated concordance instances can be observed under the threshold conditions briefly outlined. The noise caused by the reliability of some of the tools has been the main reason for the production of a number of false positive instances. However, while the error rates encountered are high, and the method immature for allowing fully-automatic integration of new data in the lexicon, our method can be used as a supporting tool that can aid the lexicographers identify new usage. Actually, we never intended to provide fully automatic means, since the manual refinement, inspection and judgement will always be the crucial factor left for the lexicographer; that is to make a final decision regarding what will be included and what will be left out from a dictionary. The design presented in this paper is dictionary-dependent but the method is not specific to Swedish, as long as there are tools that can contribute with various types of morphological, syntactic, lexical semantic or other information to corpus data. Although our experiments are limited in magnitude, the results showed that the coverage of GLDB is pretty good. The future direction for the work presented will operate on a larger sample of the language and try to define more rigorous evaluation criteria. References Allén S. 1981 The Lemma-Lexeme Model of the Swedish Lexical Database. In Rieger B. (ed) Empirical Semantics. Bochum pp. 376–387 Atkins B.T. 1987 Semantic ID Tags: Corpus Evidence for Dictionary Senses. In Proceedings of the 3rd OED. Waterloo Canada Clear J. 1994 I Can't See the Sense in a Large Corpus. In Kiefer F. Kiss G. and Pajzs J. (eds.). Papers in Computational Lexicography COMPLEX ’94. Budapest pp. 33–45 Daelemans W. Zavrel J. van der Sloot K. 1999 TiMBL: Tilburg Memory Based Learner version 2. ILK Technical Report 99-01. Paper available from http://ilk.kub.nl/~ilk/papers/ilk9901.ps.gz Dorr B. and Jones D. 1996 Role of Word Sense Disambiguation in Lexical Acquisition: Predicting Semantics from Syntactic Cues. In Proceedings of the 16th COLING. Vol. 1. Copenhagen Denmark pp. 322-327 Hanks P. 1996 Contextual Dependency and Lexical Sets. Journal of Corpus Linguistics. Benjamins 1(1): 75–98 Kilgarrif A. 2000 Generative Lexicon Meets Corpus Data: the Case of Non-Standard Word Uses. In Bouillon P. Busa F. (eds) Word Meaning and Creativity. Cambridge UP Kilgarriff A. and Palmer M. 2000 Introduction to the Special Issue on SENSEVAL. International Journal of Computer and the Humanities. Special Issue on SENSEVAL. 00:1-13 Kluwer Academic Publishers Kokkinakis D. 1998 AVENTINUS GATE and Swedish Lingware. In Proceedings of the 11th NODALIDA Conference (Nordisk Datalingvistik). Copenhagen Denmark pp. 22–33 Kokkinakis D. and Johansson-Kokkinakis S. 1999a Sense Tagging at the Cycle-Level Using GLDB. In Proceedings of the NFL Symposium (Nordic Association of Lexicography). Gothenburg Sweden. Paper available from: http://svenska.gu.se/~svedk/publics/nfl.pdf Kokkinakis D. and Johansson-Kokkinakis S. 1999b A Cascaded Finite-State Parser for Syntactic Analysis of Swedish. In Proceedings of the 9th EACL. Bergen Norway. Paper available from: http://svenska.gu.se/~svedk/publics/eaclKokk.ps Kokkinakis D. Toporowska-Gronostaj M. and Warmenius K. 2000 Annotating Disambiguating & Automatically Extending the Coverage of the Swedish SIMPLE Lexicon. In Proceedings of the 2nd LREC. Athens, Hellas (2000) Krovetz R. 1994 Learning to Augment a Machine-Readable Dictionary. In Proceedings of the EURALEX '94. Amsterdam Holland pp. 107–116 Leacock C. Towell G. and Voorhees E.M. 1996 Towards Buidling Contextual Representations of Word Senses Using Statistical Models. Boguraev B. Pustejovsky J. (eds.): Corpus Processing for Lexical Acquisition. Bradford pp. 98–113 Leech G. (1997) Introducing Corpus Annotation In Corpus Annotation. Linguistic Information from Computer Text Corpora pp. 1-18 Longman Lenci et al. 1998 SIMPLE WP2 Linguistic Specifications Deliverable 2.1 Pisa Levin B. 1993 English Verb Classes and Alternations: a Preliminary Investigation. UCP Malmgren S.G. 1992 From Svenska ordbok (‘A dictionary of Swedish') to Nationalencyklopediensordbok (‘The Dictionary of the National Encyclopedia'). In Tommola H. Varantola K. Salmi-Tolonen T. Schopp J. (eds). In Proceedings of the EURALEX ’92 Vol. 2. Tampere Finland pp. 485–491 355 Miller G.A. (ed.) 1990 WordNet: An on-line Lexical Database. International Journal of Lexicography Special Issue 3(4) Mitchell T. M. 1997 Machine Learning. McGraw-Hill Series on Computer Science Renouf A. 1993 A Word in Time: First Findings from the Investigation of Dynamic Text. Aarts J. de Haan P. Oostdijk N. (eds.) English Language Corpora: Design Analysis and Exploitation. Rodopi Tapanainen P. and Järvinen T. (1998) Dependency Concordances. Journal of Lexicography. OUP 11(3): 187–203 Wilks Y. 1980 Frames Semantics and Novelty. In Metzing D. (ed) Frame Conceptions and Text Understanding. de Gruyter pp. 134–163 Wilks Y. Slator B. and Guthrie L. 1996 Electric Words Dictionaries Computers and Meanings. MIT Yarowsky D. 1995 Unsupervised Word Sense Disambiguation Rivaling Supervised Methods. In Proceedings of the 33rd ACL. Cambridge MA pp. 189–196 356 Using bilingual corpora for the construction of contrastive generation grammars: issues and problems Julia Lavid Dep. English Philology I Universidad Complutense de Madrid 28040 Madrid (Spain) Phone and fax: +34-91-518-5799 e-mail: julavid@filol.ucm.es This paper reports on the use of corpora for the construction of a computational grammar of Spanish, contrastive with English, in the application context of Multilingual Natural Language Generation (MLG). The theoretical framework for this work is Systemic Functional Linguistics (SFL) and the computational context provided by KPML (Komet Penman Multilingual), an extensive grammar development environment and generation engine that supports large-scale multilingual development (Bateman 1997). The initial phenomena which are being investigated contrastively belong to three different functional regions of the grammar, i.e., particular subareas of the grammar that are concerned with particular areas of meanings. These regions are transitivity (ideational meaning), thematicity (textual meaning) and mood (interpersonal meaning). The present study concentrates on textual meaning (thematicity and focus) as an illustration. Following what has now established itself as a standard methodology for empirically-based Natural Language Generation (Bateman 1998a, Reiter and Dale 1997), the following steps were carried out: first, a bilingual corpus (English-Spanish) was selected. For Spanish, a sample of spoken texts from the MacroCorpus of the educated linguistic standard of the main cities of the Spanish-speaking world was used, while for English a comparable sample was selected from the British National Corpus Sampler. This was motivated by the need to provide a realistic account of the behaviour of the linguistic phenomena investigated in unplanned and spontaneous contexts of use. The second step was to carry out a contrastive analysis of the phenomena mentioned before. Finally, the results of the analysis were coded up as resources/processes for generation. In the case of Spanish, these had to be created anew. In the case of English, as the KPML already includes an English generation grammar, this last step consisted on checking the coverage of the existing specifications and extending them when could not cover the instances found in the corpus, and adapting them for effective MLG. Given the nature of the NLG process, which typically converts communicative goals expressed in some internal representation into surface forms, the kind of information that is most readily usable for NLG are statements of mappings from functions to forms. Therefore, the corpus analysis phase for NLG usually includes an explicit, and usually quite lengthy linguistic analysis where the analyst seeks possible realisations of communicative functions, which restricts the size of the corpus that can be realistically considered. This paper describes the different steps carried out for the generation of the linguistic phenomena mentioned above, discussing the problems encountered during the corpus analysis phase, and the computational representation derived from it, as well as some of the decisions taken to overcome them. 357 1. Introduction Natural Language Generation (NLG), the subfield of Artificial Intelligence and Computational Linguistics that investigates the automated production of texts by machine, is typically a process that converts communicative goals expressed in some internal representation into surface forms. As such, it touches upon different linguistic areas of inquiry such as text planning and discourse organisation, lexical semantics, grammatical and lexical choice, and the relationship between all of these. In fact, one of the main concerns of NLG is the construction of of computational accounts of the linguistic system capable of generating texts in one language (monolingual NLG) or in several (multilingual NLG, henceforth MLG). This theoretical task poses unusual demands on the linguist who has to provide explicit and details accounts of how language works and confront the results of the application of his/her theoretical claims in concrete computational systems. A more practical concern is the creation of computational systems capable of producing acceptable text from various sources and for different types of applications: well-known examples include the generation of weather reports from meteorological data (Kittredge et al. 1986), the generation of letters responding to customers (Springer et al. 1991), and other systems applied in areas such as technical documentation and instructional texts (Not and Stock 1994, Paris et al. 1995, Lavid 1995), patent claims (Sheremetyeva et al. 1996), information systems (Bateman and Teich 1995), computersupported co-operative work (Levine and Mellish 1994), patient health information and education (DiMarco et al. 1995), and medical reports (Li et al. 1986), to mention a few.1 While the first generation systems were limited to the random generation of grammatically correct sentences, the field has experienced a very rapid growth over the past ten years, both as a research area bringing a unique perspective on fundamental issues in artificial intelligence, cognitive science, and human-computer interaction, and as an emerging technology capable of partially automating routine document creation and playing an important role in human-computer interfaces. In this sense, as practical systems became more sophisticated, it was necessary to provide a good understanding of the notion of “textuality” and all the factors involved in the creation of different text types. In this context, it was only natural that corpora started to be used as the empirical basis both for the theoretical investigation of textual phenomena, and as part of the requirement analysis phase for NLG systems. As a result, the use of corpora has now been integrated as part of the standard methodology for NLG, both in theoretically-oriented research and in the development of concrete generation systems, to such an extent that computational tools have been developed to support, among other functionalities, the analysis of machine-readable monolingual or multilingual corpora for NLG (Alexa and Rostek 1997). In this paper, we report on the use of bilingual corpora for the construction of a computational grammar of Spanish, contrastive with English, in the application context of Multilingual Natural Language Generation (MLG), concentrating on the textual phenomenon of thematicity and its relationship with the related notion of focus as an illustration. Section 2 describes the two corpora selected and the criteria for concentrating on two comparable samples as the empirical basis for computational specifications. Section 3 presents the theoretical framework selected for this study and the issues which must be explored contrastively with respect to the textual phenomena mentioned before. Section 4 describes the corpus analysis phase of the bilingual samples. Section 5 presents a computational specification of the results of the analysis as resources for generation. The specification is based on the notion of functional typology as developed in SFL and implemented in the KPML development environment. Finally, section 6 provides a summary and discusses the implications of these results for corpus-based MLG. 2. Bilingual corpora for NLG Two electronic corpora were initially targeted as the empirical basis for the contrastive work proposed in 1 For an extensively documented review of the field see Bateman (1998b) 358 unplanned and spontaneous speech, which are necessary to provide a realistic account of the behaviour of the linguistic phenomena investigated in their contexts of use. In the case of NLG, it is frequent to find computational specifications of linguistic phenomena based on the analysis of specific text types, most of them written monologues, which, though useful for the generation of texts of a specific domain, can only provide a partial view on the phenomena studied. As the purpose of this study is to investigate the behaviour of textual phenomena for a multipurpose contrastive grammar of English and Spanish, the choice of samples from unplanned, spontaneous speech offered a rich and unexplored empirical basis for the study of the textual phenomena mentioned before. For the study of Spanish, a sample of spoken texts from the Spanish Macrocorpus was selected. This Macrocorpus includes the transliteration of 84 hours of recording by 168 native speakers representative of the educated speech of twelve Hispanic cities, nine of them from SouthAmerica and three of them from the Iberian Peninsula. The recordings are basically unstructured interviews conducted in a conversational style where the interviewer introduces a few questions and topics to stimulate the conversation and to establish some uniformity in the topics discussed by the speakers. In general, the speakers were left free to talk about the topics suggested by the interviewer and to introduce any new topics. The sample selected for this study consists of 10 interviews from the three cities of the Iberian Peninsula contained in the Macrocorpus, i.e. from Madrid, Seville and Las Palmas de Gran Canaria. For the study of English, a comparable sample was selected from the BNC Sampler Corpus, a subcorpus of the British National Corpus, consisting of approximately one-fiftieth of the whole corpus, viz. two million words. The Sampler Corpus consists of 184 different texts, comprising just under 127,000 sentences and two million words. From this sampler corpus, a subcorpus of spoken texts was chosen for the purposes of the present study, consisting of 10 conversations. 3. The textual resources of thematicity and focus in English and Spanish The linguistic phenomena selected for illustrating the use of corpora in NLG belong to the textual region of the grammar, and have been the subject of several functional accounts. The theoretical framework for the study of these phenomena is SFL, as this is the theoretical basis for the implementation of the current English generation grammar Nigel. In SFL, the textual clause grammar is composed of two complementary systems, the systems of Theme and Information, characterised as assigning two different kinds of textual prominence to elements of the clause: thematic prominence, in the case of the system of Theme, and prominence as news, in the case of the system of Information. The notion of Focus, however, has not received a comparable attention within SFL, but has been the subject of a number of discourse-oriented linguistic and computational studies (Lavid 2000, McCoy and Cheng, McKeown, among others), and has been treated in Dik´s Functional Grammar (Dik 1978) as a pragmatic function which assigns more salience to some clausal constituent with respect to the contextual (pragmatic) information between language producers and receivers. In view of the need to provide form-function mappings of these phenomena which can be used by a MLG system, a corpus-based analysis for generation purposes must explore at least the following issues contrastively: 1. Do both English and Spanish grammars have a Theme and a Focus function ? 2. How do English and Spanish realise the Theme and the Focus functions - e.g.: sequence, inflection, adposition, intonation? 3. Are there marked and unmarked Theme and Focus selections in both languages, and to what extent do they depend on other systems (e.g. Mood, Voice)? 4. Are there resources in English and Spanish to combine thematicity and focus? With respect to the first question, different linguistic studies acknowledge the existence of these two functions in different languages (see Caffarel et al. in preparation; Dik et al. 1981), so this issue will not be further explored here. The rest of the issues, however, are central for a functional characterisation of the textual phenomena selected for this study, and their corpus-based empirical study is the basis for the specification of resources required for MLG. Therefore, the next section will describe the corpus analysis carried out for this study and the problems encountered when attempting to investigate the issues mentioned above. 359 4. Contrastive corpus analysis for MLG: problems and decisions As in the area of discourse analysis, the type of information that NLG needs from corpus investigations is one which basically consists of statements of mappings from functions to forms. Therefore, when investigating specific discourse phenomena, a NLG system requires as a first step a specification of the mappings from functions to forms with the purpose of duplicating the text analysed. In this sense, the corpus analysis phase raises some problematic issues which must be considered by NLG practitioners. One of these issues is the size of the corpus: if, at is usually the case, the analyst has to intervene looking for possible realisations of communicative functions, the use of large corpora becomes impractical. Also, it is not possible to mark-up large quantities of texts according to functional categories if they cannot be recognised automatically on the basis of tagged texts or syntactic analysis. In view of these problems, the following decisions had to be taken to investigate the issues mentioned above: 1.- With respect to the first issue, i.e., the ways in which English and Spanish realise the Theme and the Focus functions, the corpus analysis was based on the following assumptions, based on previous linguistic studies: a) The Theme function was recognised as realised by clause initial position in both languages, as several linguistic studies have demonstrated (Lavid in press; Taboada 2000). b) The Focus function was recognised on the basis of non-prosodic realisations, such as word order patterns, focus markers and characteristic constructions, since the BNC corpus does not include prosodic annotation. 2 2.- With respect to the second issue, i.e., the existence and realisations of marked and unmarked theme and focus selections in both languages, the corpus analysis was based on the following assumptions: a) marked and unmarked themes were recognised in specific mood contexts, such as declarative, interrogative and imperative options. Absolute themes were recognised by the presence of a pause and a comma separating them from the rest of the predication. b) unmarked focus was assumed to coincide with the last lexical element of the clause.3 3.- With respect to the third issue, i.e., the existence and variation of resources which combine thematicity and focus in both languages, the corpus analysis concentrated on the so-called cleft and pseudo-cleft sentences as realisation strategies used by both English and Spanish to combine the Theme and the Focus functions. These were semi-automatically distinguished in both corpora on the basis of their characteristic form. 5. Towards a computational specification for MLG The second step in what has now established itself as a standard methodology for empirically-based NLG is the codification of the corpus analysis results as resources/processes for generation. The current implementation of the English generation grammar contained in the KPML development environment already includes a computational specification for the Theme function in English as a textual region of the grammar. However, as will be shown in the following sections, it became necessary to modify the existing specification to account for instances found in the corpus and to ensure maximal sharing of resources for contrastive generation. For Spanish a new specification was created on the basis of the contrastive corpus analysis. It should be noted, however, that it is not the purpose of this study to provide a full specification of the behaviour of the textual phenomena of thematicity and focus in both languages, but rather to discuss 2 Considering that Focus in English is predominantly realised by marked prosodic prominence (Mart nez-Caro 1999), this apparent limitation of the corpus-analysis phase must be overcome by explicitly representing this realisation in the computational specification for English. 3 According to Halliday (1994), unmarked focus falls on the last lexical element of the tonic group, which in unmarked circumstances coincides with the clause. 360 and illustrate with some examples some problems and issues raised by corpus-based contrastive MLG specifications. The following sections, therefore, will illustrate some of these problems and the solutions suggested in the context of a MLG architecture based on functional typology. 5.1. Thematic resources for English and Spanish: a functional-typological characterisation This section presents a partial specification of the thematic resources available in English and Spanish on the basis of the contrastive corpus-based analysis. More specifically, the area of theme markedness will be discussed in detail, as there exist some interesting commonalities and differences which must be accounted for when generating English-Spanish thematic variants. The approach rests on the notion of functional typologies, as pursued in systemic-functional linguistics (cf. Halliday 1978), later developed for MLG purposes (see Bateman et al. 1999). Functional typologies are constructed by means of classification hierarchies called system networks (Halliday 1966), where each disjunction, or grammatical system is seen as a point of abstract functional choice, capturing those minimal points of alternation offered by a linguistic representation level. This representation of the so-called paradigmatic axis of linguistic description is complemented with the corresponding syntagmatic realisation, i.e. the structural expression of the choices made in the paradigmatic axis and associated with individual grammatical features. This task is carried out by the so-called realisation statements which set allowable constraints on configurations of syntactic constituents such as linear precedence, immediate dominance, or "unification" of functional constituents. This type of approach, where the functional coherence of the paradigmatic organisation is preferred over generalisations concerning possible structures, has been found to provide effective multilingual linguistic descriptions and computational representations which maximise the factoring out of generalisations across languages (cf Bateman et al. 1991, 1999). Therefore, following this approach, and on the basis of empirical corpus-analysis, a specification was created to account for Theme markedness in Spanish, as shown in Figure 1 below: theme matter marked absolute not theme matter predicated Theme Markedness non-absolute Predication non-predicated unmarked (transitivity role) Figure 1: Spanish Theme markedness systems As the system network shows, the primary distinction in Spanish is between marked and unmarked themes, and within marked themes between absolute or non-absolute themes. Absolute themes do not map onto any transitivity or interpersonal functions, and are normally separated from the transitivity /interpersonal structure of the clause by a comma in writing and by a pause in speech. They can be of two types: representing a theme matter or not. If non-absolute, they can be predicated or non-predicated. When predicated, they are normally realised by pseudo-cleft sentences, as will discussed in detail below. If they are not predicated, thematic status may be assigned to any role within the transitivity structure of the clause (participants, processes or circumstances). With respect to English, the existing specification was found not to account for all the language instances found in the corpus. Therefore, on the basis of the corpus-based analysis and to provide a maximally effective specification for the purposes of MLG, the following specification was created for English: 361 theme matter marked absolute not theme matter * predicated unmarked local Markedness non-absolute Local theme selection marked local non-predicated unmarked (transitivity role) substitute Theme Substitution non-substitute Figure 2. English Theme markedness systems As can be seen, an attempt has been made at capturing the comnonalities with the Spanish specification while maintaining the English-specific integrity and difference. Thus, while English shares with Spanish the primary distinction between marked and unmarked themes, previously existing specifications (see Matthiessen 1995: 540) did not include a distinction between absolute and non-absolute themes. However, in order to enforce maximal sharing of resources across languages, this system was also included in the English network. In this sense, it should be noted that up to this point both languages are similar, but differ as the features become more delicate: while absolute themes in Spanish may refer to a theme matter or to any other transitivity role coreferential with a constituent within the predication, English has a gap in this common paradigm (indicated by an asterisk in the English network) since it only has the option of a theme matter as absolute theme. Also, English includes more delicate options within the systems of Theme predication, as it includes the possibility of having a local theme selection, and includes a system called Theme substitution, which is not available in the Spanish paradigm. In this new specification, predicated themes have been considered as marked in the sense that they map the Focus constituent, and, in many instances, the New element, onto the Theme (see Halliday 1994: 59 on predicated Themes). Contrastive examples extracted from the corpora are discussed in detail below. With respect to the syntagmatic realisations of the options presented in the networks above, both languages present divergent grammaticalisations of their common paradigmatic potential. Table 1 below illustrates the different features of the systems of both languages, together with their divergent realisations. These are specified as sets of constraints which specify the syntactic and lexical properties of the linguistic units being generated. For example, the feature ’theme substitute’ has the realisation statement +Pro-theme, which means ’insert a Pro-theme’ and ’Pro-theme / Subject’ which means ’conflate or unify with the grammatical function Subject'. The rest of the realisation specifies the ordering of constituents in relation to one another: thus ’Rheme ^ Theme’ means order the Rheme before the Theme. 362 feature English realisation Spanish realisation System: Markedness marked absolute theme matter ’As to/as for/... + NG, + Pred. ’en cuanto a + NG', + Pred. not theme matter Left Dislocation + Predication non-absolute predicated ’It + be + X..that/who’ Relative ^ Copula ^ NP NP ^ Copula ^ Relative Copula ^ NP ^ Relative unmarked local marked local ’NP it + be+ that’ non-predicated Part, Pro. or Circ. non-Subj. Part, Pro. or Circ. non-Subj. unmarked non-substitute substitute +Pro-theme/ Subject ^ Rheme ^Theme Table 1: Syntagmatic realisations of theme markedness options in English and Spanish Some of the divergent realisations in both languages are the following: 1. Absolute Themes Absolute themes refer to constituents which are not integrated within the transitivity/interpersonal structure of the clause, and are, therefore, marked in speech with a pause, and with a comma in writing. As shown in Figure 1 above, Spanish has two options when choosing absolute themes: as theme matter and as not-theme matter. Example (1) below illustrates the choice of a theme matter in Spanish, an instance extracted from the Spanish Macrocorpus: (1) En cuanto a los libros que me gusta leer, t~ sabes que a m me gusta leer todo (SE- 10) As for the books that me likes to read, you know that me likes to read everything ’As for the books that I enjoy reading, you know that I enjoy reading everything’ A contrastive example in English is illustrated by example (2) below, extracted from the BNC sampler corpus: (2) As for the past, I have adopted the doctrine of anamnesis (AEA) When the absolute theme is not a theme matter, in Spanish it may be coreferential with an element of the structure of the clause, as example (3) illustrates: And the citizen, the Madrid Madrid (one), to him notice (3-sing) one change ’ And the citizen, the Madrid citizen, can one notice a change in him...?’ In (3), the constituent ’el ciudadano’ is not integrated in the transitivity structure of the clause, though it is coreferential with an element functioning as beneficiary in the clause, expressed by the clitic ’le’ (italics in the example). English, however, lacks this option in its paradigm, and this is indicated by an asterisk in the system network in Figure 2. 2. Predicated Themes 363 Both English and Spanish include predicated themes as a choice between presenting a Theme with a special feature of identification or without that special feature. In fact, this choice is a textual strategy which combines the Theme and the Focus functions into one single realisation, which, depending on the language, may be a cleft, a pseudo-cleft, an identifying clause or a combination of the three. Thus, English typically uses clefts with the form ('it + be + X + that ...) to realise predicated themes (italics) and to mark focus constituents (boldface) as in examples (4) and (5) extracted from the BNC sample: (4) It was only a relatively small Arab army that arrived in Egypt (JXL) (5) It is not the death of the body that is important to us, it ’s the soul (NHE3) Textually speaking, this construction serves two textual functions: a) to identify the Theme in the transitivity role it serves in the clause: by using the identifying type of clause we are achieving a textual distribution of meaning with the added feature of identification. For example, in (4) ’the thing that arrived in Egypt’ is identified as ’a relatively small Arab army'. Similarly, in (5) the ’the thing that is important to us’ is identified as not the death of the body but as the soul. According to this, the segment ’it's not the death of the body’ would be the Theme of the clause, and ’that is important to us’ would be the Rheme. b) to define a specific constituent as having the Focus function, which, in many cases, is presented as contrastive to another alternative. Example (5) is a case of what Dik would call counterpresuppositional Focus, more specifically Substitute Focus. According to his definition, "the information presented is opposed to other, similar information which the Speaker presupposes to be entertained by the Addressee" (Dik 1989: 282). In (5), the speaker rejects the information which he/she presupposes to be entertained by the Addressee (the death of the body) and corrects it with information which the speaker considers to be correct (the soul). English also uses pseudo-cleft and identifying clauses to combine the Theme and Focus functions. Example (6) below illustrates a pseudo-cleft construction where the Theme is realised by a nominalising clause introduced by ’what', and the rest of the clause serves as a Rheme. This construction also serves to mark the focus constituent as the one following the copular verb: (6) what we were able to do on one occasion was er to raise enough money (FYJ) Spanish, by contrast, tends to use only pseudo-cleft clauses to combine the Theme and Focus functions. The range of possible syntagmatic realisations in Spanish is more varied than in English, and includes three main types: 1. Constructions where the relative clause appears in initial position, followed by the copular verb and the nominal group functioning as an attribute. Exampe (7), extracted from the Spanish Macrocorpus, illustrates this first type: (7) Lo que yo necesito [...] es un poco m s del estilo rabe... (SE-01) What I need is a bit more of the style Arabian ’What I need is a bit more of the Arabian style’ 2. Constructions where the nominal group is in initial position , the verb in medial position and the relative clause at the end. Example (8) illustrates this type: (8) Eso es lo que pienso de la familia (SE-04) That is what think (1-sing.) of the family ’That is what I think of the family’ 3. Constructions with the verb in initial position followed by the nominal group and the relative clause. Example (9) illustrates this type: (9) pero ha sido precisamente el avance de la anestesia [...] la que ha podido hacer que se [...] pudiera hacer una serie de operaciones (SE-05) 364 But was (3-sing) precisely the advance of anaesthesia that could allow [...] that could be made certain operations ’But it was precisely the advance of anaesthesia what allowed certain operations to be done’ 3. Local Theme selection In English when theme predication is selected, and it is realised by means of a construction of the type it + be + ...[[that ...]], that clause opens up the potential for an additional thematic contrast - a contrast local to that clause. In this case, the Theme local to the identifying clause may be unmarked (as it is in most of the cases) or marked. If it is marked, it will coincide with the Complement / Identifier of the identifying clause and will be given thematic prominence in initial position of the clause. No examples have been found in the BNC sample, probably due to the fact that it is a small one. However, as other studies have proved its existence with empirical evidence, we include the example provided by Matthiessen as an illustration: (10) There we fell and my leg it was that broke (Matthiessen 1995: 567) 4. Substitute Theme In English, when the Theme is unmarked and not predicated, there is a choice whether to re-introduce the Theme/Subject at the end of the clause so as to give it a "thematic culmination". With substitute Theme, the Theme/Subject is a pronominal nominal group and the substitute Theme is typically a full lexical nominal group, though it is also possible to find examples with this, one, etc. In examples (11) and (12) below, the Theme/Subjects are realised by the demonstrative ’that', and the lexical items ’piano’ and ’day release’ are the substitute Themes, which serve as a textual reprise of the thematic referents: (11) That ’s good, that ’s good for you, the piano you just had there (kp8) (12) That ’s what you want to be after, day release. 6. Summary and concluding remarks As the contrastive examples above have illustrated, a detailed corpus analysis in search for form-function mappings of linguistic phenomena is an indispensable step as a basis for empirically-based computational specifications for MLG. The study of the phenomenon of thematicity and its relationship with focus in both languages, though partial and purely illustrative, has served to show how a functional analysis of selected corpus samples may shed light on the paradigmatic features and their syntagmatic realisations of those phenomena, which are the basis for computational specifications for MLG architectures based on functional typologies. However, the requirements of NLG, similar in many respects to the goals of discourse analysis, raise important problems for corpus analysis: the need for semantic and contextual (pragmatic) analysis of the selected corpora, which cannot always be recognised automatically and so cannot be extracted from large-scale corpora without an analyst's intervention, makes, in most cases, the use of large corpora and form-based corpus tools unsuitable for NLG purposes. Given the close relation between discourse analysis and NLG, it can be hoped that new tools and specialised environments are developed to extract more from current electronic corpora than what can actually be obtained from form-based quantitative corpus analysis tools. References Alexa, M and Rostek, L 1996 Computer-assisted corpus-based text analysis with TATOE, Technical Report, German National Research Center for Information Technology, Institute for Integrated Information and Publication Systems, Darmstadt, Germany. Bateman, J 1997 Enabling technology for multilingual natural language generation: the KPML development environment. Journal of Natural Language Engineering 3 (1): 1-41. 365 Bateman, J 1998b Automated discourse generation. In Kent A and Hall, C.M (eds), Encyclopedia of Library and Information Science, Vol 62, New York, Marcel Dekker, pp 1-54. Bateman J, Teich E 1995 Selective information presentation in an integrated publication system: an application of genre-driven text generation. Information Processing and Management 31(5): 753-767. Bateman J, Matthiessen C, Nanri K , Zeng L 1991 The Re-use of linguistic resources across languages in multilingual generation components. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, pp 966-971. Bateman J, Matthiessen C, and Zeng, L 1999 Multilingual Natural Language Generation for Multilingual Software: a Functional Linguistic Approach. Applied Artificial Intelligence 13: 607-639. Caffarel A, Martin J R, and Matthiessen C in preparation Language typology: a functional perspective. Dik S C 1978 Functional Grammar. Dordrecht, Foris. Dik S C, Hoffmann M, de Jong JR, Djiang S, Stroomer H, and de Vries L 1981 On the typology of focus phenomena. In Hoekstra et al. (eds) Perspectives on functional grammar. Dordrecht, Foris. DiMarco C, Hirst G, Wanner L, Wilkinson J 1995 Healthdoc: customizing patient information and health education by medical condition and personal characteristics. In Proceedings of the Workshop on Patient Education, Glasgow. Halliday M A K 1966 Some notes on "deep grammar". Journal of Linguistics 2 (1): 57-67. Halliday M A K 1978 Language as social semiotic. London, Edward Arnold. Halliday M A K 1994 An introduction to functional grammar. London, Edward Arnold. Kittredge R, Polguere A, Goldberg E 1986 Synthesizing weather reports from formatted data. In Proceedings of the 11th International Conference on Computational Linguistics, Bonn, Germany, pp 563- 565. Lavid J 1995 From interpersonal option to thematic realisation in multilingual instructions. In Kittredge R (ed) Working notes of the IJCAI-95 workshop on text generation. Montreal. Lavid J 2000 Theme, focus, given and other dangerous things: linguistic and computational approaches to information in discourse. Revista Canaria de Estudios Ingleses : Lavid J in press La noci~n gramatical de tema en un contexto multilingüe: una perspectiva Levine J, Mellish C 1994 Corect: Combining CSCW with natural language generation for collaborative requirements capture. In Proceedings of International Joint Conference on Artificial Intelligence, Montreal, Canada, pp 1398-1404. Li P, Evens M, Hier D 1986 Generating medical case reports with the linguistic string parser. In Proceedings of the 5th National Conference on Artificial Intelligence, Philadelphia, pp 1069-1073. McCoy K, Cheng J 1990. Focus of attention: constraining what can be said next. In Paris et al. (eds) Natural language generation in artificial intelligence and computational linguistics. Kluwer, Dordrecht, 1990. McKeown, K 1995 Text generation: Using discourse strategies and focus constraints to generate natural language text. Cambridge, Cambridge University Press. Matthiessen C 1995 Lexicogrammatical cartography: english systems. Tokyo, International Language Sciences Publishers. Not E, Stock, O 1994 Automatic generation of instructions for citizens in a multilingual community. In Proceedings of the European Language Engineering Convention, Paris. Paris C, Van der Linden K, Fisher M, Hartley A, Pemberton L, Power R, Scott D 1995 A support tool for writing multilingual instructions. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, pp1398-1404. Reiter E, Dale R 1997 Building applied natural language generation. Journal of Natural Language Engineering 3(1): 57-87. Sheremetyeva S, Nirenburg S, Nirenburg I 1996 Generating patent claims from interactive input. In Proceedings of the 8th International Workshop on Natural Language Generation, Herstmonceaux, England, pp 61-70. 366 Springer S, Buta P, Wolf T 1991 Automatic letter composition for customer service. In Smith R, Scott C (eds) Innovative applications of artificial intelligence 3. Menlo Park, Ca AAAI Press. Taboada M 2000 Collaborating through talk: the interactive construction of task-oriented dialogue in English and Spanish. Unpublished PhD thesis, Universidad Complutense de Madrid. 356 Using bilingual corpora for the construction of contrastive generation grammars: issues and problems Julia Lavid Dep. English Philology I Universidad Complutense de Madrid 28040 Madrid (Spain) Phone and fax: +34-91-518-5799 e-mail: julavid@filol.ucm.es This paper reports on the use of corpora for the construction of a computational grammar of Spanish, contrastive with English, in the application context of Multilingual Natural Language Generation (MLG). The theoretical framework for this work is Systemic Functional Linguistics (SFL) and the computational context provided by KPML (Komet Penman Multilingual), an extensive grammar development environment and generation engine that supports large-scale multilingual development (Bateman 1997). The initial phenomena which are being investigated contrastively belong to three different functional regions of the grammar, i.e., particular subareas of the grammar that are concerned with particular areas of meanings. These regions are transitivity (ideational meaning), thematicity (textual meaning) and mood (interpersonal meaning). The present study concentrates on textual meaning (thematicity and focus) as an illustration. Following what has now established itself as a standard methodology for empirically-based Natural Language Generation (Bateman 1998a, Reiter and Dale 1997), the following steps were carried out: first, a bilingual corpus (English-Spanish) was selected. For Spanish, a sample of spoken texts from the MacroCorpus of the educated linguistic standard of the main cities of the Spanish-speaking world was used, while for English a comparable sample was selected from the British National Corpus Sampler. This was motivated by the need to provide a realistic account of the behaviour of the linguistic phenomena investigated in unplanned and spontaneous contexts of use. The second step was to carry out a contrastive analysis of the phenomena mentioned before. Finally, the results of the analysis were coded up as resources/processes for generation. In the case of Spanish, these had to be created anew. In the case of English, as the KPML already includes an English generation grammar, this last step consisted on checking the coverage of the existing specifications and extending them when could not cover the instances found in the corpus, and adapting them for effective MLG. Given the nature of the NLG process, which typically converts communicative goals expressed in some internal representation into surface forms, the kind of information that is most readily usable for NLG are statements of mappings from functions to forms. Therefore, the corpus analysis phase for NLG usually includes an explicit, and usually quite lengthy linguistic analysis where the analyst seeks possible realisations of communicative functions, which restricts the size of the corpus that can be realistically considered. This paper describes the different steps carried out for the generation of the linguistic phenomena mentioned above, discussing the problems encountered during the corpus analysis phase, and the computational representation derived from it, as well as some of the decisions taken to overcome them. 357 1. Introduction Natural Language Generation (NLG), the subfield of Artificial Intelligence and Computational Linguistics that investigates the automated production of texts by machine, is typically a process that converts communicative goals expressed in some internal representation into surface forms. As such, it touches upon different linguistic areas of inquiry such as text planning and discourse organisation, lexical semantics, grammatical and lexical choice, and the relationship between all of these. In fact, one of the main concerns of NLG is the construction of of computational accounts of the linguistic system capable of generating texts in one language (monolingual NLG) or in several (multilingual NLG, henceforth MLG). This theoretical task poses unusual demands on the linguist who has to provide explicit and details accounts of how language works and confront the results of the application of his/her theoretical claims in concrete computational systems. A more practical concern is the creation of computational systems capable of producing acceptable text from various sources and for different types of applications: well-known examples include the generation of weather reports from meteorological data (Kittredge et al. 1986), the generation of letters responding to customers (Springer et al. 1991), and other systems applied in areas such as technical documentation and instructional texts (Not and Stock 1994, Paris et al. 1995, Lavid 1995), patent claims (Sheremetyeva et al. 1996), information systems (Bateman and Teich 1995), computersupported co-operative work (Levine and Mellish 1994), patient health information and education (DiMarco et al. 1995), and medical reports (Li et al. 1986), to mention a few.1 While the first generation systems were limited to the random generation of grammatically correct sentences, the field has experienced a very rapid growth over the past ten years, both as a research area bringing a unique perspective on fundamental issues in artificial intelligence, cognitive science, and human-computer interaction, and as an emerging technology capable of partially automating routine document creation and playing an important role in human-computer interfaces. In this sense, as practical systems became more sophisticated, it was necessary to provide a good understanding of the notion of “textuality” and all the factors involved in the creation of different text types. In this context, it was only natural that corpora started to be used as the empirical basis both for the theoretical investigation of textual phenomena, and as part of the requirement analysis phase for NLG systems. As a result, the use of corpora has now been integrated as part of the standard methodology for NLG, both in theoretically-oriented research and in the development of concrete generation systems, to such an extent that computational tools have been developed to support, among other functionalities, the analysis of machine-readable monolingual or multilingual corpora for NLG (Alexa and Rostek 1997). In this paper, we report on the use of bilingual corpora for the construction of a computational grammar of Spanish, contrastive with English, in the application context of Multilingual Natural Language Generation (MLG), concentrating on the textual phenomenon of thematicity and its relationship with the related notion of focus as an illustration. Section 2 describes the two corpora selected and the criteria for concentrating on two comparable samples as the empirical basis for computational specifications. Section 3 presents the theoretical framework selected for this study and the issues which must be explored contrastively with respect to the textual phenomena mentioned before. Section 4 describes the corpus analysis phase of the bilingual samples. Section 5 presents a computational specification of the results of the analysis as resources for generation. The specification is based on the notion of functional typology as developed in SFL and implemented in the KPML development environment. Finally, section 6 provides a summary and discusses the implications of these results for corpus-based MLG. 2. Bilingual corpora for NLG Two electronic corpora were initially targeted as the empirical basis for the contrastive work proposed in this paper. These were the Macrocorpus de la norma lingüística culta de las principales ciudades del mundo hispánico (Samper et al. 1998), a corpus of the educated linguistic standard of the main cities of the Spanish-speaking world, and the British National Corpus. These two corpora were initially selected for two main reasons: they both describe the educated speech of both Spanish and English and contain samples of 1 For an extensively documented review of the field see Bateman (1998b) 358 unplanned and spontaneous speech, which are necessary to provide a realistic account of the behaviour of the linguistic phenomena investigated in their contexts of use. In the case of NLG, it is frequent to find computational specifications of linguistic phenomena based on the analysis of specific text types, most of them written monologues, which, though useful for the generation of texts of a specific domain, can only provide a partial view on the phenomena studied. As the purpose of this study is to investigate the behaviour of textual phenomena for a multipurpose contrastive grammar of English and Spanish, the choice of samples from unplanned, spontaneous speech offered a rich and unexplored empirical basis for the study of the textual phenomena mentioned before. For the study of Spanish, a sample of spoken texts from the Spanish Macrocorpus was selected. This Macrocorpus includes the transliteration of 84 hours of recording by 168 native speakers representative of the educated speech of twelve Hispanic cities, nine of them from SouthAmerica and three of them from the Iberian Peninsula. The recordings are basically unstructured interviews conducted in a conversational style where the interviewer introduces a few questions and topics to stimulate the conversation and to establish some uniformity in the topics discussed by the speakers. In general, the speakers were left free to talk about the topics suggested by the interviewer and to introduce any new topics. The sample selected for this study consists of 10 interviews from the three cities of the Iberian Peninsula contained in the Macrocorpus, i.e. from Madrid, Seville and Las Palmas de Gran Canaria. For the study of English, a comparable sample was selected from the BNC Sampler Corpus, a subcorpus of the British National Corpus, consisting of approximately one-fiftieth of the whole corpus, viz. two million words. The Sampler Corpus consists of 184 different texts, comprising just under 127,000 sentences and two million words. From this sampler corpus, a subcorpus of spoken texts was chosen for the purposes of the present study, consisting of 10 conversations. 3. The textual resources of thematicity and focus in English and Spanish The linguistic phenomena selected for illustrating the use of corpora in NLG belong to the textual region of the grammar, and have been the subject of several functional accounts. The theoretical framework for the study of these phenomena is SFL, as this is the theoretical basis for the implementation of the current English generation grammar Nigel. In SFL, the textual clause grammar is composed of two complementary systems, the systems of Theme and Information, characterised as assigning two different kinds of textual prominence to elements of the clause: thematic prominence, in the case of the system of Theme, and prominence as news, in the case of the system of Information. The notion of Focus, however, has not received a comparable attention within SFL, but has been the subject of a number of discourse-oriented linguistic and computational studies (Lavid 2000, McCoy and Cheng, McKeown, among others), and has been treated in Dik´s Functional Grammar (Dik 1978) as a pragmatic function which assigns more salience to some clausal constituent with respect to the contextual (pragmatic) information between language producers and receivers. In view of the need to provide form-function mappings of these phenomena which can be used by a MLG system, a corpus-based analysis for generation purposes must explore at least the following issues contrastively: 1. Do both English and Spanish grammars have a Theme and a Focus function ? 2. How do English and Spanish realise the Theme and the Focus functions - e.g.: sequence, inflection, adposition, intonation? 3. Are there marked and unmarked Theme and Focus selections in both languages, and to what extent do they depend on other systems (e.g. Mood, Voice)? 4. Are there resources in English and Spanish to combine thematicity and focus? With respect to the first question, different linguistic studies acknowledge the existence of these two functions in different languages (see Caffarel et al. in preparation; Dik et al. 1981), so this issue will not be further explored here. The rest of the issues, however, are central for a functional characterisation of the textual phenomena selected for this study, and their corpus-based empirical study is the basis for the specification of resources required for MLG. Therefore, the next section will describe the corpus analysis carried out for this study and the problems encountered when attempting to investigate the issues mentioned above. 359 4. Contrastive corpus analysis for MLG: problems and decisions As in the area of discourse analysis, the type of information that NLG needs from corpus investigations is one which basically consists of statements of mappings from functions to forms. Therefore, when investigating specific discourse phenomena, a NLG system requires as a first step a specification of the mappings from functions to forms with the purpose of duplicating the text analysed. In this sense, the corpus analysis phase raises some problematic issues which must be considered by NLG practitioners. One of these issues is the size of the corpus: if, at is usually the case, the analyst has to intervene looking for possible realisations of communicative functions, the use of large corpora becomes impractical. Also, it is not possible to mark-up large quantities of texts according to functional categories if they cannot be recognised automatically on the basis of tagged texts or syntactic analysis. In view of these problems, the following decisions had to be taken to investigate the issues mentioned above: 1.- With respect to the first issue, i.e., the ways in which English and Spanish realise the Theme and the Focus functions, the corpus analysis was based on the following assumptions, based on previous linguistic studies: a) The Theme function was recognised as realised by clause initial position in both languages, as several linguistic studies have demonstrated (Lavid in press; Taboada 2000). b) The Focus function was recognised on the basis of non-prosodic realisations, such as word order patterns, focus markers and characteristic constructions, since the BNC corpus does not include prosodic annotation. 2 2.- With respect to the second issue, i.e., the existence and realisations of marked and unmarked theme and focus selections in both languages, the corpus analysis was based on the following assumptions: a) marked and unmarked themes were recognised in specific mood contexts, such as declarative, interrogative and imperative options. Absolute themes were recognised by the presence of a pause and a comma separating them from the rest of the predication. b) unmarked focus was assumed to coincide with the last lexical element of the clause.3 3.- With respect to the third issue, i.e., the existence and variation of resources which combine thematicity and focus in both languages, the corpus analysis concentrated on the so-called cleft and pseudo-cleft sentences as realisation strategies used by both English and Spanish to combine the Theme and the Focus functions. These were semi-automatically distinguished in both corpora on the basis of their characteristic form. 5. Towards a computational specification for MLG The second step in what has now established itself as a standard methodology for empirically-based NLG is the codification of the corpus analysis results as resources/processes for generation. The current implementation of the English generation grammar contained in the KPML development environment already includes a computational specification for the Theme function in English as a textual region of the grammar. However, as will be shown in the following sections, it became necessary to modify the existing specification to account for instances found in the corpus and to ensure maximal sharing of resources for contrastive generation. For Spanish a new specification was created on the basis of the contrastive corpus analysis. It should be noted, however, that it is not the purpose of this study to provide a full specification 2 Considering that Focus in English is predominantly realised by marked prosodic prominence (Martínez-Caro 1999), this apparent limitation of the corpus-analysis phase must be overcome by explicitly representing this realisation in the computational specification for English. 3 According to Halliday (1994), unmarked focus falls on the last lexical element of the tonic group, which in unmarked circumstances coincides with the clause. 360 of the behaviour of the textual phenomena of thematicity and focus in both languages, but rather to discuss and illustrate with some examples some problems and issues raised by corpus-based contrastive MLG specifications. The following sections, therefore, will illustrate some of these problems and the solutions suggested in the context of a MLG architecture based on functional typology. 5.1. Thematic resources for English and Spanish: a functional-typological characterisation This section presents a partial specification of the thematic resources available in English and Spanish on the basis of the contrastive corpus-based analysis. More specifically, the area of theme markedness will be discussed in detail, as there exist some interesting commonalities and differences which must be accounted for when generating English-Spanish thematic variants. The approach rests on the notion of functional typologies, as pursued in systemic-functional linguistics (cf. Halliday 1978), later developed for MLG purposes (see Bateman et al. 1999). Functional typologies are constructed by means of classification hierarchies called system networks (Halliday 1966), where each disjunction, or grammatical system is seen as a point of abstract functional choice, capturing those minimal points of alternation offered by a linguistic representation level. This representation of the so-called paradigmatic axis of linguistic description is complemented with the corresponding syntagmatic realisation, i.e. the structural expression of the choices made in the paradigmatic axis and associated with individual grammatical features. This task is carried out by the so-called realisation statements which set allowable constraints on configurations of syntactic constituents such as linear precedence, immediate dominance, or "unification" of functional constituents. This type of approach, where the functional coherence of the paradigmatic organisation is preferred over generalisations concerning possible structures, has been found to provide effective multilingual linguistic descriptions and computational representations which maximise the factoring out of generalisations across languages (cf Bateman et al. 1991, 1999). Therefore, following this approach, and on the basis of empirical corpus-analysis, a specification was created to account for Theme markedness in Spanish, as shown in Figure 1 below: theme matter marked absolute not theme matter predicated Theme Markedness non-absolute Predication non-predicated unmarked (transitivity role) Figure 1: Spanish Theme markedness systems As the system network shows, the primary distinction in Spanish is between marked and unmarked themes, and within marked themes between absolute or non-absolute themes. Absolute themes do not map onto any transitivity or interpersonal functions, and are normally separated from the transitivity /interpersonal structure of the clause by a comma in writing and by a pause in speech. They can be of two types: representing a theme matter or not. If non-absolute, they can be predicated or non-predicated. When predicated, they are normally realised by pseudo-cleft sentences, as will discussed in detail below. If they are not predicated, thematic status may be assigned to any role within the transitivity structure of the clause (participants, processes or circumstances). With respect to English, the existing specification was found not to account for all the language instances found in the corpus. Therefore, on the basis of the corpus-based analysis and to provide a maximally effective specification for the purposes of MLG, the following specification was created for English: 361 theme matter marked absolute not theme matter * predicated unmarked local Markedness non-absolute Local theme selection marked local non-predicated unmarked (transitivity role) substitute Theme Substitution non-substitute Figure 2. English Theme markedness systems As can be seen, an attempt has been made at capturing the comnonalities with the Spanish specification while maintaining the English-specific integrity and difference. Thus, while English shares with Spanish the primary distinction between marked and unmarked themes, previously existing specifications (see Matthiessen 1995: 540) did not include a distinction between absolute and non-absolute themes. However, in order to enforce maximal sharing of resources across languages, this system was also included in the English network. In this sense, it should be noted that up to this point both languages are similar, but differ as the features become more delicate: while absolute themes in Spanish may refer to a theme matter or to any other transitivity role coreferential with a constituent within the predication, English has a gap in this common paradigm (indicated by an asterisk in the English network) since it only has the option of a theme matter as absolute theme. Also, English includes more delicate options within the systems of Theme predication, as it includes the possibility of having a local theme selection, and includes a system called Theme substitution, which is not available in the Spanish paradigm. In this new specification, predicated themes have been considered as marked in the sense that they map the Focus constituent, and, in many instances, the New element, onto the Theme (see Halliday 1994: 59 on predicated Themes). Contrastive examples extracted from the corpora are discussed in detail below. With respect to the syntagmatic realisations of the options presented in the networks above, both languages present divergent grammaticalisations of their common paradigmatic potential. Table 1 below illustrates the different features of the systems of both languages, together with their divergent realisations. These are specified as sets of constraints which specify the syntactic and lexical properties of the linguistic units being generated. For example, the feature ’theme substitute’ has the realisation statement +Pro-theme, which means ’insert a Pro-theme’ and ’Pro-theme / Subject’ which means ’conflate or unify with the grammatical function Subject'. The rest of the realisation specifies the ordering of constituents in relation to one another: thus ’Rheme ^ Theme’ means order the Rheme before the Theme. 362 feature English realisation Spanish realisation System: Markedness marked absolute theme matter ’As to/as for/... + NG, + Pred. ’en cuanto a + NG', + Pred. not theme matter Left Dislocation + Predication non-absolute predicated ’It + be + X..that/who’ Relative ^ Copula ^ NP NP ^ Copula ^ Relative Copula ^ NP ^ Relative unmarked local marked local ’NP it + be+ that’ non-predicated Part, Pro. or Circ. non-Subj. Part, Pro. or Circ. non-Subj. unmarked non-substitute substitute +Pro-theme/ Subject ^ Rheme ^Theme Table 1: Syntagmatic realisations of theme markedness options in English and Spanish Some of the divergent realisations in both languages are the following: 1. Absolute Themes Absolute themes refer to constituents which are not integrated within the transitivity/interpersonal structure of the clause, and are, therefore, marked in speech with a pause, and with a comma in writing. As shown in Figure 1 above, Spanish has two options when choosing absolute themes: as theme matter and as not-theme matter. Example (1) below illustrates the choice of a theme matter in Spanish, an instance extracted from the Spanish Macrocorpus: (1) En cuanto a los libros que me gusta leer, tú sabes que a mí me gusta leer todo (SE- 10) As for the books that me likes to read, you know that me likes to read everything ’As for the books that I enjoy reading, you know that I enjoy reading everything’ A contrastive example in English is illustrated by example (2) below, extracted from the BNC sampler corpus: (2) As for the past, I have adopted the doctrine of anamnesis (AEA) When the absolute theme is not a theme matter, in Spanish it may be coreferential with an element of the structure of the clause, as example (3) illustrates: (3) .Y el ciudadano, el madrileno madrileno, se le nota un cambio.... (MA-10) And the citizen, the Madrid Madrid (one), to him notice (3-sing) one change ’ And the citizen, the Madrid citizen, can one notice a change in him...?’ In (3), the constituent ’el ciudadano’ is not integrated in the transitivity structure of the clause, though it is coreferential with an element functioning as beneficiary in the clause, expressed by the clitic ’le’ (italics in the example). English, however, lacks this option in its paradigm, and this is indicated by an asterisk in the system network in Figure 2. 2. Predicated Themes 363 Both English and Spanish include predicated themes as a choice between presenting a Theme with a special feature of identification or without that special feature. In fact, this choice is a textual strategy which combines the Theme and the Focus functions into one single realisation, which, depending on the language, may be a cleft, a pseudo-cleft, an identifying clause or a combination of the three. Thus, English typically uses clefts with the form ('it + be + X + that ...) to realise predicated themes (italics) and to mark focus constituents (boldface) as in examples (4) and (5) extracted from the BNC sample: (4) It was only a relatively small Arab army that arrived in Egypt (JXL) (5) It is not the death of the body that is important to us, it ’s the soul (NHE3) Textually speaking, this construction serves two textual functions: a) to identify the Theme in the transitivity role it serves in the clause: by using the identifying type of clause we are achieving a textual distribution of meaning with the added feature of identification. For example, in (4) ’the thing that arrived in Egypt’ is identified as ’a relatively small Arab army'. Similarly, in (5) the ’the thing that is important to us’ is identified as not the death of the body but as the soul. According to this, the segment ’it's not the death of the body’ would be the Theme of the clause, and ’that is important to us’ would be the Rheme. b) to define a specific constituent as having the Focus function, which, in many cases, is presented as contrastive to another alternative. Example (5) is a case of what Dik would call counterpresuppositional Focus, more specifically Substitute Focus. According to his definition, "the information presented is opposed to other, similar information which the Speaker presupposes to be entertained by the Addressee" (Dik 1989: 282). In (5), the speaker rejects the information which he/she presupposes to be entertained by the Addressee (the death of the body) and corrects it with information which the speaker considers to be correct (the soul). English also uses pseudo-cleft and identifying clauses to combine the Theme and Focus functions. Example (6) below illustrates a pseudo-cleft construction where the Theme is realised by a nominalising clause introduced by ’what', and the rest of the clause serves as a Rheme. This construction also serves to mark the focus constituent as the one following the copular verb: (6) what we were able to do on one occasion was er to raise enough money (FYJ) Spanish, by contrast, tends to use only pseudo-cleft clauses to combine the Theme and Focus functions. The range of possible syntagmatic realisations in Spanish is more varied than in English, and includes three main types: 1. Constructions where the relative clause appears in initial position, followed by the copular verb and the nominal group functioning as an attribute. Exampe (7), extracted from the Spanish Macrocorpus, illustrates this first type: (7) Lo que yo necesito [...] es un poco más del estilo árabe... (SE-01) What I need is a bit more of the style Arabian ’What I need is a bit more of the Arabian style’ 2. Constructions where the nominal group is in initial position , the verb in medial position and the relative clause at the end. Example (8) illustrates this type: (8) Eso es lo que pienso de la familia (SE-04) That is what think (1-sing.) of the family ’That is what I think of the family’ 3. Constructions with the verb in initial position followed by the nominal group and the relative clause. Example (9) illustrates this type: (9) pero ha sido precisamente el avance de la anestesia [...] la que ha podido hacer que se [...] pudiera hacer una serie de operaciones (SE-05) 364 But was (3-sing) precisely the advance of anaesthesia that could allow [...] that could be made certain operations ’But it was precisely the advance of anaesthesia what allowed certain operations to be done’ 3. Local Theme selection In English when theme predication is selected, and it is realised by means of a construction of the type it + be + ...[[that ...]], that clause opens up the potential for an additional thematic contrast - a contrast local to that clause. In this case, the Theme local to the identifying clause may be unmarked (as it is in most of the cases) or marked. If it is marked, it will coincide with the Complement / Identifier of the identifying clause and will be given thematic prominence in initial position of the clause. No examples have been found in the BNC sample, probably due to the fact that it is a small one. However, as other studies have proved its existence with empirical evidence, we include the example provided by Matthiessen as an illustration: (10) There we fell and my leg it was that broke (Matthiessen 1995: 567) 4. Substitute Theme In English, when the Theme is unmarked and not predicated, there is a choice whether to re-introduce the Theme/Subject at the end of the clause so as to give it a "thematic culmination". With substitute Theme, the Theme/Subject is a pronominal nominal group and the substitute Theme is typically a full lexical nominal group, though it is also possible to find examples with this, one, etc. In examples (11) and (12) below, the Theme/Subjects are realised by the demonstrative ’that', and the lexical items ’piano’ and ’day release’ are the substitute Themes, which serve as a textual reprise of the thematic referents: (11) That ’s good, that ’s good for you, the piano you just had there (kp8) (12) That ’s what you want to be after, day release. 6. Summary and concluding remarks As the contrastive examples above have illustrated, a detailed corpus analysis in search for form-function mappings of linguistic phenomena is an indispensable step as a basis for empirically-based computational specifications for MLG. The study of the phenomenon of thematicity and its relationship with focus in both languages, though partial and purely illustrative, has served to show how a functional analysis of selected corpus samples may shed light on the paradigmatic features and their syntagmatic realisations of those phenomena, which are the basis for computational specifications for MLG architectures based on functional typologies. However, the requirements of NLG, similar in many respects to the goals of discourse analysis, raise important problems for corpus analysis: the need for semantic and contextual (pragmatic) analysis of the selected corpora, which cannot always be recognised automatically and so cannot be extracted from large-scale corpora without an analyst's intervention, makes, in most cases, the use of large corpora and form-based corpus tools unsuitable for NLG purposes. Given the close relation between discourse analysis and NLG, it can be hoped that new tools and specialised environments are developed to extract more from current electronic corpora than what can actually be obtained from form-based quantitative corpus analysis tools. References Alexa, M and Rostek, L 1996 Computer-assisted corpus-based text analysis with TATOE, Technical Report, German National Research Center for Information Technology, Institute for Integrated Information and Publication Systems, Darmstadt, Germany. Bateman, J 1997 Enabling technology for multilingual natural language generation: the KPML development environment. Journal of Natural Language Engineering 3 (1): 1-41. Bateman, J 1998a Using corpora for uncovering text organization. IV-V Jornades de corpus lingüístics. Barcelona, Institut Universitari de Lingüística Aplicada, Universidad Pompeu Fabra. 365 Bateman, J 1998b Automated discourse generation. In Kent A and Hall, C.M (eds), Encyclopedia of Library and Information Science, Vol 62, New York, Marcel Dekker, pp 1-54. Bateman J, Teich E 1995 Selective information presentation in an integrated publication system: an application of genre-driven text generation. Information Processing and Management 31(5): 753-767. Bateman J, Matthiessen C, Nanri K , Zeng L 1991 The Re-use of linguistic resources across languages in multilingual generation components. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, Sydney, pp 966-971. Bateman J, Matthiessen C, and Zeng, L 1999 Multilingual Natural Language Generation for Multilingual Software: a Functional Linguistic Approach. Applied Artificial Intelligence 13: 607-639. Caffarel A, Martin J R, and Matthiessen C in preparation Language typology: a functional perspective. Dik S C 1978 Functional Grammar. Dordrecht, Foris. Dik S C, Hoffmann M, de Jong JR, Djiang S, Stroomer H, and de Vries L 1981 On the typology of focus phenomena. In Hoekstra et al. (eds) Perspectives on functional grammar. Dordrecht, Foris. DiMarco C, Hirst G, Wanner L, Wilkinson J 1995 Healthdoc: customizing patient information and health education by medical condition and personal characteristics. In Proceedings of the Workshop on Patient Education, Glasgow. Halliday M A K 1966 Some notes on "deep grammar". Journal of Linguistics 2 (1): 57-67. Halliday M A K 1978 Language as social semiotic. London, Edward Arnold. Halliday M A K 1994 An introduction to functional grammar. London, Edward Arnold. Kittredge R, Polguere A, Goldberg E 1986 Synthesizing weather reports from formatted data. In Proceedings of the 11th International Conference on Computational Linguistics, Bonn, Germany, pp 563- 565. Lavid J 1995 From interpersonal option to thematic realisation in multilingual instructions. In Kittredge R (ed) Working notes of the IJCAI-95 workshop on text generation. Montreal. Lavid J 2000 Theme, focus, given and other dangerous things: linguistic and computational approaches to information in discourse. Revista Canaria de Estudios Ingleses : Lavid J in press La noción gramatical de tema en un contexto multilingüe: una perspectiva funcional-tipológica. In Treinta anos de la Sociedad Espanola de Lingüística. Madrid, Gredos. Levine J, Mellish C 1994 Corect: Combining CSCW with natural language generation for collaborative requirements capture. In Proceedings of International Joint Conference on Artificial Intelligence, Montreal, Canada, pp 1398-1404. Li P, Evens M, Hier D 1986 Generating medical case reports with the linguistic string parser. In Proceedings of the 5th National Conference on Artificial Intelligence, Philadelphia, pp 1069-1073. Martínez-Caro E 1999 Gramática del discurso: foco y énfasis en inglés y en espanol. Barcelona, Promociones y Publicaciones Universitarias. McCoy K, Cheng J 1990. Focus of attention: constraining what can be said next. In Paris et al. (eds) Natural language generation in artificial intelligence and computational linguistics. Kluwer, Dordrecht, 1990. McKeown, K 1995 Text generation: Using discourse strategies and focus constraints to generate natural language text. Cambridge, Cambridge University Press. Matthiessen C 1995 Lexicogrammatical cartography: english systems. Tokyo, International Language Sciences Publishers. Not E, Stock, O 1994 Automatic generation of instructions for citizens in a multilingual community. In Proceedings of the European Language Engineering Convention, Paris. Paris C, Van der Linden K, Fisher M, Hartley A, Pemberton L, Power R, Scott D 1995 A support tool for writing multilingual instructions. In Proceedings of the International Joint Conference on Artificial Intelligence, Montreal, pp1398-1404. Reiter E, Dale R 1997 Building applied natural language generation. Journal of Natural Language Engineering 3(1): 57-87. Samper JA, Hernández Cabrera CA, Troya Déniz M 1998 Macrocorpus de la norma lingüística culta de las principales ciudades del mundo hispánico. Universidad de las Palmas de Gran Canaria. Sheremetyeva S, Nirenburg S, Nirenburg I 1996 Generating patent claims from interactive input. In Proceedings of the 8th International Workshop on Natural Language Generation, Herstmonceaux, England, pp 61-70. 366 Springer S, Buta P, Wolf T 1991 Automatic letter composition for customer service. In Smith R, Scott C (eds) Innovative applications of artificial intelligence 3. Menlo Park, Ca AAAI Press. Taboada M 2000 Collaborating through talk: the interactive construction of task-oriented dialogue in English and Spanish. Unpublished PhD thesis, Universidad Complutense de Madrid. 367 Tracing referent location in oral picture descriptions Maarten Lemmens Université de Lille Katholieke Universiteit Leuven Dept. of English I.L.T. (Swedish) BP 149 Dekenstraat 6 59653 Villeneuve d'Ascq Cedex, France 3000 Leuven research grant: URM 8528 SILEX du CNRS Belgium Against the background of the typological distinction between verb-framed and satellite-framed languages, the present paper will show that the opposition between verbs of POSITION (e.g. sit, lie, stand, etc.) and verbs of EXISTENCE (e.g. be, be found, etc.) can also be regarded as a parameter in the typological distinction. At the same time, some of the claims made in the literature on manner of motion verbs will have to be nuanced when applied to verbs of POSITION/EXISTENCE. The empirical basis of this contrastive research project (initially focussing on Germanic languages) consists of standard text corpora as well as elicited oral picture descriptions. 369 A corpus study of impersonalisation strategies in newspaper discourse in English and Spanish1 Juana Marín-Arrese, Elena Martínez-Caro, Soledad Pérez de Ayala Becerril Universidad Complutense de Madrid 1. Introduction Strategies of impersonalisation in English and Spanish include the use of passives, resultatives, anticausatives, impersonal constructions, nominalizations, and various forms of lexical underspecification (Nedjalkov 1988, Shibatani 1988, Gómez Torrego 1992, Fox and Hopper 1994, Mendikoetxea 1999). The use of these strategies in newspaper discourse reporting political events reflect issues in language and ideology, such as the intentional mystification of agency and vagueness of responsability in discourse (Fairclough 1989, Fowler 1991, Gruber 1993, Curran and Seaton 1997, van Dijk 1998). This paper reports on on-going research as part of a major project on the use of these strategies in British and Spanish newspaper articles on political issues. In the collection of texts, we have established a gradient, from those news reports where neither Britain nor Spain is alluded to or implicated, to a situation where each country is both mentioned and implicated in the event. The purpose of this study is twofold: (1) the identification of qualitative and quantitative differences in the use of impersonalisation strategies between the two languages2, i.e. whether the same type of strategies are used in the two languages and the extent to which they are used; and (2) the correlation between the use of strategies and the degree of implication which the newspaper articles reflect in both languages. 2. Impersonalisation Strategies 2.1. Impersonalisation We have used the term ‘impersonalisation strategies’ to refer to a variety of linguistic means which allow for mystification of the role of agency. In our study we have focused on the following: agentless passives, ed-participles, resultatives, impersonals, anticausatives, impersonal pronouns, infinitive clauses, nominalizations, existentials, and a set of miscellaneous occurrences of lexical underspecification, including metonymy and others. In terms of recoverability of the identity of the agent, the examples chosen represent a gradient in implicitnes of agency. In some cases, the underlying agent is recoverable from the preceding or following co-text. In others, it may be inferred on the basis of shared knowledge, or shared event or context models, which allows us to predict the type of agent characteristically involved in the event, though the identity of the agent is not recoverable to the point where unique reference can be established. Finally, there are cases where deagentivization is absolute. The reasons for omission of the agent in the passive may be based on relevance criteria (Sperber and Wilson 1986). As Biber et al. (1999:477) note, in news reports the agent “may be easy to infer, uninteresting, or already mentioned”. But this exclusion may also be the result of “possible ideologically motivated obfuscation of agency, causality and responsibility”, as Fairclough (1989:124) puts it. In this way, the passive allows not only for mystification of agency but also for claiming ignorance about the identity of the agent, thus obscuring responsibility for negative action. Impersonal constructions or impersonal use of pronouns such as people, they, someone, no one typically exclude both Speaker/Writer and Addressee/Reader from the action, exonerating them from responsibility and implication. The use of we, you, one, on the other hand, allows for the inclusion of the Speaker/Writer (and the Addressee/Reader), in this way creating an expectation of implication and responsibility in the action. Very often, however, they also reflect the distinction established between ingroups and outgroups of various kinds (van Dijk 1998). In the inchoative or anticausative3 construction, intentional actions are presented as ‘events’ 1 This paper is based on work supported by the Universidad Complutense de Madrid under research project n (Project Director: Juana Marín-Arrese). 2 In this paper for the proceedings we only present and discuss the data for Spanish, due to limitations of space. 3 There is considerable variation with regard to the terminology used in the literature. The terms ergative and pseudo-passive have also been found. 370 occurring spontaneously. Spontaneous events are typically cases of autonomous or absolute construal. We may distinguish between ‘intrinsically spontaneous events’ (Kemmer 1993) or ‘internally-caused’ events (Levin and Rappaport 1995), that is, those viewed as occurring without the direct initiation of an external cause, and ‘non-intrinsically spontaneous events', involving some abstract and schematic cause, the evocation of the notion of ‘external causation’ (Langacker 1991), which may be agent, instrument, natural force or circumstance. In the case of non-intrinsically spontaneous events, the expression of the causative event is realized by means of an unmarked lexical causative construction; the same verb form is used in both causative and inchoative in English, the so-called labile alternation. Spanish presents a far more complex situation, though, in general terms, nonintrinsically spontaneous events tend to be coded by means of the marked anticausative construction with se, while intrinsically spontaneous events are coded by an unmarked intransitive (unaccusative) construction (Marín-Arrese 2000). Nominalizations represent a step further in impersonalisation. The examples chosen were those where the agent is not mentioned and the patient participant is defocused. The actional component is obscured, and the event is presented as ‘fact'. Referential-underspecification is found in various forms of lexicalization used to identify the participants. Agents may be described in general or abstract terms, or may be referred to through the use of metonymic expressions. Other such devices include ethnonyms, role description, attribute description, etc. Choices in the level of specificity in describing the actions may also result in the use of abstract nouns denoting events, rather than intentional actions. In view of the above mentioned strategies, we may posit a continuum in ‘agency', ranging from implicit reference to the agent to some abstract and schematic notion of causation, and a parallel continuum in ‘actionality', with actions at one polar end, and facts at the other (adapted from Marín- Arrese, in press). IMPLICIT <------ NON-RECOVERABLE --------> SCHEMATIC Passive Nominalization Anticausative Abstract nominals ACTION <------ EVENT -------> FACT Passive Anticausative Nominalization Abstract nominals These notions are related to the more general conceptual dimension ‘relative elaboration of event'', which, as Kemmer (1994:211) suggests, “can be thought of as the degree to which different schematic aspects of a situation are separated out and viewed as distinct by the speaker”. This dimension subsumes the semantic parameter relative ‘distinguishability of participants'. In passive events, for example, the Initiator or Agent participant is defocused and thus the degree of distinguishability of participants is lower than in active causative events. Similarly, in spontaneous events, the single participant coded is construed as the Initiator and also as the Endpoint, since it undergoes some change of state as well. Langacker (1991:372) observes that “since transitivity depends on the conception of distinct, well-differentiated participants, it is potentially influenced by the extent of their differentiation along not only the objective but also the subjective axis”. This participant distinctiveness thus involves not only more objective features, such as the conceptual distinction of entities into separate participants (agent vs. patient), or the relative salience of these entities with respect to each other and from their background, but also a subjective component, comprising parameters such as “the precision and detail of its type specification” (basic level vs. superordinate category), “the degree of definiteness, which pertains to whether the speaker and hearer have succeeded in making mental contact with a particular instance of the type in question”, and finally “the profile's extension”, that is whether the entity is presented as compact (vs. extended/diffuse) in terms of the oppositions: “spatially compact vs. spatially extended; participant vs. setting; singular vs. plural; count vs. mass; concrete vs. abstract; and restricted portion of a reference mass vs. the mass as a whole”. 2.2. Towards a Taxonomy of Impersonalisation Strategies in Spanish Impersonalisation strategies in Spanish include the periphrastic passive with ser and the resultative with estar, as well as the ed-participle and a variety of motion and result verbs in nonprototypical and resultative passives. 371 (1) a. Una persona fue detenida por resistir a la autoridad (CSp01) one person BE.PAST.3SG arrested ‘One person was arrested for resisting ... ‘ b. , una vez vencido el terrorismo, (CSp10) once defeat.ED the terrorism ‘Once terrorism is defeated’ c. .Que por qué estamos amenazados? (CSp02) that why BE.1PL threatened You mean, why are we threatened? d. , la propuesta va dirigida contra ... (CSv04) the proposal GO.3SG directed against ‘the proposal is directed against ...’ Spanish makes use of a reflexive element se for both the foregrounding and backgrounding, or promotional and non-promotional, reflexive passive4 (Foley and van Valin 1984; Givon 1990), as well as for the impersonal reflexive. The expression of non-intrinsically spontaneous events in Spanish also involves the use of se in the anticausative construction (Marín-Arrese 1992, 2000). (2) a. Aunque el informe se elaborará más tarde, (CSv05) although the report SE prepare.FUT.3SG later ‘Although the report will be prepared later, ...’ b. que se importaran harinas cárnicas , (CSa02) that SE import.SUBJ.3PL flour.PL meat.PL ‘that compound feed (made from meat) should be imported, c. no se ha detenido a nadie (CSp01) not SE have.3SG arrested to.ACC nobody ‘Nobody has been arrested’ d. que se les equipare a los funcionarios (CSp04) that SE them.ACC put on a level.SUBJ.3SG to.DAT the civil servants ‘that they should be put on a level with civil servants’ e. no se les reconoce derechos fundamentales de las personas (CSp09) not SE them.DAT recognize.3SG rights.PL fundamental of the persons ‘They do not recognize their fundamental rights as people/their fundamental rights are not recognized’ f. se cree que podría ser decisivo ... (CSa04) SE believe.3SG that might.3SG be decisive ‘It is believed/They believe that it might be decisive ... g. mientras no se avance. (CSp03) while not SE advance.SUBJ.3SG ‘while there is no advance’ h. ‘sorprendentemente’ empezaron a producirse los reiterados incendios de ... (CSp07) surprisingly begin.PAST.3PL to produce.SE the repeated fires of ‘surprisingly, the repeated fires (of ...) started to take place’ Intentional actions may also be portrayed as intrinsically spontaneous events by means of an intransitive (unaccusative) construction (Martínez-Caro 1999). (3) a. porque la lucha empieza ahora. (CSp05) because the fight begin.3SG now ‘because the fight begins now’ 4 We have established a distinction between the promotional reflexive passive, where the Endpoint nominal acquires features of subjecthood, and the non-promotional passive, where there is no verbal agreement. In both cases, the expression of the agent is typically not allowed. Finally, we have included within the impersonal category all those instances where there is no Endpoint nominal element (Marín-Arrese 1992) 372 b. estas propuestas han surgido ...(CSp04) these proposals have.3PL arisen ‘these proposals have arisen ...’ As in the case of English, other impersonalization strategies in Spanish include impersonal use of pronouns, infinitive clauses, existentials, and nominalizations. (4) a. “En el sur nos explotáis, en el norte nos expulsáis” (CSv05) in the south us.ACC exploit.2PL, in the north us.ACC expel.2PL ‘In the South you exploit us, in the North you expel us’ b. donde le informaron que ... (CSv03) where him inform.3PL that ‘where they informed him that ...’ c. para luchar contra el terrorismo (CSa01) to fight.INF against the terrorism ‘to fight against terrorism’ d. hay que acabar con el tabú de ... (CSp10) have.3SG that finish.INF with the taboo of ‘we have to put an end to the taboo of ...’ e. urge poner en común una política de ... (BSv02) urge.3SG put.INF in common a policy of ... ‘it is urgent to negotiate an immigration policy...’ f. No hubo torturas. (BSv03) not have.PAST.3SG torture.PL ‘There was no torture/He was not tortured’ g. la diseminación de modelos culturales (ASa03) the dissemination of models cultural ‘the dissemination of cultural models’ Finally, we find miscellaneous occurrences of referential underspecification and lexical vagueness, where the actors are described at various levels of generality, or by means of metonymy, metaphor, etc., and the actions are presented as abstract nominals. (5) a. si la gente altera el balance de la creación (ASa03) if the people alter.3SG the balance of the creation ... ‘if people alter the balance of creation ...’ b. Los extranjeros desconfían de la buena voluntad de ... (CSa05) the foreigners mistrust.3PL of the good will of ‘Foreigners mistrust the good will of ...’ c. mataderos que incumplen la ley. (CSp01) slaughterhouses that break.3PL the law ‘slaughterhouses that break the law’ d. esta reliquia criminal le ha arrancado a companeros como ...(CSp02) this relic criminal him.DAT have.3SG torn off to.ACC comrades like ... ‘this criminal relic has taken away from him comrades such as...’ e. los sucesos en la comarca de Almería (CSv05) the events in the region of Almería ‘the events in the region of Almería’ f. que las responsabilidades puedan llegar al Ministerio de Agricultura (CSp07) that the responsibilities may.3PL arrive at.the Ministry of Agriculture ‘that the responsibilities may reach the Ministry of Agriculture’ 3. Text Collection and Data 3.1. The Texts 373 The corpus used for this paper consists of a total of 60 texts extracted from three Spanish newspapers: 30 from El País, 15 from ABC and 15 from La Vanguardia (approx. 45.000 words). ABC and La Vanguardia are rather conservative, and therefore represent ideologies that are close to the present Government in Spain. El País is more radical, and thus more critical with the present governmental policy. ABC and El País are national newspapers, while La Vanguardia is issued in Cataluna, though there is also an edition printed in Madrid. The texts collected are news reports of a political nature, chosen from the National and International sections in the papers. The texts vary in degree of ‘potential’ implication: Type A includes international news reports, with no national implication, i.e. the news do not affect Spain in an immediate and direct way, and the country is not even mentioned (e.g. the recent U.S. presidential election); Type B is a selection of international news reports which affect Spain in a direct way (e.g. the conflict concerning the presence of the nuclear submarine Tireless in Gibraltar); Type C includes national news reports, which obviously affect the country in a direct way. The codification system that has been used for examples of each text is the following: AS: International [- implication], SPANISH BS: International [+ implication], SPANISH CS: National, SPANISH. p: El País, a: ABC, v: La Vanguardia 01, 02...: text number. For example, CSp01 refers to a National item of news taken from the Spanish newspaper El País, text number 1. 3.2. Categories of Impersonalisation Strategies The types of impersonalisation strategies found in the Spanish texts have been organized and numbered in the following way: (i) Agentless Passive (ii) –Ed participle (-agent) (iii) Resultative estar (iv) Non-prototypical passive/resultative (-agent) (v) Passive se (foregrounding or promotional) (vi) Passive se (backgrounding or non-promotional) (vii) Impersonal se (viii) Anticausative se, Unmarked intransitive (spontaneous events) (ix) Impersonal use of pronouns (they, you, one, we, ...) (x) Infinitive clauses (-agent) (xi) Modalised impersonal expressions (hay que, urge, ...) (xii) Existential (xiii) Nominalisations (xiv) Miscellaneous lexical strategies 4. Discussion of results The following table (Table 1) shows the number of instances of each strategy found in the three groups of texts: international without implication (A); international with implication (B); and national (C). Within each type, differences between conservative (ABC and La Vanguardia) and liberal newspapers (El País) are shown. The third column of each group of texts contains the global percentages of use for each strategy in comparison to the other groups of texts. For example, we have found a total amount of 58 cases of strategy I (Agentless periphrastic passive); of this amount, 23.3% represent instances found in A texts (both conservative and liberal), 34.4% in B texts and 36.2% in C texts. The last line contains the total number of examples of all the strategies in each type of text and the percentages that these represent. The bottom line contains the total number of cases for the three groups of texts analysed, adding up to a total of 859 instances. 374 The last column of the table includes the total amount of instances of each strategy followed by the percentages each category represents with regard to the overall number of cases. These figures give us an idea of the relative distribution of impersonalisation strategies in discourse. A texts (15,696 words) B texts (15,234 words) C texts (14,088 words) Linguistic strategies Cons. Liberal % Cons. Liberal % Cons. Liberal % TOTAL I 7 10 23.3 12 8 34.4 2 19 36.2 58 6.7% II 6 14 25.3 6 5 13.9 21 27 60.7 79 9.1% III 1 1 13.3 2 4 40 5 2 46.6 15 1.7% IV 2 3 27.7 1 4 27.7 5 3 44.4 18 2% V 14 18 19.8 16 26 26 38 49 54 161 18.7% VI 0 1 12.5 1 3 50 0 3 37.5 8 0.9% VII 8 5 30.9 6 5 26.1 6 12 42.8 42 4.8% VIII 4 2 11.7 2 11 25.4 16 16 62.7 51 5.9% IX 6 8 36.8 2 5 18.4 13 4 44.7 38 4.4% X 7 3 18.8 7 4 20.7 18 14 60.3 53 6.1% XI 2 3 21.7 8 2 43.4 5 3 34.7 23 2.6% XII 0 1 10 0 3 30 2 4 60 10 1.1% XIII 25 23 27.7 20 13 19 43 49 53.1 173 20.1% XIV 10 22 24.6 7 10 13 31 50 62.3 130 15.1% TOTAL 92 114 23.9 90 103 22.4 205 255 53.5 859 Table 1: Number of instances of impersonalisation strategies and percentages within the three groups of texts, A, B and C and total In this table we may observe a number of interesting features. Going from the more general to the more specific, one striking aspect is that both the A and the B texts have similar total figures and percentages, while the total figures for the C texts are more than double those of A and B. This, we think, may reflect the higher degree of implication of the writers in the national news reports (C texts), who also resort to a higher number of impersonalisation strategies than the writers of international news. Although the B texts represent news in which Spain is also implicated, the data seem to evince that the degree of implication is always more reduced, and writers appear to use impersonalisation strategies in a similar way to writers of international news with no implication (A texts). Among the more specific results, we would like to highlight the following aspects: (a) With respect to the differences in the use of the various strategies, the impersonalisation device most used in Spanish seems to be nominalisation (strategy XIII), followed by the use of the se passive (V), and somehow less frequently, -ed participles (II). The miscellaneous category (XIV) also shows high figures, but the results are not as significant since it is a mixed category comprising various lexical forms of underspecification. (b) The figure for the ordinary or periphrastic passive (strategy I) is remarkably low, especially if compared with the use of the passive in English. Biber et al. (1999:476) find that passives are quite common in news, “occurring about 12,000 times per million words”. The ratio in English would thus be 0.012, while the ratio of use of the agentless passive in the Spanish texts is 0.0012, that is, one tenth of the frequency in English (of both agented and agentless passives). It is also interesting to compare this figure to that of the se passive (V), which shows that Spanish writers tend to use se passive constructions twice as often as the ordinary passive construction. Nonetheless, if we add up the results for passive in Spanish (strategies I, IV, V, VI), the ratio of use , 0.005, is comparable to that found for agentless passives in English, 0.006, (Marín-Arrese 1997). (c) If we now compare the percentages for texts A, B and C, as we already mentioned, the figures are in general much higher in the C texts. This is especially the case with respect to the use of the -ed participle (II), the se passive (V), anticausatives and intransitives (VIII), impersonal infinitive clauses (X), existential haber, existir (XII) and nominalisations (XIII). 375 With respect to the use of each strategy, it is worth considering the following: (a) The cases of agentless periphrastic passive can be divided into two groups, from the point of view of the writer's intention to hide the identity of the agent. On the one hand, as mentioned above, there are many examples where the agent is not mentioned simply because it is unimportant or unknown. On the other hand, and less frequently, there are also examples which seem to show that the writer is using this syntactic device to intentionally make the agent disappear. Compare, in this respect, examples 6 and 7, which respectively show the two tendencies mentioned: (6) (Pinochet instructs the military auditor to hide the fact that Mr. Eugenio Ruiz-Tagle was tortured in the following way:) “El senor Eugenio Ruiz-Tagle O. fue ejecutado en razón a los graves cargos que existían contra él. No hubo torturas.” (BSv03) (‘Mr. Eugenio Ruiz-Tagle was executed due to the grave accusations that existed against him. There was no torture/He was not tortured') (7) ... miembros de la mafia que residen en Espana y que están pendientes de ser juzgados o han sido condenados por graves delitos en su país. (BSv01) (‘... members of the mafia living in Spain and who still have to be tried or have been sentenced for serious crimes in their country'). (b) In some of the strategies it is interesting to note that there is a tendency for most examples to come from direct quotations. This is typically the case of the impersonal use of pronouns (IX) and the impersonal se (VII), and is also found in strategies such as the non-prototypical passive (IV). See the following : (8) “En este país todos sabemos qué es lo que tenemos que hacer para ...” (CSp02) (‘In this country we all know what we have to do to ... ‘) (c) Most examples of impersonal se occur with verbal processes of the type of decir, asegurar, confirmar: (9) En medios diplomáticos y políticos se aseguraba ayer que si fracasa la operación Peres... (ASp08) (‘In diplomatic and political circles they assured that ...') (d) It often happens that writers use more than one category, or the same category more than once, in the same stretch of text, thus creating a cumulative effect of mystification of agency. See for example: (10) “Nosotros (los jueces) no criticamos, sólo observamos y se resuelve sobre lo que se observa. Y mientras aquí se siga creyendo que un juez toma partido, seremos subdesarrollados mentalmente”, remarcó el magistrado chileno. (BSa01) (‘We (the judges) do not criticize, we only observe, and rule according to what is observed. And while people here continue to believe that a judge takes sides, we will be mentally underdeveloped') (11) Los querellantes aseguran que se ha permitido al Tireless navegar por aguas espanolas, no ha sido inspeccionado por técnicos espanoles, no se ha adecuado plan de emergencia alguno en previsión de incidentes en la reparación del submarino, se han ocultado datos relevantes, se ha mentido por el presidente del Gobierno (sic) respecto a la existencia de dictámenes del Consejo de Seguridad Nuclear y se ha puesto en riesgo a los habitantes del Campo de Gibraltar. (BSp04) (‘...the submarine Tireless has been permitted ..., it has not been inspected by ..., no emergency plan has been adopted ..., relevant facts have been concealed ..., the President of the Government has lied with respect to ... and the inhabitants of Gibraltar have been endangered ...') Example (11) is particularly interesting, because the cumulative effect of mystification of agency (the various members of the Government) is suddenly broken by the ‘allegedly ungrammatical’ inclusion of the adjunct “por el presidente del Gobierno”, expressing the identity of the agent of the action denoted by an impersonal passive clause. The effect is such that this agent is 376 immediately attributed direct responsibility for all the previous events denoted. (e) It has been observed that many cases of mystification of agency may have to do with the use of linguistic politeness, that is, writers use impersonalisation strategies to avoid direct facethreatening acts. Brown and Levinson (1987) include the phenomenon of impersonalisation as negative politeness strategy number 7 (see also Pérez de Ayala 2001). In our corpus we have found the following: (12) “Hay actitudes y comportamientos que sobrepasan los límites más elementales del sentido común”, comentó (Josep Piqué) (BSp04) (‘There are attitudes and behaviours that excede the most elemental limits of common sense.') It should be taken into account that the results here presented are rather tentative, since the corpus used is relatively limited. 5. Conclusion The present paper has investigated the use of impersonalisation strategies in Spanish in a corpus of newspaper articles on political issues. The texts represent various degrees of ‘potential’ implication on the part of the writer with regard to the topic of the news item. The results found seem to point to a direct relationship between degree of implication and more frequent use of the various strategies of impersonalisation. As regards the relative use of each of the strategies, Spanish newspaper discourse seems to favour the use of passive se and that of nominalisations. Although the factor of relevance is undoubtledly crucial in the omission of the agent, we may also surmise that the varied use of all these strategies cognitively contributes to construct, in van Dijk's (1998) words, ‘preferred models’ of a situation, and, socio-politically, to hide institutional or elite group responsibility. References Biber, D., Johansson, S. Leech, G., Conrad, S. and E. Finegan 1999 Longman Grammar of Spoken and Written English. London, Longman. Brown, P. and S. Levinson 1987 Politeness: Some universals in language usage. Cambridge, Cambridge University Press. Curran, J. and J. Seaton 1997 Power without Responsibility. The Press and Broadcasting in Britain. London, Routledge. van Dijk, T. 1998 Ideology: A Multidisciplinary Approach. London, Sage. Fairclough, N. 1989 Language and Power. London, Longman. Foley, W.A. and R.D. van Valin 1984 Functional Syntax and Universal Grammar. Cambridge, Cambridge University Press. Fowler, R. 1991 Language in the News: Discourse and Ideology in the Press. London, Routledge. Fox, B. and P. Hopper (eds.) 1994 Voice: Form and Function. Amsterdam, John Benjamins. Givon, T. 1990 Syntax: A Functional-Typological Introduction, Vol.II. Amsterdam, John Benjamins. Gómez Torrego, L. 1992 La Impersonalidad Gramatical: Descripción y Norma. Madrid, Arco libros. Gruber, H. 1993 Political language and textual vagueness. Pragmatics 3(1): 1-28. Kemmer, S. 1993 The Middle Voice. Amsterdam, John Benjamins. Kemmer, S. 1994 Middle voice, transitivity and the elaboration of events. In: B. Fox and P. Hopper (eds.), pp. 179-230. Langacker, R. 1991 Foundations of Cognitive Grammar. Vol. II. Stanford, CA, Stanford University Press. Levin, B. and M. Rappaport Hovav 1995 Unaccusativity: at the Syntax-Lexical Semantics Interface. Cambridge, MA, MIT Press. Marín Arrese, J. 1992 La Pasiva en Inglés: Un Estudio Funcional-Tipológico. PhD dissertation, published 1993. Madrid, Editorial Complutense. Marín-Arrese, J. 1997 Cognitive and discourse-pragmatic factors in passivisation. ATLANTIS XIX (1): 203-218. Marín-Arrese, J. 2000 On ‘thematic subject’ constructions in English and Spanish. Paper presented at the International Conference on ‘Cognitive Typology', 12-14 April 2000, University of 377 Antwerp, Antwerp. Marín-Arrese, J. in press The linguistic coding of spontaneous and facilitative events in English: A cognitive perspective. To appear in: New Waves: 21st Century English Grammar. Huelva, Universidad de Huelva. Martínez-Caro, E. 1999 Gramática del Discurso: Foco y Énfasis en Inglés y en Espanol. Barcelona, PPU. Mendikoetxea, A. 1999 Construcciones con se: medias, pasivas e impersonales. In: Bosque, I. and V. Demonte (eds.,1999) Gramática Descriptiva de la Lengua Espanola. Madrid, Espasa Calpe. Myhill, J. 1997 Toward a functional typology of agent defocusing. Linguistics 35: 799-844. Nedjalkov, V. (ed.) 1988 Typology of Resultative Constructions. Amsterdam, John Benjamins. Pérez de Ayala Becerril, S. 2001 FTAs and Erskine May: Conflicting needs? Politeness in Question Time. Journal of Pragmatics 33: 143-169. Shibatani, M. (ed. ) 1988 Passive and Voice. Amsterdam, John Benjamins. Sperber, D. and D. Wilson 1986 Relevance: Communication and Cognition. Oxford, Basil Blackwell. 378 Is there such a thing as a translator's style? Mikhail Mikhailov and Miia Villikka Department of Translation Studies, University of Tampere, Finland {Mihail.Mihailov, Miia.Villikka}@uta.fi. Like women, translations should be either beautiful or faithful (cf. Mounin 1994) 1. General It would be naive to think of a literary translation as an exact copy of the original, its simple replica in another language. Even a situation in which two translators are given the same source text and instructed to translate it as faithfully to the original as possible would result in two clearly different translations. In translation studies the question of loyalty towards the original is one of the most discussed ones. It has been widely argued that literary translations should maintain the style and structure, the “spirit” of the original as intact as possible (see e.g. Chesterman 1997, Nida & Taber 1974). After all, it is the original author's name that is printed on the very cover of the translation, i.e. even the translation is considered to be a work of the author, not of the translator.1 However, it must be borne in mind that translators are also individuals and it is impossible for them to totally set aside their own personality and “get under the original author's skin”. There are several factors influencing the translation process and, consequently, the final product, the translation. One important factor – the key focus area of this research - is the fact that many translation solutions must be decided upon independently. Translating is not just a simple decoding-recoding action performed merely on the level of words – and even if it were, even the best of dictionaries could not by far offer equivalents and examples to every possible context of a certain word or expression. Also, there are often more than just one suitable equivalent. Usage of these variants is in most cases solely up to the translator's choice. Further, relations between two different language (grammar) systems are no more stable. Rather seldom is there only one possible way to translate a certain structure. Naturally, every translator aims at producing as fluent a text in the target language as possible, but even opinions on fluency, let alone the means of achieving it, vary greatly between different individuals. 2. “Stylistic fingerprints” The ‘stylistic fingerprint’ problem is nowadays widely discussed in applied linguistics. Most scholars agree that every author has a unique and identifiable style. However, there is no shared opinion on the criteria which can be used for authorship attribution. For example, proportion of nouns or adjectives, and so called “marker words” (e.g. while - whilst, upon - on) are frequently used. New methods like principle components analysis of the most common words, measuring vocabulary richness and even letter frequency analysis are being developed (Holmes & Forsyth 1995; Tweedie & Opas-Hänninen 2000). However, the existence of translator's stylistic fingerprints is less self-evident. It is inevitable that the translator is always under strong influence of the original s/he is translating. Still, is there something personal the translator adds to the target text? Is it possible to define this something and to use it to identify the translator? The aim of the research carried out at the Department of Translation Studies of the University of Tampere is to find out whether translators also have ‘stylistic fingerprints'. The research is based on a parallel corpus of Russian fiction texts and their translations into Finnish. We compared, on the one hand, original Russian texts written by the same author and by different authors and, on the other 1 Recent theories of translation (and literary theories, as well) have strongly questioned the authority of the original author (e.g. Oittinen 1995). However, the translations analysed in this research have mostly been produced around mid-20th century and thus we find it appropriate to look at the translations against the theoretical framework and background of their own time. 379 hand, analysed Finnish translations of different texts performed by the same translator and, in one case, translations of the same text performed by different translators. 3. Vocabulary richness One of the widely used methods of authorship attribution is vocabulary richness measures. D. Holmes and R. Forsyth (1995) used the following quotients for their analysis of the Federalist Papers: ) V ( 1 100 ) 1 ( 1 V LogN R - = 2 1 2 4 ) ( 10 ) 2 ( N N V i K i i aY = - = , a V N W - = ) 3 ( where N is the text length, V the total number of different words used in the text, V1, V2, Vi the number of words used 1, 2, i times, a = 0.172. The higher the number of words which were used only once (hapax legomena), the higher is the R quotient. The more high-frequency words in the text, the higher is the K quotient. The more different words there are in the text, the higher is the W quotient. We used these quotients in our research. The R, K, and W values were calculated for different original Russian texts and for Finnish translations. Table 1. Vocabulary richness. Russian texts2 Title R K W R1 1107.778 58.414 8.307 R2 1087.733 59.892 8.941 R3 1100.168 60.117 8.183 S1 1026.101 50.405 9.003 S2 1073.588 47.996 8.425 S3 1094.196 47.124 7.834 S4 1023.956 50.285 8.839 S5 1078.889 42.718 8.058 T1 1067.322 46.582 8.191 T2 998.055 49.245 8.638 T3 982.655 43.654 8.835 Table 1 shows that the values of R, K, and W are not identical for the same author (however, in most cases they are quite close, e.g. cf. R1 vs. R2, S1 vs. S4). Still, texts by different authors might sometimes have pretty close values of vocabulary richness measures, e.g. Juri Trifonov's novel Dom na naberezhnoj (The building on the Embankment, T1) is closer to Arkadi and Boris Strugatski's Piknik na obochine (Roadside Picnic, S2) than to other works by Trifonov (T2, T3). The Strugatskis’ works fall into two groups, S1, S4 and S1, S3, S5, which differ notably from one another (one possible explanation might be a different degree of participation of the two co-authors). Only Rasputin's works really demonstrate close vocabulary richness. Thus, it can be stated that the vocabulary richness of a certain author is not a stable factor; variation between early and later works may be explained with the fact that, quite naturally, the author's style changes all the time along with his / her taste, prejudices, habits, and so on. 2 Texts are referred to with the initial letter of their author's (for translations also the translator's) name (see list of texts below). 380 The same method was then used to compare translations. Most of the texts were translated by the same person, Esa Adrian; for one of the texts — Dostoyevski's Zapiski iz podpolja (Notes from the Underground) — we have two translations, by E. Adrian and by V. Kallama; and one text — Lermontov's Geroj nashego vremeni (Hero of our time) — was translated by U.-L. Heino. As demonstrated in Table 2 below, the texts translated by E. Adrian (DA, OA, R1A, SA) have quite different R, K, and W values, which seems to indicate that the vocabulary of a translation is to a large degree dependent on that of the original. This assumption is supported by the observation that the vocabulary richness values of the original R1 (see Table 1) and its translation R1A are quite close to each other (substantial deviation is found in the K-factor only, which can be attributed to the difference between the two language systems). Another important point is that the R, K, and W values for the two Finnish translations of Dostoyevski's story (Adrian, DA, and Kallama, DK) are almost identical. Table 2. Vocabulary richness. Finnish translations from Russian Title R K W DK 1038.76 40.03 8.54 DA 1034.74 40.94 8.48 OA 1021.40 30.17 7.91 R1A 1105.85 44.70 8.09 SA 1092.77 32.70 8.17 LH 1059.37 32.88 8.06 4. Most frequent words Another method which could help to distinguish different authors could be a study of most frequent words in the texts in question. Our method is based on a comparison of lemmatised word lists (this is because people use lexemes rather than word forms). Two texts are compared by selecting the 40 most frequent words from their word lists. Then the F-index is calculated: 3 points are added for each word with close relative frequencies, 2 points for each word with different relative frequencies, 1 point for each word with quite different frequencies, and 1 point is deduced for each word absent in the other list. The results for the original texts were as follows: Table 3. Most frequent words. Russian texts Texts Indexes R1 — R2 40 R1 — R3 53 R2 — R3 43 R1 — S 34 R2 — S 35 R3 — S 34 O — R1 31 S — O 33 It is obvious that if the texts were written by the same author, the frequent words’ lists overlap and many words have close frequencies. One could assume that the same thing might happen if the topic of the texts is close enough, but it doesn't. For instance, Rasputin's novels have the same topic as Shukshin's short stories (country life); moreover, these authors belong to the same “school” of country prosaists. However, their F-index (35 or 34) doesn't differ much from that of Shukshin vs. Olesha (33) or Olesha vs. Rasputin (31). The table clearly indicates that if the F-index is less than 40, the texts in question are quite likely written by different authors. The results of the study of the translations (presented in Table 4 below) were entirely different. 381 Table 4. Most frequent words. Finnish translations Translations F-Index DK — DA 63 SA — R1A 44 LH — DK 32 LH — DA 29 DA — R1A 28 SA — OA 28 LH — OA 26 OA — R1A 25 It is evident that the F-index for the texts translated by the same person is high only if the topic is close enough (Shukshin vs. Rasputin, SA — R1A). In other cases the F-index for texts translated by the same person doesn't significantly differ from those translated by different persons. Confirming obvious expectations, the F-index has its highest value when the two different translations of the same text are compared. 5. Favourite words Every person speaks his/her own idiolect, which means that a certain, more or less unique list of favourite words can be compiled from everybody's vocabulary. Although variation is possible, no dramatic changes can be expected. In this respect, we carried out the following experiment: word lists of the two texts were compared against the data of a large text corpus and two lists of words with frequencies much higher than in the corpus were generated. Then these two lists were compared and the number of coincidences (FW-index) was calculated. The higher this FW-index is, the closer is the language of the texts and the more probable it is that the texts were written by the same author. Table 5. Favourite words. Russian texts Texts FW-index R1 — R2 385 R1 — R3 577 R2 — R3 426 R2 — S 242 R2 — O 148 O — S 124 The comparison of the translations, again, shows that the language of different translations of the same text performed by different people is closer than that of the different translations by the same translator. Table 6. Favourite words. Finnish translations Translations FW-Index DK — DA 360 R1A — SA 74 R1A — OA 71 LH— DA 45 R1A — DA 31 R1A — DK 21 6. What is specific? However, although the translator is a ‘chameleon’ and the language, style, and core vocabulary of the translation depend on the author's style, we still believe that translator's style is indeed an 382 existing phenomenon. Despite the strong dependence on the original, all translators have favourite equivalents and patterns of language usage. The analysis of Finnish equivalents for Russian modal markers shows how different translators use different equivalents in analogous situations. Further, the analysis also reveals some patterns of equivalent usage, i.e. certain translators being more fond of certain words than others. This tendency is clearly presented in the different translations of the same text, as well. This analysis is a good example of the inadequacy of dictionaries as all-embracing guidelines for translators and the inevitability of the translator's own choices. As indicated in Figure 1 below, Finnish equivalents used for the Russian word kazhetsja (‘it seems (to be)') are tuntuu, taitaa, tuskin, ehkä, ilmeisesti, luuultavasti, tietääkseni, kai, kaipa, mielestäni. The most widely recognised Russian- Finnish dictionary gives the following equivalents: 1. näyttää; 2. tuntua; 3. kai, taitaa. It is worth noting that the first equivalent offered in the dictionary was not used in the translations at all and, on the other hand, the most widely used translation, ehkä, is not mentioned in the dictionary at all. Figure 1. Finnish equivalents for kazhetsja in different translations 0,000 0,200 0,400 0,600 0,800 1,000 1,200 1,400 WX Q WX X WD LWD D WX V N LQ ehkä LOP H LV H V WL OX X OWD Y D V WL tietääkseni N D L N D LS D mielestäni SA R1A OA DA DK LH Based on this data, it can roughly be concluded that translating kazhetsja as taitaa is typical of E. Adrian; this translation is not used by the other translators at all. Compared to the other translators, V. Kallama seems to prefer the word mielestäni and U.-L. Heino appears to be especially fond of the word ilmeisesti. An especially noteworthy point is that the frequencies of these words in the two translations of the same novel (see DA and DK) are quite different; thus, their usage doesn't seem to entirely depend on the original. Similar conclusions can be drawn from the data on the equivalents for vse-taki (presented in Figure 2 below). It appears that E. Adrian does not like the expression kaikesta huolimatta at all and that kuitenkin is mostly typical of U.-L. Heino. It can also be claimed that V. Kallama's repertoire of equivalents is the broadest: he is the only one of these translators who has used all the equivalents analysed. 383 Figure 2. Finnish equivalents for vse-taki in different Finnish translations 0,000 0,200 0,400 0,600 0,800 1,000 1,200 VLWWH Q N LQ sittenkään N X LWH Q N LQ N X LWH Q N D D Q VLOWL N X P P LQ N LQ WR N L N D LN H VWD K X R OLP D WWD MR N D WD S D X N VH VVD sentään SA R1A OA DA DK LH Interesting results can be discovered by comparing word, sentence, and paragraph counts of originals and translations. The number of words in original / number of words in translation ratio (Wquotient), number of sentences in original / number of sentences in translation ratio (S-quotient) as well as the ratio of number of paragraphs in original / number of paragraphs in translation (Pquotient) are in fact stable values and depend on the pair of languages (Mikhailov 2001). However, table 7 shows that the values of these three quotients are closer for the texts translated by the same person. It is also evident that the two translations of Dostoyevski (DA and DK) differ in this respect. This might be explained by the translator's attitude to the structure of the original: some translators try to generate a text of the same length and the same structure as the source text, some believe that good style and best possible readability in the target language is more important than fidelity to the original. Table 7. Words', Sentences', and Paragraphs’ ratios for Finnish translations of Russian texts. Text W-quotient S-quotient P-quotient R1A 1,010 0,929 0,975 OA 1,074 0,979 1,001 DA 1,113 0,860 0,975 S4A 1,069 0,953 1,084 S1A 1,073 0,947 1,081 LH 1,033 0,668 0,879 DK 1,069 0,956 0,979 7. Conclusions As was argued in the beginning of this article, it is inevitable that a translator makes a great deal of independent decisions during the translation process. However, when translations were analysed with some widely used authorship attribution methods (e.g. vocabulary richness, frequent words), it appeared as if translators didn't have a language and a style of their own. Still, every translator has a personal set of instruments and stylistic devices. Therefore, in search of the translator's identity (personal features), the most important indicators could be the use of modal words, particles, conjunctions, grammar forms, etc., as well as splitting or joining sentences and paragraphs and expanding or shortening the text. 384 List of texts Russian original texts3 O: Olesha Ju. Zavist’ (‘Envy') R1: Rasputin V. Zhivi i pomni (Live and remember) R2: Rasputin V. Poslednij srok (‘The deadline') R3: Rasputin V. Proshchanie s Materoj (Farewell to Matyora) S: Shukshin V. Short stories. S1: Strugatski A. & B. Paren’ iz preispodnej (‘The guy from Hell') S2: Strugatski A. & B. Piknik na obochine (Roadside Picnic) S3: Strugatski A. & B. Ponedel'nik nachinaetsja v subbotu (‘Monday begins on Saturday') S4: Strugatski A. & B. Popytka k begstvu (Escape Attempt) S5: Strugatski A. & B. Trudno byt’ bogom (Hard to be a God) T1: Trifonov Ju. Dom na naberezhnoj (‘The Building on the Embankment') T2: Trifonov Ju. Predvaritel'nyje itogi (‘Preliminary Results') T3: Trifonov Ju. Obmen (‘Exchange') Finnish translations DA: Dostoyevski F. Zapiski iz podpolja (Notes from the Underground). Finnish title: Kirjoituksia kellarista. Translator: E. Adrian. DK: Dostoyevski F. Zapiski iz podpolja (Notes from the Underground). Finnish title: Kellariloukko. Translator: V. Kallama. LH: Lermontov M. Geroj nashego vremeni (Hero of our time). Finnish title: Aikamme sankari. Translator: U.-L. Heino. OA: Olesha Ju. Zavist’ (‘Envy'). Finnish title: Kateus. Translator: E. Adrian. R1A: Rasputin V. Zhivi i pomni (Live and remember). Finnish title: Elä ja muista. Translator: E. Adrian. SA: Shukshin's short stories translated by E. Adrian. S1A: Strugatski A. & B. Paren’ iz preispodnej (‘The guy from Hell'). Finnish title: Poika helvetistä. Translator: E. Adrian. S4A: Strugatski A. & B. Popytka k begstvu (Escape Attempt). Finnish title: Pakoyritys. Translator: E. Adrian. References Baayen R, Tweedie FJ, Neijt A, Krebbers L 2000 Back to the Cave of Shadows: Stylistic Fingerprints in Authorship Attribution. In The 12th Joint International Conference of the Association for Literary and Linguistic Computing and the Association for Computers and the Humanities. University of Glasgow, 21-25 July, 2000, pp. 156-158. Chesterman A 1997. Memes of translation: the spread of ideas in translation theory. Amsterdam Benjamins cop. Holmes DI, Forsyth RS 1995 The Federalist revisited: new directions in authorship attribution. Literary and Linguistic Computing 10(2): 111-129. Mikhailov M 2001. Two Approaches to Automated Text Aligning of Parallel Texts in Fiction. Across Languages and Cultures. (forthcoming). Mounin G 1994. Les belles infideles. Presses universitaires de Lille. Nida E, Taber CR 1974. The theory and practice of translation. Leiden United Bible Societies. Oittinen R 1995. Kääntäjän karnevaali. Tampere, Tampere University Press. Tweedie FJ, Opas-Hänninen L 2000 A comparison of methods for the attribution of authorship of popular fiction. In The 12th Joint International Conference of the Association for Literary and 3 If no official English translation of the title was found, our own translation is used in inverted commas. 385 Linguistic Computing and the Association for Computers and the Humanities. University of Glasgow, 21-25 July, 2000, pp. 105-107. 386 Corpus analysis and results visualisation using self-organizing maps Dr. H. Moisl and Dr. J. Beal Centre for Research in Linguistics, University of Newcastle {Hermann.Moisl, Joan.Beal}@ncl.ac.uk This paper addresses the related issues of statistical analysis of text corpora and of intuitivelyaccessible representation of the results of such analysis, with reference to the Newcastle-Poitiers Electronic Corpus of Tyneside English (NPECTE) project. It proposes topographic mapping as a tool for analysis and visualization of the NPECTE corpus, and is in three main parts. The first gives a brief account of the NPECTE project, the second explains the nature of topographic mapping and the motivation for its use, and the third gives an example of how topographic mapping can be implemented using the Self-Organizing Map artificial neural network architecture. 1. The NPECTE project The NPECTE project is based on two separate corpora of recorded speech: (i) The earlier of the two corpora was gathered during the Tyneside Linguistic Survey (TLS) (Strang 1968, Pellowe 1972) in the late 1960s, and consists of 86 loosely-structured 30-minute interviews. The informants were drawn from a stratified random sample of Gateshead in North-East England, and were equally divided among various social class groupings of male and female speakers, with young, middle, and old-aged cohorts. Some transcription and analysis was done on this material at the time, but little of it was published, and work on it languished until 1995, when Joan Beal of DELLS secured funding from the Catherine Cookson Foundation to salvage the original reel-toreel tapes to audio cassette format and to catalogue and archive the cassettes. This material is now housed in the Catherine Cookson Archive of Tyneside and Northumbrian Dialect in the Department of English Literary and Linguistic Studies (DELLS), University of Newcastle upon Tyne (ii) The more recent corpus was collected in the Tyneside area in 1994 for an ESRC-funded project ‘Phonological Variation and Change in Contemporary Spoken English’ (PVC). This data is in the form of 18 DAT tapes, each of which averages 60 minutes in length. Dyads of friends or relatives were encouraged to converse freely with minimal interference from the fieldworker, and informants were again equally divided between various social class groupings of male and female speakers in young, middle, and old-age cohorts. This material is housed in the Department of Speech, University of Newcastle upon Tyne; Recently, an AHRB grant was awarded under the Resource Enhancement Scheme to combine the TLS and PVC collections into a single corpus and to make it available to the research community in a variety of formats: digitised sound, phonetic transcription, standard orthographic transcription, and various levels of tagged text, all aligned. 2. Topographic mapping and its application to NPECTE a) Topographic mapping The aim of topographic mapping is to represent relationships among data items of arbitrary dimensionality n as relative distance in some m-dimensional space, where m < n. In practice, it is used in applications where there is a large number of high-dimensional data items, and the interrelationships of the dimensions are not obvious: the data items are typically represented as a set of length-n realvalued vectors V = {v1, v2…vk), and these vectors are mapped to points on a 2-dimensional surface such that the degree of similarity among the vi is represented as relative distance among points on the surface. b) Motivation and application to NPECTE Corpus analysis is often concerned to discover regularities in the interrelationships of certain features of interest in the data – correlations of phonetic or graphemic features, for example, or of such things as social class, age, gender and geography with aspects of linguistic usage. Cluster analysis (Everitt 1993) has been widely and successfully used for this purpose (Manning and Schütze 1999), and topographic mapping is in fact a variety of cluster analysis. Its chief advantage over standard 387 cluster analysis techniques is the intuitive accessibility with which analytical results can be displayed: projection of a large, high-dimensional data set onto a two-dimensional surface gives an easilyinterpretable spatial map of the data's structure. With regard to the application of topographic mapping to the NCEPTE corpus in particular, the project's aim is not only to create an electronic resource, but to make that resource the basis of analytical research projects. We are therefore developing software tools to supplement those generally used in corpus analysis, and topographic mapping is the first of these. 3. Implementation of topographic mapping using the SOM architecture There are several ways of implementing topographic mapping, that is, of forming twodimensional projections of data distributions in high-dimensional spaces: principal component analysis (Jolliffe 1986, Everitt 1993), multidimensional scaling (Borg and Groenen 1997, Everitt 1993), and self-organizing maps (SOM) (Kohonen 1995). This paper adopts the last of these because SOMs have been successfully used in natural language corpus processing, and the relevant work provides a good basis for development of the applications required for the NPECTE. This section briefly describes the SOM architecture, then gives pointers to current applications of SOM in processing of textual corpora, and finally presents an example of how a SOM can be used in analysis of corpora like the NPECTE. a) SOM The self-organizing map, also known as the Kohonen net after its inventor, is a k-dimensional surface of processing units, where k is usually 2. Associated with each unit is a set of connections from an input buffer such that, for a buffer of length n, there are n connections per unit (for clarity, only sample connections are shown in Figure 1): Figure 1: A self-organizing map. Given a set V of input vectors of length n, such a net can, using the SOM training algorithm, learn to approximate the similarity relations among the vi Î V in n-dimensional space on the twodimensional surface of processing units. After training is complete, each of the vi is associated with a specific unit uj in the sense that it activates uj more strongly than any other; when the activations for all the vi are plotted on the unit surface, the distances among activated units represent the similarity relations in the input vector space. Details of the network training algorithm can be found in most textbooks on artificial neural networks (for example Haykin 1999, Rojas 1996); the standard reference is Kohonen 1995. b) SOM and corpus analysis SOMs have found application in a wide range of disciplines (Kohonen 1995, chapter 7). In natural language processing (Kohonen 1995 pp 237-249, 301; Honkela 1997), the main application to date has been in the classification of texts in large document collections. In particular, Kohonen and his research group have developed WEBSOM (Kohonen et al 2000, Kaski et al 1998, Lagus et al 1999), a system that has successfully classified over one million web documents on the basis of their lexical content. WEBSOM underlies development of the NPECTE-specific analytical tool being described here. c) Example Assume the existence of a phonetically-transcribed spoken corpus, like NPECTE, consisting of a fairly large number of interviews, each labelled for region, age, gender, and social class. One is interested in the, say, region and age distribution of phonetic segments in two environments, that is, of segments that occur between two specific (phonetic prefix - phonetic suffix) pairs. To carry out the analysis, the transcribed corpus is scanned for the relevant prefix-segment-suffix sequences, labelling 388 each such sequence with the regional and age information associated with the interview from which it came. The aim is to show how a SOM can generate and display a topographic map of the structure of such a data set. To show this data with a known structure is required; for clarity of exposition, a small, artificially constructed data set D1 will be used. The phonetic segment of interest comprises 5 variants, V1 - V5, distributed as shown in Figure 2: Figure 2: The structure of the example data set D1 · In Region 1 all age categories use variant 1 in environment 1, and variant 2 in environment 2 · In Region 2 age categories 1 and 2 use variant 3 in environment 1 and variant 4 in environment 2, but age category 3 uses variant 4 in both environments · In region 3 all age categories use variant 5 in all environments The first step is to encode the environmental prefixes and suffixes and the segment variants for processing by a SOM. This means some form of vector encoding. Again for clarity, a binary encoding is adopted, where 1 and 0 represent the presence and absence respectively of a phonetic feature; prefixes and suffixes are encoded as 3-bit and segment variants as 6-bit binary vectors: E1 prefix: 010 E1 suffix: 101 E2 prefix: 011 E2 suffix: 110 V1: 100011 V2: 100111 V3: 010011 V4: 010111 V5:001111 The encodings are arbitrary, and are not intended to be interpretable as specific phonetic features. The data set corresponding to the structure in Figure 2 is thus: 1. R1 A1 E1 0 1 0 1 0 0 0 1 1 1 0 1 2. R1 A1 E2 0 1 1 1 0 0 1 1 1 1 1 0 3. R1 A2 E1 0 1 0 1 0 0 0 1 1 1 0 1 4, R1 A2 E2 0 1 1 1 0 0 1 1 1 1 1 0 5. R1 A3 E1 0 1 0 1 0 0 0 1 1 1 0 1 6. R1 A3 E2 0 1 1 1 0 0 1 1 1 1 1 0 7. R2 A1 E1 0 1 0 0 1 0 0 1 1 1 0 1 8. R2 A1 E2 0 1 1 0 1 0 1 1 1 1 1 0 9. R2 A2 E1 0 1 0 0 1 0 0 1 1 1 0 1 10. R2 A2 E2 0 1 1 0 1 0 1 1 1 1 1 0 11. R2 A3 E1 0 1 0 0 1 0 1 1 1 1 0 1 12. R2 A3 E2 0 1 1 0 1 0 1 1 1 1 1 0 13. R3 A1 E1 0 1 0 0 0 0 1 1 1 1 0 1 14. R3 A1 E2 0 1 1 0 0 0 1 1 1 1 1 0 15. R3 A2 E1 0 1 0 0 0 0 1 1 1 1 0 1 16. R3 A2 E2 0 1 1 0 0 0 1 1 1 1 1 0 17. R3 A3 E1 0 1 0 0 0 0 1 1 1 1 0 1 18. R3 A3 E2 0 1 1 0 0 0 1 1 1 1 1 0 Hierarchical cluster analysis (squared Euclidean distance, average linkage) reveals the structure of this data (Figure 3): 389 Figure 3: A hierarchical cluster analysis of D1 The two main clusters (1-3) and (4-6) correspond to environments E1 and E3, and within both of these there is subclustering first by region and then by age. This is the structure which the SOM is expected to discover from the data. A SOM was trained on S1 with the following parameters: Map axis: 9 (that is, a 9 x 9 unit layer) Initial learning rate: 0.9 Learning rate decrement: 0.01 Learning rate decrement interval: 10 iterations Initial neighbourhood: 9 Neighbourhood decrement interval: 40 iterations Number of training iterations: 10000 After training the S1 vector set was presented to the net, with the following result (Figure 4): Figure 4: The SOM's analysis of D1 The main E1 and E2 clusters are clearly separated from one another on the left and right sides of the map. The groups in the E2 region are equidistant from one another, corresponding to the E2 subtree in Figure 2; the distance match with the hierarchical cluster tree, where the distances among 4, 5, and 6 are slightly asymmetrical, is not exact, but the map approximates this as closely as possible given its coarse granularity. The EI region also closely reflects the EI cluster subtree since, in both, 2 and 3 are closer to one another than they are to 1. In addition, the cluster tree shows that the structure of group 2 is unlike that of the other groups 1 and 3-6 in that R2A3E1 differs substantially from the other two members of the group; the map shows a corresponding distance relation. It can, therefore, be said that the SOM gives a good 2-dimensional spatial representation of the vector similarity relations in the data set, which itself encodes regional, age, and phonetic environment variation in our hypothetical corpus. 390 Now, the hierarchical cluster tree is easily as clear about the structure of the data as the SOM. What, therefore, is the advantage of SOMs over established cluster analysis methods like hierarchical analysis in corpus work? The answer is that, as data sets grow larger and their structure more complex, existing hierarchical methods become increasingly difficult to interpret, whereas SOMs remain clear. Consider, for example, another artificial data set D2 of 1000 length-24 real-valued vectors. These were generated by a process which subdivided them into 5 main groups, numbered (0-200), (201-500), (501- 800), (801-950), and (951-1000). Each was then given some subsidiary structure, and finally noise was injected into the whole set by randomly selecting two components of each vector and incrementing the values found there by a small random amount. Figures 5 and 6 show the results of hierarchical cluster analysis (Euclidean distance, single linkage) and SOM analysis respectively. Figure 5: A hierarchical cluster analysis of D2 Figure 6: A SOM analysis of D2 Both are clear with respect to the main structure of the data, but the subsidiary structure within the main clusters is much clearer in Figure 6. Both hide some structural information. In Figure 5 it is almost entirely opaque and difficult if not impossible to comprehend. In Figure 6 not all 1000 vectors 391 are present on the map: many vectors are mapped to a single unit and, since only one vector label can be represented at any given location, only one vector location can be displayed. Thus, the upper righthand corner of the SOM represents vectors 1-200, but only a minority of these is visible. The solution is both cases is to implement a graphical interface that permits interactive browsing of the structure display. For Figure 5 this would reveal ever more detailed subtrees, but, as data sets grow larger, such zooming-in soon make it difficult to see the selected subtree region in relation to the structure tree as a whole. For Figure 6 there are at least two possibilities. On the one hand, selection of a given unit could display a list of vectors associated with that unit in a way that allows the analyst to maintain a clear view its place in the overall structure map, as in Figure 6a. On the other, and more interestingly, one could use a hierarchical feature map (Merkl 2000) in which each unit of the main SOM has its own SOM associated with it, allowing the structure of the vectors mapped to the node of interest to be displayed, as in Figure 6b. Conclusion Topographic mapping is a nonhierarchical clustering technique that can project high-dimensional data sets onto low, usually two-dimensional surfaces such that the similarity relations of the data are represented as spatial distribution of points on the surface. In relation to text corpus analysis, its main advantage over standard hierarchical cluster analysis methods for the purpose is in the intuitive clarity of results visualization as a spatial structure map, and the scope of that visualization for interactive exploration of the map. The aim is to develop a SOM-based implementation of a topographic mapping tool for analysis of the NPECTE corpus. References Borg I, Groenen P 1997 Modern Multidimensional Scaling - Theory and Applications. Springer. Everitt B 1993 Cluster Analysis, 3rd ed. E. Arnold. Haykin S 1999 Neural Networks. A Comprehensive Foundation. Prentice Hall International. Honkela T 1997 Self-Organizing Maps in Natural Language Processing. PhD thesis, Helsinki University of Technology, Espoo, Finland. Jolliffe I 1986 Principal Component Analysis. Springer. Kaski S, Honkela T, Lagus K, Kohonen T 1998 WEBSOM--self-organizing maps of document collections. Neurocomputing 21: 101-117. Kohonen T 1995 Self-Organizing Maps, 2nd ed. Springer. Kohonen T, Kaski S, Lagus K, Salojärvi J, Paatero V, Saarela A 2000 Self Organization of a Massive Document Collection. IEEE Transactions on Neural Networks 11(3): 574-585. Lagus K, Honkela T, Kaski S, Kohonen T 1999 WEBSOM for textual data mining. Artificial Intelligence Review 13(5/6): 345-364. Manning C, Schütze H 1999 Foundations of Statistical Natural Language Processing. MIT Press. Merkl, D Text data mining. In Dale R, Moisl H, Somers H (eds), Handbook of Natural Language Processing, Dekker: 889-903 Pellowe J et al 1972 A dynamic modeling of linguistic variation: the urban (Tyneside) linguistic survey. Lingua 30: 1-30. Rojas R 1996 Neural Networks. A Systematic Introduction. Springer. Strang B 1968 The Tyneside Linguistic Survey, Zeitschrift für Mundartforschung, Neue Folge 4: 788- 94. 392 Pragmatic and discursive aspects of German modal particles: a corpus-based approach Martina Möllering Department of European Languages, Macquarie University, Sydney Modal particles fulfil important pragmatic and discursive functions in German. Their meaning is complex and highly dependent on linguistic as well as situational features of the context. Following the premise that German modal particles occur with greater frequency in the spoken language, the paper reports on an analysis which is based on corpora representing spoken German. The concept of ‘spoken language’ is discussed critically with regard to the corpora chosen for analysis and narrowed down concerning the use of modal particles. The analysis is based on the following corpora: Freiburger Korpus, Dialogstrukturenkorpus and Pfeffer-Korpus, which are all kept at the “Institute for the German Language” (Institut für deutsche Sprache) and can be accessed via the institute's on-line system COSMAS. In addition, a collection of telephone conversations (Brons-Albert 1984) was scanned into computer readable files and analysed with MicroConcord (Scott and Johns 1993). A quantitative analysis was carried out on all corpora; the qualitative analysis was limited to the telephone conversations. With regard to these analyses, the paper discusses: · the realization of the concept of ‘spoken language’ in the corpora under discussion · the limitations of computer-based analysis for the language feature investigated · collocational patterns which help to identify pragmatic and discursive functions of modal particles References Brons-Albert R 1984 Gesprochenes Standarddeutsch: Telefondialoge. Tübingen:Günter Narr Institut für deutsche Sprache 1999 COSMAS. http://www.ids-mannheim.de/kt/cosmas.html Scott M, Johns T 1993 MicroConcord. Oxford: Oxford University Press. 393 Evidence of Australian cultural identity through the analysis of Australian and British corpora Rachel Muntz Department of Linguistics, University of Wales, Bangor 1. Abstract This paper reports on findings of a keywords analysis comparing the ACE corpus of written Australian English (AusE) and the Flob corpus of written British English (BrE). The main aims of the study are to provide evidence of the cultural representativeness of ACE and to identify significant features of lexical frequency associated with written AusE which are distinct from written BrE. This paper shows how the physical environment has helped shape the Australian lexicon. It also demonstrates clear differences in the use of colour and personal reference terms. Although this study has discovered many lexical features which confirm that Australia's identity is very separate from that of Britain's, it is demonstrated that the British influence on Australian English and culture is still pervasive. 2. Background 2.1 Setting the context of this study: Australia's cultural and linguistic identity On November 6, 1999 the people of Australia voted narrowly against creating an Australian republic. They voted to retain Britain's queen as its head of state, and not to replace her with an Australian citizen. However, the referendum was highly politically charged and the no vote was carried more due to factors such as fears that the proposed constitutional changes might induce political and economic instability, and rather less due to Australians feeling culturally tied to Britain (Stephens, 1999). The referendum was a defining point in several decades of debate on Australia's cultural identity. During the late 1980s and early 1990s (the time period ACE and Flob represent) former prime minister Paul Keating vigorously promoted such debate, encouraging Australians to reflect on the question “What does it mean to be Australian?” (Snow, 1999). Elements of this coming-of-age discourse included the growing awareness of Australia's economic role and geographical position within the Asia-Pacific region; greater acknowledgement of the breadth of the population's ethnicity due to worldwide immigration; and increased recognition of indigenous communities. In discussing who and what it was, Australia was also coming to terms with who and what it wasn't anymore: Britain. Obviously, one of the greatest, enduring influences the British had on Australia was linguistic (Gorlach, 1991). In the 200 years plus since English was transported to Australia, first with the convicts and administration, and then with the free settlers, AusE and BrE have developed in their own unique directions. Of course, the most striking difference now is phonological (Trudgill & Hannah, 1985). An Australian accent, indeed any accent, is a marker not only of linguistic identity, but of cultural identity too. It marks the speaker as a member of a speech community. That is, we can tell from this accent that the person shares common ground in the form of knowledge, beliefs and assumptions with the other members of that community - the community here being at the level of the nation (Clark, 1996). But what about subtler linguistic features, at syntactic and lexical levels? Do they also provide clear signals as to the cultural identity of the speaker? This paper seeks to answer this question by analysing a very specific characteristic at the lexical level: frequency of word use in written AusE, as compared to written BrE. Such research is valuable, not only because it adds breadth and depth to the literature on linguistics and corpus linguistics, but also because it can contribute to a young country's search for self-knowledge and sense of identity. 2.2 Previous studies and theories of culture and linguistics For many years now linguists have been investigating links between culture and language. Any discussion of this subject inevitably refers to Benjamin Lee Whorf and the theories of linguistic relativity and linguistic determinism (Gumperz & Levinson, 1996). Although these theories are focused on the binary factor of the presence versus absence of a word for a particular concept in different languages, here they are applied to information on the frequency of use of a word within different varieties of the same language. The basic premise of linguistic relativity is that our thoughts and points of view are influenced by the language we speak due to the way the language is syntactically and semantically structured. Linguistic determinism is a more radical and less popular theory which claims that the language we speak dictates the way we think about our physical and social world (Clark, 1996). Clark expands the linguistic relativity theory to include not merely major language 394 communities, but “any cultural community that corresponds to people's social identities” (1996: 353). So applying the theory of linguistic relativity to a cross-cultural study of two varieties of English which exist in very different parts of the physical world seems a good way to test this theory. Could the English that the convicts and settlers took with them to Australia cope with the demands of a new physical environment? Did it need to change as some words became less useful and others more useful? Undoubtedly it did, and this paper seeks to uncover the nature of these changes. On the subject of Australian culture, work by Wierzbicka (1992) has provided interesting linguistic insights into its nature by examining Australians’ fondness for using diminutive forms of nouns and proper nouns. Of course, this is a not just an academic observation. The Lonely Planet guide to Australia recommends that “if you want to pass for a native try speaking slightly nasally, shortening any word of more than two syllables and then adding a vowel to the end of it, making anything you can into a diminutive” (1998: 56). Seriously, though, introspection and consultation of Wierzbicka's work and cultural guidelines found in travel guides provided a starting point for identifying semantic categories which might be present in ACE, such as struggle/adversity and mateship. However, the true beginning point for this paper was a study by Leech and Fallon entitled Computer corpora – What do they tell us about culture? (1992). This paper was the first systematic attempt to use computer corpora as a source of cultural information. They investigated significant differences in word frequencies between American and British English, based on Hofland and Johanssen's (1982) published frequency tables, and attempted to identify which differences were attributable to cultural difference. A particular strength of their paper was its frank discussion of the limitations of corpus research and clear description of the goals of their study, which excluded analysing what they termed ‘linguistic contrasts’ which included differences in the two language varieties due to spelling conventions or ‘lexical’ differences where the two varieties simply, through convention, use different lexemes to denote the same thing. Examples of such a lexical linguistic difference between Australian and BrE would be the British crisps and lorry whose Australian counterpart terms are chips and truck. Whilst interesting to note, and whist unique to a particular community's lexicon, these are superficial differences in convention not indications of one community's need to refer to a something more often than another's. “Words evolve in a community in direct response to their usefulness and usability in that community” (Clark, 1996: 341). And corpus frequency data can tell us whether the same word is more useable in a certain community and culture. These are Fallon and Leech's ‘non-linguistic contrasts', those which cannot be explained by linguistic code or variety. Excluding proper nouns, Leech and Fallon grouped their results into the following 15 domain categories which occurred naturally within their dataset: sport; transport and travel; administration and politics; social hierarchy; military and violence; law and crime; business; mass media; science and technology; education; arts; religion; personal reference; abstract concepts; and ifs, buts and modality. Their generalised conclusion on the cultural differences evidenced in American and British English were that American culture was… …masculine to the point of machismo, militaristic, dynamic and actuated by high ideals, driven by technology, activity and enterprise – contrasting with one of British culture as more given to temporizing and talking, to benefiting from wealth rather than creating it, and to family and emotional life, less actuated by matters of substance than by considerations of outward status. (Leech & Fallon, 1992: 44-45). 2.3 The present study: cultural differences in AusE and BrE To a certain extent this paper is a replication of Leech and Fallon's study, but with the following important differences: 1. This study contrasts different language varieties - British and Australian English 2. This study analyses language from a different time period - the late 1980s/early 1990s 3. The minimum significance level used in this study was p=0.01, meaning its results are of a higher statistical significance than those of Leech and Fallon's study which used p=0.05 4. This study has developed its own more suitable domain categories 5. This study includes proper nouns referring to place names, as it was felt they could provide information on a country's place within the world, which, in turn, contributes to its identity 6. This study includes spelling conventions as they provide evidence of cultural choices and language variation in progress 7. Increased effort has been made in the present study to verify concordances of results, making it less likely that polysemous words and biased dispersion have compromised results 8. This study is more asymmetrical– it focuses on observations about the Australian variety of English as compared with BrE, rather than the other way round. 395 2.4 About the corpora The study is a near-synchronic study of contemporary Australian and British Englishes. The instruments used are the Australian Corpus of (written) English (ACE), and the Freiburg Lancaster- Oslo/Bergen (Flob) corpus of British written English. The two corpora are comparable in size, date, composition and parentage (Peters & Smith, online; Hundt et al, online). Both contain around 1 million words from 500 texts (around 2000 words per text). ACE consists of texts published in 1986, whilst Flob's texts come from 1991. Both were modelled on the LOB (Lancaster-Oslo/Bergen) corpus which represented the BrE of 1961. The compilers of both Flob and ACE used random selection methods to select their component texts. These commonalties make the two corpora highly comparable and this contributes to the robustness of results. However, perfect replication of content was not possible due to shortages of Australian publications in certain fiction genres, namely mystery/detective fiction (category L), science fiction (category M), romance (category P) and western and adventure fiction (category N). The shortfall in these areas was compensated for by the inclusion of human relationship and fantasy genres in the romance category, bush fiction in the western and adventure fiction genre and through the addition of two additional sections: historical fiction (S) and women's fiction (W), for which there are no corresponding categories in Flob. It is noted throughout this paper where it is felt these differences may have skewed the study's results. The categories and corresponding number of texts of each corpus are described in the table below. Table 1: Text composition of ACE and Flob Number of texts Category ACE Flob A: Press: Reportage 44 44 B: Press: Editorial 27 27 C: Press: Review 17 17 D: Religion 17 17 E: Skills, trades and hobbies 38 38 F: Popular Lore 44 44 G: Belles lettres, biographies, essays 77 77 H: Miscellaneous (e.g. government and industry reports) 30 30 J: Science 80 80 K: General Fiction 29 29 L: Mystery and detective fiction 15 24 M: Science fiction 7 6 N: Adventure and western 8 (including bush fiction) 29 P: Romance and love story 15 (including fantasy and human relationship fiction) 29 R: Humour 15 9 S. Historical fiction 22 - W: Women's fiction 15 - Total texts: 500 500 3. Objectives The main objectives of this study of frequency data are: 1. To assess ACE's representativeness of AusE and Australian culture 2. To identify some key semantic domains and lexemes within these domains which occur more in Australian written English than in British written English 3. To provide evidence that Australia's cultural comparator is still Britain 4. Method Using the computer software program Wordsmith, a keywords or keyness analysis was performed by comparing the entire frequency wordlists of both the ACE and Flob corpora. A log-likelihood test was applied to the data, using a p-value of L 0.01. The default on the number of results returned by the program was suspended (minimum frequency = 1; database minimum frequency = 1; associate minimum frequency =5). This yielded a total number of 5323 results (2486 results for the Flob corpus; 2837 results for the ACE corpus). These were the words which occurred significantly more often in one 396 corpus than in the other. The fact that more results occurred in the ACE corpus is partly due to the fact that due to different encoding of the two corpora - parts of hyphenated words were counted in the ACE corpus as separate entries. Although this could have been avoided by changing the Wordsmith settings at the outset, these results were filtered out manually in the next stage of analysis. This was a mistake in the methodology and may have compromised a few of the results, but in the overall analysis, it is not expected to have a significant impact. The next stage of the analysis was to break down the long list of results into categories. These categories were at first based on those that Fallon and Leech found useful in their analysis. However, it soon became apparent that a list more appropriate to this data was necessary. Some categories were retained, some were dropped and some new categories were added. The 11 categories focused on in this study are as follows: 1. place names: proper nouns referring to other countries 2. origin/ethnicity adjectives* 3. multiculturalism, prestige forms and borrowings into English* 4. American spelling conventions* 5. colour terms* 6. geography * 7. housing and communities* 8. fauna and flora* 9. weather and clothing * 10. personal references * 11. abstract concepts: challenge and adversity More categories were created in the analysis, but as the focus of the project narrowed, only these categories were chosen for further analysis. For each of the categories marked with an asterisk, concordances and dispersion patterns were checked. In instances where the word occurred more than 3 times in the same text, only 3 occurrences were attributed. This prevented skewing of results by sifting out data which was not normally distributed. Unfortunately, due to time constraints, it was not possible to check the distribution of the other categories and, as such, caution should be taken in interpreting these results. At this stage of the analysis each entry is checked for instances of polysemy. Any meanings which fall outside the target semantic domain were also discounted. This type of analysis is more rigid (and more time consuming!) than that used by Fallon and Leech and ensures more robust results. By eliminating unsuitable results, the numerical data given by the computer in the first instance on the keyness of each word was altered and recalculated manually using Johannsen and Hofman's frequency tables as a guide to determine the significance of the new result. 5. Results 5.1 Place names: proper nouns referring to other countries ACE and Flob mention certain countries significantly more often than the other. Each writes about itself the most and then about its near geographical neighbours. ACE mentions Asia-Pacific countries such as New Zealand, Papua New Guinea, Timor, Malaya and Vietnam and Flob mentions European countries such as Ireland, France, Portugal, and Germany. Political events at the times the corpora were compiled have impacted on these results. For example, Iran, Iraq and Bosnia are due to this. 5.2 Origin/ethnicity adjectives This trend also holds true for adjectives such as British and Australian. Aboriginal people are represented in ACE's results. Also present is the diminutive Aussie, evidence that the custom of abbreviating words is so entrenched that it is strong enough to occur in the written language, not just the spoken language. Interestingly, like the proper nouns results, Flob contains many more origin/ethnicity adjectives than ACE. These observations on place names and origin, while fairly straightforward, provide evidence of the representativeness of the two corpora - results are as expected, with each country writing about itself more than all others. 5.3 Multiculturalism, prestige forms and borrowings into English Other related terms which occurred significantly more often in ACE included multicultural, migrants, migration. These seem relevant to Australia's increasing awareness, in the media at least, of itself as a multicultural country made up of migrants. However, as the media input into ACE is substantial, there is potential overrepresentation of the word multiculturalism in particular, as it has been something of a buzzword since the Keating years (late 80s, early 90s). Evidence in ACE of multiculturalism's influence on AusE is negligible. There is no trend of the Italian, Greek or Chinese 397 languages, which came to Australia with large numbers of speakers, making any inroads into written English. This is in contrast to the British corpus, which shows the influence of the French language on its English. Several French words are significant in the Flob corpus: le, dans, qui, tout, elle. But it appears that this tendency to drop in quelques mots du français is a habit that Australians do not practice. The reason for their inclusion in BrE is probably due to French words being considered prestige forms in Britain. If this is not the case in Australia, then what are Australia's equivalent prestige forms? It is important to note that French is not an immigrant language in Britain. Following this reasoning, perhaps it can be predicted that Italian, Greek and Chinese are not likely to be sources of prestige borrowing in AusE in the future. 5.4 American spelling conventions One form of borrowing which Australians seem to be incorporating into their writing is American spelling of words that the British spell with containing ‘ou', such as color, humor, harbor, behavior, neighborhood . The ACE results contain 11 such lexemes, whilst Flob yielded one American spelling of a different kind: center. Whilst the majority of Australians prefer the British spelling, it will be interesting to observe if this is in fact evidence of language variation in progress and whether in 50 years time the old-world ’u’ in such words will be obsolete in AusE, as has happened in the new-world English of America. 5.5 Colour terms One of the most unexpected findings of this study was the discovery of the significance of colour terms in ACE. In descending order of frequency, the following colours occurred significantly more often in ACE: white, red, blue, green, yellow, brown, pink, orange, beige. Intriguingly, there were no colours of particular British significance. This list from ACE includes eight of Berlin and Kay's (1991) 11 basic colour groups, plus a supplementary colour – beige. The absence of the other three groups (black, purple and grey) means these are used with broadly equal frequency in both Australian written English and British written English. In addition to this, the word colours itself was a significant result in ACE. How can these convincing results be explained? Berlin and Kay's work was used to disprove the strong form of the linguistic relativity theory and to support the existence of linguistic universals. But they were only concerned with the existence, not the frequency of terms. Here we have clear evidence of one linguistic community finding use for colour terms far more often than another. From this data, we can say with some certainty that Australians feel the need to classify things in terms of colour to a much greater extent than their British counterparts. One possible explanation is geophysical, due to the difference in the quantity or quality of natural light in Australia. But this is only a theory based on introspection and a similar theory by Van Wijk (cited in Berlin & Kay, 1991) was rejected by Berlin and Kay. Further analysis of the concordances of these colour terms would reveal the reasons behind this result. 5.6 Geography This section centres around physical and geographical features. Obviously, there are great differences between Australia and Britain in this respect and this is borne out in the data. However, the results show that Australians write about these features much more, evidence that such things are more central to the Australian psyche, e.g. land, landscape, rocks, hill(s), mountains, soil, parks, ground. This compares to Flob's few entries such as meadow, woodlands, moor, terms, which, in any case, border on being linguistic rather than non-linguistic contrasts. Here it must be noted that the differences in the component texts of the two corpora may have had an effect on the results. The inclusion of bush fiction in ACE must be taken into account, but equally, Flob contains western fiction which ACE does not. Words pertaining to the coast were present in roughly equal numbers in both corpora, though ACE's results were related to the beach and surf, whilst Flob's included items of a maritime theme: buoy, moorings. In a largely arid country such as Australia, the importance of sources of water is also evident. Creek, drainage, swamp, waterhole, lagoon and dam are all present in the ACE results, whilst Flob yields only reservoir. An interesting avenue to investigate would be to see if there is any link between ACE's plethora of geographical terms to the colour terms which were so prevalent. 5.7 Housing and communities We are also able to see how the two countries arrange their territory. Whereas Australia has blocks of land, with the ubiquitous shed in the backyard, in Britain there exist estates (both of the housing estate and country estate variety), manors, castles and mews. People live in villages in Britain, whereas in Australia there are towns and remote townships. However, in both countries, most people live in urban areas and this is a big theme with cities, suburb, suburbs and suburban all present in 398 ACE. Difference in housing design is also evident, with Australian dwellings tending to have a veranda (h). 5.8 Fauna and flora In a similar vein to the geography category, large categories for flora and fauna appeared in ACE, but not in Flob. Surprisingly, there was only one entry in the Flob column (dolphin) against 59 in ACE. The ACE results consisted mainly of native animals and plants, such as dingo, kangaroo, gum, wattle, which of course would not be expected to be found in Flob. But Britain has its own native animals and plants which are not found in Australia, e.g. squirrel, yet these are absent from the results. Both countries have many kinds of animals in common, such as foxes, rats, the bird and cat. But these are more significant to AusE and there are no corresponding entries for Flob. Generic terms such as tree, leaf, plants, ferns, grass are all significant in ACE, with no counterparts in Flob. Another interesting result is croc which, as an abbreviated form, shows the common Australian practice of shortening words. As with the term Aussie, the fact that croc appears in the written form means it is firmly entrenched in the Australian lexicon. 5.9 Weather and clothing Not surprisingly, these domains show a trend of rain and cold weather words in Flob e.g. precipitation, freezing, against Australia's heat and high winds, e.g. sun, hot, cyclone, wind. Corresponding clothing for each kind of weather was also noted: shorts and bikini against suits, boot, and cloak. 5.10 Personal reference This category yielded some interesting results. A preference for third person singular personal pronouns and possessives is evident in Flob with he, him, himself, his, she, her, herself, it, itself all more significant. In ACE, first person singular and first and third person plural pronouns are present: I, my, myself, us, our and they. A complementary result was found for personal reference nouns. It seems where BrE prefers the third person pronoun, Australians choose from a number of nouns: men, bloke(s), boys, father(s), husband, or the idiomatic mate(s). In the same category, Flob yields just gentlemen and sir. Nouns referring to women in ACE include woman, women, women's, girls, mum, daughters. Feminists and wimmin are certainly due to the presence of women's fiction texts in ACE and it is also a possibility that the results referring to men may be influenced by this, as well as those for children which include kid(s), teenagers, and toddlers. Reflecting the phonological changes in AusE, Australians have begun to write meself and ya, mainly in reported speech. 5.11 Abstract concepts: challenge and adversity One small, uniquely Australian category emerged from the data - that of struggle and adversity. If you come from the wrong side of the tracks, you “grew up in Struggletown”. The figure of the “Aussie battler” still looms large in the national psyche. Even if you come from the right side of the tracks, “life's not meant to be easy”. Struggles, tough, battler, conflict, sweat, feat, overcoming are results form ACE which reflect this. 6. Discussion and conclusion 6.1 Representativeness of ACE This study has found that ACE is highly representative of Australian written English, as anticipated results, such as certain proper nouns, were present. The fact that the text make-up of ACE is different to that of Flob needed to be borne in mind, but was not found to have a great influence on results overall. 6.2 Significant semantic domains Overall then, what lexical frequency evidence have these corpora provided us with which prove Australia's cultural identity is significantly different from Britain's? Results from 11 significant semantic domains were identified and discussed. Because of the focus of this paper, most of these domains deal with lexemes which are more significant to AusE than to BrE. Several of the domains are related to geography or the physical environment (categories, 1, 2, 6, 7, 8 & 9) and it seems that members of the Australian speech community have more use for such terms than the members of the British speech community. This shows how the language, as part of the people, has adapted to a new physical environment. Evidence of multiculturalism was found in ACE, although for a country which talks up this aspect of its identity, there are not the corresponding number of origin/ethnicity adjectives (category 2) which would support this image. Category 1 confirmed that Australia is very well aware of 399 its location in the Asia Pacific region. Certain BrE prestige forms were shown not to have any AusE equivalents – potential verification of what is known as the ‘tall poppy syndrome’ in Australia, where anyone acting above their station is quickly mown down. This notion goes hand in hand with the ethos of struggle and adversity. The best finds of this study, though, are those relating to the colour and personal reference terms. Clear trends were found in the different cultural uses for these terms and further study of concordances would reveal the reasons behind these trends. 6.3 Cultural comparitor Despite all this linguistic evidence of Australia's independent identity, this study provides evidence that Britain may still be Australia's everyday cultural comparator. The results closely reflect some of the most commonly observed facts and stereotypes about Australia and its culture: the size of its coastline, barren interior, native animals and hot weather. Whilst this assures us of the validity of the corpora, it also tells us that Australians have inherited a way of thinking about such things from Britain. Australia does have a massive coastline, but only in comparison with Britain, not in comparison with the United States. It is a dry country in comparison with Britain, but not compared with Middle Eastern countries. Australians and Britons think of it as a hot country, but the Aborigines probably didn't think in such terms, etc. If the ACE corpus had been compared with its American counterpart, Frown, for instance, then presumably, another set of significant words would have been yielded. But I suspect that these results would not have matched so closely with observations by social commentators about characteristics of Australian culture as the Flob results, partly because Australia has shared physical similarities (size of country, melting pot population, indigenous inhabitants). So although AusE has many of its own lexical frequency characteristics, at a deeper level – in the way Australians think – perhaps they are not so different from the British. In Whorfian terms then, perhaps the fact that Australians speak a type of English dictates the way Australians think about their physical environment, if not the frequency with which they use those words to describe it. 7. References Berlin B, Kay P 1991 Basic color terms: their universality and evolution. Berkeley, University of California Press. Clark HH 1996 Communities, commonalities and communication. In Gumperz JJ, Levinson SC (eds), Rethinking linguistic relativity. Cambridge, Cambridge University Press: 324-355. Gorlach M 1991 Englishes: studies in varieties of English 1984 -1988. Amsterdam, John Benjamins. Gumperz JJ, Levinson SC (eds) 1996 Rethinking linguistic relativity. Cambridge, Cambridge University Press. Hofland K, Johansson S 1982 Word frequencies in British and American English. Harlow, Longman. Hundt M, Sand A, Siemund R Manual of information to accompany the Freiburg - LOB corpus of British English (‘FLOB'). Accessed 25/1/01 at: http://khnt.hit.uib.no/icame/manuals/flob/INDEX.HTM Lonely Planet 1998 Australia. Melbourne, Lonely Planet Publications. Leech G, Fallon R 1992 Computer corpora – what do they tell us about culture? ICAME Journal 16: 29-50. Peters P, Smith A Manual of information to accompany the Australian corpus of English (ACE). Accessed 25/1/01 at: http://www.hd.uib.no/icame/ace/aceman.htm Snow, D 13 November 1999 When no means yes. Sydney Morning Herald Accessed 20/01/01 online at: http://www.smh.com.au/news/review/9911/13/review4.html Stephens, T 13 November 1999 The great divide. Sydney Morning Herald Accessed 20/01/01 online at: http://www.smh.com.au/news/review/9911/13/review6.html Trudgill P, Hannah J 1985 International English: a guide to varieties of standard English (second edition) London, Edward Arnold. Wierzbicka A 1992 Semantics, culture, and cognition: universal human concepts in culture-specific configuration. Oxford, Oxford University Press. 400 Investigating characteristic lexical distributions and grammatical patterning in Swedish texts translated from English P-O Nilsson English Department Göteborg University Box 200, SE 405 30 Göteborg, Sweden. Tel. +46(0)31-773 5134. Fax +46(0)31 773 4726 per-ola.nilsson@eng.gu.se This paper is a work-in-progress report about a study within the field of corpus-based descriptive translation studies. It outlines a contrastive investigation in progress using the English-Swedish Parallel Corpus (ESPC), a combined comparable and parallel corpus of English and Swedish original and translated fiction and non-fiction texts. The paper discusses some initial results of corpus investigations. The investigation involves studying typical distributions of lexical items and patterns in original and translated texts in the corpus. An example of typical lexical distribution in the fiction part of the ESPC is represented by the very frequent Swedish grammatical word av (‘of', ‘by’ etc), which is one and a half times as common in Swedish texts translated from English as in original Swedish texts. In the investigation, this characteristic lexical distribution is used as a starting point for the search for patterns with specific distribution in translated texts: The collocates of the over-represented word are used to give a picture of the constructions of which the word is a part, with the aim of describing possible patterns with specific distribution in translated Swedish texts. The fact that the ESPC is not only a comparable corpus but also an aligned parallel corpus makes it possible to then track the sources of target text constructions back to the original texts. The final step of the investigation is to describe the paradigm of correspondences between constructions involving characteristically distributed constructions in translations and their actual sources in the original texts. 401 Using feature structures as a unifying representation format for corpora exploration Julien Nioche LexiQuest & Université Paris X Nanterre julien.nioche@lexiquest.fr Benoît Habert LIMSI-CNRS & Université Paris X Nanterre habert@limsi.fr Abstract In this paper we report on the use of feature structures to represent the linguistic information of a corpus. This approach has been adopted in TyPTex, a project which aims at providing a generic architecture for corpora profiling. After a brief overview of the Typtex project, we show that corpora exploration requires manipulating linguistic features in order to obtain a required level of linguistic information or changing the set of features to get a new point of view on the data. We show that feature structures formalism can help the building and management of linguistic features with Meta-Rules based on unification. Finally, we provide an example of marking which uses a mixed approach between projection of information from a static lexicon and contextual marking via Meta-Rules. Results tend to show that the use of feature structures can improve the coverage and reliability of the marking. 1 Introduction Huge tagged or parsed corpora are used in a broad number of language-related studies (McEnery and Wilson, 1996), with very different goals, such as lexical acquisition, speech processing, language learning or discourse analysis. The common point of these works is the annotation of these corpora according to linguistic features. Those features may be of various kind, going from simple morphological or lexical information to more complex grammatical, semantic, functional or phonetic features. Global information (like the number of words, the average word length, etc.) can be also used for corpora processing. Some kinds of features can be easily identified in texts, such as morphological information which can be obtained with simple n-gram computing, or lexical features which requires only word segmentation. On the other hand, more complex tools or resources may be necessary to get the awaited level of information. This is the case of syntactic analysis, obtained with a parser, and of semantic features, which require dictionaries, or anaphora resolution. Depending on the goal of the study, the set of features observed in corpora changes. For example, readability measures consider the length of words and their average frequency compared to frequencies from a reference corpus, while stylometry studies, the stylistic analysis of texts for the purpose of author attribution, require other kind of linguistic information, like lexical features, syntactic constructions or textual organization. Therefore the choice of features is task dependent. Some corpusbased studies need to combine different kind of features at the same time. This is typically the case of distinguishing among language registers (Biber 1993, 1995) and style (Tambouratzis et al., 2000). For example, Tambouratzis et al. (2000) combine morphological, lexical, grammatical and structural features to bring out style differences within a Greek corpus. Sets of heterogeneous features are thus put together in order to compare subparts of a text collection. This kind of studies concerns corpora exploration, since the choice of a feature set changes the results obtained and reveals different aspects of the data. Thus linguistic features must be manipulated easily, which raises the issue of their representation. A lot of formalisms are used to handle linguistic annotation1. They are nowadays mostly based on manipulation of XML/SGML entities. Tools are also provided in order to create, search or browse corpora within these formalisms. The lack of standards to represent and manipulate linguistic information is a problem for Natural Language Processing, since processing corpora out of their foreseen use or in combination with textual resources in a different format requires an extensive work to build conversion and/or manipulation tools. Attempts are made to solve this problem. The AMALGAM (Atwell et al., 1994) project aimed at developing methods of automatically mapping between the annotation schemes of the most widely known corpora, for both POS tag sets and phrase structure grammar schemes, to improve the reusability of the data. More recently, Bird et al. (2001) propose a formal framework for linguistic annotation, in the context of Speech Processing. This 1 See the Linguistic Annotation Page at www.ldc.upenn.edu./annotation. 402 framework aims at providing a core representation format, which regroups the common features of existing annotation schemes. In the domain of morphosyntactic annotation, the EAGLES project recommends a common formalism2, where information is represented by the position of characters in a tag. There is an obligatory position for the POS features, which has a closed set of possible values. Other linguistic features can be freely encoded with a symbol within the tag. This paper explores the benefits that feature structures offer for the representation and manipulation of linguistic information in corpora. Experience on corpora profiling in the TyPTex project shows that the use of this format helps to handle information in a clean way, to manipulate and modify sets of linguistic features, and improves reusability of both data and experiences. Using feature structures can also improve the marking of linguistic phenomena. 2 Overview of the TyPTex project The goal of the TyPTex project is to provide a generic architecture for corpora profiling. This project is financed by ELRA (European Language Resources Association) and is carried out jointly at LIMSI and UMR 85033. Work within this project has been previously described by Illouz et al. (2000) and Folch et al. (2000). 2.1 Background The underlying idea is that the reliability of the knowledge acquired form a corpus depends on the homogeneity of its data and is decreased by its heterogeneity. In the domain of morphosyntactic tagging, Biber (1993: 223) used the LOB (Lancaster-Oslo-Bergen) corpus to show that the probability of occurrence of a morphosyntactic category depends on the domain of the text. The same is true with sequences of morphosyntactic categories, which frequencies vary according to the domain. Sekine (1997) compared the performances of a probabilistic syntactic parser with different configurations for training and testing, using 8 domains of the Brown corpus. This work proved that the quality of the parsing in terms of precision and recall falls as the domains of the texts used for training and testing differs. Ruch and Gaudinat (2000) compared the lexical ambiguity between medical and general texts and underlined the necessity to build domain-adaptable tools for Natural Language Processing. These studies lead to the conclusion that the use of important corpora requires profiling tools in order to get indications about lexical and morphosyntactic uses of their subparts and thus determine their homogeneity or heterogeneity. Corpora profiling and tuning can globally improve the performances of NLP tools, as shown in Illouz, (2000). 2.2 Previous works The approach in TyPTex consists in developing a typology of texts through inductive methods. It means that the text types are defined in terms of sets of correlated linguistic features obtained through multivariate statistical techniques from annotated corpora. This approach is based on Biber's (1988, 1995) work. Biber uses 67 features corresponding to 16 different categories (verb tense and aspect markers, interrogatives, passives, etc.). He examines their distribution in the first 1.000 words of 4.814 contemporary English texts from reference corpora. The identification of the 67 features in the corpus is done automatically on the basis of a preliminary morphosyntactic tagging. The accuracy of the tagging is checked by a linguist. The sets of correlated features (the dimensions) are obtained through a multivariate statistical technique (factor analysis). Each dimension consists of two complementary groups of features which can be interpreted as positive and negative poles. In other words, when one group of features occurs in a text, the other group is avoided. Statistical methods are then used to group texts into clusters according to their use of the dimensions. These clusters correspond directly neither to text “genres” nor to language style or registers. 2.3 Data, tools and methods The corpus used in TyPTex to test and tune the system represents 5 million words and is a part of the corpus gathered by G. Vignaud (INALF – Institut National de la Langue Française) and B. Habert within the European project PAROLE4. The texts are tagged according to the TEI (Text Encoding Initiative) recommendations. Queries are then performed to extract a subset of texts which are relevant for a determined study or application. The next step is to achieve a morphosyntactic tagging which 2 Available at http://www.ilc.pi.cnr.it/EAGLES/annotate/annotate.html. 3 See http://www.limsi.fr and http://www.ens-lsh.fr. 4 See http://www.elda.fr/Fr/cata/doc/parole.html. 403 associates each lexical item (or polylexical item) with a given lemma, a part of speech and other morphosyntactic information. The tagger used currently is Sylex-Base. It is based on the work of P. Constant (Ingenia, 1995), and proved to be robust during the tagger evaluation program GRACE (Adda et al., 1998). The second step is typological marking. It consists of replacing the information generated by the morphosyntactic tagger with higher-level linguistic features. These new features are obtained on top of the morphosyntactic tags and vary according to the oppositions the user wishes to bring out. Section 3 will explain more in detail why and how such manipulations of linguistic features are effected. From the resulting marked corpus several matrices are generated, in particular the matrix containing the frequencies of each feature in each text of the corpus under study. The resulting matrix is then analysed by statistical software programs. The analysis of the matrix aims at, on the one hand, identifying features that reveal a certain kind of opposition among the subparts of the corpus, and on the other hand, making an inductive classification of texts. 3 Corpora exploration and manipulation of linguistic features A lower level tagging used in TyPTex includes shifters, modals, presentatives (“il y a” and “c'est”), tense use, passives, certain classes of adverbs (negation, grading), determiners, etc. From the features tagged initially (around 300 available with Sylex and 170 with Cordial (Synapse, 1998)), about 40 were kept and divided into 2 subsets. The first subset comprises functional elements which role is the organization of discourse and sentence. The second subset comprises open categories like nouns, adjectives or verb tense. The features available with the initial POS tagging may not be sufficient for a given study. There is often a gap between what one gets at the output of a tagger and what is aimed at. In TyPTex, we call typological marking the set of features that is presumed useful to bring out different types of texts. Features has to be manipulated in order to get this awaited level of typological marking. However a set of linguistic features can not be settled once and for all. Typological marking requires a lot of explorations : one needs to test a set of linguistic features by analysing the distinctions it brings within a corpus. Sometimes features can be too fine-grained and lead to a scattering of occurrences which makes contrasts imperceptible. This was the case in one of the pilot studies (Illouz et al., 1999) for the TyPTex project with the verb category, which was divided into some 50 features (due to the morphology of French). Most of those features had a only a few occurrences and were not statistically significant. The other problem with this splitting of the features was that it offered no indication about the use of the verb in general. Thus it was impossible to check whether a under-use of Nouns in a subpart of the corpus was related to an over-use of Verbs in the same subpart. This is why some elementary features had to be regrouped inside “super features”, covering larger categories which were not available with the initial tagging. Some features can also be too rough and hide real oppositions. For example, the same tag can be used by a tagger to cover indifferently quantity indicators, as well as dates. We can presume that splitting that general “Cardinal Number” feature into two sub-features (quantity and dates) would create finer distinctions among the corpus. For instance, it can be necessary to gather tags corresponding to the same function but belonging to different morphosyntactic categories, such as some punctuation marks and some conjunctions, which can be regrouped as textual markers. For example, this could be the case of punctuation symbols marking an incision in the discourse, such as quotes, parenthesis or long dashes. A good illustration of this point is the use of the distinction between qualitative and relational adjectives made by Habert and Salem (1995). The aim of their study was to reveal differences of language use in a sociologic corpus of open answers. A first feature set, using “traditional” POS features (Verb, Adjective, Noun, etc.), revealed a difference in the use of verbs (overused by lesseducated persons) and nouns (overused by educated persons) between two different groups of answers. At this stage, adjective was considered as an “atomic” feature. A modified set of features introduced an opposition between qualitative and relational uses inside the adjective category. Relational adjectives are sometimes called “pertainyms”, since they mean something like “of, relating/pertaining to, or associated with” some noun and play a role similar to that of a modifying noun (Fellbaum, 1990). For example, geographical in “a geographical map” refers to geography, as presidential in “presidential election” is linked to president. Adjectives that are not relational are considered to have a “qualitative” function (which modifies the quality of a noun), such as “nice” in “a nice child” or “good” in “a good 404 practice”. This notion of relational use of adjectives is used in WordNet5 (Miller, 1998), where relational adjectives are linked to their corresponding name, the other adjectives being mainly studied through antonymy and contrast. This distinction enabled to refine the cluster of “educated persons” between non-graduate and graduate persons. In that case, splitting the adjective category into two finer categories provided a better description of the corpus. Feature manipulation is required to get the awaited level of linguistic information, by completing and modifying the feature set available with the initial POS tagging or changing the features retained for typological marking. It is necessary therefore to be able to group features for one contrast, to divide others, and at times even to start afresh tagging and marking. Corpus exploration requires flexibility in the manipulation of the feature set which in turn introduces constraints on the representation formalism. In the Typtex project, feature structures are used to represent linguistic features of the words in the corpora. 4 Using feature structures as representation format 4.1 From tags to features A morphosyntactic tagger associates a given piece of information with a lexical token by the way of a tag. According to the software used, the content of this information may vary, and its form as well. action action Ncfs (CORDIAL) “action” nom : féminin singulier (SYLEX) Figure 1 Examples of different POS tagger outputs Figure 1 presents the output of two taggers CORDIAL and SYLEX for the word “action”, where the linguistic information is the same (except for the lemma, which is not present here with Sylex) but changes in form. In fact, there's often a confusion between the graphical forms that programs manipulate (tags) and the linguistic features used by the linguist. Tags may be different but represent the same linguistic feature. In the TyPTex project, an intermediate format is used to represent linguistic information, independently of the tags provided by the preceding process (POS tagging, for example). 4.2 Representation Feature structures such as those employed in unification grammars are used in order to represent the linguistic information contained in a corpus. The format used in TyPTex is inspired from the PATR formalism (Shieber, 1986). A feature structure associates values with a set of features, and can be represented by an equation, where the feature is written between < > and the value is placed after a symbol equals (=). Feature structures can be atomic (one feature associated with a value) or complex (a value is itself a feature structure, in a recursive way). The example of tagging provided above will give the following feature structure : <form> = action <lemma> = action <category> = noun <type> = common <agreement gender> = feminine <agreement number> = singular . Figure 2 Equation of a feature structure In this example, some features have an atomic value (<form> for instance), whereas another as a complex one (<agreement>). Feature structures can also have a graphical representation with a DAG (Directed Acyclic Graph), as on figure 3. 5 See http://www.cogsci.princeton.edu/~wn/. 405 Figure 3 DAG representation of a feature structure 4.3 Modifiers Modifying operators can also be used in the equation. This is the case of negation, marked by a tilde ~, which implies that a feature shall not have a given value, as in : <agreement gender> = ~neutral. Disjunction, is an other aspect. It is the possibility to associate multiple values with a feature, by putting these values between braces { }6. At last, two or more structures can share the same value (co-indexation) when it is placed between parenthesis. This enables, for example, to specify an agreement in number and gender between an adjective and a noun. In that case, the agreement features of the adjective and the noun are not only equals, but share the same value. 4.4 Use in the TyPTex project Each word of the corpus used in TyPTex is represented by a feature structure. Thus, it is possible to modelise more precisely the information resulting from marking, in the style of Gazdar et al. (1988). This kind of representation format allows one to manipulate linguistic features directly instead of tags. A mapping (comparable to the mapping done in Atwell et al., 1994) is effected between the output of the morphosyntactic tagger used upward and the corresponding linguistic feature structures, the downward processing gaining in independence from the format of the tagger used. It is possible to use an other program for the POS tagging-task and keep the same level of information thanks to a mapping into feature structures. This enables to compare the results provided by different taggers and see their influence on the quality of the marking. An other advantage is that feature structures can encode any type of linguistic features, at any level, morphological, syntactic, semantic or functional. Compared to a tag, a feature structure has the following qualities : 1. Linguistic information is named explicitly and does not depend on the position of a character inside a linear string, like in a tag. A feature structure (ex: <category>=noun, <type>=common, <gender>=feminine, <number>=singular) is easier to read than a tag and less ambiguous (ex: Ncfs). 2. Linguistic information is structured and hierarchized. A structure such as (<category>=noun, <type>=common, <agreement gender>=feminine, <agreement number>=singular), is more logical than (<category>=noun, <type>=common, <gender>=feminine, <number>=singular). In this example, the sub-features <gender> and <number> can be manipulated through the whole complex feature <agreement>. However, one of the most remarkable aspects of the feature structure formalism is that it provides useful mechanisms for feature manipulation, which is strongly required for corpora exploration. 5 Manipulation of data with Meta-Rules based on unification Since some linguistic features one wants to obtain for typological marking may not be available at the output of a morphosyntactic tagger, it becomes necessary to add, retrieve or modify some features to get the awaited level of information for typological marking. On top of that, corpora exploration requires to test several feature sets in order to bring out new contrasts inside the corpus, which also implies the use of tools for features manipulation. 6 This is a good way to represent the output of a non-deterministic tagger, which can give different possible tags for a single token. 406 5.1 Unification of feature structures Feature structure are based on the concept of unification, which can be defined by the following : “Unification of two feature structures A and B is the minimal structure containing both A and B”. Unification checks the compatibility between two feature structures (a feature must not have two different values) and when possible produces a structure containing all the information of both structures. For example, the unification of the structure : <category> = noun <type> = common <agreement gender> = feminine with the structure <category> = noun <type> = common < agreement number> = singular yields: <category> = noun <type> = common <agreement gender> = feminine < agreement number> = singular 5.2 Meta-Rules Meta-Rules (as in Jacquemin 1994a, 1994b and 1997) based on unification are used to modify feature structures. Basically, a Meta-Rule is a feature structure consisting in two parts : a source part and a target part (in an equation, the source is separated from the target by the symbol “=>”). A Meta- Rule is applied on the feature structures representing a word (or a sequence of words) of the corpus : if unification is possible between the source part of the rule and the feature structure of the corpus, then the last is replaced by the target part of this rule. Thus Meta-Rules do not only add information to a given feature structure of the corpus but totally rewrite it. The role of unification is to check the compatibility between the source part of the Meta-Rule and a word of the corpus. However the unification process used here slightly differs from usual unification. It could be called constrained unification, because it stipulates that the feature structure representing the word must contain at least all the information of the source part of the Meta-Rule (same features with compatible values). Thus, the source part of the Meta-Rule subsumes the feature structures of the word. This condition (unification + subsumption) constraints the writing of the rules and ensures that they are used in the correct cases. For example, using normal unification with the Meta-Rule : <tense>=present => <tense>=present <deictic>=yes changes any Noun represented minimally by the feature structure : <category> = noun <type> = common < agreement number> = singular into : <tense>=present <deictic>=yes 407 The use of constrained unification prevent such cases. 5.3 Positional information One important aspect of using feature structures is that it can help to modify the linguistic information of a corpus in a contextual way. In this case word tokens are not submitted to Meta-Rules separately, but inside sequences (corresponding to a paragraph, for example). Thus, the scope of a Meta-Rule can be larger than one slot in the sequence and covers several words. One tries to apply each Meta-Rule starting from each position inside the sequence (by checking whether the source-part of the Meta-Rule subsumes and can be unified with a part of the sequence of feature structures). All Meta- Rules are tested on position 1 in the sequence, then on position 2, and so on until the end of the sequence. In case of success, the slots inside the sequence covered by the source part are replaced by the target-part of the Meta-Rule. Positional information is added to the Meta-Rule in order to specify the distance between the words in the sequence. This information is present in the feature structures of the rule as a feature by itself. The notation depends on the type of the distance, which can be fixed (noted by a “p” + number of the token in the sequence, ex: p4), free (noted by a “*” + number of the token, ex: *4), or limited (noted by a “*” + number of the token + “+”+ maximal distance, ex: *4+3). The position 1 corresponds to the current position of the processing inside the sequence. <p1 lemma>=" <p1 category >=punctuation <*2+5 lemma>=" <*2+5 category >=punctuation => <p1 lemma>=" <p1 category >=punctuation <p1 type>=start_short_citation <*2+5 lemma>=" <*2+5 category >=punctuation <*2+5 type>= end_short_citation . Figure 4 Distances in Meta-Rule Figure 4 shows a Meta-Rule, which adds a feature <type> to the quotes of a text, with a value start_short_citation or end_short_citation, if there is a distance of up to 5 words between the positions of the quotes. In that example, the distance is limited to 5. This Meta-Rule could be used to differentiate cases of reported discourse or to distinguish citations marking a phenomenon of distanciation form the speaker. 5.4 Meta-rules in TyPTex Meta-Rules are used in order to manipulate linguistic features by regrouping them into larger categories, or in the opposite splitting into finer features. This is a convenient tool for managing the feature sets used in a study and for comparing the results that they provide. It also helps to add features not available at the output of a tagger and thus get the level of information awaited for typological marking. One of the most powerful aspect of the Meta-Rules is that they are contextual : one can manipulate the content of a feature structure (basically a word) dependently of the context of that structure (other words in the same sentence or paragraph). In the rest of that paper, we will show that the use of Meta-Rules can improve the marking of such a subtle distinction as the one between qualitative and relational adjectives. 6 An application case: distinguishing between qualitative and relational adjectives in a corpus In the Typtex project, an opposition is projected in the corpora between relational and qualitative adjectives (see section 3). The description of relational adjectives provided by Habert and Salem (1995) is followed : 1. they are equivalent to a sequence of nouns : presidential election / election of the president 2. they are never gradable : *a very geographical map 3. they cannot have a predicative function : *response to the virus was immune. 408 We realized that the addition of a distinction between qualitative and relational adjectives improved the description of a corpus in Habert and Salem (1995) by refining the groupings of texts obtained after a multivariate analysis. But although this opposition seems to be useful for corpora description, its use is far from being obvious. As noticed by Bartning and Noailly (1993), a lot of adjectives can be analysed either as relational or qualitative, depending on the context. This is the case with the French adjective économique, which has a relational function in “la politique économique” (related to economics), but a qualitative one in “une formule économique”( which is not expensive). One can even assume that any relational adjective can take a qualitative function in a given context. That is why the distinction between these two aspects is somehow difficult to process automatically without information about the context of use. The solution adopted in Typtex was to combine two approaches to distinguish between qualitative and relational adjectives : on the one hand a list of potentially non-ambiguous adjectives was used to annotate the corpus (information is projected on the texts) while on the other hand a set of Meta-Rules was intended to disambiguate the adjectives in context (information is then extracted from the texts). Here we compare these two approaches. 6.1 An empirically-build static list A list of relational and qualitative adjectives has been constituted manually using press articles of the French newspaper Le Monde, and taken from the PAROLE corpus. The 14 million words subpart Press of the corpus has been built by random selection of full issues of Le Monde and gathers issues from 1987, 1989, 1991, 1993 and 1995. To build the list we extracted the thousand most frequent adjectives of the whole corpus Le Monde and analysed them manually, to check whether they have a priori a relational or qualitative function out of context. Relational annuel automobile bancaire budgétaire cardinal constitutionnel exécutif … Qualitative absolu actuel ambitieux ami ancien beau bien … Ambiguous commercial français historique humanitaire idéologique judiciaire logique … Figure 5 A sample of the adjective list In its final state the list contains 264 non-ambiguous adjectives with 244 qualitative and 20 relational. All the other adjectives studied were judged too dependent on the context and thus ambiguous. Figure 5 shows a sample of this list. 6.2 Description of the Meta-Rules Afterwards we created a set of Meta-Rules using feature structures for disambiguation. These rules are the following : An ambiguous adjective is considered as qualitative if : 1. it is directly preceded by a grading adverb 2. it is directly preceded by a stative verb 3. it is directly preceding a noun, with the same value of number and gender7 4. it is directly preceded by a qualitative adjective and a conjunction 5. it is directly preceding a conjunction and a qualitative adjective 6. it stands alone between two double quotes An ambiguous adjective is considered as relational if : 7. it is directly preceding a conjunction and a relational adjective 8. it is directly preceded by a relational adjective and a conjunction 7 This is at least true in French, where the only position of the relational adjective is after the noun. 409 Rules 4, 5, 7 and 8 are based on the hypothesis that coordination is only possible between adjectives sharing the same function as whether qualitative or relational (e.g., compare young and nice with *beautiful and geographical). Hatzivassiloglou and Wiebe (2000: 300) report on similar property of conjunctions for assigning semantic orientation to adjectives (e.g., compare corrupt and brutal with * corrupt but brutal). Figure 6 shows the equation corresponding to rule 1. <p1 form>=(1) <p1 lemma>=(2) <p1 category>=adverb <p1 type>=grading <p2 form>=(3) <p2 lemma>=(4) <p2 category>=adjective => <p1 form>=(1) <p1 lemma>=(2) <p1 category>=adverb <p1 type>=grading <p2 form>=(3) <p2 lemma>=(4) <p2 category>=adjective <p2 qualitative>=true <p2 relational>=false. Figure 6 Example of Meta-Rule used for disambiguation of adjectives This rule indicates that any adjectives directly preceded by a grading adverb has a qualitative function. If the source part of the Meta-Rule subsumes and can be unified with feature structures representing a sequence of words, these structures will be replaced by the target part of the Meta-Rule. In this example, features <qualitative>=true and <relational>=false are added to the adjective. This Meta-Rule will correctly assign a qualitative function to the adjective tendu in the sequence “un climat de plus en plus tendu” (a situation more and more tense), witch can be minimally represented by the following sequence of feature structures : <form> = un <lemma> = un <category> = determiner <type> = particle <defined> = false. <form> = climat <lemma> = climat <category> = noun <type> = common. <form> = de_plus_en_plus <lemma> = de_plus_en_plus <category> = adverb <type> = grading. <form> = tendu <lemma> = tendu <category> = adjective <qualitative>= true <relational>= true. In this case, the value of the feature <relational> will be changed to false by the Meta-Rule. However this rule is not aimed at recognizing the qualitative function of the adjective dangereux in “un climat dangereux a tous égards ” (a dangerous situation in all respect). <form> = un <lemma> = un <category> = determiner <type> = particle 410 <defined> = false. <form> = climat <lemma> = climat <category> = noun <type> = common. <form> = dangereux <lemma> = dangereux <category> = adjective <qualitative>= true <relational>= true. <form>= a_tous_égards <lemma>= a_tous_égards <category>= adverb <type>= general . This sequence of feature structures will remain unchanged by the Meta-Rule (and the function of the adjective ambiguous) because of the mismatch between the postposition of the adverb and the value of its feature <type> which value is not equal to grading. 6.3 Building of a reference corpus For this comparison between a fixed list approach and the rule-based approach, we used a sample of 13 papers from Le Monde taken from the PAROLE corpus. These texts has been extracted from the Economy section of the newspaper and represent around 10.000 words. The corpus was first tagged using the CORDIAL8 tagger and then converted into feature structures. A refinement of the original tag set has been provided by adding an information via Meta-Rules about grading adverbs for 129 frequent adverbs. Thus, adverbs such has “tres” (very), “plus” (more) or “extremement” (extremely) gained a new feature <type>= grading which was not present after the original tagging made by CORDIAL and its conversion into feature structures. The next step was a manual categorization of adjectives between relational and qualitative. No adjectives have been left ambiguous. A rectification of the data was necessary in order to correct the errors made by the POS tagger. This is commonly the case of verbs erroneously analysed as adjectives or a bad tokenisation (“Ministre de l'Intérieur” (Minister of the Interior) recognized as a single token but “Premier Ministre” (Prime Minister) identified as an Adjective followed by a Noun). After this correction, the corpus contained 507 adjective occurrences with 378 qualitative for 129 relational uses. 6.4 Results and discussion This corrected corpus serves as reference to evaluate and compare the two approaches for adjective categorization. At the beginning of the experience a version of the corpus was created, where all adjectives were ambiguous (the values of their features <qualitative> and <relational> were both true). The goal of the evaluation is to measure the recall and precision provided by the different methods for adjective marking. By recall, we mean the ratio of adjectives which have been disambiguated ((#total - #ambiguous)/ #total * 100), correctly or not, while precision indicates the ratio of well-tagged adjectives after disambiguation (#well-tagged / (#total - #ambiguous) * 100). Three tests have been carried out using respectively the plain adjective list described above, the set of Meta-Rules and a mix of list and rules on the ambiguous version of the corpus. Results are compared against the reference corpus in order to determine which of these approaches is the most efficient to distinguish between qualitative and relational uses of the adjectives. Figure 7 shows the results obtained. 8See http://www.synapse-fr.com. 411 List Rules List + Rules Correct 222 178 290 Wrong 6 0 6 Ambiguous 278 328 210 Total 506 Recall (%) 45.05 35.17 58.49 Precision (%) 97.36 100 97.97 Figure 7 Compared results for relational / qualitative tagging The values in the row Correct indicate how many adjectives has been correctly categorized as whether qualitative or relational according to the different methods. Wrong gives the number of wrong marked adjectives, while the numbers in Ambiguous refer to the occurrences of adjectives not covered neither by the list nor by the Meta-Rules. Using only the contextual Meta-Rules improves the precision of the adjective categorization compared to the use of the fixed list but provides a loss of recall that reaches 10%. The best solution seems to be a mixed use of the list and the rules, which improves the recall with a relatively equal rate of precision. However this gain in recall is relatively moderate. It could be explained by a partial overlap of coverage between the two approaches (the adjectives recognized are often the same). This example of linguistic feature marking (a new feature is added to the description of a corpus) illustrates the use of feature structures as a representation format. The combined use of fixed information and contextual rules improved the realization of such a subtle opposition as the one between relational and qualitative for adjectives, compared to the projection of a lexicon alone. This method enables the characterization of individual word occurrences, rather than word types, without requiring an important learning phase (Hatzivassiloglou, 2000). Another aspect is the fact that this mixed method allows non-usual cases to be marked correctly, like adjectives which have in context a different function than their most probable one (ex: “a very Parisian atmosphere”). However the results of this experience could be surely improved by refining the content of the fixed list and of the rules. Some semantic information would be interesting to solve the ambiguity between relational and qualitative functions of adjectives (this is required to disambiguate the example with “économique” provided at the beginning of this section), as well as regular expression operators in Meta-Rules in order to use morphological information (word endings). 7 Conclusions In this paper we report on the use of feature structures to represent the linguistic information of a corpus. Experience of corpora profiling in the TyPTex project shows that this approach helps to represent any kind of linguistic information, independently of the tools used for tagging. Feature structures is an unifying format which can be used to map from an annotation scheme to an other, in the spirit of Atwell et al. (1994). Feature structures formalism also helps to handle a set of features with Meta-Rules based on unification. We showed that corpora exploration requires to modify the linguistic features in order to obtain new results and thus to change the point of view on the data. Another aspect is that the features available at the output of a POS tagger may not be sufficient for a given experimentation, one needs to add some information to get the awaited level of marking. By defining Meta-Rules to operate on the feature structures representing a corpus, one can modify the information in a contextual way. An example of a mixed approach between projection of information from a static list and contextual marking via Meta-Rules showed that feature structures can improve the reliability and coverage of the marking. Acknowledgements The authors wish to thank Marianne Dabbadie and Lee Humphrey (LexiQuest) for their help during the writing of this paper. References Adda G, Mariani J, Lecomte J, Paroubek P, Rajman M 1998 The GRACE French Part-Of-Speech Tagging Evaluation Task. In Proceedings of LREC'98 (1st International Conference on Language Resources and Evaluation), Granada, Spain, pp 2433-2441. 412 Atwell E, Hughes J, Souter C 1994 AMALGAM: Automatic Mapping Among Lexico-Grammatical Annotation Models. In Klavans J, Resnik P (eds), The Balancing Act : Combining Symbolic and Statistical Approaches to Language. Las Cruces, Association for Computational Linguistics, pp 11-21. Bartning I, Noailly M 1993 Du relationnel au qualificatif: flux et reflux. In L'information grammaticale 58(1): 27-32. Biber D 1988 Variation across speech and writing. Cambridge, Cambridge University Press. Biber D 1993 Using register-diversified corpora for general language studies. Computational Linguistics, 19(2): 243-258. Biber D 1995 Dimensions of register variation : a cross-linguistic comparison. Cambridge, Cambridge University Press. Bird S, Liberman M 2001 A formal framework for linguistic annotation. Speech Communication 33(1): 23-60. Fellbaum C, Gross D, Miller K 1990 Adjectives in WordNet. International Journal of Lexicography 3(4): 265-277. Folch H, Heiden S, Habert B, Fleury S, Illouz G, Lafon P, Nioche J, Prévost S 2000 TyPTex: Inductive typological text classification analysis for NLP systems tuning/evaluation. In Proceedings of the Second International Conference on Language Resources and Evaluation, Athens, pp 141-148. Gazdar G, Pullum G, Carpenter R, Klein E, Hukari T E, Levine R D 1988 Category structures. Computational Linguistics, 14(1): 1-19. Habert B, Salem A 1995 L'utilisation de catégorisations multiples pour l'analyse quantitative de données textuelles. TAL, 36(1-2): 249-276. Hatzivassiloglou V, Wiebe J 2000 Effects of adjective orientation and gradability on sentence subjectivity. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrücken, pp 299-305. Illouz G, Habert B, Fleury S, Folch H, Heiden S, Lafon P 1999 Maîtriser les déluges de données hétérogenes. In Condamines A, Fabre C, Péry-Woodley M-P (eds), Corpus et traitement automatique des langues : pour une réflexion méthodologique, Cargese, pp 37-46. Illouz G, Habert B, Folch H, Fleury S, Heiden S, Lafon P, Prévost S 2000 TyPTex: Generic features for Text Profiler. In Content-Based Multimedia Information Access (RIAO'00), Paris, pp 1526-1540. Illouz G 2000, Typage de données textuelles et adaptation des traitements linguistiques. Doctorat d'informatique, Université Paris-Sud. Ingenia 1995 Manuel de développement Sylex-Base. Paris, Ingenia – Langage naturel. Jacquemin C 1994a FASTR: A unification-based front-end to automatic indexing. In Proceedings of Intelligent Multimedia Information Retrieval Systems and Management (RIAO'94), New York, pp 34-47. Jacquemin C 1994b FASTR: A unification grammar and a parser for terminology extraction from large corpora. In Proceedings of Journées IA'94, Paris, pp 155-164. Jacquemin C 1997 Variation terminologique : Reconnaissance et acquisition automatiques de termes et de leurs variantes en corpus. Mémoire d'Habilitation a Diriger des Recherches en informatique fondamentale, Université de Nantes. McEnery A, Wilson A 1996 Corpus linguistics. Edinburgh, Edinburgh University Press. Miller K 1998 Modifiers in WordNet. In Fellbaum C (ed), WordNet: an electronic lexical database. Cambridge, MIT Press, pp 47-67. Ruch P, Gaudinat A 2000 Comparing corpora and lexical ambiguity. In Proceedings of the Comparing Corpora Workshop (38th Annual Meeting of the Association for Computational Linguistics), HongKong, pp 14-20. Sekine S 1998 The domain dependence of parsing. In Proceeding of the Fifth Conference on Applied Natural Language Processing (Association for Computational Linguistics), Washington, pp 96- 102. Shieber S 1985 An introduction to unification-Based Approaches to Grammar. Stanford, CSLI Lecture Notes 4 (Center for the Study of Language and Information). Tambouratzis G, Markantonatou S, Hairetakis N, Vassilliou M, Tambouratzis D, Carayannis G 2000 Discriminating the registers and styles in the Modern Greek language. In Proceedings of the Comparing Corpora Workshop (38th Annual Meeting of the Association for Computational Linguistics), HongKong, pp 35-43. 368 Annotated corpora for assistance with English-Polish translation. Barbara Lewandowska-Tomaszczyk, University of Lodz, Poland, Michael P. Oakes, University of Sunderland, England, Paul Rayson, University of Lancaster, England. Alignment of a large bilingual corpus of original material and an acceptable translation facilitates a number of automated and partially automated approaches to translation. (Kay and Roescheisen 1993, p 121). The approach to the automatic alignment of Polish and English texts taken in this paper is that of Gale and Church (1993). Once the texts have been aligned, they can then be displayed to the translator as required using Scott's (1996) “WordSmith” concordancing tool. Using WordSmith, sentences and their translations can be retrieved and shown to the translator if they contain specified words, phrases or word fragments. The power of the bilingual concordancing tool can be enhanced by using annotated corpora for alignment. The two types of annotation we have employed are a) Part of Speech tagging, provided by the CLAWS tagger (Garside and Smith, 1997), and b) semantic tagging, provided by the ACASD suite of computer programs (Thomas & Wilson, 1996). With part of speech tagging, every word in the corpus is automatically assigned its most likely grammatical class, e.g. the notation “book_NN1” shows that a particular instance of the word “book” is most probably a singular common noun. If each word in the aligned text has its own part of speech category, we can use the search term “book” to retrieve every aligned region containing the word “book”, irrespective of whether it occurs as a noun, verb, or any other grammatical category. On the other hand, the notation “book_NN1” will only retrieve aligned regions in which the word “book” occurs as a singular common noun, enabling us to see the various translations in Polish of the English word “book” when it occurs as a singular common noun. Semantic tagging involves assigning each word in the corpus with a label showing the semantic category (akin to a thesaurus category such as “colour”, “power” or “education”) to which each word belongs. For example, the notation “clinical_B2” would retrieve only aligned regions containing “clinical” as a medical term, while “clinical_E1-“ would retrieve only those regions containing the word “clinical” where it means “lacking emotion”. The notation “clinical” allows the retrieval of aligned regions containing the word “clinical” irrespective of its meaning, “_B2” will allow the retrieval of aligned regions containing any medical term, and “_E1-” will allow us to view all the synonyms in the English corpus meaning “without emotion” alongside their Polish equivalents. At present only the English text has been fully annotated, since a Polish semantic tagger is currently unavailable. We will discuss our approach whereby a machine readable Polish-English dictionary and an alignment of semantically tagged English text with the equivalent unannotated Polish text might be used to semantically tag certain words in the Polish part of the text. References Gale W A, Church K W 1993 A program for aligning sentences in bilingual corpora, Computational Linguistics 19(1): 75-102. Garside R, and Smith N 1997 A hybrid grammatical tagger: CLAWS4, in Garside R, Leech G, and McEnery A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121. Kay M, Roescheisen M 1993 Text-translation alignment, Computational Linguistics 19(1):121-142.. Scott M 1996 WordSmith Tools Manual, Oxford University Press. Thomas J, Wilson A 1996 Methodologies for studying doctor-patient interaction, in Thomas J, Short M (eds) Using corpora for language research, London & New York, Longman. 413 OpenText.org: the problems and prospects of working with ancient discourse Matthew Brook O'Donnell, Stanley E. Porter and Jeffrey T. Reed University of Surrey Roehampton and OpenText.org 1. Introduction The vast majority of studies in corpus linguistics have focused upon contemporary usage of modern languages. However, although there have been a number of studies of the earlier periods of some of these languages, such as Old English and Old French, they have tended to adopt the methods developed for modern languages. In our theoretical paper (Porter and O'Donnell 2001), we have explored the methodological challenges and questions posed by the study of an epigraphic language, such as Hellenistic Greek.1 In particular, we have found the need for closer attention to the criteria used in the compilation of a corpus, the integration of the levels of annotation applied to a corpus, and maintaining a focus upon both traditional referential access and narrative or sequential textanalysis. 2 As a result, our approach is more than simply computer-aided text analysis, but in some ways fulfils the goals of the originators of corpus-based linguistics. It does this by performing a full analysis of the language, utilizing a structured corpus, rather than analyzing just a small portion of a much larger corpus. The textual orientation of classical and New Testament scholarship is compatible with and, in fact, requires a micro-pattern analysis, and the reading and analysis of full texts. Rather than placing it outside the scope of corpus linguistics, we argue that this perspective offers new avenues for the discipline, ones that we are exploring in OpenText.org. OpenText.org is a web-based initiative dedicated to creating resources for the linguistic analysis of Hellenistic Greek, and especially the Greek of the New Testament, in collaboration with interested scholars around the world. OpenText.org aims to make use of the insights and methods of corpus linguistics, specifically in terms of building a representative corpus of Hellenistic Greek, richly annotated according to a functional discourse model. This bottom-up discourse model relies upon the notion of levels of formal analysis. We parse all forms, beginning with morphology, but categorize from the word group up. We have found that the word group constitutes the smallest meaningful unit for discourse analysis. The increasingly higher levels are those of the clause, paragraph and discourse. One of the major principles of our annotation scheme is to mark features at the level of discourse at which they function. This has required the development of level-reflective and levelsensitive notational categories, as certain elements such as a conjunction may operate either at the word group, clause or paragraph level. These notational categories serve in a horizontal dimension both to specify the function of the individual element and to indicate its relationship to other elements within the structural unit. Each structural unit then constitutes a minimal unit at the next highest level of discourse. Each level of analysis builds upon the previous level, and thus analysis at a particular level can reach down to include features from lower levels. This vertical dimension of analysis is also segregated according to the components of register, field, tenor and mode. As a result, various features at a given level will serve different register meta-functions. This schematization provides a means for moving from the elements of text to the context of situation. The intended result is a complex calculus of features and functions that enables the analysis of the discourse. At the heart of this method, therefore, is annotation. More than that, we have found that the process of annotation itself constitutes a major part of the analytical process, raising questions as to the function of components of discourse within their respective units and the nature of text itself as these elements constitute the discourse (DeRose et al. 1990; Renear et al. 1996; Leech 1994). This paper presents two examples utilizing this methodology, drawn from the Greek New Testament. The first demonstrates how the OpenText.org annotation model can facilitate a full discourse analysis of the letter to Philemon. We have consciously selected this text for a number of reasons. The letter to Philemon is a small, complete text that encapsulates an interesting moment in discourse. It allows for us to present a synchronic view of this text, while still facilitating analysis of individual textual components. The result is, we think, a presentation of most of the major features of discourse as they are contained and displayed in our discourse model. We unfortunately will not be able to present all of these dimensions here. Though the shortest Pauline letter, Philemon has 1We are defining Hellenistic Greek as that Greek written by native and non-native Greek speakers throughout the Hellenistic and Roman worlds from approximately the fourth century B.C. to the fourth century A.D. 2The third element aligns corpus linguistics with discourse analysis, sharing a common concern for the analysis of real language usage, the observation of patterns of linguistic usage, and the (quantitative) filtering of large amounts of data. On referential and narrative methods of access to corpus data, see McCarty 1996. 414 received considerable discussion in scholarly circles; however, the major focus of this discussion has been the attempt to reconstruct the probable context of situation that gave rise to the discourse (see Pearson 1999 for an overview of major positions). Few of these attempts are predicated upon a thorough analysis of the text as discourse, at least as we are defining and using the concept here. We think that insight can be gained into the context of situation of this provocative and intriguing New Testament text through close analysis of the elements of discourse, particularly the questions of the relationships among the major participants (Paul and Philemon) and of the function that Onesimus plays in these relationships. We think that our discourse model pushes interpretation forward by pointing out textual relationships that previous interpreters have often overlooked. In that sense, questions of the tenor of discourse, among others, are specifically addressed in this example. The second example illustrates the use of semantic-domain annotation of the book of Revelation to explore structure, specifically through the identification of cohesive units. In this example, rather than examining a single discourse from a variety of perspectives, we are examining one particular component of register over a larger and potentially more complex text. The structure of the book of Revelation suggests a set of complex semantic relations often linked with shifts in text-type. Our analysis suggests that through a variety of textual means the author has structured discourse themes to progress the thematic content of the book. This thematic content is introduced through lexis. Analysis of the semantic-domain patterns over the length of the discourse shows how the thematic material is not only introduced and treated in individual units, but how it is used to cohede the individual paragraph units and the entire discourse. Whereas the instance from Philemon examines a number of levels in order to discuss the tenor of discourse, this examination treats broad patterns of semantic-domain structure to say something about field and mode of discourse. 2. Philemon and participant structure As we mentioned above, our method of analysis begins with the word group, and here we present three word groups from the book of Philemon. We present these because each illustrates important elements, before moving to analysis of a single set of clauses that constitute part of a paragraph. Under the field component of register the semantic relationships between words in a word group are annotated. We have defined four forms of modification, and one type of conjunction. The four modifiers are: specifier (sp), including articles and prepositions; definer (df), adjectives and appositional words; qualifier (ql), genitive and dative modifiers; and preposition (pr), a prepositional phrase that modifies a substantive. The conjunction or connection (cn) relationship is used to join two words within the group. It is helpful to visualize these semantic relationships through a series of nested boxes. Each box represents a word with slots for each of the relationships below the word. The boxes for modifiers (and their associated modifiers) are drawn within the relative slot of the word they modify. Thus word groups can be represented through the application of a recursive process. The first is word group 1 (see fig. 1) in the book of Philemon, part of what is traditionally called the epistolary salutation, ‘Paul, prisoner of Christ Jesus'. This word group consists of four words, with the head term being 3DXOR in the nominative, or what might be called the subject case, but this implies analysis beyond the word group. Syntagmatically, the three words follow in sequential order the word 3DXOR (syntagmatic order is indicated by the word identifier, e.g., w10), but there are two types of semantic relations indicated. The first is that of definer, in which GHYVPLR (prisoner) defines who Paul is. This relationship often describes what is traditionally called an epexegetical relationship that restates the head term. The relationship of ¨,KVRX (Jesus) and &ULVWRX (Christ) is also one of definition. However, the relation between &ULVWRX (Christ) and GHYVPLR (prisoner) is one of qualification. This is not a relationship that defines but qualifies who the servant is—he belongs to Christ (modifiers in the genitive case in Greek, as this one is, often have qualifier relations). This is a fairly straightforward example of a word group. This first word group is annotated in XML in the following manner and visualized in figure 1. <wg:group id="wg1" head="w1"> <w id="w1">3DXOR</w> <w id="w2" modify="w1" rel="define">GHYVPLR</w> <w id="w3" modify="w2" rel="qualify">&ULVWRX</w> <w id="w4" modify="w3" rel="define">¨,KVRX</w> </wg:group> 415 w1 3DXOR sp df ql pr w2 GHYVPLR sp df ql pr w3 &ULVWRX sp df ql pr w4 ¨,KVRX Fig. 1. Word group 1 ‘Paul, prisoner of Christ Jesus’ The second word group is number 3 (see fig. 2; appendix fig. 8 for annotation), ‘Philemon, the beloved and fellow worker of us'. It is also part of the salutation of the book of Philemon, except that here the head term, )LOKYPRQL (Philemon), is in the dative case, or what might be called a complement case, that is, a case in which a complement can occur.3 In word group 3 we have a slightly different kind of semantic relationship among the elements, one that is not linear as word group 1 is. Here, the head term is defined by two separate modifiers, both adjectives, DMJDSKWZ (beloved) and VXQHUJZ (fellow worker). The first of these is preceded by a specifier, the article (WZ), and the second is followed by a qualifier, the pronoun K-PZQ (our). w9 )LOKYPRQL sp df ql pr cn w11 DMJDSKWZ sp df ql pr w10 WZ w12 NDL w13 VXQHUJZ sp df ql pr w14 K-PZQ Fig. 2. Word group 3 ‘Philemon, the beloved and fellow worker of us’ These two modifiers are connected in their modifying function by a conjunction (NDLY), operating at the word-group level.4 Our word-group analysis is economical in so far as distribution of elements is concerned, so we only analyze the article as occurring as a specifier of the first and the pronoun as a qualifier of the second adjective. The annotation principle behind this is to indicate each modifying word as connected to the one other word in the group to which it is most closely attached in terms of grammatical function. So far, from the structure of the word groups, one might well think that these groups are somewhat similarly structured, and probably perform similar functions. Word group 54, the third example to consider (see fig. 3; appendix fig. 9 for annotation), consists of a significantly larger number of words than in either word group 1 or 3, ‘Onesimus, the then useless to you, but now useful to you and to me'. In some ways, the structure here is similar to word group 3. There is a head term, ¨2QKYVLPRQ (Onesimus), in the accusative case, another of what might be called the complement cases (again extending analysis beyond the word group). There are also two definers, both adjectives. The first is D>FUKVWRQ (useless) and the second HX>FUKVWRQ (useful). Each of these, however, has a number of further modifiers. D>FUKVWRQ has a specifier, the article WRYQ, and two qualifiers, SRWHY (then) and VRL (you), the first an adverb and the second a pronoun. The second definer, HX>FUKVWRQ, has three qualifiers, the adverb QXQLY (now), and the pronouns VRLY (you) and HMPRLY (me). There are a 3 However, at the level of the word group this is not significant, since here we are analyzing relations of the words within the word group, not its relations outside of it. 4 The word VXQHUJRY can be used as a substantive, in which case the phrase could be analyzed as consisting of two separate word groups, ‘Philemon the beloved’ and ‘fellow worker of us', joined at the clause level by the conjunction NDLY. The head term of the first word group, )LOKYPRQL, has a single definer, DMJDSKWZ, and the second, VXQHUJZ, a qualifier K-PZQ. The annotation scheme can allow for alternative analyses where necessary. 416 number of instances of parallelism here worth noting. First is the structural parallelism with the two definers; the second is the temporal alternation between past and present; the third is the use of the common root of the adjectives; and the fourth is the repetition and expansion of scope indicated by the pronouns. Some of this parallelism emerges more clearly because of the use of the word group conjunctions. The first, GHY (but), joins the two defining phrases, while the second, NDLY (and), joins two qualifiers. The parallelism, as well as the sheer bulk of the construction, seems to give significant weight, at least at the level of the word group, to this particular construction. w145 ¨2QKYVLPRQ sp df ql pr cn w149 D>FUKVWRQ sp df ql pr w146 WRYQ w147 SRWHY w148 VRL w151 GH w156 HX>FUKVWRQ sp df ql pr cn w150 QXQL w153 VRL w154 NDL w155 HMPRL Fig. 3. Word group 54 ‘Onesimus, the then useless to you, but now useful to you and to me’ These three word groups are important for a number of reasons. The first is that these are the word groups in which the major participants of the discourse of the book of Philemon are introduced, Paul, Philemon and (according to most scholars) Onesimus, a point we will come back to below. There are a number of other participants also mentioned in Philemon (e.g. Christ Jesus, who is, theology aside, in the position of a qualifier of a definer in relation to the head term, Paul; Timothy; Apphia; Archippus; and ‘the church in your house'; etc.; see fig. 6), but, as we would see if we analyzed more of the discourse, these are not given prominence as the others are. Each of the major participants is introduced in a word group in which there is significant modification, which is not found in the word groups for the other participants (see word groups 2, 4 and 5, where the modification is noticeably less). For each major participant, there is a grammaticalized reference by name, and then appropriate modification to indicate their role and relation. Subsequent participant reference in the discourse is made by a combination of grammaticalized, reduced and implied reference. For word groups 3 and 54, there is also internal modification that references Paul (and in word group 54 Philemon as well), thus further indicating participant relations at the word-group level. On the basis of this evidence, it appears that Onesimus is at least as important, at least at the level of the word group, as any of the other major participants, and certainly more important than the minor ones. That is consistent with traditional examination of Onesimus (word group 54, see fig. 3) within this Pauline letter by the vast majority of New Testament scholars. Though our treatment of annotation at the word-group level has focused upon the interpretative and discourse significance of individual word groups, it should be clear how this detail of annotation applied across a larger corpus of texts could be utilized using both the traditional referential retrieval paradigm (i.e. searching for particular words, participants or grammatical features in a certain modification position) and more discourse oriented narrative approaches.5 The next level of analysis is that of the clause. Within the book of Philemon, there are 47 clausal units, arranged within 5 paragraphs. Paragraph one consists of two clauses. The first has six word groups, two of which are presented above. They form the two major groups that constitute the structure of clause one. In paragraph three, however, there are 16 clauses. We wish here to analyze 5 The annotation of semantic relationships (specifier, definer, qualifer and preposition) within a word group is just one element of the field of discourse. Other features marked at the word-group level include semantic domains (field), part of speech and lexical information (mode) and participant reference (tenor). For detailed specification of the OpenText.org annotation model for each of the levels of discourse and the associated XML DTDs, see http://www.opentext.org/specifications. 417 clauses 15-24, since it is within this complex of clauses that the word group that introduces Onesimus into the discourse is found. Clausal structure, in our analysis, consists of subject, predicate, complement and adjunct. The predicate is the major unit of the Greek clause, but whether the subject is grammaticalized, reduced or implied in relation to that subject, and its placement in first or later position in the clause are very important for the information structure. We analyze this information structure in terms of prime and subsequent elements in the clause. Once the clauses have been analyzed, we examine the relations of the clauses at the paragraph level. A primary clause usually has a finite verb form or, in the absence of a finite verb, a word group functioning similarly. These clauses are used to convey the main thread or backbone of the discourse. Secondary clauses are usually relative, participle and infinitive clauses. These clauses are used to add further specification at the present point in the discourse, and do not progress the discourse in the horizontal plane. The paragraph level, under the mode of discourse, is the appropriate point in the model to analyze the connection and relationship between clauses. These connections determine the level of a clause. For example, a clause with a finite verb form that might be classified as primary, but is clearly connected to a secondary clause (e.g. a relative clause), is classified as functioning at the secondary level. The appendix contains the clause and paragraph level analysis of Philemon 10-14. Figure 4 presents a visual representation of this annotation (see appendix, fig. 10 for XML annotation of clauses). Clauses are represented by boxes. The single primary clause (c15), SDUDNDOZVHSHULWRXHMPRX WHYNQRX, is placed on the left-hand side of the diagram. All secondary level clauses connected to this clause are joined to the right-hand of this main clause. Other clauses not connected directly to c15 but to another of the secondary clauses are placed to the right of the clause to which they connect, unless they are embedded within the clause (e.g. clause c23 SRLKVDL is a complement of clause c22) where they are positioned within the box of the clause in which they are embedded. c16 R`QHMJHYQQKVDHMQWRLGHVPRL©2QKYVLPRQWRYQSRWHY VRLD>FUKVWRQQXQLGHVRLNDLHMPRLHX>FUKVWRQ c17 R`QDMQHYSHP\DYVRL DXMWRYQ c18 WRXWRH>VWLQWDHMPD VSODYJFQD c19 R`QHMJZHMERXORYPKQ c20 SURHMPDXWRQ NDWHYFHLQ c21 L^QDX-SHUVRXPRL GLDNRQKHMQWRLGHVPRL WRXHXMDJJHOLYRX c22 FZULGHWKVK JQZYPK RXMGHQKMTHYOKVD c15 SDUDNDOZVH SHULWRXHMPRX WHYNQRX  c23 SRLKVDL  c24 L^QDPKZ-NDWDDMQDYJNKQ WRDMJDTRYQVRXK?DMOOD NDWDH-NRXYVLRQ Fig. 5. Display of clause level and connections for Philemon 10-14 What is important for this paper is the analysis of word group 54 within such a paragraph analysis. Clause 15 is the primary clause of the unit that consists of clauses 15-24, with a series of four secondary clauses related to it, three of them linked by relative pronouns (clauses 16, 17 and 19). Clause 15, ‘I urge you concerning my child', refers to the three participants that we mentioned before: Paul, this time implied through the use of the first-person singular verb; Philemon, referred to by use of a reduced pronoun; and a figure cited in a prepositional phrase (adjunct) as ‘my child', not yet named. The author then defines this child in the three relative secondary clauses. The first one has complement-predicate-adjunct-complement structure, with a split complement because of the relative pronoun (R^Q). Again, Paul is the implied subject of the clause, but it is here that Onesimus is introduced as the complement of this secondary clause. A number of further statements are made about this ‘child', Onesimus, though without using his name, in the further secondary relative clauses all with Paul as the implied subject (clauses 17 and 19) and the further secondary clauses (clauses 18, 20, 21). In other words, even though Onesimus has often been analyzed as a major participant in the book of Philemon, with much secondary discussion concerned with who he is and how he relates to Paul and Philemon, the discourse structure does not confirm this analysis. Paul especially, but also Philemon, are the major participants, and their relationship is in fact the major element of the tenor component. This is confirmed by how many times Paul and Philemon, whether in grammaticalized, 418 reduced or implied form, are the subjects or complements of the primary clauses, but also secondary clauses. Participant Primary clause Secondary clause Total 1 Paul 26 14 40 2 Jesus Christ 9 3 12 3 Timothy 2 0 2 4 Philemon 21 13 34 5 Us 4 0 4 6 Apphia 2 0 2 7 Archippus 2 0 2 8 You (plural) 4 0 4 9 God 3 0 3 10 The Saints 1 1 2 11 Onesimus 4 9 13 Fig.6 Summary of participant reference according to clause level in Philemon Figure 6 shows a summary of the number of references to each of the participants annotated in the discourse. Onesimus is confined for the most part to peripheral status as a complement of secondary clauses, often in reduced or implied form (9 of the 13 references to Onesimus occur in secondary clauses and 8 of these references occur in the complement slot of the clause). He is not even introduced by name the first time he is referred to, as are the other major participants, but is referred to in a reduced form by means of a noun, ‘child', that fulfils a discourse function of characterizing Onesimus in relation to the major participants. The implications of this analysis for interpretation of the discourse are significant. The focus of future analysis will need to be upon the major participants as supported by the discourse itself, Paul and Philemon, recognizing that, as important as Onesimus may be in terms of a catalytic function in their interpersonal relations, his function is secondary to the primary relation. Further, instead of the status of Onesimus as a slave or a runaway, and whether or not he took property that was not his, the major issue seems instead to be how it is that Paul as apostle relates to Philemon as significant figure in the church to which Paul writes. References to Paul in the letter according to clause component are: subject 8x, predicate 16x, complement 12x and adjunct 6x. For Philemon, these figures are: subject 1x, predicate 9x, complement 17x and adjunct 7x. These basic figures require filtering according to the causality (voice of verbal forms, indicating actor or patient status) and position within a word group (see above). Some have noted that Paul uses a number of discourse techniques in his communication with the church there, although wanting to hold back from characterizing these techniques as at all manipulative or forcefully persuasive (Wilson 1992; Fitzmyer 2000; cf. Porter 1999). However, this notion seems to be much closer to what the discourse supports rather than examining this letter as an exposition of slave and master relations in the ancient world, in which Paul has an incidental interest. 3. Revelation, semantic-domains and cohesion The book of Revelation is much larger than Philemon. Philemon has a total of about 335 words compared to the approximately 9850 in Revelation. This extra length provides enough discourse scope for development of a number of different patterns. The way several of these patterns relate to each other is what we would like to examine in this second example. As discussed in the introduction, a common problem for both discourse analysis and traditional referential corpus linguistics is the analysis and summarization of large amounts of data in order to carry out a full analysis of even a single feature of discourse, such as lexical cohesion, over even a medium sized discourse, such as the book of Revelation. Considerable amounts of information must be tracked. The standard reference for the study of cohesion in discourse is the work of Halliday and Hasan (1976). This has served as the basis for a number of recent discourse annotation schemes (Wilson and Thomas 1997) and theoretical studies (Hoey 1991). In addition, a number of more computationally orientated studies (Morris and Hirst 1991; Kozima and Furugori 1994) have developed algorithmic approaches to the study of lexical cohesion. One of the ways the study of Hellenistic Greek, particularly the Greek of the New Testament, is more advanced than comparable study of English is in the availability of a semantic-domain lexicon (Louw and Nida 1988; Nida and 419 Louw 1993). This lexicon distinguishes 93 broad semantic domains, covering all of the semantic fields of the Greek found in the New Testament. Within these 93 domains are a large number of subdomains. Words are categorized within these domains and sub-domains, with some words being found within as many as four or five domains, except for such function words as prepositions, which are categorized in as many as six domains. It is not appropriate to discuss the shortcomings of this important work, although one of the results of our study is ideally to construct a more principled set of domains that are more closely linked to patterns of New Testament usage. However, in the meantime, this tool has proved invaluable in the study of lexis and cohesion in the New Testament. For our study of lexical cohesion in the book of Revelation, we focus upon content words only (verbs, nouns, adjectives, and adverbs). Each of these word types in the text was annotated with its major domain number from the lexicon, with no attempt to disambiguate multiple classifications. We then made use of a simple algorithmic technique to identify the number of semantic chains found within a fifty word window measured at intervals of ten words throughout the text. The first window begins at word 1 and extends to word 50, the second window at word 11 extending to word 60, etc. At each interval, the number of domains exhibiting chains of five or more words in the subsequent window are noted. For example, the first window (words 1-50) has two chains, one from domain 33, communication, and one from domain 93, names of persons and places. The following window (words 11-60) has only a single chain from domain 33. In a number of ways this fits the pattern that one might expect at the beginning of a book. We have found that domain 33 is probably the most common domain in New Testament books, especially the epistles, and indicates what one might expect in a theological text, that is, the desire to communicate is activated, and often this chain is maintained through much of the discourse. It is also worth noting, though not necessarily of great significance, that the beginning introduces at least some of the people or places involved in the action. This is a relatively low-level use of vocabulary. Figure 7 contains a line-plot with the position in the discourse in terms of word number on the x-axis and the number of significant semantic chains (domains with five or more words) at each point indicated on the y-axis. This graph, through the alternation of peaks and troughs of semantic chain usage, provides a helpful macro-view of lexically cohesive units within the book of Revelation (and perhaps insights into thematization), and serves as a tool to highlight sections for closer micro-level analysis. The genre of the book of Revelation is a complex one, since it has a mix of text-types. The peaks and troughs of the semantic chains are coordinated with the major structural divisions of the text-types. For example, within a new text-type, such as the letter section, a recurring pattern is the use of a semantic chain that peaks and then recedes. This seems to be a consistent pattern, in which the new text-type activates a set of related terms. The text-type continues, even though the semantic chains may not continue, before a new texttype activates a new set of domains. Fig. 7 Lexical cohesion at 10 word intervals across the book of Revelation Two micro-level analyses serve to illustrate the kinds of observations that emerge from this close study. The first is from Revelation 4 and the other Revelation 16. The window beginning at word 1611 (corresponding to Rev. 3.22, the last verse of chapter 3) seems to mark a significant shift in the discourse. The preceding sample windows have had only one or two semantic chains. One of these, domain 33 (communication), is relatively low-level in significance, but the other chains anticipate the next structural unit, by introducing words from semantic domain 12 (supernatural beings and powers) and domain 37 (control, rule). This confirms a similar finding concerning the way in which grammatical features such as tense-form and voice anticipate and mark transitions in discourse (Biber, Conrad, and Reppen 1998; Porter 1994). Beginning with word 1611, we find a long chain of domain 12 that extends over 200 words. A number of additional but shorter chains are also activated at this point at the opening of the unit. These include a continuation of semantic domain 37, and the introduction of domains 85 (existence in space), 6 (artefacts), 60 (number), and 14 (physical events and states). Shorter chains dispersed throughout this paragraph (extending from words 1611 to 1931) include semantic domains 2 (natural substances), 4 (animals), 67 (time), 57 (possess, transfer, exchange), 41 (behavior and related states). These shorter chains are usually between two and four 420 windows long, a maximum of 80 words. This is a semantic-chain based description of the unit. The scene itself in the book of Revelation marks a transition from the epistolary sections (the so-called letters to the seven churches) to the apocalyptic vision that constitutes much of the rest of the book. Most commentators would see these as the two major sections of the text. The first scene of this apocalyptic section is the dramatic appearance of the heavenly throne room. In this room, there is a majestic throne surrounded by other thrones with heavenly beings seated upon them. These beings have crowns and jewels and golden lamps. These two different descriptions are complimentary. The latter is an aesthetically based account of the various elements depicted. The semantic-chain description is an attempt to quantify and even explain the aesthetic account by classifying the data that make up the account. The second section is in many ways similar to the first. Here we are concerned with the unit from word 6621 to word 6751, which is preceded by a 200 word section with virtually no significant semantic chains. In contrast to the section discussed above, there is no preparatory transitional material that anticipates this unit. The semantic chains in this section come from semantic domains 15 (linear movement), 13 (be, become, exist, happen), 14 (physical events), 41 (behavior and related states), 57 (possess, transfer, exchange), 78 (degree), 79 (features of objects), 85 (existence in space), and 91 (discourse markers). The semantic domains in the Louw-Nida lexicon are arranged along a continuum, with contiguous domains having overlapping semantic features. Thus a given word may be classified within several contiguous domains. That is the case in this particular episode, as demonstrated by the domains activated. The result is that the significance of these chains may be less than these data first appear, since a number of the chains activated may contain the same word. In other words, there are probably 4 or 5 major semantic areas activated here, rather than the 7 or 9 noted above. Also in contrast with the example above, this intense confluence of semantic chains here is relatively short. Attention to the text reveals that the section marks the climax of a unit beginning around word 6300 (Rev. 16.1), which begins: ‘I heard a loud voice from the temple saying to the seven angels, “Go pour out the seven bowls of wrath upon the earth”'. Then follows the account of each of the seven angels distributing the contents of their bowls over the earth. This section describes the action and result of the seventh angel, completed with a voice from the throne saying that ‘It is done’ (Rev. 16.17). This is followed by dramatic environmental disruption, including lightening, thunder, and earthquakes. Commentators have noted the climactic nature of this event in the discourse, which can be confirmed by the pronounced accumulation of semantic chains. We introduced this part of our paper by talking about how such semantic chain patterning applied to textual cohesion. Cohesion is a concept that seems to be useful on at least two levels. One is in terms of the macro-patterns of usage that unite an entire discourse. We have seen that there are a number of semantic chains that are activated throughout the book of Revelation, such as domain 33 (communication). There are a number of domains that function at this low level to activate a number of basic concepts that unite the discourse. Cohesion also seems to function in terms of the relationship between various units or paragraphs in the discourse. Part of the cohesive function of semantic chains is to activate accumulations of chains at appropriate times. These serve the function of delineating the units of the discourse by closing and opening units, and marking transitions. The boundary here between cohesion and prominence therefore is not a firm one, since those semantic chains that are marked as prominent because of their relatively infrequent but high-level activation only become prominent when they are seen in relation to the semantic chains that cohede the discourse. 4. Conclusion In this paper, we have tried to show some of the practical differences that working with an epigraphic language such as Hellenistic Greek makes for corpus linguistics. The kind of microanalysis that we have briefly introduced in the above two examples seems to be required if one is to utilize the limited corpus size from the ancient world. Micro-analysis allows for close scrutiny of a finite set of elements, but each is seen to function within a variety of levels of discourse. The definition and classification of these levels enables each element to provide maximal data for interpretation. Such micro-analysis can only take place, however, if the corpus of texts is richly annotated to provide the largest amount of information as possible. In the first example, that of Onesimus and his relation to Paul and Philemon, we saw that at one level—that of the word group— Onesimus seems to be grammaticalized in similar fashion to the other two major participants. However, when the word group in which he is introduced is placed in the larger frame of being a component of paragraph structure, his prominence fades, and he is seen to be relegated to a peripheral participant role. In the examples of Revelation, we have examined a more narrowly circumscribed set of features in terms of the entire discourse. Here we have noted that by using an intensive semantic-domain study, and correlating this with paragraph boundaries, one can observe 421 how semantic chain shifts are coordinated from paragraph to paragraph. The result is a clearer demarcation of how the major subject matter of each paragraph unit is shifted and developed, but also how cohesion is both created within the individual paragraph units and extended over the entire discourse. One of the facts of corpus-based studies as usually conceived is that they can generate huge quantities of data for analysis, and this is thought to be a desirable feature, allowing more precise generalizations to be reached on the basis of a larger sample surveyed. One of the complaints about discourse analysis is that it generates too much data to study within the confines of a single analysis. These two are not necessarily reconcilable, especially when a non-finite corpus is involved. Even for an epigraphic language such as Hellenistic Greek, an abundance of data can be generated for analysis. In the light of the finite corpus size, however, these data are to be desired, and to be maximized for their use in corpus-based discourse studies. References DeRose S J, Durand D G, Mylonas E, Renear A H 1990 What is text, really? Journal of Computing in Higher Education 1(2): 3-26. Fitzmyer J A 2000 The letter to Philemon. New York, Doubleday. Halliday M A K, Hasan R 1976 Cohesion in English. London, Longman. Hoey M 1991 Patterns of lexis in text. Oxford, Oxford University Press. Kozima H, Furugori T 1994 Segmenting narrative text into coherent scenes. Literary and Linguistic Computing 9(1): 13-19. Leech G 1994 Corpus annotation schemes. Literary and Linguistic Computing 8(4): 275-281. Louw J P, Nida E A 1988 Greek–English lexicon of the New Testament based on semantic domains. 2 vols.; New York, United Bible Societies. McCarty W L 1996 Peering through the skylight: towards an electronic edition of Ovid's Metamorphoses. In Hockey S, Ide N (eds), Research in humanities computing 4: selected papers from the ALLC/ACH conference, Christ Church, Oxford, April 1992. Oxford, Clarendon Press, pp 240-262. Morris J, Hirst G 1991 Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics 17: 21-48. Nida E A, Louw J P 1993 Lexical semantics of the Greek New Testament. Atlanta, Scholars Press. O'Donnell M B 1999 The use of annotated corpora for New Testament discourse analysis: a survey of current practice and future prospects. In Porter S E, Reed J T (eds), Discourse analysis and the New Testament: results and applications. Sheffield, Sheffield Academic Press, pp 71-116. Pearson B W R 1999 Assumptions in the criticism and translation of Philemon. In Porter S E, Hess R S (eds), Translating the Bible: problems and prospects. Sheffield, Sheffield Academic Press, pp 253-280. Porter S E 1994 Idioms of the Greek New Testament. Sheffield, Sheffield Academic Press. Porter S E 1999 Is Critical Discourse Analysis Critical? An Evaluation Using Philemon as a Test Case. In Porter S E, Reed J T (eds), Discourse analysis and the New Testament: results and applications. Sheffield, Sheffield Academic Press, pp 47-70. Porter S E, O'Donnell M B 2001 Theoretical issues for corpus linguistics raised by the study of ancient languages. In Proceedings of Corpus Linguistics 2001. Renear, A H, Mylonas E, Durand D 1996 Refining our notion of what text really is: the problem of overlapping hierarchies. In Hockey S, Ide N (eds), Research in humanities computing 4: selected papers from the ALLC/ACH conference, Christ Church, Oxford, April 1992. Oxford, Clarendon Press, pp 263-280. Wilson A 1992 The pragmatics of politeness and Pauline epistolography: a case study of the letter of Philemon. JSNT 48: 107-19. Wilson A, Thomas J 1997 Semantic annotation. In Garside R, Leech G, McEnery A (eds), Corpus annotation: linguistic information from computer text corpora. London, Longman, pp 53-65. Appendix: annotation examples <wg:group id="wg3" head="w9"> <w id="w9">)LOKYPRQL</w> <w id="w10" modify="w11" rel="specify">WZ</w> <w id="w11" modify="w9" rel="define">DMJDSKWZ</w> <w id="w12" join=”w11” to=”w13” rel="connect">NDL</w> <w id="w13" modify="w9" rel="define">VXQHUJZ</w> 422 <w id="w14" modify="w13" rel="qualify">K-PZQ</w> </wg:group> Fig. 8. Annotation of word group 3 (Philemon 1b) <wg:group id="wg54" head="w145"> <w id="w145">¨2QKYVLPRQ</w> <w id="w146" modify="w149" rel="specify">WRYQ</w> <w id="w147" modify="w149" rel="qualify">SRWHY</w> <w id="w148" modify="w149" rel="qualify">VRL</w> <w id="w149" modify="w145" rel="define">D>FUKVWRQ</w> <w id="w150" modify="w156" rel="qualify">QXQL</w> <w id="w151" join=”w149” to=”w156” rel="connect">GH</w> <w id="w153" modify="w156" rel="qualify">VRL</w> <w id="w154" join=”w153” to=”w155” rel="connect">NDL</w> <w id="w155" modify="w156" rel="qualify">HMPRL</w> <w id="w156" modify="w145" rel="define">HX>FUKVWRQ</w> </wg:group> Fig. 9. Annotation of word group 54 (Philemon 10b-11) <cl:clause id="c15" level="primary"> SDUDNDOZVHSHULWRXHMPRXWHYNQRX </cl:clause> <cl:clause id="c16" level="secondary" connect="c15"> R`QHMJHYQQKVDHMQWRLGHVPRL©2QKYVLPRQWRYQSRWHYVRLD>FUKVWRQQXQLGHVRLNDL HMPRLHX>FUKVWRQ </cl:clause> <cl:clause id="c17" level="secondary" connect="c15"> R`QDMQHYSHP\DYVRLDXMWRYQ </cl:clause> <cl:clause id="c18" level="secondary" connect="c17"> WRXWRH>VWLQWDHMPDVSODYJFQD </cl:clause> <cl:clause id="c19" level="secondary" connect="c15"> R`QHMJZHMERXORYPKQ <cl:clause id="c20" level="secondary" connect="c19"> SURHMPDXWRQNDWHYFHLQ </cl:clause> </cl:clause> <cl:clause id="c21" level="secondary" connect="c19"> L^QDX-SHUVRXPRLGLDNRQKHMQWRLGHVPRLWRXHXMDJJHOLYRX </cl:clause> <cl:clause id="c22" level="secondary" connect="c15"> FZULGHWKVKJQZYPKRXMGHQKMTHYOKVD <cl:clause id="c23" level="secondary" connect="c22"> SRLKVDL </cl:clause> </cl:clause> <cl:clause id="c24" level="secondary" connect="c22"> L^QDPKZ-NDWDDMQDYJNKQWRDMJDTRYQVRXK?DMOODNDWDH-NRXYVLRQ </cl:clause> Fig. 10. Annotation of clause level and connection (Philemon 10-14) 423 Spelling out the optionals in translation: a corpus study Maeve Olohan Centre for Translation and Intercultural Studies, UMIST PO Box 88, Manchester M60 1QD maeve.olohan@umist.ac.uk Abstract While the use of translations in parallel corpora, mostly for the purposes of contrastive linguistic analysis, is relatively well established, the analysis of translated language as an object of study in its own right has only fairly recently been made possible through the development of corpus resources designed specifically for this purpose. The Translational English Corpus (TEC) at UMIST was the first corpus consisting exclusively of translations, in English, from a variety of source languages and text types. Much of the research carried out thus far using TEC (e.g. Laviosa-Braithwaite 1996, Kenny 1999 and 2000) has been interested in identifying and confirming features of translated language such as explicitation, normalisation, simplification and levelling out (Baker 1996). This kind of research is based on the assumption that, by retrieving and analysing data from TEC and a comparable corpus (e.g. the British National Corpus), it is possible to pinpoint consistent differences in syntactic or lexical patterning between translated English and original English. Some of these may arise from deliberate translation strategies on the part of the translator who wishes to make his/her text more explicit, to normalise or simplify etc. However, TEC can also be used as a means of identifying linguistic patterning which translators will not have been aware of producing, but which occurs as a result of the complex nature of the translation activity itself. Against this background, this paper presents an investigation of explicitation in translation. Preliminary studies using TEC and a subcorpus of the BNC (Burnett 1999, Olohan and Baker 2000) have shown that patterns of use of the optional that with reporting verbs are rather different in translated English than in original English, with translated English very much favouring the use of that, even in contexts which do not warrant it, e.g. for purposes of disambiguation or for the signalling of more formal style. This paper will present further analysis of optional syntactic features in English and their occurrence in TEC and the BNC, test the hypothesis that translated English displays a higher incidence of a range of optional syntactic features than is observed in a comparable corpus of original English, and that this is direct evidence of subconscious processes of explicitation in translation. 1. Corpus-based translation studies Corpus-based translation studies is a relatively new area of research within translation studies, motivated by an interest in the study of translated texts as instances of language use in their own right. This is in contrast to the not uncommon perception of translations as ‘deviant’ language use, a view which has generally led to the exclusion of translated texts from most ‘standard’ or ‘national’ corpora (Baker 1999). While translations have been seen as useful in parallel bilingual or multilingual corpora, this has usually been for contrastive linguistic analysis which has studied the relationship between source and target language systems or usage. Parallel corpora are naturally also of interest to the translation scholar as they facilitate investigation of the relationship between a translation and its source. Recent work using corpora in translation studies has, however, been more concerned with building corpora of translations so that the use of language in translations may be studied. The first corpus of this nature was the Translational English Corpus at UMIST (described below) which, since its inception, has provided the impetus and inspiration for a number of similar projects for other languages, including Italian, German, Spanish, Finnish, Catalan and Brazilian Portuguese. One of the fundamental concepts in corpus-based translation studies has been the notion of comparable corpus, defined by Baker (1995: 234) as ‘two separate collections of texts in the same language: one corpus consists of original texts in the language in question and the other consists of translations in that language from a give source language or languages…both corpora should cover a similar domain, variety of language and time span, and be of comparable length'. Baker's initial groundbreaking work posited a number of features of translation, or ‘translation universals', which could be investigated using comparable corpora (Baker 1996). While the term universal in this context is somewhat controversial, not least because of the practical difficulties involved in testing whether something holds true across diverse languages (for many of which corpora of translations and/or original writing do not exist), it has been suggested, for example, that translations tend to be more explicit on a number of levels than original texts, and that they simplify and normalise or standardise in a number of ways. Much of the corpus-based work carried out to date has focused on syntactic or 424 lexical features of translated and original texts which may provide evidence of these processes of explicitation, simplification or normalisation. It should be stressed that, while translators may at times consciously strive to produce translations which are more explicit or simplified or normalised in some way, the use of comparable corpora also allows us to investigate aspects of translators’ use of language which are not the result of deliberate, controlled processes and of which translators may not be aware. 2. Corpus data The Translational English Corpus is a corpus of translated English held at the Centre for Translation and Intercultural Studies at UMIST. It was designed specifically for the purpose of studying translated texts and it consists of contemporary written translations into English of texts from a range of source texts and languages. At the time of writing, it has over 6.4 million words. TEC consists of four text types – fiction, in-flight magazines, biography and newspaper articles – with fiction representing 82%, and biography and fiction together making up 96% of the corpus. The translations were published from 1983 onwards and were produced by translators, male and female, with English as their native language or language of habitual use. The corpus of original English put together for this particular study is a subset of the BNC made up of texts from the imaginary domain. It is thus comparable in terms of genre and publication dates (from 1981 onwards). The texts have been produced by native speakers of English, both male and female. A minor difference between the two corpora which is not significant for current investigations is that TEC consists of full running texts whereas some of the BNC texts are extracts (some as long as 40,000 words). There is a little variation in size between the two corpora with TEC now slightly bigger than the BNC corpus. As TEC continues to grow, new texts will be added to the BNC subcorpus so that the corpora remain comparable in all respects. The data discussed here was extracted from these two untagged corpora using Wordsmith Tools V.3.0. 3. Explicitation The analyses reported on here arose from an interest in studying processes of explicitation in translation, where explicitation refers to the spelling out in target text of information which is only implicit in a source text. This has long been considered a feature of translation and has been investigated by a number of scholars (e.g. Vanderauwera 1985, Blum-Kulka 1986; Laviosa-Braithwaite 1996; Laviosa 1998; Baker 1995, 1996) who have identified different means or techniques by which translators make information explicit, e.g. using supplementary explanatory phrases, resolving source text ambiguities, making greater use of repetitions and other cohesive devices. This current research focuses, in so far as this is possible, on subconscious processes of explicitation and their realisation in linguistic forms in translated texts. Since the starting point is the linguistic form, we have concentrated on optional syntactic features, hypothesising that, if explicitation is genuinely an inherent feature of translation, translated text might manifest a higher frequency of the use of optional syntactic elements than original writing in the same language, i.e. translations may render grammatical relations more explicit more often – and perhaps in linguistic environments where there is no obvious justification for doing so – than original writers. 4. Analysis of optional syntactic features in English Linguists may present the optional syntactic features of English in different ways, but we opted to base this study on Dixon's (1991: 68-71) omission conventions for English, presented in summary form as follows: A. Omission of subject NP B. Omission of complementiser that C. Omission of relative pronoun wh-/that D. Omission of to be from complement clause E. Omission of predicate F. Omission of modal should from a THAT complement G. Omission of preposition before complementisers that, for and to H. Omission of complementiser to I. Omission of after/while in (after) having and (while) *ing J. Omission of in order 425 These features span a range of linguistic phenomena, from frequently occurring relative pronouns to much less common constructions (e.g. to be in complement clause), and that they do not focus exclusively on optionality of omission. As will be obvious from the discussion below, they also vary considerably in terms of their identification and quantifiability in a corpus which is neither tagged nor parsed. In some instances, as can be seen in 4.3, 4.4, 4.9 and 4.10, omission is difficult to measure but occurrence, i.e. inclusion, can be traced and compared across corpora to give an indication of differences in usage of the longer surface form between corpora. 4.1. Omission of subject NP This refers to omission of a subject NP in a number of circumstances, e.g. under coordination, in subordinate time clauses, from an ING complement clause or from a modal (FOR) TO complement clause. There is no obvious way of finding instances of these in a corpus which is not tagged for parts of speech. 4.2. Omission of complementiser that Dixon states that ‘the initial that may often be omitted from a complement clause when it immediately follows the main clause predicate (or predicate-plus-object-NP where the predicate head is promise or threaten’ (1991: 70). An extensive analysis of the use of that/zero-connective with reporting verbs SAY and TELL, with reference to TEC and BNC, is presented in Olohan and Baker (2000). The results are summarised in Tables 1 and 2 below, which present both the absolute values (i.e. occurrences) and the percentages for each form: Form say (TEC) say (BNC) said (TEC) said (BNC) says (TEC) says (BNC) saying (TEC) saying (BNC) that 316 55.5% 323 26.5% 267 46.5% 183 19.2% 116 40.4% 64 12.8% 76 67.3% 142 43.0% zero 253 44.5% 895 73.5% 307 53.5% 771 80.8% 171 59.6% 435 87.2% 37 32.7% 188 57.0% Table 1: SAY + that/zero in BNC and TEC Form tell (TEC) tell (BNC) told (TEC) told (BNC) tells (TEC) tells (BNC) telling (TEC) telling (BNC) that 247 62.8% 300 38.2% 353 60% 584 43.6% 55 68.7% 28 37.5% 64 73.6% 85 42.3% zero 146 37.2% 486 61.8% 233 40% 755 56.4% 25 31.3% 52 62.5% 23 26.4% 115 57.7% Table 2: TELL + that/zero in BNC and TEC It is immediately clear that the that-connective is far more frequent in TEC than in BNC. With the exception of said and says, that occurs more often than zero for all forms of SAY and TELL in TEC. By contrast, the zero-connective is more frequent for all forms of both verbs in the BNC corpus. These differences have been proven to be statistically significant. Furthermore, the results of the SAY and TELL study were consistent with findings by Burnett (1999) who reviewed use of the verbs SUGGEST, ADMIT, CLAIM, THINK, BELIEVE, HOPE and KNOW in TEC and BNC . While that study did not include all forms of these verbs, the data available shows that the that-connective is far more common than the zero-connective in translated than in original English for forms of all seven of the verbs investigated. The hypothesis that the optional that in reporting constructions occurs proportionately more frequently in translated texts than in original English texts is thus supported. Although Olohan and Baker (2000) highlight the relative vagueness with which omission and inclusion are accounted for in the linguistics literature, and the lack of guidance on this in reference works for users of English, there are clear patterns of usage in contemporary English writing as evidenced in the BNC corpus, and there is an equally clear contrast between these patterns and those perceived in translated English. A brief analysis of one of the verbs suggested by Dixon, namely PROMISE, serves as further illustration and corroboration. Table 3 and Figure 1 below show that, although the number of instances of promise + that/zero were almost identical in the two corpora (135 in BNC and 131 in TEC), the relationship between that and zero in TEC (that = 67.9%, zero = 32.1%) is almost directly inverse to that in BNC (that = 34.1%, zero = 67.9%). 426 That/zero Total zero that Corpus BNC Count 89 46 135 % within Corpus 65.9% 34.1% 100.0% % within That/zero 67.9% 34.1% 50.8% % of Total 33.5% 17.3% 50.8% Corpus TEC Count 42 89 131 % within Corpus 32.1% 67.9% 100.0% % within That/zero 32.1% 65.9% 49.2% % of Total 15.8% 33.5% 49.2% Table 3: PROMISE + that/zero in BNC and TEC Figure 1: occurrences of PROMISE + that/zero in BNC and TEC A breakdown of each lexical item (Table 4 and Figure 2) shows that this holds true for all forms of the verb, although some have low occurrences in general (e.g. promises + that/zero occurs only twice in TEC and not at all in BNC). Figure 2: All forms of PROMISE + that/zero in BNC and TEC 427 That/zero Form zero that Total Count 38 19 57 % within Corpus 66.7% 33.3% 100.0% % within That/zero 64.4% 41.3% 54.3% BNC % of Total 36.2% 18.1% 54.3% Count 21 27 48 % within Corpus 43.8% 56.3% 100.0% % within That/zero 35.6% 58.7% 45.7% promise Corpus TEC % of Total 20.0% 25.7% 45.7% Count 1 1 2 % within Corpus 50.0% 50.0% 100.0% % within That/zero 100.0% 100.0% 100.0% promises Corpus TEC % of Total 50.0% 50.0% 100.0% Count 46 20 66 % within Corpus 69.7% 30.3% 100.0% % within That/zero 69.7% 27.8% 47.8% BNC % of Total 33.3% 14.5% 47.8% Count 20 52 72 % within Corpus 27.8% 72.2% 100.0% % within That/zero 30.3% 72.2% 52.2% promised Corpus TEC % of Total 14.5% 37.7% 52.2% Count 5 7 12 % within Corpus 41.7% 58.3% 100.0% % within That/zero 100.0% 43.8% 57.1% BNC % of Total 23.8% 33.3% 57.1% Count 9 9 % within Corpus 100.0% 100.0% % within That/zero 56.3% 42.9% promising Corpus TEC % of Total 42.9% 42.9% Table 4: All forms of PROMISE + that/zero in BNC and TEC 4.3. Omission of relative pronoun wh-/that This frequently occurring construction is difficult to measure in an untagged corpus. Thus far, only total counts of occurrence of which have been taken, with 11,201 in BNC and 23,607 in TEC. A first step in discarding irrelevant instances was to identify sentence-initial and sentence-final/clause-final which. Their removal leaves 10,457 concordance lines in BNC and 22,483 in TEC, indicating considerably higher usage of which in TEC. Obviously further detailed analysis of these instances is required to identify the occurrences in relative clauses where the coreferential NP is not in subject function in the relative clause, i.e. where omission could have taken place. 4.4. Omission of to be from complement clause From a very frequent feature above, we come to a very infrequent structure. Dixon is referring here to the omission of to be with ‘some verbs taking a Judgement TO complement clause, whose VP begins with be’ (1991: 70), with an example of thought + to be + modifier. Both THINK + to be and FIND + to be were investigated in the corpora (see Table 5). The most common occurrence in both corpora was for the past tense forms (thought and found), and TEC exhibits a greater tendency overall to include to be, but the number of occurrences overall was very small in both corpora. Form BNC TEC THINK (+ *)(+ *) to be 2 6 FIND (+ *)(+ *) to be 4 7 Table 5: think + to be and find + to be in BNC and TEC 4.5. Omission of predicate The omission of the predicate in coordinated clauses is difficult to capture in an untagged corpus and this has therefore not yet been investigated. 4.6. Omission of modal should from a THAT complement This refers to the omission of modal should from a THAT complement with examples of verbs ORDER and SUGGEST. Neither is particularly common, and both occur predominantly in the past tense form (ordered and suggested). A greater proportion of omission is seen in TEC (see Table 6). 428 Form BNC TEC ORDER + that + should 1 6 ORDER + that + zero 2 7 SUGGEST + that + should 19 19 SUGGEST + that + zero 43 58 Table 6: ORDER and SUGGEST + that + should/zero in BNC and TEC 4.7. Omission of preposition before complementisers that, for and to Some transitive verbs with a preposition as last element in their lexical form which may take a complement clause in object function will omit the preposition before that, for and to, e.g. he confessed to the crime, he confessed to strangling her, but he confessed that he had strangled her. This is not an optional omission and is therefore not of interest in this study. 4.8. Omission of complementiser to According to Dixon, the complementiser to is optional following HELP or KNOW. The form help was analysed, first discarding all uses of help as noun, as reflexive verb, verb + ING complement and verb + preposition, and then looking at occurrences of help (*) (*) to in detail (Table 7). Form BNC TEC Total occurrences Relevant occurrences Total occurrences Relevant occurrences Occurrences of help 2374 300 1792 365 help + to 62 26 72 38 help + * + to 67 50 98 80 help +* + * + to 19 3 35 19 Total help (+*) (+*) + to 79 137 help (+*) (+*) + zero 229 228 Table 7: help (+*) (+*) + to in BNC and TEC This data tells us that although the word form help is more frequent in TEC, its verbal use in both corpora is quite similar with help (+*) (+*) + to/zero occurring slightly more often in TEC than in BNC, of which the complementiser to is used in 37.5% of TEC instances, compared with 26% of the BNC occurrences. 4.9. Omission of after/while in (after) having + participle and (while) *ing As in 4.3 and 4.4 above and 4.10 below, we can more readily measure occurrence of these features rather than omission. Concordances of while *ing are pruned, discarding constructions such as all the while *ing, after/in/for a while *ing, worth your while *ing. The while *ing construction is much more frequent in TEC overall and in relation to the gerundial use (Table 8). Form BNC TEC Total while *ing concordances 150 360 Relevant concordances 138 330 Table 8: while *ing in BNC and TEC A count of after *ing *ed (which obviously does not take irregularly formed past participles into account) also shows a tendency for TEC to use this construction more frequently than BNC (Table 9). Form BNC TEC after *ing *ed 11 65 Table 9: after *ing *ed in BNC and TEC 4.10. Omission of in order According to Dixon, in order is usually omitted before to and may occasionally be omitted before for or that. While the investigation of every instance of the items to, that and for to see whether an in order has been omitted is not practical, we can easily measure usage of in order to, in order for and in order that and compare results from the two corpora. This investigation yields the following (Table 10): 429 Form BNC TEC in order to 250 1225 in order for 1 14 in order that 12 18 Total 263 1257 Table 10: in order to/for/that in BNC and TEC This does not conclusively prove that in order has been omitted more often in BNC but certainly indicates that the longer forms of the conjunctions appear with markedly higher frequency in TEC. 5. Correlations, contractions, co-occurrences To return to the notion of explicitation then, it could be claimed, on the basis of these measures of inclusion and/or omission of optional syntactic elements above, that the language of TEC makes explicit grammatical and lexical relations which are less likely to be made explicit in original English. Furthermore, this tendency not to omit optional syntactic elements may be considered subliminal or subconscious rather than a result of deliberate decision-making of which the translator is aware – most translators do not have a conscious strategy for dealing with optional that, for example. It can be argued that it is the nature of the process of translation and the cognitive processing which it requires which produces the kind of patterning seen here. However, inclusion or omission of syntactic features do not reveal the whole story. Olohan and Baker (2000) pointed out that the optional that data discussed in that study revealed potentially different patterns in other features, such as use of modifiers, pronominal forms, modal constructions etc. in TEC compared with the BNC. Thus, although a specific syntactic or lexical structure can be investigated in terms of overall occurrence and of its usage within the narrow context of a concordance line, the wider issue of co-occurrence and interdependency of features must be considered. Research of this kind on the language of translation still has a long way to go; however a small example can be used to illustrate the possible significance of interdependencies and how they might be investigated further. We can take the data referred to earlier in relation to promise and re-examine it in relation to a number of linguists’ suggestions that that is more likely to be omitted in informal usage (for example Storms 1966; Elsness 1984; Dixon 1991). If we also accept that the use of contracted forms constitutes evidence of informal style, then a search for contracted forms, within the promise concordance line only, reveals the following (Table 11): Form BNC TEC promise total 57 48 promise with contracted forms in concordance line 41 (72%) 21 (43.75%) promise + that with contracted forms in concordance line 7 (17%) 4 (19%) promise + zero with contracted forms in concordance line 34 (83%) 17 (81%) promise with no contracted forms in concordance line 16 (28%) 27 (56.25%) promise + that with no contracted forms in concordance line 12 (75%) 23 (85%) promise + zero with no contracted forms in concordance line 4 (25%) 4 (15%) Table 11: Co-occurrence of promise +that/zero and contracted forms in BNC and TEC From this we can see that, although that occurs with much higher frequency in TEC than in BNC, promise co-occurs with contracted forms to a much higher degree in BNC than in TEC, and that, when the that/zero usage is correlated with contracted forms and then compared across corpora, there is actually little difference between the two corpora. Using contracted forms as a measure of informality, this would indicate, firstly, that there is a correlation between inclusion of that and level of formality, and, secondly, that the language of TEC may thus be judged more formal. A large-scale study of contracted forms based on production and pruning of word lists for both corpora yielded the following Table 12 and Figure 3): 430 Form BNC forms BNC totals TEC forms TEC totals apostrophe 5,851 5,269 *'s 4,818 4,623 *'ll 212 9,651 43 4,799 *'d 111 10,645 29 5,349 *'t 48 40,782 30 20,316 *'ve 53 7,768 17 4,068 *'re 12 7,344 8 4,250 it's 9,554 5,046 that's 4,650 2,640 there's 2,655 1,424 he's 2,628 1,951 she's 2,266 1,154 what's 1,601 1,021 let's 913 654 who's 396 334 where's 241 117 how's 146 36 here's 132 89 e's 102 0 I'm 8,773 4,256 d’ = do 3 418 3 84 t’ = the 99 126 0 0 y’ = you 22 53 7 7 Table 12: Contracted forms in BNC and TEC Figure 3: Contracted forms in BNC and TEC as percentage of total occurrence across corpora The most frequent form with apostrophe is *'s, which in the vast majority of cases is a possessive marker rather than a contraction of is or was; many of the *'s occurrences are with names, and many occur only once or a couple of times in the corpus. For this reason, individual occurrences of *'s have not been counted, apart from the most common *'s contractions in BNC (it's, that's, there's, he's, she's, what's let's, who's, where's, how's, here's, e's). Without looking at data for individual occurrences for *'s forms, we can see from the figures above that the total number of *'s forms is similar for both corpora. This is in stark contrast with all other categories, which represent true contractions rather than grammatical markers. For all other contracted forms counted, a very clear and consistent pattern emerges; they are much more frequently used in BNC than in TEC. As mentioned above, one of the conclusions of the linguistics literature in relation to the optional that is that omission is more likely in informal usage. This may also be the case for omission of the relative pronoun that or which and the other optional features discussed above. The only exception is perhaps the modal should following verbs such as suggest and order; if the modal is omitted, the subjunctive is used, which arguably constitutes more formal style than the should construction. Interestingly this is the only feature above for which TEC seems to favour omission rather than inclusion. On all other optional forms, TEC is considerably more likely to use the optional item and longer surface form. According to the co-occurrence patterns which Biber (1988) and Biber et al. (1998) suggest as underlying the five major dimensions of English, that-deletion and contractions are in the top three 431 features at the positive end of one scale (Dimension 1); this is indicative of their tendency to co-occur in texts of shared function. These and the other features grouped with them are associated with ‘involved, non-informational focus, related to a primarily interactive or affective purpose and on-line production circumstances’ (Biber et al. 1998: 149). Biber et al. continue to describe certain of these positive features, including the two we have dealt with here – that-deletions and contractions – as constituting a reduced surface form which results in a ‘more generalized, less explicit content’ (ibid.). They talk of two separate communicative parameters, i.e. purpose of the writer (informational vs. involved) and production circumstances (allowing careful editing vs. constraints of real-time production). Dimension 1 is therefore labelled ‘involved versus informational production’ (ibid.). Relating this to the findings above, it would appear than the BNC writing is more involved, more generalised, less explicit, less edited than the writing in TEC; the original writer's purpose is more involved, the translator's less so. The translator's surface form is not reduced to the same extent as the original writer's, the translator is thus more explicit, less generalised in both form and content. The translation is perhaps more carefully edited; are original writers more concerned with the creative content and translators with explicitation of linguistic relations? 6. Conclusion In terms of concrete findings of this kind in corpus-based translation studies, there is considerable scope for further studies, particularly in the area of co-occurrence. Mauranen's (2000) study of research on comparison of co-selectional restrictions in Finnish translation and original English is one of the few which tackles collocational and colligational patterning in translated language using comparable corpora, and much more research of this nature needs to be done. In addition, the other co-occurrence features proposed by Biber for this and the other four dimensions of English could be investigated and compared across the corpora. Ongoing work in Saarbrücken using a tagged version of TEC is likely to yield interesting results in this respect, and research is continuing at UMIST to identify and investigate relationships between linguistic features of translated language and the cognitive and social factors which may give rise to them. References Baker M 1995 Corpora in translation studies: an overview and some suggestions for future research. Target 7(2): 223-243. Baker M 1996 Corpus-based translation studies: the challenges that lie ahead. In Somers H (ed) Terminology, LSP and translation: studies in language engineering, in honour of Juan C. Sager, Amsterdam and Philadelphia, John Benjamins, pp 175-186. Baker M 1999 The role of corpora in investigating the linguistic behaviour of translators. International Journal of Corpus Linguistics 4(2): 281-298. Biber D 1988 Variation across speech and writing. Cambridge, CUP. Biber D, Conrad S and Reppen R 1998 Corpus linguistics: investigating language structure and use. Cambridge, CUP. Blum-Kulka S 1986 Shifts of cohesion and coherence in translation. In House J and Blum-Kulka S (eds) Interlingual and intercultural communication: discourse and cognition in translation and second language acquisition studies. Tübingen, Gunter Narr. pp 17-35. Burnett S 1999 A corpus-based study of translational English, Unpublished MSc dissertation, UMIST. Dixon R M W 1991 A new approach to English grammar, on semantic principles. Oxford, Clarendon Press. Elsness J 1984 That or Zero? A Look at the Choice of Object Clause Connective in a Corpus of American English. English Studies 65: 519-533. Kenny D 1999 Norms and creativity: lexis in translated text. Unpublished PhD thesis, UMIST. Kenny D 2000 Lexical hide-and-seek: looking for creativity in a parallel corpus. In Olohan M (ed) Intercultural faultlines: research models in translation studies 1, Manchester, St. Jerome. pp 93- 104. Laviosa-Braithwaite S 1996 The English Comparable Corpus (ECC): a resource and a methodology for the empirical study of translation. Unpublished PhD thesis, UMIST. Laviosa S 1998 The English Comparable Corpus: a resource and a methodology. In Bowker L, Cronin M, Kenny D. & Pearson J. (eds) Unity in diversity: current trends in translation studies. Manchester, St. Jerome Publishing. pp 101-112. Olohan M, Baker M 2000 Reporting that in translated English: evidence for subconscious processes of explicitation? Across Languages and Cultures 1(2): 141-158. 432 Mauranen A 2000 Strange strings in translated language: a study on corpora. In Olohan M (ed) Intercultural faultlines: research models in translation studies 1, Manchester, St. Jerome. pp. 119-  Storms G 1966 That-clauses in Modern English. English Studies 47: 249-270. Vanderauwera R 1985 Dutch Novels Translated into English: The Transformation of a ’Minority’ Literature. Amsterdam, Rodopi. 433 Patterns in scientific abstracts Constantin Orasan School of Humanities, Languages and Social Sciences University of Wolverhampton C.Orasan@wlv.ac.uk http://www.wlv.ac.uk/~in6093 1. Introduction In today's world, large amounts of information have to be dealt with, regardless of the field involved. Most of this information comes in written format. Computers seem the right choice for making life easier, by processing text automatically, but in many cases at least partial understanding (if not full understanding) is necessary in order to automate a process. Luhn (1958) proposed a method for producing abstracts which works regardless of the type of the document. Although further research carried out in this area is based on his work, it has become apparent that general methods are not a solution. Instead, more and more research has been done into restricted domains, where certain particularities are used for “understanding”. A well known case is that of DeJong (1982) where the structure of newspaper articles was used in order to get the gist of the articles and then generate summaries. However, all these methods, due to their inherent specificity, require prior knowledge about the characteristics of the genre to which they are applied. Whether they refer to the distribution of words throughout a document or to the overall structure of the document, such textual features have to be identified in a corpus, by applying methodologies from corpus linguistics. For example, Biber (1998) investigated four different genres: conversation, public speeches, news reports and academic prose, at lexical, grammatical and discourse level. The findings showed that each genre had its own characteristics which distinguish it from the other genres. One can use these findings for the automatic identification of a genre, or for improving the results of an automatic method for one of these genres. In this paper, the characteristics of a very narrow genre, that of scientific abstracts, are explored on three different levels: lexical, syntactic and discourse. The hypothesis is that it is possible to find patterns, which could be used at a later stage to find similarities between abstracts. It is hoped that these patterns will be useful for improving the results of automatic summarisation methods when applied to scientific texts. The patterns identified in this paper are not only useful for automatic abstracting or computational linguistics in general, but they can also be used in order to teach students how to write abstracts. As is explained in the next section, both reading and writing an abstract are not a trivial task, and many students experience difficulties. Those students who are learning English as a second language have even greater problems with such tasks. The patterns which are identified in this paper could help them to write abstracts. 2. What is an abstract and why is it useful? The notion of an abstract is part of everyday language, but there is more than one definition accepted for it. According to Cleveland (1983: 104) “an abstract summarises the essential contents of a particular knowledge record and is a true surrogate of the document”. A similar definition is given by Graetz (1985): “the abstract is a time saving device that can be used to find particular parts of the article without reading it; … knowing the structure in advance will help the reader to get into the article; … if comprehensive enough, it might replace the article” These two definitions emphasise the most important function of an abstract (i.e. its role as a replacement for an entire document). However, these definitions refer to ideal abstracts produced by professional summarisers. This paper argues that it is very unlikely that an abstract produced by the author(s) of a paper (as in the case for most of the abstracts in the corpus used for this paper) is to be used as a replacement for the whole document. Therefore, a simpler definition of an abstract is considered more appropriate in this context: “a concise representation of a document's contents to enable the reader to determine its relevance to a specific information” (Johnson, 1995). So, the abstract is no longer a “mirror” of the document, but instead draws attention to the most important information contained within the document. Moreover, the main purpose of this definition (i.e. to highlight what is important in the document) can be applied to automatic abstracting. Regardless of the differences between the definitions mentioned above, they all highlight the use of abstracts as filtering devices. In the present days, people are constantly being bombarded by large 434 amounts of information. Scientists and academics use abstracts to filter the existing literature when conducting research into a certain topic or when trying to keep up-to-date with the latest advances in their field of interest.1 On the basis of the abstract, they can decide if an entire document is worth reading or not. Swales (1990) considers the process of writing an abstract to be a “rite de passage” for gaining entry into the scientific community via a demonstration of increasing mastery of the academic dialect. This is true, writing an abstract is not a trivial task given that it does not allow redundancies and forces the writers to use a lot of compound words. As Halliday (1993) points out, in scientific texts, which include scientific abstracts, lexical density is very high, which makes it difficult to both read and write such texts. There are several ways of classifying abstracts. One way is to divide them, according to their usage. The categories used in this case are: indicative, informative and critical (Lancaster, 1991). An indicative abstract provides a brief description to help the reader understand the general nature and scope of the original document without going into a detailed step by step account of what the source text is about. An informative abstract is more substantial than an indicative abstract and is often used as an alternative to the original document, following the main ideas presented in the document. Usually, these two categories are combined, an abstract performing both an indicative and informative function. A critical abstract gives not only a description of the contents, but also a critical evaluation of the original document. Usually, in this case, it is not the author of the paper who writes the abstract. After analysing the corpus of abstracts used in this paper, it became evident that the vast majority of the abstracts are indicative. Some of them also have an informative function, but cannot be classified as informative because they cannot replace the original document. No critical abstract was included in the corpus. There are also other ways of classifying an abstract: by the way they are used, and by their author, but these classifications are not relevant to this paper. Most of the abstracts in the corpus present the disadvantages of abstracts which are written by the authors of the papers, instead of trained summarisers. As is shown is section 6, most of the abstracts do not always follow the Problem- Solution-Evaluation-Conclusion structure (Swales, 1990). One reason for this could be because the authors do not consider the abstract to be particularly important, in many cases it is written just before the paper is submitted. Moreover, in some cases the content of the abstract does not necessarily reflect the content of the paper, as Cleveland (1983: 110) has remarked “authors as abstractors have been known to use their abstracts to promote the paper; this can create a misleading abstract and is unfair to the user”.2 3. The corpus In order to analyse the structure of scientific abstracts, a corpus of abstracts has been built. It consists of 917 abstracts with 146,489 words. For research in corpus linguistics this may seem a very small corpus, but building a corpus of abstracts which has a large number of words is a tremendous task, given the small size of one abstract. Moreover, as Sinclair (2000) has pointed out, small corpora are not necessarily bad; in some cases a small corpus is the right choice. The research presented in this paper required a lot of human input, and therefore its size had to be kept down to make the analysis possible. However, whenever possible, automatic processing was used. Two sources were used for building the corpus. The first one was the Journal of Artificial Intelligence Research (thereafter JAIR), from which 141 abstracts, with 24,509 words, were extracted. As the name suggests, this journal publishes articles in the field of artificial intelligence. Due to the fact that the size of this corpus was too small and the author wanted to compare abstracts from different areas, the INSPEC database was used as a second source of abstracts. The INSPEC database contains abstracts of papers from more than 4,200 journals and 1,000 conferences. Six topics have been selected and the first few abstracts have been included in the corpus. Table 1 presents some details about each topic. 1 Given the purpose of this paper, the use of news abstracts for reporters, politicians or businesspersons is not considered. However, it can be argued that, in such a context, summaries of a single document are not so useful, digests (multidocument summaries) being more appropriate 2 An example is the following sentence from an abstract of an article from conference proceedings: “By mastering the fundamental issues discussed in this paper, you will increase the return of your organisation's investment in data warehouses” 435 Topic No. of words No. of files In proceedings In journals Artificial intelligence 82,141 512 230 282 Computer science 21,467 137 117 20 Biology 16,081 100 50 50 Linguistics 6413 50 26 24 Chemistry 12,096 68 43 25 Anthropology 7717 50 24 26 Total 146,489 917 490 427 Table 1: The characteristics of the corpus Several remarks have to be made regarding the corpus. In this research, articles from artificial intelligence are of particular interest; therefore, most of the abstracts are from this field and other related areas (machine learning, information retrieval, etc.). However, other areas, like anthropology, chemistry and biology have been included for making comparisons between different genres. When computer searches were carried out in order to obtain abstracts which could be analysed in this paper, those abstracts found were from both journal articles and conference proceedings. In one case, that of the biology abstracts, the first fifty abstracts returned by the search came from conference proceedings. As a result, a second search has been performed in this area, this time restricting the search to abstracts of documents published in journals. These two categories of abstracts allow the author to see if there is a difference between the abstracts of papers published in conference proceedings and those published in journal articles. In some cases, an abstract can belong to more that one category. For example, it has been noticed that some abstracts considered as belonging to the field of linguistics, are concerned with some computational aspect of linguistics, and therefore could also be considered as belonging to the field of artificial intelligence. For each topic, the number of abstracts published in conference proceedings and journals is shown in Table 1. No conditions were imposed on abstracts’ place of publication or the author's(s') mother tongue. Therefore, not all of them are written in perfect English. However, it was considered that this would better reflect the use of English in the research community. 4. Length The length of abstracts was considered to be the first way of comparing them. Given that usually a conference paper is shorter than one published in a journal, the author expected to find that the abstracts of journal articles were longer than those of conference papers. Also, in the case of the former, the editors impose a strict control on the quality of the article, and subsequently on their abstracts. The statistics showed that abstracts of the journal articles are noticeably longer, both in terms of sentences and words, than the ones belonging to the conference papers (Table 2). The shortest abstracts are the ones belonging to humanist disciplines (linguistics and anthropology). However, due to the small number of abstracts taken from these disciplines it is not possible to conclude that all the abstracts from the humanities are short. Journal Proceedings Total Topic Sent/Abs Words/Abs Sent/Abs Words/Abs Sent/Abs Words/Abs Artificial Intelligence 8.20 165.67 6.05 152.44 7.24 159.74 Computer Science 9.58 232 5.94 163.30 6.40 159.32 Biology 7.9 196.18 5.65 130.02 6.78 163.43 Linguistics 5.78 149.52 5.92 108.65 5.85 127.83 Chemistry 8.58 215.08 6.34 163.30 7.14 181.85 Anthropology 6.23 157.88 6.08 154.43 6.16 156.26 Total 7.39 174.56 5.96 147.94 6.61 160.31 Table 2: The length of abstracts in sentences and words The length of an abstract ranges from 1 sentence to 21 sentences in the case of abstracts taken from journal articles, and 1 to 16 sentences in the case of abstracts in conference proceedings. In both cases it is possible to find abstracts which have only one sentence, but usually this sentence just enumerates the topic covered by the article. An interesting result was obtained when the length in words was computed. The longest abstract has been published in conference proceedings. As a result of this, it is possible to conclude that overall, the abstracts published in journals are longer than the ones published in conference proceedings, both in terms of words and sentences. However, there is no rule which states this. 5. Lexical level There are several ways of analysing a corpus. The most basic form is by displaying and analysing lists of characteristics. The analyses can involve very simple wordlists or be more sophisticated, 436 including the classic concordance format (Kennedy, 1998). In this paper the author starts by analysing word frequency lists and lists of n-grams. These lists are also compared with the same lists generated from a general purpose corpus, the BNC (Burnard, 1995). In the next section, grammatical features of the texts are considered by analysing subject-predicate pair lists. Whenever it was necessary, specially designed programs were used to display the context. 5.1 Word frequency lists Given the small size of the corpus it was thought unwise to make significant generalisations, but even by analysing the lists interesting features can be noticed. For this analysis and the ones which follow, the corpus was tagged using the FDG tagger (Tapanainen, 1997). Using this tagged version of the corpus, word frequency and lemma frequency lists were produced. For each case, two different lists were generated, with and without considering the part-of-speech of each word. These lists were compared with similar lists produced from BNC. It should be pointed out that the results of the comparison have to be treated with caution. The first reason is the huge difference in the size of the two corpora. Therefore, the decision was made not to draw a comparison between the two corpora in terms of frequencies, but using their position in the list instead. Of course the frequencies could have been normalised, but given the small size of the texts in the corpus, it was thought that the results would still be unreliable. The second problem with this comparison comes from the language. BNC is a corpus of British English, whereas the corpus of abstracts was not filtered on the basis of the variety of English used. Moreover, it can be argued that most of the English used in the scientific domain is written in American English. For example, the word summarisation and its derived forms (e.g. summarising, summarise, etc.) appears 137 times. In 114 of the cases the American English spelling (AE) was used, whereas the British English spelling (BE) was used in only 23 cases.3 The same problem was found with the word generalise (out of 83 occurrences, 78 used AE and only 5 BE) and characterise (56 occurrences, 46 used AE and 10 BE). However, it can be argued that there is no word with a high frequency of occurrence, which has different spellings in AE and BE. Word freq. list from abs 7132 the 5908 of 4156 and 3082 a 2925 to 2485 in 1953 is 1575 for 1310 The 1204 that 1093 are 910 on 886 with 790 by 717 an 698 be 607 this 585 as 528 system 521 which Word freq. list with tags from abs 7132 the DET 5908 of PREP 4156 and CC 3069 a DET 2466 in PREP 1953 is V 1747 to TO 1569 for PREP 1310 The DET 1178 to PREP 1093 are V 894 on PREP 886 with PREP 789 by PREP 717 an DET 698 be V 597 that CS 575 that PRON 563 this DET 528 system N Lemma freq. list from abs. 8442 the 5913 of 4543 be 4162 and 3293 a 3010 to 2890 in 1644 for 1398 this 1277 that 1165 we 940 use 933 on 909 with 906 system 823 by 781 an 735 have 694 it 636 as Lemma freq. list with tags from abs. 8442 the DET 5913 of PREP 4179 be V 4162 and CC 3270 a DET 2870 in PREP 1823 to TO 1635 for PREP 1269 this DET 1187 to PREP 1165 we PRON 917 on PREP 909 with PREP 906 system N 822 by PREP 781 an DET 718 have V 694 it PRON 615 that PRON 597 that CS Word freq. list from BNC 5538939 the 3086807 of 2631593 to 2574912 and 2091285 a 1824289 in 1088658 that 983593 is 917103 was 897690 I 849027 for 847109 it 802227 ‘s 712502 on 661109 be 657574 with 631554 The 619043 as 596588 you 501209 at Figure 1 Different frequency lists Figure 1 shows the different frequency lists for the first 20 entries in the lists. It is evident that the first 6 entries of the word frequency list, from the corpus of abstracts and from BNC, are almost identical. However, further down in the list, differences appear. The word was, which in BNC occupies the 9th position, in the corpus of abstracts appears in the 51st position. This can be explained by the small number verbs which appear in the past tense. In the frequency list of the corpus the most frequent noun is system, appearing in the 19th position, but if the lemmatised version of the list is taken into consideration instead, it is in the 15th position (even higher if the different types of systems are taken into consideration e.g. eco-system, geo-system). After this position, the nouns are quite frequent; paper occupies the 21st position (24th in lemmatised list), data 23rd (25th), information 27th (26th). In BNC, the first noun on the list is time, which occupies the 71st position, and there are not many nouns in the first 3 However, it should be pointed out that in BNC both spellings of the word summarise can be found, although the British spelling is more frequent that the American one (1220 BE, 751 AE) 437 200 words (e.g. people 94th, years 120th, etc.). It is also evident that the types of nouns are completely different. Also, a quick check on the lists reveals that many of the most frequent words from BNC, belong to more than one grammatical category. This is not true with the most frequent words in the corpus of abstracts, especially the ones which are nouns. This may indicate that the abstracts focus more on abstract states, objects and processes. A similar result was obtained by (Biber, 1998) studying nominalization in scientific texts. 5.2 N-gram lists N-grams are groups of consecutive N words in the corpus. Punctuation marks were not considered as being part of an n-gram, therefore all of those containing punctuation marks were removed. When sorted by their frequency, n-grams uncover frequent patterns in a corpus. N-grams (with N from 2 to 9) have been generated. Initially the idea was to compare them with the ones produced from BNC, but it became apparent that there is not much of a link between the two, except for the very frequent patterns of the, in the, which are not very useful. However, this is not surprising, given that the n-grams are an indicator of a document's contents. Figure 2 shows the first 29 entries of 2-grams, 3-grams and 4-grams lists. The lists with a higher number of words had much lower frequencies and for space reasons they are not displayed here. 2-grams 1276 of the 640 in the 360 this paper 320 on the 319 of a 311 and the 306 to the 273 have be 258 in this 256 for the 250 can be 242 base on 215 in a 204 it be 201 be a 198 be use 196 of this 192 with the 174 to be 166 that the 163 show that 144 be the 143 use to 142 the system 141 number of 139 by the 138 as a 123 artificial intelligence 121 such as 3-grams 143 in this paper 115 be use to 72 the use of 61 base on the 58 be base on 53 a set of 50 show that the 48 we show that 47 the problem of 47 the development of 46 the number of 44 this paper present 43 one of the 43 be apply to 42 we present a 42 this paper we 41 a number of 39 this paper describe 39 can be use 37 a variety of 35 in term of 35 be able to 34 of the system 33 the performance of 33 base on a 32 we propose a 31 with respect to 31 the result of 28 some of the 4-grams 41 in this paper we 26 can be use to 20 this paper present a 20 in the context of 17 the world wide web 17 it be show that 17 be one of the 16 the size of the 16 a wide range of 15 one of the much 15 be base on a 14 this paper we present 14 on the other hand 14 in the form of 14 be base on the 13 this paper describe the 13 of this paper be 13 in the field of 12 the performance of the 12 on the basis of 12 in the size of 11 with respect to the 11 this paper describe a 11 this paper be to 11 the use of a 11 can be apply to 10 the development of a 10 of a set of 10 in the presence of Figure 2: The lists of 2,3,4-grams from the lemmatised version of the corpus It seems that the lists are not seriously influenced by the type of abstract. A comparison of the lists of n-grams produced from abstracts published in journals and the ones from proceedings did not reveal many differences. When the 3-grams are considered, in both cases the first element on the list is in this paper, followed by be use to. However, the third element from the first list, we show that, appears in the 67th position in the second list. In the list of 2-grams, show that, appears in the 14th position in the first category and only 48th on the second category. Given that the size of the two subcorpora is almost the same, such a result is unexpected. However, it could be explained by the fact that in many cases, the conference papers present work in progress and, therefore the conclusion is not necessarily the strongest point of a paper and therefore no reference is made to the conclusion in the abstract. As is argued in section 6, there are cases when the abstracts do not have an evaluation section, but this happens less frequently with the abstracts belonging to journal papers. The n-gram lists also uncover terms for specific domains (e.g. information retrieval, neural networks, world wide web, etc.). Although this is a possible use of them, this paper does not intend to investigate this aspect. 438 5.3 The case of the noun paper The analysis of word lists and n-gram lists represent a very easy and powerful way to find patterns in texts. For example, if the word paper is taken, it is appears 499 times, in 473 abstracts, which means that more than half of the abstracts use it. In one abstract, it is used four times, its authors introducing each move using the following constructions: this paper investigates, this paper introduces, this paper describes, this paper ends. There is another abstract in which the word paper is used 3 times. In 24 abstracts it is used twice, although in the rest of the abstracts it appears only once. In addition to this the word study is used as a noun 170 times, research 154 times and work 111 times. Even though the nouns study, research and work are not always synonymous with the word paper, these three words together with paper strongly indicate that most abstracts make a reference to the paper from which they are derived using constructions like: in this paper, in this study etc. The n-gram lists strengthen this aforementioned conclusion. The word paper usually appears in constructions like this paper (360 times) or the paper (115). Constructions such as this study (18), this research (14), this work (25), this article (27) are also found. In many cases, the word paper is used as the subject of verbs like: present (62 times), describe (50), be (45), introduce (15). Clearer patterns appear when more words are considered: in this paper (143 times), this paper presents (44), this paper we (42), this paper describes (39). By increasing the number of contextual words, the patterns become less frequent: in this paper we appears 41 times, this paper presents a (20 times), this paper we present (14 times), this paper describes the (13 times). Even when 5 words are considered, the first element on the list is in this paper we present (14 times). As a result of this analysis, it is reasonable to conclude that the patterns found with regards to the word paper, are not accidental and cannot merely be explained by a high number of occurrences of the word paper. Instead, they represent patterns specific for abstracts. If the most frequent noun (i.e. system) is considered, such patterns do not appear. 6. Grammatical level Analyses of the word frequency lists and n-gram lists proved to be a very useful way of discovering patterns. However, they can only reveal patterns between words which are adjacent. As a result of this, it is possible that many patterns were missed due to some modifiers or adverbs. In this section, grammatical structure is used for uncovering patterns in the abstracts. All the abstracts were tagged using the FDG tagger. This tagger, in addition to assigning part-of-speech tags to each word, also provides partial dependency relations between words. These dependency relations were then used for finding common noun-verb pairs. Consequently, two lists were generated. The first list represents subject-predicate pairs and the second one contains pairs of nouns, which are not the subject of the sentence (e.g. objects), and verbs. It should be pointed out that the process of generating these lists was completely automatic; therefore they contain some errors. However, the number of these errors is relatively low, and they do not influence the validity of the results. Figure 3 presents the first 20 entries from the two lists. 120 it be 106 we present 88 we show 86 there be 84 that be 66 we propose 63 paper present 58 which be 56 we describe 50 paper describe 35 we discuss 31 result show 30 it show 28 we introduce 26 they be 25 approach be 24 we use 24 we develop 24 this be 65 be system 53 be problem 38 be information 37 present paper 37 be data 36 present system 36 be knowledge 33 present approach 29 be it 28 be model 27 be agent 25 be research 25 be first 24 play role 24 be number 24 be method 23 be science 23 be process 23 solve problem Figure 3: Pairs of subject-predicate and verb-noun As expected, the patterns found using n-grams also appear in these lists, but with an increased frequency. This is normal given the fact that intervening adverbs do not affect the patterns. For example, the pair we-present is found 106 times, an increase of 5 from the list of 2-grams. This is because of groups like we also present. The increase is greater in the case of we-show, from 71 to 88. 439 In the list of subject-predicate pairs some interesting pairs can be noticed. The first pair in the list is it-be. After manually checking all the appearances in the corpus, it was noticed that in only 8 cases it was used as an anaphoric pronoun. This finding is not surprising given that it has been shown by previous research that the pronoun it is frequently used in the scientific domain as non-anaphoric. In addition to the 112 pairs of it-be, where it is used non-anaphorically, there are 86 appearances of existential there as subject for the verb be. All these cases suggest that existential sentences are frequently used in abstracts. The subject we appears with a closed set of verbs (e.g. present, show, describe, discuss, etc.). This set includes verbal processes, in Halliday's terms (Halliday 1994), and presentational processes; they can be used to determine the different types of moves (as is shown in the next section). The subject is usually the one which realises an action. Eight out of the first 20 pairs contain the subject we in them, which suggests that the author is present in the abstract as the one who presents, shows, etc. The large number of be predicates in the second list (14 times in the first 20 most frequent pairs) reiterates the fact that existential sentences are quite frequent. The pair present-paper appears usually because passive voice is used (e.g. …is presented in this paper), and therefore it is an instance of the subject-predicate pair paper-present. The other pairs suggest that systems and approaches are presented and problems are solved. 7. The structure of abstracts The most distinctive feature of abstracts is their rhetorical structure. Gopnik (1972) has identified three basic types of scientific paper: the ‘controlled experiment', the ‘hypothesis testing’ and the ‘technique description'. Each type has its own structure, but according to Hutchins (1977) they can be reduced, either by degradation or by amelioration, to a problem-solution structure. However, this structure is too general for the purposes of this paper. A more detailed organisation can be identified in the scientific papers: background information about the domain, the problem, the solution to the problem, evaluation of the solution and conclusion. Sometimes, they are referred to as moves. Graetz (1985) and Swales (1990) claim that an abstract should have the following structure: problem-solutionresults- conclusion. However, Salanger-Meyer (1990a) analysed a corpus of 77 abstracts from the medical domain for this structure and found that only 52% of the abstracts followed the structure. In each abstract, the moves have been manually identified. Given the large number of abstracts in the corpus and the difficulty of annotation, only 67 abstracts have been selected and annotated. Therefore the results presented in this section are just preliminary, in the future a semiautomatic procedure will be used. Out of the 67 annotated abstracts, 35 were published in journals and 32 in conference proceedings. The patterns found in each move are similar to the ones reported in Salanger- Meyer (1990b) for abstracts from the medical domain. During the annotation, five moves have been considered: Introduction, Problem, Solution, Evaluation, Conclusion. Ideally, an abstract should contain all five moves, but an abstract with only Problem, Solution, and Evaluation moves, has been considered to be perfectly acceptable and correct. It should be pointed out that the annotation process is difficult and in many cases highly subjective. At present, the corpus has been annotated by only one person, the author, but in the future at least some of the abstracts have to be annotated by another person in order to compute the reliability of the annotation using interannotator agreement measures. Out of 67 abstracts, it was found that only 39 of them (58%) could be considered to follow the expected structure. In the rest of the cases, either an important move was missing or the moves were in logical order. In quite a few cases, it was noticed that the evaluation was given before the method was presented. The abstracts of journal papers proved to be slightly better than the ones of conference papers in terms of organisation (21 abstracts from journal papers and 18 abstracts from conference papers), but more data have to be investigated. 7.1 The Introduction section The introduction section is meant to provide the reader with some background information, to explain what has been done in the field, etc. Given the constraints on the size of an abstract, this move is not compulsory. Moreover, abstracts are written for relatively informed readers and therefore they should not provide too many background details. However, by analysing the abstracts in the corpus, the author has noticed that there are cases when the introduction is quite long, in some cases almost half of the abstract. Usually this section makes references to previous work using expressions like existing approaches, prior work, previous work. The sentences in this section contain general truths (e.g. Storing and accessing texts … has a number of advantages over traditional document retrieval methods or Discourse analysis plays an important role in natural language understanding) usually expressed through the present simple tense. In addition to the present simple tense, the present perfect is used quite often for stating generic truths, but is usually used for emphasising the weaknesses of the 440 previous work (e.g. It is usually expected … but … or The standard approach … has been to … Such an approach becomes problematic). The introduction also gives hints about the problem which is going to be solved, highlighting in an appropriate way the weaknesses of previous approaches. In some cases it was quite difficult to make a clear distinction between the Introduction and the Problem. 7.2. The Problem section The introduction section, prepares the reader for the problem section. In this section, the problem with which the article deals, is expressed. There are cases when the problem is not clearly stated, but the reader can usually infer it from the introduction and the solution sections. Even though the reader can guess the problem, it is not desirable to have an abstract which does not state the problem explicitly. Given the fact that there are often some very frequent patterns, this section can be identified relatively easily. Usually it is explicitly signalled through phrases like: we describe, we present, a formalism is presented, we outline, etc. The preferred tense for stating the problem is the present simple. In some cases comparison with previous work is used for stating the problem: because of …existing methods can erase …we propose …. In these cases the Problem section also serves also as introduction. 7.3. The Method section In this section of the abstract, the author(s) should explain how the problem is resolved. This section is very important for the reader because it enables him/her to understand the kind of approach to solve the problem that was used. Some sentences from this section are marked overtly using phrases like: an alternative solution, the approach described here uses …. Some of the patterns used for stating the problem, also appear in the method section (e.g. This paper reports on a Japanese information extraction system that merges information using a pattern matcher and discourse processor) In this example, as well as in the other cases where a pattern which would be usually found in the Problem section appears, the phrase has a double role. On one hand, it reiterates the problem, or state the problem if it has not been stated already, and on the other hand, it explains the method. The verb tense seems to be the present tense, although Biber (1990) found that the past test is more frequently used in this move. Normally, this move should describe each step taken in the research for solving the problem, and therefore the past simple should be preferred. However, in the abstracts of the corpus used for this paper, the method is described in general terms, and not as a sequence of steps. 7.4 The Evaluation section Another important section is the one which summarises the evaluation. An abstract without an evaluation section is not considered to be correct. This is because, the evaluation proves the validity of the method proposed. If a researcher is trying to find a solutions for a problem similar to the one discussed in the abstract, he or she needs this section in order to judge the usefulness of the solution. If the evaluation suggests that the solution is appropriate for the given problem, one can read the whole article in order to obtain a full explanation of the method and a detailed evaluation. It has been noticed that whenever the verb show is used, it appears in the evaluation move in phrases like: we show, it is shown. In addition to the verb show, other phrases are also used for explicitly marking this move (e.g. this work provides, reveal, yield, investigate, find etc.) Besides these phrases, words which either refer to measures or ways of measuring are used (e.g. limited, compare, quantify, are tuned, shortcomings, this allows us to measure etc.) There are cases when the evaluation is not presented. Instead, the authors indicate that an evaluation was performed (e.g. we briefly describe some experiments, we present three case studies, etc.) On the basis of such an abstract, one cannot decide if the article is relevant or not, and therefore it is not really useful. The preferred verb tense is the present simple tense, but a large number of verbs in conditional tense have been noticed. This could indicate that the authors do not want to overestimate their results (e.g. can be adapted, can use, etc.) However, in addition to being part of the evaluation, these sentences can be also used as a conclusion. Connectors, which hardly appear in the other moves, are quite frequent in this one (e.g. however, in addition, etc.), being used for justifying the evaluation. 7.5 The Conclusion section Many research papers have a conclusion section in which the results of the method are placed in a broader context. However, it is not absolutely necessary to have this section separate in the abstracts, but its presence makes an abstract more valuable in a broader context. Therefore, an abstract that does not contain such a section is not considered to be incorrectly structured. In many cases the evaluation section also plays the role of conclusion. 441 In the majority of cases, this move contains an explicit reference to the abstract (e.g. this work provides, these observations suggest, etc.) In addition, phrases like this paper concludes, as a conclusion, or adverbs (e.g. therefore, as a result, etc.) are used to mark the beginning of the conclusion move. 8. Conclusion and future work In this paper it has been shown that regardless of the level of analysis, lexical, syntactic or discourse, it is possible to extract patterns from scientific abstracts. The simplest way of analysing, the lexical level, uncovered groups of words which usually go together. As an example the word paper has been analysed, noticing that it appears in frequent patterns. By using dependency relations between subject and predicate, pairs were generated. Many of these pairs were also found in the lists of 2-grams, but given that in this case the intervening adverbs do not matter anymore, the frequency of the pairs is higher, so giving more reliable figures. By analysing the discoursal structure of abstracts, it became evident that the scientific abstracts, written by the authors of the papers, do not necessarily follow the structure which the literature predicts. In each abstract, the moves were manually annotated and those groups of words which signal the type of a move were identified. Some of the words seem to indicate reliably a certain type of move (e.g. the verb show appears usually in the evaluation section). The question which arises at this point is how these patterns can be used in computational linguistics and especially for automatic summarisation. For the beginning it has been noticed that the word paper appears usually only once in the abstract, in constructions like in this paper we. These constructions are very similar to the findings reported in Paice (1981), where common patterns from full-length papers, called indicating phrases, were identified in scientific papers and used for producing a summary. Therefore, a sentence from a document which contains a pattern similar with one previously identified, is more likely to be important, and consequently worth including in the abstract. However the usefulness of each pattern has to be assessed in a corpus of scientific papers. The patterns, which have been identified, are not only frequent in the abstracts, but in many cases, they also indicate a certain move. This suggests that it could be possible to design an automatic procedure for identifying each move in the text. However, it has been shown that more than one pattern is used to introduce a move, therefore for each move it is possible to find more that one way of introducing it. This suggests that it is possible to find general templates for each move. Such an approach would not be new, Paice and Jones (1993) proposing a similar method for building abstracts. However, in their case the templates are very specific for a certain domain. Patterns in abstracts are useful not only for automatic abstracting. They are also useful for helping researchers to produce their own abstracts. Narita (2000) proposed a system for Japanese, which can be help writing abstracts by displaying sentences and collocations from a corpus of annotated abstracts. Acknowledgements The author of this paper would like to thank Ramesh Krishnamurthy, Dr. Andrew Caink and Richard Evans for their comments provided at different stages of this paper, and IEE and Journal of Artificial Intelligence Research for the permission to use their abstracts for this research References Biber D, Conrad S and Rippen R 1998 Corpus Linguistics: Investigating Language Structure and Use, in Cambridge Approaches to Linguistics Series, Cambridge University Press Burnard L 1995 Users Reference Guide: British National Corpus Version 1.0, Oxford University Computing Services, UK. Cleveland DB 1983 Introduction to Indexing and Abstracting, Libraries Unlimited Inc. DeJong G 1982 An overview of the FRUMP system. In W. G. Lehnert and M. H. Ringle (eds), Strategies for natural language processing. Hillsdale, NJ: Lawrence Erlbaum, pp. 149 – 176 Graetz N 1985 Teaching EFL students to extract structural information from abstracts. In Ulign J. M and Pugh A. K. (eds) Reading for Professional Purposes: Methods and Materials in Teaching Languages, Leuven: Acco, pp. 123 – 135 Halliday M.A.K and Martin JR 1993 Writing Science: Literacy and Discursive Power, The Falmer Press Hutchins WJ 1977 On the structure of scientific texts, UEA Papers in Linguistics, 5(3) pp. 18 – 39 Johnson F 1995 Automatic abstracting research, Library review 44(8) Kennedy G 1998 An Introduction to Corpus Linguistics, Longman Lancaster FW 1991 Indexing and abstracting in theory and practice, Library Association Publishing Ltd. 442 Luhn H P 1958 The automatic creation of literature abstracts. IBM Journal of research and development, 2(2): 159 – 165 Narita M 2000 Constructing a Tagged E-J Parallel Corpus for Assisting Japanese Software Engineers in Writing English Abstracts, in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC'2000), Athens, Greece Paice, CD 1981 The automatic generation of literature abstracts: an approach based on the identification of self-indicating phrases. In Oddy, R. N., Rijsbergen, C. J. and Williams, P.W. (eds.) Information Retrieval Research, London: Butterworths, pp. 172 – 191 Paice, CD and Jones PA 1993 The identification of important concepts in highly structured technical papers. In Proceedings of ACM-SIGIR'93, pp. 123 – 135 Salanger-Meyer F 1990a Discoursal flaws in Medical English abstracts: A genre analysis per research- and text-type, Text, 10(4), pp. 365 – 384 Salanger-Meyer F 1990b Discoursal movements in medical English abstracts and their linguistic exponents: a genre analysis study, INTERFACE: Journal of Applied Linguistics 4(2) pp. 107 – 124 Sinclair JM 2001 Preface. In Ghadessy, M., Henry, A. and Roseberry, R. L. (eds) Small Corpus Studies and ELT: Theory and Practice, John Benjamins Swales JM 1990 Genre Analysis: English in academic and research settings, Cambridge University Press Tapanaine P and Jarvinen P 1997 A Non-Projective Dependency Parser. In Proceedings of the 5th Conference of Applied Natural Language, pp. 64 – 71 443 An attempt to develop a lemmatiser for the Historical Corpus of Hungarian Gabriella Kiss and Júlia Pajzs Department of Lexicography and Lexicology Research Institute for Linguistics, Budapest {gkiss, pajzs}@nytud.hu For the project of the Historical Dictionary of Hungarian a carefully selected representative corpus was collected (24.5 million running words). The texts were chosen from three centuries. A morphological analyser programme was successfully run on the modern texts, but the analysis of the earlier texts was problematic. In our paper we will describe a method for the conversion and analysis of archaic texts, without losing the original word forms. The Historical Dictionary project itself, and the analyser developed for the modern text, will also be reported briefly. 1. The Historical Dictionary project and its corpus The project first started in the late 19th century with the collection of old fashioned dictionary slips. The idea was to compile an OED-like dictionary which covers the vocabulary of the Hungarian language from 1772 up to the time of collection. (There exists an historical dictionary for the former period : Czuczor G & Fogarasi J 1862). The collection of the slips continued until about 1960. From time to time there was an attempt to compile some draft entries based on this collection, but somehow or other these experiments always happened to fail. Possibly these repeated failures were partly due to the lack of an adequate personality in charge of a project at this scale (Hutás, 1974). Without a dedicated and convinced chief editor not even a one volume dictionary can ever be completed, not to mention an OED-like several volume reliable historical dictionary. During the years 1950-1960 a fairly valuable seven-volume monolingual dictionary was prepared using the slips collected for the historical dictionary (Bárczi & Országh 1959-1962). It contains some illustrative quotations, but not necessarily to each meaning. Bibliographic information is not supplied with the quotations (only the author's name in abbreviated form is given in some cases). It was not published as a historical dictionary, but as an explanatory one, and it still continues to be the largest existing Hungarian explanatory dictionary (Országh 1960). In the late 60's a one-volume abridged version was produced on its basis. It is still in print, and is now being revised to be published in 2002. This revised version is also based on newly but traditionally collected data, not on a corpus. In 1985 the Hungarian Academy of Sciences decided to start the historical dictionary project all over again, based on a computerised corpus to be compiled first. The collection method seems rather naive these days: it was a combination of old fashioned collection and corpus building. At that time nobody in Hungary had any experience about efficient corpus collection methods, or about realistic expectations from a corpus of a given size or type. At the outset we planned to collect a corpus of 10 million running words. To establish the source material for a sound historical dictionary, several small excerpts were carefully chosen by the literary historians of each period. Since the short sample texts usually contained only some book pages we found that optical character recognition was not really efficient for this task. Therefore the selected text parts were keyboarded manually, which, of course, took a very long time (keyboarding was finished at the very end of 2000), and, despite repeated controlling, many keyboarding and other kinds of errors still exist. Currently the corpus consists of 24.5 million running words, with a majority from the 20th century (16 million words), 6.8 million words from the 19th century and 1.7 million from the late 18th century. The size of the samples is varied: since every poem and each part of a book is considered a separate sample, the number of different samples is quite large: over 21,000. The average sample length is around 1200, but there are also extremely short samples (e.g. a very short poem of two words), and there are surprisingly large ones: a text of 34,812 words is the current maximum. Although the literary experts tried to be as objective as possible and the selection was reviewed by different experts, the extremely overrepresented works obviously happened to be their favourite ones, and some very well known authors and/or works were simply left out for some reason. Most of the texts are different kinds of prosaic texts (prosaic fiction: 31%, other kinds of prose: 51%, poetry: 8,5%, drama: 5,7%). In order to supply exact philological data for the planned historical dictionary, bibliographic data and the estimated (or sometimes exact) year of production were recorded for each 444 sample text. This information along with the data of keyboarding, updating etc. are stored in the header of each sample file in SGML format. When we started keyboarding SGML did not exist, so for a long time special codes were used to mark paragraph ends, stanzas, quotations, etc. As soon as we learnt about SGML (in 1987) we started to convert the earlier material into this format, and after a while keyboarding contitued in it, there are, however, still too many formal errors in the texts keyboarded previously. (This is a very serious point to consider for projects starting these days: one should try to rely on XML and TEI or other standard recommendations as strictly as possible, and should not think that it is always possible to correct the errors easily by a good conversion programme.) The first draft dictionary entries based on the corpus were prepared seven or eight years ago, when the corpus was only about half as large as it is today. When we first realised that we can not make reliable historical dictionary entries based solely on the corpus, we hoped to be able to fill in the obvious gaps either by the traditional slips or by the enlargement of the corpus. Recently, we had to cope with the fact that, although it is theoretically possible to compile a more or less historical-like dictionary from the available sources, this would require a despairingly long time and/or several times larger staff and facilities than we can hope for. (Just to give you a hint of our possibilities: right now nine full time compilers are working on the project, but only three of them have some experience in writing a dictionary, and none of them have a formal education in lexicography. And this is about the largest team we can expect for the future, as well.) Faced with this situation, we are about to redesign the whole project. At this turning point there are three alternative plans: one is to further improve the historical corpus, and forget about dictionary writing, at least for the near future. The other is a less maximalistic historical dictionary than the one planned originally, but it is sill a very traditional historical approach, which wishes to build a dictionary that includes only the chronologically first quotation to each known sense of the words (either from the slips or the corpus), chiefly due to space considerations (this plan imagines an eight-volume dictionary). Our personal view, which is shared by most in our team, is to produce a good, up-to-date one volume dictionary, at least as a first step, with an electronic version including many additional facilities. This, in many respects, would rather resemble the modern corpus-based monolingual English dictionaries (COBUILD, CIDE, LDOCE, or even the new OALD) more than the OED. We would like to make clear-cut entries with easily understandable definitions, synonyms, antonyms, hyponyms, etc. The main specificity of this dictionary compared with the above mentioned modern English ones would be the illustrative quotations not only from the current part of the corpus, but from the earlier periods, as well. In the electronic version the exact philological reference of the citations would be supplied, as it is done in traditional historical dictionaries. The electronic version could also contain many more examples from the corpus, plus several additional properties — like the generation of the inflected forms of the words (which is a much more difficult task and therefore even more necessary for Hungarian than for English). We believe that with a brand new and up-to-date concept a really good modern dictionary could be compiled, combining the advantages of modern corpus based dictionaries and traditional historical ones. We are convinced that it could be produced in a realistic period of time even with a relatively small team of compilers, and this product would find its market. 2. The morphological analysis of the corpus 2.1. The analyser programme The HUMOR analyser programme was developed by the MorphoLogic Ltd in co-operation with our department in the late 80's (Pajzs 1991, Prószéky 1996, Prószéky & Kis 1999). The first version was based on the entries of the above mentioned seven-volume dictionary, which were supplied with morphological codes. Encoding was further improved during the development of the programme. Now it contains a complex classification including information about the types of suffixes that can follow the given entry and the matching suffix variant. Hungarian morphology is extremely complex: it is mainly agglutinative, several suffixes may follow each other, at least in theory. Based on the large analysed corpus we have evidence that the highly complex combinations practically never occur (Pajzs & Papp 1998). The combination of two suffixes is quite frequent, but due to vowel harmony, if the first suffix is a back variant, the next suffix must also be back, therefore the actual number of real suffix combinations is not that large. The information on the possible forms of the suffixes was stored in the databases used by the analyser. The databases of the entries and the suffixes are disjunct, and the programme only checks whether the elements actually found in the text can be matched according to the information given in the databases. It uses the unification method for choosing the correct 445 solution(s). When the analyser finds a possible match, it can also identify the root of the word, even when the actual form of the root is different from the entry (e.g. the original root ló ’horse’ becomes lov in front of the suffixes, and its analysed version contains both the original lemma in front of the „=” sign and the actual root after it: ló[FN]=lov+aink[PSt1i]+nak[DAT] ‘for our horses'). This analyser programme was applied on the corpus in several steps. First it was only tested on the contemporary part of the corpus, later on the 19th century part, but this test already raised numerous problems. Hungarian orthography was standardised only in the late 1930's, therefore the earlier texts contain many alternative orthographic possibilities. Since the analyser is also used as the engine of a Hungarian spell-checker, the old (currently unacceptable) alternatives could not be included in its databases. Vowels in some words which now should be spelt with long accents used to be spelt either with short or with long accents (even within the same text). Several compounds which are now written in one word used to be written in separate words or with a hyphen. So when we applied the analyser on the texts from the 19th century, only 90 per cent of the words were recognised by the programme, while in the 20th century part 95 per cent were recognised. Naturally, the recognition of the analyser is not always correct, sometimes the words are analysed as non existing compounds or as strange derivates. When the programme finds several possible analyses it outputs each variant. For the disambiguation of these alternatives a local rule-based programme was tested and run on the whole 19th-20th century corpus (Pajzs 1997, Pajzs 1998). Although the result of this attempt was far from the expected correctness rate, because the texts were very varied, its outcome was a usable lemma-oriented corpus, where one could directly search the entries without enumerating every possible form of the word or having too much surplus data. (So for example if you try to search the word ad ’give’ in the non-analysed corpus, you will get every word which happens to start with the same character string, while if you search it in the analysed version, you can explicitly define that you only wish to search the verb ad.) When using the analysed corpus (of 17 million running words at that time) we realised some of its drawbacks. In the meantime the keyboarding of the late 18th century texts also started which meant a new range of problems. The orthography of that time hardly resembles that of modern texts. Some typical problems: · In the earlier prints there were several characters which are not used any more. For example, instead of the short and long vowels ö,o or ü,u, standard nowadays, there used to be several different accents in between, which represent either short or long vowels. As the historical linguists considered these specialities important, we were bound to keep this information somehow when keyboarding the texts. As these characters were used sometimes instead of the short, sometimes instead of the long variant, they could not be converted directly to their current form. · There were several suffixes, old root forms and words that are not used anymore. · Some of the still existing words used to be spelt in completely different ways: words which are now written in two words sometimes (not always!) used to be written in one word, others the other way round. · Since there was no standardised orthography at all, the very same consonant phonemes were sometimes spelt differently even in various occurrences of the same word within the same text, as e.g. lly or lyly or jj or lj for [j]. The s letter was often represented by something like a  or  , and the letter z also had an archaic version. To make things even more complicated, the phoneme now spelt as zs was spelt as a single s or its old forms or with the combination of old s with old z or with any other combinations of these. All this information was kept during keyboarding. The special characters were represented by a combination of letters and digits. This way we could keep all the required information in an easily convertible format (so the representation of the characters did not depend on any operation system or the facilities of any given word processor). Not only the archaic characters, but the current accented characters are also represented in this way, chiefly for the sake of portability. (The letters of the English alphabet and the digits are always the same in every code table.) When, however, we intend to retrieve the old words together with the new corpus, we either leave the problem of searching of the different possible forms for the lexicographers who should be able to retrieve the concordances as fast as possible, or we must find a way to standardise the old words, while keeping the original old form. For this aim we designed a special format: Keyboarded form 446 Honnat-is nem kis bos43zs43zusa1g-te1tellel bu20no20s43u20lnek azok, kik tellyes43se1ggel Current ASCII form Honnan is nem kis bosszúságtétellel bunösülnek azok, kik teljességgel ‘where from those persons will be punished by a not small annoyance, who are completely...’ Lemmatized form <w><o>Honnat</o> <t>honnan</t> <a>honnan[HA]</a></w> - <w> <t>is</t> <a>is[KOT]</a> </w> <w> <t>nem</t> <a>nem[MOD]</a> </w> <w> <t>kis</t> <a>kis[MN]</a></w> <w><o>bos43zs43zusa1g</o> <t>bosszu1sa1g</t> <a>bosszu1sa1g[FN]</a></w> - <w> <t>te1tellel</t> <a>te1tel[FN][INS]</a></w> <w><o>bu20no20s43u20lnek<o> <t>bu3no2su2lnek</t> <a>bu3no2su2l[IGE][t3]</a></w> <w> <t>azok</t> <a>az[NM][PL]</a></w> , <w> <t>kik</t> <a>ki[NM][PL]</a></w> <w><o>tellyes43se1ggel</o> <t>teljesse1ggel</t> <a>teljesse1g[FN][INS]</a></w> In the field tagged by <o> the original, old version is kept if it is different form the current spelling. In the field tagged by <t> either the converted version of the original old form, or the token as it was found in the text if it does not differ from the current norms, is kept. If there are more than one possible and analysable conversions each one is kept separated by a ’|'. If the analyser is able to recognise the token, the analysed version is kept in the field marked by <a>. If it could not find the correct analysis a special tag NE ’not analysed’ is given. If there are more than one possible analyses each one is given. The analysed (or rather tagged) version consists of the recognised lemma, the part of speech code and the suffix codes. The superfield word is marked by the <w> tag. The main advantage of this format is that we can keep both the original archaic form as it occurs in the text, the normalised tokens and the lemmatised version. For the time being we do not intend to disambiguate the analysed version. The retrieval interface takes care of finding the lemma in the analysed field first. If the searched word is not present in this field, then the token field or even the original field can be searched. After the search the result is displayed either from the original field (when there is one) or from the token field if there was no original version. Of course this storing format is very redundant, it requires plenty of disk space, but nowadays hard disks are becoming less and less expensive. We have been using the Open Text SGML text retrieval software, but we are considering to switch to a more modern and efficient software tool. From this tagged version we can easily convert the text to any other format preferred by different tools. 2.2. The conversion of the archaic forms The regularly occurring orthographic and grammatical alterations were aimed to be converted at this phase. 2.2.1. Old characters and variant orthography The variant spellings of the letter s , were keyboarded as s41 and s43, respectively, and then those can be converted to s. There are several strange accented forms for the letter u and o, which were keyboarded as u20, o20, u23, o23, u24, o24 etc. and can be converted to ü ö, u and o, respectively, according to current orthography. If there are many of these old characters in one word, each of the possible combinations must be generated. The HUMOR programme tries to analyse each version and outputs every seemingly correct analysis to the analysed field, and from the converted tokens only those having a corresponding analysis will be kept. Other archaic character combinations: The phoneme currently represented by the letters cs used to be written by ts The phoneme currently represented by the letter c used to be written by tz or cz The long versions of the digraphs are now spelt by repeating only the first character of the digraph, earlier it was variable. (E.g. a suffixed form of asszony ’woman’ is spelt asszonnyal ’with a woman', while in the archaic texts it could either be aszszonynyal, asszonynyal or asszonnyal, and, of course, any of the s-es or z-s could have been old ones.) 447 2.2.2. Phonological rules regularly appearing in the texts Consonants often lengthened in intervocalic position. Those which are represented by digraphs had to be handled separately. old: segitto (keyboarded as: s43egitto24), new: segíto old: gyilkossa (keyboarded as: gyilkos34s43a), new: gyilkosa old: tallyigába new: talyigába old: fénnyiben, new: fényében Spelling according to pronunciation old: akarattyán, new: akaratján old: tilcsa or tiltsa, new: tiltja old: tanúji, new: tanúi old: eladgyák, new: eladják Vocals lengthened before l, n old: mozdúlásra new: mozdulásra old: múnkái, new: munkái old: óldva, new: oldva The phoneme now spelt zs sometimes used to be spelt s old: strásának, new: strázsának old: désa, new: dézsa 2.2.3. Morphological variants The third person plural possessive suffix was often spelt -jok, which is now -juk. old: búzájok new: búzájuk old: hazájok new: hazájuk The third person singular possessive suffix also had a variant form -ok, which is now -uk. The verbal causative derivational suffix -ít sometimes used to be written as -it. The use of long and short variants of the same vowels was much less regular than it is today. Some words had variant root forms, which are not used anymore. 2.3. The process of corpus analysis The programme which made the above described conversions and tried to analyse the converted tokens was only one module of the process. 2.3.1. The first module is a PERL programme which picked the running words from the corpus. Its output was ordered by the ‘sort’ and ‘uniq’ unix commands. The result is a file containing 1,467,230 different tokens. 2.3.2. The HUMOR analyser was run on this list. (The number of analysed words after this phase was 896,153). After the analysis the unanalysed words were separated from the output. The number of unanalysed tokens was roughly half a million at this phase. 2.3.3. On the unanalysed list a converter programme was run, which contains a series of PERL regular expressions based on the above scetched grammatical rules. It converts the possible variant forms of the non- recognised words, and then tries to analyse them with the help of the HUMOR programme. If it finds at least one analysable version, the corresponding token and the analysed version is outputted in the format described in 2.1. If there is no analysis even after the conversion, one possible converted form is still kept in the token field, and in the analysed field the code „NE” marks the missing analysis. 2.3.4. The output of 2.3.2 is merged with the output of 2.3.3. The result is put into an ACCESS database, which contains the original running word as it was found in the corpus in the first field and the analysed version in the format described in 2.1 in the second field. The database is indexed for the first field. 2.3.5. A programme runs on the whole corpus that reads and copies the header of the text files to the analysed text files, then reads the running words one by one, searches them in the first field of the database created in 2.3.4, and outputs the result from its second field. The rest of the information 448 found in the text (SGML tags, punctuation etc.) is copied to the analysed version without any change. In some sample texts there are notes from the original texts. They are kept separately at the end of each sample, and are copied into the end of the analysed file. 2.4. Evaluation of the lemmatisation method With the software package described in 2.3 the whole corpus has been analysed. Here are some examples for the problems of the conversion and identification from 18th-century samples. 2.4.1. Some examples of successful analysis after the conversion <w><o>kapkodgyon</o><t>kapkodjon</t><a>kapkod[IGE][Pe3]</a></w> The current form is kapkodjon ‘imperative of ‘snatch''. The written from, based on the pronounced assimilated word, contained the letters dgy instead of the current dj. Since it is a regular deviation, it was correctly converted and analysed. <w><o>eggy</o><t>egy</t><a>egy[DET]|egy[SZN][NOM]</a></w> The current form is egy ’one'. Again it is a regular deviation, so it was handled correctly. <w><o>melly</o><t>mely</t><a>mely[NM][NOM]</a></w> The current form is mely ’which'. Regular, correct. <w><o>alats43ony</o><t>alacsony</t><a>alacsony[MN][NOM]</a></w> <w><o>s43zaba1su1</o><t>szaba1su1</t><a>szaba1su1[MN][NOM]</a></w> <w><o>ts43ats43ogni</o><t>csacsogni</t><a>csacsog[IGE][INF]</a></w> <w><o>kits43al</o><t>kicsal</t><a>kicsal[IGE][e3]</a><t>kicsal</t><a>kicsal[IGE][e3]</a></w> <w><o>kapts43olatok</o><t>kapcsolatok</t><a>kapcsolat[FN][PL]</a><t>kapcsolatok</t><a>kapcsolat[FN][PL]</ a></w> <w><o>vis43zontag</o><t>viszontag</t><a>viszontag[HA]</a></w> In these words only the characters had to be converted (s43 to s, ts to cs). After the conversion the analysis was correct. <w><o>tis43zta1talansa1gokat</o><t>tiszta1talansa1gokat</t> <a>tiszta1talansa1g[FN][PL][ACC]|tisztatalansa1g[FN][PL][ACC]</a></ w> After the character conversion the analysis is only partially correct: the lemma tisztátalanság ’impurity’ is correctly identified, but the suffix -ok used here is an old form of the possessive suffix -uk, and not the current plural suffix -ok. The second analysis is a misinterpretation: the supposed lemma is tisztatalanság, which is a supposed derivation of the word tiszta+talan+ság ’clean'+privative+adjective-to-noun nominal suffix, but an actually non-existing word. The accusative is correctly identified at the end of the word. Although this analysis is not quite correct, the main point is to give the good lemma, and it is also given there. Usually the conversion was the most successful when only one or two old characters had to be converted and the root or the suffixes were same as the current form. 2.4.2. Some examples for unsuccessful analysis after the conversion . <w><o>o2s43zvekapts43olva</o><t>o2szvekaptsolva</t><a>NE</a></w> The current form of this word is: összekapcsolva ’connected'. In order to be able to recognise this, the variant lemma összve ’together’ must be added to the database of the entries. Then the conversion of s43 to s, and that of ts to cs would be sufficient for the recognition of this word. <w><o>eggyu2gyu2</o><t>egyju3gyu3</t><a>NE</a></w> The current from of this word is együgyu ‘foolish'. The attempt to convert the ggy to gyj was not successful, and also only one of the short ü-s should have been replaced by the long u. <w><o>s43zo2me1rmetes</o><t>szo2me1rmetes</t><a>NE</a></w> This is an archaic version of an obsolete word szemérmetes ’coy, prudent', nowadays only used ironically as a common saying originating from a well known mid-19th century ironic epic poem. <w><o>gyu24lekezo24tt</o><t>gyu2lekezo3t</t><a>gyu2lekezo3[FN][ACC]|gyu2lekezo3[MN][ACC]</a></w> The current word would either be gyülekezet ’assembly (noun)', or gyülekezett ’past tense of ‘gather’ (verb)’ or gyülekezot ’present participle of ‘gather’ (verb) with an accusative suffix'. In the current example the second case would have been the correct choice, but it was not among the converted forms, because the formerly frequently occurring -ött variant form of the -ett current suffix is not included in the suffix database. In the ancient texts many ö-s occurred where today e-s are written, because this used the be a frequent pronunciation and orthographic variant. Nowadays it is rather just a regional alternative, and appears in written form only occasionally. <w><o>o24rizko2dgy</o><t>o3rizko2dj</t><a>NE</a></w> 449 The current form is orizkedj ’the imperative of ‘refrain from''. The conversion was partially correct (dgy to dj), again the alternative -öd form of the -ed suffix should have been included in the suffix database. <w><t>mennyen</t><a>menny[FN][SUP]</a></w> The current form is menjen ’the imperative of ‘go''. The given analysis is mistaken for the superessive of the noun menny ’heaven', and seems to be analysed in the first analyser phase (2.3.2), so it does not appear among the words to be converted. <w><o>Tragyo24dia1nak</o><t>tragyo3dia1nak</t><a>NE</a></w> The current form is tragédiának ’dative of ‘tragedy''. The letter g is not often replaced by gy in the old text, and é is also rarely replaced by ö/o. So there was no conversion rule for this word. <w><o>Tana1ts43be1li</o><t>tana1csbe1li</t><a>tana1csbe1l[FN][IKEP][NOM]</a></w> The current form is tanácsbeli ‘belonging to the council'. Although the conversion was partially correct (s43 to s, ts to cs), the suffix -béli was not converted to -beli, and the converted token was mistakenly recognised by the analyser as tanács+bél+i ’council'+'intestine'+noun-to-adjective derivative suffix. 2.4.3. An analysed example sentence Old keyboarded form: Hallom s43ivednek foha1s43zkoda1s43ait, e1rtem azon bu1tsu1t, melly ne1ked a Kira1lyne1to3l adatott, de erro2l ma1skor les43z s43zo1llanunk, s43okkal nyomos43s43abb gondok foglalnak el lelku2nket. Current ortographic ASCII form: Hallom szívednek fohászkodásait, értem azon búcsút, mely néked a Királynétól adatott, de errol máskor lesz szólanunk, sokkal nyomósabb gondok foglalnak el lelkünket. A sentence with a similar meaning in current Hungarian: Értem, mennyire fáj a szíved amiatt, hogy búcsut kell venned a királynétól, de errol majd máskor beszéljünk, most sokkal fontosabb gondjaink vannak. ‘ I sympatise with your feelings on saying farewell to the queen, but we have to talk about it later, because we have much more urgent problems to solve at the moment.’ + <w> <t>hallom</t> <a>hallik[IGE][Te1]|hall[IGE][Te1]|hall[FN][PSe1]|hallom[FN]</a></w> - <w><o>s43zivednek</o> <t>szivednek</t> <a>NE</a></w> + <w><o>foha1s43zkoda1s43ait</o><t>foha1szkoda1sait</t><a>foha1szkoda1s[FN][PSe3i][ACC]</a></w> , + <w> <t>e1rtem</t> <a>e1n[NM][CAU][]|e1rt[IGE][Te1]|e1r[IGE][Me1]|e1r[IGE][TMe1] |e1rt[MN][PSe1]|e1rik[IGE][Me1]|e1rik[IGE][TMe1]</a></w> + <w> <t>azon</t> <a>azon[NM]|az[NM][SUP]</a></w> + <w><o>bu1tsu1t</o> <t>bu1csu1t</t> <a>bu1csu1[FN][ACC]</a></w> , + <w><o>melly</o> <t>mely</t> <a>mely[NM]</a></w> + <w> <t>ne1ked</t> <a>te[NM][DAT]</a></w> + <w> <t>a</t> <a>a[DET]</a></w> ' - <w><o>Kira1lyne1to2l</o> <t>kira1lne1to3l</t> <a>NE</a></w> + <w> <t>adatott</t><a>adatott[MN]|adat[IGE][Me3]|adatik[IGE][Me3]|ad[IGE][MUV][Me3]</a></w> , + <w> <t>de</t> <a>de[HA]|de[KOT]</a></w> + <w><o>erro2l</o> <t>erro3l</t> <a>ez[NM][DEL]</a></w> + <w> <t>ma1skor</t> <a>ma1skor[HA]|ma1skor[FN]|ma1s[FN][TEM]| ma1s[NM][TEM]|ma1s[MN][TEM]</a></w> + <w><o>les43z</o> <t>lesz</t> <a>lesz[IGE][e3]</a></w> - <w><o>s43zo1llanunk</o> <t>szollanunk</t> <a>NE</a></w> , + <w><o>s43okkal</o> <t>sokkal</t> <a>sokk[FN][INS]|sok[SZN][INS]</a></w> + - <w><o>nyomos43s43abb</o> <t>nyomosabb</t> <a>nyomos[MN][FOK]</a></w> + <w> <t>gondok</t> <a>gond[FN][PL]</a></w> + <w> <t>foglalnak</t> <a>foglal[IGE][t3]</a></w> - + <w> <t>el</t> <a>el[IK]</a></w> + <w> <t>lelku2nket</t> <a>le1lek[FN][PSt1][ACC]</a></w> 450 The sentence contained 22 running words. Those of them which are marked by a ‘+’ sign at the beginning of the column are analysed correctly, or rather, at least one of the given analyses is correct. Those which are marked with ‘-‘ are not analysed correctly. There is one with a ‘+-‘ sign, the analysis is nearly correct, but not completely (the current lemma would be nyomós strong, powerful’ not nyomos ‘trail, trace+noun-to-adjective derivative suffix', which is just a supposed but non existing form). Those words which were recognised as archaic versions had a first field marked with the <o> </o> tags. There were 10 words like this in the above sentence. Out of these 6 were correctly analysed after the conversion. Altogether 18 words out of the 22 were analysed correctly in this example. 2.5. Statistical evaluation of the results Cent Running words A Analysed words after 2.3.2 B Analysed words after 2.3.5 C Archaic forms D Analysed archaic forms E Not analysed words F Percentage of not analysed words 18th 1,689,735 1,433,100 1,727,697 379,898 294,597 231,319 13.6% 19th 6,839,688 6,666,168 6,873,935 339,326 207,767 684,328 10% 20th 1,615,5007 16,117,793 16,159,742 169,404 41,949 798,562 4.9% Note that the number of analysed words is sometimes larger than that of the running words, because in many cases the words have more than one supposedly correct analysis. The ratio of unanalysable words in the texts from the 18th century was 22.48 per cent (1-[(A-D)/A]) before using the conversion programme. After the use of the above described conversion rules most of the formerly unrecognised words yielded an analysis (E/D=0.775). Although there still remained a much larger number of unanalysed words than either in the 19th or 20th century texts, it is clear, that this algorithm had the most successful effect on the targeted part of the texts: while the ratio of recognised words was raised by 22.49 per cent (E/[A-D]) in the 18th century part of the corpus, in the 19th century part the improvement was 3.1 per cent, while in the 20th century part 0.2 per cent. It might be surprising that the above described rules could have been used in the 20th century texts at all, but remember that Hungarian orthography became standardised only in the 1930s. Naturally, in many cases the attempted conversions are misleading, so the resulting analysis sometimes has nothing to do with the correct recognition of the lemma. We have also examined the number of rules employed during the conversion, we have given an identification number to each group of rules and this number was put into the resulting database to an additional field. Using this field we could investigate the effectiveness of the rules. In most archaic words only one conversion rule was employed (42.57 per cent), in 12 per cent two rules were employed, three rules were used only in 2.39 per cent. The largest number of different rules employed for the same word was 7, but it was done for 6 different running words only (0.001 per cent). We are planning to study this database further. 3. Conclusion and further research The attempt to be able to search lemmata in historical texts has proved to be promising. Our purpose at this stage was to handle the difficulties raised by a diachronic corpus in a relatively straightforward way. Instead of preparing a completely different analyser programme for the archaic texts we have been trying to find regular alternations and convert the archaic words to forms as similar to the recent ones as possible. Although plenty of problematic cases remain unsolved, we still believe that we are on the right track. From the preliminary results described above it is possible to draw new conclusions: whenever we look at the current output, we can find some new regularities which can be added to the programme, thus improving the ratio of correct analysis. The often occurring irregularities can also be added to one of the databases used by HUMOR, called the inclusion database. This is a very simple database where you can add a running word in the first field and put its correct analysis in the next column (so for example the frequent archaic word forms like mennyen, 451 vala can be added here). HUMOR also uses a similar list for excluding some incorrect analyses, so whenever we find an error among the analyses given by the programme, we can just eliminate them by adding them to the exclusion database. The further development of the conversion rules combined with the careful use of HUMOR's inclusion/exclusion databases can make the lemmatisation process of the archaic texts much more accurate. References Czuczor G, Fogarasi J 1862 A magyar nyelv szótára I.-VI. ’The dictionary of Hungarian’ Pest, Emich Gusztáv Magyar Akadémiai Nyomdász Benko L (ed) 1991–1992 A magyar nyelv történeti nyelvtana I–II., ’The historical grammar of Hungarian’ Budapest, Akadémiai Kiadó. Bárczi G, Országh et all. (eds.) 1959-1962 A magyar nyelv értelmezo szótára I.-VII. ’The explanatory dictionary of Hungarian’ Budapest, Akadémiai Kiadó Elekfi L 1994 Magyar ragozási szótár – ’Dictionary of Hungarian Inflections', Budapest, Research Institute for Linguistics Kiefer F (ed) 2000 Strukturális magyar nyelvtan 3. Morfológia, ’A Structural Grammar of Hungarian 3. Morpology'. Akadémiai Kiadó, Budapest Olsson, Magnus 1992 Hungarian Phonology and Morphology, Lund, Lund University Press, Országh L 1960 Problems and Principles of the New Dictionary of the Hungarian Language. Acta Linguistica X/3-4. Budapest, Research Institute for Linguistics of the Hungarian Academy of Sciences, pp 211-273. Pajzs J 1991 The Use of a Lemmatized Corpus for Compiling the Dictionary of Hungarian Using Corpora Proceedings of the 7th Annual Conference of the OUP & Centre for the New OED and Text Research. Waterloo, University of Waterloo, pp 129-136. Pajzs J 1997 Synthesis of results about analysis of corpora in Hungarian. Linguistica Investigationes XXI-2 John Benjamins, Amsterdam pp 349-365 Pajzs J, Papp F 1998 Statistical Examination of the Hungarian Noun Paradigm Proceedings of ALLC/ACH Debrecen, Lajos Kossuth University pp 89-93. Papp I Leíró magyar hangtan, ’Hungarian descriptive phonology’ Budapest, Tankönyvkiadó, 1966. Prószéky G, Tihanyi L 1992 A Fast Morphological Analyser for Lemmatizing Corpora of Agglutinative Languages. Proceedings of COMPLEX '92. Budapest, Research Institute for Linguistics, pp 275-278. Prószéky G (1996). HUMOR - A Morphological System for Corpus Analysis. Proceedings of the first TELRI Seminar in Tihany. Budapest, Research Institute for Linguistics pp 149-158. Prószéky G, Kis B 1999 Agglutinative and Other (Highly) Inflectional Languages. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics College Park, Maryland, USA pp 261–268 R. Hutás M 1974 Az Akadémiai Nagyszótár történetének vázlata (1898-1952) ‘The brief history of the Unabridged Dictionary of Hungarian’ Nyelvtudományi Közlemények. LXXV. Budapest, Akadémiai Kiadó, pp 447-465. 452 Korean grammatical collocation of predicates and arguments Byungsun Park, Beom-mo Kang Korea University 1. Introduction This paper investigates the grammatical collocation of predicates and arguments in Korean in an objective way through statistical methods. We investigate the statistical meaning of the co-occurrence of the two words and the semantic relationship between them. 1 In the light of Korean grammatical structure, we study the collocation of the argument and the predicate using statistical measures. The definition of the collocation in Korean is clear; however, on the basis of previous studies, we can say that it is words with adjacency and high co-occurrence relations. Collocation has been an active field of study in English, and the results of these studies have been used to do part-of-speech tagging and analyse the grammatical structure of large corpora. Collocation studies of western languages generally focus on words close to each other. However, in studying Korean, on the other hand, more meaningful results are produced by examining the cooccurrence of words in a grammatical relationship within a sentence, such as subject and object. We first investigate in how far the correlation between the occurrences of two words is statistically meaningful, and then analyse the semantic relationships between them. The Korean language has many auxiliary words used for case marking. For example, the subjective case is marked by the auxiliary words ‘-i', ‘-ka'. Korean has developed its case mark system in this way. Therefore, Korean language order is relatively flexible. In this view, rather than the relationship between keyword and adjacency word, the relation between keywords and words that are grammatically related to the keyword is more meaningful for collocation. This paper focuses especially on predicates as keywords. In this paper, we focus on a methodology for analysing data, so we just introduce Korean grammatical collocation by one word ‘boda –see ’ as an example. The verb ‘boda’ is extracted from the Korea-1 corpus (H. Kim & B. Kang, 1996) that was balanced on a 10 million word scale. 2. Data analysing by statistics 2.1 Data processing At first, the data in this paper was selected from the 100 most frequent verbs in the Korea-1 corpus. Then, concordances of these verbs were extract by a concordance package, and the argument of every sentence was marked by hand. An example is given below. S seongyosa-neun amu maleobsi i O chobeol-eul [bo-goman] isseul ppun … S missionary-sub without saying anything this O punishment-obj [see(stem)–only(ending)] stand just … “ The missionary, without saying anything, stood there, seeing this punishment…” (S: subject marker, O: object marker) In this way, the grammatical markers for every concordance sentence were marked, and the frequency of the marked word for each argument extracted. 2.2 Statistical approach Up to now, a large part of linguistic research has focused on determining whether the expressions in a language are appropriate or not through intuition. This type of research is usually based on the intuition of specifically native-speakers, rather than on objective data. In our research, however, we are interested in finding out what are meaningful collocations not through intuition, but through actual language use. Since the amount of language data used in corpus linguistics is usually large, a statistical 1 This paper is part of the research project in which the authors collaborated with Jong-sun Hong, and Ho-cheol Choe of Korea University. 453 approach is essential. Moreover, a statistical approach reveals important information for natural language processing and specific data for use in a general theoretical approach. When two words that have a grammatical relation to each other co-occur and also when the frequency of co-occurrence is higher than we can usually expect, we can say that the two words have a collocation. From this point of view, statistical methods are used in this research. How exactly the statistical measures are used will be described in the following section. 2.2.1 t-score The t-score shows the difference between expected frequency of co-occurrence in the population and observed frequency of those in the sample. In this paper, the population is the Korea-1 corpus. The bigger the difference, the higher the degree of collocation. If a word occurs x times in the population, and the number of observed words in the text is y, we can calculate the probability of x/y. For example, a word, A, occurs 5000 times in the 10 million corpus, and the size of the observed text is 1000 words, we can expect that a word, A, can occurs 0.5 times in the observed text. If the observed frequency of A is larger than 0.5, it can be said that A is in collocation to the node. However, we consider not only the difference between observed frequency and expected frequency but also the statistical significance. In the character of t-score, the larger the observed frequency compared to the expected frequency, the more exaggerated the t-score is expressed. Consequently, the t-score is not simply comparing the difference between A and B. For example, if the t-score of A is 6.22 and the t-score of B is 3.11, it does not mean that A is two times larger than B in collocation. In this view, statistical significance is important. The t-score formula is shown below:2 O E O t - = Where: O = the observed frequency of occurrence of the word within the span E = the expected frequency of occurrence of the same word And also, E is calculated like this: E = the frequency of occurrence of the population X size population size span _ _ 2.2.2 MI-score (Mutual Information score) The MI-score is different from the t-score when expressing collocation. The t-score is the measured collocation by fixed node and co-occurring word. This means that the MI-score does not express the difference between observed and expected frequencies in terms of standard deviation, but represents the amount of information to which each of the two words, the node and its collocate, are related. So, the MI-score provides information about the words in relation to each other by comparing the observed probability of their co-occurrence with the expected probability, assuming that they are distributed randomly. In the MI-score, it is assumed that the amount of information that each of the two words contains is same. However, in this research, the MI-score measure shows that Korean is very different from English. Especially collocation studies of western languages generally focus on the words close to each other. we believe, on the other hand, that more meaningful results are produced by examining the cooccurrence of words in a grammatical relationship within a sentence. we therefore attempted to test the MI-score in Korean collocation research, but the result was not appropriate to Korean collocation research. The MI-score formula is shown below: I = log2 3. The collocation of the main argument In this paper, the collocation is calculated by the statistical measure as previously stated. The population is 10 million words of the Korea-1 corpus. The sample, the span, is calculated based on the 2 This formula and the explanation were described in Barnbrook (1996). E O 454 fact that one sentence in Korean usually consists of 12 words on average (B. Kang 1999, B. Park1997), so the span is 12 * the frequency of concordance. Even if the score is a negative number, it is also meaningful because negative numbers mean the two words usually do not co-occur. Especially when a negative number also has statistical significance, it gives a meaningful result. 3.1 The collocation of ‘boda (see)’ arguments The verb ‘boda’ needs a subject and an object as arguments. The meanings of the verb ‘boda’ are ‘to see’ and ‘to induce or deduct'. The verb ‘boda,’ which means to see, only needs a subject and an object, but when it means to ‘induce or deduct,’ it needs a subject (nominative), an object, and also an adverbial expression. The result of the statistical measure of this is shown in the table below. Statistical significance depends on the individual research propositions, so, in this paper, we assume the level of significance of the t-score and MI-score to be over 1.96. 3.1.1 The main arguments Table 1: t-score of the verb ‘boda’ 3 Argument Mark Word form Frequency t-score O moseub(‘shape')-eul(Obj) 39 5.187 O geol(‘thing') 28 4.462 O na(‘I')-leul(Obj) 29 4.045 Oko issda(‘be-past')-ko(Q) 47 3.914 O yeonghoa(‘film')-leul(Obj) 20 3.838 O eolgul(‘face')-eul(Obj) 18 3.655 S Taisu(P) 13 3.602 O nunchi(‘sense')-leul(Obj) 12 3.288 S na(‘I')-neun(Sbj) 71 3.076 O sigieoi(‘watch')-leul(Obj) 8 2.702 S jeo(‘Ip')-neun(Sbj) 13 2.688 O Taisu(P)-leul(Obj) 7 2.644 O ccog(‘side')-eul(Obj) 8 2.570 O bakk(‘outside')-eul(Obj) 7 2.519 O sonhai(‘loss')-leul(Obj) 7 2.444 O kkol(‘side')-eul(Obj) 6 2.315 O pihai(‘damage')-leul(Obj) 8 2.194 S jeoi(Ip)-ga(Sbj) 9 2.049 O byeol(‘star')-eul(Obj) 5 2.037 Oko eobda(‘nothing')-go(Q) 14 2.036 S nu(‘who')-ga(Sbj) 12 2.032 O Soohai(P) 4 1.996 O duismoseub(‘back')-eul(Obj) 4 1.956 As table1 shows, the objectives are mostly over 1.96, which is statistically significant, and some subjectives (nominatives) are also included in the table. But the other argument has not appeared. Some proper nouns in subject places are not significant because this depends on the specific text. Therefore they just appear without any collocational meaning. Table 2: MI-score of the verb ‘boda'4 Argument Mark Word form Frequency MIscore O Taisu(P)-leul(Obj) 7 10.219 S Taisu(P) 13 10.113 S Byeonghoa(P)-neun(T) 2 9.412 S Yeongjin(P) 2 9.412 3 Suj: subject case word, Obj: object case word, P: proper noun, Q: quotation mark word. 4 C: adverbial case word. 455 S juinajeossi(‘owner')-neun(T) 2 9.412 O Suhai(P)-leul(Obj) 4 8.827 Cr silyeon(‘loss of love')-eulo(C) 1 8.412 Cr2 yongdo(‘usaage')-lona (C) 1 8.412 Cx bangjeung(‘corroboration')-eulo(C) 1 8.412 Cx sayong(‘using')-ilagoman(C) 1 8.412 Cx junggandangieoi(‘middle step')-lo(C) 1 8.412 The MI-score, unlike the t-score, shows that the co-occurring words with low frequencies are usually beyond the statistically significant score, 1.96. So the words with high MI-score are usually low frequency items. This is very different from high t-score word list. But in English collocation research, the high score t-score words and the high score MI-score words have a similar word list. In this aspect, the collocation in Korean is very different from collocation in English. This paper just represents the sample of the statistically significant words. We can conclude from this result that the degree of dependency of the two words is very different in Korean. The assumption behind the MI-score measure is that the dependency between two words is same. But Korean is somehow different in this respect, so that we need other statistical measures for the mutually dependent approach. Since this analysis of the MI-score is not needed, it will not be mentioned here. 3.1.2 The argument in subject Table 3: the t-score of the subject place Argument Mark Word form Frequency t-score S Taisu(P) 13 3.602 S na(‘I')-neun(Sbj) 71 3.076 S jeo(IP')-neun(Sbj) 13 2.688 S jeo(IP)-ga(Sbj) 9 2.049 S nu(‘who')-ga(Sbj) 12 2.032 S namdeul(‘others')-i(Sbj) 4 1.818 S sonyeon(‘boy')-eun(T) 4 1.792 S nai(‘I')-ga(Sbj) 35 1.734 S Byeonghoa(P)-neun(T) 2 1.412 There are 5 words that have statistical significance in the subject place of ‘boda (see)'. The word with the highest t-score is a proper noun, but, as previously stated, this just depends on the specific text of the corpus. Due to its dependency on the text, this is not so meaningful from a semantic point of view. Except for this, the word with the highest t-score word is the first person pronoun. The words, ‘na (I)', ‘jeo (I-modest expression)', and ‘jieo (I-modest expression), are Korean first person pronouns. From this perspective, the grammatical collocation of the verb ‘boda (see)’ is related to the cognition of visual sense and the induction or deduction. We can also conclude from this that the first person pronoun is the main subject of a cognition and an induction or deduction. The interrogative word ‘nuga (who)’ is also related to a person, so we can conclude in the same way. Table 4: the t-score of the object place Argument Mark Word form Frequency t-score O moseub(‘shape')-eul(Obj) 39 5.187 O geol(‘thing') 28 4.462 O na(‘I')-leul(Obj) 29 4.045 Oko issda(‘be-past')-go(Q) 47 3.914 O yeonghoa(‘film')-leul(Obj) 20 3.838 O eolgul(‘face')-eul(Obj) 18 3.655 O nunchi(‘sense')-leul(Obj) 12 3.288 O sigieoi(‘watch')-leul(Obj) 8 2.702 O Taisu(P)-leul(Obj) 7 2.644 O ccog(‘side')-eul(Obj) 8 2.570 O bakk(‘outside')-eul(Obj) 7 2.519 O sonhai(‘loss')-leul(Obj) 7 2.444 O kkol(‘side')-eul(Obj) 6 2.315 O pihai(‘damage')-leul(Obj) 8 2.194 456 O byeol(‘star')-eul(Obj) 5 2.037 Oko eobda(‘nothing')-go(Q) 14 2.036 O Suhai(P) 4 1.996 O duissmoseub(‘back')-eul(Obj) 4 1.956 In this table, the mark ‘Oko’ is an object quotation clause. This is an argument of the verb ‘boda (see)’ as an opinion and an induction or deduction. It is 19 word forms that have statistical significance in the object argument. Among them, it is 2 word forms that appear in the ‘Oko’ place. The other word forms are usually related to the main meaning of ‘boda’ that is the cognition of sight. The words, ‘nunchi (sign)', ‘pihai (damage)', and ‘habui (consent),’ are the object arguments of ‘boda (see)', and it means that the meaning of ‘boda’ is expanded to the cognition of abstraction. In the dictionary, this expression is written as an idiom. This is an important piece of information for translation and Korean teaching as a second language. Table 5: the t-score of the other argument place Argument Mark Word form Frequency t-score C nun(‘eye')-eulo(C) 5 1.568 C bigoanjeog(‘pessimistic')-eulo(C) 2 1.379 C eolgul(‘face')-lo(C) 2 1.049 C geungjeongjeog(‘optimistic')-eulo(C) 2 1.047 C silyeon(‘loss of love')-eulo(C) 1 0.997 C yongdo(‘usaage')-lona (C) 1 0.997 C bangjeung(‘corroboration')-eulo(C) 1 0.997 C sayong(‘using')-ilagoman(C) 1 0.997 C junggandangieo(‘middle step')-lo(C) 1 0.997 The other argument does not appear with statistical significance. We can suppose that this other argument helps the meaning of the object. These meanings are usually an instrument of the verb ‘boda (see)'. 4. The negative score As previously described, if the t-score is negative but has statistical significance, then we should consider the word form as statistically significant. Because a negative t-score means that the two words have a tendency not to co-occur. There are 90 word forms that have statistical significance with negative scores. Even if we take a stricter approach to conformity with statistical significance, which is 3.96, then 71 word forms are in this standard. So we can predict that the argument of the verb ‘boda’ appears in various word forms. The table below shows a sample: Table 6: the t-score of the statistical significant negative5 Argument Mark Word form Frequency t-score Sp geos(‘thing')-eun(T) 1 -86.080 Op geos(‘thing')-i(C) 1 -81.200 O uli(‘we') 1 -64.038 Op geos(‘thing')-eun(T) 2 -60.160 Op geos(‘thing')-eul(Obj) 1 -52.169 Oko geos(‘thing')-elo(C) 1 -46.019 S uri(‘we') 2 -44.575 Sp na(‘I')-neun(T) 1 -44.084 O geo(‘he')-neun(T) 1 -33.096 Sp nai(‘I')-ga(Sbj) 1 -23.739 S nai(‘I') 1 -23.032 O geos(‘thing')-do(T) 1 -20.903 Op geos(‘thing')-do(T) 1 -20.903 O deung(‘etc')-eul(Obj) 1 -18.857 Op Hangug(‘Korea') 1 -17.727 5 Sp: a subject that follows a predicate, Ob: a object that follows a predicate. 457 O geos(‘thing') 4 -16.872 O geogeos(‘that')-eun(T) 2 -16.586 Sp geu(‘he')-nuen(T) 4 -15.048 As this table shows, the frequencies of most word forms are low, but the absolute value of t-score is very high. In this table, however, it is very noticeable that the first person pronoun ‘Na (I)’ has a negative t-score. This is contrary to the previous description that a first person pronoun is the main subject of cognition and induction or deduction. The reason for this difference is the location in the sentence. This means that the collocation depends on the location whether is in front predicate or not. 5. Conclusion To sum up, this research focused on a grammatical collocation approach using the verb ‘boda (see)’ as an example. As a result, the verb ‘boda (see)’ has many statistically significant collocations in the object argument. In the subject argument, there were 5 word forms that had statistical significance. These word forms are usually first person pronouns, and this is related to the meaning of the verb ‘boda (see)'. This type of research is expected to be very useful for natural language processing and Korean teaching as a second language. References (K) - Korean B. Kang. 1999 Frequencies and Language descriptions Research for language information –1. Seoul: Yonsei University, Center for Language and Information Development.(K) B. Kang 1999 The text genre and language character of Korean. Seoul, Korea University Press. (K) B. Park. 1997 The character of word use in Korean spoken language. MA thesis, Korea University (K) D. Lee. 1998 The semantic research of Korean collocation. Seoul, Korea University MA thesis, Korea University (K) G. Barnbrook. 1996 Language and Computers. Edinburgh University press H. Kim & B. Kang 1996 Korea-1 Corpus: the design and the organization Korean Linguistics 3. Seoul, The association for Korean linguistics. (K) J. Hong et al. 1998 The dictionary of modern Korean verb structure. Seoul, Doosandonga Press. (K) J. Hong, B. Kang, & H. Choe 2000 The research of applied analyzing of Korean collocation information, Korean Linguistics 11. Seoul, The association for Korean linguistics. (K) J. Kim. 2000 Korean collocation research. PhD thesis, Kyunghee University (K) J. Sinclair. 1991 Corpus, Concordance, Collocation. Oxford and New York, Oxford University Press J. Yun. 1997 Korean structure analyzing by co-occurrence based word relation., PhD thesis, Yonsei University. (K) U. Paik. 1996 The introduction of statistics. Seoul, Jayuacademi Press. (K) 458 Corpus-based terminology extraction applied to information access Anselmo Penas, Felisa Verdejo and Julio Gonzalo {anselmo,felisa,julio}@lsi.uned.es Dpto. Lenguajes y Sistemas Informáticos, UNED, Spain Abstract This paper presents an application of corpus-based terminology extraction in interactive information retrieval. In this approach, the terminology obtained in an automatic extraction procedure is used, without any manual revision, to provide retrieval indexes and a “browsing by phrases” facility for document accessing in an interactive retrieval search interface. We argue that the combination of automatic terminology extraction and interactive search provides an optimal balance between controlled-vocabulary document retrieval (where thesauri are costly to acquire and maintain) and free text retrieval (where complex terms associated to domain specific concepts are largely overseen). 1 Introduction Although thesauri are widely used in Information Retrieval, their development requires laborintensive processes with a high manual cost. On the other hand, new domains with specific conventions and new terminology are continuously appearing. The development of terminology lists is a previous step in thesaurus building that allows the use of automatic techniques in order to facilitate documentalists’ labor. Terminology lists can be seen as an intermediate point between free and controlled access to information. They are used as indexes for document and resources access. In the context of education, they are used in schools, libraries and documentation centers for database access (ERIC). Although there are several available educational thesauri, the domain of new technologies in education is more specific and requires the addition of new terms. This work has been developed inside the European Treasury Browser (ETB) project which is aimed to build the needed structures to organize and retrieve educational resources in a centralized web site server. One of the main resources which is being developed inside ETB, is a multilingual thesaurus whose terms will be used for describing the educational resources. We describe the corpus-based method for the terminology extraction procedure used to extract and suggest to documentalists the Spanish terms in the domain of new technologies, primary and secondary education. The method is based on the comparison of two corpora extracted from the web: the first one, an appropriate corpus in the domain and, the second, a corpus in a different and more general domain (international news in our case). The comparison of terms in both corpora facilitates the detection of specific terms of our domain. Section 2 will describe the methodology followed for the Terminology Extraction (TE) procedure based on corpora and morphosyntactic analysis. Exploration and evaluation of the results will be given in section 3. Thesauri and controlled vocabularies are widely used in Information Retrieval (IR). We argue that the cost of thesaurus construction can be skipped for IR purposes if constrains over terminology are relaxed. This is possible in a framework of interactive text retrieval. TE then, becomes an appropriate tool for providing richer indexing terms to IR indexes. Section 4 analyses the differences between IR and TE which allows the relaxing of TE process, and shows a first prototype where Corpus-Based TE is applied to interactive IR. 2 Corpus-based Terminology Extraction Terminology Extraction (TE) tasks deal with the identification of terms which are frequently used to refer to the concepts in a specific domain. Typically, automatic terminology extraction (TE, Term Extraction) (ATR, Automatic Terminology Recognition) is divided in three steps (Bourigault, 1992) (Frantzi et al., 1999): 1. Term extraction via morphological analysis, part of speech tagging and shallow parsing. 2. Term weighting with statistical information. The weight is a measure of the term relevance in the domain. 3. Term selection, ranking and truncation of terminological lists by thresholds of weight. These steps need a previous one in which corpora are obtained and prepared for the TE task. We will distinguish between one word terms (mono-lexical terms) and multi-word terms (poly-lexical 459 terms), extracted with different techniques. The following subsections explain all the TE process performed for the domain of interest: multimedia educative resources for primary and secondary school in Spanish. 2.1 Construction of the corpora The corpora have been constructed from web pages harvested with crawlers. These web pages need preprocessing because they contain information which can disturb the term extraction process: 1. Treatment of html tags. 2. Deletion of pages in other languages than Spanish. For this task a language recognizer has been used. 3. Deletion of repeated pages and chunks. Due to the continuous update of web site contents, pages with different names but the same content are very frequent. This becomes a problem since identical sequences of words in different documents gives a positive evidence of terminological expressions. Repeated pages and chunks produce noise in statistical measures. As mentioned above, the automatic terminology Extraction method (Manning and Schütze, 1999) used in this work is based in the use of two corpora: Educative Resources Corpus The first corpus is related to the domain and contains useful terminology in order to classify, organize and retrieve multimedia resources for secondary school. The pages of the two following web sites have been collected with a crawler: · Programa de Nuevas Tecnologías: http://www.pntic.mec.es/main_recursos.html · Aldea Global: http://sauce.pntic.mec.es/~alglobal This corpus has 1,075 documents and 670,646 words. International News Corpus With the aim of discarding frequent terms which are not domain specific, a second corpus has been collected from the web. This corpus is composed by 7,364 international news from an electronic newspaper (http://www.elpais.com), and has a size of 2.9 million words. As explained below, the comparison of term frequencies in both corpora gives a relevance measure for domain terminology. 2.2 Term detection Texts were first tokenized. Abbreviations, erroneous strings and words from other languages than Spanish were ignored. In order to obtain mono-lexical terms, texts were tagged on their part of speech using (Márquez et al., 1997; Rodríguez et al., 1998) and only nouns [N], verbs [V] and adjectives [A] were extracted. Different forms of the same word are counted by considering their lemma. In order to detect poly-lexical terms, syntactic pattern recognition has been applied to the collection. Complex noun phrases can be splitted into simpler ones, but all of them must be considered for terminology extraction. The selection of the appropriate, even correct ones will be decided in subsequent steps. For example, “distance education teachers” is a noun phrase which contains a simpler one: “distance education”. Term detection phase requires the extraction of all candidate phrases. The use of syntactic patterns is adequate for this task. Patterns enable to find all the phrases that match them in all the documents of the collection. Patterns are defined as morphosyntactic tag sequences. If the text contains a word sequence whose tags match the pattern, then a new phrase has been recognized. The patterns do not attempt to cover all the possible constructions of noun phrases, but only the ones that appear more frequently in terminological expressions. They were obtained empirically after an iterative refinement of prototypes. The patterns used are listed in figure 1. Along the pattern recognition process, a record of phrase occurrences and the documents in which they appear is built. 460 Patterns recognized 72.453 candidate phrases in the “Educational Resources Corpus”. 75% of them appear only once in the whole corpus and largely consist of wrong expressions or irrelevant terms for the domain. As discussed later, it is not necessary to discard wrong expressions for Information Retrieval purposes, but for term extraction, the correctness of the identified expressions is preferable even if some relevant expressions are lost. In other words, precision is more important than recall. Therefore, for the Terminology Extraction task a threshold for the number of phrase occurrences has been defined, and all the expressions that appear only once in the educational corpus have been discarded. The results of the term detection phase are two lists: 1. a list of lemmas (mono-lexical terms), and 2. a list of terminological phrases (poly-lexical terms). Every term is associated with its frequency and the number of different documents in which it appears. Such statistics are obtained both for the educational corpus and for the newspaper corpus. 2.3 Term weighting Term weighting gives a relevance value to every detected term in order to select the most relevant terms in the domain. We have defined a weighting formula that satisfies the following constraints: 1. Less frequent terms in the domain corpus should have less relevance. 2. Highly frequent terms in the domain corpus should have higher relevance, unless they are also very frequent in the comparison corpus or they appear in a very small fraction of the documents in the domain corpus. The formula considers: 1. Term frequency in the collection. 2. Document frequency of terms in the collection. 3. Term frequency in a more general domain. where Ft,sc: relative frequency of the term t in the specific corpus sc Ft,gc: relative frequency of the term t in the generic corpus gc Dt,sc: relative number of documents in sc where t appears. Although a majority of documents in the educational corpus are related to the domain under study, some documents contain very specific terms belonging to different domains (e.g. tales for children about witches). If this kind of documents are long enough they can give high frequencies for non relevant terms. To solve this problem, the measure considers document frequency. One term in the domain must appear in several documents to be considered relevant. 2.4 Term selection Three criteria have been used in order to reduce the number of candidate terms: 1. N N 2. N A 3. N [A] Prep N [A] 4. N [A] Prep Art N [A] 5. N [A] Prep V [N [A]] Figure 1. Syntactic patterns for Spanish. 1 Relevance (t, sc, gc) = 1 – Ft,sc · Dt,sc log2 2 + Ft,gc 461 1. Removal of unfrequent terms in the educational corpus (threshold=10). Terms not frequent in the corpus have a low probability of being representative in the domain. 2. Removal of very frequent terms in the newspaper Corpus (threshold=1000). Terms which appear very frequently in other domains have a low probability of being specific to our domain. 3. Selection of the first n (n=2000) terms ranked according to the relevance measure. The thresholds used for the previous list truncation depend on the number of terms that will be handled in the following phases. As we want to evaluate manually the precision of the automatic term extraction process, the thresholds were adjusted in order to obtain between 2000 and 3000 candidate terms. Poly-lexical term frequencies do not behave as mono-lexical frequencies. Poly-lexical terms are much less frequent than mono-lexical terms, but a couple of occurrences of a poly-lexical term give high evidence of lexicalised expressions. For this reason, further criteria were needed to add polylexical terms to the selection list attending to the relevance of their components. Through the exploration of poly-lexical terms we corroborate that very few compounds without relevant components were indeed relevant. Those terms were ignored and all the poly-lexical terms with relevant components were added to the term list for manual revision. 3 Evaluation of the term extraction procedure 3.1 Visual exploration of results Visual exploration of results is needed during the whole process: · to help in the decisions of prototype development and refinement, · to evaluate measures and techniques, and suggest modifications and improvements, · to give documentalists the possibility of exploring data in order to assist final decisions in thesaurus construction. We needed a simple, intuitive and comfortable way for data exploration. Thanks to hyperlinking, the use of html pages has resulted appropriate for this task. The html pages containing the extracted data were automatically generated in each iteration of the prototype. Figure 2 shows the pages with the statistical data for mono-lexical relevance measure computation. In this case, terms are ordered by relevance weight. The columns contain the following data: 1. Term frequency in the educational corpus 2. Document frequency in the educational corpus. 3. Term frequency in the newspaper corpus. 4. Number of compounds containing the term. 5. Relevance weight. 6. Whether the term is contained in a electronic dictionary or not. 7. Hyperlink to the page with all the contexts where the term appears in any inflected form. 8. Hyperlink to the page with all the compounds which contain the term in any inflected form. The pages for poly-lexical terms contain, again, statistical information for each term, and hyperlinks to keyword in context (KWIC) exploration pages. The KWIC pages have hyperlinks to the documents the context belongs to in two versions, text and part of speech tagged files. These links allow a deeper analysis of term contexts and better discrimination of the term senses in the collection. 462 3.2 Evaluation The final list of candidate terms contained 2,856 mono and poly-lexical. The terms were manually revised and classified to test the accuracy of the extraction process. Terms were classified as: · Incorrect, when the term is not acceptable in the language (Spanish in the example). · Non lexicalised, when it is correct but it does not have a specific meaning further than the combination of meanings of its components. · Not in the domain, when the term is lexicalised but does not belong to the domain. · Adequate. · Specific domain, when it should be part of a microthesaurus inside the domain. · Computers domain. · Variant, when the term has already been considered in some other flexive form. Tables 1 and 2 show the classification of the candidate terms. Adequate Specific domain Computers Variants Total of terms 1235 43.24% 513 17.96% 59 2.07% 78 2.73% 2856 100% Table 1. Correct terms. Incorrect Not lexicalised Not domain Total of terms 151 5.29% 515 18.03% 305 10.68% 2856 100% Table 2.Incorrect and not adequate terms. The appropriate terms with the manual classification were used as the input to documentalists to produce the thesauri. They were 66% of the terms automatically selected, an indication that the automatic procedure can be useful in detecting phrases for Information Retrieval purposes, as it is discussed in the next sections. Figure 2. Visual exploration of terms 463 4 Terminology based Information Retrieval Traditional Information Retrieval only uses mono-lexical terms for collection indexing. Arbitrary consideration of N word sequences (n-grams) generate indexes too large to be useful. An intermediate approach is to add only terminological phrases to the collection index. In this way, the term extraction procedure described above has been applied to indexing tasks for Information Retrieval. However, besides the addition of useful terms to document indexes, the use of terminology gives another possibility: instead of navigating through the collection documents, it's possible to navigate through the collection terminology and access the documents from the relevant terms (Anick et al., 1999; Jones et al., 1999). Furthermore, the consideration of two areas, one for document ranking and a second for term browsing, opens an interesting way for interactive information retrieval. Along these lines, a first monolingual prototype has been developed in order to explore term browsing possibilities for accessing information. Indexing and retrieval are described in the following subsections. 4.1 Indexing In the terminology extraction task, the goal is to decide which terms are relevant in a particular domain. In the Information Retrieval task, on the other hand, users decide the relevant terms according to their information needs. This implies that, for IR purposes, precision at indexing time should be sacrificed in favor of a higher recall, delaying the determination of which phrases are relevant until the user poses a particular query. For this reason, the terminology extraction procedure described above has been used here ignoring one step: term list truncation. The term extraction process has been adapted to keep all indexable phrases, ranked but without a cutting threshold. The ranking is used to structure and organize the relevant phrases for a given user's query. In the indexing phase, the lemmas of nouns and adjectives are linked to the phrases in which they participate. The phrases, in turn, are linked to the documents in which they appear. Indexing levels are shown in figure 3. The first level of indexing provides access to phrases from the query's isolated terms. The second level supplies access to documents from the selected phrases. 4.2 Retrieval From the above indexing schema, the retrieval process follows this steps: 1. Through the interface shown in Figure 4, the user provides to the system the searching terms (‘Look for’ text area) . Terms do not need to have any syntactic structure as phrases or sentences. 2. The system obtains the mono-lexical terms related to the query through their lemmatization and categorization. 3. Poly-lexical terms are retrieved from mono-lexical ones through the corresponding index. 4. From poly-lexical terms, documents which contain them are retrieved. 5. According to their relevance weight, terms are organized and shown to users (rightmost area in Figure 4) to allow document access directly from terms. 6. Documents are also ranked and shown in a different area (leftmost area in Figure 4). Ranking criteria are based on the number of identified terms contained in documents.# Lemma Phrase Document Figure 3. Indexing Levels 464 Both areas, term area and document area have two kind of links: 1. Links for exploring the selected documents. 2. Links for exploring the term contexts in the collection (Figure 5). From these contexts user can select and access the relevant documents. Figure 4. Website Term Browser interface. Figure 5. Term contexts with links to documents. 465 5 Conclusions and future work The work has been developed in the context of ETB project with two objectives. First, to provide to documentalists the Spanish terminology for a thesaurus building in the domain of new technologies, primary and secondary education. This thesaurus will provide multilingual structure for resources organization and retrieval. This paper has shown the methodology used for the automatic extraction of Spanish terms. Second, the extracted terminology has been used in a first prototype for information access. The developed search engine gives an intermediate way for information retrieval between free searching and thesaurus-guided searching in an interactive framework. In this prototype, documents are accessible from the terminological phrases suggested by the system after user's query. As users usually don't make use of the same phrases contained in the collection, this approach bridges the distance between the terms used in queries and the terminology used in the collection. Our present interest is focused in extending this work to cross-language information access, where the use of phrases provides not only a way for document accessing but also an excellent way for reducing ambiguity in query translation (Ballesteros et al., 1998). We plan to extend this way of disambiguation not only for query translation but also for query expansion and term variation. For query expansion and translation we plan to use the synonymy, hyper/hyponymy and meronymy relations of EuroWordNet (Vossen 1998), developing a complete multilingual framework for interactive information access. Acknowledgments This work has been partially supported by the European Commission, ETB project IST-1999- 11781. References Anick P G and Tipirneni S 1999 The Paraphrase Search Assistant: Terminological Feedback for Iterative Information Seeking. Proceedings of 22nd ACM SIGIR Conference Research and Development in Information Retrieval. 153-159. Ballesteros L and Croft W B 1998 Resolving Ambiguity for Cross-Language Information Retrieval. Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 64-71. Bourigault D 1992 Surface grammatical analysis for the extraction of terminological noun phrases. Proceedings of 14th International Conference on Computational Linguistics, COLING'92. 977- 981. ERIC: http://ericae.net Frantzi K T, Ananiadou S 1999 The C-value/NC-value domain independent method for multiword term extraction. Journal of Natural Language Processing. 6(3):145-180. Jones S, Staveley M S 1999Phrasier: a System for Interactive Document Retrieval Using Keyphrases. Proceedings of the 22nd ACM SIGIR Conference on Research and Development in Information Retrieval. 160-167. Manning C, Schütze H 1999 Foundations of Statistical Natural Language Processing. MIT Press. Cambridge, MA. Márquez L, Padró L 1997 A flexible POS tagger using an automatically acquired language model. Proceedings of ACL/EACL'97. Rodríguez H, Taulé M, Turmo J 1998An environment for morphosyntactic processing of unrestricted Spanish text. Proceedings of LREC'98. Vossen P 1998 Introduction to EuroWordNet. Computers and the Humanities, Special Issue on EuroWordNet. 466 Multi-word Unit Alignment in English-Chinese Parallel Corpora 1. Introduction Multi-word unit (MWU) alignment in bilingual/multilingual parallel corpora is an important goal for natural language engineering. An efficient algorithm for aligning MWUs in different languages could be of use in several practical applications, including machine translation, lexicon construction and cross-language information retrieval. A number of algorithms have been proposed and tested for this purpose, including collocation and association-strength testing statistics (Dagan et al., 1994; Smadja et al., 1996), n-gram, approximate string matching techniques (ASMT), finite state automata, (McEnery et al., 1997), bilingual parsing matching (Wu, 1997), and a hybrid connectionist framework (Wermter et al., 1997). Despite the work undertaken to date, however, reliable and robust MWU alignment remains an elusive goal. In this paper, we describe an algorithm which combines the n-gram approach, linguistic filters and cooccurrence statistical metrics to extract and align English and Chinese nominal MWUs in English- Chinese parallel corpora. We decided to focus on nominal MWUs as they are stable and form the largest group of MWUs in natural language1. This algorithm has been evaluated on a sentence aligned English-Chinese Parallel corpus (Piao, 2000). It obtained precision rates of 92.17% and 87.37% for English and Chinese nominal MWU extraction respectively while it obtained a precision rate of 80.63% for English-Chinese MWU alignment. 2. Related works As noted, a number of techniques have already been applied to the problem of MWU extraction and alignment, including Smadja et al.’ (1996) collocation translation system, McEnery et al.'s (1997) approximate string matching techniques (ASMT) and finite state automata, and Wu's (1997) stochastic inversion transduction grammars used to align Chinese-English phrases. Dagan and Church (1994) also describe a semi-automatic tool, Termight, which extracts technical terms and translations from parallel corpus data. This tool first extracts candidate multi-word noun phrases and single words from a POS tagged corpus using a syntactic pattern filter. Then it groups terms by head words, and sorts terms within each group in reverse word order. Finally, concordance lines are produced so that human experts can distinguish true terms from the candidate terms. Then, the Termight aligns bilingual MWUs using aligned words. For each source term, the tool identifies a candidate translation by selecting a sequence of target words whose first and last word are aligned with any of the words in the source term. Bilingual concordances are produced for selected term pairs to allow terminologists to verify true translations. Dagan et al, (1994) tested Termight on 192 terms found in English and German versions of a technical manual. They found that in 40% and 7% of the cases the first and second target language candidates respectively were the correct translations. In other cases, the correct translation was always somewhere in the concordances. The Champollion system of Smadja et al.’ (1996) can produce translations of the source collocations in the target language. It is based on a MWU extraction system, Xtract, which was also developed by Smadja et al. (1993). They first extracted source language (English) collocations with Xtract. After that, for each source collocation, they extracted its translationd in the target language (French) by testing Dice-score and collocation lists. Champollion was tested on the Hansard corpus and an accuracy of between 65%-78% was reported. 1 Biber et al. (1999: 231) observe: “In the news report, nominal elements make up about 80 per cent of the text (measured in terms of words). The corresponding figures for the other text samples are approximately: academic prose 75 per cent, fiction 70 percent, and conversation 55 per cent. In other words, nominal elements make up between a half and four-fifth of the text.” Scott Songlin Piao Department of Computer Science University of Sheffield Email: s.piao@dcs.shef.ac.uk Tony McEnery Department of Linguistics and MEL Lancaster University Email: mcenery@comp.lancs.ac.uk 467 McEnery et al. (1997) tested extracting multi-word cognates from parallel corpora using ASMT and finite state automata. In this algorithm, all n-grams from source text are compared against all potential m-grams in an aligned region of the parallel text. Dice's similarity coefficient is calculated for each pair (n, m). The (n, m) pair which gains the highest score is selected as the best potential MWU cognate. For 3,142 windows matched in a million-words English-Spanish parallel corpus, an overall precision of 96.5% was reported. They also developed finite state automata for typical English and Spanish compound noun constructions to extract alignment candidates, then filtered the candidates with a similarity score. This effectiveness of the technique was found to be directly linked to the strength of the similarity scores computed for candidate terms. Wu (1997) approached Chinese-English phrase alignment by bilingual parsing with stochastic inversion transduction grammars. Wu reports that his approach produced phrase alignments as long as 12 English tokens and 15 Chinese characters. After pruning, a precision rate of 81.5% (based on random samples drawn from about 2,800 phrasal alignments) was reported. 3. A hybrid algorithm for aligning English-Chinese nominal MWUs In this paper, we describe an algorithm of nominal MWU alignment in which a number of techniques outlined in the previous section are combined. Our approach uses n-grams, POS filters and cooccurrence metrics in concert. The n-gram approach is used to extract candidate MWUs. POS filters are then used to extract a shortlist of candidate nominal MWUs. After this, co-occurrence metrics are used to extract true nominal MWUs and their alignments (explained in detail later in this paper). To show the effectiveness of this approach, an English-Chinese parallel corpus is used as a testbed. This corpus contains 61,534 English words and 98,537 Chinese characters. The English data in the corpus was POS tagged with the Lancaster CLAWS tagger (Garside et. al. 1987) and the Chinese data was tagged with Zhang et al's (2000) Chinese tagger. Additionally, the corpora have been sentence aligned using Piao's (2000) program. One assumption underlying the approach taken to nominal MWU alignment in this paper is that nominal MWUs in the source language are generally translated into a nominal MWU (hereafter, MWU in this paper refers to nominal MWU) in the target language, and hence their occurrences in the parallel translation texts are correlated. Based on this assumption, we propose that nominal MWU alignment can be approached through a) first extracting significant MWUs from each language in the parallel corpus, then b) aligning them based on their co-occurrence affinity. An algorithm was designed to implement this approach. Two main stages are involved in this algorithm: a) English and Chinese MWU extraction, b) MWU alignment. 3.1. Candidate Nominal MWU extraction The first stage of our algorithm is to extract candidate nominal English and Chinese MWUs from the English-Chinese parallel corpus. We assume that most MWUs are stable continuous strings of words. Therefore we adopted an n-gram approach to extracting candidate MWUs from the corpus texts. In order to remove irrelevant candidates from the process, simple POS filters are used to filter out n-grams whose POS structures are unlikely to constitute nominal MUWs. Firstly, candidate MWUs are extracted from the corpus with the following algorithm: (1) Extract English and Chinese n-grams (2 <= n <= 6)2 from the English and Chinese section of the corpus respectively. Considering the unreliability of statistical scores for low frequency items, those n-grams whose frequency is lower than three were ignored. (2) An English and Chinese POS filter is used to filter out those n-grams whose POS patterns are unlikely nominal MWUs . In the first step of this algorithm, any one of the n-grams extracted, from bi-grams to 6-grams, can be candidate MWUs. However, as noted, many of these n-grams have POS sequences that make them 2 Because only twelve 6-grams with frequencies equal or greater than three were found after POS filtering in the tesbed corpus, we assumed that the nominal MWUs longer than six words are non-existent in this corpus. 468 unlikely candidate nominal MWUs. These false MWUs cause “noise” for the algorithm, and hence it is desirable to filter them out before proceeding. In filtering the candidate n-grams, the POS sequence associated with each n-gram is matched against simple POS filters. The English POS filter3 is a simple rule system that excludes any of the candidates if the any of the following conditions are met: 1) For the initial word of a n-gram, if a) the initial code of its POS tag is any one of “A”, ‘C', ‘D', ‘G', ‘I', ‘M', ‘P', ‘R’ or ‘T', or b) the initial code of its POS tag is ‘V’ and the last code is not ‘G’ or ‘N', or c) it has the POS tag of “AT”. 2) For the last word of a n-gram, if its initial POS code is not ‘N'. Table 1 shows the English POS categories denoted by the POS tags or tag-initials used in the English filter (for English C7 tagset see appendixIII in ). Code Category * Any codes A* Possessive pronouns and articles C* Conjunction D* Determiner G* Genitive markers ‘ and ‘s I* Preposition M* Numeral N* Noun/Proper noun P* Pronoun R* Adverbs T* Infinitive marker “to” V* Verb V*G -ing verb V*N Past participle Table 1: English POS categories denoted by the codes in the English POS filter The Chinese POS filter4 works by excluding candidate Chinese n-grams if any of the following conditions are met: 1) For the initial word, a) the initial code of its POS tag is any of ‘C', ‘D', ‘M', ‘P’ and ‘V', c) its POS tag is “AUX”, “AV0” or “NMW”. 2) For the last word, the initial code of tag is not ‘N'. Table 2 shows the Chinese POS categories denoted by the codes in the Chinese filter. Code Category AUX Auxiliary word AV0 Adverb C* Conjunction D* Markers “ ”, “ ” and “ ” M* Numeral P* Pronoun and determiner V* Verb 3 For C7 tagset, see appendix III in Roger Garside, Geoffrey Leech, Anthony McEnery (eds) (1997) Corpus annotation, London & New York, Longman. 4 The Chinese tagset used by Zhang et al's tagger was modified. For details, see appendix. 469 Table 2: Chinese POS categories denoted by the codes in the Chinese POS filter A notable feature of the filter is that it does not thoroughly examine n-grams in the sense that it only considers the initial and last words of the n-grams. In spite of the simplicity of this filtering mechanism, it generally performs well, particularly on bi-grams and tri-grams. For example, in an experiment, these filters filtered out 6,009/4853 irrelevant English/Chinese bigrams from 6767/5599 English/Chinese bigrams. Thus while the filter is effective, it does not seem to exclude good candidates while demonstrating a fair degree of success at filtering out bad candidates. 3.2. Seed-gram extraction After the POS filter has eliminated a proportion of the bad candidate n-grams, a further filtering process, to identify the true nominal MWUs from the candidate list, is necessary. In order to achieve this, we identify seed-grams. Seed-grams are short MWUs, including bi-grams and tri-grams, in which the element-tokens are strongly associated. Such seed-grams are identified by testing their cooccurrence associations (for bi-seed-grams) and some specific POS patterns (for tri-seed-grams). It is assumed that a good nominal MWU must contain one or more seed-grams. Therefore, seed grams can be used to find longer significant nominal MWUs, i.e. a seed-gram can “grow” longer. Bi-seed-grams are extracted as follows. For each bi-gram, the mutual information (MI)5 and t-score is calculated. These scores reflect the co-occurrence affinity between the two tokens of the bi-gram. These two scores are calculated by the following formulas: Mutual information t-score where, a, b, c and d are elements of a contingency table. For example, given a bi-gram containing tokens x and y, a = number of bi-grams in which both x and y occur; b = number of bi-grams in which only x occurs; c = number of bi-grams in which only y occurs; d = number of bi-grams in which neither x nor y occurs. For the purposes of seed-extraction, thresholds of 1.65 for MI and –3 for t-score are used6. If a given bi-gram produces both an MI and t-score greater than the thresholds, it is accepted as a seed-gram. This process does not vary in the sense that the same algorithm is applied to both English and Chinese. When applied to the testbed corpus, this algorithm extracted 414 English seed-grams and 315 Chinese seed-grams out of the 758 and 746 English and Chinese candidates respectively. All of the extracted seed-grams were found meaningful and accepted to be true seed-grams, giving the technique a precision of 100%. However, the recall for the technique for both English and Chinese seedgram extraction is significantly lower at 52.74% and 42.22% respectively. Figures 1 and 2 show examples of English and Chinese seed grams. 5 In formula 1, numerator a is squared, for previous study shows that this modified formula performs better than the original one in which the exponent of a is one (see Piao, 2000). 6 These parameters were established empirically for the corpus we tested the algorithm on. We accept that these parameters may vary depending upon the corpus being exploited. . ) ( ) )( ( ) , ( 1 ) ( ) ( ) , ( d c b a a c a b a a W W prob M W prob W prob W W prob t b a b a b a + + + + + - = - = , ) )( ( log 2 2 2 c a b a a MI + + = 470 MI t-score f Seed gram POS pattern -------------------------------------------------------------------------------------- 4.717 6.527 43 Education Commission NN1 NNJ 4.373 4.990 25 Nobel Prize NP1 NN1 4.334 4.683 22 Lianhe Zaobao NP1 NP1 4.272 5.087 26 hanyu pinyin NN1 NN1 4.164 4.682 22 Hong Kong NP1 NP1 4.005 6.864 48 higher learning JJR NN1 3.816 4.236 18 United States NP1 NP1 3.700 3.603 13 Chen Qing-shan NP1 NP1 Figure 1: A sample of top English seed grams MI t-score f Seed gram POS pattern ------------------------------------------------------------------------------------------ 4.431 9.497 98 AJ0 NN0 4.130 5.627 32 NN0 NN0 4.104 7.539 59 AJ0 NN0 3.286 3.990 16 NN0 NN0 3.225 8.569 85 AJ0 NN0 3.126 5.919 36 NN0 NN0 2.831 3.311 11 NN0 NN0 2.645 5.405 30 NN0 NN0 Figure 2: A sample of top Chinese seed grams For trigrams, POS matching patterns are used for extracting seed-grams, as shown below. Trigrams are matched against POS patterns and deemed to be seed grams if they match the following patterns for English and Chinese: (1) English POS patterns: [Noun/Proper_noun + of/genitive_marker + Noun/Proper_noun] (2) Chinese POS patterns: [Noun/Adjective/Proper_noun + de( ) + Noun/Proper_noun] These heuristic patterns were developed on the basis of the grammatical properties of English and Chinese noun phrases. This proved to be an effective technique, as when tested on the corpus, all of the tri-grams extracted with these POS patterns proved to be significant nominal MWUs. Fig. 1 shows sample tri-seed-grams extracted in this way. f Eng. tri-seed-gram POS pattern f Chi. tri-seed-gram POS pattern ------------------------------------------------------------------------------------------------------------------ 33 People ’s Republic NN GE NN1 6 NP0 DE1 NN0 30 Republic of China NN1 IO NP1 5 NN0 DE1 NN0 8 institutions of China NN2 IO NP1 5 NP0 DE1 NN0 6 China ’s education NP1 GE NN1 5 NP0 DE1 NN0 6 Ministry of Education NN1 IO NN1 4 AJ0 DE1 NN0 6 number of people NN1 IO NN 4 AJ0 DE1 NN0 6 number of women NN1 IO NN2 4 NN0 DE1 NN0 6 women ’s education NN2 GE NN1 4 AJ0 DE1 NN0 Fig. 3: A sample of English and Chinese seed tri-grams As shown previously, the n-gram approach, simple POS filters and co-occurrence metrics combined provide an efficient algorithm for extracting significant short MWUs, or seed-grams. The process of 471 ‘growing’ these seed-grams to identify nominal MWUs of length greater than 3 is described in the following section. 3.3. Extracting longer MWUs based on2 and 3 length seed-grams In natural languages, a nominal MWU can clearly be longer than three words. As discussed previously, our supposition was that if an MWU is significant, it is likely to contain one or more seed-grams; conversely, if an MWU contains one or more seed-grams, it is likely to be significant. Based on this assumption, we used seed-grams to identify true MWUs of a length greater than 3. To determine the usefulness of this approach to extracting nominal MWUs, we applied the the hypothesis to n-grams of length 3 – 67. The POS filters used to filter out candidate nominal MWUs work well on short n-grams, but work less efficiently on longer n-grams. Hence we applied a further filter to candidate n-grams (3 <= n <=6) as follows: a) English n-grams containing tags “VV*” (verbs) or “APPGE” (pre-nominal possessive pronoun) are filtered, b) Chinese n-grams containing “VV0” (verbs), “VM” (modal verbs) are filtered, where the asterisk denotes any letter(s). The candidates survived the pruning are taken as candidate nominal MWUs. Each of them is matched against the seed-grams. Those which contain one or more seed-grams are accepted as nominal MWUs. In the experiment, This approach extracted 626 English nominal MWUs and 467 Chinese nominal MWUs from the corpus. Figures 4 and 5 show samples of the extracted English and Chinese noun MWUs respectively. f MWU POS --------------------------------------------------------------------- 6 Goh Chok Tong NP1 NP1 NP1 3 Gross Domestic Product JJ JJ NN1 7 HIV infection NP1 NN1 3 Hanyu pinyin NN1 NN1 5 Harvard University NP1 NN1 3 Health Information NN1 NN1 3 Health Publications NN1 NN2 3 Health Publications Unit NN1 NN2 NN1 4 Health Service NN1 NN1 3 Heywood Stores NP1 NN2 Figure 4: A sample of automatically extracted English nominal MWUs f Nominal MWU POS Tags ------------------------------------------------------------------- 4 NN0 NMW PND NN0 30 AJ0 NN0 3 AJ0 NN0 NN0 98 AJ0 NN0 3 AJ0 NN0 CJ0 NN0 NN0 10 AJ0 NN0 NN0 14 AJ0 NN0 AJ0 NN0 3 NP0 NN0 6 NN0 NN0 3 AJ0 DE1 NN0 Figure 5: A sample of automatically extracted Chinese nominal MWUs 7 Note we still consider n-grams of length 3 at this stage, as these may be missed by the POS pattern matcher, but contain significant seed-grams of length 2. 472 A manual examination of the results showed precision rates of 92.17% and 87.37% for English and Chinese respectively. In a further analysis, it was found that 59 Chinese bad MWUs, or 20.34% of the mistakes, were caused by errors in the Chinese POS tagging. This means that, if a more accurate Chinese POS tagger is available, a higher success rate could reasonably be expected for Chinese. Given the considerably high precision yielded throughout the various stages of the technique developed for monolingual nominal MWU extraction, we use the output from the English and Chinese algorithms as the basis upon which alignment of MWUs between the two languages is attempted. 3.4. Aligning English and Chinese MWUs With the English and Chinese MWUs extracted the next step is to align them. As assumed previously, the English and Chinese translation equivalents are generally expected to co-occur in corresponding sentence translations. Since the test-bed corpus is aligned at sentence level, it is possible to test cooccurrence affinity between the candidate MWUs at sentence level. Again, the mutual information (MI) and t-score (see formulae 1 and 2) are used for testing cooccurrence correlation. For each English MWU xi (i = 1, 2, …, m) every Chinese MWU yj (j = 1, 2, …, n) is considered as a potential translation (n and m denote the numbers of English and Chinese MWUs). For a given xi, a contingency table is extracted against every yj. The contingency table contains elements, a, b, c and d which are defined as follows: (1) a denotes the number of aligned English-Chinese sentence pairs in which both xi and yj occur; (2) b denotes the number of aligned English-Chinese sentence pairs in which only yj occurs; (3) c denotes the number of aligned English-Chinese sentence pairs in which only xi occurs; (4) d denotes the number of aligned English-Chinese sentence pairs in which none of xi and yj occur. It was found that letter case distinctions in English caused considerable “noise” by dispersing frequencies. For instance, “GREAT BRITAIN” and “Great Britain” are indexed as different items despite the fact that they are variants of a single MWU, dividing their common frequency between them. In order to avoid this problem, all lowercase was taken as the canonical form of English MWUs. The MWU translations are identified as follows: 1) For each English MWU, all of the Chinese candidate MWUs are collected; 2) The Chinese candidates with t-scores lower than 1.658 are filtered out; 3) The candidates with MI lower than –0.2 are removed; 4) Finally, the surviving candidates are sorted by MI into descendent order. 5) The top one accepted as true Chinese translation of the given English MWU. 6) If no Chinese candidate survives the filtering, the given English MWU is ignored. In the experiment, out of the 626 English nominal MWUs and 467 Chinese nominal MWUs extracted by the monolingual extraction process, the alignment algorithm extracted 191 potential English- Chinese MWU alignments. Figure 6 shows a sample of aligned MWU pairs. In the sample, the first set of square brackets encloses the frequency of the English MWU and the second set of square brackets contains the MI-score, co-occurrence frequency and frequency of the Chinese MWU. 1) [48] chinese_JJ culture_NN1: (1) [2.3339 ; 22; 44] _NP0 _NN0 ----------------------------- 2) [8] chinese_JJ intellectual_NN1: (1) [0.9475 ; 6; 14] _NN0 _NN0 (2) [-0.5670 ; 3; 5] _NN0 _NN0 _NN0 ----------------------------- 3) [7] chinese_JJ intellectual_NN1 and_CC cultural_JJ elite_NN1: (1) [1.1402 ; 6; 14] _NN0 _NN0 (2) [-0.3744 ; 3; 5] _NN0 _NN0 _NN0 ----------------------------- 4) [10] chinese_JJ language_NN1 teachers_NN2: (1) [0.3455 ; 6; 17] _NN0 _NN0 8 A t-score threshold of 1.65 indicates the significance level of MI is above 95%. 473 ----------------------------- 5) [16] chinese_JJ singaporeans_NN2: (1) [0.9475 ; 6; 7] _NN0 _NP0 _NN0 (2) [-1.5850 ; 4; 12] _NP0 _NN0 (3) [-1.7370 ; 6; 45] _NP0 _NN0 (4) [-1.8301 ; 3; 6] _NP0 _NN0 ----------------------------- 6) [12] chinese_JJ studies_NN2: (1) [3.3339 ; 11; 11] _NN0 _NN0 (2) [0.3808 ; 5; 8] _NP0 _NN0 (3) [0.3808 ; 5; 8] _NP0 _NN0 _NN0 (4) [0.0589 ; 5; 10] _NN0 _NN0 ----------------------------- 7) [4] cigarette_NN1 rolling_JJ tobacco_NN1 box_NN1: (1) [-0.5850 ; 2; 3] _NN0 _NN0 ----------------------------- 8) [3] coffee_NN1 shops_NN2: (1) [0.8480 ; 3; 5] _NN0 _NN0 (2) [-0.1699 ; 2; 3] _NN0 _NN0 ----------------------------- 9) [8] cold_JJ weather_NN1: (1) [0.1699 ; 3; 3] _AV0 _AJ0 _DE1 _NN0 (2) [0.1699 ; 3; 3] _AJ0 _DE1 _NN0 (3) [-1.5850 ; 2; 3] _AJ0 _NN0 ----------------------------- Figure 6: A sample of aligned English-Chinese nominal MWUs As shown in figure 6, for most of the English MWUs more than one candidate survives the filtering. But three of them, “Chinese culture”, “Chinese language teachers” and “cigarette rolling tobacco box” are precisely matched with unique candidates “ ” “ ” and “ ”9. Due to multiple translations, some English MWUs may have more than one true translation in Chinese. For example in figure 6, “Chinese Singaporean” is translated as both “ ” and “ ”. In this particular case, the English MWU and the two Chinese translations occurred for 16 , 7, and 12 times in the corpus. Of the Chinese candidates, “ ” co-occurred with the English MWU 6 times (in the same aligned English-Chinese sentence pairs) while “ ” co-occurred with the English MWU only four times. This shows that, in the corpus “Chinese Singaporean” is translated as “ ” more often than “ ”. Their MI-scores, 0.9475 and -1.5850 reflect the situation accurately. Considering that for a given nominal MWU more than one true alignment may exist, the process of identifying true and false alignments is somewhat complex. To represent this complexity, in the evaluation of the MWU alignment algorithm, two categories are used for ‘correct’ alignments, precise match and partial match. Precise match refers to an exact match between the source MWU (English in this case) and the top target MWU (Chinese in this case) in the candidate list. Partial match refers to cases in which alignments are approximate matches or where the correct translation is the second ranked item from the candidate list. For example in figure 6, Pair (1) is judged to be a pair of precise matches while pairs (2), (3) and (8) are judged to be partial matches. A manual examination revealed 99 precise matches, 55 partial matches and 37 mismatches among the total 191 potential MWU alignments. If precise matches only are taken into account, the precision of the technique is 51.83%; if both precise and partial matches are considered, the precision score increases to 80.63%. In both cases recall is significantly lower; recall is calculated by dividing the number of English MWUs with the number of resultant MWU alignments, giving a recall (100% ´ 191/626=) 30.51%. 4. Conclusion Bilingual/multilingual MWU alignment is a challenging but worthwhile task. An efficient and effective system for MWU alignment will be of use to a number of areas including bilingual/multilingual 9 All of these MWUs reflect major topics in the corpus. 474 contrastive studies, machine translation and multilingual lexicon building. Although much effort has been made in this area, no satisfactory solution has been found yet. In this paper, we described a hybrid algorithm of English-Chinese MWU alignment which combines the n-gram approach, POS filters and co-occurrence coefficients. Given a sentence aligned English- Chinese parallel corpus, this algorithm automatically identifies and aligns nominal English and Chinese MWUs with high precision, but relatively low recall. As the result shows, it is a practical algorithm for extracting MWU alignments. So while a limited success has been achieved, it provides an inexpensive but practical tool for aligning nominal English-Chinese MWUs with a high degree of precision. Also, although not tested, the possibility exists that this algorithm could be ported to other language pairs by modifying the POS filters. References: Garside, Roger, Leech, Geoffrey and Sampson, Geoffrey 1987 The Computational Analysis of English, London, Longman. McEnery Tony, Langé Jean-Marc, Oakes Michael, Véronis Jean 1997 The exploitation of multilingual annotated corpora for term extraction. In Garside Roger, Leech Geoffrey, McEnery Anthony (eds), Corpus annotation --- linguistic information from computer text corpora, London & New York, Longman, pp 220-230. Piao Scott Songlin 2000 Sentence and word alignment between Chinese and English. Ph.D. thesis, Lancaster University. Smadja Frank 1993 Retrieving collocations from text: Xtract. In Computational Linguistics 19(1): 143- 177. Smadja Frank, McKeown Kathleen R., Hatzivassiloglou Vasileios 1996 Translating collocations for bilingual lexicons: a statistical approach. In Computational Linguistics 22(1): 1-38. Wermter Stefan, Joseph Chen 1997 Cautious steps towards hybrid connectionist bilingual phrase alignment. In Proceedings of International Conference Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria, pp 364-368. Wu Dekai 1997 Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. In Computational Linguistics 23(3): 377-401. Zhang Min, Li Sheng 1997 Tagging Chinese corpus using statistics techniques and rule techniques. In Proceedings of 1997 International Conference on Computer Processing of Oriental Languages (ICCPOL'97), Hong Kong, pp 503-506. 475 Appendix: Modified Zhang et al.'s Chinese tagset No. Part-of-Speech Classification (Eng) Part-of-Speech Classification (Chi) Original Tags Modified Tags 1 Punctuation mark w XX0 2 Idiom I IDM 3 Positional Noun s NNC 4 Pronoun r PN0 5 Verb v VV0 6 Directional f NND 7 Non-linguistic Code x ZZ0 8 Adverb d AV0 9 Suffix k SUF 10 Acronym j ACN 11 Preposition p PRP 12 Conjunction c CJ0 13 Classifier q NMW 14 Noun n NN0 15 Prefix h PRF 16 Determiner b PND 17 Temporal Noun t TIM 18 Numeral m MC 19 Exclamatory word e ITJ 20 Onomatopoeic word o OP0 21 Idiomatic Expression l FIX 22 Adjective a AJ0 23 Auxiliary of Mood y AS0 24 Morpheme g ELM 25 State word * U z AJ0 26 Auxiliary Word u AUX Added POS Categories 27 Proper Noun NP0 28 Modal Verb VM 29 Attribute Marker de “ ” DE1 30 Adverbial Marker de “ ” DE2 31 Complement Marker de “ ” DE3 32 Auxiliary suo “ ” SUO 476 Syntactic change in Abidjanee French Katja Ploog (ERSS a Bordeaux) Abidjanee popular French has acquired some notoriety in the scientific world during the seventies. But in spite of its very well-known sociolinguistic emergence context (a highly heterogeneous contact in urban area, French becoming mother-tongue), we still do not know the factors that condition its actual structures. Furthermore, the research about variation and specially about this non-standard variety is generally not interested in syntactical considerations. In fact, its approach poses serious methodological problems because of a high variability. The aim of my communication is to show how it is possible to initiate a description of the syntactical change of Abidjanee French. The analysis is based on about 20 hours of spontaneous speech produced by young Abidjanees ; the corpus has been integrally retranscribed in a data-base of 10000 propositional units, systematically annotated (morphologic form and syntactic constituents) and analysed. The description is based on derivational framework, which conceives the structure elaboration from a deep predicative structure. After a short inventory of the structure types gathered (concerning null-arguments and co-occurrence of clitics) my work sheds some light on the internal articulation oft the Abidjanee system by focussing on the pragmatic elaboration of the Abidjanee verb and the structure of its actants. My communication tackles in details the structural hypothesis of change. We notice that some of the reorganizations (for example the limiting of the preverbal clitic constituents) also exist in other spoken varieties of French and in French Creole languages. Even if its quite impossible to prove this hypothesis, in this case we can imagine that the reasons of change are inherent to the goal language. Some other structural particularities may derive from the various linguistic or cultural substrata, specially the subcategorisation of nouns, which is very close to West African languages. Furthermore, it may be asked whether a part of the change is due to universal cognitive and 1 or general discursive constraints, like the reanalysis of subject-verb agreement. 561 HPSG-based syntactic treebank of Bulgarian (BulTreeBank) Kiril Simov, Gergana Popova1, Petya Osenova2 The BulTreeBank Project Linguistic Modelling Laboratory - CLPPI, Bulgarian Academy of Sciences Acad. G.Bonchev Str. 25A, 1113 Sofia, Bulgaria Tel: (+3592) 979 28 25, Fax: (+3592) 70 72 73 kivs@bgcict.acad.bg, gdpopo@essex.ac.uk, osenova@slav.uni-sofia.bg Our paper will be mostly dedicated to a project about to start at the LML, BAS. Its main objective is to create a high quality set of syntactic structures of Bulgarian sentences within the framework of HPSG. We will discuss the methodology of the project with emphasis on the new aspects of the adopted approach, as well as its expected results and their applications. Methodology. An annotation scheme usually has to be theory-independent in order to allow different interpretations of the tagged texts in different linguistic frameworks. We think, however, that on a certain level of granularity (and linguistic descriptions in the BulTreeBank will be very detailed in order to demonstrate the information flow in the syntactic structure) we will have to exploit some linguistic descriptions that are theory dependent. We choose HPSG for the following reasons: (1) HPSG is one of the major linguistic theories based on rigorous formal grounds; (2) HPSG allows for a consistent description of linguistic facts on every linguistic level: phonetic and phonological, morphological, syntactic, even the level of discourse. Thus, it will ensure the easy incorporation of linguistic information which does not belong to the level of syntax if such is needed for the correct analysis of a given phenomenon; (3) HPSG allows for both integration and modularisation of descriptions and will therefore enable different experts to work on different parts or levels of analysis. (4) The formal basis of HPSG allows easy translation to other formalisms. We not only choose HPSG to be the linguistic theory within which we will explicate the syntactic structures, but make a step further and choose the actual logical formalism that we will use in the annotation process: namely, SRL for HPSG. For the annotation we will use descriptions called feature graphs. Such detailed descriptions will be extremely useful in the future exploitation of the Tree-Bank, but they might be difficult to use in the annotation process. Here we hope to use the (special) inference mechanisms of the logic and some of the HPSG principles in order to allow the annotator to provide only part of the needed information with the rest of it being inferred automatically. In order to minimise the necessary human intervention, we will exploit all possibilities to provide an automatic partial analysis of the input string before the actual annotation starts. We would also use the partial information entered by the annotator in order to predict or constrain the possible analyses in other parts of the whole description of the element. In this way we will exploit all the constraints available from pre-encoded grammars. Expected results. At the end of the project we expect to have a set of Bulgarian sentences marked-up with detailed syntactic information. These sentences will be mainly extracted from authentic Bulgarian texts. They will be chosen with two criteria in mind. First, they will have to cover the variety of syntactic structures of Bulgarian. Second, they should reflect the statistical distribution of these phenomena in real texts. A core set of sentences will be extracted to serve as a test-suite for software applications incorporating syntactic processing Bulgarian texts. The project should result also in a reliable partial grammar for automatic parsing of phrases in Bulgarian. This grammar will be extensively tested and used during the creation of the TreeBank. It will be used as a module separate from the TreeBank in tasks which require only partial parsing of natural language texts such as information retrieval, information extraction, data mining from texts and etc. Work on the TreeBank will require the creation of software modules for compiling, manipulating and exploring the data. This software will support both the creation of the TreeBank, and its use for different purposes such as automatic extraction of grammars for Bulgarian. 1 PhD student, Department of Language and Linguistics, University of Essex. 2 Also at the Bulgarian Language Division, Faculty of Slavonic Languages, St. Kl. Ohridsky University, Sofia, Bulgaria 477 Linguistic expressions as a tool to extract information1 Sylvie Porhiel – LaTTICe, LIMSI When searching for multiple words in a corpus database, we all have encountered the problem of not coming up with ‘the right thing'. Indeed, very often the results produced by the computer do not meet the analysis we would have produced as humans. In a text, in order to understand it correctly, we rely not only on the semantic content of the words but also on more grammatical means, which provide reading instructions. Those are the ones we will be dealing with here and we will use French prepositions (pour ce qui est de, en ce qui concerne, a propos de, au sujet de, etc.) which have the syntactic properties of being either detached and generally in an initial position or, of being dependent on another morphosyntactic constituent. This syntactic behaviour is of importance in the way in which information is processed and corresponds to two discoursal functions. When detached, these prepositions introduce a thematic (Charolles 1997) and mark an author's pragmatic intentions. When introducing a thematic they can integrate one or more propositions. When they depend on another constituent they simply focalise and do not have that integrative property. The ultimate aim of this ongoing analysis is to engineer a tool using linguistic expressions to extract thematic information. The purpose of this paper is three fold: to detail how linguistic information concerning the thematic introducers is captured; to show how they help in segmenting texts; and to give a technical overview of the system currently being developed by the LaLIC team, utilising this linguistic information. 1. Capture of linguistic knowledge about thematic introducers This section analyses the linguistic markers and captures the linguistic information related to them. The data collection goes through different steps. 1.1. Description of linguistic markers The first step consists of describing the linguistic markers identified as potential thematic introducers (from now on T). Four criteria have to be taken into account: · Morphological variations of the linguistic pattern which can be broken into: - number variations for the few that undertake them: au chapitre (de), aux chapitres (de), sur le chapitre (de), sur les chapitres (de) ‘in the matter of'; - aspectual variations: among those, only the variation used are declared, i.e.: en ce qui concerne ‘as far as... is concerned', en ce qui concernait, en ce qui concernera but not *en ce qui a concerné.2 · Case variation: all T can start a sentence and consequently start with an upper case but can also be after a word or a group of words, a comma or a semicolon, etc., in which case they start with a lower case. · The unbroken or broken structure of the T: written as follows a propos de, the constituents of which are only separated by spaces is an unbroken linguistic pattern, whereas a propos+de, the constituents of which are linked by the metacharacter ‘+’ is a broken linguistic pattern which matches: a propos notamment de, a propos tout particulierement de, etc. The elements which fit between the basic constituents of the T are called ‘insertions'. An analysis of the linguistic markers shows that there is before the base a single point of insertion in which case it is an adjective, and after the base there are two points of insertion, an adverb (a propos de) or an adjective and an adverb (sur le sujet de, sur le chapitre de). Pour ce qui est de accepts insertions such as: pour ce qui est bien sur de X ‘as far as of course X is concerned', pour ce qui est d'abord de X ‘as far as first X is concerned', pour ce qui est par exemple de X ‘as far as for example X is concerned'. 1 This research takes place within the project ‘Modele d'exploration sémantique de textes guidé par les points de vue du lecteur’ under the direction of G. Sabah. The LIMSI, the LATTICE, the CEA and the LaLIC of the CAMS contribute to it. I am indebted to J.L Minel and T. Samuels who reviewed this paper. 2 Unlike the English as far as sth is/are concerned, en ce qui concerne does not undertake number variation. 478 · Paradigmatic variations, which concern the prepositions before and after the base and the determiners before the noun-base. Firstly, when chapitre is a constituent of a T it can be preceded by two prepositions a or sur: au chapitre, sur le chapitre; and concerne by either en, pour or pour tout: en ce qui concerne, pour (tout) ce qui concerne. Indeed, the more paradigmatic variations possible, the more linguistic patterns will have to be captured. Pour ce qui est de varying aspectually and paradigmatically, we currently have: pour/pour tout + ce qui + est/était/sera. Only a few T present preposition and determiner variations after the base: en ce qui touche (a) is an example of this. Secondly, determiners can vary and be definite, indefinite, and demonstrative. This raises two questions: a) should we consider all determiners for each T and b) should we mix the demonstratives with the others? Because of insertions, the definite and the indefinite article have three forms, respectively le3, l', les and un, des, d': sur les sujets délicats de X et Y; sur des sujets aussi délicats que ceux de X et de Y, sur un sujet aussi délicat que celui de Y, etc. As for demonstratives such as in sur ce chapitre and sur ce chapitre du chômage, they play a resumptive role in texts. This particular role is of importance in text segmentation, which explains why demonstratives are not part of the paradigm of the determiners. 1.2. Pointing towards the correct prepositional phrases Software can be said to have located the proper prepositional phrase, i.e. a T, when the located phrase possesses the following characteristics: it is syntactically initially placed and prefixed; it sorts information and places it in ‘boxes'; and it integrates one or more propositions, even a whole paragraph (Charolles 1997). This section deals with the recognition of the proper prepositional phrases and gives four scenarios which highlight the difficulties of locating the correct ones. The first three occur within the syntactic boundaries of the sentence and the fourth one exceeds this boundary. First, a system might encounter difficulties as far as syntactic segmentation is concerned. In (1) and (2), the phrase a propos de is placed initially and prefixed, meeting the syntactic properties of T: (1) A propos de la grenouille, qui va s'en occuper ? ‘Concerning the frog, who is going to look after it ?’ (2) A propos, de la rouge ou de la bleue, laquelle préferes-tu ? ‘By the way, between the red or the blue, which is your favourite ?’ Only the phrase in (1) is a T and the difference between the two phrases lies in the presence of a comma after propos. To be extracted the T have to be described correctly so that it accepts insertions but no comma. Second, the system might have to overcome lexical ambiguities resulting from the polysemy of the base of the T. Let's consider (3)-(5): (3) Au chapitre des insectes, il est incollable ‘On the matter of insects, he is impossible to catch out’ (4) Au chapitre des insectes, on sent que le scientifique est vraiment dans son élément. ‘In the chapter about insects, you feel that the scientist is in his element’ (5) Au chapitre des dépenses, les institutions répertorient (...) ‘In the section on expenditure, the institutions list (...)’ All au chapitre de are syntactically detached, followed by the same plural determiner and a noun, though only (3) contains a T. In French, chapitre has several meanings which happen to combine in the same syntactic environment and to be positioned syntactically alike. Nonetheless, their semantic compatibilities differ: chapitre when referring to a book can be followed by the topic and the author; chapitre, related to a budget is also followed by specific nouns (recette, dépenses, etc.) and may also have the name of an administration. As a result, to optimise the chance for a system to locate the right phrase, au chapitre de must not be found in combination with an author's name or a specific noun. Third, the positional characteristic has to be reviewed and the hypothesis of the initial position rephrased. In (6), though not initially, a propos de and au chapitre de are T as (1) and (3): (6) Dis donc, a propos de la grenouille, qui va s'en occuper ? ‘Hey, concerning the frog, who is going to look after it ?’ (7) Pourtant, au chapitre (des) insectes, il est incollable. ‘And yet, on the matter of insects, he is impossible to catch out’ Indeed, in a text, it is fairly common for T to be preceded by 1 to 3 groups of elements such as: spatial expressions, temporal expressions, articulative conjuncts, etc. Currently 24 groups which combine in different ways have been identified: Par exemple, au chapitre de X (...), Comme, par exemple, en ce qui concerne X (...), Ainsi, par exemple, en matiere de X (...), Il en va ainsi, par exemple, pour ce qui touche a X (...). Lastly, the fourth difficulty encountered by a system lies in the recognition of thematic utterances from truncated ones, when the sentence prototypically consists of the marker plus its complement. Two cases can be considered: first, the T is placed initially; second, it appears after another group of elements. (8)-(9) illustrate the configuration in which the T, strictly followed by their complements, are placed initially or are preceded by other elements (10): 3 All the noun base are masculine. 479 (8) Il me semble qu'il m'a raconté une anecdote. A propos de Staline. ‘It seems to me that he told an anecdote. About Stalin.’ (9) [§] A propos de démocratie: Jabotinski se définissait comme un libéral et défendait avec fermeté le systeme parlementaire. (...). ‘About democracy: Jabotinski defined himself as a liberal and battled in favour of a parliamentary system. (...)’ (10) J'aurais besoin de votre aide demain. Notamment en ce qui concerne cette note de service. ‘I will require your help tomorrow. ‘Particularly as far as this memo is concerned.’ Generally speaking, the prepositional phrase in truncated sentences and thematic ones share characteristics: it has no verb, it is not followed by a proposition, it can be preceded by a group of elements and ends either with [.], [:] or [...]. It also has differences: - the intonation goes down in truncated sentences (8) (10) and rises in thematic ones (9); - the referent is not present in the interdiscourse in truncated sentences whereas it is in thematic ones (9); - they tend to occupy different syntactic positions: in the middle of a paragraph with truncated sentences (8) (10), at the beginning of one with thematic sentences (9); - truncated sentences require the help of the linguistic context and depend on a morphosyntactic constituent in the preceding sentence (8) (10), and can introduce an answer; whereas thematic ones do not. - some groups of elements as well as punctuation tend to favour the interpretation of sentences as truncated or thematic ones: interjections favour a thematic reading, selective markers a truncated one (10); the configuration in which a T is followed by [...] favours a thematic reading: (Dis donc), a propos de femme... ‘(Hey) concerning women...’ - truncated sentences do not have an integrative property, thematic sentences do. 2. Thematic introducers and text segmentation This section is about the segmentation of texts in large corpora (Le Monde Diplomatique, Frantext) and deals with frame openers and closers; examples are used to illustrate these. 2.1. T as frame openers and text segmenters To use T as signals for text segmentation, it is necessary to consider their place in the text and their combination with other markers found in the cotext. As with time and locative adverbials (Virtanen 1992), a series of initial T can either appear within paragraphs (11) or start paragraphs (12). As such, they create text strategy continuity (Virtanen 1992): (11) [§]4 Marie s'est résolue a quitter sa maison et a aller vivre en appartement. Pour ce qui est de ses meubles, elle laisse son fauteuil a bascule a sa femme de ménage. Le secrétaire sera pour Ludovic. La salle a manger revient a sa niece. Enfin, elle gardera la télé. Concernant sa voiture, elle la donne a son petit fils.(...), etc. Quant a son chien, le voisin s'en occupera. ‘Marie brought herself to leave her house and to move into an apartment. As regards her furniture, she leaves her rocking chair to her cleaner. The secretaire is for Ludovic. The dining room suite goes to her niece. Lastly, she will keep the TV. As far as her car is concerned, she gives it to her grandchild (...) etc. As for his dog, the neighbour will take care of him.’ (12) [...] Cette situation a des conséquences décisives sur trois variables: X, Y et Z. ‘This situation has decisive consequences upon three variables: X, Y and Z: [§] En ce qui concerne X, (...) ‘As far as X is concerned, (...)’ [§] Quant a Y, (...) ‘Regarding Y (...)’ [§] Enfin, pour ce qui est de Z, (...) ‘Lastly, concerning Z (...)’ On the one hand, (11) starts with a sentence, which does not tell how the information will be broken down: the referents are implicit. As each referent is introduced by a T, pour ce qui est de, concernant, quant a, a frame opens while closing the previous one. The last T, quant a, signals the last item in the list. The frame instantiated by pour ce qui est de exemplifies the integrative property of T: the referent meubles ‘furniture’ is a hyperonym encompassing the different pieces of furniture listed. Enfin ‘lastly’ heralds firstly the end of the furniture list, and secondly the end of the scope of the T. (12) provides an example of a derived thematic progression: the paragraph finishes with an introducing sentence explicitly specifying that three variables X, Y and Z, will be reused and developed in the following paragraphs. Each three T, en ce qui concerne, quant a, pour ce qui est de, prefixed and at the beginning of a paragraph or following a connector (enfin), signals a shift from one variable to the next. The paragraph structure is of importance as the idented line tells the reader that s/he has just dealt with a sense unit and that s/he is going to start on a new one. However, if the notion is to be of use as a segmentation indicator, it does not have to be overstressed: first because typographical paragraphs and thematic ones do not always coincide and second, because stylistic and balancing parameters enter the matter (Bessonnat 1988, Virtanen 1992). In both these examples T open a series of thematic frames which are examples of thematic strategic continuity. But each example is different as to the level of 4 [§] indicates the beginning of a paragraph. 480 structuration. In (11) a structuration indicates a shift from one particular point to another one within the paragraph in accordance with the first sentence; and in (12) the structuration is running in more than one paragraph5, indicating a thematic break each time. As different groups of elements are likely to precede T, it will be necessary to find out which ones play a part in the structuration of text. Here, the role of enfin ‘lastly’ has to be outlined. In (11), it signals the end of a list as well as the end of the integration under pour ce qui est de. In (12), it signals, with redundancy, the shift into the last variable; it also opens the last frame with the T pour ce qui est de. The scope of a connector such as enfin is different when within a paragraph or placed initially in a paragraph (Bessonnat 1988). To conclude, T are linguistic features which open new frames and consequently close the already instantiated one; they mark textual boundaries; they segment texts on different levels (within a paragraph with a linear thematic succession or across paragraphs with a hierarchical thematic one); they can be preceded by groups of elements, the place of which brings scope variation; and in addition the place of T indicates a textual organisation on the part of the text producers, which makes them reliable indicators for text summarisation. 2.2. Frame closers The opening boundary is relatively easy to identify as it overlaps with the presence of the T in a text. Things are different, though, as far as the closing boundary is concerned. Indeed, even if a T is at the beginning of a paragraph, it does not necessarily means that it integrates all the propositions of that paragraph. Markers likely to induce the closing of a thematic frame are for example: a) T themselves among which some tend to indicate the end of a list, quant a (11) (12), b) logical connectors (11) (13) (16), c) spatial and temporal adverbials (14), resumptive anaphors (15) (17a), discourse markers (17c), sentence adjuncts (17b), aspectual changes, typographical means, etc. (13) [...] Cette situation a des conséquences décisives sur trois variables: X, Y et Z. ‘This situation has decisive consequences upon three variables: X, Y and Z: [§] En ce qui concerne X, (...) ‘As far as X is concerned, (...)’ [§] Quant a Y, (...) ‘Regarding Y (...)’ [§] Enfin, Z, (...) ‘Lastly, Z (...) (14) [§] En ce qui concerne l'allocation des ressources, en Côte d'Ivoire, l'excédent (...). En Corée du Sud (...) [§] ‘[§] As far as the resource allowance is concerned, in Ivory Coast, the excess (...). In South Korea (...) [§]’ (15) [..] En ce qui concerne les déchets toxiques, les autorités affirment (...). Mais cet argument ne tient pas (...) ‘[§] As far as toxic waste is concerned, authorities claim (...) But such an argument does not hold (...)’ (16) A propos des sous-marins, les auteurs (...) A leurs yeux, la marine (...) ‘Concerning submarines, the authors (...) According to them, the Navy (...)’ (17) [§] S'agissant de la coopération, le président s'est félicité (...) ‘As regards co-operation, the president was pleased (...)’ a) Cette co-operation (...) ‘This co-operation (...)’ b) Personnellement, nous aurions (...) ‘Personally, we would have (...)’ c) Mais revenons a X (...) ‘But to return to my point (...)’ These frame closers, all text segmenters interacting with one another and taking part in discourse management, have a double property: while indicating a break they also signal textual continuity, then coherence. In (13) the connector enfin, not followed by a T, closes the preceding frame and opens the last one in the derived thematic progression. (14) illustrates the case when more than one frame appears at the beginning of a sentence, being subordinate to another. In this instance, en ce qui concerne l'allocation des ressources gives the general topic of the paragraph while the locative adverbials, en Côte d'Ivoire and en Corée du Sud, are subordinate and deal with particular points related to the topic. In (15)-(17) the presence of the writer is noticeable through resumptive anaphors preceded (or not) by a logical connector (15) (17a), sentence adjuncts indicating an attitude (17b), discourse markers (17c) which signal the reintroduction of a previous topic of the discourse. There are also cases where no frame closers appear on the textual surface. The linguistic tools which have been listed so far are then of little use. A solution can be found in looking at changes in the text vocabulary. Such experiments have been conducted by Ferret, Grau and Masson (1998). They worked out two methods, the outcome of which varies according to the type of texts. Ideally, the combination of properly recognised linguistic tools (T and connectors) and a statistic method should optimise the results produced in the segmentation of text, as well as improve the thematic coherence of the summary. 3. Technical overview This section describes the architecture of the ContextO software and gives a technical overview of how texts are processed. 5 Unless with spatial or temporal adverbials it is rare to find a text in which T run from the beginning till the end. 481 3.1. The architecture The ContextO software runs under the Filtext platform (figure 1) (Minel et al., forthcoming) both of which have been developed by the LaLIC team in Paris IV. It identifies specific semantic information in texts and extracts relevant sentences meeting applied criteria. To achieve this, the system uses linguistic data declared in a database and contextual rules written according to the contextual exploration method (Desclés 1997). In ContextO, the database is used to store relevant linguistic data in order to locate T. This linguistic knowledge is grouped into classes and individual classes can make compound classes in order to better describe markers. Here are examples of classes and sets of items belonging to them : &marqueurs_liaison = {cependant, Cependant, enfin, Enfin} &prep_de = {de, d', du, des} &ponctuation = {,} &a_propos_min = {a propos, en fait} &introducteurs_thématiques = {&avant_motbase_pdg_en_maj+ce qui +&vb_etre, &avant_motbase_pdg_en_maj+ce qui +&vb_concerne}6 Contextual rules are contained in Java methods and use the class information to process texts. The notion of context is a contextual one : a context is determined by indicators (here T) and constituted of sentences not necessarily adjacent. It is in that delimited space that constraints are applied. For example to locate the T a propos de et en ce qui concerne in a text, the rules R1 and R2 have to be written : R1 extracts sentences in which no comma appears after a propos but still allows for insertions between a propos and de: R1: Items belonging to &a_propos_maj = T IF placed initially AND IF &a_propos_maj is not followed by an item of &ponctuation AND IF &prep_de appears within 5 words R2 extracts sentences in which &introducteurs_thématiques are preceded by enfin, cependant R2: Items belonging to &introducteurs_thématiques = a T IF in the left context there is an item belonging to &marqueurs_liaison T ex tu a l u n its T ask C lu es , C o n te x tu a l R u le s C o n tex tu a l ex p lo ra t io n e n g in e S p e c ia lis ed a g en t S p e c ia lis ed a g en t S p e c ia lis ed a g e n t L in g u istic k n ow le d g e d a ta b a se Figure 1. FilText Architecture 3.2. Processing overview The following sentences used as a corpus will show how a text is actually processed by the ContextO software when the specific task ‘recognition of thematic frames’ has been chosen: A propos de la grenouille, qui va s'en occuper ? , A propos, de la rouge ou de la bleue, laquelle préferes-tu ?, A propos notamment des chemins de fer, l'État a décidé (...), Cependant, en ce qui concernait ce probleme particulier, on aurait pu (...). The text is first segmented into its component sentences. Each sentence is then analysed by applying the individual rules. The results of applying our example rules R1 and R2 to the corpus will be as followed: 6 This implies that the mentioned classes as well as the items pertaining to them have been declared beforehand: &vb_concerne = {concerne, concernait, concernera}, &vb_etre : {est, était, sera}, &avant_motbase_pdg_En_maj = {En, Pour, Pour tout}. Only the relevant tenses of the verbs are declared 482 Sentences R1 R2 A propos de la grenouille, qui va s'en occuper ? True False A propos, de la rouge ou de la bleue, laquelle préferes-tu ? False False A propos notamment des chemins de fer, l'État a décidé (...) True False Cependant, en ce qui concernait ce probleme particulier, on aurait pu (...) False True Sentences where either rules R1 or R2 are true will receive a ‘frame’ tag and thus be extracted. The approach presented is based on linguistic knowledge in order to analyse texts and more specifically to automatically segment texts without relying on a field of application. We first detailed the analytical steps required for a system to identify T and not a phrase of matching characters. The second part, focused on frame openers and closers. T can be used to open text segments which can be closed, for example, by other linguistic tools. As linguistic tools are not always available on the textual surface it is suggested that combining the use of T and statistic methods would lead to more promising results and should improve the thematic coherence of summarised texts. The last part gave a technical overview of the implementation of the captured linguistic knowledge in the ContextO software developed by the LaLIC team. The next step of this research will consist in increasing the list of T, in refining the contextual rules and in testing these rules in full texts. References Bessonnat D 1988 Le découpage en paragraphes et ses fonctions. Pratiques 57: 81-105. Charolles M 1997 L'encadrement du discours - Univers, champs, domaines et espace. Cahier de recherche linguistique 6. Desclés JP 1997 Systeme d'exploration contextuelle. Co-texte et calcul du sens, Presses Universitaires de Caen, pp. 215-232. Ferret O, Grau B, Masson N 1999 Thematic segmentation of texts: two methods for two kinds of texts. In Proceedings of the ALC-COLING'98, Montréal, pp. 392-396. Minel JL, Descles J.P. 2000 Résumé automatique et filtrage des textes. In Pierrel J.M. (ed), Ingénierie des langues. Paris, Hermes, pp. 253-270. Minel et al. (forthcoming) Résumé automatique par filtrage sémantique d'informations dans des textes. Présentation de la plate-forme FilText. Technique et science informatique 3. Virtanen T. 1992 Discourse Functions of Adverbial Placement in English. Abo, Abo Akademi University Press. 483 Theoretical Issues for Corpus Linguistics Raised by the Study of Ancient Languages Stanley E. Porter and Matthew Brook O'Donnell University of Surrey Roehampton and OpenText.org Corpus linguistics has focused largely upon the analysis of modern languages, usefully compiling large corpora for linguistic analysis. In the course of our study of ancient Greek from a corpus-based perspective, a number of issues regarding corpus linguistics have come to the fore. In this paper we wish to highlight and discuss three of these issues. (1) Corpus size and compilation criteria The potential size of a corpus of a modern language, such as English, is virtually infinite, limited perhaps only by storage capacity and compiler effort. These limitations are less and less significant due to developments in technology and automated corpus building. However, the situation is very different for the study of a language such as ancient Greek, where the corpus size has been limited by historical accident. The number of texts is restricted to those that have for some reason been preserved from the ancient world. Early studies in corpus linguistics, faced with technological limitations, were forced to address issues of representativeness and sampling, similar to those facing compilers of a corpus of ancient texts. More recent corpus studies have tended to emphasize the size of the corpus over the contours of its composition. However, significant linguistic results, such as those from Svartvik's study of the English voice system, can result from a carefully compiled corpus of limited size. (2) Annotation and levels of analysis In the light of the limited size of the available corpus of ancient Greek, it is necessary to include greater detail and levels of linguistic annotation in the corpus than is common in large modern language corpora. A limited corpus demands that as much information as possible be garnered from the available data. This involves a much more detailed analysis than the general lexical and morphological patterning that takes place in large-scale corpus studies. In order to facilitate this study, we have had to develop a consistent system of annotations of higher linguistic levels. This has not only generated much useful data, but forced us to examine a number of issues previously unaddressed by corpus annotators. (3) Analysis of texts One of the strengths of corpus linguistics has been the observation of linguistic patterns across large samples of text. In contrast, the textual orientation of Classical and New Testament scholarship requires close attention to the linguistic features and function of individual documents as well as overall patterns in the language. This requires sensitivity to and the analysis of contextual and co-textual features and patterns. This micro-analysis is more suitable to this kind of limited corpus, and has provided useful data for studying various text-types and created new possibilities for the use of discourse analysis in corpus-based studies. UNIVERSITY CENTRE FOR COMPUTER CORPUS RESEARCH ON LANGUAGE Technical Papers Volume 13 - Special issue. Proceedings of the Corpus Linguistics 2001 conference edited by Paul Rayson, Andrew Wilson, Tony McEnery, Andrew Hardie and Shereen Khoja. ISBN 1 86220 107 2. Lancaster University (UK), 29 March - 2 April 2001 ii Table of contents Preface vi Bas Aarts, Evelien Keizer, Mariangela Spinillo & Sean Wallis: Which or what? A study of interrogative determiners in present-day English 1 Anne Abeillé, Lionel Clément, Alexandra Kinyon & François Toussenel: The TALANA annotated corpus for French : some experimental results 2 Annelie Ädel: On the search for metadiscourse units 3 Karin Aijmer: Discourse particles in contrast 13 Takanobu Akiyama: John is a man of (good) vision: enrichment with evaluative meanings 14 Jean-Yves Antoine & Jérôme Goulian: Word order variations and spoken man-machine dialogue in French : a corpus analysis on the ATIS domain 22 Dawn Archer and Jonathan Culpeper: Sociopragmatic annotation: New directions and possibilities in Historical Corpus Linguistics 30 Eric Atwell & John Elliott: A Corpus for Interstellar Communication 31 Manuel Barbera: From EAGLES to CT tagging: a case for re-usability of resources 40 Margareta Westergren Axelsson & Angela Hahn: The use of the progressive in Swedish and German advanced learner English: a corpus-based study 45 Anja Belz: Optimisation of corpus-derived probabilistic grammars 46 Ylva Berglund & Oliver Mason: "But this formula doesn't mean anything!" 58 Roumiana Blagoeva: Comparing cohesive devices: a corpus based analysis of conjunctions in writen and spoken learner discourse 59 Hans Boas: Frame Semantics as a framework for describing polysemy and syntactic structures of English and German motion verbs in contrastive computational lexicography 64 Rhonwen Bowen: Nouns and Their Prepositional Phrase Complements in English 74 Ted Briscoe: From dictionary to corpus to self organising dictionary: Learning valency associations in the face of variation and change 79 Estelle Campione & Jean Véronis: Semi-automatic tagging of intonation in French spoken corpora 90 Pascual Cantos-Gomez: An attempt to improve current collocation analysis 100 Roldano Cattoni, Morena Danieli, Andrea Panizza, Vanessa Sandrini & Claudia Soria: Building a corpus of annotated dialogues: the ADAM experience 109 Frantisek Cermák, Jana Klimová, Karel Pala, Vladimir Petkevic: The Design of Czech Lexical Database 119 Ngoni Chipere, David Malvern, Brian Richards & Pilar Duran: Using a corpus of school children's writing to investigate the development of vocabulary diversity 126 Claudia Claridge: Approaching Irony in Corpora 134 Niladri Sekhar Dash and Bidyut Baran Chaudhuri: Corpus based Empirical Analysis of Form, Function and Frequency of Characters used in Bangla 144 Liesbeth Degand & Henk Pander Maat: Contrasting causal connectives on the Speaker Involvement Scale 158 Anne Le Draoulec & Marie-Paule Péry-Woodley: Corpus based identification of temporal organisation in discourse 159 Stefan Evert & Anke Lüdeling: Measuring morphological productivity: Is automatic preprocessing sufficient? 167 iii Cécile Fabre & Didier Bourigault: Linguistic clues for corpus-based acquisition of lexical dependencies 176 Richard Foley: Going out in style? Shall in EU legal English 185 Anna-Lena Fredriksson: Translating passives in English and Swedish: a text-linguistic perspective 196 Cécile Frérot, Géraldine Rigou & Annik Lacombe: Phraseological approach to automatic terminology extraction from a bilingual aligned scientific corpus 204 Robert Gaizauskas, Jonathan Foster, Yorick Wilks, John Arundel, Paul Clough, Scott Piao: The METER Corpus: A corpus for analysing journalistic text reuse 214 Hatem Ghorbel, Afzal Ballim, & Giovanni Coray: ROSETTA: Rhetorical and semantic environment for text alignment 224 Solveig Granath: Is that a fact? A corpus study of the syntax and semantics of the fact that 234 Benoît Habert, Natalia Grabar, Pierre Jacquemart, Pierre Zweigenbaum: Building a text corpus for representing the variety of medical language 245 Eva Hajicova & Petr Sgall: A reusable corpus needs syntactic annotations: Prague Dependency Treebank 255 David Hardcastle: Using the BNC to produce dialectic cryptic crossword clues 256 Daniel Hardt: Comma Checking in Danish 266 Raymond Hickey: Tracking lexical change in present-day English 272 Diana Hudson-Ettle, Tore Nilsson & Sabine Reich: Orality and noun phrase complexity: a corpusbased study of British and Kenyan writing in English 273 Nancy Ide & Catherine Macleod: The American National Corpus: A Standardized Resource for American English 274 Reiko Ikeo: The Positions of the Reporting Clauses of Speech Presentation with Special Reference to the Lancaster Speech, Thought and Writing Presentation Corpus 281 Wang JianDe, Chen ZhaoXiong & Huang HeYan: Multiple-Level Knowledge Discovery from Corpus 289 Yu Jiangsheng & Duan Huiming: POS Estimation of Undefined Chinese Words 290 Steven Jones: Corpus Approaches to Antonymy 297 Beom-mo Kang & Hung-gyu Kim: Variation across Korean Text Registers 311 Przemyslaw Kaszubski: Tracing idiomaticity in learner language - the case of BE. 312 Yuliya Katsnelson & Charles Nicholas: Identifying Parallel Corpora Using Latent Semantic Indexing 323 Hannah Kermes & Stefan Evert: Exploiting large corpora: A circular process of partial syntactic analysis, corpus query and extraction of lexicographic information 332 Shereen Khoja, Roger Garside & Gerry Knowles: A Tagset for the Morphosyntactic Tagging of Arabic 341 Adam Kilgarriff: Web as corpus 342 Dimitrios Kokkinakis: A Long-Standing Problem in Corpus-Based Lexicography and a Proposal for a Viable Solution 345 Julia Lavid: Using bilingual corpora for the construction of contrastive generation grammars: issues and problems 356 Maarten Lemmens: Tracing referent location in oral picture descriptions 367 Barbara Lewandowska-Tomaszczyk, Michael Oakes & Paul Rayson: Annotated Corpora for Assistance with English-Polish Translation 368 Juana Marín-Arrese, Elena Martinez-Caro & Soledad Pérez de Ayala Becerril: A corpus study of impersonalization strategies in newspaper discourse in English & Spanish 369 iv Mikhail Mikhailov & Miia Villikka: Is there such a thing as a translator's style? 378 Hermann Moisl & Joan Beal: Corpus analysis and results visualisation using self-organizing maps 386 Martina Möllering: Pragmatic and discursive aspects of German modal particles: a corpus-based approach 392 Rachel Muntz: Evidence of Australian cultural identity through the analysis of Australian and British corpora 393 P-O Nilsson: Investigating characteristic lexical distributions and grammatical patterning in Swedish texts translated from English 400 Julien Nioche & Benoît Habert: Using feature structures as a unifying representation format for corpora exploration 401 Matthew Brook O'Donnell, Stanley E. Porter & Jeffrey T. Reed: OpenText.org: the problems and prospects of working with ancient discourse 413 Maeve Olohan: Spelling out the Optionals in Translation: A Corpus Study 423 Constantin Orasan: Patterns in scientific abstracts 433 Gabriella Kiss and Júlia Pajzs: An attempt to develop a lemmatiser for the Historical Corpus of Hungarian 443 Byungsun Park & Beom-mo Kang: Korean grammatical collocation of predicates and arguments 452 Anselmo Penas, Felisa Verdejo & Julio Gonzalo: Corpus-Based Terminology Extraction Applied to Information Access 458 Scott Songlin Piao & Tony McEnery: Multi-word unit alignment in English-Chinese parallel corpora 466 Katja Ploog: Syntactic change in abidjanee French 476 Sylvie Porhiel: Linguistic expressions as a tool to extract information 477 Stanley E. Porter & Matthew Brook O'Donnell: Theoretical Issues for Corpus Linguistics Raised by the Study of Ancient Languages 483 Helena Raumolin-Brunberg: Temporal aspects of language change: what can we learn from the CEEC? 484 Andrea Reményi: Use logbooks and find the original meaning of representativeness 485 Antoinette Renouf: The Web as a Source of Linguistic Information 492 Jérôme Richalot: The influence of the passive on text cohesion and technical terminology 493 Rema Rossini Favretti, Fabio Tamburini & Cristiana De Santis : CORIS/CODIS: A corpus of written Italian based on a defined and a dynamic model 512 Hans-Jörg Schmid: Do women and men really live in different cultures? Evidence from the BNC 513 Josef Schmied: Exploring the Chemnitz Internet Grammar: Examples of student use 514 Mark Sebba & Susan Dray: Is it Creole, is it English, is it valid? Developing and using a corpus of unstandardised written language 522 Mirjam Sepesy Maucec & Zdravko Kacic: Language Model Adaptation For Highly-Inflected Slovenian Language In Comparison To English Language 523 Noëlle Serpollet: The mandative subjunctive in British English seems to be alive and kicking… Is this due to the influence of American English? 531 Serge Sharoff: Through the looking glass of parallel texts 543 Kiril Simov, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov, Atanas Kiryakov: CLaRK - an XML based system for corpora development 553 Kiril Simov, Gergana Popova & Petya Osenova: HPSG-based syntactic TreeBank of Bulgarian (BulTreeBank) 561 Simon Smith & Martin Russell: Determining query types for information access 562 v Josef Szakos & Amy Wang: Not last, even if least: Endangered Formosan aboriginal languages and the corpus revolution 571 Elke Teich & Silvia Hansen: Methods and techniques for a multi-level analysis of multilingual corpora 572 Dan Tufis & Ana-Maria Barbu: Accurate automatic extraction of translation equivalents from parallel corpora 581 Tamás Váradi: The Linguistic Relevance of Corpus Linguistics 587 Serge Verlinde & Thierry Selva: Corpus-based vs intuition-based lexicography. Defining a word list for a French learner's dictionary 594 Jean Véronis: Sense tagging: does it make sense ? 599 Adriana Vlad, Adrian Mitrea, & Mihai Mitrea: A Corpus-Based Analysis of How Accurately Printed Romanian Obeys to Some Universal Laws 600 Martin Volk: Exploiting the WWW as a corpus to resolve PP attachment ambiguities 601 Martin Weisser: A corpus-based methodology for comparing and evaluating different accents 607 Anne Wichmann & Richard Cauldwell: Wh-Questions and attitude: the effect of context 614 Kay Wikberg: His breath a thin winter-whistle in his throat: English metaphors and their translation into Scandinavian languages 615 Karen Wu Rongquan: Public Discourse as the mirror of ideological change: A keyword study of editorials in People's Daily 616 Richard Xiao Zhonghua: A Corpus-Based Study of Interaction Between Chinese Perfective -le and Situation Types 625 vi Preface All the papers in this collection are based upon talks or poster presentations given at the Corpus Linguistics (CL2001) conference, held at Lancaster University between 29th March and 2nd April 2001 organised by members of UCREL, from the Departments of Linguistics & Modern English Language and Computing. The conference attracted over 100 participants from the language engineering and corpus linguistics communities in over 20 countries world-wide. The presentations represented a truly impressive rainbow of languages in corpus research, ranging from ancient languages, eastern and western European languages, to Semitic and Asian languages. One of the aims of CL2001 was to celebrate the works of Geoffrey Leech who reached 65 in 2001. Geoffrey, a pioneer in the construction and exploitation of machine-readable corpora, has had considerable influence in many areas of linguistics over the years. A selection of papers from the conference will appear in an edited collection to be published in honour of Geoffrey Leech. Paul Rayson Andrew Wilson Tony McEnery Andrew Hardie Shereen Khoja Lancaster University, March 2001. 484 Temporal aspects of language change: what can we learn from the CEEC? Helena Raumolin-Brunberg, University of Helsinki My paper discusses the time course of a number of morphosyntactic changes in Renaissance English. It is surprisingly seldom that linguistic changes have been attributed a more accurate timing than a ‘full’ period, such as Late Middle English or Early Modern English. My study will give a more precise timing to the diffusion of the changes under scrutiny among the population of England. Both the macro- and micro-level will be taken into account and changes presented as S-curves. In addition to timing, the following questions will be raised on the macro-level. How should we define the beginning and end of a change? At what rate do changes proceed? What are the factors that play a role in the progression of a change? On the micro-level, individual speakers are in focus. How do individuals behave in relation to ongoing changes? Do they change their language during their lifetimes? If they do, where shall we look for reasons? The data are retrieved from the Corpus of Early English Correspondence (CEEC), compiled at the University of Helsinki by the project ‘Sociolinguistics and Language History'. The CEEC contains around 6000 letters from 1417-1681, forming a corpus of 2.7 million words especially designed for studies in historical sociolinguistics. The following changes will be dealt with: replacement of subject YE by YOU, third person singular suffix -TH versus -S, loss of multiple negation, object of gerund constructions, possessives MINE and THINE versus MY and THY, and introduction of possessive ITS. References Nevalainen, Terttu & Helena Raumolin-Brunberg (eds.) 1996. Sociolinguistics and Language History. Studies Based on the Corpus of Early English Correspondence. Amsterdam/Atlanta GA: Rodopi. Nevalainen, Terttu & Helena Raumolin-Brunberg (in preparation). Historical Sociolinguistics. Longman. 485 Use logbooks and find the original meaning of “representativeness” Andrea Ágnes Reményi Research Institute for Linguistics, Hungarian Academy of Sciences Budapest, Hungary (remenyi@nytud.hu) Debates about overall problems in general synchronic corpus design seem to have settled. Yet, solutions whether and how to pre-structure one's target population and corpus are still based either on practical considerations of comparability or on intuitive proportions, partly due to an exclusively textual view of representativeness, and partly to the fact that designers deny the possibility to estimate the relative distribution of texts among media, genres, registers, etc. of a language. In this paper first general problems are tackled again: what do ‘representativeness', ‘balance’ and ‘influentialness’ mean? In the second part a relatively simple and cost-efficient method to estimate those distributions is described, and results of a two-step pilot study are analysed. Finally I will suggest how both textual and demographic representativeness can be controlled in a modular corpus. General synchronic corpora wish to grasp the totality of a language in some sense. Electronic corpora support reliable quantitative studies only if the sample selected from this totality represents the totality as fully as possible. Leech (1991) states that “a corpus is ‘representative’ in the sense that findings based on an analysis of it can be generalised to the language as a whole or a specified part of it” (as cited by Kennedy 1998: 62). In my view a sample can be called ‘representative’ only if it aims to reproduce the statistical variance of the population to the highest possible degree – in terms of the distribution of text types and of linguistic features (cf. Biber 1993: 243), but not excluding demographic factors. 1 Theoretical considerations The requirement of representativeness poses a major difficulty in corpus design, as designers are supposed to find valid bases to delimit the concept of “every text in the given language”, i.e. the target population, on the one hand, and to find the most reliable sampling methods to select from that, on the other. 1.1 Statistical sampling. Branches of social science applying statistical sampling and inference most often employ either random sampling or stratified random sampling methods. In the former case every member of the target population has an equal chance to appear in the sample. When the latter method is applied, researchers previously establish a few basic categories considered by the researchers/the research community/the research paradigm to be structuring the target population in some essential way (for example sex, age, schooling, location; medium, genre, domain, etc.), and random sampling is achieved only within these ‘strata'. 1.2 Stratified random sampling. When defining the target population of a general synchronic corpus, designers must also start by deciding which sampling method to apply, that is, whether or not they should pre-structure the population by a classification of text types or language users, a structure that is maintained in the sample. Johansson (1980:26), for example, supports a textual stratification over simple random sampling, stating that “the true ‘representativeness’ of the LOB Corpus arises from the deliberate attempt to include relevant categories and subcategories of texts rather than blind statistical choice. Random sampling simply ensured that, within the stated guidelines, the selection of individual texts was free of the conscious or unconscious influence of personal taste or preference.” Note, however, that this is not only a methodological, but also a conceptual problem, as ‘strata’ of the population are pre-defined by the researchers, necessarily reflecting what they consider to be the most essential structuring factors of that population: textual, demographic or other factors.1 1 Apart from this, the most obvious advantage of stratified random sampling over simple random sampling is that it requires a smaller sample. 486 1.3 What are the units of observation? The basis of this meta-theoretical issue is the double nature of the units of observation in corpus linguistics. While in sociology the units of observation are mostly unambiguous (individual human beings), this is not so in corpus design. Should language users (text producers and receivers) or texts (the products of language use) be chosen as the units of observation? While census or similar statistical data about the totality of language users of a country are easily accessible in most countries, no similar methods have been developed to estimate either the totality of the target population of texts or the distribution of text types within it. Thus in text-based sampling the basis of the composition and proportioning of ‘strata', i.e. text types, is statistically unjustifiable. To my knowledge, most megacorpora (Brown, LOB, Cobuild, BNC, ICE, Longman, etc.) are structured according to medium, register, genre, domain, discourse function and/or subject in a way that proportions of these text types are determined by the practical consideration of balance, that is, by including comparable amounts of texts within subcategories. On the other hand, corpora organised by demographic proportions would not support the criterion of ‘sample variability matching population variability’ as far as text types are concerned. Biber is right in stating that “a corpus with this design might contain roughly 90% conversation and 3% letters and notes, with the remaining 7% divided among registers such as press reportage, popular magazines, academic prose, fiction, lectures, news broadcasts, and unpublished writing. [...] Such a corpus would permit summary descriptive statistics for the entire language represented by the corpus. These kinds of generalisations, however, are typically not of interest for linguistic research” (1993: 247). Some linguistic research, however, e.g. sociolinguistics or L2-lexicography and textbook methodology, may find interest in a demographically structured corpus. To sum up, the problems of ‘representativeness’ are mostly due to the double nature of the unit of observation in corpus design: either the diversity of language users, or that of text types is eclipsed. 1.4 A combination of the two types of units of observation. Among the above mentioned corpora, the spoken component of the British National Corpus (BNC) tried to solve this dilemma, pairing text-based and demographic sampling methods. In the text-based (‘context-governed') part of the corpus both major and minor a priori categories were proportioned to be balanced (four equal-sized contextually based categories were established, each divided into 40 per cent monologue and 60 percent dialogue) (Burnard 1995: 23), consonant with the proportioning of the written component. In the demographically sampled part of the BNC's spoken component 124 individuals were recruited based on random location sampling, asked to carry a tape recorder and to record all their conversations for 2-7 days. “Recruits were chosen in such a way as to make sure there were equal numbers of men and women, approximately equal numbers from each age group, and equal numbers from each social grouping.” (BNC Online 1997.) Note that stratified sampling was performed only in terms of location, but not in terms of the other basic demographic variables, e.g. sex, age, and socio-economic group,2 i.e. the proportions of age groups or socioeconomic groups did not follow those in the UK population. Thus, while the BNC has a complex structure of ‘strata', stratified sampling according to these basic demographic variables within none of them is achieved. The conversation subcorpus of the Longman Spoken and Written English Corpus was also sampled on demographic bases: “a set of informants was identified to represent the range of English speakers in the country (UK or USA) across age, sex, social group, and regional spread” (Biber et al. 1999: 29), but the informants may not have been randomly selected for that corpus, either. 1.5 Is it possible to estimate the distribution? Designers usually dismiss the possibility to take the daily distribution of spoken and written medium, domain, genre, register, discourse function or subject variation into account (e.g. Biber 1993:247, Burnard 1995: 20, Kennedy 1998: 63). There are direct methods developed to obtain the production and indirect methods to estimate the reception of published written and internet texts (book and periodical publication lists; best seller, library lending and periodical circulation statistics; internet search engines and click-on per internet page measures, respectively). Direct quantitative methods to assess the reception of published written texts, and the production and reception of unpublished written texts have not been found, and neither are there objective measures to define the 2 Tamás Váradi directed my attention to this point. 487 target population of the spoken medium available. Kennedy writes: “No one knows what proportion of the words produced in a language on any given day are spoken or written. Individually, speech makes up a greater proportion than does writing of the language most of us receive or produce on a typical day. However, […] a broadcast conversation on radio or television will reach many more ears than a commercial encounter involving just a customer and a salesperson. Within a written corpus, balance is equally intractable. […] How to get a balance between the few writers and speakers who are prestigious and the great majority of text producer and speakers who have no special claim to fame is not simple” (1998: 63). Biber also stresses the factor of influentialness: “proportional samples are representative only in that they accurately reflect the relative numerical frequencies of registers in a language — they provide no representation of relative importance that is not numerical” (Biber 1993:247-8; the author shifts the problem to culture studies, similarly to Sinclair 1991: 13). These authors seem to be mixing up the concept of ‘representativeness’ with the concept of an external factor, that of ‘influentialness'. 1.6 ‘Influentialness’ As I have stated, a general corpus should be as diverse as possible to fulfil the statistical axiom that every effort is to be made to reproduce the variance of the population in the sample. But the noncontrolled early introduction of influentialness rules out the possibility of the study of variable influentialness of texts later, in the analysing stage (as, for example, in a sociolinguistic investigation about the factors of certain sayings becoming proverbs, while others not). An example from sociology may illustrate my point. Network analysts can undoubtedly suppose that certain members of the population are more influential, and their sampling methods support the study of the effect of this influentialness as a variable. It would be a mistake to collect data only of the ‘influential’ members of society the same way as it would have been a mistaken move by early demographers and sociologists to include only those individuals (i.e. well-educated upper-/middleclass men over, say, 40 years of age) in their samples, but, fortunately, the tradition never developed in that discipline. 2 The logbook method How can we estimate the daily distribution of medium, domain, genre, register, discourse function and/or subject variation in the population of a given language satisfactorily? Data collection based on tabular format logbooks filled in by a demographically representative sample of (adult) speakers yields a reliable picture of these distributions. The logbook consists of sheets of tables informants are asked to carry along their daily activities, and to fill in rows of cells whenever they use language, excluding self-talk and meta-notes (‘filling in the logbook'). Information is to be collected about: · the duration of the activity (in minutes or seconds) · whether the informant is the producer, a receiver or a third party (role-changes indicated in the same row) · the approximate number of participants (if known) · brief demographic details of other participants (if assessable) · the medium (spoken or written) · the setting (home, office, street, shop, church, etc.) · genre (based on a list of possible genres, but extendible). Other factors (e.g. the domain, or the subject or aim of language use) can be added, controlled the same way by extendible lists given in the short guide that is fastened to the booklet of the fill-in tables. A more detailed manual must be given to each informant, who must also be shortly trained. Data-collection per informant must take two to seven days, at least two days when the informants’ activities are characteristically different, e.g. a weekday and a weekend day. To develop the logbook method, I have conducted two pilot studies. 2.1 Pilot study 1 First a small pilot study was carried out to figure out if the logbook method with the structure described above was at all feasible. Two individuals took notes while travelling on public transport for a few days. The procedure was executable, and, apart from notes on conversations with acquaintances and strangers, greetings, chance remarks to strangers (e.g. ‘Sorry!'), newspaper and book reading, the 488 study yielded the yet unacknowledged genres of browsing mega-posters, eavesdropping on others’ conversation, reading shopping lists, etc. 2.2 Pilot study 2 The seven informants participating in this data collection (five females and two males) filled in the logbook tables for a full day whenever they used Hungarian, starting either early morning and finishing when going to sleep, or starting at any time on one day, and finishing at the same time the following day. Data was produced about three weekdays and three weekend days (the seventh informant started on Sunday, and finished on Monday). The logbook-booklet included several sheets of A4 tables (see the format below), fastened together with a short guide. (For the guide and a list of abbreviations, see the Appendix.) Informants were also verbally instructed how to fill in the tables, what to watch for, and were informed about the aim of the study. Informant:................................. Date: ....................................... Started at (hour, minute): ........................ duration You (P/R/3) No. of Rs3 the others wr/sp setting genre (subject) (aim) A new row in the table was to be filled in every time when either the number of participants, the medium, the setting or the genre changed, when, for example, a new participant joined or left the interaction. As the subject was not of primary research importance, subjects within a type of linguistic activity were allowed to be listed within one cell. When two types of language use were either alternating or happening parallelly, the informants were asked to connect the two rows describing them with brackets. E.g. if the informant was reading and at the same time listening to other people talking; or if spontaneous conversation was regularly interrupted by interesting news on TV. When I collected the logbooks, I asked the informants to comment on it or the task: they clarified ambiguous details, and gave plenty of comments, both about the pitfalls of the logbook and their experiences. They were all surprised to realise that so much time was spent on so many different types of linguistic activities. One of them even called it consciousness-raising task, because now she realised that her whole life was being spent almost on nothing else but giving and receiving verbal, visual and other signals.4 2.3 Results Table 1 shows some quantitative results: the overall duration5 of linguistic activity by the seven informants in minutes (mean: 622.86 minutes, standard dev.: 240.2), also broken down by medium. As it can be expected, a majority of all these people's language use was in the spoken medium, though with a highly varying proportion. 2.3.1 Spoken activities The proportion of longer spontaneous face-to-face conversations (LSFC) with family, acquaintances or strangers were calculated for each informant (see Table 1). The remaining time within the spoken medium was spent on telephone conversations, fleeting interactions (greetings, saying Sorry!, paying in a shop), watching movie films or TV-programmes, listening to radio programmes or to others’ conversations and loudspeaker announcements while travelling on public transport, mixed classroom activities, etc. 3 “Number of participants” would have been a better label. 4 All my informants were extremely helpful, for which I am indebted. 5 In the case of parallel or alternating linguistic activities, the duration spent on each activity was calculated separately. 489 age sex time of week sum (min.) written (min.)* spoken (min.)* LSFC** 40 female weekend day 739 95 13% 644 87% 503 78% 39 female weekday 1058 509 48% 549 52% 451 82% 24 male weekend day 736 175 24% 561 76% 540 96% 46 female weekend day 653 196 30% 457 70% 455 99% 12 female weekend-weekday 502 114 23% 388 77% 125 32% 31 female weekday 4006 192 48% 208 52% 141 68% 36 male weekday 2727 32 12% 240 82% 227 95% Table 1. Duration of Hungarian linguistic activity by informant in pilot study 2 (in minutes); * = percentage given in proportion to sum; ** = percentage of longer spontaneous face-to-face conversations (LSFC) given in proportion to spoken activity 2.3.2 Written activities Informants’ activities included reading newspapers (various types and columns), reading books (non-fiction: popular science, professional, school textbook; fiction), browsing megaposters, reading a map, writing and reading e-mails, taking notes while reading, checking one's own paper manuscript, browsing a library catalogue, leafing through books and magazines in a library or bookshop, reading posters, billboard advertisements, tourist signs, bus signs, streetname plates and sign-boards in the street, reading out a bedtime story to a child, entering data into a mobile phone, writing a school-test and homework, filling in a library request card, using a Hungarian-language word-processor while correcting an English text. The bedtime tale, a regular activity for parents with young children, is a genre that was difficult to classify, because it is a mixture of a written text read aloud and spontaneous conversation. 2.4 Informant sensitivisation A disadvantage of the logbook method may be that is not robust enough to counterbalance differences in informant sensitivity. To raise the method's reliability level, informants should be carefully trained to be able to consciously check all their activities. The pilot study indicated that certain communicative activity types were not acknowledged, or were not broken down in sufficient detail, by some informants. For example, classroom interaction is a multi-genre linguistic activity. Similarly, several genres mix in a morning TV or radio programme: news-reading (the informant is a recipient of a text written down to be read aloud), interviews (the informant is a third party of others’ spontaneous conversation) and commercials (the informant can be either or both). The detailed logbook manual must also include informative definitions of setting, genre, subject, etc. with plenty of supporting examples. 2.5 Advantages This method, while not yielding language-use data, provides a statistically reliable picture of the daily distribution of text types in the demographic population, which can be exploited in the compilation of a demographically oriented corpus. While methods to determine the target population of published written and internet texts are available, the logbook method offers a solution to assess the target population of spoken and unpublished written texts. It also makes a comparison with an existing general corpus based on genre, register, or other classification possible. Moreover, new genres can be found, and the connection between the number of recipients and ‘relative importance’ or ‘influentialness’ of texts can be studied. The logbook method seems simple and relatively costefficient, thus it can be based on a large sample of speakers. 2.6 Modular corpora One may ask: “Do we need representative corpora at all? Monitor corpora are collected with a primary emphasis on the quantity of texts, and if the texts are well-documented, the user can decide on the proportions.” The problem is that most users (lexicographers, syntacticians, etc.) are not informed about problems of corpus design, and have no tools to assess populations but their own 6 She reported having spent most of her day writing a paper in English. 7 He reported having been writing an English language computer programme all day. 490 intuitions. It is the corpus designer's and the sociolinguist's task to assess the textual and demographic population not only for reference, but also for monitor corpora. If texts in a monitor corpus are well-documented as far as the relevant categories (‘strata') are concerned, a user-interface picking texts according to given criteria can produce any composition of texts, with the possible control over other criteria. Thus, with the help of such an interface either a balanced build-up of texts can be composed (e.g. for a multi-genre comparative analysis of a syntactic feature), or a demographically proportional one — based on the results of a survey using the logbook method (e.g. for a sociolinguistic study on ‘influentialness', or for another study on the spread pattern of politically correct phrases). The end-user's preferred composition could also be set (e.g. including only the coded spoken texts for a prosodic analysis, or texts produced in a given year for a study on new coinages), or the totality of the modular corpus's texts could be used, as well. When the interface has randomly picked the texts according to the given criteria, it must give summary statistics about the composition of texts, so that the analyst could change the criteria if the composed corpus is too small. Misbalance in the original monitor corpus (causing also end-user corpora to be too small) can be adjusted by cyclical fine-tuning (Biber 1993, Váradi 1998) by including missing text types. Such an interface may help to exploit the possibilities of monitor corpora. References Biber D 1993 Representativeness in corpus design. Literary and Linguistic Computing 8 (4): 243-257. Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman grammar of spoken and written English. Harlow, Longman. BNC Online (1997) http://info.ox.ac.uk/bnc/what/spok_design.html Burnard L 1995 The BNC Handbook. Oxford, Oxford University Press. Johansson S 1980 The LOB Corpus of British English texts: Presentation and comments. ALLC Journal 1 (1): 25-36. Kennedy G 1998 Introduction to corpus linguistics. London-New York, Longman. Leech G 1991 The state of the art in corpus linguistics. In Aijmer K, Altenberg B (eds), English corpus linguistics: Studies in honour of Jan Svartvik. London, Longman, pp 8-29. Sinclair J 1991 Corpus, concordance, collocation. Oxford, Oxford University Press. Váradi T 1998 Nyelv és korpusz: a reprezentativitás a korpusznyelvészetben (Language and corpus: representativeness in corpus linguistics). Manuscript. Budapest, Research Institute for Linguistics. 491 Appendix: The logbook guide for pilot study 2. Carry the booklet with this guide for a full day wherever you go, and fill in the lines every time you use the Hungarian language. (Except when thinking/talking to yourself or when filling in the tables.) duration: of the linguistic activity (how many minutes/seconds did it take?) You (P/R/3): were you a Producer/Receiver/third party? No. of Rs: an approximate number of Receivers others: brief demographic details of other participants — if known (sex, age, profession, residence: Budapest/town/village) wr/sp: written/spoken (or other: e.g. written text read aloud) setting: e.g. shopping, cinema, conference, religious service, etc. genre: OTHER GENRES CAN BE NAMED, TOO! · two- or several-party personal conversation (e.g. with family, acquaintances, strangers in the street, in a shop, in the doctor's office, etc., or listening to others’ conversation as a third party) · two- or several-party personal conversation (e.g. with family, acquaintances, strangers) · fleeting interaction with acquaintance, stranger (e.g. only greetings, saying sorry, asking for a journal at the news-stand, etc.) · giving presentation in front of a present audience/listening to one · listening to a loudspeaker announcement · work meeting · radio/TV: news, interview, debate, sports commentary · radio/TV/tape recorder/CD: music with text (if you listen to text) · radio/TV/movie film, theatre play · radio/TV: commercial · reading: newspaper, magazine, internet (column: news report, editorial, fiction, advertisement, etc.) · reading: book (fiction, science, law, etc.) · reading: short message (letter, shopping list, etc.) · reading: megaposter, poster, announcement, brand names on buildings or clothes, sign-board · writing:: short message (letter, shopping list, etc.) · writing: creation of longer text (diary, paper) Optional: subject aim: e.g. playing, giving or receiving information/orders, amusement, killing time, etc. If two types of language use are either alternating or happening parallelly, connect the two rows describing them with brackets. Thanks a lot for your effort! 492 The Web as a Source of Linguistic Information Antoinette Renouf Research and Development Unit for English Studies, University of Liverpool. However large and up-to-date the available electronic text corpora are, there will always be aspects of the language which are too rare or too new to be evidenced in them. In fact, the WWW is the largest existing repository of texts across a range of textual domains. It is not surprising that individual corpus linguists have increasingly hit upon the idea of querying the standard web search engines in order to retrieve the more recondite or newly-minted instances of language use. Whilst this strategy can yield useful linguistic results, the standard engines are not designed for the purpose, and the procedure is prohibitively slow and the output requires extensive post-editing. Last year, the Research and Development Unit for English Studies at Liverpool moved on from being such users, taking on board the needs of the community and beginning to develop ‘WebCorp', an Internet search system which allows on-line access to web texts as linguistic rather than information sources. A demonstration tool is available at: http://www.webcorp.org.uk. This paper will report on the research initiative and highlight some the issues involved. 493 The influence of the passive on text cohesion and technical terminology A corpus-based study of research article abstracts from the domain of electrical engineering. Jérôme Richalot Institut National des Sciences Appliquées de Lyon (INSA) jerome.richalot@insa-lyon.fr Abstract The passive is used extensively in research article abstracts. It is often agreed that the passive helps focusing on the actual content of the article rather than on its author(s). The abundance of passive sentences is often frowned upon and regarded as unnecessarily burdening scientific discourse. The aim of this paper is to go beyond this assumption and to show that the use of the passive serves a clear purpose: underlining argumentation and therefore strengthening text cohesion. A study of passive sentences in 83 research article abstracts from IEEE Transactions on Ultrasonics, Feroelectrivity and Frequency Control (UFFC) is conducted. Passive sentences are extracted and classified. The first criterion for classification is the part of the abstract the sentence belongs to (IMRaD). A study of verbs, their semantics and informative content is then carried out. Subsequently, the grammatical subjects of passive sentences are examined. The appearance of technical terms from the domain of electrical engineering and their prospective or retrospective orientation receive particular attention. An attempt is made at characterising subject noun groups. The corpus-based approach helps underlining the blandness of verb semantics as opposed to the richness of subject noun groups in sentences that often appear unbalanced. It is thereby established that the use of the passive in the 83 abstracts studied only partially matches the description of the passive based on general English. Indeed, in the abstracts that make up the corpus, passives do not seem to either characterise the grammatical subject of the sentence or put any particular stress on the result of the process as referred to by the verb. It is suggested that, by receiving so much emphasis, the subject noun group becomes the centre of predication for the sentence. The “castaway” verb group only carries little information and can sometimes even take on a purely metalinguistic role. This topicalisation process is seen as particularly efficient in helping authors structure their abstracts and therefore helps strengthen text cohesion. Technical terminology seems to benefit greatly from this use of the passive. Terms appear in a cotext helping to situate them in the author's conceptual framework thus contributing to a better understanding of the specialised field. Both the corpus approach and the results yielded could be seen as valuable contributions to terminology processing and TEFL. 1. Introduction1 Many books have been dedicated to “English for science” aimed at specialists of other areas than English. The ones I am referring to in this introduction all deal with abstracts with more or less details. In the instructions to contributors of IEEE Transactions on Ultrasonics Feroelectricity and Frequency Control (UFFC), it is clearly specified that “Each contribution must contain an abstract (not more than 200 words for papers and 50 words for correspondences and Letters).” This is the only instruction given to contributors on the abstract itself. However, in the confidential proof-reading form given to each author, under the heading Summary of evaluation, one of the first criteria regarding presentation is “Is the summary an adequate digest of the work reported in the paper?” This remark underlines the importance of the abstract as well as its conventional aspect. Robert A. Day (1989: 28) goes as far as considering the abstract as a “miniversion of the paper.” The aim of the abstract is to help the distribution of the article it refers to hence this statement by Day (1989: 28) “The abstract should provide a brief summary of each of the main sections of the paper.” Lobban et Schefter (1992: 47) add another characteristic to the abstract which they see as a “self-contained synopsis of the report”. Sites (1992: 113) also endorses this remark stating that “abstracts are self-sufficient”. These elements of definition however concise they may seem underline the secondary aspect of the abstract. Day (1989: 28) uses the term “secondary publication” with reference to secondary services in which an abstract can be published independently from the article it sums up (Biochemical Abstracts, Chemical Abstract, etc.) It therefore seems appropriate to consider the abstract as a metatext. It accompanies research articles but usually has its own layout. In the same way 1 Quotations from French references have been translated and marked [translation]. The French is given in footnotes when necessary. 494 as literary criticism it is based on another text. The subject of the abstract is the article. According to Lobban and Schefter (1992: 47) “The Emphasis in an abstract is on the results and conclusions. It should have only the objectives from the Introduction, and only a brief reference to the Materials and Methods (unless the experiments focused on methods).” They explicitly refer to the Introduction, Methods and Materials, Results and Conclusion parts of the article. In many cases the structure of the abstract does indeed reflect the IMRaD structure of the article. Day (1989: 28) clearly lays out the following principles: “The Abstract should (i) state the principal objectives and scope of the investigation, (ii) describe the methodology employed, (iii) summarize the results, and (iv) state the principal conclusion.” Sites (1992: 113) makes a difference between informative abstracts and descriptive abstracts qualifying the latter of “little more than prose table of contents”. In the study of passive conducted here I shall try to show that but for the negative connotation Sites’ remark is very appropriate. The very functional elements of definition given above are usually followed by a series of rules and regulations for a “good abstract”. Here are just a few examples. For Day (1989: 159) “Most of the abstract should be in the past tense because you are referring to your own present results.” For Vernon Booth (1993: 13), “The passive voice, commonly used to describe results, sometimes makes clumsy constructions. Turn a passive voice to direct style when you can.” These instructions seem difficult to apply and sometimes even to understand. More pragmatic advice such as Lobban and Schefter's (1992: 57) is certainly preferable: “However, there are divergent and often strong opinions about whether use of the passive is good or bad. [Ask what your professor wants !]. One of our colleagues aggressively promotes the passive. (“The report is about the experiment, not about you ! she exclaims.) Another colleague equally hotly denounces the use of the passive as “Victorian prudery” that leads to “committee writing [style], ponderous discussions, and avoidance of responsibility.” ” In the literature for specialists of other areas than English there seem to be as many “recipes” for a good abstract as books published on the topic. Still however awkward some of the instructions might sound, they show that abstracts do have their own style. If the authors quoted above seem to find it difficult to clearly define this style — Lobban and Schefter (1992: 57) even resort to a so-called “scientific etiquette” — they at least fully recognise its existence. 2. A linguistic approach Since linguistic studies are one of the only research areas in which the object being studied is also the one used to report on those studies, a clear metalanguage should be defined. I shall therefore first set out as clearly as possible the definitions for the terms used throughout this study. In the case of the passive, this seems all the more important as the metalanguage gives us a first hindsight into the actual operations at work when building a passive sentence. As Henri Adamczewski (1993: 180) points out the very terms of passive verb or passive form do not seem appropriate. They tend to restrict the passive to a phenomenon mainly concerned with the verb group. The verb group is indeed on the forefront of morphological and syntactic changes and no research on the passive can possibly be conducted without a thorough study of the verbs being used in the passive, however this study focuses on sentences (or utterances2) and I shall therefore, depending on the context, use either passive sentences or passive voice in which voice refers to subject-verb-object ordering. Although it seems awkward to do without the very term passive, particularly in language teaching, linguistic grammars have made it difficult to actually justify its use. Jean-Rémi Lapaire and Wilfried Rotgé (1991: 364) propose the following etymology. Passive as an adjective comes from the Latin pati for “suffer” or “undergo”. Passive would therefore be motivated by the appearance, in a passive sentence, of a subject undergoing the process described by the verb. It seems obvious that this feature greatly depends on the semantics of the verb used. In (1), John does indeed undergo a process. (1) John was hit by Paul. But the situation is altogether different in (2). (2) John was greeted by Paul on the doorstep. Furthermore, the emergence of a subject undergoing the process described by the verb is not, in itself, a prerequisite for a passive sentence as in (3). (3) John suffered the same fate as his brother. 2 Marie-Line Groussier and Claude Riviere define the utterance as a basic unit of study. “An utterance is produced throughout a unique uttering act by an utterer for whom it makes up a whole, which has consequences on prosody.” [translation] Utterances are therefore not exclusively oral. Groussier and Riviere point out that utterances and sentences often correspond. But featuring a verb is not a prerequisite and they consider the French “Moi, un voleur !” as an utterance. 495 The inaccuracy these three basic examples underline most probably arises from the confusion between linguistic and extralinguistic, or in other words between a universe of experience and one of representation. André Joly's and Dairine O'Kelly's remarks (1990: 152 [translation]) offer significant help: “[...] subject and object are grammatical functions, they refer to imaginary behaviours existing as thoughts and more particularly thoughts related to language. On the contrary, the impressions which define the notions of agent or patient do not correspond to functions – that is to roles played by a noun or its substitute in a sentence –, but to situations, states originating in the world of sensations.” This distinction enables me to clearly disconnect the active voice from the passive voice. The active/passive couple is an inheritance from descriptive grammars in which a sentence with a transitive verb “in the active voice” had its equivalent “in the passive voice”. Whenever “John was eating an icecream” that very ice-cream was systematically “being eaten by John.” The trend was most certainly further accentuated by generative grammars. In Syntactic Structures (1957: 80), Chomsky does wonder about the possibility of considering passive sentences as part of the kernel of English, but rejects the idea faced with the complexity of the transformations then required: “When we actually try to set up for English, the simplest grammar that contains a phrase structure and transformational part, we find that the kernel consists of simple, declarative, active sentences (in fact, probably a finite number of these), and that all other sentences can be described more simply as transforms.” Lucien Tesniere (1959: 242) opened new perspectives considering the passive in the wider scheme of transitivity. He defined four diatheses3: § the active diathesis. “Alfred hit Bernard.” § the passive diathesis. “Bernard was hit by Alfred.” § the reflective diathesis. “Alfred killed himself.” § the reciprocal diathesis. “Alfred and Bernard killed each other.” Each diathesis is a virtual realisation of transitivity. Adamczewski (1993: 185) as well as Joly and O'Kelly (1990: 151) praised the coherence of Tesniere's model but also underlined its limits. In Tesniere's model, the passive diathesis remains closely linked to the active diathesis. Nowadays, the so-called subordination of the passive to the active is widely recognised as irrelevant and a few simple examples effectively stress the point. Adamczewski (1993: 186 [translation]) sees the passive as “a phenomenon inscribed in the workings of the human language.” He presents the following verb classification4: § Verbs with an active or a passive orientation (also known as notional passives) (4) They grow tomatoes. (4') The tomatoes won't grow this year. (5) She reads the play well. (5') The play reads well. § Exclusively passive verbs (6) He was addicted to heavy drinking. (7) She is bound to find it if you're not careful. § Non-passivable verbs - subject oriented and non reversible (middle verbs) (8) The young soldier weighed six stones. (8') * Six stones were weighed by the young soldier. - with a cognate object (9) She shrugged her shoulders. (9') * Her shoulders were shrugged by her. - reflective (10) John praised himself. (10') Himself was praised by John. Still, I do not intend to reject as a whole previous studies on the passive and will not deny that the active voice often sheds light on the operations being carried out when a passive sentence is produced. The metalanguage used in the studies I quote as references is often reminiscent of the traditions of descriptive grammars. When Adamczewski (1991: 187 [translation]) proposes the following model for the passive voice: N2 {be} + past participle {by} N1 the very labels N2 and N1 are reminiscent of the N1-Vt-N2 model for the active voice. 3 The examples are a translation of Tesniere's. 4 The examples are Adamczewski's. 496 causation effection operation (event) Joly and O'Kelly (1990: 155) do propose the term resultative voice5 which is however first introduced in a parallel with their operative voice. It therefore seems appropriate for Lapaire and Rotgé (1991: 365) to point out an ambiguous dependence of the passive on the active. And it should come as no surprise that Adamczewski (1993: 181 [translation]) himself refers to the active voice as a structure “perceived as primary though with a certain unease.” However, Adamczewski goes on explaining the reasons for this uneasiness. He sets out the minimal ternary model N1 Vt N2 as corresponding to the traditional SVO model of transitivity. He argues the SVO model of transitivity can be considered as primary being “as close as possible to the extralinguistic universe: in the SVO model, the autonomous subject uses the semantics of the verb to somehow modify or affect the object of the verb. Transitivity can be nothing else than this movement or dynamics, which in itself is the language parallel of man's action on the material world.” (1993: 186-187 [translation]) Joly and O'Kelly (1990: 154 [translation]) quote Moignet (1974 and 1981) adding that “any event expressed by a verb has a causation: an initiating reference. It also has an effection6: upon completion of the operation represented by an event, a result is necessarily attained.” The model they propose (figure 1) features the same “movement” or “dynamics” referred to by Adamczewski: Figure 1 The passive therefore does appear as a secondary construction, the reversing not of the constituents of an hypothetical active sentence but of a predicative movement. The “affected” – for which we might venture the term “effected” - becomes the origin of the sentence, the second term of the equation being the verbal predicate. The newly founded relation between the effected (N2) and the verbal predicate (Vt) is the very raison d'etre of the passive. Adamczewski (1993: 187 [translation]) puts it as straightforwardly as possible “given an abstract clause: N1 Vt N2, N2 has been chosen as the origin of the utterance and what the utterer is aiming at is the N2-Vt relation, N1 being absent.” The passive can clearly be considered as an operation pertaining to thematisation. The syntactic components of a passive sentence are most interesting with regards to the very nature of this phenomenon. As seen above (3) having a subject undergoing the process described by the verb is not enough for a sentence to be considered passive. Lapaire and Rotgé (1991: 348 [translation]) consider the use of be as a “capital phenomenon” through which “the predicate [is] formally “separated” from the grammatical subject [and] is placed under the control of the utterer.” They propose (1991: 347-348 [translation]), a very interesting brief diachronic study of be which should certainly be considered here. According to them the original meaning of be was “to occupy a place” but it soon became “to have a place in the universe, thus to exist.” They therefore define be as “a marker or instigator of existence, either absolutely (I think therefore I AM), or relatively enabling the utterer to assign the grammatical subject a permanent or temporary characteristic (e.g. I AM tall, she IS angry.)” They do see in be “an actuating power – in the philosophical meaning of the term “give a reality to...”, “state to the world the reality of...” – and a characterising power: the utterer takes a defining point of view with regards to the grammatical subject by assigning it one or several traits he deems essential or at least relevant (at least in a given situation).” This hints at the double effect of be. In terms of position in the sentence, between the grammatical subject and the verb, it breaks up the direct predicative link and freezes the grammatical subject. Lapaire and Rotgé (1991: 368 [translation]) quite efficiently use the term “freeze frame”. They remind us that this impression is further accentuated by the fact that “in terms of semantics, the subject of a passive construction does not control the lexical verb.” Let us add that most passive sentences are in fact agentless, which might have a role to play in freezing the grammatical subject. The second effect of be is a highlighting one. As can be understood through the previous paragraphs, be is an attributive auxiliary, it looks towards the grammatical subject of the sentence and actualises or characterises it. The model Joly and O'Kelly (1990: 161) propose (figure 2) underlines this: (11) The door is locked. 5 “voix résultative” in French as used by Joly and O'Kelly. 6 Causation and effection in French as used by Joly and O'Kelly. 497 Figure 2 Trying to replace be with another auxiliary offers interesting contrasts. Statistically get is the most likely candidate but also possible are become, remain, seem, appear, feel, look – with an effect close to that of get. Let us consider the following examples: (12) He got run over by the whole pack. (13) He became irritated and soon lost his temper. (14) He seemed affected by the sad news. (15) He appeared relieved when she left. (16) He feels overwhelmed. (17) He looked disappointed. Although these sentences remain subject-oriented they tend to let us see the process described by the verb (Vt-EN) as unfolding. Joly and O'Kelly (1990: 161 [translation]) do remind us of the existence of the auxiliary weor.an in old English meaning become the use of which, in passive sentences, was close to the one I describe here. Adamczewski (1993: 189 [translation]) points out that “in German werden (a metalinguistic operator derived from werden = become) is used and not sein (be).” The semantics, or semic programme7, applied to the grammatical subject is of course that of the transitive verb (Vt-en). The very form of the verb, the past participle, is revealing. We might instinctively recognise Vt-EN and V-ED (preterit) as quite close, however the two forms are quite different. According to Lapaire and Rotgé (1991: 446 [translation]) “thanks to -EN, the predicate is linked to a past period of time (cf. past participle).” Joly and O'Kelly (1990: 159 [translation]) offer a slightly different perspective. “[...] The form of the past participle, e.g. broken, marks the verb as having reached the end of its tension: it is its effection form. The verb is then reduced to the result of the operation called operatively by break.” The past participle form of the verb therefore signals the operation of the verb as “effective”; from this point on, what is being considered is the result of this operation. The syntax of a passive sentence enables us to look back on the operations carried out by the speaker. Off course, it supports the potential of the passive as described above. With the help of be the speaker enters the predicative relation and makes it his own. From this point on, he handles discourse objects and gets further away from the extralinguistic universe. Lexis as defined by Groussier and Riviere (1996: 112-113 [translation]) could help summing up what I have developed so far. Groussier and Riviere make a difference between predicable lexis and predicated lexis8. The predicable lexis is the notional model of a process in terms of actors and it is up to the speaker to assign a value to each of the actors described in the model. The label given to the first actor “x0” which becomes “a” or “C0” once assigned a value, symbolises its position as origin of the message. The predicated lexis is therefore the result of predication applied to the predicable lexis.” Passive predication is “the choice as first argument, not of the actor in the x0 position of the lexis model (the source of the primitive relation), but of the actor in the x1 position (aim of the primitive relation).” So far I have tried to show that using the passive is in fact setting up a predicative model the aim of which is to promote the secondary notion in the predicable lexis to a primary status in the predicated lexis. The operation means defining the starting point of the sentence, its theme. In this study of highly specialised but also very controlled texts (see definition and characterisation of the abstract above), aimed at a well defined discourse community, thematisation takes on an even greater importance. For the authors of the abstracts which make up the corpus I am studying choosing the theme of a sentence means deciding which information should be considered as given, as determined enough to be shared with their discourse community, with no further explanation. I therefore expect these theme segments to reveal some of the characteristics of the discourse community the abstracts are aimed at. This could be particularly interesting with regards to technical terminology shared by the discourse community. The theme segments should feature a high density of technical terms related to the specialised field being considered. 7 Semic programme refers to the semantic features or semes of the verbs. These traits represent the full semantic potential of the verb only part of which realises at predication. 8 “Lexis prédicable “ and “lexis prédiquée” as used in French by Groussier and Riviere. The door is locked 498 3. A corpus-based study The purpose of the following corpus-based analysis is to test my hypothesis on the passive. No indepth study of a corpus of research articles is in fact necessary to underline the frequency of passives. A mere count of the occurrences of be (be/being/is/are/was/were) over several articles will actually yield very little noise. I decided to conduct a study of passive sentences in research article abstracts. I selected 83 abstracts from IEEE Transactions on Ultrasonics, Feroelectrivity and Frequency Control (UFFC)9 that is the abstracts from the three issues of the first semester of 1999. Thanks to Mike Scott's Wordsmith tools suite I extracted the passive sentences from the corpus. Using the Concord tool, passive sentences can be extracted automatically concordancing for be or any of its derived forms being, is, are, was, were and *ed or any of the infamous “irregular past participles” to its right. The concordance should exclude past participles left of be and look for the past participle two to three words right of be. Minor post editing is necessary. Using this set up, I extracted 358 passive sentences from ABCORP. Throughout the study of these 358 sentences I shall use and adapt the terminology first used by Adamczewski. I shall therefore refer to the grammatical subject of the sentence and to the verbal predicate (a transitive verb) using respectively N2 10 and Vt-en11. I shall first try to classify the Vt-en's used in the passive sentences in ABCORP. Then I shall study their N2's to try to get a better understanding of the Vt/N2 relation which I defined above as being the raison d'etre of the passive. 4. Categorisation of transitive verbs in passive sentences The first data that can be used upon processing of the corpus are the raw results given by Concord. From the 358 passive sentences I have extracted 368 occurrences of Vt-en's. The difference is due to sentences in which two or more Vt-en's are co-ordinated by and or or without be being repeated, as in the two examples below. (18) Frequency spectra and modes are computed and examined. (19) These effects can be reduced or eliminated by using narrow-band experiments. The 368 occurrences correspond to 145 different verbs. Their frequency is interesting. On the one hand, 30% of those Vt-en's appear only either once (81)12 or twice (23). On the other hand the four most frequent Vt-en's represent 20% of the 368 occurrences. These four verbs are in fact somehow emblematic of the IMRaD structure. They are use (29) for the Method section, present (20) for the Introduction section, obtain (16) for the Result section, and propose (13) for the Discussion section. However this mere frequency count should not lure us into hasty conclusion. Table 1 below maps those verbs over the IMRaD part they appear in. It also adds other less frequent verbs with a semantic content close to them. The methodology I am using here is somewhat similar to the one suggested by Biber (1998: 123). Biber describes the register of research articles in experimental science as “one of the few English registers that clearly distinguishes among internal purpose-shifts. […] Because each of [the IMRaD sections] is overtly marked in the text and has distinct communicative functions […]” This table leaves out some of the verbs likely to be classified at this stage but which only occurred once throughout the corpus. I consider these verbs of no statistical relevance. The verbs in Table 1 all have a very large semantic content, i.e. a low number of semes or semantic features. All are widely used outside the scientific register and their meaning does not vary significantly whether within or without this register. None of them can be said to belong exclusively to a specialised scientific domain. I shall refer to these verbs as category 1 verbs. 9 The abstracts are available under HTML format from IEEE's web site at http:\\uffc.brl.uiuc.edu\Trans\ I shall henceforth refer to this collection of abstracts as ABCORP. 10 N2 can refer to a simple noun as well as a complex noun phrase. 11 In metaoperational linguistic grammars such as Adamczewski's the use of Vt-en to refer to the past participle of a transitive verb underlines the regularity of the system. In the system, the verb (V) can be V-ing, V-ed or V-en. 12 The figure in brackets indicates the number of occurrence of form being referred to. 499 Part of the abstract Vt-en Introduction achieve (1/4) apply (2/7) base (2/7) consider (3/5) describe (4/7) determine (1/4) discuss (3/7) find (2/7) improve (1/5) investigate (3/6) know (2/3) obtain (2/16) present (6/20) propose (7/13) study (2/5) use (6/29) Method and Materials achieve (1/4) apply (4/11) base (5/6) consider (1/5) demonstrate (1/3) describe (3/7) determine (3/4) discuss (2/7) give (1/6) identify (3/3) investigate (2/6) observe (1/5) obtain (3/16) perform (6) present (5/20) produce (1/4) propose (5/13) show (1/10) study (3/5) suggest (1/2) use (20/29) Results achieve (2/4) apply (1/11) consider (1/5) demonstrate (2/3) find (5/7) give (5/6) investigate (1/6) observe (2/5) obtain (11/16) present (8/20) produce (2/4) show (9/10) suggest (1/2) use (1/29) Discussion discuss (2/7) observe (2/5) present (1/20) produce (1/4) propose (1/13) use (2/29) Table 113 The second category of verbs extracted from the passive sentences of the corpus gathers verbs with a more restricted semantic content. Those verbs describe processes central to research in science or technology. Table 2 below shows they are used more frequently in a scientific context than in “general English”. To compare the frequency of these verbs in ABCORP and in “general English”, I have chosen the British National Corpus14 as an element of comparison. The size of the BNC, 100.1 million words, and its diversity, some 4124 different files, make it the corpus most representative of “general English” available today. It could be argued that the BNC represents British English whereas ABCORP is composed of abstract published in an American Journal. However, apart from minor spelling differences (i.e. analyse vs. analyze) I do not think that the abstracts are overtly marked by their “Americanness” and I am convinced that the comparison made with the BNC yields reliable results. Because I do not think they can be of any statistical relevance for this study, I have excluded from the comparison and the resulting categorisation the verbs that could have been taken into consideration at this point but which only occurred once throughout the corpus even once lemmatised (e.g. adjust in (20)). I have kept the verbs describing processes that I considered central to research in science or technology even though they occurred only once if they somehow participated in a specialised term of the domain of ferroelectricity, ultrasonics and frequency control (e.g. enlarge in (21) was kept because asymmetrical bandwidth enlargement belongs to the terminology of UFFC and occurs in ABCORP). I have also kept the verbs likely to be categorised here but which only appeared once in the corpus if a word of the same family appeared in at least one file other than the one the passive had been isolated in (e.g. tune in (22) because cavity-tuned hydrogen masers appears in a file different from the one (22) appears in). 13 The first figure in brackets refers to the number of occurrences of the verb in the part of the abstract, the second figure refers to the total number of occurrences of the verb in the corpus. 14 Data cited herein has been extracted from the British National Corpus Online service, managed by Oxford University Computing Services on behalf of the BNC Consortium. All rights in the texts cited are reserved. 500 (20) The time interval between two consecutive samples can be continuously adjusted to avoid undesirable sample volumes. (21) At higher pressure levels, it even may happen that, in a steady, unidirectional flow (which should generate only positive Doppler frequencies), the Doppler spectrum is enlarged up to the point that negative Doppler shifts also are produced. (22) As an example of this new class of voltage-tunable chip SAW devices, a voltagecontrolled oscillator (VCO) is presented in which the output frequency can be tuned by an applied gate voltage. For the purpose of this comparison, I have used one of the frequency lists provided by Kilgariff15 (RF2). The list is a lemmatised frequency list. It features the words occurring more than 800 times in the BNC and gives their part of speech. I have lemmatised the verbs describing processes which can be considered as scientific or technological and which appear in passive sentences in ABCORP and calculated their raw frequency (RF1). I have then calculated a mean frequency (MF1) over 1,000 words for each of them. In a similar way, I have calculated a mean frequency (MF2) for the same verbs in the BNC. I have subtracted each verb's MF1 to its MF2. A positive result means that on average the verb occurs more frequently in ABCORP than in the BNC; a negative result means that on average the verb appears more frequently in the BNC, or in “general English” than in ABCORP. Table 2 shows the ten most and least frequent verbs from this second category. The top ten undoubtedly describe a process central to research in science or technology. I have decided to keep the verbs with an MF1-MF2 close to zero or negative arguing that those verbs have a rather large semantic content which realises somewhat differently in “general English” and in ABCORP. The category gathers 78 verbs which I shall refer to as category 2 verbs. Verb infinitive ABCORP BNC RF MF1 RF2 MF2 MF1-MF2 measure 16 0,8676 6683 0,0668 0,8009 compare 16 0,8676 12591 0,1258 0,7418 detect 13 0,7050 3231 0,0323 0,6727 calculate 12 0,6507 3922 0,0392 0,6115 provide 19 1,0303 47923 0,4788 0,5516 analyze 9 0,4880 4106 0,0410 0,4470 simulate 8 0,4338 <=80016 0,0080 0,4258 operate 8 0,4338 <=800 0,0080 0,4258 design 10 0,5423 11810 0,1180 0,4243 develop 12 0,6507 24205 0,2418 0,4089 … realize 2 0,1085 9726 0,0972 0,0113 fix 1 0,0542 5282 0,0528 0,0015 drive 3 0,1627 16477 0,1646 -0,0019 train 2 0,1085 11907 0,1190 -0,0105 prefer 1 0,0542 6854 0,0685 -0,0142 point 2 0,1085 12844 0,1283 -0,0199 receive 4 0,2169 24111 0,2409 -0,0240 expect 1 0,0542 27221 0,2719 -0,2177 carry 1 0,0542 31258 0,3123 -0,2580 set 1 0,0542 40381 0,4034 -0,3492 Table 2 Table 3 maps these verbs over the IMRaD part of the abstracts they appear in. Part of the abstract Vt-en Introduction align (1/2) analyze (1/4) compare (1/7) 15 Kilgariff's frequency lists are available from http://www.itri.bton.ac.uk/~Adam.Kilgarriff/bncreadme. html or via ftp from ftp.itri.bton.ac.uk/bnc. 16 Kilgariff does not list lemmatised forms with a frequency smaller than 800. For the purpose of my calculations, I have consider the RF2 of verbs not listed by Kilgariff as 800. 501 construct (1/2) design (1/2) detect (1/5) develop (5/8) establish (1/2) expect (1/1) fabricate (1/4) fix (1/1) monitor (1/1) operate (1/2) relate (1/3) Method and Materials align (1/2) analyze (3/4) assess (1/3) average (1/2) calculate (6/6) calibrate (2/2) choose (1/1) compare (5/7) compute (3/5) control (1/1) convert (1/1) correct (1/1) decrease (1/1) degrade (1/1) derive (3/3) detect (2/5) develop (3/8) dominate (1/1) drive (1/1) establish (1/2) evaluate (1/2) fabricate (3/4) formulate (1/1) insert (2/2) introduce (1/1) locate (1/1) manufacture (1/1) map (1/1) measure (2/2) model (1/1) modify (2/2) observe (1/5) operate (1/2) position (1/1) predict (1/1) propagate (1/1) provide (1/1) realize (1/1) record (1/1) scan (3/3) select (1/1) separate (2/2) set (1/1) substitute (2/2) test (1/1) train (1/1) transform (3/3) utilize (1/1) validate (1/1) Results affect (1/1) assess (2/3) attenuate (1/1) attribute (1/1) average (1/2) carry (1/1) compare (1/7) compute (2/5) confirm (1/1) construct (1/2) detect (2/5) enlarge (1/1) evaluate (1/2) examine (1/1) illustrate (1/1) increase (1/1) indicate (1/1) model (1/1) observe (2/5) point (1/1) rate (1/1) receive (1/1) reconstruct (1/1) reduce (1/1) relate (2/3) report (1/1) tune (1/1) verify (1/1) Discussion characterize (2/2) design (1/2) implement (1/1) observe (2/5) prefer (1/1) quantify (1/1) reconstruct (1/1) require (1/1) simulate (1/1) solve (2/2) Table 3 The third category of verbs extracted from the passive sentences of ABCORP gathers verbs that would only occur in a scientific register17 and quite specifically in the domain of electrical engineering. This category is made up of only two verbs: pole (2) and repole (1). The two verbs appear in the same abstract; pole occurs twice in the Discussion part whereas repole occurs in the Method and Materials part. As stated earlier in this paper, I have decided to study the passive as a sentence-level operation. I therefore have not calculated a mean frequency count18 for passives by section as proposed by Biber (1998: 124-125) since his count is word-based. The example given by Biber, a study conducted over 19 research articles from either the New England Journal of Medicine or the Scottish Medical journal published in 1985, points out to Methodology sections being “marked by their extremely frequent use of agentless passives”. Unlike Biber's, this study is based on a corpus of research article abstracts and not full-length articles and so far, it has made no difference between agentless passives and passives with an agent. However, the three categories that I have defined hint at similar results in terms of the frequency of passives over the IMRaD sections of abstracts. Table 4 below shows the breakdown of passive sentences by IMRaD section. 17 Pole as a verb only occurs three times in the BNC with a meaning similar to the one it has in the following example. “This is how we pole a raft and just because a white man is watching through his funny machine we aren't going to do it any differently.” (Barnes J 1990 A history of the world in 101 Chapters London, Picador). Repole does not occur in the BNC. 18 For an explanation of the mean frequency count, see Biber's methodology box 8, the unit of analysis in corpus-based studies (1998: 269). 502 Part of the abstract First category Second category Third category Total Introduction 26% (53) 13% (18) 20% (71) Method and Materials 42% (84) 55% (80) 33% (1) 47% (165) Results 28% (56) 23% (33) 26% (89) Discussion 4% (9) 9% (13) 66% (2) 7% (24) (202) (144) (3) (349)19 Table 4 Biber's brief study is given as an example for the study of discourse characteristics and it concludes that agentless passives contribute to “presenting events impersonally, with no acknowledged agent.” The following aims at explaining that this effect, although sensed by most readers, is probably a secondary effect for the targeted discourse community. 5. Categorisation of subject noun-groups in passive sentences Building on the above categorisation, a study of the verbs’ grammatical subjects (N2's) is particularly revealing. I shall focus on noun determination. N2's first appear as very diverse as shown by the examples below: (23) The effects of the array parameters on the array performance, such as the selectivity of Lamb modes and effectiveness of Lamb wave generation are investigated. (24) A new procedure for preparing lead zirconate titanate (PZT)/poly(vinylidene fluoride-trifluoroethylene) (P(VDF-TrFE)) 1-3 composites with both phases piezoelectrically active is described. (25) A Electromechanical coupling mechanisms in piezoelectric bending actuators are discussed. (26) These findings are shown to be in excellent agreement with previously reported theoretical predictions by the authors. (27) Three specimens are used in the study: a block of Plexiglas that has a linear attenuation, a layer of a special rubber compound with an attenuation proportional to f1.38, and a phantom made of castor oil that has an attenuation proportional to f1.67. (28) Some important factors are studied for the bimodal ultrasonic motor design. (29) The composite disks have been fabricated into transducers with air-backing and with no front face matching layer, and their performance characteristics have been evaluated in water. (30) It was found that the ~680 µm spot size of the experimental zone-plate did not vary appreciably with changing frequency, whereas the focal length increased markedly with increasing frequency (from ~5 mm at 450kHz up to ~15 mm at 900kHz). (31) Tissue-characterization parameters, which have been used successfully by other authors, were calculated for each segment. Examples (23) to (31) illustrate the nine different criteria I used for N2 classification. Sentences (23), (24) and (25) are examples of the use of the, a, and A. (26) features a deictic (these). The N2's in (27) and (28) are both determined by quantifiers either numerical or not. In (29), the N2 is determined by a possessive adjective. (30) and (31) both have a pronoun as an N2. Beyond this simple labelling, a further classification is necessary. Each of the above determiners (and for the purpose of this study, I will consider pronouns as having an built-in determination) contributes to a thematic or rhematic environment. N2 determination therefore participates in text cohesion and in the use of technical terminology. In this part I shall study N2 determination with relation to the abstract's IMRaD parts but also to Vten categorisation as shown above. Table5 provides an overview of determination patterns in the introductions of abstracts in ABCORP. Introduction A 20% (14) a/an 35% (25) 19 The restrictions in the categorisation of verbs resulted in the exclusion of 33 verbs from the study. 503 The 38% (27) Others 7% (5) (71) Table 5 Zero-determination (marked A) is seen by Adamczewski (1993: 210) as a direct reference to the extralinguistic universe20 with no thematisation. It occurs in 14 of the 71 passive sentences in the Introduction sentences in ABCORP. Eleven sentences feature verbs from the first category defined above (i.e. with a broad semantic content). Example (32) below is typical of zero-determination. (32) Quasi-monolthical integration of thin GaAs/InGaAs/AlGaAs-quantum well structures on LiNbO3 SAW devices is achieved using the epitaxial lift-off (ELO) technique. By giving the chemical composition of the components, the speaker can hardly be any closer to the extralinguistic universe. Other N2's in this category are technical terms which are absolutely central to UFFC such as deformation, acoustic fields, electromechanical coupling mechanisms, piezoelectric bending actuators, piezo helical springs, curved unimorphs, or piezo springs or concepts derived from such terms such as Improvement of sensitivity in ultrasonic fields of piezocomposite transducers. The information is apparently not proposed as given but as new which may seem incompatible with a passive sentence in which the grammatical subject is proposed as given. I shall address this apparent conflict of interests further below. Although only three sentences in the Introduction section of abstracts in ABCORP feature a “A N2” sequence and a category 2 verb (i.e. a verb describing a scientific or technical process) it is interesting to notice that the observation made above does not seem to apply to these three sentences. Their N2's are small cracks, known defects, previous studies which are indeed technical terms but not in the domain of UFFC. In ABCORP, 25 sentences featuring a passive as part of the introduction of an abstract also feature a/an as a determiner for the N2 of the sentence. Using a/an enables the speaker to refer to a specific occurrence of a notion therefore limiting its scope. A/an also conveys minimal classification. Adamczewski (1993: 213) gives the following example: - Let's not talk about all this. It was only a dream. The speaker recognises the reference of this as being “an occurrence of a dream” and classifies it as such. Using a/an is a twofold operation which first entails acknowledging the existence of a reference and then classifying it albeit minimally. The operation sets the noun group in Adamczewski's phase 121 i.e. a rhematic phase. The classifying aspect of the operation is of course particularly interesting for anyone concerned with the technical terminology of a given area of specialisation. A/an therefore introduces the speaker's conception of the noun or noun group it determines. He refers to the prototypical occurrence of a notion to acknowledge the existence of his own occurrence of the notion. This appears quite clearly in this subsection of ABCORP in which six of the 25 N2's feature the adjectives new or novel as does example (33) below. (33) A new numerical model of a short-term stability measuring system of quartz crystal resonators is presented. However, the main characteristic of the N2's extracted from this subsection of ABCORP is their extreme categorisation as shown in the following three examples (34) A resonant liquid capillary wave theory which extends Taylor's dispersion relation to include the sheltering effect of liquid surface inclination caused by air flow is presented. (35) A condition monitoring nondestructive evaluation (NDE) system, combining the generation of ultrasonic Lamb waves in thin composite plates and their subsequent detection using an embedded optical fiber system is described. (36) In this paper a novel ultrasound tomography imaging system is presented. 20 The absence of any visible or audible morpheme for this determination is symbolic of the lack of intervention of the speaker on the extralinguistic universe. 21 Adamczewski has defined a system of 2 phases for sentences. Phase 1 is rhematic or nonpresupposing whereas phase 2 is considered thematic or presupposing. Typically, a sentence in the simple present or simple past will be in phase 1 whereas one featuring the be + V-ing aspect will be in phase 2. 504 (34) and (35) show several levels of embedded clauses and (36) exemplifies heavy noun premodification. All three N2's thereby reach a level of classification which seems hardly compatible with the rhematic operation described above. I shall try to reconcile these two aspects further below. Eight sentences from this subset feature verbs from the second category however, only four different verbs are used (develop, fabricate, construct, and design). Their N2's seem to have characteristics similar to those of category 1 verbs. Having said that the grammatical subject of a passive sentence is chosen as the starting point or theme of the passive sentence and is therefore considered as given information as opposed to new information, it should come as no surprise that a majority of the N2's in the passive sentences in ABCORP are determined by the. Adamczewski's views on determination using the are particularly helpful. “The indicates that the noun group is in phase 2. The relation is presented as thematic by the utterer.” (1993: 215 [translation]) He gives the following example: - Mother, did I ever tell you? I am lucky ! - “ No, you never did “, said the mother. The first occurrence of “mother” is a direct reference to the person the sentence is directed to. However, the second occurrence , “does not refer to the partner in the conversation but to the mother already mentioned.” This shows the discourse has moved a step further away from the extralinguistic universe. It underlines the role of the speaker who filters the extralinguistic universe. Adamczewski refers to “distanciation from phase 1”. He further defines the notion when he deals with the ± generic effect of the the operator (1993: 218 [translation]):”The signals that the utterer somehow works on the notion's semic programme. We no longer have a truthful image of reality but a filtered one; there is a discrepancy with the extralinguistic universe. The relative alteration of the notion's semantics is modulated by the context and the situation.” These few lines are of an even greater relevance when studying scientific discourse. The specific operation of determining with the, and therefore of filtering the extralinguistic universe, is in total agreement with passivisation as described earlier. The speaker distanciates himself from the extralinguistic universe, he handles discourse objects to distribute information. Adamczewski (1993: 218 [translation]) offers the following summary: “The is related to phase 2 and enables the utterer to play on notion extension depending on his intentions and the requirements of the context.” Just under 40% of N2's in the Introduction section of abstracts in ABCORP feature the as their determiner. This represents 27 sentences out of 71 of which 24 feature category 1 verbs. Here are two examples. (37) The time-frequency distribution (TFD) of Doppler blood flow signals is usually obtained using the spectrogram, which requires signal stationarity and is known to produce large estimation variance. (38) The optically pumped cesium beam clock named Cs~IV is operated with a new short Ramsey cavity satisfying strict requirements on the microwave leakage level. The starting point of those sentences has been selected and is presented as such by the speaker. Through the use of the passive, he has chosen as N2 a noun group which he reckons is understood by the discourse community he is targeting. He has made of the understanding of the N2 a defining characteristic of his discourse community. In (37), the speaker's intervention is further underlined by the use of the abbreviated form in brackets stressing his endorsing of the terminology. In (38), the same intervention is made even more explicit through the embedded “named Cs~IV” thereby fixing Cs~IV as a term in a highly definitory context. In the 27 sentences, abbreviated forms appear seven times. In ABCORP, this set-up highlights such terms as longitudinal leaky surface waves (LLSW), crosscorrelation method (CCM), coherent population trapping (CPT), Doppler angle, direct digital frequency synthesizers (DDFS), digital-to-analog (DAC) output, surface acoustic waves (SAW), piezoelectric leaky surface waves, among others. The operation is further emphasised by the category the verb belongs to. By having a very broad semantic content, the Vt-en actually reveals very little of his N2. In ABCORP The, a/an and A account for more than 90% of N2 determination in passive sentences from Introduction sections. The remaining ten percent (five sentences) are occurrences of deictics, quantifiers, relative pronouns or possessive adjectives. They do not seem to have any statistical relevance at this point of the study. Although I have clearly stated above that the passive contributes to promote the grammatical subject of the sentence to the position of theme, of the three operators studied so far, the is the only one to apparently follow suit. 505 In fact, A/an and A seem to do quite the opposite — that is put forward new information. Still they are heavily used throughout the introduction of abstracts in ABCORP. Example (39) may offer clues as to the actual value of this combination. (39) A new numerical model of a short-term stability measuring system of quartz crystal resonators is presented. This is the first sentence of an abstract entitled “Modeling of a short-term stability measuring system of quartz crystal resonators.” In itself, this sentence does not bring any more information than already present in the title of the abstract. Its purely informative content is null. I believe the role of such a sentence is therefore not linguistic per se but metalinguistic. What the passive promotes is the operation carried out on the N2 and not the N2 itself. In (39) the author of the abstract acknowledges the existence of an already extremely complex notion, that of numerical model of a short-term stability measuring system of quartz crystal resonators and of an occurrence of this notion. His acknowledgement of the existence of such a notion is a trace of the categorisation operated by the speaker i.e. one of the building blocks of the experience he is trying to recreate through his text. He is setting the conceptual boundaries of his discourse. This metalinguistic feature can be implemented by other means than the passive. Examples (40) and (41) are also the first sentences of abstracts. They define the conceptual boundaries of the speaker's experience more explicitly. (40) Recent research has shown that, for a rotating phantom, the speckle pattern may not replicate the phantom motion, rather it may show a large lateral translation component in addition to rotation. (41) Recent papers have shown that focused ultrasound therapy may be feasible in brain through an intact human skull by using phased arrays to correct the phase distortion induced by the skull bone. I am assuming that the same operation occurs when the N2 of a passive sentence in the Introduction section of an abstract is determined by A or the. What is promoted is the determination of the N2 not the N2 itself. Because of the very nature of A, the operation leads to a definition of wider conceptual boundaries hence, in ABCORP, the occurrence of such central terms to UFFC as deformation, acoustic fields, Electromechanical coupling mechanisms etc. (see list above). Inversely, the conceptual boundaries set through the use of the are only central to the research article itself. By being promoted N2's they endorse the role of key concept hence the high density of technical terms occurring as N2 of a Vt-en in the introduction section of abstracts in ABCORP. However, the above description of determination is very close to what it would be in a context other than passive. I have emphasised that the passive is a well-thought reorganisation of an initial pattern for a purpose. It clearly underlines the intent of the speaker that is to relate the scientific process he has been through, to recreate it for the benefit his reader. It seems that the passive contributes to bringing forward the two-way dimension of communication. It does not just enable the speaker to reorganise and, as shown above, reconceptualise his own experience of the process he wants to relate but it also calls for the reader's (co-utterrer's) acknowledgement and acceptance of the delimitation/definitions operated over the extralinguistic universe. As already mentioned above, the Method sections of abstracts in ABCORP contain more passives than any other parts. Whereas introductions contain 71 passive sentences, Method sections contain 165. The breakdown by verb category is also quite different: 84 of the Vt-en's are from category 1, 79 from category 2 and 1 from category 3. This breakdown confirms that the “quantum of information”, as it is referred to by Haliday (1994: 34), is not evenly spread over the clause. It shows that, even in a section in which one expects accounts of procedures and therefore a sizeable amount of material processes, in passive sentences, a majority of verbs, though a short one, still convey very little of those processes most of the information being conveyed by their N2's. A comparison of N2 determination in this subsection of ABCORP and in the one previously studied also shows interesting results. Those results are summed up in table 6. Introduction Method A 20% (14) 24% (40) a/an 35% (25) 15% (25) 506 The 38% (27) 46% (75) Others 7% (5) 15% (25) (71) (165) Table 6. When considering zero-determination in passive sentences, the comparison between the Introduction section and the Method section of abstracts in ABCORP highlights a similar frequency. Of the 40 N2's determined by A in the Method section of abstracts in ABCORP 21 are subjects of category 1 verbs and 19 of category 2 verbs. However, the N2's determined by A are quite different from the ones undergoing similar determination in introductions. In the Method section, when the verb belongs to the first category, a majority of N2's refers to such concepts as computations, method, technique, measurements, simulations, or experiments. When premodified, those N2's take on a more restrictive semantic content than in introductions as can be seen in examples (42) and (43) (42) Optical interferometric measurements of pellicle displacement at discrete frequencies in tone-burst fields are converted to acoustic pressure, and the hydrophone for calibration is substituted at the same point, allowing sensitivity in volts per pascal to be obtained directly. (43) Moreover, to guarantee the convergence of identification and tracking errors, analytical methods based on a discrete-type Lyapunov function are proposed to determine the varied learning rates of the FNNI and the optimal learning rate of the adaptive controller. As in introductions, the other N2's are also essential concepts to UFFC or more precisely to the subject of the study being carried out. In (44) the concepts of location and extent are central to the study describing an application to detect prostatic carcinoma. (44) For these patients, location and extent of the carcinoma were known from histological findings after radical prostatectomy. Quite expectedly, there are fewer occurrences of determination using a/an in the Method sections of abstracts in ABCORP than in Introduction sections. In general, categorisation seems to have already been operated. The boundaries the author has set for his research are well in place once he gets down to explaining his methodology. This is emblematically highlighted by the one and only occurrence of the sequence a/an new N whereas the same sequence was concordanced in 24% of the a/an determined N2's of passive sentences introductions. Still, categorisation does occur in 15% of all passive sentences in the Method sections of abstracts in ABCORP. It seems however less marked than in introductions. Examples (45) and (46) are particularly relevant at this point (45) A line-of-sight optical projection through a test object is identified from an amplitude null and a sharp phase transition produced by diffusive waves originating from two in-phase (initial phase 0o) and out-of-phase (initial phase 180o) light emitting diode sources. (46) A second air-coupled capacitance detector (apertured to 200 µm) was scanned in the field of the zone-plate source in order to image the generated ultrasonic field at various frequencies of operation. In (45), the author of the abstract only makes use of one prepositional group for postmodification whereas in examples (34) and (35) three whole embedded clauses are used. In (46) extra information is provided in brackets whereas it could perfectly have been embedded as a clause. When in need of detailed categorisation, authors seem to resort to premodification. Example (47) takes this to great length. (47) A 0.91Pb(Zn1/3Nb2/3)O3-00.09PbTiO3(PZN-PT) single crystal with high electromechanical coupling factor (k33)>90% has been used to fabricate a 40-channel phased array ultrasonic probe with greater sensitivity and broader bandwidth than concentional probes. Premodification offers a better integration of the modifier but it also blurs the relation between the modifier and the head noun making it implicit. The resulting noun group is therefore “denser” and more is demanded of the reader to understand it correctly. The same is true of abbreviation as used in example (48). 507 (48) Then, because the dynamic characteristics of the USM are complicated and the motor parameters are time varying, an AFNNC is proposed to control the rotor position of the USM. As already hinted at in the study of a/an determination of N2's in introductions, a “semantically rich” N2 seems to call for a “semantically poor” Vt-en (category 1 verb). In keeping with the observations made when studying N2 determination in introductions, it come as no surprise that 15 out of the 25 Vt-en's extracted in this subsection of ABCORP belong to category 1. Once again, the quantum of information is very unevenly spread over the sentence. On the one hand the speaker sets out and acknowledges the existence of highly specialised notions, on the other hand he tells us very little about them and the actual role they might play in his research. The fact that the sentence is in the passive emphasises the effect. It makes is clear for the reader that he is expected to fully apprehend the whole semantic content of the N2 with very little co-textual help. His understanding requires appealing to his very own experience of the specialised field. By using the passive, the speaker gives the impression of setting out the N2's he has selected as the “compulsory checkpoints” in his research. Whoever would not go through those checkpoints would not understand his research. As mentioned in the introduction of this paper when dealing with the actual purpose of the research paper abstract, it appears quite clearly at this stage that through the abstract, the speaker carefully signposts22 the way for his reader. In this respect, the role of the abstract is very close to that of a table of contents. The function of the passive in this case seems to be a metalinguistic one pertaining to emphasising discourse23 cohesion. This metalinguistic function of the passive is even clearer when N2's are determined by the. In this case the modifiers and the head noun are sealed into a concept and labelled as a term as in the Lagrange multiplier method, the dynamic range (DR) of the new TIAOC, the backpropagation algorithm, which all occur as N2's in the Method section of abstracts in ABCORP. As could be expected, N2's are less premodified than when determined by a/an and off course by A. In the Method section of abstracts, the author takes his readers through the successive steps of the method he implemented. The fact that this section features the highest frequency of passive sentences but also of N2 determination by the shows that a particular emphasis is put on signposting this section more than any other. This is confirmed by the very high frequency of category 1 Vt-en's. Although some of the research described in ABCORP involves extremely precise processes, category 1 Vt-en's still account for almost half of the Vt-en's in the Method section. For example, such a process as “sintering”, which is central to piezoelectric ceramics production, does occur in ABCORP but never as a verb and although “poling” and “prepoling” are essential they only rarely occur as verbs. The occurrence of N2's as intrinsically “weak” as the model or the experimental data with category 2 Vt-en's is marginal and can hardly be considered a trend. Example (49), which features a rather weak N2, would prove the opposite but it being the only occurrence of a category 3 verb in this subsection of ABCORP, drawing any conclusion would be hazardous to say the least. (49) Sintered PZT rods are inserted into a prepoled copolymer matrix, and the composite is repoled under a lower electric field. A closer observation of the category labelled “Others” in table 6 proves an efficient conclusion to this part in which I have tried to bring forward the role of the passive as a means for discourse cohesion. “Others” represent 15% (25 occurrences) of N2 determination in the Method section of abstracts in ABCORP. Among these 15%, a majority of determiners are deictics (9 occurrences which represents just above 5% of the total for Method sections) all of which are occurrences of this or these. With this or these as a determiner, the N2 is introduced as rhematic with a reference to the extralinguistic universe. Example (50) features the only case of a premodified head noun in an N2 from this subsection of ABCORP. (50) By using an extension of the slowly varying functions method, this differential equation is transformed into a nonlinear differential system with perturbation terms as the right-hand side. The other N2's are this method, these methods, these parameters, this analysis, these data, these devices, this error, this line-of-sight. All trigger sentences developing right of the verb group and strengthen discourse cohesion. 22 “signposting” is a term used by Salager-Meyer (1990) and quoted by Gledhill (2000: 42) 23 At this stage in the study, through the use of discourse rather than text as first announced in the title of the paper, I intend to emphasise the role of the speaker in the production of the abstract. This goes against the idea that the abstract would result from the mere selection of key points (usually on methods and results) by the author. 508 In the above paragraphs, I have tried to show that the operation carried out by the passive in the Method section of abstracts in ABCORP is intrinsically identical to the operation carried out by the passive in the Introduction section of abstracts in ABCORP. However, there being a major “purposeshift” between the introduction of an abstract and its Method section and the two having distinct “communicative functions” the effect of passivisation can be quite different. In the Method sections most passives seem to highlight the key points of the research conducted. These points have been selected by the author to signpost his abstract. It is usually made more or less explicit, in the instructions for contributors, that abstracts should state at least the method used for the study and the results achieved. Still the Result section of abstracts in ABCORP is not as overtly marked by passives as the Method section as table 7 shows. Introduction Method Results A 20% (14) 24% (40) 31% (28) a/an 35% (25) 15% (25) 17% (14) The 38% (27) 46% (75) 31% (28) Others 7% (5) 15% (25) 21% (19) (71) (165) (89) Table 7 Because of yet another “purpose-shift” from the Method section to the Results section, operations carried out by the speaker, although similar, take on a different interpretation. Zero-determination occurs for 31% of N2's in the Results section of abstracts in ABCORP, which is the highest frequency for this type of determination. However, the N2's determined by A are quite different from the ones in the previous subsections. Indeed, this subsection of ABCORP features ten occurrences of results, data to which can be added occurrences of such N2's as measurements, comparison, or performance. (51) and (52) are two examples: (51) Data are given showing the results of using the linear quadratic Gaussian (LQG) technique to steer remote hydrogen masers to Coordinated Universal Time (UTC) as given by the United States Naval Observatory (USNO) via two-way satellite time transfer and the Global Positioning System (GPS). (52) Results are presented demonstrating that the new method has both satisfactory tracking performance and the potential for practical real-time implementation. These two examples show the cataphoric function of the N2 and the sentence developing right of the verb group. The N2's play a role identical to the one played by deictics in the previous part. They do take on a metalinguistic function and strengthen discourse cohesion. It is no coincidence to find nine cases of determination by deictics and four occurrences of a cataphoric it as an N2, also in this subsection of ABCORP. In this perspective, the role of the N2 and hence of the passive in (51), (52) and (53) is certainly similar. (53) It was found that the ~680 µm spot size of the experimental zone-plate did not vary appreciably with changing frequency, whereas the focal length increased markedly with increasing frequency (from ~5 mm at 450kHz up to ~15 mm at 900kHz). The structure seems even more efficient with category 1 verbs, which is the case in 21 of the 28 passive sentences of this subsection. The technical terms determined by A in the Results section of abstracts in ABCORP are usually more specific to the study than to UFFC in general. The proportion of N2's determined by a/an in this section is identical to the previous section and quite similarly the Results section does not seem to be the place to either categorise or introduce new notions. There are 28 N2's determined by the in this subsection of ABCORP ; 15 are grammatical subjects of category 1 verbs and 13 of category 2 verbs. What differs at this point of the study is the nature of the N2's. Only 13 of them are actual specialised terms and most of the N2's refer directly to their co-text. As examples (54) and (55) show, specialised terms can take part in the N2 but are no longer the head nouns of the N2. (54) In addition, the effectiveness of the adaptive fuzzy-neural-network (AFNN) controlled USM drive system is demonstrated by some experimental results. (55) The resulting improvement in radial and lateral blind deconvolution is demonstrated on six short ultrasound image sequences recorded in vitro or in-vivo. 509 From the Method section of abstracts in ABCORP and the Results section the “purpose-shift” does not seem to be as marked as it is between the Introduction section and the Method section. Therefore the effects of the passive in Method sections and in Results sections bear similarities and emphasis is put on text cohesion. Few specialised terms need to be introduced at this late stage of the study. They tend to be put in relation with another noun as part of the N2 rather than head the N2. As Table 8 shows, only 24 passive sentences occur in the Discussion section of abstracts in ABCORP, which barely represents 7% of passive sentences in the whole corpus. Very little conclusive evidence can actually be drawn from such a small sample I shall therefore only underline that by definition, the Discussion section of an abstract is the part of the study the reader cannot anticipate on. The bulk of the informative content is provided by the author. The effects conveyed by the passive do not seem to match the communicative function of this section. Introduction Method Results Discussion A 20% (14) 24% (40) 31% (28) 21% (5) a/an 35% (25) 15% (25) 17% (14) 12% (3) The 38% (27) 46% (75) 31% (28) 50% (12) Others 7% (5) 15% (25) 21% (19) 17% (4) (71) (165) (89) (24) Table 8 6. Conclusion As underlined in the introduction to this paper, the abstract is often considered as a “miniversion” of the research paper and should be “self-contained.” The IMRaD sections of the research paper also characterise the abstract and each section participates in a rhetorical development. Throughout this study I have tried to show that far from betraying stylistic shortcomings, the use of the passive in research article abstracts participates in this rhetoric. I have first established that the passive is not a mere mechanical transformation of a hypothetical primary active sentence but a clear and well-thought reversing of an original pattern (predicable lexis). I have then underlined two effects of the use of the passive. The passive first enables the author to set out the terminology he intends to use. In this respect and with regards to the consistency of use of the passive, I have tried to bring forward the passive as another example of “the way phraseology helps to shape a specific view of transitivity at the same time as framing terms stereotypically” as underlined by Gledhill (2000, 167). The passive as staging technical terminology is particularly present in the Introduction section of abstracts. In abstracts, the passive also contributes to signposting. It clearly marks the compulsory “checkpoints” the reader has to go through in order to understand the abstract. In this role, the passive strengthens discourse cohesion., which is particularly noticeable in the Method and Results sections of abstracts in ABCORP. Salager-Meyer (1990: 37824) has underlined the lack of cohesive devices in “unsuccessful abstracts” and the remark certainly applies to ABCORP. However, when lexical cohesive devices occur, they tend to do so in active sentences or to co-occur with “weak” N2's. Quite logically, Also is the link-word occurring most frequently in passive sentences; it introduces a rather neutral link unlike more marked link-words such as thus or therefore. Table 9 shows the results of a concordance of also over ABCORP. Also In active sentences In passive sentences With a semantically With a semantically 24 As quoted in Gledhill (2000: 42) 510 weak subject strong subject Number of occurrences 13 7 5 Type of N2 § Experimental results also are presented § […] the same data also can be averaged § Data also are shown […] § It is also substantiated […] § This improvement also can be observed […] § The numerical results also are compared to […] § Also, it was observed that […] § A dispersive perfectly matched layer (DPML) boundary condition, which is suitable for boundary matching to such a dispersive media whole space, is also proposed […] § Modern beam forming techniques such as apodization, dynamic aperture, elevational focusing, multiple transmit focusing, and dynamic receiving focusing also can be simulated. § […] the free and soft planar baffle also can be considered. § Good image reconstructions based on simulations and real objects also are provided […] § […] negative Doppler shifts also are produced. A general overview of the theory behind the LQG technique also is given. Table 9 In this study, I have categorised verbs occurring in passive sentences. The categories I have proposed are based on the semantic content of the verbs. They somehow differ from the categories set out by Gledhill in his study of the phraseology of cancer research papers. Gledhill (2000: 213) defines four main process types: research, empirical, clinical and biomedical. He points out that “these four dimensions form a continuum in which they represent the relative involvement of the author in scientific activity (either in experiment or writing up).” I have tried to show that the passive is one of the clues of this involvement, my category 1 largely overlapping Gledhill's research and empirical ones. The categorisation of N2's I have proposed, based on their determination, has underlined the lexical cohesion of discourse through labelling (Francis’ term 1994). N2's become grammatical metaphors contributing to, in Gledhill's words (2000: 204) “the distribution of thematic roles within the clause and at the same time [being] a key mechanism in the construction of new meanings.” In keeping with Gledhill's (2000: 166) and Nwogu and Bloor's (199125) findings, the metalinguistic role of the passive contributing to strengthen discourse cohesion confirms that “abstracts tend to employ simple thematic progression, linearly converting rheme to theme.” Authors are often unaware of whether they “should” use the passive or not. This study has tried to show that the passive is used with great consistency in research paper abstracts through which “grammatical collocations” (Gledhill's term) have emerged. This would favour the existence of what Gledhill terms (2000: 167) a scientific “voice”. Acknowledgements The author wishes to thank Professor M. Clay from Lyon III University for his suggestions. This study was supported by INSA-Lyon's ESCHIL, Centre for Humanities, and Department of Electrical Engineering. References Adamczewski H 1993 Grammaire Linguistique de l'Anglais Paris: Armand Colin 25 As quoted in Gledhill (2000: 166) 511 Biber D, Conrad S, Reppen R 1998 Corpus linguistics. Cambridge, Cambridge University Press. Booth V 1993 Communicating in science. Cambridge, Cambridge University Press. Brown G, Yule G 1983 Discoursed Analysis. Cambridge, Cambridge University Press. Couper-Kuhlen E 1979 The prepositional passive in English. Tübingen, Niemeyer. Day R 1989 How to write and publish a scientific paper. Cambridge, Cambridge University Press. Francis G 1994 Labelling discourse: An aspect of Nominal Group Cohesion in Coulthard (ed), Advances in Written Text Analysis. London, Routledge, pp 83-101. Gledhill C 2000 Collocations in Science Writing. Tübingen, Gunter Narr Verlag. Granger S 1983 The be + past participle construction in spoken English. Amsterdam, North Holland. Groussier ML & Riviere C 1996 Les mots de la linguistique, lexique de linguistique énonciative. Paris, Ophrys. Halliday MAK 1994 An Introduction to Functional Grammar. London, Edward Arnold. Joly A & O'Kelly D 1990 Grammaire Systématique de l'Anglais Paris: Nathan Lapaire R. & Rotgé W. 1991 Linguistique et Grammaire de l'Anglais. Toulouse, Presses Universitaires du Mirail. Lobban C & Schefter M 1992 Successful lab reports. Cambridge, Cambridge University Press. Nwogu KN & Bloor T 1991 Thematic progression in professional and popular medical texts. In Ventola (ed), Functional and systemic linguistics: approaches and uses. Den Haag, Mouton de Gruyter, pp 369-384. Salager-Meyer F 1990 Discoursal Flaws in medical English abstracts. Text 10(4): 365-384. Sides C 1991 How to write and present technical information. Cambridge, Cambridge University Press. Svartvik J 1966 On voice in the English verb. Den Haag, Mouton de Gruyter. Tesniere L 1959 Eléments de syntaxe structurale. Paris, Klincksieck. 513 Do women and men really live in different cultures? Evidence from the BNC Hans-Jörg Schmid, University of Bayreuth, Germany In her bestseller You just don't understand Deborah Tannen claimed that “talk between women and men is cross-cultural communication” (1990: 18). Two years later, Geoffrey Leech and Roger Fallon (1992) published a paper with the title “Computer corpora – what do they tell us about culture?”. They showed how the frequencies of words in the Brown and the LOB corpora mirror the importance of certain concepts in American and British culture. In a note they expressed their hope that “by the year 2000, it will be possible to make use of these corpora [i.e. BNC and COBUILD] for cross-cultural studies on a much larger scale than is now possible on the limited basis of the Brown and LOB corpora” (1992: 47). To a large extent thanks to Geoffrey Leech's own contribution to corpus linguistics, their hopes were not in vain. If we combine Tannen's claim with Leech and Fallon's method, we arrive at an obvious question: Can the BNC tell us whether Tannen is right? The paper addresses this question by comparing the frequencies of words and collocations as used by women and men in the spoken part of the BNC. Words from the following domains have been investigated: · personal references (personal pronouns, male and female proper names) · family · personal relationships · home · food and drink · body and health · clothing · car and traffic · computing · sport · public affairs · abstract notions · alleged “women's” and “men's” words · swearwords · hesitators, fillers, backchannel behaviour · linguistic politeness markers · linguistic markers of uncertainty and tentativeness · linguistic markers of conversational cooperation and support The data indicate that Tannen's claim is indeed true, since most of the words and collocations investigated exhibit significant differences which seem to be gender-determined. On the whole, there is strong converging evidence that women's speech style tends to be marked by proximity and involvement, and men's by distance and detachment. This can indeed be interpreted as reflecting some kind of cultural difference. Nevertheless, not all the distribution patterns are in conformity with what is suggested in the gender-linguistic literature. The expletive bloody, for example, is much more often used by women than by men, particularly frequently in fact by women in the 45 to 59 years age band. (The factors age, social class and education are also taken into consideration but not focussed on.) Depending on the time/space available, a selection of this data will be discussed in detail. The rest of the material will be summarised in tables in order to leave room for an adequate interpretation of the findings. 514 Exploring the Chemnitz Internet Grammar: examples of student use Josef Schmied English Language & Linguistics, Chemnitz University of Technology, Germany 1. Research context and design 1.1. Context The Chemnitz Internet Grammar (CING) is a research tool and a teaching aid at the same time. The aim of the research project1 is to induce guiding principles for the development of interactive, learner-specific information retrieval (grammar) programs for the internet and to produce a program based on aspects of English grammar which applies these principles. The target group of users consists of advanced learners of English, mainly with German as a mother tongue; thus it concentrates on those areas of English grammar where substantial differences occur between English and German. The research results on learners’ behaviour and preferences are always directly re-implemented in the grammar program, which serves as a learning tool. In contrast to other internet grammars, ours is essentially a double reference work: a database that enables inductive language learning using authentic examples, and a description of the grammar that enables deductive language learning using the “rules” contained within it. In many ways our grammar is based on the complex system of pedagogic grammar [PG] proposed by Corder as early as 1973, illustrated in his circular figure on PG in teaching materials with its interdependent four key elements inductive exercises, data and examples, explanations and descriptions and testing exercises. Through the related exercises and links, the reference work can be used for university-level language teaching and in-service teacher training. Generally, our approach can be characterized by the following key-words: learner-centred, contrastive in the deductive component and data-based in the inductive component, interactive and learner-adaptive, e.g. providing immediate feedback and correction, in the exercise component. 1.2. Comparison with other grammars in hypertext and bookform The Chemnitz Internet Grammar can of course be compared with other grammars that were published recently. The only other empirical and corpus-based internet grammar is the London Internet Grammar (1996-98), which is however neither contrastive nor has it an inductive component. The last point also goes for the other grammar books published recently: The Longman Grammar of Spoken and Written English (Biber/Johansson/Leech/Conrad/Finegan 1999) and Mindt (1999) are also corpusbased but they are not directly EFL-oriented. Our grammar is written explicity for the foreign language learner like Celece-Murcia/Larsen-Freeman 1999, which however, in contrast to all the others mentioned, is not corpus-based. By combining corpus linguistic approaches with an EFL rule-based grammar we can also make it lear to the advanced learner that most language rules are not absolute but rather relative and there is a wide (acceptable) range between prototypical and creative constructions. An important aspect of our deductive grammar (although the grammar corpus can be used to investigate all grammatical question inductively is that it is not a complete grammar (like Biber et al. 1999 or Celece-Murcia/Larsen-Freeman 1999) but it concentrates on exemplary grammar areas that make English special in many ways. In the German - English contrastive perspective, it covers three different areas so far, verbal, nominal and clausal: · In the tense/aspect/modality section it emphasizes for instance the progressive aspect, which is not grammaticalized in verbal form at all in German, the present perfect, where the distribution in German is completely different, and modulation, where the preference is more towards the adverbial rather than the modal auxiliary construction (Schmied/Schäffler 1996). · From the noun phrases our grammar discusses prepositions, as complements as well as adverbials, again since there are interesting differences between the two typologically closely related languages German and English. 1 The project has been financed by the German Research Association (DFG) since 1998 as part of the New Media research group at Chemnitz University of Technology (cf. Schmied 1999 for conceptual and Gorlow et al. 2001 for technical aspects). It also serves as a basis for other e-learning projects. I wish to thank my collaborators Naomi Hallan, Diana Hudson Ettle, Christoph Haase, Angela Hahn and Sabine Reich for many interesting and thought-provoking discussions. 515 · The clausal level is represented by relative constructions, where the basic forms are parallel to their German equivalents but again the distribution is different, and conditional clauses, where the sequence of tenses is unusually strict in English. The following presentation concentrates on the specific research aspects of our grammar, i.e. analysing learner behaviour. It illustrates the sociobiographic questionnaire, the tracking mechanisms and the first results of the experiments on student behaviour in the CING. It uses the verb phrase as an example and compares the deductive explorations of ‘grammar rules', the inductive searches for corpus samples and the work on the exercise component. Many of the results are preliminary, which seems obvious since the CING is an on-going research project, which will keep us occupied for another four years. 2. Basic guidelines and assumptions 2.1. Tracking user behaviour One of the basic assumptions of the CING research project is that different user types use different learning strategies, depending on age, language skills, exposure, computer literacy other variables that might have an influence on learning style. That is why our questionnaires cover a wide range of variables under three lists, general, language-related and computer-related. The influence of these variable on learning strategies is then compared in a statistical analysis where the sociobiographic data are compared with the recorded usage data. Of course, one of the central variables for us is the proportion of pages used in the explanations and the discovery sections. Fig. 1 thus shows that for many learners grammar is mainly a deductive exercise, only few venture into the inductive section. On this basis we can explore, for instance, which type of learner uses a relatively high proportion of the discovery section (which has less than half as many pages as the explanations section). Figure 1: User diagram The first results of our tests confirm the tentative conclusion by Yan-Ping (1991: 272) for Chinese learners of English: With respect to the more complex properties such as the semantic meanings of the present perfect, explicit instruction does not show any superiority to implicit instruction. ... A tentative conclusion can be drawn that explicit instruction is effective with simple rules but not with complex rules. Of course, whether learners can actually apply later on in tests what they have “learnt” by looking at on the screen depends on sociobiographic variables (like computer experience and language skills) as well as on presentation variables. In this context, the different perspectives on “learning” by specialists in linguistics, artificial intelligence and psychology have become obvious. 516 2.2. Hypertext advantages and disadvantages 2.2.1. Advantage: user-specific presentation of information A usually infamous feature of internet presentations is that they are so “evasive”, i.e. they can be changed quickly and adapted to new conditions or knowledge - or disappear. For some research purposes it is a great advantage if we can experiment with various types of presentations and thus measure the “explanatory value” of presentations. Explanatory value here does not mean whether a linguistic concept or theory can explain all or the majority of the cases but whether a presentation applies more or less to an advanced language learner and (thus) a better effect on tested practical language skills (more than theoretical knowledge). For the English tense/aspect complex, for instance, we have several options: Leech and Svartvik (21994: 150f) have developed a comprehensive overview of tense and aspect, which can be used as a summary of the topic. The little diagrams there are however not directly related to the diagrams we used (cf. Fig. 2) because we thought they would be palatable for advanced learners of English (cf. Hahn/Reich/Schmied fc.). We used Reichenbach's concepts as a starting point and presented speech time versus reference time as central to verbal time relations. In the students section we used prototypical examples and even animations to bring the basic ideas across, in the specialist section we used non-prototypical examples and problem cases (cf. also Schmied 1998). Figure 2: Speech and reference time By changing the concepts, in some cases even the (syntax) theory displayed, we can determine the explanatory value in this sense. We can for instance demonstrate clines and genre distributions as in Biber et al. (1999) and Mindt (1999) and see whether advanced learners can use them to produce more prototypical sentences - far beyond the usual right - wrong dichotomy of “rules”. Mindt (1999: 249) for instance just lists 9 meanings of progressive: incompletion, temporariness, iteration/habit, highlighting/prominence, emotion, politeness/downtowning, prediction, volition/intention and matter-of-course (and exemplifies them with prototypical examples), but afterwards he singles out the first three as progressive meaning in contrast to non-progressive (ibid: 250). This seems a good compromise for showing a cline from more to less central meanings. Although many students appreciate this type of guidelines, for a corpus-based grammar this approach may be surprising, since even Swan (1980) emphasises the collocational aspect when discussing the usage the progressive after certain verb classes and adverbs (s.v. progressive). 517 2.2.2. Advantage: crossreferencing One of the basic advantages of an internet grammar, apart from the possiblities of recording user behaviour, is that information can be presented in a hypertext format, where more than one direction of thought and argumentation can be offered to the reader. Thus cross-referencing from progressive forms to collocations like stative verbs, prepositions and adverbs, which support or contradict the “progressive” interpretation of meaning, can show the complex network of grammar more impressively than any linear description. We had seen it as a main advantage to be able to link the main “branches” of our internet grammar not only to the “stem” but also to other “twigs” pointing in the same direction. This also implies “understanding difficult contrasts in tense – aspect combinations”, as they are called by Celce-Murcia/Larsen-Freeman (1999: 124-128). The idea of the grammar as a system, “the tense and aspect system” (ibid: chapter title), is essential to foreign learners of English whose first languages do not have aspect grammaticised in the verb phrase, since the contrast to “alternative” constructions has to be referred to constantly. 2.2.3. Disadvantage: orientation and navigation One of the well-known problems of the internet is that its users tend to “get lost in cyberspace”. Our grammar is no exception to this rule, although references outside of the (so far) five hundred pages are few. Although we had clear frame signals which section the learner was in, explanations/discovery or student/advanced (cf. Fig. 2 above), although we provided a bookmark option and an index, our students obviously found orientation and navigation difficult in the hypertext maize we had created, where “one could not even quote page numbers”. So we had to put in a user history for learners who wanted to trace back visually what they had read during their current session, a tree diagram to show the structure of the section and a complex reference system on every page for those who wanted to quote certain pages. 2.2.4. Disadvantage. self-contained pages One of the major claims of the Internet Grammar had been that it facilitates cross-referencing, but the non-linear structure also has clear disadvantages, since web pages have to be more or less selfcontained. Thus a headline like “Why is there a choice?” is, of course, ambiguous if one does not get there directly from a page that has engraved it deeply into the reader's mind that we are talking about the contrast between simple and continuous tenses. In the index a simple headline like this could refer to many “crossroads” in a grammar. 3. Research results and their implications 3.1. Writer - user interaction One of the basic drawbacks of grammar writers has always been that it has been very difficult for the grammar writer to anticipate the needs and learner strategies of the user. This is a special case of expert - layman interaction although, of course, at least in an EFL context a certain grammatical awareness (in a sense of familiarity with linguistic thinking and terminology) can be assumed in a foreign-language context (in contrast to British schools where it has been deplored for a long time). In our internet grammar we can not only discuss with our students to gain qualitative information from actual users and distribute questionnaires to collect more quantitative data - both measurements are relatively subjective and have to take into consideration that even if students actually knew what they are doing their answers might be adapted by the need to make a good impression with the grammar teachers. With the help of computers two more objective ways of measuring how students use grammars can be applied: · In off-line experiments we can use an eye tracker to record how students read texts and how they gather information (in this case on grammar) from a webpage. Here hesitation phenomena may indicate problems of terminology, a negative interpretation, as well as inspiring thoughts, a positive interpretation. · For on-line recording we have developed a special procedure which records all the URLs that were called up, the reading time, the user input and the user ID. The recorded data in their raw format are not easy to interpret, thus we wrote a special software programme to make the figures of the individual sessions more palatable for the interpreter. Through a different little programme the data of individual users are transferred into an SPSS file so that a comparative quantitative analysis can be undertaken. 518 3.2. Terminology: boring stuff - catchy language? One of the great prejudices against grammar is that it is boring. In writing our grammar we therefore experimented with more “adapted”, user- or age-group-specific language, such as “Is it simple: be progressive, use the continuous” (Fig. 3), which we thought would be appropriate for the students. This headline indicates at least three ideas. · Although there is an important difference between English and German in the use of the progressive it is not difficult. · The use of the progressive or continuous forms is a modern feature which is part of grammatical change and progress in British English. · Progressive and continuous are two terms that indicate more the function and the form respectively but ultimately refer to the same thing. Most of these thoughts or even innuendoes were lost on our students. They considered grammar writing a serious business and were not trapped by the “young” language at all. They rather considered some of it inappropriate and distracting. This indicates that witty grammar writing at least on advanced topics is only something for the very advanced user. Thus, catchy language seems to be more appropriate for the expert to expert communication than the expert to layman communication. A further problem with this style is consistency. Thus have not been able to come up consistently with witty headlines or direct reader-specific questions and relapsed back into the traditional headlines like “Verbs of state and mental states”, “Verbs of bodily perception”, etc. - and this tradition was not too bad sometimes after all. It has to be remembered that consistency in grammatical terminology seems very desirable from the learners’ point of view. Whereas applied linguists often play with overlapping concepts, students struggling with concrete problems only find it confusing when “similar” ideas can sometimes be found under the heading “continuous” so other time under “progressive” (thus most grammars rightly use only one term as central). Figure 3: Be progressive 519 3.3. A bilingual approach? One result of our bilingual basis and contrastive analysis was that we also used the translation database to exemplify meaning in the explanations section and in the exercises (Fig. 4)2. Unfortunately, German students very often looked at the German equivalents first; that means that the old notion of the monolingual teaching tradition is challenged, when the students suggested that - in particular in such cases where there is a marked contrast between the first language German and the second language English there would be very useful to look at the translation equivalents. Maybe, however, the students should not be given the easy option of the German equivalents? In an internet grammar one can of course change the presentation for a certain group of users and investigate whether the test results or the reactions are significantly different. 4. Outlook: cultural and tutorial systems In the second phase the CING will be expanded into a more diversified and comprehensive learning tool (and thus research instrument). In this context more diversified means that learners in the grammar will be given more choices or that different learner groups will presented with different, more appropriate versions. More comprehensive means that learners will not only be given grammar sections to read but complete authentic texts, so that they can also practise and we can also measure the comprehension of lexemes, idioms and cultural conventions. This will be achieved by introducing more “distanced” texts in the learning tool: texts from non-European English-using cultures like East Africa, where English is used as a second language and has to be adapted to the specific environment and sociocultural conventions. Figure 4: MC-EXERCISE We will not only present new texts but also new learning aides such as a culture-specific lexicon that enables the non-initiated user to acquire not only grammar, but also “culture”, as it is encoded for instance in such contrastive concepts like askari (Kiswahili for watchman/policeman) or jui kali (for “under the hot sun”, i.e. informal) sector. It should be mentioned here that the verb phrase also include 2 This figure also serves as an illustration of our (tradtional) multiple choice questions and the reactions generated automatically by our system. After completing the exercise students can receive a summary of their results and recommendations for further study. 520 culture-specific variation. The expansion of the continuous forms to stative verbs in cases like it is costing a lot is very common feature in second-language varieties of English in Africa and Asia and has mainly intralinguistic reasons. The reduced usage of modal verbs can be seens as culture-specific, most second-language users would consider it exaggerated to use too many Would you be so kind to/Could you please ... -forms. With the expansion of the internet anyone in Europe for instance has access to East African newspapers on the internet even before the direct readership in East Africa can hold them in their hands. This raises the more general question how cultural outsiders cope with culture-specific expressions and our research could make a contribution to fast-growing area of intercultural communication. In order to measure that we also intend to develop new test methods for the internet, such as A second new development could be user-specific tutorial systems. Obviously many users have, initially at least, found it difficult to use the system efficiently, as we can see from the tracking records, their help calls and their interview comments. Thus the option of a “guided tour” for newcomers seems appropriate. Although this offers new research perspectives, we have to bear in mind that this also skews our original research, which concentrated on monitoring unrestricted user choice according to individual learning strategies and initial “guidance” influences personal learning styles. For many learning systems however tutorials are a decisive factor, therefore gathering information on their usage and influence only adds a modern dimension to the old problem how grammars are used. There are of course different tutorial systems possible for the Internet Grammar: · A straightforward walk-through simply illustrates what is written in the existing guidelines anyway and animation is generally more appreciated by internet users then running texts. The linear sequence of slides like the well-known PowerPoint presentations can be implemented without much effort. · A more interactive tutorial would be more in line with the original principles of the CING. Such a tutorial uses learner input either immediately or from earlier sessions to select appropriate versions of grammar sections. The effect of a machine recommending “You need to practice more PROGESSIVE TENSES” suggests objectivity and urgent need to many students. · A fully adapted tutorial includes much more: learner-specific information from the sociobiographical questionnaire and results from placement tests etc. Here even inconsistent input can be identified and learnertype-specific information , such as successful patterns of learning strategies can be included so that a very diversified and modular system evolves. This is a major step towards fully autonomous learning systems without a personal tutor. This brief outlook shows that the learning system developed under to name CING can make a more general contribution to the development of our society, which is said to be moving towards a new internet and knowledge society. Whether modern human beings that use an internet-based system like the CING really appreciate this development is not clear, but we can at least indicate in a very limited area where usergroup-specific preferences tend to lie. References Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman grammar of spoken and written English. London, Longman. Celce-Murca M, Larsen-Freeman D 21999 The grammar book. An ESL/EFL teacher's course. Heinle & Heinle. Corder S P 1973 Introducing applied linguistics. Harmondsworth, Penguin. Gorlow E, Haase C, Hallan N, Hudson-Ettle D, Schmied J 2001 Internet Grammar. Technical Documentation. http://www.tu-chemnitz.de/phil/InternetGrammar/manual.html Hahn A, Reich S, Schmied J fc “Aspect in the Chemnitz Internet Grammar”. ICAME Proceedings Freiburg 1999. Amsterdam, Rodopi. Leech J, Svartvik J 21994 A communicative grammar of English. London, Longman. London Internet Grammar 1996-98 http://www.ucl.ac.uk/internet-grammar/home.htm McDonough, Steven H. 1999 Learner strategies. Language Teaching 32: 1-18. Mindt D 2000 An empirical grammar of the English verb system. Berlin, Cornelsen. Schmied J 1998 To choose or not to choose the prototypical equivalent. In: Schulze, Rainer (ed.). Making Meaningful Choices in English. On Dimensions, Perspectives, Methodology, and Evidence. Tübingen, Narr, pp 207-222. Schmied, J 1999 Applying contrastive corpora in modern contrastive grammars: the Chemnitz Internet Grammar of English. Hasselgard H, Oksefjell S (eds), Out of corpora. Studies in honour of Stig Johansson. Amsterdam, Rodopi, pp 21-30. 521 Schmied J, Schäffler H 1996 Approaching translationese through parallel and translation corpora. In: Synchronic Corpus Linguistics. Papers from the 16th international conference on English language research on computerized corpora (ICAME 16), Toronto 1995. Amsterdam, Rodopi, pp. 41-56. Swan M 1980. Practical English usage. Oxford, Oxford U.P. Yan-Ping Z 1991 The effect of explicit instruction on the aquisition of English grammatical structures by Chinese learners. James C, Garrett P (eds). Language awareness in the classroom. London, Longman, pp 254-277. 522 Is it Creole, is it English, is it valid? Developing and using a corpus of unstandardised written language. Mark Sebba and Susan Dray Department of Linguistics and Modern English Language, Lancaster University, This paper will present our experiences of developing and using two computer corpora of written Creole. By ‘Creole’ here we mean English-lexicon creoles of Caribbean origin, which are also used in Britain. Creole, both written and spoken, is unstandardised and subject to a high degree of variability at all levels – especially in grammar, phonology and orthography - because of its relationship with Standard English. The corpora we will discuss are part of an on-going project at Lancaster University, investigating the practices of writers using Creole. They are: · the Corpus of Written British Creole (CWBC), a collection of texts written wholly or partly in Creole by West Indians whose formative years have been spent in Britain · a Corpus of Written Jamaican Creole (CWJC), a collection of texts of diverse types written by Jamaicans in Jamaica We will briefly discuss issues which arose when setting up the corpora, in particular the practical issue of the identification of texts for inclusion, and the specific problems that this raises with respect to English-lexicon Creoles. Can a text containing Creole features be included in the corpus if the writer's target appears to be Standard English? Where Creole occurs together with Standard English, to what extent is it necessary to include the Standard English (SE) parts in the corpus – and where is it legitimate to make a ‘cut'? We then focus on some potentially complex linguistic questions which arose during the annotation procedure, e.g. to what extent do graphological features, such as punctuation, layout, and the use of upper and lower case letters in a text, form part of a ‘naturally’ developing orthography? Is it important for these features to be indicated in the corpus? To what extent do an individual's literacy skills affect what is written, and what are the implications of this for the representativeness of the corpora? We will end with some remarks on the potential of these corpora as a research tool in order to illustrate the application of the annotation method. Linguistic items and their orthographic representations can be compared and contrasted with variables such as demographic information (about writers or voices within texts), country, text type, and social context. This could provide information, for example, not only on orthographic variations within or across texts, but also on how writing practices may vary according to social, cultural, generational and educational factors as well as over time. 523 Language model adaptation for highly-inflected Slovenian language in comparison to English language 0LUMDP6HSHV\0DX.HFDQG=GUDYNR.D.L. University of Maribor, Faculty of Electrical Engineering and Computer Science Smetanova 17, SI-2000 Maribor, Slovenia mirjam.sepesy@uni-mb.si Abstract The language model in this article is meant as an information source of a speech recogniser. The environment to which the recogniser will be put is topic-specific. The idea is to try to adapt the model of general language to the target domain of discourse. We concentrate on feature extraction process devoted to highly-inflected languages. Results of experiments on English and Slovenian corpora are reported. 1. Introduction A language model aims to provide a representation of language. We concentrate on statistical aspects of modelling and not on grammatical ones. Statistical models rely on the assumption that the future use of a language will follow similar linguistic patterns to those used in the past. Few years ago the problem of statistical language model was defined as the sparse data problem. Almost all language model research has adopted “bigger is better” approach where enormous volumes of training text are analysed in order to derive more reliable statistics. However, improvements with size did not yield much better language models for speech recognition. Training corpus should be used in more advanced way. 2. Basic language modelling The task of a language model is to assign the probability ) (W P to every conceivable word string W . N-gram language models are most widely used (Jelinek, 1998). An N-gram is a model that uses the last N-1 words of the history as its sole information source. Although they are very simple, the experiments have shown that they are surprisingly difficult to improve on. We use trigram models which restrict the history to two immediately preceding words O O = - - - = » = n i i i i i i n i w w w P w w w P W P 1 1 2 1 1 1 ) ( ) ... ( ) ( (1) We should point out that the word refers to a word form defined by its spelling. Two differently spelled inflections or derivations of the same stem are considered different words. This fact don't lead to a problem in modelling English language but in modelling highly-inflected languages. Great number of different word forms derived from one lemma cause the enormous vocabulary growth. This problem can be solved by choosing another basic unit instead of word (for example, morpheme) (Byrne et al., 2000). In this article we were not concerned with the basic language modelling. Deriving trigram and bigram probabilities is always a sparse estimation problem, probability smoothing was performed by Katz backingoff algorithm (Katz, 1987). 3. Topic adaptation N-gram techniques seem to capture well short-term dependencies. They lack any ability to exploit the linguistic nuances between domains. The environment to which the recogniser will be put is topic-specific. If we could effectively identify the domain of discourse, a model appropriate for the current domain could be used. We do not assume the target domain to be equivalent to one predefined topic. Target domain can be seen as a combination of several elemental topics. 524 The goal of the adaptation is to lower the language model perplexity by providing a higher probability of words and word-sequences, which are characteristic of the domain of discourse. Adopted models were built on three semantic levels: · general language model (G ). It was built by using all available training text. · topic model (T ). It was built by using only the text of a predefined topic most similar to the target domain. · general topic model ( 10 T ). It was built by using the text of 10 predefined topics most similar to the target domain. We built two interpolated models: · combined model (C ) ) ( ) 1 ( ) ( ) ( w P w P w P G C l l - + = . (2) · novel model ( N ) ) ( ) ( ) ( ) ( 3 2 1 10 w P w P w P w P G T T N l l l + + = (3) Figure 1 shows the adaptation scheme. Figure 1: The adaptation scheme. The adaptation scheme consists of the following three steps: · corpus organisation. With the growing availability of textual data in electronic form large topically diverse corpora are constructed. Stories that share similar topics are gathered together into a set of clusters. · topic classification. A classifier is used to find the clusters that are most similar in topic to the sample from the target environment. · language model building. Language models at different semantic levels are built. The models are interpolated at the word level. The first step is corpus organisation. We need the corpus organisation, which enable us to extract target-topic-similar parts of the whole collection and treat them as more representative. Given a corpus with keywords assigned to each story, topic clusters are simply created by defining each keyword as a label for a 525 cluster. Unfortunately for minority languages such corpora are often not available. Automatic generation of clusters of documents need to be used. The success of automatic clustering is conditioned with the quality of document and topic cluster representation. Before introducing the right representation we will describe main characteristics of Slovenian. 4. Inflectional morphology of Slovenian Slovenian is a South Slavic language with a speech area, wedging into the Croatian, Italian, German, and Hungarian linguistic territories. The morphological complexity of Slovenian in comparison to English will be described. It is Slovenian inflectional morphology which formally distinguishes both languages. Using a highly simplified notion, the word formation in Slovenian does not differ much from those languages where new word forms are created using a stem with the addition of derivational suffixes. Slovenian morphology introduces three main concepts: word classes, inflection, and grammatical categories. The main feature of word classes is their division into · inflectional classes: substantive words, adjective words, and verb, and · non-inflectional classes: adverb, predicate, preposition, conjunction, copula, and interjection. The basic grammatical categories of Slovenian are: gender, number, case, degree, person, tense, mood and aspect. Slovenian shares its grammatical categories with other Slavic languages, except the category of dual in addition to singular and plural, and the Slovenian nominal system does not possess a category to express an appeal. A match of the grammatical categories and word classes results in different inflectional patterns. We will describe each grammatical category with examples. 4.1 The category of gender Slovenian distinguishes three genders: the masculine, the feminine and the neuter. In the majority of the Slavic languages, gender is inherent in substantives, inflected in adjectives, and not expressed in pronouns. Slovenian has extended gender to personal pronouns, and marginally to verbal inflections. Table 1 shows some examples. Masculine Feminine Neuter Slovenian English Slovenian English Slovenian English Noun brat brother sestra sister dete baby Adjective lep pretty lep-a pretty lep-o pretty Verb dela-l worked dela-la worked dela-lo worked Pronoun moj my moj-a my moj-e my Table 1: The examples of the application of gender. 4.2 The category of number Slovenian morphology uses besides the singular and plural also the dual, when referring to two persons or objects. The category of number is applied to nouns, adjectives and pronouns. Table 2 shows some examples. Singular Dual Plural Slovenian English Slovenian English Slovenian English en-a miz-a one table dv-e mizi two tables tr-i miz-e three tables en-o mest-o a town dv-e mest-i two towns tr-i mesta three towns lep-a pretty lep-i pretty lep-e pretty on he onadva they two oni they Table 2: The examples of the applications of number. 526 4.3 The category of case The most striking differences between Slovenian and English morphology is the use of six cases in Slovenian, which denotes the relationship of individual words in sentence. In Slovenian we use different patterns for the following word classes: nouns, adjectives, and pronouns. The sentences in Table 3 illustrate the use of cases of the word mesto (Eng. “the town”) and shows the main differences between Slovenian and English. Case Slovenian English Nominative To je mest-o. This is a town. Genitive Ne vidim nobenega mest-a. I can't see any town. Dative 3ULEOLaXMHPVHPHVW-u. I'm walking towards this town. Accusative Kako bi opisal to mest-o? How would you describe this town? Locative .GRaLYLYWHPPHVW-u? Who lives in this town? Instrumental Pod tem mest-RPWH.HUHND There is a river beneath the town. Table 3: The examples of the use of cases. 4.4 The category of degree The gradation of adjectives and adverbs in Slovenian is quite similar to that in English. As in English, there are three degrees of comparison in Slovenian. Table 4 shows some examples. Positive Comparative Superlative Slovenian English Slovenian English Slovenian English bel white bolj bel more white najbolj bel most white star old star-ejši older naj-star-ejši the oldest Table 4: The examples of the use of gradation. 4.5 The category of person Verbal forms are related to the three types of the category of person. Table 5 shows the example of the conjugation of the verb “to work”. Singular Dual Plural Slovenian English Slovenian Slovenian English 1st person dela-m I work dela-va dela-mo we work 2nd person dela-š you work dela-ta dela-te you work 3rd person Dela he work dela-ta dela-jo they work Table 5: The examples of the applications of category of person. 4.6 The category of tense There are four tenses in Slovenian language. Table 6 shows the use of them. Tense Slovenian English Present del-am I work Past dela-l sem I worked Future dela-l bom I shall work Plusperfect dela-l sem bil I had worked Table 6: The examples of the use of tenses. 4.7 The category of mood There are three moods in Slovenian: indicative, imperative and conditional. Table 7 shows their use. 527 Slovenian English Indicative dala-m I work Imperative del-aj work Conditional dela-l bi I would work Table 7: The examples of the use of mood. 4.8 The category of aspect Every verb obligatorily belongs to one of two classes of aspect: perfective or imperfective. The contrast between them is expressed not only by different suffixes, but also by a radical alternation of the stem. Table 8 gives some examples. Perfective Imperfective Slovenian English Slovenian English dvig-n-iti to lift dvig-a-ti to be lifting se-.-i to reach se-ga-ti to be reaching pri-ti to come pri-h-ajati to be coming Table 8: The examples of the use of aspect. 4.9 Morphemic alternations There is one additional feature of Slovenian. Besides the extensive set of suffixes, words are also subject to a process of alternation. Two types of alternation are relevant to the written form of Slovenian: vocalic and consonantal. See Table 9. Alternation in Slovenian Meaning in English vet-e-r vetra wind bolez-e-n bolezni illnes jo-k-ati jo-.-em to cry zgu-b-iti zgu-blj-en lost Table 9: The examples of morphemic alternation. On the basis of the above description of the morphological structure of Slovenian in comparison to English, two main points can be emphasized: · Slovenian displays features of the extremely rich inflectional morphology; · Slovenian is characterized by various types of morphemic alternations in both stems and suffixes during inflection. 5. Document and topic cluster representation Given a corpus with keywords assigned to each story, topic clusters are simply created by defining each keyword as a label for a cluster. Unfortunately for Slovenian such a corpora is not yet available. Automated generation of clusters of documents based on some similarity measure need to be used. First we have to find the suitable representation. Documents and clusters are represented as a set of features. In most applications words are used as features. It has been argued that maximum performance is often not achieved by using all available features, but using a good subset of those only. Having features which do not help to discriminate between topics add noise. We want to show that it makes sense to group features into clusters, at least for languages with rich morphology. We want to group all words with the same meaning, but different grammatical form, in one cluster and represent them as one feature. We propose a novel approach for feature extraction based on soft comparison of words. 528 To avoid the use of an additional knowledge source like lexicon we define a set of membership functions. Each cluster defines its own membership function. The membership function associates to each word from the vocabulary a number representing the grade of membership of this word in that cluster. Membership function c~ m of cluster c is defined as ( ) { }. , ~ ~ V w w c c Î = m (4) c~ denotes a fuzzy set of cluster c . Cluster membership functions are based on fuzzy comparison function. Each word defines its own fuzzy set ( ) { }. , ~ ~ V w w w w Î = m (5) The function sees the word as a sequence of characters. It returns value 1 if compared words are the same and 0 for extremely different words. In other cases it returns the value between 0 and 1. The comparison function is created by using fuzzy rules, which provide a natural way of dealing with partial matching. We define three sets of rules: · language independent rules, · rules describing English language and · rules describing Slovenian language. The rules are expressed as fuzzy implications, which use linguistic variables to express the grade of similarity (for example: not very similar, quite similar). To get the impression of language independent rules we present two examples. a denotes the word with n characters and b denotes the word with mcharacters. The fuzzy implication similar very not are words different are words of characters . (6) is transformed into the predicate î í i = $ = otherwise b(j) a(i) j: b a i p 1 0 ) , , ( 1 (7) . ) , , ( ) , , ( ) , ( 1 1 1 1 1 a a = = + + + = n i m i m n a b i p m n b a i p b a e (8) The fuzzy implication similar quite are words same the are words of sequences character two . (9) is transformed into the predicate î í i + = + U = $ = otherwise ) b(j ) a(i b(j) a(i) j: b a i p 0 1 1 1 ) , , ( 2 (10) a a = = - + + - + = n i m i m n a b i p m n b a i p b a e 1 1 2 2 2 . 2 ) , , ( 2 ) , , ( ) , ( (11) The predicates are scaled by linguistic variables. Their values are empirically chosen. The final value of comparison function is computed using scaling . )) , ( max( )) , ( max( )) , ( max( ) ( ~ b a e b a e b a e b NotSimilar Similar Similar a + = m (12) Similar e denotes the set of predicates which describe similarity and NotSimilar e denotes the set of predicate which denote the distinction. Language dependent rules for English are taken from the suffix stripping set of rules provided by (Porter, 1980). Language dependent rules for Slovenian describe in simplified form the inflectional PRUSKRORJ\GHVFULEHGLQ6HFWLRQ 3RSRYL.  The membership function of word i w in cluster j c is computed by using a modified single link agglomerative clustering (Voorhees, 1986). Similarity values of word pairs can be represented as a weighted, undirected graph where nodes represent words and weights represent the similarity of words connected by the edge. To save space, we keep only edges with weights greater then a prespecified threshold. The result of the single link hierarchy are locally coherent clusters. To avoid a chaining effect and consequently elongated clusters, we modify the merging criterion. A word is added to the cluster if its average similarity with all words in the cluster is the largest among all the words not yet clustered. Clusters 529 are made one at the time. We start building a new cluster as soon as the largest similarity value does not exceed a prespecified threshold. Each cluster defines one feature. The number of clusters represents a feature vector length. 6. Topic detection Once we have training documents, topic clusters and test samples represented as feature vectors, we use topic detection to determine the similarity between two feature vectors. Topic detection is performed by the use of TFIDF classifier (Joachims, 1996). It is used to determine the similarity between two documents or clusters. 7. Experiments In our experiments we were using the broadcast news corpus (1996 CSR Hub-4 Language Model) for English DQGQHZVSDSHUQHZVFRUSXV 9H.HU IRU6ORYHQLDQGXHWRWKHLUVHPDQWLFULFKQHVV The English broadcast news corpus contains 100 mio words. It was organised into topic-specific clusters of documents based on manually-assigned keywords. We were experimenting with topic clusters that have at least 300 articles. 244 clusters satisfy this constrain. Language model adaptation was performed on 20 randomly chosen topics. 80% of each topic cluster text was used for language model training, 10% of text for interpolation parameter estimation and 10% of text was used as test sample. All words from the corpus were used for feature extraction. Before word clustering was performed, words from stop word list were removed. Using language independent word clustering feature vector size was reduced from 170,000 to 36,000. A sample of clusters is shown in Table 10. It shows that also misspelled words are correctly clustered. aadmirable admirable admirably admira admirally admire admired admirer admires admir admirers bbecause becau becaue chinasports sports sport sporto sporty sported spotrer sportin sporting sportscar sportsman sportsmen sportcoat sportless cilton clnton cinton conferenced conferences conferencing teleconference teleconferenced teleconferences videoconferences Table 10: Sample of English clusters. For each test sample, we want to model, all topic clusters were ranked by the similarity value. If we used language dependent rules in feature extraction process the top 10 topics didn't change. Four types of language models were built: general language model (G ), topic model (T ), combined model (C ) and novel model ( N ). All language models were trigram models with the vocabulary of 64,000 most frequent words. Results of 5 topics are shown in Table 11. Averaging over all 20 experimental topics the perplexity of adopted language model was reduced by 15%. Topic G PP T PP C PP N PP Automobiles 58 247 55 53 Middle East 55 145 27 27 Clinton, Bill 48 90 46 41 Holidays 62 331 49 48 Simpson, O. J. 45 59 25 23 Table 11: Test set perplexities for English language. Unfortunately, the Slovenian newspaper corpus of 20 mio words is not yet annotated with keywords. Clustering was done automatically. Documents were merged into 100 clusters iteratively by the use of agglomerative clustering(Voorhees, 1986), TFIDF classifier and feature vectors built in feature extraction process. Using language independent word clustering the feature vector size was reduced from 200,000 to 21,000. A sample of clusters in shown in Table 12. Words in italic are from semantic point of view not correct clustered. Adding language dependent rules feature vector size was reduced to 18,000. 530 afer afera aferah afere aferi afero aferami fer EDQ.QDEDQ.QHEDQ.QHPEDQ.QLEDQ.QREDQ.QLKEDQ.QLNEDQ.QLPEDQ.QLNRPEDQ.QLNRYEDQ.QLaNL cestnemu mestnemu mestnem mestne mestnega mestna mestni mestno nihanje nihanj nihanja nihanju ihan dobojevati izbojevati izbojeval Table 12: Sample of Slovenian clusters. Three test samples were manually created. All of them consist of 5 documents similar in topic. Language models were built in the same way as in the previous experiment with English corpus. There was only one difference. All words from test samples were added to the vocabulary to avoid the problem of outof- vocabulary words. By using language independent feature extraction we have got 14% perplexity reduction. By adding language dependent rules the perplexity was reduced up to 30%. Results are given in Table 13. Language independent feature extraction Topic G PP T PP C PP N PP Sport 200 621 199 197 Weather forecast 154 201 153 141 Politics 220 598 210 215 Language dependent feature extraction Topic T PP C PP N PP Sport 455 183 171 Weather forecast 168 150 115 Politics 598 201 189 Table 13: Test set perplexities for Slovenian language. 8. Conclusion In our experiments we have shown that topic adaptation does result in a decrease in perplexity. To train a language model it does not make sense to use only a small portion of topic specialised text. The results have shown that word clustering delivers significant topic detection improvement for highlyinflected languages and almost no improvement for English language. The main drawbacks of experiments on Slovenian corpus were corpus size and absence of keyword labels. 9. References Jelinek F 1998 Statistical methods for speech recognition .Cambridge, MIT Press. %\UQH : +DML. - ,UFLQJ 3 .UEHF 3 3VXWND -  0RUSKHPH EDVHG ODQJXDJH PRGHOV IRU VSHHFK recognition of Czech. In Proceedings of the Third International Workshop: Text, Speech and Dialogue, pp 211-216. Katz M S 1987 Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Transaction on Acoustics, Speech, and Signal Processing 35(3):400-401. Porter M F 1980 An algorithm for suffix stripping. Program 14(3).130-137. Voorhees E M 1986 Implementing agglomerative hierarchic clustering algorithms for use in document retrieval. Unpublished TR 86-765, Cornell University. Joachims T 1996 A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. Unpublished CMU-CS-96-118, Carnegie Mellon University. 3RSRYL.0Implementation of a Slovene language-based free-text retrieval system. Unpublished PhD thesis, University of Sheffield. 531 The mandative subjunctive in British English seems to be alive and kicking… Is this due to the influence of American English1? Noëlle Serpollet Lancaster University 1. Introduction Johansson & Norheim (1988: 34) wrote at the end of their 1988 paper: “results from elicitation tests […] suggest that the mandative subjunctive may be on the increase in British English. To study such changes, we need two new comparable British and American corpora”. With the completion of the onemillion- word corpora FLOB (Freiburg-LOB, 1991) and Frown (Freiburg-Brown, 1992), the two 1990s counterparts of the 1960s LOB and Brown, it is now possible to study these changes. These four corpora being available, the objective of this paper is to study the variation of a grammatical category of modern British English (henceforth BrE) such as the mandative constructions (modal should and the mandative subjunctive) over a thirty-year period. In order to carry out this analysis, I will first analyse two general categories – Press and Learned Prose – of two corpora of BrE: the Lancaster-Oslo/Bergen (LOB) and Freiburg-LOB corpora (FLOB). (1) […] but it is also very important that they should be fair. (LOB Press, B) (2) […] nor to obtain an order that the child be accommodated by them […] (FLOB Learned Prose, H) (3) […] usually by recommending that politicians or administrators introduce incentive […]. (FLOB Learned Prose, J) Then in order to know if the tendency observed in my findings is only intrinsic and limited to the two text categories studied or is indeed verified throughout the other genres (Fiction and General Prose), I will carry out a thorough analysis of the whole LOB and FLOB corpora. I have applied a grammatical approach to corpus data, using Xkwic, a fast concordance programme to carry out an analysis which involved developing complex queries on grammatically tagged, comparable and computerized corpora. My results confirmed previous studies (Övergaard, 1995 - Hundt, 1998) and showed that the use of the mandative subjunctive is increasing, whereas the use of mandative should is declining. So far, the description of two complementary phenomena has been provided, but we are still in need of a possible explanation for this “revival” of the subjunctive suspected by Johansson and proven by other linguists’ observations. If the mandative subjunctive is indeed ‘alive and kicking', is its health sustained by American English (henceforth AmE)? Is the increase of the subjunctive in BrE therefore due to the Americanization of the British language? The final part of this paper will attempt to provide an answer to this question by analysing two corpora of American English: Brown and its 1992 counterpart Frown. 2. Background 2.1. Previous studies The grammarians of the 1950s and 1960s considered the extinction of the subjunctive either imminent or already accomplished. They stated that the subjunctive was gradually dying out of the language, that it was fossilised, that its death throes could be observed in literary English or that living English had no subjunctive at all. Harsh (1968 : 11) reports the fact that Sir Ernest Gowers, in his 1965 revision of Fowler's A Dictionary of Modern English Usage, decided to leave intact what Fowler had written in 1926, “writing off” the subjunctive in the following statement: “it is moribund except in a few easily specified uses” (1965: 595). According to Harsh (1968:12), “the inflected subjunctive, though hardly in a state of robust good health, has been taking a long time to die. But that it is still dying, as Fowler noted, can hardly be denied”. Johansson and Norheim (1988: 27) state nonetheless and without any doubt that “English verbs have distinctive forms under certain circumstances which differ from the normal indicative forms and convey the meaning of ‘non-fact', which is characteristic of the subjunctive in other languages”. 1 Acknowledgement: the research reported here was supported by an award from the Economic and Social Research Council (UK). 532 Quirk et al. (1985: 1012-1013) define the subjunctive occurring “in that-clauses after verbs, adjectives, or nouns that express a necessity, plan, or intention for the future” as the mandative subjunctive and this is this use of the subjunctive that “[this] corpus-based investigation of language change in progress” is focusing on. I borrowed here the terms in inverted commas above, to Mair & Hundt (1995) who use this expression as the subtitle of their article presenting a pioneering effort on that subject. The analysis by Johansson and Norheim (1988) which is the starting point of this study aimed at verifying if the subjunctive was more frequently used in AmE than in BrE, examining for that purpose the two comparable Brown and LOB corpora. The results showed that the mandative subjunctive was favoured in the American corpus while its number was very low in the British one, and that the construction with should was preferred to the subjunctive in the British material. This confirms the observations of Quirk et al. (1985: 1012-1013) who emphasise the fact that the mandative subjunctive is especially used in American English, whereas in BrE, mandative should with the infinitive is more common. While a synchronic study of mandative constructions in two types of English had been carried out, what was needed in 1988 was a diachronic analysis of these constructions to observe their evolution and to check if the following statements by Quirk et al. (1985) were still applicable to the English of the 1990s: - the present subjunctive occurs more frequently in AmE than in BrE - its use in that-clauses seems to be increasing in BrE. Several studies of “language change in progress” have been undertaken since, and analyses focusing on the evolution of different grammatical features have been conducted on parallel corpora and more specifically on LOB / Brown and on the new comparable corpora FLOB / Frown2. Indeed, recent studies using new corpus resources (Asahara, 1994; Övergaard, 1995; Hundt, 1998) have analysed the diachronic evolution of mandative constructions in BrE and AmE. They have presented findings which indicate a remarkable increase of the use of the mandative subjunctive in British English, especially in late 20th century. Apparently, this use of the subjunctive, although not very frequent, is far from becoming extinct. In sections 4. and 5., more will be said about previous work on the mandative constructions. With four carefully matched corpora now available, an exhaustive corpus-based study of language change in progress over a thirty-year period can be conducted. This study can analyse and compare synchronic corpora to examine for example the possible influence of American English on British English with FLOB/Frown or diachronic corpora to study the evolution of linguistic features with LOB/FLOB in BrE. 2.2. The “mandative constructions” (non-inflected or morphological subjunctive & periphrastic construction with the modal should): a definition In this section, I will concentrate on the “mandative constructions” that I will describe, I will set out the criteria to recognise the subjunctive and clarify the terminology used. Etymologically, “mandative” is derived from the verb ‘mandate', itself coming from the Latin manda-re: to enjoin, command. The term “mandative expressions” is used in reference to verbs, nouns and adjectives (that I also call triggers) which express a demand, request, intention, proposal, suggestion, recommendation, etc. This expression is borrowed from Algeo (1992: 599) who himself adapted it from the term “mandative subjunctive” used by Quirk et al. [1985:156] for one of the three verb forms in the that-clause that follows certain expressions of resolution, intention, etc. By extension, I use the term “mandative constructions” for the different verb forms which can follow mandative expressions. Therefore instead of using the expression periphrastic construction with the modal should (Övergaard, 1995; Hundt, 1998) to designate the construction with the modal which is one variant of the mandative subjunctive, I use the term “mandative SHOULD”. (4) I insisted that he should take part in the concert, Alan said. (LOB Fiction, P) (5) During the stand-up confrontation, which took place shortly after the new year at Highgrove, the Prince of Wales's Gloucestershire home, Charles insisted that his son have a more conventional celebration in the newly refurbished Orchard Room in the house. (The Sunday Times (6th February 2000), “Charles and William in nightclub row”) Different verb forms can follow the mandative expressions: we can have a mandative subjunctive as in the example (6), or the non-distinctive form (7) that I will define below, or the modal 2 These four corpora will be described in detail in section 3. 533 auxiliary should followed by an infinitive (8) or the indicative (9). The last construction, namely the indicative, will not be considered further in this paper. (6) She insisted that he leave early. (7) He suggests that we leave early. (8) Her wish was that he should leave early. (9) She was eager that he left early. (examples from Algeo (1992 : 599) apart from the non-distinctive form in (7)) The subjunctive is difficult to identify because it is identical to the base form of the verb. According to Asahara (1994: 2) “the present subjunctive refers to a grammatical form that takes only the base form of the verb regardless of tense contrast, person and number concord”. Therefore, with a plural subject, there is no difference between the indicative and subjunctive forms. The non-inflected or morphological subjunctive is distinguishable from the indicative (through morphological criteria) in the following cases: - in the 3rd person singular present tense (no –s) (10) - in past contexts (no sequence of tenses) (11) - in finite forms of be (base form for all persons and no tense marker) (12), (13) - in negated clauses (no do-periphrasis and not is placed before the verb) (14) (10) […] he proposes to Isabella that she join his plan to frame Mariana […]. (FLOB General Prose, G) (11) Russia insisted that the Western powers take immediate measures to put an end to the unlawful and provocative actions of the Federal German Republic in West Berlin. (LOB Press, A) (12) Hence it is important that the process be carried out accurately. (FLOB Learned Prose, H) (13) Conditions have dictated that operations be scaled down enabling overheads to be reduced […]. (LOB Press, A) (14) Moreover, it requires that the concepts F(x) and G(x) not themselves contain any quantification […]. (FLOB Learned Prose, J) I included in my counts, as part of the mandative subjunctive forms, not only the distinctive/genuine subjunctive forms but also the non-distinctive forms which are indistinguishable from the indicative, as in (15a). (15a) I will guard your house for you on condition that you bake me an apple pie every day. (LOB General Prose, F) In that case, we can perform a substitution test by putting a third person singular subject in the place of “you” and we can see that we obtain a distinctive subjunctive form: (15b) I will guard his house for him on condition that he bake me an apple pie every day. 3. Material used and purpose of this paper 3.1. Data The four carefully matched corpora that have been analysed are the following: ¨ Two well-known and widely used corpora representing the language of the 1960s: · The Brown corpus, compiled at Brown University consists of one million words made up of 500 texts of American English from 1961 and spread over 15 categories3. · The Lancaster-Oslo/Bergen Corpus (LOB) has been compiled, computerized and word-tagged by research teams at Lancaster, Oslo and Bergen. It consists of 500 British English texts of about 2,000 words each, printed in 1961, divided into 15 different genre categories and contains one million words. It is synchronically parallel to the Brown corpus. ¨ Two parallel one-million-word-corpora matching the original LOB and Brown corpora, developed at Freiburg University to enable linguists to study language change in progress: · The Freiburg-LOB Corpus (FLOB, 1991) has been modelled on LOB; it is constituted of one million words of British English texts printed in 1991. · The Freiburg-Brown Corpus (Frown, 1992) modelled on Brown is synchronically parallel to FLOB and diachronically parallel to Brown. 3 They are listed in the notes to table 1. 534 3.2. Objectives As I have indicated in section 3.1. above, this study only undertakes the analysis of written data. It is intended to be an observation and a description of linguistic change in contemporary English. On that aspect, I refer to the study of grammatical change in present-day English by Mair (1997) and Hundt (1997) and I quote Holmes (1994:37): The prospect of using corpora data to infer language change over time is an exciting one. It is clearly possible to make suggestive and interesting comparisons between the frequencies of items in corpora of similar size and composition which have been constructed at different points in time. The Press and Learned Prose genre (which amount to about 2/5 of each corpus) are represented in the figure below by two rectangles with upward (LOB) and downward (FLOB) diagonal patterns. 1961 1991 1961 1992 Figure 1: British English 30 years on, a description - AmE, a possible explanation to the changes in BrE? The paper will develop a descriptive analysis based on a thorough and exhaustive observation of the data. Therefore, in order to describe what has been happening over a thirty-year period, I have carried out a comparative study of two parallel and computerised corpora LOB and FLOB to find out if there is an increase in the use of the mandative subjunctive form and concurrently a decline in the use of the mandative should over the years. My objective is also to examine in what text genres this has taken or is taking place. · Thus I will first analyse the Press and Learned Prose categories to see if two different genres follow the same trend, i.e. have experienced the same evolution regarding the mandative constructions. · Then I will verify if the trend is found in the rest of the two corpora. Is it general to BrE or genre specific? · Finally the analysis of American data will enable me to see if the ongoing change in BrE is dependent of diachronic developments in AmE or of the synchronic influence of AmE. A detailed qualitative analysis of the data is beyond the scope of this paper. Nonetheless, I hope that the last part of this study will shed some light on the evolution of mandative constructions and provide the beginning of a possible explanation. 4. METHODS AND PILOT RESULTS 4.1. Method used I have applied both a grammatical approach to corpus data, and a corpus linguistics methodology, using computer tools and retrieving software such as Xkwic. First, I carried out a concordance of should in WordSmith Tools4 (Version 3.00.00), an integrated suite of programs used to investigate how words behave in texts, in order to retrieve the total number of occurrences of the modal. 4 More detailed information can be found at http://www.liv.ac.uk/~ms2928/ and in Scott (1996). LOB Brown FLOB Frown 535 Table 1: Frequency of should in the LOB and FLOB corpora SHOULD CATEGORIES LOB FLOB Difference (abs) Difference (%) PRESS5 (A-B-C) 285 185 - 100 - 35.1 FICTION6 (K-LM- N-P-R) 214 250 + 36 + 16.8 GENERAL PROSE7 (D-E-FG) 472 330 - 142 - 30.1 LEARNED PROSE8 (H-J) 330 382 + 52 + 15.8 TOTAL 1301 1147 - 154 - 11.8 We are shown with these results that the overall number of occurrences of should has decreased between the 1960s and the 1990s. However, this trend is not generalised to all genres, as we can note a decrease in the press and general prose categories and yet an increase in fiction and in learned prose. Identifying the mandative uses of the modal on which this present study focuses required a semantic analysis leading to a classification of each of the retrieved occurrences. This process was laborious and time consuming and I limited the analysis to the Press and Learned Prose categories of LOB and FLOB. Then, I used a concordancing tool to find the mandative constructions without having to go through a detailed manual analysis. I started with WordSmith and as I progressed with my search queries, I realised that the package was not adapted to the type of complex queries I needed to create in order to retrieve only the mandative constructions. This tool could not support at the same time wildcards (*), windows {n}, interval operators {n, m} nor tags. Therefore I had to switch to Xkwic, which could handle much more complex queries, as it has a more powerful search language, and therefore I could limit the number of hits I was getting. Xkwic is a software part of the ISM Corpus Workbench9, and a motif-based user interface to the Corpus Query Processor (CQP). CQP, the concordance engine itself receives an input from a file, under the form of a query entered in Xkwic, and returns the result back to Xkwic once it has been computed. The analysis reported thereafter involved developing complex queries to retrieve only the relevant instances of both the modal and the subjunctive – which is not part-of-speech tagged. I used four totally comparable, grammatically tagged and computerized corpora of British English and therefore, I could run exactly the same retrieving queries in all corpora (LOB / FLOB and Brown / Frown). This is where the originality of my research comes from: in this very use of the same search patterns which provided me with comparable findings. As I have already mentioned in section 1., some of the previous studies of mandative constructions presented results that were somehow incomplete. Asahara (1994) did not use computerised corpus data, and relied on a rather small number of examples; nonetheless, her results are very interesting. A similar situation is found with Övergaard (1995) who did not use truly parallel corpora: she analysed the Brown and LOB corpora for the 1960s, but worked with four other corpora that were not computerised for the more recent period. The instances in these corpora are random examples that she recorded when she encountered them, hence the non-reliability of one part of her research. Regarding Hundt's analysis (1998: 162), it appears as very valuable but incomplete as she used her own findings on FLOB and Frown (which was yet complete at the time) as well as results from Johansson and Norheim (1988) for Brown and LOB. Therefore, the range of governing expressions (triggers) was limited to only 17 verbs and their related nouns. Nonetheless these previous studies indicate a trend in the evolution of mandative constructions that I intend to verify in the two sets of parallel and comparable corpora. The occurrences of mandative constructions have been retrieved using the concordance programme Xkwic and an appropriate search query. I will present below some issues involved with the 5 A = reportage, B = editorial, C = reviews [88 texts a 176,000 words] 6 K = general fiction, L = mystery & detective fiction, M = science fiction, N = adventure & western fiction, P = romance & love story, R = humour [126 texts a 252,000 words] 7 D = religion, E = skills, trades & hobbies, F = popular lore, G = Belles Lettres, bibliography, essays [176 texts a 352,000 words] 8 H = miscellaneous, mainly government documents, J = learned & scientific writings [110 texts a 220,000 words] 9 The IMS Corpus Toolbox has been developed at the University of Stuttgart. Further details can be obtained from the following web page: http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/ and in Chist (1994). 536 search criteria. When a triggering expression was followed by more than one that-clause, only the first clause was included in the final results. The decision could have been different if the data had been only analysed by hand; however, Xkwic stops at the first clause and does not account for any following that-clause triggered by the same expression. In the case of several verbs appearing in a that-clause, I took the decision to include in my counts only the first one, as the concordancer only accounts for the first verb following that. Regarding that issue, I have taken the same approach as Johansson and Norheim (1988). I shall indicate as well that I accounted for the that-deleted clauses in my results (even although they are very rare) and as I mentioned in 1.2., I also included the non-distinctive forms in my counts (details are given in the tables). I limited my research to a set of trigger, i.e. a finite set of suasive expressions: 64 suasive verbs, 52 corresponding nouns and 40 adjectives. A few words now about precision and recall: - precision error: the search output given by Xkwic contained instances that were not mandative constructions (they did indeed contain a trigger and the modal should or a base form, but the mandative construction had a non-mandative meaning). These were false instances or noise that needed to be removed manually. Even with an automatic computer search the manual intervention of the analyst is still needed and indeed necessary in order to provide accurate results. - Ratio of error: about 4 hits out of 10 had to be discarded. Therefore, the recall rate generally varied between 57 and 67%, i.e. an average of 63%. - Recall error: having retrieved all the occurrences of mandative should in the Press category of LOB and FLOB (using WordSmith), I tested my queries in Xkwic in order to see if I was retrieving all the occurrences of the modal. In fact, the concordancer failed to retrieve genuine occurrences of mandative constructions because they fell out of the search criteria set up in the Xkwic query (it would appear that the syntax is more complex in Learned Prose and that some of the mandative forms fall out of the limitations of the search queries). Table 2: Retrieval rate of Xkwic for mandative should (comparison of manual analysis and automatic retrieval with that-deleted clauses included) Retrieval rate (should) LOB FLOB Press 97% 91% Learned Prose 90% 75% A possible remedy would be to extend the search of the search scope. Unfortunately, such an extension revealed unmanageable as too much noise was encountered and a practical choice had to be made. However, the instances that have not been retrieved are in a limited number. Moreover, the occurrences retrieved still form a large subset of the total number of possible occurrences contained in the corpus. 4.2. Results in two genres of British English This section presents the results of the queries carried out in only two genres of the two corpora of BrE: · Press (A, reportage; B, editorial and C, reviews), ca. 176,000 words · Learned Prose (H, miscellaneous, mainly government documents and J, learned & scientific writings), ca. 220,000 words It would seem that the Press would be more prone to evolution as it tends to reflect the changes occurring in modern written language. We might therefore expect to witness in this genre a decline of the mandative subjunctive which belongs to a formal, legalistic style such as the one found in the Learned Prose category considered as more formal and conservative. 537 Table 3: Frequency of mandative should in LOB and FLOB [A-B-C] with verbs, nouns and adjectives as triggers (that-deleted clauses included) Should in LOB (Press) Should in FLOB (Press) VERBS NOUNS ADJ. TOTAL VERBS NOUNS ADJ. TOTAL 2110 8 7 36 1211 6 1 19 We are shown that mandative should has decreased in the Press category from the 1960s to the 1990s by 47.2% after the three types of triggers, which is the general tendency of all the uses of should in Press. It has decreased from 36 to 19 occurrences, which means going from 12.6% of the total number of occurrences of should to 10.3%. (16) They will accompany Mr. Heath next month when he goes to Brussels, headquarters of the Common Market Commission, or wherever the Six decide negotiations should be held. (LOB Press, A) (17) The suggestion that Royton should be demolished for the delight of the yuppie mugwumps of Oldham will alarm many Roytoners. (FLOB Press, B) Table 4: Frequency of genuine mandative subjunctives and non-distinctive forms (ND) in LOB and FLOB [Press] with verbs, nouns and adjectives as triggers (that-deleted clauses included) Mandative subjunctives and ND in LOB (Press) Mandative subjunctives and ND in FLOB (Press) VERBS NOUNS ADJECT. TOTAL VERBS NOUNS ADJECT. TOTAL Subj. 2 2 0 4 3 1 0 4 Non-dist. 3 0 1 4 4 0 0 4 TOTAL 5 2 1 8 7 1 0 8 The results of the concordances on the base form carried out in the Press category of the two corpora show that the total number of mandative subjunctives and non-distinctive forms is identical in the two corpora with four subjunctives and four non-distinctive forms. In this genre, although the total number of subjunctive forms is constant from the 1960s to the 1990s, the variation in the number of triggers does not follow one particular direction. There are more occurrences triggered by verbs in FLOB and fewer instances triggered by nouns and adjectives. (18) Since Peking realises just how much Britain needs the deal, it demanded that Mr Major - and his kudos as world leader – come in person to sign it. (FLOB Press, A) Table 5: Frequency of mandative should in LOB and FLOB [H-J] with verbs, nouns and adjectives as triggers (no that-deleted clauses) Should in LOB (Learned Prose) Should in FLOB (Learned Prose) VERBS NOUNS ADJ. TOTAL VERBS NOUNS ADJ. TOTAL 24 6 14 44 20 6 1 27 In Learned Prose we can note that mandative should has decreased by 36.8%, but mainly after the triggering adjectives where we can notice a sheer drop in numbers, whereas should as a whole had increased in this genre. The construction which used to represent 13.3% of the total number of occurrences of should represents now only 7% of this total. (19) It is essential that the landscaping should be designed for ease of maintenance […]. (LOB Learned Prose, H) 10 This count of 21 occurrences includes two occurrences of SHOULD in two that-deleted clause triggered by the verbs decide and agree. 11 This count of 12 occurrences includes one occurrence of SHOULD in a that-deleted clause triggered by the verb propose. 538 (20) We recommend that the Department should re-appraise and update such calculations at frequent intervals […]. (FLOB Learned Prose, H) Table 6: Frequency of genuine mandative subjunctives and non-distinctive forms (ND) in LOB and FLOB [Learned Prose] with verbs, nouns and adjectives as triggers (that-deleted clauses included) Mandative subjunctives and ND in LOB (Learned Prose) Mandative subjunctives and ND in FLOB (Learned Prose) VERBS NOUNS ADJECT. TOTAL VERBS NOUNS ADJECT. TOTAL Subj. 1 0 0 1 9 3 1 13 Non-dist. 4 0 1 5 4 0 1 5 TOTAL 5 0 1 6 13 3 2 18 In the Learned Prose category, the mandative subjunctive is on the increase by 1211% [sic] if we consider the genuine subjunctive forms on their own. If we include the non-distinctive forms which have remained stable at five occurrences in each corpora, we still have a very important increase of 198% which is encountered mainly after trigger-verbs. In that case, there is no doubt that we are witnessing a rise in the mandative subjunctive over a thirty-year period and although the numbers are very small, this would tend to prove that the subjunctive is not dying in Learned Prose in BrE. (21) The petitioner would then request that the house overrule the injunction or, alternatively, make a clear determination on where the suit ought best to be tried. (FLOB Learned Prose, J) So far, the findings from two genres confirm the trend presented by previous studies, i.e. that the use of the modal should as a periphrastic alternant to the non-inflected subjunctive is declining. Regarding the remarks that I made at the beginning of this section, a closer examination of the results leads me to say that in the Press category the counts of the subjunctive are inconclusive as no variation has be noted, i.e. the form is stable. However a “revival” of the subjunctive can be noted in Learned Prose. Hundt (1997: 167) indicates that “it is hardly surprising that a genre [Academic prose, category J] which is resisting the trend towards a more colloquial written style should be the vanguard of a change that is reviving a formal syntactic option”. 5. Full results in all genres The tables that follow present the results obtained after the analysis of the two corpora of written BrE in their entirety. This will enable me to have a full picture (nonetheless limited to one million words and four general genres) of some specific grammatical changes that happened between 1961 and 1991 and to see if the trend previously identified is verified in all genres or specific to some. Table 7: Raw frequency and proportion of genuine mandative subjunctives and non-distinctive forms in LOB and FLOB Mandative subjunctive forms LOB FLOB Genuine subjunctive 14 (48.3%) 33 (58.9%) Non-distinctive 16 (51.7%) 14 (43.1%) Total 29 58 From LOB to FLOB, the number of mandative subjunctives has doubled and the proportion of nondistinctive forms has decreased by 10.6%. This means that in FLOB, 58.9% of the subjunctive forms are genuine subjunctive forms. Hence, even if the non-distinctive forms are not included in the quantitative analysis, the direction of the evolution is not skewed; the increase is even more remarkable (from 100% to 135.7%). In the tables 8 and 9 below, both the raw frequencies and the frequencies normalized per 100 texts are presented12. The latter count is due to the fact that the different categories are not balanced. The general genres, which regroup several text categories, each contain a different number of texts (see table 1). The normalized frequencies provide a better idea of the repartition of the mandative forms in 12 A raw frequency of five subjunctives within 176 texts in General Prose is normalized to 2.8 occurrences within 100 texts. 539 each genre and give us an insight into the stylistic distribution. Looking at the raw frequencies for FLOB in table 8, one would think that there were more genuine subjunctives in General Prose, then in Learned Prose, in Fiction and finally in Press. In reality, the normalised frequencies show us that it is the Learned Prose category which ranks first with the highest number of mandative subjunctives forms, then General Prose, Press and Fiction. This classification does not really come as a surprise as the two “Prose” genres tend to use a more formal style with legalistic writing (category H), academic prose (category J) on the one hand and religious writings (category D), Belles Lettres, bibliographies and essays (category G) on the other hand. Table 8: Frequency of genuine mandative subjunctives and non-distinctive forms across genres in LOB and FLOB LOB FLOB Genre Subj. Non-dist. Total Subj. Non-dist. Total Frequency (n.) (88 texts) 4 4 8 4 4 8 Press Normalized per 100 texts 4.5 4.5 9 4.5 4.5 9 n. (126) 4 2 6 7 4 11 Fiction % 3.2 1.6 4.8 5.6 3.2 8.8 n. (176) 5 4 9 9 12 21 General Prose % 2.8 2.3 5.1 5.1 6.8 11.9 n. (110) 1 5 6 13 5 18 Learned Prose % 0.9 4.6 5.5 11.8 4.6 16.4 The number of mandative subjunctives stayed constant in the Press category, whereas in Learned Prose, the number of occurrences increased by 198%. Table 10 summarises the increases per genre of all the mandative constructions. Table 9: Frequency of genuine mandative subjunctives and non-distinctive forms across genres in LOB and FLOB LOB FLOB Genre Should Should Frequency (n.) 36 19 Press Normalized per 100 texts 40.9 21.6 n. 19 9 Fiction % 15 7.1 n. 56 28 General Prose % 31.8 15.9 n. 44 27 Learned Prose % 40 24.5 The use of the periphrastic construction with should has decreased in all four genres between 38.6% and 100%. To be able to check easily the evolution of all the mandative constructions studied, the two tables above are summarised below. I have indicated in parenthesis the rank of each category from the highest rise or decline to the lowest. 540 Table 10: Summary of the evolution of the mandative constructions from LOB to FLOB Categories Genuine subjunctive Subj. with nondistinctive forms Should Press (Same number) (Same number) (4) - 47.2% (3) Fiction + 75% + 83% (3) - 52.6% (2) General Prose + 82% + 133% (2) - 100% (1) Learned Prose + 1211% + 198% (1) - 38.6% (4) We find almost the same ranking between the categories containing the highest number of mandative subjunctive forms and the categories having experienced the highest increase, the only difference being between Press and Fiction which have exchanged their third and fourth places. Regarding mandative should, its use has decreased between 38.6% in Learned Prose and 100% in General Prose. And although there does not seem to be a correspondence within genres between the fall of one construction and the rise of the other, Hundt (1998:163) notes about results13 on the whole LOB and FLOB corpora that “[f]or the British corpora, a chi-square test proves that the increase in mandative subjunctives and the concomitant decrease of the periphrastic construction is highly significant (p  0.001)”. The only feature of interest is the fact that while Learned Prose is the category where the most important rise of subjunctive has taken place, it is also the very category where mandative should has decreased the least. This tends to reconfirm our observation that this genre is formal and conservative and that the only change experienced is to revive a formal syntactic option. An observation of the data and description of the results that confirmed previous analyses have been provided. They showed that amongst stylistic variations from one genre to the next, general and real grammatical changes concerning mandative constructions are taking place in BrE. Now we need a possible explanation of the evolution witnessed in BrE. Where does this “revival” of the mandative subjunctive and decline of mandative should come from? Is this due to the americanization of the British language? This possibility is worth investigating as many observations carried out on American vs. British English have suggested the influence of the former on the latter. The American influence is referred to an ‘accelerator of change’ within BrE (Barber, 1964: 141). Hence would this be the main reason for these linguistic changes? The final section of this paper will thus analyse the two corpora of American English: Brown and Frown in an attempt to provide an answer to these changes. 6. Analysis of two corpora of American English: a possible explanation? The following examples are extracted from Brown and Frown: (22) It was essential that he should restore his formidable reputation as a rip-roaring, ruthless gunslinger and this was the time-honored Wild West method of doing it. (Brown Fiction, N) (23) The panel recommended that public affairs preparations should be included in the planning for future military operations […]. (Frown General Prose, E) (24) The doctors had suggested Scotty remain most of every afternoon in bed until he was stronger. (Brown Fiction, K) (25) Lattimore's attention to these events was distracted by a request from the United Nations that he head a technical assistance mission to Afghanistan […]. (Brown General Prose, G) Table 11: Repartition of the mandative subjunctive forms per genre in the four corpora14 Genres LOB FLOB Brown Frown Press 8 8 (0) 29 23 (- 20.7%) Fiction 6 11 (+ 83%) 21 35 (+ 66.7%) General Prose 9 21 (+ 133%) 42 27 (- 35.7%) Learned Prose 6 18 (+ 198%) 15 22 (+ 46.7%) Total 29 58 (+100%) 107 107 (0) 13 Her results display frequencies (expressed in percentages) which are not dissimilar to mine – see table 12. 14 The results for the two American corpora are still provisional. 541 We can see that the status of the mandative subjunctives in the Press genre in AmE is even worse than in BrE (decrease of 20.7%). This would tend to indicate that the status quo of the subjunctive observed in FLOB might only be the preliminary to a future decline, as it is the case in AmE. The increase of the subjunctives in Fiction is marked more in FLOB than in Frown although the number of instances in FLOB (11) is inferior to the one in Brown (21) and far from the number reached in Frown (35). In General Prose, a rise can be observed in FLOB contrary to a decline in Frown. Number wise, BrE has not yet but almost caught up with AmE (21 compared to 27 occurrences) and the proximity in numbers seems to be due to the fact that the subjunctive is decreasing in this category of AmE. Here the results are inconclusive. What variety of English is leading the changes? Finally, Learned Prose shows an increase in the two 1991 corpora and BrE is catching up with AmE even if the number of instances in FLOB (18) is just in between Brown (15) and Frown (22). Here BrE seems to be following the trend set up by AmE. The findings on mandative should (see table 12) present a decrease of this construction throughout the genres (- 23% from the 1960s - with 26 occurrences, to the 1990s - with 20 instances); the number are very small (between 3 and 7 for each genre) and little variation is shown. If AmE is indeed leading the way on the path of linguistic change (as is indicated in the results), BrE appears to be lagging behind but is doing its best to catch up. Table 12: Frequency of mandative should vs. mandative subjunctive forms in LOB/FLOB and Brown/Frown Brown Frown LOB FLOB Should Subj. Nondist. Should Subj. Nondist Should Subj. Nondist Should Subj. Nondist 26 91 16 20 78 29 155 14 15 83 33 25 19.6% 68.4% 12.0% 15.8% 61.4% 22.8% 84.2% 7.6% 8.2% 58.9% 23.4% 17.7% =80.4% =84.2% =15.8% =41.1% The proportion of mandative subjunctive forms represents around 80% of mandative constructions in AmE and has only risen by 3.8%. Would there be a slowing down of the evolution of the mandative constructions on the way, leading to a future stabilisation? In BrE the proportion went from 15.8% to 41.1% with a rise of 25.3%. The subjunctive seems to be the preferred option in AmE (we can note than should has not disappeared yet) whereas the periphrastic construction with should seems to still be favoured by British people although the trend could be reversing in the future. The trend observed in BrE regarding the mandative should (-23%) is mirrored in AmE; it is also decreasing in all genres by 46.5%. The hypothesis presented by Johansson (1988) is verified, the mandative subjunctive forms are on the increase in BrE by 100% whereas it has stabilised in AmE with a great disparity between text categories. Hence the spread and rise in the subjunctive, observed in the first half of the 20th century must have slowed down in this variety of English. 7. Conclusion This descriptive and exploratory study has identified specific trends in the evolution of mandative constructions which are both specific to genres and varieties of English. These linguistic changes observed are much more drastic in BrE than in AmE where the language change in progress seems to be slowing down. Three hypotheses for the evolution of BrE can be presented: · Americanization Several studies have presented the hypothesis (confirmed by their analyses) that AmE is more innovative than BrE in ongoing morphological and syntactical changes. Therefore, it would influence BrE and would lead to an americanization of the language. · Grammaticalization · Colloquialization Mair (1997) states that very few genuine and significant instances of grammatical change can be observed. The changes “are not due to the fact that the grammar of the language itself has changed. Rather, these developments [increased frequency of the progressive and the going-to future, greater use of contracted forms] show that informal options which have been available for a long time are chosen more frequently today than would have been the case thirty years ago” (1997: 203). 542 Therefore, he prefers talking about “colloquialization” of the norms of written English rather than about “grammaticalization15” of the language when a syntactic structure is being replaced by an older grammatical form. But could this be argued with the rise in the use of the mandative subjunctive which used to be reserved to formal genres? This hypothesis could be verified in the Press section with either the status quo or decrease in the mandative constructions. However, this possible explanation needs to be rejected in the other genres: in Learned Prose, a very feeble decrease of should and an important increase of the subjunctive has been noted, hence the formality of this genre is not loosing any ground. The same can be said about Fiction with less extreme variations. Our work presents some limitations and an even more detailed study by type of triggers and/or by single text category might provide more explanations. We also need to be extremely cautious with the conclusions drawn from the results, as the corpora analysed are rather small (only one million words). It would be worth carrying out the same study on a bigger corpus such as the British National Corpus containing 90 million of BrE written texts. If the trends observed in BrE were confirmed in the BNC data, this would show that the changes noted in this present analysis are not due to chance and/or to the sampling and size of the data. References Algeo J 1992 British and American mandative constructions. In Banks C (ed), Language and Civilisation: A Concerted Profusion of Essays and Studies in Honour of Otto Hietsch, Vol. 2, Frankfurt, Peter Lang, pp 599-617. Asahara K 1994 English Present Subjunctive in Subordinate That-Clauses. Kasumigaoka Review 1: 1- 30. Christ O 1994 A modular and flexible architecture for an integrated corpus query system. COMPLEX'94, Budapest. Fowler H W 1965 A dictionary of modern English usage. Rev. ed. Gowers E, 2nd ed. Oxford, Clarendon Press. Harsh W 1968 The subjunctive in English. University of Alabama, University of Alabama Press. Holmes J 1994 Inferring language change from computer corpora: Some methodological problems. ICAME Journal 18: 27-40. Hundt M 1997 Has BrE been catching up with AmE over the past thirty years? In Ljung, M (ed), Corpus-based studies in English, Papers from the seventeenth international conference on English language research on computerized corpora (ICAME 17). Amsterdam, Rodopi, pp135-151. Hundt M 1998 It is important that this study (should) be based on the analysis of parallel corpora: On the use of mandative subjunctive in four major varieties of English. In Lindquist H et al. (eds), The major Varieties of English, Papers from MAVEN 97, Växjö University, pp 159-175. Johansson S, Norheim E H 1988 The subjunctive in British and American English. ICAME Journal 12: 27-36. Mair C, Hundt M 1995 Why is the progressive becoming more frequent in English? – A corpus based investigation of language change in progress. Zeitschrift für Anglistik und Amerikanistik 43: 123-132. Mair C 1997 Parallel corpora: a real-time approach to the study of language change in progress. In Ljung, M (ed), Corpus-based studies in English, Papers from the seventeenth international conference on English language research on computerized corpora (ICAME 17). Amsterdam, Rodopi, pp195-209. Övergaard G 1995 The Mandative Subjunctive in American and British English in the 20th Century. Stockholm, Almqvist & Wiksell International, Acta Universitatis Upsaliensis, Studia Anglistica Upsaliensia, Vol. 94. Quirk R, Greenbaum S, Leech G N, Svartvik J 1985 A Comprehensive Grammar of the English Language. London, Longman. Scott M 1996 WordSmith (Computer program). Oxford, Oxford University Press. 15 Grammaticalization: study of how grammatical morphemes are produced from the lexicon. 543 Through the looking glass of parallel texts Serge Sharoff (sharoff@aha.ru) Russian Research Institute for Artificial Intelligence / Alexander von Humboldt Fellow at the University of Bielefeld, LiLi Faculty Postfach 10 01 31, D-33501 Bielefeld, Germany 1. Introduction The project reported in the paper was aimed at the comparative analysis of meanings of linguistic expressions taken in their context, in particular, how meanings intended by the speaker/writer are realized via lexical items. The goal of the project was to assess the influence of context, on the one hand, and different languages, on the other hand onto meanings delivered by the same lexical item. Since no access to enumeration of original meanings is possible, the method for analysis was based on the investigation of corpora. Several types of lexical items (verbs of motion, names of plants and animals and size adjectives) were chosen from the English or Russian corpora. Then their translations into respectively Russian or English were checked. The corpora used in the project include two technical texts: · Microsoft Word'97 User Manual and its translation into Russian; · excerpts from the AutoCAD v.13 User's Manual (Chapter 2: Drawing objects) and its translation into Russian1; and three literary texts: · Lewis Carroll's Alice's Adventures in the Wonderland and three of its translations into Russian by Demurova, Nabokov and Zakhoder; · Vladimir Nabokov's The Vane Sisters and two of its Russian translations made by Ilyin and Barabtarlo; · Vladimir Nabokov's Lolita and its Russian translation made by Nabokov himself. All the texts were originally written in English and translated into Russian. The corpora were aligned at the sentence level in order to compare expressions of the same state of affairs both between languages and between translators. The corpus (it comprises about 250 thousand words in total) is relatively small in comparison to parallel corpora developed in other projects, for example, the Chemnitz English-German/German-English corpus (Schmied, Schäffler, 1996). However, its size allowed to check the most occurrences of lexical items under investigation manually within the reasonable amount of time (about 3200 instances of verbs of motion and 1100 size adjectives were consulted in total). The paper starts with the description of the XML representation for multilingual concordances and the Perl-based tools developed for creating, maintaining and consulting them. Then, Section 3 discusses results of the annotation using lexical semantic features taken from the dictionaries. Finally, Section 4 presents a network of oppositions, which are important for modeling lexical semantics of verbs of motion and size adjectives in English and Russian. Here, lexical semantics is represented by means of the systemic network, which is used in multilingual generation applications within the KPML environment (Bateman, et al, 1999). 2. Software for working parallel concordance A Perl-based software has been developed within the project to create and maintain parallel concordances, which are represented in XML. The concordance structure encodes: 1. the general information about each text; 2. the alignment between sentences in parallel texts; 3. morphosyntactic and lexical-semantic properties of words of each text. A concordance consists of three types of entities: a concordance text, sentence and a word. The general description of a document to be concordanced includes the following attributes: 1. docid—the identifier of the document (typically its file name); 2. lang—the language, in which the document is written (actually, the lang attribute sets the default value for intepretation of all words in the concordance. The default value may be 1 The texts were prepared within AGILE, the EU project aimed at multilingual generation of CAD/CAM manuals, cf. (Bateman, et al, 2000). 544 redefined for a separate sentence or a word, for example, when a foreign language citation is found); 3. author—the author of the document. The sentence description includes the following attributes: 1. sentenceid—the identifier of a sentence (it is composed of the document identifier plus the sentence index within the document); 2. correspondencelist—the list of identifiers of respective sentences in other parallel texts (if a sentence corresponds to several sentences in one text, this is also reflected in the list). The word description includes the following attributes: 1. wordid—the identifier of a word instance (it is composed of the sentence identifier plus the word index within the sentence); 2. lemma; 3. POS—the part of speech, e.g. POS="verb"; 4. morphfeatures—morphological features of the word instance, e.g. morphfeatures="3pers, sing"; 5. lexfeatures—lexical features of the word instance, e.g. lexfeatures="wordnet5, motion, physical, away, source-fg". The concordance structure is quite flexible for describing parallel texts and is suitable to be the basis for research in contrastive semantics. Also, it allows to use several languages and several translations of the same text, in particular, the German translation of Alice is planned to be added to the concordance. The structure mostly conforms to the EAGLES guidelines (EAGLES, 1996), in particular, the corpus annotation is separated from the corpus itself. The most important difference from the EAGLES scheme concerns keeping the parts of speech (major categories in EAGLES terms) in one attribute (POS) and other values (like person and number) in another one (morphfeatures). This simplification is suitable for the current purposes of the concordance development, i.e. consultation of uses of lexical items. For instance, a concordance query, which consults uses of the verb leave in the corpus, can easily discard the noun leaves and the adjective left, which are irrelevant for the query, by means of specifying the following condition POS="verb" in the query. As the result of XML conversion, the concordance is significantly larger than the original text (approximately by the factor of ten), however, the XML representation of attributes keeps the size of corpora smaller than the position-based encoding, which is used, for example, in the Susanne corpus (Sampson, 1995). Alignment of the text pairs has been performed by means of the Marc Alister software package (Paskaleva, Mihov, 1997), which implements the Gale-Church algorithm for language-independent alignment (Gale, Church, 1993). Since the language-independent alignment algorithm frequently fails to detect translation equivalents, the alignment results should be corrected manually. The alignment of “Alice in the Wonderland” and respective texts with its Russian translations has been corrected to the possible extent; the alignment of other texts has been mostly left unchanged. However, the quality of automatic alignment of other texts was much better, because the Gale-Church algorithm is noise-sensitive. The “noise” in Alice is related to the reported speech, which is coded in the original English and Russian texts quite differently. By this reason, sentence boundaries and their alignment cannot be determined automatically. Anyway, alignment errors that remain in the corpora do not cause significant problems when searching for translation equivalents, because the consultation function (see below) may present a wide context for each occurrence of a lexical item. A Perl script allows an incremental annotation of lexical items that conform to a condition. Initially, the lexical features were based on the set of senses of these lexical items under investigation from WordNet (Miller, 1990) for English and ECD and (Apresjan, 2000) for Russian. In the course of the project development, the set of features for annotation was extended to include oppositions discussed in Section 4. A concordance helps in development of an instance-based vocabulary, which provides another layer of representation of semantic properties of lexical items. A lexical item is an XML entity, which has the following attributes: 1. lexid--the identifier of the lexical item (typically it is the same as lemma, but may have indices for representing homonyms); 2. lemma; 3. lang--language; 4. lexfeatures--the lexical features of the lexical item; they correspond to the set of features, which are pertinent to the lexical item in all or most typical cases of its usage; 5. comments. 545 Also, the XML entity for a lexical item has the following multiple-value attributes: 1. use--its values refer to the identifiers of sentences, in which the lexical item occurs; 2. synonym--in addition to the sentence identifier, it specifies the lexical item that is considered as synonymous in the context of the sentence; 3. translation-- in addition to the sentence identifier, it specifies the identifier and the language of the lexical item that is considered as a translational equivalent in the context of the sentence. The lexid, lemma, lang, and use attributes are filled automatically, when the vocabulary is constructed from the concordance, while lexfeatures, synonym and translation are filled manually in the process of corpus analysis. The vocabulary is shared across all concordances and their languages. The consultation function produces an HTML file with a KWIC (KeyWords In Context) list, which items are sentences that conform to the set of search criteria specified as a dedicated Perl function. The search criteria are applied to one sentence and may include all the information stored in the concordance about this sentence, e.g. co-occurrence of words within the sentence, specific morphological or lexical features of words, as well as their synonyms or translation equivalents. The search criteria can restrict the length of the context, which is extracted from found sentences and presented to the user. Also, the output can be sorted with respect to the left or right context of the found items or according to any other condition defined by a Perl function. As for translation equivalents, the respective parallel sentence or a hyperlink to it may be also included in the output. Each sentence in the output list can be explored with respect to its wider context, since it is hyperlinked to its position in the original text. The screen shot in Figure 1 shows the selection of English and Russian sentences, in which a size adjective (such as little or small) is accompanied with a noun denoting a child (girl, boy, child or children). Figure 1 A concordance query result The developed software allows an incremental research for lexical semantics of specific groups of lexical items. For example, the dictionary provides all lexical items, from which a list of, say, verbs of motion that appear in the corpus can be extracted. The concordance query uses the list in order to classify contexts in which such verbs can be used. As the result, specific contexts can be marked in the concordance (the information is stored in the lexfeatures attribute) and can be used in further concordance queries. 3. Two methodologies in lexical semantic representation Approaches to lexical meanings fall roughly into two groups, which can be (superficially) labelled 546 as logic- and communication-centered paradigms2. The first paradigm assumes that lexical meanings are concepts that belong to an ontology, which represents real-world objects and their properties. Lexical items may refer to one or several concepts and by virtue of this reference they are endowed with a meaning. This relationship between words and meanings is primary with respect to communication, since the ontology exists and the mapping from words to concepts is defined independently from any possible act of communication. Within computational approaches to lexical semantics, good representatives of this paradigm are WordNet (Miller, 1990) and the Explanatory Combinatorial Dictionary, ECD, (Mel'chuk, 1988). The second paradigm assumes the primacy of communication: human languages are not aimed at the correct representation of the world, but at the communication of experience. In this view, language is a tool for acting in the world, and words are hints, which refer to meanings intended by the speaker. This view can be defined as the meaning-as-use position. It is also shared by a wide community of philosophers of language and linguists, for example, Wittgenstein, Harman, Halliday. If the meaning of a word depends on its contribution to the ongoing exchange between the speaker (S) and the hearer (H), it should be analysed in terms of its occurrence in an utterance taken in its context. The two paradigms seems to be contradictory. At the same time, they are complementary both in terms of their purposes and their results. Thus, they can mutually benefit from their interaction. On the one hand, the proposed description utilizes lexicographical resources such as WordNet, ECD, and (Apresjan, 2000), as well as the analysis of verbs of motion from (Levin, 1993). On the other hand, it is based on the communication-centred systemic-functional linguistics, SFL (Halliday, Matthiessen, 1999). The communication-oriented approach exercised in the project contrasts to the representation of meanings of lexical items as definitions in a dictionary. A dictionary entry lists senses as concepts that can be referred to by means of the respective lexical item. Each concept in the list of senses is considered as a separate item, which may be related to other concepts, which are other senses of the same word, but this is not specified explicitly in its definition. When a word is used in an utterance, its meaning is an element which is selected from the list of senses. A use of a word is considered ambiguous, when it refers to more than one element. The length of the list of senses of a word depends on the word and the lexicographer, however, the list is typically long. For example, for the verb leave it contains 17 senses in WordNet, 15 senses (without idioms) in the Random House Webster's, and 31 senses (also without idioms) in the Oxford English Dictionary3. Analogously, (Apresjan, 2000) analyzes 19 senses of vyjti, which is the most typical translation equivalent of leave in Russian. When a formal lexicographic description (of the type found in WordNet or ECD) is applied to corpora, two problems arise. On the one hand, some examples of real and felicitous usage do not fit into the fixed list of senses in the definition of a lexical item. On the other hand, some examples fall simultaneously into several senses. This is not related to the disambiguation, since all the senses are relevant for a human judge, who also does not consider the usage as ambiguous. Another problem concerns translated texts: the piece of reality described in parallel texts (or parallel translations of the same text) is assumed to be essentially constant, but its lexicogrammatical realization varies significantly and is not constrained to choosing different synonyms or translation equivalents. The concordance software and the parallel aligned corpus (reported above) provided the possibility to analyse English and Russian corpora with respect to: 1. the usage of verbs referring to motion, e.g. go, leave, vyjti; 2. the usage of nouns referring to natural-kind objects, such as names of trees and animals, e.g. camomile, rabbit; 3. the usage of adjectives referring to the size. The annotation of lexical features was based on their senses from WordNet for English and ECD and (Apresjan, 2000) for Russian. The experiment shows that, in the case of verbs of motion, 6% uses of English verbs in the corpus do not fit into any sense from WordNet, while about 35% are ambiguous, i.e. more than one sense can be used for the annotation, in spite of the fact that their use is not ambiguous. For example, WordNet contains two subsets of senses of leave, which refer, respectively, to leaving a place and a person. The first case contains two senses: go away and move 2According to (Matthiessen, Bateman, 1992: 54-55) the origins of both logical- and communication- (rhetorical) centered paradigms dates back to the very beginning of thinking about language in the Western tradition. Nowadays, the two perspectives are partly mirrored in the opposition of formal and functional linguistics. 3 The senses for the noun and phrasal verbs were not counted. 547 out of. In many utterances the distinction between them cannot be drawn, so the use should be considered as ambiguous. The second case requires the death or the divorce, so no sense from WordNet is applicable to: (1) But her sister sat still just as she left her The same situation is true for size adjective. Properties of objects with respect to their size are particularly important for the story of ‘Alice', but the usage of such words as large, little, small does not always follow their typical definitions, for example, of large as ‘greater than average size, quantity, or degree'. Each time, when such properties are mentioned, they refer to more than just physical characteristics of an object in comparison to its average size, because the size of an object is mentioned only if the reference to it is appropriate for some purpose of the author, often in comparison to the size of Alice: (2) a small passage, not much larger than a rat-hole (so that Alice could not pass through it with the size she had at that moment) (3) [she went on crying] until there was a large pool all round her. As the result, some references to the size are not translated into Russian at all or are expressed in a different way. For example, in Nabokov's translation of ‘Alice', (3) is rendered as: (4) … posredine zaly obrazovalos’ glubokoe ozero in-center-of hall appeared-3sg deep lake. Large vs. glubokij (deep, in this sense) and pool vs. ozero (lake) cannot be considered as translation equivalents in a normal dictionary. From the logical viewpoint, “a large pool” is in no way synonymous to “a deep lake”. Yet, basically the same meaning is delivered in the translation. As for annotated occurrences, 11% uses of English size adjectives are not covered in WordNet and about 65% are ambiguous. This is partly related to the unclear design of lexical entries for size adjectives, for example, WordNet has the following senses of little: 1. limited or below average in number or quantity or magnitude or extent; 5. of little importance or influence or power; of minor status; 6. (informal terms) small and of little importance; 8. contemptibly narrow in outlook, e.g. "a little mind consumed with trivia"; "petty little comments"; 11. used of persons or behavior; characterized by or indicative of lack of generosity. It is unclear which sense is implied in: (5) - Oh, you wicked little thing! - cried Alice, catching up the kitten, and giving it a little kiss to make it understand that it was in disgrace. Both occurrences of little in the example correspond simultaneously to several senses from the set: little thing means 1, 5, 6 and 8 (also, little4--young), while a little kiss means 1, 6 and 11. 4. Words as Resources for Communication The description reported in the previous section is mostly negative: the real usage of words does not necessarily conform to their definitions in a dictionary. However, the concept and the format of modern monolingual dictionaries depend on the history of their development4 and on the function they serve in the society, namely, to be an authoritative source, which helps in understanding a specific use of a word or in checking/ensuring the correctness of its use. As the result, a dictionary entry is designed as a list of senses that denote to events, objects or their properties. In other words, it statically describes the result of the dynamic process, when words are used by S to refer to events, objects, and their properties5. Also, S uses words not only for the purposes of referring, but also for acting on H by means of referring. The model, which is proposed in the paper, is aimed at the description of how words are used in purposeful communication. The model is based on Systemic-Functional Linguistics (SFL), cf. analysis of its traditions for representing lexical meanings in (Wanner, 1997). In spite of all the differences in 4 Origins of the tradition of sense enumeration in a monolingual dictionary are related to the development of printed discourse, particularly the new periodicals, in England in the eighteenth century. This brought about a reevaluation of the nature of meaning, cf. (Kilgariff, 1997). 5 The Collins COBUILD English Dictionary is an exception in this respect; it presents senses according to the communicative potential of lexical items. 548 approaches, the computational formalism for describing lexicogrammatical meanings in SFL is the systemic network, which represents choices between interrelated oppositions. For instance, classification of the English mood starts with the features ‘indicative’ vs. ‘imperative'. Semantically, it corresponds to the opposition of speech acts referring to exchange of information vs. issuing commands. Thus, the grammatical choice is controlled by an inquiry to relevant parameters, which are beyond the grammar, cf. (Matthiessen, Bateman, 1992). The proposed communication-oriented description of lexical items separates the potential of possibilities, which defines possible usage of lexical items and is represented by oppositions in the systemic network, from instantiation of the potential, which is developed for their instances in the context of an utterance. The set of features, which are fired for a lexical item as the result of its use in the discourse, may correspond to its sense in a dictionary. However, specific communicative goals of S or S's idiolect may also result in a different set of features. The meaning of a lexical item in the context depends not only on the set of features chosen for itself, but (a) on its contribution to the lexicogrammatical structure of the utterance (the principle of compositionality) and (b) on inquiries, which are responsible for their choice, i.e. how features correspond to extralinguistic concepts, traditions in the register and to S's communicative goals. SFL also distinguishes between paradigmatic classifications, which are represented by the feature inheritance network, and syntagmatic realizations, which are implied by the choice of features. In the case of lexical semantics, realization statements constrain the choice of lexical items. The proposed description also addresses the issue of multilingual differences between senses. The systemic network is based on the notion of (multiple) inheritance of features. The least delicate, more general, choices tend to be shared across languages, while more delicate choices tend to be languagespecific (Bateman, et al, 1999). For example, many languages have the type of motion verbs and also distinguish between the motion towards or away from the reference object, as its subtypes, e.g. enter vs. leave. However, in Russian there are further subtypes, which are lexicalized using prefixes, e.g. the away-from motion subtype can be designated by such verbs as vykhoditj, otkhoditj, uhoditj (all of them are typically translated as leave), but they denote different configurations of source and destination properties: vyhoditj implies motion from within the source, otkhoditj implies motion from the vicinity of the source, ukhoditj implies motion towards a remote destination. Such subtypes are language-dependent choices. Thus, the systemic network represents both commonalities and differences between languages. The following subsections contain systemic network fragments that are required for representing lexical semantics of verbs of motion and size adjectives. The description is oriented towards automatic generation of expressions for denoting respective events and properties. 4.1 A fragment of the systemic network for verbs of motions The most designations of motion processes in English and Russian are based on the following oppositions6: The Type of Motion: physical vs. imaginary (6) He left the room vs. He left the job. Most often, the reference to physical or imaginary motion is not an inherent feature of a lexical item, since either interpretation is possible depending on context: to go through a lot of trouble; to advance the claim, to jump to the conclusions. Within the physical motion there may be a difference in the manner of motion: run, crawl, fly, or the manner is left unspecified. The lexical item may also express the cause for the motion. Two clear cases are (a) external circumstances, e.g. to quit the job (compare to to leave the job); and (b) habits or legal requirements, e.g. withdraw (Apresjan, 2000). 6 There are other oppositions that influence lexical meanings of verbs of motion, e.g. activity vs. achievement (Levin, 1993). However, they are not special for verbs of motion. 549 Within the imaginary motion subtype, the opposition is drawn between those processes that retain some kind of metaphorical motion and those that use verbs of motion to designate an activity without an overt relation to spatial properties, e.g. The boy left school in the middle of the year (when an institution is considered as a space to be left) vs. He had left off writing on the slate. In Russian, the number of options for the imaginary motion includes, among others, processes for changing one's status with respect to a social institution (cf. Figure 2), in particular: (7) vyjti, vypisat'sja iz bol'nicy (a patient leaves a hospital); (8) vyjti, osvobodit'sja iz tjur'my (a prisoner leaves a prison); (9) vyjti na rabotu, vyjti na pensiju, vyjti v otstavku (to start one's duties, to become a pensioner, to resign); (10) vyjti iz otpuska (to end vacation). The Direction: unspecified vs. towards vs. away (11) She ran for two hours vs. She entered the room. A lexical item may either designate an inherently directed motion with respect to a reference point (arrive, depart, descend, return) or be indifferent to it (roll, slide, walk). In the case of inherently directed, it can be treated as away from a reference point or towards it (to enter a room vs. to leave a room). This opposition is also applicable to the imaginary motion. The Focal Point: source- vs. destination-foregrounded (12) He came in with a teacup vs. (13) A torrent of Italian music came from an open window. This opposition differs from the direction of motion with respect to the reference point, since the motion may be directed towards it, but the source of motion is foregrounded, cf. (12) and (13). The opposition does not necessarily concerns the semantics of the prepositional phrase. The meaning of Figure 2: The systemic network for Russian verbs of motion (the “away” subtype). 550 the focal point of an utterance can be incorporated into the verb as well, cf. ausgehen (to go out for an entertainment) in German or vyjti v svet (a publication is issued) in Russian. Thus, the example (1) can be represented by the following sequence of features: motion-process, physical, nonspecific, directed-motion, away, destination-fg. 4.2 A fragment of the systemic network for size adjectives As with verbs of motion, the investigation of size adjectives goes beyond the domain of pure size descriptions. The following oppositions are considered (they are shared in English and Russian): The type of measurement: spatiotemporal vs. amount Firstly, the majority of adjectives which refer to size are not specific for this purpose. They may also refer to the description of quantities, e.g. little door vs. little hope vs. little pattering of feet. From the cognitive viewpoint, a reference to size implies a mapping from a qualitative state into a onedimensional space, which represents the measure for the state. So many words that designate the mapping are naturally applicable both to sizes and quantities. However, the opposite is not true, because some adjectives are applicable only to amount, e.g. a lot of, insignificant. Secondly, many words that refer to size can also be used for a reference to time. In language (at least, in Indo- European languages) temporal qualities are often expressed by the same means as spatial ones, e.g. little stay, long duration (as well as from April to June). In the case of spatiotemporal measurements, the properties denoted by an adjective may be either directional (deep, early) or non-directional (small). It is not easy to classify adjectives along measured directions, an example is the adjective deep, which has the set of senses referring to all possible directions: a deep well (vertical), a deep shelf (horizontal), a deep border (broad). However, all the senses of deep are related to the presence of a remote surface, cf. also metaphorical uses deep space, deep thoughts, deep sleep. The described set of senses of the English adjective deep corresponds to its Russian translation equivalent glubokij. The reference of measurement: small vs. big This opposition is simultaneous with the distinctions in the type of measurement. Both references to spatiotemporal and amount properties, as well as metaphorical expressions can be classified in this respect, cf. deep sorrow, shallow thoughts; large breakthrough, small problem. The options for realization of the opposition in Russian are often based on diminutive suffixes to render the smaller side of the scale. In the translation of large pool in (4), the contextually important features, such as the large physical measurement and the water surface (to swim in), are preserved, thus making the translation contextually appropriate. The interpersonal dimension: sympathy vs. neutral vs. antipathy A reference to a property of an object is often aimed at providing a rhetorical impact onto H. The interpersonal dimension of the way how properties are denoted is responsible for what Sinclair (1991) calls semantic prosody. The reference to the size of an object/person may be mentioned to justify the need in taking care of it or being afraid of it. Often, the small scale of size is associated with sympathy, cf. the example (5). However, the sympathy vs. antipathy opposition can be detected even in roughly synonymous lexical items. Let's consider examples from the corpus, in which the adjectives little and small are used to refer to a child, and check their translations7: (14) ignorant little girl (D) strashnaja nevezhda (awful ignorant), (N, Z) durochka (fool-femin-diminut) (15) I'm a little girl (D) Ja malen'kaja devochka (I'm a small girl) (Z) Ja devochka (I'm a girl) (16) very few little girls of her age (D) nemnogo najdjotsja devochek ejo vozrasta (few find-refl girls of her age), (17) A small boy and a begrimed, bowlegged toddler lurked behind them. Malen'kij mal'chik i zamyzgannyj, kolchenogij mladenec zanajachili gde-to za nimi. (Lolita) 7 Translations with insignificant differences are omitted. D stands for Demurova, N for Nabokov, Z for Zakhoder. 551 (18) there came a blinding flash, and beaming Dr. Braddock, two orchid-ornamentalized matrons, the small girl in white, and presumably the bared teeth of Humbert Humbert were immortalized The selection provides two observations. Firstly, the reference to the age may be omitted from the English original without a loss in the representation of a person. Thus, the reference is used mostly for adding a specific semantic flavour. The reference to the age is omitted in some Russian translations and preserved in others. Secondly, in spite of the fact that the most vocabularies, including WordNet, Random House Webster's and OED treat little and small as synonymous for denoting young children, each time, when small is used, the context is unfavourable, (though, this observation may be related to the limited size of the corpus). 5. Conclusions The project reported in the paper presents an attempt of the contrastive semantic study on an empirical basis of the corpus-driven description. The study of verbs of motion and size adjective in terms of their use in texts illustrates that even though the lexicographic research aimed at the enumeration of senses of a lexical item provides an invaluable resource for lexical semantic analysis of texts, a dictionary cannot account for all the possible felicitous uses of words. When the list of senses gets more elaborate, more uses become ambiguous, since often they start to refer to more senses from the dictionary. The communication-centred perspective onto lexical semantics shifts the attention from the meaning of a lexical item to possible uses of groups of lexical items. The question of the lexical semantic analysis in this case is not “What is the meaning of a lexical item?”, but “How things are meant by lexical items?” This also better corresponds to the social foundations of linguistic interaction: the end of using language is in acting on others. The future direction of the reported research consists in the development of the large-scale corpus-based comparative description which models the usage of motion verbs in English, German and Russian and its implementation for automatic generation within the KPML environment. Another outcome of the reported research consists in the development of a parallel English-Russian corpus, which is aligned at the sentence level. No corpus of this type existed at the beginning of the project. The project also led to the development of the special-purpose software to create multilingual concordances and interact with them. Fortunately, Perl as the programming language and XML as the underlying representation allowed for rapid prototyping. Acknowledgments The research reported in this paper has been funded by the research support grant RSS321/1999 from the Open Society Institute, and by the Fellowship awarded by the Alexander von Humboldt Foundation, Germany. References Apresjan, J.D. (2000) Systematic lexicography. Oxford: Oxford University Press. Bateman J.A., Matthiessen C.M.I.M., & Zeng L.. (1999) Multilingual natural language generation for multilingual software: a functional linguistic approach. Applied Artificial Intelligence, 13, 607- 639. Bateman J.A., Teich E, Kruijff G-J, Kruijff-Korbayova I., Sharoff S., & Skoumalova H. (2000). Resources for multilingual text generation in three Slavic languages. In Proc. Languages Resources and Evaluation Conference LREC2000, Athens, Greece, May 30-June 2, 2000, pp. 1763-1767. EAGLES (1996) Recommendations for the morphosyntactic annotation of corpora. EAG-TCWGMAC/ R. Available from ftp://ftp.ilc.pi.cnr.it/pub/eagles/corpora/annotate.ps.gz Gale W, & Church K., (1993) A Program of Aligning Sentences in Bilingual Corpora. Computational Linguistics, 19(1):75102. Halliday, M.A.K., & Matthiessen C.M.I.M. (1999) Construing experience through meaning: a language-based approach to cognition. London: Cassell. Kilgariff, A. (1997) "I don't believe in word senses". Computers and the Humanities 31 (2), 91-113. Levin, B. (1993) Towards a lexical organization of English verbs. Chicago: University of Chicago Press. Matthiessen C.M.I.M., & Bateman J.A. (1992). Text generation and systemic functional linguistics: experiences from English and Japanese. London: Pinter Publishers. 552 Mel'chuk I.A. (1988). Semantic description of lexical units in an Explanatory Combinatorial Dictionary: basic principles and heuristic criteria. International Journal of Lexicography 1, 165- 188. Miller G., ed. (1990). WordNet: an online lexical database. International Journal of Lexicography 3 (the special issue). Paskaleva E, Mihov S, (1997) Second Language Acquisition from Aligned Corpora. In Proceedings of the International Conference “Language Technology and Language Teaching”, Groningen. Sampson G. (1995) English for the computer: the SUSANNE corpus and analytic scheme. Oxford: Clarendon Press 1995. Sinclair J. (1991) Corpus, Concordance, Collocation. Oxford: Oxford University Press. Schmied J., Schäffler H. (1996). Approaching Translationese through Parallel and Translation Corpora. Percy, C.E., C.F. Meyer & I. Lancaster (Eds), Synchronic Corpus Linguistics. Amsterdam: Rodopi. Wanner L. (1997) Exploring lexical resources for text generation in a systemic functional language model. PhD Thesis, University of Saarbrücken. 553 CLaRK - an XML-based System for Corpora Development1 Kiril Simov, Zdravko Peev, Milen Kouylekov, Alexander Simov, Marin Dimitrov2, Atanas Kiryakov3 The CLaRK Programme Linguistic Modelling Laboratory - CLPPI, Bulgarian Academy of Sciences kivs@bgcict.acad.bg, zpeev@bgcict.acad.bg, mkouylekov@dir.bg, adis_78@dir.bg, marin@sirma.bg, naso@sirma.bg 1 Introduction In this paper we describe the architecture and the intended applications of the CLaRK system. The development of the CLaRK system started under the Tübingen-Sofia International Graduate Programme in Computational Linguistics and Represented Knowledge (CLaRK). The main aim behind the design of the system is the minimization of the human work during creation of corpora. Creation of corpora is still important task for majority of languages like Bulgarian where the invested effort in such development is very modest in comparison with more intensively studied languages like English, German and French. We consider the corpora creation task as editing, manipulation, searching and transforming documents. Some of these tasks will be done for single document or a set of documents, others will be done on a part of a document. Besides efficiency of the corresponding processing in each state of the work, the most important investment is the human work. Thus, in our view, the design of the system has to be directed to minimization of the human work. For document management, storing and querying we chose XML technology because of its popularity and its ease for understanding. Very soon XML technology will be part of our lives and it will be the predominant language for data description and exchange on the Internet. Moreover a lot of already developed standards for corpus description, such as CES (Corpus Encoding Standard 2001) and TEI (Text Encoding Initiative 1997) are already adapted to XML requirements. The core of the CLaRK system is an XML Editor which is the main interface to the system. With the help of the editor the user can create, edit or browse XML documents. To facilitate corpus management we enlarge the XML inventory with facilities that support linguistic work. We added the following basic language processing modules: a tokenizer with a module that supports a hierarchy of token types, a finite-state engine that supports the writing of cascade finite-state grammars and facilities that search for finite-state patterns, the XPath query language which is able to support navigation over the whole set of mark-up of a document, mechanisms for imposing constraints over XML documents which are applicable in the context of some events, database over the XML mark-up and tokenized content of the documents. We envisage several uses for our system: 1) Corpus markup. Here users work with the XML tools of the system in order to mark-up texts with respect to an XML DTD. This task usually requires an enormous human effort and comprises both the mark-up itself and its validation afterwards. Using the available grammar resources such as morphological analyzers or partial parsing, the system can state local constraints reflecting the characteristics of a particular kind of texts or mark-up. One example of such constraints can be as follows: a PP according to a DTD can have as parent an NP or VP, but if the left sister is a VP then the only possible parent is VP. The system can use such kind of constraints in order to support the user and minimize his work. 2) Dictionary compilation for human users. The system will support the creation of the actual lexical entries whose structure will be defined via an appropriate DTD. The XML tools will be used also for corpus investigation that provides appropriate examples of the word usage in the available corpora. The constraints incorporated in the system will be used for writing a grammar of the sublanguages of the definitions of the lexical items, for imposing constraints over elements of lexical entries and the dictionary as a whole. 1 The work on the system is currently supported by BulTreeBank project funded by the Volkswagen- Stiftung, Federal Republic of Germany under the programme “Cooperation with Natural and Engineering Scientists in Central and Eastern Europe” contract I/76 887. 2 Currently at the OntoText Lab, Sirma AI LTD, Sofia 3 Currently at the OntoText Lab, Sirma AI LTD, Sofia 554 The structure of the paper is as follows: in the next section we give a short introduction to the main notions of XML language and XPath querying language, the third section describes the main components of the CLaRK system and their functionality, the last section outlines some directions for future development. 2 XML technology XML stands for eXtendable Markup Language (see XML 2000) and it emerged as a new generation language for data description and exchange for Internet use. The language is more powerful than HTML and easier to implement than SGML. Starting as a markup language, XML evolved into a technology for structured data representation, exchange, manipulation, transformation, and querying. The popularity of XML and its ease for learning and use made it a natural choice for the basis of CLaRK system. This section presents in informal way some of the most important notions of the XML technology. For more rigorous and full presentation the reader is directed to the corresponding literature on the following address: http://www.w3c.org/ XML defines the notion of structured document in terms of sequences and inclusions of elements in the structure of the document. The whole document is considered as an element which contains the rest of the elements. The elements of the structure of a document are marked-up by means of tags. Tags can surround the content of an element or tags can mark some points in the document. In the first case the beginning of an element is marked-up by the so called open tag written as <tagname> and the end is marked-up by a closing tag written as </tagname>. For example, TEI documents include on top level the following two elements: <TEI.2> <TeiHeader> ... content of the TEI header element ... </TeiHeader> <text> ... Content of the text element ... </text> </TEI.2> The tags of the second kind, so called empty elements, are represented as <tag/>. For example, a line break within a sentence can be represented in the following way: <s> ... first line of text ... <lb/> ... second line of text ... </s> Each element can be connected with a set of attributes and their values. The currently assigned set of attributes of an element is recorded within the open tag of the element or before the closing slash in an empty element. Some of the attribute-value pairs are by default assigned to some tags and thus it is not obligatory to list them. One important requirement for an XML document is that elements having common content must strictly include one into another. This means that overlap of the elements is forbidden. Such documents are called well-formed. For example, the following document is not well-formed and thus it is not an acceptable XML document: <doc><el1> ... <el2> ... </el1> ... </el2></doc> XML technology defines a set of mechanisms for imposing constraints over XML documents. Such kind of a very basic mechanism is the so called DTD (Document Type Definition). A DTD defines the inclusion of elements and the possible sequences of elements within the content of an element. These definitions are given as ELEMENT statements in the DTD. Each ELEMENT statement has the following format: <!ELEMENT tagname content_definition> where tagname is the name of the element and content_definition is a definition of the content of this kind of elements. Besides some reserved words as EMPTY and ANY, the content definition is represented as a regular expression over tag names. This regular expression determines the tags and their order in the content of the enclosing element. Additionally, the DTD can contain definitions of 555 the allowed attributes for the elements, entities declaration and others. For more details the interested reader is directed to the literature on the corresponding notions – see the above address. An XML document containing elements whose content obeys the restrictions stated in a DTD is said to be valid with respect to this DTD. Another important language defined within XML world and used within the CLaRK system is XPath language. XPath is a powerful language for selecting elements from an XML document. The XPath engine considers each XML document as a tree where the nodes of the tree represent the elements of the document, the document itself is the root of the tree and the children of a node represent the content of the corresponding element. Attributes and their values of each element are represented as an addition to the tree. Each expression in XPath language is evaluated with respect to a chosen node in the tree, called context node, and consists minimally of two parts: axis and node test. Optionally one can impose an additional predicate expression. Axis part determines the direction with respect to the context node in which the expression will be evaluated. Node test determines the type of the nodes that we are interested in. This can be text, tag names and so on. Among all nodes of the required type in the specified direction there can be nodes that we want to choose. These nodes are further filtered by a predicate expression which can be evaluated as true or false over a node. The XPath syntax allows for recursive XPath expressions and also union of XPath expressions. The complete definition of the XPath language can be found in XPath (1999). Here we give a very simple example, the following expression /descendant-or-self::gram[contains(string(child::text()),"V")] will be evaluated to a list of all <gram> nodes such that their textual value contains the letter "V". 3 CLaRK System At the heart of the CLaRK system is the XML technology as a set of utilitiess for structuring, manipulation and management of data. We started with basic facilities for creation, editing, storing and querying of XML documents and developed further this inventory towards a powerful system for processing not only of single XML documents but of an integrated set of documents and constraints over them. The main goal of this development is to allow the user to add to the XML documents a desirable semantics reflecting the user's goals. Inside the system, the core structure of the representation of the XML documents follows the DOM Level1 specification (DOM 1998). When an XML document is imported (or created) in the system it is stored in this internal representation and in this way the user has access to it only via the facilities of the system. This restriction allows us to support the consistency of the data represented in the system. We plan to exploit this feature of the system even further in future for automatic support of construction of XML documents that reflect the content of a corpus or a set of documents. The CLaRK system includes the following components: XML Engine, XML Editor, Database, Document Transformation, Tokenizer, Constraints Engine, XPath Engine and FSA Engine. In this section we describe each of these components in turn. Some of these components are not directly accessible by the user and they are used in the other components to support the corresponding functionality. The most important components of this type are the XPath Engine and the FSA Engine. The first is a module which evaluates XPath expressions over a document and the second is a module dealing with compilation of regular expressions into finite-state automata, determinization and minimization of the compiled automata. The following screen shot gives an overview of the main interface to the system. On the left side of the screen the tree view of the current document is displayed, under it is the attribute value table shows the attributes and their values of the selected element. On the right side of the screen the textual representation of two documents is given. The message window is at the bottom of the screen. 556 3.1 XML Engine XML Engine offers a full set of facilities for processing XML documents. This includes DTD compiler which compiles the element, entities and attribute definitions in a DTD and represents them in an internal format. For the elements the internal format is a set of finite-state automata corresponding to the content definition of the elements. These automata are determined and minimized during the compilation. Attributes and entities are stored as hash-tables. The second element of the XML engine is the XML parser. This parser transforms an XML document into internal for the system DOM representation. During the parsing process the parser checks the wellformedness of the document and reports the corresponding errors. The third component is the Validator. This module checks the validity of the document with respect to a DTD. Each document which is loaded in the system has to be attached to a DTD. Once a document is parsed to the internal representation of the system, it can be saved in this internal representation and the next time when it is used it will not be necessary to be parsed again. The Validator is active for the currently loaded document in the editor and when the user is modifies the document the Validator reports the changes in the validity of the document with pointers to the corresponding wrong elements of the document. 3.2 XML Editor Access to the system is via a structure-driven editor which allows the user to edit and manipulate XML documents. Each loaded into the editor document is presented to the user in two or more views. One of these views reflects the tree structure of the document as described in the previous section. The other views of the document are textual. Each textual view shows the tags and the text content of the document. The tags in the textual view are separate elements from the rest of the text and can not be edited. The user has the possibility to attach to each textual view a filter which determines the tags and the content of which elements to be displayed in the view. This option allows the user to hide some of the information in the document and to concentrate on the rest of the information. With different textual views of the same document the user can attach different filters. The editor supports a full set of editing operations, such as copy, cut, paste and so on. These operations are consistent with the XML structure of the document. Thus the user can copy or delete a 557 whole subtree of the document. Some of these operations as search and replace are defined in terms of XPath expressions. This allows the user to search not only in textual content of the document but also with respect to the XML mark-up. The most powerful operation here is the XPath replace. This operation is used for various commands for restructuring the document. Generally, the scenario is the following: (1) a list of nodes (subtrees, text elements) is chosen by the Source XPath expression. In this way the elements which will be copied or moved in the document are defined; (2) a list of nodes is chosen by the Target XPath expression. In this way the place(s) where the source elements will be copied or moved are defined; (3) the elements from source list are attached to the elements of the target list. There are several options defining the way of performing of the above action. These concern such things as whether the elements of the source are copied or cut from the document before they are attached to the target, the mapping between the source and target elements - it is possible for the source elements to be attached to each element of the target, or each element of the source to be attached to the corresponding element of the target. Via different options this operation becomes very powerful means for document modification or entering new information in case the source is given not as an XPath expression but as a fragment of XML document or text. In future we plan to allow an evaluation of the source and the target expressions over different documents and thus to allow their merging. The editor allows editing of the document textual content or its structure. The editing of the structure is supported by the attached to the document DTD. When the cursor is located at some point in the document structure, the user can enter a child, a sibling or a parent of the pointed element. In both cases the DTD is consulted and the list of the allowed for this position tags is offered to the user. 3.3 Document Transformation The system offers a general mechanism for the transformation of some XML documents into other XML documents. This is done by implementing XSLT language (XSLT 1999). The transformations can be applied in two modes: globally and locally. When a transformation is applied globally it is applied to the whole document. In future we plan some transformation to be applied to a set of documents. When the user wants to apply a transformation locally he/she first selects an appropriate fragment of the document and then the transformation is applied only to this fragment. This last option provides a mechanism for construction of a set of transformations which the user applies depending on the current task and thus avoiding the necessity to write very specific conditions on the applicability of the transformation. Some transformations, corresponding to small changes in the DTD of documents (such as reordering of elements), are generated automatically. 3.4 Tokenizer XML considers the content of each text element as a whole string that is unacceptable for corpus processing where one usually requires to distinguish wordforms, punctuation and other tokens in the text. In order to cope with this problem the CLaRK system supports a user-defined hierarchy of tokenizers. At the very basic level the user can define a tokenizer in terms of a set of token types. In this basic tokenizer each token type is defined by a set of UNICODE symbols. Above this basic level tokenizers the user can define other tokenizers for which the token types are defined as regular expressions over the tokens of some other tokenizer, so called parent tokenizer. In the system tokens are used in different processing modules. For each tokenizer an alphabetical order over the token types is defined. This order is used for operations as comparing two tokens, sorting and similar. Sometimes in different parts of one document the user will want to apply different tokenizers. For instance in a multilingual corpus the sentences in different languages will need to be tokenized by different tokenizers. In order to allow this functionality, the system allows for attaching tokenizers to the documents via the DTD of the document. To each DTD the user can attach a tokenizer which will be used for tokenization of all textual elements of the documents corresponding to the DTD. Additionally the user can overwrite the DTD tokenizer for some of the elements attaching to them other tokenizers. 3.5 Constraints General syntax of the constraints in the CLaRK system is the following: 558 (Selector, Condition, Event, Action) where the selector defines in which node(s) in the document the constraint which is applicable; the condition defines the state of the document when the constraint is applied. The condition is stated as an XPath expression which is evaluated with respect to each node selected by the selector. If the evaluation of the condition is a non-empty list of nodes then the constraints are applied; the event defines some conditions of the system when this constraint is checked for application. Such events can be: the selection of a menu item, the pressing of key shortcut, some editing command as enter a child or a parent and similar; the action defines the way of the actual application of the constraint. At the moment the following constraints are implemented in the system: FSA constraints In this kind of constraints the action is defined as a regular expression which is evaluated over the content of each element selected by the selector. If the word formed by the content of element can be recognized as belonging to the language of the regular expression then the constraint is evaluated as true. Otherwise it is evaluated as false and an appropriate message is given. Because the content of the elements can contain text and tags, one problem here is how to determine the word which corresponds to the content of the element. For example, if all wordforms in a sentence are surrounded by <w> tag then the content of a sentence element will be a list of <w> tags which is obviously not acceptable. In order to overcome this problem we allow the “letters” used in the definition of the regular expressions to be of three kinds: tag value, token type, and token value. Tag value is a string which is the result from the evaluation of an XPath expression over an element. The appropriate XPath expression for each tag is attached to the DTD. When the content of an element is converted into a word to be checked by the FSA constraint, for each non-textual element of the content the corresponding XPath expression is evaluated and the first node in the returned list is considered as a string. If this first node is a text node then the first token is taken as a value. If the node is an attribute then the first token of the attribute value is taken. Otherwise the tag of the node is taken as value. Token type and token value “letters” correspond to the tokenized textual content of the element. When the constraint is applied to an element, the element's immediate children are first processed, tokenizing any textual data, evaluating all tag values and then sent to the FSA representing the regular expression. Token's string value has a higher priority over token's category. Since the evaluation of the FSA is linear, the FSA may reject some sequences that are valid. For example consider the category LAT to be the category for all Latin words. Then having the regular expression (LAT,<a>)|("to",<b>) and evaluating it on the element <el>to<a/></el> the FSA will answer that the content of el is not valid. Number Constraints This kind of constraints are defined in terms of an XPath expression, which is evaluated to a list of nodes, and MIN and MAX values where MIN and MAX are natural numbers. The constraint is satisfied (evaluated as true) if the length of the list returned by the XPath expression is between MIN and MAX. Value Constraints These constraints determine the possible children or the parent of an element in a document. These constraints apply when the user enters a new child or a new parent of an element. In both cases a list of possible children or parents are determined by the DTD, but depending on the context in the document an additional reduction of these lists is possible. In case the only possible child of an 559 element is a text then these constraints determine the possible text values for the element. Let us take as an example the following definitions in a DTD: <!ELEMENT np ((np, pp) | ...) > <!ELEMENT vp ((vp, pp) | ...) > which in part define that a PP can be attached to a NP or a VP. Then let us take the partially marked-up sentence: <s> <np>The man</np><v>saw</v><np>the boy</np><pp>in the garden</pp> </s> For the PP "in the garden" there are still two possibilities for a parent - a NP or a VP. But if the user enters a new information than "saw the boy" is a VP then for the PP "in the garden" there is only one possible parent - a VP. This information can be encoded in the system as a value constraint for the parent of PP elements. In future versions of the system we envisage such kind of constraints to be compiled from grammar represented in a grammar development environment. 3.6 Cascade Finite State Automata Grammars Having a finite state automata facilities implemented in the system it is relatively simple to use them for regular grammar development. The CLaRK system incorporates mechanisms for writing cascade finite state grammars as defined in Abney (1996). The evaluation of the regular expressions follow the longest match strategy. Again the regular expression can be defined over tag values, token types and token values. The new category for each recognized word can be presented in the document as any kind of mark-up, but usually this is done by surrounding tag with appropriate attributes. 3.7 Database A relational database over the content of the documents imported into the system is established in order to support an evaluation of limited XPath expressions over a set of documents. In the database information about the tags, the attributes and the tokens of the documents is stored. The documents are stored separately as files in internal format. The evaluation of a query with respect to the database returns a list of documents satisfying the query. These documents are further processed if this is necessary. 3.8 Other facilities Here we will describe two more functions of the system which are useful in the process of corpora development. The first is concerned with sorting elements of a document according to some keys defined over these elements. The sorting is defined in terms of two XPath expressions. The first expressions determine which elements will be sorted. This expression is evaluated with respect to the root of the document as a context node. The second XPath expression defines the key for each element and it is evaluated for each node returned by the first XPath expression. The list of nodes returned by the first expression is sorted according to the keys of the nodes. Afterwards the nodes are returned in the document in the new order. A concordance tool is implemented on the bases of the XPath engine and the sorting module. The first step in a concordance construction is to extract the relevant information from the current document. This is done by an XPath expression which is evaluated and the returns list of nodes is stored as a separate document. The extracted elements are ordered in an appropriate way with the help of the sorting module. For example, in case of appropriately marked-up corpus one can extract all verbs and order them with respect to the first noun on the right side of the verb (not necessarily the first word on the right side). The system supports definition of different keyboards for supporting of different languages. At the moment we support the standard American keyboard and Bulgarian keyboard for entering Cyrillic letters. 560 4 Future developments The ClaRk system will be very intensively used within the BulTreeBank Project which just has started at the Linguistic Modelling Laboratory – see Simov, Popova, Osenova (2001). We plan to extend the system in the following directions: External programs. A mechanism for calling external programs which receive as input fragments of an XML document and returns also fragments of XML document. We envisage the actual communication to the external programs to be implemented via transformations of fragments of documents to and from special interface XML documents. In this way an external program will be declared within the system only once and the user will be able to use the program with XML documents with different structure. Schemes of dependencies between elements in several documents. This is in connection with databases. We can consider each XML DTD as a conceptual scheme over XML documents. Then we can use a set of DTDs to describe interconnected schemes. We plan to implement support for such schemes. Because the task can prove to be very hard we will start with one basic DTD and auxiliary DTDs defining interconnections in table format. We plan to extend the set of events and actions available to the user for defining the constraints. The target here will be a macro language for definitions of actions. Also we plan to make the constraints more active and as an activating event will be used the result from evaluation of some other contraint. In this way we will have mechanisms for propagation of information from one constraint to others. We also plan to add statistical facility for evaluating the quantity characteristic of the documents. The result of this facility will be a table of the relative frequency of some mark-up to some other mark-up in the document. Also we plan to add some other views over documents that are not naturally represented in textual or tree view of XML document such as graph view reflecting the ID references inside a document or other interpretations of the content of a document. References Abney St 1996 Partial Parsing via Finite-State Cascades. In: Proceedings of the ESSLLI'96 Robust Parsing Workshop. Prague, Czech Republic. Corpus Encoding Standard 2001 XCES: Corpus Encoding Standard for XML. Vassar College, New York, USA. http://www.cs.vassar.edu/XCES/ DOM 1998 Document Object Model (DOM) Level 1. Specification Version 1.0. W3C Recommendation. http://www.w3.org/TR/1998/REC-DOM-Level-1-19981001 Simov K, Popova G, Osenova P 2001 HPSG-based syntactic treebank of Bulgarian (BulTreeBank). In: Proceedings of Corpus linguistics 2001, Lancaster, UK. Text Encoding Initiative 1997 Guidelines for Electronic Text Encoding and Interchange. Sperberg- McQueen C.M., Burnard L (eds). XML 2000 Extensible Markup Language (XML) 1.0 (Second Edition). W3C Recommendation. http://www.w3.org/TR/REC-xml XPath 1999 XML Path Lamguage (XPath) version 1.0. W3C Recommendation. http://www.w3.org/TR/xpath XSLT 1999 XSL Transformations (XSLT) version 1.0. W3C Recommendation. http://www.w3.org/TR/xslt 562 Determining query types for information access Simon Smith and Martin Russell School of Electronic and Electrical Engineering, University of Birmingham smithsgj@eee.bham.ac.uk Abstract A body of research exists on the statistical characterization of speech acts, drawing on the features of the utterance at various linguistic strata. If queries to a multi-functional information access system could be similarly analysed, it might be possible to determine, without user intervention, the module to which control should be passed for a given task. Query determination is taken to be a special case of the general multi-class problem, also exemplified by topic spotting. Therefore, after introducing some previous work on the problem and providing motivation for the present research, the paper describes a preliminary topic spotting experiment, along with the results. Future directions for the query typing work are then set out. 1. Introduction Most of those involved in information research are actively engaged in designing Information Access tools. Information retrieval, information extraction, text mining, data mining, spoken document retrieval, named entity extraction: all are concerned, in the end, with the mapping of some user query on to some set of relevant data, and passing that data back to the user. What we wish to undertake is probably less ambitious. We assume that there exists a comprehensive information access system: discrete components of the system can retrieve documents, extract information from documents, retrieve data from fixed format files (for example the on-board address book or calendar manager), and perform certain user requested actions, such as connecting a phone call. The system would probably be mounted on a PDA-type device, and incorporate a voice input facility. We further assume that the system, for all its capabilities, is entirely piecemeal: what is lacking is a central control or shell mechanism which can determine, from the form of the user query, which of the four components is right for the task at hand. What we shall be attempting to do, therefore, is to examine the linguistic evidence present in a query, and on the basis of that evidence determine the query type. This task is similar in many respects to the assignment of speech act type, utterance type, illocutionary force, or dialogue act type to an utterance in Discourse Analysis; an important strand of the work, therefore, will be to investigate the applicability of traditional linguistics insights to a very modern problem. Research on speech act classification, as well as the strategy proposed for query typing, is closely paralleled by work on language identification (LID). If it is assumed that word boundaries are known, as is the case for text LID, “the algorithm used can be derived from basic statistical principles [and] is not based on hand-coded linguistic knowledge, but can learn from training data”, according to Dunning (1994). Interestingly, Dunning also notes that the same algorithm has been implemented for species identification in the domain of biochemistry; moreover, his keyword-counting approach bears comparison to Garner's (1997) dialogue act classification. Such techniques can be generalized to a means of solving other multi-class problems: Gorin (1995) worked on a method of determining the topic of a telephone call, so that it could automatically be routed correctly. This is an appropriate application for LID, too: for example, in a multilingual culture, it would be helpful if calls could be directed to an operator fluent in the language of the caller (Lazzari, Frederking and Minker 1999). Now, imagine a machine translation system which consists of components for various language pairs. If the source language were not specified by the user, it might be determined by a keyword-based language spotter, so that control could be passed to the requisite MT component. In both these cases, then, we have something akin to a topic spotter acting as a command or shell function, determining the correct module for the task itself. The same idea can be extended to information access: a personal digital assistant might have a limited number of command and query strategies at its disposal, each employing a distinct technology. The choice of strategy would, in principle, be governed by the user's expectation of the results of the query or command, so one approach would be to require him or her to select a query type before entering the query itself. Far better, though, and more in accordance with one's expectations of a personalisable device, if the right module could be determined from features of the query itself. 563 As well as addressing the practical matter of an approach to a real world problem, the work is also an exercise in statistical language analysis. We plan to consolidate the extension of topic-spotting to multi-class problems in general, and to the determination of query types in particular; then build from training data a set of convenient query features at various significant linguistic strata. 2. A query typology What, then, are the possible query types, and what applications may they be associated with? Possible types might include command action, open exploration, factual enquiry, how-to-do enquiry, buying enquiry. Many of these, it turns out, can actually be decomposed into what might be termed query primitives: a buying enquiry, for example, could be taken to consist of open exploration, factual enquiry and command action phases. Open exploration corresponds to what is technically known as information retrieval. As with a web search, it is not answers to questions that are returned, but details of (or hypertext links to) documents in which the user might find the answers. IR retrieves pointers to information, not the information itself. Factual enquiry seeks dates, names or other data in direct response to the query. Two subtypes can be distinguished: the first exploits the relatively new technologies question answering and information extraction, where the contents of texts and other documents are analysed and data returned. Far more straightforward is data retrieval, or database enquiry: this would apply to local files of known formats, such as the address book or calendar manager found on all PDA and similar systems. Command action needs no elaboration, beyond stating that it is not strictly a “query” type, and the task application is one of synthesis rather than analysis. Example queries of each type are now presented. (1) IR we need a map of/information on Manchester Airport. (2) IE What terminal does Singapore Airlines use at Manchester Airport? (3) DR What's the name of the guy I'm meeting at the airport on Tuesday? (4) CA Call him, please. It is immediately clear that there is little in the way of an intuitive connection between the query type and the vocabulary used. It is true, perhaps, that one might associate information with the IR type, wh- question words with factual enquiries, and call or please with command actions; but it is easy to think of counter-examples, and most real queries will not contain any of these keywords. In any case, the examples are not real queries but have been invented by us. The goal, however, is not to handcraft rules of association, but to observe whether, over a reasonably large training set, any patterns do emerge; whether, and to what extent the query type can be predicted on the basis of query keywords. 3. Lexical speech act determination Samuel et al (1998) used a transformation-based learning (TBL) heuristic algorithm to assign speech act type to texts. They derived a set of ordered rules and applied them in turn, so that each utterance was first labelled with a default speech act tag. The application of subsequent rules led to utterances containing known keywords (dialogue act cues) being re-tagged. In their data, over 71% of utterances were correctly tagged by this means. Samuel et al noted limitations in their TBL model, namely that the rule-templates had to be hand-crafted, and that the algorithm could suggest speech act tags but had no means of asserting its confidence in those tags; they identify a possible solution for the first problem and a workaround for the second. Lager and Zinovjeva (1999) applied the TBL algorithm to the Edinburgh Map Task, a corpus which incorporates dialogue act mark-up, and attained an accuracy of 62.1%. The corpus consists of a series of conversations about the contents and features of a map, and each utterance is marked up for dialogue move: Go to the left is an INSTRUCT move, What's behind the church? is classified as QUERY-W, Right and OK as ACKNOWLEDGE, and so on. In all, there are 12 different dialogue moves. Garner (1997) undertook the same task, based on the same corpus, using a maximum likelihood model. Garner reports a classification success rate of 47%, which does not sound particularly promising until it is recognized that his model treats both moves and keywords as independent, that is to say no account is taken of contextual information. Furthermore, the approach is based purely on lexical frequency, and no appeal is made to other linguistic strata. The annotation of the Map Task corpus is reported on by Carletta et al (1998). 564 There is a clear parallel in Garner's experiments with the query typing work here proposed. Furthermore, since we propose only four distinct query types, it is reasonable to expect a higher classification accuracy. 4. Experiment – a topic spotter As an initial experiment in multi-class problem solving, we built a baseline topic-spotting system, implementing algorithms of (and effectively replicating the work of) Garner (1997) and Wright et al (1995). The approach is rather different from that of standard IR systems or web search engines: it attempts to find the single best solution to a multi-class problem from a very constrained set of candidates, or alternatively to rank each member of that set. IR, on the other hand, offers in general a ranked list of possible solutions from a virtually unlimited set of possibilities. In principle, however, the topic-spotter could be extended to handle a very much larger topic set. Our aim was to build a system which would correctly identify the topic of an unseen news text, by examining the frequencies of significant keywords used. To do this, we first decided on ten appropriate topics: domestic politics, foreign politics, health, finance, crime, sport, media, technology, environment and education. Two sets of training data were assigned to each category, one set consisting of eight news stories, the other — a subset of the first — of three stories. All the texts were taken from BBC Online News, where they are arranged in categories not dissimilar to ours; the fact that multiple category membership of texts is commonplace in the BBC scheme is what led us to develop our own. Often, though, the topic of a text seemed to motivate its assignment to more than one class (a story about Elizabeth Taylor and her AIDS-related work belonged intuitively to both media and health categories, for example). In such cases, we simply excluded the text from training data. Once training data had been prepared, the probable topic of an unseen text was computed by maximizing probability (5) (Garner 1997) over all keywords and all topics: (5) P m P w m P w m P w m i i i K i ( | , ) ( | , ) ( | , ) ( | , ) x D D D D = 1 2 K Here, P(x|mi,D) represents the probability that an observation x, instantiated as the unseen text, will occur in the training data D associated with a topic m. Whenever a word encountered in the unseen text is found in the training data for a particular topic, its probability of occurrence (P(wj|mi,D)) is computed by dividing the number of tokens of that word in D by the total number of words in D; if D does not attest the word in question, the probability defaults to an arbitrarily small value. As (2) shows, these probabilities are then multiplied together to yield P(x|mi,D). Normally, P(x|mi,D) would in turn be multiplied by the topic prior probability before maximization, but that step is skipped in this implementation, as it is assumed that all topics are equally likely. When the small (three stories per topic) training set was used, 30 out of 50 stories were assigned to the right topic, while 42 stories were correctly matched by means of the eight-story training set. 4.1 Data pruning One of the basic principles on which this work is founded is the use of information-theoretic measures, such as usefulness (Garner 1997) or salience (Gorin 1995) to identify structure which is, in some sense, optimal for discriminating between the different classes in question. These techniques are most commonly applied at the word level, either to text, verbatim transcriptions of speech, or the output of a speech recognition system. However, in principle they are equally applicable to any symbolic representation of data. For example Gorin et al (1999) have applied similar techniques to the output of an automatic phone recognizer. Of course, first-order statistics describing the occurrence of individual phones are unlikely to provide strong evidence for the classification of an utterance. However, by extending these techniques to phone sequences, it may be possible to detect, automatically, phone sequences which describe discriminative lexical or even syntactic or semantic structure. Potentially useful sequences will be characterized by their frequent occurrence in different contexts and can be detected automatically using methods such as the Context-Adaptive Phone (CAP) analysis described in Moore et al (1994). Variations in the instantiation of the same underlying sequence, due to deletion, insertion or substitution of symbols can be accommodated using various schemes based on dynamic programming (Sankoff and Kruskal 1983), and this could be formalized, for example, by representing each sequence as a statistical model, such as a hidden Markov model (Jelinek 1999). Similar techniques could be applied to sequences of primitive acoustic symbols, produced by a data-driven cluster analysis of the output of a speech signal processing system, or, as in the case of the research proposed here, to more symbolic representations such as those used to describe prosodic information, such as that described by Shriberg et al (1998). 565 Whilst the experiments conducted so far have been confined to lexical analysis, the ultimate goal is to process input from an unsegmented speech stream. Thus, it is intended to build models based on subword units, probably phones. 4.1.1 Usefulness thresholding Our program then applied (6) (Wright et al 1995) to determine a usefulness score for each vocabulary item in the training data. (6) U P w P w P w k k k k = ( | ( | ( | T)log T) T) A word w is thereby said to be useful when it is frequent in training texts of topic T, and occurs relatively rarely in other texts; the usefulness score describes the discriminatory contribution of keywords to the topic of the text. Thresholding or pruning is then carried out so that only the n most useful keywords are searched for when determining the topic of an unseen text. Table 1 shows, for each news topic, what were computed by the usefulness algorithm to be the top ten keywords. The lists probably correspond to most people's intuitions of words that would epitomize each topic. Table 1 finance computers crime domestic politics education environ ment foreign politics health media sport banks internet plane party students environ ment eritrea cancer film boxing yen web victim tories schools masts lazio tamoxifen olsen sydney merger websites suharto kennedy college gm ethiopia breast laurel olympic bank engines bomb labour oxford fuel eritrean parodi travolta ham jp lottery victims mp university crops speight fruit films spurs shares computer cheng ira pupils we monitors everson magazine garcia sega information police donaldson curriculum oil kosovo boots hardy talent debenhams sites trial ulster comprehensive pioneer fiji krishnamurthy hurley robson chase neurons musharraf trimble state environ mental ethiopian removed comedy mcgrath banking identity li romsey i prince electoral pacemaker book edwards However, when topics were assigned to unseen (test) news stories, greater accuracy was on the whole achieved when all training data was considered than when thresholding was applied. Table 2 shows how many of 50 stories were correctly identified when usefulness thresholding was applied at various levels n: that is to say, when only the n most useful words in the training corpus for each topic were taken into account. 566 Table 2 threshold usefulness modified usefulness salience 10 10 30 10 20 13 31 10 30 12 35 15 40 12 35 19 50 13 37 18 60 12 39 24 70 11 38 28 80 12 35 30 90 13 34 30 100 13 34 29 150 12 35 31 200 16 36 34 249 16 37 35 299 16 37 36 349 16 38 36 399 16 42 42 449 16 39 42 499 19 41 43 549 22 41 42 599 23 41 40 no threshold 42 42 42 Table 2 also shows the results of thresholding under a reformulation of the usefulness calculation which we term modified usefulness, also suggested by Garner (1997). There is a marked improvement here, although the best performance is still achieved when no thresholding is carried out. The reformulation, shown at (7), differs only from (6) in that the absolute probability of occurrence of a word is ignored. (7) U P w P w k k k = log T) T) ( | ( | The salience results will be discussed presently. Part of the discrepancy between the results from the two formulations of usefulness arises because the original form tends to rank function words unduly highly, which leads the application to treat these items as keywords. Table 3 shows the ranking of the determiner the for each training corpus, under both formulations of usefulness. 567 Table 3 corpus modified original computer 931 1032 crime 1026 1132 domestic 896 32 education 946 146 environment 1101 172 finance 912 1012 foreign 928 29 health 1038 518 media 951 1046 sport 1003 957 4.1.2 Salience thresholding Gorin (1995) postulates salience as “an information-theoretic measure of how meaningful a word is for a particular device [i.e. task]”. Gorin's work focused on unconstrained speech driven routing of telephone calls to one of a number of human operators, each dealing with one task, such as reversecharge calls, credit card calls and directory enquiries. Salience of a word to a class T is computed by (8). (8) sal( T )log T ) T) w P w P w P ) ( | ( | ( = P(T) is ignored for our purposes, as all topics are equally likely. We do not know the probability of the class given the word, but it can be derived from Bayes’ Law, as shown at (9). (9) P w P w P P w ( | ( | ( ( T ) = T) T) ) P(T) is again ignored; P(w) is the number of tokens of a particular word in the whole of the training data (in principle divided by the total number of tokens of any word, but this is a constant); P(w|T) is the number of tokens of the given word divided by the total word count for the topic, as per the usefulness calculation. It will be seen that whereas usefulness compares the likelihood of a token in on-topic and off-topic texts, salience relates incidence in on-topic texts and all texts, whether on- or off-topic. As Table 3 shows, at low pruning thresholds, the algorithm performs as badly as the original usefulness formulation, but the performance accelerates towards that of the modified usefulness. We are not unduly discouraged by the poor performance of the data pruning algorithms. They were implemented by Gorin and Garner with short utterances, rather than news stories, in mind; we will in due course be adapting these models to handle queries, which are by their nature short. What we have attempted at this stage of the research is a baseline maximum-likelihood classifying program, and we are satisfied that it is working correctly. 5. Speech act determination from non-lexical features We plan, next, to begin building other query features into the model. Of the knowledge sources described above, we expect discourse context, the likelihood of the next query type given the current selection, to prove the least difficult to integrate. This is because it involves the recycling of information that the program was designed to provide - the value assigned to the first utterance is simply passed as a parameter in the computation of the next, and, unlike the prosodic input, does not have to be determined externally. 5.1 Evidence from discourse context Nagata and Morimoto (1993) trained a corpus of conference-booking dialogues using a trigram model of utterances classified by speech act type, and attempted to predict, using mutual information, the following utterance type given the current one. Their approach is entirely probabilistic, and ignores any linguistic or lexical considerations: they report a classification accuracy of 61%, amongst the nine speech act types catered for. 568 One would expect, certainly, patterns of speech act type to emerge at the discourse level: given a question, it is reasonable (if facile) to suppose that the next utterance will be an answer. An example of a sequence of query types was noted above with reference to the “buying enquiry”, and (1) to (4) were intended to convey some sort of logical progression in an imaginary user's query trail. Whatever “answer” the query may generate is not part of the process here, of course, but there is no reason to suppose that monologue cannot be just as effectively modelled, statistically, as dialogue. A couple of drawbacks would apply to this approach to query typing. First, in an isolated query, there is no “next” or “previous” utterance; and if there is, how does one decide whether contemporaneous queries are in fact related? And how recent does the previous query have to be for it to count? Jurafsky et al (1997) labelled a portion of the Switchboard telephone corpus with 42 different utterance types. They combined discourse grammar (estimation of type based on adjacent utterances, as with Nagata and Morimoto) with keyword-based classification, and claimed accuracy of 64.6% (compared to 42.8% for their own keyword-only classification) in assigning utterance types to a test set. 6. Evidence from prosodic features Jurafsky et al took into account information from another linguistic stratum in their classification: namely, prosodic features of the speech stream. They set up several dozen utterance-wide prosodic feature types, including f0_max_utt (the maximum fundamental frequency reached), rel_nrg_diff (ratio of RMS energy of final and penultimate phrasing region), and mean_enr_utt (mean speaking rate value); then they trained CART-style decision trees (Breiman et al 1984) whose task was to distinguish between two utterance types. The path through the tree, and eventual classification, was achieved through comparison of the feature values to constants stipulated at decision points. In the proposed work, we would initially adopt Jurafsky et al's prosodic features. The CART tree-building technique employs binary recursive partitioning. Given training data (speech segments, the dialogue act types they represent and the prosodic feature values associated with them), configurations of parent nodes with two children are hypothesized. Each bifurcation represents a decision point on a particular feature; the algorithm examines all possible values for that feature, and attempts to find a splitting rule that maximizes the discriminatory power of the feature. Thus, while the choice of features is made by the experimenters, the trees are derived by data-intensive means. Although the incorporation of prosodic information secured only a marginal improvement in performance, it is fairly well motivated, as there is a significant literature in the Discourse Analysis domain of theoretical linguistics on the relationship between intonation and speech act type (for example Crystal (1969), and Brazil (1985) for the Birmingham discourse intonation perspective). The chief success of Jurafsky et al, in recruiting a prosodic contribution, seems to spring from the observation that, in English, a yes/no question tends to end with rising intonation. Queries to information access systems probably only rarely take the form of a yes/no question, so one might justly be sceptical about the application of prosodic techniques here: it is reiterated, however, that the aim of the work we propose is to establish what patterns of usage, if any, do emerge, with respect to the query types described. There is a body of related work exploiting prosodic features in speech recognition, including that of King (1998), who integrates language model, dialogue context and intonational information to improve recognition of spontaneous dialogue speech. Jensen et al (1994) present a scheme for phrase-level recognition of intonation contours, and show how it can help compute the perceived pitch of voiceless utterance segments (where no fundamental frequency measurement is available). Hirschberg et al (1999) found that prosodic features can be used to predict recognition errors: their work was based on dialogues from the TOOT train information corpus. Carey et al (1996), working on speaker identification, established that prosodic features are less sensitive to noise distortion than cepstral coefficients. 7. Conclusions and future work The next logical step is the extension of the topic spotter to determine dialogue moves in the Map Task corpus, following Garner, and this work is now under way. The extraction of annotated utterances to move-specific training corpora, using XML tools, is complete; some preliminary tests with held-out data indicate that usefulness has more of a role to play in dialogue move assignment than with the topic-spotter. Experiments using bigrams, rather than single words, will in due course be conducted. 569 The Map Task is not, however, a collection of queries, and identification of suitable query corpora is a high priority. The ATIS (Airline Travel Information Service) is a possible candidate, as are query logs from IR systems, some of which are publicly available on the web. Once an appropriate corpus has been located, we will almost certainly have to do substantial annotation work. We plan to annotate using XML tags, for two reasons: first, we are currently working with the XML version of the Map Task, so fewer modifications to the dialogue move program will be needed. Secondly, the use of what is likely to become the standard annotation of choice would mean our work was more likely to be of use to other researchers, in the future. It may be found more practical to use more than one corpus, although there could be methodological difficulties here, particularly if each corpus held only one query type (at the very least, the discourse context approach to query typing would thus be ruled out). Crucially, though, the corpora must represent true queries to either an on-line or Wizard of Oz system, rather than likely-sounding utterances coined for the purpose. It may prove necessary to filter out very short queries: it might be difficult to distill any features at all from a query which consists of just one keyword, where the discourse context, prosodic and syntactic analyses are apparently unavailable. On the other hand, we might find that query length has a direct bearing on query type. In the longer term, we shall be collecting data on the discourse context and prosody of spoken queries, to supplement the key word/phone information. Ultimately, we hope to employ data fusion techniques to develop a model of query type determination based on all these linguistic sources. Once this has been completed to our satisfaction, we should like to consider exploiting a further source of information present in the query term which we have not discussed here in any detail. Intuitively, it seems plausible that the syntactic structure of a query may reveal information about its nature: a command action may be more commonly associated with the use of the imperative form of the verb, for instance, than other types, and one might expect a factual enquiry to invoke a more complex grammatical structure, perhaps including long distance dependencies. Gorniak (1998) used a syntax based feature extraction algorithm to determine the topic of email messages. It is likely, too, that other topic-spotting techniques may make implicit use of syntactic information, perhaps through n-tuple based methods or statistical grammars. This possibility will be investigated further if it is decided to proceed with the syntactic work. However, we know of no work so far that has attempted to evaluate or use the syntax of short texts or, specifically, queries. Such work would be very interesting and challenging. It is true that parsers and part of speech taggers are available and freely downloadable, and assertions about syntactic structure can be made with more confidence than is usual for prosodic structure, but the difficulty lies in establishing which features are to be taken into account, and what weighting is to be assigned to them. The work we have embarked on is of potential importance to a number of disciplines. Information Access will benefit from the work, although it is not in itself an IA application; the prosody findings may be of general application in Speech Recognition; ultimately, it is envisaged, the work could form part of a production PDA device. It is, we believe, a novel and interesting approach, and can reasonably be expected to make a central contribution to the state of the art. Acknowledgements Thanks are due to Graham Tattersall of Snape Signals Research for stimulating discussion on queries and query typology; to Harvey Lloyd-Thomas of Ensigma Technologies for help with BBC Online News stories; and to Steve Isard of Edinburgh University for supplying the annotated MapTask, and for useful advice. References Breiman L, Friedman R, Olshen R, Stone C 1984 Classification and regression trees. Pacific Grove CA, Wadsworth. Brazil, D 1985 The communicative value of intonation in English. Birmingham, University of Birmingham English Language Research. Carey M, Parris E, Lloyd-Thomas H, Bennett S 1996 Robust prosodic features for speaker identification. In Proceedings of the 4th International Conference on Spoken Language Processing, Philadelphia, pp 1800-1803. Carletta J, Isard A, Isard S, Kowtko J, Doherty-Sneddon G, Anderson A 1998 The reliability of a dialogue structure coding scheme. http://www.cogsci.ed.ac.uk/~jeanc/maptask-codinghtml/ rerevised.html. Crystal D 1969 Prosodic systems and intonation in English. Cambridge, Cambridge University Press. 570 Dunning T 1994 Statistical identification of language. Technical report CRL MCCS-94-273, Computing Research Lab, New Mexico State University. http://www.comp.lancs.ac.uk/computing/users/paul/ucrel/papers/lingdet.ps. Garner P 1997 On topic identification and dialogue move recognition. Computer Speech and Language 11(4): 275-306 Gorin A 1995 On automated language acquisition. Journal of the Acoustical Society of America 97: 3441-3461 Gorin A, Petrovska-Delacrétaz D, Riccardi G, Wright J 1999 Learning spoken language without transcriptions. In Proceedings of the Workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado. Gorniak P 1998 Sorting email messages by topic, University of British Columbia Computer Science Dept. http://www.cs.ubc.ca/spider/pgorniak/um/bucfe.html Hirschberg J, Litman D, Swerts M 1999 Prosodic cues to recognition errors. In Proceedings of the Workshop on Automatic Speech Recognition and Understanding, Keystone, Colorado. Jelinek F 1999 Statistical methods for speech recognition. Cambridge MA, MIT Press. Jensen U, Moore R, Dalsgaard P, B Lindberg 1994 Modelling intonation contours at the phrase level using continuous density hidden Markov models. Computer Speech and Language 8(3): 247-260. Jurafsky D, Bates R, Coccaro N, Martin R, Meteer M, Ries K, Shriberg E, Stolcke A, Taylor P, Van Ess-Dykema C 1997 Automatic detection of discourse structure for speech recognition and understanding. In Proceedings of the Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, pp 88-95. King S 1998 Using information above the word level for automatic speech recognition. Unpublished PhD thesis, Edinburgh University. Lager T, Zinovjeva N 1999 Training a dialogue act tagger with the µ-TBL system. In Proceedings of the Third Swedish Symposium on Multimodal Communication, Linköping. Lazzari G, Frederking R, Minker W 1999 Speaker-language identification and speech translation, Multilingual information management: current levels and future abilities, Pittsburgh, Carnegie Mellon University Computer Science Dept. Moore R, Russell M, Nowell P, Downey S, Browning S 1994 A comparison of phoneme decision tree (PDT) and context adaptive phone (CAP) based approaches to vocabulary-independent speech recognition. In Proceedings of the International Conference On Acoustics, Speech, And Signal Processing, Adelaide, I: 541-545. Nagata M, Morimoto T 1993 An experimental statistical dialogue model to predict the speech act type of the next utterance. In Proceedings of the International Symposium on Spoken Dialogue, Tokyo, pp 83-86. Samuel K, Carberry S, Vijay-Shanker K 1998 Dialogue act tagging with transformation-based learning. Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics, Montreal, pp 1150-1156. Sankoff D, Kruskal J 1983 Time warps, string edits, and macromolecules: the theory and practice of sequence comparison. London, Addison-Wesley Shriberg E, Bates R, Taylor P, Stolcke A, Ries K, Jurafsky D, Coccaro N, Martin R, Meteer M, Van Ess-Dykema C 1998 Can prosody aid the automatic classification of dialog acts in conversational speech? Language and Speech: 41, 443-492 Wright J, Carey M, Parris E 1995 Improved topic spotting through statistical modelling of keyword dependencies. In Proceedings of the International Conference On Acoustics, Speech, And Signal Processing, Detroit, pp 313-317. 109 Building a corpus of annotated dialogues: the ADAM experience Roldano Cattoni, ITC-irst, Trento, Italy, <cattoni@itc.it> Morena Danieli, Loquendo, Torino, Italy, <Morena.Danieli@loquendo.it> Andrea Panizza, Universita del Piemonte Orientale “A. Avogadro”,Vercelli, Italy, <Andrea.Panizza@CSELT.IT> Vanessa Sandrini, ITC-irst, Trento, Italy, <sandrini@itc.it> Claudia Soria, ILC-CNR, Pisa, Italy, <soria@ilc.pi.cnr.it> 1. Introduction ADAM1 is a corpus of annotated spoken dialogues currently being developed as part of the Italian national project SI-TAL2. Each dialogue is annotated at five levels of linguistic information: prosody, morphosyntax, syntax, semantics and pragmatics. The five levels were chosen for both practical (their interest for real applications) and scientific reasons (the possibility to investigate inter-level phenomena). For each level a corresponding annotation scheme has been defined that provides annotation instructions, examples and criteria. The result of each annotation is an XML file that encodes the content of a dialogue with respect to a particular level according to the annotation scheme of that level. The aim of this paper is therefore to present the ADAM corpus and the experience gained in defining and building such multi-level corpus. Section 2 describes the ADAM spoken corpus that includes both human-human and human-machine dialogues in the semantic domain of tourism and railways transportation. Section 3 provides a detailed introduction to the transcription format and to the five annotation schemes, one for each level of linguistic information. Section 4 focuses on the architectural issues of the ADAM corpus: essential requirements that drove the design process – like corpus reusability – are presented and discussed. 2. Corpus description The ADAM spoken corpus is a collection of 450 vocal dialogues: they are both human-human (200 dialogues) and human-machine (250 dialogues). All the dialogues are recordings and transcriptions of telephone conversations in the semantic domain of tourism and railway transportation. The format of the audio files is the standard format for telephone signal data recommended by the SPEECHDAT project directions. The human-human dialogues are simulated telephone conversations between two experimental subjects, playing the roles of a travel agent and of a caller, respectively. They had to perform predefined scenarios, including the request for train or flight timetables, the request for hotel arrangements, the reservation of both transportation and hotel, and the communication of credit card information. Each of the two acoustic signals (agent and caller) has been captured by two microphones (one directional and one “close-talk”), recorded on a digital tape as signed linear PCM 16bit at 16kHz. The human-human dialogues are longer than the human-machine ones, since the amount of information exchanged is greater. The total amount of recorded speech is more than 7 hours, for a total number of 58377 words for the human-human dialogues, while the length of the human-machine dialogues amounts to around 1250 utterances. The human-machine dialogues were collected on the field: they are interactions between the automatic telephone information service of the Italian railway company and callers, recorded during an experimental phase of that service3. The callers called the system during night hours (from nine p.m. to eight p.m.). The calls came from all Italy: each call is from a different caller and several varieties of standard Italian are represented in the speech data. The system was designed to provide callers with information about the Italian train timetable. The dialogue manager allowed the user to enter several task parameters in a single turn; in case of recognition errors, the dialog manager exploited a set of repair strategies for recovering from the misunderstanding and being able to access the timetable database with the correct task parameters. The transaction success rate of the automatic service was around 85%, so 15% out of the dialogues present miscommunication phenomena. The speech signal was recorded at 8kHz and stored according the PCM-Ulaw 8 bit protocol. 1 The acronym ADAM stands for “Architecture for Dialogue Annotation on Multiple Levels”, see Soria, Cattoni and Danieli (2000). 2 The ADAM Corpus will be released by the end of 2001. A pilot set of 30 dialogues is currently available upon request. 3 The automatic service is based on CSELT speech and dialogue technologies (see Baggia, Castagneri, and Danieli 2000 for some details on service architecture and system performance). 110 3. Transcription and annotation practices Each dialogue in the ADAM corpus is represented by an orthographic transcription (physically an XML file), which in turn is linked to an audio file containing the corresponding recording. In addition, the transcription of each dialogue is associated to five XML annotation files, according to five different levels or layers of linguistic information, namely prosody, morphosyntax, syntax, semantics and pragmatics. The five levels of annotation were mainly chosen in consideration of their interest for practical applications of the annotated material. In spite of the number of levels considered, and their sometimes conflicting requirements, we tried to develop a coherent, unitary approach to design and application of annotation schemes. In particular, in developing the different annotation schemes for the five levels envisaged, attention was paid to be consistent with criteria of robustness, wide coverage and compliance with existing standards and previous efforts in annotation of spoken dialogues. As a general criterion, however, attention has been paid so as to design the various annotation schemes and transcription conventions to be general enough to accommodate as many different annotation practices as possible. 3.1 Transcription The representation of dialogues in terms of orthographic transcription implies making several choices about which aspects and phenomena of dialogues to represent. In accordance with the EAGLES recommendations for the representation of spoken language (see Gibbon 1999), the following information is made explicit: - turns and speakers: according to common practice, a turn is identified with a stretch of talk produced by a single speaker. Each turn is numbered and the corresponding speaker is identified with a conventional label - words: the transcription conventions adopted represent each recognisable word in orthographic form; non-recognisable words and word fragments are represented through approximated orthographic rendering of the perceived sound - pauses are represented by means of specific symbols; length of pauses is also specified - hesitation signals or fillers are given a specific status and classified as a separate category - non-linguistic phenomena such as coughs, laughs or noises from surrounding context are classified and signalled via dedicated symbols. For an illustration, see the following example: t_002B: {a} buongiorno mi scusi mi chiamo annamaria degasperi [ breath ] senta io avrei bisogno di prenotare un treno [ ... ] [ breath ] che parte da roma lunedi` magari verso le otto di mattina non piu` tardi [ ... ] [ breath ] {e} [ puff ] per verona pero` per cortesia mi basterebbe un posto solo [ puff ] c’ e` qualcosa [ puff ] As a general criterion, we follow the practice of recording only actually pronounced words, without making assumptions about the nature and type of words in cases of word truncation and disfluency phenomena. The same holds for non-standard forms such as dialectal expressions or mispronounced words. In these cases, the actually pronounced word or word fragment is represented. Optionally, the annotator can further specify whether the form is to be considered as a non standard form or an interrupted form. It is also possible, if needed, to specify the target form possibly intended by the speaker. This range of information can be expressed via a set of dedicated attributes that come as optional extensions of the basic annotation weaponry. 3.2 Prosodic annotation In the ADAM corpus, the Prosodic Annotation (PA) is used to represent the prosodic structure of utterances at the suprasegmental level. This structure is represented by providing a subset of the ToBI scheme (Tones and Break Index, see Silverman at al. 1992). ToBI is a widely used system since its five different levels of break indexes are able to capture a variety of intonation phenomena. The ToBI subset used in the ADAM project concerns the annotation of prosodic phrasing. The break indexes used are five, from 0 to 4. They are applied on the basis the following criteria: · index 0: used in clitic groups, and every time that a word is atonic and is leaned on following words (for example: I [0] am; a [0] car); 111 · index 1: used every time in which there is not a word separation (word boundary stronger than clitic but without intonational cut, for example: I [0] saw [1] him [1] yesterday, where between the words “saw” and “him”, and between “him” and “yesterday”, there are neither separation nor pauses); · index 2: used to mark anomalous intermediate boundaries (mostly when a break occurs in an unexpected place, for example I [0] bought [2] a [0] new [1] ca,. where between the words “bought” and “a”, there is an unexpected pause); · index 3: used to mark intermediate intonation boundaries. In the ADAM corpus, utterances of spontaneous speech makes it difficult to distinguish those cases where it is better to use index 3 from those where index 4 would be more appropriate. Therefore, index 4 is used according to the following criteria: · index 4: used to mark complete disjunctions (an obvious case is at the end of turns) and used to mark intermediate disjunctions too, if and only if they are conclusive, for example in the cases of clear logic separations and intonation patterns changing (Ok, I booked you a seat. Do you need more information?...) In this example, the two parts of the utterance are clearly separated and there is a change in the intonation pattern (from assertive to interrogative). In addition to these five break indexes, we can use the following symbols for PA: · symbol [ p ]: combined with the break index symbol (ex. 2p, 3p) and used every time the boundary has a feature of hesitation; · symbol [ - ]: used to signal an uncertainty in attributing levels; · symbol [u ]: used to signal uncertainty combined with hesitation. 3.3 Morpho-syntactic annotation The ADAM proposal for morphosyntactic and syntactic annotation is a two-layer annotation structure, containing respectively information on word category and morphosyntactic features (pos tagging), and non recursive phrasal nuclei (called chunks). The morphosyntactic annotation level encodes the following information: a) identification of morphological words and linking to their corresponding orthographic counterparts; b) annotation of their pos-category; c) annotation of morphosyntactic features (such as number, gender, person, tense, etc.); d) annotation of their corresponding lemma. The particular tag set, though adapted to representation of Italian, is compliant with EAGLES recommendations (Gibbon 1999). An example is given below: <mw id=“mw_004” lemma=“BUONGIORNO” pos=“I” mfeats=“X”>buongiorno</mw> <mw id=“mw_005” lemma=“ESSERE” pos=“V” mfeats=“S1IP”>sono</mw> <mw id=“mw_006” lemma=“ANNAMARIA” pos=“SP” mfeats=“FS”>annamaria</mw> <mw id=“mw_007” lemma=“DEGASPERI” pos=“SP” mfeats=“NN”>degasperi</mw> <mw id=“mw_008” lemma=“VOLERE” pos=“V” mfeats=“S1DP”>vorrei</mw> <mw id=“mw_009” lemma=“PRENOTARE” pos=“V” mfeats=“F”>prenotare</mw> <mw id=“mw_010” lemma=“UN” pos=“RI” mfeats=“MS”>un</mw> <mw id=“mw_011” lemma=“VIAGGIO” pos=“S” mfeats=“MS”>viaggio</mw> In addition, the tag set is structured into a core scheme, supplying basic means for annotating morphological information, and a periphery tag set, which serves the purpose of making provision for further linguistic annotation to be added to obligatory information. This is the case, for instance, of a set of optional tags devised in order to annotate the so-called “discourse marker” class of words, i.e. a range of words belonging to several traditional grammatical categories and characterising themselves as a compact class as to their discursive or dialogic function. In this case, additional features are provided as an orthogonal dimension to recommended classification, so as to make it possible to express the fact that a given word is functioning as a discourse marker: <mw id=“mw_104” lemma=“LE” pos=“PQ” mfeats=“FS3”>le</mw> <mw id=“mw_105” lemma=“ANDARE” pos=“V” mfeats=“S3IP”>va</mw> <mw id=“mw_106” lemma=“BENE” pos=“B” mfeats=“X”>bene</mw> <mw id=“mw_107” lemma=“SI'“ pos=“B” mfeats=“X” dfeats=“PF”>si'</mw> <mw id=“mw_108” lemma=“QUINDI” pos=“C” mfeats=“X” dfeats=“MD”>quindi</mw> <mw id=“mw_109” lemma=“OTTO” pos=“N” mfeats=“NN”>otto</mw> <mw id=“mw_110” lemma=“E” pos=“CC” mfeats=“X”>e</mw> <mw id=“mw_111” lemma=“UN” pos=“RI” mfeats=“MS”>un</mw> 112 Robustness and coverage were a crucial aspect in the development of the two schemes, in particular for what concerns i) syntactic constructions specific of spoken dialogues (ellipses, anacolutha, non verbal predicative sentences etc.), and ii) disfluencies (repetitions, false starts, trailing off etc.). The syntactic annotation level is built on top of the previous one and consists in identification of non-recursive phrasal nuclei (called chunks) and annotation of their category4, as well as of their internal structure. The preference given to shallow parsing over, e.g., phrase structure trees is chiefly motivated by the locality of the analysis offered by this approach, a useful feature if one wants to prevent a local parsing failure from backfiring and causing the entire parse of an utterance to fail. This is particularly desirable when dealing with particularly noisy and fragmented input such as spoken dialogue transcripts. For an illustration of syntactic annotation, see the examples below: <pn id=“pn_004” type=“INT” href=“dial_002_mor.xml#id(mw_004)”> buongiorno <h id=“h_004” href=“dial_002_mor.xml#id(mw_004)”/> </pn> <pn id=“pn_005” type=“FV” href=“dial_002_mor.xml#id(mw_005)”> sono <h id=“h_005” href=“dial_002_mor.xml#id(mw_005)”/> </pn> <pn id=“pn_006” type=“N” href=“dial_002_mor.xml#id(mw_006)”> annamaria <h id=“h_006” href=“dial_002_mor.xml#id(mw_006)”/> </pn> <pn id=“pn_007” type=“FV” href=“dial_002_mor.xml#id(mw_007)”> degasperi <h id=“h_007” href=“dial_002_mor.xml#id(mw_007)”/> </pn> <pn id=“pn_008” type=“FV” href=“dial_002_mor.xml#id(mw_008)..id(mw_009)”>vorrei prenotare <d id=“d_001” type=“modal” href=“dial_002_mor.xml#id(mw_008)”/> <h id=“h_008” href=“dial_002_mor.xml#id(mw_009)”/> </pn> <pn id=“pn_009” type=“N” href=“dial_002_mor.xml#id(mw_010)..id(mw_011)”>un viaggio <h id=“h_009” href=“dial_002_mor.xml#id(mw_011)”/> </pn> 3.4 Semantic annotation The framework developed for ADAM allows the annotation of the semantic information through concepts. A concept is a typed structure that, using an ontology (e.g. a set of symbols that encode the apriori information), represents the semantic information in a synthetic and non-ambiguous form. The result of the conceptual annotation of a dialogue is therefore an XML file in which a (possibly empty) collection of concepts is associated to each turn. For the design of the annotation scheme of the conceptual level we have identified the following requirements: (1) soundness: the scheme should refer to well studied and formally sound representational approaches; (2) expressiveness: the scheme should allow the representation of the content of complex dialogues; (3) minimality: each turn should be annotated in a unique way (this requirement is rather strong since it is difficult to identify the “best” abstract level for the semantic content; therefore the requirement actually means that the annotation scheme should provide practical rules and criteria for this problem); (4) simplicity: the syntactic complexity of the language describing the concepts is to be minimised; (5) locality: each turn is independent of the previous turns and, in general, of the dialogue history; (6) portability: the annotation scheme should be domain-independent. The ADAM annotation scheme takes inspiration from the so called “Frame-based Description Languages” (Cattoni and Franconi 1990) a framework developed in the field of the Knowledge Representation. In our scheme a concept is encoded like a “frame”, a typed structure with “slots”. Slots represent the properties of the concept and its relations with other concepts. Slots are encoded with the couple <slot-name, slot-value>: the former contains the name of a property, the latter either a simple value or a reference to another concept. This recursion allows the encoding of complex and structured semantics information. There are different types of concepts according to the content to be represented (e.g. “time”, “trip”, “room”). An example is needed at this point: given the simple sentence “The train leaves from Rome”, the corresponding semantic annotation is: <concept id=“c_001” ctype=“trip”> <slot sname=“transportation-type” svalue=“train”/> <slot sname=“origin” svalue=“rome”/> </concept> 4 Syntactic annotation in ADAM is done automatically with manual check. 113 In this case only a concept (of type “trip”) is used, with two slots (properties): “transportationtype” with value “train” and “origin” with value “rome”. Complex concepts can be encoded using reference to simpler ones. For example the sentence “The train leaves from Rome at eight on Saturday fifteen” is annotated with : <concept id=“c_001” ctype=“trip”> <slot sname=“transportation-type” svalue=“train”/> <slot sname=“origin” svalue=“rome”/> <slot sname=“departure-time” svalue=“*c_002”/> </concept> <concept id=“c_002” ctype=“time”> <slot sname=“hour” svalue=“8:00”/> <slot sname=“week-day” svalue=“saturday”/> <slot sname=“month-day” svalue=“15”/> </concept> The simpler concept of type “time” (with identifier “c_002”) encodes a specific point in time. The more complex concept of type “trip” refers to such time to identify the value of the property “departure-time” - to do this the star ‘*’ character followed by the identifier of the referenced concept is used. As far as the ontology is concerned, three categories of symbols may be distinguished: (1) symbols that identify the type of concept (the value of the “ctype” attribute of <concept>), (2) symbols that identify the name of concepts’ property (the value of the “sname” attribute of <slot>), (3) symbols that identify the value of concepts’ property (the value of the “svalue” attribute of <slot>). The user is free to adopt his/her conventions to encode the three categories of symbols of the ontology; nevertheless a good reference are the rules and symbols adopted by the C-STAR Consortium (Waibel 1996) for the inter-lingua: they have been developed on the basis of the experience on six different (Asiatic and European) languages and this appears to guarantee a good inter-lingua portability. For the semantic annotation of the ADAM corpus we actually adopted the C-STAR conventions: i) all symbols are English words, ii) complex symbols of categories (1) and (2) are obtained by means of the dash ‘-’ character (e.g. week-day, interval-time), iii) complex symbols of categories (3) are obtained by means of the underscore ‘_’ character (e.g. new_york). It is important to emphasise here that the scheme is domain-independent so that the annotation is portable. In fact even if the domain changes or the ontology is enriched with new symbols, the annotation scheme and the corresponding representation in XML doesn't change. For example, let us change the domain from the Transport Information context to that of Hotel Reservation: given the sentence “I would like a single room in Venezia for Saturday fifteen”, the annotation scheme doesn't change even if the types of concepts, the name and values of their properties change: <concept id=“c_001” ctype=“room”> <slot sname=“quantity” svalue=“1”/> <slot sname=“type” svalue=“single”/> </concept> <concept id=“c_002” ctype=“location”> <slot sname=“value” svalue=“venice”/> </concept> <concept id=“c_003” ctype=“time”> <slot sname=“week-day” svalue=“saturday”/> <slot sname=“month-day” svalue=“15”/> </concept> Although most of the concepts encode strictly domain-dependent information, some domainindependent (or cross-domain) concepts do exist, like the temporal expressions. An user is clearly free to annotate temporal expressions as he/she likes; nevertheless for ADAM we have defined a set of predefined concepts to represent temporal expressions, taking inspiration from the Verbmobil TEL (Temporal Expression Language) (Reithinger 1999). 3.5 Pragmatic annotation In several recent works on dialogue, there is an underlying assumption about the explicative power of the dialogue act notion for characterising discourse attitudes in human-human and humanmachine dialogue (for example, Isard and Carletta 1995 and Di Eugenio et al. 1998), and some authors argue for the use of dialogue act tagging schemes for dialogue system evaluation. The ADAM metascheme5 for pragmatic annotation is based on that widely held assumption, i.e. the pragmatic dimension 5 For an explanation of the concept of “meta-scheme” see Section 4. 114 of the dialogues is characterised in terms of the “linguistic acts and the contexts in which they are performed” (Stalnaker 1970). The goal of the pragmatic annotation level is the attribution of one (or more than one) dialogue act tags to each utterance of the dialogue. The scheme is a modified version of the tagging schemes DASML (Core and Allen 1997) and SWITCHBOARD-DAMSL (Jurafsky et al. 1997). In particular, it shares with DAMSL the features used to capture the communicative dimension of a dialogue turn. The ADAM annotation meta-scheme allows a three-layer pragmatic annotation practice: at the first layer, each dialogue turn is characterised with respect to its communicative level; at the second layer, the annotation captures the illocutionary dimension of the utterance(s) included in the turn; at the third layer, the discourse relationships among different utterances are characterised. 3.5.1 The communicative dimension Table 1 reports the tags, and some examples provided by the ADAM pragmatic scheme for annotating the communicative status of the dialogue turns. The annotation of this dimension is largely inspired by Core and Allen (1997). The four tags may be used by the annotators to characterise the following aspects: 1. TASK: the turn provides a contribution to the fulfilling of the goal of the conversation 2. TASK-MANAGEMENT: the turn addresses some specific features of the problem-solving process related to the task; 3. COMMUNICATION-MANAGEMENT: the turn addresses phenomena connected with the maintenance of the communication channel. 4. OTHER-LEVEL: the communication content of the turn cannot be characterised with any of the previous three tags (for example, jokes, word-plays, cliché, etc...) TAG EXAMPLES TASK “Do you want to reserve on that flight?” TASK-MANAGEMENT “I'm taking note of your requiremnts for the flight reservation. “ COMMUNICATION-MANAGEMENT “Please, hold on!” OTHER-LEVEL “Better late than never” Table 1 3.5.2 The dialog act dimension The dialog act labels provided by the ADAM pragmatic annotation scheme are reported in Table 2. The pragmatic annotation meta-scheme allows to characterise each utterance of the dialog with one or more dialog act tags on the basis of the role(s) of the utterance in the discourse. At the beginning of the annotation practice, we tried to apply to the annotation of this linguistic level the same minimality principle stated for the conceptual level, i.e. tagging each utterance one and only one dialogue act label. However, we soon realised that this simplification was not applicable at the pragmatic level, even in a task-oriented domain and for “simple” (question answering) dialogues. For example, indirect speech acts were hard to capture. At present, the pragmatic meta-scheme allows to attribute a primary label (characterising the direct speech act) and one, or more, secondary ones (for coding the indirect act). The secondary label is optional6. The dialogue act tags cover a large set of illocutionary functions in a task-oriented domain, and we believe that the scheme may be re-used, at some extent, for different domains. However, as for the other levels of the ADAM corpus, the problem of re-usability has been approached in terms of the pragmatic meta-scheme and its formal realisation (see below, section 4). Under this aspect, the distinction between the communicative dimension of a turn and the illocutionary content(s) of its utterance(s) is likely to be re-used in several annotation tasks. 6 Under this respect the pragmatic annotation scheme presented here and applied in the ADAM corpus differs from the one described in Soria, Cattoni and Danieli (2000). 115 LABEL EXAMPLE Statement I'm leaving today Request I'd need a double room Accept The flight leaving at ten is nice for me Accept-Part Yes, but I'd need an extra bed for my child Open-Option Do you want me to reserve the return flight? Action-Directive Please, reserve two seats on the BA3476 Repeat-Rephrase Oh, you said BA3476, the one leaving at 10 pm Collaborative-Completion …and I want to leave from NY next Sunday Conventional-Opening Hello, this is the Tourist Information Desk Conventional-Closing Good-bye Backchannel/Acknowledge Yes, of course Backchannel/Question Is that ok? Summarize/Reformulation So, you want to leave around 8 p.m. Or-Question Do you prefer a room with view on the garden or on the street? Apology Excuse me Thanking Thank you for calling Offer-Commit I've to check if there is a reduced fare available Yes/No Question Do you want to reserve the return flight on Thursday? Open-Question Which company do you prefer to travel? Reject No, I don't like to travel with this air company Yes-Answer Yes No-Answer No Response-Acknowledgement I agree Dispreferred-Answers No, I'd prefer to have a smoking room Opinion I believe this is the best solution Appreciation I enjoyed very much to work with you Abandoned/Uninterpretable I thin… Suggestion Perhaps we could try with another travel agent Signal-Non-Understanding Pardon? Signal-Understanding I see 3rd-Party-Conversation Fido, stop barking, I can't hear a word! Other You know, I'd need to take a week off Table 2: Dialogue-act tag set 4. Architectural framework The ADAM approach is mainly driven by the need of making the corpus widely reusable across different research and application purposes. In short, the concept of corpus reusability could be rephrased as the sum of the two concepts of “cross disciplinary acceptability” and “wide circulability”. The two concepts refer, respectively, to the fact that a corpus a) either express a consensual or standard view about the type of information encoded (the content or semantics of annotation) and b) express a consensual view for what concerns the way information is physically encoded (the encoding syntax or markup language). The latter goal seems to have been reached through adoption of XML as a de-facto standard. On the other hand, the former point is more tricky and represents the motivation for the many well-known standardisation efforts. Adoption of common or standard annotation schemes, however, seems to be at least only partially practicable, for the very simple reason that establishing what the needs of future users might be is hardly feasible. An alternative to the search for standards is thus desirable. A way out from the standardisation stricture is represented by concentrating our efforts in corpus design on maximisation of corpus flexibility. We claim that the degree of flexibility of a corpus depends on the extent to which the annotation is easily and quickly modifiable at a moderately low cost by subsequent users of the corpus. To clarify a little, we can think of at least two possible scenarios where a user might need to customise the annotation provided with a corpus. First, it might be the case that a user wishes to reuse a corpus which is annotated for several types of linguistic information, but lacks of a particular annotation type; the potential user could nevertheless be interested in the existing annotations, and would like to supplement them with a new one. On the other hand, it might be the case that a user is interested in some annotation only (e.g., pos-tagging or syntactic structure) and s/he might want to leave aside other annotation types. Reusability of an annotated corpus can thus be thought of as a function of the extent to which new levels of linguistic information can be added, or uninteresting ones can be removed. This is what we call the vertical dimension of customisation in annotated corpora. Second, for each level of linguistic analysis, an annotated corpus is likely to be reused depending on the extent to which existing annotation can be changed, so as to accommodate different annotation practices. It is often the case that a corpus which is annotated with a given annotation scheme “hard- 116 wires” the annotation so as it is impossible to replace the annotation without reverting to the raw text and rebuilding the annotation from scratch, which is enormously expensive. This is what we call the horizontal dimension of customisation of an annotated corpus. The extent to which an annotated corpus can be flexible enough to be compliant with these two requirements clearly depends on the particular choices made at the design level about the organisation and structuring of annotation. If, for instance, all types of annotation are flattened onto a single representation level, it is clear that the customising operations above become hardly feasible. In ADAM we aim at maximising corpus flexibility by appealing to the two related notions of modularity of annotation and use of annotation meta-schemes. The notion of modularity of annotation refers to data architecture (see MATE, Dybkjaer et al. 1998). In an annotated corpus, several different types of annotation or linguistic information may be present in relation to the same input data. These types of information can be thought of as independent, yet related, levels or dimensions of linguistic description. We thus can think of a level of prosodic analysis, another of pos-tagging, another of semantic analysis, etc. By annotation modularity we mean that the different layers of annotation are to be kept independent one of another. In the ADAM Corpus synchronisation among the different analyses and between these and the speech signal is ensured by the different annotations (stored as separate files) making reference to the same input file. This file, containing the transcription of the dialogue, is in turn linked to the audio file in PCM (a-low or u-low) format. Support for this structure is provided by the use of XML as mark-up language. By adopting this structure, annotation layers are linguistically heterogeneous and mutually orthogonal, so that changing one of them affects others only to a limited extent; layers are nevertheless indirectly related through a) their hinging on a common reference file (the “raw” text represented by the transcription file); b) the indirect correlation of the linguistic information they convey. This vertical modularity of the ADAM approach has interesting consequences for the purposes of reusability. A potential user of the ADAM Corpus is left free to select, among the proposed levels of annotation, those which best reflect his/her theoretical and practical interests. (S)he can also feel the need for adding a new layer of information, not contemplated in today's ADAM realisation. By the way, level modularity is also of theoretical interest, since most annotation schemes we know differ mainly in the way pieces of linguistic information categorised, rather than in the intrinsic nature of these levels. Moreover, level modularity seems to have a useful impact on our theoretical understanding of the linguistic phenomena at stake, since it is capable of expressing correlation relationships between layers, and ultimately between dimensions of linguistic analysis. Horizontal customisation in annotated corpora can be enhanced by implementing the concept of annotation meta-schemes. According to our view, an annotation meta-scheme is a general descriptive framework in which different annotation schemes can be accommodated. In many cases the same unit of linguistic information can be annotated in different, arguably mutually incompatible ways, which are nonetheless all compatible with the recommended vertical modularity described above: so it is better to provide the potential user with the possibility of adopting any arbitrary annotation scheme without being forced to re-build the annotation from scratch or to forcefully comply with some other annotation scheme, no matter how standardised. To do so, it is necessary to have a representation format for the annotation that is general enough for competing schemes to be mutually substitutable. In other words, it is necessary to make the representation of annotation schemes as much scheme-independent as possible. It should be noted how the ADAM different annotation schemes do not, in fact, merely amount to another set of ready-made annotation schemes, but actually are represented in their XML annotation format in such a way that, for each annotation level, those features that are common to several competing schemes become slots or descriptive element tags to be associated with linguistic elements; the values of these attributes can be any arbitrary set of tags. Let us consider, for instance, the case of pragmatic annotation. The main difference between annotation schemes for this level of analysis lies in the particular types of dialogue act chosen rather than in the notion of dialogue act itself, which appears to be uncontroversial. If, however, we adopt a scheme where the basic descriptive element of any arbitrarily long set of words is the general tag <dialogue act>, further described by an attribute “type”, different schemes can be applied to the same corpus without totally discarding the existing annotation: a substitution in the set of values will be enough. Conversion from one annotation scheme into another is easily done through XSLT transformations. It is our belief that enforcing this practice in the design of annotation schemes will bring us to more effective corpora exchange and reuse7. 7 In addition, the meta-scheme can be seen as a tool for effective comparison of alternative annotation schemes. 117 Finally, it must be noted that actual corpus reusability crucially depends also on the physical format or mark-up language used for corpus encoding8. As already stated throughout the paper, the mark-up language used for the encoding of the ADAM Corpus is XML. XML proved to be the ideal candidate for a number of reasons, all related to corpus reusability. First, it is an emerging and widespread standard, which ensures a good degree of corpus reusability in the times to come. Second, because of its platform-independence it enhances the potential for wide circulation of the annotated material, together with a considerable flexibility of use. More crucially, however, XML proved essential for implementation of the architectural choices described above. Annotation modularity is supported via extensive use of Xlink elements (DeRose et al. 2000). Each XML element in the annotation files is actually an hypertextual link which refers to an element (or set of elements) in the transcription file. All annotations for each dialogue are thus connected to the same input reference source (the transcription), thus ensuring synchronisation of the different annotations and still preserving their independence. On the other hand, the concept of annotation meta-scheme is easily implemented in XML, thanks to translation of the different annotation schemes content-independent. In other words, a general preference was given towards representing the different annotation tags as values of generic, scheme-independent attributes of XML elements. In this way the different annotation schemes (represented as different DTDs) are represented in a generic enough way, so that a future user of the corpus will only need to change the values of the different attributes for the entire annotation scheme to be changed. We believe that this approach represents a further value of the ADAM Corpus. 5. Conclusions and future work In this paper we have described the methodological assumptions, the annotation practice and general architectural framework underlying the ADAM Corpus, which is a corpus of annotated spoken dialogues currently being developed as part of the Italian national project SI-TAL. As far as annotation is concerned, our next step is to proceed to a validation phase, where the annotations performed by several annotators will be evaluated according to the metrics of (Isard and Carletta 1995). In addition to provide a concrete annotation experience, we have introduced what we believe to be an essential aspect to bear in mind in corpus design, namely the requirement of reusability. We have claimed that, for effective circulation and re-use of corpora, it is essential to make provision for as many practices of dialogue annotation as possible, as well as approaches to annotation at different levels, instead of providing fixed levels and schemes of analysis, no matter how standardised. Corpora will have a chance to be reused as far as it will be easy and relatively inexpensive to adapt them to different needs and application purposes. Use of XML as mark-up language is a further step toward this end. 6. References Baggia P, Castagneri G, Danieli M 2000 Field trials of the Italian ARISE train timetable system. Speech Communication 31: 355-367. Cattoni R, Franconi E 1990 Walking through the semantics of frame-based description languages: a case study. In Proceedings of the Fifth International Symposium ISMIS ‘90, Knoxville, TN, pp 234-241. Core M, Allen J 1997 Coding dialogues with the DAMSL annotation scheme. In Working Notes of the AAAI Fall Symposium on Communicative Actions in Humans and Machines, Cambridge, MA, pp 28-35. DeRose S, Maler E, Orchard D, Trafford B 2000 XML linking language (Xlink). W3C Working Draft, 21 February 2000. http://www.w3.org/TR/xlink/. Di Eugenio B, Jordan P, Moore J D, Thomason, R 1998 An empirical investigation of proposals in collaborative dialogues. In Proceedings of the 17th International Conference on Computational Linguistics and the 36th Annual Meeting of the Association for Computational Linguistics, Montréal, Canada, pp 325-329. Dybkjaer L, Bernsen N O, Dybkjaer H, McKelvie D, Mengel A 1998 The MATE markup framework. MATE Deliverable 1.2. http://mate.nis.sdu.dk. Gibbon D (ed) 1999 Handbook of standards and resources for spoken language systems. First supplement, EAGLES LE3-4244, Spoken Language Working Group. 8 For a similar view see Ide and Brew (2000). 118 Ide N, Brew C 2000 Requirements, tools, and architectures for annotated corpora. In Proceedings of the Workshop on Data Architecture and Software Support for Large Corpora, LREC 2000, Athens, Greece, pp 1-5. Isard A, Carletta J 1995 Replicability of transaction and action coding in the Map Task corpus. In Walker M, Moore J D (eds), AAAI Spring Symposium: Empirical Methods in Discourse Interpretation and Generation. Stanford, CA, pp 60-66. Jurafsky D, Shriberg E, Briasca D 1997 Switchboard DAMSL labeling project coder's manual. Technical Report 97-02, University of Colorado, Institute of Cognitive Science, Boulder, Reithinger N 1999 Robust information extraction in a speech translation system. In Proceedings of Eurospeech ‘99, Budapest, pp 2427-2430. Silverman K, Beckman M, Pitrelli J, Ostendorf M, Wightam C, Price P, Pierrehumbert J, Hirschberg J 1992 TOBI: A standard for labeling English prosody. In Proceedings of ICSLP 1992, pp 867-870. Soria C, Cattoni R, Danieli M 2000 ADAM: An architecture for xml-based dialogue annotation on multiple levels In Dybkjaer L, Hasida K, Traum D. (eds), Proceedings of the First SIGDial Workshop on Discourse and Dialogue, Association for Computational Linguistics, Hong Kong, pp 9-18. Stalnaker, R C 1970 Pragmatics. Synthese 22, pp 272-289. Waibel A 1996 Interactive translation of conversational speech. Computer 29(7): 41-48. 571 Not last, even if least: endangered Formosan aboriginal languages and the corpus revolution Dr. Josef Szakos and Amy Wang English Department, Providence University, Taichung, Taiwan In our paper we intend to report on the present stages and uses of corpus creation for the Austronesian languages of Taiwan. Our main attention goes to the Tsouic tribes, including Northern Tsou, Kanakanavu and Sa’ arua. The latter two can be regarded as highly endangered as there are only dozens of old speakers left. Our corpus serves a three-fold purpose: Authentic documentation of the oldest remnants of early Austronesian languages, making them analysable for the linguistic community around the Globe, and enabling a revitalisation and stabilisation of language use in the aboriginal communities. As these languages have lacked writing systems until now, the concomitant task of alphabetisation has to be solved, too. The dozens of hours of transcribed speech data are searchable by any kwic programs and concordancers. The innovation of our approach is the linking of sound files (the length of intonational units) with the corpus, and consequently with the output of searches. This makes not only the instruction of speech patterns for young people easier, but it also provides the researchers with the phonetic context. The quick comparison of intonation patterns can give additional information for semantic subdistinctions. The corpus is constantly growing, providing a series of CD-s, while we also intend to combine it with web-availability. For the young learners of their mother-tongue, we have created a sympathetic user interface including a combined vocabulary. Remaining problems are the possible automatization or partial automatization of the recording and segmentation procedure, search for topics, themes and balancing the corpus by including the speech of more persons of the last speakers. Since the users, the mother-tongue learners and researchers need an interface in Chinese or English, we still have to work out better solutions for that (like parallel corpus arrangement). We hope that the overview and introduction of some problems may raise the attention of some experts who could solve the further problems. 572 Methods and techniques for a multi-level analysis of multilingual corpora Elke Teich Silvia Hansen Institute for Applied Linguistics, Institute for Applied Linguistics, Translation and Interpreting (FR 4.6) Translation and Interpreting (FR 4.6) University of Saarland, Germany; University of Saarland, Germany & S.Hansen@mx.uni-saarland.de Department of Linguistics University of Sydney, Australia E.Teich@mx.uni-saarland.de 1 Introduction The present paper discusses the application of a set of computational corpus analysis techniques for the analysis of the linguistic features of translations. The analysis task is complex in a number of respects. First, a multi-level analysis (clause, phrases, words) has to be carried out; second, among the linguistic features selected for analysis are some rather abstract ones, ranging from functionalgrammatical features, e.g., Subject, Adverbial of Time, etc, to semantic features, e.g., semantic roles, such as Agent, Goal, Locative, etc.; third, monolingual and contrastive analyses are involved. This places certain requirements on the computational techniques to be employed both regarding corpus annotation and information extraction. We show how a combination of commonly available techniques can fulfil these requirements to a large degree and point out their limitations for application to the research questions raised. The paper is organized as follows. Section 2 describes the concrete analysis scenario at hand, including the corpus design and the kinds of linguistic features we are interested in extracting from the corpus. Section 3 discusses the application of a range of computational tools in different stages of corpus analysis, ranging from encoding and alignment in the corpus preparation stage over part-ofspeech tagging and grammatical and semantic annotation in the linguistic annotation stage to concordance programs in the information extraction stage. Section 4 concludes the paper with a summary. 2 The analysis of a multilingual corpus One of the primary goals of the kind of corpus analysis described here is to test some hypotheses about the specific features of translations compared to their source language (SL) originals and to comparable original texts in the target language (TL). The language pair we are interested in primarily is English-German. Among the hypotheses tested are Toury's law of growing standardization (Toury 1995)/Baker's normalization (Baker 1995) and Toury's law of interference (Toury 1995). The first says essentially that translations are even more typical of the target language than are original texts in the same language, exaggerating the typical features of the TL (TL normalization). The second says that what makes translations a particular kind of text is that the source language always shines through in one way or another (SL shining-through). Given the kinds of relations that are referred to in these hypotheses, we need a corpus that consists of SL originals and their translations as well as comparable original texts in the TL (see Figure 1). Figure 1: Multilingual corpus and relations between sub-corpora We make a couple of starting assumptions relating to the kind of analysis that needs to be carried out to test the hypotheses. First, the difference of translations to SL original texts and comparable original TL texts is one of degree, and can thus be measured on a quantitative basis. That is, we can TL original texts (German) TL translations (German) SL original texts (English) 573 analyze the relations between sub-corpora in terms of frequencies of occurrence of particular linguistic features and compare their distributions across sub-corpora. Second, while a text in a language l1 and its translation into a language l2 are comparable simply because they are in a translation relation, for two original texts in a language l1 and a language l2 and for two original texts in the same language, we need to make sure that they are comparable using some other criterion. For a definition of the notion of ‘comparable', we draw on the concept of register, i.e., linguistic variation according to function in situational context (cf. Quirk et al. 1985; Halliday 1978). The features selected for analysis are taken from contrastive and monolingual register analysis (Biber 1995; Halliday 1998; Beneš 1981; Fluck 1997). They range from syntactic features such as verb complementation patterns and voice, over semantic features, such as agency, to textual features, such as theme-rheme structure. As will be seen in Section 3, some of these (e.g., passive) can be extracted on the basis of sequences of parts-of-speech, but others (such as agentive) have to be extracted on the basis of manually coded text. The relation of these features to the testing of the two hypotheses is the following. For example, for the register of scientific writing it is commonly known that English texts from this register are characterized by the frequent use of passives. In German, passive also constitutes a register feature of scientific texts; however, the grammar of German offers other possibilities with similar functions which are ”quasi passives” (examples (1) and (2)), so that the core passive occurs less frequently in German original texts than in English original texts of the given register. (1) Somit lassen sich auch bei diesen Spielen verschiedene Strategien gegenüberstellen. thus let themselves also with these games different strategies oppose ”For these games, too, it would be possible to compare different strategies.” (2) Dabei ist eine sehr bemerkenswerte Verlagerung der Schwerpunkte zu verzeichnen. thereby is a very remarkable shift of emphases to note ”There has also been a remarkable shift in emphasis.” Comparing the frequency of core passives in German translations to their frequency in English originals, two things may happen: Either, there is a significant difference, or there is no significant difference. If there is a significant difference, the frequency of core passives may either be closer to the one in the corpus of SL originals or it may be closer to the one in comparable TL originals. The first would be an indication of SL shining-through, the second an indication of TL normalization.1 Syntactic features, such as passive, can be extracted more or less straightforwardly on the basis of text annotated with parts-of-speech. With more abstract features, this is not possible any more. Agency, an attribute of the clause with two values agentive (Agent involved; see example (3)) and non-agentive (no Agent involved; see example (4)) is a case in point. Features such as agency are tested in a similar way as described above for passive, i.e., distributions are compared across corpora and interpreted as SL shining-through or TL normalization. (3) She was moving the horses. (4) The horses were moving. Finally, if we want to analyze features such as passive or agency according to different registers, we would like to be able to formulate queries such as ”Search for all agentive clauses in passive voice in the register of popular-scientific writing”. This would require that more than one level of annotation can be referred to at the same time. We will see in Section 3 that this is not trivial with the tools available to date. Thus, the following requirements are placed on the tools to be used for the kinds of analysis we carry out: · Syntactic features need to be extracted. Searches on raw text, which can well be successful if one is interested in lexical material, are therefore pointless in the present context. The corpus has to be annotated at least with part-of-speech information so as to enable the extraction of instances of particular syntactic constructions. · Semantic features need to be extracted. Since semantic analysis cannot be carried out automatically, we need a mechanism for manual annotation, on the one hand, and a query mechanism that is responsive to that annotation. 1 For details of the methodology and analysis results see (Teich in progress). 574 · Contrastive data need to be extracted. Since we need to carry out contrastive analyses, we need the tools to be employed to be applicable for more than one language, on the one hand, and we need querying facilities that are responsive to more than one language at a time. · Multiply-annotated data needs to be referred to (cf. above). In the following section we discuss the computational techniques we have employed in corpus preparation, linguistic annotation and for information extraction. The benefits and limitations of each technique for the present analysis requirements are assessed. 3 Computational techniques 3.1 Corpus preparation Since part of the analysis relates to translations and their SL originals, the parallel corpus needs to be aligned. For this purpose we use the alignment program Déja Vu.2 See Figure 2 showing an SL and a TL text aligned with this tool. Figure 2: Multilingual corpus alignment Déja Vu aligns a text and its translation, storing the aligned texts in one file or in two separate files depending on the requirements of the information extraction tool used in later stages of analysis. Files can be exported to translation workbenches and to Microsoft Excel and Access. Figure 3 shows a Déja Vu output in a TSV (tab separated vector) format. "17 One such is hydrogen chloride, HCl, and textbooks often write this process as HCl H+ + Cl-." "17a Salzsaeure, HCl, ist eine solche Verbindung. 17b In Lehrbuechern wird dieser Prozess oft durch die Gleichung HCl H+ + Cldargestellt." "18 But H+ is a ""bare"" proton, and as it has an overwhelming attraction to any electron pair in its vicinity it cannot exist apart from a molecule." "18 Da aber H+ ein ""nacktes"" Proton ist und jedes in seiner Naehe befindliche Elektronenpaar anzieht, kann es nicht wirklich ausserhalb eines Molekuels existieren." Figure 3: Déja Vu alignment format Also, we encode each text of the corpus in terms of a header that provides some meta-information (including title, author, publication, translator, etc) as well as register information (field, tenor, mode). 2 http://www.atril.com/ 575 Each file is encoded in XML using a modified version of TEI3 (illustrated in Figure 4) and employing a standard XML editor (here: XML Spy4). The text body is annotated for headings, sentences, paragraphs, etc. <tei.2> <teiHeader> <fileDesc> <filename>code_tl_e.txt</filename> <subcorpus>popular-scientific (trans_en)</subcorpus> <language>English</language> <titleStmt> <title>Code breaking Ewald Osers The Overlook Press Woodstock 1999 German-English Verschlüsselte Botschaften German Rudolf Kippenhahn popular-scientific types of code and their function exposition communication expert to educated layperson unequal maximal constitutive graphic written Modified TEI Figure 4: XML corpus encoding 3.2 Linguistic annotation Depending on how abstract the linguistic features to be analyzed are, the linguistic corpus annotation can be done automatically (using morphological analysis tools, part-of-speech taggers or parsers) or it can be computer-aided (using corpus annotation tools). A fairly reliable method of syntactic annotation is part-of-speech tagging. The tagger we employ is the TnT tagger, a statistical part-of-speech tagger that analyzes trigrams, incorporating several methods of smoothing and of handling unknown words (Brants 1999). The system is trainable on different languages and comes with the Susanne tagset5 for English and the Stuttgart-Tübingen tagset6 for German. It includes a tool for tokenization, which is a preparatory step in the tagging process. In the basic mode, the tagger not only adds a part-of-speech tag to each token, but it omits alternative tags, 3 http://www.tei-c.org/index.html 4 http://www.xml-spy.com 5 http://www.cogs.susx.ac.uk/users/geoffs/RSue.html 6 http://www.sfs.nphil.uni-tuebingen.de/Elwis/stts/stts.html 576 together with a probability distribution. It analyzes between 30,000 and 60,000 tokens per second and has an accuracy of about 97 per cent. Figure 5 shows a sample output of TnT, which is in a TSV format. Figure 5: TnT sample output When more abstract features are to be coded, annotation has to be carried out manually. Tools supporting such annotation allow the definition of annotation schemes and support manual coding by graphical user interfaces (GUI's). One such tool is Coder (O'Donnell 1995). Coder has five functionalities: chunking-up of texts into units for coding, definition of coding schemes, annotation of texts with a coding scheme, calculating of basic descriptive statistics for a coded corpus and outputting concordances. The chunking mechanism chunks up the text into sentences and paragraphs. If units smaller than sentences need to be annotated, chunking must be done manually. The definition of a coding scheme is supported by a GUI, additions and changes to schemes are straightforward. Coding is supported by another GUI that highlights the unit currently being coded and presents the coding options. The system keeps a record of the codings, on the basis of which a simple descriptive statistics can be calculated and exported. Finally, there is the reviewing function, which is a concordance function operating on the coded features. Figure 6 displays a coding scheme used for coding agency. Figure 6: Coder's interface for annotation scheme definition The annotated texts are written out in an XML/SGML-like format. See Figure 7 for an example. Let VV0 us PPIO2 consider VV0 , YC by II a AT1 simple JJ illustration NN1 , YC the AT problem NN1 of IO depositing VVG a AT1 coding VVG with IW sender NN1 and CC receiver NN1 . YF Coding VVG keys NN2 are VBR much DA1 like VV0 the AT keys NN2 we PPIS2 use VV0 in II our APPG daily JB lives NN2 . YF 577
The story was about a boy who had a bucket a net and a dog and the dog took the bucket and the boy took the net and they walked over to go to the pond to catch a frog but when they went they looked all over almost all day but they couldn't find the frog.
Figure 7: Coder output format 3.3 Information extraction In order to extract particular kinds of linguistic information from the corpus annotated in the ways described above, tools for querying the corpus in terms of the features annotated are needed. For extracting syntactic information, we use the IMS Corpus Workbench (Christ 1994). The IMS Corpus Workbench is a concordance tool with which it is possible to query for words and/or part-ofspeech tags on the basis of regular expressions. Moreover, it allows queries on parallel corpora (aligned translation corpora). The IMS Corpus Workbench consists of two modules: the Corpus Query Processor (CQP) and the user interface (Xkwic). Figure 8 shows Xkwic with a query for passive extraction. Figure 8: User Interface of the IMS Corpus Workbench 578 The query is based on the part-of-speech tags VB.* (forms of the verb ‘be') followed by VVN.* (past participle) and zero to three words in between. The results are displayed in the KWIC (keyword in context) list indicating the number of matches as well. For the extraction of text coded for more abstract features that have been annotated with Coder, the review and statistics functions of Coder can be used for further processing. With the statistics function it is possible to create a descriptive statistics of the analysis; the review function is a concordance function with which it is possible to extract text instances that have been annotated with a particular feature. See Figure 9, which presents an example of extraction of text annotated with the feature agentive. Figure 9: Review function of Coder It is not possible to investigate multilingual corpora with Coder. 4 Summary and conclusions The analysis task we are faced with in the corpus-based investigation of the linguistic properties of translations places a number of requirements on the computational tools to be used in corpus analysis (cf. Section 2). We have described the application of a set of techniques for the analysis of multilingual corpora ranging from alignment and encoding over linguistic annotation to information extraction (cf. Section 3). Taken together, the tools we have discussed can support the kinds of analysis we carry out, but there are some remaining problems. One has to do with the input and output representations and formats of the individual tools, the other has to do with information extraction. Each of the tools used for encoding, annotation and extraction employs different input formats that do not necessarily match straightforwardly. While for encoding we have employed XML, the IMS corpus workbench requires as input a tokenized text with syntactic annotations in a TSV format, and Coder requires as input raw text (segmentation is done within Coder). Also, the outputs that are generated from coding are again different across tools: TnT produces a TSV format, Coder produces an XML/SGML-like format. While part of this problem can be dealt with simply by format transformations (e.g., a TSV format can be straightforwardly transformed into an XML format by a Perl script, and an XML-like format can be straightforwardly transformed into XML with the help of XSLT (W3C-XSLT 2000; see Figure 10 for a PoS-tagged text in a simple XML notation), there are some more principled questions involved here to do with the fact that we do multi-level annotation. A short lesson on keys Let us consider , by a simple illustration , the problem of depositing a coding with sender and receiver . Coding keys are much like the keys we use in our daily lives . Encrypting is similar to hiding AT1>a message in a locked box . Figure 10: PoS-tagged text in XML format 579 If, for instance, clause annotations as we have done them using Coder are to be integrated with PoS annotations like the ones given in Figure 10 into one uniform representation, different units of annotation have to be merged. Again, this would be feasible simply operating on the different formats, but in a more principled treatment, the units of annotation would have to be defined explicitly in the first place. This is a typical task of document type definition, as handled by, for instance, XML. A possible document type definition (DTD) for our annotation purposes could look as displayed in Figure 11. Figure 11: XML DTD for multi-level annotation This defines a formal grammar for annotation specifying the units of annotation (sentence, clause, phrase, token) and their attributes (exemplified here with the unit token and the attribute PoS (partof- speech). ‘+’ denotes ‘one or more occurrences of'. In information extraction, problems arise for similar reasons. Unless format transformations are carried out (where possible), different tools have to be employed for information extraction and the corpus can only be queried with respect to one level of annotation at a time. It therefore seems to be desirable after all, to have a uniform representation that is built on first principles (see Figure 12 for illustration). Again, this would require employing a document encoding standard in which document types can be properly defined. Perl Perl XSLT TSV XML TSV XML/SGML-like TnT XML Spy Déja Vu Coder Figure 12: Integrated representation A recent development in such a direction is the MATE system (Mengel 1999; Mengel and Lezius 2000), which allows for multi-level annotation in a uniform representation, using XML. However, MATE would still need to be tested in a multilingual application of the kind we are involved in here. Finally, the query mechanisms available in the tools we have tested are either simple Boolean searches on strings (as in the case of Coder) or they are based on regular expressions (as in the case of the IMS workbench). This limits the possibilities of corpus querying. In particular, since the queries in the IMS workbench have to be formulated on sequences of PoS tags, the queries can become quite complex. In our immediate future work, we are going to test more expressive query systems, such as the one implemented in G-Search (Keller et al. 1999), which allows searching with context-free grammars. To conclude, with a complex corpus analysis task as the one discussed in this paper, we cannot expect to find the one ideal tool that can deal with all aspects of annotation and fulfil our particular requirements on information extraction. The concrete tools we have discussed here are exemplars of standard techniques used in corpus linguistics and we would thus expect other linguists with similar analysis requirements to run into the same kinds of problems. Finally, we have formulated the desideratum of a uniform representation of corpus annotation, so that on that basis searches on multiple levels of annotation can be carried out. This currently remains an unsolved problem in our analysis scenario. information extraction integrated representation (XML) 580 Acknowledgments Thanks to Peter Fankhauser and Frank Klein for advice on XML. References Baker M 1995 Corpora in translation studies: An overview and some suggestions for future research. Target 7(2):223-245. Beneš E 1981 Die formale Struktur der wissenschaftlichen Fachsprachen aus syntaktischer Hinsicht. In Bungarten T (ed) Wissenschaftssprache. München, Fink, pp 185-212. Biber D 1995 Dimensions of register variation: A cross-linguistic comparison. Cambridge, Cambridge University Press. Brants T 1999 TnT - A Statistical Part-of-Speech Tagger (User manual). Department of Computational Linguistics, Universität des Saarlandes, Saarbrücken, Germany (http://www.coli.unisb. de/~thorsten/tnt/). Christ O 1994 A modular and flexible architecture for an integrated corpus query system. In Proceedings of COMPLEX 94, 3rd Conference on Computational Lexicography and Text research, Budapest, pp 23–32 (http://www.ims.uni-stuttgart.de/projekte/CorpusWorkbench/). Fluck H R 1997 Fachdeutsch in Naturwissenschaft und Technik: Einführung in die Fachsprachen und die Didaktik/Methodik des fachorientierten Fremdsprachenunterrichts. Heidelberg, Groos. Halliday MAK 1978 Language as social semiotic. Arnold, London. Halliday MAK 1998 Things and relations: Regrammaticising experience as technical knowledge. In Martin J, Veel R (eds) Reading Science. Critical and functional perspectives on discourses of science. London, Routledge. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A comprehensive grammar of the English language. London, Longman. Keller F, Corley M, Corley S, Crocker M, Trewin S 1999: Gsearch: A Tool for Syntactic Investigation of Unparsed Corpora. In Uszkoreit H, Brants T, Krenn B (eds) Proceedings of the EACL Workshop on Linguistically Interpreted Corpora, Bergen, pp 56-63. Mengel A 1999 Die integrierte Repräsentation linguistischer Daten In: Gippert J (ed) Multilinguale Corpora. Codierung, Strukturierung und Analyse (11. Jahrestagung der Gesellschaft für Linguistische Datenverarbeitung). Prag, enigma corporation, pp 115-121. Mengel A, Lezius W 2000 An XML-based representation format for syntactically annotated corpora. In Proceedings of LREC 2000, Athens, pp. 121-126 (http://mate.mip.ou.dk). O'Donnell M 1995 From Corpus to Codings: Semi-Automating the Acquisition of Linguistic Features. In Proceedings of the AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, Stanford University, California, pp 120-124 (http://cirrus.dai.ed.ac.uk:8000/Coder/index. html). Teich E in progress Contrast and commonality between English and German in system and text. A methodology for investigating the contrastive-linguistic properties of translations and multilingual texts. Department of Applied Linguistics, Translating and Interpreting, Universität des Saarlandes, Saarbrücken, Germany. Toury G 1995 Descriptive Translation Studies and beyond. Amsterdam, John Benjamins. W3C-XSLT 2000 XSL Transformations (XSLT), Version 1.0 (http://www.w3c.org/TR/xslt). 581 Accurate automatic extraction of translation equivalents from parallel corpora Dan Tufiú and Ana-Maria Barbu RACAI-Romanian Academy Center for Artificial Intelligence 13, “13 Septembrie”, RO-74311, Bucharest, 5, Romania {tufis, abarbu}@racai.ro Abstract The paper describes a simple but very effective approach to bilingual lexicons extraction from parallel corpora. After briefly describing the method we present the evaluation for six pairs of languages in terms of precision and recall and processing time. We conclude by discussing the merits and the drawbacks of out method in comparison with other works and comment on further developments. 1. Introduction The vast and continuously growing amount of information available nowadays on the WEB poses numerous and challenging problems which strongly influenced current approaches to natural language processing. One of the most obvious tendencies in NLP and related technologies, is a distinct preference for shallow processing. The term usually implies not only a partial coverage of the difficult problems in computational linguistics but also, accepting a limited imprecision or inaccuracy in automatic decisions made in order to achieve a useful degree of language analysis or generation. The motivation for this trend, to a large extent imposed by the Internet industry and market, is given by the requirement to process in real time very large quantity of texts. It has been noticed that many useful tasks (document indexing, document classification, information retrieval, etc) may be achieved when using even a quite superficial processing of natural language texts. Increasing the amount of linguistic processing in more and more web-based applications is a constant preoccupation of many R&D professionals and companies. The development of large linguistic resources (thesauri, ontologies, translation memories, multilingual dictionaries, annotated corpora etc.) became a very active area both for academic research and industry. As one would expect, printed dictionaries and lexicons were primary sources for constructing lexical databases. However, relying on human-dedicated lexical sources was not as successful as expected for computer programs due to various reasons among which one could mention variation of lexical stock and lexical gaps, shifts of meaning, sense granularity to fine for automatic discrimination, etc. The costs involved in turning the human-oriented linguistic knowledge sources into useful computational resources are also high (copyright fees, man-power costs) so the idea of using computers to extract and organise linguistic information according to specific needs is natural and significant research is carried along these lines. Combining supervised and unsupervised linguistic knowledge acquisition is a trade-off between time and costs versus quality and completeness. As basic language resources (meant for training /learning) are cleaner and larger, the acquisition of more structured information and higher levels of linguistic knowledge is possible with less human supervision and with simpler computational means. We showed (Tufiú) for instance that the accuracy of tagging new texts can go higher than 99% when very clean training corpora are used and language modelling is adequate with respect to the underlying technology. For this example (based on HMMs), we argued (TufiúHt. all 2000) that preserving in the language model distinctions that cannot be reliably made by a specific distributional analysis model is source of noise and the cause of performance deterioration. The tagging technology is more or less the same in many projects but the accuracy of the results varies (even for the same language) and the explanations can be found both in the quality of the training data and (mainly) in the quality of the tagsets. Aligning parallel texts is a very good example where simple techniques and limited linguistic knowledge (if any at all) can ensure surprisingly good results for the problem of interest. Since charAlign (Gale, Church 1993) was published, many variants refinements and improvements of this program were implemented, but the basic underlying ideas remained extremely simple and thus easy to implement and fast in operation. Extracting bilingual dictionaries from corpora can be seen as a very fine-grained alignment process, were the aligned units are not paragraphs or sentences but words and phrases. Most approaches to this problem rely on statistical means to build translation lexica from bilingual texts, roughly falling into two categories: the hypotheses testing approach such as (Gale, Church 1991), (Smadja et all 1996) etc. and the estimating approach (Brown et all 1993), (Kupiec 1993), (Hiemstra 1997) etc. The first approach involves a generative device that produces a list of translation equivalence candidates (TECs), each of them being subject to an independence statistical test. The TECs that pass the test are assumed to be translation- 582 equivalence pairs (TEPs). The second approach assumes building from data a statistical model the parameters of which are to be estimated according to a given set of assumptions. There are pros and cons for each type of approach, some of them discussed in (Hiemstra 1997). Our method hardly fit into one of these two categories, but is closer in spirit to hypotheses testing approach, by first generating a list of translation equivalent candidates and then iteratively extracting the most likely translation-equivalence pairs. The candidate list is constructed from the translation/alignment units (TU). That is to say that the translation of an item in a source language sentence is looked for only in the alignment corresponding sentence(s) of the target language. 2. Format of the input data The translation equivalents extraction process does not assume a pre-existing bilingual lexicon for the considered languages. If such a lexicon exists, it might be used for the validation purposes, but also for some kind of “land-marking” for increasing the certainty in building the TECs. Three files physically represent the input data: · the source language file, containing one half of the aligned bi-text TS · the target language file, containing the other half of the aligned bi-text TT · the alignment index for the sentences in TS and TL. Each part of the bi-text is represented in the same tabular format with one tagged and lemmatised token per line and sentence mark-up explicitly shown. Table 1 shows excerpts from the contents of the three files. The first column in the Source/Target file represents the type of the item (lexical token, left/right split, punctuation etc), the second column contains the word-form while the last column contains the lemma, the morpho-syntactic description of the word-form and the POS only tag used in the alignment (separated by a “\”).The SGML alignment file is a cesAlign type of document and specifies the sentence pairing in the two languages by means of sentence identifiers in the first two files. An alignment unit defines a translation unit (TU) as composed by the sentences identified by the IDs. Source file Target file Alignment file TOK It it\Pp3ns\P LSPLIT Într- =\Spsay\S TOK was be\Vmis3s\AUX TOK o un\Tifsr\T TOK a a\Di\D TOK zi =\Ncfsrn\N TOK bright bright\Af\A 72.VHQLQ VHQLQ\Afpfsrn\A … … … Table 1: The format of the input data For instance, in Table 1, the first TU is made of the sentence identified by the id “Oro1.2.2.1” (Într-o zi VHQLQ «) and the two sentences identified by the ID “Oen.1.1.1.1” (It was a bright…) and “Oen.1.1.1.2” (Winston Smith, his chin…). The format of the input data is conformant with different conventions adopted within the MULTEXTEAST (MTE) project (which developed the “1984” multilingual corpus we worked with). Various preprocessing steps (tokenisation, alignment and tagging) were initially achieved by using tools developed within the MULTEXT project, for which MTE was a follow up. Because of interest in this corpus, it was extended with new languages within the TELRI project and further cleaned up within the CONCEDE project. We provide more details on the corpus in the “Experiments and results” section. 3. The baseline and the iterative algorithm Based on the alignment, the first step is to compute a list of translation equivalent candidates (TECL). This list contains several sub-lists, one for each POS considered in the extraction procedure. Obviously if the parallel text contains alignment and tagging errors, several real translation equivalents would not be found because they will not be member of the corresponding TECpos. Each POS-specific sub-list contains several pairs of tokens of the corresponding POS that appeared in the same TUs. These pairs (translation equivalents candidates-TECs) are generated by a Cartesian product of the set of tokens (of the given POS) in one half of the TU with the set of tokens (of the same POS) in the other half. Each pair has attached the number of occurrences of the respective association throughout all the TUs. The baseline algorithm is represented by a retaining from the list of TECs only those pairs which cannot be considered to occur just by chance. This hypothesis can be tested by different statistical tests (we used S. Banerjee's and T. Pedersen's Bigram Statistics Package, which includes chi-square, dice, mutual information and log-likelihood statistical measures). For instance, when chi-square test is used, the coefficients (computed for each TEC) given by the formula 583 ) ( * ) ( * ) ( * ) ( ) * * ( 22 21 22 21 21 11 12 11 2 21 12 22 11 * * 2 n n n n n n n n n n n n n + + + + - = c may be used to select the most likely candidates as TEPs. For a 99.9% confidence level, the threshold condition for rejecting the null hypothesis (TS and TT cooccurred by chance) would be l2> 10.83. One could use also a minimal number of occurrences for (usually this is 3). This baseline algorithm may be enhanced in many ways (using a dictionary of already extracted TEPs for eliminating generation of spurious TECs, stop-word lists, considering token string similarity etc.). An algorithm with such extensions (plus a few more) is described in (Gale, Church 1991). In spite of being extremely simple, this algorithm was reported to provide impressive results (Canadian Hansard, precision about 98% and recall about 50%). However the response time is not among its assets and it is not clear how or whether different translations of the same item are extracted. The iterative algorithm we propose is also very simple but significantly faster than the baseline. It can be enhanced in many ways (including those discussed above). The algorithm gets as input the aligned parallel corpus and the maximum number of iterations. At each iteration step, the pairs that pass the selection (see below) will be removed from TECL so that this list is shortened after each step and eventually may be emptied. Based on TECL, for each POS is constructed a contingency table (TBLk) as shown in Table 2: TT1 … TTn TS1 n11 … n1n n1* … … … … … TSm nm1 … nmn nm* n*1 … n*n n** Table 2: TBLk - Contingency table with counts for TECs at step K The rows of the table are indexed by the distinct source tokens and the columns are indexed by the distinct target tokens (of the same POS). Each cell (i,j) contains the number of occurrences in TECL of the TEC: nij = occ(TSi,TTj); ni* = a = n j 1 ij n ; n*j= a = m 1 i ij n ; and n** = ) n ( n 1 j m 1 i ij a a = = . The selection condition is expressed by the equation: (1) )} n (n ) n (n q p, | T ; T { TP pj ij iq ij Tj Si k 3 U 3 " > < = . This is the key idea of the extraction algorithm and it expresses the requirement that in order to select a TEC as a translation equivalence pair, at step k, the number of associations of TSi with TTj must be higher than (or at least equal to) any other TTp (p1j) that are represented in the TBLk. The same holds for the other way around. If TSi is translated in more than one way, the rest of translations will be found in subsequent steps (if frequent enough). The most used translation of a token TSi will be found first. 4. Experiments and results We conducted experiments on one of the few publicly available multilingual aligned corpora, namely the “1984” multilingual corpus (Dimitrova et al 1998) containing 6 translations of the English original. This corpus was developed within the Multext-East project, published on a CD-ROM (Erjavec et al 1998) and recently improved within the CONCEDE project (to be soon released to the research community CONCEDE's homepage: www.itri.brighton.ac.uk/projects/concede/). Each monolingual part of the corpus (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene) was tokenised, lemmatised, tagged and sentence aligned to the English hub. Language Bulgarian Czech English Estonian Hungarian Romanian Slovene No. of wordforms 15093 17659 9192 16811 19250 14023 16402 No. of lemmas 8225 8677 6871 8403 9729 6626 7157 No.of >2-occ lemmas* 3350 3329 2916 2876 3294 3052 3189 Table 3:The lemmatised monolingual “1984” overview * the number of lemmas does not include interjections, particles, residuals) The number of lemmas in each monolingual part of the multilingual corpus as well as the number of lemmas that occurred more than twice are shown in Table 3. 584 For validation purposes we set the step limit of the algorithm to 4. The evaluation was fully done for Estonian, Hungarian and Romanian and partially for Slovene (the first step was fully evaluated while from the rest were evaluated randomly selected pairs). The evaluators judged a translation pair as correct, partially correct or incorrect. Pairs that were judged as partially correct (all cases appeared as expected in Hungarian and Estonian) corresponded to English multiword equivalents of the source words. For instance the Estonian word “armastusministeerium” means “ministry of love”, but our algorithm found as translation equivalents (armastusministeerium = ministry) and (armastusministeerium = love). For all such cases we used the BSP package to test whether the correct multiword equivalent pass a collocation test and if so, we included the partially correct TEP into the class of correct pairs. Such a decision might be motivated on the grounds that a preliminary collocations analysis and recognition would have raised this issue. The figures in Table 4 show the results and the evaluation. In the following the notion of correctness of a pair is taken in the above-mentioned interpretation. The extracted bilingual lexicons are available at http://www.racai.ro/bi-lex/. The precision (Prec) was computed as the number of correct TEPs divided by the total number of extracted TEPs. The recall (Rec*) was computed as the number of correct TEPs divided by the number of lemmas in the source language with more than 3 occurrences. When the (usual) threshold of minimal 3 occurrences is considered, the algorithm provides a high precision and a good recall. As one can see from the figures in Table 4, the precision is higher than 98% for Romanian and Slovene almost 97% for Hungarian and more than 96% for Estonian. The recall (our defined Rec*) ranges from 50.92% (Slovene) to 63.90% (Estonian). We run the extractor for the Ro-En bitext without imposing a step limit. The program stopped after 25 steps with a number of 2765 extracted pairs, out of which 113 were wrong. The precision decreased to 95,91%, but the recall significantly improved: 86,89%. We should mention that Rec*, as we compute it, is slightly different from the usual recall, because Rec* is reporting the percentage of the number of correct translations found for the lemmas (occurring more than the threshold) in the source language. Let us assume that in the source language there are N different lemmas occurring more than the preset threshold and the program found M correct translation equivalents. Further, let us assume that each lemma is used, on average, with S different senses each one occurring more than the set threshold. Then, our Rec* will be M/N when it should return M/N*S. As we specified before, different translations of the same lemma are found, usually, in subsequent steps. Since we set the number of iteration steps to 4, only for a few words (those very frequent) there would be found multiple valid translations. That is to say, that Rec » Rec*. However, when the number of iteration steps is increased, Rec* becomes an overestimation of Rec. In an initial version of this algorithm we used a chi-square test (as in the baseline algorithm) to check the selected TEPs. Since the selection condition (EQ1) is very powerful, the vast majority of the selected TEPs passed the chi-square test while many pairs that used to pass the chi-square threshold did not pass the condition (EQ1) and therefore we eliminated the supplementary and time consuming statistical tests. This is certainly one of the reasons for the speed (see next section) of our extraction algorithm. If one source word has different translations in the target language (either lexicalisations of different senses of a polysemous source word or different synonyms for the target word), in general they are found, if frequent enough, in different iteration steps. For instance, when processing the RO-EN bitext of “1984” parallel corpus, there were extracted 10 correct TEPs for “mare” (big, great, large, vast, sea, long, main, thick, general, important) but none of them would have been found unless each pair appeared in TECL more than twice. Language Bg-En Prec/Rec* Cz-En Prec/Rec* Et-En Prec/Rec* Hu-En Prec/Rec* Ro-En Prec/Rec* Sl-En Prec/Rec* Step 1 1336 NA/NA 1399 NA/NA 1216 99.50/42.07 1299 98.61/38.88 1394 99.71/42.74 1177 99.91/36.87 Step 2 1741 NA/NA 1886 NA/NA 1617 97.89/55.04 1737 97.63/51.48 1867 99.30/52.23 1489 99.52/46.47 Step 3 1896 NA/NA 2085 NA/NA 1807 96.63/60.84 1863 96.99/54.85 2067 99.03/54.84 1589 99.06/49.63 Step 4 1986 NA/NA 2188 NA/NA 1911 96.18/63.90 1935 96.89/56.92 2182 98.57/56.36 1646 98.66/50.92 Table 4: The results after 4 iteration steps and partial evaluation From the results shown in the Table 4 one can notice that most part of bilingual lexicons is extracted in the first step (between 63% and 71%). We tried to extract translation equivalents even for words that appeared in the source language only twice. Because of validation reasons we did this experiment only for 585 the Ro-En bitext. The experiment considered an extra-iteration step, after the last one, with the occurrence threshold set to two. The extracted number of pairs (1311) was more than half of the pairs extracted in the previous steps altogether, but also the number of erroneous pairs (297) was almost 10 times higher than the previous number of errors (31). However, when lowering the occurrence threshold from the first step, the number of correct pairs was almost the same (3174 versus 3165) but the number of errors was significantly higher (417 versus 328) with a global error rate increase from 9,38% to 11,61%. 5. Implementation The extraction program is written in Perl and runs under practically any platform (Perl implementations exists not only for UNIX/LINUX but also for Windows and MACOS). The Table 5 shows the running time for each bitext in the “1984” parallel corpus. The program was run under LINUX on a Pentium III/600Mhz with 96 MB RAM. Ro-En Language Bg-En Cz-En Et-En Hu-En 4 steps 25 steps Si-En Time (sec) 181 148 139 220 183 415 157 Table 5 :Extraction time (in seconds) for each of the bilingual lexicons A quite similar approach to ours (also implemented in Perl) is presented in (Ahrenberg et all 1998) and (Ahrenberg et all 2000). The languages considered by their experiments are English and Swedish. For a novel of about half the length of Orwell's “1984” their algorithm needed 55 minutes on a Ultrasparc1 Workstation with 320 MB RAM and the best results reported are 96.7% precision and 54.6% recall. For a computer manual containing about 45% more token than our corpus, their algorithm needed 4.5 hours with the best results being 85,6% precision and 67,1% recall. Unlike us, they don't rely on tagging and lemmatisation, although they use a “morphology” module that achieves some kind of stemming and grouping of inflectional variants. This strategy makes it possible to link low-frequency source expressions belonging to the same suffix paradigm. The obvious advantage of not using POS categorisation is that their approach would be able to identify TEPs where the POS of the source token is different from the POS of the target token. An explanation of the much better response time in our case, besides not using statistical tests (they use t-test), is that the search space in our case is probably several orders of magnitude smaller. 6. Conclusions and further work We presented a simple but very effective algorithm for extracting bilingual lexicons, based on a 1:1 mapping hypothesis. We showed that in case a language specific tokeniser able to recognise and “pack” the compounds is responsible for pre-processing the input to the extractor the 1:1 mapping approach is not a limitation anymore. The MULTEXT tokeniser, for instance allows for recognition of generic multiword expressions (dates, literally expressed numbers) or specific ones based on external resources containing lists of compounds, proper names, idiomatic expressions. If the compounds cannot be dealt with in the segmentation pre-processing phase one may consider either extending the bilingual lexicon extractor's model to an N:M paradigm or consider using a monolingual tool as a pre-processor for recognising the compounds. We are currently considering both options. For the first one we started the implementation of a new tokeniser including collocation recognition. As most of the multiword tokens (found by the collocation extraction module) are not expected to be in the lexicon, a post-tagging module will check (based on very simple chunking grammars) whether the assigned tag is compatible with the tags recorded in the lexicon for the constituent (known) words. For the second option, we are carrying out some preliminary experiments with a slightly modified version of the program presented in this paper. Conceptually, the modified version of the program can be seen as receiving the same text as source and target input file with all the sentence alignments being 1:1. Two additional modifications are: · the TECL must not include pairs made of identical strings.; this condition is necessary for limiting the search space to the only potential collocations · the POS condition is removed; this restriction is not necessary anymore since most sequences of words that should be translated as one unit are not characterised by the same POS. A new and customisable version (implemented in C++) of the algorithm described in this paper, incorporating BSP, is under construction. 586 Acknowledgements The research reported here started as an AUPELF/UREF co-operation project, coordinated by Patrick Paroubeck of LIMSI/CNRS (CADBFR). Special thanks are due to Heiki Haalep, Csaba Oravecs, and Tomaz Erjavec for the validation of the Et-En, Hu-En and Si-En extracted dictionaries. References Ahrenberg L, Andersson M, Merkel M 1998 A simple Hybrid Aligner for Generating Lexical Correspondences in Parallel Texts. In Proceedings of COLING'98, Montreal, pp 29-35. Ahrenberg L, Andersson M, Merkel M 2000 A knowledge-lite approach to word alignment, in Véronis J (ed), Parallel Text Processing. Text, Speech and Language Technology Series, Kluwer Academic Publishers, pp 97-116. Brown P, Della Pietra S, Della Pietra V, Mercer R 1993 The mathematics of statistical machine translation: parameter estimation. Computational Linguistics 19 (2): 263-311. 'LPLWURYD/(UMDYHF7,GH1.DDOHS+3HWNHYLF97XILú'1998 Multext-East: Parallel and Comparable Corpora and Lexicons for Six Central and East European Languages in Proceedings of the 36th Annual Meeting of the ACL and 17th COLING International Conference, Montreal, pp 315-319. Gale W, Church K 1991 Identifying word correspondences in parallel texts. In Fourth DARPA Workshop on Speech and Natural Language, pp 152-157. Gale W, Church K 1993 A program for aligning sentences in bilingual corpus. Computational Linguistics 19(1): 75-102. Erjavec T, Lawson A, Romary L 1998 East Meet West: A Compendium of Multilingual Resources. TELRIMULTEXT EAST CD-ROM, 1998, ISBN: 3-922641-46-6. Hiemstra D 1997 Deriving a bilingual lexicon for cross language information retrieval. In Proceedings of Gronics 21-26 Kupiec J 1993 An algorithm for finding noun phrase correspondences in bilingual corpora. In Proceedings of the 31st Annual Meeting of the Association of Computational Linguistics, 17:22 Smadja F, McKeown K, Hatzivassiloglou V 1996 Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22 (1): 1-38. 7XILú '  8VLQJ D /DUJH 6HW RI (DJOHV-compliant Morpho-Syntactic Descriptors as a Tagset for Probabilistic Tagging. In Proceedings of the Second Conference on Language Resources and Evaluation, Athens, pp.1105-1112. 7XILú '  'LHQHV 3 2UDYHF] & 9iUDGL T 2000 Principled Hidden Tagset Design for Tiered Tagging of Hungarian. In Proceedings of the Second Conference on Language Resources and Evaluation, Athens, 1421-1426. 587 The linguistic relevance of Corpus Linguistics Tamás Váradi Department of CL Research Institute for Linguistics Hungarian Academy of Sciences varadi@nytud.hu 1. Introduction The present paper is intended as a review of some basic principles of corpus linguistics (CL) from a linguistic point of view. In particular, it examines the aims and methods of CL in terms of how it stands up to some basic linguistic tenets regarding the nature and use of data for linguistic analysis. Central to the discussion will be the problem of representativeness, which will be discussed with reference to an influential paper by Douglas Biber (1993). The perspective from which CL is examined in this paper is its utility and relevance for linguistic description. This is just one of the contexts in which CL is used today and indeed it may not be the bulk of its applications. It can hardly be doubted that CL has proved its utility in areas of human language technology too numerous to mention. Within fields more close to linguistics, its role in lexicography is nothing short of revolutionizing the whole practice of the discipline, introducing a new technology that produced a new generation of dictionaries. These are undeniable achievements that has established CL as a thriving and dynamically growing field. Amidst this burgeoning activity it is perhaps opportune to take this occasion to reflect on some underlying assumptions and methodological principles characterizing current practice in CL. 2. The concept of language and corpus linguistics Fundamentally, CL undertakes an empirical study of language. As Leech (2000: 685) rightly points out, this means that CL is ‘patently’ concerned with documenting and analysing performance, or the Saussurean notion of parole. The notion performance grammar is somewhat misleading in that CL is obviously concerned with the actual product of language use as against processes involved in speech production. The main theoretical claim that CL can make as a contribution to linguistics is that through the compilation and analysis of masses of data corpus linguistics can provide a solid objective empirical foundation on which to build a grammar of a language and indeed the technological facility to sift through large amounts of data leads to new insight into the structure of the language. One key aspect of language use that CL is particularly well suited to reveal is its quantitative characteristics. It is to be noted, though, that the very concept of language that CL aims to capture is most forcefully rejected by Chomsky. Earlier, he consistently defined language as a set of sentences (Chomsky 1957, 1965), which, at least on the face of it, seemed to be congruous with the concerns of CL in the sense that, typically, CL also approaches language in terms a set of sentences. However, in later works he came to regard the traditional concept of language, which he now prefers to call Externalized-language, as “an epiphenomenon at best” (Chomsky 1986: 25) a notion so laden with “complex and obscure sociopolitical, historical and normative-teleological elements (Chomsky 1991: 31) that he is doubtful if the concept can be given a coherent interpretation at all. The primary object of linguistic investigation in Chomsky's view is the Internalized-language, the mental grammar that each native speaker has internalized and which they draw upon in their linguistic communication. Even if one does not share Chomsky's dismissive opinion on the relative standing of E-language visa- vis I-language, the dichotomy is one which CL also has to face. The issue boils down to the need to account for I-language from the facts of E-language as displayed in a corpus. It seems reasonable to take the view that language primarily exists in the minds of the speakers. Hence the starting point and focus of the linguistic enquiry should be the individual speakers with their I-language, their communicative competence and the actual products of their language use. Focusing on the actual products of actual language use by individual speakers of a language should, on the face of it, be quite amenable to the methods of CL. 588 Even at the level of the idiolect, the I-language of individual speakers, the research program raises some inherent theoretical problems stemming from the need to extrapolate from finite data to an infinite system. The difficulties are compounded, however, by the attempt to deal with the linguistic output of a group of speakers, let alone that of a whole language community. The hypothesis of an ideal hearerspeaker of a homogeneous speech community (Chomsky 1965) does save the theoretical linguist from a host of complications but CL cannot operate from this perspective. The ambition of CL was from the very beginning to provide an empirical record of a language at the level of the whole speech community. As an interesting historical aside, we should note that this ambition was absent from what McEnery and Wilson (1996) term early phase of CL i.e. prior to Chomsky. Without the technological support ensuring a semblance of feasibility, one could not even entertain the idea of capturing and processing enough data for the whole language community. Nor was this the intent. Harris (1951:12), advocates a method of corpus compilation that proceeded in close cooperation with informants. This piecemeal, interactive procedure was driven by the expectations of the field workers, based on experience, about the completeness and consistency of the grammar that they were seeking to build. 3. E-language as a sampling problem As noted above CL deals with the realm of actual language use. The primary raw data it encounters is a set of utterances produced by the language community. At first sight, this seems to be an infinite set. However, the number of words produced either in speech or writing are limited by obvious human physiological limits. Biber et al. (1999: 27) quote the figure of around 7000 words per hour as the average speech rate observed in the conversational part of the Longman Spoken and Written English Corpus. Therefore, once the inevitable fuzziness in the geographical, social and temporal boundaries of the notion of the language community is somehow resolved, one can put a number to the totality of language produced by any set of speakers over a given period of time. Nevertheless, even if, as the above thought experiment suggested, the set of utterances produced by a language community in a given time interval is finite in size, it is unrealistic to expect that the totality of language production could ever be captured on electronic media. The bottleneck is not necessarily storage or technological capacity. While the incredible rate of advancement of computerization and the exponential spread of the Internet may eventually make the bulk of written language output at least accessible, capturing spoken output in a corresponding manner is not only unfeasible and even imponderable. This fact leaves us to conclude that any corpus, however large, is and will necessarily remain a mere sample of the totality of language output. That is how the issue of representativeness in the design of the corpus assumes key importance. From the very beginning the aim of CL was to compile a corpus that was representative of a language. In terms of the concepts introduced above, this means nothing less than to design a corpus that models the totality of language use of a speech community. This is certainly a tall order, given the complexity and the scope of the phenomena that it undertakes to cover. In practice, though, the task was attempted from the outset with some reasonable limitations in the temporal and geographic dimensions of the data. The pioneering Brown corpus (Kueera and Francis 1967) set out to capture the written language of the United States of the year 1963. It was intended to be a general purpose, balanced corpus of American English of the period. 4.2 discusses how representativeness was achieved and Table 1 displays the corresponding figures. Another claim that Corpus Linguistics makes is that it shows up ‘language as is spoken', real language in its rawness and richness. This intention in obviously inherent in the whole corpus linguistic enterprise of capturing vast amount of actual data. Apart from marketing purposes, CL only needed emphasizing this in contraposition to the ruling generative linguistic school, which tended to base its findings on introspective evidence. 4. Basic design issues It is clear that the key issue for Corpus Linguistics to make good its promises lies in the scope and composition of the data that it provides. This will be the focus of our attention for the rest of the paper. It is widely agreed that a corpus is not simply an archive of texts but rather a principled collection of texts. One of the first and most important principles referred here concern the selection of texts to go into the corpus. 589 The first question that arises in examining this issue is whether we should care too much about the composition of the corpus. Accordingly, there developed two kinds of schools of thought supporting two kinds of corpora: the so-called opportunistic and the balanced corpora. 4.1 Monitor corpus vs. balanced corpus It is fairly easy to deal with the opportunistic kind as it denies that there is any principled way to balance a corpus and it makes recourse to the law of large numbers. Perhaps size will automatically sort out all questions of ‘balance’ in the structure of the data. This approach is vigorously represented by Sinclair (1991 pp. 23-24) who proposes instead the idea of a monitor corpus – a very large corpus, which after reaching some sort of a saturation point will undergo a partial self-recycling: the new material flowing in will be subjected to an automatic monitoring process which will only retain those parts of the incoming data which show some significantly different features than the stable part of the data. Once it is decided that some sort of scheme will be set up to compile a corpus in some principled way, the question that confronts us is whose job is it to do so. Sinclair (op. cit.: 13) holds that it is a task that should belong to the students of culture rather than corpus linguists. They should only undertake it as a matter of necessity. The use of language, Sinclair seems to argue, should be studied in the wider cultural context, which goes beyond the competence of the corpus linguist. 4.2 Units of sampling Another important sampling issue to decide is the units of the overall population in terms of which the sample will be compiled. Should the sample be compiled in terms of the speakers or language? If the latter is chosen, as it was originally done, (without apparently considering any alternative) what are to be the linguistic units in terms of which the population is sampled: words, sentences, texts, speech situations etc.? In the first generation of balanced corpora, the Brown and the LOB corpus, this issue was decided by a panel of experts who designed a scheme where different varieties of language, called genres, are represented in specific proportions. Table 1 shows how the 1 million word corpus is divided into 15 genres and how many texts of 2000 word length each are allocated into each category. Note how despite the professed intention to develop a replica of the pioneering American corpus for British English, the internal composition of the LOB corpus was slightly changed in categories E,F and G). These subtle changes were introduced so as to accommodate the structure of the corpus to the peculiarities of British culture. As for the selection of the particular texts, apparently, a great deal of effort was spent into making sure that the texts within each category were chosen at random but I am not aware of any public arguments offered in justification for the particular ratios used between the categories. 4.3 Methods of sampling Choosing things at random suggests itself as a safe procedure to eliminate any bias or skewing in the result. However, purely random sampling works against the selection of items that are relatively rare in the population, out of which the sample is made. An important principle that a sample should meet in order to be representative of the population is that the sample should show the same ratios between elements within the sample as they have in the population. Samples are, as it were, severely scaled down versions of the population. The more frequent an item is in the population, the better chance it stands of being selected at random. Therefore, it may easily happen that items which occur pretty rarely in the population, will not be selected by the random process at all. Alternatively, if for some reason or other, we would like to see the rare items included in the sample, we would have to increase the size of the sample, perhaps out of all manageable proportions. 590 Genres No of texts Brown LOB A Press:report 44 44 B Press:editorial 27 27 C Press:reviews 17 17 D Religion 17 17 E Trades, hobby, leisure 36 38 F General lore 48 44 G Belles lettres, biography, essays 75 77 H Misc. government. documents, public reports, university catalogues 30 30 J Scientific journals 80 80 K General fiction 29 29 L Crime fiction 24 24 M Science fiction 6 6 N Adventure and Western 29 29 P Romance 29 29 R Humour 9 9 ~ ~ 500 500 Table 1 Composition of the BROWN and the LOB corpus One solution that is devised to overcome the above difficulty is to use stratified random sampling. Under this procedure the population is first divided into a number of categories (strata) and random sampling is only applied to fill up the chosen categories with items selected at random. The question of how many categories to set up into which the population is arranged and how much data should be collected for each category is decided beforehand. (These are indeed the figures shown in Table 1 for the BROWN and the LOB corpus.) The taxonomy of the categories is established independently of statistical considerations. Yet, it has a direct bearing on the quantitative results as well. Once a category is established, it is bound to be represented in the sample. For example, if we have a general category for reviews, chance will decide whether the random sampling will select any articles on reviews of early twentieth century travel books. (Chance will be helped by the number of such articles in the whole population in that the more there are the higher the chances that a purely random method will select them.) If on the other hand a special category is adopted to cover travel books, this is taken as a target to be met and the selection procedure is considered incomplete until data is selected for that category as well. Hence, the granularity of the classification scheme will effect the structure of the sample as well. An even more direct intervention in the workings of chance is the setting of target figures for the amount of data to be collected within each category (i.e. the figures against the categories in Table 1 representing the number of texts, each about 2000 words long). In order for a sample to be representative of the population for the set of categories in terms of which the sample is compiled, the sample should conform to the principle of proportionality. This requires the various categories in the sample to be represented in the same ratio as they are in the total population. For the BROWN corpus to qualify as a representative sample of the totality of written American English for 1963 for humorous writing, it would have to be established that humorous writings did make up 1.8 % of all written texts created within that year in the US. This single requirement serves to illustrate the enormous difficulty if not impossibility of the task. Surely, it is simply not feasible to put a figure on the amount of text within the various genres in the totality of texts produced by a speech community. Yet, this is what the statistical concept of a representative sample calls for. Note that the difficulty is not necessarily that of dealing with an infinite set. It is, rather, inherently a logical one. If sampling is done in terms of text type, a representative sample would require knowledge about the whole population that is simply not available. If it were, we would hardly need a sample, and in order to find out about proportions obtaining in the population, one would obviously like to examine a sample of it. 4.4 Demographic vs. context-based sampling How can we break this vicious circle? One lesson obviously is that one can only provide a representative sample of the population in terms of features about which one has reliable knowledge from some independent source. One such source of outside knowledge is indeed available in data about the 591 speakers. One could consult National Census figures to find out about chief characteristics of speakers such as age, gender, schooling, type of settlement they live in etc. It is then feasible to compile a representative sample of speakers for such selected features. This type of demographic sample of informants is a wellestablished procedure in opinion poll surveys, psychological or socio-linguistic research. For corpus linguistics, the total output of such representative group of speakers would ipso facto amount to a representative corpus of the speech population. This procedure was indeed used by the spoken component of the British National Corpus (cf. Burnard 1995: 20-25). 124 adults were selected so that, as far as practical limitations allowed, they would be represented in equal numbers in terms of sex, age (divided into six age groups) and social class (defined in four main categories). The recruited informants were asked to record their speech conversations, unobtrusively whenever possible, for a period of up to a week. Approximately four million words were collected in this manner, a little under half of the spoken component of the BNC, which in turn, for obvious practical constraints, made up one tenth of the 100,000 word corpus. The rest of the spoken component, termed the context-governed part, was selected by “a priori linguistically motivated categories” defined in terms of a hierarchy of categories with the four context categories educational, business, public/institutional and leisure at the top and three regional and two interaction type categories providing further subdivisions. It should be noted that the demographical sample used by the BNC cannot be considered representative in the sense of the sample being proportional to the population. Curiously, for reasons not disclosed in the manual, the BNC did not make use of actual demographic figures about the relative proportions of sex, age and social class obtaining in the UK population. Instead, “the intention was, as far as possible, to recruit equal numbers of men and women, equal numbers from each of the six age groups, and equal numbers from each of four social classes” [emphasis added] (Burnard op.cit.: 21). This methodological laxness in the almost single area where the required information to compile a representative sample was, in fact, available is hardly compensated by the care to use “established random location sampling procedures” to select individual members within the groups. Despite the undeniable practical difficulties of implementing it, the demographic sampling technique was applied in a limited way on purpose. The Reference Guide notes that ‘many types of spoken text are produced only rarely in comparison with the total output of all “speech producers”: for example, broadcast interviews, lectures, legal proceedings and other texts produced in situations where – broadly speaking – there are few producers and many receivers. A corpus constituted solely on the demographic model would thus omit important spoken text types'. (Burnard op. cit.: 20) 5. Biber's notion of representativeness The issues reviewed so far are certainly nothing new to practitioners of the field. With predictable regularity a discussion flares up on the Corpora List around the notion of the balanced corpus. Newcomers to the discussion are often referred to Douglas Biber's article “Representativeness in Corpus Design” (Biber 1993), which is indeed one of the most comprehensive discussions of the topic available in print1. The rest of the paper will concentrate on this as a canonical text, particularly as it reflects views that are upheld by Biber in essentially the same form in more recent works (Biber 1998, Biber et al 1999). Biber distinguishes three possible approaches to corpus design depending on whether they are aimed at covering text production, text reception and texts as products. The first two are basically different from the third in that they both define the population in terms of the agents (i.e. speaker/hearer) of language use, while the third covers it in terms of the output i.e. language. Accordingly, the first two approaches would call for a demographic sample. However, Biber also rejects demographic samples on the grounds that “they would not represent the range of text types in a language, since many kinds of language are rarely used, even though they are important on other grounds.’ […] It would thus be difficult to stratify a demographic corpus in such a way that it would insure representativeness of the range of text categories. Many of these categories are very important, however, in defining a culture” [emphasis added] (op. cit.: 245). This revealing passage spells out some assumptions that may be difficult to reconcile with some basic assumptions about the role of corpus linguistics. One of the fundamental aims of Corpus linguistics as I understand it is to show up language as is actually attested in real life use. However, Biber seems to argue 1 There is one unwritten item that comes to mind: there was a live debate held in Oxford between prominent advocates of the two corpus design philosophies Quirk aided by Leech speaking up for the balanced corpus vs. Sinclair and Meijs arguing for the open-ended monitor corpus. Oral tradition has it that the debate was decided by the audience in favour of the Sinclair team. 592 that in designing a corpus one should apply a notion of importance that is derived from a definition of culture. For lack of any means of operationalizing this criterion of relative importance in culture, this throws the door wide open to subjective judgment in the compilation of the body of data that is expected to provide solid empirical evidence for language use. Biber seems to think very little of the value of a corpus assembled on demographic criteria. “Such a corpus would permit summary descriptive statistics for the entire language represented by the corpus. These kinds of generalizations, however, are not typically of interest for linguistic research”, “… it is not necessary to have a corpus to find out that 90% of the texts in a language are linguistically similar (because they are all conversations)”; rather, we want to analyse the linguistic characteristics of the other 10% of the texts since they represent the large majority of the kinds of registers and linguistic distributions in a language” (op. cit.: 248). Biber concedes that there is no a priori way to establish the relative proportions of the different genres obtaining in the population hence a representative sample would have to be demographic by definition. This impasse leads Biber to conclude that the notion of representativeness as we know it from statistics do not apply in corpus linguistics. What lies at the root of the problems to implement representativeness is the principle of proportionality that has been discussed above. Biber not only considers proportional sampling difficult or unfeasible to implement in any other way than the demographic approach but also goes as far as to simply reject the notion of proportional sample as an appropriate concept. In justifying his position he makes the following curious argument: “proportional samples are representative only (sic!) in that they accurately reflect the relative numerical frequencies of registers in a language – they provide no representation of relative importance that is not numerical. Registers, such as books, newspapers, and news broadcasts are much more influential than their relative frequencies indicate.” [emphasis added] (op. cit. : 248) First, it is disingenuous to find fault with proportional sampling for something it is not intended for i.e. to reflect this non-numerical relative importance. Second, there is no suggestion how this kind of importance can be established, let alone quantified in any objective manner. No attempt is made to show how to measure and accommodate the extent of the influence of the above registers. Earlier, we already noted the potential methodological danger for arbitrary decisions creeping in the corpus design principles. One cannot avoid feeling that once recourse is made to non numerical factors such as importance in compiling the corpus, this makes the whole enterprise of corpus design so vulnerable to subjective value judgments that any amount of methodological rigour applied in the random selection of the items for categories looks like the farcical effort of searching for the lost key where there is light. Rejecting the traditional notion of representative sampling based on the principle of proportionality, Biber blandly declares that “language corpora require a different notion of representativeness”, “researchers require language samples that are representative in the sense that they include the full range of linguistic variation existing in a language.” (op. cit.: 247) First of all, one must voice serious misgivings about any attempt to divest such a key term of its well-established meaning, which has a clear interpretation to statisticians and the general public alike. Of course, any self-respecting corpus would like to advertise itself as a representative corpus. There is such a strong and unanimous expectation from the public and scholars alike for corpora to be representative that it is an assumption that is virtually taken for granted. However, to meet this demand by the semantic exercise of redefining the content of the term is a move that hardly does credit to the field. 6. Conclusions My aim with this brief overview of the issues in corpus design has been to highlight the linguistic implications of the choices that are made. By highlighting on the uncertainties, inconsistencies and methodological fudges currently employed in corpus linguistics, my intention was to show up where further effort is needed. The picture that emerges helps to dispel the unintended disparity in scientific rigour: in order to live up to its expectations corpus linguistics must put its methodology on more solid footing and users of corpus linguistics would do well to be aware of the linguistic issues at stake and the extent to which they can expect ready solutions. 593 References Biber D 1993 Representativeness in corpus design. Literary and Linguistic Computing 8(4):243-257. Biber D, Conrad S, Reppen R, 1998 Corpus Linguistics. Investigating Language Structure and Use. Cambrdige, Cambridge University Press. Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman Grammar of Spoken and Written English. London, Longman. Burnard L (ed) 1995 British National Corpus. Users Reference Guide for the British National Corpus. Oxford, University Computing Service. Chomsky N 1965 Syntactic Structures. Cambridge Mass., MIT Press. Chomsky N 1986 Knowledge of Language Its Nature, Origin and Use. New York, Westport, London, Praeger. Chomsky N 1991 Linguistics and cognitive science: Problems and mysteries. In Kasher A (ed) The Chomskyan Turn. Oxford, Blackwell, pp. 26-53. Kueera H, Francis W 1967 Computational Analysis of Present-Day American English. Providence RI, Brown University Press. Leech G N 2000 Grammars of Spoken English: New Outcomes of Corpus-Oriented Research Language Learning 50(4):675-724. McEnery T, Wilson A 1996 Corpus Linguistics. Edinburgh, Edinburgh University Press. Harris Z S 1951 Methods in Structural Linguistics. Chicago, University of Chicago Press. Johansson S, Leech G N, Goodluck H 1978 Manual of information to accompany the Lancaster-Oslo- Bergen Corpus of British English, for use with digital computers. Department of English, University of Oslo. Sinclair J 1991 Corpus, Concordance, Collocation. Oxford, Oxford University Press. 594 Corpus-based versus intuition-based lexicography: defining a word list for a French learners’ dictionary. Serge Verlinde and Thierry Selva Modern Language Institute, K.U. Leuven (Belgium) 1. Introduction Although French lexicographers were among the first to integrate corpus-analysis into the dictionary-making process, with the Trésor de la langue française project in the early seventies and its corpus of 170 million words, corpus-based lexicography is certainly not a common practice in contemporary lexicography in France. There are a few exceptions, however, e.g. the Oxford-Hachette English-French/French-English translation dictionary (Corréard 1997) and the Dictionnaire d'apprentissage du français des affaires (DAFA - Binon, Verlinde, Van Dyck, Bertels 2000), but mainstream lexicography is undoubtedly intuition-based. As far as we know, no comparative studies have been made of the results of the two lexicographic approaches. The aim of this paper is to present such a comparative study on a selection of words to be described in a learners’ dictionary of French. On the one hand, we have a recent learners’ dictionary, which has been published by one of the leading French dictionary publishers (Dictionnaires Le Robert), where the selection of the entries is intuition-based (Dictionnaire du français - DF, Rey-Debove 1999). On the other hand, we have the DAFLES (Dictionnaire d'apprentissage du français langue étrangere ou seconde, an (electronic) French learners’ dictionary we are presently working on. We try to use an objective frequency criterion to select the words and multiword units described in our dictionary. Therefore we use an automated statistical analysis of a 50 million word corpus of newspaper texts, taken from the 1998 issues of Le Monde (France) and Le Soir (French speaking part of Belgium). In the two first sections of this paper, we will present this corpus and the analyses made on it. In the third section, we will compare the two lists of words in order to reveal the most important differences between these two lists. In the last section, we will make another comparison, namely with the only large frequency list existing for French, which has been published along the Trésor de la langue française (TLF), the Dictionnaire des fréquences (Imbs 1971). 2. The corpus Two important questions arise when building a corpus: its representativeness and its size. For the French language there is currently no project like the British National Corpus (BNC 2000) or the Bank of English (BOE 2000); therefore we must rely on the texts that are freely accessible (see Verlinde, Selva (forthcoming) for an overview of available corpora for French). The choice is limited to literature in the public domain and newspaper texts published on archive CD-ROMs. In order to cover actual language, we have chosen the 1998 issues of two newspapers: Le Monde (France) and Le Soir (French speaking part of Belgium). Both CD-ROMs permit the texts of all articles to be exported. In the case of Le Soir, exporting can be done by date or by newspaper section. In the case of Le Monde, there is no clear-cut classification of the articles. Therefore, only exporting by date makes it possible to export all articles. The corpus has a total size of 54 260 926 words, both subcorpora having approximately the same size. We cannot say that our corpus is perfectly balanced, but it is made up of the kind of texts that the potential users of our dictionary will have to deal with. At the first stage, we cut all documentary information about the articles from the corpus. Indications about source, date and page have been coded separately in the format of the text analysis software we use (see below). The whole corpus was then tagged and lemmatised with the Cordialsoftware, splitting up all multi-word units like chemin de fer and pomme de terre and removing proper nouns. The result of this analysis was processed in order to restore the aspect of the original texts. We submitted the entire lemmatised corpus (51 845 143 words) to Wordcruncher, a well-known text analysis tool. As Wordcruncher was not able to merge both subcorpora, we have merged the two separate frequency lists of both subcorpora to create a frequency list for the whole corpus. This frequency list has been corrected on some minor points. For example, frequent words written with a 595 hyphen that were split up during the lemmatisation process have been extracted from the original corpus and added to the list. Some errors of lemmatisation have also been corrected. Our corpus is smaller than the two big English corpora but it seems to be large enough, taking into account our objective of writing a learners’ dictionary with a selection of the most common vocabulary, collocations and grammatical structures of the current language. 3. Frequency list and dictionary word list For the lexicographer, it is particularly difficult to define the importance of a dictionary word list. We see that in many cases, the number of entries (macrostructure of the dictionary) is much more important than the content of each entry (microstructure of the dictionary). Thus, the accent lies on single words more than on word combinations (collocations), for instance. This is paradoxical because, for productive purposes, learners need this information on collocations much more (Bogaards 1996, 1998) than an impressive number of isolated words, many of which will never be looked up. We decided provisionally to limit the word list to 12 156 words, selecting all lemmas that appear at least 100 times in our corpus. These 12 156 lemmas represent approximately 93.14% of all the words of the corpus, proper nouns not included. Extending this word list to 22 000 words, as in the DF, would only increase the coverage of the texts by 1%. It is surprising to see that this limited list contains a large number of words that are very common in spoken language: maman, papa, job, sympa, bosser for example. There are also a lot of words that should perhaps not appear in a learners’ dictionary because they are immediately linked to current affairs (bosniaque, kosovar) or to the local pages of the newspapers (brabançon, brainois, borain). Another aspect of vocabulary use that can be easily studied with the frequency list is the generalization of English words in French. It is well known that French authorities follow a policy of “defending” the French language by suggesting, quite systematically, French equivalents for English words. We might suppose that newspapers, mainly the French ones, wouldtry to reinforce this policy, but this seems not to be the case in light of the frequency of some English words (table 1). frequency Le Monde frequency Le Soir business 446 471 coach 83 1424 cool 108 169 design 305 312 fast-food 42 88 goal 69 108 holding 573 505 joint(-)venture 84 95 leasing 29 69 lobbying 137 114 marketing 847 767 team 76 590 trader 48 35 Web/web 1057 514 Table 1: frequency of some English words in the Le Monde and Le Soir corpora Even for those words that have an accepted and well-know French equivalent (affaires for business, but for goal, équipe for team), the English word seems to be used quite regularly. In some cases, the French equivalent is not used at all (mercatique for marketing). The fact that we are working with corpora from two different language communities makes it also possible to compare the vocabulary used in both communities and to extract words that are specific to one of these communities by comparing the relative frequency of their occurrences in both corpora. Table 2 shows an extract of the list of typical French and Belgian words and abbreviations. 596 typical French words typical Belgian words ballottage Échevin préfectoral mai eur/mayeur départemental Communal baccalauréat Deputation minitel Tram préfet Subside lycéen play-off intéressement Coach cantonal Voirie interministériel Urbanistique typical French abbreviations typical Belgian abbreviations mdc Asbl insee Prl cfdt Cpas smic Psc cgt Rtbf Table 2: list of typical French and Belgian words and abbreviations Such information on geographical variants is rarely mentioned in the essentially France-oriented French dictionaries. 3. Corpus frequency list and word list of the DF The DF is in fact the first learners’ dictionary of French for twenty years. The objective, presenting the words of both everyday conversation and the press (Rey-Debove 1999: VII), is very close to our objective, and to the objective of every learners’ dictionary in general. As the authors do not say that they integrated a corpus analysis, it is possible to make a comparison between a corpusbased approach and an intuition-based approach, at least for the word list of the dictionary. Similar comparisons could be made for the collocations and the syntactic structures that are described in the dictionary. Table 3 shows to what extent the word list of the DF matches the words of the corpus frequency list. corpus frequency ranges number of words not mentioned in the DF percent cumulative frequency cumulative percent 0-500 0 0 0 0 501-1000 2 0,4 2 0.2 1001-1500 3 0,6 5 0.3 1501-2000 1 0,2 6 0.3 2001-2500 10 2 16 0.6 2501-3000 16 3,2 32 1.1 3001-3500 18 3,6 50 1.4 3501-4000 28 5,6 78 2 4001-4500 40 8 118 2.6 4501-5000 45 9 163 3.3 5001-5500 48 9,6 211 3.8 5501-6000 61 12,2 272 4.5 6001-6500 58 11,6 330 5.1 6501-7000 67 13,4 397 5.7 7001-7500 39 7,8 436 5.8 7501-8000 87 17,4 523 6.5 8001-8500 80 16 603 7.1 8501-9000 102 20,4 705 7.8 9001-9500 120 24 825 8.7 9501-10000 115 23 940 9.4 597 10001-10500 110 22 1050 10 10501-11000 129 25,8 1179 10.7 11001-11500 154 30,8 1333 11.6 11501-12000 124 24,8 1457 12.1 Table 3: corpus frequency ranges and DF word list The conclusion that can be drawn from this table is that 12.1% of the 12 000 most frequent words of our corpus do not appear in the DF. The differences in coverage are limited up to frequency 4000 with a difference less than 10%. From frequency 4 000 on, and mainly from frequency 8 500 on, the differences in coverage increase seriously (up to 20% and more). In the list of ‘forgotten’ words and abbreviations, we find investisseur, budgétaire, entité, concertation, restructuration, infrastructure, forum, info, privatisation, amendement for example. These words need to be mentioned in a general purpose dictionary. When we have a look at the words mentioned in the DF that do not appear in our frequency list, we notice that these words can not really be considered as current words (table 4, excerpt from the beginning of the letter A). a fortiori abetissant abreuvoir accessoiriste a gogo abjurer abricotier accotement a jeun ablution abrutir accouder (s') a.z.t. aboiement abrutissant accoudoir abasourdi abois (aux) abscisse accoutrement abat-jour abominablement absenter (s') accoutrer abats abortif abyssin accroupir (s') abattant aboutissants acadien accumulateur abattis abracadabrant acariâtre accus abetir abrasif accablement achalandé Table 4: DF words not appearing in the corpus frequency list The authors of the DF identify the frequent and important words by marking them with a blue triangle. In our learners’ dictionary we classify the words into six frequency ranges (table 5). frequency range Range occurrences text coverage 1 <= 427 >= 11 183 66 % 2 <= 990 >= 5 273 75 % 3 <= 1 926 >= 2 482 82 % 4 <= 3 920 >= 854 88 % 5 <= 12 156 >= 100 93 % 6 < 100 100 % Table 5: DAFLES frequency ranges Both frequency indications can be linked and compared. Once again, the intuitive approach seems to be less rigorous than a corpus-based approach: words like acajou, adipeux, ablation, affairé and affublé are “frequent and important” according to the authors of the DF but not a, année, américain, allemand, afin de/que. Looking into detail at the whole list of “frequent and important” words for the letter A of the DF reveals however also some weaknesses of our corpus-based approach. Some everyday life words as s'absenter do not appear into our frequency list. They need certainly to be added to a learners’ dictionary word list. 4. Comparing two corpus-based frequency lists: literature and newspapers As it was mentioned above, the Dictionnaire des fréquences (Imbs 1971) is a frequency list published along the TLF. It is the only frequency list based on a large corpus (170 million words) for 598 French with literary texts from the beginning of the nineteenth century to the sixties. We selected the first 12 174 lemmas of this list in order to compare them to our frequency list. It is not surprising that a lot of current words as régional, match, euro, championnat, football, culturel, télévision and festival, for example, do not appear in the frequency list of the Dictionnaire des fréquences. Words that do not appear in our list mainly characterise personal feelings (sottise, fâché, gémir, tressaillir) and things that do not exist anymore (pardessus, sou, écu). In addition to everyday life words mentioned above, words expressing feelings in general form a second important group of words to be added to the word list of our dictionary. 5. Conclusion From the comparison of both lexicographic approaches (corpus-based and intuition-based), we can conclude that corpus-based lexicography gives a strong and necessary empirical evidence to the lexicographer's personal intuition, even if this personal intuition remains helpful in filling the gaps in our corpus. These gaps are undoubtedly due to the fact that the corpus is unbalanced. Taking into account this observation, there is a strong need to design and construct for French, and for other languages as well, a carefully selected corpus with a large variety of texts, in order to improve the quality of (learners') dictionaries, and vocabulary learning and teaching in general. References Binon J, Verlinde S, Van Dyck, J, Bertels A 2000 Dictionnaire d'apprentissage du fraçais des affaires. Paris, Didier. (an electronic version of this dictionary can be found at http://www.projetdafa.net). Bogaards P 1996 Dictionaries for learners of English. International Journal of Lexicography 9(4): 277-320. Bogaards P 1998 Des dictionnaires au service de l'apprentissage du français langue étrangere. Cahiers de lexicologie 72(1): 127-167. Corréard M-H (ed.) 19972 The Oxford-Hachette French Dictionary. Paris-Oxford, Hachette-OUP. Imbs P 1971 Dictionnaire des fréquences. Vocabulaire littéraire des XIXe et XXe siecles, I - Table alphabétique, II - Table des fréquences décroissantes. Nancy-Paris, CNRS-Didier. Rey-Debove J (ed.) 1999 Dictionnaire du français. Référence. Apprentissage. Paris, CLE International-Dictionnaires le Robert. TLF. Imbs P 1971-1994 Trésor de la langue française. Paris, CNRS-Gallimard. Verlinde S, Selva Th forthcoming Nomenclature de dictionnaire et analyse de corpus. Cahiers de lexicologie. Websites BNC 2000: http://info.ox.ac.uk/bnc/ BOE 2000: http://titania.cobuild.collins.co.uk/boe_info.html 599 Sense tagging: does it make sense? Jean Véronis Université de Provence 29, Avenue Robert Schuman, 13100 Aix-en-Provence (France) Jean.Veronis@up.univ-mrs.fr Sense tagging is probably one of the challenges that corpus linguists have to face in the near future. So far, computerisation of this task has yielded very modest results despite numerous efforts, and sense tagging is turning out to be a touchy task. Difficulties stem from various sources, extracting disambiguating information from the context. However, one of the main problems that lies upstream of the disambiguating process is the sense inventory itself. Most tagging efforts rely on traditional dictionaries to supply the reference senses, or on computer-oriented resources such as WordNet, which do not differ significantly from traditional dictionaries in terms of sense division. The present paper shows that human taggers perform very poorly when given a traditional dictionary as the reference, and that machines should therefore not be expected to perform any better if the same kind of resource is used. A detailed analysis reveals the lack of distributional criteria in dictionary entries: traditional dictionaries are chiefly concerned with meaning definition, and not with the surface clues (syntactic, collocational, etc.) that are required to match a given sense with a given corpus occurrence. It is argued that no fundamental progress can be made until large-scale lexical resources have been built that incorporate extensive distributional information, and that, until that time, any massive sense tagging efforts based on traditional dictionaries or computer-oriented resources such as WordNet would not only be premature but also questionable in terms of resource management. Keywords: sense tagging, polysemy judgements, interannotator agreement, dictionaries, distributional information, word sense disambiguation. 600 A corpus – based analysis of how accurately printed Romanian obeys to some universal laws Adriana Vlad, Adrian Mitrea, and Mihai Mitrea “POLITEHNICA” University of Bucharest Faculty of Electronics and Telecommunications 1-3 Iuliu Maniu Bvd., Bucharest, Romania, vadriana@vala.elia.pub.ro A main objective of the paper is how accurately printed Romanian complies with the stationarity hypothesis. A statistical approach to NL stationarity, based on the mgram structure is presented. The statistical inferences are: estimation theory with multiple confidence intervals, test of the hypothesis that probability belongs to an interval and test of the equality between two probabilities. The b size of the type II statistical error plays a special role in the designing of a corpus for mathematical purposes. The stationarity investigation was also used to investigate how accurately printed Romanian complies with two frequency–rank laws. Key words: natural language stationarity, frequency–rank laws, multiple confidence intervals for probability. 601 Exploiting the WWW as a corpus to resolve PP attachment ambiguities Martin Volk University of Zurich Department of Computer Science, Computational Linguistics Group Winterthurerstr. 190, CH-8057 Zurich volk@ifi.unizh.ch 1. Introduction Finding the correct attachment site for prepositional phrases (PPs) is one of the hardest problems when parsing natural languages. An English sentence consisting of a subject, a verb, and a nominal object followed by a prepositional phrase is a priori ambiguous. The PP in sentence 1 is a noun attribute and needs to be attached to the noun, but the PP in 2 is an adverbial and thus part of the verb phrase. (1) Peter reads a book about computers. (2) Peter reads a book in the subway. If the subcategorisation requirements of the verb or the competing noun are known the ambiguity can sometimes be resolved. But many times there are no clear requirements. Therefore, there has been a growing interest in using statistical methods that reflect attachment tendencies. This new line of research was kicked off by Hindle and Rooth (1993). They tackled the PPattachment ambiguity problem (for English) by computing lexical association scores over a partially parsed corpus. If a sentence contains the sequence V+NP+PP the triple V+N+P is observed with N being the head noun of the NP and P being the head of the PP. The probabilities are estimated from cooccurrence counts of V+N and of N+P. They evaluated their method on manually disambiguated verbnoun- preposition triples. It resulted in 80% correct attachments. In the meantime the method has been improved and extended. The best reported results are from Stetina and Nagao (1997: up to 88% correct attachment). They use a supervised learning approach (they train the disambiguator over the Penn-Treebank) and a semantic dictionary to cluster the words. We applied unsupervised statistical methods to German. Since there is no large German treebank available we first worked with a partially parsed corpus. The gathering of co-occurrence data is more complicated for German because of its variable constituent ordering. In Langer et al. (1997) we show that we can achieve around 76% attachment accuracy for the decidable cases. But many cases cannot be decided because of sparse data. Therefore we have experimented with using the WWW, a corpus that is orders of magnitude larger than our locally accessible corpora. With the help of a WWW search engine we obtain frequency values (“number of pages found”). In querying a search engine we lose some precision compared to corpus analysis. Our hypothesis is that the size of the WWW will compensate our rough queries. Our method for determining co-occurrence values is based on a simple formula. We use the frequency of a word co-occurring with a given preposition against the overall frequency of this word. For example, if some noun N occurs 100 times in a corpus and this noun co-occurs with the preposition P 60 times then the co-occurrence value of N+P will be 60/100 = 0.6. The general formula is (where X can be either a noun N or a verb V): freq(X,P) / freq(X) = cooc(X,P) In Volk (2000) we have explored this formula in detail. We have shown that the WWW frequencies can be used for the resolution of PP attachment ambiguities if the difference between the competing cooccurrence values is above a certain threshold. In this way the co-occurrence values served to decide 58% of our test cases with an attachment accuracy of 75%. In the more successful experiments for PP attachment in English (Stetina and Nagao 1997, Collins and Brooks 1995) the co-occurrence statistics included the noun within the PP. The motivation behind this becomes immediately clear if we compare the PPs in the example sentences 3 and 4. Since both PPs start with the same preposition only the noun within the PP helps to find the correct attachment. (3) Peter saw the thief with his own eyes. (4) Peter saw the thief with the red coat. 602 In a new round of experiments we have included the head noun of the PP into the queries. This means we are now working with the extended formula: freq(X, P, N2) / freq(X) = cooc(X, P, N2) Let us look at an example sentence from our corpus: (5) Unisource hat die Voraussetzungen für die Gründung eines Betriebsrates geschaffen. Unisource has set up the prerequisites for the foundation of a work council. freq(X,P,N2) freq(X) cooc(X,P,N2) X=N1 freq(Voraussetz.,für,Gründung) 274 freq(Voraussetzungen) 255'010 cooc(Voraussetz.,für,Gründung) 0.001074 X=V freq(geschaffen,für,Gründung) 139 freq(geschaffen) 172'499 cooc(geschaffen,für,Gründung) 0.000805 The co-occurrence value cooc(N1,P,N2) is higher than cooc(V,P,N2), and thus the model correctly predicts noun attachment for the PP. 2. Preparation of the test corpus We manually compiled a treebank as a test suite for the evaluation of our method. We semiautomatically disambiguated and annotated 3000 sentences. In order to be compatible with the German NEGRA treebank we used the same annotation scheme as Skut et al. (1997). We selected our evaluation sentences from the 1996 volume of the ComputerZeitung, a weekly computer magazine that is available on CD-ROM (Konradin-Verlag 1998). We tagged the text and selected 3000 sentences that contained 1. at least one full verb and 2. at least one sequence of a noun followed by a preposition. With these conditions we restricted the sentence set to those sentences that contain a prepositional phrase in an ambiguous position. Manually assigning a complete syntax tree to a sentence is a labour-intensive task. This task can be facilitated if the most obvious phrases are automatically parsed. We used our chunk parser for NPs and PPs to speed up the manual annotation. We also used the NEGRA Annotate-Tool (Brants et al. 1997) to semi-automatically assign syntax trees to all (preparsed) sentences. This tool comes with a built-in parser that can suggest categories over selected nodes. The sentence structures were judged by two linguists to minimize errors. Finally, completeness and consistency checks were applied to ensure that every constituent was included into the sentence structure. We then used a Prolog program to build the nested structure and to recursively work through the annotations in order to obtain sextuples with the relevant information for the PP classification task: 1. the full verb (a separated verbal prefix is reattached), 2. the real head noun N1 (the noun which the PP is attached to), 3. the possible head noun N1 (the noun that immediately precedes the PP; this noun leads to the attachment ambiguity), 4. the preposition of the PP, 5. the core noun of the PP (called N2), and 6. the attachment decision (as given by the human annotators). Let us illustrate this with some example sentences. (6) Das Dorfmuseum gewährt nicht nur einen Einblick in den häuslichen Alltag vom Herd bis zum gemachten Bett. The village museum allows not only insights into the everyday life from the oven to the bed. (7) ... nachdem dieses wichtige Feld seit 1985 brachlag. ... since this important field lay idle since 1985. (8) Das trifft auf alle Waren mit dem berüchtigten "Grünen Punkt'' zu. This holds true for all goods with the ill-famed "Green Dot''. 603 These corpus sentences will lead to the following sextuples: verb real N1 possible N1 prep. N2 (in PP) function of the PP gewährt Einblick Einblick in Alltag postnominal modifier gewährt Alltag Alltag vom Herd postnominal modifier gewährt Alltag Herd bis Bett postnominal modifier brachlag / Feld seit 1985 verb modifier zutrifft Waren Waren mit Punkt postnominal modifier Each sextuple represents a PP with the preposition occurring in a position where it can be attached either to the noun or to the verb. Note that the PP auf alle Waren in 8 is not in such an ambiguous position and thus does not appear in the test cases. In sentence 6 we observe the difference between the real head noun and the possible head noun. The PP bis zum gemachten Bett is not attached to the possible head noun Herd but to the preceding noun Alltag. Obviously, there is no real head noun if the PP attaches to the verb (as in 7). In the following tests we use the real reference noun N1 if it is present else the possible reference noun N1. Our test corpus consists of 4383 test cases, out of which 63% are noun attachments and 37% verb attachments. 3. Disambiguating with WWW frequencies We queried AltaVista in order to obtain the frequency data for our co-occurrence values. For all queries we use AltaVista advanced search restricted to German documents. For co-occurrence frequencies we use the NEAR operator. · For nouns and verbs we query for the word form by itself. · For co-occurrence frequencies we query for Verb NEAR preposition NEAR N2 and N1 NEAR preposition NEAR N2 again using the verb forms and noun forms as they appear in the corpus. The NEAR operator in AltaVista restricts the search to documents in which its argument words co-occur within 10 words. We then compute the co-occurrence values for all cases in which both the word form frequency and the co-occurrence frequency are above zero. We evaluate these co-occurrence values against our test corpus using the following disambiguation algorithm. if (cooc(N1,P,N2) && cooc(V,P,N2)) then if (cooc(N1,P,N2) > cooc(V,P,N2)) then noun attachment else verb attachment else noun attachment If both co-occurrence values exist, the attachment decision is based on the higher value. If one or both co-occurrence values are missing we decide in favour of noun attachment since 63% of our test cases are noun attachment cases. The disambiguation result is summarized in table 1. correct incorrect accuracy noun attachment 2553 1129 69.34% verb attachment 495 206 70.61% total 1800 1335 69.54% Table 1: Attachment accuracy for the complete test corpus 604 The attachment accuracy is improved by 6.5% compared to pure guessing. But it is way below the accuracy that we computed for the decidable cases in earlier experiments. Even in the WWW many of our test triples do not occur. Only 2422 (55%) of the 4383 test cases can be decided by using both cooccurrence values. The attachment accuracy for these test cases is 74.32% and thus about 5% higher than when forcing a decision on all cases (cf. table 2) correct incorrect accuracy noun attachment 1305 416 75.83% verb attachment 495 206 70.61% total 1800 622 74.32% Table 2: Attachment accuracy when requiring both cooc(N1,P,N2) and cooc(V,P,N2) 3.1. Using the co-occurrence values against a threshold A way of tackling the sparse data problem lies in using partial information. Instead of insisting on both cooc(N1,P,N2) and cooc(V,P,N2) values, we can back off to either value for those cases with only one value available. Comparing this value against a given threshold we decide on the attachment. If, for instance, cooc(N1,P,N2) is available (but no cooc(V,P,N2) value), and if this value is above the threshold then we decide on noun attachment. If cooc(N1,P,N2) is below the threshold we take no decision. Thus we extend the disambiguation algorithm as follows: if (cooc(N1,P,N2) && cooc(V,P,N2)) then if (cooc(N1,P,N2) > cooc(V,P,N2)) then noun attachment else verb attachment elseif (cooc(N1,P,N2) > threshold) then noun attachment elseif (cooc(V,P,N2) > threshold) then verb attachment Now the problem arises on how to set the thresholds. It is obvious that the attachment decision gets more reliable the higher we set the thresholds. At the same time the number of cases that are decidable decreases. We suggest to set the threshold in such a way that using this partial information is not worse than using both the cooc(N1,P,N2) and cooc(V,P,N2) values. That means that we set the threshold so that we keep the overall attachment accuracy at around 75%. correct incorrect accuracy noun attachment 1448 446 76.45% verb attachment 629 245 71.97% total 2077 691 75.04% Table 3: Attachment accuracy when requiring either cooc(N1,P,N2) or cooc(V,P,N2) We thus set the threshold to 0.001 and obtain the result in table 3. The attachment rate (the number of decidable cases) has risen from 55% to 63%; 2768 out of 4383 cases can be decided based on either both co-occurrence values or on the comparison of one co-occurrence value against the threshold. Noun attachment is still better than verb attachment. 3.2. Using the co-occurrence values of word forms and base forms The above frequencies were based on word form counts. But German is a highly inflecting language for verbs, nouns and adjectives. If a rare verb form (e.g. a conjunctive verb form) or a rare noun form 605 (e.g. a new compound form) appears in the test corpus it often results in a zero frequency for the triple. We may safely assume that the co-occurrence tendency is constant over the different verb forms. We may therefore substitute the rare verb form with a more frequent form of this verb. We decided to query with the given verb form and with the corresponding verb lemma (the infinitive form). For nouns we also query for the lemma. As a special case we reduce compound nouns to the last compound element and we compute the lemma for the last element (e.g. Informationssystemen ® System). We do the same for hyphenated compounds (e.g. GI-Kongresses ® Kongress). We also reduce company names ending in GmbH or Systemhaus to these keywords and use them in place for the lemma (e.g. CSD Software GmbH ® GmbH). The co-occurrence value is thus computed as (X is the verb V or the reference noun N1): freq(Xform,P,N2) + freq(Xlemma,P,N2) freq(Xform) + freq(Xlemma) =cooc(X,P,N2) The disambiguation algorithm is the same as above and we use the same threshold of 0.001. As table 4 shows, the attachment accuracy stays at around 75% but the attachment rate increases from 63% to 71% (3109 out of 4379 test cases can be decided). correct incorrect accuracy noun attachment 1615 459 77.87% verb attachment 735 300 71.01% total 2350 759 75.59% Table 4: Attachment accuracy including threshold and lemmas In order to complete the picture we evaluate without using the threshold. We get an attachment accuracy of 74.72% at an attachment rate of 65%. This a 10% increase to the result we computed for word forms (cf. table 2). If, in addition, we use any single co-occurrence value (i.e. we set the threshold to 0), the attachment accuracy slightly decreases to 74.23% at an attachment rate of 85%. This means that for 85% of our test cases we have at least one co-occurrence value from the WWW frequencies. If we default the remaining cases to noun attachment we end up with an accuracy of 73.08% which is significantly higher than our initial result of 69.54% (reported in table 1). 3.3. Conclusion The most important lesson from these experiments is that triples (X,P,N2) are much more reliable than tuples (X,P) for deciding the PP attachment site. Using a large corpus such as the WWW helps to obtain frequency values for many triples and thus provides co-occurrence values for most cases. Furthermore, we have shown that querying for word form and lemma substantially increases the set of decidable cases and thus the attachment rate without any loss in the attachment accuracy. The accuracy is 74% for all decidable test cases and 73% for all test cases. We can further enhance the cooccurrence frequencies by querying for all word forms, as long as the WWW search engines index every word form separately. If we are interested only in highly reliable disambiguation cases (80% accuracy) we may lower the number of decidable cases by increasing the threshold (or by requiring a minimal distance between cooc(V,P,N2) and cooc(N1,P,N2) as we have shown for tuples in Volk, 2000). When using frequencies from the WWW the number of decidable cases should be higher for English since the number of English documents in the WWW by far exceeds the number of German documents. Still the problem remains that querying for co-occurrence frequencies with WWW search engines using the NEAR operator allows only for very rough queries. For instance, the query P NEAR N2 does not guarantee that the preposition and the noun co-occur within the same PP. It matches even if the noun precedes the preposition. There are various possibilities for improved queries. 1. X NEAR "P DET N2" with an appropriate determiner DET will query for the sequence "P DET N2'' and thus for P and N2 co-occurring in a standard PP. 606 2. X NEAR (P NEXT 3 N2) will query for N2 as one of the three tokens following P. The NEXT operator is often available in information retrieval systems but not in the WWW search engines that we are aware of. This query is more flexible than querying for a standard PP. 3. "N1 P" NEXT 3 N2 will query for noun N1 and preposition P immediately following each other as is most often the case if the PP is attached to N1. 4. V SAME_SENTENCE (P NEXT 3 N2) will query for the verb V co-occurring within the same sentence as the PP. From a linguistic point of view this is the minimum requirement for the PP being attached to the verb. In fact, to be linguistically precise we must require the verb to co-occur within the same clause as the PP. But none of these operators is available in current search engines. Obviously, any of these constraints will reduce the frequency counts and may thus lead to sparse data. We will therefore have to counterbalance this with querying for words that behave similarly with respect to PP attachment, for instance, words from the same semantic class. Acknowledgement We thank Charlotte Merz for comments and corrections on earlier versions of this paper. References Brants T, Skut W, Krenn B 1997 Tagging grammatical functions. In Proc. of EMNLP-2, Providence, RI. Collins M, Brooks J 1995 Prepositional phrase attachment through a backed-off model. In Proc. of the Third Workshop on very large corpora. Hindle D, Rooth M 1993 Structural ambiguity and lexical relations. Computational Linguistics, 19(1): 103-120. Konradin-Verlag 1998 Computer Zeitung auf CD-Rom. Volltextrecherche aller Artikel der Jahrgänge 1993 bis 1998. Leinfelden-Echterdingen, Konradin-Verlag. Langer H, Mehl S, Volk M 1997 Hybride NLP-Systeme und das Problem der PP-Anbindung. In Wermter S, Busemann S, Harbusch K (eds), Berichtsband des Workshops "Hybride konnektionistische, statistische und symbolische Ansätze zur Verarbeitung natürlicher Sprache" auf der 21. Deutschen Jahrestagung für Künstliche Intelligenz, KI-97 (auch erschienen als DFKIDocument D-98-03), Freiburg. Skut W, Krenn B, Brants T, Uszkoreit, H 1997 An annotation scheme for free word order languages. In Proceedings of the 5th Conference on Applied Natural Language Processing, Washington, DC, pp 88-95. Stetina J, Nagao M 1997 Corpus based pp attachment ambiguity resolution with a semantic dictionary. In Zhou J, Church K (eds), Proc. of the 5th Workshop on very large corpora, Beijing and Hongkong, pp 66-80. Volk M 2000 Scaling up. Using the WWW to resolve PP attachment ambiguities. In Proc. of Konvens- 2000. Sprachkommunikation, Ilmenau, VDE Verlag, pp 151-156. 607 A corpus-based methodology for comparing and evaluating different accents Martin Weisser Department of Linguistics and Modern English Language Lancaster University 1. Learner corpora and their uses The past few years have seen a growing interest in the development of learner corpora (cf. Granger, 1998). However, so far the main emphasis for this kind of corpus has been on collecting and marking up written texts produced by native and non-native speakers. This was mainly done in order to be able to determine where the most common problems and differences in the usage of written language, both as far as grammar and lexis are concerned, lie for the learners. Learner corpora of spoken language are still relatively rare and even if they do exist, often consist of written orthographic transcriptions, possibly minimally enriched with symbols to indicate lengthening, etc. only. Attempts at producing learner data relating to pronunciation have so far, to my knowledge, been restricted to setting up a flatfile database documenting the realisations of school children, as in the Austrian Learner's Database (Wieden/Nemser, 1991b), and to parts of the ISLE Project1. 1.1. Corpora vs. LE speech databases As pointed out above, learner corpora are so far mainly being used to investigate learner behaviour with regard to grammar and lexis, but very little attention has been paid to issues of pronunciation. Phonetic transcriptions of spoken language on a corpus scale are generally the domain of Language Engineering (LE), where the purpose has mainly been to collect speech data that may be used for applications such as speech recognition, speech synthesis, telebanking applications or spoken dialogue systems. However, the material recorded for these purposes is in many cases severely limited with respect to the particular requirements of the application it is being collected for, e.g. series of numbers or certain keywords needed for communicating with the application. The first notable exception to this kind of corpus is the aforementioned Austrian Learner Database, which documents the pronunciation of Austrian schoolchildren of varying age-groups and from different regions. The purpose of this database, amongst others, was to establish which particular problems in pronunciation for theses learners may be introduced by their L1. However, this database had some serious drawbacks and flaws due of its design and implementation, some of which will be discussed with the relevant topics further below. The second exception is the material collected for the EU-funded ISLE project, which aims to provide an architecture for incorporating speech recognition technology into language-learning software products. 1.2. Transcription, encoding and fonts Due to the relative high use of UNIX-based workstations in LE, transcriptions in LE speech databases are often based on ASCII representation schemes of the IPA character sets, such as SAMPA (Wells et al., 1992) or use other mapping algorithms, which work well for the computer, but often make it difficult for the human reader to understand the transcriptions, especially if the latter contain many diacritics. The transcription and encoding system used in the Austrian database is even more complex since it not only uses numerical codes to represent the transcriptions themselves, but also includes codes that represent information about the relative closeness of the pronunciations to RP and any deviations from the implicit norms underlying the analysis. For example, one of the representations of the word rubber given in Wieden/Nemser, 1991b (p. 354) is “00004540000200001000000050020000”. Coding schemes such as this may make it easy to conduct relatively exact statistical comparisons containing minute details of pronunciation, but are extremely difficult to handle and may lead to results biased more towards the quantitative than qualitative side. To facilitate research of a more qualitative nature, the use of standard IPA (True-type) fonts is thus preferable. 1 http://nats-www.informatik.uni-hamburg.de/~isle/speech_text.html 608 2. Spoken language models applied to learner English 2.1. RP vs. native speaker corpus-based reference models When learners of English are assessed or tested, this is usually done against implicit reference models such as RP (Received Pronunciation) for British English or GenAm (General American) for American English because these represent the standardised varieties the learners have supposedly been taught. Furthermore, learners are often expected to enunciate far more clearly and possess a far higher rhetorical skill than native speakers of different varieties of English. Not only is this unrealistic and an unfair practise towards the learners, but it is also, at least as far as British English is concerned, an unrepresentative model of the language as RP is only spoken by about 3 % of the overall population (Hughes/Trudgill, 1997: 3), and even then in different forms (Wells, 1982: 279-301). However, so far no attempts at creating more realistic models of native speakers English have been made that could serve as an adequate basis for comparison between native and non-native speakers. 3. Creating speaker/learner population models The methodology I propose here represents a synthesis of ideas from corpus linguistics – especially learner corpora – and LE. Creating a kind of reference model from the native speaker realisations in the first instance makes it possible to compare native and non-native speaker data by establishing an adequate basis for comparison, rather than relying on the more abstract established teaching models, such as RP. Such a model would highlight tendencies and patterns in native speaker variation against which one can then evaluate the performance of non-native speakers realistically, and in turn identify problem areas that suggest possible changes in teaching methodology and practice. Apart from comparing native and non-native speakers in this way, there are also other implications and usages of such a methodology, which are discussed towards the end of this paper. 3.1. What needs to be stored? In order to create suitable models for analysing the spoken language of different speaker populations a number of different types of information need to be stored. Just like in corpora of written language, it is important to store orthographic representations of the spoken material and as far as possible enrich these by at least incorporating morphosyntactic information in the form of grammatical tags2. In my implementation, the orthographic material, i.e. a dialogue created and read by 7 native and 10 nonnative speakers, is stored word-by-word in a table and information about the word-classes, i.e. grammatical tags, in another. More than for purposes of handling written language, this type of information needs to be complemented by detailed information about the speakers, such as age, sex, place of birth, etc. because these factors may have a strong influence on the speech behaviour. For non-native speakers it is also important to record details about their exposure to the target language, in order to be able to establish ‘proficiency’ levels. The part that makes a spoken corpus essentially most different from a written one is that not only does a spoken corpus/database need to incorporate the transcriptions themselves, but also needs to give the user access to the original recordings in form of sound files, so that the original transcriptions can easily be verified and potentially corrected. This is why simple annotated text files generally do not represent a suitable storage format for spoken data. More often than not, genuine spoken corpora therefore tend to take on the form of applications that allow linking in and controlling analysis and playback tools (cf. Deutsch et al., 1998). The methodology I describe in this paper is based around an MS Access 2000 database application that already allows for some of these features, while some others are yet to be implemented. It is, for example, already possible to start a speech analysis program with a specified soundfile and at a specific offset, in order to do transcriptions and to control this program to some extent from a form via Visual Basic for Applications (VBA). 2 For more information on different types of annotation, see Leech et al. (1998). 609 3.2. Transcription issues As already mentioned above, the use of fonts may present some difficulties, especially if maximum compatibility between different operating system platforms is required. However, with the increasing importance of Unicode3 and at least some Linux implementations now supporting the use of True-type fonts, the use of IPA fonts has become less of an issue. Working on a Windows-based system, the obvious choice for my implementation is to use an IPA font whose character representations are saved in the transcription table(s). Since MS Access unfortunately does not allow for multiple fonts within the same table and the font I use only contains a limited amount of numbers, which would therefore make it unsuitable for representing large IDs for linking pronunciations to words, transcriptions and comparisons in the database are made accessible via MS Access forms that can display different fields of a table using different fonts. Figure 1 and Figure 2 below demonstrate the difference between the two forms of representation. Figure 1 – A snapshot of the Realisations table. Figure 2 – The Realisations form, open for speaker E03. As Figure 2 shows, the Realisations form also provides buttons that make it easier to input some of the characters and diacritics that would normally have to be input via character codes as they are not mapped onto the keyboard. Something else that can be seen from the illustration above is that in order to be able to conduct precise phonetic and phonemic analyses, a great amount of detail should be included in the initial transcription. This is necessary because in empirical analyses like mine, and especially when comparing speakers from different countries, it is extremely difficult to predict which 3 Although existing Unicode fonts still often lack in typographic quality as far as phonetic characters are concerned. 610 features may turn out to be relevant. For example, after having listened to some of my data, I had initially assumed that one of the major differences between the native and non-native speakers may be the amount of creak in the realisations of the latter, but it later turned out that both speaker populations use creaky voice, while this feature only becomes a distinguishing marker in certain contexts. A further issue in the transcription spoken data is consistency. In a large scale implementation of my methodology, it would be extremely important to have a sufficient amount of transcriptions verified by different transcribers since this is the only way in which any internal consistency could be guaranteed – and thus also any statistical validity. 3.3. Phonetic categories Unlike in a written corpus, where words are generally separated by whitespace or punctuation marks, in transcriptions of spoken data it is important to look at the context in which each word appears. Transitions between words thus represent an important and relatively easily categorisable feature of pronunciation and often serve a distinctive markers between different speaker populations. It is therefore important to keep a record of the phenomena occurring between words alongside the general transcription. Figure 2 above shows a list-box on the right-hand side from which the different transitions can be selected. These transitions contain both simple types, such as assimilation, elision, etc., but also information about pauses and other complex types, such as assimilation + elision, assimilation before a short pause, assimilation before a long pause, etc. Information about stress patterns can easily be incorporated by including primary and secondary stress marks in the transcriptions, complemented by information about possible multi-word-units, such as compounds, etc. Furthermore, other suprasegmental information, such as the length of the realisation of different objects in the text, i.e. the length of the dialogue (text) itself, of individual sentences, phrases, and down to the level of individual words if necessary. At present, my database(s) only contains information down to the level of individual sentences. 3.4. ‘Phonetics’ vs. ‘phonology’ As previously pointed out, for comparing different speaker populations often a great level of phonetic detail may be required in order to detect relevant distinguishing features. However, this wealth of detail may sometimes make it difficult to arrive at high-level observations. For this reason, my application contains a VBA routine that creates a copy of the original data that can be ‘filtered’ by stripping out any diacritics deemed irrelevant for any particular part of the analysis, in order to make the remaining data more ‘legible'. For example, if it is unimportant for analysis purposes whether initial plosives are aspirated or not, all the aspiration can simply be filtered out in the comparison. In this way, we can move step by step from a phonetic to a more phonemic representation of the data. Figure 3 below shows an ‘unfiltered’ comparison of the realisations and transitions associated with the word town and Figure 4 shows a ‘filtered’ representation of the same word with ‘normal’ aspiration removed4. The ‘filtering’ routine is started by clicking on the “Filter Occurrences” button at the top of the form displayed in Figure 3, which then prompts the user to input a series of characters to be stripped out, separated by spaces. 4 Note that the stronger, more unusual aspiration represented by an [s] is still shown. 611 Figure 3 – Realisations + Transitions (‘unfiltered') Figure 4 – Realisation Comparisons (‘filtered') 4. Data Storage Apart from what is being recorded in the kind of corpus under discussion, it is also important to understand how the different types of information are stored and how they can be related to one another in a meaningful way. As far as data storage is concerned, we end up with two fundamentally different types of data, the digitised soundfiles, which are stored as 16 bit, 16kHz .wav files, which can be played back by many player applications both on Windows and Linux systems, and the transcriptions and annotations, which are stored in the relational MS Access database(s). 4.1. Relational model The choice of using a relational, rather than a flatfile database model was an easy one because a relational database allows to store different types of information together and relate them to each other when needed. It therefore provides a highly flexible and easily expandable architecture to which other types of information can be added if and when necessary. 612 4.2. Levels of information and their coordination The most basic levels of information in the database are represented by the transcriptions of individual word tokens, their transitions and the word tokens they are related to in the dialogue. Through this, it is not only possible to combine and display all the different realisations for a particular word in a specific context, but also generate a kind of extended pronouncing dictionary with information about the different possibilities of pronunciation for a specific word in different contexts. At the next level, we can incorporate grammatical tags in order to see whether specific realisation phenomena, such as for example the deletion of complete words, are related or even restricted to specific word classes for individual speaker populations. We can also investigate whether specific groups of word classes, such as deictica (e.g. pronouns or locatives) are often realised with final release, indicating a potentially unusual degree of emphasis. Next, we can incorporate punctuation to determine speaker behaviour at phrase, sentence or turn boundaries and see how it relates to the realisation of pauses or final lengthening indicated in the word transition categories. This could then, as a further step, be complemented by prosodic information to investigate the proportion of long or short pauses in relation to the use of prosodic indicators of textual cohesion. In those cases where we may find particularly unusual realisations that do not fit any speaker population patterns, we can make recourse to the information stored about individual speakers in order to verify whether anything in their backgrounds may have triggered such particular speech behaviour, such as e.g. long periods of time spent abroad, etc. In a similar way, we may also investigate variation within particular populations based upon the age of the speakers. 5. Comparison strategies Comparisons between different speaker populations can take on various forms. As an initial step, all the realisations (potentially also including the transitions) for each particular population have to be combined into one table. This can be achieved by running SQL union queries against the database. Once the query results exist, additional queries can be run against them and aggregate functions, such as counts, averages or standard deviations can be used to establish frequency information for particular realisations, transitions, etc. Just like with concordance programs for analysing written corpora, wildcard searches in these queries help to focus in on particular parts of the data. If all that is needed for an analysis is to establish the deviances of one population from another, one can even write programs that analyse the results of the union queries and write these out to a table. However, according to my experience this may not necessarily help the researcher to understand all the relevant differences between the populations. Therefore, contextualised side-by side comparisons, such as illustrated in Figures 3 and 4 above are a much more preferable way of analysing the data, since this way, tendencies for both populations can relatively easily be spotted and compared. Once a particular model for a native speakers population has been established, it is of course also possible to evaluate the performance of individual non-native speakers against this model and to identify in which areas there is scope for improvement. 6. Applications and implications The methodology I have developed for my PhD thesis provides a way of using relational databases in order to store transcriptions of native and non-native speaker data and for capturing the differences between them statistically. This is a rather different approach from the one taken by most other studies investigating the performance of non-native speakers, who simply assume that a convenient basis for comparison exists in form of a standard. This kind of methodology has implications for both language teaching and also testing, as it may provide a more realistic account of the way that native speakers speak and therefore also what should potentially be taught to foreign learners in order to enable them to communicate efficiently with the former. As far as language testing is concerned, the implications are that a) it is possible to assess the speech of learners in a relatively objective way, rather than having to relay on the largely impressionistic marking schemes that are still currently in use (cf. Heaton, 1975: 100 and Weir, 1993: 43/44), and b) that this provides an eminently better and fairer way of assessing the speech of foreign learners in comparison to native speaker performance. 613 Apart from its use for the evaluation of non-native speaker accents, the methodology can also easily be applied to the study of different native speaker accents, not only for purely linguistic research purposes, but also potentially in order to establish criteria that may be used to improve speech recognition and other language engineering technology, such as dialogue systems (cf. Leech and Weisser, 2001). References Deutsch W, Vollman R, Noll A, Moosmüller S 1998 An Open Systems Approach for an Acoustic- Phonetic Continuous Speech Database: The S_Tools Database-Management System (STDBMS). In: Nerbonne J (ed.) 1998. Linguistic Databases. Center for the Study of Language and Information: Stanford, California. pp 77-92. Chollet G, Cochard J-L, Constantinescu A, Jaboulet C, Langlais P 1998 Swiss French PolyPhone and PolyVar: Telephone Speech Databases to Model Inter- and Intra-speaker Variability. In: Nerbonne J (ed.) 1998. Linguistic Databases. Center for the Study of Language and Information: Stanford, California. pp 117-135. Granger S (ed.) 1998 Learner English on Computer. London: Longman. Heaton JB 1975 Writing English Language Tests. London: Longman. Hughes A., Trudgill P 1987 English Accents and Dialects. London Edward Arnold. Leech G, Weisser M forthcoming 2001 Pragmatics and Dialogue in: Mitkov R (ed.). The Oxford Handbook of Computational Linguistics. Oxford: OUP. Leech G, Weisser M, Wilson A, Grice M. 1998 Survey and Guidelines for the Representation and Annotation of Dialogue. In: Gibbon D, Mertins I, Moore R (eds.) 2000 Handbook of Multimodal and Spoken Dialogue Systems. Dordrecht: Kluwer Academic Publishers. Nerbonne J (ed) 1998 Linguistic Databases. Center for the Study of Language and Information: Stanford, California. Wieden W, Nemser W 1991a The Pronunciation of English in Austria. Tübingen: Gunter Narr. Wieden W, Nemser W 1991b Compiling a database on regional features in Austrian-German English. In: Wieden W, Nemser W 1991 The Pronunciation of English in Austria. Tübingen: Gunter Narr. pp. 350-363. Wells, J. 1982. Accents of English. Cambridge: CUP. (Vol. 2) Wells J, Barry W, Grice M, Fourcin A, Gibbon D 1992 Standard Computer-compatible transcription. Esprit project 2589 (SAM), Doc. no. SAM-UCL-037. London: Phonetics and Linguistics Dept., UCL. Weir C 1993 Understanding and Developing Language tests. Hemel Hempstead: Prentice Hall International (UK). 614 Wh-questions and attitude: the effect of context Anne Wichmann and Richard Cauldwell University of Central Lancashire and University of Birmingham One of the most commonly cited functions of intonation is its ‘attitudinal’ function, and yet this remains its most elusive aspect. The profusion of meanings frequently ascribed to one and the same contour serves to show that the contour itself means none of them. However, we know intuitively that such meanings are generated, and recently there have been some attempts to investigate what these meanings are and how they arise. A persistent problem in the investigation of prosodically generated ‘attitudes’ is the tendency by many to conflate many different kinds of affective meanings, in particular emotion and attitude. There have been some attempts to identify and categorise different affective states (see Wichmann 2000, Scherer forthcoming). These studies have been motivated mainly by an upsurge in interest in the direct effects of emotion on the voice, and, for this reason, the indirect, context dependent affective meanings which can be observed in interaction have so far been neglected. Cauldwell (2000) observed that the attitudinal meaning conveyed by an utterance (a WH-question) in isolation was absent when the utterance was heard in its original conversational context. In this paper I also focus on WH-questions, extracted from the ICE GB corpus, and observe the various affective (or neutral) meanings both in isolation and in context. These perceived meanings include those which reflect the emotional state of speaker, the affectively coloured beliefs or predispositions of the speaker to a person or proposition, or an interpersonal stance (an affective stance taken toward another person in a specific interaction) (after Scherer forthcoming). I attempt to explain the contribution of intonation to the presence or absence of perceived affective colouring on the basis of auditory prosodic analysis. References Wichmann A 2000 The attitudinal effects of prosody, and how they relate to emotion. In Proceedings of ISCA workshop on Speech and Emotion, Belfast. Cauldwell RT 2000 Where did the anger go? The role of context in interpreting emotion in speech. In Proceedings of ISCA workshop on Speech and Emotion, Belfast. Scherer KR (forthcoming) Psychological models of emotion. To appear in J.Borod (ed) The neuropsychology of emotion. New York, Oxford University Press 615 His breath a thin winter-whistle in his throat: English metaphors and their translation into Scandinavian languages Kay Wikberg Department of British and American Studies University of Oslo k.b.wikberg@iba.uio.no Fax: 47 22 85 68 04 In a previous study (Wikberg, forthcoming) based on 11 fiction extracts from the ENPC corpus I show that the translation of innovative metaphors from English into Swedish and Finnish are mostly based on equivalent images and that most of the changes that occur in the rendering of the metaphors are simply due to the constraints of the target languages. Another finding was that the overwhelming majority of the metaphors have other than plot-advancing functions. In this study I will examine the translation of metaphors and idioms with a metaphorical meaning in a wider selection of texts in the ENPC corpus. My aim is further to examine the functions these nonliteral expressions play in discourse and to investigate the relation between metaphorical structure and function (cp Cacciari 1998). Finally, I will address the question of how metaphors can be handled in a discourse model (cp Werth 1999). References Cacciari, C. 1998. Why do we speak metaphorically? Reflections on the functions of metaphor in discourse and reasoning. In Albert N. Katz, Cristiana Cacciari, Raymond W. Gibbs & Mark Turner, Figurative Language and Thought, New York & Oxford: Oxford University Press, 119- 157. Werth, P. 1999. Text Worlds: Representing Conceptual Space in Discourse. Longman: London. Wikberg, K. (forthcoming) Studying the translation of metaphors in a multilingual corpus (English into Swedish and Finnish). In K.M. Jaszczolt & K. Turner (eds.), Proceedings from the Second International Conference in Contrastive Semantics and Pragmatics, Cambridge, 11-13 September 2000. Elsevier Science. 616 Public discourse as the mirror of ideological change: a keyword study of editorials in People's Daily Karen Wu Rongquan English and Communication Department, City University of Hong Kong enkarenw@cityu.edu.hk 1. Overview China, in its 50 years of communist rule since liberation, has experienced cultural and ideological change on a massive scale. The editorials from the People's Daily, the typical public discourse representing the Communist ideology of China, encoded these ideological changes in its formulations. This study will present the meaning changes of some political-social keywords in editorials over the years and try to interpret the changing content of ideology by virtue of these semantic findings. While many communication studies have investigated the change reflected in news and media, noting its functions in Communist party-state (Liu, 1971, 1981; Bishop, 1989; He and Chen, 1998; Chang, Wang, and Chen, 1994; Lee, 2000; Zhao, 1998; He, 2000), or have concentrated on the politics of information flow from the party-state center to the mass media (Wu, 1994), few have devoted themselves to the semantic study of typical public discourse regarding China’ s cultural and ideological transformation over its history. This study will examine editorials from the People’ s Daily (“PD” hereafter) since 1949 – the year the Chinese nation was officially “stood up” by the Communist Party, to 1995, the year in which the Party celebrated its 46 years of leadership over the state. Taking the notion that discourse is language in use (Fairclough and Wodak, 1997), the current study will apply and develop the social theories concerning discourse and social change. Analysis will be conducted at two levels. At the lexical level, it will draw on contextual meaning theory to examine the meanings of keywords by observing the collocations occurring in 9-word span. At the textual level, a text analysis will be introduced to describe the meaning of keywords under particular practical discursive environment. At both levels, meaning analysis will be processed regarding the words immediate context (particular editorial text in which particular keyword occurs) and the far-reaching context (the changing social process viewed from an ideological perspective). This keywords study will put emphasis on the semantic field of ideology study. The ideological implications of the keywords concerned have undergone great transformation in Chinese official media discourse along the years. Focusing on the ideological significance that discourse and the lexicon encoded, this analysis aims to be a bridge linking our understanding of particular words and the ideological changes that China has experienced along its social development. This study will argue that, for the aim of critical discourse analysis, instead of merely taking social ideological change for granted, detailed analysis on keywords is helpful in understanding the content of the changing ideology. 2. Literature Review 2.1. The concept of ideology Among numerous scholars (de Tracy, [1796]; Marx and Engels, 1974[1845]; Lenin, 1970[1905]; Lukacs, 1971; Mannheim, 1960[1933]) contributing to ideology studies, Marx and Engels stand out significantly for they tried to construct a complete conceptual system of ideology. Their concept of ideology is pejorative in essence and different from previous studies (e.g., Napoleon’ s negative concept of ideology) in that they input Hegel's dialectics into their philosophy as well their theory of ideology. Marx and Engels constructed ideological theory from the perspective of economic stratification together with class domination and defined the term “ideology” as systemized ideas originating from interests of the dominant class. Ideology, as the old-fashioned, controversial term, is caught within even more debates in contemporary studies (Geertz, 1964; Althusser, 1965; Shils, 1968; Foucault, 1972; Pecheux, 1982; Harbermas, 1988; Bourdieu, 1977, 1991; Gee, 1990). Among these debates, the Marxian twodimensional stance in understanding ideology has been a prominent influential one upon recent language-related ideological studies (van Dijk, 1998; Hodge and Kress, 1993). Hodge and Kress’ s (1993) work is an example of some linguistic studies trying to avoid one-dimensional surface understanding of ideology and attempting to investigate ideology in a dialectic way. With Marx and Engels’ ideology theory as the basic framework, some of the contemporary studies offer this old-fashioned, controversial term some reasonable definitions (Hodge and Kress, 1993; 617 Fairclough, 1995b; van Dijk, 1998). But due to the more or less different perspectives and goals, it is still difficult to say one is more correct than others. Unlike those studies which thoroughly adopted Marx’ s ideological theory as their basis, this study will add its own “ strategy” in defining ideology: to define the concept by revealing its relations with discourse. 2.2. Discourse and ideology The necessity to define the concept of ideology in relation to discourse resides in the knowledge that the relation between signifiers and their meaning cannot be easily interpreted and controlled without considering the relation between discourse and ideology. As stated by Stalin, language is not in itself ideological, discourse is (Stalin, 1951). On this issue, the scope and sophistication of the thought of contemporary French thinkers, like Althusser (1970), Bourdieu (1980) and Foucault (1979, 1980a, 1980b), are imposing. Drawing on Malrieu’ s (1999) definition of discourse, and adopting Marx and Engels’ two dimensional stance in defining ideology, this study defines ideology as a set of values and ideas advocated by the social dominant groups that guide actions and regulate the relationship of power and are expressed in a conventional discourse. In other words, ideology is a system that consists of three essential components: (1) a set of values and ideas advocated by the dominators according to their interests; (2) the guiding function of those values and ideas on actions and the structure of power they endorse; and (3) a discursive convention that expresses, perpetuates and determines the contents of the ideology. 2.3. Keyword studies Some scholars have tried to interpret culture through keywords. Researchers whose names should not be unmentioned in the literature of keywords study include Firth (1935), Williams (1961, 1983), Said (1978), Fairclough (1990, 1992), Wierzbicka (1992, 1997) and Stubbs (1996). 3. Hypothesis and research questions This study treats the past of PRC as a history composed of three social periods, i.e. 1949-1965; 1966-1978 and 1979-1995. This is called the “ three-fold-division method”. The “ three-fold-division method” is based on the following assumptions: 1966 is the year in which the Cultural Revolution was initiated. The abnormal social situation in China caused by the Cultural Revolution didn’ t finish until early 1979, when the reform strategies indicated in the Eleventh Party Congress of the CCP (held in 1978,18/12-22/12) began to be implemented in reality. The period between 1949-1965 represents the time during which Communist China built it’ s national integration (Liu, 1971), and 1979-1995 is the period in which China resurrected its economy by practising a market-oriented open-door policy. The three periods are comparable in terms of the time span and how each represents a radical change. In the study, the following research questions will be addressed: a) Are there any semantic changes in particular keywords in editorials? b) What explanations can be offered for the meaning change or stability of particular keywords? c) How can we understand the changing content of ideology through comparing the meanings of keywords in each period? 4. Methodology 4.1. Data collection and analysis The data corpus was built up with editorials from the People’ s Daily since January 1,1949 to December 31, 1995. The total number of PD editorials in this period was 6812. After eliminating those outside the fields of politics and economics, articles under research will number 6064. Analysis with editorials as target data offers a reasonable approach in understanding the function of apparatus discourse in the changing context of ideology. As a typical kind of “discourse of apparatus” , the editorial from People's Daily represents to a large degree what He (2000) labels as “discursive conventions” . Besides, as argued by van Dijk (1988) and Hodge and Kress (1993), the editorial, as the most preferred formulation place for a newspaper to express its political and ideological attitudes, most clearly reflects the real standpoint and perspective of the opinion of the newspaper holders. It is therefore reasonable for the current study to take a semantic approach to interpret the ideological sense encoded in particular keywords. I will argue that, in terms of the transmission medium and the audience, basically, there are two kinds of public ideological discourse in China, i.e. theoretico-ideological discourse and practicoideological discourse (or pure ideological discourse and applied ideological discourse). Ideology is expressed explicitly in the former kind, in the form of official party documents of Communist China; and is strategically manipulated to convince people in the form of the latter kind, which are those 618 appearing in media forms. Ideological keywords should be located within theoretico-ideological discourse and be understood in their practico-ideological discursive context. In this study, the meanings of some 30 ideological keywords, which are drawn from the political reports of Communist China Party Congresses (hereafter CCPC), will be investigated under their practical discursive environment, i.e. the editorials from the “PD”. By observing the collocations occurring in 9-word span, meaning analysis will be processed regarding the words immediate context (particular editorial text in which particular keyword occurs) and the far-reaching context (the changing social process viewed from an ideological perspective). Discourse analysis here aims to be the bridge linking our understanding of particular word usage and the social progress. Analysis will consist of four stages: Stage 1: Locating Chinese basic ideological keywords on the basis of theoretico-ideological discourse: the political reports of Chinese Communist Party Congress (CCPC). Scholars have set out certain academic definitions for keywords (Williams, 1961; Wierzbicka, 1997), which will be taken as the basic criteria for locating keyword. With somehow different perspectives, there are various categories of keywords which are all revealing in ideology studies, for example: the keywords appeared in all social periods, the ideologically expressive words once prevalent for certain period but later disappeared, and words coined as new ideological expression for new social periods. This study will focus on the basic ideological keywords, i.e., the ideological keywords being used in all political reports of CCPC (from the 8th CCPC in 1956 to the 15th CCPC in 1997) during the social periods concerned (Murata, 1998, see Table.1 for examples of basic ideological keywords). Stage 2: Locating keywords in the practico-ideological discourse, i.e., the editorials. Based on the results of stage 1, some 30 ideological keywords will be chosen for further analysis. These keywords will be located and investigated considering their usage in the editorials from the “PD”. Stage 3: Observing the collocation lists of the keywords, determining the meaning components of the keywords according to their collocation lists.1 To identify the meaning of particular keywords, rather than merely depending on the intuition of native speakers, the keywords will be investigated in its context. By context, we mean 9-word collocation span (Firth, 1935, Stubbs, 1996), 80-charater context (Scott, 1991) as well as the text as a whole. One need to decide the scale of the context considering the purpose of the particular study. This study will observe the keywords in a context adopted in Firth’ s (1935) and Stubbs’ (1996) key words studies, that is the 9-word collocation span (four words to left and right). In this study, data of the three social periods, which will be called sub-corpora for the convenience of describing, will be processed respectively (for the basis of division, see section 4.). In each subcorpus, collocations of certain keywords will be listed and categorized according to the fields (what is going on, what is that collocation for?) of the collocations. The fields list thus developed is the meaning components list of the keywords. For example, in the collocations list of keyword “ revolution” , while some collocations such as “worker” “Chairman Mao” “proletarian” are used accompanying revolution to illustrate the participants of revolutionary activity, some other collocations like “improve” “maintain” are used to describe the expectation that the editorials producers hold towards revolution. In that case, “participant” and “expectation” are the fields that the collocations stand for, and therefore are two of the meaning components of the word “revolution” (see Table S-1 in Sample Analysis below). With computer-aided work, collocations under each meaning components will be sorted according to the frequency of occurrences. Different collocations with the same or co-relative meaning will be grouped together and considered as a unit while sorting the frequencies of collocations of particular meaning. Collocations (or collocation units) ranked higher in the list will be regarded as the closer contextual meanings of the words in that particular social period. Stage 4: Analysing the collocations (with focus on frequency, reference) and interpreting the use of key words through the observation on collocations of each components. Comparing the meaning components of keywords across different social periods on the basis of the results from collocation analysis. Interpreting the keywords meaning variation regarding the social-context change (See section of findings and discussions in Sample Analysis attached). Through analysis of the collocation, the main uses of the keywords, under each meaning components, can be listed in a meaning table (see Table S-2 in the Sample Analysis attached). 4.2. Quantitative analysis In order to study keywords collocations across large corpus, this research requires a method to summarize collocation data and to calculate the frequency and the likelihood of association between 1 During the three stages of analysis, I referred to the Modern Chinese-English Dictionary (Foreign Language Teaching and Research Press, 1991) when translating the keywords and collocations from Chinese to English. 619 collocations. Work with quantitative method includes: a) Identifying in the corpus all occurrences of keywords and their frequency; b) Keeping a record of collocations of the keywords with occur in a window of defined size (e.g. four words to left and right); c) Counting the frequency of each collocation in the whole corpus and in each sub-corpus; d) Sorting the collocations (and the collocation units, in which the collocations share the same or corelative meaning) under each meaning components according to the frequencies in each sub-corpus. 4.3. Qualitative analysis Qualitative method will be employed to interpret the results drawn from quantitative research on collocations regarding the social and ideological context of China. Sample analysis: The change in the meaning of “ ” (ge ming, revolution) in China from 1976 to 1980 1. Overview This analysis examines the meaning change of the keyword “ ” (ge ming, revolution) appeared in editorials of “PD” during the period 1978-1980. Findings for this purpose are presented in form of collocations lists according to the meaning components of “ ” (ge ming, revolution). Another aim of this study is to interpret the meaning change regarding to the ideological changes that China has experienced during this 4 years. The main theory referred is the contextual meaning theory, which believes that words meaning is embodied in the usage, and the semantic study on words should take context into careful consideration. The main concept is that words occur in characteristic collocations, which show the associations and connotations they have, and as well the assumptions that they embody. It is also assumed that collocations of particular word encode the “meaning components” of that word (Wierzbicka, 1984). By identifying the collocations of the word “ ” in different periods of history, one can trace the meaning change occurring in this word over years. 2. Data collection and procession Considering the historical significance of the year 1978, in which China released the reform strategies (in the Eleventh Party Congress of the CCP, held in 1978, 18/12- 22/12), this analysis will take data two years to the forth and after 1978, i.e. from Jan 1, 1976 to Jan 1, 1980. Data will be processed concerning the two “ turning points” of China in the 4 years: the downfall of The Gang of Four (in Oct, 1976) and the official appeal of “ focusing on economy” (at the end of 1978). Corresponding editorials for the two points are respectively on Jul 10, 1976, and Jan 01, 1979. The 44 editorials therefore will be categorized to three groups according to the two points, and the comparison of the collocations will be held within this three-division frame. 3. Findings Table S-1 attached shows part of the collocation table of “ ” ge ming, revolution in editorials of the category “national politics” (Jan. 1, 1976-Jan. 1, 1980.). Categories (except the “ fixed phrases”) in the left-most column are defined on the basis of the functions or referents of collocations in contributing to the usage of “ ” , and therefore can be regarded as meaning components of this word (definitions are given in the brackets following each category). Figures in square brackets show the collocation frequency of each component in different social periods. While figures in brackets of first row show the number of editorials observed for each social periods, those in brackets following each collocation show the total times of that occurrence in that particular data scope. We will analyze the collocations regarding each meaning component. Fields: a) During the Cultural Revolution, main fields of revolutionary activity are social institutions, such as education, public health, art and science and technology. Revolution in these institutions means to politically change the minds of the people engaged in these careers and to destroy the traditional working principles of these institutions. The “institutions” meaning occurred in 22.9 per cent (11/48) of fields in Cultural Revolution. b) After the Cultural Revolution and before the late 1978, the “ institutions” meaning occurrence dropped to 11.3 percent (5/44). The main field became the construction in social economy, including socialist construction (17/44: 38.6%), ownership (2), economy (1), and manufacture industry (1), which together occurred in 47.7 per cent (21/44) of collocations. c) Since late 1978, the “institutions” meaning disappeared from the fields of revolution. Main field is 620 about the ownership (5), which ranked 1st in the collocations list. Other collocations (e.g. the attribute (2), rule (2), theory (2), cognition(1), general truth (1) reason (1) indicate that the CCP began to reflect on the revolution theoretically. Revolution became a object requiring study, rather than a unquestionable spirit or claim. Besides, the juxtaposition of revolution and construction indicates that revolution is considered as a process, which aim is to liberate the productivity, in order to clear the way for social construction. Social construction now began to take place of revolution in terms of the important role it plays in Chinese society. Participants and Opposites: a) Categorization: During the Cultural Revolution, subjects are categorized from the perspective of social classes, as we can see from the distinction between proletarian (the people, cadre, mass), capitalist. After the Cultural Revolution, the categorization turns to take basis on social groups, e.g. the worker, cadre, mass, peasantry, PLA, criminals. A fact needs to be noticed that the opposites of revolution in this period were criminals, which were opposite to the social orders and security, but no longer the Capitalist during the Cultural Revolution, according to the classic points of view. b) Alliance: Rather than make distinctions among the participants, discourse after the Cultural Revolution began to introduce the notion of alliance into the participants of revolution. This explains why words like (worker-peasantry alliance), (foundation), (nation) and (main force) enter into collocations of the second period. Editorials producers try to unite people from various social resources in order to strengthen the forces for social construction. c) Mao Zedong was no longer mentioned as (Chairman Mao) after the Cultural Revolution. Since late 1978, producers use (Comrade Mao) instead. This is not only a result of abandoning the “personal cult” , but also, by rejecting the symbol of traditional revolutionary time, an indicator of detaching oneself from the old revolutionary thoughts appealed in Mao” s Era. d) Solidarity and power semantics: as a pronoun indicating ‘others’ , the term “ ” (they) was used to refer to the opposites of the revolution during and after the cultural revolution (in 01/01/76 - 19/10/76, one for anti-revolutionist; in 25/10/76 - 18/11/78, six referring to anti-revolutionary, one referring to members of Left Wing or people engaging political mistakes ). This is significantly contrary to the situation in the Cultural Revolution, in which (they) also has ‘ participants of revolution ‘ as its reference for three times. e) After the Cultural Revolution, (they) was no longer used to refer to the people or the mass. Instead of (they), editorial producers began to use (we) (one and one for each period) to refer the readers (audiences) and the people as well. Also, with the pronoun (we), the readers and the producers were mentioned together, which are both regarded as the participants of revolution. This semantic change indicated that the top leaders of China try to establish the solidarity rather than power relationship between the CCP and the mass. They try to make up the distance between the leaders and the mass, and to have the power less explicit. f) After the Cultural Revolution, the word (main forces) was used metaphorically to refer to the massive people participating in the process of social construction of China. The ‘ main forces’ , which was historically portrayed as brave, patriotic, was considered as prototype of revolutionary spirit in China. Chinese leaders try to use a revolutionary discourse to match the values of Chinese mass, in order to encourage their devotion to the social construction. Internal relationship between participants: Verbs encoding meaning of ‘leading’ ‘following’ , e.g. (carry out), (along with), (cause), (according to), (direct), (implement), (guide), (encourage), (follow), (instruct), (learn), occurred with rapid decrease over the three periods. During the Cultural Revolution, words with “leading” meaning occurred in 82.3 per cent (14/17) of relationship verbs. These verbs imply the relationship between participants: the producers, as representative of the top leaders of china, is higher, brighter than mass people, the people need to be educated to know how to justify and participate in the revolution. Considering the semantic choices of pronouns when referring to the mass, there is also an undermined indication: at that time, although the top leader often claimed that people are the owners of the country, they are actually considered as “others”. With semantic choice of pronouns like (they), the producers make a clear distinction between the mass people and themselves, i.e. the members representing the power centre of China politics. Mass people are positioned at a ‘lower’ or ‘remote’ place, and actually excluded from the power centre. They have no power of speaking, judging, and deciding. The only approved and encouraged participation in this revolution process is very simple: to 621 follow the leaders. Expectation: a) During the Cultural Revolution, the use of collocation verbs like (require), (do), (engage) , indicate that “revolution” is mainly considered as stative situation. Collocations with the meaning of ‘maintaining', ‘consolidating’ occurred in 43.9 per cent (29/66) in this period. The figure kept nearly at the same level (42 percent by 11/26) during period 25/10/76 - 18/11/78. However, since late 1978, collocations with meaning of “ maintaining” disappeared. Instead, collocates describing revolution as dynamic process appeared (shift, resurrect, speed up). These collocates, which indicate the meaning of “ transformation” , occurred totally in 61 per cent (8/13), The changes indicate that people began to view revolution through an empirical and dynamic perspective. Contrary to the expectation during the Cultural Revolution, in which revolution was considered a mature cause to be maintained, peoples require more transformation and innovation on it. Revolution is no longer regarded as a pre-determined concept or an undoubted slogan but rather a process, people began to add something substantial in it and hope they can really get or gain something from it. “ Revolution”, in this period actually became a metaphor to indicate the reformation that China will take in all social areas, which includes politics, thoughts and economy. Attitude: For all periods, the semantic prosody of revolution is very positive. The word (revolution) was used in connection with positive words like continue, inherit, develop, go ahead, deepen, enhance and so on. This shows the positive attitude towards this word. The difference across different periods is, after the Cultural Revolution, collocations indicating this positive attitude decreased. In particular for period since late 1978, only one collocation of that use occurred. One possible explanation for this is that people’ s use of the word “ revolution” decreased. Also, they gradually lose interest in expressing their attitudes towards “ revolution” . Requirement: The valued characters of revolution have changed over the years. Contrary to that in the Cultural Revolution, in which people are required to have characters like consciousness (desire to do something from the drive inside), and to follow the directions from the top, since late 1978, people are encouraged to do something creative, i.e. to (dare to think), (dare to say). 4. Discussion 1976 and 1978 were significant years for China. Top leaders, including Premier Zhou Enlai, Mao Zedong and Zhu De died consequently in 1976. The Gang of Four was smashed in the October. The whole nation was left not only with no clearly recognized leader but also with no clear ideological and policy direction. People began to be doubted about the belief they have been insisted on during the Cultural Revolution. The vacuum between the downfall of The Gang of Four and the release of reform policy (1978) was a period in which people reflect on the horrors and violence of that decade. Also in this period CCP leaders like Deng Xiaoping, who was reinstated in late 1978, kept trying to set a new ideology for the whole nation. It is very clear to Deng and his reform-allies that the ideological influence of Mao Zedong would not allow for radical social reforms. But the task of uniting the whole nation on the basis of the new ideology in order to make social shift from class struggle to economic reform is urgent. To solve this contradiction, one of Deng's strategies is to gradually change the public political discourse for the sake of social reform in productivity and social relationship. The change reflected in the use of the keyword (revolution) is a typical evidence of such manipulation on discourse as well as ideological perception. To certain degree, the these changes can be generalized as: a) The frequency of revolution decreased along the four years. People gradually lose interest in choosing the word. This indicates that the need for the use of this word decreased. b) The use of FP increased along the years. FP was used to serve the aim to change the traditional notions of revolution, to encourage people’ s acceptance of the new notions of (revolution). Because (revolution), as an old key word in Chinese political discourse, cannot simply be “kicked out” of people's habitual lexicon, the CCP try to change the meaning of this word by means of the frequent usage of FP. c) Viewed from the meaning components, the meaning changes can be represented as Table S-2. d) The word (revolution) continues to be used in Chinese public discourse after the Cultural Revolution while the meanings have changed a lot. Besides, this word is used mainly as an integral part of fixed phrases. As for those used separately outside the FP, Deng took advantage of this term to introduce his own strategy of reform and socialist construction. Rather than simply rejecting this term, whose meaning was deeply rooted in the political context of Mao's era, Deng infused new 622 notions relating to social reform into the old key word, and appealed new social values in the process of referring back to and reflecting on the traditional values of China. Considering the relatively small amount of work, the coding work in this analysis was done by hand. As for the main corpus, this coding work will be processed with the help of “Winmax” software – software specifically designed for text coding and organizing. Table. 1: Examples of basic ideological keywords Category Examples Stable frequency Socialism, people, politics, ideology, line, mass, central, locality, unite, liberate, consolidate, worker, Mao Zedong, Marx, Marxism-Leninism, Communism, army, Taiwan, advanced, peaceful coexistence. Frequency dropping obviously during Cultural revolution economy, development, construct, democracy, nationality cadre, improvement, cooperate, peace, unify, science, technology, benefit, productivity, common, independence, stability, culture, theory, Frequency dropping obviously since reform and opendoor policy: class, revolution, struggle, property, capital, dictatorship, Lenin, destroy, class struggle, revolutionary, war. Table S-1: Collocations of “ ” (ge ming, revolution) in editorials of “ PD” during 1976-1980 01/01/76 - 19/10/76 (7) 25/10/76 - 18/11/78 (15) 01/01/79 - 01/01/80 (22) Fields ( the fields that revolution is introduced, or what is referred when talking about revolution) [48/318: 0.150] (8), (4), (4), (3) (3) (3) (3), (3), (2), (2), (2), (2), (1) (1), (1), (1), (1), , (1) The proletarian dictatorship (8), production (4), practice(4), public health(3), art(3), education (3), object(4), guiding principle(3), nature(3), task(2), future(2), , socialist construction(1), theory(1) , party(1), work(1), policy(1), tradition (1), the great debates in Qinghua University(1) … [44/292: 0.150] (17) (2), (1), (1), (1), (1) (1), (1), (1), (1), (1), (1), (1), (1), (1), (1) (1) (1) (1), (1) Socialist construction (17), ideological policy (2), politics (1), thought (1), economy (1), diplomacy (1), theory (1), practice (1), education (1), art (1), public health (1), technology (1), class struggle (1), production competition (1), scientific experience (1), privacy ownership (1), public ownership (1), manufacturing industry (1), standard (1), principle (1)… [21/160:0.131] (3) (2), (2) (2), (1) (1) (1) (1) (1), (1) (1), (1), (1) (1) (1) (1) Ownership (3), attribution (2), rule(2), theory(2), modernization construction (1), privacy ownership (1), public ownership (1), machine building (1) recognition (1), general truth (1), movement (1), reason (1), aspect (1), center (1), production (1) relationship (1), material condition (1)… 623 Table S-2: The meaning change of ge ming, revolution since 1976-1980 01/01/76 - 19/10/76 25/10/76 - 18/11/78 01/01/79 - 01/01/80 Fields (Destroy the orders and principles of) social Institution, As a weapon Social construction, As a weapon Production ownership, economic construction As a theory Participants and Opposites Classes, Distinction, opposite Social groups, Alliance Social groups Alliance Internal relationship between participants Power semantic Solidarity semantic Solidarity semantic Relationship between participants and opposites Hostile Hostile Hostile Expectation Stative situation, to be maintained and consolidated As a concept or slogan Stative situation, to be maintained As a concept and process Dynamic process, to be transformed As a substantial process Attitudes Positive Positive Indifferent Evaluation Positive, eulogistic Positive, eulogistic Positive Requirement Consciousness, drive from inside Zeal, bravery Creativity References Althusser L 1970 Idéologie et appareils Idéologiques d'État (notes pour une recherche). La Pensée, 151: 3-38. Althusser L 1965. Pour Marx. Paris: Maspero. Bishop R L 1989 Qi Lai! Mobilizing One Billion Chinese: the Chinese Communication System. Ames: Iowa State University Press. Bourdieu P 1977 Outline of a Theory of Practice. Cambridge. Cambridge University Press. Bourdieu P 1980 Le mort saisit le vif. In Actes de la Recherche en Sciences Sociales, 33: 3-14. Bourdieu P 1991 Language and Symbolic Power. Cambridge: Harvard University Press. Chang T K, Wang J, Chen CH 1994. News as social knowledge in China: The changing worldview of Chinese national media. In Journal of Communication, 4(3), 52-69. Fairclough, N 1990 What might we mean by “ enterprise discourse” ? In R. Keat and N. Abercrombie (Eds). Enterprise Culture. London: Routledge, 38-57. Fairclough NL 1992 Discourse and Social Change. Cambridge: Polity Press. Fairclough NL 1995a Media Discourse. London: Edward Arnold. Fairclough NL 1995b Critical Discourse Analysis: The Critical Study of Language. London: Longman. Fairclough NL, Wodak R 1997 Critical discourse analysis. In T. A. Van Dijk (Ed.), Discourse studies: A Multidisciplinary Introduction, Vol.2, 258-284. London: Sage. Firth JR 1935 The technique of semantics. Transactions of the Philological Society, 36-72. Foucault M 1972 The Archaeology of Knowledge. London. Tavistock. Foucault M 1979 Discipline and Punish: the Birth of the Prison. New York: Vintage Books. Foucault M 1980a The History of Sexuality, Vol. I: An Introduction. New York: Vintage Books. Foucault M 1980b Power/Knowledge: Selected Interviews and Other Writings, 1972-1977. Pantheon Books. Gee J P 1990 Social linguistics and literacies: Ideology in Discourses. London: Falmer Press. Geertz, C. 1964. Ideology as a cultural system. In D. Apter (Ed.). Ideology and Discontent. New York: The free Press. He, Z 2000, forthcoming. Working with a dying ideology: dissonance and its reduction in Chinese journalism. 624 He Z, Chen H L 1998. The Chinese Media: A New Perspective. Hong Kong: Pacific Century Press. Hodge R, Kress G 1993. Language as Ideology. London: Routledge. Lee C C (ed.) 2000 Money, Power and Media: Communication Patterns in cultural China. Chicago: Northwest University Press. Lenin V. I. 1970 [1905]. What is to Be Done? London: Panther. Liu, Alan PL 1971 Communications and National Integration in Communist China. Berkeley: University of California Press. Liu, Alan PL 1981 Mass campaigns in the People’ s Republic o China during the Mao era. In R. E. Rice and C. K. Atkin (Eds.), Public Communication Campaigns. New York: Oceana. Lukacs G 1971 History and Class Consciousness: studies in Marxism. Dialectics. London: Merlin Press. Malrieu J P 1999. Evaluative semantics: Cognition, Language and Ideology. London: Routledge. Mannheim K 1960 [1933]. Ideology and Utopia: an Introduction to the Sociology of Knowledge. London: Routledge and Kegan Paul. Marx K, Engels F 1974 [1845]. The German Ideology. London: Lawrence and Wishart. Moon R 1994. The analysis of fixed expressions in text. In M. Coulthard (ed.), Advances in Written Text Analysis. London: Routledge, 117-35. Moore T E (ed.) 1973 Cognitive Development and the Acquisition of Language. New York and London: Academic Press. Murata T 1998 A study on Ideological Changes through language in Political reports of Chinese Communist Party’ s Plenary Sessions. In B. K. T’ sou, T. B. Lai, S. W. K. Chan and S. Y. W. Wang (eds.), Quantitative and Computational Studies on the Chinese Language (pp. 209-234). Hong Kong: City University of Hong Kong, Language Information Sciences Research Center. Pecheux M 1982 Language, Semantics, and ideology: Stating the Obvious. London: Macmillan. Said E. 1978 Orientalism. London: Routledge and Kegan Paul. Scott M 1991 Demystifying the Jabberwocky: A research narrative. PhD dissertation. University of Lancaster. Shils E 1968 The concept and function of ideology. International Encyclopedia of the Social Sciences, vol. 7, pp.66-76. New York: The Macmillan Company and Free Press. Stalin I V 1951 On Marxism in Linguistics. In The Soviet Linguistic Controversy (translation from the Soviet Press by J V Murra, R M Hankin, F Holling). New York: King’ s Crown Press, pp. 70-76. Stubbs M 1996 Text and Corpus Analysis. Oxford: Blackwell. Van Dijk T A 1988 News Analysis. New Jersey: Lawrence Erlbaum Associates. Van Dijk T A 1998 Ideology : a Multidisciplinary Approach. London: Sage. Wierzbicka A 1984 Lexicography and Conceptual Analysis. Ann Arbor: Karoma Publishers. Wierzbicka A 1992 Semantics. Culture and Cognition: Universal Human Concepts in culture-specific Configurations. Oxford: Oxford University Press. Wierzbicka A 1997 Understanding Cultures through Their keywords: English, Russian, Polish, German and Japanese. New York: Oxford University Press. Williams R 1961 Culture and Society 1780-1950. Harmondsworth: Penguin. Williams R 1983 Keyowrds: A Vocabulary of Culture and Society. London: Fontana Press. Wu G G 1994 Command communication: the politics of editorial formulation in the People’ s Daily. The China Quarterly, 137, 194-211. Zhao Y Z 1998 Media, Market, and Democracy in China: Between the Party Line and the Bottom Line. Urbana and Chicago: University of Illinois Press. 625 A corpus-based study of interaction between Chinese perfective -le and situation types Zhonghua Xiao, Lancaster University Mandarin Chinese as an aspect language (Norman, 1988:163) has a rich inventory of aspect markers, including perfective -le1, -guo and imperfective -zhe, zai. Of these -le is the most studied marker because of its mysterious behaviors. For example, is it necessary to differentiate between the perfective -le and the COS le? Does -le indicate completion or termination? Are there any constraints on the interaction of -le with various situation types? All of these issues have aroused much controversy because different authors have invented different “acceptable” examples to support their arguments. Many of these examples, however, are rarely found in real language, though they are good for the purpose of argumentation. In this study, I will take another approach to address these issues and find evidence from authentic language data. An L1 Chinese corpus of 124,164 Hanzi (Chinese characters) was compiled for this purpose. The corpus was first automatically segmented and POS-tagged, and post-editing was conducted by hand for the tagging of the marker LE to ensure consistency and accuracy. Then all the clauses containing -le and le are extracted into two databases and the situation type of each instance is judged on the basis of human decision. A total of 1,208 occurrences of LE are found in our data, of which 1019 are the perfective -le and 166 are the COS le. In the other 23 instances where LE appears in the sentence-final position, the morpheme has the dual function indicating both perfectivity and change of state. Other functions of LE, e.g., as a full verb, as a modal particle and as a bounded morpheme, are also found in the corpus. Because they are irrelevant to our study here, these functions are not counted. The high frequency of LE and its rich functions justify this corpus as a good basis for the case study of this morpheme, albeit the small corpus size. This paper is concerned with the three questions raised at the beginning and is organised as follows: Section 1 discusses the one-morpheme approach vs. the two-morpheme approach; Section 2 considers the type of closure indicated by the perfective -le; Section 3 examines the interaction between the perfective -le and situation types and Section 4 concludes. 1. Verbal -le vs. sentential le There is an unanimous agreement that -le is a perfective aspect marker (e.g., Chao, 1968; Henne & Rongen & Hansen, 1977; Smith, 1991, 1997; Zhang, 1995, Dai, 1997). Yet much controversy arises when it comes to whether the perfective -le and the COS le have the same functions. While the twomorpheme approach focuses on their differences in terms of syntactic distributions, semantic functions, and etymological sources, the one-morpheme approach focuses on their semantic similarities. Zhang (1995: 120), for example, supports the unified treatment of LE and describes its major functions as denoting a change of state by termination and establishing a boundary between two different situations. However, despite her explicit favour for the one-morpheme approach, she has to turn to the two-morpheme approach to explain the interchangeability of -le and -guo (see Zhang, 1995:217-219). I argue in favour of the two-morpheme approach. As suggested above, the perfective -le and the COS le differ in terms of syntactic distributions, semantic functions, and etymological sources. The terms “verb-final suffix” -le and “sentence-final particle” le (Li & Thompson, 1981) best illustrate their difference in syntactic distributions. Syntactically, the perfective -le occurs post-verbally while the COS le appears in post-sentential position. However, when an intransitive verb2 takes the sentence final position, we have to take into account the different semantic functions of the two morphemes to determine which LE we have in front of us. The perfective -le focuses the actualisation of a situation and presents it as a whole; the COS le, on the other hand, mainly indicates a change into a new situation and signals its current relevance3. There are three possibilities for LE taking the sentence 1 In this study, the morpheme LE in the verb-final position indicating perfectivity is glossed as -le while that in the sentence-final position indicating change of state (COS) is glossed as le. The capitalised LE refers to either. 2 As transitive verbs are always followed by their objects, they cannot possibly appear in the sentence final position. 3 “Current” should be interpreted in relation to the reference time rather to the speech time. 626 final position. It can be either the COS le if the sentence only allows a change-of-state reading, or the perfective -le if the sentence only has a perfective reading, or it has dual function if the sentence has both the change-of-state and perfective readings (c.f. Li & Thompson, 1981:296). In this last case, the additional COS le is absorbed into the first perfective -le, as Chinese “always avoids a repetition of the same syllable by way of haplology: -le le”(Chao, 1968:247)4. Historically, the perfective -le and the COS le developed at different stages of evolution. The COS le is derived from the verb liao “to finish, to come to an end” (the same syllable with a different pronunciation), as in siliao “to settle out of court ”. When its sentence-final function was well established, it also developed a use in which it appears directly after the main verb (whether or not it is sentence-final) functioning to signal perfectivity. Therefore diachronically, the COS le developed earlier and gave rise to the perfective -le (c.f. Bybee, 1993:84-85). The evolution of these two morphemes also furnishes evidence in favour of the two-morpheme approach: if they are the same and one morpheme can function adequately, why is it necessary for the other to be derived? The differentiation between the perfective -le and the COS le is also supported by the quantitative data in our corpus. Of a total of 1,208 occurrences of LE, 1,019 (84.36%) are the perfective -le, 166 (13.74%) are the COS le, and in 23 instances (1.9%) the morpheme denotes both COS and perfectivity. The ratio of the perfective -le over the COS le is 6.139. The higher frequency of the former over the latter is predicated because our corpus mainly contain narrative discourses, of which the perfective aspect is a prominent syntactic feature. Our finding here is in conformity with Christensen (1994), who finds a ratio of 6.818 for the written narratives, as shown in Table 1 below: Table 1: Frequency data for LE: data total -le le dual -le/le Our corpus 1208 1019 84.36% 166 13.74% 23 1.90% 6.139 C's written 86 75 87.21% 11 12.79% 0 6.818 The table shows that the perfective -le is more productive than the sentential le and dual-function LE. Christensen's written data show a higher frequency of -le because his data are purely narrative. It can be seen from the discussions above that the perfective -le is different from the COS le in many respects: (i) syntactically, -le appears in the verb-final position whereas le in sentence-final position; (ii) semantically, -le signals perfectivity whereas le indicates change of state; (iii) etymologically, -le is derived later than le, and (iv) empirically, -le is more productive than le. All of these argue strongly for the two-morpheme approach. 2. Completion vs. termination Another issue which is as controversial as the one discussed above is the type of closure signalled by the perfective -le. Traditionally, the perfective -le is considered to indicate completion of the action denoted by the verb. Chao (1968:247), for example, argues that the verbal -le has the class meaning of “completed action”. Following Chao, Henne & Rongen & Hansen (1977:117) claim -le indicates “the completed action of the verb to which it is attached”. Similar views can also be found in Zhu (1981), Lü (1981:314-321) and Tiee (1986:96). But the traditional view cannot account for the puzzle in (1) below: (1a) zhe-ben xiaoshuo wo kan-le san-tian this-CL novel I read-le three-day I read the novel in three days (I finished reading it). (1b) zhe-ben xiaoshuo wo kan-le san-tian le this-CL book I read-le three-day le I have been reading this book for three days (I haven't finished reading it). 4 According to Chao (1968), in certain dialects such as Cantonese and the Wu dialects, there are separate morphemes to indicate actuality and change-of-state which can co-occur contiguously. Haplology of -le le only occurs in the Mandarin Chinese (putonghua “the common language”). 627 Clearly (1a) and (1b) have different aspectual meanings. While the former indicates the completion of the reading event, the latter gives no such indication. If LE indicates completion, “why a completed reading is derived when one LE is used, but it is not allowed when an additional LE is used?”5 Interestingly, all of the scholars quoted above relate completion to the action of a verb rather than a situation. Their approach is clearly incompatible with the definition of the aspect6. In fact, the compositional nature of aspect is widely observed in the literature (e.g., Verkuyl, 1972; 1993; Smith, 1991; 1997; Brinton, 1988). Therefore the aspectual value of a situation is contributed to by the semantic features of all sentential elements, though a verb plays an important role. More recent studies, however, realize that the perfective -le does not necessarily indicate completion. While perfectives with a resultative verb complement (RVC) unequivocally indicate completion, “simple perfectives” (Smith, 1997:264) 3 sentences with -le alone but without RVCs 3 only present situations without indication of the closure type7 (Smith, 1988:216,218; Tai, 1984:291-292; Chu, 1976: 48). This view has won growing popularity in the literature (e.g., Li & Thompson, 1981:215- 216; Zhang, 1995: 115-116; Christensen, 1994; Smith , 1997: 264-265; Dai, 1997: 21). While I agree to this recent view in principle, I argue that the type of closure indicated by -le is not so arbitrary as Smith (1988:228) claims: “Semantically, sentences without completive RVCs do not present a completed event; but pragmatically, they often do just that.” Such arbitrariness has led to much confusion in her own studies. Let us consider Smith's examples (1988:218-219): (2a) wo zuotian xie-le yi-feng xin I yesterday write-le one-CL letter I wrote a letter yesterday. (2b) *wo zuotian xie-le yi-feng xin, keshi mei xie-wan I yesterday write-le one-CL letter, but not write-finish Lit.: I wrote a letter yesterday, but didn't finish it. (2c) wo zuotian xie-wan-le yi-feng xin I yesterday write-finish-le one-CL letter I finished writing a letter yesterday. Smith (1988:218) argues that sentences like (2a) “present events as terminated but not necessarily completed”, but in Smith (1997) she contradicts her own assertion by admitting that in fact, the most natural interpretation of (2a) would be that the letter was finished, though in order to remedy the selfcontradiction, she adds immediately that “the completive interpretation is conversationally only: it can be cancelled by other information” (1997:265), as shown in (2b). But I argue that (2b) in fact sounds unacceptable semantically, if not grammatically, to a native ear (c.f. also Teng, 1986). And like (2c) in which -le co-occurs with the RVC -wan “to finish”, (2a) also indicates the completion of the writing event, i.e., the letter was finished. If we followed Smith's assumption that completive readings denoted by “simple perfectives” can be cancelled, we would have the following absurd situation: (3) *shanggeyue ta sheng-le yi-ge nanhai, keshi mei sheng-wan last month she give:birth-le one-CL boy but not give:birth-finish Lit.: Last month, she gave birth to a baby boy, but did not finish it. It is true that “simple perfectives” may indicate either completion or termination, but the type of closure depends on the type of situation. That is, telic situations8 are presented as completed whereas atelic situations are presented as terminated. When a telic situation is presented perfectively as a single unanalysable whole, its inherent final endpoint is naturally included, thus resulting in a completive reading. On the other hand, an atelic situation does not have an inherent end point, so when it is presented perfectively, only an arbitrary final endpoint is included, and thus a terminated reading is appropriate. (2b) above will become acceptable if the quantified direct object is replaced by a bare noun9, as shown in (4): 5 Translated from Dai's (1997:21) quotation of Lü (1961) “The current task for researchers of Mandarin Chinese.” 6 In our model, we follow Smith's (1997:1) and define aspect as “the semantic domain of the temporal structure of situations and their presentation.” 7 Chu (1976), in his study of action verbs, also finds that the structure of “action verb+-le” only indicates active attempt and actual performance rather than attainment of goal, while the structure of “action verb+RVC” indicates all of the three. 8 Telic/atelic distinction is an important distinguishing feature for aspectual classification. A situation is telic if it has an inherent spatial final endpoint. 9 As there is no articles in Chinese, and the plural suffix -men is syntactically optional, bare nouns in Chinese can be regarded as bare plurals in English. 628 (4) wo zuotian xie-le xin, keshi mei xie-wan I yesterday write-le letters but not write-finish I wrote letters yesterday, but I didn't finish them. The acceptability of (4) can be explained as follows. In this sentence, the object xin “letters” is a bare noun, which is at best ambiguous between specific and non-specific readings. When it interacts with the accomplishment verb xie “to write”, the resulting situation can be naturally understood as atelic. Thus the situation conveyed by the first clause in (4) has a terminated reading and further assertion can be made that the letters were not finished. The above analysis suggests that the type of closure indicated by the perfective -le is related to situation types. Smith (1988: 218) also realizes this point when she claims: The choice between termination and completion arises only with telic events, of course. Atelic events have no other possibility besides termination.10 While agreeing to the second part of this claim, I argue that no choice is open to telic situations either. That is, for telic situations, only completive readings are possible. Let us examine the three examples11 Smith (1988) uses to support her claim. (5a) Zhangsan xue-le Fawen, keshi mei xue-hui Zhangsan study-le French but not learn-know Zhangsan studied French, but he still didn't know it. (5b) *wo mai-le san-ben shu, keshi mei mai-dao I buy-le three-CL book but not buy-succeed Lit.: I bought three books, but I didn't buy them. (5c) Zhangsan zhao-le ta de shoubiao, keshi mei zhao-dao Zhangsan look:for-le he DE watch but not look:for-succeed Zhangsan looked for his watch, but he didn't find it. The first point to be noted here is that Smith asserts that completive readings in (5a)-(5c) are cancelled by the conjuncts (Smith, 1988:288). On a closer examination, however, we find xue Fawen “study French” and zhao ta de shoubiao “look for his watch” are both atelic events, because only xuehui Fawen “to learn French” and zhao-dao tade shoubiao “to find his watch” are telic (c.f. Smith, 1988:220, 234; Tai, 1988: 290). If in (5a) and (5c) -le did signal completive readings which were cancelled by the conjuncts, Smith would be contradicting her own claim quoted above that only termination is possible for atelic situations. Secondly, while Smith is right in saying that the first clauses in (5a) and (5c) do not have completive readings, she is wrong in the case of (5b). For the same reason discussed in the analysis for (2) above, mai san-ben shu “buy three books” in (5b) is a telic event, and thus its completive reading cannot be cancelled. Therefore, this is an invalid example to serve her purpose. Smith is on the right track when she realizes that “because telic events involve completion, they may be used to implicate completion” (Smith, 1988:228). But regrettably, she attributes the final decisive role of the closure type to pragmatics. Tai (1984: 291-292) also observes that “Vendler's examples of accomplishment expressions such as ‘to paint a picture’ and ‘to write a letter’ may or may not imply attainment of goal in Chinese” (ibid:291). Tai's observation is true to the fact, but the reason he provides for this 3 “depending on the particular context which a native speaker happens to be in” (ibid:291) 3 is not. It is argued here that the closure types of these situations depend on how we translate these phrases. If we translate “to paint a picture” as huahua and “to write a letter” as xiexin, then they are atelic. When they are presented perfectively with the verbal -le, only terminated readings are possible; but if we translate “to paint a picture” as hua yi-fu hua and “to write a letter” as xie yi-feng xin, then they are telic situations and only allow completive readings when presented perfectively12. Tai (1984:291) argues that sentences like (2a) may imply the attainment of goal “for many native speakers”, but sentences like (2b) “suffice to show the implication is not absolute.” Tai's argument is 10 But regrettably, even this claim is negated later by herself: “But in Chinese perfectives termination and completion are expressed separately for all situation types” (1997:73), which in turn is contradicted by her own assertion that accomplishments may be either terminated or completed with simple perfective viewpoint (1997:264). 11 This pair of examples are taken from (Smith 1988:220). But the English translations of (a) and (c) are modified, because in (a), according to Tai (1984:290-291), xue and xue-hui in Chinese can find equivalents in English: “study” for xue and “learn” for xuehui. While xue and “study” are atelic, xue-hui and “learn” are telic. In (c), the same applies: while zhao and “look for” are atelic, zhao-dao and “find” are telic. 12 These two translations are both possible because Chinese has no articles. 629 even less convincing than Smith's claim that “the completive interpretation is conversational only” (1997:265). One problem with Tai's argument is its unreliable theoretical basis. “For many native speakers” is a rather vague concept: how many? what percentage? Unfortunately Tai has not made a demographic survey. Another problem is the acceptability of his counter-examples. As noted above, sentences like (2b) are in fact unacceptable semantically. If Tai had followed the convention of treating Chinese as a “non-article” language and had not translated these two phrases so literally, he might have come to the point. Our argument for the correlation between closure type and situation type13 does not go far away from Smith (1988:218), because she also agrees that atelic events14 can only be interpreted as terminated. We differ in our treatment of telic events. Smith's accomplishments are of two types: one is the simple form like xie yi-feng xin “to write a letter”, the other is the RVC form like xie-wan yi-feng xin “to write-finish a letter”. Her second type of accomplishment falls within the category of achievements in our model15. As an achievement encodes result in itself and is punctual by nature, it is expected that once such a situation is realised, it is completed. This prediction is supported by the empirical evidence. Of the 510 achievements taking the perfective -le found in our corpus, all have completive readings without exception. Here are some examples: (6a) na jiahuo shao-cheng-le hui, wo ye neng ren-chulai (File 9558601) that guy burn-become-le ash I too can recognise Even if that guy was burnt into ashes, I would recognize him. (6b) ta...zhidao yu-shang-le gaoshou (File 9560501) he...know encounter-le master-hand He knew that he had encountered a master-hand. Our difference with Smith in this respect revolves around the closure type of accomplishments (her simple form accomplishments) when they take the perfective -le. My argument is that accomplishments can only be interpreted as completed, whereas Smith assumes that this type of situation may have a choice between termination and completion. This assumption, however, is ungrounded, because the counter-examples she uses for the contradiction test, e.g., (2b) and (5b), are semantically unacceptable. Smith's assumption also lacks empirical evidence. Even if her intuition is correct when she invents such examples, these utterances are not supposed to be found in real language. In our corpus data, all of the 326 accomplishments taking the perfective -le can only allow completive readings. Let us consider an example cited from the corpus: (7a) wo jimang yi gaojia zhu-le yiliang Beijing jipuche, zhishi Wangzhuang (File 9560601) I hurriedly with high:price hire-le one-CL Beijing jeep direct:drive Wangzhuang I hurriedly hired a “Beijing” jeep at a high price, and headed direct for Wangzhuang. (7b) *wo jimang yi gaojia zhu-le yiliang Beijing jipuche, keshi mei zhu-dao I hurriedly with high:price hire-le one-CL Beijing jeep but not hire-succeed Lit.: I hurriedly hired a “Beijing” jeep at a high price, but didn't succeed hiring it. The situation “I hired a Beijing jeep” in (7a) is an accomplishment presented perfectively. (7b) shows that even if a conjoined second clause could cancel its completive reading, the second clause would clash with some other sentential element, i.e., “at a high price”. We normally assume that when the price is settled, the deal is done. Furthermore, if the completive reading of the actualised accomplishment could be cancelled, there would be no subsequent event “headed for Wangzhuang”. Therefore, our argument for the positive relation between telicity value of a situation and its closure type is supported by both theoretical analysis and empirical evidence. 3. Interaction between -le and situation types Before we go on to examine the interaction between the perfective -le and situation types, it is necessary to make a brief introduction to our aspect model. Following Smith (1991; 1997), aspect is taken to have two components, namely, situation aspect and viewpoint aspect. The former is concerned with the inherent temporal features of a situation while the latter provides a perspective to view the situation. Aspect is the synthetic result of these two components. 13 Pan (1993) also observes that “different situation types influence the interpretation of perfective”, with an accomplishment, -le indicates that the event started and finished later; with an activity, -le indicates it started and terminated later. 14 Smith argues that the perfective does not interact with statives in Chinese, which, according to our data, is not true (see Section 3). 15 In our model, all verbs that encode result are classified as achievements (See Section 3). 630 In our model, situation aspect is concerned with both the lexical and the sentential levels. This twolevel approach is different from Vendler (1967) and Smith (1991, 1997). At the lexical level, verbs are grouped into six classes based on five distinguishing features16, as shown in Table 2: Table 2: Feature matrix system of verb classes: classes [±dynamic] [±durative] [±bounded] [±telic] [±result] activities + + - - - semelfactives + - ± - - accomplishments + + + + - achievements + - + + + individual-level states - + - - - stage-level states ± + - - - These verb classes interact with their arguments and adjuncts at three different levels according to the following rules: A: Lexical level: Rule 1: Verb[-telic/±bounded] + RVCs . Derived Verb[+result/+telic] Rule 2: Verb[ -telic/ ±bounded] + reduplicant. Derived Verb[+bounded] B: Core sentence level: Rule 3: NP+Verb[+telic] + NP[ acount] . Situation[ atelic] 17 Rule 4: NP[ acount] + Verb[+telic] (+ NP) . Situation[ atelic] Rule 5: NP+Verb[–telic]+PP[Goal] . Situation[+telic] C: Full sentence level: Rule 6: Core-sentence[ -bounded]+for-PP/from...to . Full-sentence[+bounded] Rule 7: Core-sentence[+telic]+for-PP/from...to . Full-sentence[ -telic] Rule 8: Core-sentence[ ±bounded]+Quantity NPs . Full-sentence[+bounded] Rule 9: Core-sentence[+telic]+Progressive . Full-sentence[ -telic] Rule 10: Core-sentence[ -result] + ba/bei-construction . Full-sentence[+result] The interaction at these levels result in six basic situation types and five derived types as shown below: Table 3: Feature matrix of situation types: Situation Types [±dynamic] [±durative] [±bounded] [±telic] [±result] ILS basic - + - - - derived - + + - - SLS basic ± + - - - derived ± ± + - - ACC + + + + - ACT basic + + - - - derived + + ±18 - - SEM basic + - ± - - derived + + ± - - ACH basic + - + + + derived + + + + + 16 In addition to the three traditional ones, two new features, [±bounded] and [±result], are introduced to separate verb classes from situation types. Both telicity and boundedness are related to final endpoint, but the former is spatially defined while the latter is temporally defined. The feature of [±result] refers to whether or not a verb encodes result in itself. 17 a is a variable with the value of either plus or minus. +Count] NPs should be understood as singular or specific plural countable NPs or “quantised” arguments in Krifka's (1987, 1989) terms, while [-count] NPs include mass nouns and bare plurals. The [±count] distinction is similar to Smith's count/mass opposition or Verkuyl's (1993) [±SQA]. 18 Derived activities have the value of [±bounded] because they represent a complicated categories. When basic activities are delimited by specific time frame, they are [+bounded]; when accomplishment verbs take [-count] NPs or the progressive, or when semelfactives allow indefinite multiple event readings, they are derived activities with the value of [-bounded]. 631 It should be noted that situations types discussed here are the final result of composition processes at the full-sentence level. When basic states and activities are temporally bounded by delimiting mechanisms, bounded states and bounded activities come as a result. Derived activities can also be obtained from accomplishments taking the progressive, semelfactives occurring the progressive or temporal adverbials indicating indefinite time frame, and achievements taking the progressive or [-count] NPs. Accomplishments do not have derived situation type19. The basic semelfactives have single-event reading; when they occur with quantity NPs or temporal adverbials indicating definite time frame, they become derived semelfactives. When basic achievements take [+count] NPs, derived achievements come as a result. Having discussed the temporal features of situation types, we are now in a position to examine the interaction between the perfective -le and situation types. More recently, this topic has attracted much interest. Smith (1997:70,264) and Pan (1993), for example, assert that the perfective -le is not available to states. Pan (1998) becomes aware of the distinction between stage-level and individual-level predicates and corrects his generalization as “perfective marker -le can be used only with stage-level predicates which include some of the statives”. “Some of the statives” here refer to stage-level states (SLS) like ta bing-le san-tian “He was ill for three days”. Smith and Pan's assertions suggest that the perfective -le is sensitive to the feature of dynamicity. On the other hand, Li (1999) argues that the perfective -le only appears in telic situations20 like accomplishments and achievements, but not in atelic situations like states and activities21. Yang (1995) is aware of the different natures of spatial and temporal endpoints. She argues that all situations with a spatial final endpoint (i.e., telic situations) can be presented with the perfective viewpoint marked by -le. In addition, atelic situations (including states), when they are temporally bounded by delimiting mechanisms, can also take the perfective -le. But without such delimiting devices providing a temporal boundary, atelic situations cannot felicitously co-occur with -le. The arguments made by these two authors suggest that the perfective -le is sensitive to spatial or temporal endpoint. Yang's observations appear to be closer to the fact, but her categorical statement that no [-bounded] situation can take -le (ibid:115) is arguable, because our data does not allow for a clear-cut distinction. Based on our corpus data, I argue the perfective -le is more sensitive to the feature of telicity and boundedness than to dynamicity as Smith and Pan suggest. But the sensitivity is rather a matter of degree. As can be seen in Table 3, activities and two types of states are inherently [-bounded] and [-telic], while accomplishments and achievements are intrinsically [+bounded] and [+telic]. Semelfactives are [-telic] but shift between [+bounded] and [-bounded]. Therefore, we expect the perfective -le to be more likely to co-occur with accomplishments and achievements. This prediction is in fact borne out of the corpus data. A breakdown of the situations taking -le in the corpus is given as follows: Table 4: A breakdown of situations taking the verbal -le: ILS SLS ACT SEM ACC ACH Total 29 19 109 26 326 510 1019 2.85% 1.86% 10.70% 2.55% 31.99% 50.05% 100% From these figures, it is clear that more than 80% of the total are telic situations. Furthermore, of the atelic situations (accounting for around 18% of the total), more than half involve a temporal boundary provided by delimiting devices. Specifically, 82 out of 109 activities, 16 out of 26 semelfactives, and 2 out of 19 SLS are temporally bounded, taking up 9.81% of the total. When these [+telic] and [+bounded] situations are taken together, they account for more than 90% of the situations taking the perfective -le in our data. The chi-square test shows that our result is highly significant. This indicates 19 Because the derived situation types of accomplishments have exactly the same feature values as their basic types, these two are combined into one. 20 Although Li (1999) also uses the term “bounded”, she actually intends the term to mean “telic”, because in her model, “boundedness” actually refers to “the natural final point signaling change of state”. 21 Li (1999) does not differentiate between achievements and semelfactives, nor is she aware of the distinction between SLS and ILS. 632 a strong tendency for -le to occur with situations with spatial or temporal endpoints. In what follows, we'll discuss the interaction of the perfective -le with various situation types. States may hold for an indefinite interval and are therefore intrinsically open-ended. This feature explains their relatively low co-occurrence frequency with the perfective -le. ILS verbs are predicated of the more permanent dispositions or “properties” of an individual. Because -le only functions to present a situation in its entirety but does not provide any endpoint, the mere addition of -le to ILS normally does not result in grammatical sentences unless a temporal boundary is explicitly provided by an extra delimiting device. But this requirement is not absolute. Here are some corpus examples, in which ILS verbs are italicised. (8a) Yindu he Bajisitan ye you-le he nengli (File 9558801) India and Pakistan also have-le nuclear capacity India and Pakistan also had nuclear capacities. (8b) Yang Qinxian jiu jubei-le zhe-lei renwu de quanbu tezheng (File 9559901) Yang Qinxian then possess-le this-type people DE all characteristics Yang Qinxian bears all of the characteristics of a dangerous person. In both sentences, ILS situations are not bounded. The perfective -le indicates that these situations are presented as a single whole. But it should be noted that -le in these sentences can be omitted without significant change in meanings. This shows that ILS behave quite differently from other situation types in respect of aspectual marking: while the latter have to be marked aspectually, either overtly or covertly, to have a specific closed reading, the former do not have this requirement (c.f. also Yang, 1995:108; Moens, 1987). In this respect, SLS are more “event-like” because they also have to be marked aspectually. Compare the acceptability of the following: (9a) yi-ge laotaipo chulai, jian shi ji-ge jingcha, dunshi huang-le shen (File 9560701) one-CL old woman come:out see be some-CL police at:once scare-le spirit An old woman came out. She was scared out of her wits when she found the visitors were some policemen. (9b) shuo dao zher, Zhang Dandan shiran-le (File 9561301) say reach here Zhang Dandan at:ease-le Having said these, Zhang Dandan felt at ease. These two sentences denote SLS. If the perfective -le was removed, they would become ungrammatical. In this sense, SLS are more akin to non-statives than to ILS. Activities are intrinsically neither telic nor bounded unless there is an extra delimiting device providing them with a temporal boundary. Because the perfective -le is sensitive to endpoint, we predict activities taking -le are more likely to be temporally bounded. This prediction is in fact supported by our empirical data. Out of the 109 activities taking the perfective -le found in our corpus, 82 have a temporal endpoint provided by some delimiting mechanism, accounting for more than three quarters of the total. This piece of evidence also tells against the claim made by some scholars (e.g., Yang, 1995:116 and Li, 1999:216) that atelic or unbounded situations can never take -le. Rather, our data show that the compatibility is merely a matter of tendency. Consider the following corpus examples: (10a) ta pai-le wushu-ge huaqian-yuexia de baima-wangzi (File 9560301) he act-le countless-CL romantic DE white knight He has acted countless romantic white knights. (10b) yi-ge xiao nühai... beishang de ku-le qilai22 (File 9560701) one-CL little girl sadly DE cry-le start A little girl began to cry sadly. The situations described in (10) are both unbounded activities, but it is not hard to find them in real language. The verb pai “to play the part, act” in (10a) is an accomplishment verb, but its interaction with a [-count] object NP (modified by wushu-ge “countless”) results in an atelic situation; ku “to cry” in (10b) is also an activity with no endpoint. In these cases, the perfective -le simply focuses on the realisation of these situations and gathers them in their entirety. In comparison, bounded activities take the perfective -le more easily. Our data register a ratio of 3.04:1 between bounded and unbounded 22 The suffix -qilai is an imperfective aspect marker indicating inceptiveness. 633 activities. As activities are inherently unbounded, their temporal final endpoint is normally provided by an extra delimiting mechanism. Consider the following examples: (11a) xingxun yanxu-le san-ge xiaoshi (File 9556901) inquisition by torture last-le three-CL hour The inquisition by torture lasted as long as three hours. (11b) na hanzi zuoyou xunshi-le yi-fan, disheng dao... (File 9557601) that man left:right look-le one-CL low:voice say The man cast his eyes around, and said in a low voice... (11c) wo huitou wang-le wang zhe-ge popo-lanlan de jia (File 9560701) I turn:around look-le look this-CL worn out DE home I turned around and took a brief look at this run-down home. The activities denoted in the above sentences are bounded respectively by a temporal NP (11a), a quantity NP (11b) and a verb reduplicant (11c). It is clear that the aspect marker -le does not provide any endpoint information, rather it only indicates the occurrence or realisation of a situation. Because their inherent temporal boundary can be easily overridden when they shift from the singleevent reading to the multiple-event reading, semelfactives pattern with activities. But semelfactives differ from activities in that they may have the feature of [+bounded] even without an extra delimiting mechanism. Therefore we predict that semelfactives can take the perfective -le more freely. This prediction is supported by our data. Of the 26 occurrences of semelfactives taking -le, 16 are bounded by extra delimiting mechanisms, with a ratio of 1.6:1, lower than the ratio for activities 3.04:1. Our observations on the behavior of semelfactives also run against Yang (1995:118), who assumes that “delimiting mechanisms have to be employed to provide specific closed readings out of semelfactives.” Here is a corpus example of semelfactives without an extra delimiting device: (12) Fu Yiwei de xiao guzi da-le Chen Hua (File 9559301) Fu Yiwei de younger sister-in law beat-le Chen Hua Fu Yiwei's younger sister-in-law beat Chen Hua. When a semelfactive needs to be bounded, the same three delimiting devices also apply, as shown in the following examples: (13a) (tamen) da-le ni ji-tian? (File 9556901) they beat-le you how many-days (Temporal NP) For how many days did they beat you? (13b) Yang Qinxian zhui-shang-le ta, ju dao lian chi-le liu-xia (File 9559701) Yang Qinxian chase-up-le him raise knife successively stab-le six-CL (Quantity NP) Yang Qinxian caught up with him and stabbed him six times with his knife. (13c) laoren xiao-zhe dou-le dou shou (File 9560501) old man smile-DUR shake-le shake hand (Reduplicant) The old man shook his hand with a smile. While the interaction of the perfective -le with all other situation types is an issue that has aroused hot debate, there is an unanimous agreement that accomplishments and achievements can take -le without any trouble (e.g., Smith, 1997; Pan, 1998; Yang, 1995; Li, 1999). Accomplishments and achievements are both telic situations, this means that they have both spatial final endpoint and temporal boundary even without the help of an extra delimiting mechanism. As such, these two situation types interact with the perfective -le most naturally. From Table 4 above, we see that accomplishments and achievements combined account for more than 80% of the total number of situations taking the perfective -le found in our corpus data. This furnishes empirical evidence in favour of our assumption that the perfective -le is sensitive to endpoint, but the sensitivity is merely a matter of degree. In the following examples, situations in (14) are accomplishments and those in (14) are achievements. (14a) women you kaifa-le yixilie xin chanpin (File 9561401) we also develop-le a:series:of new product We also developed a series of new products. (14b) qunian shiyue, Yang Bingming xie-le liang-feng xin (File 9560401) last year October Yang Bingming write-le two-CL letter Last October, Yang Bingming wrote two letters. (15a) (tamen) di'er tian shangwu shi dian jiu dida-le mudidi (File 9558001) they 2nd day morning 10 o'clock already reach-le destination 634 They arrived at their destination at 10 o'clock the next morning. (15b) wo haishi kan-chu-le pozhan (File 9557301) wo still see-out-le weak:point But I still spotted his weakness. It should be noted that although accomplishments and achievements have both spatial and temporal endpoints, these endpoints are either encoded in basic or derived verbs themselves (achievements) or provided by their arguments or adjuncts (accomplishments). In other words, -le interacting with these two situation types only present them as an unanalysable whole. As with all other situation types, -le does not provide any endpoint. Summing up, it is clear that (1) the perfective -le interacts with all situation types in Chinese; (2) there is a strong tendency for -le to co-occur with spatially or temporally bounded situations; (3) as a perfective aspect marker, -le only focuses on the totality of a situation but does not provide any endpoint. 4. Conclusion In this study, we have cleared away some confusion over the perfective -le with empirical evidence from a Chinese corpus. Based on the discussions above, our answers to the three questions raised at the beginning are clear enough. First, as a perfective aspect marker, -le is different from the COS le. Their differences in respect of syntactic distribution, semantic function, etymological source, and productivity in the natural language all evidence that this is an unarguable linguistic fact. Second, the perfective viewpoint marked by -le can presents a situation either as completed or as terminated. The perfective - le only gathers a situation as a whole but does not provide any endpoint, so the closure type depends upon situation types. That is, telic situations are presented as completed whereas atelic situations are presented as terminated. Third, the perfective -le can interact with all situation types, but it demonstrates a strong tendency to co-occur with spatially or temporally situations. References Brinton L 1988 The development of English aspectual system. Cambridge, Cambridge University Press. Bybee J 1993 The evolution of grammar: tense, aspect and modality in the languages of the world. Chicago, University of Chicago Press. Chao Y 1968 A grammar of spoken Chinese. Berkeley, University of California Press. Christensen, M 1994 Variation in spoken and written Mandarin narrative discourse. Unpublished PhD thesis, Ohio State University. Chu C 1976 Some semantic aspects of action verbs. In Lingua 40: 43-45 Dai Y 1997 Xiandai hanyu shiti xitong yanjiu (A study of aspect in modern Chinese). Hangzhou, Zhejiang Educational Press. Henne H, Rongen O, Hansen L 1977 A handbook on Chinese language structure. Oslo, Universitetsforlaget. Li C, Thompson S 1981 Mandarin Chinese. Berkeley, University of California Press. Li M 1999 Negation in Chinese. Unpublished PhD thesis, Manchester University. Lü S 1981 Xiandai hanyu babai ci (800 words in Modern Chinese). Beijing, Commercial Publishing House. Moens M 1987 Tense, aspect and temporal reference. Unpublished PhD thesis, Edinburgh University. Norman J 1988 Chinese. Cambridge, Cambridge University Press. Pan H 1993 Interaction between adverbial quantification and perfective aspect. In Stevan L, Bloomington (eds) Indiana University Linguistic Club Publications. 1993:188-204. Pan H 1998 Adverbs of quantification and perfective aspects in Mandarin Chinese. http://ctlhpan.cityu.edu.hk/haihuapan/pan/pan-publication.htm Smith C 1988 Event Types in Mandarin. In Chan M, Ernst T (eds.) Proceedings of the 3rd Ohio State University Conference on Chinese Linguistics. Ohio, pp215-243. Smith C 1991 The parameter of aspect (1st Ed.). Kluwer Academic Publishers. Smith C 1997 The parameter of aspect (2nd Ed.). Kluwer Academic Publishers. Tai J 1984 Verbs and times in Chinese: Vendler's four categories. In Papers from Parasession on Lexical Semantics (CLS) 27-28: 289-296. 635 Teng S 1986 Hanyu dongci de shijian jiegou (The temporal structure of Chinese verbs). In Proceedings of 1st International Symposium on Chinese Teaching. Beijing, Beijing Languages Institute Press. Tiee H 1986 A Reference grammar of Chinese sentences. Tucson, the University of Arizona Press. Verkuyl H 1972 On the compositional natural of aspects. Dordrecht-Holland, D. Reidel. Verkuyl H 1993 A theory of aspectuality. Cambridge, Cambridge University Press. Vendler Z 1967 Linguistics in Philosophy. Cornell, Cornell University Press. Yang, S 1995 The aspectual system in Mandarin Chinese. Unpublished PhD thesis, Victoria University. Zhang L 1995 A Contrastive study of aspectuality in German, English & Chinese. Peter Lang. Zhu D 1981 Yufa jiangyi (Teacher's guide to Grammar). Beijing, Commercial Publishing House. A Corpus-based Contrastive Analysis of Spoken and Written Learner Corpora: The Case of Japanese-speaking Learners of English Mariko Abe (Sophia University) 1. Introduction The purpose of this research is to investigate the variability of interlanguage that has been claimed in previous study by means of a corpus-based quantitative analysis. It aims to observe the style shifting of various grammatical features and word formation errors by tagging errors in 297 learners’ data. Various studies have been undertaken to describe and explain the process of second language (L2) acquisition. Tarone (1983), for example, claims that interlanguage capability continuum of learners diverges with respect to the degree of attention to language form, so-called ‘careful’ style to ‘vernacular’ style. Additionally Tarone (1985) showed that learner's performance varies depending on the dissimilarity of task, whether a written grammar test, oral interview and oral narrative. Ellis (1987) has confirmed the style shifting in the L2 learners’ use of the past tense. The variability of accuracy in past tense morphemes was observed in his study, when different amounts of planning time were set for a single narrative discourse task. Through his examination Ellis (1987) concluded that “so-called ‘natural’ order may not be a stable phenomenon” (p.1). In this study I compared the features of variability in L2 written and spoken corpus data based on the hypothesis that processing mode of learners affects their performance in L2. This research focussed on the analysis of the same task that used different production mode. The similarities and differences of learners’ errors in each mode were examined mainly from the perspective of grammatical and word formation features. In addition, English proficiency level of learners is another factor added to this study. Although this addition was only possible for the spoken corpus data, we can still observe the performance of learners at different proficiency levels. 2. Corpus selection The spoken data were extracted from the Standard Speaking Test (SST) Corpus (Tono et al. 2002). This test has 9 different levels to assess the speaking proficiency of learning English. Spoken data came from 100 examinees belong to SST level 2 to 9. Although there are 5 stages in this speaking test, only one of the stages, single picture description stage, was used. A single picture is chosen from 5 different pictures, and the examinees were asked to describe it in 2 or 3 minutes. The written data were all collected by the author using a similar type of picture description task. The 197 Examinees were all university students who have been studying English for 6 years. In addition, 31 out of 197 examinees have also taken the simple version of SST, and these results were used to create a L2 spoken subcorpus. 3. Data processing All of the hand-written manuscripts for the written corpus were transcribed on a word processor by the researcher, and two sets of data were error-tagged according to The TAO Speech Corpus of Japanese Learner English Error Tagging Manual Ver.1.0. (Isahara, Saiga, and Izumi 2002). This error-tag set is divided into three main levels. The first level consists of three criteria: Word Formation Errors (WF), Grammatical Errors (G), Lexico-Grammatical Errors (LG), Lexical Errors (LXC), and Others (O). The second level is divided into part of speech and the other categories. The final level is the category for the errors as follows: inflection (inf), number (num), Japanese English (je), genitive (gen), agreement (ag), form (f), tense (tns), voice (vo), finite/infinite (fin), negation (ng), question (qst), modal (mo), quantifier (qnt), inflection (inf), position (pst), countability (cnt), complement (cmp), dependent preposition (dprp), word redundancy (rdd), omission (oms), misordering (odr), ambiguity (amb), and unnaturalness (unl). Spelling errors and word division errors were not normalized but manually tagged and included in the section of word formation error. Since the total corpus size of written and spoken data was almost the same therefore no normalization has been done in this research. 4. Data analysis: general error type In this section, spoken and written errors are sorted by general error types. The following Table1 shows that grammar is the category with the most errors for Japanese learners of English. Consequently, it might be meaningful to inspect the effect of mode of production on grammatical errors. According to Ellis (1987), previous studies have not mainly investigated the interlanguage variability from the aspect of grammatical structure. 1 Table 1: Frequency of general error types WR Freq. % SP Freq.% Grammatical errors 1,136 50.02 473 58.40 Others 421 18.54 18522.84 Lexical errors 291 12.81 11113.70 Lexico-grammatical errors 53 2.33 334.07 Word formation errors 370 16.29 80.99 Total 2,271 810 Figure 1: Frequency of general error types 010203040506070GOLXCLGWF(%)SP WR Except for the category of word formation error (WF), there is no difference in frequency rank order between the spoken and written corpora. Almost all the category indicate similar percentage in overall errors, however, interestingly spoken data have higher error frequency than that of written, excluding the category of word formation. Since the criteria of G (Grammatical errors), LG (Lexico-grammatical error) and WF (Word formation) are subcategorised in part of speech, further examination will be provided in following sections. 5. Data analysis: part of speech When we sort each category by part of speech, another interesting feature can be observed. Articles occupy the highest error rate in both production modes, followed by verbs, nouns, pronouns, adjectives, adverbs, and prepositions. Error frequency rate as well as rank order is almost identical in both modes, whereas spoken data have higher density of error in noun and pronoun. In addition to the low frequency of preposition, the error rate of adjective, and adverb is low. This may not mean that the grammatical rules associated with these parts of speech are mastered, but it also indicates that learners are unconsciously avoiding these rules. Therefore, considering this rank order, learners may underuse nouns and overuse pronouns, since according to Granger & Rayson (1998) the rank order of word category in non-native data is as follows: nouns, verbs, prepositions, articles, adjectives, conjunctions, adverbs, determiners and pronouns. The following Table2 specifies the error distribution pattern in terms of part of speech, but it does not illustrate the accuracy rate. The analysis does not depend on the accuracy rate of learners, but on the frequency rate of the learners’ error, so that we cannot generalise the degree of learner's avoidance and acquisition of certain grammar points. However, we can at least compare the error rate between that of written and spoken mode in various grammar categories of each part of speech. From this standpoint, detailed examinations are presented in the subsequent sections through the subcategories of nouns (N) and verbs (V) respectively. 2 Table 2: Error distribution pattern in part of speech WR Freq. % SP Freq. % AT 822 36.2299 36.91 V 312 13.74141 17.41 N 41 1.8151 6.30 PN 21 0.9215 1.85 AJ 1 0.044 0.49 AV 0 -1 0.12 PRP 0 -1 0.12 AT=article; V=verb; N=noun; PN=pronoun; AJ=adjective; AV=adverb; PRP=preposition Figure 2: Error distribution pattern in part of speech 0510152025303540ATVNPNAJAVPRP(%)WRSP 5.1. Noun Singular common nouns are most frequently used in both spoken and written mode by native speakers of English (Leech et al. 2001), but this rule cannot be applied in this study. There are 5 subcategories in the criteria of noun and its error rate in different production modes is shown in Table3. In this section we will concentrate on the most striking errors, and on the items that have a striking dissimilarity between modes. Table 3: Error frequency rate of subcategories in noun WR Freq.% SP Freq. % g_n_num 2765.853772.55lg_n_cnt 512.2059.80wf_n_inf 0-59.80Countability (32)(78.05)(47)(92.16)g_n_gen 921.9523.92lg_n_dprp 0-23.92Noun total 41 51 3 Figure 3: Error frequency rate of subcategories in noun 01020304050607080g_n_numlg_n_cntwf_n_infg_n_genlg_n_dprp(%)WRSP 5.1.1. Countability The following tags, (many *book), (listening to a *music) and (*childs) can be included under the category of countability. When Tarone (1985) examined the accuracy rate of plural markers ‘s', only form shift of morpheme was tested, and no variability was observed due to the different task. In this section, I would like to examine the error rate of the plural marker. Before comparing the error rate of between written and spoken mode, errors such as “one of the *girl”, “I like *dog” should be excluded so that I only count the error rate of plural marker. The result is that the error frequency for writing mode is 14 out of 27 (51.85%) and that of speaking mode is 18 out of 37 (48.65%). There seems to be no significant variability in two different modes as might be expected. However, there is one prominent finding when we sort the error of spoken mode in learners’ proficiency level as shown in Table4. Table 4: Error frequency in different SST level (SP) SST 23456789g_n_num 0436995137Plural marker (0)(0)(0)(4)(5)(3)(5)(1)(18)Other (0)(4)(3)(2)(4)(6)(0)(0)(19)lg_n_cnt 011002105wf_n_inf 002210005g_n_gen 010001002lg_n_dprp 000002002Total 066810146151 Figure 4: Error frequency in different SST level (SP) 012345678910g_n_numlg_n_cntwf_n_infg_n_genlg_n_dprpSST2SST3SST4SST5SST6SST7SST8SST9 4 As can be seen from the dispersion of the frequency in Table4, a morpheme change of plural marker is an error that often occurs in the intermediate levels but not in the novice level, whereas other grammar rules concerning countability occurs more often in the novice level than the intermediate level. As a result, we can hypothesise that when the necessity of attention to the form and grammar is low, intermediate learners tend to make errors; and when the necessity of attention is high they can avoid making errors. In addition to this, when the necessity of attention is low, because the rule is simple, novice learners make fewer errors; and when the necessity is high, because the rule is complicated, they have a tendency to make errors. By extracting Krashen's monitor model Ellis (1987) explains that easy rules can be monitored consciously causing the differences in style shifting. But this leads to another conclusion that intermediate learners may have an inclination to make much more errors over easily-learned rules than well-acquired rules and novice learners’ error rate increases as the load of attention to the rules increases. 5.1.2. Variability in easily learned rules Regarding the error over gender (*woman hair is black) and inflection (*childs), both has variability between the modes. Also both can be categorised as easily learned rules, nevertheless unexpectedly, these errors have the opposite error frequency rate in production modes. While there are only 2 examples in spoken corpus, larger corpus with adequate learners’ proficiency level information is needed to understand the cause of these differences. 5.2. Verb The subcategories for verbs and their corresponding error rates in different production modes are shown in Table5. In this section we will mainly focus on the most striking errors, and on the items that have dissimilarity between modes. Table 5: Error frequency rate of subcategories in verb WR Freq. % SP Freq. % g_v_agr 16151.60 7653.90 g_v_fml 4514.42 96.38 g_v_tns 4113.14 2215.60 g_v_vo 41.28 21.42 g_v_fin 30.96 74.96 g_v_ng 10.32 0-g_v_qst 10.32 0-g_v_mo 0-0-lg_v_dprp 4414.10 1510.64 lg_v_cmp 41.28 96.38 wf_v_inf 82.56 10.71 Total 312141 Figure 5: Error frequency rate of subcategories in verb 0102030405060g_v_agrg_v_fmlg_v_tnsg_v_vog_v_fing_v_ngg_v_qstlg_v_dprplg_v_cmpwf_v_inf(%)WRSP 5 5.2.1. Agreement (there *are the lady) (cat *sleep in her bed) In Tarone (1985) the accuracy level of third singular verb correction was examined, and variability according to tasks was also investigated. In this study the error frequency rate in different production modes does not diverge markedly. However, the error tag-set used in this research involves not only the third singular verb agreement errors but also every type of verb agreement errors. I excluded the agreement error for be-verbs and modal verbs from the data. The result is that the error frequency for writing mode is 95 out of 161 (59.00%) and speaking mode is 48 out of 76 (63.16%). Still a significant difference cannot be seen according to the production mode, but we have to remember the fact that Arabic learners of English were also included in Tarone's study. When we only focus on Japanese learners of English in her study, there is no apparent accuracy rate difference on different tasks. Although variability was found in neither different tasks nor modes, there is a striking result on error frequency in different SST level (see Tabel6). Table 6: Error frequency in different SST level (SP) SST 23456789G_v_agr 3281615461376person sing. verb (2)(21)(10)(12)(0)(3)(0)(0)(48)Be & modal verb (1)(7)(6)(3)(4)(3)(1)(3)(28)G_v_fml 102212019G_v_tns 1557112022G_v_fin 012301007Lg_v_dprp 0430341015Lg_v_cmp 030212109Total 5412829101654138 Figure 6: Error frequency in different SST level (SP) 051015202530g_v_agrg_v_fmlg_v_tnsg_v_finlg_v_dprplg_v_cmpSST2SST3SST4SST5SST6SST7SST8SST9 Novice learners overwhelmingly tend to make errors over whole verb agreement, compared with higher-level learners; but when we focus on the third singular verbs and the other verbs separately, another distribution is found. If we follow the hypothesis concluded in section 5.1.1., ‘error rate is in inverse proportion to the degree of attention to the rule in intermediate learners, and error rate is in direct proportion to the degree of attention in novice learners,’ the agreement rule for third person singular verbs may be much more difficult than for be-verbs and modal verbs for Japanese learners of English. 5.2.2. Form (one boy is *listen to music and drinking something) Interestingly, form error is much more frequently seen in the written mode, when learners can pay more attention to the form than spoken mode. Regarding the distribution of proficiency level in spoken 6 3rd data, errors can be observed in almost every level in almost same number. When I sorted the errors into present particle errors and the others, there are 42 out of 45 (93.33%) in written mode and 5 out of 9 (55.56%) in spoken mode. In spoken data there is one example in each SST proficiency level 2, 4, 5, 7, and 9. How can we explain the fact that most of the errors are occur with present particle error, which is presumably easy learned, and with written mode? We need more adequate data and precise study to draw a conclusion. However, if learners do not have a high accuracy in simple mechanical verb form changing, we can assume that it is not fully acquired by the learners so that they need ample practice in verb form changing. 5.2.3. Complement (this person is teaching *how to them) While Tarone (1985) found the variability in the aspect of third person singular direct object pronoun (D.O. Pro ’It') in different task, it was difficult to discover the variability in different mode. This is because not all the examples in both modes cannot be categorised as errors in D.O. Pro ’It'. In the written mode 3 out of 4 (75%) examples are of this type: “A woman wears *to it”, “I will characterize *of her”, and “A girl crosses *with her legs.” Whereas in spoken mode only one example out of 9 (11.11%) can be found: “I don't know how to say *in English.” Moreover, the error related with does not seem to be connected to learners’ level. Considering the fact that there are not sufficient examples in both written and spoken mode, complement errors over the verb may have deeper connection with the misunderstanding of usage of verbs as far as Japanese learners are concerned. Consequently, analysis must be more focussed on the verbs themselves. 6. Data analysis: others In previous sections we were able to observe the variability and consistency in some of the categories. In this section, we examine the general tendency of errors that have not been mentioned so far. Table 7: The size of four corpora WR SP WR-subSP-subPicture 1 1-5 11SST level --- 2-9 2-62-6File 197 100 3128Tokens 17,863 17,222 3,2213,908Types 951 1,314 422451Type/Token/Ratio 5.32 7.63 13.111.54Ave. Word Length 4.56 3.37 4.513.31Sentences 1,507 986 265252Sent.length 10.5 16.51 10.4214.71sd. Sent. Length 5.45 14.7 6.2613.66 7 Table 8: Rank frequency of error in each corpus WR SP WR-sub SP-sub 17,863 17,222 3,221 3,908 Rank Freq.% Freq.% Freq. % Freq. % 1 g_at 82236.20 g_at 29936.91 g_at 13034.30 g_at 95 46.12 2 wf_o_o 36215.94 lxc 11113.70 lxc 5213.72 g_v_agr 29 14.08 3 lxc 29112.81 o_oms 10212.59 wf_o_o 5514.51 o_oms 28 13.59 4 o_oms 27812.24 g_v_agr 769.38 o_oms 4411.61 lxc 22 10.68 5 g_v_agr 1617.09 o_rdd 516.30 g_v_agr 3910.29 g_v_tns 5 2.43 6 o_odr 703.08 g_n_num 374.57 g_v_tns 123.17 g_pn 4 1.94 7 g_v_fml 451.98 g_v_tns 222.72 lg_v_dprp112.90 lg_v_dprp 4 1.94 8 lg_v_dprp 441.94 g_pn 151.85 g_v_fml 71.85 g_n_num 3 1.46 9 o_amb 421.85 lg_v_dprp 151.85 o_amb 61.58 g_v_fin 3 1.46 10 g_v_tns 411.81 o_amb 131.60 o_odr 61.58 o_odr 3 1.46 11 o_rdd 311.37 o_odr 101.23 g_n_num41.06 lg_v_cmp 2 0.97 12 g_n_num 271.19 g_v_fml 91.11 g_pn 30.79 g_av_pst 1 0.49 13 g_pn 210.92 lg_v_cmp 91.11 o_rdd 30.79 g_v_fml 1 0.49 14 g_n_gen 90.40 o_unl 91.11 g_n_gen 10.26 lg_aj_dprp 1 0.49 15 wf_v_inf 80.35 g_v_fin 70.86 g_v_ng 10.26 o_amb 1 0.49 16 lg_n_cnt 50.22 wf_n_inf 50.62 g_v_qst 10.26 o_unl 1 0.49 17 g_v_vo 40.18 lg_n_cnt 50.62 g_v_vo 10.26 wf_o_o 1 0.49 18 lg_v_cmp 40.18 g_av_pst 40.49 lg_n_cnt 10.26 wf_v_inf 1 0.49 19 g_v_fin 30.13 g_n_gen 20.25 lg_v_cmp10.26 lg_n_cnt 1 0.49 20 g_v_ng 10.04 g_v_vo 20.25 wf_v_inf10.26 21 g_v_qst 10.04 lg_n_dprp 20.25 22 g_aj_qnt 10.04 wf_v_inf 10.12 23 wf_o_je 10.12 24 wf_o_o 10.12 25 lg_aj_dprp 10.12 26 lg_prp_cmp10.12 Total error 2271 810 379 206 Errors due to the omission of one or more necessary words have relatively high frequency, but there is no significant difference in the error rate for the two modes. Regarding errors over word order , only the written corpus has a high frequency, but again there is not so much difference in error rate. This high error frequency in written corpus can be explained by novice low-level learners’ consistent errors patterns. It is caused by the total misunderstanding of English structure such as “*Open door” (“door is open”), and this was counted as an error related word order. The ranking of error concerning ambiguity is almost equal but except for the subcorpus of speaking data. Errors that were difficult to categorise in any of the criteria were included in this group. Another criterion is errors related to redundancy , and the error rate is dissimilar in each corpus. The most striking finding is that there is no error of this type in spoken subcorpus, while error rate in spoken corpus is fairy high. One possible explanation for this difference is that the proficiency level of learners in the spoken corpus is higher than that of the sub-corpus, consisting of learners from SST level 2 to 6. Therefore, we can presume that higher-level learners have a tendency to make redundancy errors. Lastly, the study did not show variability in the category of “other”, but the comparison between different proficiency level corpora will be useful in further studies. 7. Conclusion Through the detailed analysis on subcategories of nouns and verbs, we can observe the error rate difference in error over noun gender and verb form. Another finding that is noteworthy is that the error category, which has a high error rate, also has a large distribution among the learners’ proficiency level, as can be seen from the example of errors related to countability and agreement. Also we were 8 able to acknowledge that the error rate is in inverse proportion to the degree of attention to rules for intermediate learners, and error rate is in direct proportion to the degree of attention to rules for novice learners. Granger and Rayson (1998) have shown in their research the resemblances of written and spoken production of learners, and they conclude that communicative approach is one of the factors that have an influence on “speech-like nature of learner writing” (p.130). Since the rise of this ELT methodology we may come to emphasise fluency but not accuracy, however, this study suggests that it is also necessary to take notice on learners’ errors through instruction and feedback in the classroom. Since not all the examinees of written data were able to take the SST test in this study, it was unfortunately impossible to investigate the correlation of written and spoken modes in terms of learner's proficiency level. More detailed data on learner's proficiency level and much larger corpora will be needed in future studies. Another drawback was that all the analysis comprised of the error rate, but not of the accuracy rate. Much more impartial examination could be done, if it were possible to determine whether learners are avoiding the certain usage or not. The last point is that subcategorised tag-sets that accord with learners’ error tendency will be necessary for further study. Tag-sets for relative pronoun and conjunction, for example, were eliminated in this study. By analysing the similarities and differences between the two modes of learner corpora, I have arrived to identify the features of interlanguage variability in a more objective way, which will shed some light on the nature of the interlanguage development and possible implications for EFL pedagogy. References Ellis R 1987 Interlanguage variability in narrative discourse: Style-shifting in the use of the past tense. Studies in Second Language Acquisition 9: 1-20. Granger S, Rayson P 1998 Automatic profiling of learner texts. In Granger S (ed), Learner English on computer. London, Longman, pp119-131. Isahara H, Saiga T, and Izumi E 2002 The TAO Speech Corpus of Japanese Learner English Error Tagging Manual Ver.1.0. Leech G, Rayson P, Wilson A 2001 Word Frequencies in Written and Spoken Language. London, Longman. Tarone E 1983 On the Variability of Interlanguage Systems. Applied linguistics 4(2):143-163. Tarone E 1985 Variability in interlanguage use: a study of style-shifting in morphology and syntax. Language learning 35: 373-404. Tono Y, Kaneko T, Isahara H, Saiga T, Izumi E 2002 The Standard Speaking Test Corpus. Studies in Lexicography 11(2): 7-18.339 9 Methodology and steps towards the construction of EPEC, a corpus of written Basque tagged at morphological and syntactic levels for the automatic processing Aduriz I.*, Aranzabe M.J., Arriola J.M., Atutxa A., Díaz de Ilarraza A., Ezeiza N., Gojenola K., Oronoz M., Soroa A., & Urizar R. Department of Computer Languages and Systems Computer Science Faculty University of the Basque Country P.O. box 649, E-20080 Donostia jibaregj@si.ehu.es *Department of Linguistics Faculty of Philology University of Barcelona. E-08007 EPEC, the Reference Corpus for the Processing of Basque, is a corpus of standard written Basque that has been manually tagged at different levels (morphology, surface syntax, phrases) and is currently being hand tagged at deep syntax level. It is aimed to be a "reference" corpus for the development and improvement of several NLP tools for Basque. Although small (50,000 words), EPEC is a strategic resource for the processing of Basque and has already been used for the development and improvement of some tools. Half of this collection was obtained from the Statistical Corpus of 20th Century Basque, a reference corpus of Basque including 4.7 million word-forms. The other half was extracted from Euskaldunon Egunkaria, the only daily newspaper written entirely in standard Basque. When defining a general framework for the automatic processing of agglutinative languages like Basque, a morphological analyser of words is an indispensable basic tool. However, previous to the completion of the morphological analyser MORFEUS, the design of the tagset and a lexical database had to be accomplished. Choosing an appropriate tagset is a crucial task since the usefulness and ambiguity-rate of the analyser depend on it. For the morphosyntactic treatment of Basque texts, the tag system we developed is a four level system, ranging from the simplest part-of-speech tagging scheme up to the full morphosyntactic information. In addition to these four levels, further tags are added to mark verb chains, noun phrases, and postpositional phrases. Nowadays, we are involved in the syntactic tagging of the corpus, following the Dependency Structure-based Scheme in order to build a treebank. The Lexical Database for Basque (EDBL) is a general-purpose lexical database used in several text-processing tools for Basque. This large repository of lexical knowledge is the basis in many different NLP tasks, and provides lexical information for several language tools including, obviously, the morphological analyser. At present, it consists of nearly 80,000 entries. Morfeus is a robust morphological analyser for Basque. It is a basic tool for current and future work on NLP of Basque. It is based on the two-level formalism proposed by Koskenniemi (1983). Morfeus consists of three main modules: (i) the standard analyser, capable of analysing and generating standard word-forms, (ii) the analyser of linguistic variants (dialect uses and competence errors), and (iii) the guesser or analyser of words without lemmas in the lexicon. The manual disambiguation of the corpus was performed on the output of Morfeus. Thus, the whole corpus was morphosyntactically analysed giving to each word-form every possible analysis, without taking into account the context in which it appeared. Once each word-form in the corpus was analysed, we carried out the manual disambiguation process. Two linguists marked independently the correct syntactic tag to each word in the corpus, applying the “double blind” method described in Voutilainen & Järvinen (1995). Both linguists' answers were compared and, when differences occurred, they agreed a single tag. This manually disambiguated corpus was used both to improve a Constraint Grammar disambiguator and to develop a stochastic tagger. After disambiguating the morphological tags in the corpus, the next step was to assign the corresponding syntactic tag to each word-form. Syntactic function tags follow the philosophy of the Constraint Grammar (CG). By adopting the CG formalism, we express the syntactic functions of words and the interdependencies that exist among them rather than deep structural relations. So, the syntactic tags at this level refer to shallow syntactic functions, i.e. they may provide information about the surface structure of verb chains, noun phrases, or postpositional phrases. Once each word-form in the corpus was given at least one syntactic tag, we carried out the manual disambiguation process again. The method used was similar to the one used for the morphological disambiguation in the previous step. At this stage we have the corpus manually tagged with surface syntactic tags following the CG syntax. No phrase units are marked yet, although based on this representation, the identification of various kinds of phrase units, such as verb chains, noun phrases, and postpositional phrases is reasonably straightforward. In order to detect verb, noun, and postpositional phrases, we use different function tags as well as some particles (such as negative or modal particles). At present, a linguist is checking the tags that the first set of mapping rules marked up in the corpus. Whenever necessary, she adds, removes, or changes the tags automatically assigned. Once this work is finished, the first set of mapping rules developed will be tested on the corpus and the results will be used to improve the rules iteratively as well as to develop new ones. Nowadays, we are also involved in the syntactic tagging of the corpus following the Dependency Structure-based Scheme to tag syntactically the corpus in order to build a treebank. 10 During the last three years, a great effort has been done in our research group (Artola et al., 2002) to integrate the NLP tools for Basque described in previous sections. Due to the complexity of the information to be exchanged among the tools, Feature Structures (FSs) are used to represent it. Feature structures are coded following the TEI's DTD for FSs, and Feature Structure Definition descriptions (FSD) have been thoroughly defined. The documents used as input and output of the different tools, contain TEI-P3-conformant feature structures (FS) coded in SGML. In the future, we also intend to extend the corpus annotation to word sense tagging and anaphora annotation. 11 The mood of the (financial) markets: In a corpus of words and of pictures1 Khurshid Ahmad, Pensiri Manomaisupat, David Cheng, Tugba Taskaya, Saif Ahmad, Lee Gillam and Andrew Hippisley Department of Computing, University of Surrey, Guildford, Surrey. Methods of corpus linguistics are typically used to study language either synchronically or diachronically. Much of the work currently being carried out under the rubric of information extraction benefits directly or indirectly from work in corpus linguistics: the extraction of the so-called named entities, template elements, template relations, and scenario templates, all relating to meaning bearing units of language within a text, relies on various statistical tests that have to be carried out over a corpus of texts. Some brave folk attempt to carry out these tests over a corpus of multilingual texts. One of the important developments in information extraction relates to event modelling. Here linguists look at the lexico-grammatical properties of texts and attempt to derive information about events that are supposed to be reported in the texts. We have found that event modelling requires a good understanding of the modes used in communicating the events, including natural language, graphs and images. A case study of financial market movement, where a corpus of news wires and graphical information, or a financial time series, were correlated, is described. These are preliminary results of an EU 5th Framework Project –GIDA (No. IST 2000-31123). News streams provided by organisations like Reuters or Bloomberg comprise a range of keywords and indexical names that may change from one news item to the next; an event modeller will need to filter the news from such a diverse information resource. Specialist information providers deliver not only news texts but also supply, for example, time series of changes in value of stocks, shares, currencies, bonds and other financial instruments. We have a narrower focus than other authors in information extraction (see for example Maybury et al, 1995) in that we are looking for changes in key financial instruments that are reported in financial news-wires. The news coverage of these instruments is of two types: first, there is a daily report about changes in the value (numerical) of the instruments for instance, one can see time series comprising historic data about the changes in values of currencies; second, the manner in which the value of the instruments changes depends on the reports relating, directly or indirectly, to the instrument. The reports, for example, about war or economic uplift/downturn, affect the value of the instruments. Some authors claim that there is a correlation between ‘good’ or ‘bad’ news relating to the instrument and its potential numerical value. Our work focuses on primary movements within a market which lasts from few months to many years and represents the broad trend within a market. We report on some initial work that attempts to changes in an index, FTSE100, with changes in ‘market sentiment’ as expressed in news reports about the UK economy specifically and reports about the Wall Street indices. The later has substantial influence on the UK economy. Financial analysts use sophisticated political, economic and psychological analysis to determine the reaction of market operatives and to predict the possible trading decisions of the operatives. Reports related to the sentiment use a range of metaphors to express the state of a market and its possible movements. Francis Knowles has written about the use of health metaphors used in the financial news reports: markets are full of vigour and are strong or the markets are anaemic or are weak (1996); most newspapers also use animal metaphors – there are bull markets and bear markets, the former refer to expansion, and indirectly to fertility, and the later to shy, retiring and grizzly behaviour much like that reported about bears in popular press and in literature for children. Indeed, there are fairly literal words that express the sentiment, as reported in the news wires, about the markets: financial instruments rise, fall, markets boom, go bust, and there are gains, losses within the markets, economies slowdown, suffer downturns, whole industry sectors maybe hardpressed. We created a corpus of 1,539 English financial texts from one source (Reuters) on the World Wide Web, published during a 3-month period (Oct 2001-January 2002) comprising over 310,000 tokens. The corpus comprised a blend of both short news stories and financial reports. Most of the news is business news from Britain with thirty percent of the news is from Europe and from the United Stages. We automatically extract the sentiment words and key terms from the text corpus diachronically. The correlation between the metaphorical words and the FTSE looks good at first sight: 1 Based on papers presented at two workshops: LREC Event Modelling Workshop (Spain 2002) and Financial News Analysis Workshop, 11th International Terminology and Knowledge Engineering Congress (France 2002). 12 00.20.40.60.811.21256789121314151619202122232627282930DateRatioGood wordsBad wordsFTSE100 Recent work has looked beyond the frequency distributions of positive and negative sentiment keywords to the actual relevance of a keyword's contribution to the analysis through its grammatical properties. Specifically we look at the tense / aspect features of verbal keywords to determine their relevance to the immediate situation. Some of the categories and their relevance measures are shown below: Greater relevance Less relevance Present Continuous e.g. stocks are rising Present Simple e.g. stocks rise Perfect e.g. stocks have risen Past e.g. stocks rose The claim is that a positive verb such as rise which is marked as Present Continuous will suggest greater immediacy and therefore greater relevance as an indicator of current market sentiment. In this paper we will describe our attempts to look at a much larger corpus of texts (c. 1 million token financial news stream) and perform information extraction by making greater use of the linguistic properties of the sentimental tokens thus identified. Maybury (1995) Generating Summaries from Event Data. Information Processing and Management. 31(5) 733-751. Knowles, F. (1996) Lexicographical Aspects of Health Metaphors in Financial Texts. In (Eds.)Martin Gellerstam et al. Euralex'96 Proceedings (Part II). Göteborg, Sweden: Göteborg University, pp 789-796. 13 The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation Sandra M. Aluísio*#, Gisele M. Pinheiro#, Marcelo Finger†, Maria das Graças V. Nunes*#, Stella E. O. Tagnin‡ *ICMC – DCCE, University of Sao Paulo, CP 668, 13560-970 Sao Carlos, SP, Brazil #Núcleo Interinstitucional de Lingüística Computacional (NILC), ICMC-USP, CP 668, 13560-970 Sao Carlos, SP, Brazil †IME – DCC, University of Sao Paulo, Rua do Matao, 05508-090 Sao Paulo – SP, Brazil ‡FFLCH – DLM, University of Sao Paulo, Av. Prof. Luciano Gualberto, 403, 05508-900 - Sao Paulo – SP, Brazil sandra@icmc.usp.br, gisele@nilc.icmc.usp.br, mfinge@ime.usp.br, gracan@icmc.usp.br, seotagni@usp.br 1. Introduction The pioneering balanced Brown Corpus launched in 1964, annotated reference corpora, such as Suzanne and the Penn Treebank and the balanced mega British National Corpus (BNC)1, to cite only a few, have helped both the development of English computational linguistic tools and English corpus linguistics. Portuguese, on the other hand, still requires a lot of work for building the basic resources to develop linguistic research based on and driven by corpus. Portuguese is the mother tongue of approximately 200 million people (Brazil, Portugal, Angola, Cabo Verde, Guiné Bissau, Mozambique and S. Tomé e Príncipe) and is the sixth most spoken language in the world today. The two language variants with the greatest number of users – European Portuguese (EP) and Brazilian Portuguese (BP) – differ in phonological, lexical, morphological and syntactical levels (Wittmann et al. 95) suggesting a real need to build corpora considering both of them. For European Portuguese (EP), the project AC/DC2 has compiled several corpora of non-literary (e.g. journalism) and literary texts (poetry, prose, plays) mainly from 16th to 19th centuries. These corpora are available both in raw format and annotated with lemma, POS and associated attributes, and syntactic tags, being the main purpose of this compilation to raise the quality of Portuguese NLP. Another example for EP is the Corpus de Referencia do Portugues Contemporâneo (CRPC)3, which has been under construction since 1988, and contains excerpts of several types of written discourse (literary, journalism, technical, scientific, didactic, economy, legal, parliamentary, etc.) and oral discourse from the main variants of Portuguese. Its main goal is to establish an on-line representative sample collection of general usage contemporary Portuguese accessible to anyone interested in engaging in theoretical and practical studies or applications. Both projects include texts from the Brazilian Portuguese and are valuable resources although there is a lot of work to be done with regard to text classification based on genre and text typology to better balance the variants included. On the other hand, Brazilian Portuguese (BP) corpora to date have mainly addressed spoken language; some written language corpora are only partially available. These corpora were used by specific projects, particularly for the production of dictionaries, and some are not publicly available due to copyright restrictions. As for their application, they have generally been used for specific linguistic studies such as sociolinguistic and phonetic-phonological research as well as historical linguistic studies (e.g NURC-RJ4 and NURC-SP5, PHPB6 and VARSUL7 projects); lexicography (e.g. the written corpus “Usos do Portugues” from the State University of Sao Paulo, at Araraquara, which gave birth to three dictionaries: a 1 info.ox.ac.uk/bnc/ 2 www.linguateca.pt/ 3 www.clul.ul.pt/sectores/projecto_crpc.html 4 www.letras.ufrj.br/nurc-rj/ 5 www.fflch.usp.br/dlcv/nurc/ 6 www.letras.ufrj.br/phpb-rj/ 7 www.cce.ufsc.br/~varsul 14 dictionary of frequency (Biderman, 2001), a dictionary of verbs (Borba, 2001) and one of contemporary Brazilian Portuguese usage (Borba, 2002); and a grammar of Brazilian Portuguese usage (Neves 2000)) and literary studies (e.g. the NUPILL project8, which made available classical Brazilian literary books for teaching and literary studies). There are also small specialized written language corpora compiled by researchers or research groups in order to assess the performance of NLP systems. Such corpora are generally not available either and their results are not reproducible in principle. The Lacio-Web project, a two-year project launched in early 2002, tries to fill this gap as it aims at compiling corpora which are freely accessible for both non-expert users interested in the Brazilian Portuguese language and expert users who pursue theoretical and practical linguistic studies and develop computational linguistics tools (e.g. taggers, parsers, sentence and word aligners, automatic term extraction tools, and automatic summarizers) and applications such as computer systems for natural language information retrieval, machine translation and grammar checking. Lacio-Web (LW) is being developed at the University of Sao Paulo (USP) under the auspices of the governmental agency Conselho Nacional de Desenvolvimento Científico e Tecnológico do Brasil (CNPq). The LW project comprises six corpora: 1) a reference corpus called Lacio-Ref; 2) Mac-Morpho9, a gold standard portion from Lacio-Ref, comprising 1,1 million words, which was manually-validated for morpho-syntactical tags; 3) an automatically-annotated portion of the Lacio-Ref with lemmas, POS and syntactic tags which are used by the parser Curupira developed at NILC10; 4) a deviation corpus composed of non-revised texts (Lacio-Dev); 5) and parallel and 6) comparable Portuguese-English corpora called, respectively, Par-C and Comp_C. The corpora will be available on the WWW for download. We will also develop a web-based interface for access to the corpora to meet several users´ needs. For this purpose we will consider the Project of Korpus 2000 (Andersen et al., 2002) as its corpus interface was designed with non-expert user needs in mind. It is worth mentioning that the design of the Lacio-Ref corpus and its text typology have been based on corpus linguistics principles (Sinclair and Ball, 1996) and on important corpora projects, e.g. American National Corpus (ANC) (Filmore et al. 1998; Ide & Macleod, 2001; Ide et al. 2002), BNC and Czech National Corpus (CNC)11. We have also tried to overcome some flaws in the typologies used in the written parts of the latter two (BNC and CNC) which would prevent us from broadening the set of potential users of the LW corpora. The construction of LW corpora builds upon the previous experience at NILC in the ad hoc compilation of a 35 million-token corpus named Corpus Nilc (CN). This paper details the corpora being created (Section 3), presents the rationale for LW corpora (Section 4) as, in Brazil, there is an urgent need for corpora (both annotated and raw) constructed according to corpus linguistics principles. It also compares and contrasts the development of both corpora (LW and CN) emphasizing the lessons learned in the process (Section 2). 2. Lessons learned from developing and critically analyzing CN The CN corpus was built to support the development of a grammar checker for Brazilian Portuguese named ReGra (Martins et al., 1998). Specifically, CN was designed to inform linguistic studies for the development of ReGra and to provide data for performance testing. The construction of ReGra started in 1993 and it has been improved since then under the auspices of both a Brazilian governmental agency and a Brazilian software company. CN is an opportunist corpus built in an ad hoc manner: its text selection was carried out on demand and its text classification was based on a very particular typology suitable for testing the grammar checker. At the beginning of the LW Project, we counted on being able to feed a major part of the CN into the Lacio-Web, thus, reducing the cost of its construction. However, after a detailed analysis of the CN subcorpora we found major problems regarding: classification, the number of texts in certain subcorpora, sample size (the main criterion in CN was full texts), grouping and formatting, documentation and copyright. 8 www.cce.ufsc.br/~nupill 9 Details about its tagset, the annotation process, including the results of the inter-annotator agreement evaluation and linguistic problems faced in developing Mac-Morpho can be found at www.nilc.icmc.usp.br/nilc/projects/lacio-web.htm 10 In a follow up project, we intend to do a manual revision of it. The resulting Treebank will serve, for example, to improve the parser itself and to train statistical parsers. Details about Curupira can be found at www.nilc.icmc.usp.br/nilc/tools/curupira.html 11 ucnk.ff.cuni.cz/english/index.html 15 These items will be detailed below together with an indication of the cost involved to correctly include the CN texts in LW. 2.1 Text classification As CN was built on demand, its text classification became problematic. The texts were divided into three classes, driven by the purpose of the corpus, namely: a) corrected texts, b) uncorrected texts and c) semi-corrected texts, i.e. texts published in books or journals e.g., unrevised texts, and text revised by advisors e.g., respectively. Inside these classes, the texts were grouped in an ad hoc fashion, either by domain (or subject), or genre or textual type. This irregular text classification prevents a user from recovering all texts on “sports” because only some of them were classified according to domain. One reason for this is that the texts in CN were not prepared to be automatically retrieved by a user. On the other hand, as we intend to design powerful and customized search tools for LW´s users, the first task after defining LW corpus typology was to define its text typology which has a 4-orthogonal category typology to classify texts (genre, textual type, medium and subject) and information about authorship and the publication of texts. Another benefit a sound text typology (driven by linguistic or commonly accepted external criteria) brings along is the possibility to better control the insertion of material into a corpus in order to achieve better representativeness in corpus design. 2.2 Number of texts Some CN subcorpora are under-represented, i.e. they consist of a small number of texts. For example, the “Technical and Scientific” subcorpus has only a few samples of theses and incomplete dissertations, most of them from the Computer Science domain. We can say that CN benefits terminology studies in Computer Science but it is not representative regarding other technical and scientific domains. On the other hand, the design of the LW corpora is based on a 4-orthogonal category typology to organize the texts and we will endeavour to complete each category in each corpus (see Section 4 for details on this procedure). This set-up will enable the development of several types of tools and linguistic researches. Automatic text categorization, for example, is a type of research which would benefit both the advance of the research area itself and the organization of the texts in a typology. Moreover, in the near future we will be able to perform linguistic analysis to evaluate the representativeness of our newborn corpora following Biber´s recommendation to proceed in a cyclical fashion of analysis, design, compilation and analysis again (Biber, 1993). 2.3 Sample size Some text samples deviated from the standard followed by CN which was to include only full texts. For example, some only have a few chapters of a book, other have only excerpts from the beginning, middle and end part of a whole document. In the LW Project, we are following the criterion of inclusion of sequential12 full texts to allow for research on text structure and summarization, for example. However, if copyright issues demand otherwise we will have this fact annotated in the text header. This may happen with textbooks, for example. 2.4 Grouping and formatting Issues regarding ad hoc grouping such as, to group several small texts from a same class in just one text causes several problems for the compilation of the header. It would be difficult to import these groups of texts to LW since some information on them has been lost and we intend to edit a header for each and every text from the corpora. This grouping has occurred in encyclopedia entries and small articles from newspapers. Also, CN has not keep text formatting e.g. sentence and paragraph marks for many texts. As we consider this annotation important it would be costly to recover this information. 12 The newspaper corpora in AC/DC Projects had to have their texts scrambled, which prevents several types of research of being conducted. 16 2.5 Documentation and copyright Annotating a text with a header which provides internal and external information on the texts was not taken seriously in the compilation of CN. Some texts have the traditional information on authorship and publication details but nothing is said about its domain (subject) or genre and text type; others do not have any header at all. The LW will try to correct this flaw and will invest large amounts of effort to edit and encode a header following largely accepted standards (see Section 4). With regard to copyright, there was no effort to secure rights to include material in the CN as it was designed to support the development of an application. LW, however, has other design criteria as it will be freely available on the WWW. One of the most difficult problems in building a corpus is surely to get permission to include texts from copyright holders as there is a lot of correspondence involved before we manage to get copyright clearance. We are using similar contact and permission letters provided by BNC and ANC to include texts in LW corpora. Although we do not have a consortium of commercial members to feed our corpora we have managed to include relevant newspaper, magazines and books in our corpora. In summary, CN was of great value to the development of ReGra and with regard to LW, CN is ready to contribute with the legislation (110.571 tokens), journalism (25,167.436 tokens) and literary texts (1,761.373 tokens). 3. The corpora of the Lacio-WEB Project A common concern in recent studies is to provide precise guidelines concerning the typology of a corpus. Atkins, Clear & Ostler (1992) discussed contrast-based parameters that allow identification of different types of corpora: synchronic vs. diachronic corpus; closed vs. open-ended corpus; full text vs. sample vs. monitor corpus; general vs. terminological; single vs. parallel-2 vs. parallel-3 vs. …, etc. Some criteria for classifying corpora have also been provided by Cathy Ball, in a tutorial about concordances and corpora13, on top of the parameters mentioned above: balanced vs. opportunist vs. pyramidal corpus, and plain vs. annotated corpus. A similar work on such parameters was carried out by Berber Sardinha (2000), who distinguished corpora according to their purpose: reference vs. study vs. training and testing. The corpora comprised by this project have a varied composition and can be classified according to the 8 parameters cited above. Regarding purpose, the Lacio-Ref corpus is a reference corpus; the Lacio-Dev – a deviation corpus composed of unrevised texts – was meant for training and testing applications, such as grammar checkers; and the Mac-Morpho corpus is closed and serves as a training and testing corpus for NLP tools, such as POS taggers. As far as design is concerned, the Mac-Morpho corpus and the automatically-annotated portion of Lacio-Ref are of annotated type, while the Lacio-Ref is a plain corpus. For the parameter single vs. parallel, the Par-C is of the parallel-2 type; and for the general vs. terminological parameter, the Comp_C is of the terminological type. All the corpora are synchronic, presenting BP language from 1900 onwards. Lacio-Dev, Par-C and Comp_C are opportunistic corpora. The specificities of each of these corpora are discussed below, together with their status in terms of completion and encoding. We distinguish two types of encoding: i) header and sentence markup and ii) annotation for linguistic phenomena such as lemmatization, POS and syntactic tags. Our header editor includes an XML compliant header, according to the international standard adopted in the Translational English Corpus (TEC)14 from University of Manchester's Institute of Science and Technology (UMIST). A tool named SPLITTER, developed at NILC, will be used to split the texts into sentences. 1) Lacio-Ref Corpus: contains texts from various genres (e.g., literary and its subdivisions, factual, informative, scientific, law), textual types (e.g. article, manual, research project, letter, biography), subjects (e.g., politics, environment, life style, sports, arts, religion etc.); and medium of distribution (e.g., books, internet texts, cd-rom material, newspapers and magazines, etc.). The four-category typology cited above (genre, textual type, medium and subject) will be used to allow for specific searches on the corpus. Also, the texts may be searched by authorship and other publication details. 13 www.georgetown.edu/faculty/ballc/corpora/tutorial.html 14 www2.umist.ac.uk/ctis/research/TEC/tec_home_page.htm 17 All pieces of text are authentic with identified sources (see Section 4) and are, preferentially, full texts. Tools will also be available to users for obtaining a statistical description of the texts, in terms of the text size (in number of words, pages or kbytes). Status: the corpus is being compiled (it already contains material contributed from the CN); the edition of the headers has not been started. 2) Lacio-Dev Corpus: most texts (in a total of 516.840 tokens) will be imported from the CN which contains a subcorpus comprised of unrevised texts of varied subjects produced by undergraduate students and prospective students attempting to enter the University. This Corpus can be used to assess the performance of grammar checkers for BP, as there is a need to ensure that the checkers are able to detect linguistic inadequacies during the tests. Status: edition of the headers has not been started. 3) Mac-Morpho Corpus: this 1.1 million-word Corpus is composed of a collection of randomly selected texts from several issues of Folha de Sao Paulo (1994)15, a major Brazilian newspaper, which ensures high quality contemporary Brazilian Portuguese from different authors and domain. The manual validation and correction process was carried out on the morphosyntactic tagging of the texts performed by the parser PALAVRAS16. The corpus contains structural markers for sentences. Besides being annotated by the XML-compliant format proposed by the Advisory Group on Languages Engineering Standards EAGLES (see Section 4) it will be available in annotators’ format (one word per line followed by its tag) which is appropriate for training and evaluating POS tagging methods. Status: the compilation and manual annotation have been completed; both the header edition and the corpus encoding have not been started. 4) a portion of Lacio-Ref automatically-annotated with lemmas, POS and syntactical tags: as opposed to Mac-Morpho, this corpus will consist of a varied selection of text genres. It will be annotated by the parser Curupira and in the near future we hope to carry out a manual revision of it. The resulting Treebank will have several uses: to improve the parser itself, to train statistical parsers, to perform more accurate searches, etc. Status: compilation and header edition have not been started. 5) Par-C Corpus: because of its opportunistic behavior, this corpus will be enlarged from time to time. In this initial phase of the LW project Par-C is composed of 65 pairs of authentic academic parallel texts (abstracts) in Computer Science contributed from Project PESA17 which aims at evaluating sentence alignment methods. The corpus was divided into two groups: one comprising 65 pairs of authentic (non-revised) texts; another with the same 65 pairs revised by a human translator (pre-edited corpus). They were named CAT and CPT, respectively. CAT has 416 BP sentences and 439 English sentences. CPT has 418 BP sentences and 431 English sentences. Status: compilation and alignment completed; edition of headers has not been started. 6) Comp-C Corpus: it is also an opportunistic corpus which can be used to evaluate term extraction methods as well as for other linguistic researches. In this initial phase it comprises English-Portuguese comparable texts contributed by the Project COMET (Tagnin 2001, 2002). These texts are technical, scientific and marketing-related, amounting to 300,000 words in each language. The corpus was compiled by students in the Diploma in Translation course at FFCLH/USP in order to build glossaries18. Status: compilation completed; edition of the headers to be started. 4. Issues in corpus creation tackled in the Lacio-Web Project 4.1 Accessibility of the corpora As mentioned before, one of our main goals in constructing the LacioWeb corpora is to create a benchmark for Computational Linguistics in Brazilian Portuguese for tasks such as POS tagging, parsing, text alignment, and term extraction. For that purpose, texts must be provided with several sets of hand-validated linguistic annotation e.g. lemmas, POS, and syntactic tags. Additionally, in order to maximize its use, a corpus should be encoded according to largely agreed standards and must be available to each and every researcher in 15 http://www1.folha.uol.com.br/fsp/ 16 http://visl.hum.sdu.dk/visl/ 17 www.nilc.icmc.usp.br/nilc/projects/pesa.htm 18 http://www.fflch.usp.br/citrat 18 general. Therefore, the texts of the LacioWeb corpora project will be made freely available on the Internet, followings the initiative of several recently created corpora such as: the Tycho Brahe Corpus of Historical Portuguese19, which also provides POS- and parsing-annotated texts and the COMPARA parallel corpus for Brazilian and European Portuguese texts20. With regard to providing standardized encoding for the resources we are following the design of the ANC Corpus21 which will use the specifications of an XML-compliant format of the Corpus Encoding Standard XCES (Ide et al., 2000). In our case, the texts will be available for download from the project's home page22. It will be redesigned to allow powerful searches and friendly access to non-expert users who may use LW corpora as an extension of dictionary consultation to obtain authentic examples of real word usage in BP. 4.2 Exhaustivity versus selectivity One of the basic tenets of Science is the reproducibility of results. In the field of Computational Linguistics, when an algorithm is tested against a non-public corpus the result is not reproducible in principle. It may well be the case that other researchers, applying the same algorithms to other corpora, may find distinct results. Without applying the test to a common, public corpus, these discrepancies may never be reconciled. From this point of view, any method, algorithm or measurement performed over a non-public corpus cannot be considered definitively tested. Moreover, no comparison of two distinct computational linguistic methods which perform the same task can be made unless they are tested against the same set of data. The reason is clear. One method may be better suited for one kind of text, while the other may be better suited for some other kind of data. Two researchers may test both methods on their private data and arrive at different conclusions. Again, the only way to solve this dispute is to apply both methods to the same piece of public data. And if different pieces of data yield different results, this is an important piece of information that one will not have unless the data is public. This fact has guided us in the construction of the Lacio-Ref, in particular in the guidelines for text gathering and classification23. This leads us to the next topic: the exhaustivity of the texts to be included, mainly, in Lacio-Ref. By exhaustivity we mean the broad and varied coverage of the most common categories used to classify texts such as: authorship, text size, and the four-category typology, namely genre (literary ¯ subcategorized into poetry, drama, prose ¯ and non-literary ¯ subcategorized into factual, informative, academic and legislation), textual type (e.g. article, manual, research project, letter, biography), medium (e.g. newspapers, books, ads, journals, magazines) and subject (or domain). The genre typology instances can be further subcategorized. This is a differential regarding reference corpus design as they are generally built in an ad hoc manner. Linguistic aspects of texts are usually only provided for corpora which are designed to be balanced, but in the hope of broadening the set of potential users of the Lacio-Ref such information will be provided for its texts. Each text classification according to these categories and others related with text publication will be documented in the text's header via XML-annotations in a format inspired by the TEC corpus header annotation. Choosing to be exhaustive instead of selective is another way to broaden the set of potential users of the corpus and the possibilities of linguistic analysis. We are not limited to including only a specific genre, e.g. informative with newspaper data, as followed by the CRPC project, cited in the Introduction, with regard to the Brazilian Portuguese variant. We also follow a much more detailed genre typology than those used in the BNC written corpus and the CNC SYN2000. Additionally we use a 4-orthogonal category typology (genre, type of text, medium and domain) as commented before while these other corpora focus mainly on domain. For example, BNC classifies its texts by medium, domain and time, reducing its genre typology to two classes: informative writings which are subcategorized by domain and imaginative writings which are subcategorized by literary and creative works. SYN2000 follows the same genre typology as the BNC but subdivides imaginative texts into poetry, drama, fiction, other and transitional types and the informative into journalism and technical and specialized texts. The latter is again subcategorized by domain as in the BNC. This does not mean that we are unaware of the difficulties to classify the texts following a more detailed taxonomy but this enterprise is worth trying as it can provide a better search tool for the user. 19 www.ime.usp.br/~tycho 20 www.linguateca.pt/COMPARA/ 21 http://americannationalcorpus.org/ 22 www.nilc.icmc.usp.br/nilc/projects/lacio-web.htm 23 We tried to adhere as much as possible to the EAGLES recommendations on text typology. 19 4.3 Representativeness and balance Availability and exhaustivity were two criteria that superseded representativeness and text balancing in the selection of texts for the Lacio-Ref corpus. Problems of copyright would prevent us from obtaining a balanced corpus within the two-year duration of the project for, unlike other corpora created by a consortium of publishers like the ANC, we started with no repository of texts of our own24. However, we have so far been very successful in obtaining authorization from newspaper and magazine publishers as well as donations of electronic versions of public domain books, which will enable us, in the near future, to start the design of a balanced corpus of modern Brazilian Portuguese which will be included in the set of corpora of the Lacio-Web project. After the initial text gathering and classification we will be in a position to follow Biber's classical recommendation (1993, abstract), “The actual construction of a corpus would then proceed in cycles: the original design based on theoretical and pilot-study analyses, followed by collection of texts, followed by further empirical investigations of linguistic variation and revision of the design.” 5. Conclusion and future work In Brazil, there is an urgent need for corpora constructed according to corpus linguistics principles regarding its text typology, encoded according to largely accepted standards, and freely available. In this paper we presented the Lacio-Web project which aims at compiling freely accessible corpora for both non-expert users interested in the Brazilian Portuguese language and expert users who pursue theoretical and practical linguistic studies and develop computational linguistics tools. Its purpose is not only to raise the quality of Portuguese NLP but also popularize the use of corpora for layman who may be interested in using them as an extension of dictionary consultation to obtain authentic examples of real word usage in BP. After the first year of the project we have already corrected the morphosyntactic annotation of the 1,1 million-word corpus MAC-Morpho, gathered several texts for the Lacio-Ref, critically analyzed the CN corpus which can also provide material for the Lacio-Ref. We are almost ready to release the first version of the bilingual corpus Par-C and the Comp-C. It is worth mentioning that the header must be included in the entire corpora before the release. In a new project, we intend to pursue the balancing of Lacio-Ref to make it more useful for research. References Andersen, M.S., Asmussen, H. & Asmussen, J. 2002 The project of Korpus 2000 going public. In Proceedings of Euralex 2002, pp 291-299. Atkins, S., Clear, J. & Ostler, N. 1992 Corpus design criteria. Literary and Linguistic Computing 7: 1-16. Biber, D. 1993 Representativeness in corpus design. Literary and Linguistic Computing 8: 1-15. Biderman, M.T.C. 2001 Dicionário de freqüencias do portugues brasileiro contemporâneo. In: Martins Fontes (ed), Teoria Lingüística. Sao Paulo, pp 335-348. Borba, F. S. 1991 Dicionário Gramatical de Verbos. Sao Paulo, UNESP. Borba, F. S. 2002 Dicionário de usos do Portugues do Brasil. Sao Paulo, Editora Ática. Fillmore, C., Ide, N., Jurafsky, D., and Macleod, C. 1998 An American National Corpus: A Proposal. In Proceedings of the First International Language Resources and Evaluation Conference, Granada, Spain, pp 965-70. 24 It is important to note that although we have the CN its texts don´t have copyright clearance, therefore they are not ready for inclusion in the Lacio-Ref. 20 Ide, N., Reppen, R., Suderman, K. 2002 The American National Corpus: More Than the Web Can Provide. In Proceedings of the Third Language Resources and Evaluation Conference (LREC), Las Palmas, Canary Islands, Spain, pp 839-844. Ide, N., Macleod, C. 2001 The American National Corpus: A Standardized Resource of American English. In Proceedings of Corpus Linguistics 2001, Lancaster UK. Available in: www.cs.vassar.edu/faculty/ide/pubs.html Ide, N., Bonhomme, P., Romary, L. 2000 XCES: An XML-based Standard for Linguistic Corpora. In Proceedings of the Second Language Resources and Evaluation Conference (LREC), Athens, Greece, pp 825-30. Martins, R.T.; Hasegawa, R.; Nunes, M.G.V.; Montilha, G.; Oliveira Jr., O.N. 1998 Linguistic issues in the development of ReGra: a Grammar Checker for Brazilian Portuguese. Natural Language Engineering 4(4): 287-307. Neves, M. H. M. 2000 Gramática de Usos do Portugues. Sao Paulo, UNESP. Sardinha, T.B. 2000 Lingüística de Corpus: Histórico e Problemática (Corpus Linguistics: History nd Problematization), DELTA 16(2): 323-367. Sinclair, J. and Ball, J. 1996 Preliminary Recommendations on Text Typology. EAG-TCWG-TTYP/P, June 1996. Available in: www.ilc.pi.cnr.it/EAGLES/texttyp/texttyp.html Tagnin S.E.O. 2002 Corpora and the Innocent Translator: how can they help him. In Lewandowska-Tomaszczyk, Barbara & Marcel Thelen (eds.) Translation and Meaning - Part 6 - Proceedings of the Lodz Session of the 3rd International Maastricht-Lodz Duo Colloquium on 'Translation and Meaning", held in Lodz, Poland, 22-24 September, 2000, Maastricht: Universitaire Pers Maastricht, pp 489-496 Tagnin S.E.O. 2001 COMET – A Multilingual Corpus for Teaching and Translation. In: PALC ‘01 – International Conference on Practical Applications in Language Corpora, Lodz, Polônia, To appear in the Proceedings. Wittmann, L. Pego,T. & Santos, D. 1995 Portugues do Brasil e de Portugal: alguns contrastes. In Actas do XI Encontro da Associaçao Portuguesa de Lingüística, Lisboa, Portugal, pp 465-487. 21 A corpus of seventeenth-century English news reportage: construction, encoding and applications Dawn Archer, Andrew Hardie, Tony McEnery & Scott Piao Dept. Linguistics, Lancaster University This poster describes a 750 thousand-word corpus of news reportage from the English Civil War, currently under construction at Lancaster, using texts drawn from the Thomason Tracts. This collection (held at the British Library) is unique in containing the greater part of what was published in London in the period 1640-1661. We are in the process of transcribing the periodical newsbooks published between December 1653 and May 1654 to an SGML-based format. The fairly light markup initially used by the transcribers is similar to HTML, allowing the texts subsequently to be mapped automatically to both a TEI-compliant SGML/XML format and a web-compatible HTML format, facilitating the widest possible potential re-use of the corpus. Simultaneous with the development of the corpus, a number of linguistic/historical issues have been investigated using the transcribed newsbooks. These include an examination of the complicated nature of text re-use in the press at this period, and an inquiry into the presentation of women in different newsbooks. 32 A database system for storing second language learner corpora Bertol Arrieta(1), Arantza Díaz de Ilarraza, Koldo Gojenola, Montse Maritxalar, Maite Oronoz Affiliation: IXA Group (http://ixa.si.ehu.es) University of the Basque Country (UPV/EHU) Postal address: Faculty of Computer Science 649 p.k., 20080 Donostia (The Basque Country) Tel.: +34 943 015 061 Fax: +34 943 219 306 E-mail (1): bertol@si.ehu.es Abstract With the aim of storing learner corpora as well as information about the Basque language students who wrote the texts, two different but complementary databases were created: ERREUS and IRAKAZI. Linguistic and technical information (error description, error category, tools for detection/correction…) will be stored in ERREUS, while IRAKAZI will be filled in with psycholinguistic information (error diagnosis, characteristics of the writer, grammatical competence…). These two databases will be the basis for constructing i) a robust Basque grammar corrector and, ii) a computer-assisted language-learning environment for advising on the use of Basque syntax. 1. Introduction The IXA research group has been working in Natural Language Processing during the last 14 years. At the same time, we have worked on Intelligent Computer Assistant Language Learning (ICALL) environments. The work we present in this paper has a wide background in these fields: NLP tools, error detection and ICALL environments using adapted NLP tools. Background in NLP tools In order to work on error detection, a very important background in NLP tools is needed. In this sense, these are the tools implemented in our group: a. EDBL, a lexical database, which at the moment contains more than 80,000 entries (Aduriz et al., 1998). b. A tokeniser that identifies tokens from the input text. c. Morpheus, a wide-coverage morphosyntactic analyser for Basque (Alegria et al., 2002) that includes a segmentiser, a morphosyntactic analyser and a recogniser of multiword lexical units (MWLUs). d. EusLem, a general-purpose tagger/lemmatiser. (Ezeiza et al., 1998). e. A shallow syntactic analyser that identifies noun phrases and verbal chains. Background in error detection As we have developed most of the tools in the linguistic analysis chain (morphology, morphosyntax, surface syntax, phrases, etc.), we started working on error detection. Thus, a robust spelling corrector, called Xuxen (Aduriz et al., 1997), was developed some years ago. With the aim of following with this work, a syntactic approach was planned. This way, some work in syntax error detection has been done in the last years, using different approaches: a. We have combined a robust partial parser which obtains the main components of the sentence (implemented in PATR-II), and a finite-state parser used for the description of syntactic error patterns (Xerox Finite State Tool, XFST, (Karttunen et al., 1997)) to detect errors in dates (Gojenola K. & Oronoz M., 2000). We defined six different types of errors and its combinations. 33 b. The Constraint Grammar (Karlsson, 1995) formalism has been used to analyse 25 types of errors about postpositions and other 10 different types of grammar errors. c. The relaxation of syntactic constraints (Douglas & Dale, 1992) has been used for the detection of agreement errors between the verb and the subject, object or indirect object (Gojenola, 2000). This grammar-based method allows the analysis of sentences that do not fulfil some of the constraints of the language by identifying a rule that might have been violated, determining whether its relaxation might lead to a successful parse. Background in ICALL environments The main work in this field done in our group is an environment for studying the learning process of language learners, called MUGARRI (Maritxalar, 1999). In this environment, we find three systems: IRAKAZI, IDAZKIDE and HITES. IRAKAZI helps the teacher in gathering psycholinguistic information about the students and the texts they write; IDAZKIDE is a student oriented ICALL environment for second language learning; and HITES is a system for modelling the interlanguage of particular learners and the common interlanguage of learners at the same language level. IRAKAZI interacts with the teacher, IDAZKIDE with the student, and HITES with the psycholinguist. ERREUS and its connection with IRAKAZI With this background, we realised that gathering error corpora is a very important task in order to i) have a basis for deciding which type of linguistic phenomena are important to treat, ii) have a corpora for tool-testing and evaluating. That is why we began thinking about a system that would store information about the errors of the corpora. The ERREUS database was born with this aim. ERREUS has the purpose of storing technical and linguistic information about any type of error, and it was designed for being, in some sense, a repository of error corpora. On the other hand, IRAKAZI is used to store all the information about the student (mainly, relative to his/her learning process) and the deviant structures he/she has used. Working in the design of the ERREUS database, we realized that ERREUS is complementary with IRAKAZI. In ERREUS we are going to store any kind of error made by language learners and native speakers, and in IRAKAZI, only the deviations made by language learners. It must be pointed out that in the ICALL environment, we will speak about deviant structures instead of errors. The word “error” is directly joined to correction, and it has a negative sense. That is why we have decided to use the word deviation when speaking about the learning process, following some psycholinguistic trends (Maritxalar et al, 1996). So, all deviations made by students are going to be referenced in both the IRAKAZI and the ERREUS databases, while the errors found in corpora that were not written by students are only going to be stored in the ERREUS database, as we can see in Figure 1. Due to the fact that both databases provide different points of view about the same matter, we saw the need of joining the two databases. Thus, for each error-containing-text, we would have its technical-linguistic information in ERREUS, as well as its corresponding psycholinguistic information in IRAKAZI. Taking into account the information stored in each database (see example in figure 2), we note that the information about the text that contains the error and its category appears in both. However, there is a difference when representing the linguistic category. In the case of IRAKAZI, we store the concrete category of the deviation (AGREEMENT_SUBJ_VERB), while in ERREUS we use a hierarchical classification of linguistic errors (Morphosyntactic Agreement Agreement between subject and Figure 1: Errors vs. deviations Deviations in the IRAKAZI database Errors in the ERREUS database 34 verb). Being the case that the final category in ERREUS matches with the category in IRAKAZI, it is viable to join both databases. Deviation-Sentence: *Hura igeriketamaitedu(*He love swimming) -Category: AGREEMENT_SUBJ_VERB-Deep reason: Generalization of a ruleText-Number of words: 245-Type of text: exercise-Reference: 122Student-Name: AnaBerazadi-Age: 25-School:Ilazki-Language knowledge level: +Spanish (speak,understand,write, read): 5, 5, 5, 5+French (speak,understand,write, read): 3, 3, 3, 4-Mother language: Spanish-Learning historyError containing text-Sentence: *Hura igeriketa maitedu(*He love swimming)-Correction: Hark igeriketamaitedu(He loves swimming)-Text reference: 122Error-Description: loose of the letter ‘s’ in third singular person, in present tense-IsDetected? Yes-DetectionTool: Constraint Grammar-IsCorrected? No-CorrectionTool: -Category and subcategory levels-FirstCategory: Morphosyntactic-SecondCategory: Agreement-ThirdCategory: Agreement between subject and verbIRAKAZI databaseERREUS database 2. The ERREUS database The ERREUS database will be the basis for constructing a robust Basque grammar corrector. That is the reason why we have designed a database that stores linguistic and technical information of errors found in the corpora. Designing the database, we have followed several steps for assuring a good design and development, taking into account that a) it is important to access the database via Internet, b) many non-specialized users would access it, c) we want to store a very large range of linguistic errors, and d) we want to link ERREUS to IRAKAZI. Next, we will briefly explain the steps we followed to build the database. Firstly, we made a complete classification of errors based on bibliographic research and hand-made studies of real corpora (step 1). Secondly, this classification was complemented with the results of a questionnaire made to some proofreaders and Basque language teachers (step 2). And, finally, this classification was used as a basis for designing and constructing the ERREUS database (step 3) and its corresponding ZOPE based interface (step 4). Step 1: Classifying the errors As mentioned before, in order to make a thorough classification of the errors we could find in any corpora, we used as a basis a set of Basque grammars, our previous experience in error classification (Maritxalar, 1999), and the advice of the linguists in our research group. Besides, we contrasted our classification with other works on error typology (Becker et al., 1999) made in other languages. Figure 2: A view of the main information stored in ERREUS and in IRAKAZI (these boxes are not the entities of neither ERREUS nor IRAKAZI) 35 This way, we obtained a classification in which all errors were divided into five main categories: - Spelling errors - Morphological, syntactic or morphosyntactic errors - Semantic errors - Punctuation errors and style suggestions - Errors due to the standardisation process of Basque Each category was subcategorised so as to make a classification as detailed as possible. The relevance of this error classification relies on guiding the user through the interface into the appropriate category/subcategory. This procedure will let us organise in the database all the error occurrences according to the mentioned classification. Step 2: The questionnaire and its results The standardisation of Basque has not been yet completed. The Basque Language Academy (http://www.euskaltzaindia.net) publishes periodically rules for the standardisation of the language, but they do not cover all its aspects. For this reason, sometimes it is difficult to decide whether a given structure may be considered standard or not. All these characteristics made more difficult to create a proper error classification in Basque. Therefore, we prepared a questionnaire in order to contrast our classification. As we assumed that learners of Basque and natives do not make the same kind of errors and with the same frequency, we asked both, experienced Basque teachers and proofreaders, about two different aspects. We gave them our first draft of the error classification, and asked whether they knew any error category that was not included in such classification and whether all the errors we considered were actually errors. If this was the case, we also wanted to know, which was the frequency of occurrence of each error in the kind of texts they usually work with. Using this data, we completed our error classification. In the near future, we are going to intend to continue implementing rules for the detection of errors, starting with those ranked with the highest frequency in the questionnaire. Our objective is to use these rules to detect automatically such errors in real corpora. Step 3: The design of the database We carried out the design of the database with the objectives of being open and flexible enough to allow the addition of new information. We designed a simple, standard database to collect errors of the different types mentioned in the classification. The system will also allow restricted users to update the database via Internet. The database is composed of these main entities: error, linguistic categories, text and correction. In the entity named ‘error', we store, among other things, the following technical information: whether the error is automatically detectable/rectifiable, and in such case, which is the most appropriate NLP tool to detect/correct it. We also specify the origin of the error (e.g. influence of Spanish) and the possible cause of it. We have used four tables, each one for each level of the hierarchy in the classification. For example, in the first table (FirstLevelCategory), we have the general category of the error (orthographic, morphosyntactic, semantic, punctuation, style and errors due to the lack of standardisation of the language). Besides, each general category is divided into second level subcategories using the table (SecondLevelCategory), and so on (see figure 3). 36 The entity named ‘text’ stores, for each error occurrence, the sentence that contains the error. Besides, we have an attribute (with a value ranging from 0 to 5) to indicate to which extent we are sure that it is really an error in the context where it appears. A given word or structure might be always considered an error or it might be considered an error just in some given contexts (e.g. “The bread ate John” might be correct in a literary context). -Spelling errors -Morphological, syntactic or morphosyntacticerrors -Semantic errors -Punctuation errors -Errors due to the lack of standardisation of BasqueFirstLevelCategory-Loose of the “h” letter -…-Derivation of words-Dates -Postpositions-Bad use of adverbs-Fixed phrases-Word order-Agreement-…--…--…SecondLevelCategory…ThirdLevelCategory…ForthLevelCategory In the ‘correction’ entity, we store the correction of each error occurrence. In this sense, it is important to remark that if we have more than one error in a concrete sentence, we will have one different text occurrence for each error, in order to i) have the proper reference to each kind of error, and ii) have one correction for each kind of error. Step 4: The ZOPE based interface As the interface will be used by people that are not specialised in computers, and, therefore, it has to be an easy-to-use tool, we designed a simple and user-friendly interface based in ZOPE technology (Latteier & Pelleitier M., 2001). This way, we built an interface to guide the user into the error classification in order to choose one concrete category/subcategory. The user has the possibility of making different operations: a. Consulting operations as to find real examples of errors in the corpora for the chosen category. For example, if the category/subcategory “Morphosyntactic / Agreement / AgreementBetweenSubjectVerb” is chosen, the system will show all the texts with this error, e.g. “Hura igeriketa maite du” (“He love swimming”), “Bera ingelesa daki” (“She cans speak English”) and so on. b. Inserting operations as to insert an error-containing-text into the chosen category, with its own correction. For example, let us suppose that the linguists find the next sentence “That's not very apropriate”. The steps to follow should be: i. Find the proper category/subcategory for that error (Spelling error) ii. Check if the error has already been inserted (one “p” instead of the double “p”) iii. If not, then, insert the error and its technical characteristics. iv. Check if the sentence has already been inserted. v. If not, insert the sentence and its correction. Figure 3: Classification hierarchy 37 c. Updating operations, related to error information as well as text information. an TranslationDevelopmentSummaryBasoarenbarrena joan zirenThey went across the forestBasoarenbarrena joan ziren perretxikoen bilaThey went across the forest to pick up mushroomsBasobarrena joan ziren perretxikoen bilaThey went across the forest to pick up mushroomsSAVE A TEXTTo which extend we are sure that it is really an error in this context. 3. The IRAKAZI database In the introduction, we have done a distinction between the information stored in ERREUS (linguistic/technical) and the information stored in IRAKAZI (psycholinguistic). In the same way, as we mentioned before, we distinguish errors (ERREUS) from deviant structures (IRAKAZI). So, when speaking about IRAKAZI, we will refer to students’ deviations. IRAKAZI is responsible of storing the knowledge about the student, given by the teacher. The main goal is twofold: - To get information about the student, relative to his/her learning process. - To work on the diagnosis of the deviant structures of the learner. In the future, all this information will be used in the development of the diagnosis module of IDAZKIDE (Diaz de Ilarraza et al, 1999), a student oriented ICALL system for second language learning. IRAKAZI is composed of an interface to interact with the teacher and a knowledge base (field work). This knowledge base contains information about the learners, the type of exercises they do and the deviations found in texts written by them (see figures 5, 6 and 7). Figure 4: Inserting a text in ERREUS Figure 5 First screen in IRAKAZI. 38 The interface of IRAKAZI helps the teacher to provide the necessary information to keep in the knowledge base. Such information is composed of: - Specific features of the student such as age, language level, mother tongue, other languages, studies, frequency of use of Basque, environment of use (home, business, tourism...) and so on (see figure 6). - Information about texts written by the student (including a list of deviations written in the texts). Each deviation will be classified from three different points of view: i) a superficial point of view (e.g. omission of a letter), ii) a linguistic/metalinguistic point of view (e.g. agreement between subject and verb, creation of new words by means of loans…), and iii) a deep perspective (e.g. transfer from mother tongue…). The last one includes the reasons why the deviations were committed (Maritxalar & Díaz de Ilarraza, 1993). - Information about the exercises proposed and their objective. For example, some texts can be the result of guessing a story or talking about something heard before, etc. At this moment, the implementation of IRAKAZI is done in Access, but a new improved version implemented in Zope will be available in Internet in few months. In this new version where we are working on, we will add information about the necessary strategies that should be followed when helping the student in improving the knowledge related to his/her deviant structures. In order to do that, we will collect information about the most adequate types of exercises for the treatment of the deviant structures in each case, that is, in the case of each particular learner (see figure 7). Some years ago, when IRAKAZI and MUGARRI were designed and implemented, a field work was done collecting Basque students texts from some schools specialised in the teaching of the language (Diaz de Ilarraza et al., 1998). These text corpora are a very interesting source of data, and they will be described in the next section. 4. Text corpora Annotated corpora of errors for Basque is a very important resource, not only for deriving an empirically based error classification, but also as a basis for the development of error detecting tools. In our case, ERREUS and IRAKAZI will be used as repositories of errors that will be annotated from linguistic/technical and psycholinguistic points of view, respectively. Text corpora provide the necessary information to both databases. In text corpora, each kind of error occurs with very low frequency and, therefore, big corpora are needed for testing. The task of collecting corpora is not easy and it turns very difficult when error corpora have to be collected. Even if such corpora were available, the task of recognising error instances for evaluation is a hard task, as there are no syntactically annotated treebanks in Basque with error marks. So, if we want to obtain naturally occurring test data, hundreds of texts have to be automatically and manually examined and marked. This work of collecting Basque students texts was done following some criteria: we collected written material from different language schools (IRALE1, ILAZKI, AEK) and grouped this material depending on some features of the texts as i) the kind of exercise proposed by the teacher (e.g. abstract, 39 Figure 7: Texts and deviations in the texts Figure 6: Specific features of the student 1 IRALE, ILAZKI and AEK: schools specialised in the teaching of Basque article about a subject, letter…) and ii) the student who wrote the text. These were students who attended classes regularly, and with different characteristics and motivations for learning Basque (e.g. different learning rates, different knowledge about other languages, mother tongue…). The corpus is made up of 350 texts written from 1990 to 1995. We codified the texts of the corpora following a prefixed notation (e.g. il10as) showing the language school (e.g. “il”, ILAZKI), the language level (e.g. “10”, Level 10), the learner's code (e.g. “a”, first letter of the name Ainhoa), and the type of exercise proposed (e.g. “s”, summary). Information related to this corpus is stored in ERREUS and IRAKAZI. In addition to these texts, an archive of 1600 e-mail messages from the mailing list “EuskaraZ”, the first workgroup in Basque, has been collected. This list was created in 1996 with the purpose of exchanging information about everything related to Basque. This corpus has the advantage of being easily accessible, electronically available and contains linguistic errors. On the other hand, it has the disadvantage of being a corpus written in an informal language, sometimes with incomplete words and abbreviations, so it is not easy to analyse it. Apart from this text corpus, we will use grammars with error examples (Zubiri, 1994) as a source of errors and texts for filling in ERREUS. 5. Conclusions and Future Work We have implemented two databases that gather information on linguistic errors analysed from different points of view. An interdisciplinary approach has been followed when analysing written errors in Basque texts. IRAKAZI contains the information related to the learning process of the student (diagnosis, writer characteristics, grammatical competence…), and ERREUS is provided with a vast linguistic and technical description of the errors (classification of the error, description, occurrences in texts, possible tools used for detection/correction…). Both databases have a reference to previously built and encoded learner corpora. So, we can link the two databases and create a complete system that will take into account different points of view about errors/deviations. In the near future, we have two projects in mind: a. A robust grammar corrector of Basque. b. A system for syntax teaching that will improve IDAZKIDE. The information contained in ERREUS is essential in the development of both projects, while IRAKAZI is a very important source of psycholinguistic information that will be used in IDAZKIDE. HITES is a system for modelling the interlanguage of particular learners and the common interlanguage of learners at the same language level. Using the information obtained from HITES, IDAZKIDE will be able of giving linguistic advice to the students considering their level. For that purpose, the tools previously constructed in our NLP research group (the spelling corrector, the electronic dictionaries, the tool for shallow parsing…) could be adapted taking into account the knowledge level of the student. In the future, we will construct mechanisms in the form of linguistic rules, grammars or statistical methods for the detection of, basically, grammar errors and deviations. Acknowledgements This research is supported by the University of the Basque Country (9/UPV00141.226-14601/2002) and, the Ministry of Industry of the Basque Government (XUXENG project, OD02UN52)). Thanks to Eli Pociello for her help writing the final version of the paper. References Aduriz I., Aldezabal I., Ansa O., Artola X., Díaz de Ilarraza A., Insausti J. M. 1998 EDBL: a Multi-Purposed Lexical Support for the Treatment of Basque. In Proceedings of the First Int. Conf. on Language Resources and Evaluation, vol II, 821-826. Granada (Spain). Aduriz I., Alegria I., Artola X., Ezeiza N., Sarasola K., Urkia M. 1997 A spelling corrector for Basque based on morphology. Literary & Linguistic Computing, Vol. 12, No. 1. Oxford University Press. Oxford. 1997. 40 Alegria I., Aranzabe M., Ezeiza A., Ezeiza N., Urizar R. 2002 Robustness and customisation in an analyser/lemmatiser for Basque. In proceedings of the LREC-2002 Customizing knowledge in NLP applications workshop. Becker M., Bredenkamp A., Crysmann B., Klein J. 1999 Annotation of Error Types for German News Corpus. In Proceedings of theATALA workshop on Treebanks, Paris. Díaz de Ilarraza A., Maritxalar A., Maritxalar M., Oronoz M. 1999 IDAZKIDE: an intelligent CALL environment for second language acquisition. In Proceedings of a one-day conference "Natural Language Processing in Computer-Assisted Language Learning" organised by the Centre for Computational Linguistics , UMIST, in association with EUROCALL, a special ReCALL publication, 12-19. UK. Díaz de Ilarraza A., Maritxalar M. Integration of natural language techniques in the ICALL systems field: the treatment of incorrect knowledge UPV/EHU-LSI TR 9-93. Díaz de Ilarraza A., Maritxalar M., Oronoz M. 1998 An Implemented Interlanguage Model for Learners of Basque. Language Teaching and Language Technology. Swets and Zeitlinger (Publisher). Sake Jager, John Nerbonne and Arthur van Essen editors. Lisse. pp 149-166. Douglas, S., Dale R. 1992. Towards Robust PATR. In proceedings of COLING'92, Nantes. Ezeiza N., Aduriz I., Alegria I., Arriola J.M., Urizar R. 1998 Combining Stochastic and Rule-Based Methods for Disambiguation in Agglutinative Languages. In Proc. COLING-ACL'98, 10-14. Montreal (Canada). Gojenola K. & Oronoz M. 2000. Corpus-Based Syntactic Error Detection Using Syntactic Patterns. In proceedings of NAACL-ANLP00,Student Research Workshop . Seattle. Gojenola, K. 2000 EUSKARAREN SINTAXI KONPUTAZIONALERANTZ. Oinarrizko baliabideak eta beren aplikazioa aditzen azpikategorizazio-informazioaren erauzketan eta erroreen tratamenduan. Unpublished PhD thesis, University of the Basque Country. Karlsson F., Voutilainen A., Heikkilä J., Anttila A. 1995 Constraint Grammar: A Language-independent System for Parsing Unrestricted Text. Mouton de Gruyter. Karttunen L., Chanod J-P., Grefenstette G., Schiller A. 1997 Regular Expressions For Language Engineering. Journal of Natural Language Engineering. Latteier A. & Pelleitier M. 2001. The Zope Book. New Riders. Maritxalar M., Díaz de Ilarraza A., Alegria I., Ezeiza N. 1996 Modelización de la competencia gramatical en la interlingua basada en el análisis de corpus. Procesamiento del Lenguaje Natural (SEPLN), 19: 166-178. Maritxalar, M. 1999 Mugarri: Bigarren Hizkuntzako ikasleen hizkuntza ezagutza eskuratzeko sistema anitzeko ingurunea. Unpublished PhD thesis, University of the Basque Country. Zubiri I. 1994 Gramática didáctica del euskera Didaktiker, S.A. 41 Towards a methodology for corpus-based studies of linguistic change Contrastive observations and their possible diachronic interpretations in the Korpus 2000 and Korpus 90 General Corpora of Danish Jorg Asmussen Society for Danish Language and Literature Det Danske Sprog- og Litteraturselskab DSL The Korpus 2000 Project, www.dsl.dk/korpus2000 Christians Brygge 1, DK-1219 Copenhagen K ja@dsl.dk Abstract Corpora serve as a widely accepted base for synchronic descriptions of language. Yet easily accessible general corpora that enable diachronic descriptions of language are still quite rare, the Korpus 90 and Korpus 2000 Corpora of Danish being one exception. Korpus 90 and Korpus 2000 were both designed and compiled at the Society for Danish Language and Literature (DSL). Korpus 90 comprises text material from the period 1983-1992 and was compiled in the early 1990s. Korpus 2000 is a recently compiled corpus holding text material from the period 1998-2002. The joint web-based query interface of the two corpora enables immediate comparative studies. This paper first gives a very brief introduction to the background of the two corpora before focusing on examples of contrastive observations and their possible diachronic interpretations - and misinterpretations. The examples cover frequencies (new words, vanishing words), the inflectional and collocational behaviour of certain words, and their connotations. An example illustrating syntactical differences is also briefly sketched. The paper then discusses whether these observable differences reflect real changes in the Danish language, or whether they reflect the probable fact of differently compiled - and thus perhaps incomparable - corpora. Finally, the paper proposes some prerequisites for a methodology of comparative corpus investigation and the determination of diachronic corpus similarity. In this context, the concept of invariant textual features will be introduced. 42 A New Machine Learning Algorithm for Neoposy: coining new Parts of Speech Eric Atwell, School of Computing, University of Leeds eric@comp.leeds.ac.uk http://www.comp.leeds.ac.uk/eric 1. Introduction: Unsupervised Natural Language Learning According to the Collins English Dictionary, “neology” is: a newly coined word, or a phrase or familiar word used in a new sense; or the practice of using or introducing neologies . We propose “neoposy” as a neology meaning “a newly coined classification of words into Parts of Speech; or the practice of introducing or using neoposies”. Unsupervised Natural Language Learning, the use of machine learning algorithms to extract linguistic patterns from raw, un-annotated text, is a growing research subfield; for examples, see Proceedings of annual conferences of CoNLL: Computational Natural Language Learning, or the membership list of ACL-SIGNLL, the Association for Computational Linguistics – Special Interest Group in Natural Language Learning. Corpora, especially tagged and parsed corpora, can be used to train `machine learning’ or computational language learning models of complex sequence data. (Jurafsky and Martin 2000) divide Machine Learning systems into Supervised and Unsupervised approaches (p118): “… The task of a machine learning system is to automatically induce a model for some domain, given some data from the domain and, sometimes, other information as well… A supervised algorithm is one which is given the correct answers for some of this data, using those answers to induce a model which can generalize to new data it hasn't seen before… An unsupervised algorithm does this purely from the data. While unsupervised algorithms don't get to see the correct labels for the classifications, they can be given hints about the nature of the rules or models they should be forming… Such hints are called a learning bias.” Hence, in Corpus-based Computational Language Learning, a supervised algorithm is one trained using an annotated corpus; for example, a supervised ML parser such as (Atwell 1988; 1993) is trained with a Treebank of example sentences annotated with their parses. An unsupervised algorithm has to devise an analysis from raw, un-analysed corpus data; for example, an unsupervised ML parser such as (van Zaanen 2002) is trained with raw text sentences and has to propose phrase-structure analyses “by itself”. 2. Clustering words into word-classes A first stage in Unsupervised Natural Language Learning (UNLL) is the partitioning or grouping of words into word-classes. A range of approaches to clustering words into classes have been investigated (eg Atwell 1983, Atwell and Drakos 1983, Hughes and Atwell 1994, Finch and Chater 1993, …, Roberts 2002). In general these researchers have tried to cluster word-types whose representative tokens in a Corpus appeared in similar contexts, but varied what counts as “context” (eg all immediate neighbour words; neighbouring function-words; wider contextual templates), and varied the similarity metric and clustering algorithm. This approach ultimately stems from linguists’ attempts to define the concept of word-class in term of syntactic interchangeability; the Collins English Dictionary explains “part of speech” as: a class of words sharing important syntactic or semantic features; a group of words in a language that may occur in similar positions or fulfil similar functions in a sentence. For example, the previous sentence includes the word-sequences a class of and a group of ; this suggests class and group belong to the same word-class as they occur in similar contexts. 43 Clustering algorithms are not specific to UNLL: a range of generic clustering algorithms for Machine Learning can be found in the literature (eg Witten and Frank 2000). These generic clustering systems require the user to formalise the problem in terms of a feature-space: every instance or object to be clustered must be characterised by a set of feature-values, so that instances with same or similar feature-values can be lumped together. Generally clustering systems assume each instance is independent; whereas when clustering words in a text, it may be helpful to allow the “contextual features” to either be words or be replaced by wordclass-labels as clustering proceeds (as in Atwell 1983). 3. Ambiguity in natural language word-classification A common flaw, from a linguist's perspective, is that these clustering algorithms assume all tokens of a given word belong to one cluster: a word-type can belong to one and only one word-class. This results in neoposy which passes a linguist's “looks good to me” evaluation (Hughes and Atwell 1994, Jurafsky and Martin 2000) for some small word-clusters corresponding to closed-class function-word categories (articles, prepositions, personal pronouns): the author can claim that at least some of the machine-learnt word groupings “look good” because they appear to correspond to linguistic intuitions about word-classes. However, the basic assumption that every word belongs to one and only one class does not allow existing word-clustering systems to cope adequately with words which linguists and lexicographers perceive as syntactically ambiguous. This is particularly problematic for isolating languages, that is, languages where words are generally not inflected for grammatical function and may serve more than one grammatical function; for example, in English many nouns can be used as verbs, and vice versa. The root of the problem is the general assumption that the word-type is the atomic unit to be clustered, using the set of word-token contexts for a word-type as the feature-vector to use in measuring similarity between word-types, applying standard statistical clustering techniques. For example, (Atwell 1983) assumes that a word-type can be characterised by its set of word-types and contexts in a corpus, where the context is just the immediately preceding word: two word-types are merged into a joint word-class if the corresponding word-tokens in the training corpus show that similar sets of words tend to precede them. Subsequent researchers have tried varying clustering parameters such as the context window, the order of merging, and the similarity metric; but this does not allow a word to belong to more than one class. 4. Classifying word types or word tokens? One answer may be to try clustering word tokens rather than word types. In the earlier example, we can say that the specific word-tokens class and group in the given sentence share similar contexts and hence share word-class, BUT we need not generalise this to all other occurrences of class or group in a larger corpus, only to occurrences which share similar context. To illustrate, a simple Prolog implementation of this approach, which assumes “relevant context” is just the preceding word, produces the following: ?- neoposy([the,cat,sat,on,the,mat],Tagged). Tagged = [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat, T2]] The Prolog variable Tagged is instantiated to a list of [word, Tag] pairs, where Tag has an arbitrary name generated by the program, letter T follwed by an integer. These integers increment starting from T1, unless a word has a “context” seen earlier, in which case it repeats the earlier tag. In the above example, word “mat” has the same context (preceding word “the”) as earlier word “cat”, so it gets the sameT2 tag instead of a new tag T6. We see that the two tokens the have distinct tags T1 and T5 since they have different contexts; but the token mat is assigned the same tag as token cat because they have the same context (preceding word-type). This also illustrates an interesting contrast with word-type clustering: word-type clustering works best with high-frequency words for which there are plenty of example tokens; whereas word-token clustering, if it can be achieved, offers a way to assign low-frequency words and even hapax legomena to word-classes, as 44 long as they appear in a context which can be recognised as characteristic of a known word-class. In effect we are clustering or grouping together word-contexts rather than the words themselves. 5. How many token-classes will be learnt? However, this prototype demonstrator also illustrates some problems with clustering on tokens. If we no longer assume all tokens of one type belong to one class, do we allow as many classes as there are tokens? Presumably not, as there would then be no point in clustering; in fact it would be hard to justify even calling this clustering. In the above example sentence there are 6 words ad 5 Tags: at least two words share a cluster, because they share a context. This means there are as many clusters as there are “contexts”; in the above case, a context is the preceding word-TYPE, so this implies we will get as many clusters as there are word-types. A million-word corpus such as the Lancaster-Oslo/Bergen (LOB) corpus yields about 50,000 word-types, so our simple token-clustered would yield about 50,000 word-classes. This is a lot less than a million (one per token), but arguably still too many to be useful. A “learning hint” to guide clustering may be a constraint on the number of word-classes for a word-type, for example to say all tokens of a word-type must partition into at most 5 word-classes; but how is this to be achieved? 6 Clustering by constraint-based reasoning In supervised Part-of-Speech tagging research, statistical approaches (eg Constituent Likelihood Automatic Word-tagging System CLAWS, Leech et al 1983) have been challenged (and arguably surpassed) by constraint-based taggers, exemplified by Transformation-Based Learning taggers (Brill 1995), and the ENGTWOL English Constraint Grammar tagger, (Voutilainen 1995). So, a constraint-based approach to neoposy may be worth considering. However, the constraints in Brill and Voutilainen taggers were based on surrounding PoS-tags, not words. If there are no “given” (supervising) PoS-tags, all constraints in our neoposy system would have to framed in terms of word-contexts. This introduces significant computational complexity: Voutilainen's tagger had over 1000 constraints based on parts of speech within a window context, but if each PoS-context had to be replaced with specific word-contexts the number of constraints could easily grow to unmanageable proportions. For the neoposy task, we would like a word-type to be allowed to belong to more than one PoS-class. This means the set of tokens (and their contexts) for each word-type needs to be partitioned, into a small number of subsets, one for each PoS the word can have. We would like this partitioning to yield similar context-subsets for pairs of words which share a PoS. We can try to formalise this: Each word-type Wi is represented in a training corpus by word-tokens {wi1,wi2,…win} and their corresponding contexts {ci1,ci2,…cin} We wish to partition the set of contexts for every word-type Wi in a way which maximises the similarity of context-subsets between words which have the same PoS: We want to find partitions of context-sets for Wi, Wj, Wk,… of the form: { {ci1,ci2,…cia}, {cia+1,...cib}, {cib+1,…},…, {…, cin} }, { {cj1,cj2,…cja}, {cja+1,...cjb}, {cjb+1,…},…, {…, cjn} }, { {ck1,ck2,…cka}, {cka+1,...ckb}, {ckb+1,…},…, {…, ckn} }, … in a way that maximises “reuse” of similar context-subsets between words: {ci1,ci2,…cia} ¡Ö {cj1,cj2,…cja} ¡Ö {ck1,ck2,…cka} ¡Ö … , {cic,…cid} ¡Ö {cje,…cjf} ¡Ö {ckg,…ckh} ¡Ö … , … 45 This is a horrendously large constraint-satisfaction problem, potentially more challenging than traditional constraint-satisfaction applications such as scheduling, e.g. see (Atwell and Lajos 1993). A million-word training corpus such as LOB contains 1000K word-tokens and about 50K word-types, yielding an average of about 20 word-tokens, and hence word-token-contexts, per word-type. Cross-comparing all full context-sets, on the assumption that each word can belong to only one word-class, is already a heavy computational task. A set of 20 contexts could be partitioned in a very large number of ways, so it would take an impossibly long time to cross-compare every possible partitioning of every word with every other partitioning of every other word. 7. Semi-supervised clustering via a language discovery toolkit A general observation in machine learning is that completely unsupervised learning is very hard, but even a little guidance can yield much more plausible results. The speculation above suggests that wholly unsupervised machine learning of token-based clustering may be unachievable, but perhaps some hybrid of token-based and type-based clustering, combined with limited sensible “learning hints” may be more manageable. To explore the range of possible components and parameters which might make up a successful hybrid solution, we need a “toolkit” of compatible “language discovery” software modules, to try putting together in various combinations. 8 Further research aims This vague concept needs to be explored further, to pin down the sort of model and hints which could work; we need a programme of research combining Corpus Linguistics resources and Machine Learning in a Language Discovery Toolkit. (Atwell 2003) outlines a programme of further research: 1) to explore theories and models from Machine Learning and Corpus Linguistics, to synthesise generic frameworks and models; 2) to fit past and current research projects and software into a coherent generic framework and architecture; 3) to collate and develop a general-purpose software toolkit for experiments with a wide range of algorithms for Corpus-based Computational Language Learning: discovery of language characteristics, patterns and structures from linguistic training data; 4) to explore applications of this toolkit for Language Education, and for language-oriented knowledge-mining, in discovery of language characteristics, patterns and structures in a range of datasets from bioinformatics, astronomy, and multimedia datasets; 5) to disseminate the Language Discovery Toolkit to a wide range of potential users and beneficiaries, to uncover new unforeseen applications, by providing an internet-based showcase demonstrator for the wider research and education community. References E Atwell, 1983 “Constituent-Likelihood Grammar” in ICAME Journal Vol.7 E Atwell, 1988 “Transforming a Parsed Corpus into a Corpus Parser” in Kyto, M, Ihalainen, O & Risanen, M (eds), “Corpus Linguistics, Hard and Soft: Proceedings of the ICAME 8th International Conference on English Language Research on Computerised Corpora”, pp61-70, Amsterdam, Rodopi E Atwell, 1993 ”Corpus-based statistical modelling of English grammar” in S Souter and E Atwell (eds), “Corpus-based computational linguistics: Proc 12th ICAME”,pp195-214, Amsterdam, Rodopi E Atwell, 2003. Combining Corpus Linguistics resources and machine Learning in a Language Discovery Toolkit. Internal research proposal, School of Computing, University of Leeds 46 E Atwell, N Drakos, 1987 “Pattern Recognition Applied to the Acquisition of a Grammatical Classification System from Unrestricted English Text” in B Maegaard (ed), “Proceedings of EACL'87: the Third Conference of European Chapter of the Association for Computational Linguistics”, Copenhagen, ACL E Atwell, G Lajos, 1993 ”Knowledge and Constraint Management: Large Scale Applications” in E Atwell(ed), “Knowledge at Work in Universities: Proc 2nd HEFCs-KBSI” pp21-25, Leeds, Leeds University Press S Finch, N Chater,1992 “Bootstrapping Syntactic Categories Using Statistical Methods” in “Proceedings 1st SHOE Workshop”, Tilburg University, The Netherlands J Hughes, E Atwell, 1994 ”The automated evaluation of inferred word classifications” in A Cohn (ed), “Proc 11th European Conference on Artificial Intelligence”, pp535-539, Chichester, John Wiley D Jurafsky, J Martin, 2000 “Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition”, Prentice-Hall, New Jersey F Karlsson, A Voutilainen, J Heikkila, A Anttila (eds), 1995 ”Constraint Grammar”, Mouton de Gruyter, Berlin A Roberts, 2002. Automatic acquisition of word classification using distributional analysis of content words with respect to function words, Technical Report, School of Computing, University of Leeds M. van Zaanen, 2002. Bootstrapping Structure into Language: Alignment-Based Learning, PhD Thesis, School of Computing, University of Leeds. 47 Detecting student copying in a corpus of science laboratory reports: simple and smart approaches Eric Atwell, Paul Gent, Julia Medori, Clive Souter, University of Leeds, Leeds LS2 9JT, England 0. Introduction This case study is an evaluation of generic, general-purpose plagiarism detection systems applied to a specific domain and task: detecting intra-class student copying in a corpus of Biomedical Science laboratory reports. From the outset, our project had the practical, pragmatic aim to find a workable solution to a specific problem. Biomedical Science undergraduates learn experimental methods by working through a series of laboratory experiments and reporting on their results. These laboratory reports are “peer-reviewed” in large classes, following a prescribed marking scheme; as the reports are effectively marked by other students rather than by a single lecturer, there is an opportunity for an unscrupulous student to avoid having to carry out and report on an experiment, by simply copying another student's report. To reduce this temptation, the Biomedical Science director of teaching, Paul Gent, approached Eric Atwell of the School of Computing and Clive Souter of the Centre for Joint Honours in Science, to look at ways to compare laboratory reports automatically, and flag candidates with signs of copying. We were joined by Julia Medori, forensic linguist from Trinity College Dublin, who developed and evaluated a range of possible solutions. 1. Detailed requirements analysis We examined example student reports in the specific genre, to identify measurable characteristic features (e.g. use of identical diagrams and graphs may be indicative of copying but generic checkers like Turnitin assume text-only essays). We also interviewed Biomedical Science teaching staff to elicit/confirm significant diagnostic features which can identify copying, and identify the level of copying considered unacceptable (as some overlap in lab reports is expected). 2. Survey of what is available and how well this matches the requirements specification Unlike other surveys of generic plagiarism detection systems, we aimed to evaluate systems against the detailed specific requirements identified above. A number of candidate systems are available, for example: - (CheatChecker 2002) - an n-gram based tool; - (SIM 2002) and (YAP 2002) based on longest common sub-sequence approach; - detecting ‘unusual’ similarities between a closed set of essays (or programs) is called collusion and is the basis for (Copycatch 2002); - (Copyfind 2002) is an example of a system aimed at scientific rather than humanities documents; - (Turnitin 2002) is probably the most widely-known system, but the website gives scant details of the underlying methods; - (Clough 2002) uses more sophisticated techniques from Natural Language Processing. We decided to develop two ‘simple’ copying-detection solutions to assess; and compared these against ‘smarter’ commercial-strength systems. 48 3. Collating a test corpus of Biomedical Science student reports Evaluation using a test corpus is a well-established methodology in Corpus Linguistics research, and is appropriate to this task: in our case, the aim is to test candidate systems and compare detection (and ‘red-herring') rates. 3.1 sources: first and second year coursework exercises The initial test corpus used in our study was composed of 103 Biomedical Science first-year student reports. At the outset we did not know whether this included real cases of copying, or if so how many; none had been detected by teaching staff. We found 2 pairs of reports were almost identical, and called these Albert1, Albert2 and Barbara1, Barbara2 (all filenames are randomly generated and are not the real names of the students). As there turned out to be few examples of `real' plagiarism in our initial corpus, we extended it by adding 94 reports written by second year students, covering 3 more challenging, less constrained tasks; during subsequent tests, we discovered 4 pairs which appeared to involve copying: Geraldine2, Fred2; Jane2, Naomi2; Colin2, David2; and Jane3, Naomi3. There appeared to be no other clear cases of copying. To ensure our corpus included samples which we knew were definitely copied, we also added an example of first-year plagiarism that had been detected previously by Biomedical Science staff, which we called Scanfile1, Scanfile2; we were given the hardcopy of two reports to scan using an Optical Character Recognition tool. 3.2 adding artificial TestFiles by merging parts of existing reports To check the sensitivity of each plagiarism detection system tested, we created 19 new TestFiles: each composed of one file from which 5%(then 10%,..., 95%) was removed and replaced by a portion of another file. We also created 2 extra TestFiles: TestFileAB made up of half SourceA and half SourceB; and TestFileABC, made up of TestFileAB with added text from SourceC from a different genre (en email message). This gave a final corpus of 220 student laboratory reports including artificial TestFiles (which were still “authentic corpus material” as they were composed of parts of real reports). 3.3 Format and content of the laboratory reports The first year assignment described the experiment and gave instructions on how to use a computer simulation of the laboratory apparatus to generate results; then the students had to produce tables and graphs and answer a few questions to analyse the results. It seemed likely that any plagiarism we would find would be in the text of the answers. Unlike the first year student reports, the second year students did not have to answer specific questions; so the text should be more `free' and show more originality. However, they still follow the same generic structure of all science laboratory reports: Introduction, Methods, Results, Discussion. A characteristic of this genre is that a lot of the vocabulary from the specific science domain will be shared between all the reports. The answers were quite short (generally about 25 words, occasionally up to 100 words or more), so there is not much scope for originality from each student. As most of the plagiarism checkers available at the moment appear to be developed to detect plagiarism either in humanities essays (high level of originality) or in programming (with limited vocabulary), there is no existing copying checker for the specific problem of scientific reports: low level of originality but not a limited vocabulary. 49 4. Testing and Evaluation of a range of candidate copying-detection systems We used our Corpus to test contrasting approaches to detection. We developed two "simple" programs: Zipping, based on a standard file-compression tool, and Bigrams, a basic comparison of frequent bigrams; and we tested "smart" methods using commercial-strength plagiarism-checking systems Turnitin, Copycatch, and Copyfind. Our findings are reported in detail in (Medori et al 2002). 4.1 Zipping, a simple compression-based approach As a baseline of our study, we tested the idea that using file compressors or zippers, we would be able to measure a `distance' between files according to their similarities. The algorithm of common zippers (like gzip) finds duplicated strings and marks up the matching strings with a pointer; which means that, by appending a questioned text to different corpora (e.g. of different authors or languages) and measuring the difficulty of zipping these files, we can measure the level of similarity between two files. The smaller the zipped file of the concatenation, the more common elements the questioned text shares with the corpus. 4.2 Another 'simple' approach: Bigrams This method is another `home-grown' program based on the idea that when 2 files are similar, they should share a high number of their bigrams. The programs were all implemented in Perl. We tested this method first with character bigrams and then, word bigrams. In the first place, we created a model for each file in our corpus by counting all the occurrences of each character bigrams in the file. To compare two files, we simply checked whether the most frequent bigrams were the same in both files. We compared the results for a model containing 10, 20, or 30 most frequent bigrams. Note that this comparison did not take account of the frequencies of the top 10/20/30 bigrams, it just compared members of the two lists. The results even for "top 30" character bigrams were disappointing: this method will flag copying of an entire document by highlighting documents sharing the same most frequent 30 letter-pairs, but below a total match, most files share a large proportion of character bigrams. This may be at least in part because most laboratory reports will have similar style and content, so need a somewhat more sophisticated discriminator to go beyond finding near-identical documents. We tried the same experiment with word-bigrams, and got better results: word-bigrams are better discriminators. 4.3 Turnitin: a commercial WWW-based system The first commercial system we used was Turnitin.com, a web-based plagiarism detection service first designed to detect material cut and pasted from the Internet but also detects similarities between submitted papers; Leeds University has subscribed to the service. The submission of papers is done by means of laborious cut-and-paste of the text from each document. After about 24 hours, an “Originality Report” is sent onto the user's account for each submitted paper. We found that TurnItIn gives quite good results. However, the fact that it gives an index of similarity between the file and all papers found on the web and in the database leads it to miss a few cases of plagiarism, especially in the case of the pair `Colin-David'. 50 Turnitin gives as a result the level of similarity between the submitted paper and all the papers in the database and the data found on the web; it seems a waste of time checking on the web and other papers than the ones in the class. A more significant shortcoming is the laborious input method. On the positive side, Turnitin underlines the similarities, making them easier to visualise; and it makes it easier to detect partial plagiarism. 4.4 CopyCatch: a commercial PC-based system These conclusions led to the next step of our study, to test CopyCatch, a software developed by Woolls from CFL Software Development. It seemed that it might be more appropriate to our specific needs as it gives instant results, sorting them by percentage of similarity between 2 files. The main distinguishing characteristic of this software was that it was much easier to use, especially the submission process: it allows the user to browse and select the files to check. The results are almost immediate. It outputs a list of pairs of files sorted by percentage of match between them. It is then possible to have a list of the vocabulary or phrases shared between the 2 files and it can also mark up the files highlighting the similarities between them. It allows the user to check Word documents as well as text, rtf and HTML files: no previous conversion is needed. CopyCatch found all our `real' cases of plagiarism. It even found the pair `Colin-David', which was not an obvious case of plagiarism and which we missed using TurnItIn. The selection of our files in a few clicks is a very important practical advantage. 4.5 CopyFind: a university-developed freeware alternative As another alternative, we decided to test another CopyFind, available for Linux as well as for Windows. This software requires that the Copyfind executable and all the files to be checked are in the same folder. The user has to add a list of all the files that have to be checked. These can be Word documents. The user can define a threshold of similarity. The system will then output a file containing all the pairs of files and the number of matched words for each of them. An HTML report will then be created for every pair which are above the threshold, with underlined similarities. We ran the program with standard parameter values; the results were quite good, but it didn't detect the pair `TestFile-TestFile2'. To find this pair in the results, we had to lower the threshold by changing the minimum number of matching words to report from 500 (value suggested) to 300. The modification of this threshold didn't change any of our other results. CopyFind is a program that works quite well and gives a clear output underlining the similarities between the files. The results are immediate, but its submission method is cumbersome. 5. Discussion We found little difference in detection rates between "simple" and smart" approaches, and in practice the main performance discriminator was ease of use rather than accuracy. We found this surprising, given the naivety of the "simple" models: Zipping found copying as a side-effect of file compression; and Bigrams did not compare frequencies, each document is simply characterised by a vector of its 50 most-used word-pairs. These minimal models were enough to flag all known real cases of plagiarism in our corpus. 51 Turnitin gave interesting results too, as it was the only system to flag the overlap between our two artificially-constructed test files (TestFileAB, TestFileABC) and their sources (SourceA, SourceB). However, it didn't successfully detect the pair Colin, David. As this plagiarism is more representative of how students really plagiarise, this failure outweighs the success with artificial TestFiles. In the tests, all approaches successfully flagged examples of ``blatant'' copying, but results were less clear-cut when reports had small amounts of overlapping text. The other top results were mainly pairs of small but unrelated files. This could be because of the nature of the assignment: the students often rewrite the objectives of the experiment, copy the questions, or answer them rephrasing the question in a negative or affirmative sentence. This could explain why the reports where the students did not develop the answers are found first. This also suggests that we could try to take account of knowledge specific to our domain. In Biomedical Science laboratory reports, students must always write four sections: Introduction, Method, Results, Discussion. In fact this is standard practice even for research journal papers; and this format is a mandatory part of the course work specification. Biomedical Science lecturers know that certain parts of the text can legitimately be copied (the given specification of Method, and specific questions set by the lecturer), whereas originality is more important in other parts (notably the Discussion section). Perhaps a copying-checker should include some ``intelligence'', to make it ignore overlap in unimportant sections, and focus on the Discussion section; overlap in Discussion is a much surer indicator of copying. On the other hand, perhaps we do not need such "intelligence". Research into students' reasons for copying suggests often the reason is either last-minute panic to meet a submission deadline, or else laziness or unwillingness to devote the time needed to do the work. In these circumstances, students generally make blatant copies rather than intelligently disguised copying through time-consuming detailed modifications. Our "simple" approaches suffice for these cases. Anecdotal evidence from Biomedical Science teaching staff suggests that if cheaters modify the copied document to try to avoid detection, they tend to change just the first page (on the grounds that this is all that will be seen in a cursory skim through 200+ reports for blatant copying). A "simple" answer could be to check for blatant copying but ignore the first page; this would incidentally upgrade the significance of the Discussion section as it comes last. 6. Conclusions We have analysed the requirements of our specific problem, through discussion with Biomedical Science teaching staff. We have surveyed a range of candidate systems, and evaluated five systems (Turnitin, Copycatch, Copyfind, and two "home-grown" systems, Zipping and Bigrams) in experiments with a Test Corpus of Biomedical Science student laboratory reports. We concluded that none of the systems stood clearly above the rest in their basic detection ability, but that Copycatch seemed marginally more successful, and more significantly it had the most appropriate user interface for our needs as it ran on a local PC and simplified the analysis of a whole directory of lab reports. Following personal contact with Copycatch developers, we agreed to be the first beta-test site for the enhanced Java version. We have installed Copycatch on Biomedical Science computing network for future use by staff. A side-effect of availability and use copy- checking software is the need to migrate from paper-based report submission and processing to electronic submission, probably via student pigeon-holes in the “Nathan Bodington” virtual building, an e-learning environment set up by the Flexible Learning Development Unit for general teaching use at Leeds University. The Flexible Learning Development Unit has accepted our recommendations (Atwell et al 2002). The School of Biomedical Sciences has decided to require electronic submission of all laboratory reports in future, to facilitate use of copying-detection software. During tests, we found evidence to support suspected cases of copying, and discovered at least one new (previously unknown) case of student copying, which was reported to the appropriate authority. 52 A growing array of tools have appeared for detection of plagiarism and copying; some authorities such JISC in Britain advocate standardised adoption of a generic centralised web-based system. We conclude that our task is an exception meriting a specialised corpus-based study and tailored solution; and that such language analysis tools can be a catalyst for reform in assessment practices. 7. References Atwell, Eric; Gent, Paul; Souter, Clive; Medori, Julia. 2002. Final Report of the C&IT in the Curriculum project: Customising a copying-identifier for Biomedical Science student reports, School of Computing, University of Leeds. http://www.comp.leeds.ac.uk/eric/citFinalReport.txt, or http://www.comp.leeds.ac.uk/eric/citFinalReport.doc CheatChecker. 2002. http://www.cse.ucsc.edu/~elm/Software/CheatChecker/ Clough, P. 2002. http://www.dcs.shef.ac.uk/~clough CopyCatch. 2002. http://www.copycatch.freeserve.co.uk CopyFind. 2002. http://plagiarism.phys.virginia.edu/software.html Medori, Julia; Atwell, Eric; Gent, Paul; Souter, Clive. 2002. Customising a copying-identifier for biomedical science student reports: comparing simple and smart analyses in: O'Neill, M, Sutcliffe, R, Ryan, C, Eaton, M, & Griffith, N (editors) Artificial Intelligence and Cognitive Science, Proceedings of AICS02, pp. 228-233 Springer-Verlag. SIM. 2002. http://www.few.vu.nl/~dick/sim.html Turnitin. 2002. http://www.turnitin.com/ YAP. 2002. http://www.cs.su.oz.au/~michaelw/YAP.html 53 A Corpus of Sworn Translations – for linguistic and historical research Francis Henrik Aubert Stella E. O. Tagnin University of Sao Paulo 1. Introduction In Brazil, all and any documents and papers, when in a foreign language, need to be translated by public sworn translators if they are to be used for any official purposes. Such documents will normally range from school papers needed for a student's transfer from one country to another, birth, marriage or death certificates for naturalization, marriage or inheritance purposes, up to contracts, powers-of-attorney, promissory notes and articles of incorporation or other commercial documents for international transactions. In principle however, any text may be submitted to a “sworn” translation procedure if such text, for any reason, is to be processed by government authorities, at any level, or by the Courts. Thus, for instance, a love letter may well have to be translated by a public sworn translator if such letter is used as evidence in a divorce suit. A translation of a play may give rise to a suit for copyright infringement, and, as a consequence, even a literary text may have to be dealt with by a sworn translator, back-translating it into the initial source language, as documentary evidence in the records of the suit. (See also Aubert 1996). Public translators must register these translations (i.e., true and unabridged copies of each translation delivered to the clients) in a special ‘book of records', conforming to a certain number of rules. Upon their retirement or death, these books are turned in to the Board of Trade of the state in and for which they are qualified (the Junta Comercial). The Board of Trade of the State of Sao Paulo (JUCESP) has approached the University of Sao Paulo, where a large corpus for teaching and translation purposes is being built – the COMET – offering this material for inclusion in the corpus. It covers a period of over one hundred years and is written in more than 20 languages. However, this offer is not without problems: a large part of the material will require a long and strenuous work of restoration, which, in turn will demand substantial fundings. To bring the project off the ground in a relatively short period of time, it has been decided to first concentrate our efforts on the books of the last thirty years with translations into and out of Portuguese, English, German and Spanish, the languages presently addressed by the COMET. This material will not only allow for research in linguistic and translation matters, such as, contrastive stylistics, lexicography, legal terminology, translation norms and translationese, but also in studies of a more historical nature, especially Brazil's immigration waves, periods in which there is a high demand for public translations to comply with the legal aspects of immigration. We can envisage historians and sociologists as interested researchers. The article will discuss the relevance of this unique material, the design and structure of the corpus, its population, the audience it is aimed at, the problems expected in the preparation of the texts, the header, and possible research areas. 2. The COMET project This project grew out of two experiments in teaching English into Portuguese translation to students at the Specialization in Translation course at the University of Sao Paulo (Tagnin 2002a; 2003b to 54 appear). The students built small bilingual corpora ranging between 100,000 and 200,000 words in each language in several technical areas1 over a period of two years, 2000-2001. This material was put together on a CD-ROM and made available for their terminological research. The resulting glossaries are available at http://www.fflch.usp.br/citrat As other corpora were being built to inform masters and doctoral studies, it was decided to bring all this material together under a common project: COMET – A Multilingual Corpus for Teaching and Translation. The COMET consists of three subcorpora: a Technical Corpus, a Learner Corpus and a Translation Corpus. The Technical Corpus favours mainly three areas in which a significant lack of terminological sources has been identified by professional translators: Commercial Law, Computing and Orthodontics. This means that regular work is being carried on to enlarge these corpora systematically. Nevertheless, all technical corpora produced by student work or otherwise at the University of Sao Paulo (USP) will eventually be hosted here.2 As corpus work became more evident at USP, we joined the Br-Icle project, which was being conducted by Tony Berber Sardinha at the Catholic University of Sao Paulo. This was brought to the attention of other scholars interested in language teaching/acquisition and it was subsequently decided to construct a multilingual learner corpus at USP as the Department of Modern Languages is comprised of five different areas: English, French, German, Italian and Spanish. At the moment of writing, English, German and Spanish have teamed up to pursue this project (for more details see Tagnin 2003a). The Translation Corpus initially consisted of student literary translations: 9 American short stories and 20 Canadian short stories. The latter have been published in book format, along with a bio-bibliography of each author (Tagnin 2002b). To these will be added parallel (original/translation) literary texts collected in the course of several theses and dissertations in translatoloty, including short stories by Edgar Allan Poe (for which several renderings into Brazilian and European Portuguese are available), the translation into English of the classic “Os Sertoes” (“Rebellion in the Backlands”), by Euclides da Cunha, to mention but a few. Inclusion of uncorrected student production has been considered so as to form a Translation Learner Corpus to function as a source for research in the pedagogy and practice of translation. As mentioned above, JUCESP has approached USP to donate the books of records of deceased and retired sworn translators to be included in COMET's Translation subcorpus. The sections below will provide the details of the project. 3. The JUCESP subproject The corpus is meant for research primarily in the fields of legal translation, lexicology and terminology. It will initially be fed with JUCESP texts translated from English, German and Spanish into Portuguese or out of Portuguese into any of these foreign languages. The audience envisaged is both translation students and teachers, lexicographers and terminologists. However, as all texts will be included in their full form they will also make valuable material for researchers interested in discourse analysis. 1 The areas in which these corpora were built are: Biotechnology: transgenic foods; Cooking: Spices; Computing: Security; Fashion: Clothes; Veterinary: Bovine diseases; Ecology: Biodiversity; Dentistry: Orthodontics; Automation: Safety Locks; Business: Brazilian Financial Market; Tourism: Ecotourism; Genetic Engineering: Genoma. 2 Part of this corpus will be fed into the Lácio-Web Project (Aluísio et al., 2003). In exchange, all corpus tools developed within that project are being made available to the COMET Project. 55 As explained below, parallel texts (originals and their translations) will be rare so that the corpus will be above all a comparable one. It will also be an open-ended corpus as new material may be inserted when new material is processed or more books of records are made available due to the death or retirement of other sworn translators. A relevant feature is that the material is of public domain so that no copyright restrictions apply. The sworn translations on file at the Board of Trade are organized in ‘books of records', identified by the name of the sworn translator and by the foreign language from or into which the translation was carried out. Thus, e.g., a sworn translator qualified for English, French and Spanish, will have three series of volumes (of up to 400 pages each), one for each foreign language. If a given original contains textual material in more than one foreign language, the translation will normally be recorded in a volume corresponding to the prevailing language in the original text, or, alternatively, in the volume corresponding to the official language, if any, of the country in which the original document was issued.3 The major foreign languages represented (as source or as target languages) in the material are English, Spanish, French, Italian and German (roughly in this order). A certain amount of material is also to be found for Arabic, Dutch, Greek, Hebrew, Hungarian, Japanese, Korean, Latin, Norwegian and Russian, though for these languages the actual volume has yet to be assessed. 3.1 Conversion of the texts into electronic format One initial major difficulty will be the conversion of the material into electronic format. Up to the early 50's, most translators copied the translations into the book of records by hand. From the mid 50's and up to the early 70's, the copies found are mostly carbon copies. Photostatic reproduction became common only in the 1980's, and, even here, the quality of the copies is not always such that it will permit electronic scanning to be performed without involving an unreasonable amount of revision. It is therefore expected that, to a very large extent, the material will have to be transcribed, an operation which – as one knows from the times of Medieval copyists – is fraught with the risks of errors, slips of the finger, lapses and the conscious or subconscious desire to “improve”. The fact that, over the relevant period (1902-2002), Brazilian Portuguese has undergone two major orthographical reforms, does not render the task any easier. Nevertheless, the variety of the material and potential rewards for research are such that the effort involved (including strict supervision of the electronic transcripts) is felt to be well worth the while. Due to the volume of material donated, it will be processed in batches: 1. texts covering the last 30 years: 1972-2002; 2. texts covering the period 1935-1971; 3. texts covering the period 1902-1934. 3 The linguistically hybrid texts are indeed fairly common. A daughter company of a US holding, organized in the Grand-Duchy of Luxemburg, will have its Articles of Association drawn up in English, followed by a French version, and with its notarization also in French, unless one of the signatories is resident in Italy, in which case one of the notary public acknowledgments may be worded in Italian. In a more extreme case, a bill of lading was found to have been printed in English, its different boxes filled out in a blend of Portuguese and Spanish (in all likelihood, in a not very successful attempt to produce Portuguese); the rubber stamp of the carrier was worded in German, the address of the shipping company appeared in Norwegian, but the notarization had been conducted in the Canton of Ticino, Switzerland, and was thus formulated in Italian. Since the basic text (the starting point) was in English, this translation was inserted in the book of records for the English language of the relevant translator. 56 As soon as a substantial amount of the first batch has been processed, it will be made available on the Web as a pilot corpus. It will then be updated regularly as new texts are prepared and ready to be read electronically. 3.2 Text types Although any text can, in a given situation, be subjected to a “sworn translation” procedure (see Introduction), the major part of the translations available in the material can be divided into four main groups: (a) personal documents (identity documents, birth, marriage, divorce and death certificates, school documents, and the like); (b) corporate documents (articles of association or incorporation, corporate deliberations, minutes of shareholder meetings, secretary's certifications, etc.); (c) financial documents (bills of lading, agreements in general – purchase and sale, rental, licensing of trademarks, technology transfers –, promissory notes and other securities); and (d) legal documents (petitions, letters rogatory, court decisions). Obviously, groups (b), (c) and (d) intersect, to a large extent, in terminology and phraseology, although their specific purposes and the actors involved in the respective communicative processes are somewhat different. Texts dealing with matters directly related to industrial technology are relatively rare, save for patent registrations and exhibits to technology transfer agreements, but even these contain, to a greater or lesser extent, terms and phrasings which pertain to the legal and commercial specialty languages. 3.3 The structure of the corpus Because of the comparative and contrastive studies envisaged, structuring of the material by language will take precedence. The next level will be by date and then by text type. For instance: English 2002 Personal Documents Birth, Death, Marriage and Divorce Certificates Transcripts Financial Documents Balance Sheets Bills of Lading Promissory Notes Corporate Documents Articles of Incorporation/Association Minutes of Shareholder and Board Meetings Legal Documents Petitions, pleas and similar acts of bringing suit Powers of Attorney etc German 2002 Personal Documents Birth, Death, Marriage and Divorce Certificates Transcripts Financial Documents Balance Sheets Bills of Lading Promissory Notes Corporate Documents Articles of Incorporation/Association Minutes of Shareholder and Board Meetings Legal Documents Petitions, pleas and similar acts of bringing suit Powers of Attorney etc 57 3.4 Header Each text will be identified by a code indicating language direction, date, text type and text number. Thus GP902BC0001 means: German into Portuguese, 1902, Birth Certificate, text nr. 00001 in that category. This will allow up to 99,999 texts in each category. The potential total will probably never be reached for most categories, but if we are to plan ahead and hope to process most of the material, we may well come close to that number for at least a few categories, especially in English, which is, by far, the most prevalent source language. A list will be drawn up of all text types and their corresponding abbreviations, which will be part of the identifying code of each text (see next section). A header, based on the international standards devised for the Translational English Corpus (TEC)4, developed at the UMIST (University of Manchester Institute of Science and Technology), will provide more detailed information as to the direction of translation (into mother tongue, into foreign language), language of source text, language of target text, the date of the translation, the name of the translator, his/her sex and nationality (if available), the text type, the extent of the text, and the subject. This procedure requires careful analysis of the material so that the header is completed correctly as the data therein will serve as pointers for selecting the texts to be submitted to research with the aid of electronic search tools, i.e., the texts will be searchable by language, by text type, by date, by translator, by subject etc. 3.5 A brief analysis of the JUCESP texts Legally speaking, sworn translations are not “independent” texts. A sworn translation is not used officially in Brazil in lieu of the original; rather, it ensures that the original text5 may be put to official use, and the original and the translation are therefore jointly submitted to the public office, agency, court or other institution (schools and academies, banks, insurance companies, traffic departments, etc.) to which they are destined. This, in principle, suggests that sworn translations will tend to adhere more closely to the original texts, that, in a sense, they will be more “literal” than common unofficial translations. This, indeed, is – supposedly – a defining feature of sworn translations in general: the intent of such translations is to assist the recipient in understanding the text within its source cultural setting, and not to propose solutions which would have been appropriate if the text had been originally produced in the target language setting. In Venuti's (1995) terms, sworn translations adopt – or ought to adopt – translation strategies or procedures which are “foreignizing” rather than domesticating. The material to be included in the corpus will, it is expected, afford the possibility of testing this hypothesis, and, more specifically, assist in identifying if and to which extent other factors beyond the “sworn translation mode” as such tend to stimulate or check the “foreignization rule” (e.g. text typology, translational direction – from/to Brazilian Portuguese –, subject matter, etc.). But here the material available presents a specific problem. Although the sworn translators must retain a full copy of each translation they have produced in their official capacity, there is no equivalent requirement concerning copies of the original texts. Thus, the books of records which will be fed into the COMET project will not provide a strict parallel corpus, and the original texts can only be inferred from the translations This limitation will, at times, pose a problem in that it will not allow a comparison with the original. An odd passage in the translation might, of course, indicate an error or mistake of the translator. 4 www2.umist.ac.uk/ctis/research/TEC/tec_home_page.htm 5 or a certified copy thereof, but never a non-certified copy or other reproduction (facsimile, electronic printout, etc.). 58 Alternatively, however, it could be indicative of an attempt to reproduce an oddity existing in the original, given that a sworn translation is supposed to “mirror” the original, not to improve on it (although, not infrequently, the original texts would probably have benefited from such improvements) (see Aubert, 1996, op.cit). 3.6 Research possibilities Despite the limitations referred to in the preceding paragraphs, the material is expected to provide a wide range of information relevant to linguistic and historical research. A cursory review of the material seems to indicate that approx. 3 of the translations have been made into Portuguese, the remaining 1 corresponding to translations from Portuguese and into the relevant foreign languages. Despite the marked difference in distribution, the sheer quantity of the material available (some 3,000 volumes – i.e., approx. 1.200.000 pages – covering the entire 20th century) is expected to provide sufficient textual samples for a thorough linguistic, stylistic, translational, intercultural and historical investigation in both translation directions (from/into Portuguese), and afford relevant comparisons. For terminological purposes, the corpus of sworn translations is expected to provide a vast range of real-life translational situations, disclosing both the underlying strategies and the actual solutions provided. The fact that different source and target languages are involved will also afford a reasonable degree of comparison, and the possibility of verifying to what extent the specific source or target language exerts, as such, any influence on the terminological options made by the translators. The terminological studies afforded by the corpus will not only derive from the actual translated texts. Some sworn translators have, in the course of their careers, set up their own personal glossaries, and one such glossary (containing close to 2,000 terms) has already been made available to the Centre for Translation and Terminology at USP, by the heirs of a recently deceased sworn translator for the German language. The textual material to be included in the COMET will serve the purpose of validating, reviewing and/or varying the solutions proposed, by comparing the glossary with the actual usages and corresponding contexts found in the said translator's recorded sworn translations. Under a subsequent stage, the terminological solutions validated for the production of this specific translator can be compared to those found in the records of other translators as well as those proposed/recommended by Chambers of Commerce, monolingual glossaries in German and Brazilian Portuguese, and so on. Much the same expectations are relevant to phraseology, although, since phraseology is intimately related to stylistics and to the idiomatic features of each language, one might expect that the subcorpus of translations into the foreign languages will afford a safer ground for comparative analysis than translations into Brazilian Portuguese from several different foreign languages. Consider, for instance, the more or less set phrase which usually closes the preamble to a standard Brazilian contract: “As partes qualificadas supra tem entre si justo e acordado o que segue” (discursively equivalent to “Now therefore, in view of the promises and mutual undertakings and obligations contained herein, the parties agree to be bound by the following terms and conditions”). Although the actual original texts will not be available, the predictability of this set phrase is such that it can be readily inferred in the several translations to be analysed. It can be expected to reappear in a number of different translations, into a variety of languages, with varying solutions, and will thereby provide a basis for a typical “stylistique comparée” investigation, much as originally conceived by Vinay and Darbelnet (1958). 59 In the absence of the corresponding original texts, a direct observation of the translation procedures involved cannot be conducted in any systematic fashion6. Yet, the observation of the lexical and syntactical structures and their frequencies in the translated material, as compared to their corresponding frequencies in authentic original texts and in common (i.e. “not-sworn”) translations is expected to unveil the degree of structural “contamination” (or literal shift) of sworn translations from and into Brazilian Portuguese. The from/to distinction is here of a certain relevance. In fact, one of the initial hypotheses is that the translational strategies will not be the same, but will vary according to the translational direction.7 Also of a certain interest is the observation of translational solutions which become standard “translationese” for translations from Brazilian Portuguese into foreign languages, as markers of Brazilian cultural, legal and institutional specificities, e.g. the widespread use of “quotaholder” as a translation for “quotista” (a reference to the shareholder of a Brazilian limited liability company, legally termed “sociedade por quotas de responsabilidade limitada”) or of the official name of the country as “Federative Republic of Brazil” (derived from “República Federativa do Brasil”, although “Federal Republic of Brazil” would probably be a more idiomatically adequate solution in English).8 As already mentioned, the first stage of the corpus will cover a period corresponding to the last thirty years (1972/2002). At later stages, the books of records of sworn translators from earlier times will also be included, covering a full century. Even the initial stage, however, will provide sufficient elements for diachronic investigations; probably not in terms of linguistic structure, but most certainly in terminology and, very possibly, in translation procedures and strategies. Here, a number of relevant extralinguistic factors are likely to have exerted marked influences: the redemocratization process, after a long period of military rule, culminating with the 1988 Federal Constitution (and the new institutions and legal concepts arising therefrom); the first free presidential elections in a generation, in 1989; the opening of Brazilian economy as from 1990; the intensification of the commercial and cultural exchanges with other Latin American countries, specially within the framework of the Mercosur; the growing assertion of intellectual property rights by actions in court or otherwise (which, in turn, tend to require that translations be more closely tied to the original texts, so as to avoid the risk of infringing on such property rights by the intromission of a more explicit – and uncalled for – co-authorship); and, obviously, the translators’ move from the typewriter to the personal computer as a tool for writing, editing and proofreading, for terminological research, and for exploiting new strategies, including intersemiotic translation. If these factors are indeed relevant to the production of public sworn translators, one might reasonably expect to observe their reflections and refractions on the translated material, by comparing the translations produced in the mid-70's with those produced in the second half of the 90's. The preceding considerations point to other research possibilities beyond the realm of language studies as such. The close connection between public sworn translation and the political, institutional, 6 Occasionally, sworn translation clients request that the original text be set up together with the actual translation, in two parallel columns. Also, a significant volume of sworn translations involve standardized original texts (e.g. passports, driving licenses), in which the wording is basically the same, save only for the personal data of the actual bearer. Here, it will suffice to have access to one such original standardized text in each language in order to conduct direct observation of the translation procedures applied by the different translators. 7 In her doctoral dissertation, Sonia T. Gehring (1998), working with a different text and translation typology (social sciences), provides statistical evidence that translations from English into Brazilian Portuguese are in a sense more “literal” (or “foreignizing”), whilst equivalent texts translated from Brazilian Portuguese into English tend to be “freer” (or more “domesticated”). 8 Evidently, the same concern holds good for the opposite translational direction. It is interesting to observe that a US “county” is usually translated as “condado”, although the Brazilian institutional system has a reasonably close correspondent in “comarca”. And, for reasons unknown and which would bear further investigation, the Brazilian consulates abroad, when in the US or Canada, tend to identify “notary publics” as “notários” but, elsewhere (including the UK), as “tabeliaes”. 60 economic and legal spheres suggests that material covering a full century will also bear marks of the historical processes that the target community of these translations has undergone: the initial republican regime, after the abolition of monarchy in 1889 and the subsequent dictatorships, alternating with civilian rule; the two world wars and Brazil's participation therein; the several waves of immigration, specially from Spain, Portugal, Italy, Lebanon, Japan and, more recently, from Korea; the budding industrialization of the country and the shift from a rural to an urban society, accelerated as from the mid-1950's; the ups and downs of the economy. At this point, the project branches out to a promising inter- and transdisciplinary co-operation with historians, sociologists and anthropologists. 4. Conclusion This paper has reported on the creation of a Multilingual Corpus of Sworn Translations (MCST) at the University of Sao Paulo, Brazil. The MCST will initially consist of sworn translations into Portuguese and out of English, German and Spanish, as well as translations out of Portuguese into these foreign languages, covering a 30-year period (1972-2002), extracted from the complete works of deceased and retired sworn translators in the state of Sao Paulo over a period of one hundred years (1902-2002). Due to the unique character of this material it is believed that even the “small” part currently under consideration will offer enough material for relevant studies in the areas of translational, lexical, syntactic, terminological, discursive and even historical and sociological research. References Aluísio S M, Pinheiro G M, Finger M, Nunes M G V, Tagnin S E O 2003 The Lácio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In Proceedings of Corpus Linguistics 2003, Lancaster, UK. Aubert F H 1996 Translation Typology: the Case of 'Sworn Translations'. In Coulthard M, De Baubeta P A O (org) Theoretical Issues and Practical Cases in Portuguese-English Translations, Edwin Mellen Press. Gehring S T 1998 As modalidades de traduçao ingles/portugues: correlaçoes bidirecionais. Unpublished Doctoral Dissertation. University of Sao Paulo. Tagnin S E O (ed.) 2002a Lá do Canadá, Sao Paulo: Olavobrás. Tagnin S E O 2002b Corpora and the Innocent Translator: How can they help him. In Thelen M (ed.) Translation and Meaning, Part 6, Proceedings of the Lodz Session of the 3rd Maastricht-Lodz Duo Colloquium on “Translation on Meaning”, Lodz, Poland, September 22-24, 2000, Maastricht: Universitaire Pers Maastricht, 489-496. Tagnin S E O 2003a A multilingual learner corpus in Brazil. In the Proceedings of the Workshop on Learner Corpora at Corpus Linguistics 2003, Lancaster. Tagnin S E O 2003b Os Corpora: instrumentos de auto-ajuda para o tradutor. To appear in the special issue Translation and Corpora, Cadernos IX - 2002/1, University of Santa Catarina. Venuti L. 1995 The translator's invisibility. London, Routledge. Vinay J, Darbelnet J P 1958 Stylistique comparée du français et de l'anglais. Paris, Didier. 61 Constructing Corpora of South Asian Languages Paul Baker*, Andrew Hardie*, Tony McEnery*, and Sri B.D. Jayaram° * Department of Linguistics, Lancaster University ° Central Institute of Indian Languages, Mysore {j.p.baker, a.hardie, a.mcenery}@lancaster.ac.uk, jayaram@ciil.stpmy.soft.net Abstract The EMILLE Project (Enabling Minority Language Engineering) was established to construct a 67 million word corpus of South Asian languages. In addition, the project has had to address a number of issues related to establishing a language engineering (LE) environment for South Asian language processing, such as translating 8-bit language data into Unicode and producing a number of basic LE tools. This paper will focus on the corpus construction undertaken on the project and will outline the rationale behind data collection. In doing so a number of issues for South Asian corpus building will be highlighted. 1 Introduction The EMILLE project1 has three main goals: to build corpora of South Asian languages, to extend the GATE LE architecture2 and to develop basic LE tools. The architecture, tools and corpora should be of particular importance to the development of translation systems and translation tools. These systems and tools will, in turn, be of direct use to translators dealing with languages such as Bengali, Hindi and Punjabi both in the UK and internationally (McEnery, Baker and Burnard, 2000). This paper discusses progress made towards the first of these goals and considers to a lesser extent the third goal of the project. Readers interested in the second goal of the project are referred to Tablan et al (2002). 2 Development of the corpora This section describes our progress in collecting and annotating the different types of corpora covered by EMILLE. EMILLE was established with the goal of developing written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Punjabi, Sinhalese, Tamil and Urdu. In addition, for those languages with a UK community large enough to sustain spoken corpus collection (Bengali, Gujarati, Hindi, Punjabi and Urdu), the project aimed to produce spoken corpora of at least 500,000 words per language and 200,000 words of parallel corpus data for each language based on translations from English. At the outset we decided to produce our data in Unicode and annotate the data according to the Corpus Encoding Standard (CES) guidelines. As the project has developed, the initial goals of EMILLE have been refined. In the following subsections we describe the current state of the EMILLE corpora and outline the motives behind the various refinements that have been made to EMILLE's goals. 2.1 Monolingual written corpora The first major challenge facing any corpus builder is the identification of suitable sources of corpus data. Design criteria for large scale written corpora are of little use if no repositories of electronic text can be found with which to economically construct the corpus. This causes problems in corpus building for the languages of South Asia as the availability of electronic texts for these languages is limited. This availability does vary by language, but even at its best it cannot compare with the availability of electronic texts in English or other major European languages. 1 Funded by the UK EPSRC, project reference GR/N19106. The project commenced in July 2000 and is due to end in September 2003. 2 Funded by the UK EPSRC, project references GR/K25267 and GR/M31699. 71 We realised that much of the data which, in principle, we would have liked to include in the corpus existed in paper form only. On EMILLE, it would have been too expensive to pay typists to produce electronic versions of the 63 million words of monolingual written corpus (MWC) data. Even if the initial typing had been affordable, checking the data for errors would have added a further cost, particularly since tools for error correction, such as spell checkers, do not exist for many of the languages studied on EMILLE (Somers, 1998, McEnery and Ostler, 2000). Scanning in the text using an optical character recognition (OCR) program is a viable alternative to typing in printed text for languages printed in the Roman alphabet. However, OCR programs for South Asian scripts are still in their infancy (for an example of some early work see Pal and Chaudhuri, 1995) and were not considered stable and robust enough for this project to use gainfully.3 As part of a pilot project to EMILLE4, we ran a workshop that examined potential sources of electronic data for Indian languages. The workshop identified the Internet as one of the most likely sources of data5. This prediction proved accurate, and we have gathered our MWC corpus from the web on the basis of four, largely pragmatic, criteria: 1. Data should only be gathered from sources which agreed to the public distribution of the data gathered for research purposes; 2. Text must be machine readable: we could not afford to manually input tens of millions of words of corpus data; 3. Each web-site used should be able to yield significant quantities of data: to focus our efforts we excluded small and/or infrequently updated websites from our collection effort; 4. Text should be gathered in as few encoding formats as possible: as we map all data to Unicode, we wished to limit the development of mapping software needed to achieve this. While the first three criteria are somewhat easy to understand and have been discussed elsewhere (Baker et al, 2002) the fourth criteria merits some discussion. Ideally, we would have liked to include texts that already existed in Unicode format in our corpus. However, when we first started to collect data, we were unable to locate documents in the relevant languages in Unicode format6. We found that creators of such documents on the internet typically rely on five methods for publishing texts online: • They use online images, usually in GIF or JPEG format. Such texts would need to be keyed in again, making the data of no more use to us than a paper version; • They publish the text as a PDF file. Again, this made it almost impossible to acquire the original text in electronic format. We were sometimes able to acquire ASCII text from these documents, but were not able to access the fonts that had been used to render the South Asian scripts. Additionally, the formatting meant that words in texts would often appear in a jumbled order, especially when acquired from PDF documents that contained tables, graphics or two or more columns; • They use a specific piece of software in conjunction with a web browser. This was most common with Urdu texts, where a separate program, such as Urdu 98, is often used to handle the display of right-to-left text and the complex rendering of the nasta'liq style of Perso-Arabic script; • They use a single downloadable True Type (TTF) 8-bit font. While the text would still need to be converted into Unicode, this form of text was easily collected; • They use an embedded font. For reasons of security and user-convenience, some site-developers have started to use OpenType (eot) or TrueDoc (pfr) font technology with their web pages. As with PDF documents, these fonts no longer require users to download a font and save it to his or her PC. However, gaining access to the font is still necessary for conversion to Unicode. Yet gathering such fonts is difficult as they are often protected. We found that owners of websites that used 3 We wished to produce the corpora in the original scripts and hence avoided Romanised texts altogether. 4 This project, Minority Language Engineering (MILLE), was funded by the UK EPSRC (Grant number GR/L96400). 5 While we also considered publishers of books, religious texts, newspapers and magazines as a possible data source, the prevalence of old-fashioned hot-metal printing on the subcontinent made us realise early on that such sources were not likely providers of electronic data. Indeed, a number of publishers expressed an interest in helping us, but none could provide electronic versions of their texts. 6 To date, the only site we have found that uses Unicode for Indic languages is the BBC's; see for example www.bbc.co.uk/urdu or www.bbc.co.uk/hindi. 72 embedded fonts were typically unwilling to give those fonts up. Consequently using data from such sites proved to be virtually impossible. There are a number of possible reasons for the bewildering variety of formats and fonts needed to view South Asian scripts on the web. For example, many news companies who publish web pages in these scripts use in-house fonts or other unique rendering systems, possibly to protect their data from being used elsewhere, or sometimes to provide additional characters or logos that are not part of ISCII. However, the obvious explanation for the lack of Unicode data is that, to date, there have been few Unicode-compliant word-processors available. Similarly, until the advent of Windows 2000, operating systems capable of successfully rendering Unicode text in the relevant scripts were not in widespread use. Even where a producer of data had access to a Unicode word-processing/web-authoring system they would have been unwise to use it, as the readers on the web were unlikely to be using a web browser which could successfully read Unicode and render the scripts. Given the complexities of collecting this data, we chose to collect text from South Asian language websites that offered a single downloadable 8-bit TTF font. Unlike fonts that encode English, such as Times New Roman as opposed to Courier, fonts for South Asian languages are not merely repositories of a particular style of character rendering. They represent a range of incompatible glyph encodings. In different English fonts, the hexadecimal code 42 is always used to represent the character “B”. However, in various fonts which allow one to write in Devanagari script (used for Hindi among other languages), the hexadecimal code 42 could represent a number of possible characters and/or glyphs. While ISCII (Bureau of Indian Standards, 1991) has tried to impose a level of standardisation on 8-bit electronic encodings of Indian writing systems, almost all of the TTF 8-bit fonts have incompatible glyph encodings (McEnery and Ostler, 2000). ISCII is ignored by South Asian TTF font developers and is hence largely absent from the web. To complicate matters further, the various 8-bit encodings have different ways of rendering diacritics, conjunct forms and half-form characters. For example, the Hindi font used for the online newspaper Ranchi Express tends only to encode half-forms of Devanagari, and a full character is created by combining two of these forms together. For example, to produce he (Unicode character U+092A) in this font, two keystrokes would need to be entered (h + e). However, other fonts use a single keystroke to produce he. We were mindful that for every additional source of data using a new encoding that we wished to include in our corpus, an additional conversion code page would have to be written in order to convert that corpus data to the Unicode standard. This issue, combined with the scarcity of electronic texts, meant that we didn't use as many sources of data as we would have initially liked. Thus we had to focus almost exclusively on newspaper material7. However, as noted in the following paragraph, as a consequence of the collaboration between Lancaster University and the Central Institute of Indian Languages (CIIL), the eventual corpus will now contain a wider range of genres. Web data gathered on the basis of these four criteria would have allowed us to fulfil our original MWC project goals. However, the MWC collection goals of the project have altered significantly. Thanks to a series of grants from the UK EPSRC8 the EMILLE project has been able to establish a dialogue with a number of centres of corpus building and language engineering research in South Asia. As a consequence, the EMILLE team has joined with the CIIL in Mysore, India, to produce a wider range of monolingual written corpora than originally envisaged on the EMILLE project. One effect of this change is that the uniform word counts of the monolingual written corpora will be lost.9 Each language will now be provided with varying amounts of data, though no language will be furnished with less than two million words. However, there is a further important effect of this collaboration: the corpus will now be able to cover a much wider range of languages (14 rather than 7) and a wider range of genres. By a process of serendipity, the corpus data being provided by CIIL covers a number of genres, but not newspaper material.10 As the material gathered at Lancaster focuses almost exclusively on 7 One important exception to this is the incorporation of the Sikh holy text, the Adi Granth or Guru Granth Sahib, into the Punjabi corpus. 8 Grants GR/M70735, GR/N28542 and GR/R42429/01. 9 This change was also necessitated by the varying availability of suitable newspaper websites for the different languages. For Hindi and Tamil, for example, plenty of data is available to be gathered; for Punjabi and Bengali, somewhat less; for Urdu, almost none. 10 The data provided by CIIL to the project covers a number of genres, including Ayurvedic medicine, novels and scientific writing. 73 newspapers, the CIIL and Lancaster data is complementary. Table 1 shows the state of the EMILLE/CIIL monolingual written corpora at present, and the revised target corpus size. Language Target word count (millions) Current word count (millions) Assamese 2.6 2.6 Bengali 9.0 5.5 Gujarati 10.6 10.6 Hindi 12.0 11.2 Kannada 2.2 2.2 Kashmiri 2.3 2.3 Malayalam 2.3 2.3 Marathi 2.2 2.2 Oriya 2.7 2.7 Punjabi 9.0 4.5 Sinhalese 9.0 6.0 Tamil 15.0 13.9 Telegu 4.0 4.0 Urdu 3.0 1.6 Total 85.9 72.1 Table 1: Word counts for each language in the EMILLE/CIIL Corpus as of January 2003 2.1.1 Encoding of the monolingual written corpora The decision had been made early on to use CES encoding for the EMILLE corpora. However, for CES documents there are a variety of different levels of conformance. Level 2 conformance requires the use of SGML entities to replace special characters such as emdashes, currency signs, and so on; however, since the corpus was to be encoded as 16-bit rather than 7-bit text this was not a matter of importance, as there is provision in the Unicode standard for all these “special” characters. Level 3 conformance involves adding tags for abbreviations, numbers, names and foreign words and phrases. Given the size of the corpus, this could only be practicable if accomplished automatically; however, given the large range of writing systems we were dealing with, a suitable algorithm would have been too time-consuming to implement. The corpus texts therefore extend only to level 1 CES conformance. For this it is necessary for documents to validate against the cesDoc DTD. The level 1 recommendation that italics and similar textual information that might conceivably indicate a linguistically relevant element be retained has not been implemented, again due to the unwieldiness of attempting to do so for the large number of scripts involved. In practice, this means that for monolingual written texts the main textual elements are

, , and . These elements could be deduced automatically from the HTML code of the original web pages from which the data was gathered. A full CES header has been used, however; see the Appendix to this paper for an example. 2.1.2 Mapping the corpus texts to Unicode The task of mapping the data in the monolingual written corpus is, as has been indicated above, a fairly difficult one. Whilst it is fairly simple to write a program that will map every character in a given font to one or more given Unicode characters, this basic algorithm will not handle the more problematic fonts. The formats we had to deal with fell into three broad groups. 74 • Texts in Urdu or western Punjabi required one-to-one or one-to-many character mapping. This was due to the nature of the alphabet11 in which in they were written, which does not contain conjunct consonants as the Indian alphabets do. • Texts in ISCII required one-to-one character mapping. These texts, primarily those from the data provided by CIIL, could be mapped very simply because the Unicode standard for Indian alphabets is actually based on an early version of the ISCII layout. • Texts in specially-designed TTF fonts as discussed above required the most complex mapping. They typically contain four types of characters. The first type need to be mapped to a string of one or more Unicode characters as with ISCII and the Perso-Arabic script. The second type have two or more potential mappings, conditional on the surrounding characters. Some of these conditional mappings could be handled by generalised rules; others operated according to character-specific rules. The third type of characters required the insertion of one or more characters into the text stream prior to the point at which the character occurred12. The fourth type, conversely, required characters to be inserted into the text stream after the current point (in effect, into a Unicode stream which does not yet exist)13. In neither of the latter two types was it simply a case of going “one character forwards” or “one character back”; the insertion point is context-sensitive. The third type of text in particular could not be dealt with using simple mapping tables – each font required a unique conversion algorithm. The task of developing and coding these algorithms was split between a partner in India and the University of Lancaster. The “Unicodify” software suite developed at Lancaster is currently capable of mapping three fonts for three separate languages, and has been successfully used to encode the monolingual data published in the beta release of the corpus (see section 4.0 below). Unicodify is also capable of re-interpreting the HTML elements created by the “save as Web Page” function of Microsoft Office programs and mapping them to the CES elements of

, and , and of generating an appropriate header. At the time of writing, the collection phase for the EMILLE/CIIL MWC data is nearly complete. Only around 13 million words of data remain to be collected (as shown in Table 1 above). Good progress is also being made on mapping this data to Unicode. Consequently, the focus of the project is now falling increasingly on parallel and spoken data. 2.2 Parallel corpora The problems we encountered in collecting MWC data were also encountered when we started to collect parallel data. However, the relatively modest size of the parallel corpus we wished to collect (200,000 words in six languages) meant that we were able to contemplate paying typists to produce electronic versions of printed parallel texts. We eventually decided to do this as we had an excellent source of parallel texts which covered all of the languages we wished to look at: UK government advice leaflets. This was a good source of data for us, as we wished to collect data relevant to the translation of South Asian languages in UK in a genre that was term rich. The leaflets we were able to gather were mostly in PDF or print-only format. Typing these texts became a necessity when the UK government gave us permission to use the texts, but the company that produced the electronic versions of the texts refused to give us the electronic originals. We found it 11 We have come to refer to this alphabet as “Indo-Perso-Arabic”, although it is more widely known simply as the “Urdu alphabet”, or in the case of the various forms of western Punjabi that use it, “Shahmukhi”. This name is designed to capture the fact that the Perso-Arabic script as used for Indo-Aryan languages has certain shared features not found in Arabic, Persian, etc. – for instance, characters for retroflex consonants, or the use of the nasta'liq style of calligraphy. 12 This is primarily the case for those Indian alphabets which allow conjunct consonants whose first component is the letter “ra”. When this letter is the first half of a conjunct, it takes the form of a diacritic which appears after the second half of the conjunct. In Unicode, the text stream contains the logical order of the characters, but in the TTF fonts, the graphical order is almost always the order that is held in the computer's memory. 13 This is primarily the case for certain vowel diacritics which indicate vowels that follow the consonant but which appear before the consonant. Again, Unicode follows the logical order, whereas TTF fonts almost always follow the graphical order of the glyphs. 75 economic to pay typists to produce Unicode versions of the texts using Global Writer, a Unicode word-processor.14 The research value of the British government data is very high in our view. The UK government produces a large number of documents in a wide range of languages. All are focused in areas which are term-rich, e.g. personal/public health, social security and housing. To build the parallel corpus we collected 72 documents from the Departments of Health, Social Services, Education and Skills, and Transport, Local Government and the Regions.15 Other than the need to type the data from paper copies, the parallel corpus also presented one other significant challenge: while most of the data is translated into all of the languages we need, there are a few instances of a document not being available in one of the languages. Our solution is to employ translators to produce versions of the documents in the appropriate language. While far from ideal, this is not unprecedented as the English Norwegian Parallel Corpus project also commissioned translations (see Oksefjell, 1999). All such texts are identified as non-official translations in their header. The parallel corpus is now complete, and we are beginning the process of sentence aligning the texts using the algorithm of Piao (2000). 2.2.1 Annotation of the parallel corpora The annotation of the parallel texts was essentially the same as that applied to the monolingual texts. The principal difference was that the parallel texts had to be keyboarded manually by native speakers of the relevant languages, and thus rather than the automated insertion of SGML elements which characterises the monolingual written corpora, it was necessary to formulate guidelines that would allow transcribers with no knowledge of SGML to accurately mark up the text. In the event, the ultimate content of the transcription guidelines was dictated by certain problems relating to the recruitment of transcribers. Our initial strategy was to recruit typists from among the student body of Lancaster University, which includes native speakers of all the languages in question (Hindi-Urdu, Bengali, Punjabi, and Gujarati). However, we were only able to find two or three reliable transcribers in this way. The majority of students recruited were unable to commit themselves to working on the project for the necessary extended period. This problem was particularly acute because we had to recruit outside the Department of Linguistics; while there were potential academic benefits for Linguistics students which would make the task more worthwhile, this was not the case for students of other subjects. Therefore, we have formed an arrangement with a data-processing firm in India, Winfocus PVT, who have been able to take on the transcription of a significant proportion of the parallel corpus (as well as nearly all the spoken data – see also below). However, because Winfocus deployed a wide range of staff members on the project, it was necessary that the transcription guidelines be as simple as possible, as typing would otherwise be slowed to an unacceptable speed. The decision had already been made for the monolingual corpora to adhere to only the most basic level of CES-compliance (see above). This meant straightaway that the guidelines could be fairly simple. The SGML elements included in all versions of the guidelines were , ,

16, and – these were also used in the monolingual corpus. Unique to the parallel corpus was the tag for a transcriber's correction of a typographical error in the original printed text (the SGML of course retains the original “incorrect” version). The instructions for the element (replacing pictures and other non-transcribable graphical elements) have likewise always been part of the guidelines. It would seem that these instructions have 14 When the project began, Global Writer was one of the few word-processors able to handle the rendering of Indic languages in Unicode. Since then, Microsoft have made Word 2000 Unicode-compliant. However, unless running on a Windows 2000 machine the Unicode compliance of Word 2000 is not apparent. 15 We also collected a smaller number of texts from the Home Office, the Scottish Parliament, the Office of Fair Trading, and various local government bodies (e.g. Manchester City Council). 16 Lists of bulleted-pointed items – which are very common in the UK government information leaflets – are encoded as separate elements within a single

, as are table cells. 76 been ignored by transcribers, however, who have just passed silently over the illustrations17. We intend to harmonise the use of the element in the data at the end of the project. In the initial version of the guidelines, instructions were included for an SGML encoding of footnotes using elements to anchor the notes and elements at the end of the file containing the footnote text. However, in practice transcribers seem to have ignored these guidelines as well, encoding footnotes using normal

and elements wherever they physically occurred in the text. Therefore and were excised from later versions of the guidelines, as were the instructions for the reference for bibliographic references – which in the event was never needed anyway. Initially, we also asked the transcribers to add details of title, publication date/place, etc. from the printed text to the header of the file as they typed it. However, due to regular inconsistencies and errors in the headers thus created, we later moved over to a system in which the header was added subsequently by a member of the EMILLE research team using information from a central database. Doing this drastically reduced the bulk of the transcribers’ guidelines, greatly facilitating the training of new transcribers. Similarly, it is a design feature of the EMILLE corpora that the filename of each text embodies the hierarchical structure of the corpus as a whole. For the parallel corpus, this means that each filename includes information on the language, medium and category of the text, as well as a descriptive name which is that text's unique identifier. For example, the Gujarati version of the text “The Health of the Nation and You”, an NHS information booklet published by the Department of Health, has the filename guj-w-health-nation.txt18. The guidelines initially gave step-by-step instructions for composing these filenames, but transcribers uniformly got it wrong, so this task too was excised from later versions of the guidelines. 2.3 Spoken corpora For the collection of spoken data we have pursued two strategies. Firstly we explored the possibility of following the BNC (British National Corpus) model of spoken corpus collection (see Crowdy, 1995). We piloted this approach by inviting members of South Asian minority communities in the UK to record their everyday conversations. In spite of the generous assistance of radio stations broadcasting to the South Asian community in the UK, notably BBC Radio Lancashire and the BBC Asian Network, the uptake on our offer was dismal. One local religious group taped some meetings conducted in Gujarati for us, and a small number of the people involved in transcription work on the project agreed to record conversations with their family and friends. The feedback from this trial was decisive – members of the South Asian minority communities in Britain were uneasy with having their everyday conversations included in a corpus, even when the data was fully anonymised. The trial ended with only 50,000 words of spoken Bengali and 40,000 words of Hindi collected in this way. Consequently we pursued our second strategy and decided to focus on Asian radio programmes broadcast in the UK on the BBC Asian Network as our main source of spoken data.19 The BBC Asian Network readily agreed to allow us to record their programmes and use them in our corpus. The five languages of the EMILLE spoken corpora (Bengali, Gujarati, Hindi-Urdu, and Punjabi) are all covered by programmes on the BBC Asian Network. At least four and a half hours in each language (and more in the case of Hindi-Urdu) are broadcast weekly. The programmes play Indian music (the lyrics of which have not been transcribed) as well as featuring news, reviews, interviews and phone-ins. As such the data allows a range of speakers to be represented in the corpus, and some minimal encoding of demographic features for speakers is often possible as at least the sex of the speaker on the programmes is apparent. The recordings of the radio programmes are currently being digitised and edited, to remove songs and other such material. The recordings will be made available in conjunction with the transcriptions. However, the transcriptions and recordings will not be time aligned. An obvious future enhancement of 17 It should however be noted that one transcriber went the opposite way, being extremely “trigger happy” with his use of the element and filling the text with non-requisite information which had to stripped later. 18 Spoken texts are similarly titled; for instance the file guj-s-cg-asiannet-02-11-23.txt is a context-governed spoken text containing a transcription of the BBC Asian Network Gujarati programme transmitted on the 23rd November 2002. 19 Programmes broadcast in Bengali and Urdu on BBC Radio Lancashire make up the remainder of the spoken corpus. 77 this corpus data would be to work on techniques, already well established for English, to time align the transcriptions. The recording and transcription of the broadcasts is ongoing and to date we have completed the transcription of 265,000 words of Bengali, 109,000 words of Gujarati, 41,000 words of Hindi, and 119,000 words of Urdu. 2.3.1 Annotation of the spoken corpus The transcription of the spoken texts was undertaken by the same group of typists who worked on the parallel corpus. In a similar way, the annotation guidelines, which began by embracing the full range of possible CES encoding elements, were of necessity simplified as the project progressed. As with the parallel transcriptions, the requirement to generate an appropriate header and filename were dropped early on. Information required for the header – on participants sex/age/profession, their relationships, and the setting of the conversation – was noted in plain English at the top of the file by the transcribers, and then converted to a proper SGML header later on. Within the text itself, the elements that were implemented were for utterance – which also records the speaker and a unique ID for each utterance – and also , , , , and , which relate to either the content of the speech or other noises on the tape. The codes for noting overlapping speech, the code, and the large set of codes to indicate changes in stress, seem to have been almost entirely neglected by the transcribers, so although they were not removed from the guidelines their importance was de-emphasised in later versions to avoid distracting attention from the greater importance of the , and elements. 3 Analytic annotation of the corpora We aimed from the outset to explore morphosyntactic annotation of the Urdu corpus. For a description of the work undertaken on this aspect of the project, see Hardie (2003). The corpus annotation research of EMILLE has recently expanded to cover another form of annotation – the annotation of demonstratives – in Hindi. The work on Hindi is at an early stage, with an annotation scheme originally designed to annotate demonstratives in English (Botley and McEnery, 2001) being used to annotate Hindi. The annotation is currently underway and the goal is to annotate the demonstratives in 100,000 words of Hindi news material by the end of the project. 4 Accessing the corpus A beta release of the EMILLE/CIIL corpus will be available, free of charge, for users from April 2003. The beta release of the corpus will contain a sample of MWC, parallel and spoken data for the core EMILLE languages. In order to register for access to the beta release, users should contact Andrew Hardie. 5 Conclusion The EMILLE project has adapted and changed over the course of the past two years. With regard to the EMILLE corpora, this has in large part been due to the project team engaging in a dialogue with the growing community of researchers working on South Asian languages. As a result of this dialogue the EMILLE team has made some major changes to the original design of the EMILLE corpora. However, as with all large-scale corpus-building projects, other changes have occurred on the project which have been responses to unexpected factors, such as the reluctance of members of the minority communities to engage in the recording of everyday spontaneous speech, and the lack of compatible 8-bit font encoding standards used by the different producers of electronic texts in the relevant languages. Devising methodologies to convert the numerous disparate 8-bit based texts to Unicode has been one of the most complex and time-consuming tasks of the project. The area of South Asian corpus building is growing. As well as work in the UK and India, a new centre for South Asian language resources has been established in the US20. As the centres cooperate and integrate their research, there is little doubt that further work on the construction and annotation of South Asian corpora will grow. As this work grows, we believe that corpus builders should not loose 20 See http://ccat.sas.upenn.edu/~haroldfs/pedagog/salarc/overallplan.html 78 sight of two important truths. Firstly, that collaboration is better than competition – the corpus produced by Lancaster/CIIL will be larger and better because we have accepted this. The construction of large scale language resources needs the acceptance of this truth if it is to be effective. Secondly, that while many South Asian languages are entering the growing family of languages for which corpus data is available, there are still languages spoken in South Asia and the world for which corpus data is not available. While we must celebrate the creation of corpora of South Asian languages, we should also think of the work yet to be done in creating corpora for those languages not yet corpus enabled. References Baker, JP, Burnard, L, McEnery, AM and Wilson, A 1998 Techniques for the Evaluation of Language Corpora: a report from the front. In Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada. Baker, JP, Hardie, A, McEnery, AM, Cunningham, H, and Gaizauskas, R 2002 EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation. In: González Rodíguez, M and Paz Suarez Araujo, C (eds.) Proceedings of 3rd Language Resources and Evaluation Conference(LREC). Las Palmas de Gran Canaria. Botley, S.P. and McEnery, A 2001 Demonstratives in English: a Corpus-Based Study. Journal of English Linguistics, 29: 7-33. Bureau of Indian Standards 1991 Indian Standard Code for Information Interchange, IS13194. Crowdy, S 1995 The BNC spoken corpus. In: Leech, G, Myers, G and Thomas, J (eds.), Spoken English on computer: transcription, mark-up and application. Longman: London. Hardie, A 2003 Developing a model for automated part-of-speech tagging in Urdu. Paper presented to CL2003 conference. McEnery, A, Baker, JP and Burnard, L 2000 Corpus Resources and Minority Language Engineering. In M. Gavrilidou, G. Carayannis, S. Markantontou, S. Piperidis and G. Stainhauoer (eds) Proceedings of the Second International Conference on Language Resources and Evaluation (LREC): Athens. McEnery, AM and Ostler, N 2000 A New Agenda for Corpus Linguistics – Working With All of the World's Languages. In Literary and Linguistic Computing, 15: 401-418. Oksefjell, S 1999 A Description of the English-Norwegian Parallel Corpus: Compilation and Further Developments. In International Journal of Corpus Linguistics, 4:197-219. Pal, U and Chaudhuri, BB (1995) Computer recognition of printed Bengali script. In International Journal of System Science, 26: 2107-2123. Piao, SS 2000 Sentence and Word Alignment Between Chinese and English. Ph.D. thesis, Department. of Linguistics and Modern English Language, Lancaster University, UK. Somers, H 1998 Language Resources and Minority Languages. Language Today 5. Tablan, V, Ursu, C, Bontcheva, K, Cunningham, H, Maynard, D, Hamza, O and Leisher, M 2002 A Unicode-based Environment for Creation and Use of Language Resources. In: González Rodíguez, M and Paz Suarez Araujo, C (eds.) Proceedings of 3rd Language Resources and Evaluation Conference(LREC). Las Palmas de Gran Canaria. 79 Appendix Example of the CES header for a monolingual corpus text. Note that throughout the corpus the date format yy-mm-dd is employed. guj-w-samachar-news-01-05-23.txt Electronic file created by Department of Linguistics, Lancaster University text collected by Andrew Hardie transferred into Unicode by "Unicodify" software by Andrew Hardie UCREL Department of Linguistics, Lancaster University, Lancaster, LA1 4YT, UK 02-12-18 "Gujarat Samachar" internet news (www.gujaratsamachar.com), news stories collected on 01-05-23 Gujarat Samachar Gujarat, India Gujarat Samachar 01-05-23 Text collected for use in the EMILLE project. Simple written text only has been transcribed. Diagrams, pictures and tables have been omitted and their place marked with a gap element. 02-12-18 Gujarati Universal Multiple-Octet Coded Character Set (UCS). print 80 The “new” quotatives in American English: A cross-register comparison Federica Barbieri Northern Arizona University Department of English Liberal Arts Building Flagstaff, AZ 86011-6032 USA Tel.: (928) 523 6845 Fax: (928) 523 7074 E-mail: Federica.Barbieri@NAU.EDU In the past decade, a growing number of studies has been concerned with the use of like as direct speech introducer, mostly in American English (Blyth et al. 1990, Romaine & Lange 1991, Ferrara & Bell 1995), but also in other varieties of English, such as Scottish English (Macaulay 2001), British and Canadian English (Tagliamonte & Hudson 1999), and African American Vernacular English (Cukor-Avila 2002). While they provide important insights in the use of this new quotative, as well as cumulative evidence of its expansion, these studies are based on small samples of conversational narrative, and their findings are thus hardly generalisable. In addition, very little attention has been devoted to other new direct quotation introducers, such as be all and go. To date, there are indeed no studies of the use of like and other new quotatives, in principled, diverse corpora of spoken English. This corpus-based study investigates the frequency of use and the discourse-pragmatic function of the innovative quotatives be like, go, be all, compared with the traditional quotative say, in contemporary spoken American English. The analysis is based on four corpora of spoken American English representing different registers of spoken interaction: casual conversation, campus-related service encounters, academic office hours consultations, and students’ study groups. The study includes quantitative analyses of frequency of occurrence of the quotatives, frequency in association with grammatical person, and comparison of frequency across corpora; and qualitative analysis of the discourse-pragmatic function of the quotatives. Simple present and simple past tense forms of the quotatives were analysed quantitatively and qualitatively in four small corpora of approximately 500,000 words. The findings showed that type and frequency of occurrence of the quotatives vary across different registers of spoken language, thus suggesting that the use of direct quotation and the way it is introduced are sensitive to context and level of formality. The new quotatives be like and go were found to be frequent quotatives which successfully compete with the traditional say in casual registers such as conversation and service encounters. However, these quotatives are infrequent in consultative registers such as office hours. The new quotative be all was found to be rather infrequent in all registers. The study also revealed that there are clear patterns of association between frequency of occurrence of the quotatives and grammatical person, as well as between grammatical person and discourse-pragmatic function of the quotative. 81 A preliminary analysis of collocational differences in monolingual comparable corpora Marco Baroni and Silvia Bernardini, University of Bologna at Forli 1. Introduction The notion of collocation has enjoyed mixed fortunes in the 50 odd years of its existence. Claimed to be obscure (Lyons, 1977: 612), counterproductive (Langendoen 1968: 63ff) and generally useless (Lehrer 1974)1 by its detractors, the idea that part of the meaning of a word is somehow related to its “word accompaniment, the other word material in which [it is] most commonly or most characteristically embedded” has also had its supporters. These have suggested that words have no meaning out of context, and meaning itself is not contained anywhere, but rather dispersed as the “light of mixed wave-lengths into a spectrum” (Firth 1957(1951): 192). The intuitive appeal of this view is evident if one thinks of the difficulty a compositional approach to meaning has (even allowing for subcategorisation frames) in explaining the patterned quality of language performance, as found in a corpus, and ultimately the speaker's or writer's effortless routine handling of co(n)textual restrictions. The hypothesis that “everything we say may be in some degree idiomatic – that […] there are affinities among words that continue to reflect the attachments the words had when we learned them, within larger groups” (Bolinger, 1976:102) provides a powerful argument in favour of the empirical study of collocations, with implications for theoretical, descriptive and applied branches of linguistics. In recent years, notwithstanding the vagueness of the notion and consequent methodological problems in investigating it empirically, the study of collocations has indeed defied difficulties and criticism and sparked renewed interest in a number of areas ranging from computational and corpus linguistics to lexicography, language pedagogy, and crucially for our purposes, translation studies. The hypotheses that “everything we say may be in some degree idiomatic” (Bolinger, above), and that “actual usage plays a very minor role in one's consciousness of language” (Sinclair 1991:39) raise a number of interesting questions for translation research. Is there any evidence that translators be aware of collocational restrictions in the source and target languages? Do they show sensitivity to phraseological (a)typicality and restrictedness? These are very complex issues, that can hardly be resolved in one fell swoop. For a start, theoretical as well as methodological problems remain as to what collocations are in the first place,2 and how best they can be retrieved from corpora and compared (see e.g. Krenn 2000a). Secondly, different types of corpora for the study of translation exist, providing different perspectives on the translation process. In this paper, we limit our investigation to monolingual comparable corpora (MCC) and present a number of attempts at selecting and comparing collocations across original and translated texts. This study is novel in at least two ways: to the best of our knowledge, no previous investigation of the behaviour of translators through MCC has focused on collocational restrictions, and no study of collocational restrictions in translated texts has attempted to select candidate bigrams automatically. The paper is organised as follows. Section 2 provides a general background on monolingual comparable corpora and collocation extraction. In section 3 we present a brief study conducted on a small corpus of EU reports in translated and original English. Section 4 describes a more substantial study of a corpus of translated and original world affairs articles in Italian. In section 5 we briefly discuss directions for further work. 2. Background 2.1 Monolingual comparable corpora in translation research Monolingual comparable corpora are collections of original and translated texts in the same language, assembled according to comparability criteria such as “a similar domain, variety of language and time span […]” (Baker 1995: 234). According to Frawley (1984: 168-169) translation […] is essentially a third code which arises out of the bilateral consideration of the matrix and target codes […] since [it] has a dual lineage, it emerges as a code in its own right, setting its own standards and structural presuppositions and entailments […]. 1 “The main criticism against the lexical approach to co-occurrence is that it does not explain anything. The lexical item is found to collocate with a second item and not with a third, but no explanation is given. Collocations and sets are treated as if combinatorial processes of a language were arbitrary” (Lehrer 1974: 176). 2 A recent book on corpus-based lexical semantics gives the following rather general definition of collocation: “‘collocation’ is frequent co-occurrence” (Stubbs 2001: 29). 82 Following Baker (e.g. 1995), work on MCC has adopted a corpus-based research methodology to unveil universal features of translation seen not as an individual act of interlinguistic transfer, but as a mediated communicative event, with its own “third code”. 3 Thus, rather than focusing on differences between single originals and their translations (i.e. parallel texts), MCC allow the analyst to compare collections of originals and translations in the same language, unrelated to each other but chosen so as to be broadly comparable. There is no doubt that MCC provide an innovative research environment in which translation norms (Toury 1995) and strategies (Löscher 1991) can be observed against the backdrop of target language use. Yet they raise substantial methodological problems that risk invalidating the results obtained. The main problem one is confronted with relates to the comparability of the corpus components. According to one of its compilers (Laviosa 1998a), the two components of the English Comparable Corpus (ECC) are comparable with regard to the relative proportion of biography and fiction (i.e. genre), time span, distribution of female and male authors, distribution of single and team authorship, and overall size of each component. Furthermore, continues Laviosa, “the target audience of both collections can be characterised as literate, intellectual adults of both sexes”. The compilers have clearly thought out their design criteria. Yet doubts about the comparability of the two corpora remain. The criteria just mentioned would appear to be derived from monolingual corpus building heuristics. Yet your average translated novel, say, has a much more complex history than your average original novel. First, it derives directly from an existing text (its source text). Disregarding the latter may have somewhat worrying consequences: for example, the source text could have been written much earlier than its translation. In this case, it would not seem unlikely for the latter to display features typical of a diachronically different state of the target language (for further discussion and a concrete example see Bernardini and Conrad 2002). In more general terms, translation involves two cultures-languages, at times distant from each other, and a set of decisions – often involving substantial investments of time and funds - that have to be taken in order for a work to migrate between the two. It has been suggested that these migrations are subject to socio-cultural norms (Toury 1995). For instance, not everything that is published in the Netherlands is translated into English, far from it; instead, Dutch fiction is chosen for translation either in the function of assumed target taste or in that of the status the work has acquired at the source pole, often as a combination of the two. (Vanderauwera 1985: 132) Ignoring such socio-cultural factors when setting up a monolingual comparable corpus may result in reduced comparability and doubtful interpretations of the data obtained. Given the complexity of the translation situation – we have only scratched the surface of the problem here, see Bernardini and Zanettin (forthcoming) for a more detailed discussion –, and the fact that corpus comparability is in itself an untrivial concept (Kilgarriff 2001), it would seem wise to give serious thought to the composition of a MCC, at least at these early stages, and to refrain from the assumption of comparability based on crude situational criteria inherited from monolingual research. In an attempt to limit the proliferation of variables, and consequent difficulty in interpreting results, the corpora on which the experiments we present are based were selected so as to be maximally comparable. The first corpus (EU) is a collection of official reports submitted by different EU countries to the EU Commission, describing progress made in the implementation of European guidelines in the area of employment policies.4 Two versions are typically available, the original in the country's language, and a translation, normally into English. Though the originals were not included in the corpus, they were consulted to make sure that the English texts were indeed translations, not independent texts nor source texts for the pair. From these texts we constructed a small corpus of originals from Ireland and the United Kingdom and translations from Finland, Italy, Portugal and Sweden.5 The corpus contains 72,966 words in the original section and 145,932 words in the translation section. 3 Among the universals proposed are simplification, explicitation/explicitness, normalization, levelling out, disambiguation and standardization (e.g. Baker 1995, 1996; Schmied and Schäffler 1996, Laviosa 1998a, 1998b; Olohan and Baker 2000; Olohan 2001). 4 The reports are freely available on the Web: http://europa.eu.int/comm/employment_social/news/2001/may/naps2001_en.html 5 The direction of translation was confirmed by bilingual speakers of English and each of the four languages in question. 83 This corpus has its advantages and its disadvantages. On the one hand, translation is carried out into international English, thus limiting the effect of preliminary norms deciding what to translate when (Toury 1995, above). The texts are very homogeneous, in terms of topic, date, register etc. Copyright clearance and text preparation are straightforward, and this is no trivial advantage in a exploratory study like this one, in which the possibility that the material be inadequate for the purposes of the research has to be investigated empirically. On the other hand, these are rather boring texts, translated into a highly conventionalised variety of English – sometimes called EUese; thus, doubts may be raised about the generalisability of any results to different translation settings. Furthermore, the corpus is rather small. The second corpus (LIMES) is a slightly less homogeneous but far larger collection of articles published in the Italian Quarterly Limes - Rivista italiana di geopolitica [Italian geopolitics journal]. 6The complete collection of articles published between 1993 and 1999 is included. The total size of the corpus is approximately 3 million words, with translations accounting for slightly less than one third of the total. These have been carried out by approximately 40 different translators, about half male and half female. The total number of authors exceeds 700. Each volume is centred around a theme (“War in Europe”; “Divisions within Islam”, “What use is NATO?”, “The global bomb”), and normally contains both originals and translations, thus ensuring thematic consistency across the original and translated components. Judging from the topic matter and names of authors and translators, it would appear that source languages for these include, among others, Albanian, Arabic, Chinese, English, French, German, Serbo-Croatian, Russian. These characteristics of homogeneity (originals and translations address the same audience, deal with virtually the same topics, conform to the same editorial policies) would appear to guarantee an acceptable level of comparability between the subcorpora, such as is rare in MCC. The likely variety of source languages should limit specific source language effects. Similarly, the variety of authors, translators and topics covered should limit the effect of idiosyncracies. 2.2 Collocational restrictions and translation In order to compare the level of patterned-ness in translated versus original language, it is first of all necessary to retrieve candidate patterns. A common approach to this problem has been the structural one: patterns are defined in terms of sequences of parts of speech, which are then searched either manually, or automatically with the help of a tagged corpus. For example, Gitsaki (1996) adopts this approach in her study of collocations in ESL student written production, deriving candidate structures from a dictionary (Benson et al 1986). Krenn and Evert (see e.g. Krenn 2000b, Krenn and Evert 2001) adopt statistical measures to rank potential collocations that match certain syntactic templates, on the basis of a tagged and partially parsed corpus. Heid (1996: 121) describes “discovery procedures for collocations […] based on a detailed description of the targeted collocations”. The starting point in this case is a classification of lexical functions as defined by Mel'èuk (e.g. 1996). Within TS, two recent works have attempted to analyse collocations in original vs translated language. They both rely on bilingual (parallel) corpora, adopting a different viewpoint from the one adopted here. Yet the method used for retrieving relevant units of analysis is equally relevant. Kenny's study of “sanitisation” in translation adopts various techniques for spotting creativity in originals before checking how it is rendered in translation. One “node” is identified – the German word Auge [eye] - as an impressionistically viable choice, i.e. a word that is frequent enough, enters into fixed expressions, and has been found in previous studies to be the object of creative manipulation by other writers. Concordances, tables of collocates and lists of clusters are then retrieved for this word. This method is appropriate to investigate whether translators tend to normalise creative collocations, but would appear to be hardly adaptable to the aims of the present study. More relevant to our concerns is Danielsson's (2001) attempt at designing an automatic process whereby UMs (units of meaning) for English and Swedish can be identified in a parallel corpus and compared. She takes a sample of about 200 “interesting” words occurring 200 times or more in her corpora, and automatically works through their downward and upper collocates (Sinclair 1991). This method yields units of varying length containing some of the most frequent words in the corpora. A third method, the longest-linear method, is used to retrieve “units of structure” of the type “the X of the new Y” (ibid: 149). What would appear to be left out by the joint application of these methods is the potentially large set of collocations containing less frequent words, whose significance is not necessarily small. The collocation extraction methods discussed in this section, all relatively knowledge-intensive, have led to cleaner, easier-to-interpret results than the ones we report below. It might be 6 We would like to thank Limes for granting permission to use their CD-ROM for this research 84 argued, however, that they place strong interpretative grids over the data (e.g. through intuitive classifications, POS tagging and so forth) that are better avoided at these early stages of research. We prefer the simple knowledge-free method we describe below (4.2) since we do not know, a priori, which collocational structures are most typical of the original and translated languages we are studying, and thus we would not want to bias the results in the direction of a specific subset of collocational templates. Indeed, as we will briefly discuss in 4.5, it appears that some interesting differences between translated and original texts concern frequent syntactic structures that would not traditionally be considered collocations. No less important, we are interested to see whether it is possible to obtain meaningful results with a method that would be applicable to any language or sub-language, independently of the NLP resources available. 3. A pilot study with the EU corpus We conducted a pilot study with the EU data in which, following a method similar to those proposed in Kilgariff (2001), we verified that the lists of collocations extracted from subcorpora constructed from parts of the same corpus are more strongly correlated than those extracted from different corpora. This result was obtained with all the collocation extraction and scoring methods we describe below. Thus, the pilot experiment indicates that our measures are sensitive to systematic similarities among corpora more than to the random similarities and differences that we expect to exist in any set of textual data. However, a more thorough investigation of the EU data along the lines of the one we report below for the LIMES data failed to reveal systematic differences between original and translated documents. While it is tempting to attribute this failure to the “scripted” nature of the EU report genre, that would tend to reduce the differences between originals and translations, we feel that, because of the small size of the corpus, it is premature to draw any conclusions from this failure. We plan to collect a larger corpus of EU reports, in order to obtain more robust results. 4. Analysis of the LIMES corpus 4.1 Corpus pre-processing The text extracted from the LIMES database was split into a corpus of articles originally written in Italian and a corpus of articles translated from other languages into Italian using an automated procedure. Interviews and roundtables were discarded, since we were not sure about their status. The output of the automated procedure was checked and corrected by hand. We also performed other minor clean-ups in a semi-automated fashion. The original text corpus (O) and the translated text corpus (T) were tokenized in an extremely rudimentary way, removing all non-alphabetic symbols except the apostrophe, that, following the conventions of Italian orthography, was tokenized as part of the preceding word. After tokenization, the O corpus contained 2,132,060 words and the T corpus contained 895,820 words. O and T were further subdivided into subcorpora as described in sections 4.3 and 4.4. 4.2 Collocation extraction and ranking In order to find collocations, we first collected for each subcorpus of interest (see discussion in 4.3 and 4.4 below) candidate bigrams that had the following characteristics: 1) they were made of words that occurred at least twice in all the subcorpora to be compared; 2) they occurred at least 3 times in the relevant subcorpus. The rationale for the first of these conditions is that we are not interested in differences in collocations that are due to differences in the topics covered by the various articles. We expect that words that are relatively frequent in all the subcorpora being compared are words that are not strongly linked to any particular topic. The second condition guarantees that we have a list of manageable size, and it is unlikely to exclude any “true” collocation, since collocations are, by definition, frequent ngrams. 7 We used three association measures to rank the lists of bigrams: raw frequency, (point-wise) mutual information (Church and Hanks 1990) and (-2*) log-likelihood ratio (Dunning 1993). Unlike raw frequency, the other two measures take the unigram frequency of the words composing the bigram 7 We ran some preliminary experiments in which the bigram collection procedure ignored words from an automatically constructed list of likely function words. We are still in the process of analyzing the results obtained in this way, but see note 10 below for some short remark on them. 85 into account, and favor word combinations whose frequency is higher than what we would expect under assumptions of independence. Mutual information and log-likelihood ratio are calculated using the formulas given in Manning and Schütze (1999 ch. 5). For discussion of these and other measures of collocativity see also Evert (2001). We chose not to pick one specific measure as the “right” one given the exploratory nature of this study. Moreover, recent work (e.g. Inkpen and Hirst 2002, Baroni et al. 2002) suggests that measures such as mutual information and log-likelihood ratio should be used in combination, as they tend to discover different types of related words. As a consequence of the collocation extraction and evaluation methods we used, the results reported below are based on a rather generous and vague notion of what counts as a collocation: essentially, any pair of adjacent words that has a high frequency, and/or a higher frequency than what we would expect by chance, is treated as a collocation. Furthermore, by working with lists of ranked bigrams, we are implicitely assuming that collocativity is gradient, rather than binary. 4.3 Comparing the number of collocations in translated and original text The first question we were interested in answering was the following: Do translators have a greater tendency to use fixed expressions than original authors? In principle, there could be an effect in either direction: On the one hand, translators could have a tendency to use a simplified language characterized, among other things, by the frequent repetition of the same expressions (a possible effect of the tendency towards explicitness, on which see Schmied and Schäffler 1996). On the other, faithfulness to the source language text, coupled with the fact that many fixed expressions are often not translatable from a language to the other, could lead the translators to use fewer collocations than the creators of original texts. In order to study this issue, we split the T corpus into 5 subcorpora containing 179,164 words each, and we randomly selected 5 chunks of 179,164 words from the larger O corpus. From each of the 10 subcorpora created in this way, we extracted candidate bigrams and computed frequency, mutual information and log-likelihood as described above. First of all, we compared the number of candidate bigrams in the T-subcorpora to the number of candidate bigrams in the O-subcorpora. For the T-subcorpora, the average number of bigrams is 8,094.2 (median: 8,149); for the O-subcorpora the average is 8,044.8 (median: 8,128). According to the results of a two-tailed Mann-Whitney U test (see Siegel 1956, ch. 6), the difference between the two sets is not significant at the á = 0.05 level. In the subsequent analyses, rather than considering simply the number of bigrams extracted from each subcorpus, we looked at the association scores that were assigned to these bigrams. In particular, for each measure m and for each cutoff point c from a set of cutoff points across the distribution of m, we computed the percentage of bigrams in each subcorpus that had an m-score equal or greater than c. We then compared the percentage of bigrams at or above the various cutoff points in the T- vs O-subcorpora. In Table 1, we report the results of this type of analysis performed at 3 cutoff points for each of the measures (for frequency and log-likelihood ratio, the cutoffs are expressed as logarithms of the actual values). We chose this particular set of cutoff values since they seem to represent well the range of patterns encountered. For each measure and representative cutoff point, the table reports the average percentage of bigrams with scores at least as high as the cutoff point in the T- and O-subcorpora, the medians, and whether a two-tailed Mann Whitney test comparing the T- and O-subcorpus sets with respect to the relevant percentages was significant at the á = 0.05 level. measure cutoff T-avg O-avg T-med O-med MW test fq 2 25.8 24.87 25.84 24.73 significant fq 3 6.45 5.8 6.42 5.73 significant fq 4 1.43 1.09 1.39 1.05 significant mi 5 25.53 25.07 25.55 25.09 significant mi 8 4.58 4.75 4.56 4.70 not sig mi 10 1.05 1.15 1.04 1.16 not sig llr 3 41.39 40.07 41.56 39.92 significant llr 4 12.42 11.65 12.35 11.36 significant llr 5 3.11 2.69 3.02 2.83 significant Table 1 Proportion of bigrams ¡Ý cutoff value 86 As the table shows, there is a small but clear tendency for the translated texts to contain a larger number of bigrams with stronger association scores (this is also in line with the results on the absolute number of bigrams we presented above). The only data that do not go in this direction are those for the “middle'” and “high” mutual information cutoffs. It is interesting that these are also the only levels at which the difference between the groups is not statistically significant, i.e., what we have here is not a reversal of the effect, but a lack of significant effects of the translated/original distinction. An informal comparison of the top bigrams according to mutual information to the top bigrams according to frequency and log likelihood ratio suggests that the latter two measures (the first more than the second) tend to pick up bigrams where at least one component is a function word, whereas mutual information tends to pick up bigrams that are closer to our intuitive idea of what a collocation is (frequent/lexicalized N+Adj or V+N structures). Thus, the difference between translated and original texts detected with frequency, log-likelihood ratio and the lowest mutual information cutoff seems to be due to frequent bigrams that would not normally be treated as collocations. We will come back to this topic in 4.5 below. 4.4. Collocation overlap among translated and original texts The data reported above provide some (weak) evidence that there are systematic differences between translated and original texts in terms of collocational patterns, but they do not tell us whether such differences are due to a general tendency for translators to use more fixed expressions, or whether there are specific fixed expressions that tend to be favored by translators (or by original writers). In order to test this second possibility, we conducted another set of experiments in which we measured the degree of overlap and correlation among the collocations found in original and translated texts. This time, we split the T corpus into 10 subcorpora containing 89,582 words each, and we randomly selected 10 chunks of 89,582 words from the larger O corpus. We then merged 5 randomly selected T-subcorpora into a 447,910 word “reference” T corpus and 5 randomly selected O-subcorpora into a 447,910 word “reference” O corpus. The idea, then, was to compare the bigrams found in the unmerged T- and O-subcorpora to the bigrams found in reference T and reference O. If there is a tendency to use similar bigrams in texts of the same type, we should find that the bigrams in the T-subcorpora tend to be closer to those in reference T than to those in reference O, and/or that the bigrams in the O-subcorpora tend to be closer to those in reference O than to those in reference T.8 First of all, we looked at the number of candidate bigrams (in the sense of 4.2 above) that the T-subcorpora and the O-subcorpora shared with the reference corpora. To control for the effect of the absolute size of the bigram lists we computed the percentage of shared bigrams over the total number of distinct bigrams in the two lists being compared. The average percentage of bigrams shared by the T-subcorpora with reference T was 21.28 (median: 21.36); the average percentage shared by the T-subcorpora with reference O was 21.32 (median: 21.11). The average percentage of bigrams shared by the O-subcorpora with reference O was 21.16 (median: 21.20); the average percentage of bigrams shared by the O-subcorpora with reference T was 20.23 (median: 20.25). These data suggest that there is no strong trend in either direction as far as simple ovelap of the candidate bigram lists goes. This was confirmed by the statistical analysis. We ran Wilcoxon two-tailed matched-pairs signed-rank tests (Siegel 1956, Ch. 5) comparing the T-subcorpora percentage overlap with reference T vs their overlap with reference O, and comparing the O-subcorpora percentage overlap with each of the reference corpora. Neither test gave significant results at á = .05. We then computed, for each of the association measures, the Spearman rank correlation coefficients (Siegel 1956, ch. 9) between each T- or O-subcorpus and the reference corpora. The correlations were computed by considering only those bigrams that occured both in the list extracted from the relevant subcorpus and in the list extracted from the relevant reference corpus.9 8 We use the reference corpus strategy rather than directly comparing all subcorpora to each other since the latter strategy would yield results that are difficult to interpret, as the samples would not be independent from each other (e.g. the degree of overlap between, say, subcorpora T1 and T2, that between T1 and T3 and that between T2 and T3 would all have counted as instances of T-to-T comparisons). 9 In general, the percentage overlap between the bigrams in a subcorpus and the bigrams in a reference corpus is around 21%, as we have just seen. Including the 79% of bigrams that are not shared by the compared corpora into the correlation analyses would have been problematic both from a statistical point of view, because of the massive 87 tie problem due to the 0-scores, and from an empirical point of view, since, in the best case, the analyses would have essentially been a replica of the overlap analyses we just presented. The results of these analyses are summarized in Table 2. The first column reports the association measure; the second column reports the subcorpus set; the third column reports the average (in parenthesis: median) of the Spearman coefficients of the correlations between each of the relevant subcorpora and reference T; the fourth column reports the same data for the correlations with reference O; the fifth column reports whether a Wilcoxon test for the corresponding data (correlation coefficients of the subcorpora with reference T vs reference O) gave significant results at á = .05. measure subcorpora avg (med) r with T avg (med) r with O W test fq T .63 (.63) .59 (.60) not sig fq O .61 (.60) .62 (.62) not sig mi T .91 (.91) .91 (.91) not sig mi O .91 (.91) .91 (.91) not sig llr T .74 (.73) .72 (.72) not sig llr O .72 (.72) .73 (.73) not sig Table 2 Correlations between subcorpora and reference corpora First of all, notice that in general the correlation coefficients between corpora are quite high, in the case of mutual information so high that the uniform results could be due to a ceiling effect. The results with frequency and log-likelihood ratios go in the expected direction (each set of subcorpora is correlated more strongly with the corresponding reference corpus). However, the differences are very small and they are not statistically significant. Given the small size of the subcorpora (< 100,000 tokens) and their limited number (5 per set), it seems that an obvious next step with respect to the overlap/correlation analysis would be to test whether the weak trends we have detected are confirmed by an analysis based on a larger data set. 4.5 Qualitative analysis In order to collect a smaller data set for a preliminary qualitative analysis, we extracted the collocations that appear to be most typical of translated texts and the collocations that appear to be most typical of original texts using the following method. We first computed the average log-likelihood ratio for each bigram in the O-subcorpora and in the T-subcorpora described in 4.3. Then, we computed the log ratio of these two values for each bigram. We put the bigrams with a positive value of this measure equal to or greater than 12 in the list of bigrams typical of original text, and the bigrams with a negative value equal to or greater than 12 in the list of bigrams typical of translated text. The ±12 cut-off point was arbitrarily chosen to limit the data to a manageable amount. Based on a subjective evaluation of meaningfulness and wellformedness, each set was further divided into two (sub-)sets. Set A contains sequences that appear to be meaningful and well-formed, while set B contains less likely collocation candidates, i.e. incomplete sequences resulting in syntactically ill-formed structures (termine geopolitica; iniziativa centro; veda nota); fully-predictable sequences (suo figlio [“his/her son”]; noi europei [“we Europeans”]; sara possibile [“it will be possible”]) and, somewhat more controversially, content words preceded or followed by function words (usually articles or prepositions), such as sull'isola [“on the island”]; delle riserve [“of the reserves”]; proveniente dal [“coming from the”]. Since this is no attempt at proposing a classification of collocations, we have not provided intepretative labels for these groupings. Table 3 shows the number of bigrams assigned to each set, and their proportion out of the total number of bigrams selected for analysis. Tot Set A%Set B%original 166 83508350translated 203 7436.4512963.54 Table 3 A tentative classification of bigrams Although the initial number or bigrams selected is substantially larger in the case of translations (203 vs 166 in originals), when less meaningful and wellformed sequences are removed only 74 bigrams are 88 left (vs 83 in the originals).10 These can be futher analysed in terms of their topic-dependency, and grouped along the cline “technical-general” into the fuzzy categories strongly topic-dependent (containing a geographical term), weakly topic-dependent and topic independent (general language). The top 5 bigrams from each category are reproduced in Table 4: ORIGINALS meaning approximately TRANSLATIONS meaning approximately strongly topic-dependent lega nord Northern League (It pol. party) fratelli musulmani Muslim brothers lingua russa Russian language marco tedesco German mark minoranza italiana Italian minority mar rosso Red Sea Alto adriatico High Adriatic chiesa russa Russian church centro europea Central-European vicino oriente Near East weakly topic-dependent guerra giusta right war governo federale Federal government minoranze etniche ethnic minorities nucleo centrale central nucleus opinioni pubbliche public opinions sistema monetario monetary system spazio vitale vital space istituto orientale Oriental institute prodotti industriali industrial products autorita federali Federal authorities topic-independent basti pensare suffice it to think terza fase third phase breve periodo short period porre fine put a stop chi scrive the writer [lit. s/he who writes] stessa cosa same thing occorre realizzare it is necessary to set up reso noto made public scorso anno last year far si make possible Table 4 Examples of bigrams grouped according to topic-dependency Table 5, based on a manual count, shows that topic-independent typical sequences are twice as common in originals as in translations, whilst the opposite is true of strongly topic-dependent sequences: topic-independent % strongly topic-dependent%original 21 (25,3%) 12 (14,4%)translated 10 (13,5%) 29 (39,1%) Table 5 Distribution of topic independent and strongly topic-dependent bigrams The initial impression of a more substantial incidence of repeated patterns in translated vs original language, supported by the number of patterns retrieved, is mitigated by observation of actual instances. It does seem that translated language is repetitive, possibly more repetitive than original language. Yet the two differ in what they tend to repeat: translations show a tendency to repeat structural patterns and strongly topic-dependent sequences, whereas originals show a higher incidence of topic-independent sequences, i.e. the more usual lexicalised collocations in the language. The latter may be viewed as instances of those “target-specific features” that according to Mauranen (forthcoming), who analyses Finnish data, tend to be underrepresented in translations with respect to comparable originals. Closer observation of the bigrams excluded from this analysis (B set) reveals further interesting patterns. For example, the sequences considerato come [considered.MASC.SING as] and considerata come [considered.FEM.SING as], appear in the list of typically translational expressions. A search for all the variants of the adjective/past participle (masculine singular and plural, feminine singular and plural) retrieves 619 occurrences from the original corpus, and 333 occurrences from the translation corpus. These figures are in accordance with the relative sizes of the two corpora, approximately 2:1. However, if we look at the frequency of the collocate come in the first position to the right of the keyword, the proportions change dramatically (table 6): 10 This agrees with what we observed in 4.3 concerning the results obtained with mutual information vs frequency and log-likelihood ratio, i.e. that the former tends to pick up the most plausible collocations whilst failing to detect differences between originals and translations. The same point emerges from a preliminary analysis of the bigrams extracted after removing the function words (see footnote 7 above). Again, in this kind of data the differences between original and translated language nearly disappeared. 89 considerato/considerata considerati/considerate Total tokens%+ come (R1) % Original 6192,096,1912.969 9.8 Translated 333922,9463.661 20.7 Table 6 Frequencies of considerato and considerato come It might be hypothesised that translators show a preference for using optional come [“as”] in this structure. This would be in line with Olohan's findings concerning overuse of optional elements in (English) translation (Olohan 2001). Clearly, caution must be exercised in drawing conclusions, especially in an exploratory study like this one. However, the current findings are promising, hinting at some systematic differences in the use of collocations in closely comparable translated and non-translated texts. 5. Conclusions We believe that this study has shown that monolingual (closely) comparable corpora are promising resources for the study of collocational restrictions in translated vs non-translated language. Simple data-exploration methods coupled with qualitative analyses would appear to be adequate in providing at least preliminary insights in this area. At the same time, we were only able to detect weak trends in the LIMES corpus, and no effects in the EU corpus. We plan to improve on this via two strategies. On the one hand, we hope that by simply enlarging both corpora, we will be able to identify more robust statistical trends. On the other, we can bootstrap from the data-driven insights presented in this exploratory study to devise knowledge-richer collocation extraction methods. For example, the observations in 4.5 suggest that analysis of frequent bigrams including function words might reveal itself to be particularly relevant in telling translated language apart from original language. This is a strategy we would have not considered had we not performed this preliminary data-driven investigation. References Baker M 1995 Corpora in translation studies: an overview and some suggestions for future research Target 7(2): 223-243. Baker M 1996 Corpus-based translation studies: the challenges that lie ahead. In Somers H L (ed), Terminology, LSP and translation, Amsterdam, Benjamins, 175-186. Baroni M, Matiasek J, Trost H 2002 Using textual association measures and minimum edit distance to discover morphological relations. Paper presented at the International Workshop on Computational Approaches to Collocations. Online: http://sslmit.unibo.it/~baroni/ Benson M, Benson, E, Ilson R 1986 (1997) The BBI dictionary of English word combinations. Amsterdam, Benjamins. Bernardini S, Conrad S 2002 “Multidimensional Analysis and translation”. Paper presented at the international conference Corpora and Discourse, Camerino 27-29 September 2002. Bernardini S and Zanettin F forthcoming. When is a universal not a universal? Some limits of current corpus-based methodologies for the investigation of translation universals. In Proceedings of Translation universals, do they exist? Savonlinna, 19-20 October 2001. Bolinger D 1976 Meaning and memory. Forum Linguisticum 1(1): 1-10. Church K, Hanks P 1990 Word association norms, mutual information, and lexicography. Computational Linguistics 16: 22-29. Danielsson P 2001 The automatic identification of meaningful units in language. Unpublished PhD Thesis. Göteborg University. Dunning T 1993 Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1): 61-74. Evert S 2001 On lexical association measures. Online: http://www.collocations.de/EK/am-html/ Firth J R 1957 (1951) Modes of meaning. Papers in linguistics. London, OUP, 190-215. Firth JR 1964 (1930). Speech. London, OUP. Frawley W 1984 Prolegomenon to a theory of translation. In Frawley W (ed), Translation: literary, linguistic and philosophical perspectives. London and Toronto, Associated University Presses, pp159-175. Gitsaki C 1996 The development of ESL collocational knowledge. Unpublished PhD Thesis. University of Queensland. 90 Heid U 1996 Using lexical functions for the extraction of collocations from dictionaries and corpora. In Wanner L (ed), Lexical functions in lexicography and Natural Language Processing. Amsterdam, Benjamins, pp115-146. Inkpen D Z, Hirst G 2002. Acquiring collocations for lexical choice between near-synonyms. SIGLEX Workshop on Unsupervised Lexical Acquisition, 40th meeting of the Association for Computational Linguistics, Philadelphia, June 2002. Kenny D 2001 Lexis and creativity in translation. Manchester, St. Jerome. Kilgarriff A 2001 Comparing corpora International Journal of Corpus Linguistics 6(1): 1-37. Krenn B 2000a The usual suspects: data-oriented models for identification and representation of lexical collocations. Saarbrücken, Saarland University. Krenn B 2000b Collocation mining: exploiting corpora for collocation, identification and representation. KONVENS , pp209-214. Krenn B, Evert S 2001 Can we do better than frequency? A case study on extracting PP-verb collocations. Proceedings of the ACL Workshop on Collocations, Toulouse. Langendoen T 1968 The London School of linguistics: a study of the linguistic theory of B. Malinowski and J.R. Firth.Cambridge (MA), MIT Press. Laviosa S 1998a “Core patterns of lexical use in a comparable corpus of English lexical prose. Meta 43(4): 557-570. online: http://www.erudit.org/revue/meta/1998/v43/n4/. Laviosa S 1998b Universals of translation In Baker M (ed), The Routledge encyclopedia of translation studies. London, Routledge, pp288-291. Lehrer A 1974 Semantic fields and lexical structure. Amsterdam and London, North Holland. Löscher W 1991 Translation performance, translation process, and translation strategies. Tübingen, Gunter Narr. Lyons J 1977 Semantics - Volume I. Cambridge, Cambridge University Press. Manning C, Schütze H 1999 Foundations of statistical natural language processing. Cambridge (MA), MIT Press. Mauranen A forthcoming “Where's cultural adaptation?” InTRAlinea. Olohan M 2001 Spelling out the optionals in translation: a corpus study. In Rayson P, Wilson A, McEnery T, Hardie A and Khoja S (eds.) Proceedings of Corpus Linguistics 2001, Lancaster, pp423-432. Olohan M, Baker M 2000 Reporting that in translated English. Evidence for subconscious processes of explicitation? Across languages and cultures 1(2) 141-158. Schmied J, Schäffler H 1996 Explicitness as a universal feature of translation. In Ljung M (ed), Corpus-based studies in English: papers from the seventeenth International Conference on English Language Research on Computerized Corpora (ICAME 17). Amsterdam, Rodopi, pp21-36. Siegel S 1956 Nonparametric statistics for the behavioral sciences. New York, McGraw-Hill. Sinclair J 1991 Corpus, concordance, collocation. Oxford, Oxford University Press. Stubbs M 2001 Words and phrases. Corpus studies of lexical semantics. Oxford, Blackwell. Toury G 1995 Descriptive translation studies and beyond. Amstedam, Benjamins. Vanderauwera R 1985 Dutch novels translated into English. The transformation of a "minority" literature. Amsterdam, Rodopi. 91 Investigating cross-linguistic constraints on adjectival and adverbial premodification. A corpus-based study of English and German. Dr. Sabine Bartsch, Darmstadt University of Technology, Department of Linguistics and Literature, Hochschulstrasse 1, 64289 Darmstadt, Germany, eMail: bartsch@linglit.tu-darmstadt.de, http://www.linglit.tu-darmstadt.de/bartsch/ Combinatorial constraints are commonly assumed in linguistics to be either based on the grammatical system of a language or to be idiosyncratic constraints on the combinatorial properties of individual lexical items and not extensible in any systematic way to larger subsets of the vocabulary. Thus, the required complements of verbs are an example of a constraint of the former type, while the latter type of constraint is instantiated by lexical collocations which are commonly assumed to be individual and idiosyncratic co-selections of lexical items and are generally treated as a usage phenomenon. There are, however, subsets of the lexicon that display striking combinatorial constraints or, rather, combinatorial requirements, over and above singular lexical combinatorial preferences which are not modelled by the grammatical rules of the language. Such constraints can be shown to hold across larger subsets of the lexicon but are neither explicable as co-selection preferences of individual lexical items nor as constraints imposed by the grammatical rules of a language. Such constraints are found to govern, for instance premodification requirements of a subset of attributive adjectival past participle (APP) - noun combinations. It has thus been observed that for reasons yet to specify, the following noun phrases – comprised of an adjectival element plus head noun – are felt to be unacceptable unless further qualified by an appropriate premodifier: (1) (a) ?a built house (a') a newly built house (b) ?ein gebautes Haus (b') ein neu gebautes Haus (c) ?a born child (c') a recently born child (d) ?ein geborenes Kind (d') ein kürzlich geborenes Kind (e) ?a footed dancer (e') a light footed dancer (f) ?ein füßiger Tänzer (f') ein leichtfüßiger Tänzer (g) ?a prone routine (g') an accident prone routine (h) ?ein trächtiger Verlauf (h') eine unfallträchtiger Verlauf In a paper on adjectival passives, Levin, Rappaport (1986: 634) observe that „some APPs sound peculiar unless qualified, for reasons that are not entirely clear”. The unacceptability and unnaturalness of the non-premodified examples in the left column is, indeed, neither explicable based on grammatical rules nor is it attributable to singular idiosyncrasies of individual items. What makes this phenomenon so interesting is precisely the fact that it must be explained in terms different from other lexical or grammatical phenomena. This paper attempts to pursue this issue in terms of the information content of these phrases and their constituents. It also addresses the question what influence the information structure of utterances has on linguistic constraints, on the one hand, and discusses the language user's sensitivity to constraints on the amount of information comprised in an utterance on the other. Another aspect that is of interest in the context of this phenomenon is whether the constraints observed in the English data can also be shown to hold cross-linguistically in other languages which would allow statements about the universal validity of the constraint across languages. Surprisingly, relatively few studies have addressed this issue so far (e.g. Grimshaw & Vikner, 1993; Ackerman & Goldberg 1996), and none, to my knowledge, have made use of corpus evidence, especially in more than one language. This paper reports findings on this phenomenon based primarily on examples from two corpora of English and German, the British National Corpus (BNC) and the tagged subsection of the COSMAS corpus collection of contemporary German at the IDS (Institut für Deutsche Sprache) in Mannheim, Germany. The paper investigates premodification constraints based on a corpus study of English and German. It addresses the following questions: 1) what properties of this type of structure and its constituents lead to this (obligatory) premodification requirement? 2) does this type of constraint allow for generalisations across a wider range of data beyond adjectival past participle plus noun phrases? 3) what does this type of constraint tell us about the interplay of language structure and world knowledge? 4) can this constraint be shown to hold cross-linguistically in other languages e.g. German? The paper traces the premodification requirement in the examples above to the information content of the constituents involved and hypothesizes that the language user's world knowledge about the 92 properties of the head noun referent governs the amount and type of information that can be meaningfully contributed by an attributive adjective. This reasoning is in line with Ackerman & Goldberg's (1996: 28) conclusion that “APPs can only occur if they are construable as predicating an informative state of the head noun referent.” Based on a corpus-study of the phenomenon just outlined, the paper investigates the mechanisms responsible for the premodification requirement by a comparison with similar structures that do not require obligatory premodification. The paper also attempts to extend the findings to related structures besides those containing an adjectival past participle. A comparison of English and German examples of this type of phrase seeks to establish the cross-linguistic validity of the postulated premodification constraint. The paper postulates a pragmatic motivation for the above mentioned constraints on premodification. This postulate is based on the observation that the provision of a surplus of information by the attributive adjective violates pragmatic constraints such as conversational maxims (Grice 1975) on the informativeness of an utterance. More precisely, it is postulated that it is an overlap of information comprised in the adjectival and the noun constituent which results in a level of redundancy that renders the examples above unacceptable unless further qualified by an appropriate premodifier. Redundancy, which is defined here as a surplus of information, is standardly assumed to be recovered by a reduction of the information provided. Paradoxically, in the structure under study here, the redundancy is recovered by providing additional information by means of a premodifier that further qualifies the information provided by the noun phrase in order to establish an informative state. The paper raises the question how our knowledge of the world leads to or result in constraints on what can be or has to be expressed explicitly, which types of information are perceived to be redundant and how avoidance of redundancy is reflected in constraints on or requirements of premodifying elements in the above mentioned types of structures. 1. The nature of the pattern under study The paper is specifically concerned with the premodification requirements of a subset of English and German phrases of the basic pattern: obligatory premodifier (adverb, adjective, noun) + adjective + noun As it became apparent in the course of the study that the constraints investigated originally for APP plus noun phrases are also found in at least one other type of phrase of the above mentioned basic structure, namely obligatorily premodified desubstantival adjectives, it was decided that the following patterns should be analysed: I. premodifier + adjectival past participle (APP) + noun: (2) (a) a newly built house (b) ein neu gebautes Haus II. premodifier + desubstantival adjective (DesubstAdj) + noun (3) (a) a light footed dancer (b) ein leichtfüßiger Tänzer The following sets of examples illustrate the phenomenon for pattern I. (examples (4) and (6)) and II. (examples (5) and (7)) in German and English respectively: (4) (a) a nobly born orphan HH1 (‘a) ?a born orphan (b) a British born woman EA1 (‘b) ?a born woman (c) freshly baked bread C8S (‘c) ?baked bread (d) specially designed dashboard J1T (‘d) ?a designed dashboard (e) a lightly built rugby player CL2 (‘e) ?a built rugby player (f) one of the most injury-prone cricketers in the country CBG (‘d) ?a prone cricketer (g) the guilt ridden middle classes CE6 (‘d) ?the ridden middle classes (h) the traffic ridden part of Argyle Street GWL (‘d) ?the ridden part of Argyle St. (i) a naked, angst ridden, shorn haired young man ART (‘d) ?a ridden, haired young man (5) (a) to arrive empty handed (a') *to arrive handed (b) a hard headed woman (b') *a headed woman (c) a scatter-brained person (c') *a brained person (d) a thick leafed plant (d') *a leafed plant (e) a hot-blooded character (e') *a blooded character 93 (6) (a) ein neugeborenes Baby (a') *ein geborenes Baby (b) frisch gebackenes Brot (b') *gebackenes Brot (c) ein schlank gebauter Spieler (c') *ein gebauter Spieler (d) ein verletzungsanfälliger Spieler (d') ein anfälliger Spieler (e) ein kurzhaariger Mann (e') ein haariger Mann (7) (a) eine starrköpfige Frau (a') *eine köpfige Frau (b) eine leichtsinnige Person (b') *eine sinnige Person (c) eine dickblättrige Pflanze (c') ?eine blättrige Pflanze (d) eine dickstämmige Birke (d') ?eine stämmige Birke (e) ein heißblütiger Charakter (e') *ein blütiger Charakter While the phrases in the first column of example sets (4) - (7) are well-formed and perceived to be natural, acceptable and grammatical, the examples in the second column are mostly perceived as non-acceptable, unnatural and deviant in some unspecified way. The only difference between the phrases in the first column and those in the second column lies in the presence or absence of a premodifying element. It is thus reasonable to seek for an explanation of the perceived unacceptability of the non-premodified phrases in the role of the premodifying element and to ask why these phrases should require premodification in order to be rendered acceptable. After all, as the following examples show, other phrases of the same structure, APP/DesubstAdj plus noun, exist which are perfectly acceptable without premodification: (8) (a) It is now a declared policy of many governments and international agencies that the only vehicle for such preparation is ‘education, education, education', … (b) A balanced diet includes a variety of foods from all 5 food groups. (c) […] that grew out of the post-World War I settlement of the broken Ottoman Empire. (d) The born loser. (Comic title) (e) A married father of three. (f) He scored his first headed goal. (g) headed notepaper (h) a hung parliament (9) (a) …, denn ab 11:30 gibt es gebackenen Fisch auf dem Fest. (b) Die getötete Person ist circa 1,80 m groß. (c) gehackte Kräuter (d) Ein verheirateter Vater dreier Kinder. (e) Es ist der erklärte Wunsch der Parteien …. (f) eine ausgewogene Diät. Contrasting these sets of examples in (8) and (9) with the previous sets of examples raises the issue what exactly the difference is between those adjectival past participle/desubstantival adjective plus head noun phrases that do require premodification to be rendered acceptable, and those that do not require premodification. Two previous studies offering explanations of this type of structure and premodification requirement are discussed in the next section. 2. Previous studies: an event structure account versus a non-redundancy account In a paper on English adjectival past participles, Ackerman & Goldberg (1996: 17-30) postulate that it is the amount of information provided by an expression rather than the event structure of the verb underlying the APP that gives the relevant clue to the unacceptability of the above mentioned phrases unless they are premodified. Ackerman & Goldberg conclude that “adjectival past participles can only occur if they are construable as predicating an informative state of the head noun referent” (ibid.: 28). Another previous study (Grimshaw & Vikner 1986) of adjectival past participles which is critically discussed in Ackerman & Goldberg (ibid.: 19-20) traces the premodification requirement of adjectival past participles based on creation verbs to the event structure of the verbs underlying the adjectival past participles. This event structure account argues that the underlying creation verbs display an event structure with two subevents, a process and a state and that as the APP only serves to “identify” the state subevent, the process subevent must be specified by the adverb premodifier. As Ackerman & Goldberg (ibid.) correctly observe, this event structure based account would have to be assumed to hold for all verbs with this event structure, i.e. for all accomplishment verbs, not just for creation verbs. Yet, many change of state verbs such as broil, cool etc. can be used without obligatory premodification without being rendered unacceptable. In this light, Ackerman & Goldberg's account based on the observed redundancy of the non-premodified phrases appears more plausible and also has the advantage of promising to cover a wider 94 range of data. Building on the notion of redundancy introduced by Ackerman & Goldberg (ibid.), this paper explores the impact of the information content of the constituents in the phrases under study on the level of redundancy It will therefore be argued in this paper that the requirement of an obligatory premodifier in a subset of the data can be attributed to the information content of the expressions under study and the knowledge of the semantic frameworks of the constituents that the language user (has). It will be demonstrated, furthermore, that while on the one hand redundancy is avoided in language and information processing and is usually recovered by omission of redundant information, we are dealing here with an example in which the redundancy of the adjective plus noun phrase is recovered by an added premodifier. This premodifier, in turn, is rendered obligatory by the fact that the plain combination of adjectival past participle / desubstantival adjective plus noun expresses a non-informative and therefore non-permissible state. 3. The semantic nature of the structures under study As was already indicated, the motivation for the required premodification of the adjective plus noun phrases above lies in the semantics of the adjective and the noun, and specifically in the knowledge of the language user about the inherent properties of the head noun referent and the information contributed by the attributive adjectival modifier. It will thus be necessary to give a characterisation of the semantic content of the noun in relation to the information provided by the adjectival constituent. APPs are best characterised as denoting resultativity, i.e. a state resulting from an action or event described by the verb they are derived from. Syntactically, they function as adjectival elements modifying nouns or pronouns or as predicative adjectives. Desubstantival adjectives are in so far related to adjectival past participles and parallel in this context as they fulfil the same syntactic function. They likewise denote a state or property of their referent. The head nouns of these phrases denote either a referent or object that is, in a sense, the end-point or result of the property introduced by the APP (a newly born baby ¨ ‘having been born is an inherent property of a baby'), or that has the property introduced by the adjective as an inherent property, or a referent, object or artefact (a hard-headed woman ¨ having a head is an inherent property of a woman, hence, we cannot speak informatively of a headed woman). There thus appears to be an overlap of information between the adjectival past participle / desubstantival adjective and the head noun, i.e. the adjectival modifier provides information about its head noun referent which is already tacitly understood by the language user. The non-premodified adjectival constituent thus provides a surplus of information that is deemed uninformative and therefore perceived to be redundant and unacceptable by the language user. Based on this observation a pragmatic motivation for the perceived unacceptability of the non-premodified and the requirement of a premodifier in a subset of these structures is postulated. The next section introduces a classification of the types of relations between adjectival constituent and head noun which will serve as a basis for the following discussion of the semantic and pragmatic reasons for the obligatoriness of the premodification in these phrases. 4. Classification of the data In general terms, it may be said that premodification of the adjectival constituent, APP or DesubstAdj, in the examples quoted here seems to be the rule rather than the exception even though some of the following adjectival past participle/desubstantival adjective plus noun phrases will occasionally be found without a premodifying element. These cases are, however, heavily dependent on some kind of contrastive context. It also must be stressed that dependence of the acceptability of many of the phrases below on the premodifying element is often a matter of degree rather than a matter of absolute decisions. In this section, a set of phrases from the BNC and the Cosmas corpus is classified according to the role of the adjectival constituent in relation to the head noun it modifies. The classification is based on Bartsch (forthcoming, 2003). 1) Adjectival element denotes an inherent property of the modified head noun In the first type of phrase in this classification, the adjectival element denotes an inherent property of the modified head noun. In these examples, an adjective which is obligatorily premodified denotes an inherent property of the noun it modifies. For example, in the phrase (10) (a) fearlessly built industries 95 the adjectival past participle built denotes the fact that industries as a man-made artefact must, inherently, be built in order to come into existence. Even in the example (11) (a) the thickly carpeted mountainside (b) (…) statt üppiger Vegetation nur spärlich bewachsenen Untergrund the DesubstAdj carpeted metaphorically denotes the fact that a mountainside is assumed to have some kind of layered natural covering (e.g. moss, grass etc.). In this case, the particular selection of adjective is a metaphorical extension indicating the quality of the covering. In the German example, the information provided by the adverbial premodifier that the ground is sparsely covered makes the phrase sound more natural, even though it has to be said that the German phrase is also acceptable without the premodifier, arguably at the expense of its naturalness. A number of subtypes of this structure are distinguishable in the corpus data: 1.1) Construction / shape / configuration: In this subtype, the adjective refers to the construction, shape or configuration of the head noun referent. (12) (a) […] I guess that the slightly askew likes of Strangelove should be welcomed with open charms. CHB (b) Working as I do for a technically based industry, I hope I may be forgiven for believing that almost anything can be achieved through technology, […]. EA8 (c) […] that when a lightly built eight-stone person runs around on a hard surface, his/her knees suffer a momentary weight equivalent to three-quarters of a ton with each step ASD (d) Matchsticks almost fell out stepping back awkwardly like a badly constructed puppet. CEC (e) After 1970 the newly created Department of the Environment, […]. G05 (f) […] a surprisingly hard-edged collection of songs. C9K (g) These fossil seeds are preserved in a relatively coarse-grained sandstone. AMM (h) […] in fact a very fine-particled clay, Montmorillionite. C97 (i) I was astonished when we began walking down the now weed-strewn path to feel a familiar feeling of fear and expectation. CE9 (13) (a) Das erste privat gebaute Gefängnis Deutschlands steht in […]. (b) […] dort den köstlichen Duft frisch gebackener Plätzchen schnuppern, […]. (c) […] der auch Inhaber des neu geschaffenen Neurochirurgie-Lehrstuhls ist […]. 1.2) States of development / creation / preparation The following phrases denote states of natural development and man-made creation or preparation. These are felt to be intrinsic to the object denoted by the head noun and can therefore only be addressed explicitly when this information is sanctioned by further modification and specification of the nature of the properties. Thus in the first two examples, (a) and (b), it is obvious that a son or an animal would have been born or bred in order to come into existence as a living being. The information that a son, or indeed a living being, is born or an animal bred is, by itself, redundant and therefore not permissible in this construction. Yet, as soon as qualification is added in the form of a specification of the circumstances of the birth of the son or the breeding of the animal, the expression is sanctioned as providing new and relevant information. (14) (a) Here two men, a ship's officer and a gently born younger son sent from England to make his way in the world, are involved in certain events […] EC8 (b) […] in most sheep rearing countries (although not Australasia or Japan); transmissible mink encephalopathy, occurring in rare outbreaks in captive bred animals on mink farms, mainly in North America; chronic wasting disease of captive mule deer and elk, seen in North America; […] EC7 (c) […] preventing certain grievances from developing into full fledged issues. G1G (e) Shostakovich's First Violin Concerto is a more rough-hewn affair. A5E (f) […] Ruth privately thought – there were freshly baked rolls, split and filled with ox-tongue. CB5 (g) Ockleton pointed in turn at a broadly built man smiling conventionally towards the camera […] H8T The German examples are slightly less clear-cut. Example (a) below is also permissible without the premodifier, example (b) must be judged differently because it is a metaphorically extended meaning, but example (c) corresponds to the mechanisms already explained for the English examples above; we can neither speak of a schultrig gebauter Mann (‘shouldered built man') nor of a gebauter Mann (‘built man') without further specification. (15) (a) […] das von seiner Frau nichtehelich geborene Kind zu adoptieren. (b) in relativ altbackener Kleidung (c) der breitschultrig gebaute Mann 96 2) inalienably possessed parts: This set of examples comprises adjectival elements referring to inherent and inalienably possessed parts of – mostly – living beings, i.e. humans, animals, plants. The part denoted as a property or part of the living being is perceived to be an intrinsic part of this living being. So much so, in fact, that mentioning this part explicitly requires the provision of further information in order to be sanctioned as new and relevant information. This type of the structure is one of the most frequent and reliable instantiations of obligatory premodification. It rests on the knowledge of language users regarding parts of living beings which is part of language users’ world knowledge. Such world knowledge is much more inseparable from purely linguistic knowledge than has been suggested at times in linguistic research. As will be shown below, this is an important factor in accounting for the obligatoriness of the premodification of the adjectival elements in these examples. 2.1) inalienably possessed body parts In these examples, the adjectival element functions as a modifier to a noun denoting an inseparable part of a living being (human, animal, plant) that is inherent or inalienable to its owner. In the example: (16) (a) a harshly-faced person the adjective faced denotes an inherent and, in fact, inalienable part of the human being, i.e. a person inherently has a face which is an inalienable part of his or her physiology. When the adjective faced is used in the context of a human being, it is tacitly understood that every human being has a face, in fact, so much so that this detail must not be mentioned explicitly unless there is good reason to do so. Good reason for mentioning such attributes or properties is the provision of additional information about the inherent part which goes beyond the mere statement of the existence of this part. Thus, stating about a human being that he or she has a face is only permissible if this information is further specified beyond what is already known as part of the language user's knowledge of the properties of living beings. In each of the examples below, the adjectival constituent must be further premodified because the combination of the information provided by the desubstantival adjective plus the head noun is, by itself, redundant. The premodifier contributes the information that recovers the unacceptable and unnatural redundancy of the non-premodified expressions. (16) (b) And it reached her; in a totally unwelcome manner she seemed trapped in a web spun by golden eyes, a harshly boned face, a sensual mouth that often hid its humour. HA7 (c) The house was very mucky and rotting food spilled in the side alley next to it which attracted the most bleary-eyed flea-ridden dogs. H9G (d) […] Mrs Beattie and Mrs Friar could hardly be classed as neighbours and even the most cock-eyed optimist would never even hope that they might even begin to be pleasant to each other. ATE (e) […] spend the rest of my life playing it, thickened with doleful dirges, vainly trying to lay the trauma, my only satisfaction the ashen faced, staring eyed audiences staggering out at the end of performances, primed, and ready to carry on the good work. A6C (f) Nuggett is the hot headed one. K20 but: But in the 86th minute of a frantic Roman derby, he produced a stunning headed goal as the game ended 1–1. CEP Ballpoint pen and carbon-paper suffice for ‘one-off’ letters written on donated headed notepaper.HHP The two last phrases under (16)(f) show that the phrase can also be used without premodification under the condition, as instantiated in these examples, that the adjectival past participle be used in a sense that does not overlap with tacit information language users already have about the head noun referent. In this case the phrase headed goal is used not in the sense ‘goal with a head’ parallel to ‘person with a head', but in the sense ‘goal scored with or by means of the head'. The expression is thus non-redundant and does therefore not require an obligatory premodifier. The same is true of the phrase headed notepaper in which the adjective headed contributes an informative state about the notepaper, namely that it has a printed header. This is a property that is not inherent to or prototypically assumed of notepaper; thus, the information provided by the adjectival constituent of the phrase is not redundant and does therefore not require premodification in order to be rendered informative. (17) (a) die unterschiedlichen Pigmente blauäugiger, hellhäutiger Menschen (b) ein bleichgesichtiger Mensch (c) Barfüßig und im Schlafrock parodiert …. (d) ein fahlgesichtiger junger Mann (e) ein kurzhaariger Mann 2.2) inalienably possessed plant parts These examples are analogous to the examples of inalienably possessed body parts in 2.1) above. They refer to parts of plants such as branches, leaves and the trunk that are perceived to be inalienable, 97 characteristic, prototypical parts of plants. Because a tree is assumed to have these parts, mentioning them explicitly as parts of a tree is redundant to the proficient language user unless this information is sanctioned by further specification. (18) (a) "A slight sudden puff of breeze lifted a thickly leafed branch in front of her and she saw two figures, framed by the foliage at the instant their lips met […].” H9H (b) double-, five-, many-branched (OED 2nd: branched ppl. a.) (c) Sprinting across it, he reached a particularly thick-trunked tree at the farther edge. HJD (19) (a) Gelbrandige Spielarten bleiben buntblättrig, wenn man den grünen Mittel […] (b) […] vierblättrige Kleeblätter aber zerstörte … (c) Und daß neben dem Waldvögelchen die breitblättrige Stendelwurz schon einen g[…] 3) cognition, perception, communication: This category comprises adjectival elements referring to abilities and properties of the head noun referent that are associated with cognitive skills, with perception and communication. In some cases, e.g. spoken, voiced, it is difficult to decide whether speech or the voice is a part of the human body (in which case these examples would have had to be classified under 2.1) above or whether they belong to the more abstract category of cognition, perception and communication. This latter interpretation was given preference because the voice or the capacity for speech is not a part of the human body in the same way that e.g. the limbs and head are. 3.1) cognitive and perceptive skills and properties This first set of examples comprises phrases with adjectival constituents that refer explicitly to cognitive abilities (e.g. minded) or perceptive skills (e.g. sighted; -sichtig). Most examples found in the BNC refer to distinctly human skills and properties. (20) (a) He was humble minded and never forced his opinions down other people 's throats. EVH (b) In view of the deeply felt resentment at the reign of Bayezid I on part of at least the more pious elements of the state, […]. H7S (c) One was the detestation by the liberally oriented of religious paternalism, a mild form of anti-clericalism. A07 (d) Ernest Gowers, a very far-sighted man, had no hesitation in suggesting a separate council for Scotland and this is what happened; […] K5M but: the sighted UFO (21) (a) […] nur die weitsichtige Entscheidung der Regierung kann schlimmere Folgen abwenden 3.2) communicative faculties A subtype of the type this type of pattern refers to the human faculties in the context of verbal communication. Again, it is interesting to note that a person would be said to be just ‘spoken’ in the sense of having the faculty of speech because speech is obviously perceived as an inherent human faculty. (22) (a) (…) and this time an impeccably spoken man replied, albeit with another question. CRE (b) An initially kind-voiced woman took down my details (…) CAG (23) (a) […] der vielstimmige Chor der Waldvögel. 4) affected states This last set of examples involving obligatory premodification of adjectival past participles denote affected states or physical conditions. (24) (a) The most seriously-affected areas are Scotland, Wales and the Pennines. J2V (c) […] the reasons for the supposedly violence-prone nature of the UMW, […] G1H (d) This problem was to make it the most accident-prone routine in Tiller history. B34 (e) Bed ridden patients at St Mary 's A St Mary Abbots Hospital […]. KCN (g) […] a large number of the rural population […] are poverty stricken. AN3 (25) (a) die von der Hochwasserkatastrophe am härtesten betroffenen Gegenden 98 In some of these examples, e.g. (24) and (25) (a) and (b), it is not entirely clear that the premodifier is obligatory. The following modification of sentence (a) without the premodifier seems to be permissible without sounding unnatural or providing redundant information: (a') The most affected areas are Scotland, Wales and the Pennines. J2V A brief note on the level of establishment of this premodification pattern in both English and German might not be out of place: Looking at the structural types of the phrase pattern under study, it is interesting to note that the constraint on premodification is deeply entrenched in the language, indeed. So much so, in fact, that many of the premodifier plus adjectival constituent combinations are spelled not as two separate words anymore, but are hyphenated or even spelled as one word. German has a speciality here in linking some of the premodifiers to the adjectival past participles by means of a so-called Fugen-s (e.g. verletzungsanfällig – injury-prone), but as this feature is not specific to the pattern under study, the phenomenon will not be discussed any further here. It is interesting, however, that there is a strong indication that many of these phrases are becoming lexicalised as so-called complex participles which are spelt with a hyphen or as one word and many of which have an adjectival constituent that cannot be directly derived from an underlying verb. Here are some examples: (26) (a) rough-hewn, accident-prone, weed-strewn (b) altbacken It should be mentioned explicitly that both English and German also allow different types of premodification such as attributively premodifying temporal or circumstantial clauses providing the same type of information as attributive premodifiers, however, these alternative structures are not the concern of this study. The examples discussed under 1.) - 4.) in this section as instantiations of an obligatory premodification requirement are only some of the most prominent ones discovered. More research might still lead to subtler classifications, but the classification undertaken here serves to illustrate the different areas of mostly tacit conceptual knowledge that the language user brings to the interpretation of such expressions. It is this knowledge of the proficient – an obviously knowledgeable – language user that leads to the perceived redundancy of the adjectival past participle / desubstantival adjective plus noun patterns when used without premodification. The language user's tacit knowledge about prototypical and inalienable properties of a human body or other living beings or the processes by which living creatures and artefacts come into existence render redundant information unnatural and unacceptable. Only by adding premodification can phrases such as the ones discussed above be recovered to an informative and therefore acceptable state. Language users have strong intuitions of the amount and types of information they will accept as informative and permissible and reject such expressions that violate this requirement. The examples discussed in this paper have given but a small glimpse of the extent to which conceptual knowledge influences and shapes linguistic combinatorial requirements. Conceptual knowledge brings about a combinatorial requirement that is neither covered by the grammar not by the collocational co-selection requirements of the language. An interesting question is thus, how allegedly extralinguistic information on this type of constraint can be incorporated in linguistic theory. The next section proposes a solution. 5. Recovering redundancy: pragmatic constraints on attributive modification As was already indicated, pragmatic constraints are postulated to play an important role in the establishment of collocations alongside lexical constraints. But why should the human language user be bothered by being provided a surplus of information? After all, is it not helpful and does it not ensure successful communication and information processing if as much information as possible is provided? The answer is: yes and no. On the one hand, it is, of course, true that communication can only succeed if the utterance made is sufficiently informative to be decodeable; on the other hand, communication can fail if too much, i.e. superfluous, information is provided. Sufficient information thus must be interpreted as: enough information, but not too much information. Linguistic communication is not normally maximally redundant and language systems can be shown to avoid redundancy, and what is true of elements in the language system (e.g. completely synonymous words) is also true of units of information and communication. This observation lies at the heart of Grice's (1975) conversational maxims. Grice formulated a set of maxims that can be applied to the constraints discussed in this paper. The redundancy which results from talking about *built houses, *born babies, *surfaced tables, *faced persons, *trunked trees, *gebaute Häuser, *geborene Babies, *blättrige Bäume, *haarige Männer etc. 99 violates at least the conversational maxims of quantity and relation of Grice's Cooperative Principle (CP) (Grice 1975: 45-46) which entails – among other things – the following requirements: QUANTITY: 1. Make your contribution as informative as is required (for the current purposes of the exchange). 2. Do not make your contribution more informative than is required. RELATION: Be relevant. By explicitly speaking of *a built house and *a born child, of *a surfaced table and *a faced person information is provided that exceeds the information required for the exchange. It makes the expression redundant and therefore non-informative. At the surface, it actually makes the utterance more informative than is required because the information provided about the head noun by the APP or the desubstantival adjective can be assumed to be known by every competent member of the linguistic community. As the Cooperative Principle rests on the prerequisite that every participant in a linguistic exchange observes this principle and its maxims, it can be expected by every participant in an exchange that by modifying a noun by means of an adjective, new and potentially interesting information is provided. Stating the obvious without further qualification violates the maxim that a contribution should not be more informative than required. The information supplied by the adjectival past participle or the desubstantival adjective in the expressions discussed above is not only more informative than required, stating the obvious tacitly insinuates that the speaker either does not know the rules of conversation or violates them intentionally in the assumption that the person he/she communicates with does not know things that are obvious. These examples are strong evidence for the postulate that in the case of this type of expression pragmatic constraints have an influence on the type and amount of information that can be introduced felicitously into an utterance. 6. Cross-linguistic validity of the constraint An issue raised at the beginning of this paper is concerned with the question of the cross-linguistic validity of this type of constraint on premodification. In other words, does this constraint on premodification hold for parallel expressions in other languages? The corpus study of English and German has been able to show that the constraint does indeed hold in both languages and can thus be assumed to be valid in both languages. However, there are also differences which have be addressed here. From the small set of examples discussed here, it looks like the English and German data generally adhere to the same principles, however, there are, as yet unsystematic differences which might indicate that different linguistic communities perceive of different types of information differently. There are, first of all, cases such as in the German examples (6)(d) and (e) above in which German permits for the non-premodified use of the following examples: (6) (d) ein verletzungsanfälliger Spieler (d') ein anfälliger Spieler an injury-prone player *a prone player (e) ein kurzhaariger Mann (e') ein haariger Mann a short-haired man *a haired man In the first example, German can speak of a prone player (anfälliger Spieler) which covers unspecifically all sorts of injuries, illnesses and other afflictions that can affect a player. In the second example, the lexical collision of *haired with the already established adjective hairy prevents the use of the adjectival past participle form per se, the form –haired is permissible in combination with an appropriate premodifier. These examples are not permissible in English without premodification. In a small number of cases, native speaker informants of both languages were asked to pass judgement over certain phrases which yielded interesting responses mostly corroborating the corpus findings and attempting telling explanations of the fact that a phrase was or was not perceived acceptable. An interesting intercultural difference was for example the fact that none of the English speakers would permit in any context the phrase *a killed person, whereas German speakers did not have any problems motivating contexts in which the phrase eine getötete Person could be used. There were a few other examples in which phrases were likewise found acceptable without premodification in German, but never in English. The premodified counterexamples were deemed acceptable by both group of speakers, but as the group of native speaker informants was small and by no means representative, further research with native speaker informants will have to be carried out before definite statements can be made about potential intercultural differences concerning the phrases under study in the two languages. 100 7. Conclusion This paper has attempted to show in how far constraints, or requirements on premodification in a set of phrases of the basic structure premodifier plus adjectival past participle / desubstantival adjective plus head noun are shaped by the tacit conceptual knowledge that the language user brings to the processing of language. It could be shown that such constraints shape obligatory structural requirements that are clearly not covered by either the grammar of the language nor the collocational co-selection requirements of the individual lexical item. It has furthermore become clear that an explanation of the type of premodification requirement exemplified in this paper is best attempted on the basis of the information content of the phrase constituents and the ensuing redundancy if the adjectival past participle or desubstantival adjective is not premodified. It has been possible to show that findings on the premodification requirements of adjectival past participles can also be extended to other, similar types of qualifying adjectives for which similar models of explanation can be employed. As postulated, the same mechanisms based on conceptual knowledge and avoidance of redundancy govern premodification requirements both in English and German such that the question of the cross-linguistic validity of the constraint can be answered affirmatively for the two languages studied comparatively. Only some of the aspects underlying this phenomenon could be discussed here. More comparative studies are needed to investigate the alternative ways of resolving redundancy in these and similar structures. Further research must be extended to other languages and other patterns in order to establish the impact of conceptual knowledge on linguistic structures on a broader cross-linguistic basis. A question that remains unresolved and will continue to pose a challenge to linguistics is the question how this type of constraint which is obviously based on factors that have been deemed extra-linguistic by many approaches to language is to be incorporated in linguistic theory. References: Books Ackerman, F, Goldberg, AE 1996 Constraints on adjectival past participles. In: Goldberg, AE (ed) 1996 Conceptual Structure, Discourse and Language. Standford, Cal., CSLI. 17-30. Bartsch, S forthcoming, 2003 Structural and functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen, Verlag Gunter Narr. Cole, P, Morgan, JL (eds) Syntax and Semantics 3: Speech Acts. New York, London, Academic Press. Goldberg, AE (ed) 1996 Conceptual Structure, Discourse and Language. Standford, Cal., CSLI. Grice, HP 1975 “Logic and conversation.” In: Cole, P, Morgan, JL (eds) Syntax and Semantics 3: Speech Acts. New York, London: Academic Press. 41-58. Levin, B 1993 English Verb Classes and Alternations. A Preliminary Investigation. Chicago, London: The University of Chicago Press. Levin, B, Rappaport Hovav, M 1995 Unaccusativity: at the syntax-lexical semantics interface. Cambridge, Mass., MIT Press. Levin B, Rappaport M 1986 The formation of adjectival passives. Linguistic Inquiry 17, 623-661. Levinson, S 1983 Pragmatics. Cambridge, CUP. Nuyts, J, Pederson E (eds) 1997 Language and conceptualisation. Language culture and cognition 1. Cambridge, Cambridge University Press. Corpora: British National Corpus: http://www.hcu.ox.ac.uk/BNC/ The corpus comprises 100 mio. running words of contemporary British English. The three letter code accompanying each BNC example in the text refers to the source text from which the example was taken. The Cosmas Corpus Collection at the IDS (Institut für Deutsche Sprache) in Mannheim, Germany (http://www.ids-mannheim.de/kt/corpora.shtml): The corpus used for the German examples in this study is the tagged subsection of the Cosmas Corpus at the IDS (Institut für Deutsche Sprache), Mannheim, a morpho-syntactically annotated subcorpus of contemporary German. This morpho-syntactically annotated subcorpus consists of a public section of the LIMAS corpus plus four years of newsprint of the German newspaper Mannheimer Morgen (years 91 + 94 + 95 + 96). The corpus comprises 18.22 Mio. running words. 101 Synchronic and diachronic variation: the how and why of sociolinguistic corpora. Dr. Kate Beeching, Senior Lecturer, Linguistics and French, Faculty of Humanities, Languages and Social Sciences, University of the West of England, Bristol, Frenchay Campus, Coldharbour Lane, Bristol BS16 1QY. Tel: 0117-344-2385; Fax: 0117-344-2820; E-mail: Kate.Beeching@uwe.ac.uk Abstract Tape-recordings have come of age. The spoken word has been captured, stored on tape and (in as yet regrettably restricted amounts) transcribed for a number of decades now and this is a situation which can only improve. Where good accompanying demographic data are available, it is beginning to be possible to trace linguistic change and to provide empirical evidence of the way that synchronic and diachronic variation may be linked. In the past, it has only been possible to guess at long-gone spoken manifestations through written, mainly literary, evidence. This paper aims to describe and illustrate the potential of (spoken) sociolinguistic corpora for research studies in both synchronic and diachronic variation, with reference to French, and to suggest ways in which useful research corpora may be established for future generations of scholars. The Etude Sociolinguistique d'Orléans, currently available from the University of Leuven, Belgium, was conducted by a group of university teachers of French at Reading University from 1966-70. The resulting electronic corpus comprises approximately 109 hours of spoken French, 902,755 words of which have been orthographically transcribed. with a further 13 hours of phonetic transcription. These transcriptions are available at : http://bach.arts.kuleuven.ac.be/elicop. Although French researchers, notably Claire Blanche-Benveniste and the GARS team in Aix and Mireille Bilger in Perpignan, have collected and transcribed spoken French, their corpora are not (yet?) available on line. Beeching's (1980-90) “Bristol” Corpus , which comprises 17.5 hours of orthographically transcribed speech, or 155,000 words, involving 95 speakers, aged 7 to 90 years with a balanced range of educational backgrounds, though small in scale, may be accessed at http://www.uwe.ac.uk/facults/les/staff/kb/CORPUS.pdf. Beeching (forthcoming b) suggests that pragmaticalisation is a process of language change during which words, used in everyday social interaction, shift in meaning or accrue a certain social semiotic, become habituated in that usage and are propagated because of a new fashion or prestige which is attached to them. Spoken corpora and corpus tools are an excellent heuristic in charting distributional frequencies or probabilistic factors. Andersen (2000) suggests that the upsurge of innit and like in the COLT Corpus of adolescent English may be more than age-grading. The present paper will present broadbrush preliminary evidence with respect to the evolution of selected pragmatic particles in French. 102 The MEANING Italian Corpus Luisa Bentivogli, Christian Girardi, Emanuele Pianta ITC-irst Via Sommarive 18 - 38050 Povo (Trento) – Italy E-mail: {bentivo, cgirardi, pianta}@itc.it Abstract The MEANING Italian Corpus (MIC) is a large size corpus of written contemporary Italian, which is being created at ITC-irst, in the framework of the EU-funded MEANING project. Its novelty consists in the fact that domain-representativeness has been chosen as the fundamental criterion for the selection of the texts to be included in the corpus. A core set of 42 basic domains, broadly representative of all the branches of knowledge, has been chosen to be represented in the corpus. The MEANING Italian corpus will be encoded using XML and taking into account, whenever possible according to the requirements of our NLP applications, the XML version of the Corpus Encoding Standard (XCES) and the new standard ISO/TC 37/SC 4 for language resources. A multi-level annotation is planned in order to encode seven different kinds of information: orthographic features, the structure of the text, morphosyntactic information, multiwords, syntactic information, named entities, and word senses. 1. Introduction A domain-based corpus can be a useful resource in different research areas. It is well known that domain-specific sublanguages exhibit specific features at various linguistic levels (Grishman and Kittredge, 1986). Linguistic analyses carried out on a multi-domain corpus can uncover differences in the lexicon and morphology, in names and named entity structures, and in lexical semantics, syntactic and discourse structure. The NLP community can find in a domain-based corpus a fundamental resource for several tasks such as, for example, parsing (Sekine, 1997), domain-dependent lexical acquisition, Word Sense Disambiguation (WSD), etc. In WSD, domain lexical information has proved to be very useful in the development of high precision algorithms (Magnini et al., 2003). Domain (also called topic, or subject matter) is one of the criteria for text selection and/or classification in many existing corpora. For instance, in the written component of the British National Corpus two main text selection criteria were used: “medium” and “domain”. More specifically, the BNC uses 9 knowledge domains (arts, social sciences, world affairs, etc.). Similar selection criteria (medium and domain) have been adopted also in the design of the American National Corpus (Ide and Macleod, 2001). The Brown and LOB corpora classify texts in 15 different text categories but such categories are a mix between genre labels (bibliography, popular lore) and domain labels (religion, “skill, trade and hobbies”). The NERC report (Calzolari et al., 1995), offers a summary of the classification systems used by major corpus projects in Europe, showing that domains are generally used in the classification of the texts. The same holds for the most important corpora created for the Italian language. In the SI-TAL Italian Treebank (Montemagni et al., 2000) texts have been “selected to cover a good variety of topics”. The reference corpus CORIS/CODI (Rossini Favretti et al., 2001) is structured in subsections, some of which can be compared to domains. However, in all the mentioned corpora, a complete representation of domains is not pursued in a systematic way. On the contrary, domain is the fundamental selection criterion of texts to be included in the MEANING Italian corpus (MIC). The MIC is being developed with the aim of supporting domain-based WSD in the framework of the MEANING project (Rigau et al., 2002). MEANING is an EU funded project which aims at enriching existing wordnets (for English, Spanish, Catalan, Basque, and Italian) by acquiring new lexical information from corpora. MEANING tries to exploit the inter-dependency between Word Sense Disambiguation (WSD) and knowledge acquisition by applying the following steps: 1. Train accurate WSD systems and apply them to very large corpora; 2. Use the partly disambiguated data in conjunction with shallow parsing techniques and domain information to extract new linguistic knowledge to be incorporated into wordnets; 3. Re-train WSD systems and re-tag the corpus, exploiting the information acquired in the second step. The result of this cycle is twofold: the enrichment of the lexical resources with information acquired from the corpus and a multi-level linguistic annotation of the corpus itself. The rest if this paper is structured as follows. In Section 2 the structure of the MIC is described in detail. Section 3 deals with the encoding of the corpus while in Section 4 the multi-level linguistic annotation of the corpus is illustrated with annotation scheme examples. Section 5 summarizes what has been done up to now and what are the tasks still to be undertaken. 103 2. Corpus design The MIC is being created with the aim of representing the domains used in WORDNET DOMAINS (Magnini and Cavaglia, 2000), an extension of WordNet 1.6 where each synset has been annotated with at least one domain label, selected from a set of 164 labels hierarchically organized. The WORDNET DOMAINS hierarchy was created starting from the subject field codes used by current dictionaries, and the Dewey Decimal Classification system (DDC), a general knowledge organization tool that is continuously revised to keep pace with knowledge development. The DDC is the most widely used library classification system in the world and provides a very large and complete set of hierarchically structured domain labels (see DDC 1996). A core set of 42 basic domains (the second level of the WORDNET DOMAINS hierarchy) has been chosen to be represented in the MIC. A study carried out by (Magnini and Gliozzo, 2002) shows that these 42 domains have a domain-coverage equivalent to the domain-coverage of the DDC system. In the MIC, texts are assigned to a topic category on the basis of an existing, text-external, list of domain labels. The value of this kind of classification is one of the central controversial areas of text typology, as pointed in the EAGLES preliminary recommendations on text typology (Sinclair and Ball, 1996). This report argues that it is not possible to classify the texts produced in the world on the basis of a limited list of topics, chosen on a text-external, a priori basis; there are too many possible methods for identifying the topic of a text. Also, the boundaries between topics are blurred, and texts usually cover a variety of topics. On the contrary, the topic(s) of a text should be identified on the basis of text-internal evidence such as vocabulary clustering. However, while claiming that internal evidence should be the primary criterion for the identification of a text topic, the EAGLES report admits the possibility of a defensible use of topic categories based on few external criteria. These are the sectionalisation of newspapers, some topic-related classifications institutionalized in a society (in particular lists of recognised professions and educational courses), and, when existing, the self-classification of the text. We recognize the problems mentioned above and we agree with the position that there is no objective, scientific means of assigning topics. However, a commonly accepted topic classification scheme based on internal criteria has not been developed yet. Moreover, some practical consideration must be made. In the current corpus practice text-external criteria are widely used to assign topics to texts. As it is shown in the introduction, the topic categories given in the NERC report have a common ground in many or most of the corpora studied. The MIC is in line with the trend in corpus practice as most of the commonly used topics reported in that document correspond to our basic domains. Moreover, as we will see below, in the construction of the corpus we exploit all the acceptable external criteria mentioned in the EAGLES report. Coming back to the model designed for the creation of the MIC, we were faced with two requirements. First, we needed completeness, i.e. we wanted all of the 42 domains to be represented. Second, we wanted the corpus to reflect the fact that different domains do not have the same relevance in the language. To meet the completeness requirement we are creating a micro-balanced corpus composed of 42 subcorpora, each representing a basic domain. On the other hand we are creating a macro-balanced corpus, i.e. a homogeneous corpus of the contemporary Italian language created without taking into account the domain criterion but in which we know that most domains are represented. This corpus will allow us to verify in an independent way the relevance of the different domains in the generic language. Figure 1 shows the overall structure of the MIC. As regards the other corpus building criteria, the MIC represents only the written (electronic) mode, relying on written texts already available in electronic form. The media used are essentially three: newspapers, press agency news, and web documents. The genre is mainly that of informative, “factual” prose. An important characteristic of the corpus is that a part of it is bilingual. It includes 5 million words of aligned parallel English/Italian news and the first version of MultiSemCor (Bentivogli and Pianta, 2000), which is a bilingual aligned parallel corpus semantically tagged with a shared inventory of senses. Up to now MultiSemCor consists of 30 English texts of the SemCor corpus (a subsection of the Brown corpus semantically tagged with WordNet senses) along with their Italian translations, for a total of about 120,000 words. 104 Figure 1 Corpus Composition 2.1 The micro-balanced component The micro-balanced section of the MIC will be composed of 42 subcorpora representing the 42 basic domains selected from WORDNET DOMAINS (reported in Table 1). To create the subcorpora, we take into account the whole hierarchy of WORDNET DOMAINS. This means that for each subcorpus we look for texts belonging not only to the corresponding basic-level domain, but also to the more specific domains related to it in the hierarchy. It is important to underline that the micro-balanced corpus is not composed of specialized texts as we do not aim at creating specialized corpora but a general language corpus in which all the domains are covered. Given the fact that the 42 basic domains seem to have a different absolute relevance, we distinguish major domains (e.g. Economy and Sport) and minor domains (e.g. Linguistics and Astronomy). Each major domain subcorpus will include 2 million words while the minor subcorpora will be composed of 1 million words each. The texts to be included in the micro-balanced corpus come from three main sources: press agency news, newspaper weekly special supplements, and web documents. Each domain subcorpus should be balanced with respect to the three media; however for some subcorpora most of the texts will be web documents as it is unlikely that we will be able to find enough news or supplements belonging to those domains (see for instance Mathematics, Pedagogy, Archeology). 2.1.1 Press agency news and special supplements The press agency news were collected through the Excite (http://www.excite.it) and Virgilio (http://www.virgilio.it) portals. They come from the following press agencies: Reuters, ANSA, ASCA, DataSport, and ADNKRONOS (parallel Italian/English). Supplements come from a wide circulation newspaper called “La Stampa”, which contains weekly special supplements dealing with science (“Tuttoscienze”), books (“Tuttolibri”), finance (“Tuttosoldi”), television (“TV”), cars and motorbikes (“Speciale motori”), agriculture (“Speciale agricoltura”), Italian elections (“Speciale elezioni”) and local events in the town of Turin (“Speciale citta” and “Torinosette”). To speed up the creation of the micro-balanced corpus, we explored the possibility of developing a methodology for the (semi-)automatic classification of news and special supplements to be assigned to the various subcorpora. This is made easier by the fact that both press agency news and special Administration Artisanship Computer science Law Philosophy Sexuality Agriculture Astrology Earth Linguistics Physics Sociology Alimentation Astronomy Engineering Literature Play Sports Anthropology Biology Economy Mathematics Politics Telecommunications Archaeology Body care Fashion Medicine Psychology Tourism Architecture Chemistry History Military Publishing Transport Art Commerce Industry Pedagogy Religion Veterinary Table 1 Basic domains in WordNet Domains - Version 1.1 Newspapers News Web documents Newspapers Web documents News Web documents Newspapers News Newspaper Newspaper … 42 DOMAIN-2 DOMAIN-3 DOMAIN-1 MICRO-BALANCED CORPUS MACRO-BALANCED CORPUS MEANING CORPUS 105 supplements are already classified by the publishers with two different kinds of information: broad topic and keywords. News are divided into 9 broad topics, namely economy, politics, cars and motorbikes, artistic performances, sports, science and technology, generic news, foreign news, local news. Also the 9 special supplements can be considered self-classified in 9 broad topics given in their title (science, finance, etc.). Moreover, one or more keywords, often corresponding to domain labels, are always associated to each piece of news and supplement article. We studied how to develop a procedure able to exploit this information to (semi-)automatically assign news and supplements to the appropriate domain. To develop and evaluate the procedure, a development and a test set have been created for both news and supplements. The development set is composed of all the 20,399 news collected from the Excite portal in five months, from April to August 2002. As the news constitute an always growing open set, it is important to verify the productivity of the procedure when applied to news belonging to a different period of time. For this reason, the time-span covered by the test set was kept different from that of the development set. The test set was created by selecting 500 out of 15,014 news, collected in five months from September 2002 to January 2003. The news in the test set were chosen randomly, keeping temporal distribution and the proportions of the broad topics with which they were classified by Excite. As regards supplements, we had at our disposal newspapers special supplements covering a time-span of 10 years (from 1992 to 2001), for a total of 66,927 articles. As these supplements represent a closed set, both the development and test set can be selected from the same period of time. We selected 30,000 articles for the development set and 500 for the test set, randomly chosen but homogeneously distributed over the 10 years and keeping the proportions of each broad topic. The news composing the two test sets were read and manually assigned to the appropriate domain. Note that the texts have been assigned to the most specific domain available among the WORDNET DOMAINS. As a first step, we tried a simple algorithm that can be considered as a baseline for our experiment. We associated to each domain a set of Italian words currently used to refer to that domain. This set of words (domain word set) has been manually created and contains the lemma of the domain, possible morphological variants, and possible synonyms. As an example, the following is the domain word set associated to the pharmacy domain: Domain: PHARMACY Domain word set: farmacia (pharmacy), farmaceutica (pharmaceutics), farmacologia (pharmacology) Then, for each domain a procedure looks for a matching between words in the domain word set and keywords associated to the texts. This procedure assigns a text (piece of news or article) to a domain if at least one of the words contained in the domain word set corresponds to one of the keywords associated to the text. The procedure exploits information about keywords as their granularity is similar to that of domains. The 9 broad topics are too generic to be useful for the baseline algorithm.This procedure relies entirely on a priori information, as it does not require any kind of analysis of the development set The results of the application of the baseline procedure to the test sets of the news and the special supplements are shown in Table 2. In the evaluation, the domain assigned by the procedure and the manually assigned domain are considered to match if they are equal or if they have a common ancestor that is a basic domain. Since these results show -especially for the news- a very law recall, a second procedure has been developed, based on a number of rules manually written on the basis of the study of the development set. These rules exploit wider information than the baseline algorithm: • keywords that are not in the domain word set, but are somehow related to the domain • the broad domains • the words in the title Table 3 shows three sample rules, which apply to the pharmacy and the computer science domains. The first rule only considers information about the text keyword(s). The second rule looks at both keywords and words in the title. The last one considers both keywords and broad topic. Precision Recall Coverage News 0.72 0.15 0.20 Special supplements 0.54 0.56 0.70 Table 2 Performances of the baseline procedure on news and supplements 106 PHARMACY if KEYWORD = farmacia (pharmacy) or farmaceutica (pharmaceutics) or farmacologia (pharmacology) or farmaco (medicine) or vaccine (vaccine) PHARMACY if and KEYWORD = epatite (hepatitis) or morbillo (measles) or meningite (meningitis) or antipolio (polio) or virus (virus) TITLE = vaccine (vaccine) COMPUTER SCIENCE if and KEYWORD = internet (internet) BROAD TOPIC = tecnologia e scienza (science and technology) Table 3 Examples of rules This second procedure has been developed only for the press agency news and gives the results shown in Table 4 below. The precision decreases but both recall and coverage improve significantly. Unfortunately, despite these improvements, the precision of both algorithms is still unsufficient to avoid manual intervention. Thus, we plan to use the results of the application of the second algorithm for supporting humans in creating the 42 subcorpora. Manual work will be speeded up as corpus builders will have to check if the assignment of the text to a domain is correct or not, a task which is much simpler than assigning a text to one of the 42 subcorpora. In order to test the applicability of the hand-written rules -developed for the press agency news- also to other texts, we applied them also to the articles of the special supplements. The application of the second procedure to the supplements does not change significantly the results obtained with the baseline algorithm: precision goes from 0.54 to 0.53, and recall from 0.56 to 0.57. These results demonstrate that the rules created on the basis of the press agency news are specific to the news themselves and cannot be reused for different kinds of texts. 2.1.2 Web documents The third main source of texts to be included in the micro-balanced corpus is the web. The web gives access to colossal quantities of texts of any type and more and more linguists and language technologists rely on it as a huge source of corpus materials (see Kilgarriff, 2001). The MEANING project itself treats the web as a corpus to learn information from it, with the final aim of opening the way for a concept-based access to the Multilingual Web. Despite web's usefulness for corpus research, when trying to collect web documents we have to face several problems: the web contain duplicates or very similar documents, not all documents contain enough text, they may contain mixes of languages, and so on. As it is impossible to visit, download and manually classify some of the millions of web pages, we are at the moment studying how to devise automatic methods to draw materials from the web for inclusion in the corpus. Precision Recall Coverage News 0.64 0.44 0.55 Table 4 Performance of the second procedure on the news 2.2 The macro-balanced component The macro-balanced corpus is being created in order to evaluate in an independent way the relevance of the domains in a generic corpus. This corpus is not intended to be a reference corpus for the Italian language, as it is not balanced with respect to different literary genres, media, modes, and styles. It is a homogeneous corpus composed of two general high circulation newspapers (“La Repubblica” and “La Stampa”) in which we expect most domains to be represented. The macro-balanced corpus contains about 90 million running words covering a time-span of 4 years (1998-2001). This time-span has been chosen in order to keep the corpus comparable with the other corpora of the MEANING consortium. We assume that in the selected material the most common topics dealt with in periodical are represented, giving us a picture of the distribution and proportions of the topics within the corpus. This will allow us to verify the relevance of the different domains in the current language. Table 5 summarizes the data about the texts we included in the corpus. Size (tokens) Time-span La Repubblica 38 millions 2000-2001 La Stampa 48 millions 1998-1999 Table 5 Structure of the macro-balanced component 107 3. Corpus encoding The corpus will be encoded using XML as a common data format. We will take into account, whenever possible according to the requirements of our NLP applications, the Corpus Encoding Standard for XML (XCES) guidelines and the new standard ISO/TC 37/SC 4 for language resources (Ide and Romary, 2002). We chose full text as type of sample for the corpus, that is the complete newspaper article, piece of news, or other document is taken as the minimum size of the text. Each text is stored in a separate file. CES distinguishes three broad categories of information which are of direct relevance for the encoding of corpora for use in NLP applications: • Documentation, which includes global information about the text, its content, and its encoding. • Primary data, which consist of the text marked up with information regarding both gross structure (paragraphs, chapters, titles, footnotes, etc.; features of typography and layout; non-textual information, such as graphics, etc.) and sub-paragraph structures (sentences, highlighted words, dates, abbreviations, etc.) • Linguistic annotation, i.e. information added to the primary data as a result of some linguistic analysis In the MIC, documentation about each text will be included in the form of a separate XCES-conformant header. All the original texts are stored in the legacy corpus, which is kept as a backup corpus. Then, to obtain the encoded version of the corpus, the legacy texts undergo a series of transformations. To this extent, a number of normalization scripts have been implemented. In the CES guidelines primary data (i.e. the text itself marked up with information about its structure) form the so-called base or hub text. The hub text does not include linguistic annotations which are stored in separate documents and linked to the hub text or other linguistic annotation documents. In the encoding of the MIC, we follow CES guidelines in retaining linguistic annotation in separate documents. However we differ from CES in the way we treat primary data. In fact, we prefer our hub corpus to be completely plain, i.e. pure text without any type of markup (apart from carriage returns). Thus the encoding of the primary data is not kept together with the text itself: primary level information is coded in the same way as linguistic information and is stored in different files separated from the hub text. 4. Corpus annotation A multi-level annotation of the corpus is planned in order to encode seven different kinds of information: orthographic features, the structure of the text (primary data, level 1 and 2), morphosyntactic information, multiwords, syntactic information, named entities (primary data, level 3), and word senses. All annotations are performed automatically, using linguistic tools developed at ITC-irst. Information about each level of annotation is stored in separate documents. Following the CES recommendations (see Section 3), all annotation documents are linked to the hub corpus or other annotation documents using one-way links. Two different means can be used to specify locations, namely reference to a unique identifier (ID) and reference to the position of the characters in the text We use the character position locators to link the orthographic annotation to the hub corpus. ID locators are used to link all the other linguistic annotation documents. In the next sections, information about the different kinds of annotation are given. All the examples reported refer to parts of the same sentence: “Il Ministero della Sanita dice che coi superalcolici bisogna andarci veramente piano” (Eng. “The Department of Health and Human Services says that people must take it really easy with liquors”). 4.1 Orthographic annotation The corpus is automatically tokenized and each token is annotated with: • token ID • Location in the hub corpus • case (upper, lower, capitalized) 108 Example: “Il Ministero…” (The Department…) Il capitalized Ministero capitalized 4.2 Structure annotation At this level of annotation, primary data (level 1 and 2) are encoded. As said before, this information is stored in a document separeted from the hub file, which contains only the pure text without any tags. The following information is recorded: • text divisions, paragraphs, sentences, rendition information, etc.(i.e. all structural information) • ID for text divisions, paragraphs, and sentences • link to token IDs in the orthographic annotation file Example:

Il Ministero della Sanita dice che coi superalcolici bisogna andarci veramente piano.
Negli ultimi anni, infatti, il numero di cirrosi epatiche e in continuo aumento.

... 4.3 Morphosyntactic annotation After PoS tagging and lemmatization each token in the corpus is annotated with its morphosyntactic information, that is: • word ID • link to token ID in the orthographic annotation file • lemma, stem, PoS, form (when necessary), morphological features (gender, number, mood, tense, person) Moreover, if the word belongs to a multiword: • link to the multiword ID in the multiwords annotation file • function of the word in the multiword (head, satellite) As regards POS tags, the tagset applied is a subset of the tagset specified in the EAGLES Guidelines for morphosyntactic annotation. Example: “andarci veramente piano” (Eng. “take it really easy”) andare and andar v VF inf pres head ci pron +E satellite veramente avv B piano avv B satellite 4.4 Multiwords annotation All expressions in the corpus which are multiwords are coded with the following information: • multiword ID • PoS, lemma • link to the word ID of the components in the morphosyntactic annotation file • function of the components words (head, satellite) Example: “andarci piano” (Eng. take it easy) andarci_piano v head satellite satellite If the multiword is present in our reference lexicon MultiWordNet1 (Pianta et al. 2002), PoS and lemma are those of MultiWordNet. 4.5 Named Entities annotation All named entities in the corpus are recognized and coded as such with the following information: • named entity ID • type of named entity • link to the word ID or multiword ID in the respective annotation files Example: Ministero della Sanita (Departement of Health and Human Services) organization The tagset applied to annotate named entities is the one adopted in the framework of the DARPA/NIST HUB4 evaluation exercise. 4.6 Word sense annotation Content words and multiwords in the corpus which are present in MultiWordNet are disambiguated according to MultiWordNet synsets. The annotation includes: • link to the word ID or multiword ID in the respective annotation files • MultiWordNet lemma, PoS, and synset ID Example: “bisogna andarci piano” (Eng. (people) should take it easy) bisognare v v#3990811 andarci_piano v v#03437782 4.7 Syntactic annotation Syntactic annotation will be carried out only in the last phase of the creation of the MIC. The precise encoding of the syntactic annotation has not been decided yet. However we plan to automatically annotate at least the main phrases of the sentence by using shallow parsing (phrase chunking) techniques. 5. Summary and conclusions The MEANING Italian corpus has been presented in this paper. MIC is being developed in the framework of the MEANING project with the aim of supporting word sense disambiguation, however a domain-based corpus can be a very useful resource not only for natural language processing applications but also for different kinds of linguistic analyses. The corpus is in its way to realization. All its overall structure has been designed and the multi-level annotation scheme has been developed. The macro-balanced component has been created, normalized and linguistically annotated up to level of morphosyntactic annotation. XCES-conformant headers for each texts have been automatically created. As regards the micro-balanced component, we are collecting materials from different sources and we are devising semi-automatic procedures to speed up its construction. Our work will go on until the corpus will be entirely created and all the levels of linguistic annotation will be performed. Acknowledgments We would like to thank Pamela Forner for her huge and precious work on the creation of the corpus and Claudio Giuliano, Nancy Ide, and Tomaz Erjavec who offered helpful comments for the development of the corpus annotation scheme. References Bentivogli L, Pianta E 2002 Opportunistic semantic tagging. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands - Spain. Calzolari N, Baker M, Kruyt T (eds) 1995 Towards a network of european reference corpora. Report of the NERC Consortium Feasibility Study, coordinated by Antonio Zampolli. Linguistica Computazionale XI-XII. Pisa, Giardini. 111 Dewey Decimal Classification and Relative Index 1996, Ed. 21, edited by J.S. Mitchell, Forest Press, Albany. Fellbaum C (eds) 1998 WordNet: An Electronic Lexical Database. Cambridge(Mass.), The MIT Press. Grishman R, Kittredge R (eds) 1986 Analyzing Language in Restricted Domains. Lawrence Erlbaum. Ide N, Romary L 2002 Standards for Language Resources. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002), Las Palmas, Canary Islands – Spain. Ide N, Macleod C 2001 The American National Corpus: A Standardized Resource of American English. In Proceedings of Corpus Linguistics, Lancaster, UK. Kilgarriff A 2001 Web as corpus. In Proceedings of Corpus Linguistics 2001 Conference, Lancaster,UK. Magnini B, Strapparava C, Pezzulo G, Ghiozzo A 2003 The Role of Domain Information in Word Sense Disambiguation. Journal of Natural Language Engineering (special issue on Senseval-2) 9(1). Magnini B, Cavaglia G 2000 Integrating Subject Field Codes into WordNet. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece. Magnini B, Gliozzo A 2002 Mapping WordNet Domains to the Dewey Decimal Classification. ITC-irst Technical Report. Montemagni S, Barsotti F, Calzolai N, Corazzari O, Zampolli A, Fanciulli F, Massetani M, Raffaelli R, Basili R, Pazienza M T, Saracino D, Zanzotto F, Mana N, Pianesi F, Del Monte R 2000 Building the Italian Syntactic-Semantic Treebank. In Building and using syntactically annotated corpora. Kluwer, Dortrecht. Pianta E, Bentivogli L, Girardi C 2002 MultiWordNet: developing an aligned multilingual database. In Proceedings of the First Global WordNet Conference, Mysore, India. Rigau G, Magnini B, Agirre E, Vossen P, Carroll J 2002 MEANING: A Roadmap to Knowledge Technologies. In Proceedings of COLING Workshop "A Roadmap for Computational Linguistics". Taipei, Taiwan. Rossini Favretti R, Tamburini F, De Sanctis C 2001 A corpus of written Italian: a defined and dynamic model. In Proceedings of Corpus Linguistics 2001 Conference, Lancaster,UK. Sekine S 1997 The Domain Dependence of Parsing In Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington D.C., USA. Sinclair J, Ball J 1996 EAGLES Preliminary Recommendations onText Typology. (http://www.ilc.pi.cnr.it/EAGLES96/texttyp/texttyp.html) 112 Discovering Regularities in Non-Native Speech Julie Carson-Berndsen 1, Ulrike Gut 2 & Robert Kelly 1 1 Department of Computer Science, 2 Faculty of Linguistics and Literary Studies, University College Dublin, Ireland, University of Bielefeld, Germany {Julie.Berndsen, Robert.Kelly}@ucd.ie gut@spectrum.uni-bielefeld.de Fax: +353 1 2697262 Fax: +49 521 106 6008 This paper presents ongoing collaborative research which focuses on the application of computational linguistic techniques to the analysis of a corpus of native and non-native speech. The aim of this research is to use computational tools for phonological acquisition and representation to identify regularities and sub-regularities between different speaker groups. The corpus is being collected and annotated at different levels as part of ongoing research into the acquisition of prosody by non-native speakers at the University of Bielefeld (see Milde & Gut, 2001). The computational tools have been designed and implemented at University College Dublin as part of a suite of tools aimed at providing a development environment for modeling, testing and evaluating phonotactic descriptions of lesser studies languages (Carson-Berndsen, 2002). The two hitherto separate research directions have now come together to apply computational linguistic tools to a corpus-based investigation of non-native speaker phonotactics. The term phonotactics refers to the permissible combination of sounds in a language. There are various ways of acquiring representations of phonotactic constraints. One approach is to manually construct a set of rules based on the linguistic intuitions of a native speaker. Another approach is to learn such constraints from a data set. The latter is the approach taken in this paper. Currently the corpus consists of 253 annotated recordings of between 2 and 30 minutes length by 88 different speakers with 21 different native languages. The corpus is annotated at a number of linguistic levels such as the level of the intonational phrase, the word, the syllable, and the skeletal (CV) structure, and comprises annotations of the prosodic structures of intonation and pitch range. Each level of annotation is viewed as a tier, analogous to the representations of autosegmental phonology (Goldsmith 1990). Analysis can take place either with respect to individual tiers or with respect to an associated set of tiers. In the latter case, one tier is chosen as the primary tier and the others are associated with it in terms of overlap and precedence relations between the units as suggested in Carson-Berndsen (1998: 60). Using the computational linguistic tools, finite state automaton and finite state transducer representations of the tiers are extracted automatically from the annotated corpus. Regularities in the data are then identified either with respect to a single tier or with respect to an associated set of tiers. The majority of previous studies on the acquisition of phonotactic constraints are based on small numbers of participants, which reflects the time-consuming nature of a manual analysis of this kind of data. Results on an initial subset of the corpus for Italian and Polish speakers of German demonstrate that the task of identifying phonotactic regularities and sub-regularities in the corpus data can be performed elegantly and efficiently using the computational linguistic tools. Distinct differences between the phonotactic violations produced by the Italian and the Polish speakers have been found and will be documented in detail in the full paper. Further experiments are now underway on a larger corpus. 113 NLP model and tools for detecting and interpreting metaphors in domain-specific corpora P. BEUST, S. FERRARI, V. PERLERIN GREYC, Computer Science Laboratory, University of Caen - bd Maréchal Juin F14032 Caen Cedex - France Abstract The aim of this paper is to present how a user-centred lexical representation model, based on the theory of Interpretative Semantics, can be used for detecting and interpreting metaphors in domain specific corpora. We present here several tools useful for such tasks and discussing the results of an experiment. Introduction In this paper, we present NLP (Natural Language Processing) project addressing the interpretation process. This project, called “ISOMETA1”, focuses on computer-assisted metaphor interpretation following a user-centred point of view. We propose a model for lexical representation as well as tools for validation on corpora. In the first section, we give an overview of some previous approaches related to metaphor detection and interpretation in order to highlight the main concepts we deal with. We also introduce the theoretical background for knowledge representation and text interpretation sustaining our approach. In the second section, we argue for user-centred lexical representations and we present our model for this purpose (called Anadia) as well as practical examples. This model enables automatic computing of customized help for interpretation by means of the isotopy concept. We detail how to produce such help when dealing with conventional metaphors. In the third section, we present some of the tools implementing our main propositions. AnadiaBuilder is a user-friendly interface to build structured lexical representations. Complementary tools have been developed for corpus analysis, producing graphical representations for easy browsing through the results and customized help for interpretation. In the last section, we present the results of an experiment on a domain-specific corpus. We study examples of a specific conventional metaphor: the stock market domain expressed with meteorological terms. Finally, we discuss how to carry out an evaluation of our work. We also propose other applications of our model and tools. We conclude by pointing the main directions for further developments and the next steps for the “ISOMETA” project. 1 Framework 1.1 Metaphors in NLP It is generally agreed that a metaphor involves two concepts: a source concept, related to the words used metaphorically, also called the vehicle of the metaphor, and a target concept, which is what the metaphor is used for and tries to describe, also called the tenor of the metaphor. If we consider the following example, first proposed by Wilks (1978), and still studied by Fass (1997): (1) “My car drinks gasoline”, the source of the metaphor is the action of drinking, and the target may be described as the use of gasoline by a car. The different NLP approaches for metaphor interpretation mainly depend on how the relation between the source and the target is viewed: as an analogy, as a novelty, or as an anomaly. In (Gentner, 1983; Falkenhainer et al., 1989), this relation is mostly viewed as an analogy. Thus, interpreting a metaphor requires deeply structured knowledge representations in order to trace back and describe the analogy between concepts. In (Indurkhya, 1992; Gineste et al, 1997), the relation between the source and the target is viewed as a novelty: it is not a pre-existing similarity but one created by the existence of the metaphor. Thus, interpreting it requires the dynamic selection and transfer of knowledge from the source domain to the target domain. 1 “ISOMETA” stands for ISOtopy and METAphor. 114 Metaphor may also be viewed as a semantic anomaly. In example (1), there is an anomaly if one considers that “drinking” does not normally apply to physical objects such as cars. As shown by Martin (1992), metaphors are not always anomalies, and anomalies are not always metaphors. For instance, in: (2) “McEnroe killed Connors” (ibid), there is no anomaly, nonetheless “killed” may be viewed as metaphoric. Only contextual information can help for disambiguating the whole sentence. Fass (1997) proposes a method for discriminating semantic relations, which makes a clear distinction between metaphors and anomalies. This method makes it possible for multiple interpretations to coexist, as in example (2). It is not necessary to focus on the relation between the source and the target to interpret metaphors. Kintsch (2000) shows how the meaning of a metaphor can be interpreted and represented by a multi-dimensional vector, exactly like other meanings in the Latent Semantic Analysis approach. We also consider that metaphors require the same interpretation process as other meanings. We do not focus on the relation between the source and the target either. But in our approach, we use a symbolic representation in order to provide a novice user with easily understandable tools. Lakoff and Johnson (1980) introduced the notion of conventional conceptual metaphor, based on the observation that, for some semantic domains, multiple terms from a common source domain may be used to describe metaphorically multiple corresponding concepts from a common target domain. In (Ferrari, 1997), such conventional metaphors are studied in the scope of domain specific corpora. For instance, he observed that stock market events are often described by meteorological terms in newspaper articles related to economics. In our work, we look at conventional metaphors in order to use the pre-existent knowledge that the target domain may be partly structured as the source domain. We focus on the previous example, which we call “economics is meteorology”. Using limited and user-centred resources, we try to track down the analogy and the novelty points of view. In the next section, we present the linguistic basis of our approach. 1.2 Knowledge representation and text interpretation The lexical representation and the analysis process we use are mainly inspired by continental structural linguistics (Greimas, 1966; Pottier, 1987) and especially by the linguistic theory developed by F. Rastier (1987): Interpretative Semantics. In this theory, the interpretation is considered as a description of semantic units located both in a linguistic unit (corpus, text, sentence...) and a situation. Interpretation involves an interpreter, along with his knowledge, his goals and his social relation2 to these given linguistic units. Thus, the meaning of a word, for instance, is not a definition of this word, as could be found in a dictionary, but rather an explanation of its role in a given linguistic unit. A lexical content is described in terms of meaning components, themselves described in terms of semantic features called semes. For example, the lexical item “depression” can be related to a ‘meteorological phenomenon’ or a ‘mental state', and the meaning component ‘meteorological phenomenon’ can be represented with the following semes: /area/, /low pressure/, /bad weather/… Such a description is called a componential representation. Semes depend both on the user and on the task. They are potential meaning features, relevant only in specific contexts. The notion of isotopy, introduced by Greimas (1966), characterizes these contexts. An isotopy is the recurrence of one seme in a linguistic unit. For instance, in this paper, one may notice at least two main isotopies related to ‘computer science’ and ‘linguistics', supported by many different lexical items. In our work, we focus on lexical items from two domains, meteorology and stock market, in order to describe the underlying conventional metaphor. In the next section, we present Anadia, the model we have previously developed for such lexical representations, and show how to use it for metaphor processing. 2 A model for lexical representation 2.1 Main principles The main principles of our model have been described in details by Beust (1998) and Nicolle et al. (2002). Anadia is a model of lexical categorization based on both componential and 2 We are talking about the relation to linguistic units through social role. For instance, a juridical text is differently interpreted by a lawyer and by common people. 115 differential representation. The differential paradigm states that a lexical content can be described by opposing it to others through structural relations, following the notion of “linguistic value” proposed by Saussure (1915). The Anadia model allows a user to produce descriptions of meaning components by the way of semes, which are the componential part of the representation. Rather than classical componential representations, semes are represented by a set of opposite features. This is the basis of the differential part of the representation. For example, “depression” can be described as the combination of the semes [Zone] and [Pressure] respectively corresponding to the opposite features “area vs. line” and “low vs. high”. The activated features for “depression” are area and low. These semes also allow a semantic representation of the lexical item “anticyclone” described by the activated features area and high. Lexical items representations are therefore made from the combination of semes. In this way, our model allows its user to build tables where lexical items can be described in terms of differences and common points, as shown in Figure 1. Figure 1. Example of an Anadia table describing some pressure zones3. In Figure 1, the combination of the semes [Zone] and [Pressure] gives rise to four table rows in which lexical items can take place. When there are several lexical items in the same row, it implies that their semantic representations are not considered as different in this table (in another one, they could be differentiated). It is the case for “tropical wave” and “easterly wave” in the example. A row can stay empty if we do not know any lexical item corresponding to a certain combination of features. It is the case for the combination of ‘line’ and ‘high’ in the example. A row can also be filled in later if we find a corresponding lexical item (for instance, by the way of a corpus study). Several tables can be used to describe a specific semantic domain. In such a set of tables, a table can be linked to a row in another table by a subcategorization relation (Figure 2). Figure 2. Extract from a set of tables for the stock market domain. The second row of the Domain objects table is linked to the Stock indices table by a relation of subcategorization. For many reasons (choice of semes, content of rows, subcategorization relations) tables represent the points of view of the user for a given task. Anadia is a user-centred model and the lexical representations built with the model are not supposed to be either universal or exhaustive. Tables can be modified and updated at any time, depending on the results obtained from the analysis process. Anadia tables allow proposing an analysis process based on the concept of isotopy. As shown by Tanguy (1997), isotopy can be seen as an easy and understandable way of expressing themes in linguistic units. Therefore, the interpretation process consists in finding isotopies in linguistic units. (3) During the three days immediately proceeding depression formation, anomalous moisture transforms from a pattern associated with a tropical wave transversing the open Atlantic Ocean ... (http://ams.confex.com/ams/25HURR/25HURR/abstracts/35268.htm) 3 The examples have been translated for this paper. 116 In example (3), using the representation of Figure 1, we notice that “tropical wave” and “depression” are described with the same semes: [Zone] and [Pressure]. These two recurring semes involve two isotopies that contribute to the meaning to the sentence. The recurring features also show that the sentence deals with pressure zones of different type : one corresponding to a ‘line’ of ‘low’ pressure and one to a ‘area’ of ‘'low’ pressure. 2.2 Using the model for metaphor processing The Anadia model was not originally designed for metaphor processing. The latter is just a specific task for which the model can be used. In order to study how the model can effectively be applied to metaphor processing, and what adjustments are to be made, we focus on the specific conventional metaphor: “economics is meteorology”. The model enables us to represent our lexical knowledge concerning the source and the target domains involved in this specific metaphor. Let us work on the assumption that one set of tables, set S, describes the lexical items of the source domain, meteorology, and a second one, set T, is dedicated to the target domain, stock market. At this point, the Anadia model enables us to use a single lexical item in multiple sets of tables. For instance, it is possible to represent “barometer” both in set S and in set T. In set S because it is a common term of meteorology, and in set T because we have noticed in newspaper articles that it is sometimes used in phrases such as “stock market barometer”, suggesting some economical tools for measures or predictions. This possibility becomes a problem when dealing with metaphors. If we want to use the model to detect the metaphorical use of “barometer” in phrases such as “stock market barometer”, we must not represent it in set T. Moreover, lexical items of set T must not be formed with words that can be considered as lexical items of set S. This is a first adjustment, or constraint, added to the Anadia original model: when building sets of tables for metaphor processing, it is necessary not to use words from a source domain in a set of tables for a target domain. Following this rule, “barometer” is now banished from the lexical items of set T. The reason for this is that when computing isotopies, the source semes are required to spot a metaphorical use. If “barometer” were in the two sets, S and T, its metaphorical use in “stock market barometer” would be ignored because an isotopy of words from set T would only hide the existence of semes from the source domain. It is important to notice that such a representation must not be considered as “wrong” and would not lead to misinterpretation. It would simply reflect the conventional aspect of the metaphor, which itself would be part of the knowledge of the user who would include “barometer” in the lexicon related to “stock market”. Assuming that S and T are now built according to that constraint, let us see how it is possible to spot a metaphor, and to what extent the lexical representation can produce guidance for its interpretation. The whole point is to detect an isotopy involving words from both the source and the target domain. On the one hand, with the Anadia model, isotopies are based on semes shared by lexical items involved in a single linguistic unit. On the other hand, previous works on conceptual metaphors have shown the existence of underlying structure analogies between the source and the target domains. It then stands to reason that the solution is to use some semes which are shared by lexical items from the two sets of tables, and which represent the structure analogy between the two domains. For example, if we use the seme [Role = studying, analysing vs playing a part] to describe “barometer” from the meteorology domain and “stock exchange” from the stock market domain, it then becomes possible to spot and produce guidance for interpreting the following metaphor : (4) a- “the Dow Jones is a stock exchange barometer”. The seme [Role] is here shared by two lexical items: “barometer” from the source domain and “Dow Jones” from the target domain. The fact that the lexical items involved belong to different domains is characteristic of a metaphorical use. The shared seme, creating an isotopy, is a first step for guiding the interpretation process. We shall discuss these points further in the following sections. At the moment, we can consider the use of shared semes as a second adjustment or constraint added to the model when processing metaphors. If sets S and T are built according to the two constraints presented in this section, it is not only possible to spot metaphors involving the lexical items initially used to organize the two sets, but also to process some of their extensions. Actually, when building the set of tables concerning meteorology, the user will probably consider lexical items such as “thermometer”, “mercury”, and propose to use the same seme [Role] to describe them. It will then be possible to process the following examples: (4) b- “the Dow Jones is a stock exchange thermometer” 117 (4) c- “the Dow Jones is the New York stock exchange mercury” even though the sets of tables were not originally designed for these specific metaphors. The next section presents tools developed in order to validate our model on corpus. 3 Tools The tools we created for our experiments are freely available for research purposes. They have been implemented with platform-independent languages (Java, XML and XSL). They can be used for different kinds of tasks including figurative language analysis (as shown in this paper) or for instance document retrieval (as shown in (Perlerin, 2001)). 3.1 AnadiaBuilder: a tool for building Anadia lexical representations AnadiaBuilder is software enabling to build lexical representations following the Anadia model (Nicolle et Al., 2002). The created data is stored in XML format. Via a user-friendly graphical interface, the user can build sets of tables according to the current task. The interface contains five main interactive panels: (A) (B) (C) (D) (E) The first one enables the user to create the semes he finds relevant for the representation. The user chooses the related sets of opposed features and an explicit name for each seme. The second one makes it possible to create tables made from the combination of semes (Figure 3). The user chooses the semes and the machine computes the combinations and automatically builds the table. The user fills in the cells (on the left-hand part of the table) corresponding to a given set of features from different semes with relevant lexical items. The third one displays a graphical representation of a table (called “topique” in French) showing the differences and the semantic proximity between lexical items by means of annotated links (Figure 3). The fourth one creates the relations between tables. It also makes it possible to see the whole set of tables through a schematic representation where only table names are displayed (Figure 4). In this panel, the user can allocate a colour to each table, which is useful for further corpus analysis. The last one is linked to the MAHTLEX lexical database, developed at the University of Toulouse4. For each lexical item, the computer proposes a set of inflections or enables the user to build the corresponding set of inflections by himself. Inflections will be used to match occurrences of lexical items in texts. At step (B), when building a table, if the user estimates that he can fill in several cells with the same lexical item, he must correct his proposals. This fact can happen because of two reasons. The chosen semes are not mutually exclusive, or the features of at least one seme are not mutually exclusive. The building contraints of the Anadia model are discussed by Beust (1998). Perlerin et Beust (2002) have undertaken an experiment with novice users. The results have shown that building a set of tables following the Anadia constraints is accessible to novice users. Such results may have to be moderated when dealing with a linguistic phenomenon such as metaphor. 4http://www.irit.fr/ACTIVITES/EQ_IHMPT/ress_ling.v1/accueil01.php 118 Figure 3. AnadiaBuilder: tables building panel and corresponding “topique” from the “topique” panel (extract of the screenshot). Figure 4. AnadiaBuilder: set of tables representation related to the stock market domain. Each set of semes, each set of tables or inflections dictionary can be saved independently and reused in different experiments. In particular, the sets of tables can be used for corpus analysis. Results are then produced as an annotated version of the corpus. Several tools help us to browse through the resulting corpus, mainly by the use of colours and charts. 119 3.2. Corpus analysis tools During the automatic part of the corpus analysis, all the possible occurrences of lexical items from the sets of tables are located in the texts. A first tool builds a graphical representation of each text in the corpus5, as shown in Figure 5. For one text, each table is represented by one bar inheriting its colour. Each bar is proportional to the number of matched lexical items from the table. In our experiment on the metaphor “economics is meteorology”, the purpose of this graphical representation is to provide the user with a quick way to track down articles where the source domain is evoked. A single HTML page contains all the charts along with hyperlinks to the related texts (Figure 5). Figure 5. Graphical representations of the outputs: moving the mouse over a bar shows the corresponding table name and matches the number of lexical items. A second tool transforms the XML version of each text into an HTML version, as shown in Figure 6. In the HTML version, the matched lexical items are in the same colour as the corresponding table. This provides the user with an easy means to find the precise location of the lexical items he is interested in. Figure 6. A coloured article. Moving the mouse over a coloured lexical item shows the corresponding table name and the corresponding set of semes/features. 5 In our experiments, the article appeared to be a relevant unit to build the charts. The level of this linguitic unit can be changed. 120 The next session presents some results of an experiment realised on a journalistic corpus. 4 First results Our work has been validated through a corpus experiment. The corpus is constituted of about 600 articles from the French newspaper “Le Monde”, addressing economics and stock market (around 450,000 words) between 1987 and 1989. This corpus, already studied by Ferrari (1997), contains numerous examples of the conventional metaphor “economics is meteorology”. It also contains lexical items from the meteorology domain that are not used in a figurative way. For our experiment, the sets of tables have been designed with nine shared semes. These semes reflect our own view of the conceptual metaphor. Specialists of any of the two domains would probably have designed the sets of tables in a much different way. Our point of view reflects our knowledge of the underlying analogy between the two domains. In the following, we discuss two different examples in order to show how the analogy and novelty points of view can be retrieved with our proposals. (5) Le Dow Jones par exemple, le thermometre de la Bourse de New York, qui avait chuté de 508 points …6 - Article n°126 – Paragraph 1 In example (5), three lexical items from the sets of tables were matched (therefore coloured) by the analysis process. “Dow Jones” appears in the “Stock Indices” table of the stock market domain (see Figure 9). “thermometre” (thermometer) appears in the “Measuring Instruments” table of the meteorology domain (see Figure 8). Figure 8. Extract of the meteorology Anadia set of tables. Figure 9. Extract of the stock market Anadia set of tables. Following these representations of the two domains, an isotopy involves the shared inherited seme [Role] and the value ‘studying, analysing’ can be found thanks to the first two coloured lexical items. One can then conclude in favour of a metaphorical use and propose the following interpretation: “thermometre” (thermometer) is used in the same way as “graphics”, “ratio”... i.e. to suggest an object for analysis and study in the stock market domain. The lexical item could be replaced (more or less efficiently) by others from the “Measuring Instruments” table. (6) Ce krach était du (…) a la chute vertigineuse et incontrôlée du dollar, signe que la tempete affecte dorénavant les marchés financiers.7 - Article n°153 – Paragraph 3 In example (6), the lexical items “krach” (crash) and “tempete” (storm) appear in the following tables (Figure 10 and Figure 11). 6 Literal translation: The Dow Jones, for instance, the thermometer of Wall Street, which had fallen 508 points … 7 Literal translation: This crash was due (...) to the vertiginous and uncontrolled fall of the dollar, sign that the storm will henceforth affect the financial markets. 121 Figure 10. Extract from the stock market Anadia set of tables (the table has been truncated). Figure 11. Extract from the meteorology Anadia set of tables (the tables have been truncated). The isotopy found in this sentence (example 6) is based on two different semes. The first seme involved is [Connotation] (inherited for “storm”) with the same activated value ‘bad'. The second one is [Direction] with two different activated values: ‘down’ for “krach” and ‘up’ for “tempete”. Example (3) makes it possible to conclude in favour of a metaphorical use. First, due to the activated values, the seme [Direction] is less relevant than the other one, [Connotation]. Moreover the seme [Axis] is exclusively used in the meteorological domain and is not involved in any isotopy. We propose therefore to consider it as “irrelevant” in the context. The seme [Strength] does not take part in an isotopy either; but, unlike [Axis], it can be shared between several lexical domains. It seems to us that we can therefore consider it as relevant in this context. This illustrates how novelty is dealt with in our approach. Finally, we propose the following help for interpretation: “tempete” (storm) is used to evoke a not only bad but also violent dynamic phenomenon in the stock market domain. Numerous examples of sentences where the sets of tables enable to conclude in favour of metaphorical uses have been discovered in the corpus thanks to our tools. The two sets of tables have been modified several times depending on the results obtained from the analysis process. Those results are the fisrt step of the “ISOMETA” project validating our approach and our tools. Conclusion and further works This paper has presented a user-centred lexical representation model and its use to produce help for metaphor interpretation. There is no need to be an expert in a given domain to describe it by means of this user-centred model. Nevertheless, metaphor interpretation is a linguistic task. Thus, a description for a study on a conceptual metaphor, such as the one we have presented in this paper, requires a certain familiarity with linguistic sciences. The user must indeed be able to describe how he appreciates the analogy between the source domain and the target domain by the use of shared semes. Though we have presented the use of the Anadia model for a very specific task, we have already argued for its use in many applications, such as domain-specific corpus browsing or document retrieval, as shown in (Nicolle et al. 2002). We hope the same applies to the tools developed for the “ISOMETA” project. An experiment on domain-specific corpus has validated our method. Actually, producing customized help for metaphor interpretation appears to be possible. However, this result must be evaluated, both quantitatively and qualitatively. Nevertheless, such an evaluation is not easy to carry out. On the one hand, the user-centred aspect of the model implies that the evaluation process should be user-centred too. On the other hand, this evaluation requires an annotated corpus. Such a reference corpus does not exist yet and seems difficult to produce. In order to start the evaluation, our further works will concern other examples of conceptual metaphors, as well as other domain-specific corpora for their study and the automatic processing of isotopies. We also plan to use our model for metaphor and paraphrase in automatic text generation. 122 References Beust P. 1998 Contribution a un modele interactionniste du sens. Computer Sciences PhD Thesis of the University of Caen, France. Falkenhainer B., Forbus K.D. and Gentner D. 1989 The Structure-Mapping Engine : Algorithm and Examples. Artificial Intelligence, 41/1, pp.1-63. Fass, D. 1997 Processing metaphor and metonymy. Greenwich, Connecticut: Ablex Publishing Corporation. Ferrari, S. 1997 Méthode et outils informatiques pour le traitement des métaphores dans les documents écrits. Computer Sciences PhD Thesis of the University of Paris XI, France. Gentner D. 1983 Structure-Mapping: A Theoretical Framework for Analogy. Cognitive Science, 7, pp. 155-170. Gineste, M.-D., Indurkhya, B. and Scart-Lhomme, V. 1997 Mental representations in understanding metaphors. Technical report, 97/2, Groupe Cognition Humaine, LIMSI-CNRS, Orsay. Greimas A.J. 1966 Sémantique structurale. Paris: Larousse. Indurkhya, B. 1992 Metaphor and Cognition. Dordrecht, The Netherlands: Kluwer Academic Publishers. Kintsch, W. 2000 Metaphor comprehension: A computational theory. Psychonomic Bulletin & Review, pp. 257-266. Lakoff G. and Johnshon M. 1980 Metaphors we live by. University of Chicago Press, Chicago, U.S.A. Martin J.H. 1992 Computer Understanding of Conventional Metaphoric Language. Cognitive Science (16), pp.233-270. Nicolle A., Beust P. and Perlerin V. 2002 Un analogue de la mémoire pour un agent logiciel interactif. In Cognito, 21, pp. 37-66. Perlerin V., 2001 La recherche documentaire, une activité langagiere. In proceedings of TALN2001, Tours. Perlerin V. and Beust P. 2002 Pour une instrumentation informatique du sens. Proceedings of the CNRS/ARCO Summer School in Tatihou, to be published. Pottier B. 1987 Théories et analyse en linguistique. Hachette: Paris , p. 224. Rastier F. 1987 Sémantique interprétative. Presses Universitaires de France : Paris. Saussure F. de 1915 Cours de Linguistique Générale. Mauro-Payot: Paris (1986). Tanguy L. 1997 Computer-Aided Language Processing: Using Interpretation to Redefine Man-Machines Relations. Proceedings of the 2nd International on Cognitive Technology (CT'97), Humanizing the Information Age, Aizu Wakamatsu City, Japan, August 25-28. Wilks Y. 1978 Making Preferences More Active. Artificial Intelligence, 11/3, pp.197-223. 123 A corpus-based technique for grammar development Philippe Blache, Marie-Laure Guénot & Tristan van Rullen LPL-CNRS, Université de Provence 29 avenue Robert Schuman 13621 Aix-en-Provence, France {pb,mlg,tristan}@lpl.univ-aix.fr Abstract We argue in this paper in favor of a fully constraint-based approach in the perspective of grammar development. Representing syntactic information by means of constraints makes it possible to include several parsers at the core of the development process. In this approach, any constraint (then any syntactic property) can be evaluated separately. This aspect, on top of being highly useful in a dynamic grammar development, also allows to quantify the impact of a constraint over a grammar. We describe in this paper a general architecture for developing a grammar in this perspective and different tools that can be used in this schema. 1. Introduction Modern applications, especially in the context of human-machine communication, need to treat unrestricted linguistic material (including spoken languages) with a fine granularity. This means that NLP applications have to be at the same time robust and precise. The first step towards this perspective consists in developing adequate resources, in particular broad-coverage grammars. However, developing grammars remains an important problem and this work is usually done only empirically. We present in this paper an experiment based on a constraint-based linguistic formalism consisting in developing a broad-coverage grammar by means of corpus-based techniques. More precisely, the approach consists in using different tools, namely a POS-tagger, a shallow parser and a deep parser, for developing an electronic grammar for French taking into account various phenomena and different uses including spoken language syntactic turns. Different parsers are used in this perspective at different stages both in a development and an evaluation perspective. The general idea consists, starting from a core grammar, in parsing previously tagged and disambiguated corpus by means of a deep non-deterministic parser. The results are interpreted in order to identify syntactic phenomena beyond the scope of the grammar. The grammar is then completed and the modifications are experimented over the corpus. The new result is interpreted again in order to examine the consequences of the modifications: new parses, new structures, elimination of unexpected parses, but also false results, spurious ambiguities, etc. This work is then completed with a more systematic evaluation by means of a shallow parser which makes use of the entire grammar, but applies some heuristics in order to control the parse. This parser presents the advantage of being robust and efficient and can be used for parsing large corpora. In our experiment, this parser is used in order to evaluate the different versions of the grammar over different corpora. It is then very easy to compare automatically the efficiency of the grammar, even without any treebank. 2. The formalism of Property Grammars We think that using a fully constraint-based formalism for representing syntactic information offers several advantages that can have deep consequences in grammar development. More precisely, provided that 124 information is represented only by means of constraints, it is possible to conceive a grammar development architecture starting from zero and taking advantage of evaluation tools. In this section, we present the “Property Grammars” formalism (hereafter PG, see [Blache00]) in which linguistic information corresponds to different kinds of constraints (also called properties). In this approach, parsing comes to constraint satisfaction and, more interestingly in the perspective of grammar development, the result of a parse corresponds to the state of the constraint system. Such a result, called “characterization”, is available whatever the form of the input, being it grammatical or not. This makes this approach well adapted to the treatment of unrestricted texts and even spoken language corpora. We will give in the following some examples taken from different corpora, from newspapers to dialogs. In the end, we want to show that in such a perspective, different parsing tools can be also used as tools for developing but also evaluating grammars. In the remaining of this section, we present the PG formalism, illustrating the fact that constraints constitute a radically different approach in the perspective of parsing unrestricted texts. In PG, different kind of syntactic information corresponds to different kind of constraints. Basically, we use the set of following constraints: linearity (linear precedence), dependency (semantic relations between units), obligation (set of possible heads), exclusion (cooccurrence restriction), requirement (mandatory cooccurrence) and uniqueness: 1. Obligation (noted Oblig): Specifies the possible heads of a phrase. One of these categories (and only one) has to be realized. Example: Oblig(NP) = {N, Pro} 2. Uniqueness (noted Uniq): Set of categories that cannot be repeated in a phrase. Example: Uniq(NP) = {Det, N, AP, PP, Sup, Pro} 3. Requirement (noted Ë): Cooccurrence between sets of categories. Example: N[com] Ë Det 4. Exclusion (noted ‚): Cooccurrence restriction between sets of categories. Example: AP ‚ Sup (in a NP, a superlative cannot co-occur with an AP) 5. Linearity (noted <): Linear precedence constraints. Example (in a NP): Det < N 6. Dependency (noted ¨): Dependency relations between categories. Example (in a NP): Det ¨ N In this approach, all constraints are defined as relations between categories, and all categories are described using a set of properties. Moreover, even if we preserve a hierarchical representation of syntactic information (distinguishing lexical and phrasal levels), the notion of constituency doesn't constitute an explicit information. In other words, categories are only described by a constraint graph in which nodes represent categories involved in the description and edges are relations (or constraints; cf. fig.1). The fact that the set of properties describing a category forms a graph is a side effect. It indicates certain cohesion of the description coming from the fact that the description of the category is specified, at the difference of classical generative approaches, not in terms of relations between some constituents and a projection, but only in terms of relations between these constituents. A grammar is then formed by a set of such constraint graphs. The important point, especially in the perspective of grammar engineering, is that all constraints are at that the same level and then independent from each other. This means that it is possible to evaluate them separately. It is also possible to evaluate only a subpart of the constraint system. 125 figure 1 : Constraint graph of the Prepositional Phrase (PP) The categories mentioned in the properties of the PP are represented by nodes; each relation (or property) is represented with edges, eventually looping on one node (obligation, uniqueness) or connecting two nodes (requirement, exclusion, dependence, linearity). Let's describe more precisely the parsing architecture. In GP, describing a category consists in finding all the relevant constraints in the grammar and evaluating them. In a data-driven perspective, this comes more precisely to identify a set of categories and verify their properties. At the beginning of the process, the initial set of categories corresponds to the set of lexical categories. The parsing process simply consists in identifying the set of constraints that can be evaluated for the categories belonging to this set. Any constraint belonging to a constraint graph, the evaluation process comes to activate one (or several) constraint graphs. Activating a constraint graph leads to instantiate new categories (which are the categories described by the corresponding graphs). Activating a graph is then simply the consequence of the evaluation of the constraint system. This means that after evaluation, for each graph, we obtain the set of constraints that are satisfied plus eventually the set of constraints that are violated. These two sets form what we call the characterization of a category. We can then describe any kind of input, grammatical (i.e. whose set of violated constraints will be empty), or not. figure 2: Characterization graph Characterization graph of the phrase “The air line that the doctor Fergusson intended to follow”. The nodes correspond to the parsed categories, arrows correspond to the satisfied properties. 126 Such an architecture shows in what sense this approach differs from generative ones. In this last case, the general process consists in first building a structure (typically a tree) and then verifying its properties. In our approach, we only use constraint evaluation, without needing any derivation relation nor other extra devices in order to build the syntactic description of an input. 3. The role of constraints in Grammar Engineering One of the most important aspects of PG is that all information is represented by constraints and all constraints are independent. The information comes from constraint interaction, but different kinds of information are represented separately. As explained above, it is then possible to evaluate separately all constraints. We will describe in the next section how it can be possible to take advantage of this aspect in our shallow parsing technique. As for grammar development itself, this characteristic also has important consequences. Let's first come back on an important point. One of the main differences between a fully constraint-based approach such as PG and other classical generative techniques lies in the fact that there is no need to build a structure before being able to verify its properties. More precisely, generative method consists in expressing relations in terms of structures where PG uses only relations between objects. This means that a grammar has to be coherent, consistent and complete when using generative formalisms. At least, it should allow to build a structure (a tree) covering an input. Moreover, it is not possible in such approaches to deal easily with partial information. Using PG avoids these problems. First, even when starting from zero the development of the grammar, it is possible to evaluate any subset of properties. This aspect has several consequences on the utility of automatic devices during the development. On one hand, it is possible at any stage of the development of the grammar, to use a parser in order to verify the consequences of new properties, for example in parsing the same corpus and comparing the results of different versions of the grammar. On the other hand, the same devices can also be used in order to evaluate the organization of the grammar itself. All properties being assessable separately, it is possible to evaluate the respective consequences, for example in terms of filtering, for each constraint. This means a possibility of evaluating empirically the power of each constraint as well as specifying this power for a given type of property with respect to the others. Second, especially when developing a broad coverage grammar, unrestricted texts have to be taken into account. It is then necessary to deal with partial information and partial structures. The constraint-based approach proposed in PG relies, as shown before, on graphs which is well adapted to the representation of such information. Finally, dealing with unrestricted texts leads to treat even ungrammatical inputs. In a classical approach, such data are problematic when using automatic devices: the input can be either rejected (which is not interesting for example when treating spoken languages) or accepted (in this case, a wrong construction is added to the grammar). In our approach, it is possible to introduce a gradient in the grammaticality relying on the possibility of quantifying constraint evaluation. For a given construction, one can give the number and the type of constraints that are violated. Such information is in our opinion as important as the set of satisfied constraints, reason why our characterization contains both information. Reciprocally, for a given constraint, it is possible to specify its satisfiability degree in a corpus. And even more generally, it could be possible to specify what kind of constraint is preferably violated. At the difference with the optimality theory (see [Prince93]), we have here a systematic tool for evaluating the respective importance of constraints, generally and locally. The approach proposed here presents many advantages. It is in particular well adapted for the elaboration of grammars taking into account many different sources (including spoken languages one). Moreover, this approach is highly flexible in the sense that any syntactic information can be evaluated separately by means of tools making it possible to quantify their roles. This approach also presents a very practical interest: it only requires a simple constraint editor (which consists in an interface between an XML file and its representation). It can easily be adapted to the evolution of category representation. 127 4. Step by step semi automated grammar development The experimental process of grammar development described in this section is shown hereafter on figure 3 and can be schematised as a versioning step by step semi-automated sequence of experiments. Figure 3: step by step semi-automated sequence of experiments The development of a broad-coverage grammar for French relies, within our framework, on several points: a basic corpus, a core grammar, and some tools (several parsers). In order to carry out the tests on our grammar, we use a three million words corpus of journalistic texts (hereafter “corpus-test”), annotated and disambiguated (the treebank corpus of the LLF, University of Paris 7). A non-deterministic deep parser can be used as a real grammar development platform. It makes it possible to obtain various results for any input, while entrusting to him either the totality of grammar or one of its subpart, and then being able for example to focus on a subset of properties (to precisely observe their operation, their weight and their incidence on the parse), or on a subset of categories (to focus on the results obtained, the number of non relevant parses, to observe if one can deduce a recurring error pattern and try to avoid it, or on the contrary a recurring satisfactory pattern), without taking into account the rest of the grammar. We can make this type of handling without modifying the structure of neither grammar nor the parser, and that only by neutralizing the non-desired data. The non-determinism of the parser allows to get a mean to evaluate at a first level the efficiency of grammar according to the number and quality of the proposed parses. The development process can be alternatively carried out in three different ways. First, from a descriptive point of view: in this case, we first isolate from the corpus various occurrences of a given construction (e.g. clefts, coordination). We then build its fine empirical linguistic description, based on observation of … … … … … … releasing completion Large tests stage on whole grammar releases Versioning stage on parts of a grammar Interpretation of modifications with the indications of statistics Multiplexed output and statistics about results Multiplexer parse result Rn+1 Property Gramma r Ver n+1 parse result Rn Property Grammar Ver n Shallow parser Different Large tagged Analysis Versioning Syntactic phenomena parse result Rn Property Grammar Ver n parse result R1 Property Grammar Ver 1 non-determinist Deep parser tagged Corpus … … … … … … 128 attested productions, in order to integrate to our basic grammar the properties corresponding to our theoretical conclusions. Then we use a deep parser to analyse our test-corpus and compare the quality of the new results with the ones obtained from the precedent version of the grammar. Such evaluation takes into consideration the evolution of the quantity of parsing proposals according to the modifications of the grammar (i.e. conclusions about the consequences of introductions, suppressions and/or corrections, and in which proportions), the evolution of the quality of these proposals (number of relevant parses), the emergence of new suggested structure patterns, etc. The second development prospect concerns the coverage of the grammar. In this case the first stage consists in isolating, among the results obtained with the first version of the grammar, various occurrences of a particular construction (e.g. unbounded dependencies). We can then study in details the errors caused by the grammar (omission, imprecision, etc.). Once corrected, we run the parser again with the new version of the grammar to compare the quality of the results. Thirdly, in order to evaluate the set of properties itself and its semantics, we can isolate a type of property and observe its behaviour. This use of the deep parser allows to observe directly the impact of a type of property on the results. This can lead to modify the set of constraints itself, i.e. their operating mechanism and/or their presence. For example we can compare the quality of one parse result obtained with only one type of property on the one hand, and another one obtained with the other properties on the other hand. Then, while referring to the results obtained with the entire grammar, we can deduce the weight of a type of property on the parsing, either by its presence (for example if the majority of relevant analyses are found in the results obtained with only that one), or by its absence (for example if the majority of relevant analyses are found in the results obtained without this property, and if the number of false results is decreased by its absence). This can lead to an in-depth redefinition of a type of property, for example a modification of its semantics (i.e. its satisfaction mechanism), and even its suppression from the grammar (if it appears to be useless, and even more a source of recurrent errors). Let us consider the parsing results for the following phrase (coming from a dialogue corpus): “Alors on vous demandera aussi heu pourquoi c'est le meilleur” (“so we'll ask you too hum why it's the best”). One parsing proposition with the first version of the grammar is: figure 4: A parsing proposition with the first version of the grammar Following the approach explained above, we have done a lot of changes in the grammar using alternatively one of the three methods described here, and get another parsing proposition with a later version of the grammar: figure 5:A parsing proposition with a later version of the grammar 129 Each modification of grammar, whatever the method, can have side effects on the rest of the parse (i.e. on the elements that we put aside during our different focusings). For example a modification of the constraint graph describing a category, aiming at allowing a particular structure treatment can lead to the proliferation of non-relevant parses for the rest of the corpus (e.g. this structure can then be proposed more often, which can cause as a consequence a more significant number of erroneous proposals). This is the reason why we have, for each modification of the grammar, to launch the parse again on the entire test-corpus. If necessary, a new handling of the grammar is to be planned so as to control the effects of the described process. This checking can be done with our deep parser, or with a shallow parser if we want to widen significantly the size of the corpus to parse. Each new grammar version is subsequently tested over large corpora by means of a shallow parser. The evaluation relies on the analysis of outputs and consists in comparing phrase boundaries of each sentence and in studying local statistics about the width, the nature an the count of parsed phrases. The Property Grammars parser used for these tests is deterministic and implements some heuristics in order to control the process. Each sentence is parsed once. The result is built with a stack of categories obtained by means of a classic left-corner parsing, and with a dynamic satisfaction algorithm of the constraint set given by the grammar. The main interest of this parser is that it provides linguistic information about chunks, with hierarchical features, while keeping fastness and robustness as with other shallow parsers. Thus it is necessary within this grammar development framework to use a tool for comparing parsing results which is fast, complete and efficient. This is the reason why we must handle a test-corpus (so that the provided results can be comparable in a reliable way) and an automatic tool capable of providing quickly and intelligibly some information about common points and differences between results. In this way we can preserve throughout a detailed, step-by-step elaboration, a general point of view on the efficiency of the grammar, which is of primary importance so that the provided results remain homogeneous. The results of these tests are interesting for grammar development process as for parser improvement. Such a goal is reached with a parameterised automatic evaluation strategy (cf. [Blache02a]) which uses a multiplexer. The technique operates on common boundaries. Its output is the intersection or the difference of the inputs and statistics are given about the relative importance of the categories detected for each border. Set operators such as union, difference and intersection are applied on input borders as well as on categories of compared phrases. This multiplexing method is interesting for an evaluation process without needing a treebank. The first experiment consists in testing different versions of the grammar over the “treebank corpus”, so that the shallow parser results can be compared for two grammars in order to allow a manual correction of the grammar. The second experiment uses several corpora with different characteristic features such as spoken language, literary style etc. Its aim is to show efficiency of grammars on unrestricted texts. Results improve our knowledge on side effects, limits and the relative importance of properties. Both experiments give statistics about each kind of phrase, their width, their depth, their position. The multiplexed evaluation gives a precise comparison of grammars and information about their main differences. With this tool, we can observe differences and likelihood between two compared grammars in quantitative terms. The sample results below, obtained on the test corpora, show two grammars letting NPs unchanged, and giving 25% of different VPs and 15% of different PPs. 130 Figure 6: Some sample results of a multiplexing process for two grammar releases on a same corpus 5. Conclusion This paper describes a technique for empirical corpus-based grammar elaboration. Our approach relies on a fully constraint-based formalism that makes it possible to evaluate precisely each constraint. This is one of the main advantages of such a technique in comparison with classical derivational methods. It allows in particular to evaluate separately any constraint as well as any kind of constraint. This method constitutes then an efficient platform both for developing grammars as well as experimenting and evaluating theoretical results. The architecture proposed here makes it possible to implement an incremental and step-by-step grammar development environment in which several parsers are used both for developing and evaluating the grammars. References [Abeillé03] Abeillé A. 2003, Une grammaire électronique du français, CNRS Editions. [Blache00] Blache P. 2000, “Constraints, Linguistic Theories and Natural Language Processing”, in Natural Language Processing, D. Christodoulakis (ed), LNAI 1835, Springer-Verlag [Blache02a] Blache P, Van Rullen T. 2002, “An evaluation of different symbolic shallow parsing techniques”, in procs of LREC-02. [Blache02b] Blache P, Guénot M.-L 2002, “Flexible Corpus Annotation with Property Grammars”, in procs of the International Workshop on Treebanks and Linguistic Theories. [Butt99] Butt M., T.H. King, M.-E. Nino, & F. Segond 1999, A Grammar Writer's Cookbook, CSLI. [Copestake01] Copestake A. 2001, Implementing Typed Feature Structure Grammars, CSLI. [Kaplan02] Kaplan R., T. King, J. Maxwell III 2002, “Adapting Existing Grammars: The XLE Experience”, in proceedings of Workshop on Grammar Engineering and Evaluation (COLING-02). [Kinyon02] Kinyon A. & C. Prolo 2002, “A Classification of Grammar Development Strategies”, in proceedings of Workshop on Grammar Engineering and Evaluation (COLING-02). [Prince93] Prince A. & P. Smolensky 1993, “Optimality Theory: Constraint Interaction in Generative Grammars”, Technical Report RUCCS TR-2, Rutgers Center for Cognitive Science. [Simov02] Simov K. & al. 2002, “Building a Linguistically Interpreted Corpus of Bulgarian: the BulTreeBank”, in proceedings of LREC-02. Grammar GP1 GP2 Phrases / sentence19.0418.97 Words / phrase 1.50 1.50 NP StatsGP1 GP2 GP1 100%100%GP2 100% VP Stats GP1 GP2 GP1 100% 75% GP2 100% PP StatsGP1 GP2 GP1 100% 85% GP2 100% 131 Towards an integrated model of service encounters Birte Bös, Rostock, Germany 1. Introduction Service encounters are everyday social contacts, in which certain goods and/or services are provided. Despite booming Internet and shopping channels most of these encounters are still carried out face-to-face as they have always been, e.g. at department stores, travel agencies and restaurants. These institutionalised interactions are relatively clearly structured, and they generally have a fixed role allocation, involving (at least) a customer and a server. Service encounters are usually realised in a complex interplay of cognitive processes, dyadic verbal components, non-verbal elements and physical actions; and it has been stressed that these aspects should be considered in discourse analysis (cf. Ehlich 1993: 124). Looking at various existing models of service interactions one gains the impression that most of them have certain deficiencies. For example, the diverse phase models, which are particularly popular in practical sales guides, but also suggested in some linguistic publications, mostly represent a rather unrealistic picture. Besides postulating fixed sequences of phases, they often focus on the server's perspective, neglecting the interactive character of the event. Another disadvantage of most of the approaches examined is that they do not capture the actual realisation of the interaction. This is why this study strives for an integrated model, which allows for a comprehensive representation of service encounters (cf. Bös forth.). It is obvious that such a model cannot be developed without authentic data. Therefore, a corpus of 100 service encounters recorded in London bookstores was established as an essential prerequisite. As it is one of the main concerns to go beyond the standard focus on the verbal elements, the scope of this study was extended to include selected facial expressions, gestures and physical actions documented during the recordings at the actual point of sale as well. Based on a close analysis of this material and the existing representations, an explanatory model was created, which is shown (in extracts) and described in section 2. In section 3, the proposed form of representation is applied to a sample interaction from the corpus, illustrating that the model cannot only cope with idealised service encounters, but is also capable of reflecting the characteristics of specific service interactions. 2. The integrated model An integrated model of service encounters has to meet various requirements. One of the key objectives is to develop a general pattern which can serve as the basis for the representation of individual service encounters. This general form of the integrated model should provide all possible components of service encounters in bookstores as well as different alternatives of their realisation. It should be flexible enough to be adapted to specific service encounters. The model should include both the customer's and the server's perspectives, whose contributions are considered as equally important for the progress of the interaction. Besides, it should capture communicative as well as physical and cognitive aspects and illustrate their interplay. For the diverse activities involved in service encounters, a unified representation appears necessary to allow for a better comparability of the interactions. 132 Global structure of service encounters One of the first steps in the development of the integrated model was the investigation of the global structure of service encounters in bookstores. The corpus analysis has produced a set of 10 possible structural elements, which, in the individual interactions, occur in various combinations. A full list of these components is given in Fig. 1. • RUN UP (RU) • ESTABLISHING CONTACT (EC) • GREETING (GR) • SERVICE BID (SB) • NEED PRESENTATION (NP) • SERVICE (SE) • PURCHASE DECISION (PD) • PURCHASE REALISATION (PR) • FAREWELL (FA) • BREAKING UP CONTACT (BC) Fig. 1 Global structural elements of service encounters Although the RUN UP precedes the actual service encounter and is, in this respect, not part of the interaction proper, it is included in the model, because it essentially influences the progress of the interaction. In shops with self-service like bookstores, customers may browse, select products, and even make their purchase decision in the RUN UP, before they turn to the salesperson for help, for further information or simply to pay for the goods chosen. The remaining global structural elements differ in their degrees of obligation. The only obligatory elements observed in the corpus are ESTABLISHING CONTACT, BREAKING UP CONTACT and the NEED PRESENTATION. More specifically, ESTABLISHING and BREAKING UP CONTACT are physical elements which are necessary to create the face-to-face situation of the interaction. In the investigated data it is usually the customer who approaches or leaves the counter, where the server is placed. The most central obligatory element is the NEED PRESENTATION. Since it is the primary goal of every service encounter to satisfy certain customers’ needs, inevitably, these needs have to be expressed at some point in the interaction either verbally or non-verbally. SERVICE, PURCHASE DECISION and PURCHASE REALISATION are considered to be obligatory alternatives in the course of the service interaction. As the corpus analysis has proved, a considerable number of service encounters does not involve the transfer of goods but the supply of information or other services, which are provided in the SERVICE element. This means, a service encounter does not necessarily include a (positive or negative) PURCHASE DECISION of the customer. This cognitive process is only realised when certain products are involved. If there is a PURCHASE DECISION, and it is positive, the actual exchange of money and goods takes place in the PURCHASE REALISATION, which usually consists of a routine sequence of typical physical actions and may be supplemented by certain communicative acts. Components like GREETING, SERVICE BID and FAREWELL are communicative elements which are truly optional, i.e. all of them can be omitted, and the data show that often they are indeed not realised. The element GREETING occurs in only 56% and the SERVICE BID even less often, namely in 14% of the service encounters in the corpus; the FAREWELL is documented more frequently in 85% of the cases. 133 Local structure of service encounters – primitive acts Another aim of the integrated model is to provide a detailed, but comparable description of the various activities performed during service encounters. Searching for a way to represent the communicative, physical and mental components of the interactions, the concept of primitive acts, first suggested by Schank (1972), proves helpful. As their application in the cognitive script model by Schank/Abelson (1977) has shown, primitive acts allow for a unified representation of actions. Therefore, they can help to clarify certain regularities of service encounters independently from their individual realisation. For its use in the integrated model, the set of primitive actions was adjusted and now comprises the acts summarised in Fig. 2. Apart from modifying the meaning of acts such as MTRANS (which, in Schank's classification, includes the transfer of information between persons), and placing more emphasis on the communicative function of SPEAK, the new act GEST was introduced to account for the non-verbal aspects of communication. Additionally, the primitive act ATTEND is used to document the non-verbal aspect gaze (‘ATTEND eyes'). Another new act is OPERATE, which allows for the description of certain activities such as writing, using the telephone, computer etc. PHYSICAL PTRANS Change of the local position of a person or object; or transfer of an abstract relationship, e.g. possession GRASP Griping of an object by a person INGEST Taking in of an object by a person EXPEL Expulsion of an object from the body of a person into the physical world OPERATE Performing of an action involving the use of a certain instrument, e.g. ‘OPERATE computer', ‘OPERATE pen’ MENTAL MTRANS Transfer of information within a person MBUILD Construction of new information based on old information ATTEND Focussing of a sense organ to a certain stimulus COMMUNICATIVE SPEAK Transfer of information between persons by producing sounds GEST Transfer of information between persons by moving parts of the body or the face, e.g. ‘GEST lips ª’ (smile) Fig. 2 Modified set of primitive acts (based on Schank/Abelson 1977: 12ff) This limited set of primitive acts presupposes a basic level of description. It does not include the instrumental acts which are necessary to perform the suggested acts. For example, GRASP could be split up into various subordinate acts, e.g. moving the arm, bending the finger etc. On the other hand, the act GRASP itself can operate as an instrumental act, e.g. in the realisation of a PTRANS of an object. However, in this function it is, like all other instrumental acts, neglected in the model proposed. In this way, the modified set tries to avoid ‘endless conceptualisations’ which have been criticised as one of the major problems of Schank's concept (cf. Dresher/Hornstein 1976: 368). Graphic representation of service encounters Fig. 3 shows an extract from the general form of the integrated model illustrating the graphic arrangement which is used to capture the structure of service encounters. It demonstrates how the two perspectives of the major participants and various alternatives in the realisation of a service interaction are integrated in one form of representation. As mentioned before, both interactants are considered to have an equal share in the interaction, therefore they are assigned an equal space in the model, the left half of the diagram being reserved for the customer's, the right for the server's contributions. 134 CUSTOMER (C) SERVER (S) RUN UP ESTABLISHING CONTACT GREETING SERVICE BID RUN UP ESTABLISHING CONTACT GREETING SERVICE BID NON-PERCEPTIBLE PERCEPTIBLE NON-PERCEPTIBLE NEED PRESENTATION C MTRANS Present need? C MBUILD yes no C MTRANS Continue SE? C MBUILD yes no C MTRANS Specification necessary? C MBUILD yes no C SPEAK / C GEST or: C PTRANS P to S C SPEAK / C GEST S SPEAK / S GEST S SPEAK / S GEST S MTRANS Need been presented? S MTRANS no S MBUILD Does C have need? yes yes S MBUILD no S MTRANS Specification necessary? yes S MBUILD no NEED PRESENTATION SERVICE C MTRANS Service necessary? C MBUILD yes C MTRANS Support service? no C MBUILD yes no C MTRANS All service needs supplied? no C MBUILD yes C SPEAK / C GEST (+ others) S SPEAK / S GEST (+ others) S MTRANS Service necessary? yes S MBUILD no S MTRANS All service needs supplied? S MBUILD no yes SERVICE PURCHASE DECISION PURCHASE REALISATION FAREWELL BREAKING UP CONTACT PURCHASE DECISION PURCHASE REALISATION FAREWELL BREAKING UP CONTACT Fig. 3 General integrated model for the representation of service interactions (extract) P = product 135 The extract contains the SERVICE BID, NEED PRESENTATION and SERVICE, i.e. three out of the 10 components of a service interaction, which the analysis of the corpus has yielded (cf. Fig. 1). These global structural elements are given in the left and the right margin of the figure. They are indicated separately for customer and salesperson, because the data show that asymmetries can occur (cf. section 3). In the general model, the structural elements are arranged in the idealised sequence presented in Fig. 1. However, this order is not fixed, but can be rearranged to represent specific service encounters. The various activities in the realisation of service encounters are described with the help of the primitive acts assembled in Fig. 2, and can be classified as perceptible or non-perceptible. This distinction is shown in Fig. 3 by their graphic separation. The perceptible segment contains the actual observable interaction. These communicative (i.e. verbal and non-verbal) and physical activities of the participants are arranged in the middle of the diagram (where they are again differentiated into customer's and server's part). The general form of the model can only include the primitive acts which are typically used in the realisation of the various global structural elements. However, as Fig. 4 will illustrate, this segment extends immensely in detail when the model is applied to a specific interaction. What should always be taken into account is that the observable interaction is determined by the underlying mental processes of the interactants. An explanatory model aiming at a comprehensive picture of a service encounter should therefore also include cognitive aspects. These non-perceptible elements are taken down separately for the customer and the server on the left and right of the figure respectively (shaded grey). As the interactants are not just “passive vehicles in the unfolding of generic structure”, as George (1988: 316) puts it, but “knowledgeable participants”, their decisions presuppose a background knowledge of continually changing experiences and expectations. Generally, assumptions about cognitive processes are, of course, of a hypothetical nature and involve interpretative problems. Their representation in the model has to remain restricted to standardised points of decision. The principle of incorporating yes/no choices is also used, for example, in flowchart models (e.g. Ventola 1987), and makes it possible to include alternatives for the progress of the interaction which are indicated in Fig. 3 by the arrows. 3. Application of the model The general integrated model introduced in section 2 can flexibly be adapted to represent specific service interactions by adjusting the number and sequence of global structural elements and adding the relevant primitive acts as well as the actual wording of the interaction. The analysis of the corpus has revealed that the complete, idealised sequence of structural elements presented in Fig. 1 does, in fact, never occur. Instead, there are certain typical combinations depending on whether the interaction is focussed on the provision of goods and/or different kinds of services. As this determines the choice and realisation of the obligatory alternatives (SERVICE, PURCHASE DECISION, PURCHASE REALISATION, s.a.), eight subtypes of service encounters in bookstores can be distinguished according to their global structure. It does not come as a surprise that in a self-service environment like the bookstore the most frequent type of interaction is the simple payment, which does not include a special SERVICE element. However, there are, for example, also many cases in which the customer seeks only the advice of the salesperson and, thus, no PURCHASE DECISION and REALISATION are required. As the global structure of the integrated model is not considered to be fixed, the model can cope with the different subtypes. The flexible form of representation also allows for a visualisation of the three major structural peculiarities which could be observed in the corpus: loops, embedded elements, and asymmetries. As 41% of the service encounters display one of these divergences or a combination of them, approaches which postulate linear sequences of phases are disproved. Loops are caused by the repetition of structural elements. For example, customers often present more than one need in the course of the 136 service encounter. As a result, NEED PRESENTATIONS and, of course, also other elements, e.g. the ones in which the needs are dealt with, can occur repeatedly. Embedded elements are performed while another element is in progress. A customer might, for instance, ask for further information while s/he is paying for some goods. In such cases, the (second) NEED PRESENTATION is embedded in the PURCHASE REALISATION. Loops and embedded elements cannot only occur separately or in combination with each other, they can also be linked with asymmetries, i.e. asynchronous sequences, in which the interactants are engaged in different structural elements. A sample of these phenomena is given in Fig. 4. For the description of the local structures of a particular service interaction, the activities displayed in the non-perceptible and the perceptible segments of the diagram have to be specified. The non-perceptible, cognitive processes, which involve a variety of alternatives in the general form of the model, can now be geared to the actual service interaction. Consequently, the representation will be restricted to those points of decision and their resolutions which can be derived from the progression of the specific interaction. The perceptible segment of the model, on the other hand, which contains only a few typical primitive acts in the general form, can be enlarged with the details of the concrete verbal, non-verbal and physical realisation of the encounter. For the inclusion of the actual verbal utterances, a new feature is added to the figure. The dialogue is shown in a column in the middle of the diagram, thus symbolising the centrality of this part in the interactive process. Fig. 4 illustrates the way the necessary adaptations are accomplished and the potential such a form of representation offers. The extract shown is taken from a service encounter of the corpus which does not involve any goods, but only the advice of the salesperson. The customer, in her first NEED PRESENTATION, has asked the server for books on a certain subject (‘erm I'm looking for something on – marriage as opposed to weddings – the state of marriage‘). In the subsequent SERVICE section (SE1), the server starts checking on the computer, but due to the fact, that the search word is rather vague, she has difficulties in finding suitable publications (cf. Fig. 4, ‘...I'm not coming across anything in general'). The global structure of the example demonstrates the peculiarities discussed above. It contains a loop, as well as cases of embedded elements and asymmetries. For example, the customer accepts the negative reply of the server (‘okay') and then comes up with a new NEED PRESENTATION (NP2) (‘alright I'll just have a look on my own do you have a philosophy section?). It appears that for the customer the SERVICE (SE1) for her first need is finished (though unsuccessfully), and, as the next issue is brought up (NP2), a loop occurs in the structure of the encounter. However, the server still tries to deal with the first problem, even after the customer has signalled that her needs have changed. While the salesperson provides the information asked for (SE 2), she still continues her search for the first request. Consequently, from her perspective, NP2 and SE2 are embedded elements, performed while SE1 is in progress. As this is not the case for the customer, an asymmetry arises, which continues up to the end of the interaction. At that point, the server is still busy operating the computer, searching for suitable books, while the customer impatiently tries to signal that the interaction is finished by applying the discourse marker ‘okay’ and an expression of thanks (FA). Even as she leaves the counter, the server is still involved in SE1. 137 C NON-PERCEPTIBLE PERCEPTIBLENON-PERCEPTIBLE S SE 1 --- etc. --- I've just come up with so many different ones but I'm not coming across anything in general S SPEAK S ATTEND eyes to screen SE 1 NP 2 C MTRANS Present need? C MBUILD yes C SPEAK C SPEAK C SPEAK okay erm the thing is all the books are in the region of about thirty five to fifty plus pounds hm hm so that's all sort of alright I'll just have a look on my own do you have a philosophy section? S SPEAK S SPEAK SE 1 C MTRANS Service necessary? text ones erm S SPEAK + S ATTEND eyes to C S MTRANS Need been presented? S MBUILD yes NP 2 SE 1 SE 2 C MBUILD yes C MTRANS Support service? C MBUILD no yeah that's downstairs S SPEAK S MTRANS Service necessary? S MBUILD yes SE 2 SE 1 I mean I've got things on parenting . erm about . perspectives on the family (13 sec) there's nothing that's really – general what you're looking for S ATTEND eyes to screen + S SPEAK S OPERATE computer S SPEAK SE 1 FA C MTRANS Terminating greetings necessary? C MBUILD yes C SPEAK okay . thanks very much S OPERATE computer Fig. 4 Representation of a sample (extract) 138 As emphasised before, the cognitive, verbal, non-verbal and physical elements of the interaction are “not discrete activities but aspects of an ongoing stream of behavior” (Arndt/Janney 1987: 4). Fig. 4 exemplifies how this interplay can be illustrated in the integrated model. The perceptible part of the extract contains mainly verbal activities of the participants, represented by the primitive act SPEAK. Additionally, we find instances of non-verbal behaviour, e.g. visual contact (‘S ATTEND eyes to C'), and physical actions like the salesperson's use of the computer (‘S OPERATE computer'). As pointed out above, the underlying cognitive processes are restricted to selected points of decision. The chronological order of the activities is represented by their position in the diagram, their causal relations by the arrows connecting the entries. Simultaneous actions are placed on one horizontal line, as, for instance, demonstrated in the last line of the extract, where the customer is speaking and the server is still involved the computer search at the same time. Needless to say, the non-perceptible activities (which are frequently realised in a split second) are usually running in parallel with the perceptible ones. The sequences of events can be derived from their vertical arrangement. Often, it is this interplay of features which clarifies apparently abnormal phenomena in the plain text versions. For example, the inclusion of physical activities can explain the occurrence of pauses in the verbal part. In the corpus, long pauses (of 5 seconds up to several minutes) are usually observed when physical tasks are performed. Our sample interaction contains five instances of long pauses caused by the server operating the computer and waiting for the search results. They vary in length from 13 seconds (as shown in the extract) to up to 25 seconds. The neglect of physical and non-verbal aspects in the analysis of service encounters might lead to the impression that certain global structural elements are missing. The obligatory NEED PRESENTATION, for instance, is realised non-verbally in almost 30 % of the cases, usually by handing over a product for payment (‘C PTRANS P to S'). These instances would, of course, escape the analyst of the plain text versions. Thus, for a comprehensive analysis of service encounters the concentration on the verbal parts of the interaction proves insufficient. With the integrated model discussed in this paper, however, a detailed picture of service interactions in bookstores can be achieved. As this model offers the chance to represent complex processes in a relatively clear graphic form, it could also be applied (with the necessary adjustments) to other types of service encounters. Besides, it might be interesting to test the potential of the graphic arrangement for the representation of other types of dialogues. References Arndt H, R W Janney 1987 InterGrammar. Toward an Integrative Model of Verbal, Prosodic and Kinesic Choices in Speech. Berlin, de Gruyter. Bös B forth. Integrative Darstellung englischer Serviceinteraktionen: verbale, nonverbale, physische und kognitive Aspekte. Dissertation. Ehlich K 1993 HIAT: A Transcription System for Discourse Data. In Edwards J, Lampert M D (ed), Talking Data: Transcription and Coding in Discourse Research. Hillsdale, NJ, Erlbaum, 123-147. George S 1988 Postscript: discourse descriptions and pedagogic proposals. In Aston G (ed), Negotiating Service. Studies in the discourse of bookshop encounters. Bologna, Editrice CLUEB, 303-332. Schank R C 1972, Conceptual Dependency: A Theory of Natural Language Understanding. Cognitive Psychology 3, 552-631. Schank R C, Abelson R P 1977 Scripts, Plans, Goals and Understanding. Hillsdale, NJ, Erlbaum. Ventola E 1987 The Structure of Social Interaction. A Systematic Approach to the Semiotics of Service Encounters. London, Frances Pinter. 139 A Statistical Analysis of the Source Origin of Maltese – Abstract Roderick Bovingdon 99 Chetwynd Road Merrylands NSW 2160 Australia roderick_bovingdon@hotmail.com Angelo Dalli NLP Research Group1 University of Sheffield United Kingdom angelo@dcs.shef.ac.uk The most recent theories relating to the original source of the Maltese language point to a direct Sicilian-Arabic connection (Agius, 1990; Agius, 1993; Agius, 1996; Brincat, 1994). This paper presents the results of the first ever large-scale statistical analysis of Maltese using the newly formed Maltilex Corpus (Rosner et al., 1999; Rosner et al., 2000). Traditional etymological and categorical analyses were supplemented with data mining techniques to provide accurate results confirming traditional subjective notions. The Maltilex Corpus is made up of a representative mixture of newspaper articles, local and foreign news coverage, sports articles, political discussions, government publications, radio show transcripts and some novels. As of the time of writing, the corpus had over 1.8 million words and almost 70,000 different word forms, making it the largest digital corpus of Maltese in existence. A random sample of 1000 unique word forms was selected from the corpus and the etymology and category class noted down for every word in the sample. Words falling under multiple category classes were duplicated and a single category was entered for every word to ensure unique etymology/word class pairs. Weights were added to every entry representing the number of category classes associated with a particular word to maintain accurate statistics. A data matrix with 1034 entries was thus obtained, representing all possible etymology/word class pairs for the sample word forms. The data matrix was analysed using a custom written data mining tool to extract statistics about the relationship between etymology and word classes in Maltese. Overall statistics about the source language origins of Maltese, together with the most commonly occurring word classes were also extracted. The use of a data mining tool enabled us to analyse the data from two different perspectives – word class distribution for every etymological class and vice-versa. Maltese grammar and morphology remain to this day largely Arabic, but with distinct Romance and English morphological accretions (Aquilina, 1973; Aquilina, 1979). Traditionally the Romance element in Maltese was thought to commence with the coming of the Normans in Medieval ages and intensified during the long period of rule under the Knights of Malta. Pre-Norman Arabic content appears to have been heavily Sicilian based (Brincat, 1995). The greatest linguistic inroads from the Italian mainland occurred during the early days of the British rule when large numbers of political refugees sought and were granted asylum during the unification of Italy (Friggieri, 1979). The most recent linguistic influence on Maltese is English. English has steadily and increasingly affected Maltese, adding another language facet to the overall structure of Maltese (Mifsud, 1995). Despite the relatively very recent accretions from English and the vastly different morphological structure of the two languages, the assimilation of English lexemes into a Maltese mould occurs with the least possible disturbance to Maltese morphology, especially with English verbs. Interestingly, contemporary Arabic is adopting similar assimilative patterns as Maltese in its borrowings from the English-American lexis. This study clearly shows that, in addition to other aspects, Italian lexical influence upon present day Maltese has exceed the Arabic content in a quantitative sense. Such development has also enriched Maltese from a purely root based morphology, with the additional productive Romance feature of catenation (Schweiger, 1994). 1 With support from the Maltilex Project and the Department of Computer Science and Artificial Intelligence at the University of Malta. 140 References Agius, Dionisius A. 1990. Il-Miklem Malti: a contribution to Arabic lexical dialectology, British Society for Middle Eastern Studies. Agius, Dionisius A. 1993. Reconstructing the Medieval Arabic of Sicily. Languages of the Mediterranean, Brincat, Joseph M. Msida, University of Malta. 119-129. Agius, Dionisius A. 1996. Siculo Arabic. Library of Arabic linguistics (Kegan Paul International) 12. Aquilina, Joseph. 1973. The structure of Maltese. Msida, University of Malta. Aquilina, Joseph. 1979. Maltese-Arabic Comparative Grammar. Msida, University of Malta. Brincat, Joseph. 1994. Gli albori della lingua maltese: il problema del sostrato alla luce delle notizie storiche di al-Himyari sul periodo arabo a Malta. Languages of the Mediterranean, Msida, University of Malta. Brincat, Joseph. 1995. 870-1054: Al-Himyari's Account and its Linguistics Implications. Msida, University of Malta Friggieri, Oliver. 1979. Storja tal-Letteratura Maltija. Msida, University of Malta. Mifsud, Manwel. 1995. Loan verbs in Maltese a descriptive and comparative study. Studies in Semitic languages and linguistics. Leiden: Brill. Rosner, Michael et. al. 1999. Linguistic and Computational Aspects of Maltilex. Proc. of the ATLAS Symposium, Tunis. Rosner, Mike, Ray Fabri, Joe Caruana, M Lougraieb, Matthew Montebello, David Galea, and G. Mangion. 1999. Maltilex Project, University of Malta. Rosner, Mike, Ray Fabri, and Joe Caruana. 2000. Maltilex: A Computational Lexicon for Maltese, Msida, University of Malta. Schweiger, Fritz. 1994. To what extent is Maltese a Semitic Language?. Languages of the Mediterranean, Msida, University of Malta. 141 Xara: an XML aware tool for corpus searching Lou Burnard and Tony Dodd Research Technologies Service, Oxford University Computing Services 13 Banbury Road, Oxford OX2 6NN lou.burnard@oucs.ox.ac.uk tony.dodd@btinternet.com tel: +44 1865 273285 fax: +44 1865 273275 From SARA to Xara Xara is the working name for a new version of SARA, the `SGML aware retrieval application' originally developed for use with the British National Corpus (BNC) in 1994. The system has been completely rewritten as a general purpose tool for searching large XML corpora, with a particular focus on the needs of corpus linguists, with close attention to new XML-based encoding standards, and with the benefit of hindsight derived from a decade of feedback from hundreds of SARA-users world wide. The Xara system combines the following components: (1) an indexer, which creates inverted file style indexes to a large collection of discrete XML documents; (2) a server, which handles all interaction between the client programs and the data files; (3) a Windows client, which handles interaction between the server and the user. The modularity of this architecture has several advantages, permitting, for example, the development of multiple specialized client programs for different applications or styles of usage. In addition, an index building utility, called Indextools, is supplied with Xara, which simplifies the process of constructing a Xara database. Its chief function is to collect information about the corpus to be supplied additional to that present in any pre-existing corpus header, and to produce a validated and extended form of the corpus header. It can also be used to run the indexer and test its output. Rather than review extensively its history and design philosophy in this short presentation, we give here some general comments covering the following aspects of the system: • XML support • Corpus related features • Ease of use XML support Xara will process any XML encoded corpus. The more detail present in the tagging, the more facilities are available to the client but the minimal requirement is only that the text be well-formed XML. If the corpus to be processed specifies a document type definition (DTD), then Xara will validate it at indexing time, and will not proceed if any validity errors are discovered. Unlike earlier versions of the program, any DTD may be used for this purpose, though Xara was (naturally) designed with the likely needs of such widely used DTDs as the TEI or CES in mind. Also unlike earlier versions of the program, as previously noted, no DTD at all need be supplied. Xara can thus do something useful with the full range of digital material one might wish to build into a corpus. At one extreme, we will demonstrate how it can be used to provide basic searching facilities for a collection of Project Gutenberg style texts, innocent of any explicit descriptive markup at all; at the other, we will show how it can also take full advantage of the rich annotation present in a multilingual corpus produced in full conformance with the XCES Guidelines, containing detailed feature structure analysis, POS-tagging, and explicit lemmatization. Oddly enough, the most problematic material is likely to be texts which have been marked up in loose (syntactically incorrect) HTML, such as that 142 generated by automatic conversion from Microsoft Word; fortunately, utilities such as Dave Ragget's Tidy are readily available to generate well-formed XML from such conversions. With XML support, comes Unicode support. Xara uses Unicode internally to represent all character data: it can thus handle text in any language, and any combination of languages. To take full advantage of this, the user of the system needs convenient methods both for displaying and for inserting Unicode characters at their workstation. Good Unicode fonts are now available for display of texts in several different scripts (we have so far tested Chinese, Eastern European, Medieval, and Ancient Greek scripts) and the number continues to grow. Xara does not assume the existence of any particular font, however, and allows the user to select the display font at run time. In common with other XML systems, the system will correctly process character entity references found in the data, such references being retained untranslated in the underlying corpus index, but rendered as the appropriate Unicode code point when displayed in non-XML modes. To enter characters not found on the keyboard, for example to search for words containing them, however, an appropriately configurable input system is necessary. Dynamic keyboard redefinition is built into more recent operating systems, but is still a little user-unfriendly in most cases; the Xara client therefore includes its own keyboard redefinition facilities, which allow selection of specific characters from a Unicode table by point and click, temporary mapping of keyboard keys, or complete redefinition of a new keyboard map, which can be loaded as needed. Features for Corpus Linguists Any XML or SGML aware search engine has the ability to locate specific tagged components and to carry out searches within the context of such components. Most also have the ability to reorganize and display search results in a variety of forms. SARA extended these facilities with some additional, more lexically-motivated, abilities. These include: • implicit or explicit tokenization of element content • implicit or explicit lemmatization of element content • multiple keys for index searching • expandable automatic collocation search Xara supports the full range of facilities originally provided by Sara, but with several modifications and simplifications to the interface. A number of new facilities have also been added. Xara inherits from SARA a rich range of query facilities. The user can search for substrings or regexp-style patterns, words, phrases, or the tags which delimit XML elements and their descriptive attributes, either simply checking for their presence or absence in the lexical index maintained for the corpus (and inspecting their frequency or collocative patterns therein), or retrieving and displaying actual instances of the words etc. sought for. Searches can be made using additional keys such as part of speech, or root form (lemma) of a token, specified either explicitly in the tagging of the texts, or implicitly by means of a named algorithm. Xara also supports a variety of scoped queries, searching for combinations of words etc. in particular contexts, which may be defined as XML elements, or as combinations of other identifiable entities, or as stretches of text. Such searches may be order-sensitive or insensitive. Different kinds of search can be combined to form complex queries of various kinds, using either a simple graphical interface, or a (rather esoteric) `Corpus Query Language' (CQL). This language defines the full capability of the query interface and forms a major part of the protocol by means of which client and server modules communicate. In Xara, it has been re-expressed using an XML syntax, in line with the desire to leverage XML standards wherever possible. The client also supports a simple ECMAscript-like scripting language which makes direct calls on an API defined in very similar terms to the CQL. 143 Like its predecessor, Xara displays the results of searches either one at a time or within a traditional KWIC style window which can be sorted, thinned, expanded etc. The context of hits located in this way can be explored by expanding it, up to the full text level if necessary. Unlike its predecessor, Xara allows user-definition of a range of formatting properties for KWIC and single hit displays, using a subset of the formatting properties defined by the W3C standard Cascading Stylesheets (CSS); it also allows the user to export results in a simple XML format which can then be reprocessed, either by Xara, or by any other XML-competent application, such as a word processor or XSLT engine. Corpora can be reorganized or partitioned in a user-defined way, using the results of any query, the values of specified element/attribute combinations, or a manual classification. Searches carried out across partitioned corpora can be analysed by partition: so the client can display the relative frequencies of a given lexical phenomenon in texts of different categories identified in a corpus. Using Xara For existing users of SARA, the main new feature of Xara will probably be the facilities it offers for indexing of new corpora. As noted above, all the parameters required by the indexer and server are now gathered and stored transparently within a TEI-conformant XML header file, rather than being supplied at runtime by various obscure control files, command line options, or hard-coded declarations. A new Windows utility has been included to facilitate creation and modification of this data: it allows the user to control the behaviour of the indexer by selecting available options from a series of dialogues, and saves the results of these decisions in an appropriate part of the TEI/XCES header. The same utility can be used to check validity of the supplied corpus files and to run the indexing utility, and provides an interface for testing that the system index files have been correctly generated. Its use is not however essential: other XML-aware software can be used to define a corpus header which Xara will use, and the indexer utility can be run independently. The indexer needs to be provided with the following information: • how PCDATA (element content) is to be tokenized • how tokens are to be mapped to index terms (for example, by lemmatization or by the inclusion of additional keys) • how indexed terms are to be referenced in terms of the document structure • how and whether XML tags and attributes are to be indexed Much, perhaps most, of this information is implicit in the XML structure for a richly tagged corpus: one could imagine a corpus in which every index term was explicitly tagged, with lemma, part of speech, representation etc. In practice however, such richly tagged corpora remain imaginary, and software performs the largely automatic task of adding value to a document marked up with only a basic XML structure. The new indextools utility specifies how the task is to be performed in a standardized way, using the existing structure of the TEI header (see http://www.tei-c.org/Guidelines/HD.htm) to hold the specification. Because our original design goal was to avoid any need for extension or modification of the existing TEI DTD; we have used very general constructions of the type rather than defining more `application focussed' tags such as . We have however done this is a fairly consistent and easily validated form, so that when TEI support for external namespaces is available, it will be easy to make our generic tags more specific. The Xara system is currently under beta-test and will be demonstrated at the conference. We hope to make an initial general release available by the summer of 2003, but participants wishing to experiment with it earlier are very welcome to participate in the beta test. The first release will operate in any modern Microsoft Windows environment; it is intended, however, to port both indexer and server to Unix environments as soon as possible; it is also intended to licence open source versions of at least those components of the system at that time (open source licensing of the Microsoft-dependent components is rather less likely, for obvious reasons) 144 Expressions and structures of the delexical verb ÊÁÍ [“MAKE” / “DO”] in Modern Greek language: A corpus-based approach to newspaper articles1. Marianna N. Christou Postgraduate research student in Greek language and lexicography at the University of Birmingham (UK) 1. Introduction This is perhaps the first lesson to be learned from corpus study. Language cannot be invented; it can only be captured. (Sinclair 1997: 31) In this study I intend to explore the usage of the Modern Greek verb ÊÁÍ2 by means of a sub-corpus extracted from the 30-million-word written Hellenic National Corpus (HNC) developed by the Institute for Language and Speech Processing (ILSP) in Greece. In the belief that there is a cline of semantic differentiation between fixed expressions with figurative meaning on the one hand, and ‘simple’ collocations of ÊÁÍ with nouns on the other, I stress the need for applying this cline to language research. In addition, I present arguments for the explanation of the syntactic distribution of such phrases, which would contribute to the understanding of delexical structures. For the purposes of this study, I start with some methodological issues. Next, I adopt the term cline of idiomaticity for the development of a theoretical framework that supports the generalised structure of ÊÁÍ + noun, and divides all the instances of my sub-corpus into five categories. Subsequently, I refer to the distribution of the verb and its complement within either the same or different clauses. Finally, I discuss the significance of adopting the proposed cline of idiomaticity in dictionary-making, since this proposes a shift of the lexical load that a lexicographer needs to clarify and describe. 2. Methodology This study, as has already been mentioned, is corpus-based, since it was built on a sample (see Appendix I) extracted from the HNC (on-line access: http://hnc.ilsp.gr). 2.1. The data extracted from the corpus and their processing The whole corpus of my study was compiled by means of a lemma query. As the question posed initially was the examination of the role that the (delexical) verb ÊÁÍ plays in Modern Greek language, I expected that real data – even limited in number – would yield some fruitful results to this end. Thus, I restricted the lemma query of the verb ÊÁÍ to a particular medium (two of the most popular Greek newspapers, Åëåõèåñïôõðßá and Ôï ..µá, see Hatzigeorgiu et al. 2000: 1737), genre (informative texts) and topic (related to society). It has to be noted, though, that I made no further selection (e.g. according to more specific genre or topic), with the intention of achieving at least a representativeness “for certain high frequency linguistic features” (McEnery and Wilson 2001: 78). Therefore, my sub-corpus was at the same time small – compared with the whole 30-million-word HNC – and sufficient for the needs of the present research. To be more precise about the identity of the corpus used, this consisted of selected articles written between 1993 and 1997. The results showed 6,200 texts (4,851 from Åëåõèåñïôõðßá and 1,349 from Ôï ..µá), within which 4,236 instances (tokens) of the lemma KAN were automatically extracted. Then, all examples were processed by use of WordSmith Tools (henceforth WordSmith). 1 I would like to thank Mrs. Maria Gavriilidou, Head of the Electronic Lexicography Department of the Institute for Language and Speech Processing (ILSP) in Athens, and Mr. Nikos Hatzigeorgiu, Head of the ILSP branch in Xanthi, for the invaluable information they provided me with concerning the Hellenic National Corpus (HNC). I am also grateful to Professor Emeritus Geoffrey N. Leech, Dr. Andrew Wilson, and Mr. Philip King for their help and useful comments on this study. 2 Greek verbs are capitalised, when they are used as lemmas. 145 WordSmith, developed by Mike Scott3, proved to be an invaluable tool for the processing even of a language with an alphabet other than the roman. It provided the concordance for the instances extracted from the sub-corpus (see Appendix II), since the equivalent Greek tool can only be applied to the whole HNC for the moment. Nevertheless, WordSmith was by no means the most perfect and accurate solution to my problem. This became more obvious when I saved the HNC concordance output (4,236 instances) to file, and then ran Wordsmith on it. By virtue of some partial incompatibility between the two programmes, I had to remove some incomplete, misread or unrecognisable sentences manually. After that, 4,059 tokens of the lemma ÊÁÍ underwent a closer examination on an advanced stage of research. Subsequently, I sorted out the instances that I would later use by adding extra information in the set column of the concordance (Categories A, B, C and D, see section 3.2. ff.). Next, I both re-sorted the whole sub-corpus and weeded out the sentences that would be of no interest to me (Category E, see section 3.2.5.). Thus, I ended up with 2,139 instances. Since the annotation of the corpus used was considered an essential prerequisite for this research, I added simple tags4 manually. It is important to acknowledge here that this (non-machine-aided) procedure may not be one hundred percent accurate and reliable; however, the aims of the present study were accomplished, since all problems encountered were solved to a great extent. 2.2. Problems encountered and their solutions The table below illustrates some of the problems encountered, although not always anticipated throughout the manipulation of the data. Moreover, it is also hereby shown how I temporarily (i.e. for the purposes of the present study) solved them, along with how I perceive their potential amelioration. Table 2.2.: Problems encountered and their (temporary and future) solutions Problems encountered Temporary solutions Future solutions no reliable parsers / taggers for the Modern Greek language available for the moment manual annotation of the verb and the noun only across all 2,139 instances improvement of the existing tools (of the ILSP) and / or development of new ones consequence of the previous problem: Greek text untagged – no possibility of identifying the systematic co-occurrences related to the preceding problem: practice on a combination of WordSmith, Excel and Notepad to find collocations part of the above problem: amelioration of the concordance tool, so that it can carry out more complex commands partial incompatibility between the number of results of the HNC (4,236) and their insertion to the WordSmith programme (4,059) manual elimination of the (few) incomplete, misread or unrecognisable examples improvement of both the WordSmith programme and the Greek tool for concordance (the latter does not accept a query on a particular sub-corpus) large amount of instances (4,236) needed for a representative (insofar as possible) sample of newspaper articles reduced number of instances (2,139) taken into account – a theory based on their closer reading incorporation of all instances found could have a further impact on this theory, i.e. either consolidate it or modify it significant imperfections of the HNC (e.g. words not disambiguated yet) reliance placed on the trustworthiness of the results enhancement of the morphological lexicon and machine-aided disambiguation 3 For more information on WSmith see McEnery and Wilson (ibid. 211). 4 Leech and Smith (1999: 24) have demonstrated that annotation is useful for ‘inputting’ information, whereas a concordance programme helps in ‘outputting’ information from a corpus. 146 Problems encountered (cont.) Temporary solutions (cont.) Future solutions (cont.) only one sentence of the corpus available, the one containing the node word (need for tracing back for more context, which was impossible using the WordSmith concordance) work simultaneously at three levels: results and context (HNC), concordance (WordSmith) and manual annotation (Notepad) expansion of the number of sentences that the HNC allows on its web interface (e.g. by allowing users to define by themselves the amount of context needed) certain limits of space and time a general theory introduced evaluation and review of the work done 3. Data analysis 3.1. Different approaches to the notion of ‘delexical structures’ Until recently, there has been no attempt to standardise the terminology through which verbonominal structures (Stein 1991: 4, Nakas 2000: 125 ff.)5 are defined. That is, several of the terms suggested refer either to the verb itself or to the noun. More precisely, the idea of [semantically] ‘empty’ or ‘light verbs’ (Jespersen 1942: 117 ff.), on the one hand, has led modern theory to extremes, i.e. this has been both rejected (by Stein 1991: 15) and adopted (by Biber et al. 1999: 428). On the other hand, the labels of ‘eventive object’ and ‘deverbal noun’ (ibid. 128 and 428, Quirk et al. 1985: 750 ff.)6 are, among other labels, attributed to the noun that collocates with verbs of this kind. In this study, I shall use the term ‘delexical verbs’ as defined by Sinclair et al. (1998: 147): [t]here are a number of very common verbs which are used with nouns as their object to indicate simply that someone performs an action, not that someone affects or creates something. These verbs have very little meaning when they are used in this way. for two reasons: first, in order to benefit from corpus evidence to support the cline of semantic shift for the delexical structures, and second, in order to complement the definition in Babiniotis’ dictionary (2002: 246), in which it is stated that the meaning of a verb sometimes takes the form of a periphrasis instead of being represented by a cognate simple verb/lexeme7. As we shall see later, even though there is not always such a possibility of substituting the periphrasis for a simple verb deriving from the noun's stem, we still define the verb as delexical. In this sense, we accept that the ‘lexical load’ is carried by the second part of the phrase (Live 1973: 31). As regards the present study, the results of the concordance of the verb ÊÁÍ revealed a variety of word-classes (e.g. noun, adjective, article, adverb, pronoun, preposition, conjunction etc.) that are commonly combined with it. However, the major issue of my concern will be the collocations of this verb with its nominal complements (nouns / noun phrases). 3.2. Collocations of the delexical verb ÊÁÍ + (noun / noun phrase): the cline of idiomaticity Adopting Sinclair's terminology (1991: 115), ÊÁÍ could be regarded as a ‘node', and its complement – any noun / noun phrase, in this case – could be considered its ‘collocate’ in that [c]ollocation is the occurrence of two or more words within a short space of each other in a text … Collocations can be dramatic and interesting because unexpected, or they can be important in the lexical structure of the language because of being frequently repeated (ibid. 170). 5 In his article (1968), Nickel introduces the equivalent concept of “complex verbal structures”, while Live (1973: 32) preferably accepts a ‘phrasal form’ of the ‘light verbs'. 6 Stein (1991: 2) and Allan (1998: 2-3) provide a more general overview of the terminology used in the past. 7 Using the term simple (= single-word) verb I have translated the Greek µïíïëåêôéêü ..µá (e.g. äçëþíù [“state”] instead of ÊÁÍ äÞëùóç [“make a statement”]) which a) constitutes one word, b) has an equivalent meaning with the periphrasis and c) should be contrasted to (English) phrasal verbs (two or more words). 147 On the one hand, before looking into the corpus, I surmised that the core meaning of the verb would be expanded as well as restricted to some extent, since ÊÁÍ is among the most common verbs in Modern Greek (cf. “make” / “do”8 (English), “faire” (French), “machen” / “tun” (German), “hacer” (Spanish), etc.). A closer examination of the corpus, on the other hand, allowed me to provide concrete examples in support of the theory that I will develop next. Having named this cline of idiomaticity, on the basis of a proposal by Biber et al. (1999: 1026), I shall further suggest five distinct Categories. 3.2.1. Category A: ÊÁÍ + noun = fixed expression with figurative meaning (ÊÁÍ öôåñÜ) The first Category comprises instances of fixed (idiomatic) expressions with the verb ÊÁÍ, which have figurative meaning. The term ‘fixed expressions’ denotes that, whereas the verb conjugates regularly, the collocate remains uninflected, e.g. ÊÁÍ öôåñÜ [“vanish”], ÊÁÍ èñáýóç [“be popular”], ÊÁÍ <êÜôé> öýëëï ... ..... [“search sth. thoroughly”]. Furthermore, it has to be made clear that adjectives modifying the nominal complement of the verb ÊÁÍ do not frequently intervene, except in cases where they form part of the expression, e.g. ÊÁÍ ôá ....ß. µÜôéá [“turn a blind eye to sth.”] (but not * ÊÁÍ (ôá) µÜôéá), ÊÁÍ ÷ñõóÝò ........ [“earn a lot of money”] (but, ÊÁÍ äïõëåéÝò has a totally different (literal) meaning). Similarly, an indirect object of the verb is sometimes essential for the syntactic structure to be considered as grammatical, as in ÊÁÍ <óå .......> ôï ....... [“prepare and invite sb. for a meal”] and ÊÁÍ <óå .......> ôïí ß.. .ß.... [“make life unbearable for sb.”]. Finally, examples of rather informal or colloquial set phrases mentioned in the newspaper articles have been incorporated in the same Category, given that they constitute collocations: ÊÁÍ êÝöé [“feel like doing sth.”], (different from ÊÁÍ <êÜðïéïí> êÝöé [“like one's company”]), ÊÁÍ ðáé÷íßäé [“take the initiative”], ÊÁÍ êïõµÜíôï [“be in control” / “be the boss”]. 3.2.2. Category B: ÊÁÍ + noun = semi-fixed expression with figurative meaning (ÊÁÍ (+ adj.) âÞµá) Category B includes set phrases that have figurative meaning, since they do not originate directly from the literal content of the words in question, which is similar to the previous case. The difference, though, lies in that in the second Category the (idiomatic) expressions are semi-fixed, i.e. allow adjectives, pronouns, articles, etc. to intervene and modify the noun, e.g. ÊÁÍ Ýíá ............ ß.µá [“take a decisive step”], ÊÁÍ åíôõðùóéáêÞ ...... [“take an impressive change in direction”]. Moreover, it should be clarified that some of the adjacent groups of this kind can be used both in the singular and plural. Here are some examples: ÊÁÍ µéá ............. ...... [“do a masterstroke”] and ÊÁÍ ôéò ........... ........ [“act as is necessary”], ÊÁÍ ðñïóåêôéêü .....µá <ðñïò .......> [“try carefully to approach sb. / sth.”] and ÊÁÍ êÜðïéá .....µáôá [“try to approach sb. / sth. somehow”]. Similarly, ÊÁÍ ðáñáôÞñçóç and ÊÁÍ ðáñáôçñÞóåéò [“reprimand sb. for doing / saying sth.”] appear in both numbers. However, the findings of the corpus analysis underpinned the fact that there are also some set phrases, which are commonly applicable to either number, e.g. ÊÁÍ Ýñùôá [“make love”] (but not * ÊÁÍ Ýñùôåò), ÊÁÍ èõóßåò [“make sacrifices”] (occurring only in plural in my corpus, although it has a (less frequent) singular, as well). In this Category, account is taken of standardised Greek expressions in the singular only, such as ÊÁÍ ÷ñÞóç [only in the sense of “take drugs” / “drink alcohol” etc.], ÊÁÍ ôï .... (ôïý ...µïõ) [for “disseminating information around the world”] and ÊÁÍ áðåñãßá ...... [“go on a hunger strike”], as well. 3.2.3. Category C: ÊÁÍ + noun = main delexical structure with literal meaning (ÊÁÍ äÞëùóç) While figurative meaning was reflected in the previous two Categories, where the verb ÊÁÍ had an idiomatic and not a delexical function, Category C provides instances of what I have called main delexical structures with literal meaning. These are: main, in contrast with the subordinate ones (as explained below), because they can easily be substituted for a simple verb which shares the same meaning and stem with the collocate (noun, in most cases); delexical structures, as their most meaningful item is the noun; and literal, since they are meant in the noun's original sense. 8 Altenberg and Granger (2001: 173-175) place the English verbs ‘make’ and ‘do’ among high frequency verbs, which are interesting from a cross-linguistic perspective. 148 It is not surprising that the most frequent main delexical structure found in articles from the Greek press is by far the phrase ÊÁÍ äÞëùóç [“make a statement”]. A similarly high rate of the cognate verb ÇËÍ was also anticipated and eventually found (see Appendix III). These significant occurrences can be explained by the actual fact that eminent persons of public life, such as Prime Ministers, Ministers, VIPs, etc., make official statements, which the reporters note down. This also alludes to the question of why half of the most popular structures in this Category are closely related to speech acts (cf. ÊÁÍ áíáöïñÜ [“make reference to sb. / sth.”], ÊÁÍ ðñüôáóç [“make a suggestion”], ÊÁÍ ðáñݵâáóç [“intervene verbally”], ÊÁÍ áíáêïßíùóç [“make an announcement”]. It could be further argued here that the combination of verb and noun makes the foregrounding of information much easier in a language such as Greek, where the order of the sentence constituents could be characterised as either loose or rather variable in terms of focalisation. Following this argument, we can shed light on cases, such as ðñïóðÜèåéá ..... [“he made an attempt”], Ýëåã÷ï ...... [“they checked”] and Ýñåõíá ...... [“they searched”], where emphasis is placed on the noun (for the syntactic distribution of the verb ÊÁÍ, see also section 3.3. below). A special emphasis is also placed on collocates that not only precede the node, but are modified as well, e.g. áñ÷çãéêÞ / åðéèåôéêÞ .µöÜíéóç ..... <ï X.> [“X. appeared as leader / having an aggressive attitude”]. Lastly, as regards the delexical ÊÁÍ ÷ñÞóç [“make use of sth.”] (in its literal sense, instead of “use”), the concordance of the corpus showed that this is most commonly combined with another noun in the genitive, nothing intervening between ÊÁÍ and ÷ñÞóç in most cases. 3.2.4. Category D: ÊÁÍ + noun = subordinate delexical structure with literal meaning (ÊÁÍ ëüãï) The fundamental difference between this and the previously discussed group of expressions lies in that the structures which are brought together in Category D are still delexical (so as to complement Babiniotis’ definition (2002: 246), see 3.1.), but subordinate, in this case. I shall call subordinate those patterns that cannot be substituted for a simple verb, purely because no such verb derives from the noun's stem in Greek, for instance ÊÁÍ êáêü [“do harm”] and its opposite, ÊÁÍ êáëü [“do good to sb.”], ÊÁÍ ðáñÝá [“keep sb. company”], ÊÁÍ öáóáñßá [“make noise”]. In some similar cases, even though the cognate simple verb may exist, it can derive from an older tradition of Greek, such as Ancient Greek or even êáèáñåýïõóá (katharevousa), and therefore may nowadays have neither the same meaning nor the same use. For example, ÊÁÍ ëÜèïò = óöÜëëù [“make a mistake”] is rather distinct from ëáíèÜíù [“be concealed”], even though they share the same stem; ÊÁÍ äéÜëïãï [“converse with sb.”] is currently much more preferable than the ‘antiquated’ äéáëÝãïµáé having the same meaning; and ÊÁÍ Ýêêëçóç [“make a plea”] cannot be replaced by åêêáëþ [“make an appeal”], since the latter is restricted to juridical terminology. The following three collocations are subordinate delexical structures, as well, for their basic meaning is close to the literal one proposed by the noun: ÊÁÍ ëüãï [“refer to sth.”] (cf. ëÝãù [“say”]), ÊÁÍ µíåßá [“make reference to sb. / sth.”] (cf. µíçµïíåýù [“mention”]), ÊÁÍ ôçí ....... [“make a strong impression”] (cf. åêðëÞôôù [“surprise”]). The first structure is considerably the most common of this Category (cf. also ÊÁÍ äÞëùóç in Category C), whereas the collocate of the second is usually modified by the adjective éäéáßôåñç [“special”] (ÊÁÍ éäéáßôåñç µíåßá). In the third phrase the article ôçí is essential and adds sense to the meaning of åíôõðùóéÜæù; it can be thus dissociated from the main delexical structure ÊÁÍ Ýêðëçîç = åêðëÞôôù [“surprise”]. 3.2.5. Category E: ÊÁÍ + noun = lexical structure – both items carry a literal lexical load (ÊÁÍ Ýíá .....) The last Category of this cline of idiomaticity is composed of combinations of ÊÁÍ with any noun, so that both the verb and its complement carry a literal lexical load. In fact, these are neither idiomatic nor delexical, but still constitute an extreme of the cline, since they “retain the core meaning” of the verb (Biber et al. 1999: 1027). In this way, the core meaning of ÊÁÍ is expanded to multiple semantic fields, thus revealing the polysemy of this word (e.g. ÊÁÍ bearing the meaning of “arrange an event” (ÊÁÍ Ýíá .....), “broadcast” (ÊÁÍ µéá ....µðÞ), “construct”, “produce”, “create”, “accomplish”, “commit”, etc.). However, due to the fact that these lexical structures are too many to be identified even in such a small corpus and, additionally, require a lot of space and time for their analysis, they are not discussed here. 149 To sum up, the following diagram illustrates the cline of idiomaticity, as described in section 3.2: Category A: ÊÁÍ + noun = fixed expression with figurative meaning (ÊÁÍ öôåñÜ) Category B: ÊÁÍ + noun = semi-fixed expression with figurative meaning (ÊÁÍ (+ adj.) âÞµá) Category C: ÊÁÍ + noun = main delexical structure with literal meaning (ÊÁÍ äÞëùóç) Category D: ÊÁÍ + noun = subordinate delexical structure with literal meaning (ÊÁÍ ëüãï) Category E: ÊÁÍ + noun = lexical structure (both items carry a literal lexical load) Diagram 3.2.: The cline of idiomaticity of the verb ÊÁÍ 3.3. Syntactic distribution of the node ÊÁÍ and its (noun) collocates It has already been mentioned (in 3.2.3.) that the (noun) collocates can either precede or follow the node (verb). To demonstrate this, the table below charts the frequency of occurrence of nouns before and after the verb, as well as before the subordinate clause (there were no findings in relation to nouns following the subordinate clause) in numbers. (NB: freq. indicates absolute frequency, i.e. frequency of occurrences out of 2,139 instances): Table 3.3.: The frequency of syntactic distribution of the node ÊÁÍ and its (noun) collocates (Categories A-D) Category noun following verb (same clause) noun preceding verb (same clause) noun: main clause verb: subordinate clause Total hits freq. (%) hits freq. (%) hits freq. (%) A. 44 2.06 3 0.14 2 0.09 49 B. 173 8.09 11 0.51 17 0.79 201 C. 928 43.38 199 9.30 243 11.36 1,370 D. 491 22.95 17 0.79 11 0.51 519 Total 1,636 230 273 2,139 3.3.1. Distribution within the same clause According to the above results, the noun is most likely to follow the verb, when they are both found in the same clause (main or subordinate). This is not surprising, since it is commonplace for the complement (object) to come after the predicate in Modern Greek, in a ‘natural’ order of sentence constituents (Subject-Verb-Object / Verb-Subject-Object, see also Clairis and Babiniotis 1999: 298 ff.). 3.3.2. Distribution within different clauses: noun first More interesting, however, is the case of collocate preceding the node, since this offers flexibility to the syntax of the sentence and causes the foregrounding of information. By placing the noun – i.e. the item that carries the lexical load chiefly in Categories C and D – first, emphasis is given to the complement, while the reader's attention is drawn to the ‘unexpected’ order of the constituents. As Table 2 indicates, in approximately one third of the cases the noun comes before the verb, which is highly significant for the focalisation of the main information provided. As a final point, I should clarify that the subordinate clauses were almost exclusively relative clauses that facilitated the structure ‘Object-Verb-Subject,’ thus highlighting simultaneously the first and the last position of the constituents’ occurrence, again because of the disturbed ‘natural’ order, as described earlier. 150 4. Implementing the cline of idiomaticity in relation to dictionary-making Having analysed the theoretical framework of the cline of idiomaticity, it would be of supreme importance to show a way of utilising it during the compilation of a dictionary. The contribution of each separate Category could be summarised as follows: Category A provides all fixed expressions that have figurative meaning and usually constitute an obstacle for the foreign language learner. Furthermore, phrases of this Category are commonly used by native speakers. Category B provides the semi-fixed expressions with figurative meaning, which are quite helpful for the understanding of metaphor in language. In addition, the Category in question designates frequent collocations of nouns being modified by adjectives and other word-classes. Category C offers the analysis of a simple verb into its main delexical structure, which has literal meaning and can be used alternatively, in accordance with the speaker's / writer's intentions. The delexical structure has the advantage of the noun being both broadly modified and extensively preferred in a focussed (i.e. preceding the verb) position. Category D supports the subordinate delexical structure that cannot be substituted for a verb deriving from the noun's stem (the substitution is possible in the previous case). Nevertheless, structures of this Category are potentially replaced by a synonymous, simple verb with literal meaning, and synonyms belong to the lexicographer's field of research. Category E is equally essential for the dictionary's needs, since it reveals a thesaurus of multiple lexical meanings of a word. Moreover, polysemy can be examined to a significant extent through real examples extracted from a corpus. 5. Synopsis of the results and conclusion For the needs of the present study, I used a sub-corpus of the more extensive HNC developed by the ILSP in Greece. This sub-corpus consisted of articles from two popular Greek newspapers (medium) and was representative of informative texts (genre) with social content (topic). My main concern was to look into the usage of the common verb ÊÁÍ [“make” / “do”] and for this reason I adopted and put forward a theory on a cline of idiomaticity. With the help of the WordSmith software and by adding tags to the nodes and collocates that I would later use, I worked out the concordance of the corpus and focussed on the colligation of ÊÁÍ + noun. I then divided the latter into five main Categories, according to the semantic load that the verb carried within the phrase. Thus, I suggested that there is a cline for ÊÁÍ + noun, which ranges from fixed (idiomatic) expressions with figurative meaning (Category A) to lexical structures, where both items carry a literal meaning (Category E). In the middle, there exists what I called ‘semi-fixed expressions with figurative meaning’ (Category B), as well as the two Categories of (main and subordinate) delexical structures (Categories C and D, respectively). Having looked through the grammatical, lexical and semantic structure, I brought up the issue of the syntactic distribution of the phrases in question. My results attempted to make clear that the collocate commonly follows the node in these cases, while the noun often precedes the verb, when there is a relative clause following. Finally, I tried to highlight the usefulness of the cline of idiomaticity for lexicographic purposes. Given that most expressions and set phrases constitute a problematic field for learners, they should be adequately and appropriately described in the dictionary. Moreover, the delexical structures offer a wide range of possibilities for noun modification, foregrounding or focus, whereas a variety of collocations can be explained through the polysemy of the verb. However, these are matters that remain to be further analysed, since they go beyond the scope of the present research. 151 References Allan Q 1998 Delexical Verbs and Degrees of Desemanticization. Word 49(1): 1-17. Altenberg B, Granger S 2001 The Grammatical and Lexical Patterning of MAKE in Native and Non-native Student Writing. Applied Linguistics 22(2): 173-194. Babiniotis G 2002 Ëåîéêü ... .... ......... ....... (µå ...... ... .. ..... ..... ... ......) (äåýôåñç ......). [Dictionary of Modern Greek Language (with Comments on the Correct Usage of Words) (second edition)]. Athens, Lexicology Centre (in Greek). Biber D, Johansson S, Leech G, Conrad S, Finegan E 1999 Longman Grammar of Spoken and Written English. London, Longman. Clairis Chr, Babiniotis G 1999 ÃñáµµáôéêÞ ... .... ......... (äïµïëåéôïõñãéêÞ – åðéêïéíùíéáêÞ) ÉÉ. Ôï ..µá: Ç ........ ... µçíýµáôïò. [Modern Greek grammar (structural-functional – communicative) II. The verb: The organization of the message]. Athens, Greek Letters (in Greek). Hatzigeorgiu N, Gavrilidou M, Piperidis S, Carayannis G, Papakostopoulou A, Spiliotopoulou A, Vacalopoulou A, Labropoulou P, Mantzari E, Papageorgiou H, Demiros I 2000 Design and Implementation of the Online ILSP Greek Corpus. LREC 2000 Proceedings 3: 1737-1742. Jespersen O 1942 A Modern English Grammar on Historical Principles, part 6: Morphology. Copenhagen, Ejnar Munksgaard. Leech G, Smith N 1999 The Use of Tagging. In van Halteren H (ed), Syntactic Wordclass Tagging, vol. 9, Text, Speech and Language Technology. Dordrecht, Kluwer Academic Publishers, pp 23-36. Live A H 1973 The Take-Have Phrasal in English. Linguistics: An International Review 95: 31-50. McEnery T, Wilson A 2001 Corpus Linguistics (second edition). Edinburgh, Edinburgh University Press. Nakas Th 2000 ÃëùóóïöéëïëïãéêÜ B´: ÌåëåôÞµáôá ... .. ...... ... .. .......... (Ýêôç ......). [Studies on language and literarure, B´ (sixth edition)]. Athens, Parousia (in Greek). Nickel G 1968 Complex Verbal Structures in English. International Review of Applied Linguistics in Language Teaching (IRAL) 6: 1-21. Quirk R, Greenbaum S, Leech G, Svartvik J 1985 A Comprehensive Grammar of the English Language. London, Longman. Sinclair J 1991 Corpus, Concordance, Collocation. Oxford, Oxford University Press. Sinclair J M 1997 Corpus Evidence in Language Description. In Wichmann A, Fligelstone S, McEnery T, Knowles G (eds) Teaching and Language Corpora. London, Longman, pp 27-39. Sinclair J M et al. (eds) 1998 Collins COBUILD English Grammar. London, Collins. Stein G 1991 The Phrasal Verb Type ‘to Have a Look’ in Modern English. International Review of Applied Linguistics in Language Teaching (IRAL) 29: 1-29. 152 Appendices 5.1. Appendix I: A sample of the results from the HNC 5.2. Appendix II: A sample from the WordSmith Tools concordance 153 5.3. Appendix III: The frequency of main delexical verbs (Category C) and their cognate simple ones Delexical verb Hits Hits Lexical verb ÊÁÍ äÞëùóç [“make a statement”] 243 5,000+ ÇËÍ ÊÁÍ áíáöïñÜ [“make reference ”] 78 3,428 ÁÍÁÖÅÑ ÊÁÍ ðñüôáóç [“make a suggestion”] 72 935 ÐÑÏÔÅÉÍ ÊÁÍ ÷ñÞóç [“make use ”] 54 972 ×ÑÇÓÉÌÏÐÏÉ ÊÁÍ ðñïóðÜèåéá [“make an attempt”] 50 1,006 ÐÑÏÓÐÁÈ ÊÁÍ (ôçí) åµöÜíéóç (µïõ) [“appear”] 34 897 ÅÌÖÁÍÉÆ ÊÁÍ Ýëåã÷ï [“control” / “check”] 33 451 ÅËÅÃ× ÊÁÍ ðáñݵâáóç [“interfere”] 28 153 ÐÁÑÅÌÂÁÉÍ ÊÁÍ Ýñåõíá [“search” / “carry out research”] 24 181 ÅÑÅÕÍ ÊÁÍ áíáêïßíùóç [“make an announcement”] 19 1,655 ÁÍÁÊÏÉÍÍ 154 Using natural language processing tools to assist semiotic analysis of information systems Ken Cosh and Pete Sawyer Computing Department Lancaster University UK LA1 3EU [k.cosh, sawyer]@comp.lancs.ac.uk 1. Abstract Semiotic Analysis has been used to aid understanding of information or communication systems, providing information that can be used during requirements engineering. The MEASUR approach begins by analysing short, natural language problem statements and manually extracting the key themes involved. As the process is scaled up and applied to longer problem statements, as found in many real life circumstances, the manual effort required increases. When the starting point for Semiotic Analysis is a large document describing the information system, such as an ethnographic report, assistance in the analytical process is necessary. This paper investigates how statistical Natural Language Processing Tools can aid this analysis. Natural Language Processing Tools can assist the analyst by directing them to the central themes in the document. Comparing a frequency list of the document with a frequency list from a large corpus of text such as the British National Corpus reveals the key words in the document. Collocation analysis of these keywords enables the creation of a lexical network and then closer investigation of the collocates in context allows the analyst to add semantic information to the model. 2. Keywords Natural Language Processing, Semiotic Analysis, MEASUR, Requirements Engineering, Organisational Semiotics 3. Introduction to semiotic analysis Semiotics is the study of ‘signs', and how they are used to communicate information between people. Organisational Semiotics is merely the study of how these signs are used within organisations(Stamper 2000). As a sign can be anything that conveys information, understanding the properties and meanings of these signs can be a useful aid to understanding the workings of the organisation. One application of semiotics is the semiotic analysis of information systems. This can be used to aid requirement engineering. Requirement Engineering is used to analyse a problem domain, prior to designing a solution to the problem. It is a crucial part of any software engineering process, as it is vital to have a thorough understanding of the problem before attempting to solve it. Ambiguity and misunderstandings between the user and the developer are a big cause of costly rework. Analysing a problem using semiotics can reduce ambiguity by creating a common understanding of the semantics involved in a problem domain. This paper reviews the MEASUR approach to Semantic Analysis, developed by Stamper (Stamper 1994), and looks at the similarities between steps in MEASUR and some statistical Natural Language Processing techniques. It looks at some problems that currently exist with this Semantic Analysis approach and discusses how some statistical Natural Language Processing tools can be used as a solution to these problems. Several authors have pointed out the apparent links between Organisational Semiotics and Natural Language Processing (NLP). (Charrel 2002)(Connolly 2000). Whenever Semiotic Analysis is used to model any problem or domain, then natural language, whether spoken or documented, is studied. The purpose of this research is to demonstrate how probabilistic NLP techniques can be applied to Semantic Analysis, as described in the next section. 4. The MEASUR approach to semantic analysis There are 4 major phases to Semantic Analysis as proposed by MEASUR (Lui 2000): 155 Problem Definition | Candidate Affordance Generation | Candidate Grouping | Ontology Charting Figure 1, Stages in semantic analysis The first step, as shown, is problem definition. MEASUR begins by formulating a concise, well-articulated problem statement, which includes all the relevant parts of a problem. Working from this problem description, semantic analysis takes over, with the goal of creating a model of the problem. This begins with candidate affordance generation, where all the agents, objects, actions, etc. are identified from the text. Once these candidates have been identified, Candidate Grouping takes over as the first step towards creating an ontology chart. The two key types of entity that need to be identified during candidate affordance generation, to be used as construction blocks for the Ontology Chart, are agents and affordances. An agent is a type of object, a performer or processor, the initiator of an event. An agent can be a human, a device or a program, whatever has responsibility for the action. The behaviour of an agent is directed by its knowledge of, and constrained by the nature of, the environment. For instance an agent could swim assuming it has the knowledge of how to swim, and the environment affords it an area of liquid to swim in. A swimming pool could therefore be seen as an affordance within the environment enabling the agent to swim. Affordances can be seen as properties of a situation, not necessarily objects such as swimming pools. Other examples of affordances are budgets, projects, time and even agents, as sometimes agents can become affordances of another agent, depending on context. (Gibson 1979)(Stamper 1996) Agents are relatively easy to identify since they typically appear as nouns in the problem statement. Nouns may also represent affordances or roles (roles are explained in the conference organisation example below). Unlike agents (and roles), affordances do not map neatly onto a single part of speech. They can be nouns, but equally could be verbs or several other parts of speech. Hence, while a list of candidate agents may be generated mechanistically, the identification of affordances requires analysis of the semantics by considering the relationships between the various syntactic elements of the problem statement. Normally elements will depend on other elements for their existence. For instance, the affordance swim depends upon the existence of a swimming pool, in which to swim, and also a swimmer (the role of a person agent), to actually swim. This is depicted in the fragment of an ontology chart in figure 2. Here, agents are shown as ellipses, and affordances are depicted in rectangles. Figure 2. A fragment of an ontology chart The ontology chart is one of the goals of this process; it is a graphical representation of the relationships between these agents and affordances. The process to create an ontology chart is better demonstrated in the example below. The eventual chart is comprised from grouping the affordances together. Once the ontology chart has been created, the next step is to assemble a set of norms, which govern the standard behaviour of the model. Norms describe how the model is expected to work. They describe the normal behaviour of agents within the problem domain and their use of affordances. When the norms are attached to an ontology chart a thorough understanding of the problem is completed. This paper however concentrates upon the steps involved in creating the ontology chart. swimmer Swimmingpool Person Swim 156 5. Conference organisation example The following minimal problem statement permits the identification of the principal entities of the problem domain listed in table 1.: A member of an organisation can organise a conference, which they invite people to attend. Participants can submit papers to be reviewed by the conference organiser. Member Role of person in organization Organisation Agent Organise Affordance of conference and conference organizer Conference Affordance of organization Invite Affordance of conference organiser and person People Agent Attend Affordance of participant Participant Role of person who accepts invitation Submit Affordance of Participant and Paper Papers Affordance Review Affordance of conference organiser and contributed paper Conference Organiser Role of member who organises conference Table 1. The principal entities in a conference organisation problem The above list of entities represents the set of candidate nodes in the ontology chart. Once the candidates have been identified, the next task as also illustrated in table 1, is to categorise these as agents, roles and affordances and assemble the ontology chart to explicate the relationships between them. In this example, Organisation is an agent (agents aren't necessarily human) while Member is a role of a person within the organisation. In order to make the relationship understandable, the affordance Membership has to be added. Membership depends upon an Organisation, which can allow Membership, and a Person, to take advantage of Membership, whereupon they become a Member (Figure 3). Figure 3. Modelling the membership of an organisation In the model in figure 3, organisation and person are agents, while membership is an affordance. Member is a role and so is placed between person and membership. The important entities of the problem domain can be grouped to begin to create an ontology chart. Once candidates have been grouped a structure for the chart appears. Member has been identified as a role, not just an agent, as the problem statement identifies People, Participants and Members. An experienced analyst will recognise that Participant and Member are (in O-O terms) specialisations of Person. In an O-O notation such as UML class diagrams, the analyst would have the option of modelling these using either sub-classing or, if instances of a single class could play multiple roles, using roles at the termination points of association relationships. In an ontology chart, roles are the modelling mechanism used. Roles only exist in circumstances where a dependent affordance is taken advantage of, so they are depicted along the arc between the antecedent and the dependent. For instance, in figure 4 which shows the complete ontology chart for the conference organisation problem, a person only becomes a participant when they accept the invitation. Roles in an ontology chart are not merely labels, they can form nodes. This is illustrated in the conference organiser role which is itself derived as a special type of the role member when organising the conference. Note that roles needn't be restricted to the agents in the model. For membership organisation person member 157 example, contribution is a specific type of the affordance paper which has been submitted by the role participant. In the same way in which membership was added to the ontology chart to explain the relationship between a person and an organisation, other terms can be added. Authorship and accept are added to the ontology chart despite not being in the original problem statement. Authorship explains the relationship between a person and a paper. Accept has to be added to enable the existence of a participant as a person only becomes a participant when they have accepted the invitation to the conference. Figure 4. Complete ontology chart of conference organiser example An ontology chart should be read from left to right, as objects to the right are dependent on the items they are connected to by arcs on their left (Barjis & Chong 2002). For example, the affordance invite can only exist when there is a conference organiser (to perform the inviting) and a person (to be invited). Similarly the affordance accept relies on a person (to perform the accepting) and the existence of the invitation (to be accepted). The construction of an ontology chart, even from a concise problem description, isn't mechanistic. It requires considerable experience and skill in the application of MEASUR on the part of the analyst and this acts as an inhibitor to the application of MEASUR in systems development. 6. Statistical natural language processing and organisational semiotics. In all the documented case-studies and examples used in the organisational semiotics literature, the problem scope is small, and the problem statement is a concise description with no redundant description or ambiguity. For many real life cases it isn't possible to neatly and perfectly summarise the problem in a brief problem statement, so the starting point for semantic analysis may not be a carefully worded problem statement. Instead it could be a set of long descriptive reports. Examples could include ethnographic study reports, legal documents, codes of practice, meeting transcripts, and any other documentation which could add knowledge to the problem definition. Many problems are just too complex and subtle to be neatly encapsulated by a short problem statement. To be practical it should be possible to apply MEASUR to problems where the sources of information are less clearly bounded, diffuse, scattered and poorly structured. The problem with this is that to read through and analyse long documents, using the approach discussed above, becomes increasingly difficult, particularly if we intend to “generate candidate affordances”, by selecting every noun, noun-phrase, verb etc. While in a short precisely written problem statement the analyst will only find the necessary semantic units in order to create the ontology chart, in a longer larger document, the analyst might find many nouns which only occur once throughout the document and don't actually add anything to the model. This irrelevant information needs to be filtered out. contribution participant Conference organiser member person organisation conference membership organise review invite accept authorship submit paper 158 Many of the terms will be used several times, in different contexts, so they become easier to define accurately. Having selected the key candidates using the frequency tests, we can then look at these words in context, singling out the word and the passage in which the word occurs. Looking at the key word in context (KWIC), as well as aiding definition of terms, can also be used for grouping the keywords together, as instances where keywords are used in conjunction with each other can be isolated. As large amounts of natural language has to be analysed, it is logical to use some NLP tools to assist with the automation of the approach. The following section discusses how some NLP techniques can be used to aid with Candidate Affordance Generation and Candidate Grouping. 7. Applying natural language processing techniques For the purposes of this example, information collected in an ethnographic report into the Air Traffic Control (ATC) domain is used. This has previously been looked at in the REVERE project (Rayson, Garside & Sawyer 2000). The document is 66 pages long with over 40,000 words in it, so it clearly isn't the concise carefully written problem statement alluded to previously. Using this as the ‘problem statement', the next step is to generate a list of candidate affordances. Using the statistical NLP tools this can be done by firstly compiling a frequency list – counting the number of occurrences for each word. This frequency list can then be compared to a Corpus frequency list. A Corpus is a large body of text. The example used here is the British National Corpus (BNC), which is a giant frequency list consisting of many words from the English language, and how frequently they can be expected to occur. The first step in this process, is to create a frequency list for each word in our document. Initially the most frequently occurring words are likely to be words such as ‘the', ‘of', ‘and’ and ‘to'. These words aren't a particularly helpful indication of what the document is about, as they are the words which occur most commonly in written natural language. The most interesting and helpful words are those which occur most significantly more often than we would expect them to within the document, and we can detect these by comparing the document with the BNC. The first step in calculating the most significantly overused words is to calculate how frequently we would expect a word to occur, given the size of the document. This can be done by first creating the following contingency table; Text to be analysedBNC Total Frequency of word a b A + b Frequency of other wordsc-a d-b C + d – a - b Total c d C + d Figure 5. Contingency table. (Rayson & Garside 2000) With the information in this table, the expected frequency for any word in the text to be analysed can be calculated using the following formula; E1 = c * (a + b) / (c + d) And the expected frequency, given out text to be analysed, for any word in the BNC can be calculated; E2 = d * (a + b) / (c + d) Once the two expected frequencies have been calculated, we can calculate the significance in the difference between these two scores, using a Log Likelihood test. The following formula will give a significance score showing how significant it is that the word occurs as frequently as it does; Sig = 2 * ( ( a * ln ( a / E1) ) + ( b * ln ( b / E2 ) ) ) (Rayson & Garside 2000) The higher the result of this test, the more significant it is that the word has occurred more often than it should have. After this comparison using a log likelihood test between the BNC frequency list, and 159 the ATC frequency list, a new frequency list is created with the most significantly overused words prominent. These words become the keywords of the problem domain. Using the example of the air traffic control ethnographic report, the words that are most significantly overused within the document are, as could be expected, words like, controller, radar, flight etc. (fig 6). Further information about this technique can be found in (Rayson, Garside & Sawyer 1999). Word Frequency Log Likelihood 1 controller 217 1386.84 2 strips 227 1372.39 3 radar 168 1060.39 4 strip 173 966.805 5 flight 113 576.605 6 controllers 73 458.849 7 chief 114 451.441 8 sector 82 353.931 9 pole 75 334.929 10 traffic 74 307.131 11 of 671 294.06 12 planes 56 293.969 13 pending 48 281.001 14 aircraft 58 279.465 15 hill 75 273.275 16 level 106 255.248 17 the 1848 254.816 18 airspace 41 246.31 19 ph 40 239.917 20 im 38 239.149 Figure 6. Significantly overused words in the ATC example. This list still includes words like “the” and “of” which are significantly overused within the document, but don't add to the model. These words can be removed by refining the list so we only have words that we are interested in. Clearly with a large document to attempt to draw out the candidate affordances manually would be very time consuming. Using NLP tools, the first steps of the semiotic analysis method can be completed more speedily, and with less manual input. The third stage of the process is to group candidate affordances. Once a list of the terms that should occur within our ontology chart has been generated, they can be grouped. Once again there are NLP tools to assist during this stage. By looking at each keyword in more detail, it is possible to discover what it means and how it relates to the other keywords more precisely. The overused keyword ‘controller’ is chosen, it is a role, played by a person agent. Collocation analysis is a statistical test, which produces a Z score that tells us how likely it is for two words to have “co-occurred”. This works by first predicting the number of times that the second word should occur within a specified range, bearing in mind the frequency of the word within the entire document. Given the expected co-occurrence frequency and the actual co-occurrence frequency, the probability can be calculated. With this probability, the significance of the co-occurrence can be tested using Berry Rogghe's z-score calculation (Oakes 1998). To calculate the significance of a collocation, the following information is needed; Z – total words in text A – number of times keyword occurs in text B – number of times potential collocate occurs in text K – number of times the keyword and the collocate co-occur within span S – Span – number of words on either side of the keyword to be considered 160 The first step is to calculate the number of times the collocate should co-occur near the keyword if the two words were randomly distributed and then compare this with the actual number of co-occurrences. To calculate the expected number of co-occurrences, we first need the probability of the collocate occurring where the keyword does not occur; P = B / ( Z – A) Then, the expected number of co-occurrences is given by; E = P * A * S The statistical test to see how significant the collocation is, is determined by calculating a z score, as follows; ____________ z = ( K – E ) / ã (E * ( 1 – p ) ) Once again, the higher the z score, the more significant the collocation. Setting the span to 10 words either side of our keyword, ‘controller', the collocation significance of each other word in the document can be calculated. Once we have the z score, as given by Berry Rogghe's calculation, we can compare this to a percentage significance level. Here, words have been split into 1%, 5% and 10% significance,1% being the most significant. Words are significant at the 1% level with a z score greater than 2.33, at the 5% level when greater than 1.65 and at 10% level when greater than 1.3. The most significant co-occurring words (collocates) for ‘controller’ are; Word Frequency # Co-occurancesExpected Co-occurancesZ Score Roger 14 14 1.54 10.03 midland 17 14 1.87 8.86 Ph 40 19 4.41 6.96 Upper 4 5 0.44 6.87 131 2 3 0.22 5.92 Fault 2 3 0.22 5.92 Asks 2 3 0.22 5.92 Wit 1 2 0.11 5.69 32 1 2 0.11 5.69 looming 1 2 0.11 5.69 105 1 2 0.11 5.69 arrived 1 2 0.11 5.69 searching 1 2 0.11 5.69 charters 1 2 0.11 5.69 Fallen 1 2 0.11 5.69 Leans 4 4 0.44 5.36 10 17 9 1.87 5.21 Who 48 17 5.29 5.1 assistant 9 6 0.99 5.03 Pilot 35 13 3.85 4.66 Figure 7. Significant collocates at 1% level As can be seen several of the keywords only occur once within the entire document, so it isn't that interesting that they co-occur with ‘controller'. As they only occur once within the document, they add little to the eventual model, so we can filter the list and investigate further the interesting co-occurrences. 161 Word Frequency # Co-occurances Expected Co-occurances Z Score Roger 14 14 1.54 10.03 midland 17 14 1.87 8.86 Ph 40 19 4.41 6.96 Upper 4 5 0.44 6.87 Leans 4 4 0.44 5.36 Who 48 17 5.29 5.1 assistant 9 6 0.99 5.03 Pilot 35 13 3.85 4.66 Red 5 4 0.55 4.65 marks 3 3 0.33 4.64 express 3 3 0.33 4.64 Him 78 22 8.59 4.58 climbing 24 10 2.64 4.53 Figure 8. Significant collocates of ‘Controller’ Given the list of significant collocates for each keyword a lexical network could be created automatically. This involves simply connecting every word to those which occur near them. However, as this network contains little semantic content it isn't very useful in understanding the problem domain. Further investigation into each collocate produces much more valuable insight into the nature of the problem. Each of these significant collocates can be looked at in context, firstly so the meaning of the second word can be understood, and secondly so the relationship between the two words can be analysed. In this example, “roger” is the most significant collocate, and further investigation of the pair in context reveals that the controller communicates to a pilot via radio, and regularly uses the radio term “roger”. From this information it is possible to begin to group possible candidate affordances, such as that radio is dependent upon pilot and controller for it's existence. Figure 9. Grouping of controller, radio and pilot. Looking at keywords in context with other keywords assists the analyst in understanding the semantics of them. By isolating the controller keyword it is possible to find sentences such as; It is the job of the radar or sector controller To coordinate this traffic through his/her sectors. Which amongst others provides the information that the controller is a role performed by a human agent. This information can be added to the growing ontology chart; Figure 10. Fragment of ontology chart demonstrating the role controller. radio control controller person pilot radio controller pilot 162 It is also possible to learn further information about the controller role. Firstly from the above sentence alone, a function of the controller role is to co-ordinate traffic through his or her sector. There is clearly a relationship between the controller and ‘sector'. Using NLP to look at keywords in context with each other, it is possible to isolate every piece of text which contains both ‘controller’ and ‘sector'. This provides the information needed to discover that it is a sector which the controllers control, or in other words, the affordance control can only exist when there is a controller (to perform the controlling) and a sector (to be controlled). This information can be added to the ontology chart. Figure 11. Fragment of ontology chart demonstrating addition of sector. Gradually by grouping more information, as various parts of the text are investigated using the keywords in context, the ontology chart can be added to until it provides a thorough analysis of the Air Traffic Control domain. 8. Conclusions Semiotic analysis of an information system can aid requirement engineers and other analysts to fully understand it. When conducting semiotic analysis of an information system a rich source of information could be an ethnographic report, or another natural language document describing the domain. Current approaches to semiotic analysis are designed for use with concise, carefully constructed problem statements. This paper has investigated how NLP can be used to scale up the approach so larger, more information rich documents can be analysed. The NLP tools that have been included in this research are statistical frequency tests and collocation analysis, aimed at guiding an analyst to the important areas of the document being analysed. Further work in this area could be useful to further aid the analyst in identifying agents and affordances automatically, either by refining Part of Speech tagging or semantic tagging. Using collocation analysis it is possible to automate the creation of a lexical network, connecting related words based on them occurring near one another in a document. Further human input is necessary to add semantic information to the lexical network. Whilst viewing the keyword in context alongside other keywords is an aid for the analyst, further tool support her could aid the process further. 9. References Barjis & Chong 2002 Integrating Organisational Semiotic Approach with the Temporal Aspects of Petri Nets for Business Process Modeling, in J.Filipe, Sharp, B. and Miranda, P. (Eds.), Enterprise Information Systems III. Kluwer Academic Publishers, Dordrecht, The Netherlands Charrel 2002 Viewpoints for knowledge management in system design. In Proceedings 5th Annual International Workshop on Organisational Semiotics Connolly 2000 Accomodating Natural Language Within The Organisational Semiotic Framework, Third International Workshop in Organisational Semiotics WOS3 Gibson 1979 The Ecological Approach to Visual Perception. Boston, Houghton Mifflin company. Liu 2000 Semiotics in Information Systems Engineering, Cambridge University Press sector control controller person pilot radio 163 Oakes 1998 Statistics for Corpus Linguistics, Michael P. Oakes. Edinburgh textbooks in empirical linguistics. Rayson, Garside & Sawyer 1999 Language Engineering for the recovery of requirements from legacy documents. REVERE project report, Lancaster University, May 1999 Rayson & Garside 2000 Comparing corpora using frequency profiling. In proceedings of the workshop on Comparing Corpora, held in conjunction with the 38th annual meeting of the Association for Computational Linguistics (ACL 2000). 1-8 October 2000, Hong Kong, pp. 1 - 6. Rayson, Garside & Sawyer 2000 Assisting requirements engineering with semantic document analysis. In Proceedings of Content-based multimedia information access RIAO 2000 (Recherche d'Informations Assistie par Ordinateur, Computer-Assisted Information Retrieval) International Conference, College de France, Paris, France, April 12-14, 2000. C.I.D., Paris, pp. 1363 - 1371. ISBN 2-905450-07-X Stamper 1994 Signs. Information, Norms and Systems In: B. Holmqvist, P. B. Andersen, H. Klein and R. Posner (Eds), Signs of Work: Semiosis and Information Processing in Organisations, Walter de Gruyter & Co, Berlin, pp.349-397. Stamper 1996 Ontological Dependency In Proceedings of the workshop on “Ontological Engineering” Aug 1996 Stamper 2000 – Organisational Semiotics Informatics without the Computer? In Information, Organisation and Technology - Studies in Organisational Semiotics By Kecheng Liu, Rodney J. Clarke, Peter Bogh Andersen, Ronald K. Stamper (eds), Kluwer Academic Publishers, Boston, 2001. 164 Language engineering tools for collaborative corpus annotation H. Cunningham*, V. Tablan*, K. Bontcheva*, M. Dimitrov** * Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield S1 4DP {H.Cunningham, V.Tablan, K.Bontcheva}@dcs.shef.ac.uk **Ontotext Lab, Sirma AI Ltd. 38 A Hristo Botev Blvd. Sofia 1000, Bulgaria marin.dimitrov@sirma.bg 1. Introduction A vital step towards the creation of distributed corpora is the provision of tools for collaborative corpus annotation, in order to enable researchers to collaborate on annotating corpora regardless of their physical location. This problem can be decomposed in two major tasks: (i) provide users with access to distributed corpora; and (ii) provide visualisation and editing tools that require no installation effort and are easy to use. In this paper we will present the new collaborative corpus annotation facilities, recently developed as part of the GATE language engineering tools and infrastructure. These facilities have been used to build OLLIE – a client-server application that allows users to use the collaborative corpus annotation facilities in their own Web browser. This paper is structured as follows. An overview of GATE is provided in Section 2. Section 3 describes the support for distributed language resources implemented using relational database servers. The client-server architecture of OLLIE is introduced in Section 4 and details of its collaborative corpus annotation facilities are provided in Section 5. Section 7 concludes the paper and outlines future work. 2. GATE overview GATE (Cunningham et al 2002) is a open-source language engineering infrastructure1 which, among other things, is being used for the creation and annotation of a number of corpora in many languages, e.g., the American National Corpus and EMILLE - a 63 million word corpus of Indic languages (Baker et al 02). It is written in Java and exploits component-based software development, object orientation and mobile code. One advantage of GATE from a corpus processing perspective is that it uses Unicode throughout (Tablan et al. 2002), and has been tested on a variety of Slavic, Germanic, Romance, and Indic languages (Gamback and Olsson 2000), (Pastra et al 2002). Another advantage is its support for a wide range of input and output document formats – XML, HTML, SGML, email, RTF, and plain text – and GATE can also be extended easily to handle other formats, in a manner transparent to the rest of the system. When a document is loaded in GATE, its format is analysed and converted into a GATE document, which consists of content and one or more layers of annotation. The annotation format, a modified form of the TIPSTER format (Grishman 1997), is largely isomorphic with the Atlas format (Bird and Liberman 1999) and successfully supports I/O to/from XCES and TEI (Ide et al. 2000). An annotation has a type, a pair of nodes pointing to positions inside the document content, and a set of attribute-values, encoding further linguistic information. Attributes are strings; values can be any Java object. An annotation layer is organised as a Directed Acyclic Graph on which the nodes are particular locations in the document content and the arcs are made out of annotations. All mark-up contained in the text used to create the document content is automatically extracted into a special annotation layer and can be used for processing or for exporting the document back to its original format. 1 GATE is available for download from http://gate.ac.uk and is distributed freely as an open-source system under the GNU library license. 165 In addition to corpus annotation facilities, GATE provides a set of language processing components, e.g., tokeniser, part-of-speech tagger, named entity recogniser; services for persistent storage of language resources; and extensive visualisation and evaluation tools aimed at facilitating the development and deployment of language processing applications. The new distributed corpus annotation facilities discussed here were built to resemble GATE's easy-to-use and extendable facilities for text annotation of locally stored documents. These facilities allow both manual and semi-automatic annotation of corpora, based on GATE's language processing components. The text is annotated in a point-and-click fashion, where the user first highlights the text to be annotated and then clicks on the desired annotation type (e.g., ENAMEX, POS). The rest of the paper will discuss how GATE has been extended to support distributed language resources and how these facilities were used to build OLLIE – a client-server application for collaborative corpus annotation. 3. Distributed language resources Big corpora, such as the British National Corpus (BNC) are best stored on server machines with large amounts of disk space, so they are installed once and shared among many researchers. GATE offers support for such distributed language resources by storing them in relational databases, which run on the servers and accessed by the GATE visual environment that acts as a client. Currently Oracle and PostgreSQL are supported, of which the latter is freely available. Currently the database servers support storage of corpora and documents and their associated annotations. These GATE objects are stored in about 20 relational tables, grouped in the following categories2: 1) Language Resource-related tables: contain information common for all language resources, e.g., their name and type. 2) Corpus-related tables: contain information about the stored corpora, such as which documents belong to them. 3) Document-related tables: contain information about the stored documents, e.g., their encoding, content, and which annotation sets belong to them. 4) Annotation-related tables: contain information about the annotations, the sets they belong to, and the feature-value pairs associated with them. 5) Security related tables: contain information about users and groups. The GATE users are completely isolated from the concrete ways in which LRs are stored in the relational tables. When choosing to store an LR in a distributed database, users just have to choose which of the available servers they want to use and provide their user name and password. Then GATE transparently converts all LRs in the corresponding database objects and stores them via JDBC. In this way, all GATE users and applications are completely isolated from the technicalities of using a database server for distributed LR storage - remote RDBMS datastores are accessed in the same way as local file system based datastores. An important aspect of supporting distributed LRs is security and access control. In GATE we have implemented role-based access control and every language resource is associated with security properties specifying the actions that certain users and groups may perform with this resource. When users create LRs into the database datastore, they specify access rights for users, the group of the user and other users. For example, LRs created by one user/group can be made read-only to others, so they can use the data, but not modify it. The access modes supported by GATE are detailed in Table 1. If needed, ownership can be transferred from one user to another. Users, groups and LR permissions are administered in a special administration tool, by a privileged user, i.e., the database administrator. As already mentioned, the RDBMS persistence is fully JDBC compliant but in order to improve the performance, specific features of the two databases were used when possible. While this approach induces certain development overhead, since small parts of the implementation are database specific 2 For a detailed description of GATE's database storage mechanism and the relational tables used see http://gate.ac.uk/gate/doc/persistence.pdf and GATE's User's Guide (http://gate.ac.uk/sale/tao/). 166 (such as properties of the database objects, stored procedures, SQL optimizer hints, etc.) it ensures that the optimal performance is achieved for each database. Mode Owner Owner's (Read/Write) group (Read/Write) Other users (Read/Write) World Read / Group Write +/+ +/+ +/- Group Read / Group Write +/+ +/+ -/- Group Read / Owner Write +/+ +/- -/- Owner Read / Owner Write +/+ -/- -/- Table 1 Database access rights. Another performance related feature of the database-based persistence is the load-on-demand implementation of certain GATE resources. Since it is inefficient to load huge amounts of data from a remote datastore, all language resources in GATE employ a load-on-demand behaviour when loaded from database datastore. In such cases the corpus documents, document content and annotation sets will be loaded from the datastore only when first accessed from the user (or the application). In a similar manner a complex tracking system ensures that only the parts of a resource that were changed are propagated back to the database for updates. This approach significantly reduces the amount of data transferred and the round trips to the server. Read-only resources allow shared use of existing corpora such as the British National Corpus (BNC), which only need to be saved in the database once and then become available through GATE to all users, who do not need to install and store them locally. In the future we will allow users to extend such resources by creating a new 'composite' LR, which references inside the original read-only LR. New data is added to the modifiable part of the composite LR, while users can access data from both parts. In this way, different groups can share a read-only copy of e.g. the BNC or WordNet, which is referenced by their separate customised LRs which store the new information, and made accessible to other groups/users as required. Key benefits of the approach include: 1) The database infrastructure takes care of multi-user access, locking and so on without any intervention by the user. 2) LRs can in this way be easily distributed across networks, and need not be installed locally on each machine used by their developers and users. 3) GATE's graphical display tools can be used to visualise and edit the data in the resources. 4) GATE's framework can be used for embedding the resources in diverse applications. The database infrastructure offered by GATE is employed in a number of projects and production systems. Some real world statistics from OntoText Lab for a project that uses GATE to annotate news articles look like: • Database size: 3.5 GB (growing by 20MB per day) • Number of documents: 170,000 (1000 new per day) • Number of annotations: 3,500,000 • Number of annotation features: 10,500,000 • Total number of rows in the database: 23,500,00 167 4. The OLLIE architecture for collaborative annotation OLLIE is a client-server application providing support for collaborative corpus annotation by using any Java-enabled Web browser. The three main functions of the OLLIE client are: 1) support user authentication and profiles; 2) support collaborative corpus annotation; 3) provide language processing tools to (partially) automate this task. OLLIE also supports training and running different Machine Learning methods but these aspects will not be discussed here due to space limitations. The OLLIE server on one hand communicates with the OLLIE client and on the other uses GATE and its distributed database facilities (described in Section 3) to store the user data and the language resources being edited (see Figure 0 for an illustration). Every user has a username and a password, used to retrieve their profiles and determine which documents they can access (according to the database access policies discussed above). The profiles contain information provided by the user, specifying the types of information that they are annotating in the corpora. For example, the profile for basic named entity annotation will specify Person, Organization, and Location. These tags will then be provided in the document annotation pages in the client (see Figure 0). The OLLIE server is implemented as a set of Java Server Pages (JSPs) along with some supporting Java classes. The entire application is then packaged for deployment into a servlet container such as Figure 0 The OLLIE corpus and document browser 168 Apache Tomcat. All the configuration data (e.g. URLs to the databases holding the security data or the GATE distributed database) is handled through the configuration of the servlet container (which in the case of Tomcat consists of a set of XML files). The server's efficiency is improved by using connection pooling for efficient database access. The client-server communication during on-line document annotation is carried over Java Remote Method Invocation (RMI), which supports data sending and exchange of messages between the client and the server. The continuous communication comes with the added benefit of data security in case of network failure. The data on the server always reflects the latest version of the document, so no data loss can occur. 5. The Web-based collaborative corpus annotation client OLLIE supports collaborative annotation of documents and corpora by allowing their shared, remote use and by making updates made by one client immediately available on the OLLIE server. In this way users can share the annotation task with other users. For example, one user can annotate a text with organisations, and then another annotates it with locations. The documents reside on the shared server which means that one user can see errors or questionable mark-up introduced by another user and initiate a discussion. The OLLIE client provides facilities for loading documents/corpora for annotation from a URL, uploaded from a file, or created from text pasted in a form. A variety of formats including XML, HTML, email and plain text are supported. As part of this process, the user also specifies the access rights to the document/corpus, which determine whether it can be shared for viewing and collaborative annotation (see Section 3). Figure 0 shows the OLLIE client window listing all corpora and documents accessible to the user and available for annotation through the Edit button. The Admin button is for specifying the access rights and there is also a button for deleting the LR from the server. Figure 0 The OLLIE client-server architecture Figure 0 Remote document editing in OLLIE 169 The document editor is used to annotate the text (see Figure 0). The right-hand side shows the classes of annotations (as specified in the user profile) and the user selects the text to be annotated (e.g., ``McCarthy") and clicks on the desired class (e.g., Person). The new annotation is added to the document and the server is updated immediately. The client also provides facilities for deleting wrong annotations, which are then propagated to the server, in a similar way. The annotation facilities in the OLLIE client are designed to work exactly in the same way as those in GATE's visual environment, so that users familiar with one can use the other without a major overhead. The advantages of using the OLLIE client is that it does not require any installation and runs in any Java-enabled Web browser, which makes it ideal for collaborative corpus annotation, especially if users tend to change their locations and machines. On the other hand, the GATE visual environment provides more extensive facilities, e.g., specialised viewers/editors for complex structures such as syntax trees, while still supporting distributed LRs via the Oracle/PostgreSQL databases, but needs to be installed locally on the machine. In addition, due to its more generic nature, as a development environment for language processing applications, it offers many additional features which make its user interface too generic, whereas OLLIE's client is specifically designed for collaborative corpus annotation. The users do not need to manually annotate all data, instead they can use GATE's language processing tools to bootstrap the annotations by running them on the corpus. The actual processing occurs on the OLLIE server where the corpus resides and can also be scheduled to take place overnight. While some other infrastructures, e.g., AGTK (Ma et al 2002) offer corpus sharing in a manner similar to GATE, GATE has the advantage of offering the bootstrapping support discussed here. OLLIE supports bootstrapping with a wide range of linguistic information - part-of-speech, sentence boundaries, gazetteer lists, and named entity class. This information, together with tokenisation information (kind, orthography, and token length) is provided by the language processing components available within GATE, as part of the ANNIE system (Cunningham et al 2002). The user chooses which of this information is needed and the respective language processing components are run on the OLLIE server, the document is updated with new annotations, and they are propagated immediately to the OLLIE client and shown to the user. The user can then correct these annotations by deleting the wrong ones and adding new ones. However, such bootstrapping would help only if the language processing tools have acceptable performance. Therefore, if the automatic results for some annotation types (e.g., Organizations) have proven unreliable, the OLLIE client also offers the facility to delete all annotations of a given type by selecting this type in the right-hand side pane and pressing the Delete key. 6. Web Services based collaboration Web services are a new breed of web applications that are based on the service-oriented architecture (SOA). Web Services are self-contained, modular applications that can be published, located and invoked across the Internet. They ease the inter-enterprise application integration, collaboration and process automation, reduce complexity associated with development and deployment and make it possible to easily build loosely-coupled systems that allow dynamic binding and just-in-time integration. At present Web Services functionality is being added to GATE. This will allow clients to use remotely the GATE functionality and use services (processing resources) that are otherwise not available locally. Making GATE Web Services aware will lead to: • increased collaboration between disparate research sites • easier integration of the HLT functionality in GATE with other applications • reduced cost and time of the development of GATE based applications • easy integration with light-weight and non-java components that require HLT functionality 7. Conclusion In this paper we described a set of language engineering tools for collaborative corpus annotation, which allow users to share work on language resources distributed over the network and also to automate the annotation process by running relevant language processing tools. These tools form part of GATE, the General Architecture for Text Engineering. Recently we have also provided machine 170 learning tools as part of GATE, in order to enable users to have trainable NLP tools they can use for corpus annotation. In future work we plan to enhance the GATE support for collaborative corpus annotation by providing access to the GATE corpora and tools via Web services. As shown by some recent work on using Web services and SOAP for distributed language resources (Dalli 2002), this model combines very well with efficient database storage mechanisms, like the ones used in GATE. 8. Acknowledgements Work on GATE has been supported by the Engineering and Physical Sciences Research Council (EPSRC) under grants GR/K25267 and GR/M31699, and by several smaller grants. The first author is currently supported by the EPSRC-funded AKT project (http://www.aktors.org) grant GR/N15764/01 . 171 Figure 0 Logical model of the GATE relational schema. 172 9. References Baker P, Hardie A, McEnery A, Cunningham H, Gaizauskas R. 2002. EMILLE, A 67-Million Word Corpus of Indic Languages: Data Collection, Mark-up and Harmonisation. In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'2002), pages 819--825. Bird S, Liberman M 1999. A Formal Framework for Linguistic Annotation. Technical Report MS-CIS-99-01, Department of Computer and Information Science, University of Pennsylvania. http://xxx.lanl.gov/\-abs/cs.CL/9903003. Cunningham H, Maynard D, Bontcheva K, Tablan V 2002. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, US. Dalli, A 2002. Creation and Evaluation of Extensible Language Resources for Maltese. In: Proceedings of 3rd Language Resources and Evaluation Conference (LREC'2002). Gran Canaria, Spain. Gamback B, Olsson F 2000. Experiences of Language Engineering Algorithm Reuse. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC), pages 155--160, Athens, Greece. Grishman R 1997. TIPSTER Architecture Design Document Version 2.3. Technical report, DARPA. http://www.itl.nist.gov/div894/894.02/related_projects/tipster/. Ide N, Bonhomme P, Romary L 2000. XCES: An XML-based Standard for Linguistic Corpora. In Proceedings of the Second International Language Resources and Evaluation Conference (LREC), pages 825--830, Athens, Greece. Ma X, Lee H, Bird S, Maeda K 2002. Models and tools for collaborative annotation. In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'2002), Gran Canaria, Spain. Pastra K, Maynard D, Hamza O, Cunningham H, Wilks Y 2002. How feasible is the reuse of grammars for Named Entity Recognition? In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'02), Gran Canaria, Spain. Tablan V, Ursu C, Bontcheva K, Cunningham H, Maynard D, Hamza O, McEnery A, Baker P, Leisher M 2002. A Unicode-based environment for creation and use of language resources. In Proceedings of 3rd Language Resources and Evaluation Conference (LREC'02), Gran Canaria, Spain. (Oracle) Oracle database documentation from http://technet.oracle.com (Postgres) Postgres database documentation from http://www.postgresql.com/docs 173 Annotation without lexicons: an alternative to the standard bootstrapping approach Mark Davies Illinois State University 1. Introduction A fundamental problem facing the creators of corpora for less-common languages – or the older stages of an established language – is the lack of suitable lexicons to annotate the corpus. Typically, the annotation of these languages involves “bootstrapping”, which refers to the process of starting with the most frequent forms (or some of the morphologically most predictable forms, or both) and progressively annotating the corpus as one also constructs the lexicon. As the annotation process continues, one can deal with less common or less regular forms, because of the predictive capacity of the increasingly robust lexicon and the increasingly rich, annotated context in the corpus itself . (For papers dealing with recent approaches to bootstrapping for a wide range of languages, see Rocio et al 1999, van Eynde et al 2000, Simov et al 2002, Cucerzan et al 2002, Ghani et al 2002, Moreno et al 2003). A standard approach to corpus annotation is to recursively traverse the entire textual corpus itself, searching for textual patterns, and then adding annotation to the corpus. This can be done either by large-scale pattern matching and replacement (using regular expressions or a similar schema), or by processing sequential chunks of text in a buffer (where only a small window of text is used to disambiguate part of speech (POS) and lemma). The important point is that the annotation typically takes place directly in the textual corpus itself. In this paper, we will outline an alternative schema that was used to annotate the Corpus del Espanol (www.corpusdelespanol.org) – a 100 million word corpus of Spanish texts from the 1200s-1900s. The annotation of this corpus was done without directly dealing with or seeing the actual textual corpus itself. Rather, the annotation was done on tables containing all of the distinct 1, 2, 3, and 4-word sequences (n-grams) in the entire corpus, along with the frequency of each of these n-grams in each historical period and modern register of Spanish. As we will see, this non-traditional approach affords a number of important advantages, both in terms of the flexibility and speed of the search engine for the corpus. For the purposes of this paper, we will focus primarily on the method of annotating the older stages of the language – the texts from the 1200s through the 1500-1600s. The lexicon that we created for Modern Spanish was able to annotate the texts from the 1800s-1900s quite well, but was progressively less useful for older stages of the language. For example, only 40% of the types from the 1700s appear in the Modern Spanish lexicon, and this decreases to 33% for the 1500s and 16% for the 1200s. In other words, most of the types from older historical periods were from a different “language”, as far as the lexicon was concerned. We will see, however, that by using an approach based on n-grams tables in relational databases, we were still able to annotate tens of thousands of distinct word forms from the oldest stages of the language in a matter of just a few hours. 2. Corpus architecture and design Before discussing in detail the way in which the annotation is carried out with the aid of large relational databases of n-grams, let us briefly consider the overall organization of the Corpus del Espanol. The actual textual corpus for the 100 million word corpus is stored as 1000-2000 word chunks of text in a Microsoft SQL Server 7.0 database. This textual corpus itself is not annotated in any way, apart from a code that indicates the source of each block of text. However, it is indexed with SQL Server “Full-Text Indexing”, which is similar to the standard Microsoft Search engine. This indexing scheme allows exact words and phrases to be found fairly quickly – usually less than one second to query the entire corpus and return the relevant examples. The important limitation, however, is that the Full-Text search engine for SQL Server only works well with exact words and phrases. Even wildcard searches are problematic, and certainly there is no capability for customized annotation of any sort. The annotation for the Corpus del Espanol resides in relational databases, which are completely separate from the textual corpus itself. These databases are composed of different tables for all of the 1, 2, 3, and 4-grams in the corpus. These tables also include the frequency for each of these n-grams in each of the centuries from the 1200s-1900s, as well as the different registers of Modern Spanish. The data for these tables was generated from the textual corpus itself using the WordList function of WordSmith. This program was run separately for several three to four million word blocks of text, and then merged together in the SQL Server tables. 174 As might be imagined, the tables are rather large, since they include all of the distinct 1, 2, 3, and 4-grams in the entire corpus. There are nearly one million distinct 1-grams (i.e types), eleven million distinct 2-grams, forty million distinct 3-grams, and 65 million distinct 4-grams. An example of one of the forty million 3-grams from the corpus is the following: Table 1. N-grams/frequency table w1 w2 w3 x12 x13 x14 x15 x16 x17 x18 x19 19-Lit 19-Oral 19-Misc son las cosas 38 16 77 67 16 19 33 68 24 40 14 The columns w1, w2, w3 refer to each of the “slots” in the 3-gram; the columns x12-x19 refer to the frequency of this 3-gram in the 1200s-1900s; and 19-Lit, 19-Oral, and 19-Misc refer to the frequency in these three registers from the 1900s. Each of these relational database tables is indexed, including some clustered indices (to be discussed in more detail later on), all of which leads to very fast retrieval. 3. Queries without annotation As this level, however, there is still no annotation per se – only the textual corpus and the n-grams tables. Even at this level, however, there are some useful queries that can be run against the database. For example, a user can input any of the following queries into the web-based form: Table 2. Queries without annotation QUERY SORT BY LIMITS tan * como 1900s *ización 1900s 1900s>5 –1800s quer* lo/la/los/las *r 1200s +1200s -1500s The first query will search the 3-grams table for all cases where the [w1] column is tan “as/so” and the [w3] column is como “as”, and order the results by the frequency in the [x19] column. This will return strings like tan bueno como “as good as”, tan rápido como “as fast as”, etc. The second query will search the 1-grams table for all of the records where the word in the [w1] column ends in [-ización], the value in the [x19] column is more than 5, and the value in the [x18] column is 0 (meaning that the word appears for the first time in the 1900s). This will return strings like privatización, globalización, and urbanización. The final example will return strings like queriendo lo fazer “wanting to do it” and queremos las dezir “we want to say them”, which represent cases of a form of querer “to want” followed by a direct object pronoun, followed by an infinitive. This query will search the 3-grams table for all records where the [w1] column is has the pattern [quer-], the w2 column is one of the following (lo, la, los, las), the word in the [w3] column ends in [-r], the [x12] column is greater than 0, and the x15 column is 0 (meaning that the phrase is found primarily in Old Spanish). The simple search syntax of the third query in the table above is transformed via web-based scripting into the following SQL statement, which is then run against the database, and which returns the results in less than one half of a second. select top 300 * from x3 where w1 like (‘quer%') and w2 in ('lo', 'la', 'los', 'las') and w3 like ('%r') and x12>0 and x15= 0 Because Spanish has morphology that is both strong and fairly regular, users can employ simple lists of words and word patterns to search for even relatively complex syntactic constructions, as in Table 2 above. However, at some point it will obviously be necessary to have more complete annotation, including annotation for those lemma that are not morphologically regular (e.g. quis* for preterite forms of querer), as well as parts of speech that are not predictable in terms of forms (such as nouns and adjectives in Spanish). The major focus of this paper, then, is the way in which this can be done using collocational and frequency information from the n-grams tables themselves, and the challenge that this presents for languages (or stages of a language) for which we do not have a lexicon. 4. Annotating the corpus: inheriting information from related lexicons Before turning to the basic question of how to enable “bootstrapping” via n-grams tables in relational databases, however, let us first consider a related and somewhat less difficult scenario. Imagine that there is a lexicon for the modern stage of a particular language, but that the need exists to annotate an older stage of the same language. Obviously, some of the forms from the modern language will be applicable to older stages, but this lexicon will become progressively less useful the farther back one 175 goes. For example, as we have previously mentioned, 40% of the types from the 1700s in the Corpus del Espanol appear in the Modern Spanish lexicon, and this decreases to 33% for the 1500s and 16% for the 1200s. To the degree that there is similarity in types between the older and newer stages, however, perhaps the best strategy for annotating the corpus is simply to use the database to “inherit” features from related forms in the modern language. Let us briefly consider how this “inheritance” of annotation features has been carried out with the Corpus del Espanol. First, we used the frequency information in the n-grams tables to identify the highest frequency forms from older stages of the language, for which lemma and/or POS has not been applied from the Modern Spanish lexicon. For example, a simple SQL query like the following would produce a rank-ordered list of the 300 most common words from the 1500s (x15 column) in the x1 table (single words), where the word (w1) is the same as a word in the lemma table (x_l), which does not have a lemma assigned (column x1) (a similar query could be run for other centuries, as well as for the POS table [x_c]): select top 100 x1.x15,x1.w1 from x1,x_l where x_l.x1 is null and x_l.w1 = x1.w1 order by x1.x15 desc This list of the most common unannotated forms for a particular historical period can be INSERTed into another table called “unannotated”. The corpus creator would then manually go through this list and type into an adjacent column the modern forms that correspond to the older, unannotated forms, in those cases where there is such a correspondence. For example, in this query from the Corpus del Espanol, some of the unannotated forms that appear are començó, cavallos, and hazían, which correspond to the modern forms comenzó, caballos, and hacían “3SG-started, horses, 3PL-made”. Once the modern forms are entered into the “unannotated” table along with the modern form, we use a simple SQL UPDATE command to “copy” the POS and lemma values for the modern forms and apply these to the older forms. In addition to looking at the unannotated forms with the highest frequency, we can also search for forms according to regularized phonetic or morphological changes between the modern language and older stages of the language. For example, there was a regularized shift from [-zi-] to [-ci-] in Spanish, and the following query will find the 1000 most common, unannotated forms from the 1500s that have the pattern [-zi-], and which correspond to an annotated [-ci-] word in the modern lexicon, such as the older forms haziendo, juizio, and vezinos. select top 1000 x1.x15,x1.w1 from x1,x_l where x_l.x1 is null and x_l.w1 = x1.w1 and x1.w1 like '%zi%' and patindex(x1.w1,'%zi%') in (select patindex(w1,'%ci%') from x1 where w1 like '%ci%') order by x1.x15 desc Once these older forms are INSERTed into an “unannotated” table, a simple REPLACE command can be used to place the modern form into another column, and a subsequent UPDATE query would copy the modern Spanish POS and lemma values to the older forms. In this way, with a knowledge of some of the basic phonetic, morphological, and orthographic changes in the language, it is possible to annotate thousands of forms from older stages of the language in a matter of a few hours. 5. POS annotation with n-grams/frequency information and pattern matching In the previous section we assumed a more optimistic scenario, in which there is some type of related lexicon that can be applied to our corpus. Let us now turn to the more pessimistic scenario, in which there is no lexicon at all and we are simply “working from scratch”. The only assumption here is that we have the n-grams/frequency tables and the relational database structure that we have previously discussed, but there are no other annotation tools available to us. In order to assign POS, at the most basic level we will simply use SQL commands to select the most common word forms that have a certain morphological pattern. For example, the following SQL query selects those forms that end in [-ADO/-ADA/-ADOS/-ADAS], which is the typical marker of the past participle: 176 select top 100 x12, w1 from x1 where w1 like '%do' or w1 like '%da' or w1 like '%dos' or w1 like '%das' order by x12 desc The one problem with such queries, however, is that they also retrieve many items that are morphologically similar, but which do not in fact belong to the desired grammatical category. For example, the preceding query would retrieve (-DO) quando, grado, mando; (-DA) espada, nada, cada, (-DOS) todos, dos; and (-DAS) espadas, todas – none of which are past participles. Fortunately, we can use SQL sub-queries to limit forms that have overly-generalized patterns (e.g. -DO, -DA, etc), by comparing them to other forms with which they share a predictable morphological relationship. For example, the past participle of [-AR] verbs is (nearly always) formed by removing the [-AR] of the infinitive, and replacing it with [-ADO]. We can therefore include in the SQL query a sub-query that checks to see whether the “root” of an –ADO form (i.e. remove the –ADO) can also be found as the root for an infinitive (i.e. remove the –AR): select top 100 x12, w1 from x1 where w1 like '%ado' and len(w1) >=3 and left(w1,len(w1)-3) in ( select left(w1,len(w1)-2) from x1 where w1 like '%ar' and len(w1) >=2 and x12 > 10 ) order by x12 desc Using this sub-query, we eliminate spurious [–ADO] forms like grado, obispado, and prelado, because there are no corresponding infinitives with the (hypothetical) form *grar, *obispar, and *prelar. In addition, we can use sub-queries to find derived forms of grammatical categories. For example, in searching for past participles ending in [–ADA] (fem sg), we can have the sub-query check to see whether the corresponding [–ADO] (m sg) form also exists (and at a certain frequency, such as 10 occurrences in the 1200s): select top 100 x12, w1 from x1 where w1 like '%ada' and left(w1,len(w1)-1) in ( select left(w1,len(w1)-1) from x1 where w1 like '%ado' and x12 > 10 ) order by x12 desc This would eliminate spurious “[-DA] past participles” like cada “each” and espada “sword” because there are no corresponding [–DO] past participles like the hypothetical forms *cado and *espado. 6. POS annotation with n-grams and collocation information In the preceding examples we have seen how POS annotation can be done by looking for morphological patterns (prefixes, suffixes, etc), and how this can be enhanced through the use of sub-queries, which test for the existence of related forms. In all of these cases, however, we have been dealing with queries that look at individual word forms. The real strength of the n-grams approach comes when we begin to limit queries based on the occurrence of given words within a certain n-gram sequence. For example, suppose that we want to identify the 1000 most common infinitives in the Spanish of the 1200s. We could do a simple search on the 1-gram table in which we search for words (w1) ending in [-R] (the marker of the infinitive), and produce a rank-ordered list. The following query would select these forms, but would also produce false entries like por, quier, mayor, muger, etc. 177 select top 1000 x12, w1 from x1 where w1 like ‘%r’ order by x12 desc An alternative strategy – and one that relies on the inherent strengths of an n-grams approach – would be to search the 2-grams table for words that end in [-AR / ER / IR / YR], but which also occur after some of the most common forms of one of the three auxiliary verbs poder, querer, and deber. select top 1000 w2,sum(x12) from x2 where (w1 like 'quer%' or w1 like 'quis%' or w1 like 'quier%' or w1 like 'pod%' or w1 like 'pued%' or w1 like 'dev%' or w1 like 'deb%' or w1 like 'deu%') and w2 like '%r' group by w2 having sum(x12) > 2 order by sum(x12) desc This query is much more accurate than the 1-gram query shown previously (producing only one false hit in the top 200 word forms), and takes less than one second to run. The following are some additional searches that show how the n-grams tables can be used to annotate parts of speech. In each case, we want to identify a two or three word syntactic environment in which nearly all of the words in one of the “slots” belong to a particular syntactic category. For example, suppose that we want to identify the 2000 most common nouns in the 1200s. A syntactic environment in which these would occur is [indef art] + __ + [que] (un omne que “a man that”, vna casa que “a house that”, etc). The following query – which takes less than one second to run – selects the 2000 most frequent words in slot 2 (w2) in the 3-grams table (x3), which occur more than two times in that slot and which are preceded by a word (w1) that is (una, una, vn, vna) and which are are followed by que “that” in the third (w3) slot: select top 2000 w2,sum(x12) from x3 where w1 in ('un', 'una','vn', 'vna') and w3 = 'que' group by w2 having sum(x12) > 2 order by sum(x12) desc Likewise, the following query – which can be run against the entire 100 million word corpus in less than one second – is an attempt to define a syntactic environment for adjectives. It selects all words in the third slot (w3) of the 3-grams table (x3), in which the first slot (w1) is one of several high-frequency forms of ser “to be”, and the second slot (w2) is a form of muy “very” or tan “so”: select w3,sum(x12) from x3 where w1 in ('es','era','será','sera','fue') and w2 in ('muy','mui','muy'','tan') group by w3 having sum(x12) > 2 order by sum(x12) desc Finally, let us return to the category of past participle, which we considered in the previous section. Recall that in the case of pattern matching with individual 1-grams, it was fairly difficult to morphologically define a past participle (words that typically end in –ADO /-ADOS / -ADA / –ADAS), because of the number of false hits like cada, dos, and espada. By using the 2-grams or 3-grams table, however, we can constrain the search much better, by selecting only those forms that occur after one of the auxiliary verbs haber “to have”, or estar or ser “to be”, which is the real evidence for a form being a past participle In the following query, we search the 2-grams table (x2) to select those forms (w2) that end in [–ADO /-ADOS / -ADA / –ADAS], which are preceded (w1) by some of the highest frequency patterns matching forms of haber, ser and estar. 178 select top 1000 w2 from x2 where (w1 in (‘ha','has','han') or w1 like ‘ha__a%’ or like 'esta_a%' or w1 in ('fue','fueron','era','eran')) and (w2 like '%ado' or w1 like '%ada' or w1 like '%ados' or w1 like '%adas') group by w2 having sum(x12) >= 2 order by sum(x12) desc As a final note regarding POS annotation, we should mention the progressive nature of the process. As with other methods of bootstrapping, our queries will become progressively more refined and powerful. For example, once we have accurately defined the 500 or 1000 most frequent nouns, we can then use the category [noun] to look for the members of other categories. In other words, it is not necessary for us to continually be doing queries “from scratch”, as in the examples given here. 7. Lemmatization with n-grams, frequency, collocates, and pattern matching In addition to POS tagging, the n-grams/frequency tables can also be used to help with lemmatization. Probably the most obvious application is to search for word forms that have a certain morphological pattern. For example, the following query will list the sixty most frequent words (w1) in the 1200s, where the word pattern is ‘quer*’ (querya, querremos), ‘qu_er*’ (quiere, qujeren) or ‘quis*’ (quisyere, quise), which are all Old Spanish forms of querer “to want”: select top 100 x12,w1 from x1 where w1 like 'quer%' or w1 like 'qu_er%' or w1 like 'quis%' order by x12 desc We may determine, however, that a number of the morphologically similar word forms actually belong to another lemma. In the example above, for example, a number of the matching forms belong to the Old Spanish verb querellar “to quarrel”. To further limit the forms, we may wish to provide a syntactic context in which only the forms of querer “to want” will be found, such as those forms that precede an infinitive. The following query does this by using a sub-query to limit the [que-] forms to just those that occur immediately before an infinitive (w2; [-AR/-ER/-IR/-YR] for Old Spanish) in the table of 2-grams (x2), which is a syntactic environment in which querellar “to fight” would likely not occur: select top 100 x12,w1 from x1 where (w1 like 'quer%' or w1 like 'qu_er%' or w1 like 'quis%') and w1 in ( select w1 from x2 where (w1 like 'quer%' or w1 like 'qu_er%' or w1 like 'quis%') and (w2 like '%ar' or w2 like '%er' or w2 like '%ir' or w2 like '%yr') ) order by x12 desc By combining pattern matching (e.g. ‘qu_er*'), collocational information for multi-word n-grams tables, and frequency information for each of these n-grams, it is possible to lemmatize hundreds or thousands of words in just a few hours, even when there is no lexicon available. Finally, we should once again keep in mind the progressive and cumulative nature of the annotation process. The more we have defined a particular grammatical category or a particular lemma, the easier it will be to find members of additional categories or lemma. For example, in attempting to find the members of the lemma [querer] “to want”, we have looked at n-grams in which the word in the following “slot” ends in [-AR / -ER / -IR / -YR] (in Old Spanish). Yet at certain point we will have refined the [V_INF] category to the point that we can use it directly in the queries. Likewise, once we have identified the forms of a particular lemma, we can then use that lemma directly in subsequent queries. For example, rather than using ‘quer*’ and ‘qu_er*’ to define [querer], at a certain point we can refer to the lemma [querer] directly as we search for other lemma and grammatical categories. This is of course similar to more traditional methods of bootstrapping, which perform progressively more refined annotation on the textual corpus itself. 179 8. The advantage of n-grams databases vs. traditional approaches In the preceding sections, we have shown how forms can be annotated, based on information from n-grams and frequency in relational databases. This is a much less orthodox approach than the standard approach, which is to use regular expressions (or other pattern matching schemes) to search for strings and patterns in the textual corpus itself, and then insert the annotation into the textual corpus. However, there are certain clear advantages to the relational database / n-gram approach, as we discuss in this section. First, our approach is probably much more economical and efficient than the standard approach. As has been mentioned several times, even the most complex queries on tens of millions of records in the n-grams tables take only one or two seconds to run. In the standard approach, each query may involve a new traversal of the entire textual corpus. This is much less of a problem for small corpora (one million words or less), but may quickly become prohibitive for corpora containing tens of hundreds of millions of word of text. Second, the relational database approach is able to take advantage of sub-queries, which allows it to perform complex multi-stage tests on the data. For example, in order to search for a 1SG form of the present tense (fablo “I speak”), the query could have a number of sub-queries: check to see if the form ends in [-O], check whether the verbal root is also the root for a relatively high frequency infinitival-like form (i.e. with the [-AR/-ER/-IR/-YR detached), and check whether the form occurs in a multi-word sequence (e.g. a 2- or 3-grams table) where (1SG) present tense verbs would likely occur. Although the database actually performs these searches in sequence, the intermediate result of each sub-query is stored in a temporary table and is automatically processed by the database program. In the standard approach, it would likely be much more complicated. After each traversal of the entire corpus (to look for high frequency, morphologically-related forms, or word sequences), the program would have to store the temporary results and then re-use these as part of subsequent traversals of the corpus. Depending on the efficiency of the script used to process the data, this could quickly become too complicated or prohibitive, in terms of speed and performance. Third, the n-grams approach naturally lends itself to advanced collocational analysis of the data. For example, in examining the top one hundred forms of suspected adjectives, it might be useful to see the fifteen or twenty most frequent 3-grams for each word form, in which the word form occupies the middle slot. With the relational database approach, the sorted results could be produced (using sub-queries) in two or three seconds. While it would certainly be possible to produce the same listing using the standard approach, it is likely that it would be somewhat more complicated than the five or six lines of code in the relational database SQL command. Fourth, the relational database approach allows one to easily select and deselect word forms that are potential members of a syntactic category or particular lemma. For example, the SQL command can INSERT 100 or 1000 probable forms into a “temporary” table, along with the highest frequency preceding and following words, if so desired. These forms can then be quickly reviewed and the value for an “action” column can be quickly set to a particular value for those word forms that belong to the grammatical category or lemma. Another UPDATE command can then go back and assign the particular POS or lemma to all of the matching forms in the actual n-grams databases, for those items whose “action column” has been set to a certain value. In the standard approach, it would likely not be as intuitive or natural to review a list of suspected forms, select a subset of these, and then annotate the corpus itself based on which items have been selected. The fifth and final advantage of the relational database / n-grams approach is also perhaps the most important one. In this approach, the n-grams/frequency tables are simply one part of the overall database, albeit the most important part. But because they are in a relational database, they can very easily be joined to other tables within the same database, and there is no limit to the amount of annotation that can be applied to the corpus. For example, in the Corpus del Espanol, there is a table containing the synonyms for 30,000 words, and users can access this information as part of their search. An example of this is the following query, which searches for all forms of all lemma that are synonyms of mandar “to order, command”, followed by the subjunctive marker que “that”, followed by a past subjunctive (mandé que fueran “I made them leave”; hicieron que dijera “they made her say”): !mandar.* que *.v_subj_ra Likewise, users of the Corpus del Espanol can create customized lists of words that are morphologically, syntactically, or semantically related, such as adjectives describing emotions, words ending in [-azo] that denote a blow or strike, or a list of temporal adverbs. These are stored in tables, where they can later be used as part of the query syntax. For example, suppose that a user [Jones] creates a list called [emotions], which contains a list of verbs of emotion (e.g. gustar, alegrar, 180 sorprender “to please, make happy, surprise”). The user can then use this list as part of the following syntactic construction to return phrases like me gusta que haya “it please me that there is”, le sorprende que tengan “it depresses him/her that they have”. The following is the query as it is entered into the search form, along with a description of the queryL me/nos/te/os/le/les [Jones:emotions].* que *.v_subj_pres [one of the indirect object pronouns me, nos, te, os, le, les + any form of any of the words in the [emotions] list created by Jones + que + present subjunctive] The web script then translates this into the following SQL command, which is passed to the database: select top 300 * from x4 where w1 in ('me', 'nos', 'te', 'os', 'le', 'les') and w2 in (select w1 from x_l where x1 in ('gustar', 'sorprender', 'agradar', 'alegrar')) and w3 in ('que') and w4 in (select w1 from x_c where x1 in ('v_subj_pres')) The important point is that there can be an unlimited number of levels of annotation on the corpus – whether parts of speech, or lemma, or synonyms, or translations between languages, or etymologies, or customized lists, and these can all be linked together with simple SQL JOIN commands. It is not apparent how this degree of flexibility or power would be an inherent property of the standard scheme, in which the annotation is based within the textual corpus itself. 9. The design and architecture of annotation in the relational database As explained previously, there are two approaches to the placement of annotation in the relational databases. One approach would be to include it as additional columns in the n-grams/frequency databases themselves. For example, in the following case there is annotation (POS and lemma) within the table for each of the three words in the 3-gram son las cosas “are the things” Table 3: N-grams/frequency table with integrated POS and lemma annotation W1 L1 C1 W2 L2 C2W3 L3 C3 x12x13x14x15x16x17x18x1919-Lit19-Oral 19-Misc son ser vp las lo adef cosas cosa N 381677671619336824 40 14 In this scenario, for example, a user who is searching for cases of a form of poder + estar + a present participle (e.g. puede estar pensando “3SG-can be thinking”) would enter the following in the web-based form: poder.* estar *.v_ndo This is then translated into the following SQL command, which queries only the 3-grams table (x3), since it has all of the POS and lemma information included within that one table: select top 300 * from x3 where L1 = ‘querer’ and W2 = ‘estar’ and C3 = ‘v_ndo’ The alternative is to have the n-grams/frequency information in one table (x3), and the POS and lemma information in two additional tables (x_c and x_L, respectively): Table 4: Separate n-grams/frequency, POS, and lemma tables w1 w2 w3 x12 x13 x14 x15 x16x17x18x19 19-Lit19-Oral 19-Misc x3 son las cosas 38 16 77 67 16 19 33 68 24 40 14 w1 x1 pos w1 x1 lemma es v_pres (x_c) es ser (x_L) será v_fut cosa cosa sido v_pp cosas cosa In this scenario, there would be simple SQL JOIN queries to link the two databases. The same query shown above would be translated into the following SQL command. In this case, the database first sub-queries the POS (x_c) and lemma (x_L) tables, and then feeds the output from these tables into a query of the main n-gram/frequency table (x3): 181 select top 300 * from x3 where w1 in (select w1 from x_L where x1 in ('poder')) and w2 in ('estar') and w3 in (select w1 from x_c where x1 in ('v_ndo')) So which of the two approaches produces the best results? The advantage of having the annotation within the n-grams / frequency table itself is that the database creator can use contextual information to resolve ambiguity. For example, in the 3-grams table two of the 40+ million records are for the strings el poder de “the power of” and para poder “in order to be able”: Table 5. Contextual disambiguation, based on n-grams W1 L1 C1 W2 L2 C2 W3 L3 C3 x12x13x14x15x16x17x18x1919-Lit 19-Oral 19-Miscel el a_d poder de de prep19257542001359336025567 36 152 para para prep poder ser ser v_inf0 0 2 115 3 13376 17 14 In Spanish, poder can either be a noun (el poder “the power”) or an infinitive (poder saber “to be able to know”). Therefore, it is not immediately clear what the POS or the lemma should be for the [poder] in [w2] slots. Using simple SQL updates, however, we can look at the contextual information to fill in or alter the POS and lemma values, based on the values in the other word slots. For example, the following UPDATE query will assign the value of [noun] to [poder] when it occurs before a preposition (as in the first row) and it will assign a value of [v_inf] when it is followed by another infinitive (the second row). The second case is shown in the following example: update x3 set c2 = ‘v_inf’ where w2 = ‘poder’ and c3 = ‘v_inf’ Judging from languages like English – which have such a high degree of polysemy – it may seem unwise to have to run a large amount of SQL updates – such as the one just shown – to improve the annotation. Yet it turns out that because Spanish is such a morphologically strong language, there are relatively few forms (as with poder) that have a high frequency as two different parts of speech or two different lemma. Because of this very low level of polysemy, we might take a more radical step and completely remove the annotation from the n-grams/frequency tables, as shown in Table 4 above. In this case there would be two entries for poder in both the POS and lemma tables, and in most cases, the queried phrase will naturally disambiguate itself. For example, if a user searches for [de *.v_inf *.v_inf], then [de poder saber], [de poder estar] and others will be retrieved, because poder is listed as a [v_inf] in the POS table. And it turns out that all of these examples of the potentially polysemous poder really are in its use as an infinitive. Likewise, if a user searches for [el *.n de], then [el poder de] “the power of” will be retrieved, and again virtually all of these cases will be with poder as a noun (because of the following preposition). There will be problems only in those relatively few cases in which 1) a word is polysemous and 2) both meanings are highly frequent and 3) the context is not sufficiently rich to disambiguate the multiple meanings. Assuming that it is the case that ambiguity in POS and lemma assignment is much less of a problem in a morphologically complex language like Spanish (as opposed to a morphologically weak language like English), then we can take the further step of simply removing the annotation from the n-grams/frequency tables altogether (where they might be useful for contextual disambiguation) and place them in separate POS and lemma tables. In fact this is precisely what we have done in the Corpus del Espanol, and more than a year's worth of complex searching on the corpus suggests that it very rarely leads to any problems. One of the major advantages of placing the annotation in tables that are separate from their context deals with redundancy. In the database architecture in which the POS and lemma information is part of the n-grams tables, then redundant annotation occurs every time a word occurs in any slot of any n-gram table. Thus a word like es “is” would be annotated as [POS=v_pres; lemma = ser] in each of the 209,160 rows of the 3-grams table where it appears in the [w2] slot – as well as hundreds of thousands of other rows for the other slots of the 3-grams table and the other n-grams tables. By placing the annotation in a separate table, there is exactly one entry for [es]. There are also secondary benefits from placing the annotation in separate tables, both of which are related to the issue of redundancy. First, because the annotation only occurs in one or two rows of the POS or lemma tables, hundreds or thousands can be updated in a matter of one or two seconds. If a 182 form can occur in hundreds of thousands of rows, on the other hand, then updating the annotation for the large amount of forms becomes more difficult. A second issue is somewhat more technical in nature, and deals with the physical architecture of database tables. In most databases, only one column in each table can have a “clustered” index, which means that the rows in the table are physically arranged on the hard drive according the contents of that column. (In a non-clustered index, on the other hand, there are pointers to the data, but the data itself may be spread over the entire hard drive.) For our case, the important point is that if the clustered index is placed on the [w1] column (the first word in the n-gram), then any indices on the annotation columns in the n-grams table cannot be clustered, and queries dealing with POS or lemma will be relatively slow. By placing POS and lemma annotation in their own tables, each of these tables can contain a clustered index, and text retrieval will be much faster. This is why queries of the 100 million word Corpus del Espanol typically take just one or two seconds – for even more the most complex queries, involving POS, lemma, word patterns, synonyms, and user-defined lists of words. References Cucerzan S, Yarowsky D 2002 Bootstrapping a multilingual part-of-speech tagger in one person-day. CoNLL-2002 Sixth Conference on Natural Language Learning. Taipei, Taiwan, pp. 132-38. Ghani R, Jones R 2002 A comparison of efficacy and assumptions of bootstrapping algorithms for training information extraction systems. In Proceedings of LREC 2002 Workshop on "Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data". Spain: Las Palmas, pp. 61-72. Moreno, A., López S, Sánchez F 2003 Developing a syntactic annotation scheme and tools for a Spanish treebank. In Abeillé A (ed), Building and using syntactically annotated corpora. Dordrecht: Kluwer, pp. 149-65. Rocio V, Pereira G 1999 Análise sintáctica parcial em cascata". In: Marrafa P, Mota M (eds), Linguística Computacional: Investigaçao Fundamental e Aplicaçoes. Lisboa: Ediçoes Colibrí, pp. 235-251. Simov K, Kouylekov M, Simov A 2002 Incremental specialization of an HPSG-based annotation scheme. In Proceedings of LREC 2002 Workshop on "Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data". Spain: Las Palmas, pp. 16-23. van Eynde F, Zavrel J, Daelemans W 2000 Lemmatisation and morphosyntactic annotation for the spoken Dutch corpus. In M. Gavrilidou et al. (eds), Proceedings of the Second International Conference on Language Resources and Evaluation. Paris: ELRA, pp. 1427-1433. 183 Consonant variation within words Joost van de Weijer Department of Linguistics and Phonetics Lund University Helgonabacken 12 22362 Lund, Sweden email: vdweijer@ling.lu.se 1. Introduction The aim of this paper is to show that the consonants that a simple (monomorphemic) word is constructed with usually are different from each other. To illustrate this, consider words like glass, drink, butter, or tray in which no consonant occurs more than once (it is the pronunciation I am concerned with, not the spelling). Changing one consonant so that it becomes identical to another one would result in rather odd-sounding new words, e.g., glagg, krink, butteb or bubber. Of course I realize that there are a great deal of words in which the same consonant occurs more than once, as there are clock, deed, paper, text. However, I believe that these words are in a minority, and of relatively low frequency so that they do not show up often in everyday speech. I do not think that the phenomenon is restricted to any language in particular, but instead applies to many languages, although I would not dare to claim all. I am not aware of any study in which this specific claim was questioned, but there are a number of findings that point in the direction of the idea that identical consonants repulse rather than attract one another. First of all, it is common that suffixes are deleted when the adjacent consonant in the stem is homophonous, a phenomenon that has been labeled morphological haplology (Stemberger 1981). In English, for example, the genitive –s is not realized in plural forms, unless the plural does not end on –s (cf. the *boys's books with the children's books). Similarly, in Swedish, the present tense markers –er or –ar are not realized when the verb stem ends on –r, e.g., jag kommer, (I come), but jag hör — instead of *jag hörer (I hear). Second, identical nearby consonants sometimes undergo change during a process of borrowing between languages or as a result of historical sound change, a process known as phonological dissimilation (Hock and Joseph 1996). Examples of such a change in loan words are the French word marbre which became marble when adopted into English, or the Italian word tartuffeli (truffle) which became Kartoffel (potato) in German. An example of an historical sound change is the Swedish word nyckel (key) which once was lyckel. A third observation is that whereas relatively many languages show vowel harmony (a phonological process according to which vowels within a word become more similar), there are only few examples of languages with consonant harmony. A major exception being child language in which patterns of consonant harmony frequently occur (Smith 1973; Vihman 1978). Finally, a well-known phonological constraint, the obligatory contour principle (OCP), states that 'adjacent identical segments are prohibited' (Clements and Hume 1995). Originally this constraint was applied to explain the distribution of lexical tones, but at a later stage it was also used to explain regularities in speech segments. McCarthy (1986), for instance, used the OCP to explain patterns in Arabic trilateral and quadriliteral roots that lack identical consonants. In sum, it appears that there is a tendency for nearby consonants to be different. The aim of the present study is to take this claim one step further by stating that most morphologically simplex words are constructed of only different consonants. This means that not only nearby consonants are different, but also consonants further apart, as long as they are within the same morpheme. If this is true, then the distribution of consonants can be a valuable source of information for automatic morphological 184 decomposition, since there is an increased likeliness that there is a morphological boundary between two identical consonants. A second, related, application is that the distribution of consonants may help the listener or a speech recognizer in identifying word boundaries. Since there are no reliable markers of word boundaries in spoken language as there are in written language (Cole and Jakimik 1980), many researchers have searched for cues that may aid in the recognition of spoken words (e.g., Cutler and Norris 1988; McQueen 1998). The presence of identical consonants, then, may be a simple cue for the listener to hypothesize that there is a word boundary between them. As a starting point for investigating these research claims, I analyzed the distribution of identical consonants in a selection of Swedish words. Swedish is a language for which I believe that the hypothesis is true. It has a phonological structure that is in a number of ways similar to that of English. Its syllable structure, for example, is comparable to that of English, with consonant clusters consisting of up to three consonants in syllable-initial position and clusters up to three consonants in syllable-final position, or even five if suffixes are included (Sigurd 1965). The phonemic inventory contains 18 different consonants (Elert 1989), but additional consonants occur in loanwords, mainly from English or French. I restrict my analysis to morphologically simple words, so that compounds, derivations and inflections are not included. I do not know of any phonological process (other than morphological haplology which only applies when the adjacent consonant is homophonous) according to which consonants in affixes change because the same consonant also occurs in the stem that the affix is attached to. The suffix –s for third person singular in English, for instance, is the same for verb stems containing an s (as in he sits) or any other consonant (as in he hits). The remainder of the paper is organized as follows. In the next section I describe the characteristics of the material that was used, and how it was analyzed. Thereafter, I will address the following issues: How many words contain two or more identical consonants? Are certain consonants more likely to occur more than once within a word than others? Are words with identical consonants relatively common or rather rare? 2. Method Material The material was selected from a word list that was used for the production of a current Swedish pronunciation dictionary (Hedelin 1997). The complete list is composed of 116,362 entries. An entry consisted of the orthographic word form, a word-class code (noun, adjective, adverb, pronoun, verb, count noun, proper noun, preposition, article, interjection, abbreviation, conjunction, infinitive marker, prefix, word group), the word's pronunciation, and whether the word was a compound, a derivation or an inflected form. Selection of the material The aim of the selection was to obtain a word list of different monomorphemic words. Clearly, the dictionary contained a large number of entries that did not meet this requirement. Furthermore, a proportion of the dictionary was redundant for the purpose of the present analysis, since many entries were repetitions of the same word root (e.g., words that could be a noun as well as a verb). Luckily, a large proportion could be excluded on the basis of the information already provided in each entry. A total of 90,463 entries coded as compounds, e.g., bordduk (table cloth), derivations, e.g., alkoholist (alcoholic – noun), alkoholism (alcoholism), alkoholisk (alcoholic – adjective), etc., and inflected forms (e.g., present tense verbs, genitive forms of pronouns, etc.) were excluded automatically. Obviously, it is debatable where the line between what counts as a morphologically complex word and a morphologically simple word should be drawn (see, e.g., Sproat 1992 for further discussion on this issue). From the list that resulted after the exclusion, I manually excluded an additional 1,164 of entries that I considered morphologically complex as well. The entries that were excluded contained words with a meaning that could easily be derived from the individual meanings of the components. These words included for instance a number of change of state verbs which in Swedish are constructed of an adjective plus the suffix –na (e.g., the verb mjukna (to become soft) consists of the adjective mjuk 185 (soft) followed by the suffix na). When the meaning could not be derived from the word's individual components – for instance the meaning of the word protest (protest) – the word was not excluded. Although this strategy worked reasonably well for bisyllabic words, it became more difficult for words of three syllables or longer. For that reason, I decided to restrict the analysis to words of one and two syllables only, and excluded all words that were longer than two syllables. Finally, proper names (e.g., Alfa Romeo, Charlotte, Amsterdam), abbreviations and acronyms, (e.g., SAS, NATO), interjections, (e.g., oj (oops), usch (ugh)), orthographic variants (e.g., mej (me) instead of mig, or sebra (zebra) instead of zebra), and repeated entries of the same word root (e.g., bank in the meaning of 'sofa' as an alternative to bank in the meaning of 'bank') were also excluded. The final selection then consisted of 8,887 words. The number of consonants in each word varied from zero to eight. Word Frequencies Word frequency information was obtained from a language corpus collected at the University of Gothenburg. The research group there provided me with a list of approximately 100,000 word types (100,998) that had a frequency of 20 or more in a large corpus consisting mainly of newspaper articles. The sum of the frequencies of these words was somewhat below 57 million tokens (56,916,383). In this list, there were a total of 5,388 word types that also occurred in the selection of morphologically simple words. Together these had a total token frequency of little below 29 million (28,845,117). Analysis The decision whether a word contained two or more identical consonants was made on the basis of the transcripts provided in the dictionary, with the following two exceptions. In the dictionary a distinction is made between long and short consonants. This difference is primarily context dependent, i.e., long consonants follow short vowels, as in the word tack /tAk¢/ (thanks), and short consonants follow long vowels, as in the word tak /tA¢k/ (roof). In addition, the difference in pronunciation is not realized in many Swedish dialects (Elert 1989). For these reasons, long and short variants of the same consonant were considered to be identical. Second, in some dialects of Swedish, the /r/ has an effect on the place of articulation of one or more subsequent alveolar consonants. In these cases, the consonants adopt a retroflex or supradental place of articulation, and the /r/ disappears. The word barn (child), for example, is pronounced as /bA ~/ (Elert 1989). The assimilated forms are given in the dictionary, but these were replaced in the analysis by the unassimilated form (e.g., /bA rn/). Notice that these two criteria make the analysis more conservative, since the likeliness of finding two or more identical consonants within a word increases in both cases. 3. Results During the analysis, it became clear that no consonant occurred twice within a syllabic onset or syllabic coda. For instance, there were no syllables that ended on –tst or –ksk or any other combination of three consonants. Although this is an intuitively plausible finding it is not a trivial observation since the presence of two identical consonants within a syllable onset or coda is in principle not prohibited by phonotactic constraints. In English, for instance, any noun ending on –st combined with a genitive –s, or any verb ending on –sp combined with third person singular –s results in words with codas with two identical consonants, e.g., guest's or gasps, etc. A second thing that became clear during the analysis, was that no two consonants directly adjacent to each other ever were the same. This could have been possible, for instance, at the boundary between the first and the second syllable in a bisyllabic word. However, there were no such examples. These two findings were the starting point for dividing the words according to their syllable structure as shown in Table 1. The left column of the table shows the structure of the words, starting with words 186 that consisted of a vowel only. The O's and the C's in the following rows stand for Onset and Coda, where O1 and C1 refer to the onset and the coda of the first syllable, and O2 and C2 refer to the onset and the coda of the second syllable. C1 in the second row then, refers to all the words that had one or more consonants in the coda of the first syllable, but no consonants in the onset, or in the second syllable if there was any. The second column shows the numbers of word types with the structure listed in the first column. The third and the fourth columns show the absolute and the relative numbers of hits, i.e., the words that contained at least two identical consonants. The last column shows two examples of words with the structure given in column 1. The examples are, wherever possible, words with at least two identical consonants. The table starts with five categories that could not have identical consonants. For the sake of simplicity, bisyllabic words with only one consonant in C1 and one consonant in O2 (for example the words inte and syssla in the table) are classified as if these two consonants both belonged to the onset of the second syllable. Together the first five categories consisted of 438 types, corresponding to 4.93% of the total. The total number of words with at least two identical consonants was 1,001 (11.26%). Most of them contained no more than two consonants that were the same, a few common examples of which were the pronouns nagon (someone) and annan, (other), the adverb nästan (almost), the conjunction trots (although), and the adjective/noun svensk (Swede, Swedish). However, there were 54 word types in which two consonants occurred twice, including the words status (status), porträtt (portrait), struktur (structure), and taktik (tactics). In addition, there were ten words in which one consonant occurred three times, including skepsis (skepticism), substans (substance), traktat (treaty), census (census). There were no words that contained more than three identical consonants or more than two consonants that occurred twice. Table 1: Type count. C stands for coda; O stands for onset. structure types hits (%) examples - 3 0 (0.000) i (in), ö (island) C1 113 0 (0.000) upp (up), älv (river) C2 3 0 (0.000) oas (oasis), eon (aeon) O1 153 0 (0.000) trä (wood), vra (corner) O2 166 0 (0.000) idé (idea), inte (not) C1O2 29 0 (0.000) önska (wish), ändra (change) O1O2 2,373 154 (6.490) syssla (work), kruka (pot) O1C1 2,432 185 (7.607) klipsk (shrewd), klok (wise) O2C2 424 45 (10.613) annan (other), essens (essence) O1C1O2 159 20 (12.579) virvla (whirl), dundra (thunder) O1C2 66 12 (18.182) nyans (nuance), neon (neon) O1O2C2 2,799 527 (18.828) banan (banana), staket (fence) C1O2C2 39 11 (28.205) emblem (badge), anstalt (institution) O1C1O2C2 128 47 (36.719) substans (substance), distrikt (district) totals 8,887 1,001 (11.264) Among the words that contained identical consonants, there were four relatively small groups of words that suggested that identical consonants in some cases serve a specific purpose. These four groups contained words that are typically used by children, onomatopoetic verbs and nouns, a set of nouns and adjectives that all had a pejorative or distasteful connotation, and finally a set of bisyllabic mimetic ('flip-flop') words that consist of two syllables that are exact or nearly exact replications of each other (see Table 2 for examples). 187 Table 2: Specific functions of identical consonants. child words onomatopoetic phonestemes 'flip-flop' words mamma (mother) pappa (dad) baby (baby) pippi (dickybird) jojo (yoyo) bebis (baby) vovve (dog) dada (nanny) pipa (chirp, squeak) bubbla (bubble) mummel (murmur) tuta (hoot) nynna (hum) babbla (blather) knark (drugs) skurk (villain) strunt (rubbish, trash) snusk (uncleanness) skolk (truancy) skunk (skunk) stursk (insolent, impudent) smisk (smack) slisk (cloyingly sweet) slusk (shabby person) smask (smacking noise) slask (wet garbage) stursk (stubborn) tiptop (tip-top) vigvam (wigwam) pingpong (ping-pong) picknick (picnic) virrvarr (crisscross) sicksack (zigzag) mischmasch (mishmash) gonggong (gong) snicksnack (chatter) Which consonants were most likely to occur more than once in a word? Table 3 shows how often each consonant occurred overall and how often it occurred as identical consonant. Comparing the two frequencies it shows that /t/, /k/ and /s/ were relatively likely to occur more than once, since their overall relative frequencies were much lower than their frequency as identical consonants. On the contrary, /l/ was not likely to likely to occur more than once since there was an almost 7% difference in how often it occurred overall and how often it occurred as identical consonant. The difference between the two frequencies was not more than 3% for all the other consonants. Table 3: Consonant frequencies (%). consonant total relative frequency frequency as identical consonant d 5.022 2.395 f 2.998 0.958 g 3.417 2.011 h 1.499 0.000 j 2.929 0.383 k 9.068 14.464 l 10.644 3.831 m 5.276 4.789 n 7.787 7.184 p 4.832 4.693 r 12.831 13.027 s 11.928 20.881 t 10.444 19.253 v 3.413 2.490 Ó 1.008 0.000 ² 0.568 0.192 N 2.180 0.479 S 0.389 0.192 b 3.766 2.778 Table 4 shows the results of the token count. Comparing the values listed in Tables 1 and 4 reveals that the total percentage of words with identical consonants has decreased from 11.26 to 1.57. A major cause of this decrease is that the numbers of tokens in the first five rows are much higher in Table 4 than in Table 1. Included in these categories are a small number of very highly frequent function words, such as och (and), att (to), en (a). However, all the percentages in the other rows of Table 4 are 188 lower than the corresponding percentages in Table 1 as well, indicating that the words with identical consonants were relatively rare. Table 4: Token count. O stands for Onset, C stands for Coda. structure types tokens targets (%) - 3 1,738,874 0 (0.000) C1 92 6,866,180 0 (0.000) C2 1 241 0 (0.000) O1 134 3,171,934 0 (0.000) O2 123 928,972 0 (0.000) C1O2 21 127,133 0 (0.000) O1C1 1,796 11,079,424 164,988 (1.489) O1O2 1,397 2,244,855 37,974 (1.692) O1C2 28 15,063 295 (1.958) O2C2 243 1,067,550 44,522 (4.170) C1O2C2 20 47,768 2,079 (4.352) O1C1O2 71 54,653 4,013 (7.343) O1O2C2 1,402 1,475,306 194,242 (13.166) O1C1O2C2 57 27,164 5,021 (18.484) totals 5,388 28,845,117 453,134 (1.571) 4. Conclusions The hypothesis of the present investigation was that monomorphemic words that contain identical consonants are rare. The results showed that 11.26 percent of Swedish monosyllabic and bisyllabic words contained at least two identical consonants. However, these words seem to be relatively uncommon and, consequently, will not be heard or read often. The estimate that I found based on a selection of more than 28 million word tokens, I estimated that only 1.57 percent contained identical consonants. Taken together, the results show that identical consonants within words indeed constitute a dispreferred pattern in Swedish. Based on the result of the token count, it would be expect to find a word with two or more identical consonants only one in every 65 words. The first question that is open for future research is whether this is also the case in other languages. I expect to find similar percentages in languages with a phonological structure that is comparable to that of Swedish, for instance the other Germanic languages. More interesting comparisons will be with languages that have deviant phonological structures, e.g., different syllable structure or fewer consonants. A second question that needs to be answered is whether listeners are aware that identical consonants within words are relatively rare, and whether they use the presence of identical consonants for morphological decomposition. If this is true, then it is predictable that listeners will identify non-words with two identical consonants more quickly than non-words with only different consonants. A second prediction is that listeners will decompose morphologically complex words that contain identical consonants (e.g., sits) more quickly than words that do not (e.g., hits). I intend to test these predictions in the near future. Finally, one of the reasons for doing the present study was that identical consonants might be useful for morphological decomposition. As a preliminary attempt to answer the question how useful identical consonants are I counted the number of morpheme boundaries that were marked by identical consonants in a short Swedish text. The following text was selected from the introduction of a writing style guide (Svenska Skrivregler, Utgivna av Svenska Spraknämnden, 1991): Var-för behöv-er man skriv-regl-er. Tal är ett sprak för öra-t, skrift är ett sprak för öga-t. I tal-et ha-r vi fler-a sätt att signal-era hur det vi säg-er skall upp-fatta-s: ton-fall, paus-er, beton-ing, röst-styrka, och dess-utom kan vi an-vänd-a gest-er och min-er. I skrift maste vi an-vänd-a andr-a signal-er, som stycke-in-del-ning, stor eller lit-en bok-stav, skilj-e-tecken. De o-lik-a signal-er-na i tal-sprak lär vi oss 189 huvud-sak-lig-en spontan-t de är natur-lig-t fram-vuxn-a, men för skrift-sprak-et behöv-s viss-a bestäm-d-a regl-er för den yttre ut-form-ning-en. In this text, there are 56 morpheme boundaries (marked by dashes) and 79 word boundaries (marked by spaces). Out of these 135 boundaries, 26 (19.3%) were marked by identical consonants. On the contrary, there was only one place were two identical consonants did not mark a boundary, namely within the word spontan-t (spontaneously). This suggests that the presence of two identical consonants is a limited but rather reliable source of information that is useful for the identification of morphological boundaries. 5. References Clements G, Hume E 1995 The internal organization of speech sounds. In Goldsmith J (ed), The handbook of phonological theory. Oxford, Basil Blackwell Ltd. pp 245–306. Cutler A, Norris D 1988 The role of strong syllables in segmentation for lexical access. Journal of Experimental Psychology: Human Perception and Performance 14: 113–121. Elert C 1989 Allmän och svensk fonetik [General and Swedish phonetics]. Stockholm, Norstedts Förlag AB. Hedelin P 1997 Norstedts svenska uttalslexikon [Norstedts Swedish pronunciation dictionary]. Svenska Förlag: Norstedts Ordbok. Hock H, Joseph B 1996 Language history, language change, and language relationship. New York, Mouton de Gruyter. McCarthy J 1986 OCP effects: gemination and antigemination. Linguistic Inquiry 17: 207–263. McQueen J 1998 Segmentation of continuous speech using phonotactics. Journal of Memory and Language 39: 21–46. Sigurd B 1965 Phonotactic structures in Swedish. Lund, Berlingska Boktryckeriet. Smith N 1973 The acquisition of phonology. Cambridge, Cambridge University Press. Sproat R 1992 Morphology and computation. Cambridge MA, The MIT Press. Stemberger J 1981 Morphological haplology. Language 57: 791-817. Vihman M 1978 Consonant harmony: its scope and function in child language. In Greenberg J (ed) Universals of language, vol. 2, phonology. Stanford CA, Stanford University Press, pp 281–334. 190 Rationale for a multilingual corpus for machine translation evaluation Debbie Elliott Anthony Hartley Eric Atwell debe@comp.leeds.ac.uk a.hartley@leeds.ac.uk eric@comp.leeds.ac.uk School of Computing and Centre for Translation Studies, University of Leeds, Leeds LS2 9JT. England. 1. Introduction An overview of research to date in human and automated machine translation (MT) evaluation (Elliott 2002) points to a growing interest in the investigation of new automated methods, allowing for the quick and inexpensive evaluation of MT output. It is clear, however, that corpora designed for this purpose are lacking. Our own research in automated evaluation methods will require not only a corpus of source texts with machine translations that represent actual MT use, but also the detailed scores for these translations given by human evaluators. These scores will allow us to test the reliability of new automated evaluation methods. It is our intention, therefore, to compile a multilingual corpus specifically for MT evaluation, to meet not only our own research requirements, but the needs of the MT community at large. 2. Machine translation evaluation The evaluation of machine translation output has played a crucial role in the development of MT systems since their emergence over five decades ago. Evaluations are required both by developers, before and after system modifications, and by end-users who wish to compare different systems before making a purchase. However, evaluating the quality of any translated text is complex. Unlike the evaluation of part-of-speech taggers, parsers or speech recognisers (Atwell et al. 2000) it is not simply a matter of comparing MT output to some “gold standard” human translation, since translation is legitimately subject to stylistic and other variation. Instead, MT evaluation relies on either the objective scoring of very specific linguistic phenomena using test suites, or the somewhat subjective quality judgements made by evaluators, who are trained to score individual sentences or text segments using a chosen metric. The problem of subjectivity can, however, be reduced by obtaining scores from several evaluators for each sentence and by calculating a mean score. The reliability of results can also be increased by using a large number of texts. Designing and conducting reliable human MT evaluations has proven to be costly and time-consuming. As a result, more recent research has involved the investigation and application of automated methods, including IBM's BLEU (BiLingual Evaluation Understudy) method (Papineni et al. 2001) and work by Rajman and Hartley (2001, 2002). Successful automated evaluation methods will allow both developers, who need to conduct frequent evaluations after system modifications, and end-users to evaluate systems more quickly and cheaply. 3. Corpora or test suites? The evaluation of MT output involves the use of either a collection of texts, which in few cases seem large enough to be classified as corpora, or test suites. A corpus designed for this purpose has typically comprised texts in the chosen source language(s), machine translations produced by the systems for evaluation and one or more expert human translations of each text. Bilingual evaluators might then rate the fidelity (preservation of original content) of each machine-translated sentence or marked segment by comparing it to the source text and assigning a score using a particular scale. Alternatively, monolingual native speakers of the target language would perform the same kind of evaluation using the expert human translations for comparison. Scoring the fluency of each sentence, on the other hand, requires access only to the machine translations from the corpus, as no reference to the source text is needed when evaluating this attribute in isolation. Whereas corpora are widely used for “black box” MT evaluations by end-users, test suites are more often devised and used by researchers and developers, who need to pinpoint the handling of specific linguistic phenomena to guide system modifications (a “glass box” approach). Test suites for MT evaluation typically comprise many short annotated test items in the source language, with correct target translations, which are referenced according to specific linguistic categories. They allow for the systematic 191 and objective evaluation of carefully selected linguistic phenomena, complete control over every test point (which may be tested in isolation or in combination with other features) and the opportunity to include negative data to determine how a system deals with input errors. However, as test suites are normally designed to evaluate the handling of grammatical phenomena, the vocabulary is intentionally limited, making them less suitable than corpora for the evaluation of MT system glossaries. Furthermore, test suites for natural language processing applications “normally list items on a par without rating them in terms of frequency or even relevance with respect to an application” (Oepen et al. 1997: 25). Corpora, on the other hand, represent naturally occurring data and can be designed to include texts that reflect user needs. This factor is particularly important for end-users who wish to select an MT system to translate specific text types. It is clear, therefore, that the use of test suites and corpora are not competing evaluation methods, but complementary, insofar as they serve different purposes. Our own research interests lie in the evaluation of MT systems for end-users. We require, therefore, a corpus that represents current user needs. 4. A need for multilingual corpora for MT evaluation Previous research in MT evaluation has involved the use of either sentences or fairly small numbers of texts. Papineni et al. (2001), for instance, rely on a very small corpus that includes human reference translations. Other research (see Table 1) has made use of the much larger DARPA (Defense Advanced Research Projects Agency) corpus, along with results from the largest DARPA human MT evaluation, carried out in 1994. Researchers have used the DARPA corpus and evaluation results to validate (or not, as the case may be) experimental automated evaluation methods, by seeking correlations between the human DARPA scores and those from new methods. Table 1 details texts or corpora used in a sample of published MT evaluation projects, listed chronologically. Table 1: The use of corpora and test sentences in previous MT evaluation projects Author(s) and/or project name Evaluation type Attributes tested No. of source items used for evaluation = N No. of human translations of N No. of machine translations of N Carroll (Pierce 1966) Human Intelligibility Fidelity 144 sentences Scientific Russian 3 English 3 English Nagao et al. (1985) Human Intelligibility Accuracy 1,682 sentences Scientific Japanese 0 1 English Shiwen (1993) Human and automated 6 test points: words, idioms, morphology, elementary, moderate, advanced grammar 3,200 random sentences English 1 Chinese 1 Chinese DARPA 1994 series (White 1997, 2001, forthcoming) Human Adequacy Fluency Informativeness 100 texts French 100 texts Spanish 100 texts Japanese (news articles of approx. 300-400 words or 800 Japanese characters) 2 English 5 English (Human and machine translation scores available for research) JEIDA (Isahara 1995) Human Linguistic test sets 770 sentences English 1 Japanese 8 Japanese Author(s) and/or project name Evaluation type Attributes tested No. of source items used for evaluation = N No. of human translations of N No. of machine translations of N 192 IBM BLEU (Papineni et al. 2001) Human and automated Number of n-gram matches between MT output and human translations (with penalties) Approx. 500 sentences Chinese (all from news articles) Up to 4 English 3 English White and Forner (2001) Test: potential automated method Noun-compound handling 33 texts French 33 texts Spanish (DARPA corpus) 0 5 English (DARPA corpus with scores) Reeder et al. (2001) Test: potential automated method Named-entity handling 0 1 English of 1 Spanish text (DARPA corpus) 5 English of 1 Spanish text (DARPA corpus with scores) Miller and Vanni (2001) Vanni and Miller (2001, 2002) Test: potential automated methods Coherence, clarity, syntax, morphology, dictionary update, names, terminology 0 1 English of 2 Spanish texts 1 English of 1 Japanese text (DARPA corpus) 3 English of 2 Spanish texts 3 English of 1 Japanese text (DARPA corpus) Rajman and Hartley (2001, 2002) Human and automated Grammaticality, preservation of semantic content 20 French (DARPA corpus) 1 English 5 English of 100 French texts (DARPA corpus with scores) 1 English of 20 French texts (DARPA) by an additional MT system The largest known corpus for MT evaluation, the DARPA corpus, makes available the associated evaluation scores, which has proved invaluable to the MT community. However, this corpus does have its limitations; it comprises only newspaper articles, representing only a small part of MT use, the source texts are in only three languages and all target texts are in American English. It is also clear from the above information that most projects and, therefore, corpora for MT evaluation are concerned with English as a target language. In response to these findings, it is our intention to compile a multilingual corpus specifically for MT evaluation. This will not only be used for our own work, but will also be made available for research within the MT community. Before text collection begins however, decisions must be made regarding corpus content, size, language pairs and text types for inclusion. 5. Corpus content We intend to provide a balanced corpus in terms of the number of words and text types for each language pair. Texts and language pairs will be selected to reflect the actual use of MT systems and our decisions will be guided by a survey of MT users. The corpus will comprise source texts with at least one human translation and a number of machine translations of each one, along with our own detailed human evaluation scores. The corpus will be made available online, allowing users to browse the contents of each language pair, displayed in the form of a list of text types and topic areas. Users will be able to view each source text along with its human and machine translations, and analyse our human evaluation scores, which will be regularly updated as soon as they become available. The source texts will be of use to anyone wishing to evaluate their own system(s), and the human reference translations will provide material for comparison when scoring the MT output. Furthermore, our evaluation results, in addition to those from the DARPA series, will allow for the testing of experimental automated metrics. 193 6. Corpus size Constraints in terms of research time and cost mean that informed decisions must be made with respect to corpus size. Using a very large corpus would be unsuitable for human MT evaluation projects for practical reasons: the greater the number of texts, the more time-consuming and expensive the evaluation. Furthermore, the provision of expert human translations of thousands of texts is costly and unnecessary if valid evaluation results can be obtained from a smaller corpus. On the other hand, proven automated evaluation methods might benefit from a larger corpus, which would allow for the generation of more scores at no greater cost than if a smaller number of texts were used. This begs the question: at what point does a larger number of texts cease to give us more reliable evaluation results? How many texts do we need to obtain valid scores for system comparison? Our first attempt to answer this question has involved analysing DARPA scores with varying numbers of evaluated texts. We used the three scores (adequacy, fluency and informativeness) for the five machine translations and one human translation of each of the 100 French source texts (of approximately 300-400 words) to calculate a mean score for each number of texts evaluated. Figures 2, 3 and 4 show the mean scores for each of the three attributes for every number of texts evaluated (ie. from one text to one hundred texts). Figure 4 shows the overall mean scores. Figure 2: Comparison of adequacy scores: DARPA 1994 (French-English)00.10.20.30.40.50.60.70.80.911112131415161718191Number of texts evaluatedMean adequacy score 194 CandideGlobalinkMetalSystranHumanXLT Figure 3: Comparison of fluency scores: DARPA 1994 (French-English)00.10.20.30.40.50.60.70.80.911112131415161718191Number of texts evaluatedMean fluency scoreFigure 4: Comparison of informativeness scores: DARPA 1994 (French-English)00.10.20.30.40.50.60.70.80.911112131415161718191Number of texts evaluatedMean informativeness score 195 CandideGlobalinkMetalSystranHumanXLT CandideGlobalinkMetalSystranHumanXLT Figure 5: Comparison of overall scores: DARPA 94 (French-English)00.10.20.30.40.50.60.70.80.911112131415161718191Number of texts evaluatedOverall mean score Results show that scores from a very small number of texts (perhaps a sample of ten, amounting to around 3,500 words) can allow us to determine the highest and lowest ranking systems, in terms of individual attributes and overall scores. However, the highest scoring “system” here was the human, whom we would normally expect to perform better than the MT systems. It must also be noted that some MT evaluation projects do not involve the evaluation of human translations, but focus on the comparison of MT systems alone. Even then, we are able to determine that Systran performs better than the other MT systems by using scores from as few as ten texts. The only anomaly here is the informativeness score, where Systran and Globalink compete. A clearer picture of how all five MT systems compare can be obtained after the evaluation of approximately 40 texts (around 14,000 words) for each attribute, and further sampling serves only to confirm this. After around 30 samples, we see that scores begin to remain consistent within a relatively small variance fluctuation, although we do find instances of pairs of systems constantly switching position as more texts are evaluated (Systran/Globalink for informativeness, Globalink/Metal for fluency and adequacy and Metal/Candide for the overall score). In these cases, any number of samples may never see the situation resolved, and the systems that continue to compete according to the number of texts evaluated, can be considered “equal” in terms of particular attributes. It would then be up to the potential user to decide which attribute was more important for their translation needs. For example, a high adequacy score and low fluency score would be more acceptable to someone wishing to use an MT system for gisting or information extraction. Having obtained these results, our second step was to conduct the same statistical analysis using the Spanish-English and Japanese-English DARPA scores. Results for both language pairs confirmed that reliable scores can be obtained from the evaluation of around 40 texts. We now intend to use texts from our new corpus to conduct human evaluations and to carry out the same analysis. Our initial sample will comprise 35-40,000 words, equal in size to one language pair in the DARPA corpus. This will allow us to compare the number of words required for valid results when evaluating both newspaper articles and different text types, which better represent MT user needs. Our findings will then guide us in terms of the initial number of words required per language pair. 196 CandideGlobalinkMetalSystranHumanXLT 7. Language pairs In January 2003 we carried out a survey of MT users in order to obtain guidelines for corpus content. In the survey, sent as an email to a number of MT and translation-related mailing lists, we asked which language pairs and text types users regularly translate with the aid of fully automatic MT systems. The 25 replies received to date (16 from large translation providers or international corporations/organisations, 9 from single users) have provided valuable information on both issues. Of the 25 responses, 21 were used for this research, as 4 reported only on their use of translation memory tools. The survey is ongoing and will shortly be available online. Texts in a number of different language pairs will be needed for our own research, when we investigate new approaches to automated MT evaluation. Furthermore, the availability of texts and translations in several languages will make the corpus more useful for other research projects. It is important to evaluate texts translated from and into more than one language, including languages that are typologically different from one another, to explore the portability of new evaluation methods. Additionally, translation providers often use MT to translate more than one language pair and may need to test systems for several languages. Figure 6 shows the language pairs (in which the source or target language is English) translated by respondents using MT systems. A very small number of respondents also use systems to translate language pairs that do not involve English. Figure 6: Language pairs translated by MT users024681012English-SpanishEnglish-FrenchEnglish-GermanEnglish-PortugueseEnglish-ItalianEnglish-JapaneseEnglish-DanishEnglish-DutchEnglish-GreekEnglish-ChineseEnglish-FinnishEnglish-NorwegianEnglish-RussianEnglish-SwedishEnglish-VietnameseGerman-EnglishFrench-EnglishSpanish-EnglishItalian-EnglishJapanese-EnglishChinese-EnglishPortuguese-EnglishFinnish-EnglishVietnamese-EnglishLanguage pairNumber of respondents 197 The number of language pairs that MT systems are now able to handle is constantly increasing. The IAMT (International Association for Machine Translation) Compendium of Translation Software (Hutchins and Hartman 2002) lists an enormous number of MT systems translating many more languages than those shown above. As a starting point, therefore, we plan to collect source texts (with human and machine translations in English) in French, German, Spanish and Italian, along with texts in some typologically different languages, such as Chinese and Japanese to begin with. These will allow us to carry out our initial evaluations of systems translating into English. However, in a second phase we will add translations out of English, which will allow us to test how well existing MT evaluation methods transfer to other language pairs and to develop new machine learnt metrics, which generalise across languages. The target languages for inclusion will be the subject of further research. 8. Text types Since expectations of MT systems have become more realistic, a greater number of uses have been found for imperfect raw MT output. Consequently, a variety of text types, genres and subject matter are now machine-translated for different text-handling tasks, including filtering, gisting, categorising, information gathering and post-editing (White 2000). It is crucial, therefore, to represent this variety of texts, ranging from emails to technical reports, in our corpus, allowing for the evaluation of texts that represent real MT use. The main purpose of our survey was to gather information on the kinds of texts and topics most frequently translated using MT systems. Information obtained from this part of the survey is providing useful guidelines on the types of texts to include in our corpus, but there are several problems involving the analysis of data. Firstly, results are based on respondents’ own interpretations of the “text types” suggested in the survey and these inevitably overlap in terms of content and grammatical structures. For example, technical material can be found in several separate categories in our questionnaire: internal company documents, technical documents, user manuals, instruction booklets, academic papers and web pages. This must be taken into account when we select our texts. Secondly, some respondents did not specify the subject matter of the material they machine translate, and many were unable to provide details on the number of texts. Finally, it is difficult to equate the comparatively small number of words translated by single users with the millions of words translated by international companies every year. In response to this last problem, we present two sets of results at this stage. Figure 7 shows the number of companies and Figure 8, the number of single users who use MT to translate particular text types. Responses to date show that single users and companies use MT systems to translate different types of documents. Five of the international companies/organisations who responded did give information about the number of texts they translate. Of these five respondents, all use MT systems to obtain a first draft of either user manuals, instruction booklets, technical documents or internal company documents, or a combination of these. Their total monthly word count is estimated at 3.5 million words. It is crucial, therefore, to represent these documents in our corpus. However, the single user market is likely to grow, as systems become cheaper, so it is important to reflect the needs of such users also. Findings so far tell us that we must represent all of the above text types to reflect MT use. Documents in our corpus will be categorised, enabling anyone wishing to compare MT systems to easily select source texts for evaluation according to their own needs. The subject matter of these texts will inevitably overlap, as it does in the real world. We are still receiving replies to our survey and updated results will shortly be available online. 198Figure 7: Number of companies who use MT to translate particular text types024681012web pagesacademic papersnewspaper articlestechnical docsemailstourist/travel infoscientific docsmedical docslegal docsinternal company docsbusiness letterspatentscalls for tenderuser manualssoftware stringsmemosinstruction bookletsfinancial docsText typeNumber of respondents Figure 8: Number of single users who use MT to translate particular text types01234web pagesacademic papersnewspaper articlestechnical docsemailstourist/travel infoscientific docsmedical docslegal docsinternal company docsbusiness letterspatentscalls for tenderuser manualssoftware stringsmemosinstruction bookletsfinancial docsText typeNumber of respondents 199 9. Conclusion Our findings to date have provided valuable guidelines for the size and content of our corpus. Analysis of the existing DARPA scores indicates that a small sample of texts (amounting to around 14,000 words) is sufficient to rank a range of MT systems in terms of individual attributes and overall scores. However, our user survey indicates that we need to cover a much wider range of genres, beyond newspaper articles, so there is still a need for a larger corpus. We intend to compile a dynamic corpus, which will be updated to reflect changing trends in the MT user market. New source texts and translations will be added to reflect language change and the introduction of new terminology, and additional MT systems will be added to our evaluations over time. The key feature of our corpus, however, will be the detailed scores from our human evaluations, which will be made available to aid research in automated MT evaluation. Acknowledgements We wish to thank everyone who has responded to our MT user survey. References Atwell E, Demetriou G, Hughes J, Schiffrin A, Souter C, Wilcock S 2000 A comparative evaluation of modern English corpus grammatical annotation schemes. ICAME Journal 24: 7-23. Elliott, D 2002 Machine Translation Evaluation: Past, Present and Future. Unpublished MA dissertation, University of Leeds. Hutchins J, Hartmann W 2002. IAMT Compendium of Translation Software 1.5. http://www.eamt.org/compendium.html Isahara H 1995 JEIDA's Test-Sets for Quality Evaluation of MT Systems – Technical Evaluation from the Developer's Point of View. In Proceedings of Machine Translation Summit V, Luxembourg. Miller K, Vanni M 2001 Scaling the ISLE Taxonomy: Development of Metrics for the Multi-Dimensional Characterisation of Machine Translation Quality. In Proceedings of Machine Translation Summit VIII, Santiago de Compostela, Spain. Nagao M, Tsujii J, Nakamura J 1985 The Japanese government project for machine translation. Computational Linguistics 11: 91-109. epen S, Netter K, Klein J 1997 TSNLP – Test Suites for Natural Language Processing. Linguistic Databases. CSLI Lecture Notes, CSLI Stanford. Papineni K, Roukos S, Ward T, Zhu W 2001 BLEU: a Method for Automatic Evaluation of Machine Translation. IBM Research Report RC22176. Yorktown Heights, NY:IBM. Pierce J (Chair) 1966 Language and Machines: computers in Translation and Linguistics. Report by the Automatic Language Processing Advisory Committee (ALPAC). Publication 1416. National Academy of Sciences National Research Council. Rajman M, Hartley A 2001 Automatically predicting MT systems rankings compatible with Fluency, Adequacy and Informativeness scores. In Proceedings of the 4th ISLE Workshop on MT Evaluation, Machine Translation Summit VIII, Santiago de Compostela, Spain. Rajman M, Hartley A 2002. Automatic ranking of MT systems. In Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain, pp 1247-1253. Reeder F, Miller K, Doyon J, White J 2001 The Naming of Things and the Confusion of Tongues. In Proceedings of the 4th ISLE Workshop on MT Evaluation, Machine Translation Summit VIII, Santiago de Compostela, Spain. Shiwen Y 1993 Automatic evaluation of output quality for machine translation systems. Machine Translation 8: 117-126. Vanni M, Miller K 2001 Scaling the ISLE Framework: Validating Tests of Machine Translation Quality for Multi-Dimensional Measurement. In Proceedings of the 4th ISLE Workshop on MT Evaluation, Machine Translation Summit VIII, Santiago de Compostela, Spain. Vanni M, Miller K 2002 Scaling the ISLE Framework: Use of Existing Corpus Resources for Validation of MT Evaluation Metrics across Languages. In Proceedings of the Third International Conference on Language Resources and Evaluation, Las Palmas de Gran Canaria, Spain. White J 1997 MT Evaluation: Old, New and Recycled Methods. Tutorial slides, Machine Translation Summit VI, San Diego. White J 2000 Toward an Automated, Task-Based MT Evaluation Strategy. In Proceedings of the Workshop on the Evaluation of Machine Translation, Third International Conference on Language Resources and Evaluation, Athens, Greece. White J, Forner M 2001 Predicting MT fidelity from noun-compound handling. In Proceedings of the 4th ISLE Workshop on MT Evaluation, Machine Translation Summit VIII, Santiago de Compostela, Spain. White J (Forthcoming) How to evaluate Machine Translation. In Somers H (ed), Machine translation: a handbook for translators. Amsterdam, Benjamins. O 200 Detecting gender-preferential patterns of linguistic features in face-to-face communication Jens Fauth and Hans-Jörg Schmid, University of Bayreuth, Germany Gender-preferential language is a research area with a long tradition. Most of the previous investigations, however, included linguistic features that are not suited for large-scale automatic frequency counts in computerized corpora, because they are semantic or pragmatic in nature rather than grammatical, and/or difficult to operationalise (e.g. references to emotions). The present paper sets out to discover previously unknown gender-related linguistic features that are retrievable automatically from corpora. Although the present study chooses a grammatical approach similar to Biber's (1988), it differs in that it does not only look for features that have already been suspected to bear a significant relation to gender but also for features that were not determined a priori in this respect. All in all 45 linguistic features from various grammatical domains were taken into consideration. Their occurrences were retrieved from the subcorpus direct conversation of ICE-GB and counted. Only conversations between same-sex participants were investigated because this constellation is known to enhance gender differences (for a complementary study on mixed-sex conversations see the abstract for this conference by Schmid and Fauth). In a second step the Pearson Correlation Coefficient was used to determine significant correlations between gender and linguistic features. The results were then used to predict the gender of the speakers of a conversation by computing a score for each text based on the correlating features. Basic descriptive statistics such as mean, standard deviation and z-scores were used for this purpose. The results show that four linguistic features are correlated with female gender: third person personal pronouns, indefinite pronouns, predicative adjectives and intensive adverbs. Male gender is correlated with three different features: the definite article, nominalizations and NP postmodifications realized by of-PPs. These features are interpreted with regard to their communicative function and the extent to which their use reflects gender-specific topic choices. The methodology allows us to predict the gender of speakers in same-sex conversations with a probability of 88.10% for female texts and 85.80% for male texts. Jens Fauth Professor Hans-Jörg Schmid Chair of English Linguistics Chair of English Linguistics D-95440 Bayreuth D-95440 Bayreuth E-mail: jot.fauth@gmx.net Hans-Joerg.Schmid@uni-bayreuth.de Fax: +49 921 553627 Fax: +49 921 553627 211 Analysis of the rhetorical structure of computer science abstracts in Portuguese Valéria D. Feltrim Sandra M. Aluísio Maria das Graças V. Nunes NILC – Computational Linguistics Group ICMC – University of Sao Paulo C.P.668 Sao Carlos, SP, Brazil {vfeltrim|sandra|gracan}@icmc.usp.br 1. Introduction It is widely acknowledged that academic writing is a complex task even for native speakers, since it involves the complexities of the writing process as well as those specific to the academic genre1 (Sharples and Pemberton 1992). To make it worse, especially to novice writers, often the demands of the academic genre are not clear enough. A number of writing tools have been described in the literature whose ultimate goal is to improve the quality of academic English texts produced by novice and/or non-native writers, e.g. WE (Smith and Lansman 1988), Writer's Assistant (Sharples et al. 1994), Composer (Pemberton et al. 1996), Academic Writer (Broady and Shurville 2000), Abstract Helper (Narita 2000), and Amadeus (Aluísio et al. 2001). These tools provide help during different stages of the writing process, from the generation and organization of ideas to post-processing tasks. For Portuguese, on the other hand, one can only find post-processing tools, such as spell-checkers, grammar checkers and online dictionaries and thesauri. With the ultimate goal of aiding academic writing, the project SciPo, currently being developed at NILC2, aims at analyzing the rhetoric structure of Portuguese academic texts . in terms of schematic structure, rhetorical strategies and lexical patterns . to derive models for supporting the creation and evaluation of computational writing tools. To make the project feasible, the analysis has focused on specific sections of theses in Computer Science . abstract, introduction and conclusion, which are the most studied in the literature (Swales 1990, Weissberg 1990, Liddy 1991, Santos 1996) and also have been pointed out as the most difficult ones in a questionnaire applied to graduated students. The choice for this kind of text was made mainly for three reasons: theses having to be written in Portuguese, unlike research articles, which are preferably written in English; the high standardization of Computer Science texts, as in other scientific research areas; and SciPo's developers’ familiarity with the Computer Science domain. A corpus of fifty-two (52) Computer Science theses was compiled and has been used both in the analysis of writing patterns specific to the focused research community and in the identification of their main writing problems. The whole corpus analysis is being carried out in a set of stages, the first of which has been accomplished, namely the analysis of the abstracts. This paper presents the results of this first stage and discusses the features of abstracts at a macro level of textual organization as well as comments on its annotation process. The next sections present the methodology used for the annotation and analysis of the abstracts (Section 2) as well as the results obtained (Section 3). Finally, features for a writing tool to assist novice academic writers are put forth based on what has been observed in the corpus (Section 4). 2. Corpus annotation The annotation of the corpus was carried out manually in each of the 52 abstracts, consisting of three levels of analysis: identification of structural components, identification of rhetorical strategies used for the realization of each component, and identification of lexical patterns (similar to the “formulaic expressions” used by Teufel et al. 1999) that can be used as lexical clues of the argumentative role of a sentence. These three levels were focused because they reflect the levels of help that we intend to give in a writing tool, namely: how to organize the text (structural components), how to realize each part of 1 We call “academic genre” the one employed in published academic works (papers, theses, technical reports, etc) of a research community. 2 NILC – a Brazilian Computational Linguistics Research Group (www.nilc.icmc.usp.br) 212 the text (rhetorical strategies) and what are the common expressions used in each part of the text according to the selected strategies (lexical patterns). In order to render the annotation more reliable, we had four different human annotators annotate our whole corpus of abstracts. Given a scheme of 6 structural components, with 3 rhetorical strategies each, the annotators were asked to identify text fragments corresponding to those components and strategies, not leaving unclassified fragments. We actually had to perform the annotation of our corpus several times, since the first experiments showed low agreement between annotators. This may have been caused mainly by four factors: (1) the initial inadequacy of the annotation scheme, (2) the lack of familiarity of the annotators with the annotation scheme, (3) the nature of our corpus and (4) the subjective nature of this kind of annotation. Although the first versions of the annotation scheme had been based on models accepted in the literature, like Swales's model, the corpus presented some peculiarities that did not fit in the model. Additionally, the actual meaning of its elements was not thoroughly stated, making room for misinterpretation. Finally, as the annotators were initially not very familiar with the annotation scheme, disagreement between them followed naturally. Another factor of difficulty was the nature of our corpus. Although it is composed of academic theses, the texts probably were not submitted to such rigorous revision as published articles usually are. Many abstracts even presented passages that could not be classified within any category. This made the annotation process still more difficult and also required a critical view from the annotators. In addition, working with Portuguese brought the difficulty of dealing with long sentences. The average number of words per sentence in our corpus is 26.8 while in a corpus of English abstracts3 it is 23. So, it is common to find different components of the annotation scheme in one sentence, sometimes mixed. The mixed components found in the corpus will be discussed in Section 3, and the major writing problems observed will be described in Section 4. 1 Setting S1 Arguing about the topic prominence S2 Familiarizing terms, objects, or processes S3 Introducing the research topic from the research area 2 Gap G1 Citing problems/difficulties G2 Citing needs/requirements G3 Citing the absence of previous research 3 Purpose P1 Indicating the main purpose P2 Specifying the purpose P3 Introducing more purposes 4 Methodology M1 Listing criteria or conditions M2 Citing/Describing materials and methods M3 Justifying choices for methods and materials 5 Results R1 Describing the artefact R2 Presenting/Indicating results R3 Commenting/discussing on the results 6 Conclusion C1 Presenting conclusions C2 Presenting contributions/value of research C3 Presenting recommendation Figure 1. Overview of the designed annotation scheme As we could not change the nature of our corpus, we tried to minimize the subjectivity of the task by focusing on better defining the annotation scheme. Since the annotators took part in this process, they automatically became more familiar with the scheme and likely to master it and agree on its usage. After these adjustments, we managed to reach a stable scheme and higher inter-annotator consistency. As a starting point, the annotation scheme was derived from three models: Swales’ CARS (1990) and those by Weissberg (1990) and by Aluísio and Oliveira Jr (1996), the latter modeling introductions of experimental research papers written in English. Although these works deal with 3 In order to calculate this average, we took 54 abstracts from the Computation and Language (cmp-lg) corpus. The cmp-lg corpus is composed by scientific papers which appeared in Association for Computational Linguistics (ACL) sponsored conferences. 213 introduction sections, the basic structure of their models could also be applied to abstracts. So, during the first annotation experiment, the scheme was modified to accommodate all the argumentative roles found in the corpus. The components and rhetorical strategies which compose our annotation scheme are similar to the ones presented by those authors, especially to Aluísio and Oliveira Jr's. The final version of our annotation scheme is presented in Figure 1. Such similarity had been expected and shows that, despite the heterogeneity of the corpus, it exhibits predictable rhetorical patterns of scientific argumentation. The major deviation from the traditional structure pattern was found in sentences reporting results. Many of them focus on the resulting product (mainly computational systems) instead of on the corroboration of the initial hypotheses of the underlying research issue. To accommodate this particularity, we included the new rhetorical strategy Describing the artifact. We believe that this “non-standard” strategy to report results stems from the technological nature of Computer Science, in which it is common to emphasize the artifact (i.e. a piece of software or a method) developed during the research. 3. What has been found in the corpus The texts in our corpus were collected from online theses repositories and date from 1994 to 2000, comprising 49 MSc theses and three PhD theses, most of them written by students from our Computer Science Department. Only 2 texts were written by students from other Brazilian universities. Texts in the corpus span several Computer Science sub-areas. So, we divided them into 7 research topics: database systems, computational intelligence, software engineering, hypermedia, digital systems, distributed systems, and graphical computation and image processing. Figure 2 presents the number of abstracts classified in each research topic as well as the total and average number of words per abstract. Research Topic Abstracts Number of words Total Average Database Systems 3 1,006 335.3 Computational Intelligence 7 1,452 207.4 Software Engineering 16 3,202 200.1 Hypermedia 12 2,097 174.7 Digital Systems 1 171 171 Distributed Systems 12 1,962 163.5 Graphical Computational and Image Processing 1 94 94 Total 52 9,984 192 Figure 2. Corpus distribution across Computer Science research topics Considering the components of the annotation scheme presented in the previous section (Figure 1), 96.2% of the abstracts are classified as strict subsets of that scheme, with some repetitions of components. Only 3.8% of the abstracts contain all the six components, in one of the following sequences: [S G P M C R C]4 or [S G P C M R], neither being in the order recommended by the scheme. Considering the number of different components observed in each abstract (Figure 3), 50% of the abstracts present 5-4 components, 44.3% present 3-2 components and 1.9% present only one component (Purpose). As mentioned earlier, repetition of components is very common in the corpus and some patterns5 have been identified. The most frequent pattern is the repetition of Setting followed by Gap, i.e. (SG)+, present in 25% of the abstracts. Others repetition patterns have been observed, such as Methodology followed by Result, i.e. (MR)+, and Result followed by Conclusion, i.e. (RC)+, but with much lower frequency. Number of components Frequency 6 3.8% 5-4 50% 3-2 44.3% 1 1.9% Figure 3. Numbers of components per abstract 4 In this section, the letters S, G, P, M, R, C stand for the components of the annotation scheme: Setting, Gap, Purpose, Methodology, Result and Conclusion. 5 We use regular expressions to represent repetition and structure patterns. 214 Regarding ordering of components, we also observed some patterns, especially involving the components at the beginning of the abstracts. The pattern Setting followed by Gap, with repetition or not, followed by Purpose, i.e. ((SG)+|(GS))P, appears in 30.7% of the corpus. Instances6 of this pattern are [S G S G P R] and [G S P R M]. Another frequent pattern is Setting followed by Purpose, followed or not by Methodology or Result, i.e. SP[M|R], which is observed in 21.1% of the corpus. Instances are [S P], [S P R C] and [S P M R]. The pattern Purpose followed by Result, with repetition or not, followed or not by Methodology or Conclusion, i.e. (PR)+[M|C], and the pattern Purpose followed by Methodology, followed or not by other Purpose or Result or Setting PM[P|R|S] appear, respectively, in 19.2% and 13.4% of the abstracts. Other combinations of components appear in the corpus with a very low frequency, so they were not considered here. A summary of the most frequent patterns is presented in Figure 4. Pattern Frequency ((SG)+|(GS))P 30.7% SP(M|R)* 21.1% (PR)+[M|C] 19.2% PM[P|R|S] 13.4% Figure 4. Observed patterns frequency The ordering patterns identified show that the order adopted in our annotation scheme is reasonable, since 51.8% of the corpus start with components Setting or Gap followed by Purpose, despite order varying greatly from middle to final components. In fact, we observed a lack of clarity when analysing these components, especially Methodology and Result. They usually occur mixed with each other or, even more often, with Purpose. The frequency of these cases is shown in Figure 5. The most frequent case of mixing occurs between Result and Purpose, which appears in 38.5% of the corpus. This means that, in 38.5% of the abstracts, the passage annotated as Indicating the main purpose also includes traces of Result. Also, in 9.6% of the corpus, Purpose presents traces of both Result and Methodology. The high frequency of Result mixed with Purpose may be viewed as another evidence of the importance given to the resulting product of the research – the artifact – as previously noticed in Section 2. We believe that such importance is the reason why writers focus on the produced artifact inside Purpose, since it is a central component with high relevance for the abstract. Still considering the observed patterns, we noticed that many texts present not only similar structure but even the same lexical patterns, mainly when written by students who had the same advisor. Such similarity was also observed in texts written by students from the same research group. This suggests that students tend to use texts produced inside their work group as models when writing their own texts and also shows the great influence of the advisor on the student text. Mixed Components Frequency Result mixed with Purpose 38.5% Methodology mixed with Purpose 19.2% Result and Methodology with Purpose 9.6% Methodology mixed with Result 7.7% Figure 5. Frequency of attached components The frequency of each component of the annotation scheme in the corpus is presented in Figure 6. The numbers show lower frequencies for Setting, Gap and Conclusion in contrast with components Purpose, Methodology and Result. This is in agreement with the results reported by Santos (1996) and Motta-Roth and Hendes (1998). These authors have also observed the optional character of using initial and final components in abstracts and a tendency to a more frequent use of middle components, especially Purpose, which appears in all the analysed abstracts. This strongly suggests that writers see Purpose as the most important kind of information that an abstract should provide to the reader, followed by Result and Methodology. Although, the frequency of Setting in our corpus also suggests writers’ concern with contextualizing their research, which was not observed by Motta-Roth and Hendes. We believe that this difference is due to the nature of the two corpora. As our corpus is composed by thesis abstracts, the writers were not faced with the constraint of a limit of words, which generally hold for journal abstracts, and so they are free to make extensive contextualization. 6 All the presented structure examples are taken from the corpus. 215 Component Frequency Setting 55.7% Gap 42.3% Purpose 100% Methodology 63.4% Result 67.3% Conclusion 30.7% Figure 6. Distribution of components Another aspect related to the nature of our corpus is the high frequency of the rhetorical strategy Presenting contributions/value of research. As can be viewed in Figure 6, component Conclusion occurs in 30.7% of the corpus. Of these occurrences, 68.8% use the referred rhetorical strategy (or 21.2% of the corpus occurrences). Considering that is very important in MSc/PhD theses to emphasize the contributions to a research area, it seems natural for writers to use that conclusion strategy in the abstracts with some frequency. 4. Observed writing problems In this paper, when we refer to writing problems we mean deviations from the traditional structure model that can compromise the communicative objective of the abstract. As was previously commented in Section 2, the texts in our corpus probably have not been through such rigorous revision, even though the texts are MSc and PhD theses. So, it is not surprising that the abstracts presented some problems, not only general, like grammatical mistakes, but also specific to the academic genre. Moreover, as the major part of the corpus is MSc theses, we can also attribute the observed problems to the inexperience of the writers. We will not comment on the superficial problems found in the corpus, since these are not our object of study in this paper. Thus, we will only focus on problems that are specific to the academic genre. The major problems observed in the abstracts were misuse of lexical patterns and verbal tenses, inefficient organization and inappropriate emphasis on some specific components. We regard as misuse of lexical patterns those cases in which the writer uses expressions in a component which are specific to other components. An example is the use of however, which is a lexical clue to component Gap, in a Setting sentence and not with its correct contrasting value. Other examples are expressions of the type “A and B are also presented.”, which can be seen as an indication of the rhetorical strategy Introducing more purposes, used in sentences reporting results in an indicative way. Furthermore, the writers sometimes do not use the appropriate verbal tense, especially in sentences reporting results. It is recommended by the literature (e.g. Weissberg 1990) to use the past tense when reporting findings. This gives the reader the notion that the results were obtained by the reported research and not in a previous one. However, it is common to find sentences in our corpus reporting findings in the present tense. Due to this, sometimes we could not determine quite whether the writer was reporting his own results or commenting on someone else's results (as in literature review). Misuse of lexical patterns and verbal tenses surely demanded great effort from the annotators when it came to interpreting the texts to identify their components. Regarding inefficient organization, we observed some abstracts in which the writer mixed components in such a way as to confuse readers. An example of this problem can be observed in the structure [P M S G P]. The writer initiated by indicating the main purpose (first P) and then described the methodology used to accomplish that purpose (M). After that, a more natural move would be to present results; however, the writer used a Setting component, followed by a Gap (SG), in order to lead the reader into the further detailing of the previously stated purpose and the introduction of yet other purposes. The presence of Setting and Gap in the middle of the abstract, separating the main purpose from its detailing, confuses the reader, who may lose track of the main purpose of the related research. Also, the sequence Methodology-Setting disrupts the cohesion of the text, causing the reader to feel that “something is missing”. A different case of inefficient organization is inappropriate emphasis on some components. In our corpus such emphasis was observed mainly in contextualization (Setting + Gap). One such example is an abstract that has the structure [S G S G P] and 160 words. A Purpose component, which also has traces of Result and Methodology, corresponds to 16.9% of the abstract. The rest of it (83.1%) is dedicated to the contextualization of the problem, (SG)+. Taking into account that these writers had no limit of words to write their abstracts, we can consider such an abstract little informative. It would be 216 desirable to find more information about the reported research in addition to such a relatively extensive contextualization. Though less frequent, three further problems were found in the corpus, namely: (1) passages that could not be classified as a component of the annotation scheme. This also took the annotators some time, as previously commented in Section 2; (2) passages indicating obvious information, like “this work also presents a literature review about the focused subject”. It is quite sure that an MSc/PhD thesis will have a literature review on the main focused subjects, fairly obviating that piece of information; (3) exactly one abstract presenting the outline of the theses. This kind of information is common in introductions, but not in abstracts. 4. Conclusions In this paper we have reported on the annotation and analysis of a corpus of thesis abstracts in Computer Science, based on an annotation scheme designed specially for this project. We have argued that this scheme is stable and reasonable on the basis of the results of the corpus analysis. Furthermore, we have discussed structuring patterns and some particularities of the corpus, such as repetition of components and levels of relevance given to each component, and identified writing problems, which are deviations from what is prescribed as characteristic of a good academic abstract. As we mentioned in the introduction, the presented corpus analysis is part of a project which aims at deriving models for computational writing tools specific to the academic genre in Portuguese. Based on what was observed in the corpus, including its problems, we are developing computational tools for aiding especially novice writers. These tools are being implemented as Web-based applications featuring (1) prescriptive guidelines related to the academic genre and (2) repositories of good and bad examples of structure, writing strategies and lexical patterns. The example repository will have abstracts, introductions and conclusions. Another tool is meant to be a critiquing system capable of giving advice/criticism on structure organization, verbal tenses and level of emphasis given to each component. As a result, we expect to help writers overcome their difficulties related to the academic genre. As future work, we intend to extend our analysis to introductions and conclusions, implement stable prototypes of these tools, and evaluate them with real users, e.g. graduate Computer Science students. 6. References Aluísio S M, Barcelos I, Sampaio J, Oliveira Jr O 2001 How to learn the many unwritten "Rules of the Game" of the Academic Discourse: A hybrid Approach based on Critiques and Cases. In Proceedings of the IEEE International Conference on Advanced Learning Technologies, Madison/Wisconsin, pp 257-260. Aluisio S M, Oliveira Jr. O N 1996 A Detailed Schematic Structure of Research Papers Introductions: An Application in Support-Writing Tools. Revista de la Sociedad Espanyola para el Procesamiento del Lenguaje Natural, 19: 141-147. Also available in Broady E, Shurville S 2000 Developing Academic Writer: Designing a Writing Environment for Novice Academic Writers. In E. Broady (ed.), Second Language Writing in a Computer Environment, CILT, London, pp 131-151. Liddy E D 1991 The Discourse-Level Structure of Empirical Abstracts: An Exploratory Study. Information Processing & Management, 27(1): 55-81. Motta-Roth D, Hendges G 1998 Uma Análise Transdiciplinar do Genero Abstract. Revista Intercâmbio, VII: 125-134. Narita M 2000 Corpus-based English Language Assistant to Japanese Software Engineers. In Proceedings of MT-2000 Machine Translation and Multilingual Applications in the New Millennium. pp 24-1 – 24-8. Pemberton L, Shurville S, Hartley A 1996 Motivating the Design of a Computer Assisted Environment for Writers in a Second Language. In Proceedings of CALICE'96, pp 141-148. Santos M B 1996 The textual organization of research paper abstracts. Text 16(4): 481-99. Sharples M, Pemberton L 1992 Representing writing: external representations and the writing process. In P.O. Holt and N. Williams (eds.) Computers and Writing: State of the Art. Intellect, Oxford, pp 319-336. 217 Sharples M, Goodlet J, Clutterbuck A 1994 A comparison of algorithms for hypertext notes network linearization. International Journal of Human-Computer Studies 40(4): 727-752. Smith J B, Lansman M 1988 A Cognitive Basis for a Computer Writing Environment. Technical Report, n.87-032, Chapel Hill. Swales J M 1990 Genre Analysis: English in Academic and Research Settings. Cambridge applied linguistics series. Teufel S, Carletta J, Moens M 1999 An annotation scheme for discourse-level argumentation in research articles. In Proceedings of EACL 1999. Weissberg R, Buker S 1990 Writing up Research: Experimental Research Report Writing for Students of English. Prentice Hall. 218 Updating LSP dictionaries with collocational information Katerina T. Frantzi Department of Mediterranean Studies, University of the Aegean Dimokratias 1, 85100, Rhodes, Greece frantzi@rhodes.aegean.gr Abstract Despite the big amount of general language dictionaries in electronic form, those coming from specialised areas are still “under construction”. There are two main reasons for this: firstly, the need for these dictionaries was/is less essential than the need for general language dictionaries, since these were/are aiming mainly to specialists, and secondly many specialised areas are changing over time, resulting to dictionaries that need continuing updating. Due to this, techniques that improve the automatic or semi-automatic construction and updating of specialised dictionaries are and will always be welcome. In this work we are concerned with the updating of dictionaries for Languages for Special Purposes (LSPs) with information coming from collocations. The collocations to be used are extracted from LSP corpora of not necessarily big size. 1 Introduction – collocations The big number of applications for collocations (dictionary construction, translation, language learning, etc.), makes them an interesting area to work on. The availability of corpora in electronic form has given a great deal of help to this kind of research since we are now able to work with real data. English is not any more the only language with electronic corpora, though it owns the greatest deal. Also, although most electronic corpora describe the general language, corpora of languages for special purposes (LSPs) become more and more available. Firth, (Palmer 1968), introduced the meaning of a collocation when discussing about senses. He suggested that part of the sense of a word depends on its neighbour words in texts: “You shall know a word by the company it keeps”, (Palmer 1968:179). This “company” is what he named collocation, and kept it very important for understanding words. It is quite some time now that linguists have shown interest in collocations (Jones and Sinclair 1974), and various definitions have been given. Some allow collocations to only consist of two words, while others of much more. Some care about what information collocations can give us on semantics, others on syntax or grammar. Some accept common words, others not. Some allow collocations to cross a comma, others not. Regarding interrupted collocations, there are differences as for the size of the gap(s) among the collocates. Despite all the differences, collocations are arbitrary, recurrent and cohesive lexical clusters, and depend on the language (Smadja 1993). We adopt the collocation definition given by Sinclair and Carter agreeing for a collocation to be the occurrence of two or more words within a short space of each other in a text (Sinclair and Carter 1991). As mentioned above, collocations depend on the language and sublanguage they are found. They actually play an important role in sublanguages (Frawley 1988; Ananiadou and McNaught 1995). The study of collocations in general language needs large corpora since phenomena in general language are sparse: in the Brown Corpus we only have two instances of “cups of coffee”, five of “for good” and seven of “as always” (Kjellmer 1994). However, when we talk about LSPs, things are easier as for the size of the corpus which can be a lot smaller since information there is dense. Early work on collocation extraction was determinant. Choueka et al. were among the first to use frequency of occurrence for recognising collocations (Choueka et al 1983). The work of Nagao and Mori was also 219 based on frequency of occurrence but they also considered the length of collocations to be extracted, giving priority to longer ones (Nagao and Mori 1994). Church and Hanks were the first to use association ratio (Church and Hanks 1990), a measure based on mutual information first expressed by Fano (Fano 1961). They cared about the semantic relations of the word-pairs they recognised, which could be interrupted by other words. On mutual information is based the work of Kim and Cho (Kim and Cho 1993), which extent it to three words, but in a different way than that originally defined by Fano. Collocation extraction is still an interesting issue for researchers (Kilgarriff and Tugwell 2001; Kim et al. 2001). Collocations can be divided to those that do not appear as part of other longer collocations and those that they do. The latter we call nested collocations. For example, in Computational Linguistics, “Natural Language” is a collocation itself, but is also part of the longer collocation “Natural Language Processing”. Three important works that mention the problem of nested collocations are those of Smadja, Kita et al., and Ikehara et al. Xtract, based on frequency of occurrence, recognised as collocations only those expressions of the greatest length (Smadja 1993). It did not extract collocations that were part of others. The work of Ikehara et al., which was based on Nagao and Mori's work, only accepted those that were found with satisfying frequency as not-nested (Ikehara et al. 1995). The problem of nested collocations was a big a concern for Kita et al. These accepted a nested collocation when it also appeared as not-nested with satisfying frequency (Kita et al. 1994). 2 Updating the dictionary We deal with the updating of LSP dictionaries for the Greek language. We use nested collocations to get the information in a way easier than looking directly into the corpus, which can be very time consuming. C-value is used for the extraction of collocations from LSP corpora. C-value has been initially constructed and used for the extraction of English collocations (Frantzi et al. 2000). It has been also applied to Japanese language (Mima et al. 2001). In this work we will be using it for Greek collocations and the updating of Greek dictionaries. Let us remind that C-value pays particular attention to nested collocations. When applied to the “Artillery Firing Military Rule Book” (“Óôñáôéùôéêüò .......µüò ....ß........ .....”, the corpus we will be using) one of the collocations it extracts is the “éïñèþóåéò .. .... .. ...µµÞ ß....”. So it does to “ãñáµµÞ ß....” which is a nested to the previous one collocation, but also stands as a collocation by itself. We need such a method since we will use nested collocations to get the information for updating the dictionary. When C-value is applied to an expression, it considers the following parameters: 1. The length of the expression (in terms of number of words). The longer the expression, the more important. 2. The frequency of occurrence of the expression in the corpus. The bigger the frequency the more important the expression. 3. Whether the expression appears as nested, and if yes the number of the different longer collocations that contain it. The number of times it is found in these longer collocations is also a considered parameter. Let us remind that C-value is evaluated as follows (a is the expression we examine): 1. C-(value)(a)=0 if the expression is part of one longer collocation and its frequency of occurrence is the same as this longer collocation's frequency. In this case the examined expression is not a collocation by itself. 2. C-value(a)= (|a| - 1)n(a) if the expression is not part of any longer collocations. |a| is the size of the expression a in terms of number of words, 220 n(a) is the frequency of occurrence of the expression a in the corpus. 3. C-value(a)= (|a| - 1)(n (a) – t(a)/c(a)) if the expression is part of longer (more than one) collocations. c(a) is the number of these longer collocations that include the expression a, t(a) is the total frequency of the expression a as part of these longer collocations. After extracting the collocations we group them and choose a group to start with. Attention should be taken when grouping the collocations. If for example we only group them alphabetically based on the first word, then we could miss out members of the group and as a result possibly useful information. Which group of collocations to start with is up to the application. A group of collocations that would be used to update the dictionary could be the following: ðáñÜããåëµá ß.... ...... ........µá ß.... ...... ........µá ß.... µïßñáò ...... ........µá ß.... ....ß........ ...... ........µá .µåóçò ß.... ...... .......... ........µáôïò ß.... ........ ....... ........µáôïò ß.... ...... ........µá ß.... ...... ....µïý ...... ........µá ß.... ......... ....µïý ...... ........µá ß.... ......... ....µïý ...... ........µá .µåóçò ß.... ....ß........ ...... ........µá .µåóçò ß.... ...... ....µïý ...... ........µá .µåóçò ß.... ......... ....µïý ...... ........µá .µåóçò ß.... ......... ....µïý The algorithm for updating the dictionary is the following: L: existing LSP dictionary; entry_L(.): an entry in L; Extract collocations from the LSP corpus using C-value; Group collocations creating collocation_groups; for each collocation_group cg from collocation_groups for each collocation c in cg length(c)= number of words of c max_length = max(length(c)) where collocation c in cg new_c = collocation c in cg with length(c)=min(length(.)) if entry_L(new_c) = 0 create entry_L(new_c) info_length = length(new_c) while info_length < max_length + 1 for each collocation c from cg with info_length = length(new_c)+1 check c for new information update entry_L(new_c) end_for 221 info_length = info_length +1 end_while end_for The choice of C-value as the method for extracting the collocations is critical since it deals with nested collocations, the type of collocations we need for getting the information. Let us now assume the following imaginative group of collocations from the collocation list: a b a b c a b d a b f e a b c g a b f g h where a b c d e f g h words. We take the collocation of the smallest length. In our example the “a b”. If the collocation “a b” does not yet exist in the lexicon a new entry is created. Now we consider the collocations of length the next smallest, (in terms of number of words). In our case the “a b c” and the “a b d”. We can start with “a b c” considering the word “c” in terms of the information it can give us on grammar, syntax or semantics (depending on the type of dictionary we want to update). We continue with “a b d” and the grammatical, syntactical or semantical information that the word “d” gives for collocation “a b”. Then we move to collocations (of the same group always) of the next smallest length, that is the “a b f e” and the “a b c g”. We do the work we did before, so we consider “f e” as for the information it can give for the collocation “a b”. For the collocation “a b c g” we consider the fact that “a b c” is a nested collocation we have already checked and add the information given by word “g”, and of course any new information acquired by the word combination “c g”. We finish with the collocation “a b f g h”, where we take information from the word combination “f g h”. When a collocation group is over we can move to the next collocation group. The method is semi-automatic since the machine, the domain expert and lexicographer need to cooperate. The human factor is necessary for the evaluation of information coming from the collocation under consideration. It is the domain expert and the lexicographer to judge which information is useful to be used and which not. 3 Application The method is applied to the “Artillery Firing Military Rule Book” (“Óôñáôéùôéêüò .......µüò ....ß........ .....”) of about 35,000 words. Since we are working with an LSP corpus we can use a small corpus. With a general language corpus things would be a lot harder in terms of its size since phenomena in that case are sparse. No tagging has been applied on the corpus. The implementation was done in Linux. Table 1 shows a sample of it. At first, collocation extraction is taking place using C-value. In this application we extract expressions of 2 to 7 words. This is a variable and changes according to application. The extracted collocations are ordered according to their C-value. A threshold can be applied to only allow those expressions above a value to be extracted and therefore proceed to the next stage. A threshold could have also been applied to the frequency of occurrence of the candidate expressions. Table 2 shows a sample of the list with the extracted collocations. The first column gives the C-value for the expression shown on fifth column. The fourth column gives the frequency of occurrence of the expression. The third column gives the number of (longer) expressions that contain the current expression while the second the total frequency of the expression in these longer ones. Expressions on Table 2 have 222 been chosen such that differences between C-value and frequency of occurrence can be noticed. We can see for example that long expressions despite their low frequency are valued high by C-value, e.g. “óå .......... .. .... .. ...µµÞ ß....”, and “ôï ..µéóµá ..... ... µå .. ß........”. Those expressions are domain-dependent, and for that they are (correctly) valued high. On the contrary expressions such “êáé ...” (“and to”), “ôç .....” (“the angle”), and “áõôü .....” (“this is”), are valued more by pure frequency of occurrence. ÁÐÏÔÅËÅÓÌÁÔÉÊÏÔÇÔÁ ... ........... ..... 1. Óõíåñãáóßá ... ... ........ ..... ....ß....... Ôï ...ß..µá ... ............ ... ..... µéáò ....... ....µïý ......... µå ... ........µÝíåò ........... ... .........., ôïõ ....... éåõèýíóåùò ..... (¥ÊÐ) êáé ... ....ß........ ..... (Ó÷. 1). Ôá .... .... .µÞµáôá ... ....ß......, ðñÝðåé .. ..... .......µÝíá µå ....... ...... ............. Ôï ...... ...µá ....... .. ........ µå ........ ... .. ....ß...... ....... ........... µåéþóåùò ... .......µåíïõ ......, ãéá ... ........µáôéêÞ ........ µéáò ......... ß..... á. ÐáñáôçñçôÞò. Ï ........... ..... «ôá µÜôéá» ôïõ ....ß...... ...... ÁíáæçôÜ ... ............ .. .... .......... ... .. ....ß..... ......, µÝóá ... .... ............ .... Ãéá .. ....ß..... ... ....., äéáâéâÜæåé ... ...... ß.... ... .... .......... ....... .......µü ... ß..... Åðéôçñåß .. .... ... ... ....... ........ ... ¥ÊÐ. â. ÊÝíôñï éåõèýíóåùò ...... Ôï ¥ÊÐ ........ ... «åãêÝöáëï» ôïõ ....ß....... ËáµâÜíåé ... ...... ß.... ... .........., ðñïóäéïñßæåé ........ ß.... ... .. µåôáôñÝðåé .. ........µáôá ß...., ôá ..... ...ß.ß.... ... ....ß.... Åêôåëåß ...... ... ....... ......... ... ...... Ëüãù ... µåãÜëùí .......... µåôáîý ... µïíÜäùí ..... (ðõñïâïëáñ÷éþí) êáé ... .......... ... ... ...... ...... ..... ............, ç ....... ......... ... ..... .......... ....... ... ¥ÊÐ ... ....ß......... Ôï ¥ÊÐ ...... ....... ....... ......... ... ..... (ôñüðï ....ß.... ... ......) êáé ............ ... .. ...... ß..... ÅðéðëÝïí ß.... .. ¥ÊÐ ... ....ß........ .... ....... ......... ... ....., ðáñÝ÷ïíôáò .. .... ........ ß.... ... .. ...... ..... ... .......... ... ........ ¥ÊÐ, üôáí .......... Ó÷Þµá 1. Óõíåñãáóßá ... ... ........ ..... ....ß....... ã. Ðõñïâïëáñ÷ßá ...... Table 1 Sample of the corpus. Table 3 shows how the method behaves with nested expressions. We can see that, if instead of C-value we used frequency of occurrence, and in order to give a value to a candidate expression we were subtracting from its frequency the summation of its frequency when part of longer expressions, we would underestimate quite a few important expressions. The extracted list is expected to contain “useless” expressions, like “êáé ...” (“and to”) or “áõôü .....” (“this is”). However according to Kjellmer no extracted expression can easily -if at all- be characterised “useless” (Kjellmer 1994). His dictionary of English collocations incorporates everything that has been extracted with no characterisation as “correct” or “wrong”. However, we could use a part-of-speech tagger to only allow expressions of a particular form. This way we would eliminate some expressions we do not want but could also lose some we do. What we do depends on the application. 223 C-value(a) t(a) c(a) f(a) extracted collocation 8.42206 0 0 3 ôï ..µéóµá ..... ... µå .. ß........ 8.42206 0 0 3 óå .......... .. .... .. ...µµÞ ß.... 8.42206 0 0 3 ï ........... ............ .. .... ... ...... 8.42206 0 0 3 ï ........... ß..... .. ........ ... ß..µáôïò 8.42206 0 0 3 µå .. ....... ... µåôñÜ .. ....... 8.42206 0 0 3 Ì10/Ì17 ç ........ .. .... .. ...µµÞ 8.42206 0 0 3 ßóï µå .. ß........ .. ........ µÝôñùí 8.42206 0 0 3 ç ........ .. .... .. ...µµÞ ....ß... 8.42206 0 0 3 ç ..... ....... ..µåßïõ ... ...... ..... 8.42206 0 0 3 ãéá ... ........ ..' åõèåßáò ......... ß.... 8.42206 0 0 3 ãݵéóµá ..... ... µå .. ß........ .. 8.42206 0 0 3 áðü .. .ß.... .10/Ì17 ç ........ .. 8.42206 0 0 3 áâÜêéï .10/Ì17 ç ........ .. .... .. 8.33333 8 3 11 ôç ..... 8.33333 8 3 11 êáé ... 7.92481 12 2 11 ï ........... ...... 7.66667 7 3 10 áõôü ..... 7.66667 10 3 11 óôç .... 7.66667 10 3 11 ðáñáôçñçôÞ .. 6.96578 0 0 3 ï ......... ... µïíÜäáò ....µïý 6.96578 0 0 3 µå ......... .... ...... ..µåßï 6.96578 0 0 3 ãéá ... ....ß... ... ...... 6.96578 0 0 3 ãéá ... ........ ... ......... 5.61471 0 0 2 õðïëïãéóµüò .......... .... µÞêïò ... ...µµÞò ............ 5.61471 0 0 2 ôïí ....ß. ..........µü ... ...... ... ...... Table 2 Sample of the list with the extracted collocations. Let us now see an example from our corpus, on how we get the information for updating the dictionary. We have already extracted the collocation list. Assume the collocation group we work with, is the following: Length: 2 óôïé÷åßá ß.... Length: 3 õðïëïãéóµüò ......... ß.... ......... ......... ß.... Length: 4 óôïé÷åßá ß.... ....ß.... ...... µÝèïäïò ........µïý ......... ß.... Length:5 óôïé÷åßá ß.... ... .....µáíóç ....ß.... Length: 6 õðïëïãéóµüò ......... ß.... µå ..... PC32F õðïëïãéóµüò ......... ß.... µå ..... TI59 óôïé÷åßá ß.... µå ..... µåôåùñïëïãéêþí ......... ........µüò ......... ß.... µå ..... laser õðïëïãéóµüò ......... ß.... ... ... .......... Length: 7 áíáãùãÞ ......... ß.... .... ...... ............ .......... ....... ......... ß.... µå ..... .ß..... .17 Length: 8 ÷ñÞóç ... ......... ... ... ........µü ......... ß.... 224 The collocation we are dealing with is the “óôïé÷åßá ß....”. Length is taken in terms of number of words. The collocation is met in the corpus under two forms: “óôïé÷åßá ß....” and “óôïé÷åßùí ß....”. We take these two as the same collocation and then we consider the two collocations of Length=3. The domain expert and the lexicographer need to evaluate the information taken from each of the two words “õðïëïãéóµüò” and “êáôáãñáöÞ”. The domain expert has to decide whether the expression “õðïëïãéóµüò ......... ß....” is a collocation or not. If it is, and the information of the word “õðïëïãéóµüò” regarding the “óôïé÷åßùí ß....” does not exist in the dictionary then the dictionary has to be updating by the lexicographer with the information given by the domain expert. The same happens for the other collocation of length=3, the “êáôáãñáöÞ ......... ß....”. When we finish with collocations of Length=3, we move to those of Length=4, in our example the “óôïé÷åßá ß.... ....ß.... ......” and the “µÝèïäïò ........µïý ......... ß....”. The first collocation “óôïé÷åßá ß.... ....ß.... ......” does not directly relate to any of the two collocations of length=3, and so the “ðñïóâïëÞò ......” will be treated by the domain expert and the lexicographer as the words “õðïëïãéóµüò” and “êáôáãñáöÞ” of the previous stage. The collocation “µÝèïäïò ........µïý ......... ß....” will have to update the information of the previously checked collocation “õðïëïãéóµüò ......... ß....”, so the latter will be taken under consideration. The method continues with the same simple way until we reach and use the collocation with the greatest length in the group, in our example the “÷ñÞóç ... ......... ... ... ........µü ......... ß....” with Length=8. C-value(a) t(a) c(a) f(a) Extracted collocation 3 4 2 5 ãùíéïµåôñéêü ...... 8.8 11 5 11 äéüñèùóç ß......... 42.7 33 10 46 äñáóôéêÞ ß... 10.2857 12 7 12 åêñçêôéêü ß..µá 10.8 11 5 13 åðéóÞµáíóç ....ß.... 11.8872 10 4 0 åõèåßá ......... ß.... 23 26 13 25 æþíç ............ 49.5 50 20 52 êáôÜ ......... 8.75 10 8 10 ðñþôç ........ 18.7143 23 7 22 ðñþôï ß..µá 11.4118 10 17 12 ðõñïâüëá ...... 18.8571 15 7 21 óçµåßï .......µïý 16.8571 15 7 19 óôïé÷åßá ß.... 17.2222 16 9 19 ýøïõò ......... 5.3333 8 3 8 ÷áñáêôçñéóôéêÜ ..µåßá Table 3 Collocations that have been also found as nested. The method is quite simple in the way it works. It is semi-automatic in the sense that it needs the domain-expert and the lexicographer. We believe that this is necessary in order to provide accuracy and completeness to a high degree. However the domain expert and the lexicographer do not have to look (unless really needed) on the corpus itself to obtain the information for the LSP dictionary updating, which of course is a considerable gain on time. We have yet not applied an evaluation measure to judge the results as for the correctness and completeness of information gained. This is a subject still to be done. Another matter is the stemmer. It is not easy to decide whether to use one or not. If yes, words having the same thema would count as one (as they should be in most cases). However there are cases where this should not happen, like with the words “ðáñáôçñçôÞò” (“observer”) and “ðáñáôçñçôÝò” (“observers”). These two words in many cases need to stay as they are found, since they are often used to indicate different meanings in different collocations. 4 Summary In this paper we presented the incorporation of the C-value method for the extraction of collocations to dictionary updating. C-value offers to this since it focuses on nested collocations, the type of collocations we look at in order to obtain our dictionary new information. The method makes the process faster since we actually look at the extracted collocation list instead of the whole corpus. It is semi-automatic since the 225 final decision on which information should update the dictionary is taken by the domain expert and the lexicographer. Regarding future work we first are to apply the method to other languages starting with English, but Turkish, Arabic and Hebrew as well. Should things be working as expected, we will move to the application of the method to multilingual corpora (including parallel) for the updating of multilingual dictionaries. References Ananiadou S, McNaught J 1995 Terms are not alone: term choice and choice terms. Journal of Aslib Proceedings 47(2): 47-60. Choueka Y, Klein T Neuwitz E 1983 Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Journal of Literary and Linguistic Computing 4:34-38. Church K W, Hanks P 1990 Word Association Norms, Mutual Information, and Lexicography. Computational Linguistics 16:16-29. Fano R M 1961. In Transmission of Information: a statistical theory of communications, New York, M.I.T. Press. Frantzi K T, Ananiadou S, Mima H 2000 Automatic Recognition of Multi-Word Terms. International Journal on Digital Libraries 3(2): 115-130. Frawley W 1988 Relational models and metascience. In Evens M (ed) Relational models of the lexicon. Cambridge, Cambridge University Press, pp 335-372. Ikehara S, Shirai S, Kawaoka T 1995 Automatic Extraction of Collocations from Very Large Japanese Corpora using N-grams Statistics. Transactions of Information Processing Society of Japan 11: 2584-2596 Jones S, Sinclair J 1974 English Lexical Collocations: A Study in Computational Linguistics. Cahiers de Lexicologie, 24(1): 15-61. Kilgarriff A, Tugwell D 2001 WORD SKETCH: Extraction and Display of Significant Collocations for Lexicography. In Proceedings of Collocation Workshop, ACL 2001, pp 32-38. Kim P K, Cho Y K 1993 Indexing Compound Words from Korean Texts using Mutual Information. In Proceedings of Natural Language Pacific Rim Symposium, pp 85-92. Kim Y, Zhang  T, Kim Y T 2001 Collocation Dictionary Optimization Using WordNet and k-Nearest Neighdor Learning. Journal of Machine Translation 16: 89-108. Kita K, Kato Y, Omoto T, Yano Y 1994 A comparative Study of Automatic Extraction of Collocations from Corpora: Mutual Information vs. Cost Criteria. Journal of Natural Language Processing 1: 21-33. Kjellmer G 1994 A dictionary of English Collocations, Oxford: Clarendon Press. Mima H, Ananiadou S, Neradic G 2001 The ATRACT Workbench: Automatic Term Recognition and Clustering for Terms. In Lecture Notes in Computer Science, LNAI 2166, Springer-Verlag, pp 126-133. Nagao M, and Mori S 1994 A new Method of N-grams Statistics for Lange Number of n and Automatic Extraction of Words and Phrases from Large Text Data of Japanese. In Proceedings of the 14th International Conference on Computational Linguistics, pp 611-615. Palmer F R (ed) 1968 Selected papers of J.R. Firth. Harlow: Longman. Sinclair J, Carter R (eds) 1991 Corpus, Concordance, Collocation. Oxford, England, Oxford University Press. Smadja F 1993 Retrieving Collocations from Text: Xtract. Computational Linguistics, 19:143-177. 226 Using the XARA XML-Aware Corpus Query Tool to Investigate the METER Corpus Robert Gaizauskasa, Lou Burnardb, Paul Cloughc and Scott Piaod Department of Computer Sciencea University of Sheffield, Sheffield, S1 4DP R.Gaizauskas@dcs.shef.ac.uk Research Technologies Servicetb University of Oxford, Oxford, OX1 2JD lou.burnard@oucs.ox.ac.uk Department of Information Studiesc University of Sheffield, Sheffield, S1 4DP p.d.clough@sheffield.ac.uk Department of Linguistics and Modern English Languaged Lancaster University, Lancaster,LA1 4YT s.piao@lancaster.ac.uk Abstract The METER (MEasuring TExt Reuse) corpus is a corpus designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written independently. The corpus has been annotated in accordance with the TEI Guidelines. The annotations include both descriptive metadata, such as the title, source, and date of publication, and human judgements about text reuse. To exploit the value of the annotations, for searching and querying the corpus, requires a tool which is designed to “understand” XML, or even, ideally, TEI. Such a tool is XARA, which has been built specifically to support corpus investigations over XML-annotated corpora. This paper reports lessons learned in using XARA to explore the METER corpus, particularly the importance of designing the annotation scheme with an understanding of the capabilities and limitations of the retrieval application. 1. Introduction The METER (MEasuring TExt Reuse) corpus is a resource designed to support the study and analysis of journalistic text reuse. It consists of a set of news stories written by the Press Association (PA), the major UK news agency, and a set of stories about the same news events, as published in various British newspapers, some of which were derived from the PA version and some of which were written independently. The corpus has been annotated in accordance with the TEI guidelines. The annotations include both descriptive metadata, such as the title, source, and date of publication, and human judgements about text reuse. These judgements were made at two levels. Firstly, each newspaper story, as a whole, is labeled as wholly derived, partially derived, or not derived from a PA source. Secondly, for certain selected stories, every word in the text is labelled as either reused verbatim from the PA, rewritten from the PA, or new. The corpus and the associated annotation scheme have been described elsewhere (Clough et al., 2002; Gaizauskas et al., 2001), and this description will not be repeated, beyond bare essentials, below. The current paper focuses on work we are carrying out to explore the corpus using the newest version of the SARA (SGML Aware Retrieval Application) corpus query tool, called XARA, which was originally designed to support exploration of the British National Corpus. This work has two main purposes: (1) To verify and validate the TEI encoding scheme we have adopted for the METER corpus; i.e. have we encoded the correct things and have we encoded them correctly? (2) To verify and validate the XARA query tool; i.e. does XARA allow us to retrieve the things we want to retrieve and does it retrieve these things correctly? Specifically, we wanted to know whether XARA would allow us to answer queries such as: Show me the titles of Telegraph stories wholly derived from the PA; or, Extract a sub-corpus consisting of murder stories with the catchline ``Axe''; or, Extract all sentences from the Sun stories which were rewritten from the PA; or Which words occur most frequently in the verbatim/rewritten/new portions of texts? 227 More generally this work addresses the utility of adopting TEI encoding standards for corpus linguistic work. It also provides lessons into how to more effectively annotate texts for subsequent investigation and how to extend the capabilities of the XARA querying tool to better support corpus linguists. 2. The METER project 2.1. Aims and objectives of the project The METER project1 investigated the reuse of news agency (also known as newswire) sources by journalists and editors in the production of newspaper articles. While the reuse of others’ text without acknowledgement is, in academic life, a cardinal sin, in the newspaper industry this is not only an accepted behaviour, but is in fact a standard business practice. In the newspaper industry, journalists depend upon “pre-fabricated” agency press releases as the source for many of their own stories: “most of what journalists actually write is a re-processing of what already exists” (Bell, 1991). Because newspapers subscribe to newswire services, they are at liberty to reuse source texts in any way they want (even verbatim), with or without attribution to the original author: the news agency. Aims of the METER project included the following: (1) To investigate and provide examples of text reuse by journalists writing for the British Press. (2) To define a framework in which text reuse between the newswire-newspaper could be assessed. (3) To build a corpus, a collection of carefully selected source and possibly derived texts for further study and evaluation of automatic methods of detection. (4) To select and evaluate a range of algorithms to measure text reuse based on existing methods of computing similarity between natural language texts. The METER project was supported by one of the largest UK news agencies, the Press Association, who allowed us access to the full newswire service as supplied to journalists. Being a primary supplier, the news issued by the PA is widely reused, either directly or indirectly, in British newspapers. The study of text reuse has, aside from its intrinsic academic interest, a number of potential applications. Like most newswire agencies, the PA does not monitor the uptake or dissemination of copy they release because tools, technologies, and even the appropriate conceptual framework for measuring text reuse are unavailable. For news agencies like the PA who release vast amounts of news material everyday (on average the PA release over 1,500 stories each day), manually analyzing stories for text reuse is not only impractical, but with limited resources is also infeasible. For the PA, potential applications of being able to automatically measure reuse of their text accurately include: (1) monitoring of source take-up to identify unused or little used stories; (2) identifying the most reused stories within the British media; (3) determining customer dependencies on PA copy, and (4) new methods for charging customers based upon the amount of copy reused. This could create a fairer and more competitive pricing policy for the PA2. 2.2. The METER corpus To support our investigation of text reuse, we created a corpus of newswire and newspaper texts called the METER corpus. This corpus, the first of its kind as far as we are aware, provides over 700 examples of text reuse between Press Association source texts and subsequent articles published by nine newspapers in the British Press who subscribe to the PA news service. The corpus consists of 1,716 texts selected from the period between 12th July 1999 and 21st June 2000 from two staple and recurring domains in contemporary British Press: (1) law and court reporting, and (2) show business and entertainment. Newspaper and newswire texts collected for the METER corpus vary around a number of parameters which were taken into account during the corpus construction. These include the following: (1) domain, (2) source, (3) time period, (4) newspaper register, (5) length of newspaper story, (6) coverage of the news topic, (7) degree of reuse, and (8) the number of newspapers reporting the same story. Texts were selected to represent these parameters and their influence on text reuse was taken into account during analysis. 1 METER was a 2-year UK EPSRC-funded research project – for details see: http://www.dcs.shef.ac.uk/nlp/meter. 2For more information on the PA and public access to a portion of their newswire, see: http://www.ananova.com. 228 As well as selecting source texts for the corpus, professional journalists also analysed the PA and newspaper text pairs and recorded the derivational relationship believed to exist between them according to a two-level, three-degree scheme for classification. At the document level, all 944 newspaper articles were classified as either: (1) wholly derived, (2) partially derived, or (3) non-derived, reflecting their degree of dependency on the PA source text for the provision of news material. A more fine-grained scheme, at the lexical level, aimed to capture the reuse of text within the newspaper text itself and classified word sequences as either: (1) verbatim, (2) rewritten, or (3) new. At this level of reuse, 445 texts only were analysed and annotated due to limitations of time and resource. These derivation relationships were captured in the corpus as pragmatic annotation – interpretive information added to the base text to capture a professional journalist's view of text reuse between a PA source text and corresponding newspaper article. 2.3. Annotating the corpus in TEI/XML To explicitly capture characteristics of the newspaper texts, e.g. their headline, author, date of publication, catchline, domain and source, as well as the derivation relationships existing between the newswire-newspaper counterpart texts, we originally created an SGML markup scheme in a beta release of the corpus. However, we later transformed the entire collection into XML conforming to the Text Encoding Initiative (TEI) standard with the goal of making the resource compatible with international standards for corpus encoding and hence facilitating exchange with other members of the corpus linguistics community. The corpus is stored physically in 27 separate files, one for each day on which stories were sampled. A global corpus header file contains information about the corpus as a whole, including publication information and the definition of attributes specific to the METER project (e.g. the document and lexical level annotation schemes). Within each day file, material is organized into catchlines – a group of articles from different sources all dealing with the same story, which is identified by a PA-assigned tag known as a catchline (e.g. “axe” for a story about an axe murderer). Within catchline are the individual articles as published by the PA or newspapers. Figure 1 shows how we have captured this structure using TEI tags and attributes. The TEI scheme allows for the renaming of tags, but we chose to use standard names for simplicity; in some cases, therefore, the tag and attributes names do not necessarily provide an intuitive abbreviation of what they are intended to represent. Each file consists of a header (encapsulated within the tag), followed by a collection of catchlines for both PA and newspaper texts within a and tag. Each catchline is defined by a and tag. The tags are not shown in Figure 1 to simplify the diagram. Within tags the

tag is used to indicate individual newspaper articles or pages of PA text for this catchline. At the most detailed level of annotation, each text consists of paragraphs and sentences, denoted by

and tags. Within sentence the tag is used to indicate portions of verbatim, rewritten and new text. More information about the structure of the XML/TEI corpus can be found in Clough et al. (2002). Figure 1: The TEI markup of individual days in the METER corpus 229 Table 1 lists, in order of their hierarchical structure, the key TEI tags and attributes which are used to encode the METER corpus, as shown schematically in Figure 1. TEI tag Attributes Values Comment id n e.g. “M01032000-1” e.g. “Burstein”, “mother”, “wagstaff” Unique catchline ID Name of catchine

id n type ana e.g. “A636” e.g. “pa-01032000-15” “courts” or “showbiz” “src”, “wd”, “pd”, or “nd” Unique ID (A=PA, M=newspaper) Reference for each news story Domain for this catchline Document-level classification Type e.g. “pa”, “sun”, “mail”, “times”, “guardian” The source of the news article N 1..N Sentence number in the text Ana “verbatim”, “rewritten” or “new” Lexical-level classification Table 1: Some of the TEI tags and attributes used in the METER corpus 3. XARA – the XML Aware Retrieval Application 3.1. Background SARA ("SGML-Aware Retrieval Application") was originally developed with funding from the British Academy and the British Library to meet the need for a simple searching interface to the British National Corpus (BNC), when this was first distributed in 1994. The original design brief was to survey freely-available text retrieval software libraries and to build on these a simple user interface which could also be distributed without additional licensing costs. At this time the amount of freely available SGML-aware software was small, and it was rapidly decided that it would be quicker and more cost-effective to develop our own within the BNC project. SARA was accordingly developed with the very specific needs of the BNC in mind, incorporating a number of unusual design features unique to that resource (such as the specifics of how words are tagged, and speakers in its spoken texts are identified), while excluding a number of more generic but expensive features (such as indexing of POS tags). The system was also designed with a particular software environment in mind: one in which researchers used desktop machines to access local server machines, probably running a dialect of Unix, rather than standalone desktop systems; and in which data were counted in megabytes rather than gigabytes. Because it was intended for use with only one specific data set, little thought went into generalizing (or even distributing) the component which created the index files used by the server to optimize access to the corpus. For similar reasons, little effort was put into optimizing the structure of those indexes, or extending them to cope with user requirements beyond those already identified within the project. Some consequences of these design decisions are readily apparent: it is designed to operate in a network environment, with the bulk of data storage and much of the processing carried out on a central server. The overall functionality of the system is constrained by the expressivity of the protocol by which clients communicate with the server. This is probably a good thing, since the availability of that protocol makes it easier to develop new clients in new environments; constraints consequent on the need to support only the BNC are less good however, since the fixed index format made it difficult or impossible to support additional requirements such as efficient identification of collocates or POS searches. 3.2. New directions In redesigning the tool, the primary objective was to take advantage wherever possible of the availability of XML and of XML encoded data, reflected in a change of name – from SARA to XARA. This decision conveniently allows us also to leverage a number of powerful technologies: most significant of these is probably Unicode, which not only enables us to support corpora in any of the world's current languages, but also allows us to rely upon standard methods of tokenization and character classing. XARA will operate on any well-formed XML document, without the need for any DTD or detailed tagging; it will also make use of whatever DTD it is supplied with to enrich the functionality of the supplied data. 230 After nearly a decade of use, we were much better aware of the problems reported by users of the BNC and SARA, and felt confident that we now knew which of the many possible additional enhancements we might add would be of most general usefulness to our target user community: some of the new features are listed below. As well as relying on Unicode and XML, XARA uses other recognised standards for its communication with other systems and with the outside world. The metadata required by the indexing system is all recorded using the TEI header (though there is no requirement that the rest of the corpus follows this particular XML application); the scripting language offered for application development is based on Javascript; the formatting language is a simple subset of CSS, the W3C's Cascading Stylesheet Language; and we are also considering how best to re-express the query protocol itself in an existing XML vocabulary. 3.3. Technical Overview The XARA system combines the following components: (1) an indexer, which creates inverted file style indexes to a large collection of discrete XML documents; (2) a server, which handles all interaction between the client programs and the data files; (3) a Windows client, which handles interaction between the server and the user. In addition, an index building utility, called Indextools, is supplied with XARA, which simplifies the process of constructing a XARA database. Its chief function is to collect information about the corpus to be supplied additional to that present in any pre-existing corpus header, and to produce a validated and extended form of the corpus header. It can also be used to run the indexer and test its output. 3.3.1. Functionality for indexing In order to process a large amount of textual data marked up in XML, XARA uses a pre-built index to optimize access to the original source texts for particular (largely lexically-motivated) kinds of query. To construct this index, the software needs information about: • how PCDATA (element content) is to be tokenized • how tokens are to be mapped to index terms (for example, by lemmatization or by the inclusion of additional keys) • how index terms are to be referenced in terms of the document structure • how and whether XML tags and attributes are to be indexed Much, perhaps most, of this information is implicit in the XML structure for a richly tagged corpus: one could imagine a corpus in which every index term was explicitly tagged, with lemma, part of speech, representation etc. In practice however, such richly tagged corpora remain imaginary, and software performs the largely automatic task of adding value to a document marked up with only a basic XML structure. XARA allows for the specification of how that task is to be performed in a standardized way, using the existing structure of the TEI header (see http://www.tei-c.org/Guidelines/HD.htm) as the vehicle for its specification. 3.3.2. Functionality for searching XARA inherits from SARA a rich range of query facilities, but adds to them considerably. The system allows the user to search for substrings, words, phrases, or the tags which delimit XML elements; it also supports a variety of scoped queries, searching for combinations of words etc. in particular contexts, which may be defined as XML elements, or as combinations of other identifiable entities, as further discussed elsewhere in this paper. Searches can be made using additional keys such as part of speech, or root form of a token, specified either explicitly in the tagging of the texts, or implicitly by means of a named algorithm. Outputs from a search usually take the form of a traditional KWIC concordance, which can be displayed or organized in several different ways, under user control, or can be exported as an XML file for use with other XML-aware systems such as a word processor. Corpora can be reorganized or partitioned in a user-defined way, using the results of any query, the values of specified element/attribute combinations, or by means of manual classification. Searches carried out 231 across partitioned corpora can be analysed by partition: so the client can display the relative frequencies of a given lexical phenomenon in texts of all different categories identified in a corpus. 4. Using XARA to explore the METER corpus 4.1. Indexing the METER corpus To index the METER corpus, we used the XARA Indextools utility. Given a TEI corpus header file and a group of XML/TEI texts, the indexer tool creates a searchable index from the texts, which can then be explored using XARA. XML/TEI validation is carried out during indexing. One aspect of XARA which has important implications for subsequent searching, is the requirement for the user to specify a notional document and unit during indexing. The document can be any XML element. Its significance is that partitions of the corpus, or sub-corpora, can only be defined in terms of sets of whatever element is chosen as document. The unit can also be any XML element; its function is to identify the lower level context within which hits are identified. We chose
to be our notional document, i.e. the individual newspaper or PA article, and , or sentence, to be our notional unit. Thus we can create partitions, or sub-corpora, of articles matching various search criteria and lexical search results are displayed in terms of the sentences in which they occur. 4.2. Searching the METER corpus Once indexed, the corpus can be searched. XARA offers a number of query options including: word, phrase, part-of-speech, pattern and XML-based searching. There are at least four specific features of XARA which are useful for querying the METER corpus: (1) the Query Builder tool which enables users to create complex queries using a visual interface, (2) the ability to define a partition of the corpus, dividing it into sections according to a number of different criteria, (3) the way in which search results may be presented to the user and (4) how search results may be saved for subsequent exploitation. We describe these capabilities in this section, then present a set of example queries in the next section. More detailed information about searching with SARA can be found in Aston and Burnard (1998). Query Builder To create complex queries the Query Builder tool can be used, the visual interface for which is shown in Figure 2. The left node is called the scope node and defines the scope or context in which a complex query is executed. For example, if the scope node is defined as the
annotation, the search will take place on text and annotations only within the scope of this annotation – a news story in the case of the METER corpus. To fully specify the query, one or more right-hand nodes must be specified using any of the query options, e.g. pattern, word, phrase or XML annotation. Nodes can be added both vertically and horizontally. Query nodes added vertically are logically ANDed together to narrow the context in which the search is executed. Nodes added horizontally represent alternatives, i.e. are logically ORed. For vertical connections, the link type between nodes can be specified as either next (i.e. adjacent), not next, one-way, or two-way, all indicating the order and proximity in which matching search terms must be found. Figure 2: Ordering of nodes can affect the query: left-hand query produces no solutions; right-hand one produces 21,209 solutions Searches based on annotations and one-way link types are sensitive to the hierarchical structure and order of the corpus annotations. For example, consider the problem of listing all sentences from news stories. Two versions of the query are shown in Figure 2. The query on the left returns no solutions, whereas the 232 query on the right returns all 21,209 sentences in the METER corpus. The difference is the ordering of nodes in the query. The left-hand query does not work because the ordering requires matching sentences first and then the headline, but sentences always come after the headline. Defining partitions XARA allows the user to create sub-corpora, or partitions, each representing a different grouping of the documents in the whole corpus. By this means the user can select a portion of the corpus which is of interest and perform further analysis on this portion only. Partitions can be created by either selecting texts from the user interface or by using a query to partition the corpus into documents that match the query and those that do not. Once a partition has been created, it can be activated and queries are addressed to the activated partition only. By partitioning the METER corpus, queries aimed at one particular subset of the corpus, e.g. a particular newspaper or only articles derived from the PA, can be easily created, thereby reducing their complexity. This capability highlights the importance of (1) ensuring that the textual units over which one wants to aggregate results are defined as XML elements in the building the corpus, and (2) selecting the appropriate XML/TEI element as the notional “document” in indexing the corpus with XARA. Viewing the search results After executing a query in XARA, the results are displayed as either a listing or each result is displayed individually. The listing usually takes the form of a KWIC concordance in which results can be viewed as plain text, in XML showing the annotations, or in a custom-format using a user-defined Cascading Style Sheet (CSS). The context of the listing will depend on the search itself, but can be broadened and narrowed by the user. The listing can be exported as an XML file which can be viewed or edited with other XML-aware applications. Saving search results XARA enables the user to save various outputs including word frequency lists produced during a word query and the results of query analysis, in XML or comma/tab separated plain text formats. The XML results can be parsed by and viewed in any XML-aware tool. 4.3. Sample Searches In this section we present a number of example questions that one might like to ask of the METER corpus, and discuss whether and how XARA can support these requests. These fall into three classes: those addressing the lexical content of the corpus only, those addressing the XML annotations alone, and those combining lexical content and annotations. 4.3.1. Lexical searches based on the content of the news stories: free-text searching (1) Find me which texts contain a given word or phrase Suppose we begin with a simple word or phrase search, for example: “find me all the texts which contain the word ‘Ipswich'”. This can be performed using the “word query” option which matches 30 occurrences of “Ipswich”, and 1 occurrence of “Ipswich's”. At this point, the word frequency table can be saved in XML, or displayed as a KWIC concordance. As well as exact match searches, a regular expression can also be defined to search for a pattern, e.g. “town*” finds the words “town”, “town's”, “townhill”, “townhouse” and “townend”. Phrases can also be searched for, e.g. “Ipswich town”, although this option does not permit pattern searches. (2) Find me which words co-occur with a given word XARA can be used to find collocations, a common task in corpus linguistics. A Z-score or Mutual Information (MI) score indicates the “strength” of collocate and can be used to rank the results. Collocates of “Ipswich” include “CID”, “borough”, “town” and “crown”. Again, this listing can be viewed and/or saved. 233 (3) Find me which words are most frequent in the METER corpus To create a list of word frequencies, the word query option is used and the search term defined as a pattern, e.g. [a-zA-Z-]*. This calculates word frequency based on the entire METER corpus and the results can be saved to a file in the form (e.g. comma-separated or XML tagged). It would be useful to be able to filter certain words, e.g. function words, which are invariably most frequent, from these lists, either at search or index time: XARA does not support this directly, though it is possible to order lists by frequency and specify a cut-off point. When a partition is active, word frequencies are provided for each component of the partition, allowing one to compare the relative frequencies of the given search term across the partition. (4) More complex lexical searches As with most search engines, XARA can be used to generate complex queries using Boolean operators, such as “find me which texts contain the words X and Y”, or “find me which texts contain the words X or Y”. The Query Builder tool can be used to perform this search task, but also supports more sophisticated searching by enabling the user to limit the scope and order with which the query words match. For example, the search “Find me all texts containing Ipswich and council” can be defined in Query Builder as a search for two vertically-placed word nodes within the scope of a
. With the link type defined as next, the query requires the word “council” to immediately follow the word “Ipswich” (equivalent to a phrase search), a 1-way link means that “council” must follow “Ipswich” within the scope of
(i.e. ordered), and 2-way link implies the words can occur in any order, but still within the same scope. 4.3.2. Searching on the XML annotations Using the XML markup, we are able to exploit the interpretative information added to the METER corpus to answer queries, which cannot be answered from the lexical content. As with searching over lexical content, XARA supports complex searching over annotations too, enabling us to answer questions such as “Find me all texts from the Sun newspaper derived from the PA”. What you can do with XARA depends crucially on the annotation scheme; therefore, careful design of this scheme is paramount. For example, in the initial TEI markup of the METER corpus, we did not use the “type” attribute of the tag to identify the article source, but only to indicate whether the source was a newspaper or a PA text. The specific source could only be extracted from a substring of the “n” attribute on the
tag (e.g. “sun-01032000-3”). The version of XARA we used did not support pattern searches on the XML attribute value strings: our solution was to extract the source name from the
annotation, using a Perl script, and change the “n” attribute to reflect this, i.e. to modify the annotation scheme. (1) Show me how many newspaper stories are in the METER corpus When constructing an XML query, selecting the tag and “type” attribute displays all possible attribute values and their counts, giving the number of articles from each newspaper in the corpus. This is true for all attributes, e.g. selecting the
tag and “ana” attribute enables the user to determine there are 205 non-derived texts, 438 partially-derived and 301 wholly derived. (2) Extract all titles from the Sun which are wholly derived from the PA This can be performed using Query Builder to create the query shown schematically below. Scope is defined to be within
, and the additional query constraints are that the “ana” attribute has be “wd” (wholly-derived) and the 's “type” attribute must be “sun”. The result is a list of 19 headlines. 234 Scope node: Query term nodes: wd CONTAINING sun As a further example, to select all wholly-derived texts from the Sun for the courts domain, the preceding query need only be modified by adding a further constraint on the scope node (“type” equals “courts”). (3) Extract all sentences from the Sun which were taken verbatim from the PA To extract verbatim sentences a query can be defined to match sentences containing a tag with the “ana” attribute value equal “verbatim. If during indexing the unit is specified to be , as we have assumed, the results will be displayed within the context of the sentence as required. Searches such as “Show me all verbatim segments from the Sun”, or “Show me texts containing verbatim segments from the Sun” require re-indexing the corpus, defining the unit to be either or
respectively. 4.3.3. Combining searches Using the Query Builder tool, search requests which involve combining both information from the annotations and free-text can be constructed, enabling us to answer questions such as the following. (1) Show me all titles from the Sun with the word “Axe” in the headline We define the scope node to be with the attribute “type” set to match “sun”. By adding a word query node, as shown below, to match the word “axe”, two titles are returned: “Sex taunt made axe monster murder 3 in love triangle” and “Street to axe Ravi”. Note that in the first, “axe” is used to modify the noun “monster”, the context being a murder, whereas in the second “axe” is a verb which in this story is used to headline a story about the actor Ravi being removed from the British soap opera, Coronation Street. This reveals some interesting characteristics of the kind of newspaper language used by the popular press. s (2) Extract a sub-corpus consisting of all murder stories with the catchline “Axe” Start by defining a query whose scope node is
and whose “type” attribute is “courts” (this category defines murder stories). Then, add a word query node of “axe”. This results in 45 stories from both newspapers and the PA (this could also be filtered to allow, e.g., just newspaper stories). A new partition is defined based on this query, saved and the partition activated to limit further queries to just these texts. (3) Which words occur most frequently in the verbatim, rewritten and new portions of the texts? This search is more complex than the preceding because it involves creating a new XML file based on the results of a query, and then re-indexing this file to compute word frequencies. First, we index the METER corpus with
as the document, and as the unit to create a listing with a context of just . This listing is saved in XML and used as the basis for a new index. This index is then loaded into XARA and word frequencies can again be computed. 5. Conclusions The METER corpus was encoded in TEI/XML format to enable the data to be shared with other members of the TEI community, and to enable generic XML and TEI parsers and search tools to be used, avoiding the need to build custom applications. The goals of the work described in this paper were (1) to verify and validate the TEI encoding scheme we have adopted for the METER corpus and (2) to verify and validate the XARA query tool. We pursued these goals by examining a variety of questions related to the annotations and lexical content of the METER corpus, questions which we hoped the corpus annotation and XARA would enable us to answer, and used XARA to try to answer them. Our first observation is that the TEI encoding of METER is in fact valid – by indexing the corpus using XARA we have been able to verify that the texts are encoded in valid XML. Secondly, the encoding was verified to the extent that it allowed us to answer all of the questions that we set by ourselves. However, a deeper question is whether the TEI encoding was really necessary. Certainly, there can be no doubt that an XML encoding of some sort is highly useful, enabling searches over the structured annotations, as well as Query term nodes: axe Scope node: sun 235 combined searches over lexical content as well as the structured metadata. However, it is not clear that for the METER corpus annotating the corpus in TEI provides any additional benefit beyond encoding the corpus in XML. For those who are not familiar with TEI – in this case the creators of the corpus when we set out to build it – TEI involves a steep learning curve. Using TEI offers the corpus encoder a choice of predefined tags and attributes, at the expense of sometimes difficult decisions about how exactly to map the structural units of the corpus into the available TEI elements and their associated attributes. XARA works perfectly well with any XML files, so the benefits it offers could be derived without adopting TEI. While for many applications a ready-made tag and attribute set of the sort TEI offers may be entirely adequate and promote sensible annotation, for others it may be the case that carefully designing a bespoke annotation hierarchy and attributes from the first principles is a better use of time than shoe-horning a problem into the framework that TEI provides. One specific problem we encountered illustrates this issue and the interaction between the annotation scheme and the retrieval tool. When encoding the METER corpus in TEI, we annotated sentences and then annotated verbatim, rewritten or new word sequences using the TEI within sentences. Since XML does not permit overlapping elements, the result was that verbatim, rewritten or new word sequences that spanned sentences were split into multiple shorter sequences that fit within sentences. However, our research into detecting text reuse found that the length of verbatim matches was a key indicator of reuse. The TEI encoding scheme we had adopted precluded determining whether adjacent verbatim word sequences in different sentences were part of a single longer verbatim sequence. One solution to this difficulty is to use a link attribute to chain together multiple connected segments. Another solution would be to do without sentence annotations since they are not essential to the METER aims. However, this is not entirely satisfactory because not all texts in the corpus were annotated to the same degree of granularity, i.e. some are annotated to the level of word sequence and some only to the document level. Certain degree of markup more fine-grained than the whole text, e.g. the sentence, is required to enable the results of search requests in XARA to be of practical benefit. The second aim of this paper was to verify and validate the XARA query tool as a possible search tool for release with the METER corpus. XARA has been able to support all of the initial questions we set out to answer, and has so far as we are aware, correctly answered all of them. This included queries concerned with lexical content only, those concerned with structured annotations only, and combinations of the two. The only limitations we discovered in the course of this study were that (1) partitions cannot be saved in a XML format that can be read by other applications, (2) XARA does not support pattern match searching over XML attribute value strings, which can be useful when values assigned to attributes are non-atomic. This facility has subsequently been added to the system. A final lesson for would-be corpus creators is this: before committing to an annotation scheme, learn about the capabilities of a XML query tool such as XARA, identify key queries you would like the scheme to support and assure yourself that the query tool and annotation scheme together support the queries. References: Aston, G. and L. Burnard. 1998. The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh Textbooks in Empirical Linguistics, Edinburgh University Press, Cambridge, UK. Bell, A. 1991. The Language of News Media. Blackwell, Oxford, UK. Clough, P, R. Gaizauskas, and S. L. Piao. 2002. Building and annotating a corpus for the study of journalistic text reuse. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC02), pp. 1678-1691. Gaizauskas, R., J. Foster, Y. Wilks, J. Arundel, P. Clough, and S. Piao. 2001. The METER corpus: A corpus for analysing journalistic text reuse. In Proceedings of the Corpus Linguistics 2001 Conference, pp. 214-223. Sperberg-McQueen, C. and L. Burnard, editors. 1999. Guidelines for Electronic Text Encoding and Interchange. TEI P3 Text Encoding Initiative, Oxford, UK, revised reprint edition. 236 Repetition and young learners´ initiations in the L2: a corpus-driven analysis Ana Llinares García Universidad Autónoma de Madrid Abstract The present study is based on the analysis of classroom interactions between children and with their teacher in different types of foreign language contexts. The subjects of our study are five-year-old children and their teachers from different types of schools, with different degrees of immersion in the L2. The data comes from the UAM-corpus, which is collecting spoken data from different types of EFL contexts.1 The source of our analysis is based on Halliday´s and Painter´s classification of the communicative functions that children can convey in their mother tongue at the pre-school level (Halliday 1975; Painter 1999). In this paper, we will provide a corpus-driven classification of the different types of repetitions performed by both teachers and children. We believe that teachers should encourage certain types of repetitions in the children´s discourse. The data shows that when children are encouraged to repeat certain utterances with specific discourse functions, even children in low immersion contexts will end up initiating the interaction with no help of the teacher. This is especially important in the case of foreign language learners, in order to avoid their functional language production being limited to responses to their teacher´s initiations. 1. Introduction The present study is part of a larger study that focuses on the functional analysis of a spoken corpus of English as a foreign language. The corpus has been collected from different types of EFL contexts with different degrees of immersion (from complete to low-immersion contexts). The first part of this corpus, which will be used in this study, contains data from five-year-old children and their teachers. One of the reasons for our choice of this type of subjects is the growing interest in Spain, in the last few years, in introducing the teaching of English at the pre-school level. Moreover, there has been very little research on how a foreign language develops at this age. Halliday´s (1975) and Painter´s (1999) analysis of the language of their children focused on their observation that children develop their language because they need to do things with it. Following this idea, the point of departure of our study was that, at the pre-school level, the teacher should encourage learners to use the L2 to convey different types of functions (to give information, to ask for information, to order an action, etc...). At this age and level, in our opinion, the priority should be that the children perceive the learning of a foreign language as a process where they can achieve things with it, where the L2 has a functional purpose, in the same way as their mother tongue. In this particular paper, we focus on the use of repetition, both by teachers and children, in both a high-immersion and a low-immersion classroom context, and the different functions conveyed. 2. Repetition in the language of teachers and pupils In the 1970s began a great interest in the analysis of the language used with the children, which was called baby talk or motherese (Snow and Ferguson 1977). This kind of research focused on the importance of interaction and found that parents used well-structured sentences, repetitions and reformulations in order to guarantee communication. However, some authors, such as Aitchison (1998), questioned the validity of the input theory and argued that parents´corrections and reformulations had no positive effects on the child´s linguistic development. Aitchison referred here to the grammatical complexity. However, as we will try to show in this paper, repetitions and expansions have a lot of influence in other types of linguistic developments, such as the pragmatic ones. In our opinion, input and 1 The UAM-Corpus has been collected since 1998 and contains longitudinal data from children and teachers in EFL classroom contexts with different degrees of immersion in the L2. At the moment we have recorded data from the same children from the age of five to nine (Project financed by the Comunidad Autónoma de Madrid, 06/0027/2001) 237 interaction are fundamental aspects in the functional development of a foreign language, especially at the pre-school level. Repetitions are not only a characteristic of the language of the parents. The results of studies on communication in childhood show that the children tend to reinitiate communication when they don´t receive any response. Garvey and Beringer (1981) find out that children use repetition to reinitiate their discourse 1/3 of the times. Repetition is also a common strategy used by foreign language learners. In her classification of learner strategies, Oxford (1990) considers “repetition” a cognitive strategy within the group of “direct strategies”. As far as teacher repetitions are concerned, Pica (1994) shows that when sentences are repeated or reformulated to enhance comprehension, the learners have more opportunities to become aware of characteristics of the L2. In the analysis of the L2 teacher´s language, Richards and Lockhart (1994) find that the instructions are repetitive in order to facilitate their pupils´comprehension. Classroom discourse is not always spontaneous and it has some elements common to other type of interactions in everyday life and some elements that are specific of the classroom context. At the pre-school level, the classroom discourse tends to be very similar to the type of interaction that is carried out at home between parents and children. Therefore, the use of repetitions by both teachers/parents and learners/children is common in both contexts. Parents tend not to correct the form of their children´s utterances. They seem to be more worried about communication and they tend to repeat the correct form if something is not very clear (Kess 1992). With these parameters children´s language develops. Therefore, the teacher´s discourse should be similar in second language contexts at an early age. However, there is a type of interaction that is characteristic and specific of the L2 classroom: teacher initiation-learner response-teacher feedback (Sinclair & Coulthard 1975). Usually, after correcting the learners, the teacher expects them to repeat the correct utterance. This is a type of repetition encouraged by the teacher that is very common in classroom contexts. We believe it is very important for teachers to ask learners to produce this kind of repetitions in EFL contexts at initial learning levels, as they will lead, in the end, to the learners´ initiation of interactions, as we will see in the present paper. As far as bilingual contexts are concerned, Genesee (1994) suggests the teacher repetitions of important words as one of the ways of facilitating comprehension. He suggests the importance that the teacher offers opportunities for language use and interaction, promoting rich activities that give the children the opportunity to initiate a conversation. He also suggests the importance of encouraging children´s repetitions of the teacher´s model in order to stimulate specific linguistic aspects. 3. The function of repetition in the corpus: qualitative analysis The function of repeating is included in many of the taxonomies on pragmatic functions both outside and inside the classroom context. In the analysis of the language of the child, it is important to mention taxonomies such as the one by Vila (1987) and by Ninio et al. (1994), which include the function of imitating through repetition as very characteristic of the language of the child. Vila (1987), for example, describes the function of imitating as characteristic of the dialogue and defines it as the function of repeating the adult´s utterance without providing any information. In the analysis of children´s production in second language classroom contexts it is important to mention Cathcart´s taxonomy (1986). She identifies another type of repetition, which consists of children repeating their own utterances in order to reinforce a message (in this case it is self-repetition, as opposed to imitation of the other´s utterances, which is the function identified in Vila´s and Ninio et al.´s taxonomies). In a broader analysis, we classified the different pragmatic functions realised by teachers and pupils in our corpus based on Halliday´s and Painter´s classification of the functions of language that children realise in their mother tongue: personal, informative, heuristic, interactional, regulatory and instrumental. We made a corpus-driven subclassification within each function and we added some functions that we found specific of the foreign language classroom context. We found that it was important to consider a function specific of the teacher, called teacher feedback, and another function that we called secondary functions, whose aim was reinforcing a specific message. The different types of repetitions were classified as follows2: 2 Not all the functions presented below were always realised through repetitions. 238 Teacher repetitions: I./ TEACHER FEEDBACK (other-repetition/ voluntary) I.a./ Repetition of the message given by the learner in order to encourage them to say more or as a simple acknowledgement Example: TCH: And how do you tear it off? CH: This cuts it TCH: Oh, that cuts it... CH: Like this I.b./ Repetition of the message given by the learner in order to show positive assessment. Example: TCH: You want to cut the bread. Where does that come from? Where does th- the bread come from? CH: From wheat. TCH: From wheat. In example Ib, the teacher asks the pupil a display question. Therefore, we have considered the teacher´s repetition of the child´s response as having a positive evaluative function. On the other hand, in example 1a the teacher asks a referential question, and the repetition of the child´s response does not have an evaluative function in this case, but an interactional function.3 Children repetitions: II./ RESPONSE TO THE TEACHER´S PEDAGOGIC REGULATORY FUNCTION (other-repetition/ non-voluntary) The teacher asks a pupil to imitate or repeat with pedagogic purposes, so that he/she learns an utterance better or as reinforced input for their classmates. II.a./ The children are made to imitate the teacher´s utterance Example:4 TCH: An elephant? CH: An elephant? TCH: In my house? No way CH: No way II.b./ The children are made to repeat their own utterance Example: CH: How many? 3 Display questions are those where the answer is known for the speaker and referential questions are those where the speaker does not know the answer. The first type is characteristic of the language of the teacher, in order to test their pupils’ knowledge. 4 This extract belongs to an activity where several pupils participate in a role-play about a mother who does not allow her child to have certain animals at home. The dialogue is, in many cases, repetitive. However, this repetition seems to be necessary for the learners to complete the dialogue at this initial stage (we must remember that the pupils are five-year-old EFL learners). 239 TCH: Fernando, sh, sh, can you repeat, David? CH: How many? One of the most characteristic types of interaction in a second language context is that where the teacher gives a model that the child has to imitate (Prator 1969). Among first language acquisition studies, some within a generative approach have suggested the low impact of children´s imitation for their language development (Ervin 1964). Aitchison (1998) points out that repetition is necessary but is not enough: What is being said is that practice alone cannot account for language acquisition: children do not learn merely by constant repetition of items (Aitchison 1998:75). In our opinion, it is important to encourage children to imitate in foreign language classroom contexts at an early age, even though it does not create a natural interaction in the L2 classroom. III./ CHILDREN´S SPONTANEOUS REPETITIONS OF THE TEACHER´S OR OTHER CHILDREN´S UTTERANCES (other-repetition/voluntary) Example: TCH: Sit down ((TO FERNANDO)) OK. And now tell me, what's this? (( SHOWS A PICTURE )) CH: Boy. CH: Boy One of the most characteristic types of interaction between the child and the adult includes the repetition that the child makes of the language of the adult (Montfort et al. 1996). In second language learning contexts, Genesee (1994) observes that the children have the tendency to repeat, only with the aim of practising something new for them. Teacher and children repetitions: IV./ RESPONSE TO A REQUEST FOR CLARIFICATION (self-repetition/non-voluntary) Example: a) CH: Red. TCH: Sorry? CH: Red b) CH:You know? My daddy has . three or four brothers. CH: What? CH: My daddy has three or four brothers In examples a) the teacher demands the repetition for clarification and in b) it´s a child that demands it from one of his classmates. V./ SECONDARY FUNCTION (self-repetition/ voluntary) Self-repetition of a message to reinforce it Example: CH: and then .. he do it wrong TCH: He did it wrong. Why did he do it wrong? How was it wrong? In the acquisition of the mother tongue, self-repetition is one of the most common strategies used to teach and learn. In the acquisition of a second language, Tomlin (1994) argues that repetition is a social 240 act with cognitive consequences. With repetition, the teacher helps the pupil to understand the sentence produced, and this has the cognitive consequence of helping the learner to transform “input” in the L2 into “intake”. The secondary functions could be classified into what Halliday calls “textual meaning”, as they are used to clarify what has already been said. We have followed Sinclair and Coulthard (1975) when they classify some of the moves as “subordinate”, when they are not followed by a response and depend on the central move. In our taxonomy, we have considered different types of secondary functions. We have included here the one that involves repetition of the previous utterance. We have codified an utterance with this function when there is a repetition of the content of the message, even though it is repeated with other words, as we can see in the example above. However, when the children use this function, reinforcing their message, they tend to use the exact words of the central move, as we can see in the following example: CH1: I'm hungry. ((NOISE)) I'm hungry. CH2: This is television. This is television. TCH: OK. Television. In this example, we have a case of repetition of an utterance by a pupil to reinforce his need: he wants something to eat. The other type of repetition made by another child has the purpose of showing that he knows the answer to the teacher´s question. In our opinion, this is an important function both in the discourse of the teacher and the pupils. The fact that the teacher reinforces an utterance can have important implications on the pupils´ comprehension and future production in the L2. In the case of the realisation of this function by the pupils, it helps them to achieve their communicative goals. It is very common that children reinitiate the communication when they don´t obtain any response. Garvey and Beringer (1981) observed that children between 2;10 and 5;7 tended to reinitiate their discourse through repetition one third of the times. 4. Methodology and Quantitative analysis The data analysed for this study comes from an English school, where children were exposed to the L2 for the whole school day, and from a low-immersion bilingual school, where children were exposed to the L2 for one hour everyday. In both cases, the children were five years old. We recorded five sessions from two groups in the English school and one group in the low-immersion school. After transcribing and tagging each individual utterance in both the language of teachers and pupils, we identified 67 different functions (Llinares García 2002). In this particular study, we will focus on the functions realised through repetition.5 4.1. Self-repetition Out of 67 functions codified in the corpus we show below how the function of self-repeating or self-reinforcing an utterance is one of the three most frequent functions in the language of both teachers and children in both groups, as shown in the tables below: 5 The Wordsmith tools were used to analyse the data, once it was tagged 241 Table 1: Three most common functions realised by the teachers in both the high and the low immersion contexts Teacher/high-imm./Group A Teacher/high-imm./Group B Teacher/low-imm. Marker of initiation of an utterance (S1) 11,88% Open display questions (H3.bo) 10,71% Self-repetition or reinforcement of an utterance (S2) 13,16% Call somebody (R1) 11,10% Self-repetition or reinforcement of an utterance (S2) 9,43% Positive evaluation (TF2.a) 13,16% Self-repetition or reinforcement of an utterance (S2) 10,01% Ask somebody to perform an action (R3) 9,11% Call somebody (R1) 10,32% Table 2: Three most common functions realised by the pupils in the L2 in both the high and the low immersion contexts Pupils/high-imm./Group A (L2) Pupils/high-imm./Group B (L2)Pupils/low-imm. (L2) Call somebody (R1) 10,90% Response to open display questions (H3.re.bo) 7,07% Completion of the teacher´s request (RP2.re) 16,82% Self-repetition or reinforcement of an utterance (S2) 8,79% Call somebody (R1) 6,42% Identify or describe people or things in the present (Inf1) 12,78% Identify or describe (Inf1) 6,08% Self-repetition or reinforcement of an utterance (S2) 6,06% Call somebody (R1) 6,05% Table 1 above shows that the function of self-repetition is among the three most common in the language of the three teachers, as we can expect from a classroom context, where teachers need to reinforce their message to make sure that the pupils understand. This function is even more necessary in the language of the teachers in EFL contexts. In our data, the teacher in the low-immersion context seems to perceive even more the need to reinforce the message, as it is the most common function. This function is also one of the most common in the language of the pupils in the high-immersion context. 4.2. The children´s response to the teacher´s feedback in the high-immersion context An interesting result in the high-immersion context was to observe that one of the teachers used the function of feedback very frequently, encouraging the pupils to continue with their discourse, whereas the other teacher used it much less frequently. The teacher in group A uses this function 25 times against twice in the case of the teacher in group B. As we can see in the example below, children go on with their discourse when they are encouraged to do so TCH: He does? Who kills him? CH: a bird, a bird kills him TCH: A bird kills him CH: It´s funny at the end because the grasshopper said, is that one another one of your tricks? The example below is one of the two that correspond to group B, and we can see that this function is not only realised through repetition: 242 CH1: ((native)) I I I saw a grasshopper in my garden TCH: Did you? CH2: Me too. I saw it when I was little CH3: I saw one in the old school A native pupil initiates this interaction, but once the teacher encourages going on, other non-native children participate. 4.3. The pupil´s imitation of the teacher´s utterances in the low-immersion context In the case of the low-immersion context, we identified a type of repetition that was important to encourage the pupils´ discourse initiation: children´s imitation or repetition after the teacher´s request to do so. Llinares García (2002) classified the function of asking for repetition and asking for imitation within a group of three functions that were called teacher pedagogic regulatory functions. These included asking for imitation, asking for repetition and asking for completion. In this paper, we are only focusing on the first two, which are the ones that involve repetition. All these functions were much more frequently used by the teacher in the low-immersion EFL context, as we can see in figure 1 below. In figure 2, we can observe that the functions of asking to repeat and asking to imitate are not used at all in the high- immersion classes: Figure 1: Use of the teacher pedagogic regulatory functions in the high and low-immersion contexts 020406080100120140160180high-immers. Ahigh-immers.Blow-immers Figure 2: Use of each individual pedagogic regulatory function in the high and low-immersion contexts 020406080100120140160TRP1TRP2TRP3high-imm Ahigh-imm Blow-immers The example below shows the importance of this function for the language development of children in low-immersion contexts: TCH: Who said yellow? Okay. Is it yellow? Santiago, ask Luis. CH1: Is it yellow? CH2: No 243 TCH: Good boy.. No. It isn´t yellow No. CH2: No. TCH: It isn´t yellow. CH2: It isn´t yellow. TCH: No. Good boy. Table three. Who´s going to ask a question in table three. Ínigo. Ask if it´s dangerous... Is it dangerous? CH3: Is is dangeorus? CH2: Is dangerous. CH3: Can I go to the bathroom, please? TCH:No. Ínigo. W- wait class. Who wants to ask a question. Cristina, ask a question. What question would you like to ask? CH4: Is it fast? CH2: N o. In the example above, we can observe how the teacher asks the pupils to repeat a question to find out the animal. At the end, we can see how one of the pupils asks the question herself without having to repeat the teacher´s question. 5. Conclusions There are three main conclusions that we can raise in relation to the use of repetition in the EFL classroom contexts in our corpus: - Self-repetition is a very common function in the language of the teacher, and it is the most common in the low-immersion EFL context. This surely implies that the teacher feels the need to reinforce the message to make sure that the pupils understand. - In the high-immersion EFL context, pupils have the L2 knowledge to be able to express their messages in the L2. However, they need to be encouraged by their teacher. Therefore, we can see that, although both groups are exposed to the same hours of English, in one the pupils participate more in a spontaneous interaction, because the teacher motivates them to do so. Maybe we can conclude that the time of exposure to a language is a key point in child L2 acquisition, but it is also important the type of interaction promoted by the teacher. - In low-immersion contexts, on the other hand, it is not enough to motivate the pupils to talk, as they don´t have the language level. Here, it is important for the teacher to carry out interactions based on repetitions and imitations, that will later make it easier for the children to initiate the discourse on their own and to express things with more autonomy. 6. Bibliography Aitchison J 1998 The articulate mammal. London, Routledge. Cathcart R 1986 Situational differences and the sampling of young children´s school language. In Day R. (ed.) Talking to learn: conversation in second language acquisition. Rowley, Mass, Newbury House. Ervin, SM 1964 Imitation and structural change in children´s language. In Lenneberg, E.H. (ed.) New directions in the study of language. Cambridge, MA, MIT Press. Garvey C, Beringer G 1981 Timing and turn-taking in children´s conversations. Discourse Processes 4: 27-57. 244 Genesee F 1994 Educating second language children. Cambridge, Cambridge University Press. Halliday MAK 1975 Learning how to mean: explorations in the functions of language. London, Edward Arnold. Kess JF 1992 Psycholinguistics. Amsterdam, John Benjamins. Llinares García A 2002 La interacción lingüística en el aula de segundas lenguas en edades tempranas: análisis de un corpus desde una perspectiva funcional. Unpublished doctoral dissertation. Monfort M, Juárez Sánchez A 1996 El nino que habla. Madrid, Colección Educación Preescolar. Ninio A, Snow C, Pan B, Rollins, P 1994 Classifying communicative acts in children´s interactions. Journal of Communicative Disorders 27: 157-188. Oxford R. 1990 Language learning strategies: what every teacher should know. New York, Newbury House. Painter C. 1999 Learning through language in early childhood. London, Cassell. Pica T. 1994 Research on negotiation: what does it reveal about second-language learning conditions, processes, and outcomes? Language Learning 44 (3): 493-527. Prator CH 1969 Adding a second language. TESOL Quarterly 3 (2): 95-104. Richards JC, Lockhart, C 1994 Reflective teaching in second language classrooms. Cambridge, Cambridge University Press. Sinclair J, Coulthard, M 1975 Towards an analysis of discourse: the English used by teachers and pupils. Oxford, Oxford University Press. Snow C 1977 Mothers´speech research: from input to interaction. In Snow C, Ferguson, C (eds) Talking to children. Cambridge, Cambridge University Press. Tomlin RS 1994 Repetition in second language acquisition. In Johnstone B (ed.) Repetition in discourse interdisciplinary perspectives. Norwwod, NJ., Ablex. Vila I 1987 Adquisición y desarrollo del lenguaje. Barcelona, Grao. 245 Learner Corpora: Design, Development and Applications Development of NLP tools for CALL based on learner corpora (German as a foreign language) Sandrine Garnier, Youhanizou Tall, Sisay Fissaha, Johann Haller Institut für angewandte Informationsforschung Institute of Applied Information Sciences (IAI) Martin-Luther-Str. 14, D-66111 Saarbrücken E-Mail: info@iai.uni-sb.de, http://www.iai.uni-sb.de Telefon: +49 (0)681-38951-0 Fax: +49 (0)681-38951-40 Introduction The project unideutsch.de is carried out by the DAF Institute in Munich (German as foreign language) in collaboration with IAI which is responsible for the parts described in this paper. IAI was founded as a non-profit-making Research & Development (R&D) organisation in 1985 in Saarbrücken, Germany, to carry out the European EUROTRA machine translation project for the German language. Meanwhile IAI is an internationally acknowledged R&D institute in the field of multilingual information processing covering all aspects of natural language processing, computer-aided translation, and information management and knowledge management in advanced information technology environments. This is mainly done through strategic alliances with industry and national and European authorities. At present, the institute's major R&D activity is in the area of multilingual language technology, in particular machine translation technology with special emphasis on domain-specific terminology and concept-based methodologies, multilingual information retrieval and information filtering, as well as general knowledge management. 1 Project uni-deutsch.de: description Currently IAI is working on learning projects such as uni-deutsch.de, which combines our experience of NLP and the results of the research in foreign languages learning. The use of the technical advanced NLP tools provides another means of learning. The main objective of the project is to devise an autonomous, long-distance, language learning system for advanced learners. The project is aimed further at: improving the language performance of advanced learners. boosting the learning process by giving vocabulary in certain specified domains such as science or education. checking automatically a text produced by a foreign language learner allowing them to learn by an interactive process which combines a human corrector and a machine corrector. The machine corrector has the purpose of correcting spelling and grammar errors whereas the human corrector has to answer learner´s questions concerning errors found by the machine and to comment on other semantic errors. The combination of these both methods allows for accelerated language acquisition through error based learning. 246 2 3 Terminology Computer based learning of vocabulary In order to achieve the above mentioned aims, IAI has developed a programme, LiLa, - Linguistisch intelligente Lehrwerksanalyse – the task of which is to analyse textbooks with linguistic intelligence (texts, exercices and vocabulary listing).1 The programme is a combination of morphosyntactic taggers and parsers based on partial parsing techniques (Constraint Grammar) with developed linguistic resources in German. The different programmes break down the text into sentences and words. This programme allows students to have a list of new vocabulary which is the result of comparing the existing vocabulary of a lexicon containing a list of basic words (level B1: Zertifikat Deutsch) with the vocabulary in a text. The new entries are produced with grammatical information. LiLa indicates new nouns together with their gender information and plural form, and for verbs infinitive, present, imperfect and past participle. Other words such as conjunctions are associated with their grammatical categories. All this information is essential for learners in order to improve their use of vocabulary. Consequently the student will then see from the list what they have still got to learn and what they should already know. LiLa provides a second lexicon which is the addition of the first lexicon to the other new words. This second lexicon can be then compared with another text. Example of an automatically produced vocabulary listing: abkühlen, kühlt ab, kühlte ab, abgekühlt Abkühlung, die, Abkühlungen absinken, sinkt ab, sank ab, abgesunken adiabatisch (Adjektiv) aerob (Adjektiv) aggressiv (Adjektiv) Alge, die, Algen altern, altert, alterte, gealtert Ammoniak, das andauernd (Adjektiv) anhand (Präposition) anpassen, passt an, passte an, angepasst Anreicherung, die, Anreicherungen Argon, das The listing can also be shown as one text where new words appear in color. Other words can be tagged, e.g. a spelling error, a proper noun etc. Another programme which isn´t part of the project uni-deutsch.de, but which could also be useful for learning vocabulary is the programme AUTOTERM. This programme, based on statistics, automatically gives the terminology from specialised texts. This could be useful when students are working in specialized fields in English, German and French. 1 http://www.spz.tu-darmstadt.de/projekt_ejournal/jg_07_1/beitrag/haller1.htm 247 Example of a text from the online newspaper Wissenschaft-aktuell.de (http://www.wissenschaft-aktuell.de/): Eintrittspforten in der Hülle von Nervenzellen nachgewiesen [Neurowissenschaften] Durham (USA) - Bisher glaubte man, dass die äußere Zellmembran gleichförmig aufgebaut ist, so dass an jeder Stelle der Zelloberfläche Moleküle in eine Zelle eindringen können. Jetzt haben Wissenschaftler der Duke University bei Nervenzellen ganz definierte Eintrittsstellen nachgewiesen. Die Regulation des Stofftransports in diesen Membranzonen ist für die Funktion der Nervenzelle wichtig. Weitere Untersuchungen der neu entdeckten Strukturen könnten daher helfen, Wirkstoffe gegen verschiedene Nervenerkrankungen zu entwickeln, schreiben die Forscher im Fachblatt "Neuron". Nervenzellen geben über so genannte Synapsen Signale in Form von Botenstoffen an Nachbarzellen weiter. Diese Neurotransmitter binden an Rezeptorproteine an der Außenseite der Empfängerzelle. Die Rezeptoren werden ständig durch neue ersetzt, also in beiden Richtungen durch die Membran transportiert. Indem die Zelle diesen Austausch reguliert, kontrolliert sie die Zahl der Rezeptoren und damit die Empfindlichkeit, mit der sie auf die Neurotransmitter reagiert. Wirkstoffe, die diesen Transport hemmen oder beschleunigen, könnten als Medikamente zur Behandlung von Depressionen, Epilepsie oder Suchterkrankungen eingesetzt werden. Die Arbeitsgruppe von Michael Ehlers wollte verfolgen, wie die außen sitzenden Rezeptormoleküle wieder in die Zelle gelangen. Dazu markierten die Wissenschaftler das Protein Clathrin mit einem Fluoreszenzfarbstoff und gaben es zu Kulturen von Nervenzellen. Clathrin lagert sich an die Stellen der Membran, an denen sie sich einstülpt, um Proteine nach innen zu transportieren. Die mikroskopische Auswertung ergab, dass sich das Clathrin nicht gleichmäßig verteilt sondern nur in bestimmten Zonen auf der Membran ablagerte und damit definierte Eintrittsstellen markierte. "Wir haben herausgefunden, dass die Nervenzelle mit einem Zimmer vergleichbar ist, in das nur ganz bestimmte Türen führen", sagt Ehlers. Bisher habe man die Zellmembran eher für einen an vielen Stellen durchlässigen Vorhang gehalten. Während sich die Verteilung der Clathrin-markierten Eintrittsstellen bei jungen Nervenzellen häufig veränderte, blieb sie in reifen Neuronen weitgehend konstant. Möglicherweise sei das ein Ausdruck für die im Alter nachlassende Anpassungsfähigkeit des Gehirns, vermutet Ehlers. Die jetzt nachgewiesene Membranzone könnte nur die erste von weiteren noch unbekannten speziellen Membranstrukturen sein, glauben die Forscher. Author: Joachim Czichos Source: EurekAlert Extract of an automatically produced terminology listing: Anpassungsfähigkeit Anpassungsfähigkeit des Gehirns Ausdruck Austausch Außenseite Außenseite der Empfängerzelle Behandlung Behandlung von Depressionen Behandlung von Epilepsie Botenstoff Clathrin Clathrin-markierte Eintrittsstellen bestimmte Tür bestimmte Zone definierte Eintrittsstelle 248 Example of the dialoge interface showing the nouns and noun groups found in the text: 4 Error-based self learning According to some authors (Prof. H. J. Heringer) students make the same or similar errors which are repeated. If the student knows what type of errors he constantly makes, he could then accelerate the process of language acquisition. This idea is the basis of the development of specialized programmes for foreign language learners. Several programmes are connected together in order to give the structure of a German sentence. The different information which is given by the programmes is finally processed by an additional programme, KURD, which has the task of finding syntactically or morphologically incorrect structures in the German sentence. KURD is "a formalism for shallow post morphological processing" and uses "the input from the morphological analyser MPRO".2 It works on grammar checking and style control (German & English). This programme uses several operators such as unification and deletion to add and delete information in the linguistic representation of the sentence in order to mark one or several errors in the sentence. Formalism rules have been developed to add information in the analysis of the most common structural mistakes made by learners of German. The new information is then processed by another programme which tags the errors found in the text. 2 http://www.iai.uni-sb.de/iaide/en/kurd.htm 249 Example: The position of the verb in a subordinate clause in German, where the verb must be at the end of the sentence. The error message is "Subordinate clause: verb at the end". 4.1 4.2 Error messages This objective aims at developing the necessary tools for the system to evaluate the task done by the learners and give them feedback on their performance. Qualitative methods (morphosyntactic tagger and syntactic parser) allows a linguistically based analysis of the texts produced by learners. Error messages can be more precise due to linguistic analysis. This method replaces the straightforward right/wrong answer or the comparison of pattern structures and instead provides a detailed message to help the students correct themselves. The type of error message can be adapted to the linguistic knowledge of the user. Example 1: the genitive form in German. Sentence: “Der Stil des schweizerischen Autors des langen Textes ist langweilig.“ The error message is the following: “too many genitive forms.” Example 2: wrong participle. Sentence: “Ich habe das gebrach.” The error message is the following: “Wrong participle”. Work basis: learner corpora In order to develop rules adapted to foreign learners we had to collect errors made by speakers of other languages. The sentences collected were written by Socrates DAF students and students from Burkina Faso, Vietnam and Russia. 250 We have also used material from Professor Heringer. He has listed different incorrect types of sentences in the book Fehlerlexikon and on his web-site.3 From these sources we have classified different error types in order to help us formalize them. A lot of errors made can be automatically found: spelling errors grammatical errors some valence errors. What we can´t find at the moment in most cases are the valence. Work on this is currently in progress and we hope to resolve it during the coming year. Classification: structure and some examples H stands for examples from Prof. Heringer, I for IAI, S for Socrates students and K for correct structures. The sign + means that IAI is able to detect the wrong structure and gives a corresponding error message, whereas the sign – means that we are not able at the moment to detect the error. 2.4. Wortbildung 2.4.1. Partizip von starken Verben +H: Ich habe Münzen zu den Leuten geworft. +H: Danach hat er ein Rezept geschreiben. +I: Ich habe das gebrach. 2.4.2. Partizip von schwachen Verben +H: Am Sonntag bin ich um 10 Uhr aufgewach. +H: Sofort habe ich eine Wohnung gemiet. -V: Die Stadt wurde überschwommen. 2.4.3. Kein ge- bei nicht trennbaren Präfixverben und Verben auf -ieren +H: Er hat ein Zimmer für seinen Vater gebestellt. +H: Er hat dort Medizin gestudiert. 2.4.4. Falsche Konjugation +S: Man verglicht. +U: Mit Sicherheit geb es auch künftig große Kraftfahrzeuge, aber die Besitzer von kleinen Autos würden lächeln, weil Sie durchkommen viel schneller durch den Straßenchaos. 2.4.5. Komposition +K: Beschwerdeführer +I: Beschwerdenführer -I: Beispielssätze -U: Statt eines 20-Minuten-Taktes sei in Basel ein 6-Minute-Takt eingeführt werden. The result of the analysis can be seen in unix as well as in winword. The userfriendly interface shows the tagged wrong structure and gives a message which helps learners to correct themselves. The programme can also give a correct example or several options of correction. You can find the site at: http://www.uni-deutsch.de. The internet address of the DAF Institute is the following: http://werkstadt.daf.uni-muenchen.de/ home2.htm. 3 http://www.philhist.uni-augsburg.de/Faecher/GERMANIS/daf/Forschung/fehler/analyse.html 251 Conclusion The NLP tools will be a powerful additional means for the acquisition of a foreign language. Computer-assisted-language-learning seems to be a good solution today to the individual e-learning of languages and helps the human corrector, who can then be more aware of semantic and style problems. The CALL must be seen as another option and be added to the other traditional means of acquiring a foreign language. What we have described in this paper, is just a beginning. Special grammar rules for specific grammar or vocabulary exercises, integration and comparison with predefined answers into grammar chapters and dictionary could be the next task in order to customize the learning process. Books Garnier S, Tall Youhanizou, 2003 Uni-deutsch.de Report, forthcoming, IAI Heringer H. J. 2000 Fehlerlexikon Deutsch als Fremdsprache, Berlin, Cornelsen Verlag. Internet pages Haller J 2002 Linguistisch intelligente Lehrwerksanalyse. In Zeitschrift für Interkulturellen Fremdsprachenunterricht, 7(1), 6 pp. http://www.spz.tu-darmstadt.de/projekt_ejournal/jg_07_1/beitrag/haller1.htm (stand 02.13.2003) Michael C 2001 KURD, 10 pp . http://www.iai.uni-sb.de/iaide/en/kurd.htm (stand 02.13.2003) Heringer Hans Jürgen, 1999, Aus Fehlern lernen (Multimediales Computerprogramm zur Fehleranalyse, mit Grammatikteil), Augsburg http://www.philhist.uni-augsburg.de/Faecher/GERMANIS/daf/Forschung/fehler/analyse.html (stand 02.13.2003) 252 The company women and men keep: what collocations can reveal about culture Sara Gesuato Dept. of Anglo-Germanic and Slavic languages and literatures, University of Padua, Italy 1. Introduction Habitual ways of talking about given phenomena show people's attitudes to and conceptualizations of them: the meanings associated with certain events or situations determine the cultural significance of given topics within a group of speakers (cf Fairclough 1992, Hodge and Kress 1993). In particular, a language's keywords give insight into the culture of its speakers (e.g. Krishnamurty 1996, Partington 1998, Wierzbicka 1999); but actually all the words repeatedly circulating in a community form constellations of repeated meanings conveyed in typical patterns that produce conventional expressions and opinions. Therefore, any word recurrently used in interaction may carry social implications, even if these may go unnoticed among the interactants (see Tognini-Bonelli 2000: 125). To understand the social salience and semantic nature of the terms exchanged in verbal interactions, it is useful to look at their recurrent patterns of usage, namely their occurrences in co-text (see Stubbs 2001). In this paper I examine collocations of WOMAN, MAN, GIRL, and BOY (from the Usbooks, Ukbooks, Time and Today components of the Cobuild on-line corpus) to determine if their denotational symmetry (identification of adult and non-adult human beings of the female and male sexes) is paralleled by similar patterns of usage (association with the same semantic fields and embodiment of the same semantic roles in given syntactic environments), i.e. if they belong to similar discourse domains. 2. Most significant collocates The lexemes under examination do not share all of their most representative collocates (i.e. words they co-occur with in a statistically significant way), which reveals that they are part of different domains of discourse. The top 50 significant collocates of WOMAN include terms that fall within the discourse domains of physical attractiveness (beautiful, pretty, attractive), age (young, old, older, elderly, aged), physical appearance (body, dressed, looking, white, black), family and personal relations (married, pregnant, singles), women's liberation (rights, movement), religion (pontiff, ordination), (groups of) people (woman, women, girls, man, men). Other collocates comprise said, drowned, listening, named, division, percent, working, golf, title, doubles, lives, and won. The wordforms man and men co-occur with words exemplifying some of the semantic fields identified for WOMAN: age (old, older, aged, young), physical appearance (white, black, tall, big, fat), family and personal relationships (married, singles), and (groups of) people (woman, women, men). Other collocates are relevant to distinct domains of discourse: negative states (blind, dead, poor), the military (officers, enlisted, squad), sex (gay, sexual), negatively connoted physical force or action (violence, arrested), non-physical attractiveness (rich, kind, nice, good, wise). Among the remaining collocates, some are common to WOMAN as well (division, percent, working; labeling: title, named), while others are peculiar only to MAN (burned, made, isle, saw, matc). The data show that both WOMAN and MAN are described in terms of such notions as age, physical appearance, family and personal relationships. On the other hand, WOMAN and MAN partly occupy different semantic space. First of all, MAN is relevant to a wider range of semantic fields, which points to a more versatile use of its wordforms. In addition, even when both lexemes are relevant to the same domains, these may be highlighted in different ways. For example, in the domain of physical appearance the collocates of MAN stress the notion of ‘size’ while those of WOMAN that of ‘overall qualitative effect', as if to suggest that a person's visual impact on others is more relevant to women than men. Moreover, the euphemistic term referring to old age (elderly) is reserved to WOMAN, which signals a mild taboo in the co-occurrence of the notion of ‘femaleness’ and ‘age'. Furthermore, WOMAN co-occurs with a greater variety of terms denoting people, indicating that relationships or interaction are more often discussed with reference to women. Finally, WOMAN and MAN occur in partially complementary distribution; the former lexeme is associated with the domains of physical attractiveness, religion, civil rights, and involuntary actions (drowned, won), the latter with those of non-physical attractiveness, the military, violence, and voluntary actions (burned, made), which shows that distinct areas of activity appear to be salient to the two gender groups. The terms occurring with GIRL and BOY reproduce patterns very much like those outlined above. The collocates of GIRL are representative of the semantic fields of age (little, year, young, teenage, baby, aged, old, adolescent), personal relationships (boy, girl, father, married, lover, boys, women, girls, men), physical attractiveness (lovely, pretty, beautiful, beauty), negative state (poor, 253 raped, died, dead, killed, murdered, screaming), positive state (nice, golden, jolly, pregnant), labeling (named, called), and physical appearance (black, white, haired). The collocates for BOY refer to the domains of age (little, old, year, young, baby, aged, teenage, older), size (small, big), relationships (girl, girls, boy, boys, men, brigade, club, posse, scouts, scout, mother, father), negative state (died, poor, dead, bad), positive state (golden, good), activities (school, game, play, jobs), labeling (named, called), and outdoor events or fun activities (beach, pet, shop, do). The collocates reveal that GIRL and BOY show up in partially different discourse domains. The former lexeme is relevant to the topics of physical appearance and attractiveness, which suggests a young woman's outward characteristics are paid attention to. GIRL is also often associated with participles which denote dangerous or hopeless conditions and convey ideas of helplessness and victimization. Finally, GIRL may occur in contexts of talk focusing on adult relationships (see married, lover, pregnant), which indicates that such notions are considered appropriate even to young females and/or that GIRL is also used in partial contradiction with its denotational meaning, that is with reference to not very young women. The lexeme BOY, on the other hand, is more relevant to the notions of behavior (good, bad), bonding (brigade, club, possse), active involvement (game, play, jobs) and associated with outdoor contexts (scout, beach). This lexeme, therefore, occurs in contexts focusing on the idea of direct participation in concrete activities. In general, the collocates of WOMAN, GIRL, MAN, and BOY show that these lexemes share certain contexts of use, but also that there are discourse domains pertaining only to WOMAN and GIRL, and others only to MAN and BOY. The distinct sets of collocates suggest that WOMAN and GIRL are more frequently associated with notions of passivity and physicality, and MAN and BOY with those of activity and general behavior. 3. Descriptive labels The descriptive labels attached to a noun specify the characteristics of its referent and actually narrow it down by identifying a subset of it. A language's typical descriptive labels are attributive adjectives, but a similar categorizing function is also played by post-modifiers like relative clauses and of-headed prepositional phrases. 3.1. Attributes Adjective+noun combinations represent a type of colligation or association between grammatical categories. When examined in high numbers, they can be semantically revealing because they show what properties (as encoded in the pre-nominal adjective or other modifier) are frequently or by default ascribed to what entities (encoded as nouns). The following findings are based on a sample of the relevant Cobuild data (100 concordances per each wordform woman, man, girl and boy), systematically selected through the software running the corpus (one every a given number of occurrences determined by the program). A comparison of representative samples of attribute+BOY vs. +GIRL combinations reveals that most of the attributes preceding both lexemes are of three types: they mention physical characteristics (little, grown, blind), refer to behavioral traits (good, brave) or identify the group membership of the relevant referent (Jewish, factory). But despite this basic similarity, the collocates of GIRL and BOY also reveal differences in the usage of those terms: GIRL is associated with the notion of beauty, encoded through various terms (beautiful, lovely, pretty), four times as frequently as BOY, which co-occurs with pretty only once; in addition, only GIRL is occasionally associated with attributes that signal its referent is thought of as an adult engaged or involvable in heterosexual relationships (call-girl, flirtatious girl, and had a steady girl, is going to find a nice girl and marry her). On the other hand, more varied terms are available for a description of the behavior and personality of the referents of BOY (bully, charming, different, dull, fine, funny, nasty, problem, suitable, wonder) than those of GIRL (dangerous, hard, material, naughty, sympathetic). Furthermore, if the attributes referring to physical attractiveness are excluded, it turns out that those referring to boys are also more evenly distributed between the categories of positive and negative connotations. Finally, BOY is accompanied by attributes referring to intellectual abilities (clever, dull, sharp) three times as frequently as GIRL (smart). The attributes preceding WOMAN and MAN are more varied than those of GIRL and BOY, and identify, e.g., states or conditions experienced by their referents (pregnant, dead, forgotten), their area of origin (Palestinian, Yale), permanent properties (Catholic, religious), traits of behavior (liberated, gambling), physical and emotional characteristics (tall, big, broken, loud, naked), and group or category membership (neighbor woman, company man). However, although the general patterns of usage are similar, the specifics of these descriptive labels point out subtle differences in the ways people talk about women and men. Attributes of WOMAN often signal negative or demeaning characteristics of the lexeme's referents . as revealed in other contexts as well (see sections 4.1. and 5.2.) . namely, women's past or 254 present role as victims (liberated, helpless), mental or emotional instability (deranged, vengeful), lack of control on their behavior (loud, broken), limited intellectual ability (stupid), socially marginalized love life (divorced, spurned) or low-qualified professional position (cleaning woman); the salience of their physical attributes (beautiful, fair, fine, full) and the non-prototypicality of their playing certain social roles (woman boss, neighbor woman) are also highlighted in the concordances. Neutral or positive attributes are infrequent (black, tall, rich, real, sensible). Attributes of MAN signal neutral, positive, and only occasionally negative characteristics of its referents, which indicates that people's discourse practices provide a more balanced picture of men's personalities and behavior. Pre-modifiers of MAN emphasize men's importance and power (big, powerful, wealthy), appropriate conduct (cautious, devout, helpful, proper, religious, spiritual, suitable), general likeability (great, fantastic, decent, lovely, handsome, dishy), education (cultured), cleverness (witty), independence (bold, free [cf liberated of WOMAN]), modernity and authenticity (modern, plain man of the people), satisfaction (chuffed), and emotional success (ladies man). Negative attributes may refer to conditions the referent of MAN is not responsible for (blighted, harassed, dying), denote reprimandable behavior (marked, violent, strange, gambling) or point out lack of socially desirable qualities like out-goingness (timid, shy). Neutral attributes are those identifying types or groups of men performing specific functions in given contexts (ball man, in the context of badminton; best man, in the context of wedding, radio man, in the context of communication) or temporary or permanent physical characteristics (naked, hooded, white). In general, while a variety of attributes are available for describing both females and males, those associated with the former are more frequently negatively connoted, restricted to fewer domains of discourse or focused on the decorative function of women. On the other hand, the attributive collocates of MAN are typically more numerous and varied, equally reveal positive and negative qualities of their referents, and highlight either socially important roles played by men or their intellectual values. 3.2. Of-headed prepositional phrases More of-headed prepositional phrases modify BOY (100) and MAN (425) than GIRL (82) and WOMAN (218). Most can be grouped under the same headings, i.e. they classify their referents in similar ways. Those attached to boy and girl mostly refer to age (When I was a boy of 13; A girl of four saved her family yesterday); only 3 of the concordances of girl and 1 of those of boy refer to other characteristics of their referents (Alex is a girl of very high morals; A Girl of the Streets; he was a boy of mercurial nature). The phrases describing boys and girls are more varied and some are shared: they include reference to age (Dr. Stuart Horner said girls of 12 could be donors; at Middlesex University found that boys of 13 or 14 who played computer), place of origin (Fox suggests that the girls of Jerusalem speak the first part; compared to the big boys of Hollywood), and group membership (and said the girls of the horde learned their walk; but even more the boys of his own class). In addition, the concordances of girls exemplify only one other kind of descriptive phrase, which indicates what girls may possess (a description of bombed houses in The Girls of Slender Means). The other concordances of boys, on the other hand, specify a few more types of characteristics of their referents: character (The right type of boys of a type - to appreciate her), job (the smart boys of the press), epoch (all the other little boys of his generation), condition of being somebody's “property” (They have four boys of their own), identification with reference to a topic (the seven Hard Flint boys of the Navajo myth; one of the boys of Old Woman Running). The concordances of both woman and man reveal the same types of classifications: both sets describe their referents in terms of their age (anyone who thinks that a woman of 40 plus is undesirable; Just as a man of 70 probably can't lift as much), good qualities and bad traits (she was a woman of considerable literary ability; a woman of dubious loyalty; he was not a man of achievement, but a man of promise; a man of disgusting morals), physical features (You're a woman of medium height; if it is possible for a man of his size to be peripheral), group membership (a woman of the horde; a man of the tribe called the Westfolk), possession (the title of her book A Woman of Substance; would be a difficult matter to find a man of any property in the country), character (a woman of her word; a man of constant surprises), origin (a woman of England called Agnes; a man of the north), importance (a woman of rank; might well have become a man of destiny); only woman, however, has concordances describing its referent as a man's “property” (He's the sort that needs a woman of his own), and only man has concordances mentioning its referent's work (curator of the museum had to be a man of the pen and of the book; We might have had Msgr Montini a man of letters) and epoch (Only a man of Colonel Sartoris’ generation). In general, phrases that describe non-physical characteristics make up about 60% of the concordances of man and 42% of those of woman. Furthermore, of the concordances mentioning positive qualities, only 5 (i.e. about 8%) of those relevant to woman include intensifiers (a woman of singular attainments; A woman of strong convictions!), while 72 (i.e. about 45%) of those relevant to 255 men are pre-modified in this way (a man of charismatic personality; a man of considerable intelligence; A man of evident refinement; A man of formidable intellectual gifts; A man of great courage and determination; a man of immense patience; a man of international calibre). The concordances of women and men reveal the same patterns as those identified above for woman and man: their embedded prepositional phrases describe the referents of those wordforms along very similar lines, but to different degrees. Age is mentioned 26 times (i.e. about 17%) with reference to women and 11 times (i.e. about 4%) with reference to men (can be used on women of all ages; four units a day by men of all ages); physical characteristics show up 4 times with reference to women (i.e. about 2.5%) and 3 times (i.e. about 1.1%) in the case of men (This chapter is meant for women of every size; place the onus on men of Cleese proportions); the time or generation a person belongs to is indicated more often in the vicinity of women (13 times, i.e. about 8%) than men (8 times, i.e. about 3%; the women of that 1960s generation; men of an old generation); the geographical, national and/or family origin of women is referred to more frequently than that of men (58 times, i.e. about 37% vs 57 times, i.e. about 21%, respectively; All the women of the valleys had found some pride; labourers assumed that men of Anglo-American origins); group membership, on the other hand, appears to be more salient to male than female referents, the frequence of occurrence being 31 times (i.e. 20%) for the former and 83 times (i.e. about 31%) for the latter (that was unheard of among women of the aristocratic classes; to the men of the B company it felt as if); the specification of people's non-physical qualities is seemingly equally relevant to women and men, since it is to be found in 19 concordances of women (i.e. about 12%) and in 44 concordances of men (i.e. about 16.5%; it stresses the importance to women of good qualifications; was led by men of enduring stature); however, there are also prepositional phrases accompanying men that mention other non-physical characteristics of their referents, like profession (13 times, i.e. about 5%; we men of business), type, personality and/or behavior (27 times, i.e. about 10%; For men of the stamp of Scheer), level of importance (9 times, i.e. about 35%; You find it a lot with men of power); overall, therefore, men are described in more varied detail than women through of-headed prepositional phrases. Finally, the positive characteristics of women are enhanced by intensifying adjectives slightly less frequently than those of men, namely 10 times (i.e. about 6%) vs 15 times (i.e. about 5%), respectively (women of strong will; as free men of definite and sincere worth). In conclusion, the of-headed prepositional phrases post-modifying GIRL, BOY, MAN, and WOMAN present comparable, but not identical, kinds of specifications; for one thing, some of these do not occur with the same frequency across the concordances, that is physical attributes appear to be more salient to female than male referents, while non-physical characteristics are more tightly associated with male than female referents; in addition, certain types of prepositional phrases are not always available to describe females, for instance those mentioning jobs; finally, the positivity of certain attributes tends to be more strongly emphasized (through adjectives) when the words encoding them collocate with terms marked for maleness. In brief, GIRL, BOY, WOMAN, and MAN are not equally salient to the same types of concepts used to describe their referents. 4. Roles and situations A clause is the verbal representation of an event, that is the linguistic rendering of a phenomenon in the world. It pivots around a predicate, which represents what is going on, and may include nominals, phrases, and adverbials that refer to entities and circumstances that contribute to making up that event or situation. It is often possible to verbally represent the “same” event in various ways; the specific lexico-syntactic wording chosen will highlight certain aspects of it and obscure others (see Halliday 1994: ch. 5). Examining clauses in which WOMAN, MAN, GIRL, and BOY occur will help identify what events and situations they are most typically associated with and what roles their referents are represented as embodying. 4.1. Relative clauses A relative clause represents an event or situation in which the referent of the relative pronoun is a main participant. In the corpus, I looked for examples of who-headed relative clauses, in which, that is, the referent of the pronoun would most often be the subject or direct object of the clause, and thus likely to play the semantic role of agent or patient, respectively. From the output of each of my queries, I had the system select a random representative sample of 100 concordances. The result of the girl+who search presents three interesting findings: (1) 34 of the concordances describe negative situations that the referent of the relative pronoun is affected by or, occasionally, responsible for (strongly enough to want to murder the girl who'd informed them; the five-year-old girl who escaped the massacre; a troubled girl who felt pressures at home; I have a neurotic girl who is panic-stricken; a severely disabled girl who is unable to speak; I had a girl who was a Down's Syndrome's baby; the subject of the aggressive little girl who had had a ‘perfect birth'; a young 256 Monk shelters an Albanian girl who is being hunted for the murder). (2) Moreover, 7 concordances show that girls may be talked about with reference to their bodies, especially their physical attractiveness, occasionally seen as a source of problems (period: puberty. What happens to a girl who begins to menstruate at 10; She was really pretty like that. A girl who feels pretty has a tendency; a girl with breasts and a girl who's skinny; in the case of a teenage girl who is ashamed of too little breast), and 9 more associate girls with the notions of romantic and/or sexual involvement, that is in (a not always happy) relation to some man (We found a Romanian girl who is married to a local man; for a little girl who loved him; a girl who fails to marry her suitor; a Mexican girl who eventually jilted him). (3) Finally, 79 concordances reveal that the subject of the relative clause plays the role of an experiencer of situations, events or emotions, or the patient of others’ actions, which may even be unpleasant (a Colorado girl who becomes an opera singer; A girl who lived with her parents; She was a vivacious, outgoing girl who was always enthusiastic; a girl who will never forget a friend; a girl who shares his interest in sport; A teenage girl who collapsed; The girl who died was her friend; the handcuffed girl who was being pushed; A girl who was raped and strangled); on the other hand, when playing the role of an agent, the situation the referent of girl is in charge of is often negative, vilifying or restricted to the field of love-relationships (A heart-swap girl who battled against all the odds; did not bode well for the girl who had invited Kurt to lunch; Except that the girl who told the tale had been expelled; the black girl who'd killed herself; bordello in Lima, Noriega asked a girl who'd just finished making love; Paula shot to fame as the girl who ditched all her lover's gifts). The 100 concordances of girls+who reveal very similar patterns: 32 depict negative events or situations (it may be that girls who are weak academically; his daughter was attacked by girls who are fellow pupils; to throw stones at the Chinese girls who broke world records; girls who develop eating disorders fail; This was very hard on the girls who had paid for their machines; the poor and unfortunate girls who walk the streets; child labor and raw immigrant girls who would work for next to nothing; nervous girls who never laugh); 17 concordances mention girls in relation to sex, the emotions, or careers in the show business (while girls who are sexually active are loose; more than half of the girls who become pregnant before age 18; and happily-ever-after endings, girls who buy into this script; suspicious of the girls who want commitment; one of the other girls who crossed his bachelor path; scores of local girls who ended up marrying; girls who are cheerleaders; where modeling is at now, for girls who could once only dream of owing; Hollywood's pick of the girls who look set to become tomorrow's); 80 lines of concordance portray girls as experiencers or patients, sometimes in negative contexts, and not always in control of themselves (girls who are average in development; girls who have good communication about; girls who had sat in that chair; three girls who arrived rather apprehensively; girls who looked pale and unhealthy; delicate girls who enjoy shopping; those girls who most fully accept; memories of the girls who'd drowned; girls who were adopted in the 1960s; girls who were sacked; quite young girls who must have been the victims; fair young girls who trembled; the three girls who were panicking); when presented as agents, girls do not appear to always cut a good show (girls who drop out most often do so; He caught two other girls who were cheating; they were middle-class girls who, to revolt against; there are far more girls who say they would rather be boys; ran a network of call girls who serviced French high society). Of the 200 collocations of the combinations woman+who and women+who, 49 present women as facing or, occasionally, causing difficult circumstances (frustrating it must be for women who couldn't do what they wanted; I know there are many women who endure similar treatment; women who have abnormal Pap smears; only against illegal practitioners, women who obtained abortions; talking to the other women who shared some of their concerns; author of Women Who Love Too Much; mortally wounded a young woman who was watching the crowds; an unmarried, unemployed 24-year-old woman who is struggling to feed her four; warned against women who are always fawning; in fact, it was the woman who instigated the affair; a woman who is forever scaring; the woman who made his life a misery); 157 concordances depict women as being passively involved, as experiencers or patients, in situations determined by external circumstances or run by more active participants; the verbs most commonly accompanying the wordforms woman and women are indeed be and have (women who are infected with HIV; women who have had their ovaries removed; battered women who explode into loss of control; the number of women who are encouraged to do so; a young woman who disappears after meeting him; woman who had been infected); however, when presented as agents, only occasionally are women presented as involved in negative situations (the woman who'd rented Samelu his brickworks; the woman who had administered the test; the woman who tried to kidnap April; for two young women who have booked to go abroad; the older women who supported Mother Marie; Only one in ten women who runs a business works fewer); 51 occurrences show women in association with the domains of emotions, sex, and physical appearance (a young Roman woman who begins a passionate affair; was one of the women who adored their husbands; a woman who, along with sexual aversion; a fair number 257 of women who have never had an orgasm; lovely succulent woman who should be eaten as whole; a raven-haired, proud-breasted young woman who has risen from a humble; the woman who brings back the elegance; particularly for women who may be watching their weight). Also, the notions of negativity and passivity often co-occur and the semantic fields that are specially “reserved” to women are negatively connoted (women who are incapacitated; unmarried women who aren't prepared to occupy; women who have been widowed; women who have bulimia; a woman who discovers a breast lump; single woman who was attacked; woman who is afraid of flying). The 200 concordances of the wordforms boy and boys also include reference to negative contexts, but less frequently than those of girl and girls, namely on 30 occasions overall (angry, needy and frightened boy who looks to women to mother him; Gregory was a loner, an Afrikaans farm boy who had been neglected; eight-year-old retarded quadriplegic boy who was murdered; motivate pupils, especially boys, who fail at school; prison sentences imposed on the two boys who killed James Bulger; neighbours also joined boys, who had set fire to curtains); interestingly, these negative contexts occasionally involve females more directly than males (it is girls rather than boys who are at greater risk; a boy who saw his grandmother killed). In addition, only 10 are the concordances which somehow include reference to sex, physical appearance or feelings (many teenage boys who feel homoerotic urges; for high school boys who are too skinny to play; A 16-year-old boy who's never had a girlfriend). Finally, more numerous in these sets of data, namely 99, are the concordances that include representations of events in which boys are actively participating (incredibly wiry boy, who could climb the highest pines; from a boy who first lifted a tennis racket; said the boy who had brought the coffee; I had trained a boy who had won a scholarship; punched and kicked by a boy who had knocked on her door; crowds are made up of boys who play sports; were moistened by two small boys who pressed sodden cloths against; and the boys who take the field will be very); 15 of these portray boys as involved in or causing problems (an eleven-year-old who has stolen some dried figs; a wild boy who poached rabbits; the boy who took the most drugs; care for boys who had committed offences; pension book stolen by boys who tricked their way into his East). The wordforms man and men display collocational patterns similar to those of boy and boys: 45 of the 200 concordances considered refer to negative situations (A man who appeared to be drunk; What rung for the man who has been unemployed; Here we are in a room with a man who's been charged with murder; about the same frequency in men who are insuline-dependent; one of the families of the SAS men who died on the mission; in black balaclavas attacked two Asian men who were lying on the floor). Reference to the domains of love or sex life and beauty are limited to 17 concordances (he's the man who sleeps his way to the top; I am the type of man who can give to more than one woman; Clearly, a man who feels sexually attracted to; can't compete for women with men with flourishing forelocks; a troupe called The Hollywood Men, who dance in skimpy briefs; twice as likely to get divorced as the men who married later). Furthermore, 97 are the concordances showing men taking an active role in the events being represented, and only a dozen of these refer to negative circumstances (the man who brought you The Sound of Music; why I'm the man who helped fill the missing link; lost their lives trying to save a man who jumped in the water to rescue; a chapel man who later spoiled himself by drink; men who ally themselves with England; were written by men who had fought in it; half of the men who immigrated to Australia; prosecute the men who sexually harass me at work); only 4 are the instances of passive clauses (the man who had been certified dead; two men who had been killed in the quarry). Finally, quite frequent, i.e. 40, are the concordances that show men in a positive light, whether they are presented as determining or simply experiencing the positive circumstances relevant to their lives (This is a young man who believes in the future; the son of a man who had risen from beginnings; paid their last respects to the man who made Jaguar a world leader; and so is the man who now seeks supreme power; this is the man who won over the public; He was a loving man, a man who would do anything for anybody; the universal regard for men who had risked their lives to defend; Toward the men who held national power; eloquent men who stand above the crowd; two men who turned their dream into reality; the only two men who could win the title). The above data suggest that while WOMAN, GIRL, MAN, and BOY share similar contexts of usage, they do not occur in those contexts with the same frequency: it is more often the referents of WOMAN and GIRL than those of MAN and BOY that appear to be involved in difficult situations, affected by external circumstances or other agents’ interventions, and associated with discourse domains that draw attention to a female's physical attractiveness and her role in relation to a male. 5. Semantic preferences The notions people exchange in interaction when talking about females and males can be revealed in part by examining their collocates in syntactic environments likely to introduce topics for discussion. Two such constructions are the genitive and for-headed prepositional phrases. 258 5.1. Genitive The semantic relation holding between ideas expressed by nouns can be of various types. When these nouns are grammatically linked by the genitive, they can express such notions as possession (e.g. my neighbor's house), personal relation (e.g. Frank's son-in-law), classification (e.g. a children's story), origin (e.g. John's telegram), authorship (e.g. Wilbur Smith's latest bestseller), agency (or subiect-verb relation, e.g. Mary's consent), passive experience of an event (e.g. the town's destruction), predication of a state relative to an entity (or subject-complement relation, e.g. Paul's happiness), and more. Of all the types of phrases examined in this study, the genitive is the one that reveals the least striking categorial differences between the collocates of the pairs of gender-marked lexemes. The concordances obtained for wordforms girls (92) and boys (86), and representative samples of 120 concordances for girl, boy, woman, women, man, and men show that the words following the lexemes in question (almost exclusively nouns or noun phrases) convey information about the group membership, physical characteristics, thoughts or attitudes, feelings, states, qualities of the possessee, and also about the objects possessed or actions performed by them, the events, experiences or abstract notions relevant to them or finally their relationships with others. Neither the concordances of boy nor those of girl exemplify the notion of group membership. All the other concepts are represented, although to varying degrees within each set. For girl and boy, respectively, the frequencies of occurrence of the various notions, given in percentage, are as follows: event (about 3% vs 6%; Girl's 24-hour van rape ordeal; Boy's 130mph joyride), relationships (about 13% vs 7%; Last night the girl's grandmother said; sent the boy's aunt Marlene in her place), physical characteristics (about 42% vs 26%; the desert area where the little girl's body had been found; He crossed to Mack and touched the boy's arm), actions (about 14% vs 10%; to detect a euphemism for the girl's ‘welcoming’ her lover into her; for his appointed visit. The little boy's coming back), feelings (about 3% vs 0.8%; the girl's love is said to be better; the latter start fighting for the boy's affections), thoughts (about 4% vs 6%; because of a girl's mistrust of one's courtiers; one boy's desire to become an actor), states (about 2.5% vs 4%; to ensure the girl's safety; she said the boy's presence made teaching the class), qualities (about 0.8% vs 4%; a fraction of a 13-year-old girl's energy and concern; If you can tell a boy's character from the company he), objects (about 9% vs 13%; Leaphorn checked the girl's duffel bag; The boy's bicycle was found in a pigsty), and abstract notions (about 3% vs 9%; only partially to do with a girl's psychological make-up; it became the boy's duty). It appears from the above data that certain types of notions cluster more frequently around girl (actions, feelings, relationships, and especially physical characteristics), while others more typically accompany boy (events, conditions, thoughts, qualities and abstract notions, and especially possession of objects). In addition, more frequently associated with girl than with boy (i.e. about 26% vs 3%) is the reference to all of the following domains taken together: love (may be rivals for the girl's affection; men's and boy's [sic] dominance in sexual relationships), sexual attributes (spending the night between the girl's breasts; keep secret a boy's description of his genitals), clothing (a girl's dress is displayed like an exhibit). Finally, differently from other sets of concordances, reference to negative situations is only slightly higher in the case of girl than boy (i.e. about 26% vs 22%; Moors murder girl's grave wrecked by ‘sick’ gang; Four charged after ecstasy boy's death). The frequency of occurrence of the semantic categories identified above is markedly different among the collocates of both girls and boys. The genitives of girls do not include reference to the notions of qualities or thoughts, and mostly concentrate on those of events (about 16%; choir concert, night out, reunion, race, party) and “possession” of objects (about 36%; bedroom walls, cabin, dressing room, home, independent school, tent, toy). On the other hand, the concordances of boys do not represent the category of state, while they focus on those of group membership (about 25%; Brigade, Choir, club, network, soccer team) and possession of objects (about 37%; boarding school, camp, dormitory, house, locker room, old baseball caps, test room). Also, the concordances of girls and boys represent negative situations much less frequently than those of girl and boy (both about 8%; I saw the girls’ tent had collapsed; too familiar with the terrors of the boys’ locker room). Finally, the association of the referents of the above wordforms with the notions of beauty, love, and sex is reduced to about 10% in the case of girls (he knows that the girls’ beauty will burn out a lot quicker; all Catholic girls dream of falling in love with; Girls’ interest in sexuality) and not attested at all in relation to boys. The data about WOMAN and MAN reveal that the wordforms of these lexemes share only some semantic preferences. The concordances of both woman and women do not refer to the notions of feelings or states, but while the former focus on actions (about 12.5%; actions, annual consumption, answer, becoming pregnant, behaviour, knitting, laugh, leading a caravan) and especially physical appearance (about 56%; acne, age, appearance, body, dark hand, face, lap, mouth), the latter concentrate on the discourse domains of abstract concepts (about 8%; health, issues, politics, rights, 259 roles), group membership (25%; City Club Members, Health Book Collective, Health Education Network, liberation movement, political organization, Political Caucus, teams), and especially sports events (about 37%; championship, lives, national indoors singles, polo tournament, World Cup downhill races). On a parallel with the concordances of GIRL, those of WOMAN too show that representation of negative contexts and reference to physical appearance and sexuality is more frequent in association with the wordform woman (about 34% and 23%, respectively) than with the plural women (about 10% and 4%, respectively). Examples include: the concentrated toxins in a woman's blood; women's abuse, the great waste of women's potential; skin care products improve a woman's appearance; sexual parts of a woman's body parts, possibly “vulva”; women's lingerie. The data about MAN show that its singular wordform disregards the notion of group membership, while it prefers those of action (about 9%; abuse, activity, advances, bluff, historic quest), abstract concepts (10%; destiny, dignity, ego), possession (about 19%; books, country, house, land, name-tag, poster), and especially physical attributes (about 28%; abdomen, breath, cool eyes, face, hair, legs, shirt, shoulder, tonsured head), only a few of which, however, suggest attractiveness. On the other hand, the concordances of the plural men do not refer to qualities or relationships, but draw attention to abstract concepts (about 8%; title, greater access, sexual problems, studies), groups (about 16%; amateur squad, teams, Bible Study group, division, group), actions (about 27%; abuse, bidding, further domination, practice, projection and externalization, violence), events, especially related to sports (about 27%; 400m hurdles, doubles, game, lives, singles). These data thus reveal that both man and men show a preference for the categories of actions and abstract concepts, although only for the latter to a comparable degree. Another similarity has to do with the frequency of occurrence of negatively marked contexts, which is about 16% for man (there was the problem of the man's face, distinctive even in death) and about 17% for men (within the complex relationship of men's sexual abuse and men's violence), and which in most cases refer to problems caused, rather than experienced by, men. On the other hand, reference to men's attractiveness or sexuality is less frequent in the case of man (about 7%; blocking the man's ability to get an erection; a means of bridling man's unbridled lust) than in that of men (about 15%; In the ballet world men's bodies are stunning, sinewy; an intro for some men's mag sex in the kitchen). In conclusion, in these sets of data it is still the referents of WOMAN and GIRL who are more frequently talked about in terms of sexuality, physical attractiveness, and romantic involvement; however, interesting correlations can also be found between their singular and plural wordforms and those of MAN and BOY. For instance, in the singular both GIRL and BOY are more often associated with the notions of physicality, sexuality, attractiveness, negativity, possession, and action. Similarly, both WOMAN and MAN show a preference for the notions of physicality, action, and negativity in the singular, and for group membership and events (mostly sports events) in the plural. Some conceptual categories turn out to be favored across wordforms; for example, that of possession is a “favorite” with girl, girls, boy, boys, women, and man; that of action with woman, man, men, girl, and boy; physicality with girl, woman, man, and boy; that of abstractness with man, men, and boy; and finally, that of group membership with women, men, and boys. Therefore it appears that in the concordances relative to the genitive both gender and number are relevant grammatical categories for the identification of similarities and differences in the patterns of usage of WOMAN, MAN, GIRL, and BOY. 5.2. For-headed prepositional phrases The preposition for potentially encodes a number of semantic relationships between two concepts; for instance, it may indicate that something is intended to be given a specific person (e.g. Let's save some cake for Phil), that someone is in favor of a person (e.g. Who are you going to vote for?), that a characteristic is surprising when considering what a given person is like (e.g. That was fast for you), that something is an intended action (e.g. The plan is for us to leave early). For-headed prepositional phrases, therefore, may provide a convenient window to the range of semantic preferences linkable to the referent of the object of that preposition. The Cobuild corpus reveals that for-headed prepositional phrases embedding the lexemes GIRL or WOMAN tend to be accompanied by words that can be grouped under a few broad notions: appropriateness of conduct, adequacy of conditions, problematicity of circumstances, unusuality of achievement, (cause of) disruption of relationships, and feelings. Such phrases are often preceded by adjectives or expressions conveying judgments about what a woman could, couldn't, should or shouldn't do (advisable, indispensable, fitting role, acceptable norm, offence, unbecoming, unthinkable, risky), expressing opinions about what is acceptable or condemnable in her behavior (absurd, believable goal, lunatic) or indicating restrictions on her actions (unusual job, no place, only sphere, wasn't a job, what else is there, ain't the right kind of life, not the kind of garment for a girl to wear in a lonely search, too much). Alternatively, such phrases are preceded by evaluative descriptions of the contexts in 260 which women act and of what they represent to them; often, these are negatively connoted (torture [reference is to age], obstacle, unusual, urgency, important, nostalgic event). Several of the circumstances mentioned in relation to such prepositional phrases are explicitly negatively marked (cause of misery [reference is to age], to have her face smacked, complicated training, lonely life, ultimate time possible, problem, too small, dissatisfaction, concern, chances … are remote, fears), which reveals that problematic situations are frequently mentioned in association with women, and more frequently than neutral or positive ones (conditions of life, pleasurable activity, splendid effect). Moreover, when positive qualities are mentioned, these are presented as exceptionally related to women (she could run well, professional, not bad, good question, convincingly). A woman is also not infrequently mentioned as the “indirect” cause of the disruption of a relationship, thus being cast in the ambiguous role of a new beneficiary of the loving care that someone else is being deprived of (left me, left his wife, leaves his male lover, left home). Finally, when playing the role of a beneficiary, the referent of GIRL and WOMAN appears to be the object of affection rather than other kinds of attention (affection, ideal mate, desire, love, hopeless love). Comparable prepositional phrases for MAN and BOY are more numerous and less negatively marked. For one thing, more frequent are the phrases in which the referents of those lexemes play the role of beneficiaries, depicted as profiting from situations in which they are not specifically or necessarily the object of someone's affection, but of various forms of attention (I would do this for a man; they had voted for a man who had promised much; very little a woman won't do for a man who makes her look half her age; asking for immunity for a man who slaughtered; public affirmations of support for a man who was directly responsible). Secondly, more numerous are the phrases that introduce or are part of mini-descriptions of sets of conditions in which the referents of MAN or BOY may find themselves acting; these are not meant to express value judgments, but rather to simply report on the state of their circumstances or to mention their salient characteristics: most of them are neutral or downright positively connoted (for a man of Haig's background; for a man like you; For a man running a campaign; For a man as careful as De Gaule; for a man in his position; For a man who has had little or no trouble; For a man who potentially had another life; For a man who could have taught Japanese; For a man with such power at his fingertips; for a man on his stag night; forty was a terrific age for a man). Indications of the appropriate conduct for a man are scanty (What greater shame was there for a man than to wald the streets cringing; You consider it right for a man of my years) as is the reference to negative experiences or problematic circumstances (The problems will be worse for a man if the disfigurement is on the left; an impossible position for a man of deeply skeptical inclinations; was the inevitable result for a boy whose disaffection from his family; scandal represented tradegy for a man). More frequent are instead representations of positive situations applying to men: advantageous circustances, personal achievements, solutions to possible problems (wonderfully releasing experience for a man who, as a child, felt dominated; it will make it easier for a man to find and keep a partner; ease and comfort; it was a dream job, no mean achievement for a man of 50; Not bad for a man with an average voice, didn't do badly at all - for a man hospitalized). More generally, such for-headed prepositional phrases are set in text segments presenting positive or neutral descriptions or evaluations of the circumstances of events or situations regarding men (which was normal for a man of his age; more creditable for a man to capture an animal alive; It is usual, I imagine, for a man to look for perfection; may be particularly intense for a man; a move to … is a gamble for a man who has spent his whole career; jail would be another kick for a man who is down); sometimes they mention untypical situations (this is strange for a man to be turned into a woman; it's somehow abnormal for a man with a partner; It seemed a queer thing for a man to do; Most unusual for a man to offer another man a seat; how out of the common it is, for a man to say something like that); finally, although concordances occasionally re-present some of the semantic contexts already identified for WOMAN and GIRL (cause of disruption of relationship: dumps her for a man; object of affection: would fall for a man serving a life sentence; yearning of a man; problematic conditions: it's dangerous for a man knowing and feeling; not an easy assignment for a man; appropriateness of conduct: it is high treason for a man to violate the wife of the Sovereign; it's unnatural for a man to show such inertia), these do not constitute the focus of the various patterns of discourse characterizing men and boys. 6. Conclusion An examination of the most significant collocates of WOMAN, MAN, GIRL, and BOY in four components of the Cobuild on-line corpus has revealed that these lexemes both share some contexts of use (e.g. age, physical appearance, personal relationships, positive and negative states), but not always to the same degree, and also that they are relevant to complementary discourse domains (e.g. religion, physical attractiveness, danger, and women's liberation for females, and the military, non-physical attractiveness, activities, and physical force for males). A consideration of the concordances of the 261 above lexemes in selected syntactic environments has also shown that females and males may be associated with the same types of topics and notions, but also that the frequency of their co-occurrence is often dissimilar. Thus, for example, attributes describing females are more often negative or focused on the notion of beauty, while those modifying males tend to be more evenly distributed between positive and negative qualities or to stress men's social roles and intellectual qualities. Similarly, of-headed prepositional phrases preferably describe physical attributes of females but non-physical characteristics of males. In relative clauses, it is more often females rather than males who are represented as having to face difficult situations or in relation to their feelings, while men are portrayed in their professional roles. For-headed prepositional phrases typically refer to appropriateness of conduct with regard to females, and to advantageous circumstances or personal achievements in the case of males. On the other hand, concordances of genitive constructions have shown that comparable wordforms of different lexemes may occasionally share similar patterns of co-occurrence with certain semantic fields; thus, the singular of GIRL and BOY are both frequently associated with the notions of physical attractiveness, physicality, and negativity, while both WOMAN and MAN in their plural forms are similarly associated with the notions of group membership and sports events. The findings indicate that in the discourse practices of English speakers the world of females is frequently associated with notions of passivity, negativity, and physicality, while that of males with notions of activity, positivity, and cognitivity. This suggests that socio-culturally salient concepts encoded in potentially symmetric terms may develop different ideological significance when they recurrently keep company with different sets of words. On the other hand, more data need to be collected to confirm or disprove these conclusions especially considering that different wordforms of the same lexeme may not occur in the same contexts of use (see Stubbs 2001: 27-28), and that collocations differ considerably in different text types and thus in different corpora components (see Nakamura and Sinclair 1995: 108-109). To this end, further pairs of gender-marked terms need to be examined (e.g. sister, brother, wife, husband, female, male) in more varied syntactic contexts (e.g. in-headed prepositional phrases, complements, passive constructions) and in additional corpora components (e.g. relevant to the spoken medium and to specific genres). References Fairclough N 1992 Discourse and social change. Cambridge, UK, Polity. Halliday M.A.K. 1994 An Introduction to Functional Grammar. London, New York, Sydney, Auckland, Arnold. Hodge R, Kress G 1993 Language as ideology. New York, Routledge. Krishnamurty R 1996 Ethnic, racial and tribal: the language of racism? In Caldas-Coulthard R, Coulthard M (eds) Texts and practices: readings in critical discourse analysis. London, Routledge, pp. 129-149. Nakamura J, Sinclair J 1995 The World of Woman in the Bank of English: Internal Criteria for the Classification of Corpora. Literary and Linguistic Computing 10(2): 99-110. Partington A 1998 Patterns and meanings. Amsterdam, Benjamins. Stubbs M 2001 Words and phrases. Corpus studies of lexical semantics. Oxford, UK, Blackwell. Tognini-Bonelli E 2000 Corpus Linguistics at Work. Birmingham, UK, TWC. Wierzbicka A 1999 Emotions across languages and cultures: diversity and universals. Cambridge, CUP. 262 Tracking Lexical Changes in the Reference Corpus of Slovene Texts Vojko Gorjanc University of Ljubljana, Faculty of Arts, Aškerèeva 2, 1000 Ljubljana, Slovenia vojko.gorjanc@guest.arnes.si Keywords: corpus linguistics, corpus-based lexicology, lexical semantics, lexical changes, Slovene language Introduction With the help of a corpus, we can track lexical changes quickly and reliably, and also observe the response of a selected language to new lexical items introduced into it from other languages, e.g. English, or some other language with which the selected language has direct contact, in the case of Slovene, these are Italian, German, Hungarian or Croatian. The present paper focuses on lexical items introduced into English in the last decade of the 20th century. The starting point for the comparison with the state in the corpus of the Slovene language is John Ayto's list of lexical items. We will try to determine how the Slovene language reacts to lexical items introduced into Slovene from English as the language of global society. Using the corpus, we can track the new lexical items through the last decade, and observe their characteristics in the corpus. The Corpus The Corpus of the Slovene Language, called FIDA, is a reference corpus of Slovene. It is composed of contemporary Slovene texts, the majority of which were published in the 1990's. The corpus contains just over 100 million words, encompassing a broad variety of language variants and registers. It is composed of written texts and texts originally produced as written-to-be-spoken; the transcripts of Slovene parliamentary proceedings are the only spoken component of the corpus. Methodology Using a FIDA wordlist, we will obtain information on the lexical items from Ayto's list which are relevant to the Slovene language. By means of corpus analysis, we will determine when a word occurs in the Slovene language and how it establishes itself in the language, and by means of statistical analysis, we will determine the possible collocations of the selected word and their changes from the first occurrence of the word until the end of the decade. Since pairs of synonyms or strings often occur with new expressions, we will try to determine how they occur in the corpus and how they disappear. With the help of markers of semantic relations already identified for the Slovene language by corpus analysis, we will identify pairs of synonyms and strings within the corpus. We will focus on the context of synonyms, their distribution within the corpus regarding the time of their occurrence and the genre, and on when and why one of the synonyms becomes dominant in the language while other variants disappear. For extracting collocations, the MI3 value with information on the probability of a word pair occurring together or separately will be used. The MI3 value has turned out to be sound information for content words in Slovene. On the other hand, with this value it is hard to determine function words as part of collocations; in particular, in the case of collocators of verbs and nouns with prepositional phrases. For instance, to detect propositional words, raw statistics provide more valuable information. After detection, a noun + propositional word pair, for example, the MI3 value for the whole pair is calculated to extract the string of collocators. Conclusion In its observation of the process of accepting new lexical items into Slovene, corpus analysis reveals the great creativity of Slovene language speakers; in addition to loan words, original Slovene expressions occur almost invariably. Corpus data shows a great deal of variability, linked above all to the desire for original expressions, and, after a few years, the data begin to reveal the prevailing variant within the entire corpus or a selected genre. The question of which variant is eventually fully accepted, however, remains open. Full acceptance of a variant is conditioned by a series of linguistic as well as non-linguistic factors; in order for a variant to be fully accepted in Slovene today, it must be also sexy and cool. 263 A corpus-based approach to informality: the case of Internet chat Stefan Grondelaers – Dirk Speelman – Dirk Geeraerts RU Quantitative Lexicology and Variational Linguistics Dept. of Linguistics – University of Leuven Blijde-Inkomststraat 21 B-3000 Leuven, Belgium fax: +32 16 324767 e-mail: stefan.grondelaers@arts.kuleuven.ac.be The language of IRC – Internet Relay Chat – is in many respects an example of “spoken language in written form”: although produced in a written medium, it shares with spoken language a dialogical immediacy that ordinary written text usually lacks, as a result of which, it tends to appear highly informal, even to the untrained observer. Linguists should not, however, be misguided by this global impression of informality: the language of IRC is informal in at least four different respects. In addition to its dialogical character (a situational form of informality which manifests itself in a.o. the relatively higher frequency of 2nd person pronouns and vocatives), IRC is characterized by an abundance of abbreviation and ellipsis, which are parameters of the production demands of a written medium whose users try to imitate spoken language (e.g. Hentschel 1998). A third source of informality in IRC is speaker-related: being for the most part pre-adult males, chatters tend to indulge significantly more often in tabooed topics and territorial behavior (Hentschel 1998, De Gryse 2000). A fourth group of informality-inducing factors is register-related: no matter how colloquial the situation and/or medium, chatters may choose to sound “vernacular”, or to maintain a more formal standard, to “look after” their language. Our paper has a methodological and an empirical aim. We will demonstrate, first, that whereas stylometric approaches based on the calculation of isolated variables (e.g. Biber 1994) can be used to identify the first three types of informality (by counting & comparing over corpora the relative frequency of, for instance, 2nd person pronouns, vocatives, abbreviations, ellipses, taboo words and maledicta), the register-related source of informality cannot be effectively identified unless the alternate surface forms of 2nd person pronouns, vocatives, taboo words, etc. are taken into account. While current stylometric approaches, more particularly, correctly observe that 2nd person pronouns and maledicta occur much more frequently in IRC than in other written mediums, the proportion of, for instance, dialectical and vernacular variants of 2nd person references must be determined to identify register variation. In order to tackle the latter type of variability more adequately, an operational measure of linguistic overlap was introduced in Geeraerts, Grondelaers & Speelman (1999), which builds on the notions of onomasiological profile - the set of denotationally equivalent designations of a concept/function and their respective relative frequencies in a corpus - and uniformity, i.e. the quantified overlap between onomasiological profiles. This methodology was subsequently used to compare lexical, morphological, syntactic & phonological preferences in Belgian IRC logs and other modes of written communication (viz. UseNet, regional & national popular newspapers, and quality newspapers). What these calculations reveal is that the four types of informality do not coincide in Belgian Dutch IRC: no matter how informal chat is situation-, medium-, and speakerwise, from the register point of view, chatters do not manifest colloquial or non-standard linguistic behavior to the degree one might anticipate. This leads us to the conclusion that the linguistic specificity of chat is not, or not in the first place, determined by register choices, but by the production demands (specifically speed, and turn-taking efficiency) of spoken conversation. References Biber, D. 1994. An analytical framework for register studies. In D. Biber & E. Finegan (eds.), Sociolinguistic perspectives on register, 31-56. OUP. De Gryse, P. 2000. Vloeken en verwensingen op IRC. M.A. Thesis University of Leuven. Geeraerts, D., S. Grondelaers & D. Speelman 1999. Convergentie en divergentie in de Nederlandse woordenschat. Een onderzoek naar kleding- en voetbaltermen. Amsterdam: Meerteninstituut. Hentschel, E. 1998. “Communication on IRC”. Online publication on http://viadrina.euv-frankfurt-o.de/~wjournal/irc.html 264 Between Verbs and Nouns and Between the Base Form and the Other Forms of Verbs – A Contrastive Study into COLEC and LOCNESS Xiaotian Guo Birmingham University Introduction One of my previous contrastive studies into a learner corpus COLEC1 and a native speaker corpus LOCNESS2 has shown that these learners overuse some of the verbs. In this article, I am going to apply the theories of Halliday and Biber et al to find out some possible reasons. There are two questions I am going to answer throughout this paper: Do COLEC writers use verbs more than LOCNESS writers where there are noun equivalents? And do COLEC writers use the same amount of base form as LOCNESS writers within the verbs? In the first part, a study of twenty-five verbs and their related nouns is carried out into the tendencies of the corpus writers of the two groups in choosing nouns in their writings. In the second part, a study of fifty verbs and then four verbs in particular is made focusing on the distribution of the frequencies and the behaviour of the base form and the other forms of the verbs. I. Verbs or Nouns: A Study of Tendency According to Halliday (1985: 72-75), it is a characteristic of written English that “lexical meaning is largely carried in the nouns”. He attributed two reasons to this phenomenon: one is the structure of the nominal group and the other is the structure of clause. In analyzing the structure of the nominal group, Halliday believes that “there are a lot of things that can only be said in nominal constructions: especially in registers that have to do with the world of science and technology, where things, and the ideas behind them, are multiplying and proliferating all the time”. He compares the structure of verb groups with that of nominal groups and points out that in verbal groups there is only one lexical element: the verb itself. Even though “other lexical material may be expressed in adverbial groups”, “there are very limited in scope”. In the structure of the clause, Halliday also finds out the internal requirement for nominal groups in the grammar of modern English. Based on corpora investigations, Biber et al (1999:65) finds out that the lexical word classes vary greatly both in overall frequency and across registers. In overall frequency nouns are the most frequent word class and across registers nouns are most common in news and academic prose but least in conversation. As has been found in my previous study of COLEC, learner English is characterized by spoken-like language rather than an academic style which is supposed to be the right one. Supported by this finding and enlightened by Halliday and Biber et al's theories, it can be predicted that native speakers will tend to use nouns more frequently than learners where there are noun equivalents of verbs. Now I will choose some verbs and their noun equivalents to find out whether it is true that native 1 COLEC is a corpus of learner English mainly composed of Chinese university students’ essays in national tests. Its size is 503,799 words. 2 LOCNESS is a corpus of native speakers. It is composed of four components, namely, British Essays of A-Level students, British essays of university students, American argumentative essays and American literary-mixed essays. Its size is 324,161 words. 274 speakers tend to use nouns more frequently than learners where there are noun equivalents of verbs. If this is true, it means the tendency to choose noun equivalents is stronger for native speakers than for learners. These verbs and their noun equivalents are as follows: accept (acceptance), apply (application), argue (argument), assume (assumption), believe (belief), choose (choice), commit (commitment), communicate (communication), compare (comparison), complete (completion), create (creation), enter (entry), examine (examination), express (expression), include (inclusion), indicate (indication), introduce (introduction), involve (involvement), manage (management), occur (occurrence), produce (production), realise (realization), realize (realization), refuse (refusal) and survive (survival). Table 1 shows the frequencies of these verbs and their equivalent nouns. Each verb in the table is referred as lemma including all the forms of the verb: the base form, the third singular form, the “-ing” form, the past form and the past participle. Each noun frequency includes both the singular form and the plural form. Table 1: The frequencies of some verbs and frequencies of their equivalent nouns COLEC COLEC LOCNESS LOCNESS VERB NOUN VERB NOUN ACCEPT 41 0 182 33 APPLY 65 2 60 11 ARGUE 1 3 167 339 ASSUME 3 0 40 13 BELIEVE 295 14 373 125 CHOOSE 121 31 140 129 COMMIT 8 0 90 12 COMMUNICATE 24 30 24 27 COMPARE 52 2 49 15 COMPLETE 35 0 42 5 CREATE 18 1 182 38 ENTER 84 1 56 10 EXAMINE 17 60 28 5 EXPRESS 25 9 56 12 INCLUDE 67 0 111 2 INDICATE 21 0 10 6 INTRODUCE 12 2 61 44 INVOLVE 12 0 159 9 MANAGE 29 8 27 12 OCCURE 25 0 96 5 PRODUCE 239 65 89 38 REALISE 9 0 98 16 REALIZE 196 3 122 14 275 REFUSE 27 0 64 13 SURVIVE 34 4 47 16 Total 1460 235 2373 949 N Total3 28.98 4.66 73.2 29.28 If we look at the total use of these fifty verbs and their equivalent nouns, there are 1460 cases of verb use and 235 cases of noun use in COLEC and there are 2373 cases of verb use and 949 cases of noun use in LOCNESS. If we normalize these figures by the total frequency of the two corpus respectively, we get the following comparison: in every 10,000 words, there are 28.98 cases of verb use and 4.66 cases of noun use in COLEC and there are 73.2 cases of verb use and 29.28 cases of noun use in LOCNESS. If we compare the noun use with the verb use across the two corpora based on normalized total frequency, we will get the following ratio: NOUN:VERB in COLEC is 0.16 and NOUN:VERB in LOCNESS is 0.4. This means LOCNESS writers use two and half times more than COLEC writers in noun use. If we look at the individual frequency of these verbs and their equivalent nouns, we may find that most nouns are less in number than verbs in both corpora. But there are also a few exceptions. For example, examination appears much more frequently than that in LOCNESS because COLEC writers are using a lot the sense of test for examination rather than the sense of investigation as can be found in LOCNESS: an exhaustive examination of the broadcast networks’ programming. Since tests are overwhelmingly a major concern of university students, the overuse of this sense of examination is understandable. Another point I have noticed is the frequent occurrences of production, which is the largest use of nouns within these nouns in COLEC. This is caused by the topic about the production of fake commodities in a majority of the essays in COLEC. If we may ignore these two cases, however, we will find a more striking underuse of nouns in COLEC. Table 2: The total frequencies and the normalized total frequencies of the selected verbs and those of their related nouns in COLEC and LOCNESS COLEC COLEC LOCNESS LOCNESS VERB NOUN VERB NOUN Total 1204 110 2256 906 N Total 23.9 2.18 67 27.95 If we compare the ratios between the cases of noun use and the cases of verb use across the two corpora, we may see the difference of COLEC writers underuse of nouns is getting extremely larger: NOUN:VERB in COELC is 0.09 and NOUN:VERB in LOCNESS is 0.4. LOCNESS writers are using nouns 4.4 times more than COLEC writers. To put it in another way, the tendency to choose nouns is even milder in COLEC if we may discard the most often used ones due to the influence of topics. After it is certain that verbs are more often chosen by COLEC writers in language use, it would help us to understand better of learner English if we can know how verbs are used between the different forms. 3 “N Total” stands for normalized total. It means the comparable total normalized by the total frequency of the corpus. 276 II. The Favoured Form of Verbs: the Base Form As has been found in my previous studies, there is a large discrepancy between the base form and other forms of a verb in COLEC than in LOCNESS. This part will examine the detailed differences of them from the use of the different forms of 50 verbs and then four verbs in particular to see how properly the base form is used in COLEC. These verbs are chosen first and then selected and then re-chosen to make sure that they are verbs only and any individual form of any individual verb does not overlap in spelling with its noun or other lexical form. It has been a difficult process to choose the fifty verbs. For example, KNOW was chosen first but was found to be overwhelmingly used by learners. Thus, it was deleted from the list in the end not to let it deform the whole picture of frequencies and affect the soundness of judgment. Table 3 shows the distribution of these 50 verbs in their base form (V), present participle form (V-ing), third singular form (V-s), past form (V-ed(1)) and past participle form (V-ed(2)). The corpus of COLEC is represented by C and LOCNESS by L. Table 3: The distribution of all the forms of 50 verbs in COLEC and LOCNESS in alphabetical order V V V-ing V-s V-ed(1) V-ed(2) C L C L C L C L C L ACCEPT 28 92 5 21 0 14 8 55 AFFECT 32 29 0 14 7 23 13 37 AFFORD 16 38 0 0 1 0 1 2 ALLOW 5 83 1 43 0 47 5 100 APPLY 51 27 1 6 4 11 9 16 ARGUE 1 81 0 13 0 29 0 44 ASSUME 3 21 0 7 0 3 0 9 BEGIN 108 52 99 64 8 39 52 54 2 22 BELIEVE 269 226 1 13 6 76 19 58 BRING 290 80 4 18 34 35 37 81 CARRY 39 48 3 21 2 10 17 40 CEASE 1 15 0 3 0 0 1 1 CHOOSE 99 57 5 20 1 23 13 27 3 13 COMMIT 7 36 0 20 0 10 1 24 COMMUNICATE 18 18 4 4 0 1 2 1 COMPARE 11 12 11 1 1 4 29 32 COMPLETE 32 34 0 1 0 0 3 7 CONSIDER 88 60 7 26 2 9 26 81 CREATE 11 70 1 28 0 19 6 65 EAT 127 44 29 50 3 1 13 1 4 8 ENTER 60 31 8 7 1 8 15 10 EXAMINE 12 13 2 5 1 2 2 8 EXPRESS 22 24 1 7 1 6 1 19 GIVE 297 164 10 62 31 52 23 41 42 147 GO 684 212 169 148 63 92 115 35 22 37 277 HAPPEN 69 66 22 25 13 35 76 30 HURT 60 23 0 7 3 4 INCLUDE 16 38 42 39 7 17 2 17 INDICATE 5 4 1 3 12 2 3 1 INTRODUCE 5 13 2 8 0 2 5 38 INVOLVE 5 17 1 16 0 24 6 102 LEAD 138 130 10 30 33 49 25 74 MANAGE 22 4 2 0 0 10 5 13 OCCUR 13 47 2 9 5 24 5 16 PRODUCE 145 48 46 12 12 10 36 19 REALISE 9 33 0 7 0 38 0 20 REALIZE 137 72 8 15 1 21 50 14 REFUSE 19 12 1 10 0 23 7 19 RUN 126 84 44 52 5 12 30 7 SAVE 199 40 32 29 5 8 11 10 SEEM 18 130 0 2 48 140 22 24 SEND 20 13 16 3 0 3 19 20 SPEAK 161 35 134 25 3 4 3 4 33 4 SPREAD 6 19 3 4 2 1 SURVIVE 34 31 0 6 0 3 0 7 TAKE 909 293 99 115 51 77 63 59 117 136 TELL 151 41 9 11 41 41 77 57 THROW 17 13 5 7 1 4 1 1 6 22 WANT 1154 217 5 17 94 106 99 90 WRITE 228 19 105 25 0 30 7 23 17 33 Total 5977 3009 950 1079 502 1202 963 1513 246 422 If we add the frequencies of all the forms of the 50 verbs in COLEC (5977+950+502+963+246), the figure amounts to 8638 and if we add those in LOCNESS (3009+1079+1202+1513+422), the figure is 7225. To get an overall percentage of the use of the fifty verbs (including all the forms of each lemma) in the two corpora, we may divide the two figures above by the total frequencies of the two corpora respectively. For COLEC it is 8638/503799=0.017 and for LOCNESS it is 7225/324161=0.022. There is not a dramatic difference between the two groups of corpus writers in the total percentages of verbal uses because most of the verbs are commonly used in both of the corpora. However, if we compare the percentage of the use of the base form and that of the other forms of the verbs in the corpora, the picture will look different (see Table 4). The base form takes up 69.4 percent of the total use of the lemma in COLEC and 41.6 percent in LOCNESS. Contrary to the overuse of the base form of verbs, COLEC writers are underusing all the other forms of verbs. Table 4. Percentages of all the forms of the 50 verbs in the two corpora Form COLEC LOCNESS V 69.4 41.6 278 V-ing 10.9 14.9 V-s 5.8 16.6 V-ed (1) 11.1 20.9 V-ed (2) 2.8 5.8 Total 100 99.8 There may be several reasons for the overuse of the base form of verbs. One possible reason is that COLEC writers are misusing the base form for other forms that native speakers use. This assumption is based upon one of the features of the native language of the learner corpus writers, namely, Chinese is a non-inflected language. COLEC writers are not properly aware of the requirement of the English language in the change of word forms in different contextual situations. In the following part, this assumption will be tested. To do this, four verbs are chosen to see how properly the base forms are used in COLEC: APPLY, HAPPEN, CHOOSE, and BEGIN. An investigation into all the concordance lines of the base form of these verbs shows several types of misuses of the base form. These misuses can be mainly classified into the following categories: 1. the misuse of the base form as the third singular form. Some examples are: "Practice Makes Perfect " also apply to other subjects, The phenomenon often happen in people's everyday life, That often happen to everyone student. Whatever happen, wherever we are, we must keep our mind clear. So, when he choose a job he must … 2. the misuse of the base form as the “-ing” form. Some examples are: Because failure is the thing which is always happen everything has a process from happen to developing We can know what happen in our city from what we talk about. “OH, my dear, what's happen? Asked the father. A good begin is half done. 3. the misuse of the base form as the past participle form. Some examples are: we shall know what has happen outside We can learn what have happen in society. Since we have choose the professinl field, Once I have choose a kind of job. Well begin is half done. 4. the misuse of the verb as a noun. Some examples are: I have written the apply form to join the Party. Therefore, choose a good job is very important. 279 Now we have no choose, we must … …from begin to end … There are 6 misuses in APPLY, 16 in HAPPEN, 6 in CHOOSE and 12 in BEGIN. All these misuses take up 11.7 percent of all the lemma of the four verbs. Due to the misuse of the base form for other forms of verbs, the number of the base form use has been improperly presented. Till now, the assumption has been proved to be valid after the investigation into the use of the base form of HAPPEN, CHOOSE and BEGIN, which means COLEC writers do have problems in choosing the proper form of the verb when it is needed. By over-applying the wrong form of verbs, the frequency of the base form has accordingly increased. But the over-applications of the wrong form of verbs vary from verb to verb. III. Some miscellaneous findings Apart from the misuse of the verbal forms, there are quite a few cases of misuse of the verb where another verb or even word class should be used. Examples are: The factory can produce all kinds of producs which apply good servece for the people. (provide) the hospital apply the people with morden servece. (provide) factorys require the adequent water to apply the contemporary production (meet the need of ) The numourous wells can also apply lots of fresh water (supply) We should create many conditions to cause the fire to apply to its advantage (exert) more and more people want to live comfortable and happen (happy) In most of the examples above, the verb APPLY is used to mean SUPPLY. The reason for this kind of misuse is simply due to the similarity in the spelling. Even though the misuse of a verb as another verb does not make any difference for the analysis of the reason of the overuse of the base form of verbs, it is a kind of by-product of this study. It shows a sort of progress of language acquisition and the necessity to distinguish one word from another when they share some similarity in form. Another striking point of the examination is that a couple of words have been found to be underused dramatically by COLEC writers. For example, the lemma ARGUE has appeared 167 times in LOCNESS whereas it has appeared only once in COLEC. Another example is the lemma INVOLVE which has been used 159 times by LOCNESS writers is used only 12 times by COLEC writers. These two verbs rank 14 and 15 respectively in LOCNESS which are near the top of the list but 50 and 44 which are at the bottom of the list in COLEC. Furthermore, if we refer to Table 1 to see whether COLEC writers are using the related nouns of the two verbs, they still remain underused. Therefore, I would speculate that these two verbs are not adequately acquired by this group of learners compared with other verbs in the list. Another interesting finding is that COLEC writers tend to choose the British spelling for the word REALIZE instead of the American spelling REALISE (REALIZE 196 in total and REALISE 9) in total. This must be the influence of the text books used by COLEC writers and their teachers. Conclusion 280 This chapter has revealed learners tendency to choose verbs where native speakers could prefer to choose nouns for the writing purpose. The ignorance is explicit of the difference of language styles is obvious in the choice of nouns and verbs by COLEC writers. Another important finding of learner English style is that COLEC writers overuse the base form of verbs as a whole but underuse the V-ing form, the third singular form, the past form and the past participle form of verbs. Among the overuses of the base form, a large frequency of misuses of the base form has been found where another form should have been used. COELC writers sometimes choose a wrong word to mean something that should be expressed by another word even though this is not a dominant feature of the misuse. Attention should be to these features of learner English in the field of English learning and teaching. Reference 1. Biber. D., Johansson S., Leech G., Conrad S. and Finegan E. 1999 Longman Grammar of Spoken and Written English. London, Longman. 2. Halliday, M.A.K. 1985 Spoken and Written Language. Oxford, Oxford University Press. 281 A method for word segmentation in Vietnamese Le An Ha Research Group in Computational Linguistics School of Humanities, Languages and Social Sciences University of Wolverhampton Stafford Street, Wolverhampton, WV1 1 SB, UK L.A.Ha@wlv.ac.uk Abstract Word segmentation is the very first step in natural language processing for languages such as Vietnamese. Given the fact that un-annotated corpora are the only widely available resources, we propose a method of word segmentation for Vietnamese, which only uses n-gram information. We calculate the probabilities of different combinations of n-grams in a chunk, and choose the one that produces maximum probability. In order to calculate these probabilities, we build a 10M-word corpus of two year newspaper article. The results, while not very impressive, show that the method works. 1. Introduction It has long been known that word segmentation is a vital problem in natural language processing for certain Asian languages such as Chinese, Japanese, and Vietnamese. Without knowing the boundaries of words in a sentence, we cannot move anywhere further. The problem is that in these languages, each sound unit, when written, is separated by a space, and there is nothing to be used to identify word boundaries. Of courses, native speakers do not have any problem using their language, they can still communicate efficiently and precisely in both spoken and written form without explicitly identifying word boundaries. But a computer is more confused by word boundary in these languages. But “what is a word?” itself is also a problem. This question cannot be answered in a way to satisfy everybody, so they continue to have their own definitions. The evidence is that there were efforts in the past, which tried to group sound units into “words” in written form, such as “ky-thuat”, or “kythuat” instead of “ky thuat” (technology), and failed, because not everybody were happy with this kind of notation, and they are also still confusing. Thus the problem of word boundary continues to exist. A lot of methods have been introduced to solve the segmentation problem in Japanese and Chinese, but none for Vietnamese. Another problem is that these methods tend to use lexical resources, which is still underdeveloped for Vietnamese. These issues lead to an attempt to develop a method of word segmentation in Vietnamese, using only pure statistics. In order to do that, we build a corpus of 10M tokens, and use it to calculate different scores needed for the method. In this paper, different issues arising when we carry out the experiment will be also discussed. 2. About Vietnamese language and the problem of word boundary Vietnamese, although still debatable, is a monosyllabic language, and belongs to the southeast Asian language family. It has different phonetical, grammatical and semantic features from Indo-European languages, which make it difficult for the Vietnamese, not only to learn European languages, but also to develop techniques for natural language processing. The options are either try to make the Vietnamese language fit into the framework of other well-investigated European languages, or to develop from scratch a new framework. Both are proved to be very painful, where the former is not very successful, given the fact that Vietnamese is significantly different from Indo-European languages, and the latter needs a lot of resources, both human and materials, which may not be available for a developing country such as Vietnam. Vietnamese, traditionally, is a spoken language, thus to the native speaker, the problem of identifying word boundary is not a very serious problem. They know what it is, and use it naturally, even if there is some difference of opinion between them on the word boundary issue, it does not make the communication process more difficult, because they all agree on what is a sound, the more basic unit of Vietnamese. When written forms of the language were developed either using Chinese characters or Latin characters, these written forms are only the extensions of the spoken form, where each sound was represented by a (sequence of) character(s), and then separated by a space. This also causes no difficulty for native speakers to understand each other and the problem of word boundary did not arise 282 yet. At the beginning and middle of the 20th century, when Vietnamese scholars were introduced to Western grammar schools, some changes in written form of Vietnamese had been proposed, to make it more “word oriented”, using different marks for the word boundary to be more explicit, and the language look more like European ones. These changes included the elimination of space between syllables that may form “word”, and using the hyphen marks. These attempts were unsuccessful, maybe because of the nature of Vietnamese language, whereas the identification of words may not be that important at all. The discussion about what is a word in Vietnamese goes on, and, up to now, there is still no general agreement on the issue. The development of computers, in general, and computational linguistics, in particular, does not permit researchers to avoid the problem of word boundary anymore. Unlike native speakers, computers (at least until now), cannot easily identify word boundaries in electronic texts. And it is the bottleneck of natural language processing for Vietnamese, because without knowing word boundaries, the computer cannot do anything else (part-of-speech tagging, parsing etc.), all it can do is n-grams counting. In fact, a way to process texts without knowing about word boundary may exist after all. But given the reality that a lot of developed methods of automatic natural language processing are for word-oriented languages, the time of developing word segmentation techniques, and after that, applying other available methods of natural language processing, maybe a lot less than the time spent on developing whole new theories and techniques for Vietnamese language processing. Why identifying word boundary in Vietnamese is so difficult? As discussed in the previous section, Vietnamese, at first, is a spoken language. In a spoken language, the most important unit, is the syllable, not the word. The word boundary can vary from person to person, and still does not affect the communication process. Another reason is that the combination of different syllable units is the only way to construct new lexical units to describe new concepts in Vietnamese. There is no prefix, suffix in Vietnamese, just syllable, and it makes everything look confusing. The fact that the part-of-speech system of Vietnamese, like that of Chinese or Japanese, is not very well-defined leads to differences among dictionaries, thus contributes its part in the difficulty of word boundary identification. Just another problem of word boundary is that a large part of vocabulary of Vietnamese come from (ancient) Chinese, and these units seem to bound together more strongly than pure Vietnamese syllable units, for example, “cong nhan” (worker), “thuong nhan” (businessman). (“nhan” roughly means “person” in Chinese). Shall we treat these units differently or the same? Note that their word order are different from natural Vietnamese word order (modifier head in comparison to head modifier as in pure Vietnamese (nguoi lao dong, nguoi buon ban). Yet there is another type of lexical units in Vietnamese which is problematic. It is reduplication, usually two syllables, in which only one, or even none, has some meanings, the other one is just a variant of the sound of the other. This type of units is very popular, especially for adjectives, or in fact, almost every adjectives have their reduplication forms. It can be explained, for as a spoken language, the sound is very important. The question is that how we will treat this kind of unit in word boundary identification. 3. The construction of the corpus and some of its properties. With so many problems, the only solution seems to be the use of corpora to discover the nature of word boundary in Vietnamese. Until recently, due to different reasons, it is very difficult to construct a Vietnamese corpus. It is either because the lack of human or materials resources, where there are other more important things to do in a developing country. But the rapid development of the Internet and World-wide web in Vietnam makes it possible to build an inexpensive corpus. News, stories, books, etc. become more and more available on the web. Finding resources to build a corpus is not a difficult task any more. Of course, if we consider the task of building a well-balanced corpus, maybe the web is still not a very good resource, but if our task is just to evaluate some techniques in their early days, the web is the right place. And for Vietnamese, methods of natural language processing are in their early days, and researchers cannot wait until a standard corpus is built, to try their techniques. For these reasons, we choose to harvest news from online sources to construct our corpus. In particular, two years news articles from http://vnexpress.net are collected. These articles came from different newspaper sources, thus somehow balanced in styles and genres. These articles are classified into different topics: society, world, business, health, science, and way-of-life. Different perl scripts are 283 used to get these articles, and put them together in a corpus. The size of the final corpus is difficult to define, depending on what we will use to measure the size of the corpus. The Vietnamese characters are encoded using the unicode UTF-8 standard, and stored in a html-friendly format (in order to display the content easily on different applications). If we measure the size of the corpus, using every thing between two spaces (and other punctuation marks) as a token, (except punctuation marks) the number of tokens in the corpus is about 10M. Of courses, one can argue that this is a rather small size corpus, (where different segmentation techniques in Chinese use 40M to 60M corpus), but we think that this is enough for our experiment. The information harvested from the web site allows us to mark the author (sources) of the articles, a short summary, and the date of the articles. These will not be used in our project, but maybe useful for future development, thus they are recorded in the corpus using sgml tags. Paragraphs are also marked according to the html tags. Statistical figures of the corpus. There are quite a few problems counting n-gram in Vietnamese. As discussed above, natural language processing in Vietnamese is still in its early days, and the definition for an n-gram is not very well discussed. For example, personal names in Vietnamese, like in Chinese, have some meanings, and unlike Chinese, they are in upper case, thus distinguishable from other syllable. This leads to a problem: shall we convert everything into lower-case before counting n-gram? Of course, there are certain possible solutions for the problem, such as a name-entity recognition module, but we do not have it. In the end, we decide that we will count n-grams in both cases: original, and lower case converted ones. Observing the different between two approaches will give us certain ideas of the effects of lower and upper case in n-gram frequency in Vietnamese. Like other languages, sentence limit is also a problem in processing Vietnamese, and a simple approach has been used, whereas everything between two periods is a sentence. Hopefully this will not make a lot of errors. Observing the unigram list gives us the idea of how different Vietnamese is from English. Some of the highest frequency words are given in table 1. Comparing the unigram list when upper case is taken into account and not taken into account shows that it is better to ignore case in this particular application. When looking at the number of unique unigrams, we find out that it is 60637, which is a very high number, and almost half of them only appear once in the corpus. The explanation is that the corpus contains a lot of foreign words, which will be another issue in NLP applications in Vietnamese. va (and) 88000 các 84380 c.a (of) 83915 có 81828 m.t (one) 65332 la (to be) 59392 trong (in) 58964 cho 56378 ngu.i 55388 ðu.c 54722 ða 53700 không (negative) 48899 theo 44716nay 43040nh.ng 41430v.i 38816. 38767công 37175s. (will) 34160ra 31654b. 31093khi 29587th. 29308ông 28955 ð. 28936tren (on) 27762 nu.c (water, nation) 26209vao (into) 25539t. (from) 25109hi.n 24467nha 24439ð.n 24050v 24015ðó 23843t.i 23172cng 22980 s. 22936 nãm (year) 22902 nhân 22726 ð.ng 22471 vi.c 22446 thanh 20947 ph.i 20721 nhu 20711 nhiu 20173 chính 19952 Table 1: unigram frequency, and their meaning in English (where available). 4. Methods used for word segmentations in different languages. Word segmentation is a common problem for Asian languages, such as Chinese, Japanese, Korean, etc., although the root of the problem maybe different. Numerous methods have been introduced to solve the problem, which can also be divided into two sub problems: 1)disambiguating word boundary, whereas a lexicon database is available, and 2) identifying word boundary for new lexical items. In 284 order to solve the first problem, dictionaries and statistical methods have been used. Usually, the longest possible string will be matched, using a dictionary, and statistical scores will be calculated to disambiguate when ambiguities arise. These approaches have some problems. Firstly, when based on a dictionary, it implies that the dictionary is a reliable source for natural language processing, in the sense that it is consistent and complete. But in reality, such a dictionary is very hard to find, and different dictionaries seem to be inconsistent, especially with languages which have not been extensively investigated. Furthermore, these days, vocabulary of a developing country language is also developing, and dictionary cannot be up-to-date any more. In particular, the problem with Vietnamese dictionaries is that they are often built on different standards, and have different sets of entries. The grammar is also partly to be blamed, where there is still no concrete grammar theory for Vietnamese, and every attempt to apply Indo-European grammars for Vietnamese have failed, making the dictionary compiler confused about which system of category (s)he should classify her/his words into. Nouns are generally agreed between linguists, but verbs, adjectives and adverbs are arguable. In the past, these problems were tolerant, because, as discussed previously, it does not affect the quality of communication much, but we cannot avoid them any more, if we want to use computer to process natural language in Vietnamese. The computer is not human, and without explicitly showing it the exact word boundary, it will refuse to do anything else. The main approach for identifying word boundary of new lexical units is the statistical one, where different statistical scores have been calculated, to determine the level of “bound” between different smaller units (sound). The assumption lies behind this approach is that certain units appearing together in the text frequently form a bigger, steadier unit that we call “word”. The problem of this approach is to find a good score that really reflects the phenomenon of “word” in the subject language. This problem seems to be a difficult one, where a solid statistical measure cannot guarantee the success of the segmentation process. Mutual information and t-score are examples. Although these scores have solid statistical theories behind them, the results using them are not that impressive. In the t-score case, it maybe because the assumption of normal distribution has failed, or natural occurring texts are not very suitable for information theory, in the case of mutual information. There are still a lot of works to do in the field to establish the relation between the phenomenon of “word” in monosyllabic languages, and statistics, or even disprove the phenomenon. But it does not mean that we should not do anything. Different methods should be tried to solve smaller problems, thus may give hints to the development of a more robust theory of “word”. Following is a short review of methods used for word segmentation. Maosong el. al. use mutual information and t-scores to identify word boundary in Chinese, and the reported results, although challenged by other authors, are very high (>90%). The problem with this method is the use of t-score implies the normal distribution. Wong and Chan employ a lexicon of 80.000 entries and a corpus of 63M character for word segmentation based on maximum matching and word binding force. This algorithm relies heavily on the lexicon. Sproat et. al. introduce a stochastic finite-state word-segmentation algorithm, which also relies on lexicon resources. Sornlertlamvanich el. al. use C4.5 learning algorithm for word segmentation in Thai. Information used for the algorithm includes string length, mutual information, frequency, entropy. 5. The proposed method Given the situation that, for Vietnamese, there is a lack of large lexicographic resources, and annotated corpora are also rare, we want to develop a method that will not rely on these resources, and will use only raw corpus. From a raw corpus, frequency is the only statistical score that can be calculated reliably. Our approach begins with other units that native speaker will agree on, which are sentences, and/or chunks separated by punctuation marks (commas, hyphens, quotes, etc.). These units are less ambiguous than words, in both spoken and written forms. We then try to maximise the probability of the chunk, using different segmentations. The final segmentation is the one that gives the chunk the maximum probability. Of course, we are not talking about real probability of an n-gram, or a chunk. The used “probability” of an n-gram is its maximum likelihood estimation, and the probability of a chunk is the product of its n-gram “probabilities”. The implicit idea behind this calculation is that given a chunk, it is most likely that it is combined by the segmentation that gives the maximum probability. But in this method of calculation, we face certain problems. The first one is combinatorial explosion, where the number of ways of segmentation is an exponential function of the length of the chunk. Other problems include the difficulty of calculating probability, and sparse data. In order to solve the problem 285 of combinatorial explosion, dynamic programming is employed, whereas the maximum probability of a smaller chunk is only calculated once and then reused in other calculations. By using dynamic programming, the complexity of the problem is reduced to O(n3). The detailed calculation is given below. P(i,j): maximised probability of a segment begin at i, end at i+j. p(i,j): probability of the n-gram begin at i, and end at i+j. The calculation of maximum probability: P(i,0)=p(i,0); P(i,j)=max {P(i,0)* P(i+1,j-1), P(j+i,0)*P(i,j-1), P(i,1)*P(i+2,j-2), P(j+i-1,1)*P(i,j-2),..., p(i,j)} The actual calculation using dynamic programming method is following: for(i=0;i