NOTE: BLC KWIC CONCORDANCER HAS BEEN SWITCHED OFF
AND ANY TIPS HEREIN REGARDING ITS USE ARE OUT OF DATE
Consult Yasumasa
Someya's original help pages:
a) BLC KWiC Concordancer
b) BIGRAM PLUS
A string is a word or phrase, or a part thereof, entered directly into the search box.
The selected 'Search Type' will determine if the string is to be regarded as a:
a) a complete lexical unit [=Equal to],
b) a 'prefix' [=Start with]
c) a 'suffix' [=End with]
d) a middle-part of a larger word/phrase ('infix') [=Contain]
'Regular expressions' are (some) patterns accepted by the PERL programming language. They allow more powerful searches for lemmas, collocations, and more.
A. Convenient combinations for cut and
paste:
Symbol(s) |
Meaning |
\w | any letter or digit |
\w+ | any word including internal digits, excluding words with internal hyphens or apostrophes |
[a-z]+ | any lowercase word (no digits, no internal hyphens or apostrophes) |
[A-Za-z]+ | any word, regardless of case (no digits, no internal hyphens or apostrophes) |
[A-Za-z\-]+ | any word, also hyphenated |
[A-Za-z\']+ | any word, also with internal apostrophe(s) |
[A-Za-z\-\']+ | any word, also hyphenated or contracted |
\s+ | space(s) between words (one or more) |
[^a-zA-Z0-9]+ | any word boundary, including any punctuation |
[^a-zA-Z0-9\']+ | any word boundary, including punctuation except an apostrophe |
(word1|word2|word3|...) | multiple forms of a word or alternative words (case sensitive) |
B. Important operators:
Operator |
Function |
[x-y] | square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y |
| | a vertical stroke separates alternative strings |
() | round brackets enable the user to group alternative strings especially if these are meant to be words |
{x,y} | means preceding letter, range or group must repeat at least x times and no more than y times |
? | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional |
* | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional and may repeat any times |
+ | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) MUST occur at least once |
C. More info on Regexes:
D. Many regular patterns used with the BLC KWiC concordancer (see 3 & 4 below) will also work with Bigram PLUS (see 5).
3. Concordancing examples: [Note: The BLC concordancer is off]
A. Simple searches for individual
strings (words and parts of words)
String or expression |
Search type |
Result |
useful | EQUAL TO | useful (lower case) |
Useful | EQUAL TO | Useful (capitalised) |
useful | START WITH | useful, usefulness, usefully |
useful | END WITH | unuseful, Unuseful |
useful | CONTAIN | useful, usefulness, usefully, unuseful, Unuseful |
B. Word patterns, multiple forms,
lemmas [many
patterns will work on BIGRAM PLUS]
Pattern |
Query type |
Result |
[Uu]seful | EQUAL TO | Useful, useful |
[a-z] | EQUAL TO | any one letter word (lower case throughout) |
[A-Z] | EQUAL TO | any one letter word (uppercase throughout) |
\w | EQUAL TO | any one letter word (any mixed case) |
\w\w | EQUAL TO | any two-letter word (any mixed case) |
\w\w\w\w* | EQUAL TO | any three-letter or longer word (any mixed case, non-hyphenated) |
\w+ | EQUAL TO | any word without hyphenation or apostrophe |
\w+-\w+ | EQUAL TO | any hyphenated word with one internal hyphen |
\w+-\w+-\w+ | EQUAL TO | any hyphenated word with two internal hyphens |
[A-Za-z\-]+ | EQUAL TO | any word, whether hyphenated or not, without apostrophe |
\w+'\w+ | EQUAL TO | any word with apostrophe, e.g. contraction |
[A-Z]\w* | EQUAL TO | any word (no hyphen, no apostrophe) beginning with a capital letter |
[A-Z] | START WITH | any word beginning with a capital letter |
C | START WITH | any word beginning with capital C |
c | START WITH | any word beginning with lowercase c |
[Cs] | START WITH | any word beginning with c, either case |
[C]\w+ | EQUAL TO | any two-letter or longer word beginning with a capital C |
[A-Z]\w+ | EQUAL TO | any two-letter or longer word beginning with a capital letter |
[Ww]ays? | EQUAL TO | way or ways, in what ever case (s at the end is optional) |
(unite|unites|uniting|united) | EQUAL TO | lemma UNITE (lowercase occurrences only) |
(Unite|Unites|Uniting|United) | EQUAL TO | lemma UNITE (uppercase beginnings only) |
([Uu]nite|[Uu]nites|[Uu]niting|[Uu]nited) | EQUAL TO | lemma UNITE (either lowercase or with capitalised U) |
C. Multi-words, collocations, frames,
etc. [many patterns will work on
BIGRAM PLUS]
Pattern |
Query Type |
Result |
\w+-\w+ | EQUAL TO | hyphenated compounds (mixed case, digits included) |
[Ww]ays*\s+\w+\s+\w+ing | EQUAL TO | frame way(s) X -ing |
middle[^a-z]+[A-z]+[^a-z]+the | EQUAL TO | frame middle X the, allowing punctuation (e.g. hyphens) as word boundaries |
as a matter of | EQUAL TO | phrase as a matter of, with single spaces between words |
as\s+a\s+matter\s+of | EQUAL TO | phrase as a matter of, with possible multiple spaces between words |
as a [a-z]+ of | EQUAL TO | frame as a X of [with single spaces] |
as\s+a\s+[a-z]+\s+of | EQUAL TO | frame as a X of [would ignore possible multiple spaces between words] |
(go|went|gone|goes)\s+([a-z]+[^a-z]+){0,4}mad | EQUAL TO | lemmatised collocation GO mad with up to 4 words in between (e.g. "people who have gone computer mad" or "people who went computer-mad") |
D. Punctuation searches
Pattern |
Query type |
Result |
\w+\.\s+ |
EQUAL TO | full-stops attached to words, followed by a space (= ends of sentences) |
,\s+\w+\s+\w+,\s+ | EQUAL TO | any two words surrounded by commas |
E. Searching POS-tagged
corpora [outdated section]
Click here to find out about the TAGSET used.
Examples of queries on adjectives:
Pattern |
Query type |
Result |
ADJ | EQUAL TO | all adjectives |
ADJ\(ge,.*\) | CONTAIN | all general adjectives |
ADJ\(ge,pos,\w*\) | CONTAIN | all general adjectives in the positive degree |
ADJ\(ge,pos,\w+p\) | CONTAIN | all participles acting as adjectives |
ADJ\(ge,pos,ingp\) | CONTAIN | all present participial adjectives |
ingp\) | CONTAIN | any ing forms |
(ingp\)|edp\)) | CONTAIN | all -ing and past participle forms |
4. Using POLISH corpora [outdated section; these tips will NOT work any more]
a) Adjust the encoding of the concordance results window to Central-European (Windows)
b) [^a-z] will sometimes capture Polish letters instead of non-letter characters. Sometimes there will be a need to enter as much as [^a-zšćęłńóżĽĆĘŁŃÓŻ] for a word boundary.
Pattern |
Search type |
Result |
t[Nn]ik(t|ogo|omu|im) | EQUAL TO | Lemma NIKT |
niż\s+ | EQUAL TO | word niż |
([Dd]zień[^a-z]|[Dd]nia[^a-z]|[Dd]niu[^a-z]|[Dd]niem[^a-z]|[Dd]ni[^a-z]|[Dd]niom[^a-z]|[Dd]niami[^a-z]|[Dd]niach[^a-z]) | EQUAL TO | lemma DZIEŃ |
[a-zA-ZšćęłńóżĽĆĘŁŃÓŻ]+ | EQUAL TO | any Polish word |
5. BIGRAM PLUS AND Trigram Frame Counter
A. View Yasumasa Someya's original help page for BIGRAM PLUS
B. Help with regular expressions.
C. Examples (English and Polish) -- all refer to case-insensitive queries:
Unit 1 |
Intervening words (Trigram Frame Counter =1, always) |
Unit 2 |
BIGRAM PLUS Result |
Trigram Frame Counter Result |
of | 0 | the | Number of "of the" bigrams (2-word clusters) in the corpus | number of of _ the frames and/or freq list of all nucleus words |
(tell|tells|telling|told) | 3 | about | frequency list of 2, 3 and 4 word clusters where a form of the verb TELL precedes about | freq list of tell_ about, tells _ about, telling _ about and told _about frames and/or freq list of their nucleus words |
\w+ | 0 | life | bigrams ending with "life" | freq list of all <word> _ life frames and/or freq list of all their nucleus words |
\w+ | 0 | dzień | Polish bigrams ending with dzień | freq list of all <word> _ dzień frames and/or freq list of all their nucleus words |
\w+\s+\w+ | 0 | life | freq list of 3-word clusters ending with the wordform life | freq list of all <word> <word> _ life 4-gram frames and/or freq list of all nucleus words |
\w+\s+\w+ | 0 | życie | freq list of Polish 3-word clusters ending with the wordform życie | freq list of all <word> <word> _ życie 4-gram frames and/or freq list of all nucleus words |
\w+\s+\w+\s+\w+ | 0 | life | freq list of 4-word clusters ending with life | freq list of all <word> <word> <word>_ life 5-gram frames and/or freq list of all nucleus words |
\w+\s+\w+\s+\w+ | 0 | życie | freq list of Polish 4-word clusters ending with the wordform życie | freq list of all <word> <word> <word>_ życie 5-gram frames and/or freq list of all nucleus words |
way\w*\s+of | 0 | \w+ | frequency list of 3-word bigrams beginning with WAY of, e.g. way of life, way of living, ways of spending | freq list of all way of _ <word> 4-gram frames and/or freq list of all nucleus words |
pod\s+względem | 0 | \w+ | frequency list of 3-word clusters beginning with pod względem lub Pod względem | freq list of all pod względem _ <word> 4-gram frames and/or freq list of all nucleus words |
(tell|tells|telling|told)\s+\w+ | 0 | about | frequency list of only 3-word combinations in which exactly one word separated a wordform of TELL and the preposition about | freq list of tell <word>_ about, tells <word>_ about, telling <word> _ about and told <word> _ about 4-gram frames and/or freq list of their nucleus words |
\w+ | 0 | \w+ | all bigrams, except words with hyphens and/or apostrophes; SLOW | all 3-gram frames of the form <word> _ <word> and freq list of all nucleus words that occur in them |
\w+\s+\w+ | 0 | \w+ | all trigrams, except words with hyphens and/or apostrophes; SLOW | all 4-gram frames of the form <word> <word>_ <word> and freq list of all nucleus words that occur in them |
\w+ | 0 | \w+\s+\w+ | as above | all 4-gram frames of the form <word> _ <word> <word> and freq list of all nucleus words that occur in them |
\w+\s+\w+\s+\w+ | 0 | \w+ | all 4-word clusters, except words with hyphens and/or apostrophes; SLOW | all 5-gram frames of the form <word> <word> <word>_ <word> and freq list of all nucleus words that occur in them |
\w+\s+\w+ | 0 | \w+\s+\w+ | as above | all 5-gram frames of the form <word> <word> _ <word> <word> and freq list of all nucleus words that occur in them |
\w+ | 0 | \w+\s+\w+\s+\w+ | as above | all 5-gram frames of the form <word> _ <word> <word> <word> and freq list of all nucleus words that occur in them |
Page maintained by Przemysław Kaszubski
Last update: 2004-11-09