NOTE: BLC KWIC CONCORDANCER HAS BEEN SWITCHED OFF

AND ANY TIPS HEREIN REGARDING ITS USE ARE OUT OF DATE

Consult Yasumasa Someya's original help pages:
a)
BLC KWiC Concordancer
b)
BIGRAM PLUS


BIGRAM PLUS / TRIGRAM FRAME COUNTER

Help with search patterns:


1. Strings

A string is a word or phrase, or a part thereof, entered directly into the search box.

The selected 'Search Type' will determine if the string is to be regarded as a:

a) a complete lexical unit [=Equal to],
b) a 'prefix' [=Start with]
c) a 'suffix' [=End with]
d) a middle-part of a larger word/phrase ('infix') [=Contain]

 


2. Regular expressions

'Regular expressions' are (some) patterns accepted by the PERL programming language. They allow more powerful searches for lemmas, collocations, and more.

 

A. Convenient combinations for cut and paste:

Symbol(s)

Meaning

\w any letter or digit
\w+ any word including internal digits, excluding words with internal hyphens or apostrophes
[a-z]+ any lowercase word (no digits, no internal hyphens or apostrophes)
[A-Za-z]+ any word, regardless of case (no digits, no internal hyphens or apostrophes)
[A-Za-z\-]+ any word, also hyphenated
[A-Za-z\']+ any word, also with internal apostrophe(s)
[A-Za-z\-\']+ any word, also hyphenated or contracted
\s+ space(s) between words (one or more)
[^a-zA-Z0-9]+ any word boundary, including any punctuation
[^a-zA-Z0-9\']+ any word boundary, including punctuation except an apostrophe
(word1|word2|word3|...) multiple forms of a word or alternative words (case sensitive)

 

B. Important operators:

 

Operator

Function

[x-y] square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y
| a vertical stroke separates alternative strings
() round brackets enable the user to group alternative strings especially if these are meant to be words
{x,y} means preceding letter, range or group must repeat at least x times and no more than y times
? means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional
* means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional and may repeat any times
+ means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) MUST occur at least once

 

C. More info on Regexes:

 

D. Many regular patterns used with the BLC KWiC concordancer (see 3 & 4 below) will also work with Bigram PLUS (see 5).

 


3. Concordancing examples: [Note: The BLC concordancer is off]

 

A. Simple searches for individual strings (words and parts of words)

String or expression

Search type

Result

useful EQUAL TO useful (lower case)
Useful EQUAL TO Useful (capitalised)
useful START WITH useful, usefulness, usefully
useful END WITH unuseful, Unuseful
useful CONTAIN useful, usefulness, usefully, unuseful, Unuseful

 

B. Word patterns, multiple forms, lemmas [many patterns will work on BIGRAM PLUS]

Pattern

Query type

Result

[Uu]seful EQUAL TO Useful, useful
[a-z] EQUAL TO any one letter word (lower case throughout)
[A-Z] EQUAL TO any one letter word (uppercase throughout)
\w EQUAL TO any one letter word (any mixed case)
\w\w EQUAL TO any two-letter word (any mixed case)
\w\w\w\w* EQUAL TO any three-letter or longer word (any mixed case, non-hyphenated)
\w+ EQUAL TO any word without hyphenation or apostrophe
\w+-\w+ EQUAL TO any hyphenated word with one internal hyphen
\w+-\w+-\w+ EQUAL TO any hyphenated word with two internal hyphens
[A-Za-z\-]+ EQUAL TO any word, whether hyphenated or not, without apostrophe
\w+'\w+ EQUAL TO any word with apostrophe, e.g. contraction
[A-Z]\w* EQUAL TO any word (no hyphen, no apostrophe) beginning with a capital letter
[A-Z] START WITH any word beginning with a capital letter
C START WITH any word beginning with capital C
c START WITH any word beginning with lowercase c
[Cs] START WITH any word beginning with c, either case
[C]\w+ EQUAL TO any two-letter or longer word beginning with a capital C
[A-Z]\w+ EQUAL TO any two-letter or longer word beginning with a capital letter
[Ww]ays? EQUAL TO way or ways, in what ever case (s at the end is optional)
(unite|unites|uniting|united) EQUAL TO lemma UNITE (lowercase occurrences only)
(Unite|Unites|Uniting|United) EQUAL TO lemma UNITE (uppercase beginnings only)
([Uu]nite|[Uu]nites|[Uu]niting|[Uu]nited) EQUAL TO lemma UNITE (either lowercase or with capitalised U)

 

C. Multi-words, collocations, frames, etc. [many patterns will work on BIGRAM PLUS]

Pattern

Query Type

Result

\w+-\w+ EQUAL TO hyphenated compounds (mixed case, digits included)
[Ww]ays*\s+\w+\s+\w+ing EQUAL TO frame way(s) X -ing
middle[^a-z]+[A-z]+[^a-z]+the EQUAL TO frame middle X the, allowing punctuation (e.g. hyphens) as word boundaries
as a matter of EQUAL TO phrase as a matter of, with single spaces between words
as\s+a\s+matter\s+of EQUAL TO phrase as a matter of, with possible multiple spaces between words
as a [a-z]+ of EQUAL TO frame as a X of [with single spaces]
as\s+a\s+[a-z]+\s+of EQUAL TO frame as a X of [would ignore possible multiple spaces between words]
(go|went|gone|goes)\s+([a-z]+[^a-z]+){0,4}mad EQUAL TO lemmatised collocation GO mad with up to 4 words in between (e.g. "people who have gone computer mad" or "people who went computer-mad")

 

D. Punctuation searches

Pattern

Query type

Result

\w+\.\s+

EQUAL TO full-stops attached to words, followed by a space (= ends of sentences)
,\s+\w+\s+\w+,\s+ EQUAL TO any two words surrounded by commas

 

E. Searching POS-tagged corpora [outdated section]

Click here to find out about the TAGSET used.

Examples of queries on adjectives:

Pattern

Query type

Result

ADJ EQUAL TO all adjectives
ADJ\(ge,.*\) CONTAIN all general adjectives
ADJ\(ge,pos,\w*\) CONTAIN all general adjectives in the positive degree
ADJ\(ge,pos,\w+p\) CONTAIN all participles acting as adjectives
ADJ\(ge,pos,ingp\) CONTAIN all present participial adjectives
ingp\) CONTAIN any ing forms
(ingp\)|edp\)) CONTAIN all -ing and past participle forms

 


4. Using POLISH corpora [outdated section; these tips will NOT work any more]

 

a) Adjust the encoding of the concordance results window to Central-European (Windows)

b) [^a-z] will sometimes capture Polish letters instead of non-letter characters. Sometimes there will be a need to enter as much as [^a-zšćęłń󜟿ĽĆĘŁŃӌŻ] for a word boundary.

 

Pattern

Search type

Result

t[Nn]ik(t|ogo|omu|im) EQUAL TO Lemma NIKT
niż\s+ EQUAL TO word niż
([Dd]zień[^a-z]|[Dd]nia[^a-z]|[Dd]niu[^a-z]|[Dd]niem[^a-z]|[Dd]ni[^a-z]|[Dd]niom[^a-z]|[Dd]niami[^a-z]|[Dd]niach[^a-z]) EQUAL TO lemma DZIEŃ
[a-zA-Zšćęłń󜟿ĽĆĘŁŃӌŻ]+ EQUAL TO any Polish word

 


5. BIGRAM PLUS AND Trigram Frame Counter

 

A. View Yasumasa Someya's original help page for BIGRAM PLUS

B. Help with regular expressions.

C. Examples (English and Polish) -- all refer to case-insensitive queries:

 

Unit 1

Intervening words (Trigram Frame Counter =1, always)

Unit 2

BIGRAM PLUS Result

Trigram Frame Counter Result

of 0 the Number of "of the" bigrams (2-word clusters) in the corpus number of of _ the frames and/or freq list of all nucleus words
(tell|tells|telling|told) 3 about frequency list of 2, 3 and 4 word clusters where a form of the verb TELL precedes about freq list of tell_ about, tells _ about, telling _ about and told _about frames and/or freq list of their nucleus words
\w+ 0 life bigrams ending with "life" freq list of all <word> _ life frames and/or freq list of all their nucleus words
\w+ 0 dzień Polish bigrams ending with dzień freq list of all <word> _ dzień frames and/or freq list of all their nucleus words
\w+\s+\w+ 0 life freq list of 3-word clusters ending with the wordform life freq list of all <word> <word> _ life 4-gram frames and/or freq list of all nucleus words
\w+\s+\w+ 0 życie freq list of Polish 3-word clusters ending with the wordform życie freq list of all <word> <word> _ życie 4-gram frames and/or freq list of all nucleus words
\w+\s+\w+\s+\w+ 0 life freq list of 4-word clusters ending with life freq list of all <word> <word> <word>_ life 5-gram frames and/or freq list of all nucleus words
\w+\s+\w+\s+\w+ 0 życie freq list of Polish 4-word clusters ending with the wordform życie freq list of all <word> <word> <word>_ życie 5-gram frames and/or freq list of all nucleus words
way\w*\s+of 0 \w+ frequency list of 3-word bigrams beginning with WAY of, e.g. way of life, way of living, ways of spending freq list of all way of _ <word> 4-gram frames and/or freq list of all nucleus words
pod\s+względem 0 \w+ frequency list of 3-word clusters beginning with pod względem lub Pod względem freq list of all pod względem _ <word> 4-gram frames and/or freq list of all nucleus words
(tell|tells|telling|told)\s+\w+ 0 about frequency list of only 3-word combinations in which exactly one word separated a wordform of TELL and the preposition about freq list of tell <word>_ about, tells <word>_ about, telling <word> _ about and told <word> _ about 4-gram frames and/or freq list of their nucleus words
\w+ 0 \w+ all bigrams, except words with hyphens and/or apostrophes; SLOW all 3-gram frames of the form <word> _ <word> and freq list of all nucleus words that occur in them
\w+\s+\w+ 0 \w+ all trigrams, except words with hyphens and/or apostrophes; SLOW all 4-gram frames of the form <word> <word>_ <word> and freq list of all nucleus words that occur in them
\w+ 0 \w+\s+\w+ as above all 4-gram frames of the form <word> _ <word> <word> and freq list of all nucleus words that occur in them
\w+\s+\w+\s+\w+ 0 \w+ all 4-word clusters, except words with hyphens and/or apostrophes; SLOW all 5-gram frames of the form <word> <word> <word>_ <word> and freq list of all nucleus words that occur in them
\w+\s+\w+ 0 \w+\s+\w+ as above all 5-gram frames of the form <word> <word> _ <word> <word> and freq list of all nucleus words that occur in them
\w+ 0 \w+\s+\w+\s+\w+ as above all 5-gram frames of the form <word> _ <word> <word> <word> and freq list of all nucleus words that occur in them

 


Page maintained by Przemysław Kaszubski
Last update: 2004-11-09