BIGRAM PLUS / TRIGRAM FRAME COUNTER

Help with search patterns:

REGULAR EXPRESSIONS
EXAMPLES

A string is a word or phrase, or a part thereof, entered directly into the search box.

The selected 'Search Type' will determine if the string is to be regarded as a:

a) a complete lexical unit [=Equal to],
b) a 'prefix' [=Start with]
c) a 'suffix' [=End with]
d) a middle-part of a larger word/phrase ('infix') [=Contain]

2. Regular expressions

'Regular expressions' are (some) patterns accepted by the PERL programming language. They allow more powerful searches for lemmas, collocations, and more.

A. Convenient combinations for cut and paste:

Symbol(s)	Meaning
\w	any letter or digit
\w+	any word including internal digits, excluding words with internal hyphens or apostrophes
[a-z]+	any lowercase word (no digits, no internal hyphens or apostrophes)
[A-Za-z]+	any word, regardless of case (no digits, no internal hyphens or apostrophes)
[A-Za-z\-]+	any word, also hyphenated
[A-Za-z\']+	any word, also with internal apostrophe(s)
[A-Za-z\-\']+	any word, also hyphenated or contracted
\s+	space(s) between words (one or more)
[^a-zA-Z0-9]+	any word boundary, including any punctuation
[^a-zA-Z0-9\']+	any word boundary, including punctuation except an apostrophe
(word1\|word2\|word3\|...)	multiple forms of a word or alternative words (case sensitive)

B. Important operators:

Operator	Function
[x-y]	square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y
\|	a vertical stroke separates alternative strings
()	round brackets enable the user to group alternative strings especially if these are meant to be words
{x,y}	means preceding letter, range or group must repeat at least x times and no more than y times
?	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) is optional
*	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) is optional and may repeat any times
+	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) MUST occur at least once

C. More info on Regexes:

D. Many regular patterns used with the BLC KWiC concordancer (see 3 & 4 below) will also work with Bigram PLUS (see 5).

3. Concordancing examples: [Note: The BLC concordancer is off]

A. Simple searches for individual strings (words and parts of words)

String or expression	Search type	Result
useful	EQUAL TO	useful (lower case)
Useful	EQUAL TO	Useful (capitalised)
useful	START WITH	useful, usefulness, usefully
useful	END WITH	unuseful, Unuseful
useful	CONTAIN	useful, usefulness, usefully, unuseful, Unuseful

B. Word patterns, multiple forms, lemmas [many patterns will work on BIGRAM PLUS]

Pattern	Query type	Result
[Uu]seful	EQUAL TO	Useful, useful
[a-z]	EQUAL TO	any one letter word (lower case throughout)
[A-Z]	EQUAL TO	any one letter word (uppercase throughout)
\w	EQUAL TO	any one letter word (any mixed case)
\w\w	EQUAL TO	any two-letter word (any mixed case)
\w\w\w\w*	EQUAL TO	any three-letter or longer word (any mixed case, non-hyphenated)
\w+	EQUAL TO	any word without hyphenation or apostrophe
\w+-\w+	EQUAL TO	any hyphenated word with one internal hyphen
\w+-\w+-\w+	EQUAL TO	any hyphenated word with two internal hyphens
[A-Za-z\-]+	EQUAL TO	any word, whether hyphenated or not, without apostrophe
\w+'\w+	EQUAL TO	any word with apostrophe, e.g. contraction
[A-Z]\w*	EQUAL TO	any word (no hyphen, no apostrophe) beginning with a capital letter
[A-Z]	START WITH	any word beginning with a capital letter
C	START WITH	any word beginning with capital C
c	START WITH	any word beginning with lowercase c
[Cs]	START WITH	any word beginning with c, either case
[C]\w+	EQUAL TO	any two-letter or longer word beginning with a capital C
[A-Z]\w+	EQUAL TO	any two-letter or longer word beginning with a capital letter
[Ww]ays?	EQUAL TO	way or ways, in what ever case (s at the end is optional)
(unite\|unites\|uniting\|united)	EQUAL TO	lemma UNITE (lowercase occurrences only)
(Unite\|Unites\|Uniting\|United)	EQUAL TO	lemma UNITE (uppercase beginnings only)
([Uu]nite\|[Uu]nites\|[Uu]niting\|[Uu]nited)	EQUAL TO	lemma UNITE (either lowercase or with capitalised U)

C. Multi-words, collocations, frames, etc. [many patterns will work on BIGRAM PLUS]

Pattern	Query Type	Result
\w+-\w+	EQUAL TO	hyphenated compounds (mixed case, digits included)
[Ww]ays*\s+\w+\s+\w+ing	EQUAL TO	frame way(s) X -ing
middle[^a-z]+[A-z]+[^a-z]+the	EQUAL TO	frame middle X the, allowing punctuation (e.g. hyphens) as word boundaries
as a matter of	EQUAL TO	phrase as a matter of, with single spaces between words
as\s+a\s+matter\s+of	EQUAL TO	phrase as a matter of, with possible multiple spaces between words
as a [a-z]+ of	EQUAL TO	frame as a X of [with single spaces]
as\s+a\s+[a-z]+\s+of	EQUAL TO	frame as a X of [would ignore possible multiple spaces between words]
(go\|went\|gone\|goes)\s+([a-z]+[^a-z]+){0,4}mad	EQUAL TO	lemmatised collocation GO mad with up to 4 words in between (e.g. "people who have gone computer mad" or "people who went computer-mad")

D. Punctuation searches

Pattern	Query type	Result
\w+\.\s+	EQUAL TO	full-stops attached to words, followed by a space (= ends of sentences)
,\s+\w+\s+\w+,\s+	EQUAL TO	any two words surrounded by commas

E. Searching POS-tagged corpora [outdated section]

Click here to find out about the TAGSET used.

Examples of queries on adjectives:

Pattern	Query type	Result
ADJ	EQUAL TO	all adjectives
ADJ\(ge,.*\)	CONTAIN	all general adjectives
ADJ\(ge,pos,\w*\)	CONTAIN	all general adjectives in the positive degree
ADJ\(ge,pos,\w+p\)	CONTAIN	all participles acting as adjectives
ADJ\(ge,pos,ingp\)	CONTAIN	all present participial adjectives
ingp\)	CONTAIN	any ing forms
(ingp\)\|edp\))	CONTAIN	all -ing and past participle forms

4. Using POLISH corpora [outdated section; these tips will NOT work any more]

a) Adjust the encoding of the concordance results window to Central-European (Windows)

b) [^a-z] will sometimes capture Polish letters instead of non-letter characters. Sometimes there will be a need to enter as much as [^a-ząćęłńóśźżĄĆĘŁŃÓŚŹŻ] for a word boundary.

Pattern	Search type	Result
t[Nn]ik(t\|ogo\|omu\|im)	EQUAL TO	Lemma NIKT
niż\s+	EQUAL TO	word niż
([Dd]zień[^a-z]\|[Dd]nia[^a-z]\|[Dd]niu[^a-z]\|[Dd]niem[^a-z]\|[Dd]ni[^a-z]\|[Dd]niom[^a-z]\|[Dd]niami[^a-z]\|[Dd]niach[^a-z])	EQUAL TO	lemma DZIEŃ
[a-zA-ZąćęłńóśźżĄĆĘŁŃÓŚŹŻ]+	EQUAL TO	any Polish word

5. BIGRAM PLUS AND Trigram Frame Counter

A. View Yasumasa Someya's original help page for BIGRAM PLUS

B. Help with regular expressions.

C. Examples (English and Polish) -- all refer to case-insensitive queries:

Unit 1	Intervening words (Trigram Frame Counter =1, always)	Unit 2	BIGRAM PLUS Result	Trigram Frame Counter Result
of	0	the	Number of "of the" bigrams (2-word clusters) in the corpus	number of of _ the frames and/or freq list of all nucleus words
(tell\|tells\|telling\|told)	3	about	frequency list of 2, 3 and 4 word clusters where a form of the verb TELL precedes about	freq list of tell_ about, tells _ about, telling _ about and told _about frames and/or freq list of their nucleus words
\w+	0	life	bigrams ending with "life"	freq list of all <word> _ life frames and/or freq list of all their nucleus words
\w+	0	dzień	Polish bigrams ending with dzień	freq list of all <word> _ dzień frames and/or freq list of all their nucleus words
\w+\s+\w+	0	life	freq list of 3-word clusters ending with the wordform life	freq list of all <word> <word> _ life 4-gram frames and/or freq list of all nucleus words
\w+\s+\w+	0	życie	freq list of Polish 3-word clusters ending with the wordform życie	freq list of all <word> <word> _ życie 4-gram frames and/or freq list of all nucleus words
\w+\s+\w+\s+\w+	0	life	freq list of 4-word clusters ending with life	freq list of all <word> <word> <word>_ life 5-gram frames and/or freq list of all nucleus words
\w+\s+\w+\s+\w+	0	życie	freq list of Polish 4-word clusters ending with the wordform życie	freq list of all <word> <word> <word>_ życie 5-gram frames and/or freq list of all nucleus words
way\w*\s+of	0	\w+	frequency list of 3-word bigrams beginning with WAY of, e.g. way of life, way of living, ways of spending	freq list of all way of _ <word> 4-gram frames and/or freq list of all nucleus words
pod\s+względem	0	\w+	frequency list of 3-word clusters beginning with pod względem lub Pod względem	freq list of all pod względem _ <word> 4-gram frames and/or freq list of all nucleus words
(tell\|tells\|telling\|told)\s+\w+	0	about	frequency list of only 3-word combinations in which exactly one word separated a wordform of TELL and the preposition about	freq list of tell <word>_ about, tells <word>_ about, telling <word> _ about and told <word> _ about 4-gram frames and/or freq list of their nucleus words
\w+	0	\w+	all bigrams, except words with hyphens and/or apostrophes; SLOW	all 3-gram frames of the form <word> _ <word> and freq list of all nucleus words that occur in them
\w+\s+\w+	0	\w+	all trigrams, except words with hyphens and/or apostrophes; SLOW	all 4-gram frames of the form <word> <word>_ <word> and freq list of all nucleus words that occur in them
\w+	0	\w+\s+\w+	as above	all 4-gram frames of the form <word> _ <word> <word> and freq list of all nucleus words that occur in them
\w+\s+\w+\s+\w+	0	\w+	all 4-word clusters, except words with hyphens and/or apostrophes; SLOW	all 5-gram frames of the form <word> <word> <word>_ <word> and freq list of all nucleus words that occur in them
\w+\s+\w+	0	\w+\s+\w+	as above	all 5-gram frames of the form <word> <word> _ <word> <word> and freq list of all nucleus words that occur in them
\w+	0	\w+\s+\w+\s+\w+	as above	all 5-gram frames of the form <word> _ <word> <word> <word> and freq list of all nucleus words that occur in them

Page maintained by Przemysław Kaszubski
Last update: 2004-11-09