PK Concordancer: Tips with Regex queries

PATTERN	THE RESULTING NODE
\beffect\b	only occurrences of the word-form effect
effect	effect on its own and also when occurring inside a longer word (often a derivative)
\b\w+effect\w+\b	effect only when occurring inside a longer word (often a derivative)
*\b\weffect\w\b*	any word containing effect including effect itself (potential family of wordforms)

\bcounter\w*	prefix counter, including counter itself
\bcounter\w+	prefix counter query, excluding counter
\b\w+ment\b	suffix: all words ending in ment
\b\w+ish\w+\b	infix: words with ish in the middle
\b\w+-\w+\b	any hyphenated compound
\bas a matter of\b	phrase as a matter of, with single spaces between words
\bas\s+a\s+matter\s+of\b	phrase as a matter of, with possibly multiple spaces between words
\bas a \w+ of\b	frame as a X of
\bas\s+a\s+\w+\s+of\b	frame as a X of [would ignore possible multiple spaces between words]
\b(suggestion\|suggestions)\b	lemma SUGGESTION
\b(go\|went\|gone\|goes)\b	lemma GO
\b(go(\|es\|ne)\|went)\b	lemma GO [another option for the above]
\blast (\w+ ){0,4}least\b	collocation: last followed by least with up to 4 intervening words
\b(go\|went\|gone\|goes) ([a-zA-Z\-]+[^a-zA-Z]+){0,4}mad\b	lemmatised collocation GO mad with up to 4 words in between
/\^t/	tab indentation (often marks the beginning of a paragraph) {temporarily off}
^The\b	paragraph beginning with The
\boff\.$	paragraph ending with off. {temporarily off}

Searching text-only corpora	Searching POS-tagged corpora
useful cut-and-paste	tagset info
English examples	useful cut-and-paste
Polish examples	POS search examples

Operator	Function
[x-y]	square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y
\|	a vertical stroke separates alternative strings
()	round brackets enable the user to group alternative strings especially if these are meant to be words
{x,y}	means preceding letter, range or group must repeat at least x times and no more than y times
?	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) is optional
*	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) is optional and may repeat any times
+	means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1\|word2) ) MUST occur at least once
^	marks the beginning of paragraph
$	marks the end of paragraph

Symbol(s)	Meaning
[[:punct:]]	any punctuation mark
\w	any single letter or digit
[a-z]+	any lowercase word (no digits, no internal hyphens or apostrophes)
[A-Za-z]+	any word, regardless of case (no digits, no internal hyphens or apostrophes)
[A-Za-z\-]+	any word, also hyphenated
[A-Za-z\']+	any word, also with internal apostrophe(s)
[A-Za-z\-\']+	any word, also hyphenated or contracted
\W+	any sequence of word boundaries, including any punctuation
[^\w\']+	any sequence of word boundaries, including punctuation except an apostrophe
(word1\|word2\|word3\|...)	alternative forms of a word or alternative words (case sensitive)

PATTERN	RETRIEVES
\bjesteś\b	wordform jesteś
\bpolsk	morpheme polsk plus any word or phrase of which polsk is part, the central highlight (and sort if available) applies only to the string 'polsk'
\bpolsk\w+\b	any (often derivative) word containing polsk, the central highlight (and sort if available) applies to the complete words
\bdługo\w*\b	prefix długo, inclusive of długo itself
\bdługo\w+\b	prefix, exclusive of długo
\b\w+polskiej\b	suffix polskiej
\b\w+byś\w+\b	infix byś
\b\w+-\w+\b	any hyphenated compound
\bz tym\b	phrase z tym
\bu\w+ się\b	any phrase like u... się
\bna\W+\w+\W+że\b	frame na X że
\bnik(t\|ogo\|omu\|im)\b	lemma NIKT
\bwniosek(\W+\w+){0,4}\W+który\b	collocation: Wniosek followed by który with up to 4 words in between
\b(czas\|czasu\|czasowi\|czasem\|czasie\|czasy\|czasów\|czasom\|czasami\|czasach)(\W+\w+){0,4}\W+któr\w+\b	lemmatised collocation: lemma CZAS followed by a relative pronoun któr... with up to 4 words intervening
/\^t/	tab indentation (often marks the beginning of a paragraph)
^Jeden\b	paragraph beginning with the word Jeden
\bnie\.$	paragraph ending with the word nie.

[A-Za-z\-]+ [A-Z]+\([a-z,]+\)	ANY WORD (possibly hyphenated) WITH ANY POS- TAG (= empty slot between successively tagged words)
([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,4}	0-4 occurrences of ANY WORD-TAG combinations (= up to 4 words intervening between searched specified elements)
[A-Z]+\([a-z,]+\)	ANY TAG
[A-Za-z\-]+	ANY WORD (possibly hyphenated, after which a tag occurs)
\([a-z,]+\)	ANY FEATURES FOLLOWING A WORDCLASS LABEL
[A-Z]+	Any Wordclass label -- must be followed by \(respective,features\)

PK's Perl Concordancer

PATTERN	RETRIEVES
ADJ	Wordclass tag: any adjective
*(ADJ\(ge,pos,\w\) )**	Wordclass tag with specific feature(s): any general adjective in the positive degree
(ingp\)\|edp\))	Feature tag(s): any -ing or past participle forms
*\wlate\w* VB**	tag meeting lexical criteria: any <verb> containing the string/morpheme late
\bspread [A-Z]+	word-tag pair: spread + wordclass_info
\bof [A-Z]+\([a-z,]+\) \w+ N	immediate colligation: of + <noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing\)\s+	of + <singular noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing[,a-z]*\)\s+	of + <noun: singular or singular/collective>
(be\|am\|are\|is\|was\|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ ADJ	lemmatised colligation: lemma BE + <adj> with up to 2 words intervening
(be\|am\|are\|is\|was\|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,1} [A-Za-z\-]+ ADJ\([a-z,]+\) [A-Za-z\-]+ ([^N])	lemma BE + <adj> with up to 1 word intervening and no noun following the <adj>
ADJ\([a-z,]+\) [A-Za-z\-]+ N	contiguous tag combination: <adj>+<noun> combinations
ADJ\([a-z,]+\) ([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ N	non-contiguous tag combination: <adj>+<noun> with up to 2 words intervening>