PK's Perl Concordancer

Regular Expression Searches

Basic help on Perl regular expressions.
Exhaustive account of Perl Regexes.
Get practice using Regexes.

General principles
Important operators

Searching text-only corpora Searching POS-tagged corpora
useful cut-and-paste tagset info
English examples useful cut-and-paste
Polish examples POS search examples



1. General principles:


a) Default searches are for character strings. Use \b around the search pattern to capture complete words / phrases

b) Use \w* to indicate optional suffixes / word-parts.

c) Use \w+ (with spaces or boundaries around as required) to capture any obligatory word or its part (not containing hyphens or apostrophes).

d) Use [\w\-\']+ (with spaces or boundaries around as required) to represent a word also when it contains internal hyphen(s) and/or apostrophe(s) and/or digits.

e) Use a single space to generally separate words

f) To capture one or more word boundaries including punctuation, use \W+

g) For Polish the font encoding in your browser should be Central-European (ISO)


2. Important operators:




[x-y] square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y
| a vertical stroke separates alternative strings
() round brackets enable the user to group alternative strings especially if these are meant to be words
{x,y} means preceding letter, range or group must repeat at least x times and no more than y times
? means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional
* means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional and may repeat any times
+ means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) MUST occur at least once
^ marks the beginning of paragraph
$ marks the end of paragraph


1. Text-only search

C. Useful sequences for cut and paste:



[[:punct:]] any punctuation mark
\w any single letter or digit
[a-z]+ any lowercase word (no digits, no internal hyphens or apostrophes)
[A-Za-z]+ any word, regardless of case (no digits, no internal hyphens or apostrophes)
[A-Za-z\-]+ any word, also hyphenated
[A-Za-z\']+ any word, also with internal apostrophe(s)
[A-Za-z\-\']+ any word, also hyphenated or contracted
\W+ any sequence of word boundaries, including any punctuation
[^\w\']+ any sequence of word boundaries, including punctuation except an apostrophe
(word1|word2|word3|...) alternative forms of a word or alternative words (case sensitive)



D. English examples:



\beffect\b only occurrences of the word-form effect
effect effect on its own and also when occurring inside a longer word (often a derivative)
\b\w+effect\w+\b effect only when occurring inside a longer word (often a derivative)
\b\w*effect\w*\b any word containing effect including effect itself (potential family of wordforms)
\bcounter\w* prefix counter, including counter itself
\bcounter\w+ prefix counter query, excluding counter
\b\w+ment\b suffix: all words ending in ment
\b\w+ish\w+\b infix: words with ish in the middle
\b\w+-\w+\b any hyphenated compound
\bas a matter of\b phrase as a matter of, with single spaces between words
\bas\s+a\s+matter\s+of\b phrase as a matter of, with possibly multiple spaces between words
\bas a \w+ of\b frame as a X of
\bas\s+a\s+\w+\s+of\b frame as a X of [would ignore possible multiple spaces between words]
\b(suggestion|suggestions)\b lemma SUGGESTION
\b(go|went|gone|goes)\b lemma GO
\b(go(|es|ne)|went)\b lemma GO [another option for the above]
\blast (\w+ ){0,4}least\b collocation: last followed by least with up to 4 intervening words
\b(go|went|gone|goes) ([a-zA-Z\-]+[^a-zA-Z]+){0,4}mad\b lemmatised collocation GO mad with up to 4 words in between
/\^t/ tab indentation (often marks the beginning of a paragraph) {temporarily off}
^The\b paragraph beginning with The
\boff\.$ paragraph ending with off. {temporarily off}


E. Polish examples:



\bjesteś\b wordform jesteś
\bpolsk morpheme polsk plus any word or phrase of which polsk is part, the central highlight (and sort if available) applies only to the string 'polsk'
\bpolsk\w+\b any (often derivative) word containing polsk, the central highlight (and sort if available) applies to the complete words
\bdługo\w*\b prefix długo, inclusive of długo itself
\bdługo\w+\b prefix, exclusive of długo
\b\w+polskiej\b suffix polskiej
\b\w+byś\w+\b infix byś
\b\w+-\w+\b any hyphenated compound
\bz tym\b phrase z tym
\bu\w+ się\b any phrase like u... się
\bna\W+\w+\W+że\b frame na X że
\bnik(t|ogo|omu|im)\b lemma NIKT
\bwniosek(\W+\w+){0,4}\W+który\b collocation: Wniosek followed by który with up to 4 words in between
\b(czas|czasu|czasowi|czasem|czasie|czasy|czasów|czasom|czasami|czasach)(\W+\w+){0,4}\W+któr\w+\b lemmatised collocation: lemma CZAS followed by a relative pronoun któr... with up to 4 words intervening
/\^t/ tab indentation (often marks the beginning of a paragraph)
^Jeden\b paragraph beginning with the word Jeden
\bnie\.$ paragraph ending with the word nie.


2. Searching English POS-tagged corpora


A. Click here to find out about the TAGSET used.

B. Follow the general principles and important operators used with plain-text English corpora.

C. Useful combinations for cut and paste:

[A-Za-z\-]+ [A-Z]+\([a-z,]+\) ANY WORD (possibly hyphenated) WITH ANY POS- TAG (= empty slot between successively tagged words)
([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,4} 0-4 occurrences of ANY WORD-TAG combinations (= up to 4 words intervening between searched specified elements)
[A-Z]+\([a-z,]+\) ANY TAG
[A-Za-z\-]+ ANY WORD (possibly hyphenated, after which a tag occurs)
[A-Z]+ Any Wordclass label -- must be followed by \(respective,features\)


D. Examples of queries:

(for some queries it may be advisable to use the case-sensitive search)



ADJ Wordclass tag: any adjective
(ADJ\(ge,pos,\w*\) ) Wordclass tag with specific feature(s): any general adjective in the positive degree
(ingp\)|edp\)) Feature tag(s): any -ing or past participle forms
\w*late\w* VB tag meeting lexical criteria: any <verb> containing the string/morpheme late
\bspread [A-Z]+ word-tag pair: spread + wordclass_info
\bof [A-Z]+\([a-z,]+\) \w+ N immediate colligation: of + <noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing\)\s+ of + <singular noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing[,a-z]*\)\s+ of + <noun: singular or singular/collective>
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ ADJ lemmatised colligation: lemma BE + <adj> with up to 2 words intervening
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,1} [A-Za-z\-]+ ADJ\([a-z,]+\) [A-Za-z\-]+ ([^N]) lemma BE + <adj> with up to 1 word intervening and no noun following the <adj>
ADJ\([a-z,]+\) [A-Za-z\-]+ N contiguous tag combination: <adj>+<noun> combinations
ADJ\([a-z,]+\) ([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ N non-contiguous tag combination: <adj>+<noun> with up to 2 words intervening>


Page maintained by Przemysław Kaszubski

Last update 2004-11-04