Basic help on Perl regular
expressions.
Exhaustive account of Perl
Regexes.
Get
practice using Regexes.
General principles
Important operators
Searching text-only corpora | Searching POS-tagged corpora |
useful cut-and-paste | tagset info |
English examples | useful cut-and-paste |
Polish examples | POS search examples |
a) Default searches are for character strings. Use \b around the search pattern to capture complete words / phrases
b) Use \w* to indicate optional suffixes / word-parts.
c) Use \w+ (with spaces or boundaries around as required) to capture any obligatory word or its part (not containing hyphens or apostrophes).
d) Use [\w\-\']+ (with spaces or boundaries around as required) to represent a word also when it contains internal hyphen(s) and/or apostrophe(s) and/or digits.
e) Use a single space to generally separate words
f) To capture one or more word boundaries including punctuation, use \W+
g) For Polish the font encoding in your browser should be Central-European (ISO)
Operator |
Function |
[x-y] | square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y |
| | a vertical stroke separates alternative strings |
() | round brackets enable the user to group alternative strings especially if these are meant to be words |
{x,y} | means preceding letter, range or group must repeat at least x times and no more than y times |
? | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional |
* | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional and may repeat any times |
+ | means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) MUST occur at least once |
^ | marks the beginning of paragraph |
$ | marks the end of paragraph |
C. Useful sequences
for cut and paste:
Symbol(s) |
Meaning |
[[:punct:]] | any punctuation mark |
\w | any single letter or digit |
[a-z]+ | any lowercase word (no digits, no internal hyphens or apostrophes) |
[A-Za-z]+ | any word, regardless of case (no digits, no internal hyphens or apostrophes) |
[A-Za-z\-]+ | any word, also hyphenated |
[A-Za-z\']+ | any word, also with internal apostrophe(s) |
[A-Za-z\-\']+ | any word, also hyphenated or contracted |
\W+ | any sequence of word boundaries, including any punctuation |
[^\w\']+ | any sequence of word boundaries, including punctuation except an apostrophe |
(word1|word2|word3|...) | alternative forms of a word or alternative words (case sensitive) |
PATTERN |
THE RESULTING NODE |
\beffect\b | only occurrences of the word-form effect |
effect | effect on its own and also when occurring inside a longer word (often a derivative) |
\b\w+effect\w+\b | effect only when occurring inside a longer word (often a derivative) |
\b\w*effect\w*\b | any word containing effect including effect itself (potential family of wordforms) |
\bcounter\w* | prefix counter, including counter itself |
\bcounter\w+ | prefix counter query, excluding counter |
\b\w+ment\b | suffix: all words ending in ment |
\b\w+ish\w+\b | infix: words with ish in the middle |
\b\w+-\w+\b | any hyphenated compound |
\bas a matter of\b | phrase as a matter of, with single spaces between words |
\bas\s+a\s+matter\s+of\b | phrase as a matter of, with possibly multiple spaces between words |
\bas a \w+ of\b | frame as a X of |
\bas\s+a\s+\w+\s+of\b | frame as a X of [would ignore possible multiple spaces between words] |
\b(suggestion|suggestions)\b | lemma SUGGESTION |
\b(go|went|gone|goes)\b | lemma GO |
\b(go(|es|ne)|went)\b | lemma GO [another option for the above] |
\blast (\w+ ){0,4}least\b | collocation: last followed by least with up to 4 intervening words |
\b(go|went|gone|goes) ([a-zA-Z\-]+[^a-zA-Z]+){0,4}mad\b | lemmatised collocation GO mad with up to 4 words in between |
/\^t/ | tab indentation (often marks the beginning of a paragraph) {temporarily off} |
^The\b | paragraph beginning with The |
\boff\.$ | paragraph ending with off. {temporarily off} |
PATTERN |
RETRIEVES |
\bjesteś\b | wordform jesteś |
\bpolsk | morpheme polsk plus any word or phrase of which polsk is part, the central highlight (and sort if available) applies only to the string 'polsk' |
\bpolsk\w+\b | any (often derivative) word containing polsk, the central highlight (and sort if available) applies to the complete words |
\bdługo\w*\b | prefix długo, inclusive of długo itself |
\bdługo\w+\b | prefix, exclusive of długo |
\b\w+polskiej\b | suffix polskiej |
\b\w+byś\w+\b | infix byś |
\b\w+-\w+\b | any hyphenated compound |
\bz tym\b | phrase z tym |
\bu\w+ się\b | any phrase like u... się |
\bna\W+\w+\W+że\b | frame na X że |
\bnik(t|ogo|omu|im)\b | lemma NIKT |
\bwniosek(\W+\w+){0,4}\W+który\b | collocation: Wniosek followed by który with up to 4 words in between |
\b(czas|czasu|czasowi|czasem|czasie|czasy|czasów|czasom|czasami|czasach)(\W+\w+){0,4}\W+któr\w+\b | lemmatised collocation: lemma CZAS followed by a relative pronoun któr... with up to 4 words intervening |
/\^t/ | tab indentation (often marks the beginning of a paragraph) |
^Jeden\b | paragraph beginning with the word Jeden |
\bnie\.$ | paragraph ending with the word nie. |
2. Searching English POS-tagged corpora
A. Click here to find out about the TAGSET used.
B. Follow the general principles and important operators used with plain-text English corpora.
C. Useful combinations for cut and paste:
[A-Za-z\-]+ [A-Z]+\([a-z,]+\) | ANY WORD (possibly hyphenated) WITH ANY POS- TAG (= empty slot between successively tagged words) |
([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,4} | 0-4 occurrences of ANY WORD-TAG combinations (= up to 4 words intervening between searched specified elements) |
[A-Z]+\([a-z,]+\) | ANY TAG |
[A-Za-z\-]+ | ANY WORD (possibly hyphenated, after which a tag occurs) |
\([a-z,]+\) | ANY FEATURES FOLLOWING A WORDCLASS LABEL |
[A-Z]+ | Any Wordclass label -- must be followed by \(respective,features\) |
(for some queries it may be advisable to use the case-sensitive search)
PATTERN |
RETRIEVES |
ADJ | Wordclass tag: any adjective |
(ADJ\(ge,pos,\w*\) ) | Wordclass tag with specific feature(s): any general adjective in the positive degree |
(ingp\)|edp\)) | Feature tag(s): any -ing or past participle forms |
\w*late\w* VB | tag meeting lexical criteria: any <verb> containing the string/morpheme late |
\bspread [A-Z]+ | word-tag pair: spread + wordclass_info |
\bof [A-Z]+\([a-z,]+\) \w+ N | immediate colligation: of + <noun> |
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing\)\s+ | of + <singular noun> |
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing[,a-z]*\)\s+ | of + <noun: singular or singular/collective> |
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ ADJ | lemmatised colligation: lemma BE + <adj> with up to 2 words intervening |
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,1} [A-Za-z\-]+ ADJ\([a-z,]+\) [A-Za-z\-]+ ([^N]) | lemma BE + <adj> with up to 1 word intervening and no noun following the <adj> |
ADJ\([a-z,]+\) [A-Za-z\-]+ N | contiguous tag combination: <adj>+<noun> combinations |
ADJ\([a-z,]+\) ([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ N | non-contiguous tag combination: <adj>+<noun> with up to 2 words intervening> |
Page maintained by Przemysław Kaszubski
Last update 2004-11-04