PK's Perl Concordancer

Regular Expression Searches

Basic help on Perl regular expressions.
Exhaustive account of Perl Regexes.
Get practice using Regexes.


General principles
Important operators

Searching text-only corpora Searching POS-tagged corpora
useful cut-and-paste tagset info
English examples useful cut-and-paste
Polish examples POS search examples

 


 

1. General principles:

 

a) Default searches are for character strings. Use \b around the search pattern to capture complete words / phrases

b) Use \w* to indicate optional suffixes / word-parts.

c) Use \w+ (with spaces or boundaries around as required) to capture any obligatory word or its part (not containing hyphens or apostrophes).

d) Use [\w\-\']+ (with spaces or boundaries around as required) to represent a word also when it contains internal hyphen(s) and/or apostrophe(s) and/or digits.

e) Use a single space to generally separate words

f) To capture one or more word boundaries including punctuation, use \W+

g) For Polish the font encoding in your browser should be Central-European (ISO)

 

2. Important operators:

 

Operator

Function

[x-y] square brackets are used to indicate ranges of characters (letters, digits, etc) from x to y
| a vertical stroke separates alternative strings
() round brackets enable the user to group alternative strings especially if these are meant to be words
{x,y} means preceding letter, range or group must repeat at least x times and no more than y times
? means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional
* means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) is optional and may repeat any times
+ means preceding character, symbol, range (e.g. [a-z]) or group (e.g. (word1|word2) ) MUST occur at least once
^ marks the beginning of paragraph
$ marks the end of paragraph

 


1. Text-only search

C. Useful sequences for cut and paste:

Symbol(s)

Meaning

[[:punct:]] any punctuation mark
\w any single letter or digit
[a-z]+ any lowercase word (no digits, no internal hyphens or apostrophes)
[A-Za-z]+ any word, regardless of case (no digits, no internal hyphens or apostrophes)
[A-Za-z\-]+ any word, also hyphenated
[A-Za-z\']+ any word, also with internal apostrophe(s)
[A-Za-z\-\']+ any word, also hyphenated or contracted
\W+ any sequence of word boundaries, including any punctuation
[^\w\']+ any sequence of word boundaries, including punctuation except an apostrophe
(word1|word2|word3|...) alternative forms of a word or alternative words (case sensitive)

 

 

D. English examples:

PATTERN

THE RESULTING NODE

\beffect\b only occurrences of the word-form effect
effect effect on its own and also when occurring inside a longer word (often a derivative)
\b\w+effect\w+\b effect only when occurring inside a longer word (often a derivative)
\b\w*effect\w*\b any word containing effect including effect itself (potential family of wordforms)
   
\bcounter\w* prefix counter, including counter itself
\bcounter\w+ prefix counter query, excluding counter
\b\w+ment\b suffix: all words ending in ment
\b\w+ish\w+\b infix: words with ish in the middle
\b\w+-\w+\b any hyphenated compound
\bas a matter of\b phrase as a matter of, with single spaces between words
\bas\s+a\s+matter\s+of\b phrase as a matter of, with possibly multiple spaces between words
\bas a \w+ of\b frame as a X of
\bas\s+a\s+\w+\s+of\b frame as a X of [would ignore possible multiple spaces between words]
\b(suggestion|suggestions)\b lemma SUGGESTION
\b(go|went|gone|goes)\b lemma GO
\b(go(|es|ne)|went)\b lemma GO [another option for the above]
\blast (\w+ ){0,4}least\b collocation: last followed by least with up to 4 intervening words
\b(go|went|gone|goes) ([a-zA-Z\-]+[^a-zA-Z]+){0,4}mad\b lemmatised collocation GO mad with up to 4 words in between
/\^t/ tab indentation (often marks the beginning of a paragraph) {temporarily off}
^The\b paragraph beginning with The
\boff\.$ paragraph ending with off. {temporarily off}

 

E. Polish examples:

PATTERN

RETRIEVES

\bjesteś\b wordform jesteś
\bpolsk morpheme polsk plus any word or phrase of which polsk is part, the central highlight (and sort if available) applies only to the string 'polsk'
\bpolsk\w+\b any (often derivative) word containing polsk, the central highlight (and sort if available) applies to the complete words
\bdługo\w*\b prefix długo, inclusive of długo itself
\bdługo\w+\b prefix, exclusive of długo
\b\w+polskiej\b suffix polskiej
\b\w+byś\w+\b infix byś
\b\w+-\w+\b any hyphenated compound
\bz tym\b phrase z tym
\bu\w+ się\b any phrase like u... się
\bna\W+\w+\W+że\b frame na X że
\bnik(t|ogo|omu|im)\b lemma NIKT
\bwniosek(\W+\w+){0,4}\W+który\b collocation: Wniosek followed by który with up to 4 words in between
\b(czas|czasu|czasowi|czasem|czasie|czasy|czasów|czasom|czasami|czasach)(\W+\w+){0,4}\W+któr\w+\b lemmatised collocation: lemma CZAS followed by a relative pronoun któr... with up to 4 words intervening
/\^t/ tab indentation (often marks the beginning of a paragraph)
^Jeden\b paragraph beginning with the word Jeden
\bnie\.$ paragraph ending with the word nie.

 


2. Searching English POS-tagged corpora

 

A. Click here to find out about the TAGSET used.

B. Follow the general principles and important operators used with plain-text English corpora.

C. Useful combinations for cut and paste:

[A-Za-z\-]+ [A-Z]+\([a-z,]+\) ANY WORD (possibly hyphenated) WITH ANY POS- TAG (= empty slot between successively tagged words)
([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,4} 0-4 occurrences of ANY WORD-TAG combinations (= up to 4 words intervening between searched specified elements)
[A-Z]+\([a-z,]+\) ANY TAG
[A-Za-z\-]+ ANY WORD (possibly hyphenated, after which a tag occurs)
\([a-z,]+\) ANY FEATURES FOLLOWING A WORDCLASS LABEL
[A-Z]+ Any Wordclass label -- must be followed by \(respective,features\)

 

D. Examples of queries:

(for some queries it may be advisable to use the case-sensitive search)

PATTERN

RETRIEVES

ADJ Wordclass tag: any adjective
(ADJ\(ge,pos,\w*\) ) Wordclass tag with specific feature(s): any general adjective in the positive degree
(ingp\)|edp\)) Feature tag(s): any -ing or past participle forms
\w*late\w* VB tag meeting lexical criteria: any <verb> containing the string/morpheme late
\bspread [A-Z]+ word-tag pair: spread + wordclass_info
\bof [A-Z]+\([a-z,]+\) \w+ N immediate colligation: of + <noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing\)\s+ of + <singular noun>
\bof [A-Z]+\([a-z,]+\) \w+ N\(sing[,a-z]*\)\s+ of + <noun: singular or singular/collective>
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ ADJ lemmatised colligation: lemma BE + <adj> with up to 2 words intervening
(be|am|are|is|was|were) [A-Z]+\([a-z,]+\) ([A-Za-z\-\']+ [A-Z]+\([a-z,]+\)){0,1} [A-Za-z\-]+ ADJ\([a-z,]+\) [A-Za-z\-]+ ([^N]) lemma BE + <adj> with up to 1 word intervening and no noun following the <adj>
ADJ\([a-z,]+\) [A-Za-z\-]+ N contiguous tag combination: <adj>+<noun> combinations
ADJ\([a-z,]+\) ([A-Za-z\-]+ [A-Z]+\([a-z,]+\)){0,2} [A-Za-z\-]+ N non-contiguous tag combination: <adj>+<noun> with up to 2 words intervening>

 


Page maintained by Przemysław Kaszubski

Last update 2004-11-04