Computational Linguistics — Regular Sets and Regular Expressions
Definitions
and
X | Y - defined as X union Y;
elements which are in either (or both) of X or Y
XY - defined as elements consisting of
strings each of which consists of
a prefix, which is an element of X
a suffix, which is an element of Y
X* - defined as elements, each of which is a string
consisting of concatenations of zero or more elements of X
are regular sets (over A)
X | YZ
are not well defined.
Usually, precendences are assigned (in order of highest to least) as Star, Concat., and Alternation.
Note e.g. that "English Dictionary"* is a regular set, but that English which is a subset of "English Dictionary"* is not.
{w | w = u uR} is not.
{w | w = anbn, for n = 0,1,2, ... } is not either.
A regular expression over an alphabet A:
u | v
uv
u*
( u )
are regular expressions (over A)
[with definitions related to the regular set operators above]
Note: Regular expressions can be used to generate or describe a regular set.
Notations for reg. sets and reg. exp.'s are not unique.
- any "normal" character
a
- any concatenation thereof
Fred
- alternation (with concatenation having higher precedence)
F | B | W matches F or B or W
Fred|Wilma|Barney|Bette matches one of the Flintstones' characters
- Kleene-star
b*aby matches bbbbbbbaby
- + (one or more)
bab+y matches babbbbbbby but not bay as bab*y would
- can be grouped within parentheses
(F|L)oxie matches Foxie or Loxie
whereas
F|Loxie would match F or Loxie
More power than normal reg. exp.'s:
- w{n} repeats w "n" times
w{5}hoa matches wwwwwhoa
moo{3} matches moooo
whereas
(moo){3} matches moomoomoo
This means Perl matches expressions which do not form regular sets.
- w{m,n} matches from m to n times w
- ^ matches only at start of string
- $ matches only at end
- . (dot) matches any character except \n (new line)
- character classes []
[a-z] any lower case letter
[A-Z] any upper case letter
[A-Za-z] any letter
[abc] an a or a b or a c
[0-9] any digit
NOT escape [^ whatever]
[^a-z] anything which is not a lower case letter
[^0-9] anything which is not a digit
- special backslash escapes:
\n newline
\r carriage return
\d digit (same as [0-9])
\D non-digit (same as [^0-9])
\w word character (same as [a-zA-Z_0-9])
\W non-word character (same as [^a-zA-Z_0-9])
\s white space character (same as [ \t\n\r\f])
\S non-white space character (anything else)
- other special tokens
\number refers to the ordinal number of a parenthetical match
e.g., (a|b)(\1) matches 'aa' or 'bb' but not 'ab' or 'ba'