Computational Linguistics — Regular Sets and Regular Expressions

Definitions

Alphabet:
set of characters (tokens)
string:
a sequence of zero or more tokens from a given alphabet
[ a string is said to be "over an alphabet" ]
λ (lambda):
empty string of characters
A regular set over an alphabet A is inductively defined from:
as follows:
  1. Basis: {}, {λ}, and {a} for all a in A (singleton sets) are regular sets
  2. Recursion: if X and Y are regular sets (over A) then
             X | Y      - defined as X union Y;
                          elements which are in either (or both) of X or Y
    
             XY         - defined as elements consisting of
                          strings each of which consists of
                             a prefix, which is an element of X
                             a suffix, which is an element of Y
    
             X*       - defined as elements, each of which is a string
                        consisting of concatenations of zero or more elements of X
    
           are regular sets (over A)
    
  3. Closure: nothing else is a regular set
Note: that in the absence of precedence rules, things like:

X | YZ

are not well defined.

Usually, precendences are assigned (in order of highest to least) as Star, Concat., and Alternation.


What isn't a regular set?

Note e.g. that "English Dictionary"* is a regular set, but that English which is a subset of "English Dictionary"* is not.


{w | w = u uR} is not.

{w | w = anbn, for n = 0,1,2, ... } is not either.


Regular Expressions:

A regular expression over an alphabet A:

  1. Basis: {}, λ, and a for all a in A are regular expressions
  2. Recursion: if u and v are regular expressions (over A) then
    
             u | v
    
             uv
    
             u*
    
             ( u )
    
    are regular expressions (over A)

    [with definitions related to the regular set operators above]

  3. Closure: nothing else is

Note: Regular expressions can be used to generate or describe a regular set.

Notations for reg. sets and reg. exp.'s are not unique.


Perl-style regular expressions:

- any "normal" character

        a

- any concatenation thereof

        Fred

- alternation (with concatenation having higher precedence)

        F | B | W    matches F or B or W
        Fred|Wilma|Barney|Bette   matches one of the Flintstones' characters

- Kleene-star

         b*aby       matches bbbbbbbaby

- + (one or more)

         bab+y       matches babbbbbbby but not bay as bab*y would

- can be grouped within parentheses

         (F|L)oxie   matches Foxie or Loxie

          whereas

         F|Loxie     would match F or Loxie

More power than normal reg. exp.'s:

- w{n} repeats w "n" times


         w{5}hoa    matches wwwwwhoa

         moo{3}      matches moooo

      whereas

         (moo){3}  matches moomoomoo
This means Perl matches expressions which do not form regular sets.

- w{m,n} matches from m to n times w

- ^ matches only at start of string

- $ matches only at end

- . (dot) matches any character except \n (new line)

- character classes []

        [a-z]     any lower case letter
        [A-Z]     any upper case letter
        [A-Za-z]  any letter
        [abc]     an a or a b or a c
        [0-9]     any digit

        NOT escape [^   whatever]

        [^a-z]    anything which is not a lower case letter
        [^0-9]    anything which is not a digit

- special backslash escapes:

         \n    newline
         \r    carriage return
         \d    digit (same as [0-9])
         \D    non-digit (same as [^0-9])
         \w    word character (same as [a-zA-Z_0-9])
         \W    non-word character (same as [^a-zA-Z_0-9])
         \s    white space character (same as [ \t\n\r\f])
         \S    non-white space character (anything else)

- other special tokens

         \number    refers to the ordinal number of a parenthetical match

                e.g., (a|b)(\1)     matches 'aa' or 'bb'  but not 'ab' or 'ba'