Grammars

The term "grammar" in the general sense usually refers to (among others):

– syntax (roughly, how words can be combined together to make larger phrases, such as sentences),
– morphology (how morphemes --- parts of words, such as the parts of "writers", namely the verb "write", the `agentive affix' "er", and the plural marker "s" --- can be combined to make up words);
– phonology (the study of the sound systems of human language).
– semantics (how and why various words and combinations of words mean what they mean).

A grammar in a mathematical sense is one of many types of ``rewrite'' rules systems. These systems seek to explain one or more of the above aspects of ``grammar''.

Terms required for defining grammar.

Alphabet

An alphabet is a set of "symbols".

E.g., A = {a, b, ..., z}

E.g., B = {red, blue, green}.
Note that there are only three items in this set.

E.g., D = The English Dictionary.

A string (over an alphabet, A) is a sequence of these symbols. E.g., abbccc, dog, and cat are strings over the alphabet A above.

E.g., red red blue, blue green, and red green blue are strings over the alphabet B above.

Note the problem of meta-notation when the symbols occupy more than one space in their description. Cf., problem of hexadecimal notation.

Given an alphabet A, A* denotes the set of all strings (including the null string) over A.

Regular Expressions

Given sets S1 and S2 assumed to be subsets of A*
[If S1 consists of strings over A1 and S2 over A2, make A = A1 union A2):

Note and be able to explain examples like the following.

Regular Sets A regular set is

  1. the empty set is a regular set
  2. the set {λ} is a regular set. [The set containing just the null string.]
  3. if a ∈ alphabet A, then {a} is a regular set
  4. if S is a regular set, so is S* [What is S* defined to be?]
  5. if S1 and S1 are regular sets, so is S1 + S1
  6. if S1 and S1 are regular sets, so is S1 S1

What isn't a regular set?

E.g., {w | w = uv, where u = the reverse of v}
Note that ``the reverse of u'' is often written uR.
And that ``u repeated n times'' is written un.

Languages

What is a language?

Given an alphabet A, any set L which is a subset of A* is a language (over A.) The elements of L are called the strings or the sentences of L.

What about English?

First (bad) attempt: L = subset of {a-z}*

Is this subset a regular set? [Yes, it's finite.]

This is okay to denote the language of spell checkers. But linguists want to understand the structure of what are normally called "sentences" in English. So what are the symbols of A (the alphabet for English)?

A = {w| w is in the dictionary}

Note the complications:

For example, ``The'' and ``the'' are both in the ``dictionary''.

But ``The quick brown fox jumped over the lazy dog.'' is a sentence; while ``the quick brown fox jumped over The lazy dog.'' isn't.

Punctuation is also included in the alphabet.

More powerful tools are required to explain the legal sentences of English.

Syntax

The syntax of a language describes the structure of legal sentences of that language. For example, for a programming language, the ``sentences'' of that language are the legal ``programs'' - ones the compiler will compile without error.

It is distinguished from ``semantics'' which assigns a meaning to each sentence of the language. [How does this relate to the programs of a programming language?]

As an aside, to put perspective on the task of ascribing semantics, the following are typical specifications.

Operational Semantics: how a real or abstract machine would behave as it interpreted the sentence. Note the relative simplicity as applied to programming languages, but the problems caused even in that restricted domain.

Axiomatic Semantics: A system of ``predicates'' and how these predicates would change in evaluation upon interpretation of the sentence.

E.g., pre- world as it is now.

Sentence (Bob Dole speaking): Bob Dole wants inclusion. post- causes Republican reaction R
causes Voter reaction V
causes Media reaction M

Actually, the system probably would be more restricted:

post- it is known that Bob is amenable to some form of minority recognition and (implicature) inclusion was not previously recognized as part of his plank

This example shows why semantics must be based on a larger text (frame), rather than what is normally called a sentence.

Denotational Semantics: A ``mathematical'' description of what a sentence means. (One of the simpler examples may be a semantic net and how it is altered by a sentence.


         See this apple?    apple -------> view
                                   is-in

         The apple is red.  apple -------> view
                              |    is-in
                              |
                              | is (has the property)
                              V
                             red

Continuing with syntax:

An example, which attempts to define a legal English sentence:



<sentence> ::= <determiner> <noun> <verb> <determiner> <noun>
<determiner> ::= a
<determiner> ::= the
<noun> ::= boy
<noun> ::= boys
<noun> ::= girl
<noun> ::= girls
<verb> ::= watch
<verb> ::= watches

From which we can derive the following sentences:

   the boy watches the girls
   the girls watch a boy

But we can also derive some other ``legal'' sentences, which we know are not part of normal English:


   the boy watch the girls
   a girls watch a boy

In general, it is very difficult to handle these kind of context sensitive constructs.

An attempt to clean up the previous "grammar":

   <sentence> ::= <noun part> <verb part> <object part>
   <noun part> ::= <noun phrase>
   <verb part> ::= <verb>
   <object part> ::= <noun phrase>
   <noun phrase> ::= <determiner> <noun>
   <determiner> ::= a | the
   <noun> ::= boy | boys | girl | girls
   <verb> ::= watch | watches

Notes:
the "factoring" of <noun part>;
the short hand "|" for grammar rule with same left hand side.

Grammars

A grammar is an abstract entity which attempts to describe strings of a language.

The concept of a grammar can be divided into (at least) 2 major theories -- structural and transformational. These theories can be reasonably associated with two theories on learning -- empiricist vs. rule governed.

Empiricist point of view:

Language is acquired as a set of habits. These habits are formed by reinforcement, association, and generalization. [Mark Lester, Readings in Applied Transformational Grammar 2nd ed., Holt, Rinehart, Winston, Inc.]

Rule-governed point of view:

Language is acquired through the formation of a set of rules. [Chomsky claims virtually every sentence a speaker says is new to him.] [Chomsky calls this internal rule system the speaker's "linguistic competence". A rule system which purports to mimic this "linguistic competence" is called a "generative grammar".]

It has been found that in order for these rule systems (generative grammars) to be sufficiently general, they must be very abstract. That is, they are ``many steps removed from any kind of physical fact''. A significant question is whether the rules (if they in fact exist) are learned or inherent. Chomsky claims the human mind has an ``intrinsic intellectual organization''.

Formally, one type of grammar can be defined:

A context-free grammar is a 4-tuple (T,N,P,S), consisting of:

  1. T = set of terminals (the legal "tokens" of the language)

  2. N = set of nonterminals (aka variables)

  3. P = as set of productions, each of the form:
    A -> x
       where A is a nonterminal;
       x is a string of nonterminals and terminals.
    

  4. S = a special nonterminal called the ``start symbol''.

E.g., Let

T = { 0, 1, or, (, ) }, N = { A, B }, S = A,
P = {
      A -> B
      A -> ( A or B )
      B -> 0
      B -> 1
    }

Which of the following are legal?

( 1 or 0 )
   yes, since:
      A => ( A or B ) => ( B or B ) => ( 1 or B ) => ( 1 or 0 )


( ( ( 0 or 1 ) or 1 ) or 0 ) yes, since: A => ( A or B ) => ( ( A or B ) or B ) => ( ( ( A or B ) or B ) or B ) * => ( ( ( B or B ) or B ) or B ) => ( ( ( 0 or 1 ) or 1 ) or 0 )

Note: the use of "*" to mean 0 or more applications of productions; can place which production used under "=>".

( ( 0 or 1 ) or ( 1 or 0 ) )
   no


( 0 or 1 ) or ( 1 or 0 ) no

Grammar-Related definitions:

Derivation
Parse Tree
- real or abstract tree structure (What is the def'n of a tree?) whose
internal nodes = nonterminals
external (leaf) nodes = terminals (or maybe not)
Left-most derivation
Top-down derivation
Bottom-up derivation
Sentential Form
Sentence Generation
Sentence Recognition

BNF

When the description of Algol-58 was published, John Backus noted several inconsistencies in the language description. He set about describing the syntax in a grammar-like notation. When the Algol-60 language definition was to be released, Peter Naur, editor of the "Algol Bulletin" borrowed from and added to Backus' notation, developing what became known as Backus-Naur Form (BNF).

In BNF, nonterminals are enclosed in "<", ">", and the arrow symbol of productions is "::=".

E.g:

   <expression> ::= <expression> + <term>
                |   <expression> - <term>
                |   <term>
   <term>       ::= <term> * <factor>
                |   <term> / <factor>
                |   <factor>
   <factor>     ::= ( <expression> )
                |   <variable>
                |   <constant>

Note:

- typical resolution of precedences in an expression;
- use of recursion to expand expression;
- the start symbol and the sets T and N are not always explicitly specified.

Try parsing:

      a + b * c

      5/10-3-x solution

Create a parse tree.
Create a derivation.
(Note: they are not the same thing.)

Extended BNF (EBNF)

Nonterminals begin with an uppercase letter, terminals are enclosed within single quotes ('); the vertical bar (|) is formalized as the "alternation" symbol; parentheses [( )] are used for grouping; braces [{ }] represent 0 or more repetitions; and brackets ([ ]) represent an optional construct. Any reasonable modification to these rules is often used.

EBNF e.g:

   Expression ::= Term { ('+'|'-') Term }
   Term       ::= Factor { ('*'|'/') Factor }
   Factor     ::= '(' Expression ')' | Variable | Constant

Note:

- quotes are often omitted;
- nonterminals need not start with capitals.

Note: how some syntactical entities don't require the power of context free grammars, rather a linear grammar, i.e., those grammars with productions of form:

         A -> a
         B -> bC
are powerful enough.

These grammars lead to the concept of regular expressions.

These entities are usually "discovered" by a lexical analyzer. Thus, Identifier would not be a nonterminal, rather a terminal, in the grammar for the language.