Grammars
The term "grammar" in the general sense usually refers to (among others):
A grammar in a mathematical sense is one of many types of ``rewrite'' rules systems. These systems seek to explain one or more of the above aspects of ``grammar''.
Terms required for defining grammar.
Alphabet
An alphabet is a set of "symbols".
E.g., A = {a, b, ..., z}
E.g., B = {red, blue, green}.
Note that there are only three items in this set.
E.g., D = The English Dictionary.
A string (over an alphabet, A) is a sequence of these symbols. E.g., abbccc, dog, and cat are strings over the alphabet A above.
E.g., red red blue, blue green, and red green blue are strings over the alphabet B above.
Note the problem of meta-notation when the symbols occupy more than one space in their description. Cf., problem of hexadecimal notation.
Given an alphabet A, A* denotes the set of all strings (including the null string) over A.
Regular Expressions
Given sets S1 and S2 assumed to be subsets of A*
[If S1 consists of strings over A1 and S2 over A2,
make A = A1 union A2):
Why not just use Union? Because we're not talking about arbitrary sets. [What kind of sets are we talking about?]
Note and be able to explain examples like the following.
Regular Sets A regular set is
What isn't a regular set?
E.g., {w | w = uv, where u = the reverse of v}
Note that ``the reverse of u'' is often written uR.
And that ``u repeated n times'' is written un.
Languages
What is a language?
Given an alphabet A, any set L which is a subset of A* is a language (over A.) The elements of L are called the strings or the sentences of L.
What about English?
First (bad) attempt: L = subset of {a-z}*
Is this subset a regular set? [Yes, it's finite.]
This is okay to denote the language of spell checkers. But linguists want to understand the structure of what are normally called "sentences" in English. So what are the symbols of A (the alphabet for English)?
A = {w| w is in the dictionary}
Note the complications:
For example, ``The'' and ``the'' are both in the ``dictionary''.
But ``The quick brown fox jumped over the lazy dog.'' is a sentence; while ``the quick brown fox jumped over The lazy dog.'' isn't.
Punctuation is also included in the alphabet.
More powerful tools are required to explain the legal sentences of English.
Syntax
The syntax of a language describes the structure of legal sentences of that language. For example, for a programming language, the ``sentences'' of that language are the legal ``programs'' - ones the compiler will compile without error.
It is distinguished from ``semantics'' which assigns a meaning to each sentence of the language. [How does this relate to the programs of a programming language?]
As an aside, to put perspective on the task of ascribing semantics, the following are typical specifications.
Operational Semantics: how a real or abstract machine would behave as it interpreted the sentence. Note the relative simplicity as applied to programming languages, but the problems caused even in that restricted domain.
Axiomatic Semantics: A system of ``predicates'' and how these predicates would change in evaluation upon interpretation of the sentence.
E.g., pre- world as it is now.Sentence (Bob Dole speaking): Bob Dole wants inclusion. post- causes Republican reaction R
causes Voter reaction V
causes Media reaction M
Actually, the system probably would be more restricted:
post- it is known that Bob is amenable to some form of minority recognition and (implicature) inclusion was not previously recognized as part of his plank
This example shows why semantics must be based on a larger text (frame), rather than what is normally called a sentence.
Denotational Semantics: A ``mathematical'' description of what a sentence means. (One of the simpler examples may be a semantic net and how it is altered by a sentence.
See this apple? apple -------> view
is-in
The apple is red. apple -------> view
| is-in
|
| is (has the property)
V
red
Continuing with syntax:
An example, which attempts to define a legal English sentence:
<sentence> ::= <determiner> <noun> <verb> <determiner> <noun> <determiner> ::= a <determiner> ::= the <noun> ::= boy <noun> ::= boys <noun> ::= girl <noun> ::= girls <verb> ::= watch <verb> ::= watches
From which we can derive the following sentences:
the boy watches the girls the girls watch a boy
But we can also derive some other ``legal'' sentences, which we know are not part of normal English:
the boy watch the girls a girls watch a boy
In general, it is very difficult to handle these kind of context sensitive constructs.
An attempt to clean up the previous "grammar":
<sentence> ::= <noun part> <verb part> <object part> <noun part> ::= <noun phrase> <verb part> ::= <verb> <object part> ::= <noun phrase> <noun phrase> ::= <determiner> <noun> <determiner> ::= a | the <noun> ::= boy | boys | girl | girls <verb> ::= watch | watches
Notes:
Grammars
A grammar is an abstract entity which attempts to describe strings of a language.
The concept of a grammar can be divided into (at least) 2 major theories -- structural and transformational. These theories can be reasonably associated with two theories on learning -- empiricist vs. rule governed.
Empiricist point of view:
Language is acquired as a set of habits. These habits are formed by reinforcement, association, and generalization. [Mark Lester, Readings in Applied Transformational Grammar 2nd ed., Holt, Rinehart, Winston, Inc.]
Rule-governed point of view:
Language is acquired through the formation of a set of rules. [Chomsky claims virtually every sentence a speaker says is new to him.] [Chomsky calls this internal rule system the speaker's "linguistic competence". A rule system which purports to mimic this "linguistic competence" is called a "generative grammar".]
It has been found that in order for these rule systems (generative grammars) to be sufficiently general, they must be very abstract. That is, they are ``many steps removed from any kind of physical fact''. A significant question is whether the rules (if they in fact exist) are learned or inherent. Chomsky claims the human mind has an ``intrinsic intellectual organization''.
Formally, one type of grammar can be defined:
A context-free grammar is a 4-tuple (T,N,P,S), consisting of:
A -> x where A is a nonterminal; x is a string of nonterminals and terminals.
E.g., Let
T = { 0, 1, or, (, ) }, N = { A, B }, S = A,
P = {
A -> B
A -> ( A or B )
B -> 0
B -> 1
}
Which of the following are legal?
( 1 or 0 )
yes, since:
A => ( A or B ) => ( B or B ) => ( 1 or B ) => ( 1 or 0 )
( ( ( 0 or 1 ) or 1 ) or 0 )
yes, since:
A => ( A or B ) => ( ( A or B ) or B ) => ( ( ( A or B ) or B ) or B )
*
=> ( ( ( B or B ) or B ) or B ) => ( ( ( 0 or 1 ) or 1 ) or 0 )
Note: the use of "*" to mean 0 or more applications of productions; can place which production used under "=>".
( ( 0 or 1 ) or ( 1 or 0 ) ) no
( 0 or 1 ) or ( 1 or 0 ) no
Grammar-Related definitions:
BNF
When the description of Algol-58 was published, John Backus noted several inconsistencies in the language description. He set about describing the syntax in a grammar-like notation. When the Algol-60 language definition was to be released, Peter Naur, editor of the "Algol Bulletin" borrowed from and added to Backus' notation, developing what became known as Backus-Naur Form (BNF).
In BNF, nonterminals are enclosed in "<", ">", and the arrow symbol of productions is "::=".
E.g:
<expression> ::= <expression> + <term>
| <expression> - <term>
| <term>
<term> ::= <term> * <factor>
| <term> / <factor>
| <factor>
<factor> ::= ( <expression> )
| <variable>
| <constant>
Note:
Try parsing:
a + b * c
5/10-3-x solution
Create a parse tree.
Create a derivation.
(Note: they are not the same thing.)
Extended BNF (EBNF)
Nonterminals begin with an uppercase letter, terminals are enclosed within single quotes ('); the vertical bar (|) is formalized as the "alternation" symbol; parentheses [( )] are used for grouping; braces [{ }] represent 0 or more repetitions; and brackets ([ ]) represent an optional construct. Any reasonable modification to these rules is often used.
EBNF e.g:
Expression ::= Term { ('+'|'-') Term }
Term ::= Factor { ('*'|'/') Factor }
Factor ::= '(' Expression ')' | Variable | Constant
Note:
Note: how some syntactical entities don't require the power of context free grammars, rather a linear grammar, i.e., those grammars with productions of form:
A -> a
B -> bC
are powerful enough.
These grammars lead to the concept of regular expressions.
These entities are usually "discovered" by a lexical analyzer. Thus, Identifier would not be a nonterminal, rather a terminal, in the grammar for the language.