Computational Linguistics — Overview

What is the course (viz., computational linguistics) about?

Short answer: Computer recognition of (natural) languages.

Extended explanation:

Computer recognition could be replaced by algorithmic recognition, as computers are not required to analyze recognition methodologies.

• Recognition normally implies translation.

• Linguists are typically interested in natural languages, but the use of subsets is necessary to make the problem tractable1.

• Other forms of communication are of interest to the computational linguist. (Speech recognition2, pattern recognition3, sign language4, input/output devices5 for managing disabilities, etc.)

1, 2, 3, 4, 5 areas of intense interest to linguistics researchers:
1 Wikipedia entry related to tractability.
2Speech recognition has become "ad hoc" doable and with increasing parallelism may become "solvable".
3The general problem of pattern recognition is even more difficult than that of speech recognition.

Goals

  1. "Review" (or examine for the first time) the elements of linguistics from which "grammar school" derives its name.
  2. Study some of the structures which map cleanly and generally to simple languages. (These structures give insight into how language might be described and understood.) Cf., Grammars.
  3. Study the classifications of grammars.
  4. Implement examples of these concepts in some programming languages. E.g., Lex, C, Lisp, Perl, Prolog.


Topic overview

  1. Introduction
Lexical Analysis
  1. Regular Expressions
  2. Automata
  3. Morphology
Syntax
  1. Word Classes
  2. Part-of-Speech Tagging
  3. Context-Free Grammars
  4. Parsing
  5. Unification
  6. Features
  7. Complexity
Semantics
  1. Representation of Meaning
  2. Logic
  3. Semantic Analysis
  4. Lexical Semantics
Pragmatics
  1. Discourse
  2. Language Generation


More on what is Computational Linguistics:


Objectives of Computational Linguistics

Why would Inf. Sci./Comp. Sci. students want to study this subject?
Think of some applications you've encountered in your studies.7


Ancillary Objectives of Computational Linguistics

Because of the realization that translation is not just a manipulation of symbols, the need to develop systems with more "complete understanding" has extended the field:

More on machine translation.

Original methodology:


      SOURCE TEXT ---> TARGET LANGUAGE TRANSLATION
But, consider:

      Navajo Indians have different verbs for "picking up
      round thin objects" and another for "picking up long
      flexible objects";

      Hopi have no noun for time and no verb tenses for past,
      present, and future.

      Or famous Jimmy Carter faux pas:

         Meant to say: "... the American people have great desires
            (hopes) for the Polish people ..."
         Instead, translated to: "... the American people lust
            after the Polish people ..."

New methodology:


      SOURCE TEXT ---> MEANING OF SOURCE TEXT ---> TARGET LANGUAGE TRANSLATION

Terms/Concepts

  1. Natural Language Processing (NLP)
    the use of computers to understand human languages (i.e., to recognize and respond to natural language inputs).

    This pre-supposes:

    1. a translation to an "internal" language of the computer (or pseudo-computer)
    2. an internal "representation" of knowledge (knowledge representation)

  2. Form (language is underlying system of rules)
    vs
  3. Substance (language is not just a set of utterances or behaviors)

    Some linguists claim there is an underlying "form", a grammar, which defines language.

    Supported by:
    - brain has structures which support language (compare "physical organization" to "functional organization")
    - some claim language is what differentiates man from other animals.

    Others say there is an underlying "substance", which we might try to organize and describe with one or many grammars.

    Supported by:
    - the mapping between brain structures and functional support for language is not perfect
    - others claim that there is no fundamental difference between man and other animals; he may be more intelligent, just as intelligence differs among human beings.

  4. Competence (mastery of the rules of the system)
    vs
  5. Performance (observable behavior)

    One would like to design computer programs which are masters of the system (human language) that they are trying to emulate. Mostly we settle for performance, viz., systems that appear to interact in a humanly fashion, but which are really just using tricks.

  6. Arbitrariness of language (this is what allows study of subsets of languages to be meaningful; also is what demands conclusions be based on general theories, not just on specific observations).

    This somewhat contradicts the notion that language has a built-in form.

  7. Discrete (digital) [only one way to parse a phrase]
    vs
  8. Continuous (analog)

    The debate as to whether language is fundamentally discrete vs. continuous is regarding form. It seems hard to ascribe pure digital behavior in regards to semantics.

    E.g., "This turkey takes 5 hours to cook in a 450 degree oven. So put it in an oven at 900 degrees for 2 1/2 hours." This "almost" makes sense to people. In general, jokes and analogies appear to have subtle shadings of meaning.
    One could argue that this might also apply to syntax. When a speaker says "Errr..." some might interpret this as - "thought", "retrieval of information", "lack of knowledge", "embarrassment", etc.

  9. Duality of Patterning: Words are strings of sounds (characters), and utterances (sentences) are strings of words.

    [but note how strings of words often can be rearranged with retention of meaning - strings of sounds usually can't be]


Language and the Brain

Some ask questions like, "Why is there a science of linguistics and not of checkerology?"

Synopses of possible answers:

Some answer this by stating that the brain is structured to develop language but it is not structured to play checkers.

There is a large group with the differing view, that a brain has various functional components, some of which can be adapted to language.


Levels of Linguistic Analysis Six (or more) levels of structure of human language may be delimited:
  1. Phonology - sound
  2. Morphology - word formation
  3. Syntax - sentence structure.
  4. Semantics - meaning of a sentence
  5. Pragmatics - use of language in context
  6. Discourse - analysis of constructs larger than a sentence

Phonology - sound

How sounds are used. Every language has an alphabet of sounds called "phonemes".
Each phoneme has one or more physical realizations called "allophones".
[Example: "t" in "stop" and "top". These are pronounced differently, but English considers them the same phoneme.]
[Another controversy, "whole language" approach (to reading) vs. "phonetics".]
Morphology - word formation
Two kinds
- "inflection" and "derivation".
Inflection
- how words change to show, e.g., number, tense, etc.
Derivation
- how new words are created from others as, e.g., creating adverbs, compound nouns, gerunds.
[Student Opportunity - where would this apply in IS applications? Consider systems, even non-language systems, which do not exhaustively store knowledge, but build from basic structures. Which components are going to be the basics (axioms) and which will be derived from these (theorems)?]
Syntax - sentence structure; the "structure" of legal sentences in a language. Syntax can be described using a well-behaved tool, called a grammar. Grammars can also be used to both generate and recognize sentences.


                        Computational Linguistics
                         |                    |
                         |                    |
              Language Analysis       Language Generation
              |              |
              |              |
   Sentence Analysis    Discourse Structure and Analysis
    |              |
    |              |
Syntax Analysis   Semantic Analysis
[note that the "language generation" subtree can be mirror image of "language analysis" subtree.]

There are less general tools to describe simple (non-natural) languages' syntax. These include

automata
and
regular expressions.

Semantics - meaning of a sentence

Meaning is imparted on a sentence both from the constituent elements and their locations within the sentence:

       She is not the only one with an axe to grind.
       Only she is not the one with an axe to grind.

Individual meanings may not add up to the sum total of meaning of a sentence:

       The green-eyed monster drove her insane.

And individual meaning may be altered by other aspects of the language (idiom, hyperbole):

       I could care less.
       I could eat a horse.
Some words can not meaningfully appear together, even though syntactically correct:

       Green rocks sleep furiously.
[Student Opportunity - display more examples.]

Included in semantics are the ideas of "sense" (meaning) and "reference" (the actual entity referred to):

       The hostess was a blathering fool by night's end.

(We can tell that "hostess" is used correctly in semantics without knowing to whom it refers - or even if it is a true statement.)

[Student Opportunity - display more examples.]

Related concepts

Implicature
- information, not part of the semantics of a sentence, which can be inferred by the observer
E.g., "Can you close the door?"

Presupposition
- essential "pre-conditions" for a statement to be logically consistent (e.g., if it is a T/F statement, what must be true to decide whether it is T or F?)

E.g., "The King of France is Bald."

Presupposition is related to "algorithmic encoding content", a measure of how much information is contained in a message. What appears to be an efficient encoding may rely on copious amounts of implicit knowledge.

E.g., in 'C' (and Java), a for loop looks like:

          for(int i = 0; i < 10; i++) { ... }
What are the semantics?

What happens if a loop is written as:

          for(int i = 0;    ; i++) { ... }

What about?

          for(int i=0; i <= 10;  i = i++) { ... }

Cancellation
- removing implicatures and presuppositions.

E.g.,

   "John Travolta is a genius."
vs.
   "In the movie Phenomenon, John Travolta is a genius."
Or
   "That guy is good-looking.
vs.
   "That guy is good-looking, for an ugly guy."

(I polite way to say the above is, "That guy has off-beat good looks.")

Be able to define and illustrate with examples not used in class.

Pragmatics
- use of language in context

Discourse
- analysis of multiple sentence structures.