CSC 350 Assignment #2 — Trial Stemming
A major purpose of this assignment is to make sure everyone can run software on moxie; and to let them get any bugs out of their scripting process.

Due Tuesday, February 19 (scripted by midnight).

Create a script file (as described in the on-line document on scripting) for each problem solved. Do not edit the script files. Hard-copy of script file(s) should be handed in at the next class following the date on which you created the file.

A major use of stemming is to aid in information retrieval, e.g., in Web searches to allow discovery of documents with words whose "roots" match the roots of the entered search terms. The word roots is in quotes because stemming does not try to derive the grammatical root of a word.

Some words that obviously should have the same root, stem to different values. For example, electricity stems to electr but electrician stems to electrician. Similarly, programmer stems to programm while programming stems to program

Conversely, some words that are not etymologically related do reduce to the same stem. For example, baleful (sinister) and baling (to make into a bale) both stem to bale.

Use the Python version of the Porter stemming algorithm — /usr/misc/bin/stem — to stem a file of words that you've created to demonstrate this phenomenon.

Specifically, come up with five distinct pairs of words which yield erroneous stemming results as discussed above.

Steps to take in the process:

  1. Edit a file named words.dat and put in 10 words (5 pairs), one per line.
  2. Edit a file named explanation.txt and specify which category each pair will fall into. Include documentation on who you are and what the assignment is. E.g.,
    Name: J. Random Student Assignment: #2 - Trial Stemming Synopsis: create some word pairs that demonstrate the inability of the Porter stemming algorithm to account for English grammar. Due Date: Sunday February 17 Data used: electrician, electricity should have same root, but stem to different values baling, baleful have different roots, but stem to the same value etc.

  3. Make a script of your session. The sequence of commands you issue should resemble the following. Be sure to exit from the script before trying to print.
    script two.ts cat explanation.txt cat words.dat /usr/misc/bin/stem words.dat exit a2ps -R -1 -B -PSnyggLabPrinterQueue two.ts