CSC 350 Assignment #1 — Trial Stemming
A major purpose of this assignment is to make sure everyone can run software on moxie; and to let them get any bugs out of their scripting process.

Due Sunday, September 14 (scripted by midnight).

Create a script file (as described in the on-line document on scripting) for each problem solved. Do not edit the script files. Hard-copy of script file(s) should be handed in at the next class following the date on which you created the file.

A major use of stemming is to aid in information retrieval, e.g., in Web searches to allow discovery of documents with words whose "roots" match the roots of the entered search terms. The word roots is in quotes because stemming does not try to derive the grammatical root of a word.

Depending on how the stemming algorithm is implemented, some words that obviously should have the same root, stem to different values. For example, electricity might stem to electr but electrician stem to electrician. Similarly, programmer may stem to programm while programming stems to program

Conversely, some words that are not etymologically related do reduce to the same stem. For example, baleful (sinister) and baling (to make into a bale) both stem to bale.

Use the Python version of the Porter stemming algorithm — /usr/misc/bin/stem — to stem a file of words that you've created to demonstrate this phenomenon.

Specifically, come up with six distinct pairs of words (12 words total) which yield erroneous stemming results as discussed above. Three of the pairs should be in the category "stem to same root but shouldn't"; while the other 3 pairs should be in the "should stem to same form but don't" category. Find your own words, don't use mine or someone else's. You should not expect prefixes to be stemmed using this algorithm, so don't demonstrate words which depend on prefixes to determine their etymology.

Steps to take in the process:

  1. Edit a file named words.dat and put in 10 words (5 pairs), one per line.
  2. Edit a file named explanation.txt and specify which category each pair will fall into. Include documentation on who you are and what the assignment is. E.g.,
    Name: J. Random Student Assignment: #1 - Trial Stemming Synopsis: create some word pairs that demonstrate the inability of the Porter stemming algorithm to account for English grammar. Due Date: Sunday September 14 Data used: Should have same root, but stem to different values librarian library shift shifty contraction contractor Have different roots, but stem to the same value bowling bowlful sows (female pigs) sowing (planting seed) integral (smooth sum in math) integrity

  3. Make a script of your session. The sequence of commands you issue should resemble the following. Be sure to exit from the script before trying to print.
    script one.ts cat explanation.txt cat words.dat /usr/misc/bin/stem words.dat exit snyggprint one.ts