CSC 241- Set13

CSC 241- Set #13


Set

There are many applications that require maintaining of objects based on a key value. For example, we may have an application program that needs to keep track of employees. For the most part, employees are uniquely identified by their Social Security number. Imagine that our application allows an end-user to add or remove employees or update their information. May be, we would like the program to filter employee records and give reports on how many of different levels of employees we have. What is the best way for us to keep track of these employees?

At this point, it should be easy for you to think about storing them in an array or a linked list, sequentially search for them, remove some or add new ones. But, a simple array or linked list may not give us the performance that we like. The performance of most of our operations is O(n) when we utilize a simple sequential data structure.

What the application wants to do with employee information has something to do with our decisions on how store and manage the information. Yet, there is a significant level of commonality between applications that need to keep objects that have keys.

We want a key for objects because we want them to be unique and we like to be able to retrieve them based on their key. We also have learned that it is best to leave things general, if possible, when designing classes. Doing so, provides us with the ability to reuse already developed classes in many applications. The concept of Sets seem to fit this generalization well. What is the difference between an employee and a part? Can you not think of an application that would maintain parts and provide the same functionality as we said we would for employees? Even the more far fetched example like a dictionary with words and their meanings can be thought of as a set where each word is uniquely identified.

As you will see Sets are not designed to anticipate all functionality required by all applications; they are designed to address the most common requirements for storing keyed data. We can always extend such collections to satisfy the needs specific to a particular application.

The Set Interface

Set is defined as an interface, it allows us to find/add/remove objects that adhere to an elementType interface. We are also identifying a few helper methods as well, size, dump, and toArray. size lets us know how many elements there are in a set. toArray creates an array and stores all elements from the set in it and returns this array. dump displays the content of a set, intended to help in debugging, not really for an end-user.

What is an interface? An interface has no code associated with its methods, has no fields or constructors. But, it can be designed by any number of classes. It is also important to understand that a class in java can implements multiple interfaces. for example, if we need to, we can make a class implement a Set as well as, Comparable which as an interface defined in java.lang. This means that our class must have the methods that we say Sets must have, but also have a compareTo method as well. By implementing both interfaces, we can have objects that not only store elements, but also can be compared against other Comparables.

Here is the definition for Set. Pay close attention to the return values, parameter types, and exceptions of the methods.

public interface Set {
  public void add(elementType e) throws duplicateException;
  public void remove(String k) throws notFoundException;
  public elementType find (String k);
  public int size();
  public void dump();
  public Object[] toArray();
}

elementType Interface

elementType is also an interface. So, any class can implement it. For example, an Employee class can implement it, a dictionaryElement class can implement it; basically, anything that we would consider for storing in a Set would implement elementType.

So, what does it mean when we say something "implements" something else? First, it has to be defined that way, for example, we would say class Employee implements elementType ... in the header of the class. But more importantly, this definition obligates such a class to have the methods that the interface says it must have. In the case of elementType, we are saying that any class that implements it must have the key() and the stringView() methods.

public interface elementType {
  public String stringView();
  public String key();
}

The stringView method is useful when the dump method of needs to display the content of a set. key plays a more important role.

Why is it useful for all objects stored in a collection to have a key method? Why use a String object for the return value?

Well, if a set needs to verify that a particular employee or part or any object is in fact stored in it, it can use the key method of its objects to determine that. Without such an interface we can not implement a clean and general class that implements the Set interface. All classes that implement elementType must be designed to have their key method return a unique string that identify them.

For example, an employee object would return the SS# of the employee. Even if the actual key information is an integer or float, we have methods in Java's API that convert such values to strings.

Binary Search Tree

We are going to use the Binary Search Tree as a class that implements Set. BST keeps unique elements and as long as we have a way of comparing them, it is easy to place new nodes in it or find existing ones.

Let us first define Trees and Graphs, keeping in mind that there is much more to them than just a Binary Search Tree. We will discuss these further in subsequent courses.

A Graph is a collection of nodes/vertices connected by lines or edges.

A Tree is a specialization of a graph where the edges are directed with the concept of a single starting point, a root, and only one path to any given node in the tree. This means that we can't have cycles (a node pointing back to an ancestor) and no node can have more than one parent. There is an inherent parent-child relationship between nodes embedded in this definition which we will take advantage of for defining BSTs. There is also a clean and important concept of a subtree. Take out any edge connecting two nodes of the tree, what you end up with is the original tree with a portion of it pruned off, and what we refer to as a subtree which is also a tree; the proof for this is fun, but I'll let it go for now.

The heuristic for Binary Search Tree is well defined and simple; note that it is a recursive.

A Tree is a Binary Search Tree iff ...

Lets demonstrate with a few example:


          45                 33                 4                 40
         /  \               /  \                 \               /  \
      12      60         20      40                5            7    70
     /  \       \         \                         \            \  / 
   3     20      65        34                         6            9

      BST                 NOT BST                BST            NOT BST

As mentioned before we will use BST for implementing the Set interface. There are several good reasons for this. BST's structure lends itself well to search processes that look for a particular element, you travel the height of the tree to find something. Adding and removing are not costly either, each, only requiring traversal of a single path to the bottom at worst case. As the 3rd example above shows, it is possible to turn a tree into a linked list; however, it is only the worst case scenario. We have found that BST performs close to log(n) in average cases for add, remove, and find.

To start implementing a BST, we need to define our node. A node here will hold a elementType element and two references, one to the left subtree and the other to the right. Here is the code:


public class treeNode {
  elementType element_;
  treeNode left_;
  treeNode right_;
  public   treeNode (elementType e, treeNode l, treeNode r) {
     left_=l;
     right_=r;
     element_=e;
   }
}

The header for Tree class (our BST) will look like this: public class Tree implements Set as it needs to be declared as a class that "implements" Set.

The only state variable in this class is root which is a treeNode and initially set to null. The constructor for Tree does nothing. In the following sections, we will discuss the details of add, remove, and dump methods.

add method

Lets consider the add method. First of all Tree has to have an add method since Set identified it as a method that must be implemented. Here is the code:


  public void add(elementType x) throws duplicateException {
      elementType elementAlreadyInSet = find(x.key());
      if (elementAlreadyInSet != null)
        throw new duplicateException(x,elementAlreadyInSet);

      size_++;//increment the size of this tree
      root_=insert(root_,x);//use recursive insert to add x to the tree
  }

The find method is used to search for a key, if it is found, the element found is returned. In add, find method returning an object instead of null, means that the element is already in the tree which cause us to throw the duplicateException. Note that the constructor for duplicateException is designed to take both the object that we are trying to add to the set and the object that is already in the set as arguments. Once we determine that the object is not found, we can proceed with putting it in.

The insert method is written recursively, not because we had to do it that way, but only to provide yet another example of recursion. The statement root_=insert(root_,x); makes it possible to modify root. We had seen this technique before when discussing mergeSort.

Let us now look at the code for insert:


  private treeNode insert (treeNode root, elementType x) {
    if (root == null) return new treeNode(x,null,null);
    if (root.element_.key().compareTo(x.key()) < 0) 
      root.right_ = insert (root.right_,x);
    if (root.element_.key().compareTo(x.key()) > 0) 
      root.left_ = insert (root.left_,x);
    return root;
  }

in reading a statement like if(root.element_.key().compareTo(x.key())<0), keep in mind that root references a treeNode object. in that object, we have an element_ which is an elementType object. elementType objects have a key method that returns a String. String objects have a compareTo method that compare their content against another String object's content, in this case x.key().

It is easy to see that this algorithm traverses down a path in the tree and does not stop until a null reference to the tree or a subtree is encountered. Yet, the most fundamental aspect of insert which is the process of attaching the newly created node to the rest of the tree is the least obvious element of this algorithm. The code for attaching the new node is hidden away in the recursion here. Lets use an example to show how it works:

Suppose a tree is in the following state:


          root
           10
          /  \
         5    20
             /  \
            15   30

Let us insert a 17 into this tree. Since 17 is not already in the tree, we expect insert to be called and we execute the following statement (I hope my crude syntax here makes sense):


  root = insert (root,x);               1st call
                  10  17
                 /  \
                5    20
                    /  \
                   15   30

Since root is not null, we decide which way to go, and recursively call insert:


  root.right_ = insert (root.right_,x);   2nd call
                         20         17
                        /  \
                      15    30

root is not null, so we recursively call insert again:


  root.left_ = insert (root.left_,x);     3rd call
                        15        17

This will be the last call to insert, since root.right_ is null, the root parameter will be null upon entry:


  root.right_ = insert (root.right_,x);   4th call
                         null       17

At this point, a new node is created with 17 in it and returned. But where does it return to and what do we do with it? We are going to return the new node to where the 4th call to insert was made from, which as you can see places the return value in root.right_. consequently, making that subtree look like the following:


         root
          15
            \
             17

Note that the last instruction executed in the method is return root; This statement is designed to reattach the rest of the tree as we return from each recursive call, finally, making the tree look like the following:


          root
           10
          /  \
         5    20
             /  \
            15   30
              \
               17

This reattachment technique is great because it has very little over head and we don't have to think about our subtrees changing in subsequent recursive calls. There is no need for checking to see if the right or left branches have changed upon return from a recusive call; a statement such as, root.left_ = insert (root.left_,x); restores connection to a node's left subtree, regardless of the subtree changing or remaining the same.

remove method

The algorithm for removal of elements from a tree is a bit more complicated than the one for adding elements. There are three cases that must be handled here, two of which are simple. When deleting an element, you must traverse the tree until you locate the node that it is in, this, of course, assumes that the element exists in the tree. The node being deleted may be a leaf (no children) or it may have only one offspring, a left child, or a right child. These cases are simple to handle as you will see later. The third case is a bit more complicated and it deals with a node being removed that has both a left and right child. Keep in mind that when I say a left or right child, I am really talking about subtrees, so we don't know how many levels each have.

The remove method is designed similarly to the add method in that it calls a recursive method, del, to actually remove an element based on its key. remove also makes use of find to determine if an element really exists in the tree.

Lets consider the code for del piece by piece. Again, first we need to located the appropriate node to be deleted, then determine which case we are dealing with and of course take appropriate action to remove the node.

This segment is very similar to what we did in insert and is designed to traverse the tree recursively until the node that we want to delete is found. Notice that it uses the same technique for reconnecting the tree, for example, root.left_=del(root.left_,key); will reconnect a node to its left subtree and not bother worrying about if it was changed or not.


    if (root.element_.key().compareTo(key) < 0) { //go right
       root.right_ = del (root.right_,key);
       return root;
    }
    if (root.element_.key().compareTo(key) > 0) { //go left
       root.left_ = del (root.left_,key);
       return root;
    }

Now, the rest of the code deals with determining which case we have and handling it. The first two if statements deal with a node being deleted that has a left subtree, or a right subtree, or none at all.


    if (root.left_ == null)
      return root.right_; //since there is no left side replace root with right
    if (root.right_ == null)
      return root.left_; //since there is no right side replace root with left

Here are a couple of examples to show what happens here. Consider the following tree:


          root
           10
          /  \
         5    20
             /  \
            15   30
              \
               17

Suppose we like to delete the 5 from this tree, we call del from remove with the following call:


 root_=del(root_,key);
           10     5
          /  \
         5    20
             /  \
            15   30
              \
               17

Since 5 is less than 10, we go left using:


root.left_=del(root.left_,key);
                 5         5

Upon entering del we find out that we don't want to go left or right, so we have the correct node. Then we check root.left_ against null, since it is null, we return root.right_. In this case, notice that we are returning null as the root.right_ is also null.

At the previous level of recursion, the root gets reconnected with its left side, which is now null. So, the tree will look like the following and gets returned to remove.


          root
           10 
             \
              20
             /  \
            15   30
              \
               17

Suppose we like to delete the 15 from the original tree, we call del from remove with the following call:


 root_=del(root_,key);
           10     15
          /  \
         5    20
             /  \
            15   30
              \
               17

Since 15 is greater than 10, we go right using:


root.right_=del(root.right_,key);
                  20         15
                 /  \
                15   30
                  \
                   17

Since 15 is less than 20, we go left using:


root.left_=del(root.left_,key);
                  15       15
                    \
                     17

Again, we find out that we have found the correct node that its left side is null. root.right_ is returned. When the root at the previous level gets reconnected with its left side, we end up with:


    root
     20
    /  \
   17   30

This subtree is returned to the previous level of recursion and root, at that level, is reconnected with its right side and the tree is returned to the remove method. The tree will look like the following:


          root
           10 
          /  \
         5    20
             /  \
            17   30

In summary, for the simple cases, once we locate the node being deleted, we return either its left or right subtree. Now, on to the interesting case, when the node being deleted has both a left and right subtree.

When a node has both a left and right subtree, we can't simply return its left or right subtree; if you do, you will lose one of them.

The technique utilized here requires to first locate either the successor or the prredecessor of a node. Lets assume we will want to use the successor in our delete algorithm.

What node is considered to be a successor of the current node? The node that contains the next largest value. In our original tree, the successor to 10 is 15. How does one locate the successor node? Start from the root of the right subtree of the node that we wish to find the successor of, and follow the left branches of each subsequent node until the left most node in that subtree is found (i.e. the left branch of that node is null); that node contains the successor.

Once the successor is located, the element from the successor is copied into the node that contains what we are deleting. Then, we continue the recursive delete process in the direction of the right subtree for the element in the successor. Keep in mind that without deleting the content of the successor node, the same value will now appear twice in our table.

Lets show this by deleting 10 from our original tree. del will be called with following:


 root_=del(root_,key);
           10   10
          /  \
         5    20
             /  \
            15   30
              \
               17

for(temp=root.right_;temp.left_!=null;temp=temp.left_); is designed to locate the successor node, root.element_=temp.element_; copies the element from the successor into the current root node. At this point the tree will look like the following:


          root
           15 
          /  \
         5    20
             /  \
            15   30
              \
               17

keep in mind, even though this is not a BST at this instant, we are not done yet, the search delete process continues with:


    root.right_ = del (root.right_,temp.element_.key());
                          20           15
                         /  \
                        15   30
                          \
                           17

Since element in root is greater than 10, we go left with:


    root.left_ = del (root.left_,key);
                          15      15
                            \
                             17

Found the right node, this time, however, there is only a right subtree, so we simply return it in place of root. Once all recursive steps are completed, the tree will look like this:


          root
           15 
          /  \
         5    20
             /  \
            17   30

dump & toArray methods

Both dump and toArray use recursive methods. You should notice they both call inOrder. If you pay attention, you will notice that they are calling different versions of inOrder. The parameters for the version called from dump are root and an indent value. The parameters for the version called from toArray are the root, an array to hold the objects retrieved from the tree and an index variable identifying where to store the next element in the array.

inOrder called from dump performs an in-order traversal from right to left and displays the tree content in the form of a tree turned 90 degrees counterclockwise.

inOrder called from toArray performs an in-order traversal from left to right and stores each element in an array. Here, I'll simply discuss the concept of inOrder traversal which is what inOrder utilizes.

inOrder is a special traversal technique for BSTs whereby visiting the left subtree, consider root, then visiting the right subtree we get the tree content in an ascending order. If we visit the right side first, consider the root node, then the left, we get the tree content in descending order. The following is the general logic for getting to the elements in an ascending order:


inOrder traversal
     if root is not null
        apply inOrder traversal to left subtree
        display key in root
        apply inOrder traversal to right subtree

To make inOrder traversal clear, let us follow this example:


          root
           15 
          /  \
         5    20
             /  \
            17   30

First we traverse left. Since this subtree only has the node with 5, no left or right child, we display 5 and return. We then display 15 when we return to the previous level of recursion. Now we traverse right, the subtree there looks like:


             root
              20
             /  \
            17   30

Again, first we traverse left, and display 17. Then we display the 20, then we traverse right and display 30. Note that all nodes were visited with this algorithm and the list of elements is displayed in ascending order.