CSC 241- Set13

CSC 241- Set #13


Set

There are many applications that require maintaining unique objects. For example, we may have an application program that needs to keep track of employees. For the most part, employees are uniquely identified by their Social Security number. Imagine that our application allows an end-user to add or remove employees or update their information. May be, we would like the program to filter employee records and give reports on how many of different levels of employees we have. What is the best way for us to keep track of these employees?

The kind of operation discussed here for employees are common to a significant number of applications that need to keep a set of unique objects.

At this point, it should be easy for you to think about storing them in an array or a linked list, sequentially search for them, remove some or add new ones. But, a simple array or linked list may not give us the performance that we like. The performance of most of our operations is O(n) when we utilize a simple sequential data structure.

All objects have unique address in java, but thats not the kind of uniqueness that we need here. If two different String objects contain the word Hello we consider them to be the same, even though, they have different addresses. In cases like employee objects, we don't want two employees with the same SS#s regardless of what they have in their other fields; uniqueness, in this case, is only based on the SS# of the employee objects.

As the previous two examples show, being unique means different things to different objects. It is important to keep things general enough where we could store these vastly different types of entities in our data structures. We want our data structures to efficiently store unique objects, regardless of how that uniqueness is determined. We don't want to design one efficient data structure to hold unique strings and a different one to hold employees. If possible, when designing classes, we want to be able to to reuse our good ideas/design/code in many applications.

The concept of Sets seem to fit this generalization well. What is the difference between an employee and a part? Can you not think of an application that would maintain parts and provide the same functionality as we said we would for employees? Even the more far fetched example like a dictionary with words and their meanings can be thought of as a set where each word is a unique String.

As you will see Sets are not designed to anticipate all functionality required by all applications; they are designed to address the most common requirements for storing unique data. We can always extend our notion of a Set to satisfy the needs specific to a particular application.

The Set Interface

Set is defined as an interface, the code for this set is in Set.java. Set allows us to find/add/remove objects that adhere to the Comparable interface. Comparable lets us compare objects of a class that implement it using the compareTo method. The String class already implements Comparable, thats why we can compare String objects with the compareTo method. If we want employee objects stored in our set, they too have to implement Comparable.

Setrequires a few helper methods as well, size, dump, and toArray. size lets us know how many elements there are in a set. toArray creates an array and stores all elements from the set in it and returns this array. dump displays the content of a set; its output is intended to help in debugging, not really for an end-user.

What is an interface? An interface has no code associated with its methods and no constructors. You have not yet seen an interface with fields, although, it is legal to have them. Interfaces need to be implemented by one or more classes. It is also important to understand that a class in java can implements multiple interfaces. for example, if we need to, we can make a class implement a Set as well as, Comparable. A class must have the methods that its interface(s) obligate it to have. If a class implements both Set and Comparable, its objects can hold objects that are unique, and we can compare them.

Here is the definition for Set. Pay close attention to the return values and parameter types. Note that methods like add and remove don't throw exceptions, they are expected to do nothing when they detect an error.

public interface Set {
  public void add(Comparable e);
  public void remove(Comparable e);
  public Comparable find (Comparable e);
  public int size();
  public void dump();
  public Comparable[] toArray();
}

Binary Search Tree

We are going to use the Binary Search Tree as a class that implements Set, the cde for is available at Tree.java. BST keeps unique elements, but also need the natural ordering that compareTo method return.

Let us first define Trees and Graphs, keeping in mind that there is much more to them than just a Binary Search Tree. We will discuss these further in subsequent courses.

A Graph is a collection of nodes/vertices connected by lines or edges.

A Tree is a specialization of a graph where the edges are directed with the concept of a single starting point, a root, and only one path to any given node in the tree. This means that we can't have cycles (a node pointing back to an ancestor) and no node can have more than one parent. There is an inherent parent-child relationship between nodes embedded in this definition which we will take advantage of for defining BSTs. There is also a clean and important concept of a subtree. Take out any edge connecting two nodes of the tree, what you end up with is the original tree with a portion of it pruned off, and what we refer to as a subtree which is also a tree; the proof for this is fun, but I'll let it go for now.

The heuristic for Binary Search Tree is well defined and simple; note that it is recursive.

A Tree is a Binary Search Tree iff ...

Lets demonstrate with a few example:


          45                 33                 4                 40
         /  \               /  \                 \               /  \
      12      60         20      40                5            7    70
     /  \       \         \                         \            \  / 
   3     20      65        34                         6            9

      BST                 NOT BST                BST            NOT BST

As mentioned before we will use BST for implementing the Set interface. There are several good reasons for this. BST's structure lends itself well to search processes that look for a particular element, you travel the height of the tree to find something. Adding and removing are not costly either, each, only requiring traversal of a single path to the bottom at worst case. As the 3rd example above shows, it is possible to turn a tree into a linked list; however, it is only the worst case scenario. We have found that BST performs close to log(n) in average cases for add, remove, and find.

To start implementing a BST, we need to define our node. A node here will hold a Comparable element and two references, one to the left subtree and the other to the right. Here is the code:


public class treeNode {
  Comparable element_;
  treeNode left_;
  treeNode right_;
  public   treeNode (Comparable e, treeNode l, treeNode r) {
     left_=l;
     right_=r;
     element_=e;
   }
}

The header for Tree class (our BST) will look like this: public class Tree implements Set as it needs to be declared as a class that "implements" Set.

The fields in this class are root_ and size_. root_ references the treeNode object which represents the root of the tree; it starts as null. size_ helps us keep track of how many elements are stored in the tree. The constructor for Tree does nothing. In the following sections, we will discuss the details of add, remove, toArray and dump methods.

add method

Lets consider the add method. First of all, Tree has to have an add method since Set has it. Here is the code, note that the actual act of adding an element to the tree happens in a recursive insert method:


  public void add(Comparable e) {
    //if element is found, do nothing; otherwise, add the element.
    if (find(e) == null) {
      size_++;//increment the size of this tree
      root_=insert(root_,e);//use recursive insert to add e to the tree
    }
  }

The find method is used to search for an element, if it is found, the element found is returned; otherwise, it returns a null. If in find returns an object instead of null, we know that the element is already in the tree, so we do nothing.

The insert method is written recursively, not because we had to do it that way, but only to provide yet another example of recursion. The statement root_=insert(root_,e); makes it possible to modify root. We had seen this technique before when discussing mergeSort.

Let us now look at the code for insert:


  private treeNode insert (treeNode root, Comparable x) {
    if (root == null) return new treeNode(x,null,null);
    if (root.element_.compareTo(x) < 0) 
      root.right_ = insert (root.right_,x);
    if (root.element_.compareTo(x) > 0) 
      root.left_ = insert (root.left_,x);
    return root;
  }

This is a private method, so it can only be called from this class only. The first if statement is checking for the base case when we hit a null root (i.e. this could be the original root, or the root of any subtree in our tree). A treeeNode is constructed and returned; this new node will be instantiated with the left_ or right_ of its parent or with the root_ itself.

if (root.element_.compareTo(x) < 0) , is establishing whether we need to go left or right to insert the new element in our tree. If the result of compareTo is negative, the element in the root is smaller than what we are are inserting, so we need to go to the right subtree; if the result is positive, we need to go to the left subtree.

It is easy to see that this algorithm traverses down a path in the tree and does not stop until a null reference is encountered. Yet, the most fundamental aspect of insert which is the process of attaching the newly created node to the rest of the tree; it happens to be the least obvious aspect of this algorithm. The code for attaching the new node is hidden away in the recursion here. Lets use an example to show how it works:

Suppose a tree is in the following state:


          root
           10
          /  \
         5    20
             /  \
            15   30

Let us insert a 17 into this tree. Since 17 is not already in the tree, we expect insert to be called and we execute the following statement (I hope my crude syntax here makes sense):


  root_= insert (root_,e);               1st call
                  10  17
                 /  \
                5    20
                    /  \
                   15   30

Since root is not null, we decide which way to go, and recursively call insert, note that we are now talking about the parameter root , and not root_ which references the root of the whole tree for our Tree object. :


  root.right_ = insert (root.right_,x);   2nd call
                         20         17
                        /  \
                      15    30

root is not null, so we recursively call insert again:


  root.left_ = insert (root.left_,x);     3rd call
                        15        17

This will be the last call to insert, since root.right_ is null, the root parameter will be null upon entry:


  root.right_ = insert (root.right_,x);   4th call
                         null       17

At this point, a new node is created with 17 in it and returned. But where does it return to and what do we do with it? We are going to return the new node to where the 4th call to insert was made from, which as you can see places the return value in root.right_. consequently, making that subtree look like the following:


         root
          15
            \
             17

Note that the last instruction executed in the method is return root; This statement is designed to reattach the rest of the tree as we return from each recursive call, finally, making the tree look like the following:


          root
           10
          /  \
         5    20
             /  \
            15   30
              \
               17

This reattachment technique is great because it has very little over head and we don't have to think about our subtrees changing in subsequent recursive calls. There is no need for checking to see if the right or left branches have changed upon return from a recursive call. A statement such as, root.left_ = insert (root.left_,x); connects the node to its left side; there could be a new new node there, or we may be reconnecting it to what was to its left already.

remove method

The algorithm for removal of elements from a tree is a bit more complicated than the one for adding elements. There are three cases that must be handled here, two of which are simple. When deleting an element, you must traverse the tree until you locate the node that it is in, this, of course, assumes that the element exists in the tree. The node being deleted may be a leaf (no children) or it may have only one offspring, a left child, or a right child. These cases are simple to handle as you will see later. The third case is a bit more complicated and it deals with a node being removed that has both a left and right child. Keep in mind that when I say a left or right child, I am really talking about subtrees, so we don't know how many levels each have.

The remove method is designed similarly to the add method in that it calls a recursive method, del, to actually remove an element. find is also used here to determine if an element really exists in the tree before calling del.

Lets consider the code for del piece by piece. Again, first we need to located the appropriate node to be deleted, then determine which case we are dealing with and of course take appropriate action to remove the node.

This segment is very similar to what we did in insert and is designed to traverse the tree recursively until the node that we want to delete is found. Notice that it uses the same technique for reconnecting the tree, for example, root.left_=del(root.left_,e); will reconnect a node to its left subtree and not bother worrying about if it was changed or not.


    if (root.element_.compareTo(e) < 0) { //go right
       root.right_ = del (root.right_,e);
       return root;
    }
    if (root.element_.compareTo(e) > 0) { //go left
       root.left_ = del (root.left_,e);
       return root;
    }

Now, the rest of the code assumes we are at the node with the element we are removing. The first two if statements deal with the node with a left subtree, or a right subtree, or none at all.


    if (root.left_ == null)
      return root.right_; //since there is no left side replace root with right_
    if (root.right_ == null)
      return root.left_; //since there is no right side replace root with left_

Here are a couple of examples to show what happens here. Consider the following tree:


          root
           10
          /  \
         5    20
             /  \
            15   30
              \
               17

Suppose we like to delete the 5 from this tree, we call del from remove with the following call:


 root_=del(root_,e);
           10     5
          /  \
         5    20
             /  \
            15   30
              \
               17

Since 5 is less than 10, we go left using:


root.left_=del(root.left_,e);
                 5         5

Upon entering del we find out that we don't want to go left or right, so we have the correct node. Then we check root.left_ against null, since it is null, we return root.right_. In this case, notice that we are returning null as the root.right_ is also null.

At the previous level of recursion, the root gets reconnected with its left side, which is now null. So, the tree will look like the following and gets returned to remove and reinstantiated with root_.


          root_
           10 
             \
              20
             /  \
            15   30
              \
               17

Suppose we like to delete the 15 from the original tree, we call del from remove with the following call:


 root_=del(root_,e);
           10     15
          /  \
         5    20
             /  \
            15   30
              \
               17

Since 15 is greater than 10, we go right using:


root.right_=del(root.right_,e);
                  20         15
                 /  \
                15   30
                  \
                   17

Since 15 is less than 20, we go left using:


root.left_=del(root.left_,e);
                  15       15
                    \
                     17

Again, we find out that we have found the correct node that its left side is null. root.right_ is returned. When the root at the previous level gets reconnected with its left side, we end up with:


    root
     20
    /  \
   17   30

This subtree is returned to the previous level of recursion and root, at that level, is reconnected with its right side and the tree is returned to the remove method. The tree will look like the following:


          root
           10 
          /  \
         5    20
             /  \
            17   30

In summary, for the simple cases, once we locate the node being deleted, we return either its left or right subtree. Now, on to the interesting case, when the node being deleted has both a left and right subtree.

When a node has both a left and right subtree, we can't simply return its left or right subtree; if you do, you will lose one of them.

The technique utilized here requires to first locate either the successor or the predecessor of a node. Lets assume we will want to use the successor in our delete algorithm.

What node is considered to be a successor of the current node? The node that contains the next largest value. In our original tree, the successor to 10 is 15. How does one locate the successor node? Start from the root of the right subtree of the node that we wish to find the successor of, and follow the left branches of each subsequent node until the left most node in that subtree is found (i.e. the left branch of that node is null); that node contains the successor.

Once the successor is located, the element from the successor is copied into the node that contains what we are deleting. Then, we continue the recursive delete process in the direction of the right subtree for the element in the successor. Keep in mind that without deleting the content of the successor node, the same value will appear twice in our tree.

Lets show this by deleting 10 from our original tree. del will be called with following:


 root_=del(root_,e);
           10   10
          /  \
         5    20
             /  \
            15   30
              \
               17

for(temp=root.right_;temp.left_!=null;temp=temp.left_); is designed to locate the successor node, root.element_=temp.element_; copies the element from the successor into the current root node. At this point the tree will look like the following:


          root
           15 
          /  \
         5    20
             /  \
            15   30
              \
               17

Even though this is not a BST at this instant, we are not done yet, the search/delete process continues with:


    root.right_ = del (root.right_,temp.element_);
                          20           15
                         /  \
                        15   30
                          \
                           17

Since element in root is greater than 10, we go left with:


    root.left_ = del (root.left_,e);
                          15      15
                            \
                             17

Found the right node, this time, however, there is only a right subtree, so we simply return it in place of root. Once all recursive steps are completed, the tree will look like this:


          root
           15 
          /  \
         5    20
             /  \
            17   30

dump & toArray methods

Both dump and toArray use recursive methods. You should notice they both call inOrder. If you pay attention, you will notice that they are calling different versions of inOrder. The parameters for the version called from dump are root and an indent value. The parameters for the version called from toArray are the root, an array to hold the objects retrieved from the tree and an index variable identifying where to store the next element in the array.

inOrder called from dump performs an in-order traversal from right to left and displays the tree content in the form of a tree turned 90 degrees counterclockwise.

inOrder called from toArray performs an in-order traversal from left to right and stores each element in an array. Here, I'll simply discuss the concept of inOrder traversal which is what inOrder utilizes.

inOrder is a special traversal technique for BSTs whereby visiting the left subtree, consider root, then visiting the right subtree we get the tree content in an ascending order. If we visit the right side first, consider the root node, then the left, we get the tree content in descending order. The following is the general logic for getting to the elements in an ascending order:


inOrder traversal
     if root is not null
        apply inOrder traversal to left subtree
        display e in root
        apply inOrder traversal to right subtree

To make inOrder traversal clear, let us follow this example:


          root
           15 
          /  \
         5    20
             /  \
            17   30

First we traverse left. Since this subtree only has the node with 5, no left or right child, we display 5 and return. We then display 15 when we return to the previous level of recursion. Now we traverse right, the subtree there looks like:


             root
              20
             /  \
            17   30

Again, first we traverse left, and display 17. Then we display the 20, then we traverse right and display 30. Note that all nodes were visited with this algorithm and the list of elements is displayed in ascending order.