Lab 3: Searching

Copyright RIT 2008
$Id: writeup.xml,v 1.17 2009/03/19 22:28:02 vcss233 Exp $


Goal

In this lab you will work with an implementation of binary search and use some simple tools to measure the average performance of the algorithm. You will also explore ways to modify the basic binary search algorithm in an attempt to improve performance.

Overview

Objectives

  1. Gain experience in testing and collecting information about the performance of a program.
  2. Study ways to modify a basic algorithm to improve performance.
  3. Practice writing code from specifications.

Pre - Lab Work

  1. Review your class notes on Searching.

  2. Study the Java implementation of binary search by clicking here (./Auxiliary/TestBinarySearch1.java)

    Notice that this class contains a test procedure in main. Writing tests while (or even before) you implement a class is a very good practice. Can you think of some more tests not covered in the code?

  3. Determine the worst case performance for linear and binary search.

In-Lab Activities

Each activity has specific items you must submit for credit. Each activity is graded separately.

Activity #1 : Study the performance of various search algorithms

You are to instrument the search code and write a test program to report search performance over a range of data set sizes. Download and unpack the jar file (http://www.cs.rit.edu/~vcss233/pub/lab03/Binaries/lab3.jar) to get the code you will need to complete this lab.

The class Searchers provides implementations of linear, binary, and ternary searches. Review the Javadoc for the Searchers class and then study the code for the methods that implement the search algorithms. Ternary search is similar to binary search, but it splits the array into three areas instead of two. Do not proceed to the next step until you understand how the code in this class works.

Your first task to instrument the code. Instrumenting adds code to the Searchers class so that you can determine the number of compares made by a search method whenever it is invoked to search an array.

In the Searchers class you will find a static class variable, named numCompares. Use that variable for this task. The class also contains the methods, getNumCompares() and resetNumCompares(), which can be called to obtain or reset the class variable numCompares. Modify the code in the methods linearSearch(), binarySearch(), and ternarySearch() so that numCompares is incremented anytime two values are compared using compareTo().

The next part of this activity tests the changes made to the Searchers class. Do this by taking advantage of your knowledge of the performance of the various search algorithms implemented by this class. A linear search for a value that is not in a collection will take N compares to determine that the value is not in the collection. A binary search will take log2(N)+1 compares to determine that the value is not present. Finally, a ternary search will take log3(N)+1 compares.

You should test your instrumentation changes by writing a program that searches an array of a given size for an element that you know is not in the array and print the number of compares required to determine the element is not in the array. The value printed should agree with the theoretical performance of the algorithms.

Write a program named TestSearchers that tests the changes you made to the Searchers class. The main body of this program is a loop that creates a data set of a specific size, invokes each search method on the data and reports its result for the data set. The loop creates an array of a specified size and fills that array with Java Integer objects. Each Integer object should be double the value of the index for that value. Note you must create an array that holds references to Java Integer objects since the search methods expect the array to contain references to objects that are Comparable (an int is not a reference; it is a primitive type). The array is then searched for the value 3 using linearSearch(), binarySearch(), and finally with ternarySearch(). After each search returns, the method getNumCompares() is invoked to determine the number of compares made during the search (note that you will also need to invoke resetNumCompares() before invoking each search method).

Write the program so that it starts the test loop with an array of size 8 and doubles the size of the array each time through the loop. The program stops after it runs the tests on an array of size 65536. The output from your program must be formatted as shown below:


            N = 8, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 16, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 32, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 64, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 128, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 256, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 512, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 1024, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 2048, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 4096, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 8192, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 16384, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 32768, Linear == xxx, Binary == xxx, Ternary == xxx
            N = 65536, Linear == xxx, Binary == xxx, Ternary == xxx
            

where the xxx above will be replaced with the number of compares required to determine that the number was not in the array. Before you begin to write this program you should be able to predict what the output will be.

How To Submit

When your program is working correctly, and you are sure that it is producing the correct results, submit your work by executing the following command:

try grd-233 lab3-1 TestSearchers.java Searchers.java

Activity #2 : Build a simple spelling checker

In this activity you will implement a very simple spell checker.

A useful capability of a word processing system is the ability to check the spelling of the words in a document. The basic operation of any spell checker is to take each word in the document and search for that word in a list of correctly spelled words. If the word is not found, it is considered incorrect. The time that a spell checker takes to check all the words in a document is important to anyone using the spell checker program.

In the jar file that you downloaded is a file that contains the class Dictionary. This class provides the ability to create word lists from files that can then be used by a spell checker program to spell check a document. Review the Javadoc for the Dictionaryclass and then study the code that implements the class. Do not proceed until you understand what the class does and how it works.

For this activity write a program named Checker that uses the Dictionary class to spell check a text document. The program requires two command line arguments: the name of the file containing a list of correctly spelled words, and the name of the file containing the document to be checked. If there are too few or too many arguments, the program must produce the following output on the standard error stream and terminate without producing any additional output:

 Usage: java Checker dict-file file-to-check 

With the correct number of arguments, the program attempts to create a dictionary object using the file specified on the command line. If the dictionary object cannot be created (i.e., the constructor fails), the program must print the string "Checker: " followed by the message contained in the exception thrown by the constructor. This error message also must appear on the standard error stream. The program will then terminate without producing any additional output.

After creating the dictionary object, the program will open the document to be checked and read the words from the document one at a time. The program checks the spelling of each word by invoking the lookUp() method of the Dictionary object. If the word is not in the dictionary, the program decides it is spelled incorrectly and prints the word on standard output followed by a newline. (That is, the program prints misspelled words one per line). If a word is found in the dictionary, the program will not print it. Consequently, if all words in the document are found in the dictionary, the Checker program will produce no output other than the statistics described below.

If any errors occur during the processing of the document (e.g., the file cannot be opened or an I/O error occurs), the program must print the string "Checker: " followed by the message contained in the exception that caused the error on the standard error stream.

After checking the spelling of the whole document, the Checker program must print a blank line followed by the statistics shown below on the standard output stream:


Size of Dictionary: xxx
Words checked: xxx
Average probes: xxx
            

where the xxx will be replaced by the appropriate values for the document that you just checked. There are methods in the Dictionary class that will allow you to determine these statistics.

To read the file that you will be spell-checking, you should use a Scanner to naturally break lines into individual tokens (i.e. words). In addition, the spell-checking application must also ignore any punctuation that may appear. The easiest way to do this is to change the delimiter that the Scanner uses.

The delimiter defines what marks the end of a word. In this case, anything that is not an upper or lower case letter marks the end of the word. Calling the Scanner method useDelimiter("[^a-zA-Z]+"); on a Scanner object essentially says the delimiter is "one or more non-letters". (the String argument is a regular expression).

The directory ~vcss233/pub/lab03/Binaries (http://www.cs.rit.edu/~vcss233/pub/lab03/Binaries) contains some files that you can use with your spell checker program. The file words is a list of words derived from the dictionary file available on the UNIX systems that you can use as your dictionary file. The other files found in that directory contain some fairly famous speeches and literary works that you might want to use to check your spell checker. Don't be surprised if you find a lot of misspelled words in the "Tale of Two Cities". English just is not the language it used to be.

How To Submit

After you are sure that your spell checker is working correctly, submit your work by typing the following command:

try grd-233 lab3-2 Checker.java

Activity #3 : Improve search performance using an index

In this activity you will modify the Dictionary class to improve the performance of the spelling checker program.

When you need to look a person's name up in the phone book, you would not start at the exact center of the book. Instead, you look at the first letter of the last name and confine your search to the area of the phone book where you know that name will appear. The code for the lookUp() method in the Dictionary class uses a binary search to locate the word in the dictionary; if you look up a word that starts with an 'a', the search will begin in the middle of the dictionary. The performance of the method lookUp() would improve if the method limited its search to that part of the dictionary where the word was expected.

To limit the search for a word in a specific portion of the dictionary, the Dictionary class needs an index that tells it where in the array of words the words that start with 'a' begin, where the 'b' words begin, where the 'c' words begin, and so on. If this information were available, the method lookUp() could use the index to limit the binary search to that portion of the word list in which the word for which it is searching is located. (You may assume that the words file representing the dictionary contains at least one word beginning with each of the letters in the alphabet.)

You will have to make two changes to the Dictionary class. First, you must change the constructor so that it builds an index while it builds the dictionary from the word list file. The method lookUp() can use this index later.

The index is nothing more than an array whose position 0 tells where the 'a' words start, position 1 tells where the 'b' words start, and so on. The array will have a total of 26 positions, one for each of the letters in the alphabet. Because the word list file is in alphabetical order, it is relatively easy to determine where the 'c' words start, etc.

Recall that all characters in a computer are stored using a character code. A common character code is ASCII, another is Unicode http://www.unicode.org. A character code is simply a mapping of characters to numeric values. In Java you can determine the numeric value that is being used to represent a given character by using the static method getNumericValue() in the Java Character class. For example, if you are trying to store the position in the dictionary where the 'c' words are stored, the code below shows how to determine where this value should be stored in the index:

 int pos = Character.getNumericValue( 'c' ) - Character.getNumericValue( 'a' ); 

After changing the constructor (and testing it), the second change must make the lookUp() method use the index to limit the dictionary search. The new version of lookUp() must determine the range of the search based on the first letter of the target and the index. Use the appropriate method from the String class to obtain the first letter of the target and use your knowledge of the Character class and character codes to determine where to look in the index to determine the bounds of the search. Then using the second form of the binarySearch() method in the Searchers class (the one that allows you to specify a range), modify the parameters in its invocation to look for the target.

After you have finished making the changes to the Dictionary class, see if you improved the performance of your spelling checker program by running it with the new class. If you did things correctly you should see an improvement in the average number of probes reported by your program.

Note that even though the performance of binarySearch() has been improved, it is still O(log2(N)). The improvement is due primarily to the fact that n is smaller in the new version of the method.

How To Submit

After you are sure your Dictionary class is correct, submit the class by typing the following command:

try grd-233 lab3-3 Dictionary.java


Grade Computation

Grade Breakdown: