CS 051 Lecture 37

CS 051 Fall 2012

Lecture 37

Searching

We all know what searching is - looking for something. In a computer
program, the search could be:

Looking in a collection of values for some specific value
(where is the 17 in this array of int?).
Looking for a value with a specific property (which object
on the canvas contains the location where I clicked the mouse?).
Looking for a record in a database (what is the tax history
for the last four years for the taxpayer with SSN 101-11-1009?).
Searching for text in some document or collection of
documents (what web pages contain the text "searching and sorting
algorithms?").
What known amino acid sequences best match this sequence
gathered from proteins in the SARS virus?

Linear Search

Let's think about searching in an array, where we have the option
to look at any element directly. We will consider an array of ints,
though most of what we discuss applies to a wider range of
"searchable" items.

An iterative method to do this:

  /*
   * Search for num in array.  Return the index of the number, or
   * -1 if it is not found.
   */
    int search(int[] array, int num) {
        for (int index = 0; index < array.length; index++) {
            if (array[index] == num) {
                return index;
            }
        }
        return -1;
    }

and an equivalent recursive method to do the same thing would look like the
following (where we would call the method with a start value of 0).

  /*
   * Search for num in array recursively. Return the index of the
   * number, or -1 if it is not found.
   */
    int recSearch(int[] array, int num, int start) {
        if (start >= array.length) {
            return -1;
        } else if (array[start] == num) {
            return start;
        } else {
            return recSearch(array, num, start + 1);
        }
    }

Let's try to get some idea of how much "work" it takes the computer to get an
answer. We will use the recursive method above for our discussion. Since each
recursive call to the method consists of one comparison of the search value
to a value in the array, we can use the total number of calls as a rough estimate
of the amount of work done. (It should be straightforward to apply the same
logic to the iterative method above by counting number of iterations through the
for loop). We will assume an array of arbitrary size containing n
values, and that those values are not stored in any particular order.

The "best" case for our search, in terms of the amount of work the computer does,
is for the value we are searching for to be stored in the array at index 0.
Then, it is found after only 1 call to recSearch. The "worst" case
for our search is for the value not to exist in the array. Then, we will call
recSearch once for each value in the array, plus one more time to
recognize that we have run off the end of the array. Since there are n
values in the array, this is n + 1 calls to the method.

In reality, most searches will end up somewhere between these best and worst
cases situations. Some values will be found closer to the beginning of the
array, while others will be found closer to the end. If we assume that all
values have about an equal chance of being searched for, then we can expect
to walk about half way down the array each time.

One question that arises: does it help if the values in the array are ordered?
We'll begin with a naive attempt to improve our search by modifying the
"linear" search we've done above. If we know the values are stored in sorted
order, then we can stop searching once we reach a value larger than the one we
are searching for. (This code is left as an exercise for the student, and
should be possible by adding a single "else-if" clause to either of the above
methods.) Now, searches for values that are not in the array should average about
n/2 calls to recSearch assuming that the "next larger" value
that is in the array is evenly distributed throughout the array. However, this
does not significantly improve our results. Our best case is still finding a
value in the first location. Our worst case is now n calls to our
method (for a value in the last location) instead of n + 1 calls for
a value not in the array. In other words, nearly the same.

Binary Search

There is a significantly better approach to searching for values in a sorted
To understand the intuition behind this approach, think back to your favorite
number guessing game. I pick a number between 1 and 100 and you have to guess
what it is. The game usually goes something like this:

Me: Guess my number.
You: 50.
Me: Too High.
You: 25.
Me: Too Low.
You 37.
Me: Too High.
You 31.
Me: That's right.

If you know that there is an order, and each value is equally likely, where
do you start your search? In the middle, since this allows you to eliminate
half of the remaining values with each incorrect guess (of course, a correct
guess ends the search!) For example, if you guess 50, and you learn that the
answer is lower than 50, you don't have to ever guess 51, 52, 53,... There
are only 49 values that are possibly the correct answer.

Of course, we can repeat this procedure on the remaining list, as we did in the
game above. A Java method for binary search would look like this:

    /*
     * Binary Search for num in array.
     */
    int search(int[] array, int num) {
        return binarySearch(array, num, 0, array.length - 1);
    }


    /*
     * Binary Search for num in array.  Pass in the low and high 
     * indices of the array for the range in which the number may
     * still occur.
     */
    int binarySearch(int[] array, int num, int low, int high) {
        if (low > high) {
            return -1;
        } else {
            int mid = (low + high) / 2;
            if (array[mid] == num) {
                // num is same as middle number
                return mid;
            } else if (num < array[mid]) {
                // num is smaller than middle number
                return binarySearch(array, num, low, mid - 1);
            } else {
                // num is larger than middle number
                return binarySearch(array, num, mid + 1, high);
            }
        }
    }

How much work are we doing for such a search? In the best case, we find the
value we are searching for with one call to recBinarySearch (i.e., the
value is in the middle location.) The worst case is when the value is not in the
array. To figure out how many steps this will take, consider the following. We
are starting with an array of some arbitrary size n. For each guess, we
eliminate about half the remaining values from consideration. If the value we
searching for is not in the list, we will eventually get down to an array of
size 1. The next call to recBinarySearch will return -1 because low
is greater than high.

To simplify this a little, let's consider the case where the list size is
given as n = 2^k, and we eliminate exactly half of the remaining
values with each incorrect guess.

Then,

After one incorrect guess, the list size is n/2 = 2^k-1
After two incorrect guesses, the list size is n/4 = 2^k-2
After three incorrect guesses, the list size is n/8 = 2^k-3
...

Thus, it will take k guesses to reach a list of size 1. To solve for k
in our original equation, we take the log of both sides.

    log₂(n) = log₂(2^k) = k

So, the amount of work done by binary search is about log₂(n). To
see how this compares to linear search, let's look at how many calls we would
make to our respective recursive methods for each approach.

Search/num elts	10	100	1000	1,000,000
n/2	5	50	500	500,000
log n (base 2)	4	7	10	20

In the case of a small array, the difference is not really significant. But as
the size grows, the difference becomes enormous. Binary search is the clear
winner for larger arrays!

See the Searching and Sorting Demo we used in class to give a graphical visualization
of our searching methods.