Lecture 37

Lecture 37 - Searching
11/27/06

We all know what searching is - looking for something. In a computer program, the search could be:

Looking in a collection of values for some specific value (where is the 17 in this array of int?).
Looking for a value with a specific property (which object on the canvas contains the location where I clicked the mouse?).
Looking for a record in a database (what is the tax history for the last four years for the taxpayer with SSN 101-11-1009?).
Searching for text in some document or collection of documents (what web pages contain the text "searching and sorting algorithms?").
What known amino acid sequences best match this sequence gathered from proteins in the SARS virus?

We have done some searching this semester. Remember your contains method for a Scribble?

  public boolean contains(Location point) {

    if (first.contains(point))
      return true;
    } else {
      return rest.contains(point);
    }
}

We have to search through our collection of Lines that we call a Scribble to see which one, if any, contains the point.

How do we know that we're done searching? Well, at any time, we have access to the Line known as first. This Line might contain the point, and we'd know that this Scribble contains the point. There is no need to continue our search. If this Line does not contain the point, it might be the case that one of the other Lines in rest does. So if there are more lines, we see if any of them contain the point with a recursive call. Or perhaps, we have gotten to the end of the list and have checked every Line and none contained the point. In that case, we also know we're done and return false.

Let's try to get some idea of how much "work" it takes for us to get an answer. As a rough estimate of work, we will count how many times we call the contains method of a Line.

If our Scribble consists of n Lines, how many calls to the Line contains method will we have to make before we know the answer? It depends.. If the Scribble does not contain the point at all, we need to check all n Lines before we know the answer. If the Scribble does contain the point, we can stop as soon as we find the Line that contains the point. It might be the first, it might be the last - we just don't know. Assuming that there's an equal probability that the Line that contains the point is at any of the n positions, we have to examine, on average, (n)/(2) Lines.

In this case, we can't do any better. Perhaps if we were not restricted by the fact that the list of Lines forces us to examine the first, then the second, and so on. We can't jump right to the last Line, since our recursive structure does not provide access to that without first going through the whole list.

So let's think about searching in an array, where we have the option to look at any element directly. We will consider an array of int, though most of what we discuss applies to a wider range of "searchable" items.

A method to do this:

  /*
   * Search for num in array.  Return the index of the number, or
   * -1 if it is not found.
   */
  int search(int[] array, int num) {
    for (int index = 0; index < array.length; index++) {
      if (array[index] == num) {
        return index;
      }
    }
    return -1;
  }

The procedure here is a lot like the search for a Line in a Scribble. We have no way of knowing that we're done until we either find the number we're looking for, or until we get to the end of the array. So again, if the array contains n numbers, we have to examine all n in an unsuccessful search, and, on average, (n)/(2) for a successful search.

Alternately, we could use recursion instead of a while loop for the search:

  /*
   * Search for num in array recursively. Return the index of the
   * number, or -1 if it is not found.
   */
  int recSearch(int[] array, int num, int start) {
    if (start >= array.length) {
      return -1;
    } else if (array[start] == num) {
      return start;
    } else {
      return recSearch(array, num, start + 1);
    }
  }

Now, suppose the array has been sorted in ascending order.

Class demo: search for a number in an ordered array of numbers.

  /*
   * Search for num in a sorted array recursively. Return the index
   * of the number, or -1 if it is not found.
   */
  int recSearch(int[] array, int num, int start) {
    if (start >= array.length) {
      return -1;
    } else if (array[start] == num) {
      return start;
    } else if (array[start] > num) {
      return -1;  // num will not appear in rest of array since it is sorted.
    } else {
      return recSearch(array, num, start + 1);
    }
  }

Well, we can do the same type of search - start at the beginning and keep looking for the number. In the case of a successful search, we still stop when we find it. But now, we can also determine that a search is unsuccessful as soon as we encouter any number larger than our search number. Assuming that our search number is, on average, is going to be found near the median value of the array, our unsuccessful search is now going to require that we examine, on average, (n)/(2) items. This sounds great, but in fact is not a really significant gain, as we will see. These are all examples of a linear search - we examine items one at a time in some linear order until we find the search item or until we can determine that we will not find it.

But there is a better way. To get the intuition for the next way to search for a number, think back to your favorite number guessing game. I pick a number between 1 and 100 and you have to guess what it is. The game usually goes something like this:

Me: Guess my number.
You: 50.
Me: Too High.
You: 25.
Me: Too Low.
You 37.
Me: Too High.
You 31.
Me: That's right.

If you know that there is an order - where do you start your search? In the middle, since then even if you don't find it, you can look at the value you found and see if the search item is smaller or larger. From that, you can decide to look only in the bottom half of the array or in the top half of the array. You could then do a linear search on the appropriate half - or better yet - repeat the procedure and cut the half in half, and so on. This is a binary search. It is an example of a divide and conquer algorithm, because at each step, it divides the problem in half.

A Java method to do this:

  /*
   * Binary Search for num in array.
   */
  int search(int[] array, int num) {
    return binarySearch(array, num, 0, array.length - 1);
  }


  /*
   * Binary Search for num in array.  Pass in the low and high 
   * indices of the array for the range in which the number may
   * still occur.
   */
  int binarySearch(int[] array, int num, int low, int high) {
   if (low > high) {
      return -1;
    } else {
      int mid = (low + high) / 2;
      if (array[mid] == num) {
        // num is same as middle number
        return mid;
      } else if (num < array[mid]) {
        // num is smaller than middle number
        return binarySearch(array, num, low, mid - 1);
      } else {
        // num is larger than middle number
        return binarySearch(array, num, mid + 1, high);
      }
    }
  }

How many steps are needed for this?

Each time, we cut the part of the array we still need to search in half.
How many times can divide number in half before you get to 1?
If you start with n, you divide to get (n)/(2) then (n)/(4), (n)/(8), ... and eventually get 1.
Let's suppose that n=2^k, then divide to 2^k-1, 2^k-2, 2^k-3, ..., 2⁰ = 1; divide k times by 2.
In general can divide n by 2 at most log₂ n times to get down to 1.

So how much better is this, really? In the case of a small array, the difference is not really significant. But as the size grows...

Search/num elts	10	100	1000	1,000,000
n/2	5	50	500	500,000
log n (base 2)	4	7	10	20

That's a pretty huge difference as n increases.

Demo: Searching and Sorting

Demo.