**Lecture 37 - Searching11/27/06**

- Looking in a collection of values for some specific value
(where is the 17 in this array of
`int`?). - Looking for a value with a specific property (which object on the canvas contains the location where I clicked the mouse?).
- Looking for a record in a database (what is the tax history for the last four years for the taxpayer with SSN 101-11-1009?).
- Searching for text in some document or collection of documents (what web pages contain the text "searching and sorting algorithms?").
- What known amino acid sequences best match this sequence gathered from proteins in the SARS virus?

We have done some searching this semester. Remember your contains
method for a `Scribble`?

public boolean contains(Location point) { if (first.contains(point)) return true; } else { return rest.contains(point); } }

We have to search through our collection of `Line`s that we call a
`Scribble` to see which one, if any, contains the point.

How do we know that we're done searching? Well, at any time, we have
access to the `Line` known as `first`. This `Line` might
contain the point, and we'd know that this `Scribble` contains the
point. There is no need to continue our search. If this `Line`
does not contain the point, it might be the case that one of the other
`Line`s in `rest` does. So if there are more lines, we see if
any of them contain the point with a recursive call. Or perhaps, we
have gotten to the end of the list and have checked every `Line`
and none contained the point. In that case, we also know we're done
and return `false`.

Let's try to get some idea of how much "work" it takes for us to get
an answer. As a rough estimate of work, we will count how many times
we call the `contains` method of a `Line`.

If our `Scribble` consists of *n* `Line`s, how many calls to
the `Line` `contains` method will we have to make before we
know the answer? It depends.. If the `Scribble` does not
contain the point at all, we need to check all *n* `Line`s before
we know the answer. If the `Scribble` does contain the point, we
can stop as soon as we find the `Line` that contains the point.
It might be the first, it might be the last - we just don't know.
Assuming that there's an equal probability that the `Line` that
contains the point is at any of the *n* positions, we have to examine,
on average, *(n)/(2)* `Line`s.

In this case, we can't do any better. Perhaps if we were not
restricted by the fact that the list of `Line`s forces us to
examine the first, then the second, and so on. We can't jump right to
the last `Line`, since our recursive structure does not provide
access to that without first going through the whole list.

So let's think about searching in an array, where we have the option
to look at any element directly. We will consider an array of `int`, though most of what we discuss applies to a wider range of
"searchable" items.

A method to do this:

/* * Search for num in array. Return the index of the number, or * -1 if it is not found. */ int search(int[] array, int num) { for (int index = 0; index < array.length; index++) { if (array[index] == num) { return index; } } return -1; }

The procedure here is a lot like the search for a `Line` in a `Scribble`. We have no way of knowing that we're done until we either
find the number we're looking for, or until we get to the end of the
array. So again, if the array contains *n* numbers, we have to
examine all *n* in an unsuccessful search, and, on average,
*(n)/(2)* for a successful search.

Alternately, we could use recursion instead of a while loop for the search:

/* * Search for num in array recursively. Return the index of the * number, or -1 if it is not found. */ int recSearch(int[] array, int num, int start) { if (start >= array.length) { return -1; } else if (array[start] == num) { return start; } else { return recSearch(array, num, start + 1); } }

Now, suppose the array has been sorted in ascending order.

Class demo: search for a number in an ordered array of numbers.

/* * Search for num in a sorted array recursively. Return the index * of the number, or -1 if it is not found. */ int recSearch(int[] array, int num, int start) { if (start >= array.length) { return -1; } else if (array[start] == num) { return start; } else if (array[start] > num) { return -1; // num will not appear in rest of array since it is sorted. } else { return recSearch(array, num, start + 1); } }

Well, we can do the same type of search - start at the beginning and
keep looking for the number. In the case of a successful search, we
still stop when we find it. But now, we can also determine that a
search is unsuccessful as soon as we encouter any number larger than
our search number. Assuming that our search number is, on average, is
going to be found near the median value of the array, our unsuccessful
search is now going to require that we examine, on average,
*(n)/(2)* items. This sounds great, but in fact is not a really
significant gain, as we will see. These are all examples of a *linear search* - we examine items one at a time in some linear order
until we find the search item or until we can determine that we will
not find it.

But there is a better way. To get the intuition for the next way to search for a number, think back to your favorite number guessing game. I pick a number between 1 and 100 and you have to guess what it is. The game usually goes something like this:

Me: Guess my number. You: 50. Me: Too High. You: 25. Me: Too Low. You 37. Me: Too High. You 31. Me: That's right.

If you know that there is an order - where do you start your search?
In the middle, since then even if you don't find it, you can look at
the value you found and see if the search item is smaller or larger.
From that, you can decide to look only in the bottom half of the array
or in the top half of the array. You could then do a linear search on
the appropriate half - or better yet - repeat the procedure and cut
the half in half, and so on. This is a *binary search*. It is an
example of a *divide and conquer* algorithm, because at each step,
it divides the problem in half.

A Java method to do this:

/* * Binary Search for num in array. */ int search(int[] array, int num) { return binarySearch(array, num, 0, array.length - 1); } /* * Binary Search for num in array. Pass in the low and high * indices of the array for the range in which the number may * still occur. */ int binarySearch(int[] array, int num, int low, int high) { if (low > high) { return -1; } else { int mid = (low + high) / 2; if (array[mid] == num) { // num is same as middle number return mid; } else if (num < array[mid]) { // num is smaller than middle number return binarySearch(array, num, low, mid - 1); } else { // num is larger than middle number return binarySearch(array, num, mid + 1, high); } } }

How many steps are needed for this?

- Each time, we cut the part of the array we still need to search in half.
- How many times can divide number in half before you get to 1?
- If you start with
*n*, you divide to get*(n)/(2)*then*(n)/(4)*,*(n)/(8)*, ... and eventually get 1. - Let's suppose that
*n=2*, then divide to^{k}*2*,^{k-1}*2*,^{k-2}*2*, ...,^{k-3}*2*= 1; divide^{0}*k*times by 2. - In general can divide
*n*by 2 at most*log*times to get down to 1._{2}n

So how much better is this, really? In the case of a small array, the difference is not really significant. But as the size grows...

Search/num elts | 10 | 100 | 1000 | 1,000,000 |

n/2 | 5 | 50 | 500 | 500,000 |

log n (base 2) | 4 | 7 | 10 | 20 |

That's a pretty huge difference as n increases.

Demo.