CS62 - Spring 2010 - Lecture 18

found watch

last lecture in java!

hashtables overview
   - U is our universe of keys
   - want to mimic having an array/table with |U| entries, but have it be smaller, of size m
      - this would give us constant time adding and searching
   - use a hash function (h) to map from U to an entry in the table
   - if m < |U| (which is the basic idea), we're going to have "collisions", where h(x) = h(y) even though x != y, that is two entries hash to the same location in our table even though they're not equal
   - look at Set interface in Hashtables code

collision resolution by chaining
   - a common approach is to allow multiple items to occupy a given entry in our array. How?
      - rather than just having the item stored at the entry, store a linked list
   - put: if two items hash to the same location in the array, just add them to the linked list
   - contains: do a search of all of the entries at that entry to see if the item being search for is there
   - walk through an example
      - let h(x) = 1, h(y) = 2, h(z) = 1
      - insert x, y and z in to the table
      - now search for z
   - show ChainedHashtable class in Hashtables code
   - what is the run-time of the put and containsKey methods?
      - put: O(1), we just have to add it to the beginning or end of a linked list
      - search: O(length of the linked list)
         - worst case: all of the data hashes to the same entry, e.g. h(x) = 1
            - search: O(n)
         - average case: depends on how well the hash function distributes the items over the table
            - to analyze, we'll make the "simple uniform hashing" assumption, which is that an item could be hashed to any entry equally likely
            - let's say I roll a 10-side die 10 times how likely is it to see any particular value?
               - 1/10
               - what about if I roll it 100 times?
                  - 100/10 = 10 times per value on average
         - the hashtable is similar, if we have m entries in the table and m items that we've added, what is the expected (average) length of any linked list, i.e. the number of things to be hashed to that entry under simple uniform hashing?
            - n/m (like n rolls of an m sided die)
            - this value (n/m) has a special name for hashtables and is called the "load" of the hashtable (often written as alpha)
         - search: O(1+alpha), on average
            - why the 1?
         - Exercise 15.9: When 23 randomly selected people are brought together, chances are greater than 50 percent that two have the same birthday. What does this tell us about uniformly distributed hash codes for keys in a hash table?
            - m = 365, we're hashing people based on their birthday
            - n = 23
            - load = 23/365, which is still quite low
            - still better than 50% chance of having a collision
            - hash table size is important and collisions are very likely, even for small amounts of data

   - benefits and drawbacks?
      - pros
         - since we used linked lists, there is no limit to the number of items we can store in the table
         - it's very straightforward to implement
      - cons
         - as the load gets high, our run-time will degrade
         - there is a lot of overhead involved with storing the linked lists

open addressing
   - because of some of the cons above (in particular the overhead), we often only want to use a basic array to store the hashtable
   - we still have to do something about collisions... ideas?
   - when we have a collision and there's already an item at that location, we need to find another possible place to put it
   - for open addressing, we must define a "probe sequence" that determines where to look in the table next if we have a collision
      - if h(x) is the hash function, the probe sequence is often written as h(x, i), that is the ith place to look if all i-1 previous locations were full already
         - h(x, 0) is the first place to check
         - h(x, 1) the next
         - and so on
      - notice that this is defined by the hash function, so it could be different for different items, etc.
      - the probe sequence must be a permutation of all of the entries in the table, that is, if we look at h(x, 0), h(x, 1), ..., h(x, m-1), these values will be a permutation of 0, 1, ..., m-1
         - why?
   - inserting
      - given this, how can we insert items into the table?
         - start at probe sequence 0, if it's empty put the item there
         - if it's full, go on to 1, etc.
         - note that we can actually fill up the table here
   - contains
      - what do we need to check here?
         - again, start at probe 0
            - see if there's something there AND see if the item is equal to the item we're actually looking for
         - if not, keep looking
         - when do we stop?
            - when we find an empty entry
   - look at OpenAddressedHashtable class in Hashtables code
      - what is the "put" method doing?
      - write the "contains" method
      - notice that the class is abstract since we haven't defined what the probe sequence will be

probe sequences
   - our one requirement is that the probe sequence must visit every entry in the table
   - ideas?
   - linear probing
      - easiest to understand:
         - h(k, i) = (h(k) + i) % m
         - just look at the next location in the hash table
         - if the original hash function says to look at location i and it's full then, we look at i+1, i+2, i+3, ...
         - need to modulo the size of the table to wrap around
      - look at LinearAddressedHashtable class in Hashtables code
      - problems?
         - "primary clustering"
         - you tend to get very long sequences of things clustered together
         - show an example
   - double hashing
      - h(x, i) = (h(x) + i h2(x)) % m
      - unlike linear, where the offset is constant, the offset this time is another hash of the data
      - avoids primary clustering
      - what is the challenge?
         - probe sequence must be a permutation of the data
         - h2 must visit all possible positions in the data
      - most commonly used in real implementation

running time for open addressing
   - what is the run-time for contains for open addressing?
   - again, assume an ideal hash function where each original location is equally likely and also each probe is equally likely
   - assuming n things in the table and m elements (i.e. a load of alpha)
      - what is the probability that the first place we look is occupied?
         - alpha
      - given the first was occupied, what is the probability that the second place we look is occupied?
         - alpha (actually, (n-1)/(m-1), but almost alpha :)
      - what is the probability that we have to make a third probe?
         - alpha (the first position was occupied) +
         - alpha * alpha (the second position was occupied)
      - so, what is the probability that we have to probe i positions before we find an open one?
         - it's the sum of the probabilities that we have to make each probe
         - alpha + alpha^2 + alpha^3 + ... + alpha^{i-1}
         - which is bounded by: 1/(1-alpha)
   - how does this help us with our run-time?
      - the run-time is bounded by the number of probes we have to make
      - to insert, we need to find an open entry, what is the running time?
         O(1 + 1/(1-alpha))
      - for contains, we may have to search until we find an open entry, what's the running time?
         O(1 + 1/(1-alpha))
   - what does this translate to search-wise?
      alpha   average number of searches
      0.1   1.11
      0.25   1.33
      0.5   2
      0.75   4
      0.9   10
      0.95   20
      0.99   100
      (note that these are ideal case numbers)

deleting in open addressing
   - what is the challenge with deleting in open addressing?
      - let's us linear probing (but it happens regardless of probing scheme)
      - let h(x) = h(y) = h(z)
      - insert x, y and z
      - delete y
      - now search for z. What happens?
         - we won't find z because we will stop our search when we find an empty entry
   - solutions?
      - besides just being occupied or not occupied, keep track if it was deleted
      - for inserting, if we find a deleted item, fill it in
      - in searching if we find a deleted item, keep searching
   - any problems with this approach?
      - if we delete a lot of items, our search times can remain large even though our table isn't very full
      - in general, if you plan on doing a lot of deleting, use a chained hashtable

hash functions
   - h(x) = x.hashCode() % m
   - many other options
   - http://en.wikipedia.org/wiki/List_of_hash_functions

key/value pairing
   - so far, we've just talked about storing sets of things
   - often, we want to store store things based on a key, but we want to store some data/value associated with that key
      - social security number
         - name, address, etc.
      - bank account number
      - counting the number of times a word occurs in a document
         - key is the word
         - data/value is the frequency
   - look at Map interface in Hashtables code
      - similar to Set
      - the put method has a value as well
      - the get method instead of containsKey, which returns a value
   - how would this change the code?
      - need to store both the key and the value
      - all the hashing is still based on the key; the value is just a tagalong item

hashtables in java
   - Set interface (http://java.sun.com/j2se/1.5.0/docs/api/java/util/Set.html)
      - add
      - contains
      - remove
      - HashSet (http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashSet.html)
      - what do you thinkg SortedSet and HashSet look like?
   - Map interface (http://java.sun.com/j2se/1.5.0/docs/api/java/util/Map.html)
      - put
      - get
      - remove
      - HashMap (http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashMap.html)
      - others
         - TreeMap
         - SortedMap
         - Hashtable (sort of like ArrayList and Vector)