CS62  Spring 2010  Lecture 18
found watch
last lecture in java!
hashtables overview
 U is our universe of keys
 want to mimic having an array/table with U entries, but have it be smaller, of size m
 this would give us constant time adding and searching
 use a hash function (h) to map from U to an entry in the table
 if m < U (which is the basic idea), we're going to have "collisions", where h(x) = h(y) even though x != y, that is two entries hash to the same location in our table even though they're not equal
 look at Set interface in
Hashtables code
collision resolution by chaining
 a common approach is to allow multiple items to occupy a given entry in our array. How?
 rather than just having the item stored at the entry, store a linked list
 put: if two items hash to the same location in the array, just add them to the linked list
 contains: do a search of all of the entries at that entry to see if the item being search for is there
 walk through an example
 let h(x) = 1, h(y) = 2, h(z) = 1
 insert x, y and z in to the table
 now search for z
 show ChainedHashtable class in
Hashtables code
 what is the runtime of the put and containsKey methods?
 put: O(1), we just have to add it to the beginning or end of a linked list
 search: O(length of the linked list)
 worst case: all of the data hashes to the same entry, e.g. h(x) = 1
 search: O(n)
 average case: depends on how well the hash function distributes the items over the table
 to analyze, we'll make the "simple uniform hashing" assumption, which is that an item could be hashed to any entry equally likely
 let's say I roll a 10side die 10 times how likely is it to see any particular value?
 1/10
 what about if I roll it 100 times?
 100/10 = 10 times per value on average
 the hashtable is similar, if we have m entries in the table and m items that we've added, what is the expected (average) length of any linked list, i.e. the number of things to be hashed to that entry under simple uniform hashing?
 n/m (like n rolls of an m sided die)
 this value (n/m) has a special name for hashtables and is called the "load" of the hashtable (often written as alpha)
 search: O(1+alpha), on average
 why the 1?
 Exercise 15.9: When 23 randomly selected people are brought together, chances are greater than 50 percent that two have the same birthday. What does this tell us about uniformly distributed hash codes for keys in a hash table?
 m = 365, we're hashing people based on their birthday
 n = 23
 load = 23/365, which is still quite low
 still better than 50% chance of having a collision
 hash table size is important and collisions are very likely, even for small amounts of data
 benefits and drawbacks?
 pros
 since we used linked lists, there is no limit to the number of items we can store in the table
 it's very straightforward to implement
 cons
 as the load gets high, our runtime will degrade
 there is a lot of overhead involved with storing the linked lists
open addressing
 because of some of the cons above (in particular the overhead), we often only want to use a basic array to store the hashtable
 we still have to do something about collisions... ideas?
 when we have a collision and there's already an item at that location, we need to find another possible place to put it
 for open addressing, we must define a "probe sequence" that determines where to look in the table next if we have a collision
 if h(x) is the hash function, the probe sequence is often written as h(x, i), that is the ith place to look if all i1 previous locations were full already
 h(x, 0) is the first place to check
 h(x, 1) the next
 and so on
 notice that this is defined by the hash function, so it could be different for different items, etc.
 the probe sequence must be a permutation of all of the entries in the table, that is, if we look at h(x, 0), h(x, 1), ..., h(x, m1), these values will be a permutation of 0, 1, ..., m1
 why?
 inserting
 given this, how can we insert items into the table?
 start at probe sequence 0, if it's empty put the item there
 if it's full, go on to 1, etc.
 note that we can actually fill up the table here
 contains
 what do we need to check here?
 again, start at probe 0
 see if there's something there AND see if the item is equal to the item we're actually looking for
 if not, keep looking
 when do we stop?
 when we find an empty entry
 look at OpenAddressedHashtable class in
Hashtables code
 what is the "put" method doing?
 write the "contains" method
 notice that the class is abstract since we haven't defined what the probe sequence will be
probe sequences
 our one requirement is that the probe sequence must visit every entry in the table
 ideas?
 linear probing
 easiest to understand:
 h(k, i) = (h(k) + i) % m
 just look at the next location in the hash table
 if the original hash function says to look at location i and it's full then, we look at i+1, i+2, i+3, ...
 need to modulo the size of the table to wrap around
 look at LinearAddressedHashtable class in
Hashtables code
 problems?
 "primary clustering"
 you tend to get very long sequences of things clustered together
 show an example
 double hashing
 h(x, i) = (h(x) + i h2(x)) % m
 unlike linear, where the offset is constant, the offset this time is another hash of the data
 avoids primary clustering
 what is the challenge?
 probe sequence must be a permutation of the data
 h2 must visit all possible positions in the data
 most commonly used in real implementation
running time for open addressing
 what is the runtime for contains for open addressing?
 again, assume an ideal hash function where each original location is equally likely and also each probe is equally likely
 assuming n things in the table and m elements (i.e. a load of alpha)
 what is the probability that the first place we look is occupied?
 alpha
 given the first was occupied, what is the probability that the second place we look is occupied?
 alpha (actually, (n1)/(m1), but almost alpha :)
 what is the probability that we have to make a third probe?
 alpha (the first position was occupied) +
 alpha * alpha (the second position was occupied)
 so, what is the probability that we have to probe i positions before we find an open one?
 it's the sum of the probabilities that we have to make each probe
 alpha + alpha^2 + alpha^3 + ... + alpha^{i1}
 which is bounded by: 1/(1alpha)
 how does this help us with our runtime?
 the runtime is bounded by the number of probes we have to make
 to insert, we need to find an open entry, what is the running time?
O(1 + 1/(1alpha))
 for contains, we may have to search until we find an open entry, what's the running time?
O(1 + 1/(1alpha))
 what does this translate to searchwise?
alpha average number of searches
0.1 1.11
0.25 1.33
0.5 2
0.75 4
0.9 10
0.95 20
0.99 100
(note that these are ideal case numbers)
deleting in open addressing
 what is the challenge with deleting in open addressing?
 let's us linear probing (but it happens regardless of probing scheme)
 let h(x) = h(y) = h(z)
 insert x, y and z
 delete y
 now search for z. What happens?
 we won't find z because we will stop our search when we find an empty entry
 solutions?
 besides just being occupied or not occupied, keep track if it was deleted
 for inserting, if we find a deleted item, fill it in
 in searching if we find a deleted item, keep searching
 any problems with this approach?
 if we delete a lot of items, our search times can remain large even though our table isn't very full
 in general, if you plan on doing a lot of deleting, use a chained hashtable
hash functions
 h(x) = x.hashCode() % m
 many other options

http://en.wikipedia.org/wiki/List_of_hash_functions
key/value pairing
 so far, we've just talked about storing sets of things
 often, we want to store store things based on a key, but we want to store some data/value associated with that key
 social security number
 name, address, etc.
 bank account number
 counting the number of times a word occurs in a document
 key is the word
 data/value is the frequency
 look at Map interface in
Hashtables code
 similar to Set
 the put method has a value as well
 the get method instead of containsKey, which returns a value
 how would this change the code?
 need to store both the key and the value
 all the hashing is still based on the key; the value is just a tagalong item
hashtables in java
 Set interface (
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Set.html
)
 add
 contains
 remove
 HashSet (
http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashSet.html
)
 what do you thinkg SortedSet and HashSet look like?
 Map interface (
http://java.sun.com/j2se/1.5.0/docs/api/java/util/Map.html
)
 put
 get
 remove
 HashMap (
http://java.sun.com/j2se/1.5.0/docs/api/java/util/HashMap.html
)
 others
 TreeMap
 SortedMap
 Hashtable (sort of like ArrayList and Vector)