CS201 - Spring 2014

CS201 - Spring 2014 - Class 30

exercise

storing data for quick lookup:
   - support three key operations:
      - insert
      - search/contains
      - remove

   - key idea:
      - use an array to store the data
      - associate with each data item an index in the array

   1. generate a numerical representation for the data item

   2. take numerical representation (hash code) and map it to an entry in the array

   3. handle collisions

collision resolution by chaining

open addressing
   - because of some of the cons above (in particular the overhead), we often only want to use a basic array to store the hashtable
   - we still have to do something about collisions... ideas?
   - when we have a collision and there's already an item at that location, we need to find another possible place to put it
   - for open addressing, we must define a "probe sequence" that determines where to look in the table next if we have a collision
      - if h(x) is the hash function, the probe sequence is often written as h(x, i), that is the ith place to look if all i-1 previous locations were full already
         - h(x, 0) is the first place to check
         - h(x, 1) the next
         - and so on
      - notice that this is defined by the hash function, so it could be different for different items, etc.
      - the probe sequence must be a permutation of all of the entries in the table, that is, if we look at h(x, 0), h(x, 1), ..., h(x, m-1), these values will be a permutation of 0, 1, ..., m-1
         - why?
   - inserting
      - given this, how can we insert items into the table?
         - start at probe sequence 0, if it's empty put the item there
         - if it's full, go on to 1, etc.
         - note that we can actually fill up the table here
   - contains
      - what do we need to check here?
         - again, start at probe 0
            - see if there's something there AND see if the item is equal to the item we're actually looking for
         - if not, keep looking
         - when do we stop?
            - when we find an empty entry
   - look at OpenAddressedHashtable class in Hashtables code
      - what is the "put" method doing?
      - write the "contains" method
      - notice that the class is abstract since we haven't defined what the probe sequence will be

abstract methods/classes

probe sequences
   - our one requirement is that the probe sequence must visit every entry in the table
   - ideas?
   - linear probing
      - easiest to understand:
         - h(k, i) = (h(k) + i) % m
         - just look at the next location in the hash table
         - if the original hash function says to look at location i and it's full then, we look at i+1, i+2, i+3, ...
         - need to modulo the size of the table to wrap around
      - look at LinearAddressedHashtable class in Hashtables code
      - problems?
         - "primary clustering"
         - you tend to get very long sequences of things clustered together
         - show an example
   - double hashing
      - h(x, i) = (h(x) + i h2(x)) % m
      - unlike linear, where the offset is constant, the offset this time is another hash of the data
      - avoids primary clustering
      - what is the challenge?
         - probe sequence must be a permutation of the data
         - h2 must visit all possible positions in the data
      - most commonly used in real implementation

running time for open addressing
   - what is the run-time for contains for open addressing?
   - again, assume an ideal hash function where each original location is equally likely and also each probe is equally likely
   - assuming n things in the table and m elements (i.e. a load of alpha)
      - what is the probability that the first place we look is occupied?
         - alpha (n/m)
      - given the first was occupied, what is the probability that the second place we look is occupied?
         - alpha (actually, (n-1)/(m-1), but almost alpha :)
      - what is the probability that we have to make a third probe?
         - alpha (the first position was occupied) +
         - alpha * alpha (the second position was occupied)
      - so, what is the probability that we have to probe i positions before we find an open one?
         - it's the sum of the probabilities that we have to make each probe
         - alpha + alpha^2 + alpha^3 + ... + alpha^{i-1}
         - which is bounded by: 1/(1-alpha)
   - how does this help us with our run-time?
      - the run-time is bounded by the number of probes we have to make
      - to insert, we need to find an open entry, what is the running time?
         O(1 + 1/(1-alpha))
      - for contains, we may have to search until we find an open entry, what's the running time?
         O(1 + 1/(1-alpha))
   - what does this translate to search-wise?
      alpha   average number of searches
      0.1   1.11
      0.25   1.33
      0.5   2
      0.75   4
      0.9   10
      0.95   20
      0.99   100
      (note that these are ideal case numbers)