CS201 - Spring 2014 - Class 30

  • exercise

  • storing data for quick lookup:
       - support three key operations:
          - insert
          - search/contains
          - remove

       - key idea:
          - use an array to store the data
          - associate with each data item an index in the array

       1. generate a numerical representation for the data item
       
       2. take numerical representation (hash code) and map it to an entry in the array

       3. handle collisions


  • collision resolution by chaining

  • open addressing
       - because of some of the cons above (in particular the overhead), we often only want to use a basic array to store the hashtable
       - we still have to do something about collisions... ideas?
       - when we have a collision and there's already an item at that location, we need to find another possible place to put it
       - for open addressing, we must define a "probe sequence" that determines where to look in the table next if we have a collision
          - if h(x) is the hash function, the probe sequence is often written as h(x, i), that is the ith place to look if all i-1 previous locations were full already
             - h(x, 0) is the first place to check
             - h(x, 1) the next
             - and so on
          - notice that this is defined by the hash function, so it could be different for different items, etc.
          - the probe sequence must be a permutation of all of the entries in the table, that is, if we look at h(x, 0), h(x, 1), ..., h(x, m-1), these values will be a permutation of 0, 1, ..., m-1
             - why?
       - inserting
          - given this, how can we insert items into the table?
             - start at probe sequence 0, if it's empty put the item there
             - if it's full, go on to 1, etc.
             - note that we can actually fill up the table here
       - contains
          - what do we need to check here?
             - again, start at probe 0
                - see if there's something there AND see if the item is equal to the item we're actually looking for
             - if not, keep looking
             - when do we stop?
                - when we find an empty entry
       - look at OpenAddressedHashtable class in Hashtables code
          - what is the "put" method doing?
          - write the "contains" method
          - notice that the class is abstract since we haven't defined what the probe sequence will be

  • abstract methods/classes

  • probe sequences
       - our one requirement is that the probe sequence must visit every entry in the table
       - ideas?
       - linear probing
          - easiest to understand:
             - h(k, i) = (h(k) + i) % m
             - just look at the next location in the hash table
             - if the original hash function says to look at location i and it's full then, we look at i+1, i+2, i+3, ...
             - need to modulo the size of the table to wrap around
          - look at LinearAddressedHashtable class in Hashtables code
          - problems?
             - "primary clustering"
             - you tend to get very long sequences of things clustered together
             - show an example
       - double hashing
          - h(x, i) = (h(x) + i h2(x)) % m
          - unlike linear, where the offset is constant, the offset this time is another hash of the data
          - avoids primary clustering
          - what is the challenge?
             - probe sequence must be a permutation of the data
             - h2 must visit all possible positions in the data
          - most commonly used in real implementation

  • running time for open addressing
       - what is the run-time for contains for open addressing?
       - again, assume an ideal hash function where each original location is equally likely and also each probe is equally likely
       - assuming n things in the table and m elements (i.e. a load of alpha)
          - what is the probability that the first place we look is occupied?
             - alpha (n/m)
          - given the first was occupied, what is the probability that the second place we look is occupied?
             - alpha (actually, (n-1)/(m-1), but almost alpha :)
          - what is the probability that we have to make a third probe?
             - alpha (the first position was occupied) +
             - alpha * alpha (the second position was occupied)
          - so, what is the probability that we have to probe i positions before we find an open one?
             - it's the sum of the probabilities that we have to make each probe
             - alpha + alpha^2 + alpha^3 + ... + alpha^{i-1}
             - which is bounded by: 1/(1-alpha)
       - how does this help us with our run-time?
          - the run-time is bounded by the number of probes we have to make
          - to insert, we need to find an open entry, what is the running time?
             O(1 + 1/(1-alpha))
          - for contains, we may have to search until we find an open entry, what's the running time?
             O(1 + 1/(1-alpha))
       - what does this translate to search-wise?
          alpha   average number of searches
          0.1   1.11
          0.25   1.33
          0.5   2
          0.75   4
          0.9   10
          0.95   20
          0.99   100
          (note that these are ideal case numbers)