CS201 - Spring 2014

CS201 - Spring 2014 - Class 29

exercise

storing data for quick lookup:
   - support three key operations:
      - insert
      - search/contains
      - remove

   - key idea:
      - use an array to store the data
      - associate with each data item an index in the array

   1. generate a numerical representation for the data item
      - if it's an number already, you're set!
      - if it's something like a string, etc., need to convert it into a number
      - if the data has multiple fields:
         - may only need to rely on one field, e.g. an ID number
         - otherwise, need to come up with a number that represents the aggregate of data, e.g. first AND last name combined
      - in Java, this is done via the hashCode method

   2. take the hashCode (numerical representation) and map it to an entry in the array
      - steps 1 and 2, are sometimes thought of as a single step called a hash function

   3. handle collisions
      - the challenge is that we may have two things that are different, but that map to the same entry in the array
      - need to figure out a way to store them anyway

hash functions
   - say we have a number already (unbounded) and we want to map it to an fixed length array. How can we do this?
   - mod function (called the division method)
      h(k) = k % m

      (where m is the length of the array)

      - are all array lengths equally good?
         - No!
         - tend to use prime numbers

   - multiplication method
      h(k) = Math.floor( m * (k*A - Math.floor(k*A))

      (where m is the length of the array and A is some constant 0 < A < 1

collision resolution by chaining
   - ideas for solving this problem of collisions?
   - a common approach is to allow multiple items to occupy a given entry in our array. How?
      - rather than just having an the item stored at the entry, store a linked list

   - insert
      - if two items hash to the same location in the array, just add them to the linked list
   - search/contains
      - do a search of all of the entries at that entry to see if the item being search for is there
   - walk through an example

   - what is the run-time of the insert and search methods?
      - insert/put: O(1), we just have to add it to the beginning or end of a linked list

      - search/contains: O(length of the linked list)
         - worst case: all of the data hashes to the same entry, e.g. h(x) = 1
            - search: O(n)
         - average case: depends on how well the hash function distributes the items over the table
            - to analyze, we'll make the "simple uniform hashing" assumption, which is that an item could be hashed to any entry equally likely
            - let's say I roll a 10-side die 10 times how likely is it to see any particular value?
               - 1/10
               - what about if I roll it 100 times?
                  - 100/10 = 10 times per value on average
         - the hashtable is similar, if we have m entries in the table and m items that we've added, what is the expected (average) length of any linked list, i.e. the number of things to be hashed to that entry under simple uniform hashing?
            - n/m (like n rolls of an m sided die)
            - this value (n/m) has a special name for hashtables and is called the "load" of the hashtable (often written as alpha)
         - search: O(1+alpha), on average
            - why the 1?
         - Exercise 15.9: When 23 randomly selected people are brought together, chances are greater than 50 percent that two have the same birthday. What does this tell us about uniformly distributed hash codes for keys in a hash table?
            - m = 365, we're hashing people based on their birthday
            - n = 23
            - load = 23/365, which is still quite low
            - still better than 50% chance of having a collision
            - hash table size is important and collisions are very likely, even for small amounts of data

   - benefits and drawbacks of chaining?
      - pros
         - since we used linked lists, there is no limit to the number of items we can store in the table
         - it's very straightforward to implement
      - cons
         - as the load gets high, our run-time will degrade
         - there is a lot of overhead involved with storing the linked lists