Machine Learning - Fall 2013

Machine Learning - Fall 2013 - Class 25

admin
- class on Friday in MBH 632

general mapreduce program pipeline
   - Look at figures in http://developer.yahoo.com/hadoop/tutorial/module4.html

   1. the input is provided to the mapreduce program
      - think of the input as a giant list of elements
      - elements are ALWAYS some key/value pair
      - however, the default key/value pair is:
         - value = line in a file
         - key = byte offset into the file
   2. each key/value pair is passed to the mapping function
      - the mapping function takes a key/value pair as input
      - does some processing
      - and outputs a key/value pair as output (not necessarily the same types as input)
   3. all of the output pairs are grouped by the key
      - this results in: key -> value1, value2, value3, ... for all the values associated with that specific key
      - this is still a key value pair
         - the key = key
         - the value = and iterator over values
   4. these new key/value pairs are then passed to the reducer function
      - input is key -> value iterator
      - does some processing (often some sort of aggregation)
      - outputs the final key/value pairs, which should the the answer (or at least answer to the subproblem)

writing a mapreduce program
   - three components:
      1. map function
      2. reduce function
      3. driver

let's try and write a function that counts the number of word occurrences in a file

to write your own program, here's how I recommend doing it
   1. Figure out what your input is
      - in particular, what will the map step get as input for it's key/value
      - the default is just a line in a file
   2. Figure out how to break the program down into a map step and a reduce step
      - The map step will take the input, do some processing and produce a new collection of key/value pairs
      - The reduce step will take the output from the map step as input and then produce another new collection of key/value pairs
      - Sometimes, you may have to break the program into multiple map/reduce steps!
         - most of the programs we'll look at can be accomplished with just 1-2 map/reduce steps
   3. Write pseudo-code for the map and reduce functions
      - be very specific about what the key/value input/output types are for the map/reduce step
      - think about what processing needs to happen in each function
         - ideally, you should keep this processing down to a minimum
         - there cannot be shared state between calls to the map function!
            - if you find that you need it, you need to rethink how to write the program
   4. Write the code
      a. Decide whether you want to have a single class (with nested map and reduce classes) or three classes
      b. Write your map function
         - convert your input types into the appropriate hadoop types (IntWritable, DoubleWritable, Text, ...)
      c. Write a basic driver function
         - setup the job configuration, in particular
            - create a new JobConf item based on the driver class
            - set the job name
            - set the key/value output types
            - Optional: if the key/value output types of the map are *different* than the output types of the reduce stage, set these as well
            - set the mapper and reducer classes
            - Optional: if you're using a combiner class, set that as well
            - set the input and output directories
            - Optional: if your program requires additional input, set these as well
            - setup code to run the job
      d. Debug your map function
         - I strongly encourage you to use the NoOpReducer and make sure your map function it printing out what you expect before trying to put the whole thing together
         - Run it on some test data and make sure your map function is working
      e. Write your reduce function
         - convert your input types into the appropriate hadoop types
      f. Put it all together and run it!

input/output types
   - before you can actually write your program you need to figure out what types of the input and output should be
      - the input and output are always key/value pairs

   - map
      - the default input to a mapper is
         - key = number (the specific type is LongWritable)
            - LongWritable is just a long, but allows it to be set/changed since long's are immutable
            - LongWritable (and IntWritable, BooleanWritable, etc.) are part of the hadoop package
         - value = line (the specific type is Text)
            - Text is basically a mutable String object
      - the output types will depend on what computation you're doing
         - for word count?
            - key = Text
            - values = IntWritable (could actually use almost anything here)
   - reduce
      - the input to reduce will always be the output from the map function, specifically
         - input key type = map output key type
         - input value type = Iterator<map output value type>
      - the output to reduce will depend on the task (but the key is often the same as the input key)
         - for word count?
            - key = Text (the word)
            - value = IntWritable

general overview: first, let's look at how we can break this down into a map step and a reduce step
   - map step for word count?
      - input is a line
      - two output options
         - option 1: word -> # of occurrence in this line
         - option 2: word -> 1 for each word in the line
      - either of the options is fine, however, most often will choose option 1
         - simpler to implement
         - you want the map function to be as fast and as simple as possible
         - you want to avoid having to declare/create new objects since this takes time
            - remember that this map function is going to be called for *every* line in the data
         - you want the processing for each call to map to be as consistent as possible
   - reduce step
      - the reduce step gets as input the output from the map step with the values aggregated into an iterator per key
         - in our case: word -> 1, 1, 1, 1, 1, 1, 1, 1 (an iterator with a bunch of 1s)
      - all we need to do is some these up and output a pair of word -> sum

   - driver
      - the driver just sets a bunch of configuration parameters required to get the job going

look at WordCount code

look at WordCount example
   - both the map and reduce function MUST be written in their own classes
      - the map function should be in a class that implements Mapper
         - three methods to implement: map, close and configure
         - often we'll extend MapReduceBase which has default methods for close and configure
      - the reduce function should be in a class that implements Reducer
         - three methods to implement: reduce, close and configure
         - often we'll extend MapReduceBase again
      - two options for generating these classes
         - stand alone classes
         - as static classes inside another class
            - for simple approaches (and approaches where we don't need any state) this is a nice approach

   - WordCountMapper class
      - when implementing the Mapper interface, we need to supply the types for the input/output pairs
      - then we just have to implement the map function
         - takes 4 parameters
            - first is the input key
            - second is the output key
            - third is the collector, which is where we'll put all of our input/output pairs that we're *outputting*
            - fourth is a reporter, which we'll talk about later
         - functionality:
            - split up the line into words
            - for each word, add an output pair word -> 1
            - why do we have the two instance variables?