Machine Learning - Fall 2013

Machine Learning - Fall 2013 - Class 27

Our hadoop cluster
   - not particularly robust :(
   - don't save any files on there that are important
   - remember that it is a shared resources among the class AND other students using the computers

makefiles
   - allow you to compile, generate jar files, etc. easier :)
   - manual: http://www.gnu.org/software/make/manual/
   - I've provide a reasonable one to start from in the Demos directory

Combiner
   - what does the combiner do?
   - the combiner
      - is a reducer that runs locally on the map machine
      - it only runs on the output of the local reduce
   - why are combiners useful?
      - the "slow" steps for map reduce are:
         - reading/writing from disks
         - sending data across the network
         - the sort

overall MapReduce framework:
- http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
- (this link also has some more nice MapReduce examples)

Within the code you write, this is how the program flows:
   - start in the main method (like all java programs)
      - this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
      - this should be your "driver"
   - junJob = start running on the hadoop cluster
   - run the map phase
      - the mapper is instantiated using the constructor
         - needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
         - what if your mapper needs a parameter?
      - the configure method
         - called before any of the calls to the map function
         - it is passed the JobConf configuration object you constructed in your driver
         - the JobConf object allows you to set arbitrary attributes
            - the general "set" method, sets a string
            - other "set" method exist, though for setting other types:
               - setBoolean
               - setInt
               - setFloat
               - setLong
         - you can then grab these "set" attributes in the configure method using get
            - good practice to use a global variable for the name of this configuration parameter
      - finally, for each item in the input, the map function is called
   - the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method

Grep
   - find a particular string in a file
   - how do we pose this as a MapReduce problem?
      - map phase
         - input: key = byte offset of file, value = line in the file
         - output: key = byte offset, value = line IF it contains the word
            - we also could do key = byte offset AND filename
               - use the reporter to get the filename

      - reduce phase: NoOpReducer!
   - do either of the phases need additional information?
      - the map phase needs the word!
      - we can pass this using the configuration method

Grep with the filename
   - the above just includes the byte offset in the file
      - this is really only useful if we have a single file
   - what we're really like is to figure out which file it occurred in
   - how can we do this?
      - the reporter gives us access the the split which gives us information about which chunk of the data we're processing
      - can get the filename from this
   - look at the GrepWithFilename example