Machine Learning - Fall 2013 - Class 27

  • Our hadoop cluster
       - not particularly robust :(
       - don't save any files on there that are important
       - remember that it is a shared resources among the class AND other students using the computers

  • makefiles
       - allow you to compile, generate jar files, etc. easier :)
       - manual: http://www.gnu.org/software/make/manual/
       - I've provide a reasonable one to start from in the Demos directory   

  • Combiner
       - what does the combiner do?
       - the combiner
          - is a reducer that runs locally on the map machine
          - it only runs on the output of the local reduce
       - why are combiners useful?
          - the "slow" steps for map reduce are:
             - reading/writing from disks
             - sending data across the network
             - the sort

  • overall MapReduce framework:
       - http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
          - (this link also has some more nice MapReduce examples)

  • Within the code you write, this is how the program flows:
       - start in the main method (like all java programs)
          - this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
          - this should be your "driver"
       - junJob = start running on the hadoop cluster
       - run the map phase
          - the mapper is instantiated using the constructor
             - needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
             - what if your mapper needs a parameter?
          - the configure method
             - called before any of the calls to the map function
             - it is passed the JobConf configuration object you constructed in your driver
             - the JobConf object allows you to set arbitrary attributes
                - the general "set" method, sets a string
                - other "set" method exist, though for setting other types:
                   - setBoolean
                   - setInt
                   - setFloat
                   - setLong
             - you can then grab these "set" attributes in the configure method using get
                - good practice to use a global variable for the name of this configuration parameter
          - finally, for each item in the input, the map function is called
       - the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method

  • Grep
       - find a particular string in a file
       - how do we pose this as a MapReduce problem?
          - map phase
             - input: key = byte offset of file, value = line in the file
             - output: key = byte offset, value = line IF it contains the word
                - we also could do key = byte offset AND filename
                   - use the reporter to get the filename

          - reduce phase: NoOpReducer!
       - do either of the phases need additional information?
          - the map phase needs the word!
          - we can pass this using the configuration method

  • Grep with the filename
       - the above just includes the byte offset in the file
          - this is really only useful if we have a single file
       - what we're really like is to figure out which file it occurred in
       - how can we do this?
          - the reporter gives us access the the split which gives us information about which chunk of the data we're processing
          - can get the filename from this
       - look at the GrepWithFilename example