Machine Learning - Fall 2013 - Class 27
Our hadoop cluster
- not particularly robust :(
- don't save any files on there that are important
- remember that it is a shared resources among the class AND other students using the computers
- allow you to compile, generate jar files, etc. easier :)
- I've provide a reasonable one to start from in the Demos directory
- what does the combiner do?
- the combiner
- is a reducer that runs locally on the map machine
- it only runs on the output of the local reduce
- why are combiners useful?
- the "slow" steps for map reduce are:
- reading/writing from disks
- sending data across the network
- the sort
overall MapReduce framework:
- (this link also has some more nice MapReduce examples)
Within the code you write, this is how the program flows:
- start in the main method (like all java programs)
- this is still run on whatever machine you ran "hadoop jar" on until it reaches a "runJob" call
- this should be your "driver"
- junJob = start running on the hadoop cluster
- run the map phase
- the mapper is instantiated using the constructor
- needs to have a zero-parameter constructor! (if you don't provide one, java does this by default)
- what if your mapper needs a parameter?
- the configure method
- called before any of the calls to the map function
- it is passed the JobConf configuration object you constructed in your driver
- the JobConf object allows you to set arbitrary attributes
- the general "set" method, sets a string
- other "set" method exist, though for setting other types:
- you can then grab these "set" attributes in the configure method using get
- good practice to use a global variable for the name of this configuration parameter
- finally, for each item in the input, the map function is called
- the combiner and reducer are then run in similar fashion, instantiating, then calling configured, then the reduce method
- find a particular string in a file
- how do we pose this as a MapReduce problem?
- map phase
- input: key = byte offset of file, value = line in the file
- output: key = byte offset, value = line IF it contains the word
- we also could do key = byte offset AND filename
- use the reporter to get the filename
- reduce phase: NoOpReducer!
- do either of the phases need additional information?
- the map phase needs the word!
- we can pass this using the configuration method
Grep with the filename
- the above just includes the byte offset in the file
- this is really only useful if we have a single file
- what we're really like is to figure out which file it occurred in
- how can we do this?
- the reporter gives us access the the split which gives us information about which chunk of the data we're processing
- can get the filename from this
- look at the GrepWithFilename example