Machine Learning - Fall 2025 - Class 20

  • admin
       - Midterm 2 will be posted on Monday
       
       - Final project proposals due next Tuesday (11/11)

  • Mapreduce framework
       - http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
          - (this link also has some more nice MapReduce examples)

  • Inverted index
       - Given multiple files, want to create an inverted index:
          - word -> files with that word in it
       - How can we set this up as a map reduce problem?
          - Note, you can get access to the name of the file when you're processing a line in that file

       - Map:
          - Input:
             - key: LongWritable
             - value: Text
          - Output:
             - key: Text
             - value: Text

             - Split line into words, output <word, filename>
       - Reduce:
          - Input:
             - key: Text
             - value: Text
          - Output
             - key: Text
             - value: Text
          - Each call to reduce should be one word with all of the filenames that word occurred in. Only challenge is that it can have duplicates, so we need to remove duplicates. We can do that using a HashSet or sorting. Final output is one key per call to reduce: <word, filename1, filename2, ...>

       - Look at LineIndexer code
          - Key: use the reporter to get information about the file being processed
          - Sometimes we can't avoid using a data structure (or we could do it as another mapreduce dedup phase)
          - This builds an inverted index, which is a key structure for how search engines work


  • Word frequencies
       - Word frequencies tend to follow a common pattern:
          - https://phys.org/news/2017-08-unzipping-zipf-law-solution-century-old.html
       - One side effect of this is that there are a few words that are very frequent, a few more words that are moderately frequent, and lots of words that are infrequent
       - We would like to create a histogram of word frequencies, specifically, data of the form:
          - frequency -> how many words had that frequency
       - How can we do this with MapReduce?
          - First, run our word count mapreduce:
             - will output word -> frequency

          - Then, chain another map reduce phase:
             - Map:
                - Input (output from word count mapreduce)
                   - key: Text
                   - value: IntWritable
                - Output:
                   - key: IntWritable
                   - value: IntWritable

                - For each input of <word, freq> output <freq, 1>
             - Reduce:
                - Input:
                   - key: IntWritable
                   - value: IntWritable
                - Output
                   - key: IntWritable
                   - value: IntWritable
          - Sum reducer: add up all of the 1s and output <freq, sum>. As an added bonus, since the keys are sorted between map and reduce, the final output will be in sorted order from lowest frequency to highest frequency


  • Look at SimpleWordFreqHistogram code
       - Take the output from WordCount and counts those frequencies
       - Note the use of separate files/classes for the mapper, reducer and driver
       - Note also the use of the generic "SumReducer" reducer

  • Look at WordFreqHistogram code for full pipeline
       - We can run multiple map reduce jobs by calling their driver methods in series, in this case WordCount and then SimpleWordFreqHistogram
       - We use another input directory (working) to connect the first job to the second.