Machine Learning - Fall 2013

Machine Learning - Fall 2016 - Class 19

admin
   - assignment 8
      - two parts
         - first part due after a week (pseudocode + get hadoop up and running)
         - second part due after two weeks (code up NB training on hadoop... it's easier than it sounds)

hadoop video: https://www.youtube.com/watch?v=irK7xHUmkUA&list=PLAwxTw4SYaPkXJ6LAV96gH8yxIfGaN3H-&index=20
- videos 20 to 35

quick review of hadoop basics
   - consists of two key components
      - HDFS: Hadoop Distributed File System
         - allows us to store large amounts of data
         - distributes the files across multiple computers
         - stores redundant copies of the data for reliability
      - MapReduce framework
         - computational framework to allow programmers to analyze very large data sets
         - splits up the computation across multiple computers
         - tries to take advantage of data locality based on the HDFS
         - two key phases: map phase and reduce phase
         - look at figures on: http://developer.yahoo.com/hadoop/tutorial/module1.html

   - these two components are what make up a hadoop cluster

   - from a programmer's perspective, there are a number of other tools
      - there is an extensive set of libraries that dictate how you interact with the system
         - we'll see some of these as time goes
      - there are a number of programs that allow us to interact with the cluster, e.g.
         - interact with the HDFS
         - submit jobs to the system
         - check the status of the system
         - ...

Virtual machines
   - A virtual machine is a simulation of a computer (and all its parts) on another machine
   - A very common-place now
   - We're going to use a VM to run our hadoop cluster
      - This simplifies things so everyone has their own instance of hadoop running
      - Make installation *much* easier
      - Only downside is that you won't really see major speedups

HDFS basics
   - it is only a file system
      - in particular there is no operating system, no shell, etc.
   - it is stored within the local file system
   - it is a distributed file system so it is distributed across the local file systems of all of the computers in the cluster
      - this is what allows it to store very large data sets
   - to interact with it you use a call to hadoop, specifically:

      > hdfs dfs <some_command>

   - because there is no shell, there is no current working directory, therefore
      - no cd
      - all commands either
         - require the full path, that is starting with /
         - or are assumed to start in the users home directory

HDFS file structure
   - like the normal linux file structure, everything starts at /
   - instead of having "home" directories, user files go in a directory called "/user"
   - the default location for interactions with the HDFS is:
      /user/<username>

      for our VM, the username is "training", so all the files will go in:

      /user/training/

      - you can see all the user directories by typing

      > hdfs dfs -ls /user/
      Found 3 items
      drwxr-xr-x - hue supergroup 0 2013-09-05 20:08 /user/hive
      drwxr-xr-x - hue hue 0 2013-09-10 10:37 /user/hue
      drwxr-xr-x - training supergroup

HDFS commands
   - HDFS supports many of the basic file management commands
   - You can see a list of the dfs commands online at: http://developer.yahoo.com/hadoop/tutorial/module2.html#commandref or simply by typing:
      > hdfs dfs

HDFS basic commands
   - -ls: lists the files in the directory
      - by default, it just gives an ls of your user directory:

      > hdfs dfs -ls

      - alternatively, you can specify a path and it will give an ls of that directory

      > hdfs dfs -ls /user

   - -ls -R: lists the file in the directory AND recurses and lists all subdirectories
      - the default input to many hadoop program is a directory containing a collection of files, so this can be useful
      - given that you can do cd and some of the other associated tricks with a traditional terminal, this helps

      > hdfs dfs -ls -R /user/
      drwxr-xr-x - hue supergroup 0 2013-09-05 20:08 /user/hive
      drwxrwxrwx - hue supergroup 0 2013-09-05 20:08 /user/hive/warehouse
      drwxr-xr-x - hue hue 0 2013-09-10 10:37 /user/hue
      drwxrwxrwt - hue hue 0 2013-09-10 10:39 /user/hue/jobsub
      drwx--x--x - training hue 0 2013-09-10 10:37 /user/hue/jobsub/_training_-design-1
      -rw-r--r-- 1 training hue 677 2013-09-10 10:37 /user/hue/jobsub/_training_-design-1/workflow.xml
      drwx--x--x - training hue 0 2013-09-10 10:39 /user/hue/jobsub/_training_-design-2
      -rw-r--r-- 1 training hue 1054 2013-09-10 10:39 /user/hue/jobsub/_training_-design-2/workflow.xml
      drwxr-xr-x - training supergroup 0 2016-11-03 20:48 /user/training

   - -mv: moves files WITHIN the HDFS

   - -cp: similar to mv, copies file WITHIN the HDFS

   - -rm: removes a particular file or EMPTY directory

   - -mkdir: makes a directory WITHIN the HDFS
      > hdfs -mkdir temp
      > hdfs dfs -ls
      Found 1 items
      drwxr-xr-x - training supergroup 0 2016-11-03 20:50 temp

   - -rm -R
      - similar to rm -r on normal file system
      - deletes the directory, the files in that directory AND recurses
          - be careful!

   - -chmod and -chown: same commands as on the normal file system for handing permissions and ownership

Putting and getting data to the HDFS
   - so far, we've only talked about how to move files around, copy them, etc.
   - because it's a separate file system, there are also special commands for moving files from the current computer's file system TO the HDFS and vice versa

   - putting files onto the HDFS
      - to add files (e.g. files that you want your programs to process) you need to put them onto the HDFS
      - -put
         > hdfs dfs -put file1.txt temp
         (copies file1.txt from the local file system TO temp directory, actually /user/training/temp/ directory, on the HDFS)

         - notice that the first argument is a file/directory on the local file system and the second argument is a location on the HDFS

         > hdfs dfs -ls temp
         Found 1 items
         -rw-r--r-- 1 training supergroup 57 2016-11-03 20:53 temp/file1.txt

   - viewing file on the HDFS
      - if you just want to peek at the contents of a file on the HDFS there are couple of commands you can use
      - -cat: display the contents of a file (same as cat on normal fs)
         > hdfs dfs -cat temp/file1.txt
         this is the first file
         it has lots of good stuff in it

      - -tail: display the last 1K of the file
         - if the file is very large, you may not want to see the entire thing
         - you can use tail to peak just at the last bit of the file

   - getting files from the HDFS
      - eventually, you may want to actually get files from the HDFS
      - -get: get the specified file OR directory to the local file system
         > hdfs dfs -get temp/file1.txt .
         (copies file1.txt into my current directory)

         > hdfs dfs -get temp .
         (copies the directory and all contents to the current directory)

         - notice that the first argument is a file/directory on the HDFS and the second argument is the location on the local file system

      - -getmerge
         - a common output of a mapreduce program is a directory filled with different files, each representing a portion of the final solution
         - the getmerge function allows us to grab the files in a directory and merge them into a single file on the local file system
         > hdfs dfs -getmerge temp temp.txt
         (copies all of the files in temp into a single file, temp.txt)

Interacting with the HDFS programmatically
- You can also interact with the HDFS programmatically
- we won't cover much of this in this class, however, there are many examples online

general mapreduce program pipeline
   - Look at figures in http://developer.yahoo.com/hadoop/tutorial/module4.html

   1. the input is provided to the mapreduce program
      - think of the input as a giant list of elements
      - elements are ALWAYS some key/value pair
      - however, the default key/value pair is:
         - value = line in a file
         - key = byte offset into the file
   2. each key/value pair is passed to the mapping function
      - the mapping function takes a key/value pair as input
      - does some processing
      - and outputs a key/value pair as output (not necessarily the same types as input)
   3. all of the output pairs are grouped by the key
      - this results in: key -> value1, value2, value3, ... for all the values associated with that specific key
      - this is still a key value pair
         - the key = key
         - the value = and iterator over values
   4. these new key/value pairs are then passed to the reducer function
      - input is key -> value iterator
      - does some processing (often some sort of aggregation)
      - outputs the final key/value pairs, which should the the answer (or at least answer to the subproblem)

writing a mapreduce program
   - three components:
      1. map function
      2. reduce function
      3. driver

word count
   - let's write a program that takes a file as input and produces a list of all the words in the file and the number of times that word occurs in the file

   - general overview: first, let's look at how we can break this down into a map step and a reduce step
      - map step for word count?
         - input is a line
         - two output options
            - option 1: word -> # of occurrence in this line
            - option 2: word -> 1 for each word in the line
         - either of the options is fine, however, most often will choose option 2
            - simpler to implement
            - you want the map function to be as fast and as simple as possible
            - you want to avoid having to declare/create new objects since this takes time
               - remember that this map function is going to be called for *every* line in the data
            - you want the processing for each call to map to be as consistent as possible
      - reduce step
         - the reduce step gets as input the output from the map step with the values aggregated into an iterator per key
            - in our case: word -> 1, 1, 1, 1, 1, 1, 1, 1 (an iterator with a bunch of 1s)
         - all we need to do is some these up and output a pair of word -> sum

      - driver
         - the driver just sets a bunch of configuration parameters required to get the job going