Machine Learning - Fall 2013 - Class 24

  • admin
       - assignment 7: how did it go?
       - assignment 8
          - three parts
             - part 1: due Friday (11/8) before class
             - part 2: due Monday (11/11)
             - part 3: due Sunday (11/16) at midnight

  • finish hadoop video

  • quick review of hadoop basics
       - consists of two key components
          - HDFS: Hadoop Distributed File System
             - allows us to store large amounts of data
             - distributes the files across multiple computers
             - stores redundant copies of the data for reliability
          - MapReduce framework
             - computational framework to allow programmers to analyze very large data sets
             - splits up the computation across multiple computers
             - tries to take advantage of data locality based on the HDFS
             - two key phase: map phase and reduce phase
             - look at figures on: http://developer.yahoo.com/hadoop/tutorial/module1.html

       - these two components are what make up a hadoop cluster

       - from a programmer's perspective, there are a number of other tools
          - there is an extensive set of libraries that dictate how you interact with the system
             - we'll see some of these as time goes
          - there are a number of programs that allow us to interact with the cluster, e.g.
             - interact with the HDFS
             - submit jobs to the system
             - check the status of the system
             - ...

  • HDFS basics
       - it is only a file system
          - in particular there is no operating system, no shell, etc.
       - it is stored within the local file system
       - it is a distributed file system so it is distributed across the local file systems of all of the computers in the cluster
          - this is what allows it to store very large data sets
       - to interact with it you use a call to hadoop, specifically:

          > hadoop dfs <some_command>

       - because there is no shell, there is no current working directory, therefore
          - no cd
          - all commands either
             - require the full path, that is starting with /
             - or are assumed to start in the users home directory

  • HDFS file structure
       - like the normal linux file structure, everything starts at /
       - instead of having "home" directories, user files go in a directory called "/user"
       - the default location for interactions with the HDFS is:
          /user/<username>

          for example, for me it's

          /user/dkauchak/

  • HDFS commands
       - HDFS supports many of the basic file management commands
       - You can see a list of the dfs commands online at: http://developer.yahoo.com/hadoop/tutorial/module2.html#commandref or simply by typing:
          > hadoop dfs


  • HDFS basic commands
       - -ls: lists the files in the directory
          - by default, it just gives an ls of your user directory:   

          > hadoop dfs -ls

          - alternatively, you can specify a path and it will give an ls of that directory

          > hadoop dfs -ls /user

          
       - -lsr: lists the file in the directory AND recurses and lists all subdirectories
          - the default input to many hadoop program is a directory containing a collection of files, so this can be useful
          - given that you can do cd and some of the other associated tricks with a traditional terminal, this helps

          > hadoop dfs -lrs /tmp

       - -mv: moves files WITHIN the HDFS
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:09 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt

          > hadoop dfs -mv hdfs_demo/file1.txt hdfs_demo/file3.txt
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:11 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file3.txt

       - -cp: similar to mv, copies file WITHIN the HDFS
          > hadoop dfs -cp hdfs_demo/file3.txt hdfs_demo/file1.txt
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:12 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:12 /user/dkauchak/hdfs_demo/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file3.txt

       - -rm: removes a particular file or EMPTY directory
          > hadoop dfs -rm hdfs_demo/file3.txt
          Deleted hdfs://basin:9000/user/dkauchak/hdfs_demo/file3.txt
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:09 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt


       - -mkdir: makes a directory

       - -rmr
          - similar to rm -r on normal file system
          - deletes the directory, the files in that directory AND recurses
              - be careful!
          > hadoop dfs -mkdir temp
          > hadoop dfs -cp hdfs_demo/* temp
          (note that you can use similar * notation like you can in most file systems)
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:14 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:12 /user/dkauchak/hdfs_demo/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:15 /user/dkauchak/temp
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:15 /user/dkauchak/temp/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:15 /user/dkauchak/temp/file2.txt
          > hadoop dfs -rmr temp
          Deleted hdfs://basin:9000/user/dkauchak/temp
          > hadoop dfs -lsr
          drwxr-xr-x - dkauchak supergroup 0 2013-11-03 13:14 /user/dkauchak/hdfs_demo
          -rw-r--r-- 1 dkauchak supergroup 55 2013-11-03 13:12 /user/dkauchak/hdfs_demo/file1.txt
          -rw-r--r-- 1 dkauchak supergroup 82 2013-11-03 13:09 /user/dkauchak/hdfs_demo/file2.txt

       - -chmod and -chown: same commands as on the normal file system for handing permissions and ownership

  • Putting and getting data to the HDFS
       - so far, we've only talked about how to move files around, copy them, etc.
       - because it's a separate file system, there are also special commands for moving files from the current computer's file system TO the HDFS and vice versa
       
       - viewing file on the HDFS
          - if you just want to peak at the contents of a file on the HDFS there are couple of commands you can use
          - -cat: display the contents of a file (same as cat on normal fs)
             > hadoop dfs -cat hdfs_demo/file1.txt
             this is the first file
             it has lots of good stuff in it

          - -tail: display the last 1K of the file
             - if the file is very large, you may not want to see the entire thing
             - you can use tail to peak just at the last bit of the file

       - getting files from the HDFS
          - eventually, you may want to actually get files from the HDFS
          - -get: get the specified file OR directory to the local file system
             > hadoop dfs -get hdfs_demo/file1.txt .
             (copies file1.txt into my current directory)

             > hadoop dfs -get hdfs_demo .
             (copies the directory and all contents to the current directory)

             - notice that the first argument is a file/directory on the HDFS and the second argument is the location on the local file system

          - -getmerge
             - a common output of a mapreduce program is a directory filled with different files, each representing a portion of the final solution
             - the getmerge function allows us to grab the files in a directory and merge them into a single file on the local file system
             > hadoop dfs -getmerge hdfs_demo temp.txt
             (copies all of the files in hdfs_demo into a single file, temp.txt)

       - putting files onto the HDFS
          - to add files (e.g. files that you want your programs to process) you need to put them onto the HDFS
          - -put
             > hadoop dfs -put file3.txt hdfs_demo
             (copies file3.txt from the local file system TO hdfs_demo on the HDFS)

             - notice that the first argument is a file/directory on the local file system and the second argument is a location on the HDFS

  • Interacting with the HDFS programmatically
       - You can also interact with the HDFS programmatically
          - we won't cover much of this in this class, however, there are many examples online

  • general mapreduce program pipeline
       - Look at figures in http://developer.yahoo.com/hadoop/tutorial/module4.html
       
       1. the input is provided to the mapreduce program
          - think of the input as a giant list of elements
          - elements are ALWAYS some key/value pair
          - however, the default key/value pair is:
             - value = line in a file
             - key = byte offset into the file
       2. each key/value pair is passed to the mapping function
          - the mapping function takes a key/value pair as input
          - does some processing
          - and outputs a key/value pair as output (not necessarily the same types as input)
       3. all of the output pairs are grouped by the key
          - this results in: key -> value1, value2, value3, ... for all the values associated with that specific key
          - this is still a key value pair
             - the key = key
             - the value = and iterator over values
       4. these new key/value pairs are then passed to the reducer function
          - input is key -> value iterator
          - does some processing (often some sort of aggregation)
          - outputs the final key/value pairs, which should the the answer (or at least answer to the subproblem)

  • writing a mapreduce program
       - three components:
          1. map function
          2. reduce function
          3. driver