CS30 - Spring 2015 - Class 16

Example code in this lecture

   word-stats.py
   write_user_input.py
   unique.py

Lecture notes

  • admin
       - assignment 7 due Friday
       - all assignments up through assignment 6 have been graded and returned
       - Eli's mentor session (Tuesday) permanently moved to 6-8pm
       - Academic honesty

  • what does the print_stats function in word-stats.py code do? What can we call it with?
       - Anything that is iterable, e.g.
          - a list
          - a string
          - a tuple
          (also need to be able to call len() on the items in it, e.g. a list of strings)

       - It iterates over each item (say in the list) and keeps track of:
          - longest string found
          - shortest string found
          - total length of the strings iterated over
          - the total number of strings/items

       - how does it keep track of the longest?
          - startest with ""
          - compares everyword to the longest so far
          - if longer, updates longest

       - what does 'shortest == "" or' do? Why don't we have it for the longest condition?
          - for longest, we started with the shortest possible string, so any string will be longer
          - hard to start with the longest possible string
          - instead we add a special case for the first time through the loop
             - could have initialized shortest to be a really long string, but this is a more robust solution

  • running print_stats
       - we can run it directly by passing it a list of strings

          >>> print_stats(["this", "is", "a", "list", "of", "words"])
          Number of words: 6
          Longest word: words
          Shortest word: a
          Avg. word length: 3.0

       - look at the sentence_stats function in word-stats.py
          - the "split" method is called on a string and splits up the string into a list of strings based on spaces
          
             >>> "this is a sentence".split()
             ['this', 'is', 'a', 'sentence']
       
          - the sentence_stats function just creates a list of strings and then calls the print_stats function

             >>> sentence_stats("this is a sentence")
             Number of words: 4
             Longest word: sentence
             Shortest word: a
             Avg. word length: 3.75

  • files
       - what is a file?
          - a chunk of data stored on the hard disk
       - why do we need files?
          - hard-drives persist state regardless of whether the power is on or not
          - when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
             - main memory is faster, but doesn't persist when the power goes off

  • reading files
       - to read a file in Python we first need to open it

          file = open("some_file_name", "r")

          - open is another function that has two parameters
          - the first parameter is a *string* identifying the filename
             - be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
          - the second parameter is another string telling Python what you want to do with the file
             - "r" stands for "read", that is, we're going to read some data from the file
          - open returns a "file" object that we can use later on for reading purposes
             - above, I've saved that in a variable called "file", but I could have called it anything else

       - once we have a file open, we can read a line at a time from the file using a for loop:

          for <variable> in <file_variable>:
             # do something

          - for each line in the file, the loop will get run
          - each time the variable will get assigned to the next line in the file
             - the line will be of type string
             - the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)

  • a simple example
       - put some text in a file called "example.txt", e.g.

          this is my file
          it has three lines
          and this is the third

       - if we have following program in a .py file *saved in the same directory*

          reader = open("example.txt", "r")

          for line in reader:
           print line

          reader.close()

       - then we will see the following:

          this is my file

          it has three lines

          and this is the third
          >>>

       - Anything funny about this?

       - add "print len(line)" in the for and run again:

          this is my file

          16
          it has three lines

          19
          and this is the third
          21
       
       - what's the problem?
          - when you read a line of from the file, you also get the end of line character
          - what's really in this file is:

          
          this is my file\nit has three lines\nand this is the third

       - to fix this, we want to "strip" (i.e. remove) the end of line character:

          reader = open("example.txt", "r")

          for line in reader:
           line = line.strip()
           print line

          reader.close()


  • look at file_stats in word-stats.py
       - because we can iterate over lines in a file, once we open the file, we can use the same print_stats function to analyze words in a file
          - this is the line word = word.strip() is in print_stats

       - I have a file called "english.txt" which contains a list of ~47K English words. I can use this to understand some basic stats about English:
          
          
          - again, the file called "english.txt" needs to be in the same directory as the .py file

             >>> file_stats("english.txt")
             Number of words: 47158
             Longest word: antidisestablishmentarianism
             Shortest word: Hz
             Avg. word length: 8.37891768099

  • what does this tell us about English? Average word length is 8.3? Does that sound right?
       - seems long!
       - the problem is that it doesn't take into account word frequency. This is just a dictionary of words
       - How might we measure actual word average length in language use?
          - try and find a corpus/sample of dialogue

  • wikipedia data
       - although not exactly spoken data, looking at Wikipedia data should give a reasonable approximation for English usage
       - We'll analyze some data I put together that includes the sentences from 60K English Wikipedia articles
          - if you want to do your own experiments, you can find the data at: http://www.cs.pomona.edu/~dkauchak/simplification/ (it's version 2.0 with the document aligned data)
       
       - if we look at this data, can we just use our file_stats function?
          - No. It assumes one word per line!

       - look at the general_print_stats function in word-stats.py code
          - does the same thing as print_stats
          - BUT, splits the line up first into words, so we end up with two for loops
             - the outer for loop iterates over lines in a file
             - the inner for loop iterates of the words in a given line

       - there is a generatel_file_stats function as well that then uses this function to print out the stats for the file

       - if we run it we get:
          >>> general_file_stats("wikipedia.txt")
          Number of words: 97912818
          Longest word: http://search.dma.mil/search?q=cache:0r3uwrsm8b8j:www.navy.mil/oceans/5090_1c_manual.pdf+officially+regard+the+whole+region+as+international+waters&client=navy_search&proxystylesheet=navy_search&output=xml_no_dtd&ie=utf-8&oe=utf-8&site=navy_all&access=p
          Shortest word: ,
          Avg. word length: 4.49793781852

          - any problems?
             - both the longest word and shortest word are a little unsatisfying
             - we shouldn't be considering these things as "words"
             - we could just add some more code to filter these, but better to write some code to generate *new* data that's been cleaned up

  • writing files
       - we can also write data to files
       - look at write_user_input.py code
       - we can open a file for "writing" by using a "w" instead of an "r"
          - "w" stands for write
          - if the file doesn't exist it will create it
          - if the file does exists, it will erase the current contents and overwrite it (be careful!)
          - we can also write to a file without overwriting the contents, but instead appending to the end
             - "a" stands for append
       
       - just like with reading from a file, we get a file object when we call open
          - the "write" method writes an object to the file as a string
          - if you want to write a line to a file, you need to include the end of line character ("\n"), it does not do this by default

       - the write_me_to_file function opens a file for writing and then prompts the user for strings. It then writes this to the file as long as the user keeps entering non-empty data
          - For example, if I run it:

             >>> write_me_to_file("test.txt")
             Next line: this is line 1
             Next line: and I can keep
             Next line: entering text
             Next line: as long as I want
             Next line: and it will get written to the
             Next line: file
             Next line:

          and then look in the file "test.txt", I'll find those lines in the file

  • look at cleanup_data in word-stats.py
       - looks similar to the print_general_stats function
       - opens the file for reading and the output file
       - iterates over the lines in the input file
          - splits up the line into words
          - for each words uses the "isalpha" method to determine if the string is all alphabetic characters and appends those that are onto cleaned

       - what do you think "join" does?
          >>> help(str.join)
          Help on method_descriptor:
          
          join(...)
           S.join(iterable) -> string

           Return a string which is the concatenation of the strings in the
           iterable. The separator between elements is S.

          >>> list = ["some", "words", "in", "a", "list"]
          >>> " ".join(list)
          'some words in a list'
          >>> ", ".join(list)
          'some, words, in, a, list'
          >>> "--".join(list)
          'some--words--in--a--list'

       - make sure to call "close()" when you're writing a file
          - in most cases, if you don't do it when *reading* a file, it's fine (but still should get in the habit of doing it)
          - can (and will) cause problems if you don't do it when you're writing a file
             - it can lead to the last bit of data you wrote NOT appearing in the file!

       - I used this function to generate "wikipedia.cleaned.txt"
          >>> general_file_stats("wikipedia.cleaned.txt")
          Number of words: 80581346
          Longest word: outsideofadogabookismansbeguituitglsajsakhdlaysioeyashdklsalkdn
          Shortest word: a
          Avg. word length: 4.94404439459      

  • other questions we might want to ask about wikipedia?

  • how many unique words (well, strings) are in wikipedia?

  • dictionaries (aka maps)
       - store keys and an associated value
          - each key is associated with a value
          - lookup can be done based on the key
          - this is a very common phenomena in the real world. What are some examples?
             - social security number
                - key = social security number
                - value = name, address, etc
             - phone numbers in your phone (and phone directories in general)
                - key = name
                - value = phone number
             - websites
                - key = url
                - value = location of the computer that hosts this website
             - car license plates
                - key = license plate number
                - value = owner, type of car, ...
             - flight information
                - key = flight number
                - value = departure city, destination city, time, ...

       - like sets, dictionaries allow us to efficiently lookup (and update) keys in the dictionary

       - creating new dictionaries
          - dictionaries can be created using curly braces
             >>> d = {}
             >>> d
             {}
       
       - dictionaries function similarly to lists, except we can put things in ANY index and can use non-numerical indices
          >>> d["grapes"] = "purple"

          - this says associate the value "purple" with the key "grapes"

          >>> d
          {'grapes': 'purple'}

          - when they're printed out they're printed as key/value pairs, e.g.

          >>> d["apples"] = "red"
          >>> d
          {'apples': 'red', 'grapes': 'purple'}

       - accessing values
          - you can get back the value associated with a key
             >>> d["apples"]
             'red'

       - keys are unique!
          - if you assign to a key again, it will update the key

             >>> d["apples"] = "green"
             >>> d["apples"]
             'green'
             >>> d
             {'apples': 'green', 'grapes': 'purple'}

  • how can we use this to count the number of unique words?
       - look at unique.py code
          - iterate through each of the words
          - add each to a dictionary with value 1 (in fact, any value would work)
          - return the len of the dictionary (i.e. how many entries are in the dictionary)
       
       - we can leverage the fact that keys *must* be unique
          - if we see a word multiple times it will still only have one entry in the dictionary

       >>> unique_words("wikipedia.cleaned.txt")
       553736