### CS30 - Spring 2015 - Class 16

#### Example code in this lecture

word-stats.py
write_user_input.py
unique.py

#### Lecture notes

- assignment 7 due Friday
- all assignments up through assignment 6 have been graded and returned
- Eli's mentor session (Tuesday) permanently moved to 6-8pm

• what does the print_stats function in word-stats.py code do? What can we call it with?
- Anything that is iterable, e.g.
- a list
- a string
- a tuple
(also need to be able to call len() on the items in it, e.g. a list of strings)

- It iterates over each item (say in the list) and keeps track of:
- longest string found
- shortest string found
- total length of the strings iterated over
- the total number of strings/items

- how does it keep track of the longest?
- startest with ""
- compares everyword to the longest so far

- what does 'shortest == "" or' do? Why don't we have it for the longest condition?
- for longest, we started with the shortest possible string, so any string will be longer
- instead we add a special case for the first time through the loop
- could have initialized shortest to be a really long string, but this is a more robust solution

• running print_stats
- we can run it directly by passing it a list of strings

>>> print_stats(["this", "is", "a", "list", "of", "words"])
Number of words: 6
Longest word: words
Shortest word: a
Avg. word length: 3.0

- look at the sentence_stats function in word-stats.py
- the "split" method is called on a string and splits up the string into a list of strings based on spaces

>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']

- the sentence_stats function just creates a list of strings and then calls the print_stats function

>>> sentence_stats("this is a sentence")
Number of words: 4
Longest word: sentence
Shortest word: a
Avg. word length: 3.75

• files
- what is a file?
- a chunk of data stored on the hard disk
- why do we need files?
- hard-drives persist state regardless of whether the power is on or not
- when a program is running, all the data it is generating/processing is in main memory (e.g. RAM)
- main memory is faster, but doesn't persist when the power goes off

- to read a file in Python we first need to open it

file = open("some_file_name", "r")

- open is another function that has two parameters
- the first parameter is a *string* identifying the filename
- be careful about the path/directory. Python looks for the file in the same directory as the program (.py file) unless you tell it to look elsewhere
- the second parameter is another string telling Python what you want to do with the file
- "r" stands for "read", that is, we're going to read some data from the file
- open returns a "file" object that we can use later on for reading purposes
- above, I've saved that in a variable called "file", but I could have called it anything else

- once we have a file open, we can read a line at a time from the file using a for loop:

for <variable> in <file_variable>:
# do something

- for each line in the file, the loop will get run
- each time the variable will get assigned to the next line in the file
- the line will be of type string
- the line will also have an endline at the end of it which you'll often want to get rid of (the strings strip() method is often good for this)

• a simple example
- put some text in a file called "example.txt", e.g.

this is my file
it has three lines
and this is the third

- if we have following program in a .py file *saved in the same directory*

print line

- then we will see the following:

this is my file

it has three lines

and this is the third
>>>

- add "print len(line)" in the for and run again:

this is my file

16
it has three lines

19
and this is the third
21

- what's the problem?
- when you read a line of from the file, you also get the end of line character
- what's really in this file is:

this is my file\nit has three lines\nand this is the third

- to fix this, we want to "strip" (i.e. remove) the end of line character:

line = line.strip()
print line

• look at file_stats in word-stats.py
- because we can iterate over lines in a file, once we open the file, we can use the same print_stats function to analyze words in a file
- this is the line word = word.strip() is in print_stats

- I have a file called "english.txt" which contains a list of ~47K English words. I can use this to understand some basic stats about English:

- again, the file called "english.txt" needs to be in the same directory as the .py file

>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768099

• what does this tell us about English? Average word length is 8.3? Does that sound right?
- seems long!
- the problem is that it doesn't take into account word frequency. This is just a dictionary of words
- How might we measure actual word average length in language use?
- try and find a corpus/sample of dialogue

• wikipedia data
- although not exactly spoken data, looking at Wikipedia data should give a reasonable approximation for English usage
- We'll analyze some data I put together that includes the sentences from 60K English Wikipedia articles
- if you want to do your own experiments, you can find the data at: http://www.cs.pomona.edu/~dkauchak/simplification/ (it's version 2.0 with the document aligned data)

- if we look at this data, can we just use our file_stats function?
- No. It assumes one word per line!

- look at the general_print_stats function in word-stats.py code
- does the same thing as print_stats
- BUT, splits the line up first into words, so we end up with two for loops
- the outer for loop iterates over lines in a file
- the inner for loop iterates of the words in a given line

- there is a generatel_file_stats function as well that then uses this function to print out the stats for the file

- if we run it we get:
>>> general_file_stats("wikipedia.txt")
Number of words: 97912818
Longest word: http://search.dma.mil/search?q=cache:0r3uwrsm8b8j:www.navy.mil/oceans/5090_1c_manual.pdf+officially+regard+the+whole+region+as+international+waters&client=navy_search&proxystylesheet=navy_search&output=xml_no_dtd&ie=utf-8&oe=utf-8&site=navy_all&access=p
Shortest word: ,
Avg. word length: 4.49793781852

- any problems?
- both the longest word and shortest word are a little unsatisfying
- we shouldn't be considering these things as "words"
- we could just add some more code to filter these, but better to write some code to generate *new* data that's been cleaned up

• writing files
- we can also write data to files
- look at write_user_input.py code
- we can open a file for "writing" by using a "w" instead of an "r"
- "w" stands for write
- if the file doesn't exist it will create it
- if the file does exists, it will erase the current contents and overwrite it (be careful!)
- we can also write to a file without overwriting the contents, but instead appending to the end
- "a" stands for append

- just like with reading from a file, we get a file object when we call open
- the "write" method writes an object to the file as a string
- if you want to write a line to a file, you need to include the end of line character ("\n"), it does not do this by default

- the write_me_to_file function opens a file for writing and then prompts the user for strings. It then writes this to the file as long as the user keeps entering non-empty data
- For example, if I run it:

>>> write_me_to_file("test.txt")
Next line: this is line 1
Next line: and I can keep
Next line: entering text
Next line: as long as I want
Next line: and it will get written to the
Next line: file
Next line:

and then look in the file "test.txt", I'll find those lines in the file

• look at cleanup_data in word-stats.py
- looks similar to the print_general_stats function
- opens the file for reading and the output file
- iterates over the lines in the input file
- splits up the line into words
- for each words uses the "isalpha" method to determine if the string is all alphabetic characters and appends those that are onto cleaned

- what do you think "join" does?
>>> help(str.join)
Help on method_descriptor:

join(...)
S.join(iterable) -> string

Return a string which is the concatenation of the strings in the
iterable. The separator between elements is S.

>>> list = ["some", "words", "in", "a", "list"]
>>> " ".join(list)
'some words in a list'
>>> ", ".join(list)
'some, words, in, a, list'
>>> "--".join(list)
'some--words--in--a--list'

- make sure to call "close()" when you're writing a file
- in most cases, if you don't do it when *reading* a file, it's fine (but still should get in the habit of doing it)
- can (and will) cause problems if you don't do it when you're writing a file
- it can lead to the last bit of data you wrote NOT appearing in the file!

- I used this function to generate "wikipedia.cleaned.txt"
>>> general_file_stats("wikipedia.cleaned.txt")
Number of words: 80581346
Shortest word: a
Avg. word length: 4.94404439459

• how many unique words (well, strings) are in wikipedia?

• dictionaries (aka maps)
- store keys and an associated value
- each key is associated with a value
- lookup can be done based on the key
- this is a very common phenomena in the real world. What are some examples?
- social security number
- key = social security number
- value = name, address, etc
- phone numbers in your phone (and phone directories in general)
- key = name
- value = phone number
- websites
- key = url
- value = location of the computer that hosts this website
- key = license plate number
- value = owner, type of car, ...
- flight information
- key = flight number
- value = departure city, destination city, time, ...

- like sets, dictionaries allow us to efficiently lookup (and update) keys in the dictionary

- creating new dictionaries
- dictionaries can be created using curly braces
>>> d = {}
>>> d
{}

- dictionaries function similarly to lists, except we can put things in ANY index and can use non-numerical indices
>>> d["grapes"] = "purple"

- this says associate the value "purple" with the key "grapes"

>>> d
{'grapes': 'purple'}

- when they're printed out they're printed as key/value pairs, e.g.

>>> d["apples"] = "red"
>>> d
{'apples': 'red', 'grapes': 'purple'}

- accessing values
- you can get back the value associated with a key
>>> d["apples"]
'red'

- keys are unique!
- if you assign to a key again, it will update the key

>>> d["apples"] = "green"
>>> d["apples"]
'green'
>>> d
{'apples': 'green', 'grapes': 'purple'}

• how can we use this to count the number of unique words?
- look at unique.py code
- iterate through each of the words
- add each to a dictionary with value 1 (in fact, any value would work)
- return the len of the dictionary (i.e. how many entries are in the dictionary)

- we can leverage the fact that keys *must* be unique
- if we see a word multiple times it will still only have one entry in the dictionary

>>> unique_words("wikipedia.cleaned.txt")
553736